Personalized Daily ArXiv Papers 2026-02-02

[gpt-5]	Prompt	Completion	Total
Token	69552	64139	133691
Cost	$0.09	$0.64	$0.73

Total arXiv papers: 830

Total scanned papers: 481

Total relevant papers: 42

Table of contents with paper titles:

Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation Authors: Andrei Panferov, Erik Schultheis, Soroush Tabesh, Dan Alistarh
Float8@2bits: Entropy Coding Enables Data-Free Model Compression Authors: Patrick Putzky, Martin Genzel, Mattes Mollenhauer, Sebastian Schulze, Thomas Wollmann, Stefan Dietzel
Breaking the Blocks: Continuous Low-Rank Decomposed Scaling for Unified LLM Quantization and Adaptation Authors: Pingzhi Tang, Ruijie Zhou, Fanxu Meng, Wenjie Pei, Muhan Zhang
MixQuant: Pushing the Limits of Block Rotations in Post-Training Quantization Authors: Sai Sanjeet, Ian Colbert, Pablo Monteagudo-Lago, Giuseppe Franco, Yaman Umuroglu, Nicholas J. Fraser
A Random Matrix Theory of Masked Self-Supervised Regression Authors: Arie Wortsman Zurich, Federica Gerace, Bruno Loureiro, Yue M. Lu
Sparse Attention as Compact Kernel Regression Authors: Saul Santos, Nuno Gon\c{c}alves, Daniel C. McNamee, Andr\'e F. T Martins
Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs Authors: Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li
Symmetry Breaking in Transformers for Efficient and Interpretable Training Authors: Eva Silverstein, Daniel Kunin, Vasudev Shyam
Names Don't Matter: Symbol-Invariant Transformer for Open-Vocabulary Learning Authors: .Ilker I\c{s}{\i}k, Wenchao Li
AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism Authors: Thalaiyasingam Ajanthan, Sameera Ramasinghe, Gil Avraham, Hadi Mohaghegh Dolatabadi, Chamin P Hewa Koneputugodage, Violetta Shevchenko, Yan Zuo, Alexander Long
EUGens: Efficient, Unified, and General Dense Layers Authors: Sang Min Kim, Byeongchan Kim, Arijit Sehanobish, Somnath Basu Roy Chowdhury, Rahul Kidambi, Dongseok Shim, Avinava Dubey, Snigdha Chaturvedi, Min-hwan Oh, Krzysztof Choromanski
Exact closed-form Gaussian moments of residual layers Authors: Simon Kuang, Xinfan Lin
Learnable Permutation for Structured Sparsity on Transformer Models Authors: Zekai Li, Ji Liu, Guanchen Li, Yixing Xu, Ziqiong Liu, Xuanwu Yin, Dong Li, Emad Barsoum
Residual Context Diffusion Language Models Authors: Yuezhou Hu, Harman Singh, Monishwaran Maheswaran, Haocheng Xi, Coleman Hooper, Jintao Zhang, Aditya Tomar, Michael W. Mahoney, Sewon Min, Mehrdad Farajtabar, Kurt Keutzer, Amir Gholami, Chenfeng Xu
Layerwise Progressive Freezing Enables STE-Free Training of Deep Binary Neural Networks Authors: Evan Gibson Smith, Bashima Islam
TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification Authors: Haoyun Jiang, Junqi He, Feng Hong, Xinlong Yang, Jianwei Zhang, Zheng Li, Zhengyang Zhuge, Zhiyong Chen, Bo Han, Junyang Lin, Jiangchao Yao
YuriiFormer: A Suite of Nesterov-Accelerated Transformers Authors: Aleksandr Zimin, Yury Polyanskiy, Philippe Rigollet
Optimization, Generalization and Differential Privacy Bounds for Gradient Descent on Kolmogorov-Arnold Networks Authors: Puyu Wang, Junyu Zhou, Philipp Liznerski, Marius Kloft
Perplexity Cannot Always Tell Right from Wrong Authors: Petar Veli\v{c}kovi\'c, Federico Barbero, Christos Perivolaropoulos, Simon Osindero, Razvan Pascanu
Stabilizing Transformer Training Through Consensus Authors: Shyam Venkatasubramanian, Sean Moushegian, Michael Lin, Mir Park, Ankit Singhal, Connor Lee
DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning Authors: Abhishek Tyagi, Yunuo Cen, Shrey Dhorajiya, Bharadwaj Veeravalli, Xuanyao Fong
Language Model Circuits Are Sparse in the Neuron Basis Authors: Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann
Is Hierarchical Quantization Essential for Optimal Reconstruction? Authors: Shirin Reyhanian, Laurenz Wiskott
Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features Authors: Yiting Liu, Zhi-Hong Deng
FlexLoRA: Entropy-Guided Flexible Low-Rank Adaptation Authors: Muqing Liu, Chongjie Si, Yuheng Jia
SpanNorm: Reconciling Training Stability and Performance in Deep Transformers Authors: Chao Wang, Bei Li, Jiaqi Zhang, Xinyu Liu, Yuchun Fan, Linkun Lyu, Xin Chen, Jingang Wang, Tong Xiao, Peng Pei, Xunliang Cai
Hierarchical Shift Mixing -- Beyond Dense Attention in Transformers Authors: Robert Forchheimer
Matterhorn: Efficient Analog Sparse Spiking Transformer Architecture with Masked Time-To-First-Spike Encoding Authors: Zhanglu Yan, Kaiwen Tang, Zixuan Zhu, Zhenyu Bai, Qianhui Liu, Weng-Fai Wong
TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training Authors: Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Dongyang Li, Yupeng Su, Sijia Liu, Zheng Zhang
Towards Resiliency in Large Language Model Serving with KevlarFlow Authors: Shangshu Qian, Kipling Liu, P. C. Sruthi, Lin Tan, Yongle Zhang
Understanding Generalization from Embedding Dimension and Distributional Convergence Authors: Junjie Yu, Zhuoli Ouyang, Haotian Deng, Chen Wei, Wenxiao Ma, Jianyu Zhang, Zihan Deng, Quanying Liu
HetCCL: Accelerating LLM Training with Heterogeneous GPUs Authors: Heehoon Kim, Jaehwan Lee, Taejeoung Kim, Jongwon Park, Jinpyo Kim, Pyongwon Suh, Ryan H. Choi, Sangwoo Lee, Jaejin Lee
Local Intrinsic Dimension of Representations Predicts Alignment and Generalization in AI Models and Human Brain Authors: Junjie Yu, Wenxiao Ma, Chen Wei, Jianyu Zhang, Haotian Deng, Zihan Deng, Quanying Liu
NAG: A Unified Native Architecture for Encoder-free Text-Graph Modeling in Language Models Authors: Haisong Gong, Zhibo Liu, Qiang Liu, Shu Wu, Liang Wang
Shattered Compositionality: Counterintuitive Learning Dynamics of Transformers for Arithmetic Authors: Xingyu Zhao, Darsh Sharma, Rheeya Uppaal, Yiqiao Zhong
Mano: Restriking Manifold Optimization for LLM Training Authors: Yufei Gu, Zeke Xie
Context Structure Reshapes the Representational Geometry of Language Models Authors: Eghbal A. Hosseini, Yuxuan Li, Yasaman Bahri, Declan Campbell, Andrew Kyle Lampinen
Decomposing and Composing: Towards Efficient Vision-Language Continual Learning via Rank-1 Expert Pool in a Single LoRA Authors: Zhan Fa, Yue Duan, Jian Zhang, Lei Qi, Wanqi Yang, Yinghuan Shi
FOCUS: DLLMs Know How to Tame Their Compute Bound Authors: Kaihua Liang, Xin Tan, An Zhong, Hong Xu, Marco Canini
Is Softmax Loss All You Need? A Principled Analysis of Softmax-family Loss Authors: Yuanhao Pu, Defu Lian, Enhong Chen
SOMBRERO: Measuring and Steering Boundary Placement in End-to-End Hierarchical Sequence Models Authors: Pit Neitemeier, Alessio Serra, Jiaze Li, Sascha Wirges, Lukas Balles, Jan Hendrik Metzen
Stabilizing Consistency Training: A Flow Map Analysis and Self-Distillation Authors: Youngjoong Kim, Duhoe Kim, Woosung Kim, Jaesik Park

1. Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation

ArXiv ID: 2601.22813

Authors: Andrei Panferov, Erik Schultheis, Soroush Tabesh, Dan Alistarh

Abstract: The NVFP4 lower-precision format, supported in hardware by NVIDIA Blackwell GPUs, promises to allow, for the first time, end-to-end fully-quantized pre-training of massive models such as LLMs. Yet, existing quantized training methods still sacrifice some of the representation capacity of this format in favor of more accurate unbiased quantized gradient estimation by stochastic rounding (SR), losing noticeable accuracy relative to standard FP16 and FP8 training. In this paper, improve the state of the art for quantized training in NVFP4 via a novel unbiased quantization routine for micro-scaled formats, called MS-EDEN, that has more than 2x lower quantization error than SR. We integrate it into a novel fully-NVFP4 quantization scheme for linear layers, called Quartet II. We show analytically that Quartet II achieves consistently better gradient estimation across all major matrix multiplications, both on the forward and on the backward passes. In addition, our proposal synergizes well with recent training improvements aimed specifically at NVFP4. We further validate Quartet II on end-to-end LLM training with up to 1.9B parameters on 38B tokens. We provide kernels for execution on NVIDIA Blackwell GPUs with up to 4.2x speedup over BF16. Our code is available at https://github.com/IST-DASLab/Quartet-II .

Comment: Quantization for efficient large-scale training: fully NVFP4 training with an unbiased micro-scaled quantizer (MS-EDEN) improving gradient estimation; systems-level kernels on Blackwell GPUs.

Relevance: 10 Novelty: 9

2. Float8@2bits: Entropy Coding Enables Data-Free Model Compression

ArXiv ID: 2601.22787

Authors: Patrick Putzky, Martin Genzel, Mattes Mollenhauer, Sebastian Schulze, Thomas Wollmann, Stefan Dietzel

Abstract: Post-training compression is currently divided into two contrasting regimes. On the one hand, fast, data-free, and model-agnostic methods (e.g., NF4 or HQQ) offer maximum accessibility but suffer from functional collapse at extreme bit-rates below 4 bits. On the other hand, techniques leveraging calibration data or extensive recovery training achieve superior fidelity but impose high computational constraints and face uncertain robustness under data distribution shifts. We introduce EntQuant, the first framework to unite the advantages of these distinct paradigms. By matching the performance of data-dependent methods with the speed and universality of data-free techniques, EntQuant enables practical utility in the extreme compression regime. Our method decouples numerical precision from storage cost via entropy coding, compressing a 70B parameter model in less than 30 minutes. We demonstrate that EntQuant does not only achieve state-of-the-art results on standard evaluation sets and models, but also retains functional performance on more complex benchmarks with instruction-tuned models, all at modest inference overhead.

Comment: Model compression and efficiency: extreme-rate post-training compression via entropy coding decoupled from precision (data-free), achieving SOTA at ≤4 bits.

Relevance: 10 Novelty: 9

3. Breaking the Blocks: Continuous Low-Rank Decomposed Scaling for Unified LLM Quantization and Adaptation

ArXiv ID: 2601.22716

Authors: Pingzhi Tang, Ruijie Zhou, Fanxu Meng, Wenjie Pei, Muhan Zhang

Abstract: Current quantization methods for LLMs predominantly rely on block-wise structures to maintain efficiency, often at the cost of representational flexibility. In this work, we demonstrate that element-wise quantization can be made as efficient as block-wise scaling while providing strictly superior expressive power by modeling the scaling manifold as continuous low-rank matrices ($S = BA$). We propose Low-Rank Decomposed Scaling (LoRDS), a unified framework that rethinks quantization granularity through this low-rank decomposition. By "breaking the blocks" of spatial constraints, LoRDS establishes a seamless efficiency lifecycle: it provides high-fidelity PTQ initialization refined via iterative optimization, enables joint QAT of weights and scaling factors, and facilitates high-rank multiplicative PEFT adaptation. Unlike additive PEFT approaches such as QLoRA, LoRDS enables high-rank weight updates within a low-rank budget while incurring no additional inference overhead. Supported by highly optimized Triton kernels, LoRDS consistently outperforms state-of-the-art baselines across various model families in both quantization and downstream fine-tuning tasks. Notably, on Llama3-8B, our method achieves up to a 27.0% accuracy improvement at 3 bits over NormalFloat quantization and delivers a 1.5x inference speedup on NVIDIA RTX 4090 while enhancing PEFT performance by 9.6% on downstream tasks over 4bit QLoRA, offering a robust and integrated solution for unified compression and adaptation of LLMs.

Comment: Compression/Efficiency — unified low-rank decomposed element-wise scaling enabling quantization, joint QAT, and high-rank multiplicative PEFT with no extra inference cost.

Relevance: 10 Novelty: 9

4. MixQuant: Pushing the Limits of Block Rotations in Post-Training Quantization

ArXiv ID: 2601.22347

Authors: Sai Sanjeet, Ian Colbert, Pablo Monteagudo-Lago, Giuseppe Franco, Yaman Umuroglu, Nicholas J. Fraser

Abstract: Recent post-training quantization (PTQ) methods have adopted block rotations to diffuse outliers prior to rounding. While this reduces the overhead of full-vector rotations, the effect of block structure on outlier suppression remains poorly understood. To fill this gap, we present the first systematic, non-asymptotic analysis of outlier suppression for block Hadamard rotations. Our analysis reveals that outlier suppression is fundamentally limited by the geometry of the input vector. In particular, post-rotation outliers are deterministically minimized when the pre-rotation $\ell_1$ norm mass is evenly distributed across blocks. Guided by these insights, we introduce MixQuant, a block rotation-aware PTQ framework that redistributes activation mass via permutations prior to rotation. We propose a greedy mass diffusion algorithm to calibrate permutations by equalizing the expected blockwise $\ell_1$ norms. To avoid adding inference overhead, we identify permutation-equivariant regions in transformer architectures to merge the resulting permutations into model weights before deployment. Experiments show that MixQuant consistently improves accuracy across all block sizes, recovering up to 90% of the full-vector rotation perplexity when quantizing Llama3 1B to INT4 with block size 16, compared to 46% without permutations.

Comment: Model Compression and Efficiency: PTQ with block rotations analyzed non-asymptotically; introduces permutation-based mass diffusion for outlier suppression.

Relevance: 10 Novelty: 8

5. A Random Matrix Theory of Masked Self-Supervised Regression

ArXiv ID: 2601.23208

Authors: Arie Wortsman Zurich, Federica Gerace, Bruno Loureiro, Yue M. Lu

Abstract: In the era of transformer models, masked self-supervised learning (SSL) has become a foundational training paradigm. A defining feature of masked SSL is that training aggregates predictions across many masking patterns, giving rise to a joint, matrix-valued predictor rather than a single vector-valued estimator. This object encodes how coordinates condition on one another and poses new analytical challenges. We develop a precise high-dimensional analysis of masked modeling objectives in the proportional regime where the number of samples scales with the ambient dimension. Our results provide explicit expressions for the generalization error and characterize the spectral structure of the learned predictor, revealing how masked modeling extracts structure from data. For spiked covariance models, we show that the joint predictor undergoes a Baik--Ben Arous--P\'ech\'e (BBP)-type phase transition, identifying when masked SSL begins to recover latent signals. Finally, we identify structured regimes in which masked self-supervised learning provably outperforms PCA, highlighting potential advantages of SSL objectives over classical unsupervised methods

Comment: Representation Learning: high-dimensional random matrix theory for masked self-supervised regression with BBP-type phase transition and explicit generalization error.

Relevance: 9 Novelty: 9

6. Sparse Attention as Compact Kernel Regression

ArXiv ID: 2601.22766

Authors: Saul Santos, Nuno Gon\c{c}alves, Daniel C. McNamee, Andr\'e F. T Martins

Abstract: Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to $\alpha$-entmax attention with $\alpha = 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers -- Memory Mosaics -- show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.

Comment: Model Architecture: provides a kernel-theoretic framework for sparse attention (entmax/compact kernels), offering principled attention design alternatives.