Personalized Daily ArXiv Papers 2026-03-12

[gpt-5]	Prompt	Completion	Total
Token	47370	42790	90160
Cost	$0.06	$0.43	$0.49

Total arXiv papers: 585

Total scanned papers: 365

Total relevant papers: 28

Table of contents with paper titles:

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training Authors: Hengjie Cao, Zhendong Huang, Mengyi Chen, Yifeng Yang, Fanqi Yu, Ruijun Huang, Fang Dong, Xin Zhang, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Yuan Cheng, Tun Lu, Fan Yang, Li Shang
LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation Authors: Jinwoo Ahn, Ingyu Seong, Akhil Kedia, Junhan Kim, Hyemi Jang, Kangwook Lee, Yongkweon Jeon
Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias Authors: Borun D Chowdhury
ConFu: Contemplate the Future for Better Speculative Sampling Authors: Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou Sun
Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design Authors: Junzhuo Li, Peijie Jiang, Changxin Tian, Jia Liu, Zhiqiang Zhang, Xuming Hu
Leech Lattice Vector Quantization for Efficient LLM Compression Authors: Tycho F. A. van der Ouderaa, Mart van Baalen, Paul Whatmough, Markus Nagel
MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios Authors: Shuhuai Li, Jianghao Lin, Dongdong Ge, Yinyu Ye
Historical Consensus: Preventing Posterior Collapse via Iterative Selection of Gaussian Mixture Priors Authors: Zegu Zhang, Jian Zhang
ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping Authors: Zijian Zhu, Fei Ren, Zhanhong Tan, Kaisheng Ma
A New Tensor Network: Tubal Tensor Train and Its Applications Authors: Salman Ahmadi-Asl, Valentin Leplat, Anh-Huy Phan, Andrzej Cichocki
RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators Authors: Xinsheng Tang, Yangcheng Li, Nan Wang, Zhiyi Shu, Xingyu Ling, Junna Xing, Peng Zhou, Qiang Liu
Marginals Before Conditionals Authors: Mihir Sahasrabudhe
Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning Authors: Md Muntaqim Meherab, Noor Islam S. Mohammad, Faiza Feroz
Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought Authors: Yuling Jiao, Yanming Lai, Huazhen Lin, Wensen Ma, Houduo Qi, Defeng Sun
Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models Authors: Anurag Mishra
The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers Authors: Peter Balogh
SCORE: Replacing Layer Stacking with Contractive Recurrent Depth Authors: Guillaume Godin
On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD Authors: Tongcheng Zhang, Zhanpeng Zhou, Mingze Wang, Andi Han, Wei Huang, Taiji Suzuki, Junchi Yan
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction Authors: Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, Yaoqing Yang
Factorized Neural Implicit DMD for Parametric Dynamics Authors: Siyuan Chen, Zhecheng Wang, Yixin Chen, Yue Chang, Peter Yichen Chen, Eitan Grinspun, Jonathan Panuelos
Large Spikes in Stochastic Gradient Descent: A Large-Deviations View Authors: Benjamin Gess, Daniel Heydecker
Training Language Models via Neural Cellular Automata Authors: Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit Agrawal
Digging Deeper: Learning Multi-Level Concept Hierarchies Authors: Oscar Hill, Mateo Espinosa Zarlenga, Mateja Jamnik
Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation Authors: Viktorija Po\c{l}aka, Ivo Pascal de Jong, Andreea Ioana Sburlea
Quantization Robustness of Monotone Operator Equilibrium Networks Authors: James Li, Philip H. W. Leong, Thomas Chaffey
ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning Authors: Ruizhong Qiu, Hanqing Zeng, Yinglong Xia, Yiwen Meng, Ren Chen, Jiarui Feng, Dongqi Fu, Qifan Wang, Jiayi Liu, Jun Xiao, Xiangjun Fan, Benyu Zhang, Hong Li, Zhining Liu, Hyunsik Yoo, Zhichen Zeng, Tianxin Wei, Hanghang Tong
A Universal Nearest-Neighbor Estimator for Intrinsic Dimensionality Authors: Eng-Jon Ong, Omer Bobrowski, Gesine Reinert, Primoz Skraba
Revisiting Sharpness-Aware Minimization: A More Faithful and Effective Implementation Authors: Jianlong Chen, Zhiming Zhou

1. The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

ArXiv ID: 2603.10444

Authors: Hengjie Cao, Zhendong Huang, Mengyi Chen, Yifeng Yang, Fanqi Yu, Ruijun Huang, Fang Dong, Xin Zhang, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Yuan Cheng, Tun Lu, Fan Yang, Li Shang

Abstract: Large language models trained on natural language exhibit pronounced anisotropy: a small number of directions concentrate disproportionate energy, while the remaining dimensions form a broad semantic tail. In low-bit training regimes, this geometry becomes numerically unstable. Because blockwise quantization scales are determined by extreme elementwise magnitudes, dominant directions stretch the dynamic range, compressing long-tail semantic variation into narrow numerical bins. We show that this instability is primarily driven by a coherent rank-one mean bias, which constitutes the dominant component of spectral anisotropy in LLM representations. This mean component emerges systematically across layers and training stages and accounts for the majority of extreme activation magnitudes, making it the principal driver of dynamic-range inflation under low precision. Crucially, because the dominant instability is rank-one, it can be eliminated through a simple source-level mean-subtraction operation. This bias-centric conditioning recovers most of the stability benefits of SVD-based spectral methods while requiring only reduction operations and standard quantization kernels. Empirical results on FP4 (W4A4G4) training show that mean removal substantially narrows the loss gap to BF16 and restores downstream performance, providing a hardware-efficient path to stable low-bit LLM training.

Comment: Analyzes anisotropy and mean-bias as rank-one driver of FP4 instability and proposes mean subtraction — matches Model Compression and Efficiency: quantization stability.

Relevance: 10 Novelty: 9

2. LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

ArXiv ID: 2603.10899

Authors: Jinwoo Ahn, Ingyu Seong, Akhil Kedia, Junhan Kim, Hyemi Jang, Kangwook Lee, Yongkweon Jeon

Abstract: Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by "glimpsing into the future", in which a draft generator produces a surrogate future response approximating the target model's true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real-world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter-efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long-context understanding tasks, but also reduces the eviction cost by up to 14.5x, leading to significantly faster time-to-first-token. Our code is available at https://github.com/SamsungLabs/LookaheadKV.

Comment: KV cache eviction with learned importance prediction without draft generation — matches Model Compression and Efficiency: cache/memory optimization for LLM inference.

Relevance: 10 Novelty: 8

3. Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias

ArXiv ID: 2603.10123

Authors: Borun D Chowdhury

Abstract: The `Lost in the Middle'' phenomenon -- a U-shaped performance curve where LLMs retrieve well from the beginning and end of a context but fail in the middle -- is widely attributed to learned Softmax artifacts or the distance-decay of positional encodings like RoPE. This paper makes a single, precise claim: \emph{the U-shape is already present at initialization, before any training or positional encoding takes effect.} It is an inherent geometric property of the causal decoder with residual connections. We model multi-layer causal attention as iterated powers of the Ces\1)!)$, where $H$ is the network depth, making middle-context retrieval and training structurally hostile. We validate empirically that untrained Qwen2 and GPT-2 architectures exhibit this U-shape at Step~0, and that it is identical with or without RoPE. Comparing initialized and pretrained networks, we show that standard training does not overcome the topological valley, confirming that the U-shape persists as an architectural baseline under standard pretraining objectives. We do not claim that this bias is insurmountable, nor that interventions such as RoPE modifications are useless. We establish what the baseline is and where it comes from, so that future efforts to overcome it can be precisely targeted.}ro matrix and derive the exact closed-form influence density in the continuous limit. Causal masking forces a logarithmic divergence of gradient influence at the start of the prompt (the Primacy Tail), while residual connections create an isolated $\mathcal{O}(1)$ anchor at the final token (the Recency Delta). Between these extremes lies a factorial dead zone of order $\mathcal{O}(1/(H{-

Comment: Exact theory of transformer position bias at initialization — matches Model Architecture: analysis/innovations on transformers and training dynamics.

Relevance: 10 Novelty: 8

4. ConFu: Contemplate the Future for Better Speculative Sampling

ArXiv ID: 2603.08899

Authors: Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou Sun

Abstract: Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose \textbf{ConFu} (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8--11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.

Comment: Speculative decoding with contemplate tokens and MoE gating to boost acceptance — matches Model Compression and Efficiency and Mixture-of-Experts.

Relevance: 10 Novelty: 8

5. Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

ArXiv ID: 2603.10379

Authors: Junzhuo Li, Peijie Jiang, Changxin Tian, Jia Liu, Zhiqiang Zhang, Xuming Hu

Abstract: This paper presents a novel extension of neural scaling laws to Mixture-of-Experts (MoE) models, focusing on the optimal allocation of compute between expert and attention sub-layers. As MoE architectures have emerged as an efficient method for scaling model capacity without proportionally increasing computation, determining the optimal expert-attention compute ratio becomes critical. We define the ratio $r$ as the fraction of total FLOPs per token dedicated to the expert layers versus the attention layers, and explore how this ratio interacts with the overall compute budget and model sparsity. Through extensive experiments with GPT-style MoE Transformers, we empirically find that the optimal ratio $r^$ follows a power-law relationship with total compute and varies with sparsity. Our analysis leads to an explicit formula for $r^$, enabling precise control over the expert-attention compute allocation. We generalize the Chinchilla scaling law by incorporating this architectural parameter, providing a new framework for tuning MoE models beyond size and data. Our findings offer practical guidelines for designing efficient MoE models, optimizing performance while respecting fixed compute budgets.

Comment: Model Architecture/Efficiency: MoE scaling law optimizing expert vs. attention FLOPs; explicit formula for optimal compute allocation under sparsity.

Relevance: 10 Novelty: 8

6. Leech Lattice Vector Quantization for Efficient LLM Compression

ArXiv ID: 2603.11021

Authors: Tycho F. A. van der Ouderaa, Mart van Baalen, Paul Whatmough, Markus Nagel

Abstract: Scalar quantization of large language models (LLMs) is fundamentally limited by information-theoretic bounds. While vector quantization (VQ) overcomes these limits by encoding blocks of parameters jointly, practical implementations must avoid the need for expensive lookup mechanisms or other explicit codebook storage. Lattice approaches address this through highly structured and dense packing. This paper explores the Leech lattice, which, with its optimal sphere packing and kissing configurations at 24 dimensions, is the highest dimensional lattice known with such optimal properties. To make the Leech lattice usable for LLM quantization, we extend an existing search algorithm based on the extended Golay code construction, to i) support indexing, enabling conversion to and from bitstrings without materializing the codebook, ii) allow angular search over union of Leech lattice shells, iii) propose fully-parallelisable dequantization kernel. Together this yields a practical algorithm, namely Leech Lattice Vector Quantization (LLVQ). LLVQ delivers state-of-the-art LLM quantization performance, outperforming recent methods such as Quip#, QTIP, and PVQ. These results highlight the importance of high-dimensional lattices for scalable, theoretically grounded model compression.

Comment: Model compression and efficiency: high-dimensional Leech lattice vector quantization with codebook-free indexing and parallel dequantization.

Relevance: 10 Novelty: 8

7. MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios

ArXiv ID: 2603.09983

Authors: Shuhuai Li, Jianghao Lin, Dongdong Ge, Yinyu Ye

Abstract: Mixture-of-Experts (MoE) models enable scalable performance but face severe memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low-information nature of autoregressive expert activation. In this paper, we propose to repurpose Speculative Decoding (SD) not merely as a compute accelerator, but as an informative lookahead sensor for memory management, supported by our theoretical and empirical analyses. Hence, we introduce MoE-SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space. Extensive experiments on seven benchmarks demonstrate that MoE-SpAc achieves a 42% improvement in TPS over the SOTA SD-based baseline, and an average 4.04x speedup over all standard baselines. Code is available at https://github.com/lshAlgorithm/MoE-SpAc .

Comment: HPC/efficiency for MoE: speculative decoding as lookahead for memory management with dynamic partitioning and async prefetch/eviction.

Relevance: 10 Novelty: 8

8. Historical Consensus: Preventing Posterior Collapse via Iterative Selection of Gaussian Mixture Priors

ArXiv ID: 2603.10935

Authors: Zegu Zhang, Jian Zhang

Abstract: Variational autoencoders (VAEs) frequently suffer from posterior collapse, where latent variables become uninformative and the approximate posterior degenerates to the prior. Recent work has characterized this phenomenon as a phase transition governed by the spectral properties of the data covariance matrix. In this paper, we propose a fundamentally different approach: instead of avoiding collapse through architectural constraints or hyperparameter tuning, we eliminate the possibility of collapse altogether by leveraging the multiplicity of Gaussian mixture model (GMM) clusterings. We introduce Historical Consensus Training, an iterative selection procedure that progressively refines a set of candidate GMM priors through alternating optimization and selection. The key insight is that models trained to satisfy multiple distinct clustering constraints develop a historical barrier -- a region in parameter space that remains stable even when subsequently trained with a single objective. We prove that this barrier excludes the collapsed solution, and demonstrate through extensive experiments on synthetic and real-world datasets that our method achieves non-collapsed representations regardless of decoder variance or regularization strength. Our approach requires no explicit stability conditions (e.g., $\sigma^{\prime 2} < \lambda_{\max}$) and works with arbitrary neural architectures. The code is available at https://github.com/tsegoochang/historical-consensus-vae.

Comment: Representation Learning: proposes iterative selection of Gaussian mixture priors for VAEs to provably avoid posterior collapse across architectures.