Personalized Daily ArXiv Papers 2025-10-08

[gpt-5]	Prompt	Completion	Total
Token	66428	59725	126153
Cost	$0.08	$0.6	$0.68

Total arXiv papers: 640

Total scanned papers: 401

Total relevant papers: 41

Table of contents with paper titles:

Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density Authors: Randall Balestriero, Nicolas Ballas, Mike Rabbat, Yann LeCun
Critical attention scaling in long-context transformers Authors: Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet
vAttention: Verified Sparse Attention Authors: Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates Authors: Alex Iacob, Andrej Jovanovic, Mher Safaryan, Meghdad Kurmanji, Lorenzo Sani, Samuel Horv\'ath, William F. Shen, Xinchi Qiu, Nicholas D. Lane
PatternKV: Flattening KV Representation Expands Quantization Headroom Authors: Ji Zhang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction Authors: Utkarsh Saxena, Kaushik Roy
Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM Authors: Ryan Solgi, Parsa Madinei, Jiayi Tian, Rupak Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang
COSPADI: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning Authors: Dmitriy Shopkhoev, Denis Makhov, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis
Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving Authors: Yue Pan, Zihan Xia, Po-Kai Hsu, Lanxiang Hu, Hyungyo Kim, Janak Sharda, Minxuan Zhou, Nam Sung Kim, Shimeng Yu, Tajana Rosing, Mingu Kang
Exact Causal Attention with 10% Fewer Operations Authors: Dmitry Rybin, Yushun Zhang, Ding Tian, Zhihang Lin, Ruoyu Sun, Zhi-Quan Luo
ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization Authors: Lawrence Liu, Alexander Liu, Mengdi Wang, Tuo Zhao, Lin F. Yang
Training Dynamics Impact Post-Training Quantization Robustness Authors: Albert Catalan-Tatjer, Niccol`o Ajroldi, Jonas Geiping
Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting Authors: Zhongkai Yu, Yue Guan, Zihao Yu, Chenyang Zhou, Shuyi Pei, Yangwook Kang, Yufei Ding, Po-An Tsai
Riddled basin geometry sets fundamental limits to predictability and reproducibility in deep learning Authors: Andrew Ly, Pulin Gong
Computing frustration and near-monotonicity in deep neural networks Authors: Joel Wendin, Erik G. Larsson, Claudio Altafini
Paying Attention to Hybrid Attention: Untangling the Issues with Conversion Methods Authors: Martin Benfeghoul, Teresa Delgado, Adnan Oomerjee, Haitham Bou Ammar, Jun Wang, Zafeirios Fountas
Context Length Alone Hurts LLM Performance Despite Perfect Retrieval Authors: Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, Hao Peng
On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond Authors: Chenxiao Yang, Cai Zhou, David Wipf, Zhiyuan Li
Fundamental Limits of Crystalline Equivariant Graph Neural Networks: A Circuit Complexity Perspective Authors: Yang Cao, Zhao Song, Jiahao Zhang, Jiale Zhao
OptPipe: Memory- and Scheduling-Optimized Pipeline Parallelism for LLM Training Authors: Hongpei Li, Han Zhang, Huikang Liu, Dongdong Ge, Yinyu Ye
Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding Authors: Shrenik Bhansali, Larry Heck
VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing Authors: Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chensheng Peng, Mohit Bansal, Mingyu Ding, Masayoshi Tomizuka
Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models Authors: Gagan Bhatia, Somayajulu G Sripada, Kevin Allan, Jacobo Azcona
Rule Encoding and Compliance in Large Language Models: An Information-Theoretic Analysis Authors: Joachim Diederich
ATOM: A Pretrained Neural Operator for Multitask Molecular Dynamics Authors: Luke Thompson, Davy Guan, Dai Shi, Slade Matthews, Junbin Gao, Andi Han
Approximate Gaussianity Beyond Initialisation in Neural Networks Authors: Edward Hirst, Sanjaye Ramgoolam
Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices Authors: Yilong Li, Shuai Zhang, Yijing Zeng, Hao Zhang, Xinmiao Xiong, Jingyu Liu, Pan Hu, Suman Banerjee
On the Theory of Continual Learning with Gradient Descent for Neural Networks Authors: Hossein Taheri, Avishek Ghosh, Arya Mazumdar
Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime Authors: Andreas Maurer, Erfan Mirzaei, Massimiliano Pontil
AMAQ: Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning Authors: Yurun Song, Zhuoyi Yang, Ian G. Harris, Sangeetha Abdu Jyothi
Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics Authors: Christopher Hoang, Mengye Ren
Revisiting Long-context Modeling from Context Denoising Perspective Authors: Zecheng Tang, Baibei Ji, Juntao Li, Lijun Wu, Haijia Gui, Min Zhang
Improved High-probability Convergence Guarantees of Decentralized SGD Authors: Aleksandar Armacki, Ali H. Sayed
From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs Authors: Tianhao Zhu, Dahu Feng, Erhu Feng, Yubin Xia
Quantifying the Accuracy-Interpretability Trade-Off in Concept-Based Sidechannel Models Authors: David Debot, Giuseppe Marra
Analyzing the Effect of Embedding Norms and Singular Values to Oversmoothing in Graph Neural Networks Authors: Dimitrios Kelesis, Dimitris Fotakis, Georgios Paliouras
MixReasoning: Switching Modes to Think Authors: Haiquan Lu, Gongfan Fang, Xinyin Ma, Qi Li, Xinchao Wang
Probing the Difficulty Perception Mechanism of Large Language Models Authors: Sunbowen Lee, Qingyu Yin, Chak Tou Leong, Jialiang Zhang, Yicheng Gong, Xiaoyu Shen
Latent Speech-Text Transformer Authors: Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le
Scalable In-context Ranking with Generative Models Authors: Nilesh Gupta, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Inderjit Dhillon, Felix Yu
CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credits Authors: Kangyu Wang, Zhiyun Jiang, Haibo Feng, Weijia Zhao, Lin Liu, Jianguo Li, Zhenzhong Lan, Weiyao Lin

1. Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density

ArXiv ID: 2510.05949

Authors: Randall Balestriero, Nicolas Ballas, Mike Rabbat, Yann LeCun

Abstract: Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample's representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs' anti-collapse term does much more--it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used--in any case one can compute the learned probabilities of sample $x$ efficiently and in closed-form using the model's Jacobian matrix at $x$. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as {\bf JEPA-SCORE}.

Comment: Author match

2. Critical attention scaling in long-context transformers

ArXiv ID: 2510.05554

Authors: Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet

Abstract: As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $\textit{attention scaling}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $\beta_n$, theoretical justification for this approach remains lacking. We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $\beta_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $\beta_n \asymp \log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.

Comment: Strong match to Model Architecture and Representation Learning: rigorous theory of attention scaling in long-context Transformers, identifying critical β_n ≍ log n to prevent rank-collapse.

Relevance: 10 Novelty: 9

3. vAttention: Verified Sparse Attention

ArXiv ID: 2510.05688

Authors: Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

Abstract: State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(\epsilon, \delta)$ guarantees on approximation accuracy (thus, verified). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-k and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $\sim$4.5 percentage points for Llama-3.1-8B-Inst and Deepseek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with upto 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code is open-sourced at https://github.com/xAlg-ai/sparse-attention-hub.

Comment: Sparse Attention with guarantees: unified top-k and sampling providing user-specified (epsilon, delta) accuracy with strong efficiency gains

Relevance: 10 Novelty: 9

4. MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates

ArXiv ID: 2510.05361

Authors: Alex Iacob, Andrej Jovanovic, Mher Safaryan, Meghdad Kurmanji, Lorenzo Sani, Samuel Horv\'ath, William F. Shen, Xinchi Qiu, Nicholas D. Lane

Abstract: Training large models with distributed data parallelism (DDP) requires frequent communication of gradients across workers, which can saturate bandwidth. Infrequent communication strategies (e.g., Local SGD) reduce this overhead but, when applied to adaptive optimizers, often suffer a performance gap relative to fully synchronous DDP. We trace this gap to a time-scale mismatch: the optimizer's fast-moving momentum, tuned for frequent updates, decays too quickly to smooth gradients over long intervals, leading to noise-dominated optimization. To address this, we propose MT-DAO, a family of optimizers that employs multiple slow- and fast-moving first momenta or the gradient to track update dynamics across different time scales, for which we provide the first convergence guarantees. Empirically, for language-model pre-training, this eliminates the performance gap with DDP, outperforming infrequent-communication baselines in perplexity and reducing iso-token wall-clock time by 6-27% on Ethernet interconnects. At the 720M scale, MT-DAO reaches a target perplexity in 24% fewer steps and 35% less time than the single-momentum DDP baseline. MT-DAO enables effective cross-datacenter training and training over wide geographic areas.

Comment: HPC/Distributed Training: multi-timescale adaptive optimizers with local updates reduce communication, with convergence guarantees.

Relevance: 10 Novelty: 8

5. PatternKV: Flattening KV Representation Expands Quantization Headroom

ArXiv ID: 2510.05176

Authors: Ji Zhang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li

Abstract: KV cache in autoregressive LLMs eliminates redundant recomputation but has emerged as the dominant memory and bandwidth bottleneck during inference, notably with long contexts and test-time scaling. KV quantization is a key lever for reducing cache cost, but accuracy drops sharply as the native KV distribution lacks flatness and thus maintains a wide quantization range. Prior work focuses on isolating outliers, which caps their error but fails to flatten the overall distribution, leaving performance fragile under low-bit settings. In this work, we show that the K cache maintains a stable structure that evolves gradually with context, while the V cache carries latent semantic regularities. Building on these insights, we propose PatternKV, a pattern-aligned residual quantization scheme. It mines representative pattern vectors online, aligns each KV vector to its nearest pattern, and quantizes only the residual. This reshaping of the KV distribution flattens the quantization target and narrows its range, thereby improving the fidelity of low-bit KV quantization. Across long-context and test-time scaling settings on multiple backbones, PatternKV delivers consistent 2-bit gains, with a 0.08% average 4-bit drop relative to FP16, improves test-time scaling accuracy by 10% on average, and raises throughput by 1.4x while supporting 1.25x larger batches.

Comment: Model Compression and Efficiency: proposes a pattern-aligned residual quantization scheme for KV-cache to flatten distributions and enable low-bit inference with less memory/bandwidth.

Relevance: 10 Novelty: 8

6. KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction

ArXiv ID: 2510.05373

Authors: Utkarsh Saxena, Kaushik Roy

Abstract: Quantizing the key-value (KV) cache is a promising strategy for improving the inference efficiency of large language models (LLMs). However, aggressive quantization to very low precision (e.g., 2 bits) introduces significant errors in the stored key and value tensors, which propagate through the dot-product attention mechanism and ultimately degrade generation quality. To address this, we propose KVLinC, a framework to mitigate attention errors introduced by KV cache quantization in the extreme low-precision regime. KVLinC combines a Hadamard rotation, which reduces quantization error in values, with lightweight linear correction adapters that explicitly compensate for errors introduced by quantized keys. Across extensive evaluations on the LLaMA, Qwen2.5, and Qwen3 model families, KVLinC consistently matches or surpasses strong baselines while achieving higher KV-cache compression. Furthermore, we implement a custom attention kernel that results in upto 2.55x faster inference compared to Flash Attention baseline, enabling efficient long-context LLM inference.

Comment: Strong match to Compression/Efficiency: KV-cache quantization to very low precision with Hadamard rotation and linear correction plus a fast attention kernel for efficient long-context inference.

Relevance: 10 Novelty: 8

7. Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM

ArXiv ID: 2510.05544

Authors: Ryan Solgi, Parsa Madinei, Jiayi Tian, Rupak Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang

Abstract: Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.

Comment: Strong match to Compression/Efficiency: activation-informed theoretical bounds and Pareto-guided low-rank rank selection (PGSVD) for zero-shot LLM/VLM compression.

Relevance: 10 Novelty: 8

8. COSPADI: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

ArXiv ID: 2509.22075

Authors: Dmitriy Shopkhoev, Denis Makhov, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis

Abstract: Post-training compression of large language models (LLMs) largely relies on low-rank weight approximation, which represents each column of a weight matrix in a shared low-dimensional subspace. While this is a computationally efficient strategy, the imposed structural constraint is rigid and can lead to a noticeable model accuracy drop. In this work, we propose CoSpaDi (Compression via Sparse Dictionary Learning), a novel training-free compression framework that replaces low-rank decomposition with a more flexible structured sparse factorization in which each weight matrix is represented with a dense dictionary and a column-sparse coefficient matrix. This formulation enables a union-of-subspaces representation: different columns of the original weight matrix are approximated in distinct subspaces spanned by adaptively selected dictionary atoms, offering greater expressiveness than a single invariant basis. Crucially, CoSpaDi leverages a small calibration dataset to optimize the factorization such that the output activations of compressed projection layers closely match those of the original ones, thereby minimizing functional reconstruction error rather than mere weight approximation. This data-aware strategy preserves better model fidelity without any fine-tuning under reasonable compression ratios. Moreover, the resulting structured sparsity allows efficient sparse-dense matrix multiplication and is compatible with post-training quantization for further memory and latency gains. We evaluate CoSpaDi across multiple Llama and Qwen models under per-layer and per-group settings at 20-50\% compression ratios, demonstrating consistent superiority over state-of-the-art data-aware low-rank methods both in accuracy and perplexity. Our results establish structured sparse dictionary learning as a powerful alternative to conventional low-rank approaches for efficient LLM deployment.

Comment: Model Compression and Efficiency: training-free sparse dictionary factorization guided by calibration to compress LLMs; structured sparsity compatible with quantization and efficient sparse-dense ops.

Relevance: 10 Novelty: 8

9. Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving

ArXiv ID: 2510.05245

Authors: Yue Pan, Zihan Xia, Po-Kai Hsu, Lanxiang Hu, Hyungyo Kim, Janak Sharda, Minxuan Zhou, Nam Sung Kim, Shimeng Yu, Tajana Rosing, Mingu Kang

Abstract: As Large Language Models (LLMs) continue to evolve, Mixture of Experts (MoE) architecture has emerged as a prevailing design for achieving state-of-the-art performance across a wide range of tasks. MoE models use sparse gating to activate only a handful of expert sub-networks per input, achieving billion-parameter capacity with inference costs akin to much smaller models. However, such models often pose challenges for hardware deployment due to the massive data volume introduced by the MoE layers. To address the challenges of serving MoE models, we propose Stratum, a system-hardware co-design approach that combines the novel memory technology Monolithic 3D-Stackable DRAM (Mono3D DRAM), near-memory processing (NMP), and GPU acceleration. The logic and Mono3D DRAM dies are connected through hybrid bonding, whereas the Mono3D DRAM stack and GPU are interconnected via silicon interposer. Mono3D DRAM offers higher internal bandwidth than HBM thanks to the dense vertical interconnect pitch enabled by its monolithic structure, which supports implementations of higher-performance near-memory processing. Furthermore, we tackle the latency differences introduced by aggressive vertical scaling of Mono3D DRAM along the z-dimension by constructing internal memory tiers and assigning data across layers based on access likelihood, guided by topic-based expert usage prediction to boost NMP throughput. The Stratum system achieves up to 8.29x improvement in decoding throughput and 7.66x better energy efficiency across various benchmarks compared to GPU baselines.

Comment: High Performance Computing: system–hardware co-design (Mono3D DRAM + NMP) for MoE serving with tiered memory and expert-usage prediction.

Relevance: 10 Novelty: 8

10. Exact Causal Attention with 10% Fewer Operations

ArXiv ID: 2510.05175

Authors: Dmitry Rybin, Yushun Zhang, Ding Tian, Zhihang Lin, Ruoyu Sun, Zhi-Quan Luo

Abstract: We present Fast Causal Attention (FCA), an algorithm that computes exact Causal Attention using 10\% fewer operations. FCA accelerates a special class of matrix multiplications where either one operand or the output matrix is upper- or lower-triangular. This includes all operations in forward and backward pass of Causal Attention, such as masked product $\mathrm{Mask}(QK^{T})$. For these matrix multiplications on GPU, FCA reaches noticeable accelerations over the default PyTorch implementations and Triton compiled kernels. FCA is built upon algebraic identities discovered via machine learning and combinatorial search.

Comment: Compression/Efficiency/HPC: exact causal attention with ~10% fewer operations via new masked matmul identities and GPU-optimized kernels.

Relevance: 10 Novelty: 8

11. ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

ArXiv ID: 2510.05528

Authors: Lawrence Liu, Alexander Liu, Mengdi Wang, Tuo Zhao, Lin F. Yang

Abstract: Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy

Comment: Model Compression and Efficiency: semi-structured 2:4 pruning via adaptive matrix factorization with block-diagonal wrappers

Relevance: 10 Novelty: 8

12. Training Dynamics Impact Post-Training Quantization Robustness

ArXiv ID: 2510.06213

Authors: Albert Catalan-Tatjer, Niccol`o Ajroldi, Jonas Geiping

Abstract: While post-training quantization is widely adopted for efficient deployment of large language models, the mechanisms underlying quantization robustness remain unclear. We conduct a comprehensive analysis of quantization degradation across open-source language model training trajectories up to 32B parameters and 15T training tokens to accurately assess the relationship between training dynamics and quantization performance. Our key finding is that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters. Specifically, once learning rates decay, validation loss and quantization error diverge, largely independent of training data scale. To investigate interventions on the training dynamics and identify specific configurations that can modulate quantization robustness favorably, we train our own models in controlled experiments up to 100B tokens. Our results challenge the assumption that increasing dataset scale inherently compromises quantization effectiveness, demonstrating instead that strategic training hyperparameter interventions can improve quantization quality at scale.

Comment: Compression/Efficiency: analysis of post-training quantization robustness tied to training dynamics and hyperparameters in LLMs.

Relevance: 10 Novelty: 7

13. Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting

ArXiv ID: 2510.05497

Authors: Zhongkai Yu, Yue Guan, Zihao Yu, Chenyang Zhou, Shuyi Pei, Yangwook Kang, Yufei Ding, Po-An Tsai

Abstract: Large Language Models (LLMs) with Mixture of Experts (MoE) architectures achieve remarkable performance improvements, but their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit serving systems. To forecast the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across three state-of-the-art large-scale MoE models (200B- 671B) using over 24,000 requests spanning diverse workloads. With the resulting 150GB+ trace files, we perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse future serving systems. Taking wafer-scale GPUs as a case study, we demonstrate that minor architectural modifications leveraging our insights achieve substantial performance gains, delivering 6.3X and 4.0X average speedups on DeepSeek V3 and Qwen3, respectively. Our work provides the first comprehensive data-centric analysis of MoE models at scale. Our profiling traces and analysis results are publicly available at {https://huggingface.co/datasets/core12345/MoE_expert_selection_trace. We will also release our simulation framework shortly to facilitate future research in this area.

Comment: High Performance Computing: data-movement-centric profiling and forecasting for large-scale MoE serving; informs system design (e.g., wafer-scale GPUs).

Relevance: 10 Novelty: 7

14. Riddled basin geometry sets fundamental limits to predictability and reproducibility in deep learning

ArXiv ID: 2510.05606

Authors: Andrew Ly, Pulin Gong

Abstract: Fundamental limits to predictability are central to our understanding of many physical and computational systems. Here we show that, despite its remarkable capabilities, deep learning exhibits such fundamental limits rooted in the fractal, riddled geometry of its basins of attraction: any initialization that leads to one solution lies arbitrarily close to another that leads to a different one. We derive sufficient conditions for the emergence of riddled basins by analytically linking features widely observed in deep learning, including chaotic learning dynamics and symmetry-induced invariant subspaces, to reveal a general route to riddling in realistic deep networks. The resulting basins of attraction possess an infinitely fine-scale fractal structure characterized by an uncertainty exponent near zero, so that even large increases in the precision of initial conditions yield only marginal gains in outcome predictability. Riddling thus imposes a fundamental limit on the predictability and hence reproducibility of neural network training, providing a unified account of many empirical observations. These results reveal a general organizing principle of deep learning with important implications for optimization and the safe deployment of artificial intelligence.

Comment: Training Dynamics Theory: links chaotic dynamics and symmetry-induced invariant subspaces to riddled basins, revealing limits to predictability.

Relevance: 9 Novelty: 8

15. Computing frustration and near-monotonicity in deep neural networks

ArXiv ID: 2510.05286

Authors: Joel Wendin, Erik G. Larsson, Claudio Altafini

Abstract: For the signed graph associated to a deep neural network, one can compute the frustration level, i.e., test how close or distant the graph is to structural balance. For all the pretrained deep convolutional neural networks we consider, we find that the frustration is always less than expected from null models. From a statistical physics point of view, and in particular in reference to an Ising spin glass model, the reduced frustration indicates that the amount of disorder encoded in the network is less than in the null models. From a functional point of view, low frustration (i.e., proximity to structural balance) means that the function representing the network behaves near-monotonically, i.e., more similarly to a monotone function than in the null models. Evidence of near-monotonic behavior along the partial order determined by frustration is observed for all networks we consider. This confirms that the class of deep convolutional neural networks tends to have a more ordered behavior than expected from null models, and suggests a novel form of implicit regularization.

Comment: Representation Learning: analyzes trained DNNs via signed-graph frustration to reveal near-monotonic structure and implicit regularization.

Relevance: 9 Novelty: 8

16. Paying Attention to Hybrid Attention: Untangling the Issues with Conversion Methods

ArXiv ID: 2510.05901

Authors: Martin Benfeghoul, Teresa Delgado, Adnan Oomerjee, Haitham Bou Ammar, Jun Wang, Zafeirios Fountas

Abstract: Transformers' quadratic computational complexity limits their scalability despite remarkable performance. While linear attention reduces this to linear complexity, pre-training such models from scratch remains, in most cases, prohibitively expensive. Recent post-training linearisation methods convert pre-trained Transformers to linear models efficiently, often using hybrid approaches that combine linear attention with sliding-window softmax. We identify a critical flaw: existing hybrid methods inadvertently bypass the linear component, relying almost entirely on SWA. Component-level diagnostics reveal this previously undetected behaviour stems from overlooked evaluation practices on common-sense benchmarks. We propose three solutions to ensure balanced component usage: (i) inference-time hybridisation of linear-only conversions with sliding-window softmax; (ii) HedgeCATs, combining attention-weight transfer with targeted LoRA fine-tuning; and (iii) Scheduled Sliding-window Dropout (SSD), which stochastically suppresses the softmax branch during training to prevent component collapse. Our methods maintain computational efficiency while recovering most base model performance and ensuring genuine linear attention adoption, restoring the validity of performance attributions in hybrid conversions.

Comment: Strong match to Model Architecture and Efficiency: analyzes hybrid linear-attention conversions and proposes methods (e.g., SSD, HedgeCATs) to ensure genuine linear attention usage post-conversion.

Relevance: 9 Novelty: 8

17. Context Length Alone Hurts LLM Performance Despite Perfect Retrieval

ArXiv ID: 2510.05381

Authors: Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, Hao Peng

Abstract: Large language models (LLMs) often fail to scale their performance on long-context tasks performance in line with the context lengths they support. This gap is commonly attributed to retrieval failures -- the models' inability to identify relevant information in the long inputs. Accordingly, recent efforts often focus on evaluating and improving LLMs' retrieval performance: if retrieval is perfect, a model should, in principle, perform just as well on a long input as it does on a short one -- or should it? This paper presents findings that the answer to this question may be negative. Our systematic experiments across 5 open- and closed-source LLMs on math, question answering, and coding tasks reveal that, even when models can perfectly retrieve all relevant information, their performance still degrades substantially (13.9%--85%) as input length increases but remains well within the models' claimed lengths. This failure occurs even when the irrelevant tokens are replaced with minimally distracting whitespace, and, more surprisingly, when they are all masked and the models are forced to attend only to the relevant tokens. A similar performance drop is observed when all relevant evidence is placed immediately before the question. Our findings reveal a previously-unrealized limitation: the sheer length of the input alone can hurt LLM performance, independent of retrieval quality and without any distraction. They motivate our simple, model-agnostic mitigation strategy that transforms a long-context task into a short-context one by prompting the model to recite the retrieved evidence before attempting to solve the problem. On RULER, we observe a consistent improvement of GPT-4o up to 4% on an already strong baseline.

Comment: Training dynamics/representation insight: shows long-context length alone degrades LLM performance independent of retrieval; proposes a simple mitigation to reduce effective context.

Relevance: 9 Novelty: 8

18. On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond

ArXiv ID: 2510.06190

Authors: Chenxiao Yang, Cai Zhou, David Wipf, Zhiyuan Li

Abstract: This paper formally studies generation processes, including auto-regressive next-token prediction and masked diffusion, that abstract beyond architectural specifics. At this level of abstraction, we quantify their benefits and limitations through measurable criteria such as computational hardness and learnability. In particular, we demonstrate that allowing generation to proceed beyond autoregression and current masked diffusion, with capabilities to rewrite and length-variable edit, can bring significant theoretical and empirical advantages, with important implications for frontier LLMs that aspire to tackle increasingly hard problems and work universally across domains beyond natural language, such as coding and science.

Comment: Foundational generation paradigm analysis: formal study beyond autoregression/diffusion with rewrite/edit capabilities and associated learnability/hardness results.

Relevance: 9 Novelty: 8

19. Fundamental Limits of Crystalline Equivariant Graph Neural Networks: A Circuit Complexity Perspective

ArXiv ID: 2510.05494

Authors: Yang Cao, Zhao Song, Jiahao Zhang, Jiale Zhao

Abstract: Graph neural networks (GNNs) have become a core paradigm for learning on relational data. In materials science, equivariant GNNs (EGNNs) have emerged as a compelling backbone for crystalline-structure prediction, owing to their ability to respect Euclidean symmetries and periodic boundary conditions. Despite strong empirical performance, their expressive power in periodic, symmetry-constrained settings remains poorly understood. This work characterizes the intrinsic computational and expressive limits of EGNNs for crystalline-structure prediction through a circuit-complexity lens. We analyze the computations carried out by EGNN layers acting on node features, atomic coordinates, and lattice matrices, and prove that, under polynomial precision, embedding width $d=O(n)$ for $n$ nodes, $O(1)$ layers, and $O(1)$-depth, $O(n)$-width MLP instantiations of the message/update/readout maps, these models admit a simulation by a uniform $\mathsf{TC}^0$ threshold-circuit family of polynomial size (with an explicit constant-depth bound). Situating EGNNs within $\mathsf{TC}^0$ provides a concrete ceiling on the decision and prediction problems solvable by such architectures under realistic resource constraints and clarifies which architectural modifications (e.g., increased depth, richer geometric primitives, or wider layers) are required to transcend this regime. The analysis complements Weisfeiler-Lehman style results that do not directly transfer to periodic crystals, and offers a complexity-theoretic foundation for symmetry-aware graph learning on crystalline systems.

Comment: Model Architecture Theory: circuit-complexity characterization (TC^0) of crystalline equivariant GNNs, clarifying expressive/computational limits under symmetry constraints.

Relevance: 9 Novelty: 8

20. OptPipe: Memory- and Scheduling-Optimized Pipeline Parallelism for LLM Training

ArXiv ID: 2510.05186

Authors: Hongpei Li, Han Zhang, Huikang Liu, Dongdong Ge, Yinyu Ye

Abstract: Pipeline parallelism (PP) has become a standard technique for scaling large language model (LLM) training across multiple devices. However, despite recent progress in reducing memory consumption through activation offloading, existing approaches remain largely heuristic and coarse-grained, often overlooking the fine-grained trade-offs between memory, computation, and scheduling latency. In this work, we revisit the pipeline scheduling problem from a principled optimization perspective. We observe that prevailing strategies either rely on static rules or aggressively offload activations without fully leveraging the interaction between memory constraints and scheduling efficiency. To address this, we formulate scheduling as a constrained optimization problem that jointly accounts for memory capacity, activation reuse, and pipeline bubble minimization. Solving this model yields fine-grained schedules that reduce pipeline bubbles while adhering to strict memory budgets. Our approach complements existing offloading techniques: whereas prior approaches trade memory for time in a fixed pattern, we dynamically optimize the tradeoff with respect to model structure and hardware configuration. Experimental results demonstrate that our method consistently improves both throughput and memory utilization. In particular, we reduce idle pipeline time by up to 50% under the same per-device memory limit, and in some cases, enable the training of larger models within limited memory budgets.

Comment: High Performance Computing: optimized pipeline-parallel scheduling jointly accounting for memory capacity, activation reuse, and bubble minimization

Relevance: 9 Novelty: 8

21. Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding

ArXiv ID: 2510.05421

Authors: Shrenik Bhansali, Larry Heck

Abstract: Autoregressive (AR) decoding is a major latency bottleneck for large language models. Speculative decoding (SD) accelerates AR by letting a drafter propose multi-token blocks that a verifier accepts or rejects. However, many SD systems require heavy offline training or extra components. These choices raise data/compute cost and can yield brittle drafters under distribution drift. We introduce \emph{Draft, Verify, \& Improve (DVI)}, a training-aware self-speculative framework that combines inference with continual online learning. We partition an LLM into a drafter and a verifier, and during generation, verifier accept/reject decisions are converted into supervision signals and used to update the drafter head. A simple \emph{KL$\rightarrow$RL} schedule bootstraps calibration via online distillation and then adds reward-masked cross-entropy with a on-policy policy-gradient term, preserving lossless, single model deployment. On Spec-Bench, DVI achieves a $2.16\times$ wall-time speedup, on par with SoTA approaches like EAGLE-2, while orders of magnitude less data for training, and ablations show that DVI outperforms KL-only online distillation. DVI demonstrates that \emph{training-aware} self-speculation can deliver state-of-the-art, lossless speedups with minimal training overhead.

Comment: Inference Efficiency: training-aware speculative decoding (self-speculation) with online updates for lossless speedups

Relevance: 9 Novelty: 8

22. VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

ArXiv ID: 2510.05213

Authors: Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chensheng Peng, Mohit Bansal, Mingyu Ding, Masayoshi Tomizuka

Abstract: Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.

Comment: Model architecture: dynamic expert routing (MoE-style) with patchwise routing and curriculum top-K annealing; parameter-efficient fine-tuning of expert library.

Relevance: 9 Novelty: 7

23. Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

ArXiv ID: 2510.06107

Authors: Gagan Bhatia, Somayajulu G Sripada, Kevin Allan, Jacobo Azcona

Abstract: Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.First, to enable the reliable tracing of internal semantic failures, we propose \textbf{Distributional Semantics Tracing (DST)}, a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific \textbf{commitment layer} where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic \textbf{associative pathway} (akin to System 1) and a slow, deliberate \textbf{contextual pathway} (akin to System 2), leading to predictable failure modes such as \textit{Reasoning Shortcut Hijacks}. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation ($\rho = -0.863$) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

Comment: Strong match to Representation Learning: proposes a framework to trace internal representations, identifies a commitment layer and dual-pathway mechanism underlying hallucinations in Transformers.

Relevance: 8 Novelty: 8

24. Rule Encoding and Compliance in Large Language Models: An Information-Theoretic Analysis

ArXiv ID: 2510.05106

Authors: Joachim Diederich

Abstract: The design of safety-critical agents based on large language models (LLMs) requires more than simple prompt engineering. This paper presents a comprehensive information-theoretic analysis of how rule encodings in system prompts influence attention mechanisms and compliance behaviour. We demonstrate that rule formats with low syntactic entropy and highly concentrated anchors reduce attention entropy and improve pointer fidelity, but reveal a fundamental trade-off between anchor redundancy and attention entropy that previous work failed to recognize. Through formal analysis of multiple attention architectures including causal, bidirectional, local sparse, kernelized, and cross-attention mechanisms, we establish bounds on pointer fidelity and show how anchor placement strategies must account for competing fidelity and entropy objectives. Combining these insights with a dynamic rule verification architecture, we provide a formal proof that hot reloading of verified rule sets increases the asymptotic probability of compliant outputs. These findings underscore the necessity of principled anchor design and dual enforcement mechanisms to protect LLM-based agents against prompt injection attacks while maintaining compliance in evolving domains.

Comment: Model Architecture Analysis: information-theoretic bounds on attention mechanisms (causal/bidirectional/sparse/kernelized/cross-attention) for rule encoding/compliance.

Relevance: 8 Novelty: 8

25. ATOM: A Pretrained Neural Operator for Multitask Molecular Dynamics

ArXiv ID: 2510.05482

Authors: Luke Thompson, Davy Guan, Dai Shi, Slade Matthews, Junbin Gao, Andi Han

Abstract: Molecular dynamics (MD) simulations underpin modern computational drug dis- covery, materials science, and biochemistry. Recent machine learning models provide high-fidelity MD predictions without the need to repeatedly solve quantum mechanical forces, enabling significant speedups over conventional pipelines. Yet many such methods typically enforce strict equivariance and rely on sequential rollouts, thus limiting their flexibility and simulation efficiency. They are also com- monly single-task, trained on individual molecules and fixed timeframes, which restricts generalization to unseen compounds and extended timesteps. To address these issues, we propose Atomistic Transformer Operator for Molecules (ATOM), a pretrained transformer neural operator for multitask molecular dynamics. ATOM adopts a quasi-equivariant design that requires no explicit molecular graph and employs a temporal attention mechanism, allowing for the accurate parallel decod- ing of multiple future states. To support operator pretraining across chemicals and timescales, we curate TG80, a large, diverse, and numerically stable MD dataset with over 2.5 million femtoseconds of trajectories across 80 compounds. ATOM achieves state-of-the-art performance on established single-task benchmarks, such as MD17, RMD17 and MD22. After multitask pretraining on TG80, ATOM shows exceptional zero-shot generalization to unseen molecules across varying time hori- zons. We believe ATOM represents a significant step toward accurate, efficient, and transferable molecular dynamics models

Comment: Model Architecture: introduces a transformer neural operator with quasi-equivariance and temporal attention enabling parallel multi-step decoding and cross-molecule operator pretraining.

Relevance: 8 Novelty: 8

26. Approximate Gaussianity Beyond Initialisation in Neural Networks

ArXiv ID: 2510.05218

Authors: Edward Hirst, Sanjaye Ramgoolam

Abstract: Ensembles of neural network weight matrices are studied through the training process for the MNIST classification problem, testing the efficacy of matrix models for representing their distributions, under assumptions of Gaussianity and permutation-symmetry. The general 13-parameter permutation invariant Gaussian matrix models are found to be effective models for the correlated Gaussianity in the weight matrices, beyond the range of applicability of the simple Gaussian with independent identically distributed matrix variables, and notably well beyond the initialisation step. The representation theoretic model parameters, and the graph-theoretic characterisation of the permutation invariant matrix observables give an interpretable framework for the best-fit model and for small departures from Gaussianity. Additionally, the Wasserstein distance is calculated for this class of models and used to quantify the movement of the distributions over training. Throughout the work, the effects of varied initialisation regimes, regularisation, layer depth, and layer width are tested for this formalism, identifying limits where particular departures from Gaussianity are enhanced and how more general, yet still highly-interpretable, models can be developed.

Comment: Representation Learning: analyzes weight distributions during training via permutation-invariant Gaussian matrix models and tracks dynamics with Wasserstein distance.