Personalized Daily ArXiv Papers 2025-10-09

[gpt-5]	Prompt	Completion	Total
Token	53375	55607	108982
Cost	$0.07	$0.56	$0.62

Total arXiv papers: 631

Total scanned papers: 373

Total relevant papers: 29

Table of contents with paper titles:

Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin Authors: Enrique Queipo-de-Llano, \'Alvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, Ravid Shwartz-Ziv
Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix Authors: Tomohiro Hayase, Beno\^it Collins, Ryo Karakida
The Effect of Attention Head Count on Transformer Approximation Authors: Penghao Yu, Haotian Jiang, Zeyu Bao, Ruoxi Yu, Qianxiao Li
Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts Authors: Fangshuo Liao, Anastasios Kyrillidis
The Markovian Thinker Authors: Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, Siva Reddy
SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation Authors: Shuang Cheng, Yihan Bian, Dawei Liu, Yuhua Jiang, Yihao Liu, Linfeng Zhang, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, Bowen Zhou
From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics Authors: Zheng-An Chen, Tao Luo
Artificial Hippocampus Networks for Efficient Long-Context Modeling Authors: Yunhao Fang, Weihao Yu, Shu Zhong, Qinghao Ye, Xuehan Xiong, Lai Wei
Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data Authors: Rishabh Ranjan, Valter Hudovernik, Mark Znidar, Charilaos Kanatsoulis, Roshan Upendra, Mahmoud Mohammadi, Joe Meyer, Tom Palczewski, Carlos Guestrin, Jure Leskovec
Efficient numeracy in language models through single-token number embeddings Authors: Linus Kreitner, Paul Hager, Jonathan Mengedoht, Georgios Kaissis, Daniel Rueckert, Martin J. Menten
From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining Authors: Seng Pei Liew, Takuya Kato
Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models Authors: Chengzhi Zhong, Fei Cheng, Qianying Liu, Yugo Murawaki, Chenhui Chu, Sadao Kurohashi
Wide Neural Networks as a Baseline for the Computational No-Coincidence Conjecture Authors: John Dunbar, Scott Aaronson
The Effect of Label Noise on the Information Content of Neural Representations Authors: Ali Hussaini Umar, Franky Kevin Nando Tezoh, Jean Barbier, Santiago Acevedo, Alessandro Laio
Grouped Differential Attention Authors: Junghwan Lim, Sungmin Lee, Dongseok Kim, Wai Ting Cheung, Beomgyu Kim, Taehwan Kim, Haesol Lee, Junhyeok Lee, Dongpin Oh, Eunhwan Park
A General Constructive Upper Bound on Shallow Neural Nets Complexity Authors: Frantisek Hakl, Vit Fojtik
Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation Authors: Arjun Krishnakumar, Rhea Sanjay Sukthanker, Hannan Javed Mahadik, Gabriela Kadlecov\'a, Vladyslav Moroshan, Timur Carstensen, Frank Hutter, Aaron Klein
Native Hybrid Attention for Efficient Sequence Modeling Authors: Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, Yu Cheng
A Comparative Analysis of Contextual Representation Flow in State-Space and Transformer Architectures Authors: Nhat M. Hoang, Do Xuan Long, Cong-Duy Nguyen, Min-Yen Kan, Luu Anh Tuan
Sharpness-Aware Data Generation for Zero-shot Quantization Authors: Dung Hoang-Anh, Cuong Pham Trung Le, Jianfei Cai, Thanh-Toan Do
Accelerating Inference for Multilayer Neural Networks with Quantum Computers Authors: Arthur G. Rattew, Po-Wei Huang, Naixu Guo, Lirand\"e Pira, Patrick Rebentrost
Nearly Instance-Optimal Parameter Recovery from Many Trajectories via Hellinger Localization Authors: Eliot Shekhtman, Yichen Zhou, Ingvar Ziemann, Nikolai Matni, Stephen Tu
Heptapod: Language Modeling on Visual Signals Authors: Yongxin Zhu, Jiawei Chen, Yuanzhe Chen, Zhuo Chen, Dongya Jia, Jian Cong, Xiaobin Zhuang, Yuping Wang, Yuxuan Wang
BlockGPT: Spatio-Temporal Modelling of Rainfall via Frame-Level Autoregression Authors: Cristian Meo, Varun Sarathchandran, Avijit Majhi, Shao Hung, Carlo Saccardi, Ruben Imhoff, Roberto Deidda, Remko Uijlenhoet, Justin Dauwels
Angular Constraint Embedding via SpherePair Loss for Constrained Clustering Authors: Shaojie Zhang, Ke Chen
GUIDE: Guided Initialization and Distillation of Embeddings Authors: Khoa Trinh, Gaurav Menghani, Erik Vee
Chem-NMF: Multi-layer $\alpha$-divergence Non-Negative Matrix Factorization for Cardiorespiratory Disease Clustering, with Improved Convergence Inspired by Chemical Catalysts and Rigorous Asymptotic Analysis Authors: Yasaman Torabi, Shahram Shirani, James P. Reilly
Cocoon: A System Architecture for Differentially Private Training with Correlated Noises Authors: Donghwan Kim, Xin Gu, Jinho Baek, Timothy Lo, Younghoon Min, Kwangsik Shin, Jongryool Kim, Jongse Park, Kiwan Maeng
Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors Authors: Vasileios Titopoulos, Kosmas Alexandridis, Giorgos Dimitrakopoulos

1. Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin

ArXiv ID: 2510.06477

Authors: Enrique Queipo-de-Llano, \'Alvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, Ravid Shwartz-Ziv

Abstract: Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M-120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation studies validate our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.

Comment: Author match

2. Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix

ArXiv ID: 2510.06685

Authors: Tomohiro Hayase, Beno\^it Collins, Ryo Karakida

Abstract: Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention. In a natural regime where the inverse temperature remains of constant order, we show that the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. We further demonstrate that the distribution of squared singular values deviates from the Marchenko-Pastur law, which has been believed in previous work. Our proof relies on two key ingredients: precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential. This analysis also identifies a threshold for linearization and elucidates why attention, despite not being an entrywise operation, admits a rigorous Gaussian equivalence in this regime.

Comment: Provides a rigorous random-matrix-theoretic analysis of self-attention spectra, advancing theoretical understanding of Transformer architecture and representation dynamics.

Relevance: 10 Novelty: 9

3. The Effect of Attention Head Count on Transformer Approximation

ArXiv ID: 2510.06662

Authors: Penghao Yu, Haotian Jiang, Zeyu Bao, Ruoxi Yu, Qianxiao Li

Abstract: Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized $D$-retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for $\epsilon$-approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as $O(1/\epsilon^{cT})$, for some constant $c$ and sequence length $T$. To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order $O(T)$ allows complete memorization of the input, where approximation is entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.

Comment: Model Architecture theory: establishes upper and lower bounds on transformer approximation as a function of attention head count, including a first rigorous lower bound in a nonlinear practical setting.

Relevance: 10 Novelty: 9

4. Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts

ArXiv ID: 2510.07205

Authors: Fangshuo Liao, Anastasios Kyrillidis

Abstract: Mixture-of-Experts (MoE) architectures have emerged as a cornerstone of modern AI systems. In particular, MoEs route inputs dynamically to specialized experts whose outputs are aggregated through weighted summation. Despite their widespread application, theoretical understanding of MoE training dynamics remains limited to either separate expert-router optimization or only top-1 routing scenarios with carefully constructed datasets. This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts in a student-teacher framework. We prove that, with moderate over-parameterization, the student network undergoes a feature learning phase, where the router's learning process is ``guided'' by the experts, that recovers the teacher's parameters. Moreover, we show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality. To our knowledge, our analysis is the first to bring novel insights in understanding the optimization landscape of the MoE architecture.

Comment: Matches Model Architecture (MoE) and Representation Learning: provable joint training dynamics for soft-routed MoE; also includes post-training pruning with convergence guarantees (Model Compression/Efficiency).

Relevance: 10 Novelty: 9

5. The Markovian Thinker

ArXiv ID: 2510.06557

Authors: Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, Siva Reddy

Abstract: Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

Comment: High-Performance/Algorithmic Efficiency: redesigns the reasoning environment to a Markovian, constant-state setup enabling linear compute and constant memory for very long thinking.

Relevance: 10 Novelty: 9

6. SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

ArXiv ID: 2510.06303

Authors: Shuang Cheng, Yihan Bian, Dawei Liu, Yuhua Jiang, Yihao Liu, Linfeng Zhang, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, Bowen Zhou

Abstract: We propose SDAR, a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion. Instead of costly end-to-end diffusion training, SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation. During inference, SDAR generates sequences autoregressively across blocks for global coherence while decoding all tokens within each block in parallel via a discrete diffusion process. Extensive experiments show that AR models remain substantially more compute-efficient than masked diffusion models, providing a strong foundation for adaptation. Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation. Scaling studies across dense and Mixture-of-Experts architectures confirm that SDAR scales without compromise: larger models exhibit stronger robustness to block size and decoding thresholds, yielding greater speedups without accuracy loss. Beyond efficiency, SDAR demonstrates enhanced reasoning and domain adaptability. Our 30B MoE model surpasses its AR counterpart on challenging scientific reasoning benchmarks such as GPQA and ChemBench, and gains further improvements under test-time scaling methods like majority voting and pass@k. Together, these results establish SDAR as a practical paradigm that combines the strengths of autoregression and diffusion for scalable, high-throughput reasoning.

Comment: Model Architecture and Efficiency: introduces a hybrid AR–diffusion decoding paradigm enabling blockwise parallel generation and reports scaling across dense and MoE models.

Relevance: 10 Novelty: 9

7. From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics

ArXiv ID: 2510.06954

Authors: Zheng-An Chen, Tao Luo

Abstract: Although transformer-based models have shown exceptional empirical performance, the fundamental principles governing their training dynamics are inadequately characterized beyond configuration-specific studies. Inspired by empirical evidence showing improved reasoning capabilities under small initialization scales in language models, we employ the gradient flow analytical framework established in [Zhou et al. NeurIPS 2022] to systematically investigate linearized Transformer training dynamics. Our theoretical analysis dissects the dynamics of attention modules into two distinct stages. In the first stage, asymmetric weight perturbations from random initialization sustain non-degenerate gradient dynamics in parameter matrices, facilitating systematic escape from small initialization regimes. Subsequently, these matrices undergo condensation, progressively aligning toward the target orientation. In the second stage, the previously static key-query matrices actively participate in training, driving the normalized matrices toward asymptotic rank collapse. This two-stage framework generalizes classical directional convergence results.

Comment: Representation Learning/Training Dynamics: theoretical two-stage analysis of Transformer attention training (condensation then rank collapse) under gradient flow.

Relevance: 10 Novelty: 8

8. Artificial Hippocampus Networks for Efficient Long-Context Modeling

ArXiv ID: 2510.07318

Authors: Yunhao Fang, Weihao Yu, Shu Zhong, Qinghao Ye, Xuehan Xiong, Lai Wei

Abstract: Long-sequence modeling faces a fundamental trade-off between the efficiency of compressive fixed-size memory in RNN-like models and the fidelity of lossless growing memory in attention-based Transformers. Inspired by the Multi-Store Model in cognitive science, we introduce a memory framework of artificial neural networks. Our method maintains a sliding window of the Transformer's KV cache as lossless short-term memory, while a learnable module termed Artificial Hippocampus Network (AHN) recurrently compresses out-of-window information into a fixed-size compact long-term memory. To validate this framework, we instantiate AHNs using modern RNN-like architectures, including Mamba2, DeltaNet, and Gated DeltaNet. Extensive experiments on long-context benchmarks LV-Eval and InfiniteBench demonstrate that AHN-augmented models consistently outperform sliding window baselines and achieve performance comparable or even superior to full-attention models, while substantially reducing computational and memory requirements. For instance, augmenting the Qwen2.5-3B-Instruct with AHNs reduces inference FLOPs by 40.5% and memory cache by 74.0%, while improving its average score on LV-Eval (128k sequence length) from 4.41 to 5.88. Code is available at: https://github.com/ByteDance-Seed/AHN.

Comment: Model Architecture and Efficiency: hybrid memory design combining Transformer KV cache with learnable RNN-like compressive long-term memory (AHN) to cut FLOPs and cache.

Relevance: 10 Novelty: 8

9. Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data

ArXiv ID: 2510.06377

Authors: Rishabh Ranjan, Valter Hudovernik, Mark Znidar, Charilaos Kanatsoulis, Roshan Upendra, Mahmoud Mohammadi, Joe Meyer, Tom Palczewski, Carlos Guestrin, Jure Leskovec

Abstract: Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks. The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures and functional dependencies. In this paper, we present the Relational Transformer (RT) architecture, which can be pretrained on diverse relational databases and directly applied to unseen datasets and tasks without task- or dataset-specific fine-tuning, or retrieval of in-context examples. RT (i) tokenizes cells with table/column metadata, (ii) is pretrained via masked token prediction, and (iii) utilizes a novel \textit{Relational Attention} mechanism over columns, rows, and primary-foreign key links. Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance, averaging 94% of fully supervised AUROC on binary classification tasks with a single forward pass of a 22M parameter model, as opposed to 84% for a 27B LLM. Fine-tuning yields state-of-the-art results with high sample efficiency. Our experiments show that RT's zero-shot transfer harnesses task-table context, relational attention patterns and schema semantics. Overall, RT provides a practical path toward foundation models for relational data.

Comment: Proposes a new Transformer variant with Relational Attention over rows/columns/PK–FK links, a clear architecture innovation for relational data and representation learning.