Personalized Daily ArXiv Papers 2025-09-26

[gpt-5]	Prompt	Completion	Total
Token	54298	55137	109435
Cost	$0.07	$0.55	$0.62

Total arXiv papers: 681

Total scanned papers: 372

Total relevant papers: 45

Table of contents with paper titles:

Towards Atoms of Large Language Models Authors: Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Hierarchical Resolution Transformers: A Wavelet-Inspired Architecture for Multi-Scale Language Understanding Authors: Ayan Sar, Sampurna Roy, Kanav Gupta, Anurag Kaushish, Tanupriya Choudhury, Abhijit Kumar
Behind RoPE: How Does Causal Mask Encode Positional Information? Authors: Junu Kim, Xiao Liu, Zhenghao Lin, Lei Ji, Yeyun Gong, Edward Choi
SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips Authors: Xinyu Lian, Masahiro Tanaka, Olatunji Ruwase, Minjia Zhang
Myosotis: structured computation for attention like layer Authors: Evgenii Egorov, Hanno Ackermann, Markus Nagel, Hong Cai
Dynamic Reasoning Chains through Depth-Specialized Mixture-of-Experts in Transformer Architectures Authors: Sampurna Roy, Ayan Sar, Anurag Kaushish, Kanav Gupta, Tanupriya Choudhury, Abhijit Kumar
Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias Authors: Shuofeng Zhang, Ard Louis
Scaling Laws are Redundancy Laws Authors: Yuda Bi, Vince D Calhoun
Physics of Learning: A Lagrangian perspective to different learning paradigms Authors: Siyuan Guo, Bernhard Sch\"olkopf
Explicit and Effectively Symmetric Schemes for Neural SDEs Authors: Daniil Shmelev, Cristopher Salvi
Mechanism of Task-oriented Information Removal in In-context Learning Authors: Hakaze Cho, Haolin Yang, Gouki Minegishi, Naoya Inoue
FastEagle: Cascaded Drafting for Accelerating Speculative Decoding Authors: Haiduo Huang, Jiangcheng Song, Wenzhe Zhao, Pengju Ren
TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix Authors: Ahmet Caner Y\"uz\"ug\"uler, Ahmet \c{C}elik, Jiawei Zhuang, Lukas Cavigelli
Aligning Inductive Bias for Data-Efficient Generalization in State Space Models Authors: Qiyu Chen, Guozhang Chen
Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training Authors: Shiju Wang, Yujie Wang, Ao Sun, Fangcheng Fu, Zijian Zhu, Bin Cui, Xu Han, Kaisheng Ma
Explaining Grokking and Information Bottleneck through Neural Collapse Emergence Authors: Keitaro Sakamoto, Issei Sato
Decoupled-Value Attention for Prior-Data Fitted Networks: GP Inference for Physical Equations Authors: Kaustubh Sharma, Simardeep Singh, Parikshit Pareek
Go With The Flow: Churn-Tolerant Decentralized Training of Large Language Models Authors: Nikolay Blagoev, Bart Cox, J\'er\'emie Decouchant, Lydia Y. Chen
Binary Autoencoder for Mechanistic Interpretability of Large Language Models Authors: Hakaze Cho, Haolin Yang, Brian M. Kurkoski, Naoya Inoue
Feature Augmentation of GNNs for ILPs: Local Uniqueness Suffices Authors: Qingyu Han, Qian Li, Linxin Yang, Qian Chen, Qingjiang Shi, Ruoyu Sun
Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say Authors: Jacob Fein-Ashley, Dhruv Parikh, Rajgopal Kannan, Viktor Prasanna
On Theoretical Interpretations of Concept-Based In-Context Learning Authors: Huaze Tang, Tianren Peng, Shao-lun Huang
WAVECLIP: Wavelet Tokenization for Adaptive-Resolution CLIP Authors: Moshe Kimhi, Erez Koifman, Ehud Rivlin, Eli Schwartz, Chaim Baskin
LATTS: Locally Adaptive Test-Time Scaling Authors: Theo Uscidda, Matthew Trager, Michael Kleinman, Aditya Chattopadhyay, Wei Xia, Stefano Soatto
Latent Twins Authors: Matthias Chung, Deepanshu Verma, Max Collins, Amit N. Subrahmanya, Varuni Katti Sastry, Vishwas Rao
Function Spaces Without Kernels: Learning Compact Hilbert Space Representations Authors: Su Ann Low, Quentin Rommel, Kevin S. Miller, Adam J. Thorpe, Ufuk Topcu
Bispectral OT: Dataset Comparison using Symmetry-Aware Optimal Transport Authors: Annabel Ma, Kaiying Hou, David Alvarez-Melis, Melanie Weber
CLUE: Conflict-guided Localization for LLM Unlearning Framework Authors: Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang
RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs Authors: Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Unlocking Noise-Resistant Vision: Key Architectural Secrets for Robust Models Authors: Bum Jun Kim, Makoto Kawano, Yusuke Iwasawa, Yutaka Matsuo
Maxout Polytopes Authors: Andrei Balakin, Shelby Cox, Georg Loho, Bernd Sturmfels
SiNGER: A Clearer Voice Distills Vision Transformers Further Authors: Geunhyeok Yu, Sunjae Jeong, Yoonyoung Choi, Jaeseung Kim, Hyoseok Hwang
Toward Robust and Efficient ML-Based GPU Caching for Modern Inference Authors: Peng Chen, Jiaji Zhang, Hailiang Zhao, Yirong Zhang, Jiahong Yu, Xueyan Tang, Yixuan Wang, Hao Li, Jianping Zou, Gang Xiong, Kingsum Chow, Shuibing He, Shuiguang Deng
Alignment Unlocks Complementarity: A Framework for Multiview Circuit Representation Learning Authors: Zhengyuan Shi, Jingxin Wang, Wentao Jiang, Chengyu Ma, Ziyang Zheng, Zhufei Chu, Weikang Qian, Qiang Xu
Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs Authors: Honglin Zhang, Qianyue Hao, Fengli Xu, Yong Li
No Prior, No Leakage: Revisiting Reconstruction Attacks in Trained Neural Networks Authors: Yehonatan Refael, Guy Smorodinsky, Ofir Lindenbaum, Itay Safran
Why Attention Fails: The Degeneration of Transformers into MLPs in Time Series Forecasting Authors: Zida Liang, Jiayi Zhu, Weiqiang Sun
Parallel Thinking, Sequential Answering: Bridging NAR and AR for Efficient Reasoning Authors: Qihang Ai, Haiyun Jiang
Understanding and Improving Adversarial Robustness of Neural Probabilistic Circuits Authors: Weixin Chen, Han Zhao
Shaping Initial State Prevents Modality Competition in Multi-modal Fusion: A Two-stage Scheduling Framework via Fast Partial Information Decomposition Authors: Jiaqi Tang, Yinsong Xu, Yang Liu, Qingchao Chen
SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization Authors: Jiehui Luo, Yuguo Yin, Yuxin Xie, Jinghan Ru, Xianwei Zhuang, Minghua He, Aofan Liu, Zihan Xiong, Dongchao Yang
Learning Greens Operators through Hierarchical Neural Networks Inspired by the Fast Multipole Method Authors: Emilio McAllister Fognini, Marta M. Betcke, Ben T. Cox
Implicit Augmentation from Distributional Symmetry in Turbulence Super-Resolution Authors: Julia Balla, Jeremiah Bailey, Ali Backour, Elyssa Hofgard, Tommi Jaakkola, Tess Smidt, Ryley McConkey
CAD-Tokenizer: Towards Text-based CAD Prototyping via Modality-Specific Tokenization Authors: Ruiyu Wang, Shizhao Sun, Weijian Ma, Jiang Bian
Differential-Integral Neural Operator for Long-Term Turbulence Forecasting Authors: Hao Wu, Yuan Gao, Fan Xu, Fan Zhang, Qingsong Wen, Kun Wang, Xiaomeng Huang, Xian Wu

1. Towards Atoms of Large Language Models

ArXiv ID: 2509.20784

Authors: Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

Abstract: The fundamental units of internal representations in large language models (LLMs) remain undefined, limiting further understanding of their mechanisms. Neurons or features are often regarded as such units, yet neurons suffer from polysemy, while features face concerns of unreliable reconstruction and instability. To address this issue, we propose the Atoms Theory, which defines such units as atoms. We introduce the atomic inner product (AIP) to correct representation shifting, formally define atoms, and prove the conditions that atoms satisfy the Restricted Isometry Property (RIP), ensuring stable sparse representations over atom set and linking to compressed sensing. Under stronger conditions, we further establish the uniqueness and exact $\ell_1$ recoverability of the sparse representations, and provide guarantees that single-layer sparse autoencoders (SAEs) with threshold activations can reliably identify the atoms. To validate the Atoms Theory, we train threshold-activated SAEs on Gemma2-2B, Gemma2-9B, and Llama3.1-8B, achieving 99.9% sparse reconstruction across layers on average, and more than 99.8% of atoms satisfy the uniqueness condition, compared to 0.5% for neurons and 68.2% for features, showing that atoms more faithfully capture intrinsic representations of LLMs. Scaling experiments further reveal the link between SAEs size and recovery capacity. Overall, this work systematically introduces and validates Atoms Theory of LLMs, providing a theoretical framework for understanding internal representations and a foundation for mechanistic interpretability. Code available at https://github.com/ChenhuiHu/towards_atoms.

Comment: Representation Learning + Autoencoders: formalizes atomic units with RIP/uniqueness guarantees and shows threshold-activated SAEs recover stable sparse representations in LLMs.

Relevance: 10 Novelty: 9

2. Hierarchical Resolution Transformers: A Wavelet-Inspired Architecture for Multi-Scale Language Understanding

ArXiv ID: 2509.20581

Authors: Ayan Sar, Sampurna Roy, Kanav Gupta, Anurag Kaushish, Tanupriya Choudhury, Abhijit Kumar

Abstract: Transformer architectures have achieved state-of-the-art performance across natural language tasks, yet they fundamentally misrepresent the hierarchical nature of human language by processing text as flat token sequences. This results in quadratic computational cost, weak computational cost, weak compositional generalization, and inadequate discourse-level modeling. We propose Hierarchical Resolution Transformer (HRT), a novel wavelet-inspired neural architecture that processes language simultaneously across multiple resolutions, from characters to discourse-level units. HRT constructs a multi-resolution attention, enabling bottom-up composition and top-down contextualization. By employing exponential sequence reduction across scales, HRT achieves O(nlogn) complexity, offering significant efficiency improvements over standard transformers. We evaluated HRT on a diverse suite of benchmarks, including GLUE, SuperGLUE, Long Range Arena, and WikiText-103, and results demonstrated that HRT outperforms standard transformer baselines by an average of +3.8% on GLUE, +4.5% on SuperGLUE, and +6.1% on Long Range Arena, while reducing memory usage by 42% and inference latency by 37% compared to BERT and GPT style models of similar parameter count. Ablation studies confirm the effectiveness of cross-resolution attention and scale-specialized modules, showing that each contributes independently to both efficiency and accuracy. Our findings establish HRT as the first architecture to align computational structure with the hierarchical organization of human language, demonstrating that multi-scale, wavelet-inspired processing yields both theoretical efficiency gains and practical improvements in language understanding.

Comment: Model Architecture and Efficiency: introduces a wavelet-inspired Hierarchical Resolution Transformer with multi-resolution attention and O(n log n) complexity.

Relevance: 10 Novelty: 8

3. Behind RoPE: How Does Causal Mask Encode Positional Information?

ArXiv ID: 2509.21042

Authors: Junu Kim, Xiao Liu, Zhenghao Lin, Lei Ji, Yeyun Gong, Edward Choi

Abstract: While explicit positional encodings such as RoPE are a primary source of positional information in Transformer decoders, the causal mask also provides positional information. In this work, we prove that the causal mask can induce position-dependent patterns in attention scores, even without parameters or causal dependency in the input. Our theoretical analysis indicates that the induced attention pattern tends to favor nearby query-key pairs, mirroring the behavior of common positional encodings. Empirical analysis confirms that trained models exhibit the same behavior, with learned parameters further amplifying these patterns. Notably, we found that the interaction of causal mask and RoPE distorts RoPE's relative attention score patterns into non-relative ones. We consistently observed this effect in modern large language models, suggesting the importance of considering the causal mask as a source of positional information alongside explicit positional encodings.

Comment: Model Architecture/Representation Learning: theoretical and empirical analysis of positional information from causal masks and their interaction with RoPE in Transformer decoders, revealing non-trivial induced attention patterns.

Relevance: 10 Novelty: 8

4. SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

ArXiv ID: 2509.21271

Authors: Xinyu Lian, Masahiro Tanaka, Olatunji Ruwase, Minjia Zhang

Abstract: The emergence of Superchips represents a significant advancement in next-generation AI hardware. These Superchips employ a tightly coupled heterogeneous architecture that integrates GPU and CPU on the same package, which offers unprecedented computational power. However, there has been scant research investigating how LLM training benefits from this new architecture. In this work, for the first time, we study LLM training solutions based on offloading for Superchips. We observe important differences between Superchips and traditional loosely-coupled GPU-CPU architecture, which necessitate revisiting prevailing assumptions about offloading. Based on that, we present SuperOffload, a Superchip-centric offloading system that simultaneously uses Hopper GPU, Grace CPU, and NVLink-C2C interconnect more efficiently. SuperOffload accomplishes this via a combination of techniques, such as adaptive weight offloading, bucketization repartitioning, Superchip-aware casting, speculative execution, and a highly optimized Adam optimizer for Grace CPUs. Our evaluation of SuperOffload on NVIDIA GH200 demonstrates up to 2.5x throughput improvement compared to state-of-the-art offloading-based systems, enabling training of up to 25B model on a single Superchip while achieving high training throughput. We also extend SuperOffload with ZeRO-style data parallelism and DeepSpeed-Ulysses sequence parallelism, enabling training of 13B model with sequence lengths up to 1 million tokens on 8 GH200 while achieving 55% MFU.

Comment: High Performance Computing: Superchip-centric offloading with adaptive weight offload, bucketization repartitioning, casting, speculative execution, and CPU-optimized Adam for large-scale LLM training.

Relevance: 10 Novelty: 8

5. Myosotis: structured computation for attention like layer

ArXiv ID: 2509.20503

Authors: Evgenii Egorov, Hanno Ackermann, Markus Nagel, Hong Cai

Abstract: Attention layers apply a sequence-to-sequence mapping whose parameters depend on the pairwise interactions of the input elements. However, without any structural assumptions, memory and compute scale quadratically with the sequence length. The two main ways to mitigate this are to introduce sparsity by ignoring a sufficient amount of pairwise interactions or to introduce recurrent dependence along them, as SSM does. Although both approaches are reasonable, they both have disadvantages. We propose a novel algorithm that combines the advantages of both concepts. Our idea is based on the efficient inversion of tree-structured matrices.

Comment: Model Architecture + Efficiency: proposes an attention-like layer combining sparsity and recurrence via efficient inversion of tree-structured matrices to reduce quadratic compute/memory.

Relevance: 10 Novelty: 8

6. Dynamic Reasoning Chains through Depth-Specialized Mixture-of-Experts in Transformer Architectures

ArXiv ID: 2509.20577

Authors: Sampurna Roy, Ayan Sar, Anurag Kaushish, Kanav Gupta, Tanupriya Choudhury, Abhijit Kumar

Abstract: Contemporary transformer architectures apply identical processing depth to all inputs, creating inefficiencies and limiting reasoning quality. Simple factual queries are subjected to the same multilayered computation as complex logical problems, wasting resources while constraining deep inference. To overcome this, we came up with a concept of Dynamic Reasoning Chains through Depth Specialised Mixture of Experts (DS-MoE), a modular framework that extends the Mixture of Experts paradigm from width-based to depth specialised computation. DS-MoE introduces expert modules optimised for distinct reasoning depths, shallow pattern recognition, compositional reasoning, logical inference, memory integration, and meta-cognitive supervision. A learned routing network dynamically assembles custom reasoning chains, activating only the necessary experts to match input complexity. The dataset on which we trained and evaluated DS-MoE is on The Pile, an 800GB corpus covering diverse domains such as scientific papers, legal texts, programming code, and web content, enabling systematic assessment across reasoning depths. Experimental results demonstrate that DS-MoE achieves up to 16 per cent computational savings and 35 per cent faster inference compared to uniform-depth transformers, while delivering 2.8 per cent higher accuracy on complex multi-step reasoning benchmarks. Furthermore, routing decisions yield interpretable reasoning chains, enhancing transparency and scalability. These findings establish DS-MoE as a significant advancement in adaptive neural architectures, demonstrating that depth-specialised modular processing can simultaneously improve efficiency, reasoning quality, and interpretability in large-scale language models.

Comment: Model Architecture: Mixture-of-Experts with depth-specialized experts and learned routing (conditional/dynamic computation) in Transformers.

Relevance: 10 Novelty: 8

7. Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias

ArXiv ID: 2509.21181

Authors: Shuofeng Zhang, Ard Louis

Abstract: For overparameterized linear regression with isotropic Gaussian design and minimum-$\ell_p$ interpolator $p\in(1,2]$, we give a unified, high-probability characterization for the scaling of the family of parameter norms $ \{ \lVert \widehat{w_p} \rVert_r \}{r \in [1,p]} $ with sample size. We solve this basic, but unresolved question through a simple dual-ray analysis, which reveals a competition between a signal spike and a bulk of null coordinates in $X^\top Y$, yielding closed-form predictions for (i) a data-dependent transition $n\star$ (the "elbow"), and (ii) a universal threshold $r_\star=2(p-1)$ that separates $\lVert \widehat{w_p} \rVert_r$'s which plateau from those that continue to grow with an explicit exponent. This unified solution resolves the scaling of all $\ell_r$ norms within the family $r\in [1,p]$ under $\ell_p$-biased interpolation, and explains in one picture which norms saturate and which increase as $n$ grows. We then study diagonal linear networks (DLNs) trained by gradient descent. By calibrating the initialization scale $\alpha$ to an effective $p_{\mathrm{eff}}(\alpha)$ via the DLN separable potential, we show empirically that DLNs inherit the same elbow/threshold laws, providing a predictive bridge between explicit and implicit bias. Given that many generalization proxies depend on $\lVert \widehat {w_p} \rVert_r$, our results suggest that their predictive power will depend sensitively on which $l_r$ norm is used.

Comment: Representation Learning/Theory: closed-form scaling laws for ||w||_r under ℓ_p-biased interpolation and corresponding implicit bias in diagonal linear networks, clarifying training dynamics across norms.

Relevance: 9 Novelty: 9

8. Scaling Laws are Redundancy Laws

ArXiv ID: 2509.20721

Authors: Yuda Bi, Vince D Calhoun

Abstract: Scaling laws, a defining feature of deep learning, reveal a striking power-law improvement in model performance with increasing dataset and model size. Yet, their mathematical origins, especially the scaling exponent, have remained elusive. In this work, we show that scaling laws can be formally explained as redundancy laws. Using kernel regression, we show that a polynomial tail in the data covariance spectrum yields an excess risk power law with exponent alpha = 2s / (2s + 1/beta), where beta controls the spectral tail and 1/beta measures redundancy. This reveals that the learning curve's slope is not universal but depends on data redundancy, with steeper spectra accelerating returns to scale. We establish the law's universality across boundedly invertible transformations, multi-modal mixtures, finite-width approximations, and Transformer architectures in both linearized (NTK) and feature-learning regimes. This work delivers the first rigorous mathematical explanation of scaling laws as finite-sample redundancy laws, unifying empirical observations with theoretical foundations.

Comment: Representation Learning/Training Dynamics: provides a theoretical explanation of scaling laws via data covariance spectrum redundancy, including transformers in NTK and feature-learning regimes.

Relevance: 9 Novelty: 9

9. Physics of Learning: A Lagrangian perspective to different learning paradigms

ArXiv ID: 2509.21049

Authors: Siyuan Guo, Bernhard Sch\"olkopf

Abstract: We study the problem of building an efficient learning system. Efficient learning processes information in the least time, i.e., building a system that reaches a desired error threshold with the least number of observations. Building upon least action principles from physics, we derive classic learning algorithms, Bellman's optimality equation in reinforcement learning, and the Adam optimizer in generative models from first principles, i.e., the Learning $\textit{Lagrangian}$. We postulate that learning searches for stationary paths in the Lagrangian, and learning algorithms are derivable by seeking the stationary trajectories.

Comment: Representation Learning / Training Dynamics: introduces a Learning Lagrangian, deriving classic algorithms (e.g., Bellman optimality, Adam) from least-action principles.

Relevance: 9 Novelty: 9

10. Explicit and Effectively Symmetric Schemes for Neural SDEs

ArXiv ID: 2509.20599

Authors: Daniil Shmelev, Cristopher Salvi

Abstract: Backpropagation through (neural) SDE solvers is traditionally approached in two ways: discretise-then-optimise, which offers accurate gradients but incurs prohibitive memory costs due to storing the full computational graph (even when mitigated by checkpointing); and optimise-then-discretise, which achieves constant memory cost by solving an auxiliary backward SDE, but suffers from slower evaluation and gradient approximation errors. Algebraically reversible solvers promise both memory efficiency and gradient accuracy, yet existing methods such as the Reversible Heun scheme are often unstable under complex models and large step sizes. We address these limitations by introducing a novel class of stable, near-reversible Runge--Kutta schemes for neural SDEs. These Explicit and Effectively Symmetric (EES) schemes retain the benefits of reversible solvers while overcoming their instability, enabling memory-efficient training without severe restrictions on step size or model complexity. Through numerical experiments, we demonstrate the superior stability and reliability of our schemes, establishing them as a practical foundation for scalable and accurate training of neural SDEs.

Comment: High Performance Computing / Efficiency: proposes stable, near-reversible explicit Runge–Kutta schemes for neural SDEs enabling memory-efficient training with accurate gradients.