Personalized Daily ArXiv Papers 2025-11-11

[gpt-5]	Prompt	Completion	Total
Token	77291	63654	140945
Cost	$0.1	$0.64	$0.73

Total arXiv papers: 738

Total scanned papers: 446

Total relevant papers: 54

Table of contents with paper titles:

MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference Authors: Myunghyun Rhee, Sookyung Choi, Euiseok Kim, Joonseop Sim, Youngpyo Joo, Hoshik Kim
Route Experts by Sequence, not by Token Authors: Tiansheng Wen, Yifei Wang, Aosong Feng, Long Ma, Xinyang Liu, Yifan Wang, Lixuan Guo, Bo Chen, Stefanie Jegelka, Chenyu You
Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving Authors: Hui Zeng, Daming Zhao, Pengfei Yang, Wenxuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, Jidong Zhai
How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy Authors: Hanwen Liu, Yixuan Ma, Shi Jin, Yuguang Wang
MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling Authors: Yu Zhang, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu
Rethinking Parameter Sharing as Graph Coloring for Structured Compression Authors: Boyang Zhang, Daning Cheng, Yunquan Zhang
PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference Authors: Yushu Zhao, Zheng Wang, Minjia Zhang
Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou
Next-Latent Prediction Transformers Learn Compact World Models Authors: Jayden Teoh, Manan Tomar, Kwangjun Ahn, Edward S. Hu, Pratyusha Sharma, Riashat Islam, Alex Lamb, John Langford
Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizer Authors: Steffen Dereich, Thang Do, Arnulf Jentzen, Philippe von Wurstemberger
Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization Authors: Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, Yuxin Chen
A Fully Polynomial-Time Algorithm for Robustly Learning Halfspaces over the Hypercube Authors: Gautam Chandrasekaran, Adam R. Klivans, Konstantinos Stavropoulos, Arsen Vasilyan
Can Training Dynamics of Scale-Invariant Neural Networks Be Explained by the Thermodynamics of an Ideal Gas? Authors: Ildus Sadrtdinov, Ekaterina Lobacheva, Ivan Klimov, Mikhail I. Katsnelson, Dmitry Vetrov
Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin Authors: Lin Guan, Jia-Qi Yang, Zhishan Zhao, Beichuan Zhang, Bo Sun, Xuanyuan Luo, Jinan Ni, Xiaowen Li, Yuhang Qi, Zhifang Fan, Hangyu Wang, Qiwei Chen, Yi Cheng, Feng Zhang, Xiao Yang
MobileLLM-Pro Technical Report Authors: Patrick Huber, Ernie Chang, Wei Wen, Igor Fedorov, Tarek Elgamal, Hanxian Huang, Naveen Suda, Chinnadhurai Sankar, Vish Vogeti, Yanghan Wang, Alex Gladkov, Kai Sheng Tai, Abdelrahman Elogeel, Tarek Hefny, Vikas Chandra, Ahmed Aly, Anuj Kumar, Raghuraman Krishnamoorthi, Adithya Sagar
The Few Govern the Many:Unveiling Few-Layer Dominance for Time Series Models Authors: Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Xiaoyu Shen
Depth-induced NTK: Bridging Over-parameterized Neural Networks and Deep Neural Kernels Authors: Yong-Ming Tian, Shuang Liang, Shao-Qun Zhang, Feng-Lei Fan
LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging Authors: Seungeon Lee, Soumi Das, Manish Gupta, Krishna P. Gummadi
Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence Authors: Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, Micah Goldblum
TuckA: Hierarchical Compact Tensor Experts for Efficient Fine-Tuning Authors: Qifeng Lei, Zhiyong Yang, Qianqian Xu, Cong Hua, Peisong Wen, Qingming Huang
P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats Authors: Yuzong Chen, Chao Fang, Xilai Dai, Yuheng Wu, Thierry Tambe, Marian Verhelst, Mohamed S. Abdelfattah
Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration Authors: Stef Cuyckens, Xiaoling Yi, Robin Geens, Joren Dumoulin, Martin Wiesner, Chao Fang, Marian Verhelst
Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability Authors: Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio P. Calmon
QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations Authors: Zhixiong Zhao, Haomin Li, Fangxin Liu, Yuncheng Lu, Zongwu Wang, Tao Yang, Li Jiang, Haibing Guan
MI-to-Mid Distilled Compression (M2M-DC): An Hybrid-Information-Guided-Block Pruning with Progressive Inner Slicing Approach to Model Compression Authors: Lionel Levine, Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Majid Sarrafzadeh
CAMP-HiVe: Cyclic Pair Merging based Efficient DNN Pruning with Hessian-Vector Approximation for Resource-Constrained Systems Authors: Mohammad Helal Uddin, Sai Krishna Ghanta, Liam Seymour, Sabur Baidya
Learning to Focus: Focal Attention for Selective and Scalable Transformers Authors: Dhananjay Ram, Wei Xia, Stefano Soatto
Understanding the role of depth in the neural tangent kernel for overparameterized neural networks Authors: William St-Arnaud, Margarida Carvalho, Golnoosh Farnadi
Minimum Width of Deep Narrow Networks for Universal Approximation Authors: Xiao-Song Yang, Qi Zhou, Xuan Zhou
Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder Authors: Zhen Xu, Zhen Tan, Song Wang, Kaidi Xu, Tianlong Chen
Rank-1 LoRAs Encode Interpretable Reasoning Signals Authors: Jake Ward, Paul Riechers, Adam Shai
Diversified Flow Matching with Translation Identifiability Authors: Sagar Shrestha, Xiao Fu
TNT: Improving Chunkwise Training for Test-Time Memorization Authors: Zeman Li, Ali Behrouz, Yuan Deng, Peilin Zhong, Praneeth Kacham, Mahdi Karami, Meisam Razaviyayn, Vahab Mirrokni
Adaptive Initial Residual Connections for GNNs with Theoretical Guarantees Authors: Mohammad Shirzadi, Ali Safarpoor Dehkordi, Ahad N. Zehmakan
DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning Authors: Nikolay Yudin, Ekaterina Grishina, Andrey Veprikov, Alexandr Beznosikov, Maxim Rakhuba
Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Hau-San Wong, Qingfu Zhang, Taiji Suzuki
A Provably-Correct and Robust Convex Model for Smooth Separable NMF Authors: Junjun Pan, Valentin Leplat, Michael Ng, Nicolas Gillis
Physics-Informed Design of Input Convex Neural Networks for Consistency Optimal Transport Flow Matching Authors: Fanghui Song, Zhongjian Wang, Jiebao Sun
C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning Authors: Antonios Valkanas, Soumyasundar Pal, Pavel Rumiantsev, Yingxue Zhang, Mark Coates
On the Convergence and Stability of Distributed Sub-model Training Authors: Yuyang Deng, Fuli Qiao, Mehrdad Mahdavi
How Wide and How Deep? Mitigating Over-Squashing of GNNs via Channel Capacity Constrained Estimation Authors: Zinuo You, Jin Zheng, John Cartlidge
Rep2Text: Decoding Full Text from a Single LLM Token Representation Authors: Haiyan Zhao, Zirui He, Fan Yang, Ali Payani, Mengnan Du
From Kernels to Attention: A Transformer Framework for Density and Score Estimation Authors: Vasily Ilin, Peter Sushko
First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation Authors: Dmytro Vitel, Anshuman Chhabra
Beyond Fixed Depth: Adaptive Graph Neural Networks for Node Classification Under Varying Homophily Authors: Asela Hevapathige, Asiri Wijesinghe, Ahad N. Zehmakan
Sampling and Loss Weights in Multi-Domain Training Authors: Mahdi Salmani, Pratik Worah, Meisam Razaviyayn, Vahab Mirrokni
Magnitude-Modulated Equivariant Adapter for Parameter-Efficient Fine-Tuning of Equivariant Graph Neural Networks Authors: Dian Jin, Yancheng Yuan, Xiaoming Tao
An Efficient Gradient-Aware Error-Bounded Lossy Compressor for Federated Learning Authors: Zhijing Ye, Sheng Di, Jiamin Wang, Zhiqing Zhong, Zhaorui Zhang, Xiaodong Yu
Mixtures of SubExperts for Large Language Continual Learning Authors: Haeyong Kang
Non-Negative Stiefel Approximating Flow: Orthogonalish Matrix Optimization for Interpretable Embeddings Authors: Brian B. Avants (Department of Radiology, Medical Imaging University of Virginia, Charlottesville, VA), Nicholas J. Tustison (Department of Radiology, Medical Imaging University of Virginia, Charlottesville, VA), James R Stone (Department of Radiology, Medical Imaging University of Virginia, Charlottesville, VA)
Vocabulary In-Context Learning in Transformers: Benefits of Positional Encoding Authors: Qian Ma, Ruoxiang Xu, Yongqiang Cai
Walsh-Hadamard Neural Operators for Solving PDEs with Discontinuous Coefficients Authors: Giorrgio M. Cavallazzi, Miguel Perex Cuadrado, Alfredo Pinelli
Transolver is a Linear Transformer: Revisiting Physics-Attention through the Lens of Linear Attention Authors: Wenjie Hu, Sidun Liu, Peng Qiao, Zhenglun Sun, Yong Dou
Recursive Dynamics in Fast-Weights Homeostatic Reentry Networks: Toward Reflective Intelligence Authors: B. G. Chae

1. MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference

ArXiv ID: 2511.06010

Authors: Myunghyun Rhee, Sookyung Choi, Euiseok Kim, Joonseop Sim, Youngpyo Joo, Hoshik Kim

Abstract: The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7x over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference.

Comment: Compression/Efficiency and HPC: Shared KV Attention transforming memory-bound KV cache ops to compute-bound GEMMs with MoE-inspired sparse attention and disaggregated infrastructure.

Relevance: 10 Novelty: 9

2. Route Experts by Sequence, not by Token

ArXiv ID: 2511.06494

Authors: Tiansheng Wen, Yifei Wang, Aosong Feng, Long Ma, Xinyang Liu, Yifan Wang, Lixuan Guo, Bo Chen, Stefanie Jegelka, Chenyu You

Abstract: Mixture-of-Experts (MoE) architectures scale large language models (LLMs) by activating only a subset of experts per token, but the standard TopK routing assigns the same fixed number of experts to all tokens, ignoring their varying complexity. Prior adaptive routing methods introduce additional modules and hyperparameters, often requiring costly retraining from scratch. We propose Sequence-level TopK (SeqTopK), a minimal modification that shifts the expert budget from the token level to the sequence level. By selecting the top $T \cdot K$ experts across all $T$ tokens, SeqTopK enables end-to-end learned dynamic allocation -- assigning more experts to difficult tokens and fewer to easy ones -- while preserving the same overall budget. SeqTopK requires only a few lines of code, adds less than 1% overhead, and remains fully compatible with pretrained MoE models. Experiments across math, coding, law, and writing show consistent improvements over TopK and prior parameter-free adaptive methods, with gains that become substantially larger under higher sparsity (up to 16.9%). These results highlight SeqTopK as a simple, efficient, and scalable routing strategy, particularly well-suited for the extreme sparsity regimes of next-generation LLMs. Code is available at https://github.com/Y-Research-SBU/SeqTopK.

Comment: Model Architecture: MoE routing innovation (sequence-level TopK) enabling dynamic expert allocation under fixed budget, improving efficiency at high sparsity.

Relevance: 10 Novelty: 8

3. Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

ArXiv ID: 2511.06029

Authors: Hui Zeng, Daming Zhao, Pengfei Yang, Wenxuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, Jidong Zhai

Abstract: Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention} (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56x.

Comment: Compression/Efficiency: adaptive layer- and time-aware KV cache pruning with relevance-aware retention for long-form LLM reasoning.

Relevance: 10 Novelty: 8

4. How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy

ArXiv ID: 2511.06044

Authors: Hanwen Liu, Yixuan Ma, Shi Jin, Yuguang Wang

Abstract: Attention mechanism is a significant part of Transformer models. It helps extract features from embedded vectors by adding global information and its expressivity has been proved to be powerful. Nevertheless, the quadratic complexity restricts its practicability. Although several researches have provided attention mechanism in sparse form, they are lack of theoretical analysis about the expressivity of their mechanism while reducing complexity. In this paper, we put forward Random Batch Attention (RBA), a linear self-attention mechanism, which has theoretical support of the ability to maintain its expressivity. Random Batch Attention has several significant strengths as follows: (1) Random Batch Attention has linear time complexity. Other than this, it can be implemented in parallel on a new dimension, which contributes to much memory saving. (2) Random Batch Attention mechanism can improve most of the existing models by replacing their attention mechanisms, even many previously improved attention mechanisms. (3) Random Batch Attention mechanism has theoretical explanation in convergence, as it comes from Random Batch Methods on computation mathematics. Experiments on large graphs have proved advantages mentioned above. Also, the theoretical modeling of self-attention mechanism is a new tool for future research on attention-mechanism analysis.

Comment: Model Architecture/Efficiency: Random Batch Attention, a linear-time self-attention with theoretical expressivity and parallelization benefits.

Relevance: 10 Novelty: 8

5. MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

ArXiv ID: 2511.05811

Authors: Yu Zhang, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu

Abstract: Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34% higher training throughput.

Comment: Model Compression and Efficiency: FP8 training with microscaling and automatic scaling for throughput and numerical stability.

Relevance: 10 Novelty: 8

ArXiv ID: 2511.06786

Authors: Boyang Zhang, Daning Cheng, Yunquan Zhang

Abstract: Modern deep models have massive parameter sizes, leading to high inference-time memory usage that limits practical deployment. Parameter sharing, a form of structured compression, effectively reduces redundancy, but existing approaches remain heuristic-restricted to adjacent layers and lacking a systematic analysis for cross-layer sharing. However, extending sharing across multiple layers leads to an exponentially expanding configuration space, making exhaustive search computationally infeasible and forming a critical bottleneck for parameter sharing. We recast parameter sharing from a group-theoretic perspective as introducing structural symmetries in the model's parameter space. A sharing configuration can be described by a coloring function $\alpha:L\rightarrow C$ (L: layer indices and C: sharing classes), which determines inter-layer sharing groups while preserving structural symmetry. To determine the coloring function, we propose a second-order geometric criterion based on Taylor expansion and the Hessian spectrum. By projecting perturbations onto the Hessian's low-curvature eigensubspace, the criterion provides an analytic rule for selecting sharing groups that minimize performance impact, yielding a principled and scalable configuration procedure. Across diverse architectures and tasks, Geo-Sharing consistently outperforms state-of-the-art heuristic sharing strategies, achieving higher compression ratios with smaller accuracy degradation.

Comment: Model Compression and Efficiency: cross-layer parameter sharing cast as graph coloring with Hessian-based geometric criterion (structured compression).

Relevance: 10 Novelty: 8

7. PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference

ArXiv ID: 2511.04805

Authors: Yushu Zhao, Zheng Wang, Minjia Zhang

Abstract: Mixture-of-Experts (MoE) models have shown strong potential in scaling language models efficiently by activating only a small subset of experts per input. However, their widespread deployment remains limited due to the high memory overhead associated with storing all expert parameters, particularly as the number of experts increases. To address this challenge, prior works have explored expert dropping and merging strategies, yet they often suffer from performance drop at high compression ratios. In this paper, we introduce PuzzleMoE, a training-free MoE compression method that achieves both high accuracy and efficient inference through two key innovations: First, PuzzleMoE performs sparse expert merging by identifying element-wise weight redundancy and specialization. It uses a dual-mask to capture both shared and expert-specific parameters. Second, to avoid the overhead of storing binary masks and signs, PuzzleMoE introduces a bit-packed encoding scheme that reuses underutilized exponent bits, enabling efficient MoE inference on GPUs. Extensive experiments demonstrate that PuzzleMoE can compress MoE models by up to 50% while maintaining accuracy across various tasks. Specifically, it outperforms prior MoE compression methods by up to 16.7% on MMLU at 50% compression ratio, and achieves up to 1.28\times inference speedup.

Comment: Matches Model Compression and Architecture: training-free MoE compression via sparse expert merging and bit-packed inference.

Relevance: 10 Novelty: 8

8. Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs

ArXiv ID: 2511.07419

Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou

Abstract: Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap and improve MoE LLMs' generalization performance. Our method, "Routing Manifold Alignment (RoMA)", introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.

Comment: Model Architecture (MoE): aligns routing weights with task manifolds via manifold regularization, improving generalization with lightweight router fine-tuning.

Relevance: 10 Novelty: 8

9. Next-Latent Prediction Transformers Learn Compact World Models

ArXiv ID: 2511.05963

Authors: Jayden Teoh, Manan Tomar, Kwangjun Ahn, Edward S. Hu, Pratyusha Sharma, Riashat Islam, Alex Lamb, John Langford

Abstract: Transformers replace recurrence with a memory that grows with sequence length and self-attention that enables ad-hoc look ups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next-Latent Prediction (NextLat), which extends standard next-token training with self-supervised predictions in the latent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next output token. Theoretically, we show that these latents provably converge to belief states, compressed information of the history necessary to predict the future. This simple auxiliary objective also injects a recurrent inductive bias into transformers, while leaving their architecture, parallel training, and inference unchanged. NextLat effectively encourages the transformer to form compact internal world models with its own belief states and transition dynamics -- a crucial property absent in standard next-token prediction transformers. Empirically, across benchmarks targeting core sequence modeling competencies -- world modeling, reasoning, planning, and language modeling -- NextLat demonstrates significant gains over standard next-token training in downstream accuracy, representation compression, and lookahead planning. NextLat stands as a simple and efficient paradigm for shaping transformer representations toward stronger generalization.

Comment: Representation Learning/Architecture: Next-Latent Prediction objective induces compact belief-state latents and transition dynamics in Transformers.

Relevance: 9 Novelty: 9

10. Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizer

ArXiv ID: 2511.06675

Authors: Steffen Dereich, Thang Do, Arnulf Jentzen, Philippe von Wurstemberger

Abstract: Beside the standard stochastic gradient descent (SGD) method, the Adam optimizer due to Kingma & Ba (2014) is currently probably the best-known optimization method for the training of deep neural networks in artificial intelligence (AI) systems. Despite the popularity and the success of Adam it remains an \emph{open research problem} to provide a rigorous convergence analysis for Adam even for the class of strongly convex SOPs. In one of the main results of this work we establish convergence rates for Adam in terms of the number of gradient steps (convergence rate \nicefrac{1}{2} w.r.t. the size of the learning rate), the size of the mini-batches (convergence rate 1 w.r.t. the size of the mini-batches), and the size of the second moment parameter of Adam (convergence rate 1 w.r.t. the distance of the second moment parameter to 1) for the class of strongly convex SOPs. In a further main result of this work, which we refer to as \emph{Adam symmetry theorem}, we illustrate the optimality of the established convergence rates by proving for a special class of simple quadratic strongly convex SOPs that Adam converges as the number of gradient steps increases to infinity to the solution of the SOP (the unique minimizer of the strongly convex objective function) if and \emph{only} if the random variables in the SOP (the data in the SOP) are \emph{symmetrically distributed}. In particular, in the standard case where the random variables in the SOP are not symmetrically distributed we \emph{disprove} that Adam converges to the minimizer of the SOP as the number of Adam steps increases to infinity. We also complement the conclusions of our convergence analysis and the Adam symmetry theorem by several numerical simulations that indicate the sharpness of the established convergence rates and that illustrate the practical appearance of the phenomena revealed in the \emph{Adam symmetry theorem}.

Comment: Optimization theory: rigorous convergence rates and the Adam symmetry theorem for SGD-Adam on strongly convex problems.

Relevance: 9 Novelty: 9

11. Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

ArXiv ID: 2511.07378

Authors: Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, Yuxin Chen

Abstract: The ability to reason lies at the core of artificial intelligence (AI), and challenging problems usually call for deeper and longer reasoning to tackle. A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks with longer chain-of-thought (CoT). In this work, we present a theoretical analysis of transformers learning on synthetic state-tracking tasks with gradient descent. We mathematically prove how the algebraic structure of state-tracking problems governs the degree of extrapolation of the learned CoT. Specifically, our theory characterizes the length generalization of transformers through the mechanism of attention concentration, linking the retrieval robustness of the attention layer to the state-tracking task structure of long-context reasoning. Moreover, for transformers with limited reasoning length, we prove that a recursive self-training scheme can progressively extend the range of solvable problem lengths. To our knowledge, we provide the first optimization guarantee that constant-depth transformers provably learn $\mathsf{NC}^1$-complete problems with CoT, significantly going beyond prior art confined in $\mathsf{TC}^0$, unless the widely held conjecture $\mathsf{TC}^0 \neq \mathsf{NC}^1$ fails. Finally, we present a broad set of experiments supporting our theoretical results, confirming the length generalization behaviors and the mechanism of attention concentration.

Comment: Matches Representation Learning/Architecture theory: provable chain-of-thought length generalization in transformers via attention concentration.

Relevance: 9 Novelty: 9

12. A Fully Polynomial-Time Algorithm for Robustly Learning Halfspaces over the Hypercube

ArXiv ID: 2511.07244

Authors: Gautam Chandrasekaran, Adam R. Klivans, Konstantinos Stavropoulos, Arsen Vasilyan

Abstract: We give the first fully polynomial-time algorithm for learning halfspaces with respect to the uniform distribution on the hypercube in the presence of contamination, where an adversary may corrupt some fraction of examples and labels arbitrarily. We achieve an error guarantee of $\eta^{O(1)}+\epsilon$ where $\eta$ is the noise rate. Such a result was not known even in the agnostic setting, where only labels can be adversarially corrupted. All prior work over the last two decades has a superpolynomial dependence in $1/\epsilon$ or succeeds only with respect to continuous marginals (such as log-concave densities). Previous analyses rely heavily on various structural properties of continuous distributions such as anti-concentration. Our approach avoids these requirements and makes use of a new algorithm for learning Generalized Linear Models (GLMs) with only a polylogarithmic dependence on the activation function's Lipschitz constant. More generally, our framework shows that supervised learning with respect to discrete distributions is not as difficult as previously thought.

Comment: Learning Theory: fully polynomial-time robust algorithm for learning halfspaces over the hypercube under contamination.

Relevance: 9 Novelty: 9

13. Can Training Dynamics of Scale-Invariant Neural Networks Be Explained by the Thermodynamics of an Ideal Gas?

ArXiv ID: 2511.07308

Authors: Ildus Sadrtdinov, Ekaterina Lobacheva, Ivan Klimov, Mikhail I. Katsnelson, Dmitry Vetrov

Abstract: Understanding the training dynamics of deep neural networks remains a major open problem, with physics-inspired approaches offering promising insights. Building on this perspective, we develop a thermodynamic framework to describe the stationary distributions of stochastic gradient descent (SGD) with weight decay for scale-invariant neural networks, a setting that both reflects practical architectures with normalization layers and permits theoretical analysis. We establish analogies between training hyperparameters (e.g., learning rate, weight decay) and thermodynamic variables such as temperature, pressure, and volume. Starting with a simplified isotropic noise model, we uncover a close correspondence between SGD dynamics and ideal gas behavior, validated through theory and simulation. Extending to training of neural networks, we show that key predictions of the framework, including the behavior of stationary entropy, align closely with experimental observations. This framework provides a principled foundation for interpreting training dynamics and may guide future work on hyperparameter tuning and the design of learning rate schedulers.

Comment: Training dynamics theory: maps SGD with weight decay in scale-invariant nets to thermodynamic variables, informing hyperparameter design.

Relevance: 9 Novelty: 8

14. Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin

ArXiv ID: 2511.06077

Authors: Lin Guan, Jia-Qi Yang, Zhishan Zhao, Beichuan Zhang, Bo Sun, Xuanyuan Luo, Jinan Ni, Xiaowen Li, Yuhang Qi, Zhifang Fan, Hangyu Wang, Qiwei Chen, Yi Cheng, Feng Zhang, Xiao Yang

Abstract: Short-video recommenders such as Douyin must exploit extremely long user histories without breaking latency or cost budgets. We present an end-to-end system that scales long-sequence modeling to 10k-length histories in production. First, we introduce Stacked Target-to-History Cross Attention (STCA), which replaces history self-attention with stacked cross-attention from the target to the history, reducing complexity from quadratic to linear in sequence length and enabling efficient end-to-end training. Second, we propose Request Level Batching (RLB), a user-centric batching scheme that aggregates multiple targets for the same user/request to share the user-side encoding, substantially lowering sequence-related storage, communication, and compute without changing the learning objective. Third, we design a length-extrapolative training strategy -- train on shorter windows, infer on much longer ones -- so the model generalizes to 10k histories without additional training cost. Across offline and online experiments, we observe predictable, monotonic gains as we scale history length and model capacity, mirroring the scaling law behavior observed in large language models. Deployed at full traffic on Douyin, our system delivers significant improvements on key engagement metrics while meeting production latency, demonstrating a practical path to scaling end-to-end long-sequence recommendation to the 10k regime.

Comment: Model Architecture and Efficiency: replaces self-attention with stacked target-to-history cross-attention for linear complexity; batching and length extrapolation for 10k sequences.

Relevance: 9 Novelty: 8

15. MobileLLM-Pro Technical Report

ArXiv ID: 2511.06719

Authors: Patrick Huber, Ernie Chang, Wei Wen, Igor Fedorov, Tarek Elgamal, Hanxian Huang, Naveen Suda, Chinnadhurai Sankar, Vish Vogeti, Yanghan Wang, Alex Gladkov, Kai Sheng Tai, Abdelrahman Elogeel, Tarek Hefny, Vikas Chandra, Ahmed Aly, Anuj Kumar, Raghuraman Krishnamoorthi, Adithya Sagar

Abstract: Efficient on-device language models around 1 billion parameters are essential for powering low-latency AI applications on mobile and wearable devices. However, achieving strong performance in this model class, while supporting long context windows and practical deployment remains a significant challenge. We introduce MobileLLM-Pro, a 1-billion-parameter language model optimized for on-device deployment. MobileLLM-Pro achieves state-of-the-art results across 11 standard benchmarks, significantly outperforming both Gemma 3-1B and Llama 3.2-1B, while supporting context windows of up to 128,000 tokens and showing only minor performance regressions at 4-bit quantization. These improvements are enabled by four core innovations: (1) implicit positional distillation, a novel technique that effectively instills long-context capabilities through knowledge distillation; (2) a specialist model merging framework that fuses multiple domain experts into a compact model without parameter growth; (3) simulation-driven data mixing using utility estimation; and (4) 4-bit quantization-aware training with self-distillation. We release our model weights and code to support future research in efficient on-device language models.

Comment: Compression/Efficiency: on-device LLM with implicit positional distillation for long context, specialist model merging without parameter growth, and 4-bit QAT.

Relevance: 9 Novelty: 8

16. The Few Govern the Many:Unveiling Few-Layer Dominance for Time Series Models

ArXiv ID: 2511.07237

Authors: Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Xiaoyu Shen

Abstract: Large-scale models are at the forefront of time series (TS) forecasting, dominated by two paradigms: fine-tuning text-based Large Language Models (LLM4TS) and training Time Series Foundation Models (TSFMs) from scratch. Both approaches share a foundational assumption that scaling up model capacity and data volume leads to improved performance. However, we observe a \textit{\textbf{scaling paradox}} in TS models, revealing a puzzling phenomenon that larger models do \emph{NOT} achieve better performance. Through extensive experiments on two model families across four scales (100M to 1.7B parameters) and diverse data (up to 6B observations), we rigorously confirm that the scaling paradox is a pervasive issue. We then diagnose its root cause by analyzing internal representations, identifying a phenomenon we call \textit{few-layer dominance}: only a small subset of layers are functionally important, while the majority are redundant, under-utilized, and can even distract training. Based on this discovery, we propose a practical method to automatically identify and retain only these dominant layers. In our models, retaining only 21\% of the parameters achieves up to a 12\% accuracy improvement and a 2.7$\times$ inference speedup. We validate the universality of our method on 8 prominent SOTA models (LLM4TS and TSFMs, 90M to 6B), showing that retaining less than 30\% of layers achieves comparable or superior accuracy in over 95\% of tasks.

Comment: Compression/Efficiency + Representation: discovers few-layer dominance in TS models and proposes retaining dominant layers, yielding large parameter reduction and speedups.

Relevance: 9 Novelty: 8

17. Depth-induced NTK: Bridging Over-parameterized Neural Networks and Deep Neural Kernels

ArXiv ID: 2511.05585

Authors: Yong-Ming Tian, Shuang Liang, Shao-Qun Zhang, Feng-Lei Fan

Abstract: While deep learning has achieved remarkable success across a wide range of applications, its theoretical understanding of representation learning remains limited. Deep neural kernels provide a principled framework to interpret over-parameterized neural networks by mapping hierarchical feature transformations into kernel spaces, thereby combining the expressive power of deep architectures with the analytical tractability of kernel methods. Recent advances, particularly neural tangent kernels (NTKs) derived by gradient inner products, have established connections between infinitely wide neural networks and nonparametric Bayesian inference. However, the existing NTK paradigm has been predominantly confined to the infinite-width regime, while overlooking the representational role of network depth. To address this gap, we propose a depth-induced NTK kernel based on a shortcut-related architecture, which converges to a Gaussian process as the network depth approaches infinity. We theoretically analyze the training invariance and spectrum properties of the proposed kernel, which stabilizes the kernel dynamics and mitigates degeneration. Experimental results further underscore the effectiveness of our proposed method. Our findings significantly extend the existing landscape of the neural kernel theory and provide an in-depth understanding of deep learning and the scaling law.

Comment: Representation Learning Theory: proposes a depth-induced NTK capturing depth effects beyond infinite-width NTK, with analysis of spectrum and training invariance.

Relevance: 9 Novelty: 8

18. LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging

ArXiv ID: 2511.07129

Authors: Seungeon Lee, Soumi Das, Manish Gupta, Krishna P. Gummadi

Abstract: Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models.However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.

Comment: Compression/Efficiency: training-free instance-level dynamic selection and merging of multiple LoRA adapters at inference time.

Relevance: 9 Novelty: 8

19. Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence

ArXiv ID: 2511.07384

Authors: Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, Micah Goldblum

Abstract: Recent advances in depth-recurrent language models show that recurrence can decouple train-time compute and parameter count from test-time compute. In this work, we study how to convert existing pretrained non-recurrent language models into depth-recurrent models. We find that using a curriculum of recurrences to increase the effective depth of the model over the course of training preserves performance while reducing total computational cost. In our experiments, on mathematics, we observe that converting pretrained models to recurrent ones results in better performance at a given compute budget than simply post-training the original non-recurrent language model.

Comment: Model Architecture/Efficiency: retrofits recurrence into pretrained LMs with a recurrence curriculum to decouple test-time compute from parameters/training compute.

Relevance: 9 Novelty: 8

20. TuckA: Hierarchical Compact Tensor Experts for Efficient Fine-Tuning

ArXiv ID: 2511.06859

Authors: Qifeng Lei, Zhiyong Yang, Qianqian Xu, Cong Hua, Peisong Wen, Qingming Huang

Abstract: Efficiently fine-tuning pre-trained models for downstream tasks is a key challenge in the era of foundation models. Parameter-efficient fine-tuning (PEFT) presents a promising solution, achieving performance comparable to full fine-tuning by updating only a small number of adaptation weights per layer. Traditional PEFT methods typically rely on a single expert, where the adaptation weight is a low-rank matrix. However, for complex tasks, the data's inherent diversity poses a significant challenge for such models, as a single adaptation weight cannot adequately capture the features of all samples. To address this limitation, we explore how to integrate multiple small adaptation experts into a compact structure to defeat a large adapter. Specifically, we propose Tucker Adaptation (TuckA), a method with four key properties: (i) We use Tucker decomposition to create a compact 3D tensor where each slice naturally serves as an expert. The low-rank nature of this decomposition ensures that the number of parameters scales efficiently as more experts are added. (ii) We introduce a hierarchical strategy that organizes these experts into groups at different granularities, allowing the model to capture both local and global data patterns. (iii) We develop an efficient batch-level routing mechanism, which reduces the router's parameter size by a factor of $L$ compared to routing at every adapted layer (where $L$ is the number of adapted layers) (iv) We propose data-aware initialization to achieve loss-free expert load balancing based on theoretical analysis. Extensive experiments on benchmarks in natural language understanding, image classification, and mathematical reasoning speak to the efficacy of TuckA, offering a new and effective solution to the PEFT problem.

Comment: Matches Model Architecture and Compression/Efficiency: Tucker low-rank PEFT with hierarchical tensor experts and efficient routing (MoE-like).

Relevance: 9 Novelty: 8

21. P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats

ArXiv ID: 2511.06838

Authors: Yuzong Chen, Chao Fang, Xilai Dai, Yuheng Wu, Thierry Tambe, Marian Verhelst, Mohamed S. Abdelfattah

Abstract: The substantial memory bandwidth and computational demand of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine neural processing units (NPUs) with DRAM-based processing-in-memory (PIM) for LLM acceleration. However, existing high-precision (e.g., FP16) PIM compute units incur significant area and power overhead in DRAM technology, limiting the effective computation throughput. In this paper, we introduce P3-LLM, a novel NPU-PIM integrated accelerator for LLM inference using hybrid numerical formats. Our approach is threefold: First, we propose a flexible mixed-precision quantization scheme, which leverages hybrid numerical formats to quantize different LLM operands with high compression efficiency and minimal accuracy loss. Second, we architect an efficient PIM accelerator co-design for P3-LLM, featuring lightweight compute units to support our hybrid numerical formats. The enhanced PIM compute units significantly boost the computation throughput under iso-area constraints. Third, we optimize the low-precision dataflow of different LLM modules by applying operator fusion to minimize the overhead of runtime dequantization. Our evaluation on a diverse set of representative LLMs and tasks demonstrates that P3-LLM achieves state-of-the-art quantization accuracy in terms of both KV-cache-only quantization and weight-activation quantization. Combining the proposed quantization scheme with PIM architecture co-design, P3-LLM yields an average of $4.9\times$, $2.0\times$, and $3.4\times$ speedups over the state-of-the-art LLM accelerators HBM-PIM, Ecco, and Pimba, respectively. Our quantization code is available at https://github.com/yc2367/P3-LLM.git

Comment: Matches HPC and Compression/Efficiency: NPU–PIM co-design with mixed-precision quantization and operator fusion for LLM inference.

Relevance: 9 Novelty: 8

22. Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration

ArXiv ID: 2511.06313

Authors: Stef Cuyckens, Xiaoling Yi, Robin Geens, Joren Dumoulin, Martin Wiesner, Chao Fang, Marian Verhelst

Abstract: Emerging continual learning applications necessitate next-generation neural processing unit (NPU) platforms to support both training and inference operations. The promising Microscaling (MX) standard enables narrow bit-widths for inference and large dynamic ranges for training. However, existing MX multiply-accumulate (MAC) designs face a critical trade-off: integer accumulation requires expensive conversions from narrow floating-point products, while FP32 accumulation suffers from quantization losses and costly normalization. To address these limitations, we propose a hybrid precision-scalable reduction tree for MX MACs that combines the benefits of both approaches, enabling efficient mixed-precision accumulation with controlled accuracy relaxation. Moreover, we integrate an 8x8 array of these MACs into the state-of-the-art (SotA) NPU integration platform, SNAX, to provide efficient control and data transfer to our optimized precision-scalable MX datapath. We evaluate our design both on MAC and system level and compare it to the SotA. Our integrated system achieves an energy efficiency of 657, 1438-1675, and 4065 GOPS/W, respectively, for MXINT8, MXFP8/6, and MXFP4, with a throughput of 64, 256, and 512 GOPS.

Comment: High Performance Computing/Efficiency: precision-scalable microscaling datapaths with optimized reduction tree and NPU integration for mixed-precision MACs.

Relevance: 9 Novelty: 8

23. Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

ArXiv ID: 2511.05541

Authors: Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio P. Calmon

Abstract: Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they suffer from a variety of problems, including a systematic failure to capture the rich conceptual information that drives linguistic understanding. Instead, they exhibit a bias towards shallow, token-specific, or noisy features, such as "the phrase 'The' at the start of sentences". In this work, we propose that this is due to a fundamental issue with how dictionary learning methods for LLMs are trained. Language itself has a rich, well-studied structure spanning syntax, semantics, and pragmatics; however, current unsupervised methods largely ignore this linguistic knowledge, leading to poor feature discovery that favors superficial patterns over meaningful concepts. We focus on a simple but important aspect of language: semantic content has long-range dependencies and tends to be smooth over a sequence, whereas syntactic information is much more local. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.

Comment: Representation Learning: advances Sparse Autoencoders with a temporal contrastive loss to disentangle semantic vs. syntactic features for interpretability.

Relevance: 9 Novelty: 8

ArXiv ID: 2511.06767

Authors: Zhixiong Zhao, Haomin Li, Fangxin Liu, Yuncheng Lu, Zongwu Wang, Tao Yang, Li Jiang, Haibing Guan

Abstract: Transformer-based models have revolutionized computer vision (CV) and natural language processing (NLP) by achieving state-of-the-art performance across a range of benchmarks. However, nonlinear operations in models significantly contribute to inference latency, presenting unique challenges for efficient hardware acceleration. To this end, we propose QUARK, a quantization-enabled FPGA acceleration framework that leverages common patterns in nonlinear operations to enable efficient circuit sharing, thereby reducing hardware resource requirements. QUARK targets all nonlinear operations within Transformer-based models, achieving high-performance approximation through a novel circuit-sharing design tailored to accelerate these operations. Our evaluation demonstrates that QUARK significantly reduces the computational overhead of nonlinear operators in mainstream Transformer architectures, achieving up to a 1.96 times end-to-end speedup over GPU implementations. Moreover, QUARK lowers the hardware overhead of nonlinear modules by more than 50% compared to prior approaches, all while maintaining high model accuracy -- and even substantially boosting accuracy under ultra-low-bit quantization.

Comment: Compression/Efficiency/HPC: quantization-enabled circuit sharing for nonlinear ops in Transformers on FPGAs, reducing latency and resources.

Relevance: 9 Novelty: 7

25. MI-to-Mid Distilled Compression (M2M-DC): An Hybrid-Information-Guided-Block Pruning with Progressive Inner Slicing Approach to Model Compression

ArXiv ID: 2511.06842

Authors: Lionel Levine, Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Majid Sarrafzadeh

Abstract: We introduce MI-to-Mid Distilled Compression (M2M-DC), a two-scale, shape-safe compression framework that interleaves information-guided block pruning with progressive inner slicing and staged knowledge distillation (KD). First, M2M-DC ranks residual (or inverted-residual) blocks by a label-aware mutual information (MI) signal and removes the least informative units (structured prune-after-training). It then alternates short KD phases with stage-coherent, residual-safe channel slicing: (i) stage "planes" (co-slicing conv2 out-channels with the downsample path and next-stage inputs), and (ii) an optional mid-channel trim (conv1 out / bn1 / conv2 in). This targets complementary redundancy, whole computational motifs and within-stage width while preserving residual shape invariants. On CIFAR-100, M2M-DC yields a clean accuracy-compute frontier. For ResNet-18, we obtain 85.46% Top-1 with 3.09M parameters and 0.0139 GMacs (72% params, 63% GMacs vs. teacher; mean final 85.29% over three seeds). For ResNet-34, we reach 85.02% Top-1 with 5.46M params and 0.0195 GMacs (74% / 74% vs. teacher; mean final 84.62%). Extending to inverted-residuals, MobileNetV2 achieves a mean final 68.54% Top-1 at 1.71M params (27%) and 0.0186 conv GMacs (24%), improving over the teacher's 66.03% by +2.5 points across three seeds. Because M2M-DC exposes only a thin, architecture-aware interface (blocks, stages, and down sample/skip wiring), it generalizes across residual CNNs and extends to inverted-residual families with minor legalization rules. The result is a compact, practical recipe for deployment-ready models that match or surpass teacher accuracy at a fraction of the compute.

Comment: Model Compression and Efficiency: structured block pruning guided by mutual information plus progressive channel slicing and KD.

Relevance: 9 Novelty: 7

26. CAMP-HiVe: Cyclic Pair Merging based Efficient DNN Pruning with Hessian-Vector Approximation for Resource-Constrained Systems

ArXiv ID: 2511.06265

Authors: Mohammad Helal Uddin, Sai Krishna Ghanta, Liam Seymour, Sabur Baidya

Abstract: Deep learning algorithms are becoming an essential component of many artificial intelligence (AI) driven applications, many of which run on resource-constrained and energy-constrained systems. For efficient deployment of these algorithms, although different techniques for the compression of neural network models are proposed, neural pruning is one of the fastest and effective methods, which can provide a high compression gain with minimal cost. To harness enhanced performance gain with respect to model complexity, we propose a novel neural network pruning approach utilizing Hessian-vector products that approximate crucial curvature information in the loss function, which significantly reduces the computation demands. By employing a power iteration method, our algorithm effectively identifies and preserves the essential information, ensuring a balanced trade-off between model accuracy and computational efficiency. Herein, we introduce CAMP-HiVe, a cyclic pair merging-based pruning with Hessian Vector approximation by iteratively consolidating weight pairs, combining significant and less significant weights, thus effectively streamlining the model while preserving its performance. This dynamic, adaptive framework allows for real-time adjustment of weight significance, ensuring that only the most critical parameters are retained. Our experimental results demonstrate that our proposed method achieves significant reductions in computational requirements while maintaining high performance across different neural network architectures, e.g., ResNet18, ResNet56, and MobileNetv2, on standard benchmark datasets, e.g., CIFAR10, CIFAR-100, and ImageNet, and it outperforms the existing state-of-the-art neural pruning methods.

Comment: Model Compression and Efficiency: pruning via Hessian-vector approximation and cyclic pair merging for resource-constrained deployment.

Relevance: 9 Novelty: 7

27. Learning to Focus: Focal Attention for Selective and Scalable Transformers

ArXiv ID: 2511.06818

Authors: Dhananjay Ram, Wei Xia, Stefano Soatto

Abstract: Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective feature selection at every layer of these models, particularly for long contexts. We propose Focal Attention, a simple yet effective modification that sharpens the attention distribution by controlling the softmax temperature, either as a fixed hyperparameter or as a learnable parameter during training. This sharpening enables the model to concentrate on the most relevant tokens while suppressing irrelevant ones. Empirically, Focal Attention scales more favorably than standard transformer with respect to model size, training data, and context length. Across diverse benchmarks, it achieves the same accuracy with up to 42% fewer parameters or 33% less training data. On long-context tasks, it delivers substantial relative improvements ranging from 17% to 82%, demonstrating its effectiveness in real world applications.

Comment: Model Architecture/Efficiency: Focal Attention sharpens softmax via temperature control (fixed or learnable), improving scaling and long-context performance.

Relevance: 9 Novelty: 7

28. Understanding the role of depth in the neural tangent kernel for overparameterized neural networks

ArXiv ID: 2511.07272

Authors: William St-Arnaud, Margarida Carvalho, Golnoosh Farnadi

Abstract: Overparameterized fully-connected neural networks have been shown to behave like kernel models when trained with gradient descent, under mild conditions on the width, the learning rate, and the parameter initialization. In the limit of infinitely large widths and small learning rate, the kernel that is obtained allows to represent the output of the learned model with a closed-form solution. This closed-form solution hinges on the invertibility of the limiting kernel, a property that often holds on real-world datasets. In this work, we analyze the sensitivity of large ReLU networks to increasing depths by characterizing the corresponding limiting kernel. Our theoretical results demonstrate that the normalized limiting kernel approaches the matrix of ones. In contrast, they show the corresponding closed-form solution approaches a fixed limit on the sphere. We empirically evaluate the order of magnitude in network depth required to observe this convergent behavior, and we describe the essential properties that enable the generalization of our results to other kernels.

Comment: Matches Representation Learning/training dynamics: analysis of NTK behavior with increasing depth in overparameterized networks.

Relevance: 9 Novelty: 7

29. Minimum Width of Deep Narrow Networks for Universal Approximation

ArXiv ID: 2511.06837

Authors: Xiao-Song Yang, Qi Zhou, Xuan Zhou

Abstract: Determining the minimum width of fully connected neural networks has become a fundamental problem in recent theoretical studies of deep neural networks. In this paper, we study the lower bounds and upper bounds of the minimum width required for fully connected neural networks in order to have universal approximation capability, which is important in network design and training. We show that $w_{min}\leq\max(2d_x+1, d_y)$ for networks with ELU, SELU, and the upper bound of this inequality is attained when $d_y=2d_x$, where $d_x$, $d_y$ denote the input and output dimensions, respectively. Besides, we show that $d_x+1\leq w_{min}\leq d_x+d_y$ for networks with LeakyReLU, ELU, CELU, SELU, Softplus, by proving that ReLU can be approximated by these activation functions. In addition, in the case that the activation function is injective or can be uniformly approximated by a sequence of injective functions (e.g., ReLU), we present a new proof of the inequality $w_{min}\ge d_y+\mathbf{1}_{d_x<d_y\leq2d_x}$ by constructing a more intuitive example via a new geometric approach based on Poincar$\acute{\text{e}}$-Miranda Theorem.

Comment: Model Architecture: theoretical bounds on minimum width for universal approximation in deep narrow networks across activations.

Relevance: 9 Novelty: 7

30. Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder

ArXiv ID: 2511.05745

Authors: Zhen Xu, Zhen Tan, Song Wang, Kaidi Xu, Tianlong Chen

Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM explanations, their practical adoption faces a fundamental challenge: better interpretability demands that SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs. Recent Mixture of Experts (MoE) approaches attempt to address this by partitioning SAEs into narrower expert networks with gated activation, thereby reducing computation. In a well-designed MoE, each expert should focus on learning a distinct set of features. However, we identify a \textit{critical limitation} in MoE-SAE: Experts often fail to specialize, which means they frequently learn overlapping or identical features. To deal with it, we propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling. Experiments demonstrate a 24\% lower reconstruction error and a 99\% reduction in feature redundancy compared to existing MoE-SAE methods. This work bridges the interpretability-efficiency gap in LLM analysis, allowing transparent model inspection without compromising computational feasibility.

Comment: Model Architecture and Sparsity: multi-expert sparse autoencoder with multiple expert activation and feature scaling to reduce redundancy and improve specialization.

Relevance: 9 Novelty: 7

31. Rank-1 LoRAs Encode Interpretable Reasoning Signals

ArXiv ID: 2511.06739

Authors: Jake Ward, Paul Riechers, Adam Shai

Abstract: Reasoning models leverage inference-time compute to significantly enhance the performance of language models on difficult logical tasks, and have become a dominating paradigm in frontier LLMs. Despite their wide adoption, the mechanisms underpinning the enhanced performance of these reasoning models are not well understood. In this work, we show that the majority of new capabilities in reasoning models can be elicited by small, single-rank changes to base model parameters, with many of these changes being interpretable. Specifically, we use a rank-1 LoRA to create a minimal parameter adapter for Qwen-2.5-32B-Instruct which recovers 73-90% of reasoning-benchmark performance compared to a full parameter finetune. We find that the activations of this LoRA are as interpretable as MLP neurons, and fire for reasoning-specific behaviors. Finally, we train a sparse autoencoder on the entire activation state of this LoRA and identify fine-grained and monosemantic features. Our findings highlight that reasoning performance can arise largely from minimal changes to base model parameters, and explore what these changes affect. More broadly, our work shows that parameter-efficient training methods can be used as a targeted lens for uncovering fundamental insights about language model behavior and dynamics.

Comment: Model Compression and Efficiency: exploits low-rank (rank-1) LoRA adapters; Representation Learning: analyzes interpretable features via sparse autoencoders.

Relevance: 9 Novelty: 7

32. Diversified Flow Matching with Translation Identifiability

ArXiv ID: 2511.05558

Authors: Sagar Shrestha, Xiao Fu

Abstract: Diversified distribution matching (DDM) finds a unified translation function mapping a diverse collection of conditional source distributions to their target counterparts. DDM was proposed to resolve content misalignment issues in unpaired domain translation, achieving translation identifiability. However, DDM has only been implemented using GANs due to its constraints on the translation function. GANs are often unstable to train and do not provide the transport trajectory information -- yet such trajectories are useful in applications such as single-cell evolution analysis and robot route planning. This work introduces diversified flow matching (DFM), an ODE-based framework for DDM. Adapting flow matching (FM) to enforce a unified translation function as in DDM is challenging, as FM learns the translation function's velocity rather than the translation function itself. A custom bilevel optimization-based training loss, a nonlinear interpolant, and a structural reformulation are proposed to address these challenges, offering a tangible implementation. To our knowledge, DFM is the first ODE-based approach guaranteeing translation identifiability. Experiments on synthetic and real-world datasets validate the proposed method.

Comment: Generative modeling/representation: ODE-based diversified flow matching with translation identifiability guarantees.

Relevance: 8 Novelty: 8

33. TNT: Improving Chunkwise Training for Test-Time Memorization

ArXiv ID: 2511.07343

Authors: Zeman Li, Ali Behrouz, Yuan Deng, Peilin Zhong, Praneeth Kacham, Mahdi Karami, Meisam Razaviyayn, Vahab Mirrokni

Abstract: Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers. While these expressive models do not yet match the peak performance of state-of-the-art Transformers, their potential has been largely untapped due to prohibitively slow training and low hardware utilization. Existing parallelization methods force a fundamental conflict governed by the chunksize hyperparameter: large chunks boost speed but degrade performance, necessitating a fixed, suboptimal compromise. To solve this challenge, we introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process. Stage one is an efficiency-focused pre-training phase utilizing a hierarchical memory. A global module processes large, hardware-friendly chunks for long-range context, while multiple parallel local modules handle fine-grained details. Crucially, by periodically resetting local memory states, we break sequential dependencies to enable massive context parallelization. Stage two is a brief fine-tuning phase where only the local memory modules are adapted to a smaller, high-resolution chunksize, maximizing accuracy with minimal overhead. Evaluated on Titans and TTT models, TNT achieves a substantial acceleration in training speed-up to 17 times faster than the most accurate baseline configuration - while simultaneously improving model accuracy. This improvement removes a critical scalability barrier, establishing a practical foundation for developing expressive RNNs and facilitating future work to close the performance gap with Transformers.

Comment: High Performance Computing: training paradigm enabling massive context parallelization for RNNs via hierarchical memory and chunk decoupling.

Relevance: 8 Novelty: 8

34. Adaptive Initial Residual Connections for GNNs with Theoretical Guarantees

ArXiv ID: 2511.06598

Authors: Mohammad Shirzadi, Ali Safarpoor Dehkordi, Ahad N. Zehmakan

Abstract: Message passing is the core operation in graph neural networks, where each node updates its embeddings by aggregating information from its neighbors. However, in deep architectures, this process often leads to diminished expressiveness. A popular solution is to use residual connections, where the input from the current (or initial) layer is added to aggregated neighbor information to preserve embeddings across layers. Following a recent line of research, we investigate an adaptive residual scheme in which different nodes have varying residual strengths. We prove that this approach prevents oversmoothing; particularly, we show that the Dirichlet energy of the embeddings remains bounded away from zero. This is the first theoretical guarantee not only for the adaptive setting, but also for static residual connections (where residual strengths are shared across nodes) with activation functions. Furthermore, extensive experiments show that this adaptive approach outperforms standard and state-of-the-art message passing mechanisms, especially on heterophilic graphs. To improve the time complexity of our approach, we introduce a variant in which residual strengths are not learned but instead set heuristically, a choice that performs as well as the learnable version.

Comment: Model Architecture: adaptive initial residual connections in GNNs with theoretical guarantees preventing oversmoothing (Dirichlet energy bounded away from zero).

Relevance: 8 Novelty: 8

35. DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

ArXiv ID: 2511.06477

Authors: Nikolay Yudin, Ekaterina Grishina, Andrey Veprikov, Alexandr Beznosikov, Maxim Rakhuba

Abstract: Recently, optimizers that explicitly treat weights as matrices, rather than flattened vectors, have demonstrated their effectiveness. This perspective naturally leads to structured approximations of the Fisher matrix as preconditioners, where the matrix view induces a Kronecker-factorized form that enables memory-efficient representation. However, constructing such approximations both efficiently and accurately remains an open challenge, since obtaining the optimal factorization is resource-intensive and practical methods therefore rely on heuristic design choices. In this work, we introduce a novel approach that leverages projector-splitting integrators to construct effective preconditioners. Our optimizer, DyKAF (Dynamical Kronecker Approximation of the Fisher Matrix), consistently improves the Fisher matrix approximation quality. Experiments on large language model pre-training and fine-tuning demonstrate that DyKAF outperforms existing optimizers across a range of evaluation metrics.

Comment: Matches Efficiency/HPC: optimizer with dynamic Kronecker approximation of Fisher for scalable gradient preconditioning.

Relevance: 8 Novelty: 8

36. Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training

ArXiv ID: 2511.07372

Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Hau-San Wong, Qingfu Zhang, Taiji Suzuki

Abstract: Recent curriculum techniques in the post-training stage of LLMs have been widely observed to outperform non-curriculum approaches in enhancing reasoning performance, yet a principled understanding of why and to what extent they work remains elusive. To address this gap, we develop a theoretical framework grounded in the intuition that progressively learning through manageable steps is more efficient than directly tackling a hard reasoning task, provided each stage stays within the model's effective competence. Under mild complexity conditions linking consecutive curriculum stages, we show that curriculum post-training avoids the exponential complexity bottleneck. To substantiate this result, drawing insights from the Chain-of-Thoughts (CoTs) solving mathematical problems such as Countdown and parity, we model CoT generation as a states-conditioned autoregressive reasoning tree, define a uniform-branching base model to capture pretrained behavior, and formalize curriculum stages as either depth-increasing (longer reasoning chains) or hint-decreasing (shorter prefixes) subtasks. Our analysis shows that, under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with polynomial sample complexity, whereas direct learning suffers from an exponential bottleneck. We further establish analogous guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to polynomial order.

Comment: Representation Learning/Training Dynamics: provides provable benefits of curriculum post-training and test-time scaling for Transformer tree reasoning, reducing sample complexity.

Relevance: 8 Novelty: 8

37. A Provably-Correct and Robust Convex Model for Smooth Separable NMF

ArXiv ID: 2511.07109

Authors: Junjun Pan, Valentin Leplat, Michael Ng, Nicolas Gillis

Abstract: Nonnegative matrix factorization (NMF) is a linear dimensionality reduction technique for nonnegative data, with applications such as hyperspectral unmixing and topic modeling. NMF is a difficult problem in general (NP-hard), and its solutions are typically not unique. To address these two issues, additional constraints or assumptions are often used. In particular, separability assumes that the basis vectors in the NMF are equal to some columns of the input matrix. In that case, the problem is referred to as separable NMF (SNMF) and can be solved in polynomial-time with robustness guarantees, while identifying a unique solution. However, in real-world scenarios, due to noise or variability, multiple data points may lie near the basis vectors, which SNMF does not leverage. In this work, we rely on the smooth separability assumption, which assumes that each basis vector is close to multiple data points. We explore the properties of the corresponding problem, referred to as smooth SNMF (SSNMF), and examine how it relates to SNMF and orthogonal NMF. We then propose a convex model for SSNMF and show that it provably recovers the sought-after factors, even in the presence of noise. We finally adapt an existing fast gradient method to solve this convex model for SSNMF, and show that it compares favorably with state-of-the-art methods on both synthetic and hyperspectral datasets.

Comment: Representation Learning: convex, provably-correct model for smooth separable NMF with robustness guarantees.

Relevance: 8 Novelty: 8

38. Physics-Informed Design of Input Convex Neural Networks for Consistency Optimal Transport Flow Matching

ArXiv ID: 2511.06042

Authors: Fanghui Song, Zhongjian Wang, Jiebao Sun

Abstract: We propose a consistency model based on the optimal-transport flow. A physics-informed design of partially input-convex neural networks (PICNN) plays a central role in constructing the flow field that emulates the displacement interpolation. During the training stage, we couple the Hamilton-Jacobi (HJ) residual in the OT formulation with the original flow matching loss function. Our approach avoids inner optimization subproblems that are present in previous one-step OFM approaches. During the prediction stage, our approach supports both one-step (Brenier-map) and multi-step ODE sampling from the same learned potential, leveraging the straightness of the OT flow. We validate scalability and performance on standard OT benchmarks.

Comment: Model Architecture and Efficiency: physics-informed PICNN for OT flow matching with HJ residual; supports one-step and ODE sampling from the same potential.

Relevance: 8 Novelty: 8

39. C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning

ArXiv ID: 2511.07396

Authors: Antonios Valkanas, Soumyasundar Pal, Pavel Rumiantsev, Yingxue Zhang, Mark Coates

Abstract: Large language models (LLMs) have achieved impressive results on complex reasoning tasks, but their high inference cost remains a major barrier to real-world deployment. A promising solution is to use cascaded inference, where small, cheap models handle easy queries, and only the hardest examples are escalated to more powerful models. However, existing cascade methods typically rely on supervised training with labeled data, offer no theoretical generalization guarantees, and provide limited control over test-time computational cost. We introduce C3PO (Cost Controlled Cascaded Prediction Optimization), a self-supervised framework for optimizing LLM cascades under probabilistic cost constraints. By focusing on minimizing regret with respect to the most powerful model (MPM), C3PO avoids the need for labeled data by constructing a cascade using only unlabeled model outputs. It leverages conformal prediction to bound the probability that inference cost exceeds a user-specified budget. We provide theoretical guarantees on both cost control and generalization error, and show that our optimization procedure is effective even with small calibration sets. Empirically, C3PO achieves state-of-the-art performance across a diverse set of reasoning benchmarks including GSM8K, MATH-500, BigBench-Hard and AIME, outperforming strong LLM cascading baselines in both accuracy and cost-efficiency. Our results demonstrate that principled, label-free cascade optimization can enable scalable LLM deployment.

Comment: High Performance Computing/Efficiency: cascaded LLM inference with probabilistic cost constraints and conformal guarantees; self-supervised optimization.

Relevance: 8 Novelty: 8

40. On the Convergence and Stability of Distributed Sub-model Training

ArXiv ID: 2511.06132

Authors: Yuyang Deng, Fuli Qiao, Mehrdad Mahdavi

Abstract: As learning models continue to grow in size, enabling on-device local training of these models has emerged as a critical challenge in federated learning. A popular solution is sub-model training, where the server only distributes randomly sampled sub-models to the edge clients, and clients only update these small models. However, those random sampling of sub-models may not give satisfying convergence performance. In this paper, observing the success of SGD with shuffling, we propose a distributed shuffled sub-model training, where the full model is partitioned into several sub-models in advance, and the server shuffles those sub-models, sends each of them to clients at each round, and by the end of local updating period, clients send back the updated sub-models, and server averages them. We establish the convergence rate of this algorithm. We also study the generalization of distributed sub-model training via stability analysis, and find that the sub-model training can improve the generalization via amplifying the stability of training process. The extensive experiments also validate our theoretical findings.

Comment: HPC/Distributed training: shuffled sub-model training with convergence and stability (generalization) analysis.