Previous Day 2025-11-10
Monthly Overview 2025-11
Next Day 2025-11-12

Personalized Daily ArXiv Papers 2025-11-11

[gpt-5] Prompt Completion Total
Token 77291 63654 140945
Cost $0.1 $0.64 $0.73

Total arXiv papers: 738

Total scanned papers: 446

Total relevant papers: 54

Table of contents with paper titles:

  1. MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference Authors: Myunghyun Rhee, Sookyung Choi, Euiseok Kim, Joonseop Sim, Youngpyo Joo, Hoshik Kim

  2. Route Experts by Sequence, not by Token Authors: Tiansheng Wen, Yifei Wang, Aosong Feng, Long Ma, Xinyang Liu, Yifan Wang, Lixuan Guo, Bo Chen, Stefanie Jegelka, Chenyu You

  3. Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving Authors: Hui Zeng, Daming Zhao, Pengfei Yang, Wenxuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, Jidong Zhai

  4. How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy Authors: Hanwen Liu, Yixuan Ma, Shi Jin, Yuguang Wang

  5. MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling Authors: Yu Zhang, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu

  6. Rethinking Parameter Sharing as Graph Coloring for Structured Compression Authors: Boyang Zhang, Daning Cheng, Yunquan Zhang

  7. PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference Authors: Yushu Zhao, Zheng Wang, Minjia Zhang

  8. Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou

  9. Next-Latent Prediction Transformers Learn Compact World Models Authors: Jayden Teoh, Manan Tomar, Kwangjun Ahn, Edward S. Hu, Pratyusha Sharma, Riashat Islam, Alex Lamb, John Langford

  10. Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizer Authors: Steffen Dereich, Thang Do, Arnulf Jentzen, Philippe von Wurstemberger

  11. Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization Authors: Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, Yuxin Chen

  12. A Fully Polynomial-Time Algorithm for Robustly Learning Halfspaces over the Hypercube Authors: Gautam Chandrasekaran, Adam R. Klivans, Konstantinos Stavropoulos, Arsen Vasilyan

  13. Can Training Dynamics of Scale-Invariant Neural Networks Be Explained by the Thermodynamics of an Ideal Gas? Authors: Ildus Sadrtdinov, Ekaterina Lobacheva, Ivan Klimov, Mikhail I. Katsnelson, Dmitry Vetrov

  14. Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin Authors: Lin Guan, Jia-Qi Yang, Zhishan Zhao, Beichuan Zhang, Bo Sun, Xuanyuan Luo, Jinan Ni, Xiaowen Li, Yuhang Qi, Zhifang Fan, Hangyu Wang, Qiwei Chen, Yi Cheng, Feng Zhang, Xiao Yang

  15. MobileLLM-Pro Technical Report Authors: Patrick Huber, Ernie Chang, Wei Wen, Igor Fedorov, Tarek Elgamal, Hanxian Huang, Naveen Suda, Chinnadhurai Sankar, Vish Vogeti, Yanghan Wang, Alex Gladkov, Kai Sheng Tai, Abdelrahman Elogeel, Tarek Hefny, Vikas Chandra, Ahmed Aly, Anuj Kumar, Raghuraman Krishnamoorthi, Adithya Sagar

  16. The Few Govern the Many:Unveiling Few-Layer Dominance for Time Series Models Authors: Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Xiaoyu Shen

  17. Depth-induced NTK: Bridging Over-parameterized Neural Networks and Deep Neural Kernels Authors: Yong-Ming Tian, Shuang Liang, Shao-Qun Zhang, Feng-Lei Fan

  18. LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging Authors: Seungeon Lee, Soumi Das, Manish Gupta, Krishna P. Gummadi

  19. Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence Authors: Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, Micah Goldblum

  20. TuckA: Hierarchical Compact Tensor Experts for Efficient Fine-Tuning Authors: Qifeng Lei, Zhiyong Yang, Qianqian Xu, Cong Hua, Peisong Wen, Qingming Huang

  21. P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats Authors: Yuzong Chen, Chao Fang, Xilai Dai, Yuheng Wu, Thierry Tambe, Marian Verhelst, Mohamed S. Abdelfattah

  22. Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration Authors: Stef Cuyckens, Xiaoling Yi, Robin Geens, Joren Dumoulin, Martin Wiesner, Chao Fang, Marian Verhelst

  23. Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability Authors: Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio P. Calmon

  24. QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations Authors: Zhixiong Zhao, Haomin Li, Fangxin Liu, Yuncheng Lu, Zongwu Wang, Tao Yang, Li Jiang, Haibing Guan

  25. MI-to-Mid Distilled Compression (M2M-DC): An Hybrid-Information-Guided-Block Pruning with Progressive Inner Slicing Approach to Model Compression Authors: Lionel Levine, Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Majid Sarrafzadeh

  26. CAMP-HiVe: Cyclic Pair Merging based Efficient DNN Pruning with Hessian-Vector Approximation for Resource-Constrained Systems Authors: Mohammad Helal Uddin, Sai Krishna Ghanta, Liam Seymour, Sabur Baidya

  27. Learning to Focus: Focal Attention for Selective and Scalable Transformers Authors: Dhananjay Ram, Wei Xia, Stefano Soatto

  28. Understanding the role of depth in the neural tangent kernel for overparameterized neural networks Authors: William St-Arnaud, Margarida Carvalho, Golnoosh Farnadi

  29. Minimum Width of Deep Narrow Networks for Universal Approximation Authors: Xiao-Song Yang, Qi Zhou, Xuan Zhou

  30. Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder Authors: Zhen Xu, Zhen Tan, Song Wang, Kaidi Xu, Tianlong Chen

  31. Rank-1 LoRAs Encode Interpretable Reasoning Signals Authors: Jake Ward, Paul Riechers, Adam Shai

  32. Diversified Flow Matching with Translation Identifiability Authors: Sagar Shrestha, Xiao Fu

  33. TNT: Improving Chunkwise Training for Test-Time Memorization Authors: Zeman Li, Ali Behrouz, Yuan Deng, Peilin Zhong, Praneeth Kacham, Mahdi Karami, Meisam Razaviyayn, Vahab Mirrokni

  34. Adaptive Initial Residual Connections for GNNs with Theoretical Guarantees Authors: Mohammad Shirzadi, Ali Safarpoor Dehkordi, Ahad N. Zehmakan

  35. DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning Authors: Nikolay Yudin, Ekaterina Grishina, Andrey Veprikov, Alexandr Beznosikov, Maxim Rakhuba

  36. Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Hau-San Wong, Qingfu Zhang, Taiji Suzuki

  37. A Provably-Correct and Robust Convex Model for Smooth Separable NMF Authors: Junjun Pan, Valentin Leplat, Michael Ng, Nicolas Gillis

  38. Physics-Informed Design of Input Convex Neural Networks for Consistency Optimal Transport Flow Matching Authors: Fanghui Song, Zhongjian Wang, Jiebao Sun

  39. C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning Authors: Antonios Valkanas, Soumyasundar Pal, Pavel Rumiantsev, Yingxue Zhang, Mark Coates

  40. On the Convergence and Stability of Distributed Sub-model Training Authors: Yuyang Deng, Fuli Qiao, Mehrdad Mahdavi

  41. How Wide and How Deep? Mitigating Over-Squashing of GNNs via Channel Capacity Constrained Estimation Authors: Zinuo You, Jin Zheng, John Cartlidge

  42. Rep2Text: Decoding Full Text from a Single LLM Token Representation Authors: Haiyan Zhao, Zirui He, Fan Yang, Ali Payani, Mengnan Du

  43. From Kernels to Attention: A Transformer Framework for Density and Score Estimation Authors: Vasily Ilin, Peter Sushko

  44. First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation Authors: Dmytro Vitel, Anshuman Chhabra

  45. Beyond Fixed Depth: Adaptive Graph Neural Networks for Node Classification Under Varying Homophily Authors: Asela Hevapathige, Asiri Wijesinghe, Ahad N. Zehmakan

  46. Sampling and Loss Weights in Multi-Domain Training Authors: Mahdi Salmani, Pratik Worah, Meisam Razaviyayn, Vahab Mirrokni

  47. Magnitude-Modulated Equivariant Adapter for Parameter-Efficient Fine-Tuning of Equivariant Graph Neural Networks Authors: Dian Jin, Yancheng Yuan, Xiaoming Tao

  48. An Efficient Gradient-Aware Error-Bounded Lossy Compressor for Federated Learning Authors: Zhijing Ye, Sheng Di, Jiamin Wang, Zhiqing Zhong, Zhaorui Zhang, Xiaodong Yu

  49. Mixtures of SubExperts for Large Language Continual Learning Authors: Haeyong Kang

  50. Non-Negative Stiefel Approximating Flow: Orthogonalish Matrix Optimization for Interpretable Embeddings Authors: Brian B. Avants (Department of Radiology, Medical Imaging University of Virginia, Charlottesville, VA), Nicholas J. Tustison (Department of Radiology, Medical Imaging University of Virginia, Charlottesville, VA), James R Stone (Department of Radiology, Medical Imaging University of Virginia, Charlottesville, VA)

  51. Vocabulary In-Context Learning in Transformers: Benefits of Positional Encoding Authors: Qian Ma, Ruoxiang Xu, Yongqiang Cai

  52. Walsh-Hadamard Neural Operators for Solving PDEs with Discontinuous Coefficients Authors: Giorrgio M. Cavallazzi, Miguel Perex Cuadrado, Alfredo Pinelli

  53. Transolver is a Linear Transformer: Revisiting Physics-Attention through the Lens of Linear Attention Authors: Wenjie Hu, Sidun Liu, Peng Qiao, Zhenglun Sun, Yong Dou

  54. Recursive Dynamics in Fast-Weights Homeostatic Reentry Networks: Toward Reflective Intelligence Authors: B. G. Chae


1. MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference

ArXiv ID: 2511.06010

Authors: Myunghyun Rhee, Sookyung Choi, Euiseok Kim, Joonseop Sim, Youngpyo Joo, Hoshik Kim

Abstract: The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7x over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference.

Comment: Compression/Efficiency and HPC: Shared KV Attention transforming memory-bound KV cache ops to compute-bound GEMMs with MoE-inspired sparse attention and disaggregated infrastructure.

Relevance: 10 Novelty: 9


2. Route Experts by Sequence, not by Token

ArXiv ID: 2511.06494

Authors: Tiansheng Wen, Yifei Wang, Aosong Feng, Long Ma, Xinyang Liu, Yifan Wang, Lixuan Guo, Bo Chen, Stefanie Jegelka, Chenyu You

Abstract: Mixture-of-Experts (MoE) architectures scale large language models (LLMs) by activating only a subset of experts per token, but the standard TopK routing assigns the same fixed number of experts to all tokens, ignoring their varying complexity. Prior adaptive routing methods introduce additional modules and hyperparameters, often requiring costly retraining from scratch. We propose Sequence-level TopK (SeqTopK), a minimal modification that shifts the expert budget from the token level to the sequence level. By selecting the top $T \cdot K$ experts across all $T$ tokens, SeqTopK enables end-to-end learned dynamic allocation -- assigning more experts to difficult tokens and fewer to easy ones -- while preserving the same overall budget. SeqTopK requires only a few lines of code, adds less than 1% overhead, and remains fully compatible with pretrained MoE models. Experiments across math, coding, law, and writing show consistent improvements over TopK and prior parameter-free adaptive methods, with gains that become substantially larger under higher sparsity (up to 16.9%). These results highlight SeqTopK as a simple, efficient, and scalable routing strategy, particularly well-suited for the extreme sparsity regimes of next-generation LLMs. Code is available at https://github.com/Y-Research-SBU/SeqTopK.

Comment: Model Architecture: MoE routing innovation (sequence-level TopK) enabling dynamic expert allocation under fixed budget, improving efficiency at high sparsity.

Relevance: 10 Novelty: 8


3. Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

ArXiv ID: 2511.06029

Authors: Hui Zeng, Daming Zhao, Pengfei Yang, Wenxuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, Jidong Zhai

Abstract: Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention} (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56x.

Comment: Compression/Efficiency: adaptive layer- and time-aware KV cache pruning with relevance-aware retention for long-form LLM reasoning.

Relevance: 10 Novelty: 8


4. How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy

ArXiv ID: 2511.06044

Authors: Hanwen Liu, Yixuan Ma, Shi Jin, Yuguang Wang

Abstract: Attention mechanism is a significant part of Transformer models. It helps extract features from embedded vectors by adding global information and its expressivity has been proved to be powerful. Nevertheless, the quadratic complexity restricts its practicability. Although several researches have provided attention mechanism in sparse form, they are lack of theoretical analysis about the expressivity of their mechanism while reducing complexity. In this paper, we put forward Random Batch Attention (RBA), a linear self-attention mechanism, which has theoretical support of the ability to maintain its expressivity. Random Batch Attention has several significant strengths as follows: (1) Random Batch Attention has linear time complexity. Other than this, it can be implemented in parallel on a new dimension, which contributes to much memory saving. (2) Random Batch Attention mechanism can improve most of the existing models by replacing their attention mechanisms, even many previously improved attention mechanisms. (3) Random Batch Attention mechanism has theoretical explanation in convergence, as it comes from Random Batch Methods on computation mathematics. Experiments on large graphs have proved advantages mentioned above. Also, the theoretical modeling of self-attention mechanism is a new tool for future research on attention-mechanism analysis.

Comment: Model Architecture/Efficiency: Random Batch Attention, a linear-time self-attention with theoretical expressivity and parallelization benefits.

Relevance: 10 Novelty: 8


5. MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

ArXiv ID: 2511.05811

Authors: Yu Zhang, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu

Abstract: Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34% higher training throughput.

Comment: Model Compression and Efficiency: FP8 training with microscaling and automatic scaling for throughput and numerical stability.

Relevance: 10 Novelty: 8


6. Rethinking Parameter Sharing as Graph Coloring for Structured Compression

ArXiv ID: 2511.06786

Authors: Boyang Zhang, Daning Cheng, Yunquan Zhang

Abstract: Modern deep models have massive parameter sizes, leading to high inference-time memory usage that limits practical deployment. Parameter sharing, a form of structured compression, effectively reduces redundancy, but existing approaches remain heuristic-restricted to adjacent layers and lacking a systematic analysis for cross-layer sharing. However, extending sharing across multiple layers leads to an exponentially expanding configuration space, making exhaustive search computationally infeasible and forming a critical bottleneck for parameter sharing. We recast parameter sharing from a group-theoretic perspective as introducing structural symmetries in the model's parameter space. A sharing configuration can be described by a coloring function $\alpha:L\rightarrow C$ (L: layer indices and C: sharing classes), which determines inter-layer sharing groups while preserving structural symmetry. To determine the coloring function, we propose a second-order geometric criterion based on Taylor expansion and the Hessian spectrum. By projecting perturbations onto the Hessian's low-curvature eigensubspace, the criterion provides an analytic rule for selecting sharing groups that minimize performance impact, yielding a principled and scalable configuration procedure. Across diverse architectures and tasks, Geo-Sharing consistently outperforms state-of-the-art heuristic sharing strategies, achieving higher compression ratios with smaller accuracy degradation.

Comment: Model Compression and Efficiency: cross-layer parameter sharing cast as graph coloring with Hessian-based geometric criterion (structured compression).

Relevance: 10 Novelty: 8


7. PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference

ArXiv ID: 2511.04805

Authors: Yushu Zhao, Zheng Wang, Minjia Zhang

Abstract: Mixture-of-Experts (MoE) models have shown strong potential in scaling language models efficiently by activating only a small subset of experts per input. However, their widespread deployment remains limited due to the high memory overhead associated with storing all expert parameters, particularly as the number of experts increases. To address this challenge, prior works have explored expert dropping and merging strategies, yet they often suffer from performance drop at high compression ratios. In this paper, we introduce PuzzleMoE, a training-free MoE compression method that achieves both high accuracy and efficient inference through two key innovations: First, PuzzleMoE performs sparse expert merging by identifying element-wise weight redundancy and specialization. It uses a dual-mask to capture both shared and expert-specific parameters. Second, to avoid the overhead of storing binary masks and signs, PuzzleMoE introduces a bit-packed encoding scheme that reuses underutilized exponent bits, enabling efficient MoE inference on GPUs. Extensive experiments demonstrate that PuzzleMoE can compress MoE models by up to 50% while maintaining accuracy across various tasks. Specifically, it outperforms prior MoE compression methods by up to 16.7% on MMLU at 50% compression ratio, and achieves up to 1.28\times inference speedup.

Comment: Matches Model Compression and Architecture: training-free MoE compression via sparse expert merging and bit-packed inference.

Relevance: 10 Novelty: 8


8. Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs

ArXiv ID: 2511.07419

Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou

Abstract: Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap and improve MoE LLMs' generalization performance. Our method, "Routing Manifold Alignment (RoMA)", introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.

Comment: Model Architecture (MoE): aligns routing weights with task manifolds via manifold regularization, improving generalization with lightweight router fine-tuning.

Relevance: 10 Novelty: 8


9. Next-Latent Prediction Transformers Learn Compact World Models

ArXiv ID: 2511.05963

Authors: Jayden Teoh, Manan Tomar, Kwangjun Ahn, Edward S. Hu, Pratyusha Sharma, Riashat Islam, Alex Lamb, John Langford

Abstract: Transformers replace recurrence with a memory that grows with sequence length and self-attention that enables ad-hoc look ups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next-Latent Prediction (NextLat), which extends standard next-token training with self-supervised predictions in the latent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next output token. Theoretically, we show that these latents provably converge to belief states, compressed information of the history necessary to predict the future. This simple auxiliary objective also injects a recurrent inductive bias into transformers, while leaving their architecture, parallel training, and inference unchanged. NextLat effectively encourages the transformer to form compact internal world models with its own belief states and transition dynamics -- a crucial property absent in standard next-token prediction transformers. Empirically, across benchmarks targeting core sequence modeling competencies -- world modeling, reasoning, planning, and language modeling -- NextLat demonstrates significant gains over standard next-token training in downstream accuracy, representation compression, and lookahead planning. NextLat stands as a simple and efficient paradigm for shaping transformer representations toward stronger generalization.

Comment: Representation Learning/Architecture: Next-Latent Prediction objective induces compact belief-state latents and transition dynamics in Transformers.

Relevance: 9 Novelty: 9


10. Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizer

ArXiv ID: 2511.06675

Authors: Steffen Dereich, Thang Do, Arnulf Jentzen, Philippe von Wurstemberger

Abstract: Beside the standard stochastic gradient descent (SGD) method, the Adam optimizer due to Kingma & Ba (2014) is currently probably the best-known optimization method for the training of deep neural networks in artificial intelligence (AI) systems. Despite the popularity and the success of Adam it remains an \emph{open research problem} to provide a rigorous convergence analysis for Adam even for the class of strongly convex SOPs. In one of the main results of this work we establish convergence rates for Adam in terms of the number of gradient steps (convergence rate \nicefrac{1}{2} w.r.t. the size of the learning rate), the size of the mini-batches (convergence rate 1 w.r.t. the size of the mini-batches), and the size of the second moment parameter of Adam (convergence rate 1 w.r.t. the distance of the second moment parameter to 1) for the class of strongly convex SOPs. In a further main result of this work, which we refer to as \emph{Adam symmetry theorem}, we illustrate the optimality of the established convergence rates by proving for a special class of simple quadratic strongly convex SOPs that Adam converges as the number of gradient steps increases to infinity to the solution of the SOP (the unique minimizer of the strongly convex objective function) if and \emph{only} if the random variables in the SOP (the data in the SOP) are \emph{symmetrically distributed}. In particular, in the standard case where the random variables in the SOP are not symmetrically distributed we \emph{disprove} that Adam converges to the minimizer of the SOP as the number of Adam steps increases to infinity. We also complement the conclusions of our convergence analysis and the Adam symmetry theorem by several numerical simulations that indicate the sharpness of the established convergence rates and that illustrate the practical appearance of the phenomena revealed in the \emph{Adam symmetry theorem}.

Comment: Optimization theory: rigorous convergence rates and the Adam symmetry theorem for SGD-Adam on strongly convex problems.

Relevance: 9 Novelty: 9


11. Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

ArXiv ID: 2511.07378

Authors: Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, Yuxin Chen

Abstract: The ability to reason lies at the core of artificial intelligence (AI), and challenging problems usually call for deeper and longer reasoning to tackle. A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks with longer chain-of-thought (CoT). In this work, we present a theoretical analysis of transformers learning on synthetic state-tracking tasks with gradient descent. We mathematically prove how the algebraic structure of state-tracking problems governs the degree of extrapolation of the learned CoT. Specifically, our theory characterizes the length generalization of transformers through the mechanism of attention concentration, linking the retrieval robustness of the attention layer to the state-tracking task structure of long-context reasoning. Moreover, for transformers with limited reasoning length, we prove that a recursive self-training scheme can progressively extend the range of solvable problem lengths. To our knowledge, we provide the first optimization guarantee that constant-depth transformers provably learn $\mathsf{NC}^1$-complete problems with CoT, significantly going beyond prior art confined in $\mathsf{TC}^0$, unless the widely held conjecture $\mathsf{TC}^0 \neq \mathsf{NC}^1$ fails. Finally, we present a broad set of experiments supporting our theoretical results, confirming the length generalization behaviors and the mechanism of attention concentration.

Comment: Matches Representation Learning/Architecture theory: provable chain-of-thought length generalization in transformers via attention concentration.

Relevance: 9 Novelty: 9


12. A Fully Polynomial-Time Algorithm for Robustly Learning Halfspaces over the Hypercube

ArXiv ID: 2511.07244

Authors: Gautam Chandrasekaran, Adam R. Klivans, Konstantinos Stavropoulos, Arsen Vasilyan

Abstract: We give the first fully polynomial-time algorithm for learning halfspaces with respect to the uniform distribution on the hypercube in the presence of contamination, where an adversary may corrupt some fraction of examples and labels arbitrarily. We achieve an error guarantee of $\eta^{O(1)}+\epsilon$ where $\eta$ is the noise rate. Such a result was not known even in the agnostic setting, where only labels can be adversarially corrupted. All prior work over the last two decades has a superpolynomial dependence in $1/\epsilon$ or succeeds only with respect to continuous marginals (such as log-concave densities). Previous analyses rely heavily on various structural properties of continuous distributions such as anti-concentration. Our approach avoids these requirements and makes use of a new algorithm for learning Generalized Linear Models (GLMs) with only a polylogarithmic dependence on the activation function's Lipschitz constant. More generally, our framework shows that supervised learning with respect to discrete distributions is not as difficult as previously thought.

Comment: Learning Theory: fully polynomial-time robust algorithm for learning halfspaces over the hypercube under contamination.

Relevance: 9 Novelty: 9


13. Can Training Dynamics of Scale-Invariant Neural Networks Be Explained by the Thermodynamics of an Ideal Gas?

ArXiv ID: 2511.07308

Authors: Ildus Sadrtdinov, Ekaterina Lobacheva, Ivan Klimov, Mikhail I. Katsnelson, Dmitry Vetrov

Abstract: Understanding the training dynamics of deep neural networks remains a major open problem, with physics-inspired approaches offering promising insights. Building on this perspective, we develop a thermodynamic framework to describe the stationary distributions of stochastic gradient descent (SGD) with weight decay for scale-invariant neural networks, a setting that both reflects practical architectures with normalization layers and permits theoretical analysis. We establish analogies between training hyperparameters (e.g., learning rate, weight decay) and thermodynamic variables such as temperature, pressure, and volume. Starting with a simplified isotropic noise model, we uncover a close correspondence between SGD dynamics and ideal gas behavior, validated through theory and simulation. Extending to training of neural networks, we show that key predictions of the framework, including the behavior of stationary entropy, align closely with experimental observations. This framework provides a principled foundation for interpreting training dynamics and may guide future work on hyperparameter tuning and the design of learning rate schedulers.

Comment: Training dynamics theory: maps SGD with weight decay in scale-invariant nets to thermodynamic variables, informing hyperparameter design.

Relevance: 9 Novelty: 8


14. Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin

ArXiv ID: 2511.06077

Authors: Lin Guan, Jia-Qi Yang, Zhishan Zhao, Beichuan Zhang, Bo Sun, Xuanyuan Luo, Jinan Ni, Xiaowen Li, Yuhang Qi, Zhifang Fan, Hangyu Wang, Qiwei Chen, Yi Cheng, Feng Zhang, Xiao Yang

Abstract: Short-video recommenders such as Douyin must exploit extremely long user histories without breaking latency or cost budgets. We present an end-to-end system that scales long-sequence modeling to 10k-length histories in production. First, we introduce Stacked Target-to-History Cross Attention (STCA), which replaces history self-attention with stacked cross-attention from the target to the history, reducing complexity from quadratic to linear in sequence length and enabling efficient end-to-end training. Second, we propose Request Level Batching (RLB), a user-centric batching scheme that aggregates multiple targets for the same user/request to share the user-side encoding, substantially lowering sequence-related storage, communication, and compute without changing the learning objective. Third, we design a length-extrapolative training strategy -- train on shorter windows, infer on much longer ones -- so the model generalizes to 10k histories without additional training cost. Across offline and online experiments, we observe predictable, monotonic gains as we scale history length and model capacity, mirroring the scaling law behavior observed in large language models. Deployed at full traffic on Douyin, our system delivers significant improvements on key engagement metrics while meeting production latency, demonstrating a practical path to scaling end-to-end long-sequence recommendation to the 10k regime.

Comment: Model Architecture and Efficiency: replaces self-attention with stacked target-to-history cross-attention for linear complexity; batching and length extrapolation for 10k sequences.

Relevance: 9 Novelty: 8


15. MobileLLM-Pro Technical Report

ArXiv ID: 2511.06719

Authors: Patrick Huber, Ernie Chang, Wei Wen, Igor Fedorov, Tarek Elgamal, Hanxian Huang, Naveen Suda, Chinnadhurai Sankar, Vish Vogeti, Yanghan Wang, Alex Gladkov, Kai Sheng Tai, Abdelrahman Elogeel, Tarek Hefny, Vikas Chandra, Ahmed Aly, Anuj Kumar, Raghuraman Krishnamoorthi, Adithya Sagar

Abstract: Efficient on-device language models around 1 billion parameters are essential for powering low-latency AI applications on mobile and wearable devices. However, achieving strong performance in this model class, while supporting long context windows and practical deployment remains a significant challenge. We introduce MobileLLM-Pro, a 1-billion-parameter language model optimized for on-device deployment. MobileLLM-Pro achieves state-of-the-art results across 11 standard benchmarks, significantly outperforming both Gemma 3-1B and Llama 3.2-1B, while supporting context windows of up to 128,000 tokens and showing only minor performance regressions at 4-bit quantization. These improvements are enabled by four core innovations: (1) implicit positional distillation, a novel technique that effectively instills long-context capabilities through knowledge distillation; (2) a specialist model merging framework that fuses multiple domain experts into a compact model without parameter growth; (3) simulation-driven data mixing using utility estimation; and (4) 4-bit quantization-aware training with self-distillation. We release our model weights and code to support future research in efficient on-device language models.

Comment: Compression/Efficiency: on-device LLM with implicit positional distillation for long context, specialist model merging without parameter growth, and 4-bit QAT.

Relevance: 9 Novelty: 8


16. The Few Govern the Many:Unveiling Few-Layer Dominance for Time Series Models

ArXiv ID: 2511.07237

Authors: Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Xiaoyu Shen

Abstract: Large-scale models are at the forefront of time series (TS) forecasting, dominated by two paradigms: fine-tuning text-based Large Language Models (LLM4TS) and training Time Series Foundation Models (TSFMs) from scratch. Both approaches share a foundational assumption that scaling up model capacity and data volume leads to improved performance. However, we observe a \textit{\textbf{scaling paradox}} in TS models, revealing a puzzling phenomenon that larger models do \emph{NOT} achieve better performance. Through extensive experiments on two model families across four scales (100M to 1.7B parameters) and diverse data (up to 6B observations), we rigorously confirm that the scaling paradox is a pervasive issue. We then diagnose its root cause by analyzing internal representations, identifying a phenomenon we call \textit{few-layer dominance}: only a small subset of layers are functionally important, while the majority are redundant, under-utilized, and can even distract training. Based on this discovery, we propose a practical method to automatically identify and retain only these dominant layers. In our models, retaining only 21\% of the parameters achieves up to a 12\% accuracy improvement and a 2.7$\times$ inference speedup. We validate the universality of our method on 8 prominent SOTA models (LLM4TS and TSFMs, 90M to 6B), showing that retaining less than 30\% of layers achieves comparable or superior accuracy in over 95\% of tasks.

Comment: Compression/Efficiency + Representation: discovers few-layer dominance in TS models and proposes retaining dominant layers, yielding large parameter reduction and speedups.

Relevance: 9 Novelty: 8


17. Depth-induced NTK: Bridging Over-parameterized Neural Networks and Deep Neural Kernels

ArXiv ID: 2511.05585

Authors: Yong-Ming Tian, Shuang Liang, Shao-Qun Zhang, Feng-Lei Fan

Abstract: While deep learning has achieved remarkable success across a wide range of applications, its theoretical understanding of representation learning remains limited. Deep neural kernels provide a principled framework to interpret over-parameterized neural networks by mapping hierarchical feature transformations into kernel spaces, thereby combining the expressive power of deep architectures with the analytical tractability of kernel methods. Recent advances, particularly neural tangent kernels (NTKs) derived by gradient inner products, have established connections between infinitely wide neural networks and nonparametric Bayesian inference. However, the existing NTK paradigm has been predominantly confined to the infinite-width regime, while overlooking the representational role of network depth. To address this gap, we propose a depth-induced NTK kernel based on a shortcut-related architecture, which converges to a Gaussian process as the network depth approaches infinity. We theoretically analyze the training invariance and spectrum properties of the proposed kernel, which stabilizes the kernel dynamics and mitigates degeneration. Experimental results further underscore the effectiveness of our proposed method. Our findings significantly extend the existing landscape of the neural kernel theory and provide an in-depth understanding of deep learning and the scaling law.

Comment: Representation Learning Theory: proposes a depth-induced NTK capturing depth effects beyond infinite-width NTK, with analysis of spectrum and training invariance.

Relevance: 9 Novelty: 8


18. LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging

ArXiv ID: 2511.07129

Authors: Seungeon Lee, Soumi Das, Manish Gupta, Krishna P. Gummadi

Abstract: Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models.However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.

Comment: Compression/Efficiency: training-free instance-level dynamic selection and merging of multiple LoRA adapters at inference time.

Relevance: 9 Novelty: 8


19. Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence

ArXiv ID: 2511.07384

Authors: Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, Micah Goldblum

Abstract: Recent advances in depth-recurrent language models show that recurrence can decouple train-time compute and parameter count from test-time compute. In this work, we study how to convert existing pretrained non-recurrent language models into depth-recurrent models. We find that using a curriculum of recurrences to increase the effective depth of the model over the course of training preserves performance while reducing total computational cost. In our experiments, on mathematics, we observe that converting pretrained models to recurrent ones results in better performance at a given compute budget than simply post-training the original non-recurrent language model.

Comment: Model Architecture/Efficiency: retrofits recurrence into pretrained LMs with a recurrence curriculum to decouple test-time compute from parameters/training compute.

Relevance: 9 Novelty: 8


20. TuckA: Hierarchical Compact Tensor Experts for Efficient Fine-Tuning

ArXiv ID: 2511.06859

Authors: Qifeng Lei, Zhiyong Yang, Qianqian Xu, Cong Hua, Peisong Wen, Qingming Huang

Abstract: Efficiently fine-tuning pre-trained models for downstream tasks is a key challenge in the era of foundation models. Parameter-efficient fine-tuning (PEFT) presents a promising solution, achieving performance comparable to full fine-tuning by updating only a small number of adaptation weights per layer. Traditional PEFT methods typically rely on a single expert, where the adaptation weight is a low-rank matrix. However, for complex tasks, the data's inherent diversity poses a significant challenge for such models, as a single adaptation weight cannot adequately capture the features of all samples. To address this limitation, we explore how to integrate multiple small adaptation experts into a compact structure to defeat a large adapter. Specifically, we propose Tucker Adaptation (TuckA), a method with four key properties: (i) We use Tucker decomposition to create a compact 3D tensor where each slice naturally serves as an expert. The low-rank nature of this decomposition ensures that the number of parameters scales efficiently as more experts are added. (ii) We introduce a hierarchical strategy that organizes these experts into groups at different granularities, allowing the model to capture both local and global data patterns. (iii) We develop an efficient batch-level routing mechanism, which reduces the router's parameter size by a factor of $L$ compared to routing at every adapted layer (where $L$ is the number of adapted layers) (iv) We propose data-aware initialization to achieve loss-free expert load balancing based on theoretical analysis. Extensive experiments on benchmarks in natural language understanding, image classification, and mathematical reasoning speak to the efficacy of TuckA, offering a new and effective solution to the PEFT problem.

Comment: Matches Model Architecture and Compression/Efficiency: Tucker low-rank PEFT with hierarchical tensor experts and efficient routing (MoE-like).

Relevance: 9 Novelty: 8


21. P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats

ArXiv ID: 2511.06838

Authors: Yuzong Chen, Chao Fang, Xilai Dai, Yuheng Wu, Thierry Tambe, Marian Verhelst, Mohamed S. Abdelfattah

Abstract: The substantial memory bandwidth and computational demand of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine neural processing units (NPUs) with DRAM-based processing-in-memory (PIM) for LLM acceleration. However, existing high-precision (e.g., FP16) PIM compute units incur significant area and power overhead in DRAM technology, limiting the effective computation throughput. In this paper, we introduce P3-LLM, a novel NPU-PIM integrated accelerator for LLM inference using hybrid numerical formats. Our approach is threefold: First, we propose a flexible mixed-precision quantization scheme, which leverages hybrid numerical formats to quantize different LLM operands with high compression efficiency and minimal accuracy loss. Second, we architect an efficient PIM accelerator co-design for P3-LLM, featuring lightweight compute units to support our hybrid numerical formats. The enhanced PIM compute units significantly boost the computation throughput under iso-area constraints. Third, we optimize the low-precision dataflow of different LLM modules by applying operator fusion to minimize the overhead of runtime dequantization. Our evaluation on a diverse set of representative LLMs and tasks demonstrates that P3-LLM achieves state-of-the-art quantization accuracy in terms of both KV-cache-only quantization and weight-activation quantization. Combining the proposed quantization scheme with PIM architecture co-design, P3-LLM yields an average of $4.9\times$, $2.0\times$, and $3.4\times$ speedups over the state-of-the-art LLM accelerators HBM-PIM, Ecco, and Pimba, respectively. Our quantization code is available at https://github.com/yc2367/P3-LLM.git

Comment: Matches HPC and Compression/Efficiency: NPU–PIM co-design with mixed-precision quantization and operator fusion for LLM inference.

Relevance: 9 Novelty: 8


22. Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration

ArXiv ID: 2511.06313

Authors: Stef Cuyckens, Xiaoling Yi, Robin Geens, Joren Dumoulin, Martin Wiesner, Chao Fang, Marian Verhelst

Abstract: Emerging continual learning applications necessitate next-generation neural processing unit (NPU) platforms to support both training and inference operations. The promising Microscaling (MX) standard enables narrow bit-widths for inference and large dynamic ranges for training. However, existing MX multiply-accumulate (MAC) designs face a critical trade-off: integer accumulation requires expensive conversions from narrow floating-point products, while FP32 accumulation suffers from quantization losses and costly normalization. To address these limitations, we propose a hybrid precision-scalable reduction tree for MX MACs that combines the benefits of both approaches, enabling efficient mixed-precision accumulation with controlled accuracy relaxation. Moreover, we integrate an 8x8 array of these MACs into the state-of-the-art (SotA) NPU integration platform, SNAX, to provide efficient control and data transfer to our optimized precision-scalable MX datapath. We evaluate our design both on MAC and system level and compare it to the SotA. Our integrated system achieves an energy efficiency of 657, 1438-1675, and 4065 GOPS/W, respectively, for MXINT8, MXFP8/6, and MXFP4, with a throughput of 64, 256, and 512 GOPS.

Comment: High Performance Computing/Efficiency: precision-scalable microscaling datapaths with optimized reduction tree and NPU integration for mixed-precision MACs.

Relevance: 9 Novelty: 8


23. Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

ArXiv ID: 2511.05541

Authors: Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio P. Calmon

Abstract: Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they suffer from a variety of problems, including a systematic failure to capture the rich conceptual information that drives linguistic understanding. Instead, they exhibit a bias towards shallow, token-specific, or noisy features, such as "the phrase 'The' at the start of sentences". In this work, we propose that this is due to a fundamental issue with how dictionary learning methods for LLMs are trained. Language itself has a rich, well-studied structure spanning syntax, semantics, and pragmatics; however, current unsupervised methods largely ignore this linguistic knowledge, leading to poor feature discovery that favors superficial patterns over meaningful concepts. We focus on a simple but important aspect of language: semantic content has long-range dependencies and tends to be smooth over a sequence, whereas syntactic information is much more local. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.

Comment: Representation Learning: advances Sparse Autoencoders with a temporal contrastive loss to disentangle semantic vs. syntactic features for interpretability.

Relevance: 9 Novelty: 8


24. QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations

ArXiv ID: 2511.06767

Authors: Zhixiong Zhao, Haomin Li, Fangxin Liu, Yuncheng Lu, Zongwu Wang, Tao Yang, Li Jiang, Haibing Guan

Abstract: Transformer-based models have revolutionized computer vision (CV) and natural language processing (NLP) by achieving state-of-the-art performance across a range of benchmarks. However, nonlinear operations in models significantly contribute to inference latency, presenting unique challenges for efficient hardware acceleration. To this end, we propose QUARK, a quantization-enabled FPGA acceleration framework that leverages common patterns in nonlinear operations to enable efficient circuit sharing, thereby reducing hardware resource requirements. QUARK targets all nonlinear operations within Transformer-based models, achieving high-performance approximation through a novel circuit-sharing design tailored to accelerate these operations. Our evaluation demonstrates that QUARK significantly reduces the computational overhead of nonlinear operators in mainstream Transformer architectures, achieving up to a 1.96 times end-to-end speedup over GPU implementations. Moreover, QUARK lowers the hardware overhead of nonlinear modules by more than 50% compared to prior approaches, all while maintaining high model accuracy -- and even substantially boosting accuracy under ultra-low-bit quantization.

Comment: Compression/Efficiency/HPC: quantization-enabled circuit sharing for nonlinear ops in Transformers on FPGAs, reducing latency and resources.

Relevance: 9 Novelty: 7


25. MI-to-Mid Distilled Compression (M2M-DC): An Hybrid-Information-Guided-Block Pruning with Progressive Inner Slicing Approach to Model Compression

ArXiv ID: 2511.06842

Authors: Lionel Levine, Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Majid Sarrafzadeh

Abstract: We introduce MI-to-Mid Distilled Compression (M2M-DC), a two-scale, shape-safe compression framework that interleaves information-guided block pruning with progressive inner slicing and staged knowledge distillation (KD). First, M2M-DC ranks residual (or inverted-residual) blocks by a label-aware mutual information (MI) signal and removes the least informative units (structured prune-after-training). It then alternates short KD phases with stage-coherent, residual-safe channel slicing: (i) stage "planes" (co-slicing conv2 out-channels with the downsample path and next-stage inputs), and (ii) an optional mid-channel trim (conv1 out / bn1 / conv2 in). This targets complementary redundancy, whole computational motifs and within-stage width while preserving residual shape invariants. On CIFAR-100, M2M-DC yields a clean accuracy-compute frontier. For ResNet-18, we obtain 85.46% Top-1 with 3.09M parameters and 0.0139 GMacs (72% params, 63% GMacs vs. teacher; mean final 85.29% over three seeds). For ResNet-34, we reach 85.02% Top-1 with 5.46M params and 0.0195 GMacs (74% / 74% vs. teacher; mean final 84.62%). Extending to inverted-residuals, MobileNetV2 achieves a mean final 68.54% Top-1 at 1.71M params (27%) and 0.0186 conv GMacs (24%), improving over the teacher's 66.03% by +2.5 points across three seeds. Because M2M-DC exposes only a thin, architecture-aware interface (blocks, stages, and down sample/skip wiring), it generalizes across residual CNNs and extends to inverted-residual families with minor legalization rules. The result is a compact, practical recipe for deployment-ready models that match or surpass teacher accuracy at a fraction of the compute.

Comment: Model Compression and Efficiency: structured block pruning guided by mutual information plus progressive channel slicing and KD.

Relevance: 9 Novelty: 7


26. CAMP-HiVe: Cyclic Pair Merging based Efficient DNN Pruning with Hessian-Vector Approximation for Resource-Constrained Systems

ArXiv ID: 2511.06265

Authors: Mohammad Helal Uddin, Sai Krishna Ghanta, Liam Seymour, Sabur Baidya

Abstract: Deep learning algorithms are becoming an essential component of many artificial intelligence (AI) driven applications, many of which run on resource-constrained and energy-constrained systems. For efficient deployment of these algorithms, although different techniques for the compression of neural network models are proposed, neural pruning is one of the fastest and effective methods, which can provide a high compression gain with minimal cost. To harness enhanced performance gain with respect to model complexity, we propose a novel neural network pruning approach utilizing Hessian-vector products that approximate crucial curvature information in the loss function, which significantly reduces the computation demands. By employing a power iteration method, our algorithm effectively identifies and preserves the essential information, ensuring a balanced trade-off between model accuracy and computational efficiency. Herein, we introduce CAMP-HiVe, a cyclic pair merging-based pruning with Hessian Vector approximation by iteratively consolidating weight pairs, combining significant and less significant weights, thus effectively streamlining the model while preserving its performance. This dynamic, adaptive framework allows for real-time adjustment of weight significance, ensuring that only the most critical parameters are retained. Our experimental results demonstrate that our proposed method achieves significant reductions in computational requirements while maintaining high performance across different neural network architectures, e.g., ResNet18, ResNet56, and MobileNetv2, on standard benchmark datasets, e.g., CIFAR10, CIFAR-100, and ImageNet, and it outperforms the existing state-of-the-art neural pruning methods.

Comment: Model Compression and Efficiency: pruning via Hessian-vector approximation and cyclic pair merging for resource-constrained deployment.

Relevance: 9 Novelty: 7


27. Learning to Focus: Focal Attention for Selective and Scalable Transformers

ArXiv ID: 2511.06818

Authors: Dhananjay Ram, Wei Xia, Stefano Soatto

Abstract: Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective feature selection at every layer of these models, particularly for long contexts. We propose Focal Attention, a simple yet effective modification that sharpens the attention distribution by controlling the softmax temperature, either as a fixed hyperparameter or as a learnable parameter during training. This sharpening enables the model to concentrate on the most relevant tokens while suppressing irrelevant ones. Empirically, Focal Attention scales more favorably than standard transformer with respect to model size, training data, and context length. Across diverse benchmarks, it achieves the same accuracy with up to 42% fewer parameters or 33% less training data. On long-context tasks, it delivers substantial relative improvements ranging from 17% to 82%, demonstrating its effectiveness in real world applications.

Comment: Model Architecture/Efficiency: Focal Attention sharpens softmax via temperature control (fixed or learnable), improving scaling and long-context performance.

Relevance: 9 Novelty: 7


28. Understanding the role of depth in the neural tangent kernel for overparameterized neural networks

ArXiv ID: 2511.07272

Authors: William St-Arnaud, Margarida Carvalho, Golnoosh Farnadi

Abstract: Overparameterized fully-connected neural networks have been shown to behave like kernel models when trained with gradient descent, under mild conditions on the width, the learning rate, and the parameter initialization. In the limit of infinitely large widths and small learning rate, the kernel that is obtained allows to represent the output of the learned model with a closed-form solution. This closed-form solution hinges on the invertibility of the limiting kernel, a property that often holds on real-world datasets. In this work, we analyze the sensitivity of large ReLU networks to increasing depths by characterizing the corresponding limiting kernel. Our theoretical results demonstrate that the normalized limiting kernel approaches the matrix of ones. In contrast, they show the corresponding closed-form solution approaches a fixed limit on the sphere. We empirically evaluate the order of magnitude in network depth required to observe this convergent behavior, and we describe the essential properties that enable the generalization of our results to other kernels.

Comment: Matches Representation Learning/training dynamics: analysis of NTK behavior with increasing depth in overparameterized networks.

Relevance: 9 Novelty: 7


29. Minimum Width of Deep Narrow Networks for Universal Approximation

ArXiv ID: 2511.06837

Authors: Xiao-Song Yang, Qi Zhou, Xuan Zhou

Abstract: Determining the minimum width of fully connected neural networks has become a fundamental problem in recent theoretical studies of deep neural networks. In this paper, we study the lower bounds and upper bounds of the minimum width required for fully connected neural networks in order to have universal approximation capability, which is important in network design and training. We show that $w_{min}\leq\max(2d_x+1, d_y)$ for networks with ELU, SELU, and the upper bound of this inequality is attained when $d_y=2d_x$, where $d_x$, $d_y$ denote the input and output dimensions, respectively. Besides, we show that $d_x+1\leq w_{min}\leq d_x+d_y$ for networks with LeakyReLU, ELU, CELU, SELU, Softplus, by proving that ReLU can be approximated by these activation functions. In addition, in the case that the activation function is injective or can be uniformly approximated by a sequence of injective functions (e.g., ReLU), we present a new proof of the inequality $w_{min}\ge d_y+\mathbf{1}_{d_x<d_y\leq2d_x}$ by constructing a more intuitive example via a new geometric approach based on Poincar$\acute{\text{e}}$-Miranda Theorem.

Comment: Model Architecture: theoretical bounds on minimum width for universal approximation in deep narrow networks across activations.

Relevance: 9 Novelty: 7


30. Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder

ArXiv ID: 2511.05745

Authors: Zhen Xu, Zhen Tan, Song Wang, Kaidi Xu, Tianlong Chen

Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM explanations, their practical adoption faces a fundamental challenge: better interpretability demands that SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs. Recent Mixture of Experts (MoE) approaches attempt to address this by partitioning SAEs into narrower expert networks with gated activation, thereby reducing computation. In a well-designed MoE, each expert should focus on learning a distinct set of features. However, we identify a \textit{critical limitation} in MoE-SAE: Experts often fail to specialize, which means they frequently learn overlapping or identical features. To deal with it, we propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling. Experiments demonstrate a 24\% lower reconstruction error and a 99\% reduction in feature redundancy compared to existing MoE-SAE methods. This work bridges the interpretability-efficiency gap in LLM analysis, allowing transparent model inspection without compromising computational feasibility.

Comment: Model Architecture and Sparsity: multi-expert sparse autoencoder with multiple expert activation and feature scaling to reduce redundancy and improve specialization.

Relevance: 9 Novelty: 7


31. Rank-1 LoRAs Encode Interpretable Reasoning Signals

ArXiv ID: 2511.06739

Authors: Jake Ward, Paul Riechers, Adam Shai

Abstract: Reasoning models leverage inference-time compute to significantly enhance the performance of language models on difficult logical tasks, and have become a dominating paradigm in frontier LLMs. Despite their wide adoption, the mechanisms underpinning the enhanced performance of these reasoning models are not well understood. In this work, we show that the majority of new capabilities in reasoning models can be elicited by small, single-rank changes to base model parameters, with many of these changes being interpretable. Specifically, we use a rank-1 LoRA to create a minimal parameter adapter for Qwen-2.5-32B-Instruct which recovers 73-90% of reasoning-benchmark performance compared to a full parameter finetune. We find that the activations of this LoRA are as interpretable as MLP neurons, and fire for reasoning-specific behaviors. Finally, we train a sparse autoencoder on the entire activation state of this LoRA and identify fine-grained and monosemantic features. Our findings highlight that reasoning performance can arise largely from minimal changes to base model parameters, and explore what these changes affect. More broadly, our work shows that parameter-efficient training methods can be used as a targeted lens for uncovering fundamental insights about language model behavior and dynamics.

Comment: Model Compression and Efficiency: exploits low-rank (rank-1) LoRA adapters; Representation Learning: analyzes interpretable features via sparse autoencoders.

Relevance: 9 Novelty: 7


32. Diversified Flow Matching with Translation Identifiability

ArXiv ID: 2511.05558

Authors: Sagar Shrestha, Xiao Fu

Abstract: Diversified distribution matching (DDM) finds a unified translation function mapping a diverse collection of conditional source distributions to their target counterparts. DDM was proposed to resolve content misalignment issues in unpaired domain translation, achieving translation identifiability. However, DDM has only been implemented using GANs due to its constraints on the translation function. GANs are often unstable to train and do not provide the transport trajectory information -- yet such trajectories are useful in applications such as single-cell evolution analysis and robot route planning. This work introduces diversified flow matching (DFM), an ODE-based framework for DDM. Adapting flow matching (FM) to enforce a unified translation function as in DDM is challenging, as FM learns the translation function's velocity rather than the translation function itself. A custom bilevel optimization-based training loss, a nonlinear interpolant, and a structural reformulation are proposed to address these challenges, offering a tangible implementation. To our knowledge, DFM is the first ODE-based approach guaranteeing translation identifiability. Experiments on synthetic and real-world datasets validate the proposed method.

Comment: Generative modeling/representation: ODE-based diversified flow matching with translation identifiability guarantees.

Relevance: 8 Novelty: 8


33. TNT: Improving Chunkwise Training for Test-Time Memorization

ArXiv ID: 2511.07343

Authors: Zeman Li, Ali Behrouz, Yuan Deng, Peilin Zhong, Praneeth Kacham, Mahdi Karami, Meisam Razaviyayn, Vahab Mirrokni

Abstract: Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers. While these expressive models do not yet match the peak performance of state-of-the-art Transformers, their potential has been largely untapped due to prohibitively slow training and low hardware utilization. Existing parallelization methods force a fundamental conflict governed by the chunksize hyperparameter: large chunks boost speed but degrade performance, necessitating a fixed, suboptimal compromise. To solve this challenge, we introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process. Stage one is an efficiency-focused pre-training phase utilizing a hierarchical memory. A global module processes large, hardware-friendly chunks for long-range context, while multiple parallel local modules handle fine-grained details. Crucially, by periodically resetting local memory states, we break sequential dependencies to enable massive context parallelization. Stage two is a brief fine-tuning phase where only the local memory modules are adapted to a smaller, high-resolution chunksize, maximizing accuracy with minimal overhead. Evaluated on Titans and TTT models, TNT achieves a substantial acceleration in training speed-up to 17 times faster than the most accurate baseline configuration - while simultaneously improving model accuracy. This improvement removes a critical scalability barrier, establishing a practical foundation for developing expressive RNNs and facilitating future work to close the performance gap with Transformers.

Comment: High Performance Computing: training paradigm enabling massive context parallelization for RNNs via hierarchical memory and chunk decoupling.

Relevance: 8 Novelty: 8


34. Adaptive Initial Residual Connections for GNNs with Theoretical Guarantees

ArXiv ID: 2511.06598

Authors: Mohammad Shirzadi, Ali Safarpoor Dehkordi, Ahad N. Zehmakan

Abstract: Message passing is the core operation in graph neural networks, where each node updates its embeddings by aggregating information from its neighbors. However, in deep architectures, this process often leads to diminished expressiveness. A popular solution is to use residual connections, where the input from the current (or initial) layer is added to aggregated neighbor information to preserve embeddings across layers. Following a recent line of research, we investigate an adaptive residual scheme in which different nodes have varying residual strengths. We prove that this approach prevents oversmoothing; particularly, we show that the Dirichlet energy of the embeddings remains bounded away from zero. This is the first theoretical guarantee not only for the adaptive setting, but also for static residual connections (where residual strengths are shared across nodes) with activation functions. Furthermore, extensive experiments show that this adaptive approach outperforms standard and state-of-the-art message passing mechanisms, especially on heterophilic graphs. To improve the time complexity of our approach, we introduce a variant in which residual strengths are not learned but instead set heuristically, a choice that performs as well as the learnable version.

Comment: Model Architecture: adaptive initial residual connections in GNNs with theoretical guarantees preventing oversmoothing (Dirichlet energy bounded away from zero).

Relevance: 8 Novelty: 8


35. DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

ArXiv ID: 2511.06477

Authors: Nikolay Yudin, Ekaterina Grishina, Andrey Veprikov, Alexandr Beznosikov, Maxim Rakhuba

Abstract: Recently, optimizers that explicitly treat weights as matrices, rather than flattened vectors, have demonstrated their effectiveness. This perspective naturally leads to structured approximations of the Fisher matrix as preconditioners, where the matrix view induces a Kronecker-factorized form that enables memory-efficient representation. However, constructing such approximations both efficiently and accurately remains an open challenge, since obtaining the optimal factorization is resource-intensive and practical methods therefore rely on heuristic design choices. In this work, we introduce a novel approach that leverages projector-splitting integrators to construct effective preconditioners. Our optimizer, DyKAF (Dynamical Kronecker Approximation of the Fisher Matrix), consistently improves the Fisher matrix approximation quality. Experiments on large language model pre-training and fine-tuning demonstrate that DyKAF outperforms existing optimizers across a range of evaluation metrics.

Comment: Matches Efficiency/HPC: optimizer with dynamic Kronecker approximation of Fisher for scalable gradient preconditioning.

Relevance: 8 Novelty: 8


36. Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training

ArXiv ID: 2511.07372

Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Hau-San Wong, Qingfu Zhang, Taiji Suzuki

Abstract: Recent curriculum techniques in the post-training stage of LLMs have been widely observed to outperform non-curriculum approaches in enhancing reasoning performance, yet a principled understanding of why and to what extent they work remains elusive. To address this gap, we develop a theoretical framework grounded in the intuition that progressively learning through manageable steps is more efficient than directly tackling a hard reasoning task, provided each stage stays within the model's effective competence. Under mild complexity conditions linking consecutive curriculum stages, we show that curriculum post-training avoids the exponential complexity bottleneck. To substantiate this result, drawing insights from the Chain-of-Thoughts (CoTs) solving mathematical problems such as Countdown and parity, we model CoT generation as a states-conditioned autoregressive reasoning tree, define a uniform-branching base model to capture pretrained behavior, and formalize curriculum stages as either depth-increasing (longer reasoning chains) or hint-decreasing (shorter prefixes) subtasks. Our analysis shows that, under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with polynomial sample complexity, whereas direct learning suffers from an exponential bottleneck. We further establish analogous guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to polynomial order.

Comment: Representation Learning/Training Dynamics: provides provable benefits of curriculum post-training and test-time scaling for Transformer tree reasoning, reducing sample complexity.

Relevance: 8 Novelty: 8


37. A Provably-Correct and Robust Convex Model for Smooth Separable NMF

ArXiv ID: 2511.07109

Authors: Junjun Pan, Valentin Leplat, Michael Ng, Nicolas Gillis

Abstract: Nonnegative matrix factorization (NMF) is a linear dimensionality reduction technique for nonnegative data, with applications such as hyperspectral unmixing and topic modeling. NMF is a difficult problem in general (NP-hard), and its solutions are typically not unique. To address these two issues, additional constraints or assumptions are often used. In particular, separability assumes that the basis vectors in the NMF are equal to some columns of the input matrix. In that case, the problem is referred to as separable NMF (SNMF) and can be solved in polynomial-time with robustness guarantees, while identifying a unique solution. However, in real-world scenarios, due to noise or variability, multiple data points may lie near the basis vectors, which SNMF does not leverage. In this work, we rely on the smooth separability assumption, which assumes that each basis vector is close to multiple data points. We explore the properties of the corresponding problem, referred to as smooth SNMF (SSNMF), and examine how it relates to SNMF and orthogonal NMF. We then propose a convex model for SSNMF and show that it provably recovers the sought-after factors, even in the presence of noise. We finally adapt an existing fast gradient method to solve this convex model for SSNMF, and show that it compares favorably with state-of-the-art methods on both synthetic and hyperspectral datasets.

Comment: Representation Learning: convex, provably-correct model for smooth separable NMF with robustness guarantees.

Relevance: 8 Novelty: 8


38. Physics-Informed Design of Input Convex Neural Networks for Consistency Optimal Transport Flow Matching

ArXiv ID: 2511.06042

Authors: Fanghui Song, Zhongjian Wang, Jiebao Sun

Abstract: We propose a consistency model based on the optimal-transport flow. A physics-informed design of partially input-convex neural networks (PICNN) plays a central role in constructing the flow field that emulates the displacement interpolation. During the training stage, we couple the Hamilton-Jacobi (HJ) residual in the OT formulation with the original flow matching loss function. Our approach avoids inner optimization subproblems that are present in previous one-step OFM approaches. During the prediction stage, our approach supports both one-step (Brenier-map) and multi-step ODE sampling from the same learned potential, leveraging the straightness of the OT flow. We validate scalability and performance on standard OT benchmarks.

Comment: Model Architecture and Efficiency: physics-informed PICNN for OT flow matching with HJ residual; supports one-step and ODE sampling from the same potential.

Relevance: 8 Novelty: 8


39. C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning

ArXiv ID: 2511.07396

Authors: Antonios Valkanas, Soumyasundar Pal, Pavel Rumiantsev, Yingxue Zhang, Mark Coates

Abstract: Large language models (LLMs) have achieved impressive results on complex reasoning tasks, but their high inference cost remains a major barrier to real-world deployment. A promising solution is to use cascaded inference, where small, cheap models handle easy queries, and only the hardest examples are escalated to more powerful models. However, existing cascade methods typically rely on supervised training with labeled data, offer no theoretical generalization guarantees, and provide limited control over test-time computational cost. We introduce C3PO (Cost Controlled Cascaded Prediction Optimization), a self-supervised framework for optimizing LLM cascades under probabilistic cost constraints. By focusing on minimizing regret with respect to the most powerful model (MPM), C3PO avoids the need for labeled data by constructing a cascade using only unlabeled model outputs. It leverages conformal prediction to bound the probability that inference cost exceeds a user-specified budget. We provide theoretical guarantees on both cost control and generalization error, and show that our optimization procedure is effective even with small calibration sets. Empirically, C3PO achieves state-of-the-art performance across a diverse set of reasoning benchmarks including GSM8K, MATH-500, BigBench-Hard and AIME, outperforming strong LLM cascading baselines in both accuracy and cost-efficiency. Our results demonstrate that principled, label-free cascade optimization can enable scalable LLM deployment.

Comment: High Performance Computing/Efficiency: cascaded LLM inference with probabilistic cost constraints and conformal guarantees; self-supervised optimization.

Relevance: 8 Novelty: 8


40. On the Convergence and Stability of Distributed Sub-model Training

ArXiv ID: 2511.06132

Authors: Yuyang Deng, Fuli Qiao, Mehrdad Mahdavi

Abstract: As learning models continue to grow in size, enabling on-device local training of these models has emerged as a critical challenge in federated learning. A popular solution is sub-model training, where the server only distributes randomly sampled sub-models to the edge clients, and clients only update these small models. However, those random sampling of sub-models may not give satisfying convergence performance. In this paper, observing the success of SGD with shuffling, we propose a distributed shuffled sub-model training, where the full model is partitioned into several sub-models in advance, and the server shuffles those sub-models, sends each of them to clients at each round, and by the end of local updating period, clients send back the updated sub-models, and server averages them. We establish the convergence rate of this algorithm. We also study the generalization of distributed sub-model training via stability analysis, and find that the sub-model training can improve the generalization via amplifying the stability of training process. The extensive experiments also validate our theoretical findings.

Comment: HPC/Distributed training: shuffled sub-model training with convergence and stability (generalization) analysis.

Relevance: 8 Novelty: 7


41. How Wide and How Deep? Mitigating Over-Squashing of GNNs via Channel Capacity Constrained Estimation

ArXiv ID: 2511.06443

Authors: Zinuo You, Jin Zheng, John Cartlidge

Abstract: Existing graph neural networks typically rely on heuristic choices for hidden dimensions and propagation depths, which often lead to severe information loss during propagation, known as over-squashing. To address this issue, we propose Channel Capacity Constrained Estimation (C3E), a novel framework that formulates the selection of hidden dimensions and depth as a nonlinear programming problem grounded in information theory. Through modeling spectral graph neural networks as communication channels, our approach directly connects channel capacity to hidden dimensions, propagation depth, propagation mechanism, and graph structure. Extensive experiments on nine public datasets demonstrate that hidden dimensions and depths estimated by C3E can mitigate over-squashing and consistently improve representation learning. Experimental results show that over-squashing occurs due to the cumulative compression of information in representation matrices. Furthermore, our findings show that increasing hidden dimensions indeed mitigate information compression, while the role of propagation depth is more nuanced, uncovering a fundamental balance between information compression and representation complexity.

Comment: Representation/Architecture analysis: information-theoretic estimation of GNN width/depth to mitigate over-squashing via channel capacity.

Relevance: 8 Novelty: 7


42. Rep2Text: Decoding Full Text from a Single LLM Token Representation

ArXiv ID: 2511.06571

Authors: Haiyan Zhao, Zirui He, Fan Yang, Ali Payani, Mengnan Du

Abstract: Large language models (LLMs) have achieved remarkable progress across diverse tasks, yet their internal mechanisms remain largely opaque. In this work, we address a fundamental question: to what extent can the original input text be recovered from a single last-token representation within an LLM? We propose Rep2Text, a novel framework for decoding full text from last-token representations. Rep2Text employs a trainable adapter that projects a target model's internal representations into the embedding space of a decoding language model, which then autoregressively reconstructs the input text. Experiments on various model combinations (Llama-3.1-8B, Gemma-7B, Mistral-7B-v0.1, Llama-3.2-3B) demonstrate that, on average, over half of the information in 16-token sequences can be recovered from this compressed representation while maintaining strong semantic integrity and coherence. Furthermore, our analysis reveals an information bottleneck effect: longer sequences exhibit decreased token-level recovery while preserving strong semantic integrity. Besides, our framework also demonstrates robust generalization to out-of-distribution medical data.

Comment: Representation Learning: decodes inputs from a single last-token representation and analyzes information bottleneck in LLM internals.

Relevance: 8 Novelty: 7


43. From Kernels to Attention: A Transformer Framework for Density and Score Estimation

ArXiv ID: 2511.05924

Authors: Vasily Ilin, Peter Sushko

Abstract: We introduce a unified attention-based framework for joint score and density estimation. Framing the problem as a sequence-to-sequence task, we develop a permutation- and affine-equivariant transformer that estimates both the probability density $f(x)$ and its score $\nabla_x \log f(x)$ directly from i.i.d. samples. Unlike traditional score-matching methods that require training a separate model for each distribution, our approach learns a single distribution-agnostic operator that generalizes across densities and sample sizes. The architecture employs cross-attention to connect observed samples with arbitrary query points, enabling generalization beyond the training data, while built-in symmetry constraints ensure equivariance to permutation and affine transformations. Analytically, we show that the attention weights can recover classical kernel density estimation (KDE), and verify it empirically, establishing a principled link between classical KDE and the transformer architecture. Empirically, the model achieves substantially lower error and better scaling than KDE and score-debiased KDE (SD-KDE), while exhibiting better runtime scaling. Together, these results establish transformers as general-purpose, data-adaptive operators for nonparametric density and score estimation.

Comment: Model Architecture and Representation Learning: transformer with permutation/affine equivariance linking attention to KDE for density/score estimation.

Relevance: 8 Novelty: 7


44. First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation

ArXiv ID: 2511.04715

Authors: Dmytro Vitel, Anshuman Chhabra

Abstract: Identifying how training samples influence/impact Large Language Model (LLM) decision-making is essential for effectively interpreting model decisions and auditing large-scale datasets. Current training sample influence estimation methods (also known as influence functions) undertake this goal by utilizing information flow through the model via its first-order and higher-order gradient terms. However, owing to the large model sizes of today consisting of billions of parameters, these influence computations are often restricted to some subset of model layers to ensure computational feasibility. Prior seminal work by Yeh et al. (2022) in assessing which layers are best suited for computing language data influence concluded that the first (embedding) layers are the most informative for this purpose, using a hypothesis based on influence scores canceling out (i.e., the cancellation effect). In this work, we propose theoretical and empirical evidence demonstrating how the cancellation effect is unreliable, and that middle attention layers are better estimators for influence. Furthermore, we address the broader challenge of aggregating influence scores across layers, and showcase how alternatives to standard averaging (such as ranking and vote-based methods) can lead to significantly improved performance. Finally, we propose better methods for evaluating influence score efficacy in LLMs without undertaking model retraining, and propose a new metric known as the Noise Detection Rate (NDR) that exhibits strong predictive capability compared to the cancellation effect. Through extensive experiments across LLMs of varying types and scales, we concretely determine that the first (layers) are not necessarily better than the last (layers) for LLM influence estimation, contrasting with prior knowledge in the field.

Comment: Representation Learning/Training Dynamics: analyzes which layers best estimate data influence and proposes improved cross-layer aggregation.

Relevance: 8 Novelty: 7


45. Beyond Fixed Depth: Adaptive Graph Neural Networks for Node Classification Under Varying Homophily

ArXiv ID: 2511.06608

Authors: Asela Hevapathige, Asiri Wijesinghe, Ahad N. Zehmakan

Abstract: Graph Neural Networks (GNNs) have achieved significant success in addressing node classification tasks. However, the effectiveness of traditional GNNs degrades on heterophilic graphs, where connected nodes often belong to different labels or properties. While recent work has introduced mechanisms to improve GNN performance under heterophily, certain key limitations still exist. Most existing models apply a fixed aggregation depth across all nodes, overlooking the fact that nodes may require different propagation depths based on their local homophily levels and neighborhood structures. Moreover, many methods are tailored to either homophilic or heterophilic settings, lacking the flexibility to generalize across both regimes. To address these challenges, we develop a theoretical framework that links local structural and label characteristics to information propagation dynamics at the node level. Our analysis shows that optimal aggregation depth varies across nodes and is critical for preserving class-discriminative information. Guided by this insight, we propose a novel adaptive-depth GNN architecture that dynamically selects node-specific aggregation depths using theoretically grounded metrics. Our method seamlessly adapts to both homophilic and heterophilic patterns within a unified model. Extensive experiments demonstrate that our approach consistently enhances the performance of standard GNN backbones across diverse benchmarks.

Comment: Model Architecture: adaptive-depth GNN selecting node-specific aggregation depths to handle varying homophily.

Relevance: 8 Novelty: 7


46. Sampling and Loss Weights in Multi-Domain Training

ArXiv ID: 2511.06913

Authors: Mahdi Salmani, Pratik Worah, Meisam Razaviyayn, Vahab Mirrokni

Abstract: In the training of large deep neural networks, there is a need for vast amounts of training data. To meet this need, data is collected from multiple domains, such as Wikipedia and GitHub. These domains are heterogeneous in both data quality and the diversity of information they provide. This raises the question of how much we should rely on each domain. Several methods have attempted to address this issue by assigning sampling weights to each data domain using heuristics or approximations. As a first step toward a deeper understanding of the role of data mixing, this work revisits the problem by studying two kinds of weights: sampling weights, which control how much each domain contributes in a batch, and loss weights, which scale the loss from each domain during training. Through a rigorous study of linear regression, we show that these two weights play complementary roles. First, they can reduce the variance of gradient estimates in iterative methods such as stochastic gradient descent (SGD). Second, they can improve generalization performance by reducing the generalization gap. We provide both theoretical and empirical support for these claims. We further study the joint dynamics of sampling weights and loss weights, examining how they can be combined to capture both contributions.

Comment: Training Dynamics: theoretical analysis of sampling vs. loss weights in multi-domain training to reduce gradient variance and generalization gap.

Relevance: 8 Novelty: 7


47. Magnitude-Modulated Equivariant Adapter for Parameter-Efficient Fine-Tuning of Equivariant Graph Neural Networks

ArXiv ID: 2511.06696

Authors: Dian Jin, Yancheng Yuan, Xiaoming Tao

Abstract: Pretrained equivariant graph neural networks based on spherical harmonics offer efficient and accurate alternatives to computationally expensive ab-initio methods, yet adapting them to new tasks and chemical environments still requires fine-tuning. Conventional parameter-efficient fine-tuning (PEFT) techniques, such as Adapters and LoRA, typically break symmetry, making them incompatible with those equivariant architectures. ELoRA, recently proposed, is the first equivariant PEFT method. It achieves improved parameter efficiency and performance on many benchmarks. However, the relatively high degrees of freedom it retains within each tensor order can still perturb pretrained feature distributions and ultimately degrade performance. To address this, we present Magnitude-Modulated Equivariant Adapter (MMEA), a novel equivariant fine-tuning method which employs lightweight scalar gating to modulate feature magnitudes on a per-order and per-multiplicity basis. We demonstrate that MMEA preserves strict equivariance and, across multiple benchmarks, consistently improves energy and force predictions to state-of-the-art levels while training fewer parameters than competing approaches. These results suggest that, in many practical scenarios, modulating channel magnitudes is sufficient to adapt equivariant models to new chemical environments without breaking symmetry, pointing toward a new paradigm for equivariant PEFT design.

Comment: Model Architecture + Compression/Efficiency: proposes an equivariant PEFT adapter with per-order scalar gating that preserves symmetry for equivariant GNNs.

Relevance: 8 Novelty: 7


48. An Efficient Gradient-Aware Error-Bounded Lossy Compressor for Federated Learning

ArXiv ID: 2511.05770

Authors: Zhijing Ye, Sheng Di, Jiamin Wang, Zhiqing Zhong, Zhaorui Zhang, Xiaodong Yu

Abstract: Federated learning (FL) enables collaborative model training without exposing clients' private data, but its deployment is often constrained by the communication cost of transmitting gradients between clients and the central server, especially under system heterogeneity where low-bandwidth clients bottleneck overall performance. Lossy compression of gradient data can mitigate this overhead, and error-bounded lossy compression (EBLC) is particularly appealing for its fine-grained utility-compression tradeoff. However, existing EBLC methods (e.g., SZ), originally designed for smooth scientific data with strong spatial locality, rely on generic predictors such as Lorenzo and interpolation for entropy reduction to improve compression ratio. Gradient tensors, in contrast, exhibit low smoothness and weak spatial correlation, rendering these predictors ineffective and leading to poor compression ratios. To address this limitation, we propose an EBLC framework tailored for FL gradient data to achieve high compression ratios while preserving model accuracy. The core of it is an innovative prediction mechanism that exploits temporal correlations across FL training rounds and structural regularities within convolutional kernels to reduce residual entropy. The predictor is compatible with standard quantizers and entropy coders and comprises (1) a cross-round magnitude predictor based on a normalized exponential moving average, and (2) a sign predictor that leverages gradient oscillation and kernel-level sign consistency. Experiments show that this new EBLC yields up to 1.53x higher compression ratios than SZ3 with lower accuracy loss. Integrated into a real-world FL framework, APPFL, it reduces end-to-end communication time by 76.1%-96.2% under various constrained-bandwidth scenarios, demonstrating strong scalability for real-world FL deployments.

Comment: Matches Compression/Efficiency for distributed training: gradient-aware error-bounded lossy compression tailored to FL communication.

Relevance: 8 Novelty: 7


49. Mixtures of SubExperts for Large Language Continual Learning

ArXiv ID: 2511.06237

Authors: Haeyong Kang

Abstract: Adapting Large Language Models (LLMs) to a continuous stream of tasks is a critical yet challenging endeavor. While Parameter-Efficient Fine-Tuning (PEFT) methods have become a standard for this, they face a fundamental dilemma in continual learning. Reusing a single set of PEFT parameters for new tasks often leads to catastrophic forgetting of prior knowledge. Conversely, allocating distinct parameters for each task prevents forgetting but results in a linear growth of the model's size and fails to facilitate knowledge transfer between related tasks. To overcome these limitations, we propose a novel adaptive PEFT method referred to as \textit{Mixtures of SubExperts (MoSEs)}, a novel continual learning framework designed for minimal forgetting and efficient scalability. MoSEs integrate a sparse Mixture of SubExperts into the transformer layers, governed by a task-specific routing mechanism. This architecture allows the model to isolate and protect knowledge within dedicated SubExperts, thereby minimizing parameter interference and catastrophic forgetting. Crucially, the router can adaptively select and combine previously learned sparse parameters for new tasks, enabling effective knowledge transfer while ensuring that the model's capacity grows sublinearly. We evaluate MoSEs on the comprehensive TRACE benchmark datasets. Our experiments demonstrate that MoSEs significantly outperform conventional continual learning approaches in both knowledge retention and scalability to new tasks, achieving state-of-the-art performance with substantial memory and computational savings.

Comment: Matches Model Architecture: sparse Mixture-of-SubExperts for PEFT-based continual learning in LLMs.

Relevance: 8 Novelty: 7


50. Non-Negative Stiefel Approximating Flow: Orthogonalish Matrix Optimization for Interpretable Embeddings

ArXiv ID: 2511.06425

Authors: Brian B. Avants (Department of Radiology, Medical Imaging University of Virginia, Charlottesville, VA), Nicholas J. Tustison (Department of Radiology, Medical Imaging University of Virginia, Charlottesville, VA), James R Stone (Department of Radiology, Medical Imaging University of Virginia, Charlottesville, VA)

Abstract: Interpretable representation learning is a central challenge in modern machine learning, particularly in high-dimensional settings such as neuroimaging, genomics, and text analysis. Current methods often struggle to balance the competing demands of interpretability and model flexibility, limiting their effectiveness in extracting meaningful insights from complex data. We introduce Non-negative Stiefel Approximating Flow (NSA-Flow), a general-purpose matrix estimation framework that unifies ideas from sparse matrix factorization, orthogonalization, and constrained manifold learning. NSA-Flow enforces structured sparsity through a continuous balance between reconstruction fidelity and column-wise decorrelation, parameterized by a single tunable weight. The method operates as a smooth flow near the Stiefel manifold with proximal updates for non-negativity and adaptive gradient control, yielding representations that are simultaneously sparse, stable, and interpretable. Unlike classical regularization schemes, NSA-Flow provides an intuitive geometric mechanism for manipulating sparsity at the level of global structure while simplifying latent features. We demonstrate that the NSA-Flow objective can be optimized smoothly and integrates seamlessly with existing pipelines for dimensionality reduction while improving interpretability and generalization in both simulated and real biomedical data. Empirical validation on the Golub leukemia dataset and in Alzheimer's disease demonstrate that the NSA-Flow constraints can maintain or improve performance over related methods with little additional methodological effort. NSA-Flow offers a scalable, general-purpose tool for interpretable ML, applicable across data science domains.

Comment: Matches Representation Learning with sparsity: interpretable embeddings via non-negative, near-Stiefel constrained factorization.

Relevance: 8 Novelty: 7


51. Vocabulary In-Context Learning in Transformers: Benefits of Positional Encoding

ArXiv ID: 2511.06376

Authors: Qian Ma, Ruoxiang Xu, Yongqiang Cai

Abstract: Numerous studies have demonstrated that the Transformer architecture possesses the capability for in-context learning (ICL). In scenarios involving function approximation, context can serve as a control parameter for the model, endowing it with the universal approximation property (UAP). In practice, context is represented by tokens from a finite set, referred to as a vocabulary, which is the case considered in this paper, \emph{i.e.}, vocabulary in-context learning (VICL). We demonstrate that VICL in single-layer Transformers, without positional encoding, does not possess the UAP; however, it is possible to achieve the UAP when positional encoding is included. Several sufficient conditions for the positional encoding are provided. Our findings reveal the benefits of positional encoding from an approximation theory perspective in the context of ICL.

Comment: Model Architecture/Representation Learning: shows positional encoding enables universal approximation for vocabulary in-context learning in Transformers with theoretical conditions.

Relevance: 8 Novelty: 7


52. Walsh-Hadamard Neural Operators for Solving PDEs with Discontinuous Coefficients

ArXiv ID: 2511.07347

Authors: Giorrgio M. Cavallazzi, Miguel Perex Cuadrado, Alfredo Pinelli

Abstract: Neural operators have emerged as powerful tools for learning solution operators of partial differential equations (PDEs). However, standard spectral methods based on Fourier transforms struggle with problems involving discontinuous coefficients due to the Gibbs phenomenon and poor representation of sharp interfaces. We introduce the Walsh-Hadamard Neural Operator (WHNO), which leverages Walsh-Hadamard transforms-a spectral basis of rectangular wave functions naturally suited for piecewise constant fields-combined with learnable spectral weights that transform low-sequency Walsh coefficients to capture global dependencies efficiently. We validate WHNO on three problems: steady-state Darcy flow (preliminary validation), heat conduction with discontinuous thermal conductivity, and the 2D Burgers equation with discontinuous initial conditions. In controlled comparisons with Fourier Neural Operators (FNO) under identical conditions, WHNO demonstrates superior accuracy with better preservation of sharp solution features at material interfaces. Critically, we discover that weighted ensemble combinations of WHNO and FNO achieve substantial improvements over either model alone: for both heat conduction and Burgers equation, optimal ensembles reduce mean squared error by 35-40 percent and maximum error by up to 25 percent compared to individual models. This demonstrates that Walsh-Hadamard and Fourier representations capture complementary aspects of discontinuous PDE solutions, with WHNO excelling at sharp interfaces while FNO captures smooth features effectively.

Comment: Model Architecture: introduces a Walsh-Hadamard Neural Operator, a new spectral operator architecture complementary to FNOs for PDEs with discontinuities.

Relevance: 8 Novelty: 7


53. Transolver is a Linear Transformer: Revisiting Physics-Attention through the Lens of Linear Attention

ArXiv ID: 2511.06294

Authors: Wenjie Hu, Sidun Liu, Peng Qiao, Zhenglun Sun, Yong Dou

Abstract: Recent advances in Transformer-based Neural Operators have enabled significant progress in data-driven solvers for Partial Differential Equations (PDEs). Most current research has focused on reducing the quadratic complexity of attention to address the resulting low training and inference efficiency. Among these works, Transolver stands out as a representative method that introduces Physics-Attention to reduce computational costs. Physics-Attention projects grid points into slices for slice attention, then maps them back through deslicing. However, we observe that Physics-Attention can be reformulated as a special case of linear attention, and that the slice attention may even hurt the model performance. Based on these observations, we argue that its effectiveness primarily arises from the slice and deslice operations rather than interactions between slices. Building on this insight, we propose a two-step transformation to redesign Physics-Attention into a canonical linear attention, which we call Linear Attention Neural Operator (LinearNO). Our method achieves state-of-the-art performance on six standard PDE benchmarks, while reducing the number of parameters by an average of 40.0% and computational cost by 36.2%. Additionally, it delivers superior performance on two challenging, industrial-level datasets: AirfRANS and Shape-Net Car.

Comment: Model Architecture and Efficiency: reformulates Physics-Attention as linear attention and proposes a linear-attention neural operator with reduced complexity.

Relevance: 8 Novelty: 7


54. Recursive Dynamics in Fast-Weights Homeostatic Reentry Networks: Toward Reflective Intelligence

ArXiv ID: 2511.06798

Authors: B. G. Chae

Abstract: This study introduces the Fast-Weights Homeostatic Reentry Layer (FH-RL), a neural mechanism that integrates fast-weight associative memory, homeostatic regularization, and learned reentrant feedback to approximate self-referential computation in neural networks. Unlike standard transformer architectures that operate in a purely feedforward manner during inference, FH-RL enables internal recurrence without external looping, allowing prior latent states to be dynamically re-entered into the ongoing computation stream. We conduct controlled experiments sweeping the reentry gain $\gamma$ and evaluate emergent internal dynamics using three novel metrics: the Information Reentry Ratio (IRR), Eigen-Spectrum Recursion Index (ESRI), and Representational Drift Periodicity (RDP). Results show that reentry quantity increases proportionally with~$\gamma$, while the learned feedback matrix $W_r$ remains bounded and becomes more structured at moderate gains. Critically, a stable reflective band emerges around $\gamma \approx 0.10-0.20$, where internal feedback is maximally expressive yet spectrally stable: IRR rises smoothly, ESRI remains near zero, and RDP exhibits consistent low-frequency cycles. These findings provide quantitative evidence that reflective, thought-like internal processing can arise from a principled balance between feedback amplification and homeostatic regulation, linking modern fast-weight architectures to theories of cortical reentry and recursive cognition.

Comment: Model Architecture: fast-weights with homeostatic reentry enabling internal recurrence and controlled reflective dynamics.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  2. Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  3. High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

  4. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.