Previous Day 2026-01-05
Monthly Overview 2026-01
Next Day 2026-01-07

Personalized Daily ArXiv Papers 2026-01-06

[gpt-5] Prompt Completion Total
Token 53208 51365 104573
Cost $0.07 $0.51 $0.58

Total arXiv papers: 820

Total scanned papers: 502

Total relevant papers: 35

Table of contents with paper titles:

  1. Value-guided action planning with JEPA world models Authors: Matthieu Destrade, Oumayma Bounou, Quentin Le Lidec, Jean Ponce, Yann LeCun

  2. A Depth Hierarchy for Computing the Maximum in ReLU Networks via Extremal Graph Theory Authors: Itay Safran

  3. Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation, Convergence and Mixture of Experts Authors: Ruofeng Yang, Yongcan Li, Bo Jiang, Cheng Chen, Shuai Li

  4. Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts Authors: Boxuan Lyu, Soichiro Murakami, Hidetaka Kamigaito, Peinan Zhang

  5. Making MoE based LLM inference resilient with Tarragon Authors: Songyu Zhang, Aaron Tam, Myungjin Lee, Shixiong Qi, K. K. Ramakrishnan

  6. LinMU: Multimodal Understanding Made Linear Authors: Hongjie Wang, Niraj K. Jha

  7. T3C: Test-Time Tensor Compression with Consistency Guarantees Authors: Ismail Lamaakal, Chaymae Yahyati, Yassine Maleh, Khalid El Makkaoui, Ibrahim Ouahbi

  8. Quantized SO(3)-Equivariant Graph Neural Networks for Efficient Molecular Property Prediction Authors: Haoyu Zhou, Ping Xue, Tianfan Fu, Hao Zhang

  9. Leveraging Flatness to Improve Information-Theoretic Generalization Bounds for SGD Authors: Ze Peng, Jian Zhang, Yisen Wang, Lei Qi, Yinghuan Shi, Yang Gao

  10. Context-Free Recognition with Transformers Authors: Selim Jerad, Anej Svete, Sophie Hao, Ryan Cotterell, William Merrill

  11. Sobolev Approximation of Deep ReLU Network in Log-weighted Barron Space Authors: Changhoon Song, Seungchan Ko, Youngjoon Hong

  12. FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness Authors: Hossam Amer, Maryam Dialameh, Hossein Rajabzadeh, Walid Ahmed, Weiwei Zhang, Yang Liu

  13. Attention Needs to Focus: A Unified Perspective on Attention Allocation Authors: Zichuan Fu, Wentao Song, Guojing Li, Yejing Wang, Xian Wu, Yimin Deng, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao

  14. Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale Authors: Shengji Tang, Weihao Lin, Jingqi Ye, Hao Li, Bo Zhang, Shuyue Hu, Tao Chen, Wangli Ouyang, Lei Bai, Peng Ye

  15. SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines Authors: Itai Morad, Nir Shlezinger, Yonina C. Eldar

  16. Neuro-Channel Networks: A Multiplication-Free Architecture by Biological Signal Transmission Authors: Emrah Mete, Emin Erkan Korkmaz

  17. Heterogeneous Low-Bandwidth Pre-Training of LLMs Authors: Yazan Obeidi, Amir Sarfi, Joel Lidin, Paul Janson, Eugene Belilovsky

  18. MambaFormer: Token-Level Guided Routing Mixture-of-Experts for Accurate and Efficient Clinical Assistance Authors: Hamad Khan (Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering, Applied Sciences), Saddam Hussain Khan (Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering, Applied Sciences)

  19. Accelerating Decentralized Optimization via Overlapping Local Steps Authors: Yijie Zhou, Shi Pu

  20. Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments Authors: Hansen Jin Lillemark, Benhao Huang, Fangneng Zhan, Yilun Du, Thomas Anderson Keller

  21. Output Embedding Centering for Stable LLM Pretraining Authors: Felix Stollenwerk, Anna Lokrantz, Niclas Hertzberg

  22. ELLA: Efficient Lifelong Learning for Adapters in Large Language Models Authors: Shristi Das Biswas, Yue Zhang, Anwesan Pal, Radhika Bhargava, Kaushik Roy

  23. Energy-Aware Routing to Large Reasoning Models Authors: Austin R. Ellis-Mohr, Max Hartman, Lav R. Varshney

  24. Neural Networks on Symmetric Spaces of Noncompact Type Authors: Xuan Son Nguyen, Shuo Yang, Aymeric Histace

  25. Sparse Bayesian Message Passing under Structural Uncertainty Authors: Yoonhyuk Choi, Jiho Choi, Chanran Kim, Yumin Lee, Hawon Shin, Yeowon Jeon, Minjeong Kim, Jiwoo Kang

  26. Bayesian Subspace Gradient Estimation for Zeroth-Order Optimization of Large Language Models Authors: Jian Feng, Zhihong Huang

  27. Intention Collapse: Intention-Level Metrics for Reasoning in Language Models Authors: Patricio Vera

  28. MODE: Efficient Time Series Prediction with Mamba Enhanced by Low-Rank Neural ODEs Authors: Xingsheng Chen, Regina Zhang, Bo Gao, Xingwei He, Xiaofeng Liu, Pietro Lio, Kwok-Yan Lam, Siu-Ming Yiu

  29. Towards a Principled Muon under $\mu\mathsf{P}$: Ensuring Spectral Conditions throughout Training Authors: John Zhao

  30. Deep Deterministic Nonlinear ICA via Total Correlation Minimization with Matrix-Based Entropy Functional Authors: Qiang Li, Shujian Yu, Liang Ma, Chen Ma, Jingyu Liu, Tulay Adali, Vince D. Calhoun

  31. Context Collapse: In-Context Learning and Model Collapse Authors: Josef Ott

  32. Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting Authors: Muxi Diao, Lele Yang, Wuxuan Gong, Yutong Zhang, Zhonghao Yan, Yufei Han, Kongming Liang, Weiran Xu, Zhanyu Ma

  33. Gradient-Free Approaches is a Key to an Efficient Interaction with Markovian Stochasticity Authors: Boris Prokhorov, Semyon Chebykin, Alexander Gasnikov, Aleksandr Beznosikov

  34. Accelerating Storage-Based Training for Graph Neural Networks Authors: Myung-Hwan Jang, Jeong-Min Park, Yunyong Ko, Sang-Wook Kim

  35. Deep Clustering with Associative Memories Authors: Bishwajit Saha, Dmitry Krotov, Mohammed J. Zaki, Parikshit Ram


1. Value-guided action planning with JEPA world models

ArXiv ID: 2601.00844

Authors: Matthieu Destrade, Oumayma Bounou, Quentin Le Lidec, Jean Ponce, Yann LeCun

Abstract: Building deep learning models that can reason about their environment requires capturing its underlying dynamics. Joint-Embedded Predictive Architectures (JEPA) provide a promising framework to model such dynamics by learning representations and predictors through a self-supervised prediction objective. However, their ability to support effective action planning remains limited. We propose an approach to enhance planning with JEPA world models by shaping their representation space so that the negative goal-conditioned value function for a reaching cost in a given environment is approximated by a distance (or quasi-distance) between state embeddings. We introduce a practical method to enforce this constraint during training and show that it leads to significantly improved planning performance compared to standard JEPA models on simple control tasks.

Comment: Author match


2. A Depth Hierarchy for Computing the Maximum in ReLU Networks via Extremal Graph Theory

ArXiv ID: 2601.01417

Authors: Itay Safran

Abstract: We consider the problem of exact computation of the maximum function over $d$ real inputs using ReLU neural networks. We prove a depth hierarchy, wherein width $\Omega\big(d^{1+\frac{1}{2^{k-2}-1}}\big)$ is necessary to represent the maximum for any depth $3\le k\le \log_2(\log_2(d))$. This is the first unconditional super-linear lower bound for this fundamental operator at depths $k\ge3$, and it holds even if the depth scales with $d$. Our proof technique is based on a combinatorial argument and associates the non-differentiable ridges of the maximum with cliques in a graph induced by the first hidden layer of the computing network, utilizing Tur\'an's theorem from extremal graph theory to show that a sufficiently narrow network cannot capture the non-linearities of the maximum. This suggests that despite its simple nature, the maximum function possesses an inherent complexity that stems from the geometric structure of its non-differentiable hyperplanes, and provides a novel approach for proving lower bounds for deep neural networks.

Comment: Theoretical Architecture: depth hierarchy lower bounds for computing max with ReLUs via extremal graph theory.

Relevance: 10 Novelty: 9


3. Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation, Convergence and Mixture of Experts

ArXiv ID: 2601.01475

Authors: Ruofeng Yang, Yongcan Li, Bo Jiang, Cheng Chen, Shuai Li

Abstract: Recently, diffusion models have achieved a great performance with a small dataset of size $n$ and a fast optimization process. However, the estimation error of diffusion models suffers from the curse of dimensionality $n^{-1/D}$ with the data dimension $D$. Since images are usually a union of low-dimensional manifolds, current works model the data as a union of linear subspaces with Gaussian latent and achieve a $1/\sqrt{n}$ bound. Though this modeling reflects the multi-manifold property, the Gaussian latent can not capture the multi-modal property of the latent manifold. To bridge this gap, we propose the mixture subspace of low-rank mixture of Gaussian (MoLR-MoG) modeling, which models the target data as a union of $K$ linear subspaces, and each subspace admits a mixture of Gaussian latent ($n_k$ modals with dimension $d_k$). With this modeling, the corresponding score function naturally has a mixture of expert (MoE) structure, captures the multi-modal information, and contains nonlinear property. We first conduct real-world experiments to show that the generation results of MoE-latent MoG NN are much better than MoE-latent Gaussian score. Furthermore, MoE-latent MoG NN achieves a comparable performance with MoE-latent Unet with $10 \times$ parameters. These results indicate that the MoLR-MoG modeling is reasonable and suitable for real-world data. After that, based on such MoE-latent MoG score, we provide a $R^4\sqrt{\Sigma_{k=1}^Kn_k}\sqrt{\Sigma_{k=1}^Kn_kd_k}/\sqrt{n}$ estimation error, which escapes the curse of dimensionality by using data structure. Finally, we study the optimization process and prove the convergence guarantee under the MoLR-MoG modeling. Combined with these results, under a setting close to real-world data, this work explains why diffusion models only require a small training sample and enjoy a fast optimization process to achieve a great performance.

Comment: Model Architecture + Representation Learning: diffusion models with MoLR-MoG latent leading to MoE-structured score; provides estimation and convergence guarantees.

Relevance: 10 Novelty: 9


4. Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts

ArXiv ID: 2601.02144

Authors: Boxuan Lyu, Soichiro Murakami, Hidetaka Kamigaito, Peinan Zhang

Abstract: Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric "router" to dispatch tokens to a sparse subset of experts. Typically, this router is trained once and then frozen, rendering routing decisions brittle under distribution shifts. We address this limitation by introducing kNN-MoE, a retrieval-augmented routing framework that reuses optimal expert assignments from a memory of similar past cases. This memory is constructed offline by directly optimizing token-wise routing logits to maximize the likelihood on a reference set. Crucially, we use the aggregate similarity of retrieved neighbors as a confidence-driven mixing coefficient, thus allowing the method to fall back to the frozen router when no relevant cases are found. Experiments show kNN-MoE outperforms zero-shot baselines and rivals computationally expensive supervised fine-tuning.

Comment: Model Architecture (MoE): kNN-augmented expert routing using retrieval-based assignment and confidence mixing to improve robustness under shift.

Relevance: 10 Novelty: 8


5. Making MoE based LLM inference resilient with Tarragon

ArXiv ID: 2601.01310

Authors: Songyu Zhang, Aaron Tam, Myungjin Lee, Shixiong Qi, K. K. Ramakrishnan

Abstract: Mixture-of-Experts (MoE) models are increasingly used to serve LLMs at scale, but failures become common as deployment scale grows. Existing systems exhibit poor failure resilience: even a single worker failure triggers a coarse-grained, service-wide restart, discarding accumulated progress and halting the entire inference pipeline during recovery--an approach clearly ill-suited for latency-sensitive, LLM services. We present Tarragon, a resilient MoE inference framework that confines the failures impact to individual workers while allowing the rest of the pipeline to continue making forward progress. Tarragon exploits the natural separation between the attention and expert computation in MoE-based transformers, treating attention workers (AWs) and expert workers (EWs) as distinct failure domains. Tarragon introduces a reconfigurable datapath to mask failures by rerouting requests to healthy workers. On top of this datapath, Tarragon implements a self-healing mechanism that relaxes the tightly synchronized execution of existing MoE frameworks. For stateful AWs, Tarragon performs asynchronous, incremental KV cache checkpointing with per-request restoration, and for stateless EWs, it leverages residual GPU memory to deploy shadow experts. These together keep recovery cost and recomputation overhead extremely low. Our evaluation shows that, compared to state-of-the-art MegaScale-Infer, Tarragon reduces failure-induced stalls by 160-213x (from ~64 s down to 0.3-0.4 s) while preserving performance when no failures occur.

Comment: HPC/MoE Systems: resilient MoE inference via reconfigurable datapath, KV-cache checkpointing, and shadow experts for fault tolerance.

Relevance: 10 Novelty: 8


6. LinMU: Multimodal Understanding Made Linear

ArXiv ID: 2601.01322

Authors: Hongjie Wang, Niraj K. Jha

Abstract: Modern Vision-Language Models (VLMs) achieve impressive performance but are limited by the quadratic complexity of self-attention, which prevents their deployment on edge devices and makes their understanding of high-resolution images and long-context videos prohibitively expensive. To address this challenge, we introduce LinMU (Linear-complexity Multimodal Understanding), a VLM design that achieves linear complexity without using any quadratic-complexity modules while maintaining the performance of global-attention-based VLMs. LinMU replaces every self-attention layer in the VLM with the M-MATE block: a dual-branch module that combines a bidirectional state-space model for global context (Flex-MA branch) with localized Swin-style window attention (Local-Swin branch) for adjacent correlations. To transform a pre-trained VLM into the LinMU architecture, we propose a three-stage distillation framework that (i) initializes both branches with self-attention weights and trains the Flex-MA branch alone, (ii) unfreezes the Local-Swin branch and fine-tunes it jointly with the Flex-MA branch, and (iii) unfreezes the remaining blocks and fine-tunes them using LoRA adapters, while regressing on hidden states and token-level logits of the frozen VLM teacher. On MMMU, TextVQA, LongVideoBench, Video-MME, and other benchmarks, LinMU matches the performance of teacher models, yet reduces Time-To-First-Token (TTFT) by up to 2.7$\times$ and improves token throughput by up to 9.0$\times$ on minute-length videos. Ablations confirm the importance of each distillation stage and the necessity of the two branches of the M-MATE block. The proposed framework demonstrates that state-of-the-art multimodal reasoning can be achieved without quadratic attention, thus opening up avenues for long-context VLMs that can deal with high-resolution images and long videos.

Comment: Efficiency/Architecture: replaces quadratic attention with dual-branch linear-complexity module (bidirectional SSM + local window attention) and a 3-stage distillation pipeline for VLMs.

Relevance: 10 Novelty: 8


7. T3C: Test-Time Tensor Compression with Consistency Guarantees

ArXiv ID: 2601.01299

Authors: Ismail Lamaakal, Chaymae Yahyati, Yassine Maleh, Khalid El Makkaoui, Ibrahim Ouahbi

Abstract: We present T3C, a train-once, test-time budget-conditioned compression framework that exposes rank and precision as a controllable deployment knob. T3C combines elastic tensor factorization (maintained up to a maximal rank) with rank-tied mixed-precision quantization and a lightweight controller that maps a latency/energy/size budget token to per-layer rank/bit assignments; the policy snaps to hardware-aligned profiles and is monotone in the budget. A fast, layerwise consistency certificate, computed from spectral proxies and activation statistics, upper-bounds logit drift and regularizes training, yielding a practical reliability signal with negligible overhead. On ImageNet-1k, T3C shifts the vision Pareto frontier: for ResNet-50 at matched accuracy (\leq 0.5% drop), p50 latency is 1.18ms with a 38MB model, outperforming PTQ-8b (1.44ms, 88MB); for ViT-B/16, T3C reaches 2.30ms p50 with 59MB, improving over strong PTQ/QAT baselines. A single T3C checkpoint therefore provides predictable, certificate-backed accuracy-latency-size trade-offs on demand across devices.

Comment: Compression/Efficiency: test-time controllable low-rank factorization + mixed-precision quantization with a consistency certificate on logit drift.

Relevance: 10 Novelty: 8


8. Quantized SO(3)-Equivariant Graph Neural Networks for Efficient Molecular Property Prediction

ArXiv ID: 2601.02213

Authors: Haoyu Zhou, Ping Xue, Tianfan Fu, Hao Zhang

Abstract: Deploying 3D graph neural networks (GNNs) that are equivariant to 3D rotations (the group SO(3)) on edge devices is challenging due to their high computational cost. This paper addresses the problem by compressing and accelerating an SO(3)-equivariant GNN using low-bit quantization techniques. Specifically, we introduce three innovations for quantized equivariant transformers: (1) a magnitude-direction decoupled quantization scheme that separately quantizes the norm and orientation of equivariant (vector) features, (2) a branch-separated quantization-aware training strategy that treats invariant and equivariant feature channels differently in an attention-based $SO(3)$-GNN, and (3) a robustness-enhancing attention normalization mechanism that stabilizes low-precision attention computations. Experiments on the QM9 and rMD17 molecular benchmarks demonstrate that our 8-bit models achieve accuracy on energy and force predictions comparable to full-precision baselines with markedly improved efficiency. We also conduct ablation studies to quantify the contribution of each component to maintain accuracy and equivariance under quantization, using the Local error of equivariance (LEE) metric. The proposed techniques enable the deployment of symmetry-aware GNNs in practical chemistry applications with 2.37--2.73x faster inference and 4x smaller model size, without sacrificing accuracy or physical symmetry.

Comment: Compression/Efficiency: low-bit quantization for SO(3)-equivariant GNNs with magnitude-direction decoupling and branch-separated QAT.

Relevance: 10 Novelty: 7


9. Leveraging Flatness to Improve Information-Theoretic Generalization Bounds for SGD

ArXiv ID: 2601.01465

Authors: Ze Peng, Jian Zhang, Yisen Wang, Lei Qi, Yinghuan Shi, Yang Gao

Abstract: Information-theoretic (IT) generalization bounds have been used to study the generalization of learning algorithms. These bounds are intrinsically data- and algorithm-dependent so that one can exploit the properties of data and algorithm to derive tighter bounds. However, we observe that although the flatness bias is crucial for SGD's generalization, these bounds fail to capture the improved generalization under better flatness and are also numerically loose. This is caused by the inadequate leverage of SGD's flatness bias in existing IT bounds. This paper derives a more flatness-leveraging IT bound for the flatness-favoring SGD. The bound indicates the learned models generalize better if the large-variance directions of the final weight covariance have small local curvatures in the loss landscape. Experiments on deep neural networks show our bound not only correctly reflects the better generalization when flatness is improved, but is also numerically much tighter. This is achieved by a flexible technique called "omniscient trajectory". When applied to Gradient Descent's minimax excess risk on convex-Lipschitz-Bounded problems, it improves representative IT bounds' $\Omega(1)$ rates to $O(1/\sqrt{n})$. It also implies a by-pass of memorization-generalization trade-offs.

Comment: Representation Learning Theory: information-theoretic generalization bounds leveraging flatness to tighten SGD generalization and improve rates.

Relevance: 9 Novelty: 8


10. Context-Free Recognition with Transformers

ArXiv ID: 2601.01754

Authors: Selim Jerad, Anej Svete, Sophie Hao, Ryan Cotterell, William Merrill

Abstract: Transformers excel on tasks that process well-formed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs (Merrill et al., 2022). Merrill & Sabharwal (2024) show that $\mathcal{O}(\log n)$ looping layers (w.r.t. input length $n$) allows transformers to recognize regular languages, but the question of context-free recognition remained open. In this work, we show that looped transformers with $\mathcal{O}(\log n)$ looping layers and $\mathcal{O}(n^6)$ padding tokens can recognize all CFLs. However, training and inference with $\mathcal{O}(n^6)$ padding tokens is potentially impractical. Fortunately, we show that, for natural subclasses such as unambiguous CFLs, the recognition problem on transformers becomes more tractable, requiring $\mathcal{O}(n^3)$ padding. We empirically validate our results and show that looping helps on a language that provably requires logarithmic depth. Overall, our results shed light on the intricacy of CFL recognition by transformers: While general recognition may require an intractable amount of padding, natural constraints such as unambiguity yield efficient recognition algorithms.

Comment: Model Architecture Theory: shows looped transformers with O(log n) iterations and padding can recognize CFLs, advancing formal capacity understanding.

Relevance: 9 Novelty: 8


11. Sobolev Approximation of Deep ReLU Network in Log-weighted Barron Space

ArXiv ID: 2601.01295

Authors: Changhoon Song, Seungchan Ko, Youngjoon Hong

Abstract: Universal approximation theorems show that neural networks can approximate any continuous function; however, the number of parameters may grow exponentially with the ambient dimension, so these results do not fully explain the practical success of deep models on high-dimensional data. Barron space theory addresses this: if a target function belongs to a Barron space, a two-layer network with $n$ parameters achieves an $O(n^{-1/2})$ approximation error in $L^2$. Yet classical Barron spaces $\mathscr{B}^{s+1}$ still require stronger regularity than Sobolev spaces $H^s$, and existing depth-sensitive results often assume constraints such as $sL \le 1/2$. In this paper, we introduce a log-weighted Barron space $\mathscr{B}^{\log}$, which requires a strictly weaker assumption than $\mathscr{B}^s$ for any $s>0$. For this new function space, we first study embedding properties and carry out a statistical analysis via the Rademacher complexity. Then we prove that functions in $\mathscr{B}^{\log}$ can be approximated by deep ReLU networks with explicit depth dependence. We then define a family $\mathscr{B}^{s,\log}$, establish approximation bounds in the $H^1$ norm, and identify maximal depth scales under which these rates are preserved. Our results clarify how depth reduces regularity requirements for efficient representation, offering a more precise explanation for the performance of deep architectures beyond the classical Barron setting, and for their stable use in high-dimensional problems used today.

Comment: Theoretical Representation Learning: new log-weighted Barron spaces and depth-sensitive ReLU approximation bounds (Sobolev metrics).

Relevance: 9 Novelty: 8


12. FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness

ArXiv ID: 2601.01332

Authors: Hossam Amer, Maryam Dialameh, Hossein Rajabzadeh, Walid Ahmed, Weiwei Zhang, Yang Liu

Abstract: Scaling training compute, measured in FLOPs, has long been shown to improve the accuracy of large language models, yet training remains resource-intensive. Prior work shows that increasing test-time compute (TTC)-for example through iterative sampling-can allow smaller models to rival or surpass much larger ones at lower overall cost. We introduce TTC-aware training, where an intermediate checkpoint and a corresponding TTC configuration can together match or exceed the accuracy of a fully trained model while requiring substantially fewer training FLOPs. Building on this insight, we propose an early stopping algorithm that jointly selects a checkpoint and TTC configuration to minimize training compute without sacrificing accuracy. To make this practical, we develop an efficient TTC evaluation method that avoids exhaustive search, and we formalize a break-even bound that identifies when increased inference compute compensates for reduced training compute. Experiments demonstrate up to 92\% reductions in training FLOPs while maintaining and sometimes remarkably improving accuracy. These results highlight a new perspective for balancing training and inference compute in model development, enabling faster deployment cycles and more frequent model refreshes. Codes will be publicly released.

Comment: Efficiency: TTC-aware training and early stopping to trade training FLOPs for test-time compute with a theoretical break-even bound.

Relevance: 9 Novelty: 8


13. Attention Needs to Focus: A Unified Perspective on Attention Allocation

ArXiv ID: 2601.00919

Authors: Zichuan Fu, Wentao Song, Guojing Li, Yejing Wang, Xian Wu, Yimin Deng, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao

Abstract: The Transformer architecture, a cornerstone of modern Large Language Models (LLMs), has achieved extraordinary success in sequence modeling, primarily due to its attention mechanism. However, despite its power, the standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink. Although prior work has proposed approaches for these issues, they are often studied in isolation, obscuring their deeper connection. In this paper, we present a unified perspective, arguing that both can be traced to a common root -- improper attention allocation. We identify two failure modes: 1) Attention Overload, where tokens receive comparable high weights, blurring semantic features that lead to representational collapse; 2) Attention Underload, where no token is semantically relevant, yet attention is still forced to distribute, resulting in spurious focus such as attention sink. Building on this insight, we introduce Lazy Attention, a novel mechanism designed for a more focused attention distribution. To mitigate overload, it employs positional discrimination across both heads and dimensions to sharpen token distinctions. To counteract underload, it incorporates Elastic-Softmax, a modified normalization function that relaxes the standard softmax constraint to suppress attention on irrelevant tokens. Experiments on the FineWeb-Edu corpus, evaluated across nine diverse benchmarks, demonstrate that Lazy Attention successfully mitigates attention sink and achieves competitive performance compared to both standard attention and modern architectures, while reaching up to 59.58% attention sparsity.

Comment: Model Architecture: proposes Lazy Attention (positional discrimination + Elastic-Softmax) to fix attention collapse/sink and induce sparse attention.

Relevance: 9 Novelty: 8


14. Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale

ArXiv ID: 2601.01330

Authors: Shengji Tang, Weihao Lin, Jingqi Ye, Hao Li, Bo Zhang, Shuyue Hu, Tao Chen, Wangli Ouyang, Lei Bai, Peng Ye

Abstract: Large Language Models (LLMs) have rapidly advanced, with Gemini-3-Pro setting a new performance milestone. In this work, we explore collective intelligence as an alternative to monolithic scaling, and demonstrate that open-source LLMs' collaboration can surpass Gemini-3-Pro. We first revisit LLM routing and aggregation at scale and identify three key bottlenecks: (1) current train-free routers are limited by a query-based paradigm focusing solely on textual similarity; (2) recent aggregation methods remain largely static, failing to select appropriate aggregators for different tasks;(3) the complementarity of routing and aggregation remains underutilized. To address these problems, we introduce JiSi, a novel framework designed to release the full potential of LLMs' collaboration through three innovations: (1) Query-Response Mixed Routing capturing both semantic information and problem difficulty; (2) Support-Set-based Aggregator Selection jointly evaluating the aggregation and domain capacity of aggregators; (3) Adaptive Routing-Aggregation Switch dynamically leveraging the advantages of routing and aggregation. Comprehensive experiments on nine benchmarks demonstrate that JiSi can surpass Gemini-3-Pro with only 47% costs by orchestrating ten open-source LLMs, while outperforming mainstream baselines. It suggests that collective intelligence represents a novel path towards Artificial General Intelligence (AGI).

Comment: Conditional/Dynamic Networks: large-scale LLM routing and adaptive aggregation framework (mixture-of-models) with task-aware switching.

Relevance: 9 Novelty: 7


15. SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines

ArXiv ID: 2601.01484

Authors: Itai Morad, Nir Shlezinger, Yonina C. Eldar

Abstract: Knowledge Distillation (KD) is a central paradigm for transferring knowledge from a large teacher network to a typically smaller student model, often by leveraging soft probabilistic outputs. While KD has shown strong empirical success in numerous applications, its theoretical underpinnings remain only partially understood. In this work, we adopt a Bayesian perspective on KD to rigorously analyze the convergence behavior of students trained with Stochastic Gradient Descent (SGD). We study two regimes: $(i)$ when the teacher provides the exact Bayes Class Probabilities (BCPs); and $(ii)$ supervision with noisy approximations of the BCPs. Our analysis shows that learning from BCPs yields variance reduction and removes neighborhood terms in the convergence bounds compared to one-hot supervision. We further characterize how the level of noise affects generalization and accuracy. Motivated by these insights, we advocate the use of Bayesian deep learning models, which typically provide improved estimates of the BCPs, as teachers in KD. Consistent with our analysis, we experimentally demonstrate that students distilled from Bayesian teachers not only achieve higher accuracies (up to +4.27%), but also exhibit more stable convergence (up to 30% less noise), compared to students distilled from deterministic teachers.

Comment: Compression/Theory: SGD-based KD analysis with Bayesian teachers; shows variance reduction and guidance on BCP noise.

Relevance: 9 Novelty: 7


16. Neuro-Channel Networks: A Multiplication-Free Architecture by Biological Signal Transmission

ArXiv ID: 2601.02253

Authors: Emrah Mete, Emin Erkan Korkmaz

Abstract: The rapid proliferation of Deep Learning is increasingly constrained by its heavy reliance on high-performance hardware, particularly Graphics Processing Units (GPUs). These specialized accelerators are not only prohibitively expensive and energy-intensive but also suffer from significant supply scarcity, limiting the ubiquity of Artificial Intelligence (AI) deployment on edge devices. The core of this inefficiency stems from the standard artificial perceptron's dependence on intensive matrix multiplications. However, biological nervous systems achieve unparalleled efficiency without such arithmetic intensity; synaptic signal transmission is regulated by physical ion channel limits and chemical neurotransmitter levels rather than a process that can be analogous to arithmetic multiplication. Inspired by this biological mechanism, we propose Neuro-Channel Networks (NCN), a novel multiplication-free architecture designed to decouple AI from expensive hardware dependencies. In our model, weights are replaced with Channel Widths that physically limit the signal magnitude, while a secondary parameter acts as a Neurotransmitter to regulate Signal Transmission based on sign logic. The forward pass relies exclusively on addition, subtraction, and bitwise operations (minimum, sign), eliminating floating-point multiplication entirely. In this proof-of-concept study, we demonstrate that NCNs can solve non-linearly separable problems like XOR and the Majority function with 100% accuracy using standard backpropagation, proving their capability to form complex decision boundaries without multiplicative weights. This architecture offers a highly efficient alternative for next-generation neuromorphic hardware, paving the way for running complex models on commodity CPUs or ultra-low-power chips without relying on costly GPU clusters.

Comment: Model Architecture + Efficiency: proposes a multiplication-free network replacing weights with channel-widths and sign-gated transmission to eliminate multiplications.

Relevance: 9 Novelty: 7


17. Heterogeneous Low-Bandwidth Pre-Training of LLMs

ArXiv ID: 2601.02360

Authors: Yazan Obeidi, Amir Sarfi, Joel Lidin, Paul Janson, Eugene Belilovsky

Abstract: Pre-training large language models (LLMs) increasingly requires distributed compute, yet bandwidth constraints make it difficult to scale beyond well-provisioned datacenters-especially when model parallelism forces frequent, large inter-device communications. We study whether SparseLoCo, a low-communication data parallel method based on infrequent synchronization and sparse pseudo-gradient exchange, can be combined with low-bandwidth pipeline model parallelism via activation and activation-gradient compression. We introduce a heterogeneous distributed training framework where some participants host full replicas on high-bandwidth interconnects, while resource-limited participants are grouped to jointly instantiate a replica using pipeline parallelism with subspace-projected inter-stage communication. To make the recently introduced subspace pipeline compression compatible with SparseLoCo, we study a number of adaptations. Across large-scale language modeling experiments (178M-1B parameters) on standard pretraining corpora, we find that activation compression composes with SparseLoCo at modest cost, while selective (heterogeneous) compression consistently improves the loss-communication tradeoff relative to compressing all replicas-especially at aggressive compression ratios. These results suggest a practical path to incorporating low-bandwidth model parallelism and heterogeneous participants into LLM pre-training.

Comment: HPC + Efficiency: heterogeneous distributed pre-training combining SparseLoCo with activation/activation-gradient compression and subspace pipeline parallelism.

Relevance: 9 Novelty: 7


18. MambaFormer: Token-Level Guided Routing Mixture-of-Experts for Accurate and Efficient Clinical Assistance

ArXiv ID: 2601.01260

Authors: Hamad Khan (Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering, Applied Sciences), Saddam Hussain Khan (Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering, Applied Sciences)

Abstract: The deployment of large language models (LLMs) in real-world clinical applications is constrained by the fundamental trade-off between computational cost and the efficiency of linear-time models. To address this, we propose an LLM-based MambaFormer hybrid Mixture-of-Experts (MoE) framework for efficient medical question-answering (QA) and clinical assistance. The MambaFormer employs a lightweight gating mechanism that performs token-level dynamic routing to a customized Transformer expert (ET5) for short, complex queries or to a State Space Model expert (EMamba) for long, high-throughput sequences. The customized EMamba and ET5 models are tailored to accommodate input sequence dimensionality, embedding structure, sequence length, and target-specific output heads, and are fine-tuned through transfer learning on a new, custom-designed DentalQA dataset. Moreover, intelligent routing decisions are driven by the contextual complexity of token embeddings, normalized sequence length, and domain-aware features, thereby enforcing a Pareto-optimal trade-off between inference latency and prediction accuracy. Furthermore, a novel utility-guided multi-objective loss jointly optimizes decisions, router parameters, routing behavior, expert utilization, and computational cost by adaptively regulating token-level expert activation. Finally, the proposed MambaFormer is cross-validated (holdout) for medical QA on the new, custom-designed DentalQA and PubMedQA datasets and compared with state-of-the-art techniques. The proposed MambaFormer outperforms (BERTScore = 0.9180) with ultra-low latency (0.077 s), delivering a 24.4 speedup over T5-Large and establishing a scalable solution for resource-constrained clinical deployment.

Comment: Model Architecture: hybrid MoE with token-level dynamic routing between Transformer and SSM (Mamba) experts plus utility-guided routing loss for efficiency/accuracy trade-offs.

Relevance: 9 Novelty: 7


19. Accelerating Decentralized Optimization via Overlapping Local Steps

ArXiv ID: 2601.01493

Authors: Yijie Zhou, Shi Pu

Abstract: Decentralized optimization has emerged as a critical paradigm for distributed learning, enabling scalable training while preserving data privacy through peer-to-peer collaboration. However, existing methods often suffer from communication bottlenecks due to frequent synchronization between nodes. We present Overlapping Local Decentralized SGD (OLDSGD), a novel approach to accelerate decentralized training by computation-communication overlapping, significantly reducing network idle time. With a deliberately designed update, OLDSGD preserves the same average update as Local SGD while avoiding communication-induced stalls. Theoretically, we establish non-asymptotic convergence rates for smooth non-convex objectives, showing that OLDSGD retains the same iteration complexity as standard Local Decentralized SGD while improving per-iteration runtime. Empirical results demonstrate OLDSGD's consistent improvements in wall-clock time convergence under different levels of communication delays. With minimal modifications to existing frameworks, OLDSGD offers a practical solution for faster decentralized learning without sacrificing theoretical guarantees.

Comment: HPC/Distributed Training: overlaps computation and communication in decentralized SGD (OLDSGD) with convergence guarantees to reduce wall-clock time.

Relevance: 9 Novelty: 7


20. Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments

ArXiv ID: 2601.01075

Authors: Hansen Jin Lillemark, Benhao Huang, Fangneng Zhan, Yilun Du, Thomas Anderson Keller

Abstract: Embodied systems experience the world as 'a symphony of flows': a combination of many continuous streams of sensory input coupled to self-motion, interwoven with the dynamics of external objects. These streams obey smooth, time-parameterized symmetries, which combine through a precisely structured algebra; yet most neural network world models ignore this structure and instead repeatedly re-learn the same transformations from data. In this work, we introduce 'Flow Equivariant World Models', a framework in which both self-motion and external object motion are unified as one-parameter Lie group 'flows'. We leverage this unification to implement group equivariance with respect to these transformations, thereby providing a stable latent world representation over hundreds of timesteps. On both 2D and 3D partially observed video world modeling benchmarks, we demonstrate that Flow Equivariant World Models significantly outperform comparable state-of-the-art diffusion-based and memory-augmented world modeling architectures -- particularly when there are predictable world dynamics outside the agent's current field of view. We show that flow equivariance is particularly beneficial for long rollouts, generalizing far beyond the training horizon. By structuring world model representations with respect to internal and external motion, flow equivariance charts a scalable route to data efficient, symmetry-guided, embodied intelligence. Project link: https://flowequivariantworldmodels.github.io.

Comment: Model Architecture + Representation Learning: introduces Lie-group flow equivariance to build symmetry-structured world models with stable long-horizon latents.

Relevance: 8 Novelty: 7


21. Output Embedding Centering for Stable LLM Pretraining

ArXiv ID: 2601.02031

Authors: Felix Stollenwerk, Anna Lokrantz, Niclas Hertzberg

Abstract: Pretraining of large language models is not only expensive but also prone to certain training instabilities. A specific instability that often occurs for large learning rates at the end of training is output logit divergence. The most widely used mitigation strategy, z-loss, merely addresses the symptoms rather than the underlying cause of the problem. In this paper, we analyze the instability from the perspective of the output embeddings' geometry and identify its cause. Based on this, we propose output embedding centering (OEC) as a new mitigation strategy, and prove that it suppresses output logit divergence. OEC can be implemented in two different ways, as a deterministic operation called {\mu}-centering, or a regularization method called {\mu}-loss. Our experiments show that both variants outperform z-loss in terms of training stability and learning rate sensitivity. In particular, they ensure that training converges even for large learning rates when z-loss fails. Furthermore, we find that {\mu}-loss is significantly less sensitive to regularization hyperparameter tuning than z-loss.

Comment: Training Dynamics: output embedding centering (μ-centering/μ-loss) to prevent output logit divergence and stabilize transformer pretraining.

Relevance: 8 Novelty: 7


22. ELLA: Efficient Lifelong Learning for Adapters in Large Language Models

ArXiv ID: 2601.02232

Authors: Shristi Das Biswas, Yue Zhang, Anwesan Pal, Radhika Bhargava, Kaushik Roy

Abstract: Large Language Models (LLMs) suffer severe catastrophic forgetting when adapted sequentially to new tasks in a continual learning (CL) setting. Existing approaches are fundamentally limited: replay-based methods are impractical and privacy-violating, while strict orthogonality-based methods collapse under scale: each new task is projected onto an orthogonal complement, progressively reducing the residual degrees of freedom and eliminating forward transfer by forbidding overlap in shared representations. In this work, we introduce ELLA, a training framework built on the principle of selective subspace de-correlation. Rather than forbidding all overlap, ELLA explicitly characterizes the structure of past updates and penalizes alignments along their high-energy, task-specific directions, while preserving freedom in the low-energy residual subspaces to enable transfer. Formally, this is realized via a lightweight regularizer on a single aggregated update matrix. We prove this mechanism corresponds to an anisotropic shrinkage operator that bounds interference, yielding a penalty that is both memory- and compute-constant regardless of task sequence length. ELLA requires no data replay, no architectural expansion, and negligible storage. Empirically, it achieves state-of-the-art CL performance on three popular benchmarks, with relative accuracy gains of up to $9.6\%$ and a $35\times$ smaller memory footprint. Further, ELLA scales robustly across architectures and actively enhances the model's zero-shot generalization performance on unseen tasks, establishing a principled and scalable solution for constructive lifelong LLM adaptation.

Comment: Representation Learning/Efficiency: selective subspace de-correlation via anisotropic shrinkage regularization for continual adapters with constant compute/memory.

Relevance: 8 Novelty: 7


23. Energy-Aware Routing to Large Reasoning Models

ArXiv ID: 2601.00823

Authors: Austin R. Ellis-Mohr, Max Hartman, Lav R. Varshney

Abstract: Large reasoning models (LRMs) have heterogeneous inference energy costs based on which model is used and how much it reasons. To reduce energy, it is important to choose the right LRM and operate it in the right way. As a result, the performance of systems that dispatch tasks to different individual LRMs depend on the balance between mean energy provisioning and stochastic fluctuations. The critical regime is the unique operating point at which neither auxiliary energy nor baseline energy is systematically wasted. Increasing baseline supply shifts the system toward persistent over-supply and baseline-energy waste, while reducing supply induces persistent reliance on auxiliary energy. Yet in this regime, performance remains volatility-limited and so a second-order characterization provides further insights that we develop. Here, performance is governed by how variability is absorbed across time, models, and execution choices. This perspective highlights variance-aware routing and dispatch as a principled design axis, and provides a theoretical basis for developing energy-aware model routing policies. Routing behavior is characterized when dispatch policies are based on training-compute and inference-compute scaling laws for LRMs.

Comment: Efficiency/Systems: variance-aware, energy-aware routing among large reasoning models using compute scaling laws.

Relevance: 8 Novelty: 7


24. Neural Networks on Symmetric Spaces of Noncompact Type

ArXiv ID: 2601.01097

Authors: Xuan Son Nguyen, Shuo Yang, Aymeric Histace

Abstract: Recent works have demonstrated promising performances of neural networks on hyperbolic spaces and symmetric positive definite (SPD) manifolds. These spaces belong to a family of Riemannian manifolds referred to as symmetric spaces of noncompact type. In this paper, we propose a novel approach for developing neural networks on such spaces. Our approach relies on a unified formulation of the distance from a point to a hyperplane on the considered spaces. We show that some existing formulations of the point-to-hyperplane distance can be recovered by our approach under specific settings. Furthermore, we derive a closed-form expression for the point-to-hyperplane distance in higher-rank symmetric spaces of noncompact type equipped with G-invariant Riemannian metrics. The derived distance then serves as a tool to design fully-connected (FC) layers and an attention mechanism for neural networks on the considered spaces. Our approach is validated on challenging benchmarks for image classification, electroencephalogram (EEG) signal classification, image generation, and natural language inference.

Comment: Model Architecture: designs FC layers and attention mechanisms on symmetric spaces (noncompact Riemannian manifolds).

Relevance: 8 Novelty: 7


25. Sparse Bayesian Message Passing under Structural Uncertainty

ArXiv ID: 2601.01207

Authors: Yoonhyuk Choi, Jiho Choi, Chanran Kim, Yumin Lee, Hawon Shin, Yeowon Jeon, Minjeong Kim, Jiwoo Kang

Abstract: Semi-supervised learning on real-world graphs is frequently challenged by heterophily, where the observed graph is unreliable or label-disassortative. Many existing graph neural networks either rely on a fixed adjacency structure or attempt to handle structural noise through regularization. In this work, we explicitly capture structural uncertainty by modeling a posterior distribution over signed adjacency matrices, allowing each edge to be positive, negative, or absent. We propose a sparse signed message passing network that is naturally robust to edge noise and heterophily, which can be interpreted from a Bayesian perspective. By combining (i) posterior marginalization over signed graph structures with (ii) sparse signed message aggregation, our approach offers a principled way to handle both edge noise and heterophily. Experimental results demonstrate that our method outperforms strong baseline models on heterophilic benchmarks under both synthetic and real-world structural noise.

Comment: Sparsity/Bayesian Architecture: posterior over signed adjacency and sparse signed message passing for robust GNNs under heterophily.

Relevance: 8 Novelty: 7


26. Bayesian Subspace Gradient Estimation for Zeroth-Order Optimization of Large Language Models

ArXiv ID: 2601.01452

Authors: Jian Feng, Zhihong Huang

Abstract: Fine-tuning large language models (LLMs) with zeroth-order (ZO) optimization reduces memory by approximating gradients through function evaluations, but existing methods rely on one-step gradient estimates from random perturbations. We introduce Bayesian Subspace Zeroth-Order optimization (BSZO), a ZO optimizer that applies Kalman filtering to combine finite-difference information across multiple perturbation directions. By treating each finite-difference measurement as a noisy observation, BSZO builds a posterior distribution over the projected gradient and updates it through Bayesian inference, with a residual-based adaptive mechanism to adjust perturbation scales. Theoretical analysis shows that BSZO improves the convergence rate by a factor of $k/\gamma$ compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show that BSZO outperforms MeZO, MeZO-Adam, and HiZOO across various tasks, achieving up to 6.67\% absolute average improvement on OPT-13B while keeping memory usage close to inference-only baselines (1.00$\times$--1.08$\times$ of MeZO).

Comment: Efficiency: zeroth-order fine-tuning for LLMs via Bayesian subspace gradient estimation (Kalman filtering) to reduce memory.

Relevance: 8 Novelty: 7


27. Intention Collapse: Intention-Level Metrics for Reasoning in Language Models

ArXiv ID: 2601.01011

Authors: Patricio Vera

Abstract: Every act of language generation compresses a rich internal state into a single token sequence. We call this process intention collapse: a many-to-one projection from a high dimensional intention space I into an external language space L. We formalize intention collapse for contemporary language models, define three simple, model agnostic intention metrics (intention entropy Hint, effective dimensionality dimeff, and latent knowledge recoverability Recov), and propose an empirical agenda for studying how inference time computation shapes internal intentions before they are verbalized. We also report a first small scale experiment. Using a 4 bit Mistral 7B model on 200 GSM8K problems, we compare a direct answer baseline, a chain of thought (CoT) regime, and a babble control. CoT raises accuracy from 5.5 percent to 53 percent, sharply reduces pre collapse intention entropy (from 1.42 to 0.37 bits), and shows higher global effective dimensionality than the other regimes despite producing fewer tokens than babble. At the same time, Hint has little item level predictive power, and a linear probe on I achieves AUROC 0.65 in the CoT regime but only about chance in the baseline regime, where it collapses to the majority class. These preliminary results indicate that intention level metrics can distinguish inference regimes and expose latent information that is partly lost during collapse, while also revealing important limitations of our current proxies

Comment: Representation Learning: proposes intention-level metrics (entropy, effective dimensionality, recoverability) to study inference-time computation and internal representations in LMs.

Relevance: 8 Novelty: 7


28. MODE: Efficient Time Series Prediction with Mamba Enhanced by Low-Rank Neural ODEs

ArXiv ID: 2601.00920

Authors: Xingsheng Chen, Regina Zhang, Bo Gao, Xingwei He, Xiaofeng Liu, Pietro Lio, Kwok-Yan Lam, Siu-Ming Yiu

Abstract: Time series prediction plays a pivotal role across diverse domains such as finance, healthcare, energy systems, and environmental modeling. However, existing approaches often struggle to balance efficiency, scalability, and accuracy, particularly when handling long-range dependencies and irregularly sampled data. To address these challenges, we propose MODE, a unified framework that integrates Low-Rank Neural Ordinary Differential Equations (Neural ODEs) with an Enhanced Mamba architecture. As illustrated in our framework, the input sequence is first transformed by a Linear Tokenization Layer and then processed through multiple Mamba Encoder blocks, each equipped with an Enhanced Mamba Layer that employs Causal Convolution, SiLU activation, and a Low-Rank Neural ODE enhancement to efficiently capture temporal dynamics. This low-rank formulation reduces computational overhead while maintaining expressive power. Furthermore, a segmented selective scanning mechanism, inspired by pseudo-ODE dynamics, adaptively focuses on salient subsequences to improve scalability and long-range sequence modeling. Extensive experiments on benchmark datasets demonstrate that MODE surpasses existing baselines in both predictive accuracy and computational efficiency. Overall, our contributions include: (1) a unified and efficient architecture for long-term time series modeling, (2) integration of Mamba's selective scanning with low-rank Neural ODEs for enhanced temporal representation, and (3) substantial improvements in efficiency and scalability enabled by low-rank approximation and dynamic selective scanning.

Comment: Model Architecture/Efficiency: integrates Mamba SSM with low-rank Neural ODEs and segmented selective scanning for long-range time series with reduced complexity.

Relevance: 8 Novelty: 7


29. Towards a Principled Muon under $\mu\mathsf{P}$: Ensuring Spectral Conditions throughout Training

ArXiv ID: 2601.01306

Authors: John Zhao

Abstract: The $\mu$-parameterization ($\mu$P) provides a principled foundation for large language model (LLM) training by prescribing width-independent learning dynamics, which in turn enables predictable scaling behavior and robust hyperparameter transfer across model sizes. A central requirement of $\mu$P is the satisfaction of certain spectral conditions on weight matrices, which ensure consistent feature learning and optimization behavior as model width grows. While these conditions are well understood in theory, guaranteeing their validity in practical training for matrix-based optimizers such as Muon is still under studied. Existing works that study Muon under $\mu$P exhibit important limitations: they either do not ensure that the spectral conditions hold throughout the entire training horizon, or require repeated spectral normalization (or Newton-Schulz iterations) applied to both weights and updates, leading to significant computational overhead and reduced practicality. In this work, we show how to reliably guarantee the spectral conditions required by $\mu$P for Muon during the entire training process. Our key insight is that for moderately large models, maintaining spectral control at the level of optimizer updates alone is sufficient to preserve $\mu$P-compatible scaling, eliminating the need for explicit spectral normalization of the weights. Based on this principle, we develop a variant of Muon, namely Muon++, that satisfies spectral condition throughout the training process. Our results bridge the gap between the theoretical promises of $\mu$P and the practical deployment of matrix-based optimizers in long-horizon training. We also take the first step towards an adaptive spectral condition by incorporating data-dependent effects, making it better suited for long-horizon LLM training.

Comment: Training Dynamics/Optimization: ensures μP spectral conditions throughout training for Muon (Muon++), aligning optimizer updates with μP scaling for large models.

Relevance: 8 Novelty: 7


30. Deep Deterministic Nonlinear ICA via Total Correlation Minimization with Matrix-Based Entropy Functional

ArXiv ID: 2601.00904

Authors: Qiang Li, Shujian Yu, Liang Ma, Chen Ma, Jingyu Liu, Tulay Adali, Vince D. Calhoun

Abstract: Blind source separation, particularly through independent component analysis (ICA), is widely utilized across various signal processing domains for disentangling underlying components from observed mixed signals, owing to its fully data-driven nature that minimizes reliance on prior assumptions. However, conventional ICA methods rely on an assumption of linear mixing, limiting their ability to capture complex nonlinear relationships and to maintain robustness in noisy environments. In this work, we present deep deterministic nonlinear independent component analysis (DDICA), a novel deep neural network-based framework designed to address these limitations. DDICA leverages a matrix-based entropy function to directly optimize the independence criterion via stochastic gradient descent, bypassing the need for variational approximations or adversarial schemes. This results in a streamlined training process and improved resilience to noise. We validated the effectiveness and generalizability of DDICA across a range of applications, including simulated signal mixtures, hyperspectral image unmixing, modeling of primary visual receptive fields, and resting-state functional magnetic resonance imaging (fMRI) data analysis. Experimental results demonstrate that DDICA effectively separates independent components with high accuracy across a range of applications. These findings suggest that DDICA offers a robust and versatile solution for blind source separation in diverse signal processing tasks.

Comment: Representation Learning: deep deterministic nonlinear ICA minimizing total correlation via matrix-based entropy functional; avoids variational/adversarial schemes.

Relevance: 8 Novelty: 7


31. Context Collapse: In-Context Learning and Model Collapse

ArXiv ID: 2601.00923

Authors: Josef Ott

Abstract: This thesis investigates two key phenomena in large language models (LLMs): in-context learning (ICL) and model collapse. We study ICL in a linear transformer with tied weights trained on linear regression tasks, and show that minimising the in-context loss leads to a phase transition in the learned parameters. Above a critical context length, the solution develops a skew-symmetric component. We prove this by reducing the forward pass of the linear transformer under weight tying to preconditioned gradient descent, and then analysing the optimal preconditioner. This preconditioner includes a skew-symmetric component, which induces a rotation of the gradient direction. For model collapse, we use martingale and random walk theory to analyse simplified settings - linear regression and Gaussian fitting - under both replacing and cumulative data regimes. We strengthen existing results by proving almost sure convergence, showing that collapse occurs unless the data grows sufficiently fast or is retained over time. Finally, we introduce the notion of context collapse: a degradation of context during long generations, especially in chain-of-thought reasoning. This concept links the dynamics of ICL with long-term stability challenges in generative models.

Comment: Training Dynamics: analyzes ICL via reduction to preconditioned GD in linear transformers and studies (model/context) collapse with probabilistic guarantees.

Relevance: 8 Novelty: 7


32. Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting

ArXiv ID: 2601.02151

Authors: Muxi Diao, Lele Yang, Wuxuan Gong, Yutong Zhang, Zhonghao Yan, Yufei Han, Kongming Liang, Weiran Xu, Zhanyu Ma

Abstract: Supervised Fine-Tuning (SFT) is the standard paradigm for domain adaptation, yet it frequently incurs the cost of catastrophic forgetting. In sharp contrast, on-policy Reinforcement Learning (RL) effectively preserves general capabilities. We investigate this discrepancy and identify a fundamental distributional gap: while RL aligns with the model's internal belief, SFT forces the model to fit external supervision. This mismatch often manifests as "Confident Conflicts" tokens characterized by low probability but low entropy. In these instances, the model is highly confident in its own prediction but is forced to learn a divergent ground truth, triggering destructive gradient updates. To address this, we propose Entropy-Adaptive Fine-Tuning (EAFT). Unlike methods relying solely on prediction probability, EAFT utilizes token-level entropy as a gating mechanism to distinguish between epistemic uncertainty and knowledge conflict. This allows the model to learn from uncertain samples while suppressing gradients on conflicting data. Extensive experiments on Qwen and GLM series (ranging from 4B to 32B parameters) across mathematical, medical, and agentic domains confirm our hypothesis. EAFT consistently matches the downstream performance of standard SFT while significantly mitigating the degradation of general capabilities.

Comment: Training dynamics/representation learning: entropy-gated fine-tuning to mitigate forgetting by suppressing confident-conflict gradients.

Relevance: 8 Novelty: 7


33. Gradient-Free Approaches is a Key to an Efficient Interaction with Markovian Stochasticity

ArXiv ID: 2601.01160

Authors: Boris Prokhorov, Semyon Chebykin, Alexander Gasnikov, Aleksandr Beznosikov

Abstract: This paper deals with stochastic optimization problems involving Markovian noise with a zero-order oracle. We present and analyze a novel derivative-free method for solving such problems in strongly convex smooth and non-smooth settings with both one-point and two-point feedback oracles. Using a randomized batching scheme, we show that when mixing time $\tau$ of the underlying noise sequence is less than the dimension of the problem $d$, the convergence estimates of our method do not depend on $\tau$. This observation provides an efficient way to interact with Markovian stochasticity: instead of invoking the expensive first-order oracle, one should use the zero-order oracle. Finally, we complement our upper bounds with the corresponding lower bounds. This confirms the optimality of our results.

Comment: Optimization/training algorithms: derivative-free method for Markovian noise with mixing-time–independent rates (algorithmic efficiency).

Relevance: 8 Novelty: 7


34. Accelerating Storage-Based Training for Graph Neural Networks

ArXiv ID: 2601.01473

Authors: Myung-Hwan Jang, Jeong-Min Park, Yunyong Ko, Sang-Wook Kim

Abstract: Graph neural networks (GNNs) have achieved breakthroughs in various real-world downstream tasks due to their powerful expressiveness. As the scale of real-world graphs has been continuously growing, \textit{a storage-based approach to GNN training} has been studied, which leverages external storage (e.g., NVMe SSDs) to handle such web-scale graphs on a single machine. Although such storage-based GNN training methods have shown promising potential in large-scale GNN training, we observed that they suffer from a severe bottleneck in data preparation since they overlook a critical challenge: \textit{how to handle a large number of small storage I/Os}. To address the challenge, in this paper, we propose a novel storage-based GNN training framework, named \textsf{AGNES}, that employs a method of \textit{block-wise storage I/O processing} to fully utilize the I/O bandwidth of high-performance storage devices. Moreover, to further enhance the efficiency of each storage I/O, \textsf{AGNES} employs a simple yet effective strategy, \textit{hyperbatch-based processing} based on the characteristics of real-world graphs. Comprehensive experiments on five real-world graphs reveal that \textsf{AGNES} consistently outperforms four state-of-the-art methods, by up to 4.1$\times$ faster than the best competitor. Our code is available at https://github.com/Bigdasgit/agnes-kdd26.

Comment: HPC/Systems: storage-based GNN training with block-wise I/O and hyperbatch processing to utilize NVMe bandwidth for large-scale training.

Relevance: 8 Novelty: 7


35. Deep Clustering with Associative Memories

ArXiv ID: 2601.00963

Authors: Bishwajit Saha, Dmitry Krotov, Mohammed J. Zaki, Parikshit Ram

Abstract: Deep clustering - joint representation learning and latent space clustering - is a well studied problem especially in computer vision and text processing under the deep learning framework. While the representation learning is generally differentiable, clustering is an inherently discrete optimization task, requiring various approximations and regularizations to fit in a standard differentiable pipeline. This leads to a somewhat disjointed representation learning and clustering. In this work, we propose a novel loss function utilizing energy-based dynamics via Associative Memories to formulate a new deep clustering method, DCAM, which ties together the representation learning and clustering aspects more intricately in a single objective. Our experiments showcase the advantage of DCAM, producing improved clustering quality for various architecture choices (convolutional, residual or fully-connected) and data modalities (images or text).

Comment: Representation learning: deep clustering objective using energy-based associative memories coupling representation and clustering.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  2. Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  3. High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

  4. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.