Personalized Daily ArXiv Papers 2026-01-06

[gpt-5]	Prompt	Completion	Total
Token	53208	51365	104573
Cost	$0.07	$0.51	$0.58

Total arXiv papers: 820

Total scanned papers: 502

Total relevant papers: 35

Table of contents with paper titles:

Value-guided action planning with JEPA world models Authors: Matthieu Destrade, Oumayma Bounou, Quentin Le Lidec, Jean Ponce, Yann LeCun
A Depth Hierarchy for Computing the Maximum in ReLU Networks via Extremal Graph Theory Authors: Itay Safran
Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation, Convergence and Mixture of Experts Authors: Ruofeng Yang, Yongcan Li, Bo Jiang, Cheng Chen, Shuai Li
Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts Authors: Boxuan Lyu, Soichiro Murakami, Hidetaka Kamigaito, Peinan Zhang
Making MoE based LLM inference resilient with Tarragon Authors: Songyu Zhang, Aaron Tam, Myungjin Lee, Shixiong Qi, K. K. Ramakrishnan
LinMU: Multimodal Understanding Made Linear Authors: Hongjie Wang, Niraj K. Jha
T3C: Test-Time Tensor Compression with Consistency Guarantees Authors: Ismail Lamaakal, Chaymae Yahyati, Yassine Maleh, Khalid El Makkaoui, Ibrahim Ouahbi
Quantized SO(3)-Equivariant Graph Neural Networks for Efficient Molecular Property Prediction Authors: Haoyu Zhou, Ping Xue, Tianfan Fu, Hao Zhang
Leveraging Flatness to Improve Information-Theoretic Generalization Bounds for SGD Authors: Ze Peng, Jian Zhang, Yisen Wang, Lei Qi, Yinghuan Shi, Yang Gao
Context-Free Recognition with Transformers Authors: Selim Jerad, Anej Svete, Sophie Hao, Ryan Cotterell, William Merrill
Sobolev Approximation of Deep ReLU Network in Log-weighted Barron Space Authors: Changhoon Song, Seungchan Ko, Youngjoon Hong
FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness Authors: Hossam Amer, Maryam Dialameh, Hossein Rajabzadeh, Walid Ahmed, Weiwei Zhang, Yang Liu
Attention Needs to Focus: A Unified Perspective on Attention Allocation Authors: Zichuan Fu, Wentao Song, Guojing Li, Yejing Wang, Xian Wu, Yimin Deng, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao
Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale Authors: Shengji Tang, Weihao Lin, Jingqi Ye, Hao Li, Bo Zhang, Shuyue Hu, Tao Chen, Wangli Ouyang, Lei Bai, Peng Ye
SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines Authors: Itai Morad, Nir Shlezinger, Yonina C. Eldar
Neuro-Channel Networks: A Multiplication-Free Architecture by Biological Signal Transmission Authors: Emrah Mete, Emin Erkan Korkmaz
Heterogeneous Low-Bandwidth Pre-Training of LLMs Authors: Yazan Obeidi, Amir Sarfi, Joel Lidin, Paul Janson, Eugene Belilovsky
MambaFormer: Token-Level Guided Routing Mixture-of-Experts for Accurate and Efficient Clinical Assistance Authors: Hamad Khan (Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering, Applied Sciences), Saddam Hussain Khan (Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering, Applied Sciences)
Accelerating Decentralized Optimization via Overlapping Local Steps Authors: Yijie Zhou, Shi Pu
Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments Authors: Hansen Jin Lillemark, Benhao Huang, Fangneng Zhan, Yilun Du, Thomas Anderson Keller
Output Embedding Centering for Stable LLM Pretraining Authors: Felix Stollenwerk, Anna Lokrantz, Niclas Hertzberg
ELLA: Efficient Lifelong Learning for Adapters in Large Language Models Authors: Shristi Das Biswas, Yue Zhang, Anwesan Pal, Radhika Bhargava, Kaushik Roy
Energy-Aware Routing to Large Reasoning Models Authors: Austin R. Ellis-Mohr, Max Hartman, Lav R. Varshney
Neural Networks on Symmetric Spaces of Noncompact Type Authors: Xuan Son Nguyen, Shuo Yang, Aymeric Histace
Sparse Bayesian Message Passing under Structural Uncertainty Authors: Yoonhyuk Choi, Jiho Choi, Chanran Kim, Yumin Lee, Hawon Shin, Yeowon Jeon, Minjeong Kim, Jiwoo Kang
Bayesian Subspace Gradient Estimation for Zeroth-Order Optimization of Large Language Models Authors: Jian Feng, Zhihong Huang
Intention Collapse: Intention-Level Metrics for Reasoning in Language Models Authors: Patricio Vera
MODE: Efficient Time Series Prediction with Mamba Enhanced by Low-Rank Neural ODEs Authors: Xingsheng Chen, Regina Zhang, Bo Gao, Xingwei He, Xiaofeng Liu, Pietro Lio, Kwok-Yan Lam, Siu-Ming Yiu
Towards a Principled Muon under $\mu\mathsf{P}$: Ensuring Spectral Conditions throughout Training Authors: John Zhao
Deep Deterministic Nonlinear ICA via Total Correlation Minimization with Matrix-Based Entropy Functional Authors: Qiang Li, Shujian Yu, Liang Ma, Chen Ma, Jingyu Liu, Tulay Adali, Vince D. Calhoun
Context Collapse: In-Context Learning and Model Collapse Authors: Josef Ott
Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting Authors: Muxi Diao, Lele Yang, Wuxuan Gong, Yutong Zhang, Zhonghao Yan, Yufei Han, Kongming Liang, Weiran Xu, Zhanyu Ma
Gradient-Free Approaches is a Key to an Efficient Interaction with Markovian Stochasticity Authors: Boris Prokhorov, Semyon Chebykin, Alexander Gasnikov, Aleksandr Beznosikov
Accelerating Storage-Based Training for Graph Neural Networks Authors: Myung-Hwan Jang, Jeong-Min Park, Yunyong Ko, Sang-Wook Kim
Deep Clustering with Associative Memories Authors: Bishwajit Saha, Dmitry Krotov, Mohammed J. Zaki, Parikshit Ram

1. Value-guided action planning with JEPA world models

ArXiv ID: 2601.00844

Authors: Matthieu Destrade, Oumayma Bounou, Quentin Le Lidec, Jean Ponce, Yann LeCun

Abstract: Building deep learning models that can reason about their environment requires capturing its underlying dynamics. Joint-Embedded Predictive Architectures (JEPA) provide a promising framework to model such dynamics by learning representations and predictors through a self-supervised prediction objective. However, their ability to support effective action planning remains limited. We propose an approach to enhance planning with JEPA world models by shaping their representation space so that the negative goal-conditioned value function for a reaching cost in a given environment is approximated by a distance (or quasi-distance) between state embeddings. We introduce a practical method to enforce this constraint during training and show that it leads to significantly improved planning performance compared to standard JEPA models on simple control tasks.

Comment: Author match

2. A Depth Hierarchy for Computing the Maximum in ReLU Networks via Extremal Graph Theory

ArXiv ID: 2601.01417

Authors: Itay Safran

Abstract: We consider the problem of exact computation of the maximum function over $d$ real inputs using ReLU neural networks. We prove a depth hierarchy, wherein width $\Omega\big(d^{1+\frac{1}{2^{k-2}-1}}\big)$ is necessary to represent the maximum for any depth $3\le k\le \log_2(\log_2(d))$. This is the first unconditional super-linear lower bound for this fundamental operator at depths $k\ge3$, and it holds even if the depth scales with $d$. Our proof technique is based on a combinatorial argument and associates the non-differentiable ridges of the maximum with cliques in a graph induced by the first hidden layer of the computing network, utilizing Tur\'an's theorem from extremal graph theory to show that a sufficiently narrow network cannot capture the non-linearities of the maximum. This suggests that despite its simple nature, the maximum function possesses an inherent complexity that stems from the geometric structure of its non-differentiable hyperplanes, and provides a novel approach for proving lower bounds for deep neural networks.

Comment: Theoretical Architecture: depth hierarchy lower bounds for computing max with ReLUs via extremal graph theory.

Relevance: 10 Novelty: 9

ArXiv ID: 2601.01475

Authors: Ruofeng Yang, Yongcan Li, Bo Jiang, Cheng Chen, Shuai Li

Abstract: Recently, diffusion models have achieved a great performance with a small dataset of size $n$ and a fast optimization process. However, the estimation error of diffusion models suffers from the curse of dimensionality $n^{-1/D}$ with the data dimension $D$. Since images are usually a union of low-dimensional manifolds, current works model the data as a union of linear subspaces with Gaussian latent and achieve a $1/\sqrt{n}$ bound. Though this modeling reflects the multi-manifold property, the Gaussian latent can not capture the multi-modal property of the latent manifold. To bridge this gap, we propose the mixture subspace of low-rank mixture of Gaussian (MoLR-MoG) modeling, which models the target data as a union of $K$ linear subspaces, and each subspace admits a mixture of Gaussian latent ($n_k$ modals with dimension $d_k$). With this modeling, the corresponding score function naturally has a mixture of expert (MoE) structure, captures the multi-modal information, and contains nonlinear property. We first conduct real-world experiments to show that the generation results of MoE-latent MoG NN are much better than MoE-latent Gaussian score. Furthermore, MoE-latent MoG NN achieves a comparable performance with MoE-latent Unet with $10 \times$ parameters. These results indicate that the MoLR-MoG modeling is reasonable and suitable for real-world data. After that, based on such MoE-latent MoG score, we provide a $R^4\sqrt{\Sigma_{k=1}^Kn_k}\sqrt{\Sigma_{k=1}^Kn_kd_k}/\sqrt{n}$ estimation error, which escapes the curse of dimensionality by using data structure. Finally, we study the optimization process and prove the convergence guarantee under the MoLR-MoG modeling. Combined with these results, under a setting close to real-world data, this work explains why diffusion models only require a small training sample and enjoy a fast optimization process to achieve a great performance.

Comment: Model Architecture + Representation Learning: diffusion models with MoLR-MoG latent leading to MoE-structured score; provides estimation and convergence guarantees.

Relevance: 10 Novelty: 9

4. Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts

ArXiv ID: 2601.02144

Authors: Boxuan Lyu, Soichiro Murakami, Hidetaka Kamigaito, Peinan Zhang

Abstract: Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric "router" to dispatch tokens to a sparse subset of experts. Typically, this router is trained once and then frozen, rendering routing decisions brittle under distribution shifts. We address this limitation by introducing kNN-MoE, a retrieval-augmented routing framework that reuses optimal expert assignments from a memory of similar past cases. This memory is constructed offline by directly optimizing token-wise routing logits to maximize the likelihood on a reference set. Crucially, we use the aggregate similarity of retrieved neighbors as a confidence-driven mixing coefficient, thus allowing the method to fall back to the frozen router when no relevant cases are found. Experiments show kNN-MoE outperforms zero-shot baselines and rivals computationally expensive supervised fine-tuning.

Comment: Model Architecture (MoE): kNN-augmented expert routing using retrieval-based assignment and confidence mixing to improve robustness under shift.

Relevance: 10 Novelty: 8

5. Making MoE based LLM inference resilient with Tarragon

ArXiv ID: 2601.01310

Authors: Songyu Zhang, Aaron Tam, Myungjin Lee, Shixiong Qi, K. K. Ramakrishnan

Abstract: Mixture-of-Experts (MoE) models are increasingly used to serve LLMs at scale, but failures become common as deployment scale grows. Existing systems exhibit poor failure resilience: even a single worker failure triggers a coarse-grained, service-wide restart, discarding accumulated progress and halting the entire inference pipeline during recovery--an approach clearly ill-suited for latency-sensitive, LLM services. We present Tarragon, a resilient MoE inference framework that confines the failures impact to individual workers while allowing the rest of the pipeline to continue making forward progress. Tarragon exploits the natural separation between the attention and expert computation in MoE-based transformers, treating attention workers (AWs) and expert workers (EWs) as distinct failure domains. Tarragon introduces a reconfigurable datapath to mask failures by rerouting requests to healthy workers. On top of this datapath, Tarragon implements a self-healing mechanism that relaxes the tightly synchronized execution of existing MoE frameworks. For stateful AWs, Tarragon performs asynchronous, incremental KV cache checkpointing with per-request restoration, and for stateless EWs, it leverages residual GPU memory to deploy shadow experts. These together keep recovery cost and recomputation overhead extremely low. Our evaluation shows that, compared to state-of-the-art MegaScale-Infer, Tarragon reduces failure-induced stalls by 160-213x (from ~64 s down to 0.3-0.4 s) while preserving performance when no failures occur.

Comment: HPC/MoE Systems: resilient MoE inference via reconfigurable datapath, KV-cache checkpointing, and shadow experts for fault tolerance.

Relevance: 10 Novelty: 8

6. LinMU: Multimodal Understanding Made Linear

ArXiv ID: 2601.01322

Authors: Hongjie Wang, Niraj K. Jha

Abstract: Modern Vision-Language Models (VLMs) achieve impressive performance but are limited by the quadratic complexity of self-attention, which prevents their deployment on edge devices and makes their understanding of high-resolution images and long-context videos prohibitively expensive. To address this challenge, we introduce LinMU (Linear-complexity Multimodal Understanding), a VLM design that achieves linear complexity without using any quadratic-complexity modules while maintaining the performance of global-attention-based VLMs. LinMU replaces every self-attention layer in the VLM with the M-MATE block: a dual-branch module that combines a bidirectional state-space model for global context (Flex-MA branch) with localized Swin-style window attention (Local-Swin branch) for adjacent correlations. To transform a pre-trained VLM into the LinMU architecture, we propose a three-stage distillation framework that (i) initializes both branches with self-attention weights and trains the Flex-MA branch alone, (ii) unfreezes the Local-Swin branch and fine-tunes it jointly with the Flex-MA branch, and (iii) unfreezes the remaining blocks and fine-tunes them using LoRA adapters, while regressing on hidden states and token-level logits of the frozen VLM teacher. On MMMU, TextVQA, LongVideoBench, Video-MME, and other benchmarks, LinMU matches the performance of teacher models, yet reduces Time-To-First-Token (TTFT) by up to 2.7$\times$ and improves token throughput by up to 9.0$\times$ on minute-length videos. Ablations confirm the importance of each distillation stage and the necessity of the two branches of the M-MATE block. The proposed framework demonstrates that state-of-the-art multimodal reasoning can be achieved without quadratic attention, thus opening up avenues for long-context VLMs that can deal with high-resolution images and long videos.

Comment: Efficiency/Architecture: replaces quadratic attention with dual-branch linear-complexity module (bidirectional SSM + local window attention) and a 3-stage distillation pipeline for VLMs.

Relevance: 10 Novelty: 8

7. T3C: Test-Time Tensor Compression with Consistency Guarantees

ArXiv ID: 2601.01299

Authors: Ismail Lamaakal, Chaymae Yahyati, Yassine Maleh, Khalid El Makkaoui, Ibrahim Ouahbi

Abstract: We present T3C, a train-once, test-time budget-conditioned compression framework that exposes rank and precision as a controllable deployment knob. T3C combines elastic tensor factorization (maintained up to a maximal rank) with rank-tied mixed-precision quantization and a lightweight controller that maps a latency/energy/size budget token to per-layer rank/bit assignments; the policy snaps to hardware-aligned profiles and is monotone in the budget. A fast, layerwise consistency certificate, computed from spectral proxies and activation statistics, upper-bounds logit drift and regularizes training, yielding a practical reliability signal with negligible overhead. On ImageNet-1k, T3C shifts the vision Pareto frontier: for ResNet-50 at matched accuracy (\leq 0.5% drop), p50 latency is 1.18ms with a 38MB model, outperforming PTQ-8b (1.44ms, 88MB); for ViT-B/16, T3C reaches 2.30ms p50 with 59MB, improving over strong PTQ/QAT baselines. A single T3C checkpoint therefore provides predictable, certificate-backed accuracy-latency-size trade-offs on demand across devices.

Comment: Compression/Efficiency: test-time controllable low-rank factorization + mixed-precision quantization with a consistency certificate on logit drift.

Relevance: 10 Novelty: 8

8. Quantized SO(3)-Equivariant Graph Neural Networks for Efficient Molecular Property Prediction

ArXiv ID: 2601.02213

Authors: Haoyu Zhou, Ping Xue, Tianfan Fu, Hao Zhang

Abstract: Deploying 3D graph neural networks (GNNs) that are equivariant to 3D rotations (the group SO(3)) on edge devices is challenging due to their high computational cost. This paper addresses the problem by compressing and accelerating an SO(3)-equivariant GNN using low-bit quantization techniques. Specifically, we introduce three innovations for quantized equivariant transformers: (1) a magnitude-direction decoupled quantization scheme that separately quantizes the norm and orientation of equivariant (vector) features, (2) a branch-separated quantization-aware training strategy that treats invariant and equivariant feature channels differently in an attention-based $SO(3)$-GNN, and (3) a robustness-enhancing attention normalization mechanism that stabilizes low-precision attention computations. Experiments on the QM9 and rMD17 molecular benchmarks demonstrate that our 8-bit models achieve accuracy on energy and force predictions comparable to full-precision baselines with markedly improved efficiency. We also conduct ablation studies to quantify the contribution of each component to maintain accuracy and equivariance under quantization, using the Local error of equivariance (LEE) metric. The proposed techniques enable the deployment of symmetry-aware GNNs in practical chemistry applications with 2.37--2.73x faster inference and 4x smaller model size, without sacrificing accuracy or physical symmetry.

Comment: Compression/Efficiency: low-bit quantization for SO(3)-equivariant GNNs with magnitude-direction decoupling and branch-separated QAT.

Relevance: 10 Novelty: 7

9. Leveraging Flatness to Improve Information-Theoretic Generalization Bounds for SGD

ArXiv ID: 2601.01465

Authors: Ze Peng, Jian Zhang, Yisen Wang, Lei Qi, Yinghuan Shi, Yang Gao

Abstract: Information-theoretic (IT) generalization bounds have been used to study the generalization of learning algorithms. These bounds are intrinsically data- and algorithm-dependent so that one can exploit the properties of data and algorithm to derive tighter bounds. However, we observe that although the flatness bias is crucial for SGD's generalization, these bounds fail to capture the improved generalization under better flatness and are also numerically loose. This is caused by the inadequate leverage of SGD's flatness bias in existing IT bounds. This paper derives a more flatness-leveraging IT bound for the flatness-favoring SGD. The bound indicates the learned models generalize better if the large-variance directions of the final weight covariance have small local curvatures in the loss landscape. Experiments on deep neural networks show our bound not only correctly reflects the better generalization when flatness is improved, but is also numerically much tighter. This is achieved by a flexible technique called "omniscient trajectory". When applied to Gradient Descent's minimax excess risk on convex-Lipschitz-Bounded problems, it improves representative IT bounds' $\Omega(1)$ rates to $O(1/\sqrt{n})$. It also implies a by-pass of memorization-generalization trade-offs.

Comment: Representation Learning Theory: information-theoretic generalization bounds leveraging flatness to tighten SGD generalization and improve rates.

Relevance: 9 Novelty: 8

10. Context-Free Recognition with Transformers

ArXiv ID: 2601.01754

Authors: Selim Jerad, Anej Svete, Sophie Hao, Ryan Cotterell, William Merrill

Abstract: Transformers excel on tasks that process well-formed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs (Merrill et al., 2022). Merrill & Sabharwal (2024) show that $\mathcal{O}(\log n)$ looping layers (w.r.t. input length $n$) allows transformers to recognize regular languages, but the question of context-free recognition remained open. In this work, we show that looped transformers with $\mathcal{O}(\log n)$ looping layers and $\mathcal{O}(n^6)$ padding tokens can recognize all CFLs. However, training and inference with $\mathcal{O}(n^6)$ padding tokens is potentially impractical. Fortunately, we show that, for natural subclasses such as unambiguous CFLs, the recognition problem on transformers becomes more tractable, requiring $\mathcal{O}(n^3)$ padding. We empirically validate our results and show that looping helps on a language that provably requires logarithmic depth. Overall, our results shed light on the intricacy of CFL recognition by transformers: While general recognition may require an intractable amount of padding, natural constraints such as unambiguity yield efficient recognition algorithms.

Comment: Model Architecture Theory: shows looped transformers with O(log n) iterations and padding can recognize CFLs, advancing formal capacity understanding.

Relevance: 9 Novelty: 8

11. Sobolev Approximation of Deep ReLU Network in Log-weighted Barron Space

ArXiv ID: 2601.01295

Authors: Changhoon Song, Seungchan Ko, Youngjoon Hong

Abstract: Universal approximation theorems show that neural networks can approximate any continuous function; however, the number of parameters may grow exponentially with the ambient dimension, so these results do not fully explain the practical success of deep models on high-dimensional data. Barron space theory addresses this: if a target function belongs to a Barron space, a two-layer network with $n$ parameters achieves an $O(n^{-1/2})$ approximation error in $L^2$. Yet classical Barron spaces $\mathscr{B}^{s+1}$ still require stronger regularity than Sobolev spaces $H^s$, and existing depth-sensitive results often assume constraints such as $sL \le 1/2$. In this paper, we introduce a log-weighted Barron space $\mathscr{B}^{\log}$, which requires a strictly weaker assumption than $\mathscr{B}^s$ for any $s>0$. For this new function space, we first study embedding properties and carry out a statistical analysis via the Rademacher complexity. Then we prove that functions in $\mathscr{B}^{\log}$ can be approximated by deep ReLU networks with explicit depth dependence. We then define a family $\mathscr{B}^{s,\log}$, establish approximation bounds in the $H^1$ norm, and identify maximal depth scales under which these rates are preserved. Our results clarify how depth reduces regularity requirements for efficient representation, offering a more precise explanation for the performance of deep architectures beyond the classical Barron setting, and for their stable use in high-dimensional problems used today.

Comment: Theoretical Representation Learning: new log-weighted Barron spaces and depth-sensitive ReLU approximation bounds (Sobolev metrics).

Relevance: 9 Novelty: 8

12. FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness

ArXiv ID: 2601.01332

Authors: Hossam Amer, Maryam Dialameh, Hossein Rajabzadeh, Walid Ahmed, Weiwei Zhang, Yang Liu

Abstract: Scaling training compute, measured in FLOPs, has long been shown to improve the accuracy of large language models, yet training remains resource-intensive. Prior work shows that increasing test-time compute (TTC)-for example through iterative sampling-can allow smaller models to rival or surpass much larger ones at lower overall cost. We introduce TTC-aware training, where an intermediate checkpoint and a corresponding TTC configuration can together match or exceed the accuracy of a fully trained model while requiring substantially fewer training FLOPs. Building on this insight, we propose an early stopping algorithm that jointly selects a checkpoint and TTC configuration to minimize training compute without sacrificing accuracy. To make this practical, we develop an efficient TTC evaluation method that avoids exhaustive search, and we formalize a break-even bound that identifies when increased inference compute compensates for reduced training compute. Experiments demonstrate up to 92\% reductions in training FLOPs while maintaining and sometimes remarkably improving accuracy. These results highlight a new perspective for balancing training and inference compute in model development, enabling faster deployment cycles and more frequent model refreshes. Codes will be publicly released.

Comment: Efficiency: TTC-aware training and early stopping to trade training FLOPs for test-time compute with a theoretical break-even bound.

Relevance: 9 Novelty: 8

13. Attention Needs to Focus: A Unified Perspective on Attention Allocation

ArXiv ID: 2601.00919

Authors: Zichuan Fu, Wentao Song, Guojing Li, Yejing Wang, Xian Wu, Yimin Deng, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao

Abstract: The Transformer architecture, a cornerstone of modern Large Language Models (LLMs), has achieved extraordinary success in sequence modeling, primarily due to its attention mechanism. However, despite its power, the standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink. Although prior work has proposed approaches for these issues, they are often studied in isolation, obscuring their deeper connection. In this paper, we present a unified perspective, arguing that both can be traced to a common root -- improper attention allocation. We identify two failure modes: 1) Attention Overload, where tokens receive comparable high weights, blurring semantic features that lead to representational collapse; 2) Attention Underload, where no token is semantically relevant, yet attention is still forced to distribute, resulting in spurious focus such as attention sink. Building on this insight, we introduce Lazy Attention, a novel mechanism designed for a more focused attention distribution. To mitigate overload, it employs positional discrimination across both heads and dimensions to sharpen token distinctions. To counteract underload, it incorporates Elastic-Softmax, a modified normalization function that relaxes the standard softmax constraint to suppress attention on irrelevant tokens. Experiments on the FineWeb-Edu corpus, evaluated across nine diverse benchmarks, demonstrate that Lazy Attention successfully mitigates attention sink and achieves competitive performance compared to both standard attention and modern architectures, while reaching up to 59.58% attention sparsity.

Comment: Model Architecture: proposes Lazy Attention (positional discrimination + Elastic-Softmax) to fix attention collapse/sink and induce sparse attention.

Relevance: 9 Novelty: 8

14. Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale

ArXiv ID: 2601.01330

Authors: Shengji Tang, Weihao Lin, Jingqi Ye, Hao Li, Bo Zhang, Shuyue Hu, Tao Chen, Wangli Ouyang, Lei Bai, Peng Ye

Abstract: Large Language Models (LLMs) have rapidly advanced, with Gemini-3-Pro setting a new performance milestone. In this work, we explore collective intelligence as an alternative to monolithic scaling, and demonstrate that open-source LLMs' collaboration can surpass Gemini-3-Pro. We first revisit LLM routing and aggregation at scale and identify three key bottlenecks: (1) current train-free routers are limited by a query-based paradigm focusing solely on textual similarity; (2) recent aggregation methods remain largely static, failing to select appropriate aggregators for different tasks;(3) the complementarity of routing and aggregation remains underutilized. To address these problems, we introduce JiSi, a novel framework designed to release the full potential of LLMs' collaboration through three innovations: (1) Query-Response Mixed Routing capturing both semantic information and problem difficulty; (2) Support-Set-based Aggregator Selection jointly evaluating the aggregation and domain capacity of aggregators; (3) Adaptive Routing-Aggregation Switch dynamically leveraging the advantages of routing and aggregation. Comprehensive experiments on nine benchmarks demonstrate that JiSi can surpass Gemini-3-Pro with only 47% costs by orchestrating ten open-source LLMs, while outperforming mainstream baselines. It suggests that collective intelligence represents a novel path towards Artificial General Intelligence (AGI).

Comment: Conditional/Dynamic Networks: large-scale LLM routing and adaptive aggregation framework (mixture-of-models) with task-aware switching.

Relevance: 9 Novelty: 7

15. SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines

ArXiv ID: 2601.01484

Authors: Itai Morad, Nir Shlezinger, Yonina C. Eldar

Abstract: Knowledge Distillation (KD) is a central paradigm for transferring knowledge from a large teacher network to a typically smaller student model, often by leveraging soft probabilistic outputs. While KD has shown strong empirical success in numerous applications, its theoretical underpinnings remain only partially understood. In this work, we adopt a Bayesian perspective on KD to rigorously analyze the convergence behavior of students trained with Stochastic Gradient Descent (SGD). We study two regimes: $(i)$ when the teacher provides the exact Bayes Class Probabilities (BCPs); and $(ii)$ supervision with noisy approximations of the BCPs. Our analysis shows that learning from BCPs yields variance reduction and removes neighborhood terms in the convergence bounds compared to one-hot supervision. We further characterize how the level of noise affects generalization and accuracy. Motivated by these insights, we advocate the use of Bayesian deep learning models, which typically provide improved estimates of the BCPs, as teachers in KD. Consistent with our analysis, we experimentally demonstrate that students distilled from Bayesian teachers not only achieve higher accuracies (up to +4.27%), but also exhibit more stable convergence (up to 30% less noise), compared to students distilled from deterministic teachers.

Comment: Compression/Theory: SGD-based KD analysis with Bayesian teachers; shows variance reduction and guidance on BCP noise.

Relevance: 9 Novelty: 7

16. Neuro-Channel Networks: A Multiplication-Free Architecture by Biological Signal Transmission

ArXiv ID: 2601.02253

Authors: Emrah Mete, Emin Erkan Korkmaz

Abstract: The rapid proliferation of Deep Learning is increasingly constrained by its heavy reliance on high-performance hardware, particularly Graphics Processing Units (GPUs). These specialized accelerators are not only prohibitively expensive and energy-intensive but also suffer from significant supply scarcity, limiting the ubiquity of Artificial Intelligence (AI) deployment on edge devices. The core of this inefficiency stems from the standard artificial perceptron's dependence on intensive matrix multiplications. However, biological nervous systems achieve unparalleled efficiency without such arithmetic intensity; synaptic signal transmission is regulated by physical ion channel limits and chemical neurotransmitter levels rather than a process that can be analogous to arithmetic multiplication. Inspired by this biological mechanism, we propose Neuro-Channel Networks (NCN), a novel multiplication-free architecture designed to decouple AI from expensive hardware dependencies. In our model, weights are replaced with Channel Widths that physically limit the signal magnitude, while a secondary parameter acts as a Neurotransmitter to regulate Signal Transmission based on sign logic. The forward pass relies exclusively on addition, subtraction, and bitwise operations (minimum, sign), eliminating floating-point multiplication entirely. In this proof-of-concept study, we demonstrate that NCNs can solve non-linearly separable problems like XOR and the Majority function with 100% accuracy using standard backpropagation, proving their capability to form complex decision boundaries without multiplicative weights. This architecture offers a highly efficient alternative for next-generation neuromorphic hardware, paving the way for running complex models on commodity CPUs or ultra-low-power chips without relying on costly GPU clusters.

Comment: Model Architecture + Efficiency: proposes a multiplication-free network replacing weights with channel-widths and sign-gated transmission to eliminate multiplications.

Relevance: 9 Novelty: 7

17. Heterogeneous Low-Bandwidth Pre-Training of LLMs

ArXiv ID: 2601.02360

Authors: Yazan Obeidi, Amir Sarfi, Joel Lidin, Paul Janson, Eugene Belilovsky

Abstract: Pre-training large language models (LLMs) increasingly requires distributed compute, yet bandwidth constraints make it difficult to scale beyond well-provisioned datacenters-especially when model parallelism forces frequent, large inter-device communications. We study whether SparseLoCo, a low-communication data parallel method based on infrequent synchronization and sparse pseudo-gradient exchange, can be combined with low-bandwidth pipeline model parallelism via activation and activation-gradient compression. We introduce a heterogeneous distributed training framework where some participants host full replicas on high-bandwidth interconnects, while resource-limited participants are grouped to jointly instantiate a replica using pipeline parallelism with subspace-projected inter-stage communication. To make the recently introduced subspace pipeline compression compatible with SparseLoCo, we study a number of adaptations. Across large-scale language modeling experiments (178M-1B parameters) on standard pretraining corpora, we find that activation compression composes with SparseLoCo at modest cost, while selective (heterogeneous) compression consistently improves the loss-communication tradeoff relative to compressing all replicas-especially at aggressive compression ratios. These results suggest a practical path to incorporating low-bandwidth model parallelism and heterogeneous participants into LLM pre-training.

Comment: HPC + Efficiency: heterogeneous distributed pre-training combining SparseLoCo with activation/activation-gradient compression and subspace pipeline parallelism.

Relevance: 9 Novelty: 7

18. MambaFormer: Token-Level Guided Routing Mixture-of-Experts for Accurate and Efficient Clinical Assistance

ArXiv ID: 2601.01260

Authors: Hamad Khan (Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering, Applied Sciences), Saddam Hussain Khan (Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering, Applied Sciences)

Abstract: The deployment of large language models (LLMs) in real-world clinical applications is constrained by the fundamental trade-off between computational cost and the efficiency of linear-time models. To address this, we propose an LLM-based MambaFormer hybrid Mixture-of-Experts (MoE) framework for efficient medical question-answering (QA) and clinical assistance. The MambaFormer employs a lightweight gating mechanism that performs token-level dynamic routing to a customized Transformer expert (ET5) for short, complex queries or to a State Space Model expert (EMamba) for long, high-throughput sequences. The customized EMamba and ET5 models are tailored to accommodate input sequence dimensionality, embedding structure, sequence length, and target-specific output heads, and are fine-tuned through transfer learning on a new, custom-designed DentalQA dataset. Moreover, intelligent routing decisions are driven by the contextual complexity of token embeddings, normalized sequence length, and domain-aware features, thereby enforcing a Pareto-optimal trade-off between inference latency and prediction accuracy. Furthermore, a novel utility-guided multi-objective loss jointly optimizes decisions, router parameters, routing behavior, expert utilization, and computational cost by adaptively regulating token-level expert activation. Finally, the proposed MambaFormer is cross-validated (holdout) for medical QA on the new, custom-designed DentalQA and PubMedQA datasets and compared with state-of-the-art techniques. The proposed MambaFormer outperforms (BERTScore = 0.9180) with ultra-low latency (0.077 s), delivering a 24.4 speedup over T5-Large and establishing a scalable solution for resource-constrained clinical deployment.

Comment: Model Architecture: hybrid MoE with token-level dynamic routing between Transformer and SSM (Mamba) experts plus utility-guided routing loss for efficiency/accuracy trade-offs.

Relevance: 9 Novelty: 7

19. Accelerating Decentralized Optimization via Overlapping Local Steps

ArXiv ID: 2601.01493

Authors: Yijie Zhou, Shi Pu

Abstract: Decentralized optimization has emerged as a critical paradigm for distributed learning, enabling scalable training while preserving data privacy through peer-to-peer collaboration. However, existing methods often suffer from communication bottlenecks due to frequent synchronization between nodes. We present Overlapping Local Decentralized SGD (OLDSGD), a novel approach to accelerate decentralized training by computation-communication overlapping, significantly reducing network idle time. With a deliberately designed update, OLDSGD preserves the same average update as Local SGD while avoiding communication-induced stalls. Theoretically, we establish non-asymptotic convergence rates for smooth non-convex objectives, showing that OLDSGD retains the same iteration complexity as standard Local Decentralized SGD while improving per-iteration runtime. Empirical results demonstrate OLDSGD's consistent improvements in wall-clock time convergence under different levels of communication delays. With minimal modifications to existing frameworks, OLDSGD offers a practical solution for faster decentralized learning without sacrificing theoretical guarantees.

Comment: HPC/Distributed Training: overlaps computation and communication in decentralized SGD (OLDSGD) with convergence guarantees to reduce wall-clock time.

Relevance: 9 Novelty: 7

20. Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments

ArXiv ID: 2601.01075

Authors: Hansen Jin Lillemark, Benhao Huang, Fangneng Zhan, Yilun Du, Thomas Anderson Keller

Abstract: Embodied systems experience the world as 'a symphony of flows': a combination of many continuous streams of sensory input coupled to self-motion, interwoven with the dynamics of external objects. These streams obey smooth, time-parameterized symmetries, which combine through a precisely structured algebra; yet most neural network world models ignore this structure and instead repeatedly re-learn the same transformations from data. In this work, we introduce 'Flow Equivariant World Models', a framework in which both self-motion and external object motion are unified as one-parameter Lie group 'flows'. We leverage this unification to implement group equivariance with respect to these transformations, thereby providing a stable latent world representation over hundreds of timesteps. On both 2D and 3D partially observed video world modeling benchmarks, we demonstrate that Flow Equivariant World Models significantly outperform comparable state-of-the-art diffusion-based and memory-augmented world modeling architectures -- particularly when there are predictable world dynamics outside the agent's current field of view. We show that flow equivariance is particularly beneficial for long rollouts, generalizing far beyond the training horizon. By structuring world model representations with respect to internal and external motion, flow equivariance charts a scalable route to data efficient, symmetry-guided, embodied intelligence. Project link: https://flowequivariantworldmodels.github.io.

Comment: Model Architecture + Representation Learning: introduces Lie-group flow equivariance to build symmetry-structured world models with stable long-horizon latents.