Personalized Daily ArXiv Papers 2026-01-09

[gpt-5]	Prompt	Completion	Total
Token	58390	49511	107901
Cost	$0.07	$0.5	$0.57

Total arXiv papers: 529

Total scanned papers: 326

Total relevant papers: 29

Table of contents with paper titles:

Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics Authors: Oshri Naparstek
CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters Authors: Ao Sun, Xiaoyu Wang, Zhe Tan, Yu Li, Jiachen Zhu, Shu Su, Yuheng Jia
ADEPT: Adaptive Dynamic Early-Exit Process for Transformers Authors: Sangmin Yoo, Srikanth Malla, Chiho Choi, Wei D. Lu, Joon Hee Choi
A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems Authors: Qi Wu, Chao Fang, Jiayuan Chen, Ye Lin, Yueqi Zhang, Yichuan Bai, Yuan Du, Li Du
The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models Authors: Yan Wang, Yitao Xu, Nanhan Shen, Jinyan Su, Jimin Huang, Zining Zhu
When Models Manipulate Manifolds: The Geometry of a Counting Task Authors: Wes Gurnee, Emmanuel Ameisen, Isaac Kauvar, Julius Tarng, Adam Pearce, Chris Olah, Joshua Batson
Robust Reasoning as a Symmetry-Protected Topological Phase Authors: Ilmo Sung
RelayLLM: Efficient Reasoning via Collaborative Decoding Authors: Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang
Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64 Authors: Bugra Kilictas, Faruk Alpay
Excess Description Length of Learning Generalizable Predictors Authors: Elizabeth Donoway, Hailey Joren, Fabien Roger, Jan Leike
Token-Level LLM Collaboration via FusionRoute Authors: Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao
Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers Authors: Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid
Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning Authors: Feihu Jin, Shipeng Cen, Ying Tan
Paradoxical noise preference in RNNs Authors: Noah Eckstein, Manoj Srinivasan
Controllable LLM Reasoning via Sparse Autoencoder-Based Steering Authors: Yi Fang, Wenjie Wang, Mingfeng Xue, Boyi Deng, Fengli Xu, Dayiheng Liu, Fuli Feng
Bridging Distance and Spectral Positional Encodings via Anchor-Based Diffusion Geometry Approximation Authors: Zimo Yan, Zheng Xie, Runfan Duan, Chang Liu, Wumei Du
Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models Authors: Wei Wu, Liyi Chen, Congxi Xiao, Tianfu Wang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Hui Xiong
Distributed Online Convex Optimization with Efficient Communication: Improved Algorithm and Lower bounds Authors: Sifan Yang, Wenhao Yang, Wei Jiang, Lijun Zhang
Discontinuous Galerkin finite element operator network for solving non-smooth PDEs Authors: Kapil Chawla, Youngjoon Hong, Jae Yong Lee, Sanghyun Lee
An Algebraic Representation Theorem for Linear GENEOs in Geometric Machine Learning Authors: Francesco Conti, Patrizio Frosini, Nicola Quercioli
Density Matrix RNN (DM-RNN): A Quantum Information Theoretic Framework for Modeling Musical Context and Polyphony Authors: Joonwon Seo, Mariana Montiel
Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis Authors: Wang Cai, Yilin Wen, Jinchang Hou, Du Su, Guoqiu Wang, Zhonghou Lv, Chenfu Bao, Yunfang Wu
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection Authors: Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, Hwee Tou Ng
Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models Authors: Brady Steele, Micah Katz
MQ-GNN: A Multi-Queue Pipelined Architecture for Scalable and Efficient GNN Training Authors: Irfan Ullah, Young-Koo Lee
Aligned explanations in neural networks Authors: Corentin Lobet, Francesca Chiaromonte
Not All Steps are Informative: On the Linearity of LLMs' RLVR Training Authors: Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, Ning Miao
Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models Authors: Xukai Liu, Ye Liu, Jipeng Zhang, Yanghai Zhang, Kai Zhang, Qi Liu
Layer-wise Positional Bias in Short-Context Language Modeling Authors: Maryam Rahimi, Mahdi Nouri, Yadollah Yaghoobzadeh

1. Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics

ArXiv ID: 2601.04854

Authors: Oshri Naparstek

Abstract: Autoregressive language models are conventionally defined over discrete token sequences, committing to a specific token at every generation step. This early discretization forces uncertainty to be resolved through token-level sampling, often leading to instability, repetition, and sensitivity to decoding heuristics. In this work, we introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that \emph{mature} over multiple update steps before being discretized. Rather than sampling tokens, the model evolves continuous token representations through a deterministic dynamical process, committing to a discrete token only when the representation has sufficiently converged. Discrete text is recovered via hard decoding, while uncertainty is maintained and resolved in the continuous space. We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax), without reliance on token-level sampling, diffusion-style denoising, or auxiliary stabilization mechanisms. Additional perturbations, such as stochastic dynamics or history smoothing, can be incorporated naturally but are not required for the model to function. To our knowledge, this is the first autoregressive language model that generates text by evolving continuous token representations to convergence prior to discretization, enabling stable generation without token-level sampling.

Comment: Model Architecture: continuous-token maturation with delayed discretization for autoregressive generation, enabling stable deterministic decoding.

Relevance: 10 Novelty: 9

2. CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

ArXiv ID: 2601.04885

Authors: Ao Sun, Xiaoyu Wang, Zhe Tan, Yu Li, Jiachen Zhu, Shu Su, Yuheng Jia

Abstract: As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbf{Mean Collapse}, converging to a generic average that fails to represent diverse groups. We attribute this to \textbf{Cultural Sparsity}, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf{\textsc{CuMA}} (\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters), a framework that frames alignment as a \textbf{conditional capacity separation} problem. By incorporating demographic-aware routing, \textsc{CuMA} internalizes a \textit{Latent Cultural Topology} to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textsc{CuMA} achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textsc{CuMA} effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.

Comment: Model Architecture: demographic-aware Mixture of Adapters with routing to separate cultural modes and mitigate gradient interference.

Relevance: 10 Novelty: 8

3. ADEPT: Adaptive Dynamic Early-Exit Process for Transformers

ArXiv ID: 2601.03700

Authors: Sangmin Yoo, Srikanth Malla, Chiho Choi, Wei D. Lu, Joon Hee Choi

Abstract: The inference of large language models imposes significant computational workloads, often requiring the processing of billions of parameters. Although early-exit strategies have proven effective in reducing computational demands by halting inference earlier, they apply either to only the first token in the generation phase or at the prompt level in the prefill phase. Thus, the Key-Value (KV) cache for skipped layers remains a bottleneck for subsequent token generation, limiting the benefits of early exit. We introduce ADEPT (Adaptive Dynamic Early-exit Process for Transformers), a novel approach designed to overcome this issue and enable dynamic early exit in both the prefill and generation phases. The proposed adaptive token-level early-exit mechanism adjusts computation dynamically based on token complexity, optimizing efficiency without compromising performance. ADEPT further enhances KV generation procedure by decoupling sequential dependencies in skipped layers, making token-level early exit more practical. Experimental results demonstrate that ADEPT improves efficiency by up to 25% in language generation tasks and achieves a 4x speed-up in downstream classification tasks, with up to a 45% improvement in performance.

Comment: Model Efficiency: adaptive token-level early exit in both prefill and generation with KV-cache decoupling for transformers.

Relevance: 10 Novelty: 8

4. A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems

ArXiv ID: 2601.03992

Authors: Qi Wu, Chao Fang, Jiayuan Chen, Ye Lin, Yueqi Zhang, Yichuan Bai, Yuan Du, Li Du

Abstract: Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing (NDP) capabilities that offload experts to dedicated processing units. However, deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching. To address these challenges, this paper proposes an efficient inference framework featuring three key optimizations. First, the underexplored tensor parallelism in MoE inference is exploited to partition and compute large expert parameters across multiple NDP units simultaneously towards edge low-batch scenarios. Second, a load-balancing-aware scheduling algorithm distributes expert computations across NDP units and GPU to maximize resource utilization. Third, a dataset-free pre-fetching strategy proactively loads frequently accessed experts to minimize activation delays. Experimental results show that our framework enables GPU-NDP systems to achieve 2.41x on average and up to 2.56x speedup in end-to-end latency compared to state-of-the-art approaches, significantly enhancing MoE inference efficiency in resource-constrained environments.

Comment: Systems-level framework for efficient MoE inference on GPU–NDP with tensor parallelism, load balancing, and dataset-free prefetching—HPC/efficiency for MoE.

Relevance: 10 Novelty: 8

5. The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models

ArXiv ID: 2601.03425

Authors: Yan Wang, Yitao Xu, Nanhan Shen, Jinyan Su, Jimin Huang, Zining Zhu

Abstract: Mixture of Experts models are widely assumed to achieve domain specialization through sparse routing. In this work, we question this assumption by introducing COMMITTEEAUDIT, a post hoc framework that analyzes routing behavior at the level of expert groups rather than individual experts. Across three representative models and the MMLU benchmark, we uncover a domain-invariant Standing Committee. This is a compact coalition of routed experts that consistently captures the majority of routing mass across domains, layers, and routing budgets, even when architectures already include shared experts. Qualitative analysis further shows that Standing Committees anchor reasoning structure and syntax, while peripheral experts handle domain-specific knowledge. These findings reveal a strong structural bias toward centralized computation, suggesting that specialization in Mixture of Experts models is far less pervasive than commonly believed. This inherent bias also indicates that current training objectives, such as load-balancing losses that enforce uniform expert utilization, may be working against the model's natural optimization path, thereby limiting training efficiency and performance.

Comment: Strongly matches Model Architecture (Mixture-of-Experts analysis uncovering a domain-invariant ‘Standing Committee’; direct MoE focus).

Relevance: 10 Novelty: 8

6. When Models Manipulate Manifolds: The Geometry of a Counting Task

ArXiv ID: 2601.04480

Authors: Wes Gurnee, Emmanuel Ameisen, Isaac Kauvar, Julius Tarng, Adam Pearce, Chris Olah, Joshua Batson

Abstract: Language models can perceive visual properties of text despite receiving only sequences of tokens-we mechanistically investigate how Claude 3.5 Haiku accomplishes one such task: linebreaking in fixed-width text. We find that character counts are represented on low-dimensional curved manifolds discretized by sparse feature families, analogous to biological place cells. Accurate predictions emerge from a sequence of geometric transformations: token lengths are accumulated into character count manifolds, attention heads twist these manifolds to estimate distance to the line boundary, and the decision to break the line is enabled by arranging estimates orthogonally to create a linear decision boundary. We validate our findings through causal interventions and discover visual illusions--character sequences that hijack the counting mechanism. Our work demonstrates the rich sensory processing of early layers, the intricacy of attention algorithms, and the importance of combining feature-based and geometric views of interpretability.

Comment: Representation Learning/Training Dynamics: mechanistic interpretability revealing low-dimensional counting manifolds and attention geometry in transformers.

Relevance: 9 Novelty: 9

7. Robust Reasoning as a Symmetry-Protected Topological Phase

ArXiv ID: 2601.05240

Authors: Ilmo Sung

Abstract: Large language models suffer from "hallucinations"-logical inconsistencies induced by semantic noise. We propose that current architectures operate in a "Metric Phase," where causal order is vulnerable to spontaneous symmetry breaking. Here, we identify robust inference as an effective Symmetry-Protected Topological phase, where logical operations are formally isomorphic to non-Abelian anyon braiding, replacing fragile geometric interpolation with robust topological invariants. Empirically, we demonstrate a sharp topological phase transition: while Transformers and RNNs exhibit gapless decay, our Holonomic Network reveals a macroscopic "mass gap," maintaining invariant fidelity below a critical noise threshold. Furthermore, in a variable-binding task on $S_{10}$ ($3.6 \times 10^6$ states) representing symbolic manipulation, we demonstrate holonomic generalization: the topological model maintains perfect fidelity extrapolating $100\times$ beyond training ($L=50 \to 5000$), consistent with a theoretically indefinite causal horizon, whereas Transformers lose logical coherence. Ablation studies indicate this protection emerges strictly from non-Abelian gauge symmetry. This provides strong evidence for a new universality class for logical reasoning, linking causal stability to the topology of the semantic manifold.

Comment: Model Architecture: proposes a Holonomic Network with non-Abelian gauge symmetry, framing robust reasoning as a symmetry-protected topological phase.

Relevance: 9 Novelty: 9

8. RelayLLM: Efficient Reasoning via Collaborative Decoding

ArXiv ID: 2601.05167

Authors: Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang

Abstract: Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.

Comment: Model Compression/Efficiency: token-level collaborative decoding with dynamic routing to an LLM to cut compute cost.

Relevance: 9 Novelty: 8

9. Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64

ArXiv ID: 2601.03324

Authors: Bugra Kilictas, Faruk Alpay

Abstract: The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" the bottleneck where data movement latency outstrips arithmetic throughput. Standard inference runtimes often incur significant overhead through high-level abstractions, dynamic dispatch, and unaligned memory access patterns. In this work, we present a novel "Virtual Tensor Core" architecture implemented in software, optimized specifically for ARM64 microarchitectures (Apple Silicon). By bypassing standard library containers in favor of direct memory mapping (mmap) and implementing hand-tuned NEON SIMD kernels, we achieve a form of "Software-Defined Direct Memory Access (DMA)." Our proposed Tensor Virtualization Layout (TVL) guarantees 100% cache line utilization for weight matrices, while our zero-copy loader eliminates initialization latency. Experimental results on a 110M parameter model demonstrate a stable throughput of >60 tokens/second on M2 hardware. While proprietary hardware accelerators (e.g., Apple AMX) can achieve higher peak throughput, our architecture provides a fully open, portable, and deterministic reference implementation for studying the memory bottleneck on general-purpose ARM silicon, meeting the 200ms psycholinguistic latency threshold without opaque dependencies.

Comment: High-Performance Computing/Efficiency: systems-level memory layout and SIMD kernel design (virtual tensor core) to overcome memory wall for LLM inference on ARM64.

Relevance: 9 Novelty: 8

10. Excess Description Length of Learning Generalizable Predictors

ArXiv ID: 2601.04728

Authors: Elizabeth Donoway, Hailey Joren, Fabien Roger, Jan Leike

Abstract: Understanding whether fine-tuning elicits latent capabilities or teaches new ones is a fundamental question for language model evaluation and safety. We develop a formal information-theoretic framework for quantifying how much predictive structure fine-tuning extracts from the train dataset and writes into a model's parameters. Our central quantity, Excess Description Length (EDL), is defined via prequential coding and measures the gap between the bits required to encode training labels sequentially using an evolving model (trained online) and the residual encoding cost under the final trained model. We establish that EDL is non-negative in expectation, converges to surplus description length in the infinite-data limit, and provides bounds on expected generalization gain. Through a series of toy models, we clarify common confusions about information in learning: why random labels yield EDL near zero, how a single example can eliminate many bits of uncertainty about the underlying rule(s) that describe the data distribution, why structure learned on rare inputs contributes proportionally little to expected generalization, and how format learning creates early transients distinct from capability acquisition. This framework provides rigorous foundations for the empirical observation that capability elicitation and teaching exhibit qualitatively distinct scaling signatures.

Comment: Matches Representation Learning/Training Dynamics: information-theoretic framework (Excess Description Length) quantifying capability acquisition and generalization.

Relevance: 9 Novelty: 8

11. Token-Level LLM Collaboration via FusionRoute

ArXiv ID: 2601.05106

Authors: Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao

Abstract: Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

Comment: Matches Model Architecture/Efficiency: token-level routing with a trainable complementary generator; theoretical limits of expert-only routing (MoE-like).

Relevance: 9 Novelty: 8

12. Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

ArXiv ID: 2601.04890

Authors: Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid

Abstract: Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.

Comment: Matches Training Dynamics/Architecture: learnable per-matrix/row/column multipliers to free WD-noise equilibrium scale, improving optimization.

Relevance: 9 Novelty: 8

13. Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning

ArXiv ID: 2601.04710

Authors: Feihu Jin, Shipeng Cen, Ying Tan

Abstract: Fine-tuning large language models (LLMs) has achieved remarkable success across various NLP tasks, but the substantial memory overhead during backpropagation remains a critical bottleneck, especially as model scales grow. Zeroth-order (ZO) optimization alleviates this issue by estimating gradients through forward passes and Gaussian sampling, avoiding the need for backpropagation. However, conventional ZO methods suffer from high variance in gradient estimation due to their reliance on random perturbations, leading to slow convergence and suboptimal performance. We propose a simple plug-and-play method that incorporates prior-informed perturbations to refine gradient estimation. Our method dynamically computes a guiding vector from Gaussian samples, which directs perturbations toward more informative directions, significantly accelerating convergence compared to standard ZO approaches. We further investigate a greedy perturbation strategy to explore the impact of prior knowledge on gradient estimation. Theoretically, we prove that our gradient estimator achieves stronger alignment with the true gradient direction, enhancing optimization efficiency. Extensive experiments across LLMs of varying scales and architectures demonstrate that our proposed method could seamlessly integrate into existing optimization methods, delivering faster convergence and superior performance. Notably, on the OPT-13B model, our method outperforms traditional ZO optimization across all 11 benchmark tasks and surpasses gradient-based baselines on 9 out of 11 tasks, establishing a robust balance between efficiency and accuracy.

Comment: Strong match to Model Compression and Efficiency (memory-efficient LLM fine-tuning via prior-informed ZO gradient estimation with theory).

Relevance: 9 Novelty: 8

14. Paradoxical noise preference in RNNs

ArXiv ID: 2601.04539

Authors: Noah Eckstein, Manoj Srinivasan

Abstract: In recurrent neural networks (RNNs) used to model biological neural networks, noise is typically introduced during training to emulate biological variability and regularize learning. The expectation is that removing the noise at test time should preserve or improve performance. Contrary to this intuition, we find that continuous-time recurrent neural networks (CTRNNs) often perform best at a nonzero noise level, specifically, the same level used during training. This noise preference typically arises when noise is injected inside the neural activation function; networks trained with noise injected outside the activation function perform best with zero noise. Through analyses of simple function approximation, maze navigation, and single neuron regulator tasks, we show that the phenomenon stems from noise-induced shifts of fixed points (stationary distributions) in the underlying stochastic dynamics of the RNNs. These fixed point shifts are noise-level dependent and bias the network outputs when the noise is removed, degrading performance. Analytical and numerical results show that the bias arises when neural states operate near activation function nonlinearities, where noise is asymmetrically attenuated, and that performance optimization incentivizes operation near these nonlinearities. Thus, networks can overfit to the stochastic training environment itself rather than just to the input-output data. The phenomenon is distinct from stochastic resonance, wherein nonzero noise enhances signal processing. Our findings reveal that training noise can become an integral part of the computation learned by recurrent networks, with implications for understanding neural population dynamics and for the design of robust artificial RNNs.

Comment: Matches Training Dynamics: reveals noise-level-dependent fixed-point shifts in CTRNNs and noise as integral to computation.

Relevance: 9 Novelty: 7

15. Controllable LLM Reasoning via Sparse Autoencoder-Based Steering

ArXiv ID: 2601.03595

Authors: Yi Fang, Wenjie Wang, Mingfeng Xue, Boyi Deng, Fengli Xu, Dayiheng Liu, Fuli Feng

Abstract: Large Reasoning Models (LRMs) exhibit human-like cognitive reasoning strategies (e.g. backtracking, cross-verification) during reasoning process, which improves their performance on complex tasks. Currently, reasoning strategies are autonomously selected by LRMs themselves. However, such autonomous selection often produces inefficient or even erroneous reasoning paths. To make reasoning more reliable and flexible, it is important to develop methods for controlling reasoning strategies. Existing methods struggle to control fine-grained reasoning strategies due to conceptual entanglement in LRMs' hidden states. To address this, we leverage Sparse Autoencoders (SAEs) to decompose strategy-entangled hidden states into a disentangled feature space. To identify the few strategy-specific features from the vast pool of SAE features, we propose SAE-Steering, an efficient two-stage feature identification pipeline. SAE-Steering first recalls features that amplify the logits of strategy-specific keywords, filtering out over 99\% of features, and then ranks the remaining features by their control effectiveness. Using the identified strategy-specific features as control vectors, SAE-Steering outperforms existing methods by over 15\% in control effectiveness. Furthermore, controlling reasoning strategies can redirect LRMs from erroneous paths to correct ones, achieving a 7\% absolute accuracy improvement.

Comment: Strongly matches Representation Learning and Sparsity (Sparse Autoencoders to disentangle and steer reasoning strategies).

Relevance: 9 Novelty: 7

16. Bridging Distance and Spectral Positional Encodings via Anchor-Based Diffusion Geometry Approximation

ArXiv ID: 2601.04517

Authors: Zimo Yan, Zheng Xie, Runfan Duan, Chang Liu, Wumei Du

Abstract: Molecular graph learning benefits from positional signals that capture both local neighborhoods and global topology. Two widely used families are spectral encodings derived from Laplacian or diffusion operators and anchor-based distance encodings built from shortest-path information, yet their precise relationship is poorly understood. We interpret distance encodings as a low-rank surrogate of diffusion geometry and derive an explicit trilateration map that reconstructs truncated diffusion coordinates from transformed anchor distances and anchor spectral positions, with pointwise and Frobenius-gap guarantees on random regular graphs. On DrugBank molecular graphs using a shared GNP-based DDI prediction backbone, a distance-driven Nystr\"om scheme closely recovers diffusion geometry, and both Laplacian and distance encodings substantially outperform a no-encoding baseline.

Comment: Representation Learning: connects spectral/diffusion positional encodings to anchor-based distance via low-rank/Nyström approximation with theoretical guarantees.

Relevance: 8 Novelty: 8

17. Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models

ArXiv ID: 2601.03969

Authors: Wei Wu, Liyi Chen, Congxi Xiao, Tianfu Wang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Hui Xiong

Abstract: Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.

Comment: Model Efficiency: training-time Dynamic Outlier Truncation to suppress redundant reasoning tokens and improve cost–accuracy trade-off.

Relevance: 8 Novelty: 8

18. Distributed Online Convex Optimization with Efficient Communication: Improved Algorithm and Lower bounds

ArXiv ID: 2601.04907

Authors: Sifan Yang, Wenhao Yang, Wei Jiang, Lijun Zhang

Abstract: We investigate distributed online convex optimization with compressed communication, where $n$ learners connected by a network collaboratively minimize a sequence of global loss functions using only local information and compressed data from neighbors. Prior work has established regret bounds of $O(\max{\omega^{-2}\rho^{-4}n^{1/2},\omega^{-4}\rho^{-8}}n\sqrt{T})$ and $O(\max{\omega^{-2}\rho^{-4}n^{1/2},\omega^{-4}\rho^{-8}}n\ln{T})$ for convex and strongly convex functions, respectively, where $\omega\in(0,1]$ is the compression quality factor ($\omega=1$ means no compression) and $\rho<1$ is the spectral gap of the communication matrix. However, these regret bounds suffer from a \emph{quadratic} or even \emph{quartic} dependence on $\omega^{-1}$. Moreover, the \emph{super-linear} dependence on $n$ is also undesirable. To overcome these limitations, we propose a novel algorithm that achieves improved regret bounds of $\tilde{O}(\omega^{-1/2}\rho^{-1}n\sqrt{T})$ and $\tilde{O}(\omega^{-1}\rho^{-2}n\ln{T})$ for convex and strongly convex functions, respectively. The primary idea is to design a \emph{two-level blocking update framework} incorporating two novel ingredients: an online gossip strategy and an error compensation scheme, which collaborate to \emph{achieve a better consensus} among learners. Furthermore, we establish the first lower bounds for this problem, justifying the optimality of our results with respect to both $\omega$ and $T$. Additionally, we consider the bandit feedback scenario, and extend our method with the classic gradient estimators to enhance existing regret bounds.

Comment: Matches High-Performance/Distributed Training: improved algorithms and lower bounds for compressed communication in distributed online convex optimization.

Relevance: 8 Novelty: 8

19. Discontinuous Galerkin finite element operator network for solving non-smooth PDEs

ArXiv ID: 2601.03668

Authors: Kapil Chawla, Youngjoon Hong, Jae Yong Lee, Sanghyun Lee

Abstract: We introduce Discontinuous Galerkin Finite Element Operator Network (DG--FEONet), a data-free operator learning framework that combines the strengths of the discontinuous Galerkin (DG) method with neural networks to solve parametric partial differential equations (PDEs) with discontinuous coefficients and non-smooth solutions. Unlike traditional operator learning models such as DeepONet and Fourier Neural Operator, which require large paired datasets and often struggle near sharp features, our approach minimizes the residual of a DG-based weak formulation using the Symmetric Interior Penalty Galerkin (SIPG) scheme. DG-FEONet predicts element-wise solution coefficients via a neural network, enabling data-free training without the need for precomputed input-output pairs. We provide theoretical justification through convergence analysis and validate the model's performance on a series of one- and two-dimensional PDE problems, demonstrating accurate recovery of discontinuities, strong generalization across parameter space, and reliable convergence rates. Our results highlight the potential of combining local discretization schemes with machine learning to achieve robust, singularity-aware operator approximation in challenging PDE settings.

Comment: DG–FEONet: hybrid DG-based neural operator trained via residual minimization—operator-learning architecture with data-free training and robustness to discontinuities.

Relevance: 8 Novelty: 8

20. An Algebraic Representation Theorem for Linear GENEOs in Geometric Machine Learning

ArXiv ID: 2601.03910

Authors: Francesco Conti, Patrizio Frosini, Nicola Quercioli

Abstract: Geometric and Topological Deep Learning are rapidly growing research areas that enhance machine learning through the use of geometric and topological structures. Within this framework, Group Equivariant Non-Expansive Operators (GENEOs) have emerged as a powerful class of operators for encoding symmetries and designing efficient, interpretable neural architectures. Originally introduced in Topological Data Analysis, GENEOs have since found applications in Deep Learning as tools for constructing equivariant models with reduced parameter complexity. GENEOs provide a unifying framework bridging Geometric and Topological Deep Learning and include the operator computing persistence diagrams as a special case. Their theoretical foundations rely on group actions, equivariance, and compactness properties of operator spaces, grounding them in algebra and geometry while enabling both mathematical rigor and practical relevance. While a previous representation theorem characterized linear GENEOs acting on data of the same type, many real-world applications require operators between heterogeneous data spaces. In this work, we address this limitation by introducing a new representation theorem for linear GENEOs acting between different perception pairs, based on generalized T-permutant measures. Under mild assumptions on the data domains and group actions, our result provides a complete characterization of such operators. We also prove the compactness and convexity of the space of linear GENEOs. We further demonstrate the practical impact of this theory by applying the proposed framework to improve the performance of autoencoders, highlighting the relevance of GENEOs in modern machine learning applications.

Comment: Strongly matches Model Architecture theory (representation theorem for equivariant operators/GENEOs enabling efficient, interpretable architectures).

Relevance: 8 Novelty: 8

21. Density Matrix RNN (DM-RNN): A Quantum Information Theoretic Framework for Modeling Musical Context and Polyphony

ArXiv ID: 2601.04592

Authors: Joonwon Seo, Mariana Montiel

Abstract: Classical Recurrent Neural Networks (RNNs) summarize musical context into a deterministic hidden state vector, imposing an information bottleneck that fails to capture the inherent ambiguity in music. We propose the Density Matrix RNN (DM-RNN), a novel theoretical architecture utilizing the Density Matrix. This allows the model to maintain a statistical ensemble of musical interpretations (a mixed state), capturing both classical probabilities and quantum coherences. We rigorously define the temporal dynamics using Quantum Channels (CPTP maps). Crucially, we detail a parameterization strategy based on the Choi-Jamiolkowski isomorphism, ensuring the learned dynamics remain physically valid (CPTP) by construction. We introduce an analytical framework using Von Neumann Entropy to quantify musical uncertainty and Quantum Mutual Information (QMI) to measure entanglement between voices. The DM-RNN provides a mathematically rigorous framework for modeling complex, ambiguous musical structures.

Comment: Model Architecture: DM-RNN with density-matrix state and CPTP dynamics; rigorous parameterization and information-theoretic analysis of representations.

Relevance: 8 Novelty: 8

22. Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis

ArXiv ID: 2601.04262

Authors: Wang Cai, Yilin Wen, Jinchang Hou, Du Su, Guoqiu Wang, Zhonghou Lv, Chenfu Bao, Yunfang Wu

Abstract: Safety alignment in Large Language Models (LLMs) inherently presents a multi-objective optimization conflict, often accompanied by an unintended degradation of general capabilities. Existing mitigation strategies typically rely on global gradient geometry to resolve these conflicts, yet they overlook Modular Heterogeneity within Transformers, specifically that the functional sensitivity and degree of conflict vary substantially across different attention heads. Such global approaches impose uniform update rules across all parameters, often resulting in suboptimal trade-offs by indiscriminately updating utility sensitive heads that exhibit intense gradient conflicts. To address this limitation, we propose Conflict-Aware Sparse Tuning (CAST), a framework that integrates head-level diagnosis with sparse fine-tuning. CAST first constructs a pre-alignment conflict map by synthesizing Optimization Conflict and Functional Sensitivity, which then guides the selective update of parameters. Experiments reveal that alignment conflicts in LLMs are not uniformly distributed. We find that the drop in general capabilities mainly comes from updating a small group of ``high-conflict'' heads. By simply skipping these heads during training, we significantly reduce this loss without compromising safety, offering an interpretable and parameter-efficient approach to improving the safety-utility trade-off.

Comment: Matches Model Architecture and Efficiency: head-level diagnosis with conflict-aware sparse fine-tuning that selectively updates Transformer heads.