Personalized Daily ArXiv Papers 2026-01-12

[gpt-5]	Prompt	Completion	Total
Token	31134	29714	60848
Cost	$0.04	$0.3	$0.34

Total arXiv papers: 391

Total scanned papers: 228

Total relevant papers: 22

Table of contents with paper titles:

FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching Authors: Hongyaoxing Gul, Lijuan Hu, Shuzi Niu, Fangfang Liu
MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs Authors: Jiyuan Zhang, Yining Liu, Siqi Yan, Lisen Deng, Jennifer Cao, Shuqi Yang, Min Ni, Bi Xue, Shen Li
Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding Authors: Yuxuan Zhou, Fei Huang, Heng Li, Fengyi Wu, Tianyu Wang, Jianwei Zhang, Junyang Lin, Zhi-Qi Cheng
mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations Authors: Yongyi Yang, Jianyang Gao
Transformer Is Inherently a Causal Learner Authors: Xinyue Wang, Stephen Wang, Biwei Huang
Do Sparse Autoencoders Identify Reasoning Features in Language Models? Authors: George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi
Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer Authors: Yifan Zhang, Wei Bi, Kechi Zhang, Dongming Jin, Jie Fu, Zhi Jin
Bi-Orthogonal Factor Decomposition for Vision Transformers Authors: Fenil R. Doshi, Thomas Fel, Talia Konkle, George Alvarez
Manifold limit for the training of shallow graph convolutional neural networks Authors: Johanna Tengler, Christoph Brune, Jos\'e A. Iglesias
PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning Authors: Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum
On the Limits of Self-Improving in LLMs and Why AGI, ASI and the Singularity Are Not Near Without Symbolic Model Synthesis Authors: Hector Zenil
Continual Learning of Achieving Forgetting-free and Positive Knowledge Transfer Authors: Zhi Wang, Zhongbin Wu, Yanni Li, Bing Liu, Guangxi Li, Yuping Wang
DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis Authors: Rui An, Haohao Qu, Wenqi Fan, Xuequn Shang, Qing Li
Quantifying and Inducing Shape Bias in CNNs via Max-Pool Dilation Authors: Takito Sawada, Akinori Iwata, Masahiro Okuda
Efficient Differentiable Causal Discovery via Reliable Super-Structure Learning Authors: Pingchuan Ma, Qixin Zhang, Shuai Wang, Dacheng Tao
Scalable Heterogeneous Graph Learning via Heterogeneous-aware Orthogonal Prototype Experts Authors: Wei Zhou, Hong Huang, Ruize Shi, Bang Liu
Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces Authors: Pattarawat Chormai, Ali Hashemi, Klaus-Robert M\"uller, Gr\'egoire Montavon
DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models Authors: Eliatan Niktab, Hardip Patel
Tracing Moral Foundations in Large Language Models Authors: Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, Morteza Dehghani
Circular Reasoning: Understanding Self-Reinforcing Loops in Large Reasoning Models Authors: Zenghao Duan, Liang Pang, Zihao Wei, Wenbin Duan, Yuxin Tian, Shicheng Xu, Jingcheng Deng, Zhiyi Yin, Xueqi Cheng
Poisson Hyperplane Processes with Rectified Linear Units Authors: Shufei Ge, Shijia Wang, Lloyd Elliott
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning Authors: Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang

1. FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching

ArXiv ID: 2601.05684

Authors: Hongyaoxing Gul, Lijuan Hu, Shuzi Niu, Fangfang Liu

Abstract: Traditional post-training quantization (PTQ) is considered an effective approach to reduce model size and accelerate inference of large-scale language models (LLMs). However, existing low-rank PTQ methods require costly fine-tuning to determine a compromise rank for diverse data and layers in large models, failing to exploit their full potential. Additionally, the current SVD-based low-rank approximation compounds the computational overhead. In this work, we thoroughly analyze the varying effectiveness of low-rank approximation across different layers in representative models. Accordingly, we introduce \underline{F}lexible \underline{L}ow-\underline{R}ank \underline{Q}uantization (FLRQ), a novel solution designed to quickly identify the accuracy-optimal ranks and aggregate them to achieve minimal storage combinations. FLRQ comprises two powerful components, Rank1-Sketch-based Flexible Rank Selection (R1-FLR) and Best Low-rank Approximation under Clipping (BLC). R1-FLR applies the R1-Sketch with Gaussian projection for the fast low-rank approximation, enabling outlier-aware rank extraction for each layer. Meanwhile, BLC aims at minimizing the low-rank quantization error under the scaling and clipping strategy through an iterative method. FLRQ demonstrates strong effectiveness and robustness in comprehensive experiments, achieving state-of-the-art performance in both quantization quality and algorithm efficiency.

Comment: Matches Model Compression and Efficiency: flexible low-rank quantization with sketching and clipping-optimized approximation for LLMs.

Relevance: 10 Novelty: 8

2. MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs

ArXiv ID: 2601.05296

Authors: Jiyuan Zhang, Yining Liu, Siqi Yan, Lisen Deng, Jennifer Cao, Shuqi Yang, Min Ni, Bi Xue, Shen Li

Abstract: The pervasive "memory wall" bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures. MoE's inherent architectural sparsity leads to sparse arithmetic compute and also introduces substantial activation memory overheads -- driven by large token routing buffers and the need to materialize and buffer intermediate tensors. This memory pressure limits the maximum batch size and sequence length that can fit on GPUs, and also results in excessive data movements that hinders performance and efficient model scaling. We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach: (i) an end-to-end token dispatch and MoE training method with optimized data structures to eliminate intermediate buffers and activation materializing, and (ii) co-designed kernels with smart activation checkpoint to mitigate memory footprint while simultaneously achieving better performance. We demonstrate that MoEBlaze can achieve over 4x speedups and over 50% memory savings compared to existing MoE frameworks.

Comment: High Performance Computing/MoE Systems: co-designed token dispatch, buffer elimination, and activation checkpointing to break the MoE memory wall and accelerate training.

Relevance: 10 Novelty: 8

3. Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding

ArXiv ID: 2601.05724

Authors: Yuxuan Zhou, Fei Huang, Heng Li, Fengyi Wu, Tianyu Wang, Jianwei Zhang, Junyang Lin, Zhi-Qi Cheng

Abstract: Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to token-wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose Hierarchical Speculative Decoding (HSD), a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient probability mass across accessible branches. Our extensive large-scale experiments demonstrate that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Moreover, its strong explainability and generality make it readily integrable into a wide range of speculative decoding frameworks. Notably, integrating HSD into EAGLE-3 yields over a 12% performance gain, establishing state-of-the-art decoding efficiency without compromising distribution fidelity. Code is available at https://github.com/ZhouYuxuanYX/Hierarchical-Speculative-Decoding.

Comment: Matches Efficiency: lossless hierarchical speculative decoding improving verification and acceptance rate without altering distribution.

Relevance: 9 Novelty: 8

4. mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations

ArXiv ID: 2601.05732

Authors: Yongyi Yang, Jianyang Gao

Abstract: Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek's Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn--Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff--von Neumann theorem, we propose mHC-lite, a simple reparameterization that explicitly constructs doubly stochastic matrices as convex combinations of permutation matrices. This approach guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations. Extensive experiments demonstrate that mHC-lite matches or exceeds mHC in performance while achieving higher training throughput with a naive implementation and eliminating the residual instabilities observed in both HC and mHC. The code is publicly available at https://github.com/FFTYYY/mhc-lite.

Comment: Model Architecture/Efficiency: reparameterizes hyper-connections to exactly enforce doubly stochastic mixing (via Birkhoff–von Neumann), eliminating Sinkhorn iterations and improving stability/speed.

Relevance: 9 Novelty: 8

5. Transformer Is Inherently a Causal Learner

ArXiv ID: 2601.05647

Authors: Xinyue Wang, Stephen Wang, Biwei Huang

Abstract: We reveal that transformers trained in an autoregressive manner naturally encode time-delayed causal structures in their learned representations. When predicting future values in multivariate time series, the gradient sensitivities of transformer outputs with respect to past inputs directly recover the underlying causal graph, without any explicit causal objectives or structural constraints. We prove this connection theoretically under standard identifiability conditions and develop a practical extraction method using aggregated gradient attributions. On challenging cases such as nonlinear dynamics, long-term dependencies, and non-stationary systems, this approach greatly surpasses the performance of state-of-the-art discovery algorithms, especially as data heterogeneity increases, exhibiting scaling potential where causal accuracy improves with data volume and heterogeneity, a property traditional methods lack. This unifying view lays the groundwork for a future paradigm where causal discovery operates through the lens of foundation models, and foundation models gain interpretability and enhancement through the lens of causality.

Comment: Representation Learning/Causality: shows autoregressive transformers’ gradients recover time-delayed causal graphs, with theory and scalable extraction method.

Relevance: 9 Novelty: 8

6. Do Sparse Autoencoders Identify Reasoning Features in Language Models?

ArXiv ID: 2601.05679

Authors: George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi

Abstract: We investigate whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs). Starting from features selected using standard contrastive activation methods, we introduce a falsification-oriented framework that combines causal token injection experiments and LLM-guided falsification to test whether feature activation reflects reasoning processes or superficial linguistic correlates. Across 20 configurations spanning multiple model families, layers, and reasoning datasets, we find that identified reasoning features are highly sensitive to token-level interventions. Injecting a small number of feature-associated tokens into non-reasoning text is sufficient to elicit strong activation for 59% to 94% of features, indicating reliance on lexical artifacts. For the remaining features that are not explained by simple token triggers, LLM-guided falsification consistently produces non-reasoning inputs that activate the feature and reasoning inputs that do not, with no analyzed feature satisfying our criteria for genuine reasoning behavior. Steering these features yields minimal changes or slight degradations in benchmark performance. Together, these results suggest that SAE features identified by contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves.

Comment: Representation Learning: falsification-oriented analysis of Sparse Autoencoders, combining causal token injection and LLM-guided tests to assess whether SAE features encode genuine reasoning.

Relevance: 9 Novelty: 8

7. Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer

ArXiv ID: 2601.05770

Authors: Yifan Zhang, Wei Bi, Kechi Zhang, Dongming Jin, Jie Fu, Zhi Jin

Abstract: Algorithm extraction aims to synthesize executable programs directly from models trained on specific algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code. However, extending this paradigm to Transformer is hindered by superposition, where entangled features encoded in overlapping directions obstruct the extraction of symbolic expressions. In this work, we propose the Discrete Transformer, an architecture explicitly engineered to bridge the gap between continuous representations and discrete symbolic logic. By enforcing a strict functional disentanglement, which constrains Numerical Attention to information routing and Numerical MLP to element-wise arithmetic, and employing temperature-annealed sampling, our method effectively facilitates the extraction of human-readable programs. Empirically, the Discrete Transformer not only achieves performance comparable to RNN-based baselines but crucially extends interpretability to continuous variable domains. Moreover, our analysis of the annealing process shows that the efficient discrete search undergoes a clear phase transition from exploration to exploitation. We further demonstrate that our method enables fine-grained control over synthesized programs by imposing inductive biases. Collectively, these findings establish the Discrete Transformer as a robust framework for demonstration-free algorithm discovery, offering a rigorous pathway toward Transformer interpretability.

Comment: Model Architecture: introduces a Discrete Transformer with enforced functional disentanglement (routing vs arithmetic) and annealed sampling to enable program extraction, boosting interpretability.

Relevance: 9 Novelty: 8

8. Bi-Orthogonal Factor Decomposition for Vision Transformers

ArXiv ID: 2601.05328

Authors: Fenil R. Doshi, Thomas Fel, Talia Konkle, George Alvarez

Abstract: Self-attention is the central computational primitive of Vision Transformers, yet we lack a principled understanding of what information attention mechanisms exchange between tokens. Attention maps describe where weight mass concentrates; they do not reveal whether queries and keys trade position, content, or both. We introduce Bi-orthogonal Factor Decomposition (BFD), a two-stage analytical framework: first, an ANOVA-based decomposition statistically disentangles token activations into orthogonal positional and content factors; second, SVD of the query-key interaction matrix QK^T exposes bi-orthogonal modes that reveal how these factors mediate communication. After validating proper isolation of position and content, we apply BFD to state-of-the-art vision models and uncover three phenomena.(i) Attention operates primarily through content. Content-content interactions dominate attention energy, followed by content-position coupling. DINOv2 allocates more energy to content-position than supervised models and distributes computation across a richer mode spectrum. (ii) Attention mechanisms exhibit specialization: heads differentiate into content-content, content-position, and position-position operators, while singular modes within heads show analogous specialization. (iii) DINOv2's superior holistic shape processing emerges from intermediate layers that simultaneously preserve positional structure while contextually enriching semantic content. Overall, BFD exposes how tokens interact through attention and which informational factors - positional or semantic - mediate their communication, yielding practical insights into vision transformer mechanisms.

Comment: Representation Learning/Architecture Analysis: introduces Bi-orthogonal Factor Decomposition to disentangle positional vs content factors in attention via ANOVA+SVD, yielding insights into token interactions.

Relevance: 9 Novelty: 8

9. Manifold limit for the training of shallow graph convolutional neural networks

ArXiv ID: 2601.06025

Authors: Johanna Tengler, Christoph Brune, Jos\'e A. Iglesias

Abstract: We study the discrete-to-continuum consistency of the training of shallow graph convolutional neural networks (GCNNs) on proximity graphs of sampled point clouds under a manifold assumption. Graph convolution is defined spectrally via the graph Laplacian, whose low-frequency spectrum approximates that of the Laplace-Beltrami operator of the underlying smooth manifold, and shallow GCNNs of possibly infinite width are linear functionals on the space of measures on the parameter space. From this functional-analytic perspective, graph signals are seen as spatial discretizations of functions on the manifold, which leads to a natural notion of training data consistent across graph resolutions. To enable convergence results, the continuum parameter space is chosen as a weakly compact product of unit balls, with Sobolev regularity imposed on the output weight and bias, but not on the convolutional parameter. The corresponding discrete parameter spaces inherit the corresponding spectral decay, and are additionally restricted by a frequency cutoff adapted to the informative spectral window of the graph Laplacians. Under these assumptions, we prove $\Gamma$-convergence of regularized empirical risk minimization functionals and corresponding convergence of their global minimizers, in the sense of weak convergence of the parameter measures and uniform convergence of the functions over compact sets. This provides a formalization of mesh and sample independence for the training of such networks.

Comment: Representation Learning/Training Theory: proves Γ-convergence for training shallow GCNNs under manifold assumptions, formalizing mesh/sample independence.

Relevance: 8 Novelty: 8

10. PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning

ArXiv ID: 2601.05593

Authors: Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum

Abstract: We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains, and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5's 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.

Comment: Model Architecture + HPC/Test-time compute: introduces a conditional/message-passing architecture to massively parallelize reasoning and scale test-time compute beyond context limits.

Relevance: 8 Novelty: 8

11. On the Limits of Self-Improving in LLMs and Why AGI, ASI and the Singularity Are Not Near Without Symbolic Model Synthesis

ArXiv ID: 2601.05280

Authors: Hector Zenil

Abstract: We formalise recursive self-training in Large Language Models (LLMs) and Generative AI as a discrete-time dynamical system and prove that, as training data become increasingly self-generated ($\alpha_t \to 0$), the system undergoes inevitably degenerative dynamics. We derive two fundamental failure modes: (1) Entropy Decay, where finite sampling effects cause a monotonic loss of distributional diversity (mode collapse), and (2) Variance Amplification, where the loss of external grounding causes the model's representation of truth to drift as a random walk, bounded only by the support diameter. We show these behaviours are not contingent on architecture but are consequences of distributional learning on finite samples. We further argue that Reinforcement Learning with imperfect verifiers suffers similar semantic collapse. To overcome these limits, we propose a path involving symbolic regression and program synthesis guided by Algorithmic Probability. The Coding Theorem Method (CTM) allows for identifying generative mechanisms rather than mere correlations, escaping the data-processing inequality that binds standard statistical learning. We conclude that while purely distributional learning leads to model collapse, hybrid neurosymbolic approaches offer a coherent framework for sustained self-improvement.

Comment: Representation Learning/Training dynamics theory: formalizes recursive self-training in LLMs and proves degenerative behaviors (entropy decay, variance amplification), arguing for neurosymbolic synthesis.

Relevance: 8 Novelty: 8

12. Continual Learning of Achieving Forgetting-free and Positive Knowledge Transfer

ArXiv ID: 2601.05623

Authors: Zhi Wang, Zhongbin Wu, Yanni Li, Bing Liu, Guangxi Li, Yuping Wang

Abstract: Existing research on continual learning (CL) of a sequence of tasks focuses mainly on dealing with catastrophic forgetting (CF) to balance the learning plasticity of new tasks and the memory stability of old tasks. However, an ideal CL agent should not only be able to overcome CF, but also encourage positive forward and backward knowledge transfer (KT), i.e., using the learned knowledge from previous tasks for the new task learning (namely FKT), and improving the previous tasks' performance with the knowledge of the new task (namely BKT). To this end, this paper first models CL as an optimization problem in which each sequential learning task aims to achieve its optimal performance under the constraint that both FKT and BKT should be positive. It then proposes a novel Enhanced Task Continual Learning (ETCL) method, which achieves forgetting-free and positive KT. Furthermore, the bounds that can lead to negative FKT and BKT are estimated theoretically. Based on the bounds, a new strategy for online task similarity detection is also proposed to facilitate positive KT. To overcome CF, ETCL learns a set of task-specific binary masks to isolate a sparse sub-network for each task while preserving the performance of a dense network for the task. At the beginning of a new task learning, ETCL tries to align the new task's gradient with that of the sub-network of the previous most similar task to ensure positive FKT. By using a new bi-objective optimization strategy and an orthogonal gradient projection method, ETCL updates only the weights of previous similar tasks at the classification layer to achieve positive BKT. Extensive evaluations demonstrate that the proposed ETCL markedly outperforms strong baselines on dissimilar, similar, and mixed task sequences.

Comment: Matches Model Architecture and Sparsity: task-specific binary masks (sparse sub-networks) with gradient alignment/projection for continual learning.