Personalized Daily ArXiv Papers 2026-01-12
| [gpt-5] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 31134 | 29714 | 60848 |
| Cost | $0.04 | $0.3 | $0.34 |
Total arXiv papers: 391
Total scanned papers: 228
Total relevant papers: 22
Table of contents with paper titles:
-
FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching Authors: Hongyaoxing Gul, Lijuan Hu, Shuzi Niu, Fangfang Liu
-
MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs Authors: Jiyuan Zhang, Yining Liu, Siqi Yan, Lisen Deng, Jennifer Cao, Shuqi Yang, Min Ni, Bi Xue, Shen Li
-
Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding Authors: Yuxuan Zhou, Fei Huang, Heng Li, Fengyi Wu, Tianyu Wang, Jianwei Zhang, Junyang Lin, Zhi-Qi Cheng
-
mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations Authors: Yongyi Yang, Jianyang Gao
-
Transformer Is Inherently a Causal Learner Authors: Xinyue Wang, Stephen Wang, Biwei Huang
-
Do Sparse Autoencoders Identify Reasoning Features in Language Models? Authors: George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi
-
Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer Authors: Yifan Zhang, Wei Bi, Kechi Zhang, Dongming Jin, Jie Fu, Zhi Jin
-
Bi-Orthogonal Factor Decomposition for Vision Transformers Authors: Fenil R. Doshi, Thomas Fel, Talia Konkle, George Alvarez
-
Manifold limit for the training of shallow graph convolutional neural networks Authors: Johanna Tengler, Christoph Brune, Jos\'e A. Iglesias
-
PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning Authors: Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum
-
On the Limits of Self-Improving in LLMs and Why AGI, ASI and the Singularity Are Not Near Without Symbolic Model Synthesis Authors: Hector Zenil
-
Continual Learning of Achieving Forgetting-free and Positive Knowledge Transfer Authors: Zhi Wang, Zhongbin Wu, Yanni Li, Bing Liu, Guangxi Li, Yuping Wang
-
DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis Authors: Rui An, Haohao Qu, Wenqi Fan, Xuequn Shang, Qing Li
-
Quantifying and Inducing Shape Bias in CNNs via Max-Pool Dilation Authors: Takito Sawada, Akinori Iwata, Masahiro Okuda
-
Efficient Differentiable Causal Discovery via Reliable Super-Structure Learning Authors: Pingchuan Ma, Qixin Zhang, Shuai Wang, Dacheng Tao
-
Scalable Heterogeneous Graph Learning via Heterogeneous-aware Orthogonal Prototype Experts Authors: Wei Zhou, Hong Huang, Ruize Shi, Bang Liu
-
Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces Authors: Pattarawat Chormai, Ali Hashemi, Klaus-Robert M\"uller, Gr\'egoire Montavon
-
DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models Authors: Eliatan Niktab, Hardip Patel
-
Tracing Moral Foundations in Large Language Models Authors: Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, Morteza Dehghani
-
Circular Reasoning: Understanding Self-Reinforcing Loops in Large Reasoning Models Authors: Zenghao Duan, Liang Pang, Zihao Wei, Wenbin Duan, Yuxin Tian, Shicheng Xu, Jingcheng Deng, Zhiyi Yin, Xueqi Cheng
-
Poisson Hyperplane Processes with Rectified Linear Units Authors: Shufei Ge, Shijia Wang, Lloyd Elliott
-
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning Authors: Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang
1. FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching
ArXiv ID: 2601.05684
Authors: Hongyaoxing Gul, Lijuan Hu, Shuzi Niu, Fangfang Liu
Abstract: Traditional post-training quantization (PTQ) is considered an effective approach to reduce model size and accelerate inference of large-scale language models (LLMs). However, existing low-rank PTQ methods require costly fine-tuning to determine a compromise rank for diverse data and layers in large models, failing to exploit their full potential. Additionally, the current SVD-based low-rank approximation compounds the computational overhead. In this work, we thoroughly analyze the varying effectiveness of low-rank approximation across different layers in representative models. Accordingly, we introduce \underline{F}lexible \underline{L}ow-\underline{R}ank \underline{Q}uantization (FLRQ), a novel solution designed to quickly identify the accuracy-optimal ranks and aggregate them to achieve minimal storage combinations. FLRQ comprises two powerful components, Rank1-Sketch-based Flexible Rank Selection (R1-FLR) and Best Low-rank Approximation under Clipping (BLC). R1-FLR applies the R1-Sketch with Gaussian projection for the fast low-rank approximation, enabling outlier-aware rank extraction for each layer. Meanwhile, BLC aims at minimizing the low-rank quantization error under the scaling and clipping strategy through an iterative method. FLRQ demonstrates strong effectiveness and robustness in comprehensive experiments, achieving state-of-the-art performance in both quantization quality and algorithm efficiency.
Comment: Matches Model Compression and Efficiency: flexible low-rank quantization with sketching and clipping-optimized approximation for LLMs.
Relevance: 10 Novelty: 8
2. MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs
ArXiv ID: 2601.05296
Authors: Jiyuan Zhang, Yining Liu, Siqi Yan, Lisen Deng, Jennifer Cao, Shuqi Yang, Min Ni, Bi Xue, Shen Li
Abstract: The pervasive "memory wall" bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures. MoE's inherent architectural sparsity leads to sparse arithmetic compute and also introduces substantial activation memory overheads -- driven by large token routing buffers and the need to materialize and buffer intermediate tensors. This memory pressure limits the maximum batch size and sequence length that can fit on GPUs, and also results in excessive data movements that hinders performance and efficient model scaling. We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach: (i) an end-to-end token dispatch and MoE training method with optimized data structures to eliminate intermediate buffers and activation materializing, and (ii) co-designed kernels with smart activation checkpoint to mitigate memory footprint while simultaneously achieving better performance. We demonstrate that MoEBlaze can achieve over 4x speedups and over 50% memory savings compared to existing MoE frameworks.
Comment: High Performance Computing/MoE Systems: co-designed token dispatch, buffer elimination, and activation checkpointing to break the MoE memory wall and accelerate training.
Relevance: 10 Novelty: 8
3. Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding
ArXiv ID: 2601.05724
Authors: Yuxuan Zhou, Fei Huang, Heng Li, Fengyi Wu, Tianyu Wang, Jianwei Zhang, Junyang Lin, Zhi-Qi Cheng
Abstract: Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to token-wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose Hierarchical Speculative Decoding (HSD), a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient probability mass across accessible branches. Our extensive large-scale experiments demonstrate that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Moreover, its strong explainability and generality make it readily integrable into a wide range of speculative decoding frameworks. Notably, integrating HSD into EAGLE-3 yields over a 12% performance gain, establishing state-of-the-art decoding efficiency without compromising distribution fidelity. Code is available at https://github.com/ZhouYuxuanYX/Hierarchical-Speculative-Decoding.
Comment: Matches Efficiency: lossless hierarchical speculative decoding improving verification and acceptance rate without altering distribution.
Relevance: 9 Novelty: 8
4. mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations
ArXiv ID: 2601.05732
Authors: Yongyi Yang, Jianyang Gao
Abstract: Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek's Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn--Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff--von Neumann theorem, we propose mHC-lite, a simple reparameterization that explicitly constructs doubly stochastic matrices as convex combinations of permutation matrices. This approach guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations. Extensive experiments demonstrate that mHC-lite matches or exceeds mHC in performance while achieving higher training throughput with a naive implementation and eliminating the residual instabilities observed in both HC and mHC. The code is publicly available at https://github.com/FFTYYY/mhc-lite.
Comment: Model Architecture/Efficiency: reparameterizes hyper-connections to exactly enforce doubly stochastic mixing (via Birkhoff–von Neumann), eliminating Sinkhorn iterations and improving stability/speed.
Relevance: 9 Novelty: 8
5. Transformer Is Inherently a Causal Learner
ArXiv ID: 2601.05647
Authors: Xinyue Wang, Stephen Wang, Biwei Huang
Abstract: We reveal that transformers trained in an autoregressive manner naturally encode time-delayed causal structures in their learned representations. When predicting future values in multivariate time series, the gradient sensitivities of transformer outputs with respect to past inputs directly recover the underlying causal graph, without any explicit causal objectives or structural constraints. We prove this connection theoretically under standard identifiability conditions and develop a practical extraction method using aggregated gradient attributions. On challenging cases such as nonlinear dynamics, long-term dependencies, and non-stationary systems, this approach greatly surpasses the performance of state-of-the-art discovery algorithms, especially as data heterogeneity increases, exhibiting scaling potential where causal accuracy improves with data volume and heterogeneity, a property traditional methods lack. This unifying view lays the groundwork for a future paradigm where causal discovery operates through the lens of foundation models, and foundation models gain interpretability and enhancement through the lens of causality.
Comment: Representation Learning/Causality: shows autoregressive transformers’ gradients recover time-delayed causal graphs, with theory and scalable extraction method.
Relevance: 9 Novelty: 8
6. Do Sparse Autoencoders Identify Reasoning Features in Language Models?
ArXiv ID: 2601.05679
Authors: George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi
Abstract: We investigate whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs). Starting from features selected using standard contrastive activation methods, we introduce a falsification-oriented framework that combines causal token injection experiments and LLM-guided falsification to test whether feature activation reflects reasoning processes or superficial linguistic correlates. Across 20 configurations spanning multiple model families, layers, and reasoning datasets, we find that identified reasoning features are highly sensitive to token-level interventions. Injecting a small number of feature-associated tokens into non-reasoning text is sufficient to elicit strong activation for 59% to 94% of features, indicating reliance on lexical artifacts. For the remaining features that are not explained by simple token triggers, LLM-guided falsification consistently produces non-reasoning inputs that activate the feature and reasoning inputs that do not, with no analyzed feature satisfying our criteria for genuine reasoning behavior. Steering these features yields minimal changes or slight degradations in benchmark performance. Together, these results suggest that SAE features identified by contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves.
Comment: Representation Learning: falsification-oriented analysis of Sparse Autoencoders, combining causal token injection and LLM-guided tests to assess whether SAE features encode genuine reasoning.
Relevance: 9 Novelty: 8
7. Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer
ArXiv ID: 2601.05770
Authors: Yifan Zhang, Wei Bi, Kechi Zhang, Dongming Jin, Jie Fu, Zhi Jin
Abstract: Algorithm extraction aims to synthesize executable programs directly from models trained on specific algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code. However, extending this paradigm to Transformer is hindered by superposition, where entangled features encoded in overlapping directions obstruct the extraction of symbolic expressions. In this work, we propose the Discrete Transformer, an architecture explicitly engineered to bridge the gap between continuous representations and discrete symbolic logic. By enforcing a strict functional disentanglement, which constrains Numerical Attention to information routing and Numerical MLP to element-wise arithmetic, and employing temperature-annealed sampling, our method effectively facilitates the extraction of human-readable programs. Empirically, the Discrete Transformer not only achieves performance comparable to RNN-based baselines but crucially extends interpretability to continuous variable domains. Moreover, our analysis of the annealing process shows that the efficient discrete search undergoes a clear phase transition from exploration to exploitation. We further demonstrate that our method enables fine-grained control over synthesized programs by imposing inductive biases. Collectively, these findings establish the Discrete Transformer as a robust framework for demonstration-free algorithm discovery, offering a rigorous pathway toward Transformer interpretability.
Comment: Model Architecture: introduces a Discrete Transformer with enforced functional disentanglement (routing vs arithmetic) and annealed sampling to enable program extraction, boosting interpretability.
Relevance: 9 Novelty: 8
8. Bi-Orthogonal Factor Decomposition for Vision Transformers
ArXiv ID: 2601.05328
Authors: Fenil R. Doshi, Thomas Fel, Talia Konkle, George Alvarez
Abstract: Self-attention is the central computational primitive of Vision Transformers, yet we lack a principled understanding of what information attention mechanisms exchange between tokens. Attention maps describe where weight mass concentrates; they do not reveal whether queries and keys trade position, content, or both. We introduce Bi-orthogonal Factor Decomposition (BFD), a two-stage analytical framework: first, an ANOVA-based decomposition statistically disentangles token activations into orthogonal positional and content factors; second, SVD of the query-key interaction matrix QK^T exposes bi-orthogonal modes that reveal how these factors mediate communication. After validating proper isolation of position and content, we apply BFD to state-of-the-art vision models and uncover three phenomena.(i) Attention operates primarily through content. Content-content interactions dominate attention energy, followed by content-position coupling. DINOv2 allocates more energy to content-position than supervised models and distributes computation across a richer mode spectrum. (ii) Attention mechanisms exhibit specialization: heads differentiate into content-content, content-position, and position-position operators, while singular modes within heads show analogous specialization. (iii) DINOv2's superior holistic shape processing emerges from intermediate layers that simultaneously preserve positional structure while contextually enriching semantic content. Overall, BFD exposes how tokens interact through attention and which informational factors - positional or semantic - mediate their communication, yielding practical insights into vision transformer mechanisms.
Comment: Representation Learning/Architecture Analysis: introduces Bi-orthogonal Factor Decomposition to disentangle positional vs content factors in attention via ANOVA+SVD, yielding insights into token interactions.
Relevance: 9 Novelty: 8
9. Manifold limit for the training of shallow graph convolutional neural networks
ArXiv ID: 2601.06025
Authors: Johanna Tengler, Christoph Brune, Jos\'e A. Iglesias
Abstract: We study the discrete-to-continuum consistency of the training of shallow graph convolutional neural networks (GCNNs) on proximity graphs of sampled point clouds under a manifold assumption. Graph convolution is defined spectrally via the graph Laplacian, whose low-frequency spectrum approximates that of the Laplace-Beltrami operator of the underlying smooth manifold, and shallow GCNNs of possibly infinite width are linear functionals on the space of measures on the parameter space. From this functional-analytic perspective, graph signals are seen as spatial discretizations of functions on the manifold, which leads to a natural notion of training data consistent across graph resolutions. To enable convergence results, the continuum parameter space is chosen as a weakly compact product of unit balls, with Sobolev regularity imposed on the output weight and bias, but not on the convolutional parameter. The corresponding discrete parameter spaces inherit the corresponding spectral decay, and are additionally restricted by a frequency cutoff adapted to the informative spectral window of the graph Laplacians. Under these assumptions, we prove $\Gamma$-convergence of regularized empirical risk minimization functionals and corresponding convergence of their global minimizers, in the sense of weak convergence of the parameter measures and uniform convergence of the functions over compact sets. This provides a formalization of mesh and sample independence for the training of such networks.
Comment: Representation Learning/Training Theory: proves Γ-convergence for training shallow GCNNs under manifold assumptions, formalizing mesh/sample independence.
Relevance: 8 Novelty: 8
10. PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning
ArXiv ID: 2601.05593
Authors: Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum
Abstract: We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains, and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5's 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.
Comment: Model Architecture + HPC/Test-time compute: introduces a conditional/message-passing architecture to massively parallelize reasoning and scale test-time compute beyond context limits.
Relevance: 8 Novelty: 8
11. On the Limits of Self-Improving in LLMs and Why AGI, ASI and the Singularity Are Not Near Without Symbolic Model Synthesis
ArXiv ID: 2601.05280
Authors: Hector Zenil
Abstract: We formalise recursive self-training in Large Language Models (LLMs) and Generative AI as a discrete-time dynamical system and prove that, as training data become increasingly self-generated ($\alpha_t \to 0$), the system undergoes inevitably degenerative dynamics. We derive two fundamental failure modes: (1) Entropy Decay, where finite sampling effects cause a monotonic loss of distributional diversity (mode collapse), and (2) Variance Amplification, where the loss of external grounding causes the model's representation of truth to drift as a random walk, bounded only by the support diameter. We show these behaviours are not contingent on architecture but are consequences of distributional learning on finite samples. We further argue that Reinforcement Learning with imperfect verifiers suffers similar semantic collapse. To overcome these limits, we propose a path involving symbolic regression and program synthesis guided by Algorithmic Probability. The Coding Theorem Method (CTM) allows for identifying generative mechanisms rather than mere correlations, escaping the data-processing inequality that binds standard statistical learning. We conclude that while purely distributional learning leads to model collapse, hybrid neurosymbolic approaches offer a coherent framework for sustained self-improvement.
Comment: Representation Learning/Training dynamics theory: formalizes recursive self-training in LLMs and proves degenerative behaviors (entropy decay, variance amplification), arguing for neurosymbolic synthesis.
Relevance: 8 Novelty: 8
12. Continual Learning of Achieving Forgetting-free and Positive Knowledge Transfer
ArXiv ID: 2601.05623
Authors: Zhi Wang, Zhongbin Wu, Yanni Li, Bing Liu, Guangxi Li, Yuping Wang
Abstract: Existing research on continual learning (CL) of a sequence of tasks focuses mainly on dealing with catastrophic forgetting (CF) to balance the learning plasticity of new tasks and the memory stability of old tasks. However, an ideal CL agent should not only be able to overcome CF, but also encourage positive forward and backward knowledge transfer (KT), i.e., using the learned knowledge from previous tasks for the new task learning (namely FKT), and improving the previous tasks' performance with the knowledge of the new task (namely BKT). To this end, this paper first models CL as an optimization problem in which each sequential learning task aims to achieve its optimal performance under the constraint that both FKT and BKT should be positive. It then proposes a novel Enhanced Task Continual Learning (ETCL) method, which achieves forgetting-free and positive KT. Furthermore, the bounds that can lead to negative FKT and BKT are estimated theoretically. Based on the bounds, a new strategy for online task similarity detection is also proposed to facilitate positive KT. To overcome CF, ETCL learns a set of task-specific binary masks to isolate a sparse sub-network for each task while preserving the performance of a dense network for the task. At the beginning of a new task learning, ETCL tries to align the new task's gradient with that of the sub-network of the previous most similar task to ensure positive FKT. By using a new bi-objective optimization strategy and an orthogonal gradient projection method, ETCL updates only the weights of previous similar tasks at the classification layer to achieve positive BKT. Extensive evaluations demonstrate that the proposed ETCL markedly outperforms strong baselines on dissimilar, similar, and mixed task sequences.
Comment: Matches Model Architecture and Sparsity: task-specific binary masks (sparse sub-networks) with gradient alignment/projection for continual learning.
Relevance: 8 Novelty: 7
13. DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis
ArXiv ID: 2601.05527
Authors: Rui An, Haohao Qu, Wenqi Fan, Xuequn Shang, Qing Li
Abstract: Accurate and efficient multivariate time series (MTS) analysis is increasingly critical for a wide range of intelligent applications. Within this realm, Transformers have emerged as the predominant architecture due to their strong ability to capture pairwise dependencies. However, Transformer-based models suffer from quadratic computational complexity and high memory overhead, limiting their scalability and practical deployment in long-term and large-scale MTS modeling. Recently, Mamba has emerged as a promising linear-time alternative with high expressiveness. Nevertheless, directly applying vanilla Mamba to MTS remains suboptimal due to three key limitations: (i) the lack of explicit cross-variate modeling, (ii) difficulty in disentangling the entangled intra-series temporal dynamics and inter-series interactions, and (iii) insufficient modeling of latent time-lag interaction effects. These issues constrain its effectiveness across diverse MTS tasks. To address these challenges, we propose DeMa, a dual-path delay-aware Mamba backbone. DeMa preserves Mamba's linear-complexity advantage while substantially improving its suitability for MTS settings. Specifically, DeMa introduces three key innovations: (i) it decomposes the MTS into intra-series temporal dynamics and inter-series interactions; (ii) it develops a temporal path with a Mamba-SSD module to capture long-range dynamics within each individual series, enabling series-independent, parallel computation; and (iii) it designs a variate path with a Mamba-DALA module that integrates delay-aware linear attention to model cross-variate dependencies. Extensive experiments on five representative tasks, long- and short-term forecasting, data imputation, anomaly detection, and series classification, demonstrate that DeMa achieves state-of-the-art performance while delivering remarkable computational efficiency.
Comment: Matches Model Architecture and Efficiency: dual-path, delay-aware Mamba backbone with linear-time modules for sequence modeling.
Relevance: 8 Novelty: 7
14. Quantifying and Inducing Shape Bias in CNNs via Max-Pool Dilation
ArXiv ID: 2601.05599
Authors: Takito Sawada, Akinori Iwata, Masahiro Okuda
Abstract: Convolutional Neural Networks (CNNs) are known to exhibit a strong texture bias, favoring local patterns over global shape information--a tendency inherent to their convolutional architecture. While this bias is beneficial for texture-rich natural images, it often degrades performance on shape-dominant data such as illustrations and sketches. Although prior work has proposed shape-biased models to mitigate this issue, these approaches lack a quantitative metric for identifying which datasets would actually benefit from such modifications. To address this gap, we propose a data-driven metric that quantifies the shape-texture balance of a dataset by computing the Structural Similarity Index (SSIM) between each image's luminance channel and its L0-smoothed counterpart. Building on this metric, we further introduce a computationally efficient adaptation method that promotes shape bias by modifying the dilation of max-pooling operations while keeping convolutional weights frozen. Experimental results show that this approach consistently improves classification accuracy on shape-dominant datasets, particularly in low-data regimes where full fine-tuning is impractical, requiring training only the final classification layer.
Comment: Matches Representation Learning and Architecture: quantifies dataset shape-texture balance and induces shape bias via max-pool dilation.
Relevance: 8 Novelty: 7
15. Efficient Differentiable Causal Discovery via Reliable Super-Structure Learning
ArXiv ID: 2601.05474
Authors: Pingchuan Ma, Qixin Zhang, Shuai Wang, Dacheng Tao
Abstract: Recently, differentiable causal discovery has emerged as a promising approach to improve the accuracy and efficiency of existing methods. However, when applied to high-dimensional data or data with latent confounders, these methods, often based on off-the-shelf continuous optimization algorithms, struggle with the vast search space, the complexity of the objective function, and the nontrivial nature of graph-theoretical constraints. As a result, there has been a surge of interest in leveraging super-structures to guide the optimization process. Nonetheless, learning an appropriate super-structure at the right level of granularity, and doing so efficiently across various settings, presents significant challenges. In this paper, we propose ALVGL, a novel and general enhancement to the differentiable causal discovery pipeline. ALVGL employs a sparse and low-rank decomposition to learn the precision matrix of the data. We design an ADMM procedure to optimize this decomposition, identifying components in the precision matrix that are most relevant to the underlying causal structure. These components are then combined to construct a super-structure that is provably a superset of the true causal graph. This super-structure is used to initialize a standard differentiable causal discovery method with a more focused search space, thereby improving both optimization efficiency and accuracy. We demonstrate the versatility of ALVGL by instantiating it across a range of structural causal models, including both Gaussian and non-Gaussian settings, with and without unmeasured confounders. Extensive experiments on synthetic and real-world datasets show that ALVGL not only achieves state-of-the-art accuracy but also significantly improves optimization efficiency, making it a reliable and effective solution for differentiable causal discovery.
Comment: Matches Efficiency and Low-rank/Sparsity: sparse+low-rank precision decomposition with ADMM to constrain and accelerate differentiable causal discovery.
Relevance: 8 Novelty: 7
16. Scalable Heterogeneous Graph Learning via Heterogeneous-aware Orthogonal Prototype Experts
ArXiv ID: 2601.05537
Authors: Wei Zhou, Hong Huang, Ruize Shi, Bang Liu
Abstract: Heterogeneous Graph Neural Networks(HGNNs) have advanced mainly through better encoders, yet their decoding/projection stage still relies on a single shared linear head, assuming it can map rich node embeddings to labels. We call this the Linear Projection Bottleneck: in heterogeneous graphs, contextual diversity and long-tail shifts make a global head miss fine semantics, overfit hub nodes, and underserve tail nodes. While Mixture-of-Experts(MoE) could help, naively applying it clashes with structural imbalance and risks expert collapse. We propose a Heterogeneous-aware Orthogonal Prototype Experts framework named HOPE, a plug-and-play replacement for the standard prediction head. HOPE uses learnable prototype-based routing to assign instances to experts by similarity, letting expert usage follow the natural long-tail distribution, and adds expert orthogonalization to encourage diversity and prevent collapse. Experiments on four real datasets show consistent gains across SOTA HGNN backbones with minimal overhead.
Comment: Model Architecture (MoE): heterogeneous-aware prototype-routed expert head with orthogonalization to overcome linear projection bottleneck in HGNNs.
Relevance: 8 Novelty: 7
17. Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces
ArXiv ID: 2601.05913
Authors: Pattarawat Chormai, Ali Hashemi, Klaus-Robert M\"uller, Gr\'egoire Montavon
Abstract: Knowledge distillation involves transferring the predictive capabilities of large, high-performing AI models (teachers) to smaller models (students) that can operate in environments with limited computing power. In this paper, we address the scenario in which only a few classes and their associated intermediate concepts are relevant to distill. This scenario is common in practice, yet few existing distillation methods explicitly focus on the relevant subtask. To address this gap, we introduce 'SubDistill', a new distillation algorithm with improved numerical properties that only distills the relevant components of the teacher model at each layer. Experiments on CIFAR-100 and ImageNet with Convolutional and Transformer models demonstrate that SubDistill outperforms existing layer-wise distillation techniques on a representative set of subtasks. Our benchmark evaluations are complemented by Explainable AI analyses showing that our distilled student models more closely match the decision structure of the original teacher model.
Comment: Model Compression/Efficiency: subtask-focused knowledge distillation that transfers only relevant subspaces/layer components from teacher to student.
Relevance: 8 Novelty: 7
18. DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models
ArXiv ID: 2601.05531
Authors: Eliatan Niktab, Hardip Patel
Abstract: Tokenization sits at the boundary between high-throughput genomic input and GPU compute, posing challenges in both algorithm design and system throughput. Overlapping k-mer tokenization can introduce information leakage under masked language modeling (MLM) and may degrade downstream accuracy. Single-nucleotide tokenization avoids leakage and preserves per-base fidelity, but it greatly increases sequence length for attention-based architectures. Non-overlapping k-mers and byte-pair encoding (BPE) provide compression and avoid leakage, at the cost of boundary sensitivity or reduced interpretability. Empirically, the choice of tokenization interacts strongly with model architecture and task requirements. At the system level, however, standard string tokenizers and host-bound vocabulary lookups dominate wall-clock time once inputs reach billions of bases, regardless of the tokenization algorithm. We present DNATok, a high-performance, GPU-first tokenization system that replaces general-purpose string processing with byte lookup table (LUT)-based identifier streaming and an overlapped host-to-device (H2D)/compute pipeline using pinned memory and architectural parallelism. DNATok is vocabulary-agnostic: it accelerates single-nucleotide, non-overlapping k-mer, and BPE tokenization, and integrates as a drop-in systems layer beneath genomic foundation models. DNATok achieves 84-95x higher encoding throughput than optimized Hugging Face baselines and up to 1.9x higher H2D throughput. End-to-end streaming reaches 1.27-1.84e8 tokens/s depending on configuration, effectively removing tokenization as a bottleneck for production-scale training and inference.
Comment: High Performance Computing/System Efficiency: GPU-first tokenizer with LUT-based streaming and overlapped H2D/compute removes tokenization bottlenecks for foundation models.
Relevance: 8 Novelty: 7
19. Tracing Moral Foundations in Large Language Models
ArXiv ID: 2601.05437
Authors: Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, Morteza Dehghani
Abstract: Large language models (LLMs) often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.'' Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed within two instruction-tuned LLMs: Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that both models represent and distinguish moral foundations in a structured, layer-dependent way that aligns with human judgments. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.
Comment: Representation Learning/Mechanistic Interpretability: layer-wise concept analysis with SAEs and causal steering reveals structured, partly disentangled moral representations in LLMs.
Relevance: 8 Novelty: 7
20. Circular Reasoning: Understanding Self-Reinforcing Loops in Large Reasoning Models
ArXiv ID: 2601.05693
Authors: Zenghao Duan, Liang Pang, Zihao Wei, Wenbin Duan, Yuxin Tian, Shicheng Xu, Jingcheng Deng, Zhiyi Yin, Xueqi Cheng
Abstract: Despite the success of test-time scaling, Large Reasoning Models (LRMs) frequently encounter repetitive loops that lead to computational waste and inference failure. In this paper, we identify a distinct failure mode termed Circular Reasoning. Unlike traditional model degeneration, this phenomenon manifests as a self-reinforcing trap where generated content acts as a logical premise for its own recurrence, compelling the reiteration of preceding text. To systematically analyze this phenomenon, we introduce LoopBench, a dataset designed to capture two distinct loop typologies: numerical loops and statement loops. Mechanistically, we characterize circular reasoning as a state collapse exhibiting distinct boundaries, where semantic repetition precedes textual repetition. We reveal that reasoning impasses trigger the loop onset, which subsequently persists as an inescapable cycle driven by a self-reinforcing V-shaped attention mechanism. Guided by these findings, we employ the Cumulative Sum (CUSUM) algorithm to capture these precursors for early loop prediction. Experiments across diverse LRMs validate its accuracy and elucidate the stability of long-chain reasoning.
Comment: Representation Learning/Training Dynamics: characterizes circular reasoning loops in LRMs and links to attention dynamics; proposes CUSUM-based early loop detection.
Relevance: 8 Novelty: 7
21. Poisson Hyperplane Processes with Rectified Linear Units
ArXiv ID: 2601.05586
Authors: Shufei Ge, Shijia Wang, Lloyd Elliott
Abstract: Neural networks have shown state-of-the-art performances in various classification and regression tasks. Rectified linear units (ReLU) are often used as activation functions for the hidden layers in a neural network model. In this article, we establish the connection between the Poisson hyperplane processes (PHP) and two-layer ReLU neural networks. We show that the PHP with a Gaussian prior is an alternative probabilistic representation to a two-layer ReLU neural network. In addition, we show that a two-layer neural network constructed by PHP is scalable to large-scale problems via the decomposition propositions. Finally, we propose an annealed sequential Monte Carlo algorithm for Bayesian inference. Our numerical experiments demonstrate that our proposed method outperforms the classic two-layer ReLU neural network. The implementation of our proposed model is available at https://github.com/ShufeiGe/Pois_Relu.git.
Comment: Model Architecture/Theory: establishes a probabilistic PHP representation equivalent to two-layer ReLU networks, with scalable decomposition and Bayesian inference.
Relevance: 8 Novelty: 7
22. The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
ArXiv ID: 2601.06002
Authors: Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang
Abstract: Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.
Comment: Representation Learning/Training Dynamics: analyzes structure of long CoT reasoning and proposes Mole-Syn to synthesize effective reasoning trajectories for stable learning.
Relevance: 8 Novelty: 7
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.