Personalized Daily ArXiv Papers 2026-01-28
| [gpt-5] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 44932 | 40485 | 85417 |
| Cost | $0.06 | $0.4 | $0.46 |
Total arXiv papers: 623
Total scanned papers: 334
Total relevant papers: 34
Table of contents with paper titles:
-
Explicit Multi-head Attention for Inter-head Interaction in Large Language Models Authors: Runyu Peng, Yunhua Zhou, Demin Song, Kai Lv, Bo Wang, Qipeng Guo, Xipeng Qiu
-
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep Authors: Chen Chen, Lai Wei
-
Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective Authors: Fangzhou Wu (Richard), Sandeep Silwal (Richard), Qiuyi (Richard), Zhang
-
LoPRo: Enhancing Low-Rank Quantization via Permuted Block-Wise Rotation Authors: Hongyaoxing Gu, Lijuan Hu, Liye Yu, Haowei Li, Fangfang Liu
-
StableQAT: Stable Quantization-Aware Training at Ultra-Low Bitwidths Authors: Tianyi Chen, Sihan Chen, Xiaoyi Qu, Dan Zhao, Ruomei Yan, Jongwoo Ko, Luming Liang, Pashmina Cameron
-
FloydNet: A Learning Paradigm for Global Relational Reasoning Authors: Jingcheng Yu, Mingliang Zeng, Qiwei Ye
-
Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining Authors: Yunwei Ren, Yatin Dandi, Florent Krzakala, Jason D. Lee
-
Self-Supervised Weight Templates for Scalable Vision Model Initialization Authors: Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Yong Rui, Xin Geng
-
TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching Authors: Runjia Zeng, Qifan Wang, Qiang Guan, Ruixiang Tang, Lifu Huang, Zhenting Wang, Xueling Zhang, Cheng Han, Dongfang Liu
-
Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers Authors: Bohan Hou, Hongyi Jin, Guanjie Wang, Jinqi Chen, Yaxing Cai, Lijie Yang, Zihao Ye, Yaoyao Ding, Ruihang Lai, Tianqi Chen
-
The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-Modal Divergence Authors: Yichao Cai, Zhen Zhang, Yuhang Liu, Javen Qinfeng Shi
-
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability Authors: Shawn Im, Changdae Oh, Zhen Fang, Sharon Li
-
EPAS: Efficient Training with Progressive Activation Sharing Authors: Rezaul Karim, Maryam Dialameh, Yang Liu, Boxing Chen, Walid Ahmed
-
SONIC: Spectral Oriented Neural Invariant Convolutions Authors: Gijs Joppe Moens, Regina Beets-Tan, Eduardo H. P. Pooch
-
Revisiting Parameter Server in LLM Post-Training Authors: Xinyi Wan, Penghui Qi, Guangxing Huang, Chaoyi Ruan, Min Lin, Jialin Li
-
Learning Ordered Representations in Latent Space for Intrinsic Dimension Estimation via Principal Component Autoencoder Authors: Qipeng Zhan, Zhuoping Zhou, Zexuan Wang, Li Shen
-
How Is Uncertainty Propagated in Knowledge Distillation? Authors: Ziyao Cui, Jian Pei
-
GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs Authors: Wei Huang, Anda Cheng, Yinggui Wang
-
Is Finer Better? The Limits of Microscaling Formats in Large Language Models Authors: Andrea Fasoli, Monodeep Kar, Chi-Chun Liu, Swagath Venkataramani, Viji Srinivasan, Leland Chang, Naigang Wang
-
On the Expressiveness of State Space Models via Temporal Logics Authors: Eric Alsmann, Lowejatan Noori, Martin Lange
-
To Grok Grokking: Provable Grokking in Ridge Regression Authors: Mingyue Xu, Gal Vardi, Itay Safran
-
The Effect of Architecture During Continual Learning Authors: Allyson Hahn, Krishnan Raghavan
-
Representational Homomorphism Predicts and Improves Compositional Generalization In Transformer Language Model Authors: Zhiyu An, Wan Du
-
A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy Authors: Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi, Sunishchal Dev, Kevin Zhu, Sean O'Brien, Ashwinee Panda, Ryan Lagasse
-
Smooth embeddings in contracting recurrent networks driven by regular dynamics: A synthesis for neural representation Authors: Vikas N. O'Reilly-Shah, Alessandro Maria Selvitella
-
Residual Tokens Enhance Masked Autoencoders for Speech Modeling Authors: Samir Sadok, St\'ephane Lathuili`ere, Xavier Alameda-Pineda
-
Component-Level Lesioning of Language Models Reveals Clinically Aligned Aphasia Phenotypes Authors: Yifan Wang, Jichen Zheng, Jingyuan Sun, Yunhao Zhang, Chunyu Ye, Jixing Li, Chengqing Zong, Shaonan Wang
-
SEAFormer: A Spatial Proximity and Edge-Aware Transformer for Real-World Vehicle Routing Problems Authors: Saeed Nasehi Basharzad, Farhana Choudhury, Egemen Tanin
-
Revisiting Incremental Stochastic Majorization-Minimization Algorithms with Applications to Mixture of Experts Authors: TrungKhang Tran, TrungTin Nguyen, Gersende Fort, Tung Doan, Hien Duy Nguyen, Binh T. Nguyen, Florence Forbes, Christopher Drovandi
-
Collaborative Compressors in Distributed Mean Estimation with Limited Communication Budget Authors: Harsh Vardhan, Arya Mazumdar
-
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning Authors: Ren Zhuang, Ben Wang, Shuifa Sun
-
ASEHybrid: When Geometry Matters Beyond Homophily in Graph Neural Networks Authors: Shalima Binta Manir, Tim Oates
-
Stability and Generalization of Nonconvex Optimization with Heavy-Tailed Noise Authors: Hongxu Chen, Ke Wei, Xiaoming Yuan, Luo Luo
-
Fixed Aggregation Features Can Rival GNNs Authors: Celia Rubio-Madrigal, Rebekka Burkholz
1. Explicit Multi-head Attention for Inter-head Interaction in Large Language Models
ArXiv ID: 2601.19611
Authors: Runyu Peng, Yunhua Zhou, Demin Song, Kai Lv, Bo Wang, Qipeng Guo, Xipeng Qiu
Abstract: In large language models built upon the Transformer architecture, recent studies have shown that inter-head interaction can enhance attention performance. Motivated by this, we propose Multi-head Explicit Attention (MEA), a simple yet effective attention variant that explicitly models cross-head interaction. MEA consists of two key components: a Head-level Linear Composition (HLC) module that separately applies learnable linear combinations to the key and value vectors across heads, thereby enabling rich inter-head communication; and a head-level Group Normalization layer that aligns the statistical properties of the recombined heads. MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence, ultimately resulting in lower validation loss and improved performance across a range of tasks. Furthermore, we explore the parameter efficiency of MEA by reducing the number of attention heads and leveraging HLC to reconstruct them using low-rank "virtual heads". This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss on knowledge-intensive and scientific reasoning tasks, and only a 3.59% accuracy drop for Olympiad-level mathematical benchmarks.
Comment: Model Architecture & Efficiency: explicit multi-head attention with head-level linear composition and normalization; enables KV-cache compression via low-rank virtual heads.
Relevance: 10 Novelty: 9
2. Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
ArXiv ID: 2601.19895
Authors: Chen Chen, Lai Wei
Abstract: Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.
Comment: Strong match to Model Architecture and training stability: Post-LN Transformer with Highway-style connections enabling stable ultra-deep training and improved depth scaling.
Relevance: 10 Novelty: 9
3. Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective
ArXiv ID: 2601.18999
Authors: Fangzhou Wu (Richard), Sandeep Silwal (Richard), Qiuyi (Richard), Zhang
Abstract: KV caching is a fundamental technique for accelerating Large Language Model (LLM) inference by reusing key-value (KV) pairs from previous queries, but its effectiveness under limited memory is highly sensitive to the eviction policy. The default Least Recently Used (LRU) eviction algorithm struggles with dynamic online query arrivals, especially in multi-LLM serving scenarios, where balancing query load across workers and maximizing cache hit rate of each worker are inherently conflicting objectives. We give the first unified mathematical model that captures the core trade-offs between KV cache eviction and query routing. Our analysis reveals the theoretical limitations of existing methods and leads to principled algorithms that integrate provably competitive randomized KV cache eviction with learning-based methods to adaptively route queries with evolving patterns, thus balancing query load and cache hit rate. Our theoretical results are validated by extensive experiments across 4 benchmarks and 3 prefix-sharing settings, demonstrating improvements of up to 6.92$\times$ in cache hit rate, 11.96$\times$ reduction in latency, 14.06$\times$ reduction in time-to-first-token (TTFT), and 77.4% increase in throughput over the state-of-the-art methods. Our code is available at https://github.com/fzwark/KVRouting.
Comment: High Performance Computing & Efficiency: unified model for KV-cache eviction and query routing with randomized eviction and learning-based routing; theoretical guarantees and large speedups.
Relevance: 10 Novelty: 8
4. LoPRo: Enhancing Low-Rank Quantization via Permuted Block-Wise Rotation
ArXiv ID: 2601.19675
Authors: Hongyaoxing Gu, Lijuan Hu, Liye Yu, Haowei Li, Fangfang Liu
Abstract: Post-training quantization (PTQ) enables effective model compression while preserving relatively high accuracy. Current weight-only PTQ methods primarily focus on the challenging sub-3-bit regime, where approaches often suffer significant accuracy degradation, typically requiring fine-tuning to achieve competitive performance. In this work, we revisit the fundamental characteristics of weight quantization and analyze the challenges in quantizing the residual matrix under low-rank approximation. We propose LoPRo, a novel fine-tuning-free PTQ algorithm that enhances residual matrix quantization by applying block-wise permutation and Walsh-Hadamard transformations to rotate columns of similar importance, while explicitly preserving the quantization accuracy of the most salient column blocks. Furthermore, we introduce a mixed-precision fast low-rank decomposition based on rank-1 sketch (R1SVD) to further minimize quantization costs. Experiments demonstrate that LoPRo outperforms existing fine-tuning-free PTQ methods at both 2-bit and 3-bit quantization, achieving accuracy comparable to fine-tuning baselines. Specifically, LoPRo achieves state-of-the-art quantization accuracy on LLaMA-2 and LLaMA-3 series models while delivering up to a 4$\times$ speedup. In the MoE model Mixtral-8x7B, LoPRo completes quantization within 2.5 hours, simultaneously reducing perplexity by 0.4$\downarrow$ and improving accuracy by 8\%$\uparrow$. Moreover, compared to other low-rank quantization methods, LoPRo achieves superior accuracy with a significantly lower rank, while maintaining high inference efficiency and minimal additional latency.
Comment: Compression/Efficiency: fine-tuning-free post-training quantization with low-rank decomposition and permuted block-wise rotations (2–3 bit regime).
Relevance: 10 Novelty: 8
5. StableQAT: Stable Quantization-Aware Training at Ultra-Low Bitwidths
ArXiv ID: 2601.19320
Authors: Tianyi Chen, Sihan Chen, Xiaoyi Qu, Dan Zhao, Ruomei Yan, Jongwoo Ko, Luming Liang, Pashmina Cameron
Abstract: Quantization-aware training (QAT) is essential for deploying large models under strict memory and latency constraints, yet achieving stable and robust optimization at ultra-low bitwidths remains challenging. Common approaches based on the straight-through estimator (STE) or soft quantizers often suffer from gradient mismatch, instability, or high computational overhead. As such, we propose StableQAT, a unified and efficient QAT framework that stabilizes training in ultra low-bit settings via a novel, lightweight, and theoretically grounded surrogate for backpropagation derived from a discrete Fourier analysis of the rounding operator. StableQAT strictly generalizes STE as the latter arises as a special case of our more expressive surrogate family, yielding smooth, bounded, and inexpensive gradients that improve QAT training performance and stability across various hyperparameter choices. In experiments, StableQAT exhibits stable and efficient QAT at 2-4 bit regimes, demonstrating improved training stability, robustness, and superior performance with negligible training overhead against standard QAT techniques. Our code is available at https://github.com/microsoft/StableQAT.
Comment: Strong match to Model Compression/Efficiency: a theoretically grounded surrogate for ultra-low-bit Quantization-Aware Training that generalizes STE and stabilizes training.
Relevance: 10 Novelty: 8
6. FloydNet: A Learning Paradigm for Global Relational Reasoning
ArXiv ID: 2601.19094
Authors: Jingcheng Yu, Mingliang Zeng, Qiwei Ye
Abstract: Developing models capable of complex, multi-step reasoning is a central goal in artificial intelligence. While representing problems as graphs is a powerful approach, Graph Neural Networks (GNNs) are fundamentally constrained by their message-passing mechanism, which imposes a local bottleneck that limits global, holistic reasoning. We argue that dynamic programming (DP), which solves problems by iteratively refining a global state, offers a more powerful and suitable learning paradigm. We introduce FloydNet, a new architecture that embodies this principle. In contrast to local message passing, FloydNet maintains a global, all-pairs relationship tensor and learns a generalized DP operator to progressively refine it. This enables the model to develop a task-specific relational calculus, providing a principled framework for capturing long-range dependencies. Theoretically, we prove that FloydNet achieves 3-WL (2-FWL) expressive power, and its generalized form aligns with the k-FWL hierarchy. FloydNet demonstrates state-of-the-art performance across challenging domains: it achieves near-perfect scores (often >99\%) on the CLRS-30 algorithmic benchmark, finds exact optimal solutions for the general Traveling Salesman Problem (TSP) at rates significantly exceeding strong heuristics, and empirically matches the 3-WL test on the BREC benchmark. Our results establish this learned, DP-style refinement as a powerful and practical alternative to message passing for high-level graph reasoning.
Comment: Model Architecture: replaces local message passing with a learned DP-style global refinement operator; proven expressivity (3-WL/2-FWL).
Relevance: 9 Novelty: 9
7. Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining
ArXiv ID: 2601.19756
Authors: Yunwei Ren, Yatin Dandi, Florent Krzakala, Jason D. Lee
Abstract: The empirical success of deep learning is often attributed to deep networks' ability to exploit hierarchical structure in data, constructing increasingly complex features across layers. Yet despite substantial progress in deep learning theory, most optimization results sill focus on networks with only two or three layers, leaving the theoretical understanding of hierarchical learning in genuinely deep models limited. This leads to a natural question: can we prove that deep networks, trained by gradient-based methods, can efficiently exploit hierarchical structure? In this work, we consider Random Hierarchy Models -- a hierarchical context-free grammar introduced by arXiv:2307.02129 and conjectured to separate deep and shallow networks. We prove that, under mild conditions, a deep convolutional network can be efficiently trained to learn this function class. Our proof builds on a general observation: if intermediate layers can receive clean signal from the labels and the relevant features are weakly identifiable, then layerwise training each individual layer suffices to hierarchically learn the target function.
Comment: Deep Learning Theory: provable hierarchical learning in deep conv nets on Random Hierarchy Models via layerwise training (shallow-to-deep chaining).
Relevance: 9 Novelty: 9
8. Self-Supervised Weight Templates for Scalable Vision Model Initialization
ArXiv ID: 2601.19694
Authors: Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Yong Rui, Xin Geng
Abstract: The increasing scale and complexity of modern model parameters underscore the importance of pre-trained models. However, deployment often demands architectures of varying sizes, exposing limitations of conventional pre-training and fine-tuning. To address this, we propose SWEET, a self-supervised framework that performs constraint-based pre-training to enable scalable initialization in vision tasks. Instead of pre-training a fixed-size model, we learn a shared weight template and size-specific weight scalers under Tucker-based factorization, which promotes modularity and supports flexible adaptation to architectures with varying depths and widths. Target models are subsequently initialized by composing and reweighting the template through lightweight weight scalers, whose parameters can be efficiently learned from minimal training data. To further enhance flexibility in width expansion, we introduce width-wise stochastic scaling, which regularizes the template along width-related dimensions and encourages robust, width-invariant representations for improved cross-width generalization. Extensive experiments on \textsc{classification}, \textsc{detection}, \textsc{segmentation} and \textsc{generation} tasks demonstrate the state-of-the-art performance of SWEET for initializing variable-sized vision models.
Comment: Model Compression/Efficiency & Architecture: Tucker-factorized shared weight template with size-specific scalers enables scalable initialization across depths/widths; includes width-wise stochastic scaling.
Relevance: 9 Novelty: 8
9. TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching
ArXiv ID: 2601.19739
Authors: Runjia Zeng, Qifan Wang, Qiang Guan, Ruixiang Tang, Lifu Huang, Zhenting Wang, Xueling Zhang, Cheng Han, Dongfang Liu
Abstract: Fine tuning has been regarded as a de facto approach for adapting large language models (LLMs) to downstream tasks, but the high training memory consumption inherited from LLMs makes this process inefficient. Among existing memory efficient approaches, activation-related optimization has proven particularly effective, as activations consistently dominate overall memory consumption. Although prior arts offer various activation optimization strategies, their data-agnostic nature ultimately results in ineffective and unstable fine tuning. In this paper, we propose TokenSeek, a universal plugin solution for various transformer-based models through instance-aware token seeking and ditching, achieving significant fine-tuning memory savings (e.g., requiring only 14.8% of the memory on Llama3.2 1B) with on-par or even better performance. Furthermore, our interpretable token seeking process reveals the underlying reasons for its effectiveness, offering valuable insights for future research on token efficiency. Homepage: https://runjia.tech/iclr_tokenseek/
Comment: Model Compression and Efficiency/HPC: instance-aware token seeking/ditching to cut activation memory during fine-tuning with large savings.
Relevance: 9 Novelty: 8
10. Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers
ArXiv ID: 2601.19092
Authors: Bohan Hou, Hongyi Jin, Guanjie Wang, Jinqi Chen, Yaxing Cai, Lijie Yang, Zihao Ye, Yaoyao Ding, Ruihang Lai, Tianqi Chen
Abstract: Scaling modern deep learning workloads demands coordinated placement of data and compute across device meshes, memory hierarchies, and heterogeneous accelerators. We present Axe Layout, a hardware-aware abstraction that maps logical tensor coordinates to a multi-axis physical space via named axes. Axe unifies tiling, sharding, replication, and offsets across inter-device distribution and on-device layouts, enabling collective primitives to be expressed consistently from device meshes to threads. Building on Axe, we design a multi-granularity, distribution-aware DSL and compiler that composes thread-local control with collective operators in a single kernel. Experiments show that our unified approach can bring performance close to hand-tuned kernels on across latest GPU devices and multi-device environments and accelerator backends.
Comment: High Performance Computing/Systems: unified layout abstraction and compiler DSL for distribution, tiling, and sharding across device meshes and memory hierarchies.
Relevance: 9 Novelty: 8
11. The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-Modal Divergence
ArXiv ID: 2601.19597
Authors: Yichao Cai, Zhen Zhang, Yuhang Liu, Javen Qinfeng Shi
Abstract: While InfoNCE powers modern contrastive learning, its geometric mechanisms remain under-characterized beyond the canonical alignment--uniformity decomposition. We present a measure-theoretic framework that models learning as the evolution of representation measures on a fixed embedding manifold. By establishing value and gradient consistency in the large-batch limit, we bridge the stochastic objective to explicit deterministic energy landscapes, uncovering a fundamental geometric bifurcation between the unimodal and multimodal regimes. In the unimodal setting, the intrinsic landscape is strictly convex with a unique Gibbs equilibrium; here, entropy acts merely as a tie-breaker, clarifying "uniformity" as a constrained expansion within the alignment basin. In contrast, the symmetric multimodal objective contains a persistent negative symmetric divergence term that remains even after kernel sharpening. We show that this term induces barrier-driven co-adaptation, enforcing a population-level modality gap as a structural geometric necessity rather than an initialization artifact. Our results shift the analytical lens from pointwise discrimination to population geometry, offering a principled basis for diagnosing and controlling distributional misalignment.
Comment: Representation Learning Theory: measure-theoretic analysis of contrastive learning geometry beyond alignment–uniformity, including multimodal divergence effects.
Relevance: 9 Novelty: 8
12. How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability
ArXiv ID: 2601.19208
Authors: Shawn Im, Changdae Oh, Zhen Fang, Sharon Li
Abstract: Semantic associations such as the link between "bird" and "flew" are foundational for language modeling as they enable models to go beyond memorization and instead generalize and generate coherent text. Understanding how these associations are learned and represented in language models is essential for connecting deep learning with linguistic theory and developing a mechanistic foundation for large language models. In this work, we analyze how these associations emerge from natural language data in attention-based language models through the lens of training dynamics. By leveraging a leading-term approximation of the gradients, we develop closed-form expressions for the weights at early stages of training that explain how semantic associations first take shape. Through our analysis, we reveal that each set of weights of the transformer has closed-form expressions as simple compositions of three basis functions (bigram, token-interchangeability, and context mappings), reflecting the statistics of the text corpus and uncovering how each component of the transformer captures semantic associations based on these compositions. Experiments on real-world LLMs demonstrate that our theoretical weight characterizations closely match the learned weights, and qualitative analyses further show how our theorem shines light on interpreting the learned associations in transformers.
Comment: Representation Learning/Mechanistic Interpretability: closed-form early-training weight characterizations in Transformers via gradient leading terms.
Relevance: 9 Novelty: 8
13. EPAS: Efficient Training with Progressive Activation Sharing
ArXiv ID: 2601.19089
Authors: Rezaul Karim, Maryam Dialameh, Yang Liu, Boxing Chen, Walid Ahmed
Abstract: We present a novel method for Efficient training with Progressive Activation Sharing (EPAS). This method bridges progressive training paradigm with the phenomenon of redundant QK (or KV ) activations across deeper layers of transformers. EPAS gradually grows a sharing region during training by switching decoder layers to activation sharing mode. This results in throughput increase due to reduced compute. To utilize deeper layer redundancy, the sharing region starts from the deep end of the model and grows towards the shallow end. The EPAS trained models allow for variable region lengths of activation sharing for different compute budgets during inference. Empirical evaluations with QK activation sharing in LLaMA models ranging from 125M to 7B parameters show up to an 11.1% improvement in training throughput and up to a 29% improvement in inference throughput while maintaining similar loss curve to the baseline models. Furthermore, applying EPAS in continual pretraining to transform TinyLLaMA into an attention-sharing model yields up to a 10% improvement in average accuracy over state-of-the-art methods, emphasizing the significance of progressive training in cross layer activation sharing models.
Comment: Efficiency/HPC: progressive activation (QK/KV) sharing across Transformer layers to boost training and inference throughput with controllable sharing at inference.
Relevance: 9 Novelty: 8
14. SONIC: Spectral Oriented Neural Invariant Convolutions
ArXiv ID: 2601.19884
Authors: Gijs Joppe Moens, Regina Beets-Tan, Eduardo H. P. Pooch
Abstract: Convolutional Neural Networks (CNNs) rely on fixed-size kernels scanning local patches, which limits their ability to capture global context or long-range dependencies without very deep architectures. Vision Transformers (ViTs), in turn, provide global connectivity but lack spatial inductive bias, depend on explicit positional encodings, and remain tied to the initial patch size. Bridging these limitations requires a representation that is both structured and global. We introduce SONIC (Spectral Oriented Neural Invariant Convolutions), a continuous spectral parameterisation that models convolutional operators using a small set of shared, orientation-selective components. These components define smooth responses across the full frequency domain, yielding global receptive fields and filters that adapt naturally across resolutions. Across synthetic benchmarks, large-scale image classification, and 3D medical datasets, SONIC shows improved robustness to geometric transformations, noise, and resolution shifts, and matches or exceeds convolutional, attention-based, and prior spectral architectures with an order of magnitude fewer parameters. These results demonstrate that continuous, orientation-aware spectral parameterisations provide a principled and scalable alternative to conventional spatial and spectral operators.
Comment: Strong match to Model Architecture: continuous, orientation-aware spectral parameterization of convolutional operators with global receptive fields and resolution adaptivity.
Relevance: 9 Novelty: 8
15. Revisiting Parameter Server in LLM Post-Training
ArXiv ID: 2601.19362
Authors: Xinyi Wan, Penghui Qi, Guangxing Huang, Chaoyi Ruan, Min Lin, Jialin Li
Abstract: Modern data parallel (DP) training favors collective communication over parameter servers (PS) for its simplicity and efficiency under balanced workloads. However, the balanced workload assumption no longer holds in large language model (LLM) post-training due to the high variance in sequence lengths. Under imbalanced workloads, collective communication creates synchronization barriers, leading to under-utilization of devices with smaller workloads. This change in training dynamics calls for a revisit of the PS paradigm for its robustness to such imbalance. We propose \textbf{On-Demand Communication (ODC)}, which adapts PS into Fully Sharded Data Parallel (FSDP) by replacing collective all-gather and reduce-scatter with direct point-to-point communication. Compared to FSDP, ODC reduces the synchronization barrier from once per layer to once per minibatch and decouples the workload on each device so that faster workers are not stalled. It also enables simpler and more effective load balancing at the minibatch level. Across diverse LLM post-training tasks, ODC consistently improves device utilization and training throughput, achieving up to a 36\% speedup over standard FSDP. These results demonstrate that ODC is a superior fit for the prevalent imbalanced workloads in LLM post-training. Our implementation of ODC and integration with FSDP is open-sourced at https://github.com/sail-sg/odc.
Comment: Systems-level innovation for distributed LLM training: replaces collective ops with point-to-point in FSDP (On-Demand Communication) to handle workload imbalance—fits the HPC/distributed training criterion.
Relevance: 9 Novelty: 8
16. Learning Ordered Representations in Latent Space for Intrinsic Dimension Estimation via Principal Component Autoencoder
ArXiv ID: 2601.19179
Authors: Qipeng Zhan, Zhuoping Zhou, Zexuan Wang, Li Shen
Abstract: Autoencoders have long been considered a nonlinear extension of Principal Component Analysis (PCA). Prior studies have demonstrated that linear autoencoders (LAEs) can recover the ordered, axis-aligned principal components of PCA by incorporating non-uniform $\ell_2$ regularization or by adjusting the loss function. However, these approaches become insufficient in the nonlinear setting, as the remaining variance cannot be properly captured independently of the nonlinear mapping. In this work, we propose a novel autoencoder framework that integrates non-uniform variance regularization with an isometric constraint. This design serves as a natural generalization of PCA, enabling the model to preserve key advantages, such as ordered representations and variance retention, while remaining effective for nonlinear dimensionality reduction tasks.
Comment: Model Architecture & Representation Learning: proposes an autoencoder with non-uniform variance regularization and isometric constraint to recover ordered latent components (PCA generalization).
Relevance: 9 Novelty: 7
17. How Is Uncertainty Propagated in Knowledge Distillation?
ArXiv ID: 2601.18909
Authors: Ziyao Cui, Jian Pei
Abstract: Knowledge distillation transfers behavior from a teacher to a student model, but the process is inherently stochastic: teacher outputs, student training, and student inference can all be random. Collapsing these uncertainties to a single point estimate can distort what is learned. We systematically study how uncertainty propagates through knowledge distillation across three representative model classes--linear regression, feed-forward neural networks, and large language models (LLMs)--and propose simple corrections. We distinguish inter-student uncertainty (variance across independently distilled students) from intra-student uncertainty (variance of a single student's predictive distribution), showing that standard single-response knowledge distillation suppresses intra-student variance while leaving substantial inter-student variability. To address these mismatches, we introduce two variance-aware strategies: averaging multiple teacher responses, which reduces noise at rate $O(1/k)$, and variance-weighting, which combines teacher and student estimates via inverse-variance weighting to yield a minimum-variance estimator. We provide formal guarantees in linear regression, validate the methods in neural networks, and demonstrate empirical gains in LLM distillation, including reduced systematic noise and hallucination. These results reframe knowledge distillation as an uncertainty transformation and show that variance-aware distillation produces more stable students that better reflect teacher uncertainty.
Comment: Model Compression and Efficiency: variance-aware knowledge distillation (multi-response averaging and inverse-variance weighting) with formal analysis of uncertainty propagation.
Relevance: 9 Novelty: 7
18. GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs
ArXiv ID: 2601.19503
Authors: Wei Huang, Anda Cheng, Yinggui Wang
Abstract: Fine-tuning Large Language Models (LLMs) with downstream data is often considered time-consuming and expensive. Structured pruning methods are primarily employed to improve the inference efficiency of pre-trained models. Meanwhile, they often require additional time and memory for training, knowledge distillation, structure search, and other strategies, making efficient model fine-tuning challenging to achieve. To simultaneously enhance the training and inference efficiency of downstream task fine-tuning, we introduce GradPruner, which can prune layers of LLMs guided by gradients in the early stages of fine-tuning. GradPruner uses the cumulative gradients of each parameter during the initial phase of fine-tuning to compute the Initial Gradient Information Accumulation Matrix (IGIA-Matrix) to assess the importance of layers and perform pruning. We sparsify the pruned layers based on the IGIA-Matrix and merge them with the remaining layers. Only elements with the same sign are merged to reduce interference from sign variations. We conducted extensive experiments on two LLMs across eight downstream datasets. Including medical, financial, and general benchmark tasks. The results demonstrate that GradPruner has achieved a parameter reduction of 40% with only a 0.99% decrease in accuracy. Our code is publicly available.
Comment: Model Compression and Efficiency: gradient-guided layer pruning and merging for LLMs enabling efficient fine-tuning and inference.
Relevance: 9 Novelty: 7
19. Is Finer Better? The Limits of Microscaling Formats in Large Language Models
ArXiv ID: 2601.19026
Authors: Andrea Fasoli, Monodeep Kar, Chi-Chun Liu, Swagath Venkataramani, Viji Srinivasan, Leland Chang, Naigang Wang
Abstract: Microscaling data formats leverage per-block tensor quantization to enable aggressive model compression with limited loss in accuracy. Unlocking their potential for efficient training and inference necessitates hardware-friendly implementations that handle matrix multiplications in a native format and adopt efficient error-mitigation strategies. Herein, we report the emergence of a surprising behavior associated with microscaling quantization, whereas the output of a quantized model degrades as block size is decreased below a given threshold. This behavior clashes with the expectation that a smaller block size should allow for a better representation of the tensor elements. We investigate this phenomenon both experimentally and theoretically, decoupling the sources of quantization error behind it. Experimentally, we analyze the distributions of several Large Language Models and identify the conditions driving the anomalous behavior. Theoretically, we lay down a framework showing remarkable agreement with experimental data from pretrained model distributions and ideal ones. Overall, we show that the anomaly is driven by the interplay between narrow tensor distributions and the limited dynamic range of the quantized scales. Based on these insights, we propose the use of FP8 unsigned E5M3 (UE5M3) as a novel hardware-friendly format for the scales in FP4 microscaling data types. We demonstrate that UE5M3 achieves comparable performance to the conventional FP8 unsigned E4M3 scales while obviating the need of global scaling operations on weights and activations.
Comment: Strong match to Model Compression/Efficiency: analyzes limits of microscaling quantization and proposes a hardware-friendly FP8 UE5M3 scale format for FP4 data types.
Relevance: 9 Novelty: 7
20. On the Expressiveness of State Space Models via Temporal Logics
ArXiv ID: 2601.19467
Authors: Eric Alsmann, Lowejatan Noori, Martin Lange
Abstract: We investigate the expressive power of state space models (SSM), which have recently emerged as a potential alternative to transformer architectures in large language models. Building on recent work, we analyse SSM expressiveness through fragments and extensions of linear temporal logic over finite traces. Our results show that the expressive capabilities of SSM vary substantially depending on the underlying gating mechanism. We further distinguish between SSM operating over fixed-width arithmetic (quantised models), whose expressive power remains within regular languages, and SSM with unbounded precision, which can capture counting properties and non-regular languages. In addition, we provide a systematic comparison between these different SSM variants and known results on transformers, thereby clarifying how the two architectures relate in terms of expressive power.
Comment: Strong match to Model Architecture theory: expressiveness analysis of State Space Models via temporal logic, including quantized vs unbounded precision and comparison to transformers.
Relevance: 9 Novelty: 7
21. To Grok Grokking: Provable Grokking in Ridge Regression
ArXiv ID: 2601.19791
Authors: Mingyue Xu, Gal Vardi, Itay Safran
Abstract: We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the "grokking time") in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.
Comment: Representation Learning: theoretical training-dynamics analysis of grokking with provable bounds on generalization delay in ridge regression.
Relevance: 8 Novelty: 8
22. The Effect of Architecture During Continual Learning
ArXiv ID: 2601.19766
Authors: Allyson Hahn, Krishnan Raghavan
Abstract: Continual learning is a challenge for models with static architecture, as they fail to adapt to when data distributions evolve across tasks. We introduce a mathematical framework that jointly models architecture and weights in a Sobolev space, enabling a rigorous investigation into the role of neural network architecture in continual learning and its effect on the forgetting loss. We derive necessary conditions for the continual learning solution and prove that learning only model weights is insufficient to mitigate catastrophic forgetting under distribution shifts. Consequently, we prove that by learning the architecture and weights simultaneously at each task, we can reduce catastrophic forgetting. To learn weights and architecture simultaneously, we formulate continual learning as a bilevel optimization problem: the upper level selects an optimal architecture for a given task, while the lower level computes optimal weights via dynamic programming over all tasks. To solve the upper level problem, we introduce a derivative-free direct search algorithm to determine the optimal architecture. Once found, we must transfer knowledge from the current architecture to the optimal one. However, the optimal architecture will result in a weights parameter space different from the current architecture (i.e., dimensions of weights matrices will not match). To bridge the dimensionality gap, we develop a low-rank transfer mechanism to map knowledge across architectures of mismatched dimensions. Empirical studies across regression and classification problems, including feedforward, convolutional, and graph neural networks, demonstrate that learning the optimal architecture and weights simultaneously yields substantially improved performance (up to two orders of magnitude), reduced forgetting, and enhanced robustness to noise compared with static architecture approaches.
Comment: Model Architecture/Representation Learning: joint optimization of architecture and weights to mitigate forgetting; bilevel formulation with low-rank knowledge transfer.
Relevance: 8 Novelty: 8
23. Representational Homomorphism Predicts and Improves Compositional Generalization In Transformer Language Model
ArXiv ID: 2601.18858
Authors: Zhiyu An, Wan Du
Abstract: Compositional generalization-the ability to interpret novel combinations of familiar components-remains a persistent challenge for neural networks. Behavioral evaluations reveal when models fail but offer limited insight into why failures arise at the representational level. We introduce Homomorphism Error (HE), a structural metric that quantifies deviations from approximate homomorphisms between the expression algebra and a model's hidden-state space. We instantiate HE for two compositional operators in SCAN-style tasks: modifier HE for unary composition and sequence HE for binary composition, measured by learning representation-level operators that predict composed representations from their parts. Across controlled experiments with small decoder-only Transformers, HE predicts out-of-distribution (OOD) compositional generalization under noise injection, achieving R^2 = 0.73 correlation between modifier HE and OOD accuracy. Ablations show that model depth has minimal effect on either HE or OOD accuracy, training data coverage exhibits threshold effects (insufficient coverage sharply increases HE and degrades OOD performance), and randomly inserted noise tokens systematically increase HE. Finally, we test if HE-regularized training improves OOD accuracy. Experiment shows that explicitly enforcing low modifier HE during training significantly reduces modifier HE (p = 1.1x10-4) and sequence HE (p = 0.001) and yields a statistically significant improvement in OOD accuracy (p = 0.023). Together, these results indicate the potential of HE to be both a diagnostic and an actionable training signal for improving compositional generalization. Code to reproduce our experiments is open-sourced.
Comment: Representation Learning: introduces a structural metric (Homomorphism Error) on Transformer hidden states and uses it as a training regularizer to improve compositional generalization.
Relevance: 8 Novelty: 7
24. A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy
ArXiv ID: 2601.18939
Authors: Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi, Sunishchal Dev, Kevin Zhu, Sean O'Brien, Ashwinee Panda, Ryan Lagasse
Abstract: Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior, a targeted approach that allows for fine-tuning with significantly less data. Using sparse autoencoders (SAEs) and linear probes, we isolate the 3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. We demonstrate this approach on the task of reducing sycophantic behavior, where our method matches or exceeds state-of-the-art performance on four benchmarks (Syco-Bench, NLP, POLI, PHIL) using Gemma-2-2B and 9B models. Our results show that sparse, neuron-level updates offer a scalable and precise alternative to full-model fine-tuning, remaining effective even in situations when little data is available
Comment: Representation Learning/Efficient adaptation: isolates behavior-specific neurons via sparse autoencoders and updates only a small neuron subset (sparse, neuron-level fine-tuning).
Relevance: 8 Novelty: 7
25. Smooth embeddings in contracting recurrent networks driven by regular dynamics: A synthesis for neural representation
ArXiv ID: 2601.19019
Authors: Vikas N. O'Reilly-Shah, Alessandro Maria Selvitella
Abstract: Recurrent neural networks trained for time-series prediction often develop latent trajectories that preserve qualitative structure of the dynamical systems generating their inputs. Recent empirical work has documented topology-preserving latent organization in trained recurrent models, and recent theoretical results in reservoir computing establish conditions under which the synchronization map is an embedding. Here we synthesize these threads into a unified account of when contracting recurrent networks yield smooth, topology-preserving internal representations for a broad and biologically relevant class of inputs: regular dynamics on invariant circles and tori. Our contribution is an integrated framework that assembles (i) generalized synchronization and embedding guarantees for contracting reservoirs, (ii) regularity mechanisms ensuring differentiability of the synchronization map under mild constraints, and (iii) a base-system viewpoint in which the invariant manifold generating the input stream is treated as the driving system. In this regular setting, the conditions commonly viewed as restrictive in chaotic-attractor analyses become mild and readily satisfied by standard contractive architectures. The framework clarifies how representational content in recurrent circuits is inherently historical: the network state encodes finite windows of input history rather than instantaneous stimuli. By consolidating disparate empirical and theoretical results under common assumptions, the synthesis yields concrete, testable expectations about when prediction-trained recurrent circuits should (or should not) form smooth latent embeddings and how required state dimension scales with the intrinsic dimension of the driving dynamics.
Comment: Representation Learning: theoretical synthesis showing when contracting RNNs learn smooth, topology-preserving embeddings of regular dynamics; implications for state dimension and training dynamics.
Relevance: 8 Novelty: 7
26. Residual Tokens Enhance Masked Autoencoders for Speech Modeling
ArXiv ID: 2601.19399
Authors: Samir Sadok, St\'ephane Lathuili`ere, Xavier Alameda-Pineda
Abstract: Recent speech modeling relies on explicit attributes such as pitch, content, and speaker identity, but these alone cannot capture the full richness of natural speech. We introduce RT-MAE, a novel masked autoencoder framework that augments the supervised attributes-based modeling with unsupervised residual trainable tokens, designed to encode the information not explained by explicit labeled factors (e.g., timbre variations, noise, emotion etc). Experiments show that RT-MAE improves reconstruction quality, preserving content and speaker similarity while enhancing expressivity. We further demonstrate its applicability to speech enhancement, removing noise at inference while maintaining controllability and naturalness.
Comment: Model Architecture & Representation Learning: masked autoencoder augmented with residual trainable tokens to capture unlabeled factors in speech.
Relevance: 8 Novelty: 7
27. Component-Level Lesioning of Language Models Reveals Clinically Aligned Aphasia Phenotypes
ArXiv ID: 2601.19723
Authors: Yifan Wang, Jichen Zheng, Jingyuan Sun, Yunhao Zhang, Chunyu Ye, Jixing Li, Chengqing Zong, Shaonan Wang
Abstract: Large language models (LLMs) increasingly exhibit human-like linguistic behaviors and internal representations that they could serve as computational simulators of language cognition. We ask whether LLMs can be systematically manipulated to reproduce language-production impairments characteristic of aphasia following focal brain lesions. Such models could provide scalable proxies for testing rehabilitation hypotheses, and offer a controlled framework for probing the functional organization of language. We introduce a clinically grounded, component-level framework that simulates aphasia by selectively perturbing functional components in LLMs, and apply it to both modular Mixture-of-Experts models and dense Transformers using a unified intervention interface. Our pipeline (i) identifies subtype-linked components for Broca's and Wernicke's aphasia, (ii) interprets these components with linguistic probing tasks, and (iii) induces graded impairments by progressively perturbing the top-k subtype-linked components, evaluating outcomes with Western Aphasia Battery (WAB) subtests summarized by Aphasia Quotient (AQ). Across architectures and lesioning strategies, subtype-targeted perturbations yield more systematic, aphasia-like regressions than size-matched random perturbations, and MoE modularity supports more localized and interpretable phenotype-to-component mappings. These findings suggest that modular LLMs, combined with clinically informed component perturbations, provide a promising platform for simulating aphasic language production and studying how distinct language functions degrade under targeted disruptions.
Comment: Representation Learning and Model Architecture: component-level lesioning of MoE and dense Transformers to probe functional organization and interpretability of internal modules.
Relevance: 8 Novelty: 7
28. SEAFormer: A Spatial Proximity and Edge-Aware Transformer for Real-World Vehicle Routing Problems
ArXiv ID: 2601.19395
Authors: Saeed Nasehi Basharzad, Farhana Choudhury, Egemen Tanin
Abstract: Real-world Vehicle Routing Problems (RWVRPs) require solving complex, sequence-dependent challenges at scale with constraints such as delivery time window, replenishment or recharging stops, asymmetric travel cost, etc. While recent neural methods achieve strong results on large-scale classical VRP benchmarks, they struggle to address RWVRPs because their strategies overlook sequence dependencies and underutilize edge-level information, which are precisely the characteristics that define the complexity of RWVRPs. We present SEAFormer, a novel transformer that incorporates both node-level and edge-level information in decision-making through two key innovations. First, our Clustered Proximity Attention (CPA) exploits locality-aware clustering to reduce the complexity of attention from $O(n^2)$ to $O(n)$ while preserving global perspective, allowing SEAFormer to efficiently train on large instances. Second, our lightweight edge-aware module captures pairwise features through residual fusion, enabling effective incorporation of edge-based information and faster convergence. Extensive experiments across four RWVRP variants with various scales demonstrate that SEAFormer achieves superior results over state-of-the-art methods. Notably, SEAFormer is the first neural method to solve 1,000+ node RWVRPs effectively, while also achieving superior performance on classic VRPs, making it a versatile solution for both research benchmarks and real-world applications.
Comment: Model Architecture and Efficiency: transformer with Clustered Proximity Attention reducing attention complexity from O(n^2) to O(n) and edge-aware module for decision making.
Relevance: 8 Novelty: 7
29. Revisiting Incremental Stochastic Majorization-Minimization Algorithms with Applications to Mixture of Experts
ArXiv ID: 2601.19811
Authors: TrungKhang Tran, TrungTin Nguyen, Gersende Fort, Tung Doan, Hien Duy Nguyen, Binh T. Nguyen, Florence Forbes, Christopher Drovandi
Abstract: Processing high-volume, streaming data is increasingly common in modern statistics and machine learning, where batch-mode algorithms are often impractical because they require repeated passes over the full dataset. This has motivated incremental stochastic estimation methods, including the incremental stochastic Expectation-Maximization (EM) algorithm formulated via stochastic approximation. In this work, we revisit and analyze an incremental stochastic variant of the Majorization-Minimization (MM) algorithm, which generalizes incremental stochastic EM as a special case. Our approach relaxes key EM requirements, such as explicit latent-variable representations, enabling broader applicability and greater algorithmic flexibility. We establish theoretical guarantees for the incremental stochastic MM algorithm, proving consistency in the sense that the iterates converge to a stationary point characterized by a vanishing gradient of the objective. We demonstrate these advantages on a softmax-gated mixture of experts (MoE) regression problem, for which no stochastic EM algorithm is available. Empirically, our method consistently outperforms widely used stochastic optimizers, including stochastic gradient descent, root mean square propagation, adaptive moment estimation, and second-order clipped stochastic optimization. These results support the development of new incremental stochastic algorithms, given the central role of softmax-gated MoE architectures in contemporary deep neural networks for heterogeneous data modeling. Beyond synthetic experiments, we also validate practical effectiveness on two real-world datasets, including a bioinformatics study of dent maize genotypes under drought stress that integrates high-dimensional proteomics with ecophysiological traits, where incremental stochastic MM yields stable gains in predictive performance.
Comment: Mixture-of-Experts: incremental stochastic MM algorithm with convergence guarantees for softmax-gated MoE training on streaming data.
Relevance: 8 Novelty: 7
30. Collaborative Compressors in Distributed Mean Estimation with Limited Communication Budget
ArXiv ID: 2601.18950
Authors: Harsh Vardhan, Arya Mazumdar
Abstract: Distributed high dimensional mean estimation is a common aggregation routine used often in distributed optimization methods. Most of these applications call for a communication-constrained setting where vectors, whose mean is to be estimated, have to be compressed before sharing. One could independently encode and decode these to achieve compression, but that overlooks the fact that these vectors are often close to each other. To exploit these similarities, recently Suresh et al., 2022, Jhunjhunwala et al., 2021, Jiang et al, 2023, proposed multiple correlation-aware compression schemes. However, in most cases, the correlations have to be known for these schemes to work. Moreover, a theoretical analysis of graceful degradation of these correlation-aware compression schemes with increasing dissimilarity is limited to only the $\ell_2$-error in the literature. In this paper, we propose four different collaborative compression schemes that agnostically exploit the similarities among vectors in a distributed setting. Our schemes are all simple to implement and computationally efficient, while resulting in big savings in communication. The analysis of our proposed schemes show how the $\ell_2$, $\ell_\infty$ and cosine estimation error varies with the degree of similarity among vectors.
Comment: HPC/Distributed training: collaborative compressors for communication-efficient distributed mean estimation with error analyses beyond l2.
Relevance: 8 Novelty: 7
31. The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning
ArXiv ID: 2601.18832
Authors: Ren Zhuang, Ben Wang, Shuifa Sun
Abstract: Scaling test-time compute enhances long chain-of-thought (CoT) reasoning, yet existing approaches face a fundamental trade-off between computational cost and coverage quality: either incurring high training expense or yielding redundant trajectories. We introduce The Geometric Reasoner (TGR), a training-free framework that performs manifold-informed latent foresight search under strict memory bounds. At each chunk boundary, TGR scores candidate latent anchors via a lightweight look-ahead estimate combined with soft geometric regularizers that encourage smooth trajectories and diverse exploration. Chunk-wise KV cache resets keep memory linear in chunk length. On challenging math and code benchmarks, TGR improves robust trajectory coverage, measured by the area under the Pass@$k$ curve (AUC), by up to 13 points on Qwen3-8B, with negligible overhead of about 1.1--1.3 times.
Comment: Efficiency: memory-bounded test-time search with chunk-wise KV cache resets and geometric regularization to improve long-context reasoning coverage.
Relevance: 8 Novelty: 7
32. ASEHybrid: When Geometry Matters Beyond Homophily in Graph Neural Networks
ArXiv ID: 2601.18912
Authors: Shalima Binta Manir, Tim Oates
Abstract: Standard message-passing graph neural networks (GNNs) often struggle on graphs with low homophily, yet homophily alone does not explain this behavior, as graphs with similar homophily levels can exhibit markedly different performance and some heterophilous graphs remain easy for vanilla GCNs. Recent work suggests that label informativeness (LI), the mutual information between labels of adjacent nodes, provides a more faithful characterization of when graph structure is useful. In this work, we develop a unified theoretical framework that connects curvature-guided rewiring and positional geometry through the lens of label informativeness, and instantiate it in a practical geometry-aware architecture, ASEHybrid. Our analysis provides a necessary-and-sufficient characterization of when geometry-aware GNNs can improve over feature-only baselines: such gains are possible if and only if graph structure carries label-relevant information beyond node features. Theoretically, we relate adjusted homophily and label informativeness to the spectral behavior of label signals under Laplacian smoothing, show that degree-based Forman curvature does not increase expressivity beyond the one-dimensional Weisfeiler--Lehman test but instead reshapes information flow, and establish convergence and Lipschitz stability guarantees for a curvature-guided rewiring process. Empirically, we instantiate ASEHybrid using Forman curvature and Laplacian positional encodings and conduct controlled ablations on Chameleon, Squirrel, Texas, Tolokers, and Minesweeper, observing gains precisely on label-informative heterophilous benchmarks where graph structure provides label-relevant information beyond node features, and no meaningful improvement in high-baseline regimes.
Comment: Matches Model Architecture and Representation Learning: geometry-aware GNN with theoretical characterization (label informativeness) and curvature-guided rewiring.
Relevance: 8 Novelty: 7
33. Stability and Generalization of Nonconvex Optimization with Heavy-Tailed Noise
ArXiv ID: 2601.19730
Authors: Hongxu Chen, Ke Wei, Xiaoming Yuan, Luo Luo
Abstract: The empirical evidence indicates that stochastic optimization with heavy-tailed gradient noise is more appropriate to characterize the training of machine learning models than that with standard bounded gradient variance noise. Most existing works on this phenomenon focus on the convergence of optimization errors, while the analysis for generalization bounds under the heavy-tailed gradient noise remains limited. In this paper, we develop a general framework for establishing generalization bounds under heavy-tailed noise. Specifically, we introduce a truncation argument to achieve the generalization error bound based on the algorithmic stability under the assumption of bounded $p$th centered moment with $p\in(1,2]$. Building on this framework, we further provide the stability and generalization analysis for several popular stochastic algorithms under heavy-tailed noise, including clipped and normalized stochastic gradient descent, as well as their mini-batch and momentum variants.
Comment: Matches Representation Learning/Training Dynamics: stability and generalization bounds for nonconvex optimization under heavy-tailed gradient noise across SGD variants.
Relevance: 8 Novelty: 7
34. Fixed Aggregation Features Can Rival GNNs
ArXiv ID: 2601.19449
Authors: Celia Rubio-Madrigal, Rebekka Burkholz
Abstract: Graph neural networks (GNNs) are widely believed to excel at node representation learning through trainable neighborhood aggregations. We challenge this view by introducing Fixed Aggregation Features (FAFs), a training-free approach that transforms graph learning tasks into tabular problems. This simple shift enables the use of well-established tabular methods, offering strong interpretability and the flexibility to deploy diverse classifiers. Across 14 benchmarks, well-tuned multilayer perceptrons trained on FAFs rival or outperform state-of-the-art GNNs and graph transformers on 12 tasks -- often using only mean aggregation. The only exceptions are the Roman Empire and Minesweeper datasets, which typically require unusually deep GNNs. To explain the theoretical possibility of non-trainable aggregations, we connect our findings to Kolmogorov-Arnold representations and discuss when mean aggregation can be sufficient. In conclusion, our results call for (i) richer benchmarks benefiting from learning diverse neighborhood aggregations, (ii) strong tabular baselines as standard, and (iii) employing and advancing tabular models for graph data to gain new insights into related tasks.
Comment: Matches Representation Learning/Architecture: fixed (non-trainable) neighborhood aggregation features rival GNNs; theoretical links to Kolmogorov–Arnold representations challenge prevailing assumptions.
Relevance: 8 Novelty: 7
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.