Personalized Daily ArXiv Papers 2026-01-28

[gpt-5]	Prompt	Completion	Total
Token	44932	40485	85417
Cost	$0.06	$0.4	$0.46

Total arXiv papers: 623

Total scanned papers: 334

Total relevant papers: 34

Table of contents with paper titles:

Explicit Multi-head Attention for Inter-head Interaction in Large Language Models Authors: Runyu Peng, Yunhua Zhou, Demin Song, Kai Lv, Bo Wang, Qipeng Guo, Xipeng Qiu
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep Authors: Chen Chen, Lai Wei
Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective Authors: Fangzhou Wu (Richard), Sandeep Silwal (Richard), Qiuyi (Richard), Zhang
LoPRo: Enhancing Low-Rank Quantization via Permuted Block-Wise Rotation Authors: Hongyaoxing Gu, Lijuan Hu, Liye Yu, Haowei Li, Fangfang Liu
StableQAT: Stable Quantization-Aware Training at Ultra-Low Bitwidths Authors: Tianyi Chen, Sihan Chen, Xiaoyi Qu, Dan Zhao, Ruomei Yan, Jongwoo Ko, Luming Liang, Pashmina Cameron
FloydNet: A Learning Paradigm for Global Relational Reasoning Authors: Jingcheng Yu, Mingliang Zeng, Qiwei Ye
Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining Authors: Yunwei Ren, Yatin Dandi, Florent Krzakala, Jason D. Lee
Self-Supervised Weight Templates for Scalable Vision Model Initialization Authors: Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Yong Rui, Xin Geng
TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching Authors: Runjia Zeng, Qifan Wang, Qiang Guan, Ruixiang Tang, Lifu Huang, Zhenting Wang, Xueling Zhang, Cheng Han, Dongfang Liu
Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers Authors: Bohan Hou, Hongyi Jin, Guanjie Wang, Jinqi Chen, Yaxing Cai, Lijie Yang, Zihao Ye, Yaoyao Ding, Ruihang Lai, Tianqi Chen
The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-Modal Divergence Authors: Yichao Cai, Zhen Zhang, Yuhang Liu, Javen Qinfeng Shi
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability Authors: Shawn Im, Changdae Oh, Zhen Fang, Sharon Li
EPAS: Efficient Training with Progressive Activation Sharing Authors: Rezaul Karim, Maryam Dialameh, Yang Liu, Boxing Chen, Walid Ahmed
SONIC: Spectral Oriented Neural Invariant Convolutions Authors: Gijs Joppe Moens, Regina Beets-Tan, Eduardo H. P. Pooch
Revisiting Parameter Server in LLM Post-Training Authors: Xinyi Wan, Penghui Qi, Guangxing Huang, Chaoyi Ruan, Min Lin, Jialin Li
Learning Ordered Representations in Latent Space for Intrinsic Dimension Estimation via Principal Component Autoencoder Authors: Qipeng Zhan, Zhuoping Zhou, Zexuan Wang, Li Shen
How Is Uncertainty Propagated in Knowledge Distillation? Authors: Ziyao Cui, Jian Pei
GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs Authors: Wei Huang, Anda Cheng, Yinggui Wang
Is Finer Better? The Limits of Microscaling Formats in Large Language Models Authors: Andrea Fasoli, Monodeep Kar, Chi-Chun Liu, Swagath Venkataramani, Viji Srinivasan, Leland Chang, Naigang Wang
On the Expressiveness of State Space Models via Temporal Logics Authors: Eric Alsmann, Lowejatan Noori, Martin Lange
To Grok Grokking: Provable Grokking in Ridge Regression Authors: Mingyue Xu, Gal Vardi, Itay Safran
The Effect of Architecture During Continual Learning Authors: Allyson Hahn, Krishnan Raghavan
Representational Homomorphism Predicts and Improves Compositional Generalization In Transformer Language Model Authors: Zhiyu An, Wan Du
A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy Authors: Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi, Sunishchal Dev, Kevin Zhu, Sean O'Brien, Ashwinee Panda, Ryan Lagasse
Smooth embeddings in contracting recurrent networks driven by regular dynamics: A synthesis for neural representation Authors: Vikas N. O'Reilly-Shah, Alessandro Maria Selvitella
Residual Tokens Enhance Masked Autoencoders for Speech Modeling Authors: Samir Sadok, St\'ephane Lathuili`ere, Xavier Alameda-Pineda
Component-Level Lesioning of Language Models Reveals Clinically Aligned Aphasia Phenotypes Authors: Yifan Wang, Jichen Zheng, Jingyuan Sun, Yunhao Zhang, Chunyu Ye, Jixing Li, Chengqing Zong, Shaonan Wang
SEAFormer: A Spatial Proximity and Edge-Aware Transformer for Real-World Vehicle Routing Problems Authors: Saeed Nasehi Basharzad, Farhana Choudhury, Egemen Tanin
Revisiting Incremental Stochastic Majorization-Minimization Algorithms with Applications to Mixture of Experts Authors: TrungKhang Tran, TrungTin Nguyen, Gersende Fort, Tung Doan, Hien Duy Nguyen, Binh T. Nguyen, Florence Forbes, Christopher Drovandi
Collaborative Compressors in Distributed Mean Estimation with Limited Communication Budget Authors: Harsh Vardhan, Arya Mazumdar
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning Authors: Ren Zhuang, Ben Wang, Shuifa Sun
ASEHybrid: When Geometry Matters Beyond Homophily in Graph Neural Networks Authors: Shalima Binta Manir, Tim Oates
Stability and Generalization of Nonconvex Optimization with Heavy-Tailed Noise Authors: Hongxu Chen, Ke Wei, Xiaoming Yuan, Luo Luo
Fixed Aggregation Features Can Rival GNNs Authors: Celia Rubio-Madrigal, Rebekka Burkholz

1. Explicit Multi-head Attention for Inter-head Interaction in Large Language Models

ArXiv ID: 2601.19611

Authors: Runyu Peng, Yunhua Zhou, Demin Song, Kai Lv, Bo Wang, Qipeng Guo, Xipeng Qiu

Abstract: In large language models built upon the Transformer architecture, recent studies have shown that inter-head interaction can enhance attention performance. Motivated by this, we propose Multi-head Explicit Attention (MEA), a simple yet effective attention variant that explicitly models cross-head interaction. MEA consists of two key components: a Head-level Linear Composition (HLC) module that separately applies learnable linear combinations to the key and value vectors across heads, thereby enabling rich inter-head communication; and a head-level Group Normalization layer that aligns the statistical properties of the recombined heads. MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence, ultimately resulting in lower validation loss and improved performance across a range of tasks. Furthermore, we explore the parameter efficiency of MEA by reducing the number of attention heads and leveraging HLC to reconstruct them using low-rank "virtual heads". This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss on knowledge-intensive and scientific reasoning tasks, and only a 3.59% accuracy drop for Olympiad-level mathematical benchmarks.

Comment: Model Architecture & Efficiency: explicit multi-head attention with head-level linear composition and normalization; enables KV-cache compression via low-rank virtual heads.

Relevance: 10 Novelty: 9

2. Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

ArXiv ID: 2601.19895

Authors: Chen Chen, Lai Wei

Abstract: Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.

Comment: Strong match to Model Architecture and training stability: Post-LN Transformer with Highway-style connections enabling stable ultra-deep training and improved depth scaling.

Relevance: 10 Novelty: 9

3. Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective

ArXiv ID: 2601.18999

Authors: Fangzhou Wu (Richard), Sandeep Silwal (Richard), Qiuyi (Richard), Zhang

Abstract: KV caching is a fundamental technique for accelerating Large Language Model (LLM) inference by reusing key-value (KV) pairs from previous queries, but its effectiveness under limited memory is highly sensitive to the eviction policy. The default Least Recently Used (LRU) eviction algorithm struggles with dynamic online query arrivals, especially in multi-LLM serving scenarios, where balancing query load across workers and maximizing cache hit rate of each worker are inherently conflicting objectives. We give the first unified mathematical model that captures the core trade-offs between KV cache eviction and query routing. Our analysis reveals the theoretical limitations of existing methods and leads to principled algorithms that integrate provably competitive randomized KV cache eviction with learning-based methods to adaptively route queries with evolving patterns, thus balancing query load and cache hit rate. Our theoretical results are validated by extensive experiments across 4 benchmarks and 3 prefix-sharing settings, demonstrating improvements of up to 6.92$\times$ in cache hit rate, 11.96$\times$ reduction in latency, 14.06$\times$ reduction in time-to-first-token (TTFT), and 77.4% increase in throughput over the state-of-the-art methods. Our code is available at https://github.com/fzwark/KVRouting.

Comment: High Performance Computing & Efficiency: unified model for KV-cache eviction and query routing with randomized eviction and learning-based routing; theoretical guarantees and large speedups.

Relevance: 10 Novelty: 8

4. LoPRo: Enhancing Low-Rank Quantization via Permuted Block-Wise Rotation

ArXiv ID: 2601.19675

Authors: Hongyaoxing Gu, Lijuan Hu, Liye Yu, Haowei Li, Fangfang Liu

Abstract: Post-training quantization (PTQ) enables effective model compression while preserving relatively high accuracy. Current weight-only PTQ methods primarily focus on the challenging sub-3-bit regime, where approaches often suffer significant accuracy degradation, typically requiring fine-tuning to achieve competitive performance. In this work, we revisit the fundamental characteristics of weight quantization and analyze the challenges in quantizing the residual matrix under low-rank approximation. We propose LoPRo, a novel fine-tuning-free PTQ algorithm that enhances residual matrix quantization by applying block-wise permutation and Walsh-Hadamard transformations to rotate columns of similar importance, while explicitly preserving the quantization accuracy of the most salient column blocks. Furthermore, we introduce a mixed-precision fast low-rank decomposition based on rank-1 sketch (R1SVD) to further minimize quantization costs. Experiments demonstrate that LoPRo outperforms existing fine-tuning-free PTQ methods at both 2-bit and 3-bit quantization, achieving accuracy comparable to fine-tuning baselines. Specifically, LoPRo achieves state-of-the-art quantization accuracy on LLaMA-2 and LLaMA-3 series models while delivering up to a 4$\times$ speedup. In the MoE model Mixtral-8x7B, LoPRo completes quantization within 2.5 hours, simultaneously reducing perplexity by 0.4$\downarrow$ and improving accuracy by 8\%$\uparrow$. Moreover, compared to other low-rank quantization methods, LoPRo achieves superior accuracy with a significantly lower rank, while maintaining high inference efficiency and minimal additional latency.

Comment: Compression/Efficiency: fine-tuning-free post-training quantization with low-rank decomposition and permuted block-wise rotations (2–3 bit regime).

Relevance: 10 Novelty: 8

5. StableQAT: Stable Quantization-Aware Training at Ultra-Low Bitwidths

ArXiv ID: 2601.19320

Authors: Tianyi Chen, Sihan Chen, Xiaoyi Qu, Dan Zhao, Ruomei Yan, Jongwoo Ko, Luming Liang, Pashmina Cameron

Abstract: Quantization-aware training (QAT) is essential for deploying large models under strict memory and latency constraints, yet achieving stable and robust optimization at ultra-low bitwidths remains challenging. Common approaches based on the straight-through estimator (STE) or soft quantizers often suffer from gradient mismatch, instability, or high computational overhead. As such, we propose StableQAT, a unified and efficient QAT framework that stabilizes training in ultra low-bit settings via a novel, lightweight, and theoretically grounded surrogate for backpropagation derived from a discrete Fourier analysis of the rounding operator. StableQAT strictly generalizes STE as the latter arises as a special case of our more expressive surrogate family, yielding smooth, bounded, and inexpensive gradients that improve QAT training performance and stability across various hyperparameter choices. In experiments, StableQAT exhibits stable and efficient QAT at 2-4 bit regimes, demonstrating improved training stability, robustness, and superior performance with negligible training overhead against standard QAT techniques. Our code is available at https://github.com/microsoft/StableQAT.

Comment: Strong match to Model Compression/Efficiency: a theoretically grounded surrogate for ultra-low-bit Quantization-Aware Training that generalizes STE and stabilizes training.

Relevance: 10 Novelty: 8

6. FloydNet: A Learning Paradigm for Global Relational Reasoning

ArXiv ID: 2601.19094

Authors: Jingcheng Yu, Mingliang Zeng, Qiwei Ye

Abstract: Developing models capable of complex, multi-step reasoning is a central goal in artificial intelligence. While representing problems as graphs is a powerful approach, Graph Neural Networks (GNNs) are fundamentally constrained by their message-passing mechanism, which imposes a local bottleneck that limits global, holistic reasoning. We argue that dynamic programming (DP), which solves problems by iteratively refining a global state, offers a more powerful and suitable learning paradigm. We introduce FloydNet, a new architecture that embodies this principle. In contrast to local message passing, FloydNet maintains a global, all-pairs relationship tensor and learns a generalized DP operator to progressively refine it. This enables the model to develop a task-specific relational calculus, providing a principled framework for capturing long-range dependencies. Theoretically, we prove that FloydNet achieves 3-WL (2-FWL) expressive power, and its generalized form aligns with the k-FWL hierarchy. FloydNet demonstrates state-of-the-art performance across challenging domains: it achieves near-perfect scores (often >99\%) on the CLRS-30 algorithmic benchmark, finds exact optimal solutions for the general Traveling Salesman Problem (TSP) at rates significantly exceeding strong heuristics, and empirically matches the 3-WL test on the BREC benchmark. Our results establish this learned, DP-style refinement as a powerful and practical alternative to message passing for high-level graph reasoning.

Comment: Model Architecture: replaces local message passing with a learned DP-style global refinement operator; proven expressivity (3-WL/2-FWL).

Relevance: 9 Novelty: 9

7. Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining

ArXiv ID: 2601.19756

Authors: Yunwei Ren, Yatin Dandi, Florent Krzakala, Jason D. Lee

Abstract: The empirical success of deep learning is often attributed to deep networks' ability to exploit hierarchical structure in data, constructing increasingly complex features across layers. Yet despite substantial progress in deep learning theory, most optimization results sill focus on networks with only two or three layers, leaving the theoretical understanding of hierarchical learning in genuinely deep models limited. This leads to a natural question: can we prove that deep networks, trained by gradient-based methods, can efficiently exploit hierarchical structure? In this work, we consider Random Hierarchy Models -- a hierarchical context-free grammar introduced by arXiv:2307.02129 and conjectured to separate deep and shallow networks. We prove that, under mild conditions, a deep convolutional network can be efficiently trained to learn this function class. Our proof builds on a general observation: if intermediate layers can receive clean signal from the labels and the relevant features are weakly identifiable, then layerwise training each individual layer suffices to hierarchically learn the target function.

Comment: Deep Learning Theory: provable hierarchical learning in deep conv nets on Random Hierarchy Models via layerwise training (shallow-to-deep chaining).

Relevance: 9 Novelty: 9

8. Self-Supervised Weight Templates for Scalable Vision Model Initialization

ArXiv ID: 2601.19694

Authors: Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Yong Rui, Xin Geng

Abstract: The increasing scale and complexity of modern model parameters underscore the importance of pre-trained models. However, deployment often demands architectures of varying sizes, exposing limitations of conventional pre-training and fine-tuning. To address this, we propose SWEET, a self-supervised framework that performs constraint-based pre-training to enable scalable initialization in vision tasks. Instead of pre-training a fixed-size model, we learn a shared weight template and size-specific weight scalers under Tucker-based factorization, which promotes modularity and supports flexible adaptation to architectures with varying depths and widths. Target models are subsequently initialized by composing and reweighting the template through lightweight weight scalers, whose parameters can be efficiently learned from minimal training data. To further enhance flexibility in width expansion, we introduce width-wise stochastic scaling, which regularizes the template along width-related dimensions and encourages robust, width-invariant representations for improved cross-width generalization. Extensive experiments on \textsc{classification}, \textsc{detection}, \textsc{segmentation} and \textsc{generation} tasks demonstrate the state-of-the-art performance of SWEET for initializing variable-sized vision models.

Comment: Model Compression/Efficiency & Architecture: Tucker-factorized shared weight template with size-specific scalers enables scalable initialization across depths/widths; includes width-wise stochastic scaling.

Relevance: 9 Novelty: 8

9. TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching

ArXiv ID: 2601.19739

Authors: Runjia Zeng, Qifan Wang, Qiang Guan, Ruixiang Tang, Lifu Huang, Zhenting Wang, Xueling Zhang, Cheng Han, Dongfang Liu

Abstract: Fine tuning has been regarded as a de facto approach for adapting large language models (LLMs) to downstream tasks, but the high training memory consumption inherited from LLMs makes this process inefficient. Among existing memory efficient approaches, activation-related optimization has proven particularly effective, as activations consistently dominate overall memory consumption. Although prior arts offer various activation optimization strategies, their data-agnostic nature ultimately results in ineffective and unstable fine tuning. In this paper, we propose TokenSeek, a universal plugin solution for various transformer-based models through instance-aware token seeking and ditching, achieving significant fine-tuning memory savings (e.g., requiring only 14.8% of the memory on Llama3.2 1B) with on-par or even better performance. Furthermore, our interpretable token seeking process reveals the underlying reasons for its effectiveness, offering valuable insights for future research on token efficiency. Homepage: https://runjia.tech/iclr_tokenseek/

Comment: Model Compression and Efficiency/HPC: instance-aware token seeking/ditching to cut activation memory during fine-tuning with large savings.

Relevance: 9 Novelty: 8

10. Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers

ArXiv ID: 2601.19092

Authors: Bohan Hou, Hongyi Jin, Guanjie Wang, Jinqi Chen, Yaxing Cai, Lijie Yang, Zihao Ye, Yaoyao Ding, Ruihang Lai, Tianqi Chen

Abstract: Scaling modern deep learning workloads demands coordinated placement of data and compute across device meshes, memory hierarchies, and heterogeneous accelerators. We present Axe Layout, a hardware-aware abstraction that maps logical tensor coordinates to a multi-axis physical space via named axes. Axe unifies tiling, sharding, replication, and offsets across inter-device distribution and on-device layouts, enabling collective primitives to be expressed consistently from device meshes to threads. Building on Axe, we design a multi-granularity, distribution-aware DSL and compiler that composes thread-local control with collective operators in a single kernel. Experiments show that our unified approach can bring performance close to hand-tuned kernels on across latest GPU devices and multi-device environments and accelerator backends.

Comment: High Performance Computing/Systems: unified layout abstraction and compiler DSL for distribution, tiling, and sharding across device meshes and memory hierarchies.

Relevance: 9 Novelty: 8

ArXiv ID: 2601.19597

Authors: Yichao Cai, Zhen Zhang, Yuhang Liu, Javen Qinfeng Shi

Abstract: While InfoNCE powers modern contrastive learning, its geometric mechanisms remain under-characterized beyond the canonical alignment--uniformity decomposition. We present a measure-theoretic framework that models learning as the evolution of representation measures on a fixed embedding manifold. By establishing value and gradient consistency in the large-batch limit, we bridge the stochastic objective to explicit deterministic energy landscapes, uncovering a fundamental geometric bifurcation between the unimodal and multimodal regimes. In the unimodal setting, the intrinsic landscape is strictly convex with a unique Gibbs equilibrium; here, entropy acts merely as a tie-breaker, clarifying "uniformity" as a constrained expansion within the alignment basin. In contrast, the symmetric multimodal objective contains a persistent negative symmetric divergence term that remains even after kernel sharpening. We show that this term induces barrier-driven co-adaptation, enforcing a population-level modality gap as a structural geometric necessity rather than an initialization artifact. Our results shift the analytical lens from pointwise discrimination to population geometry, offering a principled basis for diagnosing and controlling distributional misalignment.

Comment: Representation Learning Theory: measure-theoretic analysis of contrastive learning geometry beyond alignment–uniformity, including multimodal divergence effects.

Relevance: 9 Novelty: 8

12. How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

ArXiv ID: 2601.19208

Authors: Shawn Im, Changdae Oh, Zhen Fang, Sharon Li

Abstract: Semantic associations such as the link between "bird" and "flew" are foundational for language modeling as they enable models to go beyond memorization and instead generalize and generate coherent text. Understanding how these associations are learned and represented in language models is essential for connecting deep learning with linguistic theory and developing a mechanistic foundation for large language models. In this work, we analyze how these associations emerge from natural language data in attention-based language models through the lens of training dynamics. By leveraging a leading-term approximation of the gradients, we develop closed-form expressions for the weights at early stages of training that explain how semantic associations first take shape. Through our analysis, we reveal that each set of weights of the transformer has closed-form expressions as simple compositions of three basis functions (bigram, token-interchangeability, and context mappings), reflecting the statistics of the text corpus and uncovering how each component of the transformer captures semantic associations based on these compositions. Experiments on real-world LLMs demonstrate that our theoretical weight characterizations closely match the learned weights, and qualitative analyses further show how our theorem shines light on interpreting the learned associations in transformers.

Comment: Representation Learning/Mechanistic Interpretability: closed-form early-training weight characterizations in Transformers via gradient leading terms.

Relevance: 9 Novelty: 8

ArXiv ID: 2601.19089

Authors: Rezaul Karim, Maryam Dialameh, Yang Liu, Boxing Chen, Walid Ahmed

Abstract: We present a novel method for Efficient training with Progressive Activation Sharing (EPAS). This method bridges progressive training paradigm with the phenomenon of redundant QK (or KV ) activations across deeper layers of transformers. EPAS gradually grows a sharing region during training by switching decoder layers to activation sharing mode. This results in throughput increase due to reduced compute. To utilize deeper layer redundancy, the sharing region starts from the deep end of the model and grows towards the shallow end. The EPAS trained models allow for variable region lengths of activation sharing for different compute budgets during inference. Empirical evaluations with QK activation sharing in LLaMA models ranging from 125M to 7B parameters show up to an 11.1% improvement in training throughput and up to a 29% improvement in inference throughput while maintaining similar loss curve to the baseline models. Furthermore, applying EPAS in continual pretraining to transform TinyLLaMA into an attention-sharing model yields up to a 10% improvement in average accuracy over state-of-the-art methods, emphasizing the significance of progressive training in cross layer activation sharing models.

Comment: Efficiency/HPC: progressive activation (QK/KV) sharing across Transformer layers to boost training and inference throughput with controllable sharing at inference.

Relevance: 9 Novelty: 8

14. SONIC: Spectral Oriented Neural Invariant Convolutions

ArXiv ID: 2601.19884

Authors: Gijs Joppe Moens, Regina Beets-Tan, Eduardo H. P. Pooch

Abstract: Convolutional Neural Networks (CNNs) rely on fixed-size kernels scanning local patches, which limits their ability to capture global context or long-range dependencies without very deep architectures. Vision Transformers (ViTs), in turn, provide global connectivity but lack spatial inductive bias, depend on explicit positional encodings, and remain tied to the initial patch size. Bridging these limitations requires a representation that is both structured and global. We introduce SONIC (Spectral Oriented Neural Invariant Convolutions), a continuous spectral parameterisation that models convolutional operators using a small set of shared, orientation-selective components. These components define smooth responses across the full frequency domain, yielding global receptive fields and filters that adapt naturally across resolutions. Across synthetic benchmarks, large-scale image classification, and 3D medical datasets, SONIC shows improved robustness to geometric transformations, noise, and resolution shifts, and matches or exceeds convolutional, attention-based, and prior spectral architectures with an order of magnitude fewer parameters. These results demonstrate that continuous, orientation-aware spectral parameterisations provide a principled and scalable alternative to conventional spatial and spectral operators.

Comment: Strong match to Model Architecture: continuous, orientation-aware spectral parameterization of convolutional operators with global receptive fields and resolution adaptivity.

Relevance: 9 Novelty: 8

15. Revisiting Parameter Server in LLM Post-Training

ArXiv ID: 2601.19362

Authors: Xinyi Wan, Penghui Qi, Guangxing Huang, Chaoyi Ruan, Min Lin, Jialin Li

Abstract: Modern data parallel (DP) training favors collective communication over parameter servers (PS) for its simplicity and efficiency under balanced workloads. However, the balanced workload assumption no longer holds in large language model (LLM) post-training due to the high variance in sequence lengths. Under imbalanced workloads, collective communication creates synchronization barriers, leading to under-utilization of devices with smaller workloads. This change in training dynamics calls for a revisit of the PS paradigm for its robustness to such imbalance. We propose \textbf{On-Demand Communication (ODC)}, which adapts PS into Fully Sharded Data Parallel (FSDP) by replacing collective all-gather and reduce-scatter with direct point-to-point communication. Compared to FSDP, ODC reduces the synchronization barrier from once per layer to once per minibatch and decouples the workload on each device so that faster workers are not stalled. It also enables simpler and more effective load balancing at the minibatch level. Across diverse LLM post-training tasks, ODC consistently improves device utilization and training throughput, achieving up to a 36\% speedup over standard FSDP. These results demonstrate that ODC is a superior fit for the prevalent imbalanced workloads in LLM post-training. Our implementation of ODC and integration with FSDP is open-sourced at https://github.com/sail-sg/odc.

Comment: Systems-level innovation for distributed LLM training: replaces collective ops with point-to-point in FSDP (On-Demand Communication) to handle workload imbalance—fits the HPC/distributed training criterion.

Relevance: 9 Novelty: 8

16. Learning Ordered Representations in Latent Space for Intrinsic Dimension Estimation via Principal Component Autoencoder

ArXiv ID: 2601.19179

Authors: Qipeng Zhan, Zhuoping Zhou, Zexuan Wang, Li Shen

Abstract: Autoencoders have long been considered a nonlinear extension of Principal Component Analysis (PCA). Prior studies have demonstrated that linear autoencoders (LAEs) can recover the ordered, axis-aligned principal components of PCA by incorporating non-uniform $\ell_2$ regularization or by adjusting the loss function. However, these approaches become insufficient in the nonlinear setting, as the remaining variance cannot be properly captured independently of the nonlinear mapping. In this work, we propose a novel autoencoder framework that integrates non-uniform variance regularization with an isometric constraint. This design serves as a natural generalization of PCA, enabling the model to preserve key advantages, such as ordered representations and variance retention, while remaining effective for nonlinear dimensionality reduction tasks.

Comment: Model Architecture & Representation Learning: proposes an autoencoder with non-uniform variance regularization and isometric constraint to recover ordered latent components (PCA generalization).

Relevance: 9 Novelty: 7

17. How Is Uncertainty Propagated in Knowledge Distillation?

ArXiv ID: 2601.18909

Authors: Ziyao Cui, Jian Pei

Abstract: Knowledge distillation transfers behavior from a teacher to a student model, but the process is inherently stochastic: teacher outputs, student training, and student inference can all be random. Collapsing these uncertainties to a single point estimate can distort what is learned. We systematically study how uncertainty propagates through knowledge distillation across three representative model classes--linear regression, feed-forward neural networks, and large language models (LLMs)--and propose simple corrections. We distinguish inter-student uncertainty (variance across independently distilled students) from intra-student uncertainty (variance of a single student's predictive distribution), showing that standard single-response knowledge distillation suppresses intra-student variance while leaving substantial inter-student variability. To address these mismatches, we introduce two variance-aware strategies: averaging multiple teacher responses, which reduces noise at rate $O(1/k)$, and variance-weighting, which combines teacher and student estimates via inverse-variance weighting to yield a minimum-variance estimator. We provide formal guarantees in linear regression, validate the methods in neural networks, and demonstrate empirical gains in LLM distillation, including reduced systematic noise and hallucination. These results reframe knowledge distillation as an uncertainty transformation and show that variance-aware distillation produces more stable students that better reflect teacher uncertainty.

Comment: Model Compression and Efficiency: variance-aware knowledge distillation (multi-response averaging and inverse-variance weighting) with formal analysis of uncertainty propagation.

Relevance: 9 Novelty: 7

18. GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs

ArXiv ID: 2601.19503

Authors: Wei Huang, Anda Cheng, Yinggui Wang

Abstract: Fine-tuning Large Language Models (LLMs) with downstream data is often considered time-consuming and expensive. Structured pruning methods are primarily employed to improve the inference efficiency of pre-trained models. Meanwhile, they often require additional time and memory for training, knowledge distillation, structure search, and other strategies, making efficient model fine-tuning challenging to achieve. To simultaneously enhance the training and inference efficiency of downstream task fine-tuning, we introduce GradPruner, which can prune layers of LLMs guided by gradients in the early stages of fine-tuning. GradPruner uses the cumulative gradients of each parameter during the initial phase of fine-tuning to compute the Initial Gradient Information Accumulation Matrix (IGIA-Matrix) to assess the importance of layers and perform pruning. We sparsify the pruned layers based on the IGIA-Matrix and merge them with the remaining layers. Only elements with the same sign are merged to reduce interference from sign variations. We conducted extensive experiments on two LLMs across eight downstream datasets. Including medical, financial, and general benchmark tasks. The results demonstrate that GradPruner has achieved a parameter reduction of 40% with only a 0.99% decrease in accuracy. Our code is publicly available.

Comment: Model Compression and Efficiency: gradient-guided layer pruning and merging for LLMs enabling efficient fine-tuning and inference.

Relevance: 9 Novelty: 7

19. Is Finer Better? The Limits of Microscaling Formats in Large Language Models

ArXiv ID: 2601.19026

Authors: Andrea Fasoli, Monodeep Kar, Chi-Chun Liu, Swagath Venkataramani, Viji Srinivasan, Leland Chang, Naigang Wang

Abstract: Microscaling data formats leverage per-block tensor quantization to enable aggressive model compression with limited loss in accuracy. Unlocking their potential for efficient training and inference necessitates hardware-friendly implementations that handle matrix multiplications in a native format and adopt efficient error-mitigation strategies. Herein, we report the emergence of a surprising behavior associated with microscaling quantization, whereas the output of a quantized model degrades as block size is decreased below a given threshold. This behavior clashes with the expectation that a smaller block size should allow for a better representation of the tensor elements. We investigate this phenomenon both experimentally and theoretically, decoupling the sources of quantization error behind it. Experimentally, we analyze the distributions of several Large Language Models and identify the conditions driving the anomalous behavior. Theoretically, we lay down a framework showing remarkable agreement with experimental data from pretrained model distributions and ideal ones. Overall, we show that the anomaly is driven by the interplay between narrow tensor distributions and the limited dynamic range of the quantized scales. Based on these insights, we propose the use of FP8 unsigned E5M3 (UE5M3) as a novel hardware-friendly format for the scales in FP4 microscaling data types. We demonstrate that UE5M3 achieves comparable performance to the conventional FP8 unsigned E4M3 scales while obviating the need of global scaling operations on weights and activations.

Comment: Strong match to Model Compression/Efficiency: analyzes limits of microscaling quantization and proposes a hardware-friendly FP8 UE5M3 scale format for FP4 data types.

Relevance: 9 Novelty: 7

20. On the Expressiveness of State Space Models via Temporal Logics

ArXiv ID: 2601.19467

Authors: Eric Alsmann, Lowejatan Noori, Martin Lange

Abstract: We investigate the expressive power of state space models (SSM), which have recently emerged as a potential alternative to transformer architectures in large language models. Building on recent work, we analyse SSM expressiveness through fragments and extensions of linear temporal logic over finite traces. Our results show that the expressive capabilities of SSM vary substantially depending on the underlying gating mechanism. We further distinguish between SSM operating over fixed-width arithmetic (quantised models), whose expressive power remains within regular languages, and SSM with unbounded precision, which can capture counting properties and non-regular languages. In addition, we provide a systematic comparison between these different SSM variants and known results on transformers, thereby clarifying how the two architectures relate in terms of expressive power.

Comment: Strong match to Model Architecture theory: expressiveness analysis of State Space Models via temporal logic, including quantized vs unbounded precision and comparison to transformers.

Relevance: 9 Novelty: 7

21. To Grok Grokking: Provable Grokking in Ridge Regression

ArXiv ID: 2601.19791

Authors: Mingyue Xu, Gal Vardi, Itay Safran

Abstract: We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the "grokking time") in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.

Comment: Representation Learning: theoretical training-dynamics analysis of grokking with provable bounds on generalization delay in ridge regression.

Relevance: 8 Novelty: 8

22. The Effect of Architecture During Continual Learning

ArXiv ID: 2601.19766

Authors: Allyson Hahn, Krishnan Raghavan

Abstract: Continual learning is a challenge for models with static architecture, as they fail to adapt to when data distributions evolve across tasks. We introduce a mathematical framework that jointly models architecture and weights in a Sobolev space, enabling a rigorous investigation into the role of neural network architecture in continual learning and its effect on the forgetting loss. We derive necessary conditions for the continual learning solution and prove that learning only model weights is insufficient to mitigate catastrophic forgetting under distribution shifts. Consequently, we prove that by learning the architecture and weights simultaneously at each task, we can reduce catastrophic forgetting. To learn weights and architecture simultaneously, we formulate continual learning as a bilevel optimization problem: the upper level selects an optimal architecture for a given task, while the lower level computes optimal weights via dynamic programming over all tasks. To solve the upper level problem, we introduce a derivative-free direct search algorithm to determine the optimal architecture. Once found, we must transfer knowledge from the current architecture to the optimal one. However, the optimal architecture will result in a weights parameter space different from the current architecture (i.e., dimensions of weights matrices will not match). To bridge the dimensionality gap, we develop a low-rank transfer mechanism to map knowledge across architectures of mismatched dimensions. Empirical studies across regression and classification problems, including feedforward, convolutional, and graph neural networks, demonstrate that learning the optimal architecture and weights simultaneously yields substantially improved performance (up to two orders of magnitude), reduced forgetting, and enhanced robustness to noise compared with static architecture approaches.

Comment: Model Architecture/Representation Learning: joint optimization of architecture and weights to mitigate forgetting; bilevel formulation with low-rank knowledge transfer.

Relevance: 8 Novelty: 8

23. Representational Homomorphism Predicts and Improves Compositional Generalization In Transformer Language Model

ArXiv ID: 2601.18858

Authors: Zhiyu An, Wan Du

Abstract: Compositional generalization-the ability to interpret novel combinations of familiar components-remains a persistent challenge for neural networks. Behavioral evaluations reveal when models fail but offer limited insight into why failures arise at the representational level. We introduce Homomorphism Error (HE), a structural metric that quantifies deviations from approximate homomorphisms between the expression algebra and a model's hidden-state space. We instantiate HE for two compositional operators in SCAN-style tasks: modifier HE for unary composition and sequence HE for binary composition, measured by learning representation-level operators that predict composed representations from their parts. Across controlled experiments with small decoder-only Transformers, HE predicts out-of-distribution (OOD) compositional generalization under noise injection, achieving R^2 = 0.73 correlation between modifier HE and OOD accuracy. Ablations show that model depth has minimal effect on either HE or OOD accuracy, training data coverage exhibits threshold effects (insufficient coverage sharply increases HE and degrades OOD performance), and randomly inserted noise tokens systematically increase HE. Finally, we test if HE-regularized training improves OOD accuracy. Experiment shows that explicitly enforcing low modifier HE during training significantly reduces modifier HE (p = 1.1x10-4) and sequence HE (p = 0.001) and yields a statistically significant improvement in OOD accuracy (p = 0.023). Together, these results indicate the potential of HE to be both a diagnostic and an actionable training signal for improving compositional generalization. Code to reproduce our experiments is open-sourced.

Comment: Representation Learning: introduces a structural metric (Homomorphism Error) on Transformer hidden states and uses it as a training regularizer to improve compositional generalization.