Personalized Daily ArXiv Papers 2026-01-30

[gpt-5]	Prompt	Completion	Total
Token	94363	68277	162640
Cost	$0.12	$0.68	$0.8

Total arXiv papers: 770

Total scanned papers: 451

Total relevant papers: 68

Table of contents with paper titles:

L$^3$: Large Lookup Layers Authors: Albert Tseng, Christopher De Sa
HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning Authors: Jinhao Zhang Yunquan Zhang, Zicheng yan, Boyang Zhang, Jun Sun, Daning Cheng
Scaling Embeddings Outperforms Scaling Experts in Language Models Authors: Hong Liu, Jiaqi Zhang, Chao Wang, Xing Hu, Linkun Lyu, Jiaqi Sun, Xurui Yang, Bo Wang, Fengcun Li, Yulei Qian, Lingtong Si, Yerui Sun, Rumei Li, Peng Pei, Yuchen Xie, Xunliang Cai
HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs Authors: Guoan Wang, Feiyu Wang, Zongwei Lv, Yikun Zong, Tong Yang
Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves Authors: Jonas Knupp, Jan Hendrik Metzen, Jeremias Bohn, Georg Groh, Kristian Kersting
ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling Authors: Yuchen Yang, Yaru Zhao, Pu Yang, Shaowei Wang, Zhi-Hua Zhou
ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation Authors: Zihao Huang, Jundong Zhou, Xingwei Qu, Qiyang Min, Ge Zhang
ECO: Quantized Training without Full-Precision Master Weights Authors: Mahdi Nikdan, Amir Zandieh, Dan Alistarh, Vahab Mirrokni
Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold Authors: Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Y\"uz\"ug\"uler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello
L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts Authors: Minghao Yang, Ren Togo, Guang Li, Takahiro Ogawa, Miki Haseyama
Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication Authors: Paul Tarau
High-dimensional learning dynamics of multi-pass Stochastic Gradient Descent in multi-index models Authors: Zhou Fan, Leda Wang
Perceptrons and localization of attention's mean-field landscape Authors: Antonio \'Alvarez-L\'opez, Borjan Geshkovski, Dom`enec Ruiz-Balet
PRISM: Distribution-free Adaptive Computation of Matrix Functions for Accelerating Neural Network Training Authors: Shenghao Yang, Zhichao Wang, Oleg Balabanov, N. Benjamin Erichson, Michael W. Mahoney
DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training Authors: Xinwei Qiang, Hongmin Chen, Shixuan Sun, Jingwen Leng, Xin Liu, Minyi Guo
Can Local Learning Match Self-Supervised Backpropagation? Authors: Wu S. Zihan, Ariane Delrocq, Wulfram Gerstner, Guillaume Bellec
SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips Authors: Jiahuan Yu, Mingtao Hu, Zichao Lin, Minjia Zhang
Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts Authors: Qinglun Li, Anke Tang, Miao Zhang, Mengzhu Wang, Quanjun Yin, Li Shen
Value-Based Pre-Training with Downstream Feedback Authors: Shuqi Ke, Giulia Fanti
Towards Compact and Robust DNNs via Compression-aware Sharpness Minimization Authors: Jialuo He, Huangxun Chen
Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning Authors: Zeyu Xing, Xing Li, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan
Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts Authors: Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units Authors: Jianhui Chen, Yuzhang Luo, Liangming Pan
The Depth Delusion: Why Transformers Should Be Wider, Not Deeper Authors: Md Muhtasim Munif Fahim, Md Rezaul Karim
A Separable Architecture for Continuous Token Representation in Language Models Authors: Reza T. Batley, Sourav Saha
LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models Authors: Stanislav Budzinskiy, Marian Gloser, Tolunay Yilmaz, Ying Hong Tham, Yuanyi Lin, Wenyi Fang, Fan Wu, Philipp Petersen
Clustering in Deep Stochastic Transformers Authors: Lev Fedorov, Micha\"el E. Sander, Romuald Elie, Pierre Marion, Mathieu Lauri`ere
Soft Quantization: Model Compression Via Weight Coupling Authors: Daniel T. Bernstein, Luca Di Carlo, David Schwab
GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization Authors: Chuanyang Zheng, Jiankai Sun, Yihang Gao, Chi Wang, Yuehao Wang, Jing Xiong, Liliang Ren, Bo Peng, Qingmei Wang, Xiaoran Shang, Mac Schwager, Anderson Schneider, Yuriy Nevmyvaka, Xiaodong Liu
Seg-MoE: Multi-Resolution Segment-wise Mixture-of-Experts for Time Series Forecasting Transformers Authors: Evandro S. Ortigossa, Eran Segal
Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data Authors: Grzegorz Stefanski, Alberto Presta, Michal Byra
Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling Authors: Binglei Lou, Haoran Wu, Yao Lai, Jiayi Nie, Can Xiao, Xuan Guo, Rika Antonova, Robert Mullins, Aaron Zhao
Fast and Geometrically Grounded Lorentz Neural Networks Authors: Robert van der Klis, Ricardo Ch\'avez Torres, Max van Spengler, Yuhui Ding, Thomas Hofmann, Pascal Mettes
Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference Authors: Ziming Dong, Hardik Sharma, Evan O'Toole, Jaya Prakash Champati, Kui Wu
Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening Authors: Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, Haitham Bou Ammar
LoRA and Privacy: When Random Projections Help (and When They Don't) Authors: Yaxi Hu, Johanna D\"ungler, Bernhard Sch\"olkopf, Amartya Sanyal
Representation Unlearning: Forgetting through Information Compression Authors: Antonio Almud\'evar, Alfonso Ortega
Procedural Pretraining: Warming Up Language Models with Abstract Data Authors: Liangze Jiang, Zachary Shinnick, Anton van den Hengel, Hemanth Saratchandran, Damien Teney
CORDS: Continuous Representations of Discrete Structures Authors: Tin Had\v{z}i Veljkovi\'c, Erik Bekkers, Michael Tiemann, Jan-Willem van de Meent
TRACE: Trajectory Recovery for Continuous Mechanism Evolution in Causal Representation Learning Authors: Shicheng Fan, Kun Zhang, Lu Cheng
Order-Optimal Sample Complexity of Rectified Flows Authors: Hari Krishna Sahoo, Mudit Gaur, Vaneet Aggarwal
Bridging Functional and Representational Similarity via Usable Information Authors: Antonio Almud\'evar, Alfonso Ortega
$\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval Authors: Zihao Wang, Hang Yin, Lihui Liu, Hanghang Tong, Yangqiu Song, Ginny Wong, Simon See
Dynamics Reveals Structure: Challenging the Linear Propagation Assumption Authors: Hoyeon Chang, B\'alint Mucs\'anyi, Seong Joon Oh
Identifiable Equivariant Networks are Layerwise Equivariant Authors: Vahid Shahverdi, Giovanni Luca Marchetti, Georg B\"okman, Kathl\'en Kohn
From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning Authors: Haoran Tang, Rajiv Khanna
Grounding and Enhancing Informativeness and Utility in Dataset Distillation Authors: Shaobo Wang, Yantai Yang, Guo Chen, Peiru Li, Kaixin Li, Yufa Zhou, Zhaorun Chen, Linfeng Zhang
Multi-Modal Time Series Prediction via Mixture of Modulated Experts Authors: Lige Zhang, Ali Maatouk, Jialin Chen, Leandros Tassiulas, Rex Ying
Hebbian Learning with Global Direction Authors: Wenjia Hua, Kejie Zhao, Luziwei Leng, Ran Cheng, Yuxin Ma, Qinghai Guo
KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices Authors: Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, Danilo Mandic
XFACTORS: Disentangled Information Bottleneck via Contrastive Supervision Authors: Alexandre Myara, Nicolas Bourriez, Thomas Boyer, Thomas Lemercier, Ihab Bendidi, Auguste Genovesio
Missing-Data-Induced Phase Transitions in Spectral PLS for Multimodal Learning Authors: Anders Gj{\o}lbye, Ida Kargaard, Emma Kargaard, Lars Kai Hansen
FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer Authors: Chenrui Xu, Wenjing Yan, Ying-Jun Angela Zhang
Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving Authors: Chendong Song, Meixuan Wang, Hang Zhou, Hong Liang, Yuan Lyu, Zixi Chen, Yuwei Fan, Zijie Zhou
MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference Authors: Huanlin Gao, Ping Chen, Fuyuan Shi, Ruijia Wu, Li YanTao, Qiang Hui, Yuren You, Ting Lu, Chao Tan, Shaoan Zhao, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian
How Expressive Are Graph Neural Networks in the Presence of Node Identifiers? Authors: Arie Soeteman, Michael Benedikt, Martin Grohe, Balder ten Cate
Amortized Spectral Kernel Discovery via Prior-Data Fitted Network Authors: Kaustubh Sharma, Srijan Tiwari, Ojasva Nema, Parikshit Pareek
Learning the Mechanism of Catastrophic Forgetting: A Perspective from Gradient Similarity Authors: Mutian Yang, Zisen Zhan, Yutong Chen, Haolin Li, Kaiwen Wang, Kaili Zheng, Yuguang Wang, Qi Wang, Jiandong Gao, Ji Wu
Is Parameter Isolation Better for Prompt-Based Continual Learning? Authors: Jiangyang Li, Chenhao Ding, Songlin Dong, Qiang Wang, Jianchao Zhao, Yuhang He, Yihong Gong
Effective LoRA Adapter Routing using Task Representations Authors: Akash Dhasade, Anne-Marie Kermarrec, Igor Pavlovic, Diana Petrescu, Rafael Pires, Mathis Randl, Martijn de Vos
LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them? Authors: J. Ben Tamo, Daniel Carlander-Reuterfelt, Jonathan Rubin, Dezhi Hong, Mingxian Wang, Oleg Poliannikov
Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle Authors: Alberto Fern\'andez-Hern\'andez, Cristian P\'erez-Corral, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Ort\'i
Flow Perturbation++: Multi-Step Unbiased Jacobian Estimation for High-Dimensional Boltzmann Sampling Authors: Xin Peng, Ang Gao
Putting a Face to Forgetting: Continual Learning meets Mechanistic Interpretability Authors: Sergi Masip, Gido M. van de Ven, Javier Ferrando, Tinne Tuytelaars
FlexCausal: Flexible Causal Disentanglement via Structural Flow Priors and Manifold-Aware Interventions Authors: Yutao Jin, Yuang Tao, Junyong Zhai
MAR: Efficient Large Language Models via Module-aware Architecture Refinement Authors: Junhong Cai, Guiqin Wang, Kejie Zhao, Jianxiong Tang, Xiang Wang, Luziwei Leng, Ran Cheng, Yuxin Ma, Qinghai Guo
CCMamba: Selective State-Space Models for Higher-Order Graph Learning on Combinatorial Complexes Authors: Jiawen Chen, Qi Shao, Mingtong Zhou, Duxin Chen, Wenwu Yu
Making Foundation Models Probabilistic via Singular Value Ensembles Authors: Mehmet Ozgur Turkoglu, Dominik J. M\"uhlematter, Alexander Becker, Konrad Schindler, Helge Aasen

1. L$^3$: Large Lookup Layers

ArXiv ID: 2601.21461

Authors: Albert Tseng, Christopher De Sa

Abstract: Modern sparse language models typically achieve sparsity through Mixture-of-Experts (MoE) layers, which dynamically route tokens to dense MLP "experts." However, dynamic hard routing has a number of drawbacks, such as potentially poor hardware efficiency and needing auxiliary losses for stable training. In contrast, the tokenizer embedding table, which is natively sparse, largely avoids these issues by selecting a single embedding per token at the cost of not having contextual information. In this work, we introduce the Large Lookup Layer (L$^3$), which unlocks a new axis of sparsity by generalizing embedding tables to model decoder layers. L$^3$ layers use static token-based routing to aggregate a set of learned embeddings per token in a context-dependent way, allowing the model to efficiently balance memory and compute by caching information in embeddings. L$^3$ has two main components: (1) a systems-friendly architecture that allows for fast training and CPU-offloaded inference with no overhead, and (2) an information-theoretic embedding allocation algorithm that effectively balances speed and quality. We empirically test L$^3$ by training transformers with up to 2.6B active parameters and find that L$^3$ strongly outperforms both dense models and iso-sparse MoEs in both language modeling and downstream tasks.

Comment: Model Architecture & Sparsity: proposes Large Lookup Layers as a systems-friendly sparse alternative to MoE with static token-based routing and embedding allocation; enables CPU-offloaded inference.

Relevance: 10 Novelty: 9

2. HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning

ArXiv ID: 2601.21626

Authors: Jinhao Zhang Yunquan Zhang, Zicheng yan, Boyang Zhang, Jun Sun, Daning Cheng

Abstract: Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing quantization error. The root cause lies in the Hessian matrix of the LLM loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to quantization noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15\% and effectively avoids the logical collapse commonly seen in aggressive quantization.

Comment: Model Compression and Efficiency: low-bit PTQ via Hessian conditioning with learnable rotations to reduce curvature sensitivity.

Relevance: 10 Novelty: 8

3. Scaling Embeddings Outperforms Scaling Experts in Language Models

ArXiv ID: 2601.21204

Authors: Hong Liu, Jiaqi Zhang, Chao Wang, Xing Hu, Linkun Lyu, Jiaqi Sun, Xurui Yang, Bo Wang, Fengcun Li, Yulei Qian, Lingtong Si, Yerui Sun, Rumei Li, Peng Pei, Yuchen Xie, Xunliang Cai

Abstract: While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.

Comment: Model architecture and efficiency: proposes scaling embeddings as an alternative to MoE sparsity scaling; includes system optimizations/speculative decoding; directly targets MoE/LLM scaling.

Relevance: 10 Novelty: 8

4. HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs

ArXiv ID: 2601.20745

Authors: Guoan Wang, Feiyu Wang, Zongwei Lv, Yikun Zong, Tong Yang

Abstract: As large language models (LLMs) continue to scale, deployment is increasingly bottlenecked by the memory wall, motivating a shift toward extremely low-bit quantization. However, most quantization-aware training (QAT) methods apply hard rounding and the straight-through estimator (STE) from the beginning of the training, which prematurely discretizes the optimization landscape and induces persistent gradient mismatch between latent weights and quantized weights, hindering effective optimization of quantized models. To address this, we propose Hestia, a Hessian-guided differentiable QAT framework for extremely low-bit LLMs, which replaces the rigid step function with a temperature-controlled softmax relaxation to maintain gradient flow early in training while progressively hardening quantization. Furthermore, Hestia leverages a tensor-wise Hessian trace metric as a lightweight curvature signal to drive fine-grained temperature annealing, enabling sensitivity-aware discretization across the model. Evaluations on Llama-3.2 show that Hestia consistently outperforms existing ternary QAT baselines, yielding average zero-shot improvements of 5.39% and 4.34% for the 1B and 3B models. These results indicate that Hessian-guided relaxation effectively recovers representational capacity, establishing a more robust training path for 1.58-bit LLMs. The code is available at https://github.com/hestia2026/Hestia.

Comment: Model Compression and Efficiency: introduces a Hessian-guided, differentiable QAT with temperature annealing for ultra-low-bit LLMs, improving optimization over STE-based methods.

Relevance: 10 Novelty: 8

5. Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves

ArXiv ID: 2601.21582

Authors: Jonas Knupp, Jan Hendrik Metzen, Jeremias Bohn, Georg Groh, Kristian Kersting

Abstract: Depth-recurrence facilitates latent reasoning by sharing parameters across depths. However, prior work lacks combined FLOP-, parameter-, and memory-matched baselines, underutilizes depth-recurrence due to partially fixed layer stacks, and ignores the bottleneck of constant hidden-sizes that restricts many-step latent reasoning. To address this, we introduce a modular framework of depth-recurrent attention mixtures (Dreamer), combining sequence attention, depth attention, and sparse expert attention. It alleviates the hidden-size bottleneck through attention along depth, decouples scaling dimensions, and allows depth-recurrent models to scale efficiently and effectively. Across language reasoning benchmarks, our models require 2 to 8x fewer training tokens for the same accuracy as FLOP-, parameter-, and memory-matched SOTA, and outperform ca. 2x larger SOTA models with the same training tokens. We further present insights into knowledge usage across depths, e.g., showing 2 to 11x larger expert selection diversity than SOTA MoEs.

Comment: Model Architecture: proposes depth-recurrent attention mixtures combining depth attention and sparse expert attention (MoE) to scale latent reasoning efficiently.

Relevance: 10 Novelty: 8

6. ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

ArXiv ID: 2601.21198

Authors: Yuchen Yang, Yaru Zhao, Pu Yang, Shaowei Wang, Zhi-Hua Zhou

Abstract: While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to $72.77\%$ inference latency reduction and up to $6.76\times$ higher throughput than the state-of-the-art systems.

Comment: HPC/Systems + MoE: lossless compression and cache-affinity scheduling for on-device MoE serving with provable performance, shifting I/O to compute-centric.

Relevance: 10 Novelty: 8

7. ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

ArXiv ID: 2601.21420

Authors: Zihao Huang, Jundong Zhou, Xingwei Qu, Qiyang Min, Ge Zhang

Abstract: Large language models allocate uniform computation across all tokens, ignoring that some sequences are trivially predictable while others require deep reasoning. We introduce ConceptMoE, which dynamically merges semantically similar tokens into concept representations, performing implicit token-level compute allocation. A learnable chunk module identifies optimal boundaries by measuring inter-token similarity, compressing sequences by a target ratio $R$ before they enter the compute-intensive concept model. Crucially, the MoE architecture enables controlled evaluation: we reallocate saved computation to match baseline activated FLOPs (excluding attention map computation) and total parameters, isolating genuine architectural benefits. Under these conditions, ConceptMoE consistently outperforms standard MoE across language and vision-language tasks, achieving +0.9 points on language pretraining, +2.3 points on long context understanding, and +0.6 points on multimodal benchmarks. When converting pretrained MoE during continual training with layer looping, gains reach +5.5 points, demonstrating practical applicability. Beyond performance, ConceptMoE reduces attention computation by up to $R^2\times$ and KV cache by $R\times$. At $R=2$, empirical measurements show prefill speedups reaching 175\% and decoding speedups up to 117\% on long sequences. The minimal architectural modifications enable straightforward integration into existing MoE, demonstrating that adaptive concept-level processing fundamentally improves both effectiveness and efficiency of large language models.

Comment: Model Architecture/Compression: MoE with adaptive token-to-concept compression for implicit compute allocation; reduces attention/KV cache and improves efficiency.

Relevance: 10 Novelty: 8

8. ECO: Quantized Training without Full-Precision Master Weights

ArXiv ID: 2601.22101

Authors: Mahdi Nikdan, Amir Zandieh, Dan Alistarh, Vahab Mirrokni

Abstract: Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as $\textit{master weights}$. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.

Comment: Compression/Efficiency: quantized training without full-precision master weights via error-compensating optimizer; theory and SMoE applicability.

Relevance: 10 Novelty: 8

9. Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

ArXiv ID: 2601.21686

Authors: Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Y\"uz\"ug\"uler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello

Abstract: Key--value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns \emph{orthonormal} projection bases by directly minimizing \emph{decoder-layer output reconstruction error}. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank allocation under a user-specified error budget. Noteworthy, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by $11.9$ points on C4 perplexity and $5.4\%$ on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.

Comment: Model Compression and Efficiency: KV-cache low-rank projection learned on the Stiefel manifold by minimizing decoder-layer output error with rank allocation profiles.

Relevance: 10 Novelty: 8

10. L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

ArXiv ID: 2601.21349

Authors: Minghao Yang, Ren Togo, Guang Li, Takahiro Ogawa, Miki Haseyama

Abstract: Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank \& Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on a large-scale language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing stability, expert specialization, and overall model performance.

Comment: Matches Model Architecture: MoE routing improved via low-rank latent routing space and Lipschitz-controlled scoring geometry.

Relevance: 10 Novelty: 8

11. Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication

ArXiv ID: 2601.19915

Authors: Paul Tarau

Abstract: We introduce the \emph{Arrow Language Model}, a neural architecture derived from an intuitionistic-logic interpretation of next-token prediction. Instead of representing tokens as additive embeddings mixed by attention, we encode a prefix as a \emph{left-nested implication chain} whose structure preserves order through non-commutative composition. Next-token prediction corresponds to \emph{modus ponens}, and sequence processing becomes constructive proof extension under the Curry--Howard correspondence. Our Prolog-based specialized theorem provers validate fundamental properties of the neural models, among which relations between commutative vs. non-commutative sequencing and single-token vs. multi-token prediction choices. We show that a neural architecture equivalent to multiplicative RNNs arises naturally from a proof-theoretic interpretation of next-token prediction as nested intuitionistic implication, we present a practical low-rank neural realization and position the model relative to Transformers and state-space models. Keywords: logic-based derivation of neural architectures, intuitionistic implicational logic, token-as-operator neural models, state-space models, alternatives to transformer-based foundational models.

Comment: Model Architecture: logic-derived Arrow Language Model interpreting next-token prediction as nested intuitionistic implication with low-rank realization.

Relevance: 9 Novelty: 9

12. High-dimensional learning dynamics of multi-pass Stochastic Gradient Descent in multi-index models

ArXiv ID: 2601.21093

Authors: Zhou Fan, Leda Wang

Abstract: We study the learning dynamics of a multi-pass, mini-batch Stochastic Gradient Descent (SGD) procedure for empirical risk minimization in high-dimensional multi-index models with isotropic random data. In an asymptotic regime where the sample size $n$ and data dimension $d$ increase proportionally, for any sub-linear batch size $\kappa \asymp n^\alpha$ where $\alpha \in [0,1)$, and for a commensurate ``critical'' scaling of the learning rate, we provide an asymptotically exact characterization of the coordinate-wise dynamics of SGD. This characterization takes the form of a system of dynamical mean-field equations, driven by a scalar Poisson jump process that represents the asymptotic limit of SGD sampling noise. We develop an analogous characterization of the Stochastic Modified Equation (SME) which provides a Gaussian diffusion approximation to SGD. Our analyses imply that the limiting dynamics for SGD are the same for any batch size scaling $\alpha \in [0,1)$, and that under a commensurate scaling of the learning rate, dynamics of SGD, SME, and gradient flow are mutually distinct, with those of SGD and SME coinciding in the special case of a linear model. We recover a known dynamical mean-field characterization of gradient flow in a limit of small learning rate, and of one-pass/online SGD in a limit of increasing sample size $n/d \to \infty$.

Comment: Training dynamics: asymptotically exact mean-field characterization of multi-pass mini-batch SGD vs SME vs gradient flow in high dimensions.

Relevance: 9 Novelty: 8

13. Perceptrons and localization of attention's mean-field landscape

ArXiv ID: 2601.21366

Authors: Antonio \'Alvarez-L\'opez, Borjan Geshkovski, Dom`enec Ruiz-Balet

Abstract: The forward pass of a Transformer can be seen as an interacting particle system on the unit sphere: time plays the role of layers, particles that of token embeddings, and the unit sphere idealizes layer normalization. In some weight settings the system can even be seen as a gradient flow for an explicit energy, and one can make sense of the infinite context length (mean-field) limit thanks to Wasserstein gradient flows. In this paper we study the effect of the perceptron block in this setting, and show that critical points are generically atomic and localized on subsets of the sphere.

Comment: Model Architecture theory: mean-field analysis of Transformer attention/perceptron blocks showing atomic localization of critical points.

Relevance: 9 Novelty: 8

14. PRISM: Distribution-free Adaptive Computation of Matrix Functions for Accelerating Neural Network Training

ArXiv ID: 2601.22137

Authors: Shenghao Yang, Zhichao Wang, Oleg Balabanov, N. Benjamin Erichson, Michael W. Mahoney

Abstract: Matrix functions such as square root, inverse roots, and orthogonalization play a central role in preconditioned gradient methods for neural network training. This has motivated the development of iterative algorithms that avoid explicit eigendecompositions and rely primarily on matrix multiplications, making them well suited for modern GPU accelerators. We present PRISM (Polynomial-fitting and Randomized Iterative Sketching for Matrix functions computation), a general framework for accelerating iterative algorithms for computing matrix functions. PRISM combines adaptive polynomial approximation with randomized sketching: at each iteration, it fits a polynomial surrogate to the current spectrum via a sketched least-squares problem, adapting to the instance at hand with minimal overhead. We apply PRISM to accelerate Newton-Schulz-like iterations for matrix square roots and orthogonalization, which are core primitives in machine learning. Unlike prior methods, PRISM requires no explicit spectral bounds or singular value estimates; and it adapts automatically to the evolving spectrum. Empirically, PRISM accelerates training when integrated into Shampoo and Muon optimizers.

Comment: Systems/efficiency: algorithmic framework (adaptive polynomial fitting + randomized sketching) to accelerate matrix functions used in optimizers (Shampoo/Muon), enabling faster large-model training.

Relevance: 9 Novelty: 8

15. DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

ArXiv ID: 2601.21824

Authors: Xinwei Qiang, Hongmin Chen, Shixuan Sun, Jingwen Leng, Xin Liu, Minyi Guo

Abstract: Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non-deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient-reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies: (i) Descending Q-Tile Iteration, a reversed query-block traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks. Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28$\times$ compared to the baseline, significantly advancing the efficiency of reproducible LLM training. Our code is open-sourced at https://github.com/SJTU-Liquid/deterministic-FA3.

Comment: HPC/systems: deterministic attention scheduling (backward pass DAG scheduling) to regain throughput for reproducible LLM training.

Relevance: 9 Novelty: 8

16. Can Local Learning Match Self-Supervised Backpropagation?

ArXiv ID: 2601.21683

Authors: Wu S. Zihan, Ariane Delrocq, Wulfram Gerstner, Guillaume Bellec

Abstract: While end-to-end self-supervised learning with backpropagation (global BP-SSL) has become central for training modern AI systems, theories of local self-supervised learning (local-SSL) have struggled to build functional representations in deep neural networks. To establish a link between global and local rules, we first develop a theory for deep linear networks: we identify conditions for local-SSL algorithms (like Forward-forward or CLAPP) to implement exactly the same weight update as a global BP-SSL. Starting from the theoretical insights, we then develop novel variants of local-SSL algorithms to approximate global BP-SSL in deep non-linear convolutional neural networks. Variants that improve the similarity between gradient updates of local-SSL with those of global BP-SSL also show better performance on image datasets (CIFAR-10, STL-10, and Tiny ImageNet). The best local-SSL rule with the CLAPP loss function matches the performance of a comparable global BP-SSL with InfoNCE or CPC-like loss functions, and improves upon state-of-the-art for local SSL on these benchmarks.

Comment: Representation learning/training dynamics: theoretical equivalence conditions between local SSL and global BP-SSL and practical local-SSL variants matching global SSL.

Relevance: 9 Novelty: 8

17. SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

ArXiv ID: 2601.20309

Authors: Jiahuan Yu, Mingtao Hu, Zichao Lin, Minjia Zhang

Abstract: Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.

Comment: High Performance Computing: SLO-aware rotary scheduling (RotaSched) and DuplexKV memory co-design on Superchips for responsive LLM serving.

Relevance: 9 Novelty: 8

18. Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts

ArXiv ID: 2601.21690

Authors: Qinglun Li, Anke Tang, Miao Zhang, Mengzhu Wang, Quanjun Yin, Li Shen

Abstract: Model merging efficiently aggregates capabilities from multiple fine-tuned models into a single one, operating purely in parameter space without original data or expensive re-computation. Despite empirical successes, a unified theory for its effectiveness under heterogeneous finetuning hyperparameters (e.g., varying learning rates, batch sizes) remains missing. Moreover, the lack of hyperparameter transparency in open-source fine-tuned models makes it difficult to predict merged-model performance, leaving practitioners without guidance on how to fine-tune merge-friendly experts. To address those two challenges, we employ $L_2$-Stability theory under heterogeneous hyperparameter environments to analyze the generalization of the merged model $\boldsymbol{x}{avg}$. This pioneering analysis yields two key contributions: (i) \textit{A unified theoretical framework} is provided to explain existing merging algorithms, revealing how they optimize specific terms in our bound, thus offering a strong theoretical foundation for empirical observations. (ii) \textit{Actionable recommendations} are proposed for practitioners to strategically fine-tune expert models, enabling the construction of merge-friendly models within the pretraining-to-finetuning pipeline. Extensive experiments on the ResNet/Vit family across 20/8 visual classification tasks, involving thousands of finetuning models, robustly confirm the impact of different hyperparameters on the generalization of $\boldsymbol{x}$ predicted by our theoretical results.

Comment: Model Architecture/Training Theory: unified generalization framework via L2-stability for parameter-space model merging across heterogeneous experts, with actionable merging guidance.

Relevance: 9 Novelty: 8

19. Value-Based Pre-Training with Downstream Feedback

ArXiv ID: 2601.22108

Authors: Shuqi Ke, Giulia Fanti

Abstract: Can a small amount of verified goal information steer the expensive self-supervised pretraining of foundation models? Standard pretraining optimizes a fixed proxy objective (e.g., next-token prediction), which can misallocate compute away from downstream capabilities of interest. We introduce V-Pretraining: a value-based, modality-agnostic method for controlled continued pretraining in which a lightweight task designer reshapes the pretraining task to maximize the value of each gradient step. For example, consider self-supervised learning (SSL) with sample augmentation. The V-Pretraining task designer selects pretraining tasks (e.g., augmentations) for which the pretraining loss gradient is aligned with a gradient computed over a downstream task (e.g., image segmentation). This helps steer pretraining towards relevant downstream capabilities. Notably, the pretrained model is never updated on downstream task labels; they are used only to shape the pretraining task. Under matched learner update budgets, V-Pretraining of 0.5B--7B language models improves reasoning (GSM8K test Pass@1) by up to 18% relative over standard next-token prediction using only 12% of GSM8K training examples as feedback. In vision SSL, we improve the state-of-the-art results on ADE20K by up to 1.07 mIoU and reduce NYUv2 RMSE while improving ImageNet linear accuracy, and we provide pilot evidence of improved token efficiency in continued pretraining.

Comment: Representation/Training Dynamics: value-based continued pretraining steers SSL using downstream-gradient alignment to maximize gradient value per step.

Relevance: 9 Novelty: 8

20. Towards Compact and Robust DNNs via Compression-aware Sharpness Minimization

ArXiv ID: 2601.20301

Authors: Jialuo He, Huangxun Chen

Abstract: Sharpness-Aware Minimization (SAM) has recently emerged as an effective technique for improving DNN robustness to input variations. However, its interplay with the compactness requirements of on-device DNN deployments remains less explored. Simply pruning a SAM-trained model can undermine robustness, since flatness in the continuous parameter space does not necessarily translate to robustness under the discrete structural changes induced by pruning. Conversely, applying SAM after pruning may be fundamentally constrained by architectural limitations imposed by an early, robustness-agnostic pruning pattern. To address this gap, we propose Compression-aware ShArpness Minimization (C-SAM), a framework that shifts sharpness-aware learning from parameter perturbations to mask perturbations. By explicitly perturbing pruning masks during training, C-SAM promotes a flatter loss landscape with respect to model structure, enabling the discovery of pruning patterns that simultaneously optimize model compactness and robustness to input variations. Extensive experiments on CelebA-HQ, Flowers-102, and CIFAR-10-C across ResNet-18, GoogLeNet, and MobileNet-V2 show that C-SAM consistently achieves higher certified robustness than strong baselines, with improvements of up to 42%, while maintaining task accuracy comparable to the corresponding unpruned models.

Comment: Compression/Efficiency & Robustness: sharpness-aware training over pruning masks (structure perturbations) to co-optimize compactness and robustness.

Relevance: 9 Novelty: 8

21. Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning

ArXiv ID: 2601.20326

Authors: Zeyu Xing, Xing Li, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan

Abstract: KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: \textbf{(i) Chain-of-Embedding}, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and \textbf{(ii) Fast/Slow Thinking Switching}, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to $5.7\times$ with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference. Code: https://github.com/cmd2001/ICLR2026_KV-Embedding.

Comment: Efficiency/Cache: repurposes KV cache as lightweight representation for chain-of-embedding and fast/slow reasoning switching, reducing tokens at inference.

Relevance: 9 Novelty: 8

22. Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

ArXiv ID: 2601.22156

Authors: Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu

Abstract: Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data

Comment: Model Architecture/Efficiency: distills Transformers into RNN-attention hybrids (HALO/HypeNet) with improved long-context efficiency and length generalization.

Relevance: 9 Novelty: 8

23. Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

ArXiv ID: 2601.21996

Authors: Jianhui Chen, Yuzhang Luo, Liangming Pan

Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

Comment: Representation Learning/Training Dynamics: influence-function-based mechanistic data attribution linking training samples to interpretable circuits and ICL heads.

Relevance: 9 Novelty: 8

24. The Depth Delusion: Why Transformers Should Be Wider, Not Deeper

ArXiv ID: 2601.20994

Authors: Md Muhtasim Munif Fahim, Md Rezaul Karim

Abstract: Neural scaling laws describe how language model loss decreases with parameters and data, but treat architecture as interchangeable--a billion parameters could arise from a shallow-wide model (10 layers & 8,192 hidden dimension) or a deep-narrow one (80 layers & 2,048 hidden dimension). We propose architecture-conditioned scaling laws decomposing this dependence, finding that optimal depth scales as D ~ C^0.12 while optimal width scales as W ~ C^0.34, meaning width should grow 2.8x faster than depth. We discover a critical depth phenomenon: beyond D_crit ~ W^0.44 (sublinear in W), adding layers increases loss despite adding parameters--the Depth Delusion. Empirically, we validate these findings across 30 transformer architectures spanning 17M to 7B parameters, each trained on representative high-compute samples, achieving R^2 = 0.922. Our central finding: at 7B scale, a 64-layer model (6.38B params) underperforms a 32-layer model (6.86B params) by 0.12 nats, despite being significantly deeper. This demonstrates that optimal depth-width tradeoffs persist at the production scale.

Comment: Model Architecture/Scaling Laws: architecture-conditioned scaling revealing critical depth and advocating width-over-depth tradeoffs.

Relevance: 9 Novelty: 8

25. A Separable Architecture for Continuous Token Representation in Language Models

ArXiv ID: 2601.22040

Authors: Reza T. Batley, Sourav Saha

Abstract: Transformer scaling law analyses typically treat parameters as interchangeable; an abstraction that accurately predicts loss-compute relationships. Yet, in sub-billion-parameter small language models (SLMs), embedding matrices dominate the parameter budget. This work argues that this allocation is as suboptimal as it is counterintuitive. Leviathan is an architecture with a continuous embedding generator to replace the discrete lookup tables of canonical models. Evaluating on the Pile dataset under isoparametric settings, Leviathan consistently outperforms a standard, LLaMA-style architecture. By means of an empirical power-law fit, Leviathan exhibits a markedly superior effective parameter capacity. Across the regime studied, Leviathan behaves as a dense model with $1.47$ to $2.11 \times$ more parameters.

Comment: Model Architecture/Efficiency: replaces embedding tables with a continuous token generator (separable architecture) improving parametric efficiency.

Relevance: 9 Novelty: 8

26. LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models

ArXiv ID: 2601.21623

Authors: Stanislav Budzinskiy, Marian Gloser, Tolunay Yilmaz, Ying Hong Tham, Yuanyi Lin, Wenyi Fang, Fan Wu, Philipp Petersen

Abstract: Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f(g(\mathrm{x}))$, we provide an adaptive strategy that selects a small subset of components of $g(\mathrm{x})$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.

Comment: Model Compression and Efficiency: adaptive look-ahead mixed-precision inference selecting small subsets for high precision to control rounding error in Transformers.

Relevance: 9 Novelty: 8

27. Clustering in Deep Stochastic Transformers

ArXiv ID: 2601.21942

Authors: Lev Fedorov, Micha\"el E. Sander, Romuald Elie, Pierre Marion, Mathieu Lauri`ere

Abstract: Transformers have revolutionized deep learning across various domains but understanding the precise token dynamics remains a theoretical challenge. Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point; however, these results rely on deterministic weight assumptions, which fail to capture the standard initialization scheme in Transformers. In this work, we show that accounting for the intrinsic stochasticity of random initialization alters this picture. More precisely, we analyze deep Transformers where noise arises from the random initialization of value matrices. Under diffusion scaling and token-wise RMS normalization, we prove that, as the number of Transformer layers goes to infinity, the discrete token dynamics converge to an interacting-particle system on the sphere where tokens are driven by a \emph{common} matrix-valued Brownian noise. In this limit, we show that initialization noise prevents the collapse to a single cluster predicted by deterministic models. For two tokens, we prove a phase transition governed by the interaction strength and the token dimension: unlike deterministic attention flows, antipodal configurations become attracting with positive probability. Numerical experiments confirm the predicted transition, reveal that antipodal formations persist for more than two tokens, and demonstrate that suppressing the intrinsic noise degrades accuracy.

Comment: Matches Representation Learning/Theory: stochastic analysis of deep Transformer token dynamics; interacting-particle limit prevents collapse.

Relevance: 9 Novelty: 8

28. Soft Quantization: Model Compression Via Weight Coupling

ArXiv ID: 2601.21219

Authors: Daniel T. Bernstein, Luca Di Carlo, David Schwab

Abstract: We show that introducing short-range attractive couplings between the weights of a neural network during training provides a novel avenue for model quantization. These couplings rapidly induce the discretization of a model's weight distribution, and they do so in a mixed-precision manner despite only relying on two additional hyperparameters. We demonstrate that, within an appropriate range of hyperparameters, our "soft quantization'' scheme outperforms histogram-equalized post-training quantization on ResNet-20/CIFAR-10. Soft quantization provides both a new pipeline for the flexible compression of machine learning models and a new tool for investigating the trade-off between compression and generalization in high-dimensional loss landscapes.

Comment: Compression/quantization: training-time weight coupling induces mixed-precision discretization; a novel route to quantization beyond standard PTQ.

Relevance: 9 Novelty: 7

29. GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization

ArXiv ID: 2601.22095

Authors: Chuanyang Zheng, Jiankai Sun, Yihang Gao, Chi Wang, Yuehao Wang, Jing Xiong, Liliang Ren, Bo Peng, Qingmei Wang, Xiaoran Shang, Mac Schwager, Anderson Schneider, Yuriy Nevmyvaka, Xiaodong Liu

Abstract: The placement of normalization layers, specifically Pre-Norm and Post-Norm, remains an open question in Transformer architecture design. In this work, we rethink these approaches through the lens of manifold optimization, interpreting the outputs of the Feed-Forward Network (FFN) and attention layers as update directions in optimization. Building on this perspective, we introduce GeoNorm, a novel method that replaces standard normalization with geodesic updates on the manifold. Furthermore, analogous to learning rate schedules, we propose a layer-wise update decay for the FFN and attention components. Comprehensive experiments demonstrate that GeoNorm consistently outperforms existing normalization methods in Transformer models. Crucially, GeoNorm can be seamlessly integrated into standard Transformer architectures, achieving performance improvements with negligible additional computational cost.

Comment: Model Architecture: Transformer normalization innovation (GeoNorm) unifying pre-/post-norm via geodesic updates with negligible overhead.

Relevance: 9 Novelty: 7

30. Seg-MoE: Multi-Resolution Segment-wise Mixture-of-Experts for Time Series Forecasting Transformers

ArXiv ID: 2601.21641

Authors: Evandro S. Ortigossa, Eran Segal

Abstract: Transformer-based models have recently made significant advances in accurate time-series forecasting, but even these architectures struggle to scale efficiently while capturing long-term temporal dynamics. Mixture-of-Experts (MoE) layers are a proven solution to scaling problems in natural language processing. However, existing MoE approaches for time-series forecasting rely on token-wise routing mechanisms, which may fail to exploit the natural locality and continuity of temporal data. In this work, we introduce Seg-MoE, a sparse MoE design that routes and processes contiguous time-step segments rather than making independent expert decisions. Token segments allow each expert to model intra-segment interactions directly, naturally aligning with inherent temporal patterns. We integrate Seg-MoE layers into a time-series Transformer and evaluate it on multiple multivariate long-term forecasting benchmarks. Seg-MoE consistently achieves state-of-the-art forecasting accuracy across almost all prediction horizons, outperforming both dense Transformers and prior token-wise MoE models. Comprehensive ablation studies confirm that segment-level routing is the key factor driving these gains. Our results show that aligning the MoE routing granularity with the inherent structure of time series provides a powerful, yet previously underexplored, inductive bias, opening new avenues for conditionally sparse architectures in sequential data modeling.

Comment: Model Architecture: MoE innovation with segment-wise routing for time-series Transformers, aligning conditional sparsity with temporal locality.

Relevance: 9 Novelty: 7

31. Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data

ArXiv ID: 2601.22141

Authors: Grzegorz Stefanski, Alberto Presta, Michal Byra

Abstract: In pruning, the Lottery Ticket Hypothesis posits that large networks contain sparse subnetworks, or winning tickets, that can be trained in isolation to match the performance of their dense counterparts. However, most existing approaches assume a single universal winning ticket shared across all inputs, ignoring the inherent heterogeneity of real-world data. In this work, we propose Routing the Lottery (RTL), an adaptive pruning framework that discovers multiple specialized subnetworks, called adaptive tickets, each tailored to a class, semantic cluster, or environmental condition. Across diverse datasets and tasks, RTL consistently outperforms single- and multi-model baselines in balanced accuracy and recall, while using up to 10 times fewer parameters than independent models and exhibiting semantically aligned. Furthermore, we identify subnetwork collapse, a performance drop under aggressive pruning, and introduce a subnetwork similarity score that enables label-free diagnosis of oversparsification. Overall, our results recast pruning as a mechanism for aligning model structure with data heterogeneity, paving the way toward more modular and context-aware deep learning.

Comment: Matches Model Compression/Sparsity: adaptive pruning discovers routed, specialized subnetworks ('adaptive tickets') for heterogeneous data.

Relevance: 9 Novelty: 7

32. Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling

ArXiv ID: 2601.20706

Authors: Binglei Lou, Haoran Wu, Yao Lai, Jiayi Nie, Can Xiao, Xuan Guo, Rika Antonova, Robert Mullins, Aaron Zhao

Abstract: Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation, but their sampling phase displays fundamentally different characteristics compared to GEMM-centric transformer layers. Profiling on modern GPUs reveals that sampling can account for up to 70% of total model inference latency-primarily due to substantial memory loads and writes from vocabulary-wide logits, reduction-based token selection, and iterative masked updates. These processes demand large on-chip SRAM and involve irregular memory accesses that conventional NPUs struggle to handle efficiently. To address this, we identify a set of critical instructions that an NPU architecture must specifically optimize for dLLM sampling. Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy. Together, these optimizations deliver up to a 2.53x speedup over the NVIDIA RTX A6000 GPU under an equivalent nm technology node. We also open-source our cycle-accurate simulation and post-synthesis RTL verification code, confirming functional equivalence with current dLLM PyTorch implementations.

Comment: High Performance Computing: systems-level NPU design and instruction/memory optimizations tailored to diffusion-LLM sampling workloads.

Relevance: 8 Novelty: 8

33. Fast and Geometrically Grounded Lorentz Neural Networks

ArXiv ID: 2601.21529

Authors: Robert van der Klis, Ricardo Ch\'avez Torres, Max van Spengler, Yuhui Ding, Thomas Hofmann, Pascal Mettes

Abstract: Hyperbolic space is quickly gaining traction as a promising geometry for hierarchical and robust representation learning. A core open challenge is the development of a mathematical formulation of hyperbolic neural networks that is both efficient and captures the key properties of hyperbolic space. The Lorentz model of hyperbolic space has been shown to enable both fast forward and backward propagation. However, we prove that, with the current formulation of Lorentz linear layers, the hyperbolic norms of the outputs scale logarithmically with the number of gradient descent steps, nullifying the key advantage of hyperbolic geometry. We propose a new Lorentz linear layer grounded in the well-known ``distance-to-hyperplane" formulation. We prove that our formulation results in the usual linear scaling of output hyperbolic norms with respect to the number of gradient descent steps. Our new formulation, together with further algorithmic efficiencies through Lorentzian activation functions and a new caching strategy results in neural networks fully abiding by hyperbolic geometry while simultaneously bridging the computation gap to Euclidean neural networks. Code available at: https://github.com/robertdvdk/hyperbolic-fully-connected.

Comment: Model architecture: new Lorentz linear layer with geometric guarantees plus efficient activations/caching for hyperbolic NNs, improving representation learning in non-Euclidean space.

Relevance: 8 Novelty: 8

34. Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference

ArXiv ID: 2601.22132

Authors: Ziming Dong, Hardik Sharma, Evan O'Toole, Jaya Prakash Champati, Kui Wu

Abstract: Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.

Comment: Model Compression and Efficiency: token-budgeted LLM–SLM collaboration via hint prefixes and learned hint-length routing for cost-efficient inference.

Relevance: 8 Novelty: 8

35. Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening

ArXiv ID: 2601.21590

Authors: Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, Haitham Bou Ammar

Abstract: Reinforcement learning (RL) post-training is a dominant approach for improving the reasoning performance of large language models (LLMs), yet growing evidence suggests that its gains arise primarily from distribution sharpening rather than the acquisition of new capabilities. Recent work has shown that sampling from the power distribution of LLMs using Markov chain Monte Carlo (MCMC) can recover performance comparable to RL post-training without relying on external rewards; however, the high computational cost of MCMC makes such approaches impractical for widespread adoption. In this work, we propose a theoretically grounded alternative that eliminates the need for iterative MCMC. We derive a novel formulation showing that the global power distribution can be approximated by a token-level scaled low-temperature one, where the scaling factor captures future trajectory quality. Leveraging this insight, we introduce a training-free and verifier-free algorithm that sharpens the base model's generative distribution autoregressively. Empirically, we evaluate our method on math, QA, and code tasks across four LLMs, and show that our method matches or surpasses one-shot GRPO without relying on any external rewards, while reducing inference latency by over 10x compared to MCMC-based sampling.

Comment: Model Compression and Efficiency: training-free distribution sharpening via scaled low-temperature token sampling to match RL post-training gains without MCMC.

Relevance: 8 Novelty: 8

36. LoRA and Privacy: When Random Projections Help (and When They Don't)

ArXiv ID: 2601.21719

Authors: Yaxi Hu, Johanna D\"ungler, Bernhard Sch\"olkopf, Amartya Sanyal

Abstract: We introduce the (Wishart) projection mechanism, a randomized map of the form $S \mapsto M f(S)$ with $M \sim W_d(1/r I_d, r)$ and study its differential privacy properties. For vector-valued queries $f$, we prove non-asymptotic DP guarantees without any additive noise, showing that Wishart randomness alone can suffice. For matrix-valued queries, however, we establish a sharp negative result: in the noise-free setting, the mechanism is not DP, and we demonstrate its vulnerability by implementing a near perfect membership inference attack (AUC $> 0.99$). We then analyze a noisy variant and prove privacy amplification due to randomness and low rank projection, in both large- and small-rank regimes, yielding stronger privacy guarantees than additive noise alone. Finally, we show that LoRA-style updates are an instance of the matrix-valued mechanism, implying that LoRA is not inherently private despite its built-in randomness, but that low-rank fine-tuning can be more private than full fine-tuning at the same noise level. Preliminary experiments suggest that tighter accounting enables lower noise and improved accuracy in practice.

Comment: Low-Rank/Compression + Privacy theory: DP analysis of Wishart/projection mechanisms; shows LoRA randomness is not inherently private and when low-rank helps with DP.

Relevance: 8 Novelty: 8

37. Representation Unlearning: Forgetting through Information Compression

ArXiv ID: 2601.21564

Authors: Antonio Almud\'evar, Alfonso Ortega

Abstract: Machine unlearning seeks to remove the influence of specific training data from a model, a need driven by privacy regulations and robustness concerns. Existing approaches typically modify model parameters, but such updates can be unstable, computationally costly, and limited by local approximations. We introduce Representation Unlearning, a framework that performs unlearning directly in the model's representation space. Instead of modifying model parameters, we learn a transformation over representations that imposes an information bottleneck: maximizing mutual information with retained data while suppressing information about data to be forgotten. We derive variational surrogates that make this objective tractable and show how they can be instantiated in two practical regimes: when both retain and forget data are available, and in a zero-shot setting where only forget data can be accessed. Experiments across several benchmarks demonstrate that Representation Unlearning achieves more reliable forgetting, better utility retention, and greater computational efficiency than parameter-centric baselines.

Comment: Representation Unlearning: imposes an information bottleneck in representation space to forget while retaining utility, with variational objectives.

Relevance: 8 Novelty: 8

38. Procedural Pretraining: Warming Up Language Models with Abstract Data

ArXiv ID: 2601.21725

Authors: Liangze Jiang, Zachary Shinnick, Anton van den Hengel, Hemanth Saratchandran, Damien Teney

Abstract: Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a means to ease the subsequent acquisition of rich semantic knowledge, much like humans learn simple logic and mathematics before higher reasoning. We specifically focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, on context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this procedural pretraining enables the models to reach the same loss value with only 55, 67, 86% of the original data. Third, we explore the mechanisms behind and find that procedural pretraining instils non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means to improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.

Comment: Training Dynamics/Efficiency: procedural pretraining on abstract data to induce algorithmic structure and accelerate LLM pretraining with less data.

Relevance: 8 Novelty: 8

39. CORDS: Continuous Representations of Discrete Structures

ArXiv ID: 2601.21583

Authors: Tin Had\v{z}i Veljkovi\'c, Erik Bekkers, Michael Tiemann, Jan-Willem van de Meent

Abstract: Many learning problems require predicting sets of objects when the number of objects is not known beforehand. Examples include object detection, molecular modeling, and scientific inference tasks such as astrophysical source detection. Existing methods often rely on padded representations or must explicitly infer the set size, which often poses challenges. We present a novel strategy for addressing this challenge by casting prediction of variable-sized sets as a continuous inference problem. Our approach, CORDS (Continuous Representations of Discrete Structures), provides an invertible mapping that transforms a set of spatial objects into continuous fields: a density field that encodes object locations and count, and a feature field that carries their attributes over the same support. Because the mapping is invertible, models operate entirely in field space while remaining exactly decodable to discrete sets. We evaluate CORDS across molecular generation and regression, object detection, simulation-based inference, and a mathematical task involving recovery of local maxima, demonstrating robust handling of unknown set sizes with competitive accuracy.

Comment: Representation Learning/Set Modeling: invertible continuous fields (density/feature) for variable-sized sets enabling exact decoding.

Relevance: 8 Novelty: 8

40. TRACE: Trajectory Recovery for Continuous Mechanism Evolution in Causal Representation Learning

ArXiv ID: 2601.21135

Authors: Shicheng Fan, Kun Zhang, Lu Cheng

Abstract: Temporal causal representation learning methods assume that causal mechanisms switch instantaneously between discrete domains, yet real-world systems often exhibit continuous mechanism transitions. For example, a vehicle's dynamics evolve gradually through a turning maneuver, and human gait shifts smoothly from walking to running. We formalize this setting by modeling transitional mechanisms as convex combinations of finitely many atomic mechanisms, governed by time-varying mixing coefficients. Our theoretical contributions establish that both the latent causal variables and the continuous mixing trajectory are jointly identifiable. We further propose TRACE, a Mixture-of-Experts framework where each expert learns one atomic mechanism during training, enabling recovery of mechanism trajectories at test time. This formulation generalizes to intermediate mechanism states never observed during training. Experiments on synthetic and real-world data demonstrate that TRACE recovers mixing trajectories with up to 0.99 correlation, substantially outperforming discrete-switching baselines.

Comment: Representation Learning with MoE: identifiable continuous mechanism trajectories via MoE experts for causal representation learning.

Relevance: 8 Novelty: 8

41. Order-Optimal Sample Complexity of Rectified Flows

ArXiv ID: 2601.20250

Authors: Hari Krishna Sahoo, Mudit Gaur, Vaneet Aggarwal

Abstract: Recently, flow-based generative models have shown superior efficiency compared to diffusion models. In this paper, we study rectified flow models, which constrain transport trajectories to be linear from the base distribution to the data distribution. This structural restriction greatly accelerates sampling, often enabling high-quality generation with a single Euler step. Under standard assumptions on the neural network classes used to parameterize the velocity field and data distribution, we prove that rectified flows achieve sample complexity $\tilde{O}(\varepsilon^{-2})$. This improves on the best known $O(\varepsilon^{-4})$ bounds for flow matching model and matches the optimal rate for mean estimation. Our analysis exploits the particular structure of rectified flows: because the model is trained with a squared loss along linear paths, the associated hypothesis class admits a sharply controlled localized Rademacher complexity. This yields the improved, order-optimal sample complexity and provides a theoretical explanation for the strong empirical performance of rectified flow models.

Comment: Representation learning/theory: proves order-optimal sample complexity for rectified flows in generative modeling.

Relevance: 8 Novelty: 8

42. Bridging Functional and Representational Similarity via Usable Information

ArXiv ID: 2601.21568

Authors: Antonio Almud\'evar, Alfonso Ortega

Abstract: We present a unified framework for quantifying the similarity between representations through the lens of \textit{usable information}, offering a rigorous theoretical and empirical synthesis across three key dimensions. First, addressing functional similarity, we establish a formal link between stitching performance and conditional mutual information. We further reveal that stitching is inherently asymmetric, demonstrating that robust functional comparison necessitates a bidirectional analysis rather than a unidirectional mapping. Second, concerning representational similarity, we prove that reconstruction-based metrics and standard tools (e.g., CKA, RSA) act as estimators of usable information under specific constraints. Crucially, we show that similarity is relative to the capacity of the predictive family: representations that appear distinct to a rigid observer may be identical to a more expressive one. Third, we demonstrate that representational similarity is sufficient but not necessary for functional similarity. We unify these concepts through a task-granularity hierarchy: similarity on a complex task guarantees similarity on any coarser derivative, establishing representational similarity as the limit of maximum granularity: input reconstruction.

Comment: Representation Learning Theory: unifies functional and representational similarity via usable information linking stitching, CKA/RSA, and reconstruction.

Relevance: 8 Novelty: 8

43. $\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval

ArXiv ID: 2601.20844

Authors: Zihao Wang, Hang Yin, Lihui Liu, Hanghang Tong, Yangqiu Song, Ginny Wong, Simon See

Abstract: This paper studies the minimal dimension required to embed subset memberships ($m$ elements and ${m\choose k}$ subsets of at most $k$ elements) into vector spaces, denoted as Minimal Embeddable Dimension (MED). The tight bounds of MED are derived theoretically and supported empirically for various notions of "distances" or "similarities," including the $\ell_2$ metric, inner product, and cosine similarity. In addition, we conduct numerical simulation in a more achievable setting, where the ${m\choose k}$ subset embeddings are chosen as the centroid of the embeddings of the contained elements. Our simulation easily realizes a logarithmic dependency between the MED and the number of elements to embed. These findings imply that embedding-based retrieval limitations stem primarily from learnability challenges, not geometric constraints, guiding future algorithm design.

Comment: Representation/Compression Theory: tight bounds on minimal embeddable dimension for top-k retrieval under common similarities, informing embedding design.

Relevance: 8 Novelty: 8

44. Dynamics Reveals Structure: Challenging the Linear Propagation Assumption

ArXiv ID: 2601.21601

Authors: Hoyeon Chang, B\'alint Mucs\'anyi, Seong Joon Oh

Abstract: Neural networks adapt through first-order parameter updates, yet it remains unclear whether such updates preserve logical coherence. We investigate the geometric limits of the Linear Propagation Assumption (LPA), the premise that local updates coherently propagate to logical consequences. To formalize this, we adopt relation algebra and study three core operations on relations: negation flips truth values, converse swaps argument order, and composition chains relations. For negation and converse, we prove that guaranteeing direction-agnostic first-order propagation necessitates a tensor factorization separating entity-pair context from relation content. However, for composition, we identify a fundamental obstruction. We show that composition reduces to conjunction, and prove that any conjunction well-defined on linear features must be bilinear. Since bilinearity is incompatible with negation, this forces the feature map to collapse. These results suggest that failures in knowledge editing, the reversal curse, and multi-hop reasoning may stem from common structural limitations inherent to the LPA.

Comment: Matches Representation Learning: theoretical analysis of first-order update propagation and constraints (bilinearity vs negation) on feature maps.

Relevance: 8 Novelty: 8

45. Identifiable Equivariant Networks are Layerwise Equivariant

ArXiv ID: 2601.21645

Authors: Vahid Shahverdi, Giovanni Luca Marchetti, Georg B\"okman, Kathl\'en Kohn

Abstract: We investigate the relation between end-to-end equivariance and layerwise equivariance in deep neural networks. We prove the following: For a network whose end-to-end function is equivariant with respect to group actions on the input and output spaces, there is a parameter choice yielding the same end-to-end function such that its layers are equivariant with respect to some group actions on the latent spaces. Our result assumes that the parameters of the model are identifiable in an appropriate sense. This identifiability property has been established in the literature for a large class of networks, to which our results apply immediately, while it is conjectural for others. The theory we develop is grounded in an abstract formalism, and is therefore architecture-agnostic. Overall, our results provide a mathematical explanation for the emergence of equivariant structures in the weights of neural networks during training -- a phenomenon that is consistently observed in practice.

Comment: Matches Model Architecture/Theory: identifiability-based proof linking end-to-end equivariance to layerwise equivariance.

Relevance: 8 Novelty: 8

46. From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning

ArXiv ID: 2601.22028

Authors: Haoran Tang, Rajiv Khanna

Abstract: Most LLM unlearning methods aim to approximate retrain-from-scratch behaviors with minimal distribution shift, often via alignment-style objectives defined in the prediction space. While effective at reducing forgotten content generation, such approaches may act as suppression: forgotten concepts can persist in representations and remain entangled with retained knowledge. We introduce CLReg, a contrastive representation regularizer that identifies forget features while pushing them away from retain features, explicitly reducing forget-retain interference with minimal shifts on retain features. We provide first theoretical insights that relate representation shaping to entanglement reduction. Across unlearning benchmarks and LLMs of different sizes, CLReg decreases forget-retain representation entanglement that facilitates mainstream unlearning methods without positing extra privacy risks, inspiring future work that reshapes the representation space to remove forget concepts.

Comment: Representation Learning: contrastive latent regularizer to reduce forget–retain entanglement for LLM unlearning (explicit representation shaping).