Personalized Daily ArXiv Papers 2026-02-20

[gpt-5]	Prompt	Completion	Total
Token	45341	39660	85001
Cost	$0.06	$0.4	$0.45

Total arXiv papers: 513

Total scanned papers: 331

Total relevant papers: 25

Table of contents with paper titles:

Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression Authors: Akira Sakai, Yuma Ichikawa
A Theoretical Framework for Modular Learning of Robust Generative Models Authors: Corinna Cortes, Mehryar Mohri, Yutao Zhong
On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking Authors: Jianliang He, Leda Wang, Siyu Chen, Zhuoran Yang
VP-VAE: Rethinking Vector Quantization via Adaptive Vector Perturbation Authors: Linwei Zhai, Han Ding, Mingzhi Lin, Cui Zhao, Fei Wang, Ge Wang, Wang Zhi, Wei Xi
One-step Language Modeling via Continuous Denoising Authors: Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, Jinwoo Kim
Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding Authors: Rahul Thomas, Teo Kitanovski, Micah Goldblum, Arka Pal
Canonicalizing Multimodal Contrastive Representation Learning Authors: Sharut Gupta, Sanyam Kansal, Stefanie Jegelka, Phillip Isola, Vikas Garg
Early-Warning Signals of Grokking via Loss-Landscape Geometry Authors: Yongzhong Xu
Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization Authors: Yicheng Lang, Changsheng Wang, Yihua Zhang, Mingyi Hong, Zheng Zhang, Wotao Yin, Sijia Liu
Learning with Boolean threshold functions Authors: Veit Elser, Manish Krishan Lal
Arcee Trinity Large Technical Report Authors: Varun Singh, Lucas Krauss, Sami Jaghouar, Matej Sirovatka, Charles Goddard, Fares Obied, Jack Min Ong, Jannik Straube, Fern, Aria Harley, Conner Stewart, Colin Kealty, Maziyar Panahi, Simon Kirsten, Anushka Deshpande, Anneketh Vij, Arthur Bresnu, Pranav Veldurthi, Raghav Ravishankar, Hardik Bishnoi, DatologyAI Team, Arcee AI Team, Prime Intellect Team, Mark McQuade, Johannes Hagemann, Lucas Atkins
Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking Authors: Yongzhong Xu
LORA-CRAFT: Cross-layer Rank Adaptation via Frozen Tucker Decomposition of Pre-trained Attention Weights Authors: Kasun Dewage, Marianna Pensky, Suranadi De Silva, Shankadeep Mondal
The Anxiety of Influence: Bloom Filters in Transformer Attention Heads Authors: Peter Balogh
2Mamba2Furious: Linear in Complexity, Competitive in Accuracy Authors: Gabriel Mongaras, Eric C. Larson
Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning Authors: Hien Dang, Pratik Patil, Alessandro Rinaldo
Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees Authors: Itamar Hadad, Guy Katz, Shahaf Bassan
Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum Authors: Minxin Zhang, Yuxuan Liu, Hayden Scheaffer
Revisiting Weight Regularization for Low-Rank Continual Learning Authors: Yaoyue Zheng, Yin Zhang, Joost van de Weijer, Gido M van de Ven, Shaoyi Du, Xuetao Zhang, Zhiqiang Tian
Efficient Remote Prefix Fetching with GPU-native Media ASICs Authors: Liang Mi, Weijun Wang, Jinghan Chen, Ting Cao, Haipeng Dai, Yunxin Liu
ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment Authors: Hongjue Zhao, Haosen Sun, Jiangtao Kong, Xiaochang Li, Qineng Wang, Liwei Jiang, Qi Zhu, Tarek Abdelzaher, Yejin Choi, Manling Li, Huajie Shao
Training Large Reasoning Models Efficiently via Progressive Thought Encoding Authors: Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu, Jianfeng Gao
Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature Authors: Angelo Porrello, Pietro Buzzega, Felix Dangel, Thomas Sommariva, Riccardo Salami, Lorenzo Bonicelli, Simone Calderara
Discovering Universal Activation Directions for PII Leakage in Language Models Authors: Leo Marchyok, Zachary Coalson, Sungho Keum, Sooel Son, Sanghyun Hong
Be Wary of Your Time Series Preprocessing Authors: Sofiane Ennadir, Tianze Wang, Oleg Smirnov, Sahar Asadi, Lele Cao

1. Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

ArXiv ID: 2602.17063

Authors: Akira Sakai, Yuma Ichikawa

Abstract: Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood around zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a gap-based initialization and a lightweight outward-drift regularizer, reducing the effective flip rate to approximately $10^{-3}$ with only about a one-point increase in perplexity.

Comment: Model Compression and Efficiency: analyzes sign persistence as a bottleneck for sub-bit compression; introduces gap-based init and outward-drift regularizer to reduce effective sign flips.

Relevance: 10 Novelty: 8

2. A Theoretical Framework for Modular Learning of Robust Generative Models

ArXiv ID: 2602.17554

Authors: Corinna Cortes, Mehryar Mohri, Yutao Zhong

Abstract: Training large-scale generative models is resource-intensive and relies heavily on heuristic dataset weighting. We address two fundamental questions: Can we train Large Language Models (LLMs) modularly-combining small, domain-specific experts to match monolithic performance-and can we do so robustly for any data mixture, eliminating heuristic tuning? We present a theoretical framework for modular generative modeling where a set of pre-trained experts are combined via a gating mechanism. We define the space of normalized gating functions, $G_{1}$, and formulate the problem as a minimax game to find a single robust gate that minimizes divergence to the worst-case data mixture. We prove the existence of such a robust gate using Kakutani's fixed-point theorem and show that modularity acts as a strong regularizer, with generalization bounds scaling with the lightweight gate's complexity. Furthermore, we prove that this modular approach can theoretically outperform models retrained on aggregate data, with the gap characterized by the Jensen-Shannon Divergence. Finally, we introduce a scalable Stochastic Primal-Dual algorithm and a Structural Distillation method for efficient inference. Empirical results on synthetic and real-world datasets confirm that our modular architecture effectively mitigates gradient conflict and can robustly outperform monolithic baselines.

Comment: Matches: Model Architecture (Mixture-of-Experts) — theoretical framework and minimax robust gating with generalization bounds and modularity.

Relevance: 10 Novelty: 8

3. On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

ArXiv ID: 2602.16849

Authors: Jianliang He, Leda Wang, Siyu Chen, Zhuoran Yang

Abstract: We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task. Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics. While prior work has identified that individual neurons learn single-frequency Fourier features and phase alignment, it does not fully explain how these features combine into a global solution. We bridge this gap by formalizing a diversification condition that emerges during training when overparametrized, consisting of two parts: phase symmetry and frequency diversification. We prove that these properties allow the network to collectively approximate a flawed indicator function on the correct logic for the modular addition task. While individual neurons produce noisy signals, the phase symmetry enables a majority-voting scheme that cancels out noise, allowing the network to robustly identify the correct sum. Furthermore, we explain the emergence of these features under random initialization via a lottery ticket mechanism. Our gradient flow analysis proves that frequencies compete within each neuron, with the "winner" determined by its initial spectral magnitude and phase alignment. From a technical standpoint, we provide a rigorous characterization of the layer-wise phase coupling dynamics and formalize the competitive landscape using the ODE comparison lemma. Finally, we use these insights to demystify grokking, characterizing it as a three-stage process involving memorization followed by two generalization phases, driven by the competition between loss minimization and weight decay.

Comment: Representation Learning/Training Dynamics: mechanistic theory of modular addition, Fourier features, lottery-ticket selection, and grokking stages.

Relevance: 9 Novelty: 9

4. VP-VAE: Rethinking Vector Quantization via Adaptive Vector Perturbation

ArXiv ID: 2602.17133

Authors: Linwei Zhai, Han Ding, Mingzhi Lin, Cui Zhao, Fei Wang, Ge Wang, Wang Zhi, Wei Xi

Abstract: Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental to modern generative modeling, yet they often suffer from training instability and "codebook collapse" due to the inherent coupling of representation learning and discrete codebook optimization. In this paper, we propose VP-VAE (Vector Perturbation VAE), a novel paradigm that decouples representation learning from discretization by eliminating the need for an explicit codebook during training. Our key insight is that, from the neural network's viewpoint, performing quantization primarily manifests as injecting a structured perturbation in latent space. Accordingly, VP-VAE replaces the non-differentiable quantizer with distribution-consistent and scale-adaptive latent perturbations generated via Metropolis--Hastings sampling. This design enables stable training without a codebook while making the model robust to inference-time quantization error. Moreover, under the assumption of approximately uniform latent variables, we derive FSP (Finite Scalar Perturbation), a lightweight variant of VP-VAE that provides a unified theoretical explanation and a practical improvement for FSQ-style fixed quantizers. Extensive experiments on image and audio benchmarks demonstrate that VP-VAE and FSP improve reconstruction fidelity and achieve substantially more balanced token usage, while avoiding the instability inherent to coupled codebook training.

Comment: Model Architecture/Representation Learning: replaces codebook in VQ-VAE with adaptive latent perturbations (VP-VAE) and introduces FSP for fixed quantizers, stabilizing training and improving token usage.

Relevance: 9 Novelty: 8

5. One-step Language Modeling via Continuous Denoising

ArXiv ID: 2602.16813

Authors: Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, Jinwoo Kim

Abstract: Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime, failing to realize this promise. Here we show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state-of-the-art discrete diffusion models. With FMLM, our approach outperforms recent few-step language models across the board, with one-step generation exceeding their 8-step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale. Code is available at https://github.com/david3684/flm.

Comment: Model Architecture and Efficiency: flow-based continuous denoising for language modeling with few-/one-step generation via distilled flow map; challenges discrete diffusion assumptions.

Relevance: 9 Novelty: 8

6. Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding

ArXiv ID: 2602.16994

Authors: Rahul Thomas, Teo Kitanovski, Micah Goldblum, Arka Pal

Abstract: Multi-path speculative decoding accelerates lossless sampling from a target model by using a cheaper draft model to generate a draft tree of tokens, and then applies a verification algorithm that accepts a subset of these. While prior work has proposed various verification algorithms for i.i.d rollouts, their relative performance under matched settings remains unclear. In this work, we firstly present a systematic evaluation of verification strategies across model families, tasks, and sampling regimes, and find that Traversal Verification dominates consistently, with OT-based methods lagging far behind. Our analysis uncovers that this occurs because OT-based methods achieve high multi-token acceptance near the root of the draft tree, while multi-token gains are most impactful deeper in the draft tree, where draft and target distributions diverge. Based on this insight, we propose delayed tree expansion, which drafts a partial single path, delaying the i.i.d. branching point. We show that delayed tree expansion preserves the target distribution and improves on root-node i.i.d rollouts. Further, we develop a dynamic neural selector that estimates the expected block efficiency of optimal-transport-based verification methods from draft and target features, enabling context-dependent expansion decisions. Our neural selector allows OT-based methods like SpecInfer to outperform Traversal Verification for the first time, achieving 5% higher average throughput across a wide range of models, datasets, and sampling settings.

Comment: Model Efficiency: algorithmic improvements to speculative decoding (delayed tree expansion, dynamic selector) for faster LLM sampling.

Relevance: 9 Novelty: 8

7. Canonicalizing Multimodal Contrastive Representation Learning

ArXiv ID: 2602.17584

Authors: Sharut Gupta, Sanyam Kansal, Stefanie Jegelka, Phillip Isola, Vikas Garg

Abstract: As models and data scale, independently trained networks often induce analogous notions of similarity. But, matching similarities is weaker than establishing an explicit correspondence between the representation spaces, especially for multimodal models, where consistency must hold not only within each modality, but also for the learned image-text coupling. We therefore ask: given two independently trained multimodal contrastive models (with encoders $(f, g)$ and $(\widetilde{f},\widetilde{g})$) -- trained on different distributions and with different architectures -- does a systematic geometric relationship exist between their embedding spaces? If so, what form does it take, and does it hold uniformly across modalities? In this work, we show that across model families such as CLIP, SigLIP, and FLAVA, this geometric relationship is well approximated by an orthogonal map (up to a global mean shift), i.e., there exists an orthogonal map $Q$ where $Q^\top Q = I$ such that $\widetilde{f}(x)\approx Q f(x)$ for paired images $x$. Strikingly, the same $Q$ simultaneously aligns the text encoders i.e., $\widetilde{g}(y)\approx Q g(y)$ for texts $y$. Theoretically, we prove that if the multimodal kernel agrees across models on a small anchor set i.e. $\langle f(x), g(y)\rangle \approx \langle \widetilde{f}(x), \widetilde{g}(y)\rangle$, then the two models must be related by a single orthogonal map $Q$ and the same $Q$ maps images and text across models. More broadly, this finding enables backward-compatible model upgrades, avoiding costly re-embedding, and has implications for the privacy of learned representations. Our project page: https://canonical-multimodal.github.io/

Comment: Matches: Representation Learning — discovers orthogonal canonical maps aligning independent multimodal contrastive encoders; theory and practice.

Relevance: 9 Novelty: 8

8. Early-Warning Signals of Grokking via Loss-Landscape Geometry

ArXiv ID: 2602.16967

Authors: Yongzhong Xu

Abstract: Grokking -- the abrupt transition from memorization to generalization after prolonged training -- has been linked to confinement on low-dimensional execution manifolds in modular arithmetic. Whether this mechanism extends beyond arithmetic remains open. We study two sequence-learning benchmarks: SCAN compositional generalization and Dyck-1 depth prediction. Across both tasks and a wide range of learning rates, the commutator defect -- a curvature measure derived from non-commuting gradient updates -- rises well before generalization, with lead times following a superlinear power law (alpha approximately 1.18 for SCAN, approximately 1.13 for Dyck), consistent with prior results on modular arithmetic. Weight-space PCA reveals that spectral concentration is not a universal precursor; the commutator defect is. Causal interventions demonstrate a mechanistic role: amplifying non-commutativity accelerates grokking (roughly 32% on SCAN, roughly 50% on Dyck), while suppressing orthogonal gradient flow delays or prevents it. The three task families form a spectrum of causal sensitivity -- modular arithmetic is rigid, Dyck is responsive, SCAN is intermediate -- yet suppression delays or prevents grokking in all cases, establishing necessity as a universal finding. These results identify the commutator defect as a robust, architecture-agnostic, causally implicated early-warning signal for delayed generalization in transformers.

Comment: Representation Learning/Training dynamics: proposes commutator-defect curvature as a causal early-warning signal for grokking in transformers.

Relevance: 9 Novelty: 8

9. Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization

ArXiv ID: 2602.17155

Authors: Yicheng Lang, Changsheng Wang, Yihua Zhang, Mingyi Hong, Zheng Zhang, Wotao Yin, Sijia Liu

Abstract: Zeroth-order (ZO) optimization provides a gradient-free alternative to first-order (FO) methods by estimating gradients via finite differences of function evaluations, and has recently emerged as a memory-efficient paradigm for fine-tuning large-scale models by avoiding backpropagation. However, ZO optimization has a fundamental tension between accuracy and query efficiency. In this work, we show that ZO optimization can be substantially improved by unifying two complementary principles: (i) a projection-based subspace view that reduces gradient estimation variance by exploiting the intrinsic low-rank structure of model updates, and (ii) Muon-style spectral optimization that applies gradient orthogonalization to extract informative spectral structure from noisy ZO gradients. These findings form a unified framework of subspace gradient orthogonalization, which we instantiate in a new method, ZO-Muon, admitting a natural interpretation as a low-rank Muon optimizer in the ZO setting. Extensive experiments on large language models (LLMs) and vision transformers (ViTs) demonstrate that ZO-Muon significantly accelerates convergence and achieves a win-win improvement in accuracy and query/runtime efficiency. Notably, compared to the popular MeZO baseline, ZO-Muon requires only 24.7% of the queries to reach the same SST-2 performance for LLM fine-tuning, and improves accuracy by 25.1% on ViT-B fine-tuning on CIFAR-100.

Comment: Model Compression and Efficiency: ZO fine-tuning via subspace projection and spectral gradient orthogonalization (low-rank Muon) improving query/memory efficiency.

Relevance: 9 Novelty: 8

10. Learning with Boolean threshold functions

ArXiv ID: 2602.17493

Authors: Veit Elser, Manish Krishan Lal

Abstract: We develop a method for training neural networks on Boolean data in which the values at all nodes are strictly $\pm 1$, and the resulting models are typically equivalent to networks whose nonzero weights are also $\pm 1$. The method replaces loss minimization with a nonconvex constraint formulation. Each node implements a Boolean threshold function (BTF), and training is expressed through a divide-and-concur decomposition into two complementary constraints: one enforces local BTF consistency between inputs, weights, and output; the other imposes architectural concurrence, equating neuron outputs with downstream inputs and enforcing weight equality across training-data instantiations of the network. The reflect-reflect-relax (RRR) projection algorithm is used to reconcile these constraints. Each BTF constraint includes a lower bound on the margin. When this bound is sufficiently large, the learned representations are provably sparse and equivalent to networks composed of simple logical gates with $\pm 1$ weights. Across a range of tasks -- including multiplier-circuit discovery, binary autoencoding, logic-network inference, and cellular automata learning -- the method achieves exact solutions or strong generalization in regimes where standard gradient-based methods struggle. These results demonstrate that projection-based constraint satisfaction provides a viable and conceptually distinct foundation for learning in discrete neural systems, with implications for interpretability and efficient inference.

Comment: Model Architecture/Representation: trains Boolean-threshold networks via projection-based constraint satisfaction, yielding sparse ±1-weight logical circuits.

Relevance: 9 Novelty: 8

11. Arcee Trinity Large Technical Report

ArXiv ID: 2602.17004

Authors: Varun Singh, Lucas Krauss, Sami Jaghouar, Matej Sirovatka, Charles Goddard, Fares Obied, Jack Min Ong, Jannik Straube, Fern, Aria Harley, Conner Stewart, Colin Kealty, Maziyar Panahi, Simon Kirsten, Anushka Deshpande, Anneketh Vij, Arthur Bresnu, Pranav Veldurthi, Raghav Ravishankar, Hardik Bishnoi, DatologyAI Team, Arcee AI Team, Prime Intellect Team, Mark McQuade, Johannes Hagemann, Lucas Atkins

Abstract: We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models' modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available at https://huggingface.co/arcee-ai.

Comment: Model Architecture (MoE): introduces a sparse MoE LLM with sigmoid routing and a new MoE load-balancing method (SMEBU); High-Performance Training details at scale.

Relevance: 9 Novelty: 7

12. Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

ArXiv ID: 2602.16746

Authors: Yongzhong Xu

Abstract: Grokking -- the delayed transition from memorization to generalization in small algorithmic tasks -- remains poorly understood. We present a geometric analysis of optimization dynamics in transformers trained on modular arithmetic. PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace, with a single principal component capturing 68-83% of trajectory variance. To probe loss-landscape geometry, we measure commutator defects -- the non-commutativity of successive gradient steps -- and project them onto this learned subspace. We find that curvature grows sharply in directions orthogonal to the execution subspace while the trajectory remains largely confined to it. Importantly, curvature growth consistently precedes generalization across learning rates and hyperparameter regimes, with the lead time obeying a power law in the grokking timescale. Causal intervention experiments show that motion along the learned subspace is necessary for grokking, while artificially increasing curvature is insufficient. Together, these results support a geometric account in which grokking reflects escape from a metastable regime characterized by low-dimensional confinement and transverse curvature accumulation. All findings replicate across this learning-rate range, a qualitatively different slow regime (lr=5e-5, wd=0.1, 3 layers), and three random seeds, though alignment dynamics differ quantitatively between regimes. Causal intervention experiments establish that orthogonal gradient flow is necessary but not sufficient for grokking: suppressing it prevents generalization with a monotonic dose-response across four operations, while artificially boosting curvature defects has no effect.

Comment: Matches: Representation Learning — geometric analysis of training dynamics (low-dimensional subspace, curvature) explaining grokking in Transformers.

Relevance: 9 Novelty: 7

13. LORA-CRAFT: Cross-layer Rank Adaptation via Frozen Tucker Decomposition of Pre-trained Attention Weights

ArXiv ID: 2602.17510

Authors: Kasun Dewage, Marianna Pensky, Suranadi De Silva, Shankadeep Mondal

Abstract: We introduce CRAFT (Cross-layer Rank Adaptation via Frozen Tucker), a parameter-efficient fine-tuning (PEFT) method that applies Tucker tensor decomposition to pre-trained attention weight matrices stacked across transformer layers and trains only small square adaptation matrices on the resulting frozen Tucker factors. Existing tensor-based PEFT methods decompose gradient updates: LoTR applies Tucker decomposition with shared factor matrices, while SuperLoRA groups and reshapes $\Delta W$ across layers before applying Tucker decomposition. Separately, methods like PiSSA apply SVD to pre-trained weights but operate independently per layer. CRAFT bridges these two lines of work: it performs full Tucker decomposition via Higher-Order SVD (HOSVD) directly on pre-trained weights organized as cross-layer 3D tensors, freezes all resulting factors, and adapts the model through lightweight trainable transformations applied to each factor matrix. Experiments on the GLUE benchmark using RoBERTa-base and RoBERTa-large demonstrate that CRAFT achieves competitive performance with existing methods while requiring only 41K Tucker adaptation parameters--a count independent of model dimension and depth at fixed Tucker ranks.

Comment: Matches: Model Compression and Efficiency — PEFT via cross-layer Tucker decomposition of pre-trained attention weights (low-rank/tensor).

Relevance: 9 Novelty: 7

14. The Anxiety of Influence: Bloom Filters in Transformer Attention Heads

ArXiv ID: 2602.17526

Authors: Peter Balogh

Abstract: Some transformer attention heads appear to function as membership testers, dedicating themselves to answering the question "has this token appeared before in the context?" We identify these heads across four language models (GPT-2 small, medium, and large; Pythia-160M) and show that they form a spectrum of membership-testing strategies. Two heads (L0H1 and L0H5 in GPT-2 small) function as high-precision membership filters with false positive rates of 0-4\% even at 180 unique context tokens -- well above the $d_\text{head} = 64$ bit capacity of a classical Bloom filter. A third head (L1H11) shows the classic Bloom filter capacity curve: its false positive rate follows the theoretical formula $p \approx (1 - e^{-kn/m})^k$ with $R^2 = 1.0$ and fitted capacity $m \approx 5$ bits, saturating by $n \approx 20$ unique tokens. A fourth head initially identified as a Bloom filter (L3H0) was reclassified as a general prefix-attention head after confound controls revealed its apparent capacity curve was a sequence-length artifact. Together, the three genuine membership-testing heads form a multi-resolution system concentrated in early layers (0-1), taxonomically distinct from induction and previous-token heads, with false positive rates that decay monotonically with embedding distance -- consistent with distance-sensitive Bloom filters. These heads generalize broadly: they respond to any repeated token type, not just repeated names, with 43\% higher generalization than duplicate-token-only heads. Ablation reveals these heads contribute to both repeated and novel token processing, indicating that membership testing coexists with broader computational roles. The reclassification of L3H0 through confound controls strengthens rather than weakens the case: the surviving heads withstand the scrutiny that eliminated a false positive in our own analysis.

Comment: Representation Learning/Mechanistic interpretability: shows specific transformer heads act as membership testers (Bloom filter-like) and analyzes capacity/behavior.

Relevance: 9 Novelty: 7

15. 2Mamba2Furious: Linear in Complexity, Competitive in Accuracy

ArXiv ID: 2602.17363

Authors: Gabriel Mongaras, Eric C. Larson

Abstract: Linear attention transformers have become a strong alternative to softmax attention due to their efficiency. However, linear attention tends to be less expressive and results in reduced accuracy compared to softmax attention. To bridge the accuracy gap between softmax attention and linear attention, we manipulate Mamba-2, a very strong linear attention variant. We first simplify Mamba-2 down to its most fundamental and important components, evaluating which specific choices make it most accurate. From this simplified Mamba variant (Mamba-2S), we improve the A-mask and increase the order of the hidden state, resulting in a method, which we call 2Mamba, that is nearly as accurate as softmax attention, yet much more memory efficient for long context lengths. We also investigate elements to Mamba-2 that help surpass softmax attention accuracy. Code is provided for all our experiments

Comment: Model Architecture and Efficiency: modified Mamba-2 (linear attention/SSM) to approach softmax accuracy while maintaining linear complexity.

Relevance: 9 Novelty: 7

16. Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

ArXiv ID: 2602.17565

Authors: Hien Dang, Pratik Patil, Alessandro Rinaldo

Abstract: Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher's own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight $\xi$ may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level $\lambda > 0$ at which the teacher ridge risk $R(\lambda)$ is nonstationary (i.e., $R'(\lambda) \neq 0$). We obtain a closed-form expression for the optimal mixing weight $\xi^\star(\lambda)$ for any value of $\lambda$ and show that it obeys the sign rule: $\operatorname{sign}(\xi^\star(\lambda))=-\operatorname{sign}(R'(\lambda))$. In particular, $\xi^\star(\lambda)$ can be negative, which is the case in over-regularized regimes. To quantify the risk improvement due to SD, we derive exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime (where the sample and feature sizes $n$ and $p$ both diverge but their aspect ratio $p/n$ converges) under general anisotropic covariance and deterministic signals. Our asymptotic analysis extends standard second-order ridge deterministic equivalents to their fourth-order analogs using block linearization, which may be of independent interest. From a practical standpoint, we propose a consistent one-shot tuning method to estimate $\xi^\star$ without grid search, sample splitting, or refitting. Experiments on real-world datasets and pretrained neural network features support our theory and the one-shot tuning method.

Comment: Representation Learning/Training Dynamics: rigorous theory for self-distillation in ridge regression with closed-form optimal mixing and precise asymptotics; one-shot tuning method.

Relevance: 8 Novelty: 8

17. Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees

ArXiv ID: 2602.16823

Authors: Itamar Hadad, Guy Katz, Shahaf Bassan

Abstract: Automated circuit discovery is a central tool in mechanistic interpretability for identifying the internal components of neural networks responsible for specific behaviors. While prior methods have made significant progress, they typically depend on heuristics or approximations and do not offer provable guarantees over continuous input domains for the resulting circuits. In this work, we leverage recent advances in neural network verification to propose a suite of automated algorithms that yield circuits with provable guarantees. We focus on three types of guarantees: (1) input domain robustness, ensuring the circuit agrees with the model across a continuous input region; (2) robust patching, certifying circuit alignment under continuous patching perturbations; and (3) minimality, formalizing and capturing a wide array of various notions of succinctness. Interestingly, we uncover a diverse set of novel theoretical connections among these three families of guarantees, with critical implications for the convergence of our algorithms. Finally, we conduct experiments with state-of-the-art verifiers on various vision models, showing that our algorithms yield circuits with substantially stronger robustness guarantees than standard circuit discovery methods, establishing a principled foundation for provable circuit discovery.

Comment: Representation Learning: mechanistic interpretability with algorithms and provable robustness/minimality guarantees for circuit discovery.

Relevance: 8 Novelty: 8

18. Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum

ArXiv ID: 2602.17080

Authors: Minxin Zhang, Yuxuan Liu, Hayden Scheaffer

Abstract: Efficient stochastic optimization typically integrates an update direction that performs well in the deterministic regime with a mechanism adapting to stochastic perturbations. While Adam uses adaptive moment estimates to promote stability, Muon utilizes the weight layers' matrix structure via orthogonalized momentum, showing superior performance in large language model training. We propose a new optimizer and a diagonal extension, NAMO and NAMO-D, providing the first principled integration of orthogonalized momentum with norm-based Adam-type noise adaptation. NAMO scales orthogonalized momentum using a single adaptive stepsize, preserving orthogonality while improving upon Muon at negligible additional cost. NAMO-D instead right-multiplies orthogonalized momentum by a diagonal matrix with clamped entries. This design enables neuron-wise noise adaptation and aligns with the common near block-diagonal Hessian structure. Under standard assumptions, we establish optimal convergence rates for both algorithms in the deterministic setting and show that, in the stochastic setting, their convergence guarantees adapt to the noise level of stochastic gradients. Experiments on pretraining GPT-2 models demonstrate improved performance of both NAMO and NAMO-D compared to the AdamW and Muon baselines, with NAMO-D achieving further gains over NAMO via an additional clamping hyperparameter that balances the competing goals of maintaining a well-conditioned update direction and leveraging fine-grained noise adaptation.

Comment: High Performance Computing/Optimization: introduces NAMO/NAMO-D integrating orthogonalized momentum with Adam-style noise adaptation, with convergence guarantees for large-model training.