Personalized Daily ArXiv Papers 2026-02-18

[gpt-5]	Prompt	Completion	Total
Token	37457	34563	72020
Cost	$0.05	$0.35	$0.39

Total arXiv papers: 441

Total scanned papers: 249

Total relevant papers: 28

Table of contents with paper titles:

A unified theory of feature learning in RNNs and DNNs Authors: Jan P. Bauer, Kirsten Fischer, Moritz Helias, Agostina Palmigiano
Approximation Theory for Lipschitz Continuous Transformers Authors: Takashi Furuya, Davide Murari, Carola-Bibiane Sch\"onlieb
ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns Authors: Ziyu Zhao, Tong Zhu, Zhi Zhang, Tiantian Fan, Jinluan Yang, Kun Kuang, Zhongyu Wei, Fei Wu, Yu Cheng
Avey-B Authors: Devang Acharya, Mohammad Hammoud
COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression Authors: Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Baher Mohammad, Stamatios Lefkimmiatis
1-Bit Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization Authors: Sohir Maskey, Constantin Eichenberg, Johannes Messner, Douglas Orr
Uniform error bounds for quantized dynamical models Authors: Abdelkader Metakalard (CRAN, SYNALP), Fabien Lauer (SYNALP, LORIA), Kevin Colin (CRAN), Marion Gilson (CRAN)
Logit Distance Bounds Representational Similarity Authors: Beatrix M. B. Nielsen, Emanuele Marconato, Luigi Gresele, Andrea Dittadi, Simon Buchholz
How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning Authors: Hongxuan Wu, Yukun Zhang, Xueqing Zhou
Continuous-Time Piecewise-Linear Recurrent Neural Networks Authors: Alena Br\"andle, Lukas Eisenmann, Florian G\"otz, Daniel Durstewitz
On Surprising Effectiveness of Masking Updates in Adaptive Optimizers Authors: Taejong Joo, Wenhan Xia, Cheolmin Kim, Ming Zhang, Eugene Ie
Beyond ReLU: Bifurcation, Oversmoothing, and Topological Priors Authors: Erkan Turan, Gaspard Abel, Maysam Behmanesh, Emery Pierson, Maks Ovsjanikov
PolyNODE: Variable-dimension Neural ODEs on M-polyfolds Authors: Per {\AA}hag, Alexander Friedrich, Fredrik Ohlsson, Viktor Vigren N\"aslund
The Information Geometry of Softmax: Probing and Steering Authors: Kiho Park, Todd Nief, Yo Joong Choe, Victor Veitch
The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety Authors: Max Springer, Chung Peng Lee, Blossom Metevier, Jane Castleman, Bohdan Turbal, Hayoung Jung, Zeyu Shen, Aleksandra Korolova
Panini: Continual Learning in Token Space via Structured Memory Authors: Shreyas Rajesh, Pavan Holur, Mehmet Yigit Turali, Chenda Duan, Vwani Roychowdhury
Size Transferability of Graph Transformers with Convolutional Positional Encodings Authors: Javier Porras-Valenzuela, Zhiyang Wang, Alejandro Ribeiro
CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing Authors: Zarif Ikram, Arad Firouzkouhi, Stephen Tu, Mahdi Soltanolkotabi, Paria Rashidinejad
Universal priors: solving empirical Bayes via Bayesian inference and pretraining Authors: Nick Cannella, Anzo Teh, Yanjun Han, Yury Polyanskiy
Spanning the Visual Analogy Space with a Weight Basis of LoRAs Authors: Hila Manor, Rinon Gal, Haggai Maron, Tomer Michaeli, Gal Chechik
The Equalizer: Introducing Shape-Gain Decomposition in Neural Audio Codecs Authors: Samir Sadok, Laurent Girin, Xavier Alameda-Pineda
Seeing to Generalize: How Visual Data Corrects Binding Shortcuts Authors: Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel, Rodrigo Toro Icarte
Refine Now, Query Fast: A Decoupled Refinement Paradigm for Implicit Neural Fields Authors: Tianyu Xiong, Skylar Wurster, Han-Wei Shen
Neural-POD: A Plug-and-Play Neural Operator Framework for Infinite-Dimensional Functional Nonlinear Proper Orthogonal Decomposition Authors: Changhong Mou, Binghang Lu, Guang Lin
Fast and Effective On-policy Distillation from Reasoning Prefixes Authors: Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman
Complex-Valued Unitary Representations as Classification Heads for Improved Uncertainty Quantification in Deep Neural Networks Authors: Akbar Anbar Jafari, Cagri Ozcinar, Gholamreza Anbarjafari
FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations Authors: Zhihao Shu, Md Musfiqur Rahman Sanim, Hangyu Zheng, Kunxiong Zhu, Miao Yin, Gagan Agrawal, Wei Niu
Functional Central Limit Theorem for Stochastic Gradient Descent Authors: Kessang Flamand, Victor-Emmanuel Brunel

1. A unified theory of feature learning in RNNs and DNNs

ArXiv ID: 2602.15593

Authors: Jan P. Bauer, Kirsten Fischer, Moritz Helias, Agostina Palmigiano

Abstract: Recurrent and deep neural networks (RNNs/DNNs) are cornerstone architectures in machine learning. Remarkably, RNNs differ from DNNs only by weight sharing, as can be shown through unrolling in time. How does this structural similarity fit with the distinct functional properties these networks exhibit? To address this question, we here develop a unified mean-field theory for RNNs and DNNs in terms of representational kernels, describing fully trained networks in the feature learning ($\mu$P) regime. This theory casts training as Bayesian inference over sequences and patterns, directly revealing the functional implications induced by the RNNs' weight sharing. In DNN-typical tasks, we identify a phase transition when the learning signal overcomes the noise due to randomness in the weights: below this threshold, RNNs and DNNs behave identically; above it, only RNNs develop correlated representations across timesteps. For sequential tasks, the RNNs' weight sharing furthermore induces an inductive bias that aids generalization by interpolating unsupervised time steps. Overall, our theory offers a way to connect architectural structure to functional biases.

Comment: Representation learning/training dynamics: unified mean-field theory linking RNNs and DNNs via representational kernels in the μP regime.

Relevance: 10 Novelty: 9

2. Approximation Theory for Lipschitz Continuous Transformers

ArXiv ID: 2602.15503

Authors: Takashi Furuya, Davide Murari, Carola-Bibiane Sch\"onlieb

Abstract: Stability and robustness are critical for deploying Transformers in safety-sensitive settings. A principled way to enforce such behavior is to constrain the model's Lipschitz constant. However, approximation-theoretic guarantees for architectures that explicitly preserve Lipschitz continuity have yet to be established. In this work, we bridge this gap by introducing a class of gradient-descent-type in-context Transformers that are Lipschitz-continuous by construction. We realize both MLP and attention blocks as explicit Euler steps of negative gradient flows, ensuring inherent stability without sacrificing expressivity. We prove a universal approximation theorem for this class within a Lipschitz-constrained function space. Crucially, our analysis adopts a measure-theoretic formalism, interpreting Transformers as operators on probability measures, to yield approximation guarantees independent of token count. These results provide a rigorous theoretical foundation for the design of robust, Lipschitz continuous Transformer architectures.

Comment: Model Architecture/Theory: constructs Lipschitz-continuous Transformer blocks via gradient-flow Euler steps and proves universal approximation under Lipschitz constraints.

Relevance: 10 Novelty: 9

3. ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns

ArXiv ID: 2602.15521

Authors: Ziyu Zhao, Tong Zhu, Zhi Zhang, Tiantian Fan, Jinluan Yang, Kun Kuang, Zhongyu Wei, Fei Wu, Yu Cheng

Abstract: Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbf{dynamic structural pruning} that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbf{downcycling} approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architecture composed of consistently activated universal neurons and dynamically activated specialized neurons. Leveraging this discovery, we introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations. Our experiments demonstrate that ExpertWeaver significantly outperforms existing methods, both as a training-free dynamic structural pruning technique and as a downcycling strategy for superior MoE initialization.

Comment: MoE Architecture: training-free dense-to-MoE conversion using GLU activation patterns to form shared and routed experts without breaking activation regularities.

Relevance: 10 Novelty: 8

4. Avey-B

ArXiv ID: 2602.15814

Authors: Devang Acharya, Mohammad Hammoud

Abstract: Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.

Comment: Model Architecture: proposes an attention-free encoder-only alternative with decoupled static/dynamic parameterizations, stability-oriented normalization, and neural compression for efficient long-context encoding.

Relevance: 10 Novelty: 8

5. COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression

ArXiv ID: 2602.15200

Authors: Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Baher Mohammad, Stamatios Lefkimmiatis

Abstract: Post-training compression of Transformer models commonly relies on truncated singular value decomposition (SVD). However, enforcing a single shared subspace can degrade accuracy even at moderate compression. Sparse dictionary learning provides a more flexible union-of-subspaces representation, but existing approaches often suffer from iterative dictionary and coefficient updates. We propose COMPOT (Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers), a training-free compression framework that uses a small calibration dataset to estimate a sparse weight factorization. COMPOT employs orthogonal dictionaries that enable closed-form Procrustes updates for the dictionary and analytical single-step sparse coding for the coefficients, eliminating iterative optimization. To handle heterogeneous layer sensitivity under a global compression budget, COMPOT further introduces a one-shot dynamic allocation strategy that adaptively redistributes layer-wise compression rates. Extensive experiments across diverse architectures and tasks show that COMPOT consistently delivers a superior quality-compression trade-off over strong low-rank and sparse baselines, while remaining fully compatible with post-training quantization for extreme compression. Code is available $\href{https://github.com/mts-ai/COMPOT}{here}$.

Comment: Model Compression and Efficiency: training-free sparse factorization for Transformer compression using orthogonal dictionaries with closed-form Procrustes updates and one-shot dynamic layer-wise budget allocation.

Relevance: 10 Novelty: 8

6. 1-Bit Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization

ArXiv ID: 2602.15563

Authors: Sohir Maskey, Constantin Eichenberg, Johannes Messner, Douglas Orr

Abstract: Quantization-aware training (QAT) is an effective method to drastically reduce the memory footprint of LLMs while keeping performance degradation at an acceptable level. However, the optimal choice of quantization format and bit-width presents a challenge in practice. The full design space of quantization is not fully explored in the context of QAT, and the precise trade-off between quantization and downstream performance is poorly understood, as comparisons often rely solely on perplexity-based evaluations. In this work, we address these shortcomings with an empirical study of QAT in the low-bit regime. We show that k-means based weight quantization outperforms integer formats and can be implemented efficiently on standard hardware. Furthermore, we find that, under a fixed inference memory budget, the best performance on generative downstream tasks is achieved with $1$-bit quantized weights.

Comment: Model Compression and Efficiency: low-bit QAT with k-means weight quantization; demonstrates efficient 1-bit weight regimes under fixed memory budgets.

Relevance: 10 Novelty: 7

7. Uniform error bounds for quantized dynamical models

ArXiv ID: 2602.15586

Authors: Abdelkader Metakalard (CRAN, SYNALP), Fabien Lauer (SYNALP, LORIA), Kevin Colin (CRAN), Marion Gilson (CRAN)

Abstract: This paper provides statistical guarantees on the accuracy of dynamical models learned from dependent data sequences. Specifically, we develop uniform error bounds that apply to quantized models and imperfect optimization algorithms commonly used in practical contexts for system identification, and in particular hybrid system identification. Two families of bounds are obtained: slow-rate bounds via a block decomposition and fast-rate, variance-adaptive, bounds via a novel spaced-point strategy. The bounds scale with the number of bits required to encode the model and thus translate hardware constraints into interpretable statistical complexities.

Comment: Compression/quantization theory: uniform error bounds for quantized dynamical models, with complexity scaling in bits (hardware–statistical link).

Relevance: 9 Novelty: 8

8. Logit Distance Bounds Representational Similarity

ArXiv ID: 2602.15438

Authors: Beatrix M. B. Nielsen, Emanuele Marconato, Luigi Gresele, Andrea Dittadi, Simon Buchholz

Abstract: For a broad family of discriminative models that includes autoregressive language models, identifiability results imply that if two models induce the same conditional distributions, then their internal representations agree up to an invertible linear transformation. We ask whether an analogous conclusion holds approximately when the distributions are close instead of equal. Building on the observation of Nielsen et al. (2025) that closeness in KL divergence need not imply high linear representational similarity, we study a distributional distance based on logit differences and show that closeness in this distance does yield linear similarity guarantees. Specifically, we define a representational dissimilarity measure based on the models' identifiability class and prove that it is bounded by the logit distance. We further show that, when model probabilities are bounded away from zero, KL divergence upper-bounds logit distance; yet the resulting bound fails to provide nontrivial control in practice. As a consequence, KL-based distillation can match a teacher's predictions while failing to preserve linear representational properties, such as linear-probe recoverability of human-interpretable concepts. In distillation experiments on synthetic and image datasets, logit-distance distillation yields students with higher linear representational similarity and better preservation of the teacher's linearly recoverable concepts.

Comment: Representation learning theory: establishes a logit-distance that bounds linear representational dissimilarity; implications for distillation beyond KL.

Relevance: 9 Novelty: 8

9. How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning

ArXiv ID: 2602.15580

Authors: Hongxuan Wu, Yukun Zhang, Xueqing Zhou

Abstract: When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused cross-modal computation -- and how does this structure evolve across layers? We address this question with a layer-wise framework based on Partial Information Decomposition (PID) that decomposes the predictive information at each Transformer layer into redundant, vision-unique, language-unique, and synergistic components. To make PID tractable for high-dimensional neural representations, we introduce \emph{PID Flow}, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation. Applying this framework to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks, we uncover a consistent \emph{modal transduction} pattern: visual-unique information peaks early and decays with depth, language-unique information surges in late layers to account for roughly 82\% of the final prediction, and cross-modal synergy remains below 2\%. This trajectory is highly stable across model variants (layer-wise correlations $>$0.96) yet strongly task-dependent, with semantic redundancy governing the detailed information fingerprint. To establish causality, we perform targeted Image$\rightarrow$Question attention knockouts and show that disrupting the primary transduction pathway induces predictable increases in trapped visual-unique information, compensatory synergy, and total information cost -- effects that are strongest in vision-dependent tasks and weakest in high-redundancy tasks. Together, these results provide an information-theoretic, causal account of how vision becomes language in multimodal Transformers, and offer quantitative guidance for identifying architectural bottlenecks where modality-specific information is lost.

Comment: Representation learning analysis: layer-wise PID quantifies vision, language, and synergy flows in multimodal Transformers; training dynamics insights.

Relevance: 9 Novelty: 8

10. Continuous-Time Piecewise-Linear Recurrent Neural Networks

ArXiv ID: 2602.15649

Authors: Alena Br\"andle, Lukas Eisenmann, Florian G\"otz, Daniel Durstewitz

Abstract: In dynamical systems reconstruction (DSR) we aim to recover the dynamical system (DS) underlying observed time series. Specifically, we aim to learn a generative surrogate model which approximates the underlying, data-generating DS, and recreates its long-term properties (`climate statistics'). In scientific and medical areas, in particular, these models need to be mechanistically tractable -- through their mathematical analysis we would like to obtain insight into the recovered system's workings. Piecewise-linear (PL), ReLU-based RNNs (PLRNNs) have a strong track-record in this regard, representing SOTA DSR models while allowing mathematical insight by virtue of their PL design. However, all current PLRNN variants are discrete-time maps. This is in disaccord with the assumed continuous-time nature of most physical and biological processes, and makes it hard to accommodate data arriving at irregular temporal intervals. Neural ODEs are one solution, but they do not reach the DSR performance of PLRNNs and often lack their tractability. Here we develop theory for continuous-time PLRNNs (cPLRNNs): We present a novel algorithm for training and simulating such models, bypassing numerical integration by efficiently exploiting their PL structure. We further demonstrate how important topological objects like equilibria or limit cycles can be determined semi-analytically in trained models. We compare cPLRNNs to both their discrete-time cousins as well as Neural ODEs on DSR benchmarks, including systems with discontinuities which come with hard thresholds.

Comment: Model Architecture: introduces continuous-time piecewise-linear RNNs with a training/simulation algorithm that exploits PL structure, improving tractability and efficiency over Neural ODEs.

Relevance: 9 Novelty: 8

11. On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

ArXiv ID: 2602.15322

Authors: Taejong Joo, Wenhan Xia, Cheolmin Kim, Ming Zhang, Eugene Ie

Abstract: Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.

Comment: High Performance Computing/Optimization: masked adaptive updates (Magma) provide a simple, efficient optimizer improving LLM pretraining with curvature-regularization effects.

Relevance: 9 Novelty: 8

12. Beyond ReLU: Bifurcation, Oversmoothing, and Topological Priors

ArXiv ID: 2602.15634

Authors: Erkan Turan, Gaspard Abel, Maysam Behmanesh, Emery Pierson, Maks Ovsjanikov

Abstract: Graph Neural Networks (GNNs) learn node representations through iterative network-based message-passing. While powerful, deep GNNs suffer from oversmoothing, where node features converge to a homogeneous, non-informative state. We re-frame this problem of representational collapse from a \emph{bifurcation theory} perspective, characterizing oversmoothing as convergence to a stable ``homogeneous fixed point.'' Our central contribution is the theoretical discovery that this undesired stability can be broken by replacing standard monotone activations (e.g., ReLU) with a class of functions. Using Lyapunov-Schmidt reduction, we analytically prove that this substitution induces a bifurcation that destabilizes the homogeneous state and creates a new pair of stable, non-homogeneous \emph{patterns} that provably resist oversmoothing. Our theory predicts a precise, nontrivial scaling law for the amplitude of these emergent patterns, which we quantitatively validate in experiments. Finally, we demonstrate the practical utility of our theory by deriving a closed-form, bifurcation-aware initialization and showing its utility in real benchmark experiments.

Comment: Model Architecture/Theory: introduces a class of non-monotone activations to induce bifurcations that mitigate GNN oversmoothing, with initialization derived from theory.

Relevance: 9 Novelty: 8

13. PolyNODE: Variable-dimension Neural ODEs on M-polyfolds

ArXiv ID: 2602.15128

Authors: Per {\AA}hag, Alexander Friedrich, Fredrik Ohlsson, Viktor Vigren N\"aslund

Abstract: Neural ordinary differential equations (NODEs) are geometric deep learning models based on dynamical systems and flows generated by vector fields on manifolds. Despite numerous successful applications, particularly within the flow matching paradigm, all existing NODE models are fundamentally constrained to fixed-dimensional dynamics by the intrinsic nature of the manifold's dimension. In this paper, we extend NODEs to M-polyfolds (spaces that can simultaneously accommodate varying dimensions and a notion of differentiability) and introduce PolyNODEs, the first variable-dimensional flow-based model in geometric deep learning. As an example application, we construct explicit M-polyfolds featuring dimensional bottlenecks and PolyNODE autoencoders based on parametrised vector fields that traverse these bottlenecks. We demonstrate experimentally that our PolyNODE models can be trained to solve reconstruction tasks in these spaces, and that latent representations of the input can be extracted and used to solve downstream classification tasks. The code used in our experiments is publicly available at https://github.com/turbotage/PolyNODE .

Comment: Model Architecture: extends Neural ODEs to variable-dimension flows on M-polyfolds (PolyNODE), enabling dimensional bottlenecks and new autoencoder designs.

Relevance: 9 Novelty: 8

14. The Information Geometry of Softmax: Probing and Steering

ArXiv ID: 2602.15293

Authors: Kiho Park, Todd Nief, Yo Joong Choe, Victor Veitch

Abstract: This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation of this paper is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. In this case, we argue that the natural geometry is information geometry. Our focus is on the role of information geometry on semantic encoding and the linear representation hypothesis. As an illustrative application, we develop "dual steering", a method for robustly steering representations to exhibit a particular concept using linear probes. We prove that dual steering optimally modifies the target concept while minimizing changes to off-target concepts. Empirically, we find that dual steering enhances the controllability and stability of concept manipulation.

Comment: Representation learning and geometry: information geometry of softmax representations with a principled steering method (dual steering).

Relevance: 9 Novelty: 7

15. The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

ArXiv ID: 2602.15799

Authors: Max Springer, Chung Peng Lee, Blossom Metevier, Jane Castleman, Bohdan Turbal, Hayoung Jung, Zeyu Shen, Aleksandra Korolova

Abstract: Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.

Comment: Training Dynamics: geometric analysis reveals curvature-driven alignment collapse under fine-tuning, with an instability condition and quartic scaling law.

Relevance: 8 Novelty: 8

16. Panini: Continual Learning in Token Space via Structured Memory

ArXiv ID: 2602.15156

Authors: Shreyas Rajesh, Pavan Holur, Mehmet Yigit Turali, Chenda Duan, Vwani Roychowdhury

Abstract: Language models are increasingly used to reason over content they were not trained on, such as new documents, evolving knowledge, and user-specific data. A common approach is retrieval-augmented generation (RAG), which stores verbatim documents externally (as chunks) and retrieves only a relevant subset at inference time for an LLM to reason over. However, this results in inefficient usage of test-time compute (LLM repeatedly reasons over the same documents); moreover, chunk retrieval can inject irrelevant context that increases unsupported generation. We propose a human-like non-parametric continual learning framework, where the base model remains fixed, and learning occurs by integrating each new experience into an external semantic memory state that accumulates and consolidates itself continually. We present Panini, which realizes this by representing documents as Generative Semantic Workspaces (GSW) -- an entity- and event-aware network of question-answer (QA) pairs, sufficient for an LLM to reconstruct the experienced situations and mine latent knowledge via reasoning-grounded inference chains on the network. Given a query, Panini only traverses the continually-updated GSW (not the verbatim documents or chunks), and retrieves the most likely inference chains. Across six QA benchmarks, Panini achieves the highest average performance, 5%-7% higher than other competitive baselines, while using 2-30x fewer answer-context tokens, supports fully open-source pipelines, and reduces unsupported answers on curated unanswerable queries. The results show that efficient and accurate structuring of experiences at write time -- as achieved by the GSW framework -- yields both efficiency and reliability gains at read time. Code is available at https://github.com/roychowdhuryresearch/gsw-memory.

Comment: Training Dynamics/Representation Learning: theoretical account of pretraining via universal priors and posterior contraction, explaining adaptation and length generalization.

Relevance: 8 Novelty: 8

17. Size Transferability of Graph Transformers with Convolutional Positional Encodings

ArXiv ID: 2602.15239

Authors: Javier Porras-Valenzuela, Zhiyang Wang, Alejandro Ribeiro

Abstract: Transformers have achieved remarkable success across domains, motivating the rise of Graph Transformers (GTs) as attention-based architectures for graph-structured data. A key design choice in GTs is the use of Graph Neural Network (GNN)-based positional encodings to incorporate structural information. In this work, we study GTs through the lens of manifold limit models for graph sequences and establish a theoretical connection between GTs with GNN positional encodings and Manifold Neural Networks (MNNs). Building on transferability results for GNNs under manifold convergence, we show that GTs inherit transferability guarantees from their positional encodings. In particular, GTs trained on small graphs provably generalize to larger graphs under mild assumptions. We complement our theory with extensive experiments on standard graph benchmarks, demonstrating that GTs exhibit scalable behavior on par with GNNs. To further show the efficiency in a real-world scenario, we implement GTs for shortest path distance estimation over terrains to better illustrate the efficiency of the transferable GTs. Our results provide new insights into the understanding of GTs and suggest practical directions for efficient training of GTs in large-scale settings.

Comment: Model Architecture/Theory: links Graph Transformers with GNN positional encodings to manifold neural networks, establishing size transferability guarantees.

Relevance: 8 Novelty: 8

18. CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing

ArXiv ID: 2602.15823

Authors: Zarif Ikram, Arad Firouzkouhi, Stephen Tu, Mahdi Soltanolkotabi, Paria Rashidinejad

Abstract: A central challenge in large language model (LLM) editing is capability preservation: methods that successfully change targeted behavior can quietly game the editing proxy and corrupt general capabilities, producing degenerate behaviors reminiscent of proxy/reward hacking. We present CrispEdit, a scalable and principled second-order editing algorithm that treats capability preservation as an explicit constraint, unifying and generalizing several existing editing approaches. CrispEdit formulates editing as constrained optimization and enforces the constraint by projecting edit updates onto the low-curvature subspace of the capability-loss landscape. At the crux of CrispEdit is expressing capability constraint via Bregman divergence, whose quadratic form yields the Gauss-Newton Hessian exactly and even when the base model is not trained to convergence. We make this second-order procedure efficient at the LLM scale using Kronecker-factored approximate curvature (K-FAC) and a novel matrix-free projector that exploits Kronecker structure to avoid constructing massive projection matrices. Across standard model-editing benchmarks, CrispEdit achieves high edit success while keeping capability degradation below 1% on average across datasets, significantly improving over prior editors.

Comment: High Performance Computing/Optimization: second-order constrained LLM editing using K-FAC and matrix-free low-curvature projections to preserve capabilities.

Relevance: 8 Novelty: 8

19. Universal priors: solving empirical Bayes via Bayesian inference and pretraining

ArXiv ID: 2602.15136

Authors: Nick Cannella, Anzo Teh, Yanjun Han, Yury Polyanskiy

Abstract: We theoretically justify the recent empirical finding of [Teh et al., 2025] that a transformer pretrained on synthetically generated data achieves strong performance on empirical Bayes (EB) problems. We take an indirect approach to this question: rather than analyzing the model architecture or training dynamics, we ask why a pretrained Bayes estimator, trained under a prespecified training distribution, can adapt to arbitrary test distributions. Focusing on Poisson EB problems, we identify the existence of universal priors such that training under these priors yields a near-optimal regret bound of $\widetilde{O}(\frac{1}{n})$ uniformly over all test distributions. Our analysis leverages the classical phenomenon of posterior contraction in Bayesian statistics, showing that the pretrained transformer adapts to unknown test distributions precisely through posterior contraction. This perspective also explains the phenomenon of length generalization, in which the test sequence length exceeds the training length, as the model performs Bayesian inference using a generalized posterior.

Comment: Matches Representation Learning: provides a theoretical account (universal priors, posterior contraction) for adaptation and length generalization in pretrained transformers, offering foundational insights into training/generalization dynamics.

Relevance: 8 Novelty: 8

20. Spanning the Visual Analogy Space with a Weight Basis of LoRAs

ArXiv ID: 2602.15727

Authors: Hila Manor, Rinon Gal, Haggai Maron, Tomer Michaeli, Gal Chechik

Abstract: Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet ${\mathbf{a}$, $\mathbf{a}'$, $\mathbf{b}}$, the goal is to generate $\mathbf{b}'$ such that $\mathbf{a} : \mathbf{a}' :: \mathbf{b} : \mathbf{b}'$. Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a "space of LoRAs". We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb

Comment: Low-rank architecture innovation: learnable basis of LoRA modules with dynamic composition for conditional specialization (aligns with low-rank/architecture efficiency).