Personalized Daily ArXiv Papers 2026-02-19

[gpt-5]	Prompt	Completion	Total
Token	40330	37487	77817
Cost	$0.05	$0.37	$0.43

Total arXiv papers: 500

Total scanned papers: 305

Total relevant papers: 23

Table of contents with paper titles:

Fast KV Compaction via Attention Matching Authors: Adam Zweiger, Xinghong Fu, Han Guo, Yoon Kim
Beyond SGD, Without SVD: Proximal Subspace Iteration LoRA with Diagonal Fractional K-FAC Authors: Abdulla Jasem Almansoori, Maria Ivanova, Andrey Veprikov, Aleksandr Beznosikov, Samuel Horv\'ath, Martin Tak\'a\v{c}
MoE-Spec: Expert Budgeting for Efficient Speculative Decoding Authors: Bradley McDanel, Steven Li, Sruthikesh Surineni, Harshit Khaitan
From Growing to Looping: A Unified View of Iterative Computation in LLMs Authors: Ferdinand Kapl, Emmanouil Angelis, Kaitlin Maile, Johannes von Oswald, Stefan Bauer
Surgical Activation Steering via Generative Causal Mediation Authors: Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell
Synthesis and Verification of Transformer Programs Authors: Hongjian Jiang, Matthew Hague, Philipp R\"ummer, Anthony Widjaja Lin
Optimizer choice matters for the emergence of Neural Collapse Authors: Jim Zhao, Tin Sum Cheng, Wojciech Masarczyk, Aurelien Lucchi
FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving Authors: Chia-chi Hsieh, Zan Zong, Xinyang Chen, Jianjiang Li, Jidong Zhai, Lijie Wen
Anatomy of Capability Emergence: Scale-Invariant Representation Collapse and Top-Down Reorganization in Neural Networks Authors: Jayadev Billa
Beyond Learning: A Training-Free Alternative to Model Adaptation Authors: Namkyung Yoon, Kyeonghyun Yoo, Wooyong Jung, Sanghong Kim, Hwangnam Kim
Neighborhood Stability as a Measure of Nearest Neighbor Searchability Authors: Thomas Vecchiato, Sebastian Bruch
On the Power of Source Screening for Learning Shared Feature Extractors Authors: Leo (Muxing), Wang, Connor Mclaughlin, Lili Su
Are Object-Centric Representations Better At Compositional Generalization? Authors: Ferdinand Kapl, Amir Mohammad Karimi Mamaghan, Maximilian Seitzer, Karl Henrik Johansson, Carsten Marr, Stefan Bauer, Andrea Dittadi
Subtractive Modulative Network with Learnable Periodic Activations Authors: Tiou Wang, Zhuoqian Yang, Markus Flierl, Mathieu Salzmann, Sabine S\"usstrunk
Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models Authors: Pranav Bhandari, Usman Naseem, Mehwish Nasim
Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective Authors: Yunhao Liu, Zian Jia, Xinyu Gao, Kanjun Xu, Yun Xiong
The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks Authors: Eitan Gronich, Gal Vardi
LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization Authors: Idil Bilge Altun, Mert Onur Cakiroglu, Elham Buxton, Mehmet Dalkilic, Hasan Kurban
CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill Authors: Bradley McDanel, Steven Li, Harshit Khaitan
FEKAN: Feature-Enriched Kolmogorov-Arnold Networks Authors: Sidharth S. Menon, Ameya D. Jagtap
Geometric Neural Operators via Lie Group-Constrained Latent Dynamics Authors: Jiaquan Zhang, Fachrina Dewi Puspitasari, Songbo Zhang, Yibei Liu, Kuien Liu, Caiyan Qin, Fan Mo, Peng Wang, Yang Yang, Chaoning Zhang
HAWX: A Hardware-Aware FrameWork for Fast and Scalable ApproXimation of DNNs Authors: Samira Nazari, Mohammad Saeed Almasi, Mahdi Taheri, Ali Azarpeyvand, Ali Mokhtari, Ali Mahani, Christian Herglotz
Distributed physics-informed neural networks via domain decomposition for fast flow reconstruction Authors: Yixiao Qian, Jiaxu Liu, Zewei Xia, Song Chen, Chao Xu, Shengze Cai

1. Fast KV Compaction via Attention Matching

ArXiv ID: 2602.16284

Authors: Adam Zweiger, Xinghong Fu, Han Guo, Yoon Kim

Abstract: Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to 50x compaction in seconds on some datasets with little quality loss.

Comment: Matches Compression/Efficiency: fast KV-cache compaction via attention matching with per-head preservation and closed-form subproblems enabling strong quality-time tradeoffs.

Relevance: 10 Novelty: 9

2. Beyond SGD, Without SVD: Proximal Subspace Iteration LoRA with Diagonal Fractional K-FAC

ArXiv ID: 2602.16456

Authors: Abdulla Jasem Almansoori, Maria Ivanova, Andrey Veprikov, Aleksandr Beznosikov, Samuel Horv\'ath, Martin Tak\'a\v{c}

Abstract: Low-Rank Adaptation (LoRA) fine-tunes large models by learning low-rank updates on top of frozen weights, dramatically reducing trainable parameters and memory. In this work, we address the gap between training with full steps with low-rank projections (SVDLoRA) and LoRA fine-tuning. We propose LoRSum, a memory-efficient subroutine that closes this gap for gradient descent by casting LoRA optimization as a proximal sub-problem and solving it efficiently with alternating least squares updates, which we prove to be an implicit block power method. We recover several recently proposed preconditioning methods for LoRA as special cases, and show that LoRSum can also be used for updating a low-rank momentum. In order to address full steps with preconditioned gradient descent, we propose a scaled variant of LoRSum that uses structured metrics such as K-FAC and Shampoo, and we show that storing the diagonal of these metrics still allows them to perform well while remaining memory-efficient. Experiments on a synthetic task, CIFAR-100, and language-model fine-tuning on GLUE, SQuAD v2, and WikiText-103, show that our method can match or improve LoRA baselines given modest compute overhead, while avoiding full-matrix SVD projections and retaining LoRA-style parameter efficiency.

Comment: Compression/Efficiency: advances LoRA optimization via proximal subspace iteration (LoRSum) and memory-efficient preconditioning (diagonal K-FAC/Shampoo) without full SVD.

Relevance: 10 Novelty: 8

3. MoE-Spec: Expert Budgeting for Efficient Speculative Decoding

ArXiv ID: 2602.16052

Authors: Bradley McDanel, Steven Li, Sruthikesh Surineni, Harshit Khaitan

Abstract: Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce speculation depth when MoE verification becomes expensive. We propose MoE-Spec, a training-free verification-time expert budgeting method that decouples speculation depth from memory cost by enforcing a fixed expert capacity limit at each layer, loading only the experts that contribute most to verification and dropping the long tail of rarely used experts that drive bandwidth overhead. Experiments across multiple model scales and datasets show that this method yields 10--30\% higher throughput than state-of-the-art speculative decoding baselines (EAGLE-3) at comparable quality, with flexibility to trade accuracy for further latency reductions through tighter budgets.

Comment: Mixture-of-Experts + Efficiency: verification-time expert budgeting for MoE speculative decoding to cap expert capacity and improve throughput without retraining.

Relevance: 10 Novelty: 8

4. From Growing to Looping: A Unified View of Iterative Computation in LLMs

ArXiv ID: 2602.16490

Authors: Ferdinand Kapl, Emmanouil Angelis, Kaitlin Maile, Johannes von Oswald, Stefan Bauer

Abstract: Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear. We provide a mechanistic unification: looped and depth-grown models exhibit convergent depth-wise signatures, including increased reliance on late layers and recurring patterns aligned with the looped or grown block. These shared signatures support the view that their gains stem from a common form of iterative computation. Building on this connection, we show that the two techniques are adaptable and composable: applying inference-time looping to the middle blocks of a depth-grown model improves accuracy on some reasoning primitives by up to $2\times$, despite the model never being trained to loop. Both approaches also adapt better than the baseline when given more in-context examples or additional supervised fine-tuning data. Additionally, depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy cooldown mixtures, which can be further boosted by adapting a middle block to loop. Overall, our results position depth growth and looping as complementary, practical methods for inducing and scaling iterative computation to improve reasoning.

Comment: Model Architecture: unifies and analyzes looping and depth growth to induce iterative computation in LLMs; shows composability and inference-time looping benefits.

Relevance: 9 Novelty: 8

5. Surgical Activation Steering via Generative Causal Mediation

ArXiv ID: 2602.16080

Authors: Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell

Abstract: Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.

Comment: Representation Learning/Architecture Control: uses generative causal mediation to localize and steer sparse attention heads for long-form behaviors via targeted activation interventions.

Relevance: 9 Novelty: 8

6. Synthesis and Verification of Transformer Programs

ArXiv ID: 2602.16473

Authors: Hongjian Jiang, Matthew Hague, Philipp R\"ummer, Anthony Widjaja Lin

Abstract: C-RASP is a simple programming language that was recently shown to capture concepts expressible by transformers. In this paper, we develop new algorithmic techniques for automatically verifying C-RASPs. To this end, we establish a connection to the verification of synchronous dataflow programs in Lustre, which enables us to exploit state-of-the-art model checkers utilizing highly optimized SMT-solvers. Our second contribution addresses learning a C-RASP program in the first place. To this end, we provide a new algorithm for learning a C-RASP from examples using local search. We demonstrate efficacy of our implementation for benchmarks of C-RASPs in the literature, in particular in connection to the following applications: (1) transformer program optimization, and (2) constrained learning of transformer programs (based on a partial specification).

Comment: Matches Model Architecture/Analysis: formal verification and synthesis of Transformer programs (C-RASP) via SMT-backed model checking and learning.

Relevance: 9 Novelty: 8

7. Optimizer choice matters for the emergence of Neural Collapse

ArXiv ID: 2602.16642

Authors: Jim Zhao, Tin Sum Cheng, Wojciech Masarczyk, Aurelien Lucchi

Abstract: Neural Collapse (NC) refers to the emergence of highly symmetric geometric structures in the representations of deep neural networks during the terminal phase of training. Despite its prevalence, the theoretical understanding of NC remains limited. Existing analyses largely ignore the role of the optimizer, thereby suggesting that NC is universal across optimization methods. In this work, we challenge this assumption and demonstrate that the choice of optimizer plays a critical role in the emergence of NC. The phenomenon is typically quantified through NC metrics, which, however, are difficult to track and analyze theoretically. To overcome this limitation, we introduce a novel diagnostic metric, NC0, whose convergence to zero is a necessary condition for NC. Using NC0, we provide theoretical evidence that NC cannot emerge under decoupled weight decay in adaptive optimizers, as implemented in AdamW. Concretely, we prove that SGD, SignGD with coupled weight decay (a special case of Adam), and SignGD with decoupled weight decay (a special case of AdamW) exhibit qualitatively different NC0 dynamics. Also, we show the accelerating effect of momentum on NC (beyond convergence of train loss) when trained with SGD, being the first result concerning momentum in the context of NC. Finally, we conduct extensive empirical experiments consisting of 3,900 training runs across various datasets, architectures, optimizers, and hyperparameters, confirming our theoretical results. This work provides the first theoretical explanation for optimizer-dependent emergence of NC and highlights the overlooked role of weight-decay coupling in shaping the implicit biases of optimizers.

Comment: Representation Learning/Training Dynamics: provides theoretical and empirical analysis of optimizer-dependent Neural Collapse and the role of weight-decay coupling and momentum.

Relevance: 9 Novelty: 8

8. FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

ArXiv ID: 2602.16603

Authors: Chia-chi Hsieh, Zan Zong, Xinyang Chen, Jianjiang Li, Jidong Zhai, Lijie Wen

Abstract: The growing demand for large language models (LLMs) requires serving systems to handle many concurrent requests with diverse service level objectives (SLOs). This exacerbates head-of-line (HoL) blocking during the compute-intensive prefill phase, where long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) SLO violations. While chunked prefill enables interruptibility, it introduces an inherent trade-off between responsiveness and throughput: reducing chunk size improves response latency but degrades computational efficiency, whereas increasing chunk size maximizes throughput but exacerbates blocking. This necessitates an adaptive preemption mechanism. However, dynamically balancing execution granularity against scheduling overheads remains a key challenge. In this paper, we propose FlowPrefill, a TTFT-goodput-optimized serving system that resolves this conflict by decoupling preemption granularity from scheduling frequency. To achieve adaptive prefill scheduling, FlowPrefill introduces two key innovations: 1) Operator-Level Preemption, which leverages operator boundaries to enable fine-grained execution interruption without the efficiency loss associated with fixed small chunking; and 2) Event-Driven Scheduling, which triggers scheduling decisions only upon request arrival or completion events, thereby supporting efficient preemption responsiveness while minimizing control-plane overhead. Evaluation on real-world production traces shows that FlowPrefill improves maximum goodput by up to 5.6$\times$ compared to state-of-the-art systems while satisfying heterogeneous SLOs.

Comment: Matches High Performance Computing: operator-level preemption and event-driven scheduling for LLM serving to mitigate HoL blocking and optimize TTFT-goodput.

Relevance: 9 Novelty: 8

9. Anatomy of Capability Emergence: Scale-Invariant Representation Collapse and Top-Down Reorganization in Neural Networks

ArXiv ID: 2602.15997

Authors: Jayadev Billa

Abstract: Capability emergence during neural network training remains mechanistically opaque. We track five geometric measures across five model scales (405K-85M parameters), 120+ emergence events in eight algorithmic tasks, and three Pythia language models (160M-2.8B). We find: (1) training begins with a universal representation collapse to task-specific floors that are scale-invariant across a 210X parameter range (e.g., modular arithmetic collapses to RANKME ~ 2.0 regardless of model size); (2) collapse propagates top-down through layers (32/32 task X model consistency), contradicting bottom-up feature-building intuition; (3) a geometric hierarchy in which representation geometry leads emergence (75-100% precursor rate for hard tasks), while the local learning coefficient is synchronous (0/24 precursor) and Hessian measures lag. We also delineate prediction limits: geometric measures encode coarse task difficulty but not fine-grained timing (within-class concordance 27%; when task ordering reverses across scales, prediction fails at 26%). On Pythia, global geometric patterns replicate but per-task precursor signals do not -- the precursor relationship requires task-training alignment that naturalistic pre-training does not provide. Our contribution is the geometric anatomy of emergence and its boundary conditions, not a prediction tool.

Comment: Representation Learning/Training Dynamics: geometric analysis of capability emergence, scale-invariant representation collapse, and top-down layer reorganization across model scales.

Relevance: 9 Novelty: 7

10. Beyond Learning: A Training-Free Alternative to Model Adaptation

ArXiv ID: 2602.16189

Authors: Namkyung Yoon, Kyeonghyun Yoo, Wooyong Jung, Sanghong Kim, Hwangnam Kim

Abstract: Despite the continuous research and evolution of language models, they sometimes underperform previous versions. Existing approaches to overcome these challenges are resource-intensive, highlighting the need for alternatives that enable immediate action. We assume that each language model has a local module inside that is suitable for a specific function. First, this work identifies a set of modules showing consistent and local activation changes under an inference workload through activation-based analysis. Subsequently, we transplant an internal module that is properly activated for a specific task into the target model, leading to immediate and measurable functional changes without additional training or fine-tuning. To experimentally demonstrate the effectiveness of the transplant technique, we quantify the relationship between transplant strength and performance improvement under different conditions for two language models. In the cross-generation setting, we find that transplanting activation-selected modules can substantially improve the underperforming model, reaching up to twice the target baseline and achieving gap-based recovery above 100%. Moreover, in transplant experiments between a base model and its instruction-tuned counterpart, transplantation improves the underperforming model toward the stronger baseline, yielding up to about 2.33 times the target baseline with gap-based recovery reaching up to 100% in the best case. These results show that meaningful capacity transfer can be realized through the implantation of highly localized modules implied by language models. Overall, this work provides empirical evidence for task-localized modularity in language models and presents a new research area: model transplantation.

Comment: Matches Model Architecture and Efficiency: training-free module transplantation via activation-selected internal modules for immediate capability transfer.

Relevance: 8 Novelty: 8

11. Neighborhood Stability as a Measure of Nearest Neighbor Searchability

ArXiv ID: 2602.16673

Authors: Thomas Vecchiato, Sebastian Bruch

Abstract: Clustering-based Approximate Nearest Neighbor Search (ANNS) organizes a set of points into partitions, and searches only a few of them to find the nearest neighbors of a query. Despite its popularity, there are virtually no analytical tools to determine the suitability of clustering-based ANNS for a given dataset -- what we call "searchability." To address that gap, we present two measures for flat clusterings of high-dimensional points in Euclidean space. First is Clustering-Neighborhood Stability Measure (clustering-NSM), an internal measure of clustering quality -- a function of a clustering of a dataset -- that we show to be predictive of ANNS accuracy. The second, Point-Neighborhood Stability Measure (point-NSM), is a measure of clusterability -- a function of the dataset itself -- that is predictive of clustering-NSM. The two together allow us to determine whether a dataset is searchable by clustering-based ANNS given only the data points. Importantly, both are functions of nearest neighbor relationships between points, not distances, making them applicable to various distance functions including inner product.

Comment: HPC/Efficiency for ANN: introduces neighborhood stability measures (clustering-NSM, point-NSM) to predict searchability and ANNS accuracy from nearest-neighbor structure.