Personalized Daily ArXiv Papers 2026-03-05

[gpt-5]	Prompt	Completion	Total
Token	49819	46514	96333
Cost	$0.06	$0.47	$0.53

Total arXiv papers: 655

Total scanned papers: 385

Total relevant papers: 26

Table of contents with paper titles:

Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget Authors: Peter Balogh
Why Are Linear RNNs More Parallelizable? Authors: William Merrill, Hongjian Jiang, Yanhong Li, Ashish Sabharwal
Spectral Surgery: Training-Free Refinement of LoRA via Gradient-Guided Singular Value Reweighting Authors: Zailong Tian, Yanzhe Chen, Zhuoheng Han, Lizi Liao
Dissecting Quantization Error: A Concentration-Alignment Perspective Authors: Marco Federici, Boris van Breugel, Paul Whatmough, Markus Nagel
ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer Authors: Chunyuan Deng, Sanket Lokegaonkar, Colin Lockard, Besnik Fetahu, Nasser Zalmout, Xian Li
Data-Aware Random Feature Kernel for Transformers Authors: Amirhossein Farzam, Hossein Mobahi, Nolan Andrew Miller, Luke Sernau
Directional Neural Collapse Explains Few-Shot Transfer in Self-Supervised Learning Authors: Achleshwar Luthra, Yash Salunkhe, Tomer Galanti
PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters Authors: Yinghong Yu, Guangyuan Li, Jiancheng Yang
Solving adversarial examples requires solving exponential misalignment Authors: Alessandro Salvatore, Stanislav Fort, Surya Ganguli
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs Authors: Mingyu Jin, Yutong Yin, Jingcheng Niu, Qingcheng Zeng, Wujiang Xu, Mengnan Du, Wei Cheng, Zhaoran Wang, Tianlong Chen, Dimitris N. Metaxas
SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training Authors: Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P Hewa Koneputugodage, Gil Avraham, Yan Zuo, Violetta Shevchenko, Alexander Long
NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training Authors: Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P Hewa Koneputugodage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avraham, Alexander Long
Trade-offs in Ensembling, Merging and Routing Among Parameter-Efficient Experts Authors: Sanae Lotfi, Lucas Caccia, Alessandro Sordoni, Jordan T. Ash, Miroslav Dudik
Stable and Steerable Sparse Autoencoders with Weight Regularization Authors: Piotr Jedryszek, Oliver M. Crook
EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs Authors: Yuhao Chen, Bin Shan, Xin Ye, Cheng Chen
Riemannian Optimization in Modular Systems Authors: Christian Pehle, Jean-Jacques Slotine
Semi-Supervised Generative Learning via Latent Space Distribution Matching Authors: Kwong Yu Chong, Long Feng
Surprisal-R\'enyi Free Energy Authors: Shion Matsumoto, Raul Castillo, Benjamin Prada, Ankur Arjun Mali
stratum: A System Infrastructure for Massive Agent-Centric ML Workloads Authors: Arnab Phani, Elias Strauss, Sebastian Schelter
Old Habits Die Hard: How Conversational History Geometrically Traps LLMs Authors: Adi Simhi, Fazl Barez, Martin Tutek, Yonatan Belinkov, Shay B. Cohen
StructLens: A Structural Lens for Language Models via Maximum Spanning Trees Authors: Haruki Sakajo, Frederikus Hudi, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Towards Improved Sentence Representations using Token Graphs Authors: Krishna Sri Ipsit Mantri, Carola-Bibiane Sch\"onlieb, Zorah L\"ahner, Moshe Eliasof
Controlling Chat Style in Language Models via Single-Direction Editing Authors: Zhenyu Xu, Victor S. Sheng
Efficient Refusal Ablation in LLM through Optimal Transport Authors: Geraldin Nanfack, Eugene Belilovsky, Elvis Dohmatob
Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory Authors: Zhenting Wang, Huancheng Chen, Jiayun Wang, Wei Wei
Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization Authors: Furkan Mumcu, Yasin Yilmaz

1. Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget

ArXiv ID: 2603.03459

Authors: Peter Balogh

Abstract: We investigate when transformer MLP nonlinearity is actually necessary. A gate with $d+1$ parameters decides when to replace the full MLP with a linear surrogate. Through systematic investigation across six models (162M-2.8B parameters), two architectures, and three corpora, we establish that nonlinearity need cannot be predicted from token identity: cross-corpus correlation is zero ($r < 0.05$). The routing decision is fully contextual. Despite weak per-instance predictability, the gate exploits a heavily skewed distribution where most MLP computations are near-linear, achieving 25-56% linear routing at <1% perplexity cost in GPT-2. In GPT-2 Large, 11 of 36 layers beat baseline with gating and no layer exceeds 3.7% all-linear cost. This success is architecture-dependent: Pythia models show higher costs, though Pythia-2.8B's full 32-layer sweep reveals one layer that narrowly beats baseline. As a proof of concept, we progressively replace middle-layer MLPs with frozen linear matrices: 5 of 24 layers linearize at zero cost. With a full training budget, 4 linearized layers yield a 10.2% perplexity improvement -- and a two-phase gated approach pushes this to 17.3%, beating a vanilla fine-tuning control and confirming that the nonlinear MLPs at these layers were actively harmful.

Comment: Conditional routing that replaces Transformer MLPs with linear surrogates when possible—dynamic networks/efficiency and architectural analysis.

Relevance: 10 Novelty: 9

2. Why Are Linear RNNs More Parallelizable?

ArXiv ID: 2603.03612

Authors: William Merrill, Hongjian Jiang, Yanhong Li, Ashish Sabharwal

Abstract: The community is increasingly exploring linear RNNs (LRNNs) as language models, motivated by their expressive power and parallelizability. While prior work establishes the expressivity benefits of LRNNs over transformers, it is unclear what makes LRNNs -- but not traditional, nonlinear RNNs -- as easy to parallelize in practice as transformers. We answer this question by providing a tight connection between types of RNNs and standard complexity classes. We show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit. Furthermore, we show that nonlinear RNNs can solve $\mathsf{L}$-complete problems (and even $\mathsf{P}$-complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are $\mathsf{NC}^1$-complete whereas diagonal-plus-low-rank LRNNs are more expressive ($\mathsf{PNC}^1$-complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.

Comment: Strong match to Model Architecture and High-Performance Computing theory by characterizing LRNNs’ parallelizability via complexity classes and expressivity trade-offs.

Relevance: 10 Novelty: 9

ArXiv ID: 2603.03995

Authors: Zailong Tian, Yanzhe Chen, Zhuoheng Han, Lizi Liao

Abstract: Low-Rank Adaptation (LoRA) improves downstream performance by restricting task updates to a low-rank parameter subspace, yet how this limited capacity is allocated within a trained adapter remains unclear. Through a geometric and empirical study across multiple tasks and backbones, we find that trained LoRA updates often exhibit an inefficient spectrum: task effects concentrate in a small subset of singular directions, while many remaining components are neutral or detrimental, motivating post-hoc refinement within the learned subspace. We propose Spectral Surgery, a training-free refinement that decomposes a LoRA update with SVD, estimates per-component sensitivity using gradients on a small calibration set, and reweights singular values under a magnitude constraint while keeping the learned directions fixed. Across Llama-3.1-8B and Qwen3-8B on four benchmarks, Spectral Surgery yields consistent gains (up to +4.4 points on CommonsenseQA and +2.4 pass@1 on HumanEval) by adjusting only $\approx 1{,}000$ scalar coefficients. These results demonstrate that SVD-structured, low-cost parameter editing can serve as a practical route to improving trained LoRA adapters in a purely post-hoc manner.

Comment: Model Compression and Efficiency: low-rank LoRA refinement via SVD-based singular value reweighting; training-free parameter editing.

Relevance: 10 Novelty: 8

4. Dissecting Quantization Error: A Concentration-Alignment Perspective

ArXiv ID: 2603.04359

Authors: Marco Federici, Boris van Breugel, Paul Whatmough, Markus Nagel

Abstract: Quantization can drastically increase the efficiency of large language and vision models, but typically incurs an accuracy drop. Recently, function-preserving transforms (e.g. rotations, Hadamard transform, channel-wise scaling) have been successfully applied to reduce post-training quantization error, yet a principled explanation remains elusive. We analyze linear-layer quantization via the signal-to-quantization-noise ratio (SQNR), showing that for uniform integer quantization at a fixed bit width, SQNR decomposes into (i) the concentration of weights and activations (capturing spread and outliers), and (ii) the alignment of their dominant variation directions. This reveals an actionable insight: beyond concentration - the focus of most prior transforms (e.g. rotations or Hadamard) - improving alignment between weight and activation can further reduce quantization error. Motivated by this, we introduce block Concentration-Alignment Transforms (CAT), a lightweight linear transformation that uses a covariance estimate from a small calibration set to jointly improve concentration and alignment, approximately maximizing SQNR. Experiments across several LLMs show that CAT consistently matches or outperforms prior transform-based quantization methods at 4-bit precision, confirming the insights gained in our framework.

Comment: Provides a principled SQNR-based theory of quantization error (concentration+alignment) and introduces CAT transforms—model compression/quantization.

Relevance: 10 Novelty: 8

5. ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

ArXiv ID: 2603.03583

Authors: Chunyuan Deng, Sanket Lokegaonkar, Colin Lockard, Besnik Fetahu, Nasser Zalmout, Xian Li

Abstract: Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce \textbf{ByteFlow Net}, a new hierarchical architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. ByteFlow Net performs compression-driven segmentation based on the coding rate of latent representations, yielding adaptive boundaries \emph{while preserving a static computation graph via Top-$K$ selection}. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive and information-grounded language models.

Comment: Matches Model Architecture and Efficiency: tokenizer-free hierarchical byte-level LM with compression-driven segmentation and Top-K selection for a static compute graph.

Relevance: 10 Novelty: 8

6. Data-Aware Random Feature Kernel for Transformers

ArXiv ID: 2603.04127

Authors: Amirhossein Farzam, Hossein Mobahi, Nolan Andrew Miller, Luke Sernau

Abstract: Transformers excel across domains, yet their quadratic attention complexity poses a barrier to scaling. Random-feature attention, as in Performers, can reduce this cost to linear in the sequence length by approximating the softmax kernel with positive random features drawn from an isotropic distribution. In pretrained models, however, queries and keys are typically anisotropic. This induces high Monte Carlo variance in isotropic sampling schemes unless one retrains the model or uses a large feature budget. Importance sampling can address this by adapting the sampling distribution to the input geometry, but complex data-dependent proposal distributions are often intractable. We show that by data aligning the softmax kernel, we obtain an attention mechanism which can both admit a tractable minimal-variance proposal distribution for importance sampling, and exhibits better training stability. Motivated by this finding, we introduce DARKFormer, a Data-Aware Random-feature Kernel transformer that features a data-aligned kernel geometry. DARKFormer learns the random-projection covariance, efficiently realizing an importance-sampled positive random-feature estimator for its data-aligned kernel. Empirically, DARKFormer narrows the performance gap with exact softmax attention, particularly in finetuning regimes where pretrained representations are anisotropic. By combining random-feature efficiency with data-aware kernels, DARKFormer advances kernel-based attention in resource-constrained settings.

Comment: Matches Compression/Efficiency and Model Architecture: data-aware random-feature attention (learned covariance) enabling importance-sampled linear attention (DARKFormer).

Relevance: 10 Novelty: 8

7. Directional Neural Collapse Explains Few-Shot Transfer in Self-Supervised Learning

ArXiv ID: 2603.03530

Authors: Achleshwar Luthra, Yash Salunkhe, Tomer Galanti

Abstract: Frozen self-supervised representations often transfer well with only a few labels across many semantic tasks. We argue that a single geometric quantity, \emph{directional} CDNV (decision-axis variance), sits at the core of two favorable behaviors: strong few-shot transfer within a task, and low interference across many tasks. We show that both emerge when variability \emph{along} class-separating directions is small. First, we prove sharp non-asymptotic multiclass generalization bounds for downstream classification whose leading term is the directional CDNV. The bounds include finite-shot corrections that cleanly separate intrinsic decision-axis variability from centroid-estimation error. Second, we link decision-axis collapse to multitask geometry: for independent balanced labelings, small directional CDNV across tasks forces the corresponding decision axes to be nearly orthogonal, helping a single representation support many tasks with minimal interference. Empirically, across SSL objectives, directional CDNV collapses during pretraining even when classical CDNV remains large, and our bounds closely track few-shot error at practical shot sizes. Additionally, on synthetic multitask data, we verify that SSL learns representations whose induced decision axes are nearly orthogonal. The code and project page of the paper are available at [\href{https://dlfundamentals.github.io/directional-neural-collapse/}{project page}].

Comment: Matches Representation Learning/Theory: directional neural collapse (decision-axis variance) explains few-shot transfer with sharp bounds and multitask geometry.

Relevance: 10 Novelty: 8

8. PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

ArXiv ID: 2603.04165

Authors: Yinghong Yu, Guangyuan Li, Jiancheng Yang

Abstract: Large-scale 2D foundation models exhibit strong transferable representations, yet extending them to 3D volumetric data typically requires retraining, adapters, or architectural redesign. We introduce PlaneCycle, a training-free, adapter-free operator for architecture-agnostic 2D-to-3D lifting of foundation models. PlaneCycle reuses the original pretrained 2D backbone by cyclically distributing spatial aggregation across orthogonal HW, DW, and DH planes throughout network depth, enabling progressive 3D fusion while preserving pretrained inductive biases. The method introduces no additional parameters and is applicable to arbitrary 2D networks. Using pretrained DINOv3 models, we evaluate PlaneCycle on six 3D classification and three 3D segmentation benchmarks. Without any training, the lifted models exhibit intrinsic 3D fusion capability and, under linear probing, outperform slice-wise 2D baselines and strong 3D counterparts, approaching the performance of fully trained models. With full fine-tuning, PlaneCycle matches standard 3D architectures, highlighting its potential as a seamless and practical 2D-to-3D lifting operator. These results demonstrate that 3D capability can be unlocked from pretrained 2D foundation models without structural modification or retraining. Code is available at https://github.com/HINTLab/PlaneCycle.

Comment: Model Architecture/Efficiency: training-free, adapter-free 2D-to-3D lifting operator (PlaneCycle) enabling 3D fusion while reusing 2D backbones

Relevance: 10 Novelty: 8

9. Solving adversarial examples requires solving exponential misalignment

ArXiv ID: 2603.03507

Authors: Alessandro Salvatore, Stanislav Fort, Surya Ganguli

Abstract: Adversarial attacks - input perturbations imperceptible to humans that fool neural networks - remain both a persistent failure mode in machine learning, and a phenomenon with mysterious origins. To shed light, we define and analyze a network's perceptual manifold (PM) for a class concept as the space of all inputs confidently assigned to that class by the network. We find, strikingly, that the dimensionalities of neural network PMs are orders of magnitude higher than those of natural human concepts. Since volume typically grows exponentially with dimension, this suggests exponential misalignment between machines and humans, with exponentially many inputs confidently assigned to concepts by machines but not humans. Furthermore, this provides a natural geometric hypothesis for the origin of adversarial examples: because a network's PM fills such a large region of input space, any input will be very close to any class concept's PM. Our hypothesis thus suggests that adversarial robustness cannot be attained without dimensional alignment of machine and human PMs, and therefore makes strong predictions: both robust accuracy and distance to any PM should be negatively correlated with the PM dimension. We confirmed these predictions across 18 different networks of varying robust accuracy. Crucially, we find even the most robust networks are still exponentially misaligned, and only the few PMs whose dimensionality approaches that of human concepts exhibit alignment to human perception. Our results connect the fields of alignment and adversarial examples, and suggest the curse of high dimensionality of machine PMs is a major impediment to adversarial robustness.

Comment: Representation Learning/Theory: introduces perceptual manifold dimensionality as a geometric account of adversarial vulnerability and robustness.

Relevance: 9 Novelty: 9

10. Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs

ArXiv ID: 2603.03415

Authors: Mingyu Jin, Yutong Yin, Jingcheng Niu, Qingcheng Zeng, Wujiang Xu, Mengnan Du, Wei Cheng, Zhaoran Wang, Tianlong Chen, Dimitris N. Metaxas

Abstract: In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out-of-distribution (OOD) shift. We reveal a consistent and quantifiable phenomenon: as task difficulty increases, whether through harder reasoning questions, longer contexts, or adding answer choices, the last hidden states of LLMs become substantially sparser. In short, \textbf{\textit{the farther the shift, the sparser the representations}}. This sparsity--difficulty relation is observable across diverse models and domains, suggesting that language models respond to unfamiliar or complex inputs by concentrating computation into specialized subspaces in the last hidden state. Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD. Leveraging this insight, we design \textit{Sparsity-Guided Curriculum In-Context Learning (SG-ICL)}, a strategy that explicitly uses representation sparsity to schedule few-shot demonstrations, leading to considerable performance enhancements. Our study provides new mechanistic insights into how LLMs internalize OOD challenges. The source code is available at the URL: https://github.com/MingyuJ666/sparsityLLM.

Comment: Finds a robust sparsity–difficulty relation in LLM hidden states and exploits it for curriculum ICL—representation learning/training dynamics.

Relevance: 9 Novelty: 8

11. SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training

ArXiv ID: 2603.03592

Authors: Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P Hewa Koneputugodage, Gil Avraham, Yan Zuo, Violetta Shevchenko, Alexander Long

Abstract: Decentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes. While existing Byzantine-tolerant literature addresses data parallel (DP) training through robust aggregation methods, pipeline parallelism (PP) presents fundamentally distinct challenges. In PP, model layers are distributed across workers where the activations and their gradients flow between stages rather than being aggregated, making traditional DP approaches inapplicable. We propose SENTINEL, a verification mechanism for PP training without computation duplication. SENTINEL employs lightweight momentum-based monitoring using exponential moving averages (EMAs) to detect corrupted inter-stage communication. Unlike existing Byzantine-tolerant approaches for DP that aggregate parameter gradients across replicas, our approach verifies sequential activation/gradient transmission between layers. We provide theoretical convergence guarantees for this new setting that recovers classical convergence rates when relaxed to standard training. Experiments demonstrate successful training of up to 4B-parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.

Comment: Matches High Performance Computing/Distributed Training: integrity verification for pipeline parallel training with convergence guarantees in untrusted settings.

Relevance: 9 Novelty: 8

12. NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

ArXiv ID: 2603.03597

Authors: Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P Hewa Koneputugodage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avraham, Alexander Long

Abstract: The rapid progress of large language models (LLMs) is increasingly constrained by memory and deployment costs, motivating compression methods for practical deployment. Many state-of-the-art compression pipelines leverage the low-rank structure of trained weight matrices, a phenomenon often associated with the properties of popular optimizers such as Adam. In this context, Muon is a recently proposed optimizer that improves LLM pretraining via full-rank update steps, but its induced weight-space structure has not been characterized yet. In this work, we report a surprising empirical finding: despite imposing full-rank updates, Muon-trained models exhibit pronounced low-rank structure in their weight matrices and are readily compressible under standard pipelines. Motivated by this insight, we propose NuMuon, which augments Muon with a nuclear-norm constraint on the update direction, further constraining the learned weights toward low-rank structure. Across billion-parameter-scale models, we show that NuMuon increases weight compressibility and improves post-compression model quality under state-of-the-art LLM compression pipelines while retaining Muon's favorable convergence behavior.

Comment: Compression/Efficiency + Training Dynamics: optimizer with nuclear-norm-constrained updates to induce low-rank weight structure for better LLM compressibility

Relevance: 9 Novelty: 8

13. Trade-offs in Ensembling, Merging and Routing Among Parameter-Efficient Experts

ArXiv ID: 2603.03535

Authors: Sanae Lotfi, Lucas Caccia, Alessandro Sordoni, Jordan T. Ash, Miroslav Dudik

Abstract: While large language models (LLMs) fine-tuned with lightweight adapters achieve strong performance across diverse tasks, their performance on individual tasks depends on the fine-tuning strategy. Fusing independently trained models with different strengths has shown promise for multi-task learning through three main strategies: ensembling, which combines outputs from independent models; merging, which fuses model weights via parameter averaging; and routing, which integrates models in an input-dependent fashion. However, many design decisions in these approaches remain understudied, and the relative benefits of more sophisticated ensembling, merging and routing techniques are not fully understood. We empirically evaluate their trade-offs, addressing two key questions: What are the advantages of going beyond uniform ensembling or merging? And does the flexibility of routing justify its complexity? Our findings indicate that non-uniform ensembling and merging improve performance, but routing offers even greater gains. To mitigate the computational cost of routing, we analyze expert selection techniques, showing that clustering and greedy subset selection can maintain reasonable performance with minimal overhead. These insights advance our understanding of model fusion for multi-task learning.

Comment: Systematic study of ensembling/merging/routing among parameter-efficient experts—experts/routing (MoE-style) for multi-task efficiency.

Relevance: 9 Novelty: 7

14. Stable and Steerable Sparse Autoencoders with Weight Regularization

ArXiv ID: 2603.04198

Authors: Piotr Jedryszek, Oliver M. Crook

Abstract: Sparse autoencoders (SAEs) are widely used to extract human-interpretable features from neural network activations, but their learned features can vary substantially across random seeds and training choices. To improve stability, we studied weight regularization by adding L1 or L2 penalties on encoder and decoder weights, and evaluate how regularization interacts with common SAE training defaults. On MNIST, we observe that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency. For TopK SAEs trained on language model activations (Pythia-70M-deduped), adding a small L2 weight penalty increased the fraction of features shared across three random seeds and roughly doubles steering success rates, while leaving the mean of automated interpretability scores essentially unchanged. Finally, in the regularized setting, activation steering success becomes better predicted by auto-interpretability scores, suggesting that regularization can align text-based feature explanations with functional controllability.

Comment: Matches Representation Learning and Sparsity: stability/steerability of sparse autoencoders via L2/L1 weight regularization, tied init, and unit-norm decoders.

Relevance: 9 Novelty: 7

15. EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

ArXiv ID: 2603.03681

Authors: Yuhao Chen, Bin Shan, Xin Ye, Cheng Chen

Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in vision-language tasks, but their inference efficiency is severely limited by the exponential growth of visual tokens in complex scenarios such as high-resolution images and videos. Existing visual token pruning methods mainly operate after visual encoding, overlooking the substantial computational cost incurred during the encoding stage. To address this issue, we propose EvoPrune, an early-stage visual token pruning method for MLLMs that performs pruning directly during visual encoding. Specifically, EvoPrune employs a layer-wise pruning strategy guided by token similarity, diversity, and attention-based importance to retain the most informative visual tokens at selected encoding layers. Extensive experiments on image and video benchmarks validate the effectiveness of EvoPrune. In particular, on the VideoMME dataset, EvoPrune achieves 2$\times$ inference speedup with less than 1% performance degradation, demonstrating its potential for latency-sensitive MLLM deployment.

Comment: Model Compression/Efficiency: early-stage visual token pruning inside the encoder (layer-wise, similarity/diversity/attention-guided) for MLLMs

Relevance: 9 Novelty: 7

16. Riemannian Optimization in Modular Systems

ArXiv ID: 2603.03610

Authors: Christian Pehle, Jean-Jacques Slotine

Abstract: Understanding how systems built out of modular components can be jointly optimized is an important problem in biology, engineering, and machine learning. The backpropagation algorithm is one such solution and has been instrumental in the success of neural networks. Despite its empirical success, a strong theoretical understanding of it is lacking. Here, we combine tools from Riemannian geometry, optimal control theory, and theoretical physics to advance this understanding. We make three key contributions: First, we revisit the derivation of backpropagation as a constrained optimization problem and combine it with the insight that Riemannian gradient descent trajectories can be understood as the minimum of an action. Second, we introduce a recursively defined layerwise Riemannian metric that exploits the modular structure of neural networks and can be efficiently computed using the Woodbury matrix identity, avoiding the $O(n^3)$ cost of full metric inversion. Third, we develop a framework of composable ``Riemannian modules'' whose convergence properties can be quantified using nonlinear contraction theory, providing algorithmic stability guarantees of order $O(\kappa^2 L/(\xi \mu \sqrt{n}))$ where $\kappa$ and $L$ are Lipschitz constants, $\mu$ is the mass matrix scale, and $\xi$ bounds the condition number. Our layerwise metric approach provides a practical alternative to natural gradient descent. While we focus here on studying neural networks, our approach more generally applies to the study of systems made of modules that are optimized over time, as it occurs in biology during both evolution and development.

Comment: Proposes layerwise Riemannian metrics and composable modules with contraction guarantees—principled optimization/training dynamics for neural architectures.

Relevance: 8 Novelty: 8

17. Semi-Supervised Generative Learning via Latent Space Distribution Matching

ArXiv ID: 2603.04223

Authors: Kwong Yu Chong, Long Feng

Abstract: We introduce Latent Space Distribution Matching (LSDM), a novel framework for semi-supervised generative modeling of conditional distributions. LSDM operates in two stages: (i) learning a low-dimensional latent space from both paired and unpaired data, and (ii) performing joint distribution matching in this space via the 1-Wasserstein distance, using only paired data. This two-step approach minimizes an upper bound on the 1-Wasserstein distance between joint distributions, reducing reliance on scarce paired samples while enabling fast one-step generation. Theoretically, we establish non-asymptotic error bounds and demonstrate a key benefit of unpaired data: enhanced geometric fidelity in generated outputs. Furthermore, by extending the scope of its two core steps, LSDM provides a coherent statistical perspective that connects to a broad class of latent-space approaches. Notably, Latent Diffusion Models (LDMs) can be viewed as a variant of LSDM, in which joint distribution matching is achieved indirectly via score matching. Consequently, our results also provide theoretical insights into the consistency of LDMs. Empirical evaluations on real-world image tasks, including class-conditional generation and image super-resolution, demonstrate the effectiveness of LSDM in leveraging unpaired data to enhance generation quality.

Comment: Latent Space Distribution Matching with Wasserstein bounds; connects to LDMs—representation learning/generative modeling theory.

Relevance: 8 Novelty: 8

18. Surprisal-R\'enyi Free Energy

ArXiv ID: 2603.03405

Authors: Shion Matsumoto, Raul Castillo, Benjamin Prada, Ankur Arjun Mali

Abstract: The forward and reverse Kullback-Leibler (KL) divergences arise as limiting objectives in learning and inference yet induce markedly different inductive biases that cannot be explained at the level of expectations alone. In this work, we introduce the Surprisal-R\'enyi Free Energy (SRFE), a log-moment-based functional of the likelihood ratio that lies outside the class of $f$-divergences. We show that SRFE recovers forward and reverse KL divergences as singular endpoint limits and derive local expansions around both limits in which the variance of the log-likelihood ratio appears as a first-order correction. This reveals an explicit mean-variance tradeoff governing departures from KL-dominated regimes. We further establish a Gibbs-type variational characterization of SRFE as the unique minimizer of a weighted sum of KL divergences and prove that SRFE directly controls large deviations of excess code-length via Chernoff-type bounds, yielding a precise Minimum Description Length interpretation. Together, these results identify SRFE as a variance- and tail-sensitive free-energy functional that clarifies the geometric and large-deviation structure underlying forward and reverse KL limits, without unifying or subsuming distinct learning frameworks.

Comment: Matches Representation Learning/Training Objectives: introduces Surprisal-Rényi Free Energy interpolating KLs with variance/tail sensitivity and MDL interpretation.

Relevance: 8 Novelty: 8

19. stratum: A System Infrastructure for Massive Agent-Centric ML Workloads

ArXiv ID: 2603.03589

Authors: Arnab Phani, Elias Strauss, Sebastian Schelter

Abstract: Recent advances in large language models (LLMs) transform how machine learning (ML) pipelines are developed and evaluated. LLMs enable a new type of workload, agentic pipeline search, in which autonomous or semi-autonomous agents generate, validate, and optimize complete ML pipelines. These agents predominantly operate over popular Python ML libraries and exhibit highly exploratory behavior. This results in thousands of executions for data profiling, pipeline generation, and iterative refinement of pipeline stages. However, the existing Python-based ML ecosystem is built around libraries such as Pandas and scikit-learn, which are designed for human-centric, interactive, sequential workflows and remain constrained by Python's interpretive execution model, library-level isolation, and limited runtime support for executing large numbers of pipelines. Meanwhile, many high-performance ML systems proposed by the systems community either target narrow workload classes or require specialized programming models, which limits their integration with the Python ML ecosystem and makes them largely ill-suited for LLM-based agents. This growing mismatch exposes a fundamental systems challenge in supporting agentic pipeline search at scale. We therefore propose stratum, a unified system infrastructure that decouples pipeline execution from planning and reasoning during agentic pipeline search. Stratum integrates seamlessly with existing Python libraries, compiles batches of pipelines into optimized execution graphs, and efficiently executes them across heterogeneous backends, including a novel Rust-based runtime. We present stratum's architectural vision along with an early prototype, discuss key design decisions, and outline open challenges and research directions. Finally, preliminary experiments show that stratum can significantly speed up large-scale agentic pipeline search up to 16.6x.

Comment: High Performance Computing: unified system infrastructure compiling and executing large batches of agent-generated ML pipelines efficiently.