Personalized Daily ArXiv Papers 2026-03-30

Model	Metric	Usage			Papers
Model	Metric	Prompt	Completion	Total	Total arXiv	Scanned	Relevant
`gpt-5.4`	Tokens	97913	3995	101908	421	240	24
`gpt-5.4`	Cost	$0.24	$0.06	$0.30	421	240	24

Table of contents with paper titles:

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory Authors: Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, Jason D. Lee
Beyond identifiability: Learning causal representations with few environments and finite samples Authors: Inbeom Lee, Tongtong Jin, Bryon Aragam
When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models Authors: Juan Gabriel Kostelec, Xiang Wang, Axel Laborieux, Christos Sourmpis, Qinghai Guo
A Compression Perspective on Simplicity Bias Authors: Tom Marty, Eric Elmoznino, Leo Gagnon, Tejas Kasetty, Mizu Nishikawa-Toomey, Sarthak Mittal, Guillaume Lajoie, Dhanya Sridhar
Identifying Connectivity Distributions from Neural Dynamics Using Flows Authors: Timothy Doyeon Kim, Ulises Pereira-Obilinovic, Yiliu Wang, Eric Shea-Brown, Uygar S\"umb\"ul
Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation Authors: Yiming Ren, Yujiu Yang, Junjie Wang
Optimization Trade-offs in Asynchronous Federated Learning: A Stochastic Networks Approach Authors: Abdelkrim Alahyane (LAAS-SARA), C\'eline Comte (CNRS, LAAS-SARA), Matthieu Jonckheere (CNRS, LAAS-SARA)
On the Expressive Power of Contextual Relations in Transformers Authors: Demi\'an Fraiman
Dynamic Tokenization via Reinforcement Patching: End-to-end Training and Zero-shot Transfer Authors: Yulun Wu, Sravan Kumar Ankireddy, Samuel Sharpe, Nikita Seleznev, Dehao Yuan, Hyeji Kim, Nam H. Nguyen
Contrastive Conformal Sets Authors: Yahya Alkhatib, Wee Peng Tay
On associative neural networks for sparse patterns with huge capacities Authors: Matthias L\"owe, Franck Vermet
Do Neurons Dream of Primitive Operators? Wake-Sleep Compression Rediscovers Schank's Event Semantics Authors: Peter Balogh
Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference Authors: Konstantinos Papaioannou, Thaleia Dimitra Doudali
Second-Order, First-Class: A Composable Stack for Curvature-Aware Training Authors: Mikalai Korbit, Mario Zanon
Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy Authors: Wooseong Jeong, Wonyoung Lee, Kuk-Jin Yoon
AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation Authors: Hyeongyu Kim, Geonhui Han, Dosik Hwang
Finding Distributed Object-Centric Properties in Self-Supervised Transformers Authors: Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja
From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs Authors: Jiyuan An, Liner Yang, Mengyan Wang, Luming Lu, Weihua An, Erhong Yang
Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones Authors: Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni
Label-Free Cross-Task LoRA Merging with Null-Space Compression Authors: Wonyoung Lee, Wooseong Jeong, Kuk-Jin Yoon
DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease Authors: Runsheng Bai, Chengyu Zhang, Yangdong Deng
Make Geometry Matter for Spatial Reasoning Authors: Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang
ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction Authors: David Hagerman, Roman Naeem, Erik Brorsson, Fredrik Kahl, Lennart Svensson
SAHMM-VAE: A Source-Wise Adaptive Hidden Markov Prior Variational Autoencoder for Unsupervised Blind Source Separation Authors: Yuan-Hao Wei

1. Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

ArXiv ID: 2603.26554

Authors: Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, Jason D. Lee

Abstract: Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding dimension. Our main result sharply characterizes the recovery rates of one step of Muon and SGD on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and moreover Muon saturates at a larger critical batch size. We further analyze the multi-step dynamics under a thresholded gradient approximation and show that Muon achieves a substantially faster initial recovery rate than SGD, while both methods eventually converge to the information-theoretic limit at comparable speeds. Experiments on synthetic tasks validate the predicted scaling laws. Our analysis provides a quantitative understanding of the signal amplification of Muon and lays the groundwork for establishing scaling laws across more practical language modeling tasks and optimizers.

Comment: Provides sharp theory for why spectral optimizers like Muon outperform SGD, analyzing optimizer dynamics and capacity scaling in a tractable associative-memory setting.

Relevance: 10 Novelty: 9

2. Beyond identifiability: Learning causal representations with few environments and finite samples

ArXiv ID: 2603.25796

Authors: Inbeom Lee, Tongtong Jin, Bryon Aragam

Abstract: We provide explicit, finite-sample guarantees for learning causal representations from data with a sublinear number of environments. Causal representation learning seeks to provide a rigourous foundation for the general representation learning problem by bridging causal models with latent factor models in order to learn interpretable representations with causal semantics. Despite a blossoming theory of identifiability in causal representation learning, estimation and finite-sample bounds are less well understood. We show that causal representations can be learned with only a logarithmic number of unknown, multi-node interventions, and that the intervention targets need not be carefully designed in advance. Through a careful perturbation analysis, we provide a new analysis of this problem that guarantees consistent recovery of (a) the latent causal graph, (b) the mixing matrix and representations, and (c) \emph{unknown} intervention targets.

Comment: Finite-sample guarantees for causal representation learning with few environments directly match representation-learning theory and identifiability-focused structure learning.

Relevance: 9 Novelty: 9

3. When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

ArXiv ID: 2603.26556

Authors: Juan Gabriel Kostelec, Xiang Wang, Axel Laborieux, Christos Sourmpis, Qinghai Guo

Abstract: Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2\,pp under log-likelihood scoring actually falls behind by 20.8\,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality. Our best Hybrid-KDA model retains 86--90\% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75\% and improving time-to-first-token by 2--4$\times$ at 128K-token contexts.

Comment: Hybrid-KDA plus generation-focused distillation directly targets KV-cache reduction and efficient long-context inference, with clear architectural and distillation-design insight beyond perplexity-only evaluation.

Relevance: 9 Novelty: 8

4. A Compression Perspective on Simplicity Bias

ArXiv ID: 2603.25839

Authors: Tom Marty, Eric Elmoznino, Leo Gagnon, Tejas Kasetty, Mizu Nishikawa-Toomey, Sarthak Mittal, Guillaume Lajoie, Dhanya Sridhar

Abstract: Deep neural networks exhibit a simplicity bias, a well-documented tendency to favor simple functions over complex ones. In this work, we cast new light on this phenomenon through the lens of the Minimum Description Length principle, formalizing supervised learning as a problem of optimal two-part lossless compression. Our theory explains how simplicity bias governs feature selection in neural networks through a fundamental trade-off between model complexity (the cost of describing the hypothesis) and predictive power (the cost of describing the data). Our framework predicts that as the amount of available training data increases, learners transition through qualitatively different features -- from simple spurious shortcuts to complex features -- only when the reduction in data encoding cost justifies the increased model complexity. Consequently, we identify distinct data regimes where increasing data promotes robustness by ruling out trivial shortcuts, and conversely, regimes where limiting data can act as a form of complexity-based regularization, preventing the learning of unreliable complex environmental cues. We validate our theory on a semi-synthetic benchmark showing that the feature selection of neural networks follows the same trajectory of solutions as optimal two-part compressors.

Comment: Explains neural-network simplicity bias through an MDL compression framework that predicts feature-selection dynamics across data regimes, a strong representation-learning theory result.

Relevance: 9 Novelty: 8

5. Identifying Connectivity Distributions from Neural Dynamics Using Flows

ArXiv ID: 2603.26506

Authors: Timothy Doyeon Kim, Ulises Pereira-Obilinovic, Yiliu Wang, Eric Shea-Brown, Uygar S\"umb\"ul

Abstract: Connectivity structure shapes neural computation, but inferring this structure from population recordings is degenerate: multiple connectivity structures can generate identical dynamics. Recent work uses low-rank recurrent neural networks (lrRNNs) to infer low-dimensional latent dynamics and connectivity structure from observed activity, enabling a mechanistic interpretation of the dynamics. However, standard approaches for training lrRNNs can recover spurious structures irrelevant to the underlying dynamics. We first characterize the identifiability of connectivity structures in lrRNNs and determine conditions under which a unique solution exists. Then, to find such solutions, we develop an inference framework based on maximum entropy and continuous normalizing flows (CNFs), trained via flow matching. Instead of estimating a single connectivity matrix, our method learns the maximally unbiased distribution over connection weights consistent with observed dynamics. This approach captures complex yet necessary distributions such as heavy-tailed connectivity found in empirical data. We validate our method on synthetic datasets with connectivity structures that generate multistable attractors, limit cycles, and ring attractors, and demonstrate its applicability in recordings from rat frontal cortex during decision-making. Our framework shifts circuit inference from recovering connectivity to identifying which connectivity structures are computationally required, and which are artifacts of underconstrained inference.

Comment: Studies identifiability of low-rank RNN connectivity and proposes flow-based maximum-entropy inference of connectivity distributions from dynamics, squarely in representation/mechanistic theory.

Relevance: 9 Novelty: 8

6. Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation

ArXiv ID: 2603.26330

Authors: Yiming Ren, Yujiu Yang, Junjie Wang

Abstract: Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tuning. Building on this observation, we propose Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. On Qwen3-VL-2B, IADA improves the average reasoning score by 9.5 points and the average perception score by $3.3$ points over LoRA-only fine-tuning with only 0.14M additional parameters, with the strongest gains appearing in parameter-efficient low-rank settings.

Comment: Introduces an input-adaptive cross-depth aggregation mechanism to preserve depth-wise representations during VLM fine-tuning, directly targeting architectural/training dynamics.

Relevance: 9 Novelty: 7

7. Optimization Trade-offs in Asynchronous Federated Learning: A Stochastic Networks Approach

ArXiv ID: 2603.26231

Authors: Abdelkrim Alahyane (LAAS-SARA), C\'eline Comte (CNRS, LAAS-SARA), Matthieu Jonckheere (CNRS, LAAS-SARA)

Abstract: Synchronous federated learning scales poorly due to the straggler effect. Asynchronous algorithms increase the update throughput by processing updates upon arrival, but they introduce two fundamental challenges: gradient staleness, which degrades convergence, and bias toward faster clients under heterogeneous data distributions. Although algorithms such as AsyncSGD and Generalized AsyncSGD mitigate this bias via client-side task queues, most existing analyses neglect the underlying queueing dynamics and lack closed-form characterizations of the update throughput and gradient staleness. To close this gap, we develop a stochastic queueing-network framework for Generalized AsyncSGD that jointly models random computation times at the clients and the central server, as well as random uplink and downlink communication delays. Leveraging product-form network theory, we derive a closed-form expression for the update throughput, alongside closed-form upper bounds for both the communication round complexity and the expected wall-clock time required to reach an $\epsilon$-stationary point. These results formally characterize the trade-off between gradient staleness and wall-clock convergence speed. We further extend the framework to quantify energy consumption under stochastic timing, revealing an additional trade-off between convergence speed and energy efficiency. Building on these analytical results, we propose gradient-based optimization strategies to jointly optimize routing and concurrency. Experiments on EMNIST demonstrate reductions of 29%--46% in convergence time and 36%--49% in energy consumption compared to AsyncSGD.

Comment: Derives closed-form throughput, staleness, and convergence trade-offs for asynchronous federated optimization using a queueing-network model, a strong training-systems contribution.

Relevance: 8 Novelty: 8

8. On the Expressive Power of Contextual Relations in Transformers

ArXiv ID: 2603.25860

Authors: Demi\'an Fraiman

Abstract: Transformer architectures have achieved remarkable empirical success in modeling contextual relationships in natural language, yet a precise mathematical characterization of their expressive power remains incomplete. In this work, we introduce a measure-theoretic framework for contextual representations in which texts are modeled as probability measures over a semantic embedding space, and contextual relations between words, are represented as coupling measures between them. Within this setting, we introduce Sinkhorn Transformer, a transformer-like architecture. Our main result is a universal approximation theorem: any continuous coupling function between probability measures, that encodes the semantic relation coupling measure, can be uniformly approximated by a Sinkhorn Transformer with appropriate parameters.

Comment: Studies transformer expressive power with a universal approximation theorem for contextual relations, offering foundational architectural theory rather than task packaging.

Relevance: 8 Novelty: 8

9. Dynamic Tokenization via Reinforcement Patching: End-to-end Training and Zero-shot Transfer

ArXiv ID: 2603.26097

Authors: Yulun Wu, Sravan Kumar Ankireddy, Samuel Sharpe, Nikita Seleznev, Dehao Yuan, Hyeji Kim, Nam H. Nguyen

Abstract: Efficiently aggregating spatial or temporal horizons to acquire compact representations has become a unifying principle in modern deep learning models, yet learning data-adaptive representations for long-horizon sequence data, especially continuous sequences like time series, remains an open challenge. While fixed-size patching has improved scalability and performance, discovering variable-sized, data-driven patches end-to-end often forces models to rely on soft discretization, specific backbones, or heuristic rules. In this work, we propose Reinforcement Patching (ReinPatch), the first framework to jointly optimize a sequence patching policy and its downstream sequence backbone model using reinforcement learning. By formulating patch boundary placement as a discrete decision process optimized via Group Relative Policy Gradient (GRPG), ReinPatch bypasses the need for continuous relaxations and performs dynamic patching policy optimization in a natural manner. Moreover, our method allows strict enforcement of a desired compression rate, freeing the downstream backbone to scale efficiently, and naturally supports multi-level hierarchical modeling. We evaluate ReinPatch on time-series forecasting datasets, where it demonstrates compelling performance compared to state-of-the-art data-driven patching strategies. Furthermore, our detached design allows the patching module to be extracted as a standalone foundation patcher, providing the community with visual and empirical insights into the segmentation behaviors preferred by a purely performance-driven neural patching strategy.

Comment: Learns dynamic patch boundaries end-to-end via reinforcement learning, a genuine adaptive tokenization mechanism for compact sequence representations.

Relevance: 8 Novelty: 8

10. Contrastive Conformal Sets

ArXiv ID: 2603.26261

Authors: Yahya Alkhatib, Wee Peng Tay

Abstract: Contrastive learning produces coherent semantic feature embeddings by encouraging positive samples to cluster closely while separating negative samples. However, existing contrastive learning methods lack principled guarantees on coverage within the semantic feature space. We extend conformal prediction to this setting by introducing minimum-volume covering sets equipped with learnable generalized multi-norm constraints. We propose a method that constructs conformal sets guaranteeing user-specified coverage of positive samples while maximizing negative sample exclusion. We establish theoretically that volume minimization serves as a proxy for negative exclusion, enabling our approach to operate effectively even when negative pairs are unavailable. The positive inclusion guarantee inherits the distribution-free coverage property of conformal prediction, while negative exclusion is maximized through learned set geometry optimized on a held-out training split. Experiments on simulated and real-world image datasets demonstrate improved inclusion-exclusion trade-offs compared to standard distance-based conformal baselines.

Comment: Representation-learning method giving conformal coverage guarantees directly in contrastive embedding space.

Relevance: 8 Novelty: 8

11. On associative neural networks for sparse patterns with huge capacities

ArXiv ID: 2603.26217

Authors: Matthias L\"owe, Franck Vermet

Abstract: Generalized Hopfield models with higher-order or exponential interaction terms are known to have substantially larger storage capacities than the classical quadratic model. On the other hand, associative memories for sparse patterns, such as the Willshaw and Amari models, already outperform the classical Hopfield model in the sparse regime. In this paper we combine these two mechanisms. We introduce higher-order versions of sparse associative memory models and study their storage capacities. For fixed interaction order $n$, we obtain storage capacities of polynomial order in the system size. When the interaction order is allowed to grow logarithmically with the number of neurons, this yields super-polynomial capacities. We also discuss an analogue in the Gripon--Berrou architecture which was formulated for non-sparse messages (see \cite{griponc}). Our results show that the capacity increase caused by higher-order interactions persists in the sparse setting, although the precise storage scale depends on the underlying architecture.

Comment: Theory of sparse associative memories showing higher-order interactions yield super-polynomial storage capacity.

Relevance: 8 Novelty: 8

12. Do Neurons Dream of Primitive Operators? Wake-Sleep Compression Rediscovers Schank's Event Semantics

ArXiv ID: 2603.25975

Authors: Peter Balogh

Abstract: We show that they do. Schank's conceptual dependency theory proposed that all events decompose into primitive operations -- ATRANS, PTRANS, MTRANS, and others -- hand-coded from linguistic intuition. Can the same primitives be discovered automatically through compression pressure alone? We adapt DreamCoder's wake-sleep library learning to event state transformations. Given events as before/after world state pairs, our system finds operator compositions explaining each event (wake), then extracts recurring patterns as new operators optimized under Minimum Description Length (sleep). Starting from four generic primitives, it discovers operators mapping directly to Schank's: MOVE_PROP_has = ATRANS, CHANGE_location = PTRANS, SET_knows = MTRANS, SET_consumed = INGEST, plus compound operators ("mail" = ATRANS + PTRANS) and novel emotional state operators absent from Schank's taxonomy. We validate on synthetic events and real-world commonsense data from the ATOMIC knowledge graph. On synthetic data, discovered operators achieve Bayesian MDL within 4% of Schank's hand-coded primitives while explaining 100% of events vs. Schank's 81%. On ATOMIC, results are more dramatic: Schank's primitives explain only 10% of naturalistic events, while the discovered library explains 100%. Dominant operators are not physical-action primitives but mental and emotional state changes -- CHANGE_wants (20%), CHANGE_feels (18%), CHANGE_is (18%) -- none in Schank's original taxonomy. These results provide the first empirical evidence that event primitives can be derived from compression pressure, that Schank's core primitives are information-theoretically justified, and that the complete inventory is substantially richer than proposed -- with mental/emotional operators dominating in naturalistic data.

Comment: Uses wake-sleep compression to rediscover primitive event operators, strongly matching representation structure via compression-induced feature decomposition.

Relevance: 8 Novelty: 8

13. Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference

ArXiv ID: 2603.26498

Authors: Konstantinos Papaioannou, Thaleia Dimitra Doudali

Abstract: Multimodal Large Language Models (MLLMs) power platforms like ChatGPT, Gemini, and Copilot, enabling richer interactions with text, images, and videos. These heterogeneous workloads introduce additional inference stages, such as vision preprocessing and encoding, that inflate latency and memory demand. Existing LLM serving systems, optimized for text-only workloads, fail under multimodality: large requests (e.g., videos) monopolize resources, causing severe head-of-line blocking and performance degradation. Our key insight is that multimodal requests differ by orders of magnitude in resource demands, which we capture through a simple abstraction: videos behave like rocks, images like pebbles, and text like sand. We design RPS-Serve, a modality-aware scheduler that lets sand flow quickly through pebbles and rocks, ensuring interactive responsiveness while avoiding starvation. RPS-Serve classifies requests, prioritizes them dynamically, and applies aging to avoid starvation. Evaluation across state-of-the-art MLLMs shows that RPS-Serve reduces, on average, time-to-first-token (TTFT) by 54% overall, and by 78.5% for latency-critical requests, compared to current systems. RPS-Serve delivers LLM-like responsiveness for MLLMs, with modality-aware scheduling and by making the most efficient use of the available resources.

Comment: Modality-aware scheduling for MLLM inference is a concrete large-scale serving systems contribution, introducing a new scheduler to control head-of-line blocking across heterogeneous request sizes.