Personalized Daily ArXiv Papers 2026-03-27

Model	Metric	Usage			Papers
Model	Metric	Prompt	Completion	Total	Total arXiv	Scanned	Relevant
`gpt-5.4`	Tokens	116876	4798	121674	540	324	33
`gpt-5.4`	Cost	$0.29	$0.07	$0.36	540	324	33

Table of contents with paper titles:

Labeled Compression Schemes for Concept Classes of Finite Functions Authors: Benchong Li
Perturbation: A simple and efficient adversarial tracer for representation learning in language models Authors: Joshua Rozner, Cory Shain
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens Authors: Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen
Latent Algorithmic Structure Precedes Grokking: A Mechanistic Study of ReLU MLPs on Modular Arithmetic Authors: Anand Swaroop
Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding Authors: Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Selim Furkan Tekin, Tiansheng Huang, Zachary Yahn, Yichang Xu, Ling Liu
Manifold Generalization Provably Proceeds Memorization in Diffusion Models Authors: Zebang Shen, Ya-Ping Hsieh, Niao He
MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning Authors: Andrea Manzoni
StateLinFormer: Stateful Training Enhancing Long-term Memory in Navigation Authors: Zhiyuan Chen, Yuxuan Zhong, Fan Wang, Bo Yu, Pengtao Shao, Shaoshan Liu, Ning Ding
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct Authors: Christopher Ackerman, Nina Panickssery
The Geometric Price of Discrete Logic: Context-driven Manifold Dynamics of Number Representations Authors: Long Zhang, Dai-jun Lin, Wei-neng Chen
Identification of NMF by choosing maximum-volume basis vectors Authors: Qianqian Qi, Zhongming Chen, Peter G. M. van der Heijden
Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes Authors: Fangyu Ding, Ding Ding, Sijin Chen, Kaibo Wang, Peng Xu, Zijin Feng, Haoli Bai, Kai Han, Youliang Yan, Binhang Yuan, Jiacheng Sun
Polynomial Speedup in Diffusion Models with the Multilevel Euler-Maruyama Method Authors: Arthur Jacot
Deep Neural Regression Collapse Authors: Akshay Rangamani, Altay Unal
DVM: Real-Time Kernel Generation for Dynamic AI Models Authors: Jingzhi Fang, Xiong Gao, Renwei Zhang, Zichun Ye, Lei Chen, Jie Zhao, Chengnuo Huang, Hui Xu, Xuefeng Jin
Minimal Sufficient Representations for Self-interpretable Deep Neural Networks Authors: Zhiyao Tan, Liu Li, Huazhen Lin
Likelihood hacking in probabilistic program synthesis Authors: Jacek Karwowski, Younesse Kaddar, Zihuiwen Ye, Nikolay Malkin, Sam Staton
AVO: Agentic Variation Operators for Autonomous Evolutionary Search Authors: Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, Yu-Jung Chen, Hanfeng Chen, Aditya Kane, Ronny Krashinsky, Ming-Yu Liu, Vinod Grover, Luis Ceze, Roger Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, Humphrey Shi
The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation Authors: Mingyi Liu
Kirchhoff-Inspired Neural Networks for Evolving High-Order Perception Authors: Tongfei Chen, Jingying Yang, Linlin Yang, Jinhu L\"u, David Doermann, Chunyu Xie, Long He, Tian Wang, Juan Zhang, Guodong Guo, Baochang Zhang
A Theory of LLM Information Susceptibility Authors: Zhuo-Yang Song, Hua Xing Zhu
Why the Maximum Second Derivative of Activations Matters for Adversarial Robustness Authors: Yunrui Yu, Hang Su, Jun Zhu
The Diminishing Returns of Early-Exit Decoding in Modern LLMs Authors: Rui Wei, Rui Du, Hanfei Yu, Devesh Tiwari, Jian Li, Zhaozhuo Xu, Hao Wang
Diet Your LLM: Dimension-wise Global Pruning of LLMs via Merging Task-specific Importance Score Authors: Jimyung Hong, Jaehyung Kim
Resolving gradient pathology in physics-informed epidemiological models Authors: Nickson Golooba, Woldegebriel Assefa Woldegerima
Self-Distillation for Multi-Token Prediction Authors: Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? Authors: Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang
Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep Authors: Tianyi Liu, Ye Lu, Linfeng Zhang, Chen Cai, Jianjun Gao, Yi Wang, Kim-Hui Yap, Lap-Pui Chau
Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification Authors: Han Sun, Qin Li, Peixin Wang, Min Zhang
Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization Authors: Fei Bai, Zhipeng Chen, Chuan Hao, Ming Yang, Ran Tao, Bryan Dai, Wayne Xin Zhao, Jian Yang, Hongteng Xu
Sparse Autoencoders for Interpretable Medical Image Representation Learning Authors: Philipp Wesp, Robbie Holland, Vasiliki Sideri-Lampretsa, Sergios Gatidis
Dual-Gated Epistemic Time-Dilation: Autonomous Compute Modulation in Asynchronous MARL Authors: Igor Jankowski
Upper Entropy for 2-Monotone Lower Probabilities Authors: Tuan-Anh Vu, S\'ebastien Destercke, Fr\'ed\'eric Pichon

1. Labeled Compression Schemes for Concept Classes of Finite Functions

ArXiv ID: 2603.23561

Authors: Benchong Li

Abstract: The sample compression conjecture is: Each concept class of VC dimension d has a compression scheme of size d.In this paper, for any concept class of finite functions, we present a labeled sample compression scheme of size equals to its VC dimension d. That is, the long standing open sample compression conjecture is resolved.

Comment: Foundational learning theory result resolving the sample compression conjecture for concept classes of finite functions.

Relevance: 8 Novelty: 10

2. Perturbation: A simple and efficient adversarial tracer for representation learning in language models

ArXiv ID: 2603.23821

Authors: Joshua Rozner, Cory Shain

Abstract: Linguistic representation learning in deep neural language models (LMs) has been studied for decades, for both practical and theoretical reasons. However, finding representations in LMs remains an unsolved problem, in part due to a dilemma between enforcing implausible constraints on representations (e.g., linearity; Arora et al. 2024) and trivializing the notion of representation altogether (Sutter et al., 2025). Here we escape this dilemma by reconceptualizing representations not as patterns of activation but as conduits for learning. Our approach is simple: we perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation ``infects'' other examples. Perturbation makes no geometric assumptions, and unlike other methods, it does not find representations where it should not (e.g., in untrained LMs). But in trained LMs, perturbation reveals structured transfer at multiple linguistic grain sizes, suggesting that LMs both generalize along representational lines and acquire linguistic abstractions from experience alone.

Comment: Introduces a new perturbation-based method to trace learned linguistic representations via transfer from single-example adversarial fine-tuning.

Relevance: 9 Novelty: 8

3. MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

ArXiv ID: 2603.23516

Authors: Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen

Abstract: Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically limited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage methods like RAG or agent systems, attempt to extend this limit. However, they often suffer from severe precision degradation and rapidly increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins, and long-history agent reasoning, while limiting memory capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory model framework. Through core innovations including scalable sparse attention and document-wise RoPE, MSA achieves linear complexity in both training and inference while maintaining exceptional stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens. Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. We also propose Memory Interleaving to facilitate complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier LLMs, state-of-the-art RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory.

Comment: Introduces Memory Sparse Attention with scalable sparse attention, document-wise RoPE, and KV-cache compression for end-to-end long-context training and inference efficiency.

Relevance: 9 Novelty: 8

4. Latent Algorithmic Structure Precedes Grokking: A Mechanistic Study of ReLU MLPs on Modular Arithmetic

ArXiv ID: 2603.23784

Authors: Anand Swaroop

Abstract: Grokking-the phenomenon where validation accuracy of neural networks on modular addition of two integers rises long after training data has been memorized-has been characterized in previous works as producing sinusoidal input weight distributions in transformers and multi-layer perceptrons (MLPs). We find empirically that ReLU MLPs in our experimental setting instead learn near-binary square wave input weights, where intermediate-valued weights appear exclusively near sign-change boundaries, alongside output weight distributions whose dominant Fourier phases satisfy a phase-sum relation $\phi_{\mathrm{out}} = \phi_a + \phi_b$; this relation holds even when the model is trained on noisy data and fails to grok. We extract the frequency and phase of each neuron's weights via DFT and construct an idealized MLP: Input weights are replaced by perfect binary square waves and output weights by cosines, both parametrized by the frequencies, phases, and amplitudes extracted from the dominant Fourier components of the real model weights. This idealized model achieves 95.5% accuracy when the frequencies and phases are extracted from the weights of a model trained on noisy data that itself achieves only 0.23% accuracy. This suggests that grokking does not discover the correct algorithm, but rather sharpens an algorithm substantially encoded during memorization, progressively binarizing the input weights into cleaner square waves and aligning the output weights, until generalization becomes possible.

Comment: Mechanistic training-dynamics study of grokking showing latent algorithmic structure emerges before generalization in ReLU MLPs.

Relevance: 9 Novelty: 8

5. Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

ArXiv ID: 2603.23914

Authors: Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Selim Furkan Tekin, Tiansheng Huang, Zachary Yahn, Yichang Xu, Ling Liu

Abstract: Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack combined with eviction, quantization and kernel fusion, showing further efficiency gains for resource-limited environments.

Comment: Memory-efficient inference via low-rank multi-head attention compaction and token-specific decompression for KV storage.

Relevance: 9 Novelty: 8

6. Manifold Generalization Provably Proceeds Memorization in Diffusion Models

ArXiv ID: 2603.23792

Authors: Zebang Shen, Ya-Ping Hsieh, Niao He

Abstract: Diffusion models often generate novel samples even when the learned score is only \emph{coarse} -- a phenomenon not accounted for by the standard view of diffusion training as density estimation. In this paper, we show that, under the \emph{manifold hypothesis}, this behavior can instead be explained by coarse scores capturing the \emph{geometry} of the data while discarding the fine-scale distributional structure of the population measure~$\mu_{\scriptscriptstyle\mathrm{data}}$. Concretely, whereas estimating the full data distribution $\mu_{\scriptscriptstyle\mathrm{data}}$ supported on a $k$-dimensional manifold is known to require the classical minimax rate $\tilde{\mathcal{O}}(N^{-1/k})$, we prove that diffusion models trained with coarse scores can exploit the \emph{regularity of the manifold support} and attain a near-parametric rate toward a \emph{different} target distribution. This target distribution has density uniformly comparable to that of~$\mu_{\scriptscriptstyle\mathrm{data}}$ throughout any $\tilde{\mathcal{O}}\bigl(N^{-\beta/(4k)}\bigr)$-neighborhood of the manifold, where $\beta$ denotes the manifold regularity. Our guarantees therefore depend only on the smoothness of the underlying support, and are especially favorable when the data density itself is irregular, for instance non-differentiable. In particular, when the manifold is sufficiently smooth, we obtain that \emph{generalization} -- formalized as the ability to generate novel, high-fidelity samples -- occurs at a statistical rate strictly faster than that required to estimate the full population distribution~$\mu_{\scriptscriptstyle\mathrm{data}}$.

Comment: Provides theory that diffusion models can learn manifold geometry and generalize before memorizing the full data distribution, a strong representation-learning insight.

Relevance: 8 Novelty: 9

7. MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning

ArXiv ID: 2603.24044

Authors: Andrea Manzoni

Abstract: Standard LoRA fine-tuning of Mixture-of-Experts (MoE) models applies adapters to every expert, yet our profiling shows that per-layer expert routing is highly skewed: a small subset of experts handles most tokens in each layer, while many others are rarely activated ("cold"). We propose MoE-Sieve, a simple routing-guided framework for LoRA fine-tuning, and pair it with a systematic profiling study of expert routing across architectures and tasks. The method is simple: profile routing counts on a small calibration set, select the top-k most-routed experts per layer, and apply LoRA only to those experts. Across two architecturally distinct MoE models and three diverse tasks, tuning only the top 25% routed experts per layer remains competitive with full LoRA, with mean differences within +/-1 percentage point across all conditions. This reduces LoRA trainable parameters by 70-73%, adapter checkpoint size by 71-73%, and wall-clock training time by up to 50%. We also observe a non-monotonic relationship between expert count and seed-to-seed variance, consistent with the hypothesis that adapting cold experts can introduce gradient noise without improving accuracy. Further ablations show that random expert selection at matched budget is about 2.5 percentage points worse, indicating that the routing signal matters, while greedy per-layer budget optimization does not improve over uniform top-k.

Comment: Routing-guided LoRA for MoE fine-tuning directly studies expert skew and adapts only highly routed experts, a clear architecture-mechanism and efficient adaptation contribution.

Relevance: 9 Novelty: 7

ArXiv ID: 2603.23571

Authors: Zhiyuan Chen, Yuxuan Zhong, Fan Wang, Bo Yu, Pengtao Shao, Shaoshan Liu, Ning Ding

Abstract: Effective navigation intelligence relies on long-term memory to support both immediate generalization and sustained adaptation. However, existing approaches face a dilemma: modular systems rely on explicit mapping but lack flexibility, while Transformer-based end-to-end models are constrained by fixed context windows, limiting persistent memory across extended interactions. We introduce StateLinFormer, a linear-attention navigation model trained with a stateful memory mechanism that preserves recurrent memory states across consecutive training segments instead of reinitializing them at each batch boundary. This training paradigm effectively approximates learning on infinitely long sequences, enabling the model to achieve long-horizon memory retention. Experiments across both MAZE and ProcTHOR environments demonstrate that StateLinFormer significantly outperforms its stateless linear-attention counterpart and standard Transformer baselines with fixed context windows. Notably, as interaction length increases, persistent stateful training substantially improves context-dependent adaptation, suggesting an enhancement in the model's In-Context Learning (ICL) capabilities for navigation tasks.

Comment: Stateful training for linear attention preserves recurrent memory across batch boundaries, directly targeting architecture/training dynamics for long-horizon sequence learning.

Relevance: 9 Novelty: 7

9. Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

ArXiv ID: 2410.02064

Authors: Christopher Ackerman, Nina Panickssery

Abstract: It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.

Comment: Mechanistic interpretability of residual-stream directions controlling self-authorship judgments in Llama-3 aligns with representation structure and causal circuit analysis.