Personalized Daily ArXiv Papers 2025-12-19

[gpt-5]	Prompt	Completion	Total
Token	40828	38231	79059
Cost	$0.05	$0.38	$0.43

Total arXiv papers: 470

Total scanned papers: 297

Total relevant papers: 31

Table of contents with paper titles:

How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models Authors: Ali Ghodsi
Provably Extracting the Features from a General Superposition Authors: Allen Liu
In-Context Multi-Operator Learning with DeepOSets Authors: Shao-Ting Chiu, Aditya Nambiar, Ali Syed, Jonathan W. Siegel, Ulisses Braga-Neto
Random matrix theory of sparse neuronal networks with heterogeneous timescales Authors: Thiparat Chotibut, Oleg Evnin, Weerawit Horinouchi
DEER: Draft with Diffusion, Verify with Autoregressive Models Authors: Zicong Cheng, Guo-Wei Yang, Jia Li, Zhijie Deng, Meng-Hao Guo, Shi-Min Hu
AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines Authors: Dimitrios Danopoulos, Enrico Lupi, Chang Sun, Sebastian Dittmeier, Michael Kagan, Vladimir Loncar, Maurizio Pierini
Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models Authors: Caner Erden
NRGPT: An Energy-based Alternative for GPT Authors: Nima Dehmamy, Benjamin Hoover, Bishwajit Saha, Leo Kozachkov, Jean-Jacques Slotine, Dmitry Krotov
In-Context Algebra Authors: Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau
Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference Authors: Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, Ramachandran Ramjee
Time-Frequency Analysis for Neural Networks Authors: Ahmed Abdeljawad, Elena Cordero
MEPIC: Memory Efficient Position Independent Caching for LLM Serving Authors: Qian Wang, Zahra Yousefijamarani, Morgan Lindsay Heisler, Rongzhi Gu, Bai Xiaolong, Shan Yizhou, Wei Zhang, Wang Lan, Ying Xiong, Yong Zhang, Zhenan Fan
EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving Authors: Shaoting Feng, Yuhan Liu, Hanchen Li, Xiaokun Chen, Samuel Shen, Kuntai Du, Zhuohan Gu, Rui Zhang, Yuyang Huang, Yihua Cheng, Jiayi Yao, Qizheng Zhang, Ganesh Ananthanarayanan, Junchen Jiang
Batch Normalization-Free Fully Integer Quantized Neural Networks via Progressive Tandem Learning Authors: Pengfei Sun, Wenyu Jiang, Piew Yoong Chee, Paul Devos, Dick Botteldooren
CKA-Guided Modular Quantization: Beyond Bit-Width to Algorithmic Diversity Authors: Jinhao Zhang, Yunquan Zhang, Daning Chen
From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts? Authors: Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, Patrik Reizinger
SHARe-KAN: Holographic Vector Quantization for Memory-Bound Inference Authors: Jeff Smith
Geometric Laplace Neural Operator Authors: Hao Tang, Jiongyu Zhu, Zimeng Feng, Hao Li, Chao Li
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers Authors: Adam Karvonen, James Chua, Cl\'ement Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Arnab Sen Sharma, Daniel Wen, Owain Evans, Samuel Marks
Muon is Provably Faster with Momentum Variance Reduction Authors: Xun Qian, Hussein Rammal, Dmitry Kovalev, Peter Richt\'arik
On the Universal Representation Property of Spiking Neural Networks Authors: Shayan Hundrieser, Philipp Tuchel, Insung Kong, Johannes Schmidt-Hieber
Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants Authors: Vincent Huang, Dami Choi, Daniel D. Johnson, Sarah Schwettmann, Jacob Steinhardt
KOSS: Kalman-Optimal Selective State Spaces for Long-Term Sequence Modeling Authors: Lei Wang, Xin Tan, Mingwei Wang, Ying Zhang
Tiny Recursive Control: Iterative Reasoning for Efficient Optimal Control Authors: Amit Jain, Richard Linares
Soft Geometric Inductive Bias for Object Centric Dynamics Authors: Hampus Linander, Conor Heins, Alexander Tschantz, Marco Perin, Christopher Buckley
AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs Authors: Anshul Kumar, Gagan Raj Gupta, Manisha Chawla
In-Context Semi-Supervised Learning Authors: Jiashuo Fan, Paul Rosu, Aaron T. Wang, Michael Li, Lawrence Carin, Xiang Cheng
Cartesian-nj: Extending e3nn to Irreducible Cartesian Tensor Product and Contracion Authors: Zemin Xu, Chenyu Wu, Wenbo Xie, Daiqian Xie, P. Hu
Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models Authors: Mikel Williams-Lekuona, Georgina Cosma
How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness Authors: Darshita Rathore, Vineet Kumar, Chetna Bansal, Anindya Moitra
Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference Authors: Jian Tian, Shuailong Li, Yang Cao, Wenbo Cui, Minghan Zhu, Wenkang Wu, Jianming Zhang, Yanpeng Wang, Zhiwen Xiao, Zhenyu Hou, Dou Shen

1. How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models

ArXiv ID: 2512.15115

Authors: Ali Ghodsi

Abstract: Sequence modeling has produced diverse architectures -- from classical recurrent neural networks to modern Transformers and state space models (SSMs) -- yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective interaction operator $W_{ij}(X)$, making explicit two recurring construction patterns: (i) the Unified Factorized Framework (Explicit) (attention-style mixing), in which $W_{ij}(X)$ varies through scalar coefficients applied to shared value maps, and (ii) Structured Dynamics (Implicit) (state-space recurrences), in which $W_{ij}$ is induced by a latent dynamical system. Using this framework, we derive three theoretical results. First, we establish the Interaction Rank Gap: models in the Unified Factorized Framework, such as single-head attention, are constrained to a low-dimensional operator span and cannot represent certain structured dynamical maps. Second, we prove an Equivalence (Head-Count) Theorem showing that, within our multi-head factorized class, representing a linear SSM whose lag operators span a $k$-dimensional subspace on length-$n$ sequences requires and is achievable with $H=k$ heads. Third, we prove a Gradient Highway Result, showing that attention layers admit inputs with distance-independent gradient paths, whereas stable linear dynamics exhibit distance-dependent gradient attenuation. Together, these results formalize a fundamental trade-off between algebraic expressivity (interaction/operator span) and long-range gradient propagation, providing theoretical grounding for modern sequence architecture design.

Comment: Matches Model Architecture and Representation Learning—unified attention/SSM framework with head-count and gradient propagation theory.

Relevance: 10 Novelty: 9

2. Provably Extracting the Features from a General Superposition

ArXiv ID: 2512.15987

Authors: Allen Liu

Abstract: It is widely believed that complex machine learning models generally encode features through linear representations, but these features exist in superposition, making them challenging to recover. We study the following fundamental setting for learning features in superposition from black-box query access: we are given query access to a function [ f(x)=\sum_{i=1}^n a_i\,\sigma_i(v_i^\top x), ] where each unit vector $v_i$ encodes a feature direction and $\sigma_i:\mathbb{R} \rightarrow \mathbb{R}$ is an arbitrary response function and our goal is to recover the $v_i$ and the function $f$. In learning-theoretic terms, superposition refers to the overcomplete regime, when the number of features is larger than the underlying dimension (i.e. $n > d$), which has proven especially challenging for typical algorithmic approaches. Our main result is an efficient query algorithm that, from noisy oracle access to $f$, identifies all feature directions whose responses are non-degenerate and reconstructs the function $f$. Crucially, our algorithm works in a significantly more general setting than all related prior results -- we allow for essentially arbitrary superpositions, only requiring that $v_i, v_j$ are not nearly identical for $i \neq j$, and general response functions $\sigma_i$. At a high level, our algorithm introduces an approach for searching in Fourier space by iteratively refining the search space to locate the hidden directions $v_i$.

Comment: Matches Representation Learning—provable recovery of features from general superposition with efficient query algorithm.

Relevance: 10 Novelty: 9

3. In-Context Multi-Operator Learning with DeepOSets

ArXiv ID: 2512.16074

Authors: Shao-Ting Chiu, Aditya Nambiar, Ali Syed, Jonathan W. Siegel, Ulisses Braga-Neto

Abstract: In-context Learning (ICL) is the remarkable capability displayed by some machine learning models to learn from examples in a prompt, without any further weight updates. ICL had originally been thought to emerge from the self-attention mechanism in autoregressive transformer architectures. DeepOSets is a non-autoregressive, non-attention based neural architecture that combines set learning via the DeepSets architecture with operator learning via Deep Operator Networks (DeepONets). In a previous study, DeepOSets was shown to display ICL capabilities in supervised learning problems. In this paper, we show that the DeepOSets architecture, with the appropriate modifications, is a multi-operator in-context learner that can recover the solution operator of a new PDE, not seen during training, from example pairs of parameter and solution placed in a user prompt, without any weight updates. Furthermore, we show that DeepOSets is a universal uniform approximator over a class of continuous operators, which we believe is the first result of its kind in the literature of scientific machine learning. This means that a single DeepOSets architecture exists that approximates in-context any continuous operator in the class to any fixed desired degree accuracy, given an appropriate number of examples in the prompt. Experiments with Poisson and reaction-diffusion forward and inverse boundary-value problems demonstrate the ability of the proposed model to use in-context examples to predict accurately the solutions corresponding to parameter queries for PDEs not seen during training.

Comment: Matches Model Architecture with a non-attention, non-autoregressive design (DeepOSets) that exhibits in-context learning and provides a universal operator-approximation theory.

Relevance: 9 Novelty: 9

4. Random matrix theory of sparse neuronal networks with heterogeneous timescales

ArXiv ID: 2512.12767

Authors: Thiparat Chotibut, Oleg Evnin, Weerawit Horinouchi

Abstract: Training recurrent neuronal networks consisting of excitatory (E) and inhibitory (I) units with additive noise for working memory computation slows and diversifies inhibitory timescales, leading to improved task performance that is attributed to emergent marginally stable equilibria [PNAS 122 (2025) e2316745122]. Yet the link between trained network characteristics and their roles in shaping desirable dynamical landscapes remains unexplored. Here, we investigate the Jacobian matrices describing the dynamics near these equilibria and show that they are sparse, non-Hermitian rectangular-block matrices modified by heterogeneous synaptic decay timescales and activation-function gains. We specify a random matrix ensemble that faithfully captures the spectra of trained Jacobian matrices, arising from the inhibitory core - excitatory periphery network motif (pruned E weights, broadly distributed I weights) observed post-training. An analytic theory of this ensemble is developed using statistical field theory methods: a Hermitized resolvent representation of the spectral density processed with a supersymmetry-based treatment in the style of Fyodorov and Mirlin. In this manner, an analytic description of the spectral edge is obtained, relating statistical parameters of the Jacobians (sparsity, weight variances, E/I ratio, and the distributions of timescales and gains) to near-critical features of the equilibria essential for robust working memory computation.

Comment: Representation Learning/Theory: random matrix analysis of sparse E/I networks’ Jacobians links sparsity, timescales, and gains to spectral edge and dynamics.