Personalized Daily ArXiv Papers 2025-10-16

[gpt-5]	Prompt	Completion	Total
Token	58671	60505	119176
Cost	$0.07	$0.61	$0.68

Total arXiv papers: 539

Total scanned papers: 301

Total relevant papers: 30

Table of contents with paper titles:

Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models Authors: Shai Zucker, Xiong Wang, Fei Lu, Inbar Seroussi
Chimera: State Space Models Beyond Sequences Authors: Aakash Lahoti, Tanya Marwah, Ratish Puduppully, Albert Gu
On efficiently computable functions, deep networks and sparse compositionality Authors: Tomaso Poggio
Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency without Model Sweeps Authors: Do Tien Hai, Trung Nguyen Mai, TrungTin Nguyen, Nhat Ho, Binh T. Nguyen, Christopher Drovandi
Dr.LLM: Dynamic Layer Routing in LLMs Authors: Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh
NOSA: Native and Offloadable Sparse Attention Authors: Yuxiang Huang, Chaojun Xiao, Xu Han, Zhiyuan Liu
Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory Authors: Einar Urdshals, Edmund Lau, Jesse Hoogland, Stan van Wingerden, Daniel Murfet
MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics Authors: Bowei Guo, Shengkun Tang, Cong Zeng, Zhiqiang Shen
DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation Authors: Yakun Song, Xiaobin Zhuang, Jiawei Chen, Zhikang Niu, Guanrou Yang, Chenpeng Du, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
HiLoRA: Adaptive Hierarchical LoRA Routing for Training-Free Domain Generalization Authors: Ziyi Han, Huanyu Wang, Zeyu Zhang, Xiangxiang Dai, Xutong Liu, John C. S. Lui
Statistical Guarantees for High-Dimensional Stochastic Gradient Descent Authors: Jiaqi Li, Zhipeng Lou, Johannes Schmidt-Hieber, Wei Biao Wu
Sculpting Latent Spaces With MMD: Disentanglement With Programmable Priors Authors: Quentin Fruytier, Akshay Malhotra, Shahab Hamidi-Rad, Aditya Sant, Aryan Mokhtari, Sujay Sanghavi
KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems Authors: Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, Yiran Chen
Axial Neural Networks for Dimension-Free Foundation Models Authors: Hyunsu Kim, Jonggeon Park, Joan Bruna, Hongseok Yang, Juho Lee
Mamba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning Authors: Junsoo Oh, Wei Huang, Taiji Suzuki
K-Merge: Online Continual Merging of Adapters for On-device Large Language Models Authors: Donald Shenaj, Ondrej Bohdal, Taha Ceritli, Mete Ozay, Pietro Zanuttigh, Umberto Michieli
CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression Authors: Dayin Gou, Sanghyun Byun, Nilesh Malpeddi, Gabrielle De Micheli, Prathamesh Vaste, Jacob Song, Woo Seong Chung
A Function Centric Perspective On Flat and Sharp Minima Authors: Israel Mason-Williams, Gabryel Mason-Williams, Helen Yannakoudakis
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing Authors: Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu, Lingkai Kong, Sudheer Chava, Chao Zhang
Z0-Inf: Zeroth Order Approximation for Data Influence Authors: Narine Kokhlikyan, Kamalika Chaudhuri, Saeed Mahloujifar
Cautious Weight Decay Authors: Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu
Y-shaped Generative Flows Authors: Arip Asadulaev, Semyon Semenov, Abduragim Shtanchaev, Eric Moulines, Fakhri Karray, Martin Takac
Influence Dynamics and Stagewise Data Attribution Authors: Jin Hwa Lee, Matthew Smith, Maxwell Adam, Jesse Hoogland
Rethinking Knowledge Distillation: A Data Dependent Regulariser With a Negative Asymmetric Payoff Authors: Israel Mason-Williams, Gabryel Mason-Williams, Helen Yannakoudakis
Your VAR Model is Secretly an Efficient and Explainable Generative Classifier Authors: Yi-Chung Chen, David I. Inouye, Jing Gao
SG-XDEAT: Sparsity-Guided Cross-Dimensional and Cross-Encoding Attention with Target-Aware Conditioning in Tabular Learning Authors: Chih-Chuan Cheng, Yi-Ju Tseng
LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance Authors: Patrick Haller, Mark Ibrahim, Polina Kirichenko, Levent Sagun, Samuel J. Bell
SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression Authors: Biao Zhang, Lixin Chen, Tong Liu, Bo Zheng
Deep Attention-guided Adaptive Subsampling Authors: Sharath M Shankaranarayana, Soumava Kumar Roy, Prasad Sudhakar, Chandan Aladahalli
Learning Latent Energy-Based Models via Interacting Particle Langevin Dynamics Authors: Joanna Marks, Tim Y. J. Wang, O. Deniz Akyildiz

1. Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

ArXiv ID: 2510.11789

Authors: Shai Zucker, Xiong Wang, Fei Lu, Inbar Seroussi

Abstract: We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-\frac{2\beta}{2\beta+1}}$ with $M$ being the sample size, depending only on the smoothness $\beta$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.

Comment: Model Architecture theory: dimension-free minimax rates for learning pairwise interactions in attention-style models.

Relevance: 10 Novelty: 9

2. Chimera: State Space Models Beyond Sequences

ArXiv ID: 2510.12111

Authors: Aakash Lahoti, Tanya Marwah, Ratish Puduppully, Albert Gu

Abstract: Transformer-based deep learning methods have become the standard approach for modeling diverse data such as sequences, images, and graphs. These methods rely on self-attention, which treats data as an unordered set of elements. This ignores the neighborhood structure or graph topology of the data and requires inductive biases--such as position embeddings in sequences and images, or random walks in graphs--to incorporate topology. However, designing such task-specific biases requires significant effort and can introduce side effects that hinder generalization. We introduce Chimera, a unified model that directly incorporates data topology in a principled way, removing the need for domain-specific biases. The key idea is that state space models--which naturally do not require position embeddings--can be generalized to capture any graph topology. Our experiments show that Chimera achieves strong performance across language, vision, and graph domains, outperforming BERT on GLUE by 0.7 points, ViT on ImageNet-1k by 2.6%, and all baselines on the Long Range Graph Benchmark. We further propose algorithmic optimizations to improve Chimera's efficiency: (1) for Directed Acyclic Graphs, Chimera can be implemented as a linear-time recurrence; (2) for general graphs, a simple mathematical relaxation achieves Transformer's quadratic complexity without domain-specific heuristics. These results validate Chimera's core contribution and support the idea that data topology is a powerful inductive bias across modalities.

Comment: Model Architecture: extends state space models to arbitrary data topology; Efficiency: linear-time recurrence on DAGs and quadratic-time relaxation for general graphs.

Relevance: 10 Novelty: 9

3. On efficiently computable functions, deep networks and sparse compositionality

ArXiv ID: 2510.11942

Authors: Tomaso Poggio

Abstract: We show that \emph{efficient Turing computability} at any fixed input/output precision implies the existence of \emph{compositionally sparse} (bounded-fan-in, polynomial-size) DAG representations and of corresponding neural approximants achieving the target precision. Concretely: if $f:[0,1]^d\to\R^m$ is computable in time polynomial in the bit-depths, then for every pair of precisions $(n,m_{\mathrm{out}})$ there exists a bounded-fan-in Boolean circuit of size and depth $\poly(n+m_{\mathrm{out}})$ computing the discretized map; replacing each gate by a constant-size neural emulator yields a deep network of size/depth $\poly(n+m_{\mathrm{out}})$ that achieves accuracy $\varepsilon=2^{-m_{\mathrm{out}}}$. We also relate these constructions to compositional approximation rates \cite{MhaskarPoggio2016b,poggio_deep_shallow_2017,Poggio2017,Poggio2023HowDS} and to optimization viewed as hierarchical search over sparse structures.

Comment: Model Architecture and Representation Learning: theory linking efficient Turing computability to compositionally sparse DAGs and corresponding deep neural approximants.

Relevance: 10 Novelty: 9

4. Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency without Model Sweeps

ArXiv ID: 2510.12744

Authors: Do Tien Hai, Trung Nguyen Mai, TrungTin Nguyen, Nhat Ho, Binh T. Nguyen, Christopher Drovandi

Abstract: We develop a unified statistical framework for softmax-gated Gaussian mixture of experts (SGMoE) that addresses three long-standing obstacles in parameter estimation and model selection: (i) non-identifiability of gating parameters up to common translations, (ii) intrinsic gate-expert interactions that induce coupled differential relations in the likelihood, and (iii) the tight numerator-denominator coupling in the softmax-induced conditional density. Our approach introduces Voronoi-type loss functions aligned with the gate-partition geometry and establishes finite-sample convergence rates for the maximum likelihood estimator (MLE). In over-specified models, we reveal a link between the MLE's convergence rate and the solvability of an associated system of polynomial equations characterizing near-nonidentifiable directions. For model selection, we adapt dendrograms of mixing measures to SGMoE, yielding a consistent, sweep-free selector of the number of experts that attains pointwise-optimal parameter rates under overfitting while avoiding multi-size training. Simulations on synthetic data corroborate the theory, accurately recovering the expert count and achieving the predicted rates for parameter estimation while closely approximating the regression function. Under model misspecification (e.g., $\epsilon$-contamination), the dendrogram selection criterion is robust, recovering the true number of mixture components, while the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood tend to overselect as sample size grows. On a maize proteomics dataset of drought-responsive traits, our dendrogram-guided SGMoE selects two experts, exposes a clear mixing-measure hierarchy, stabilizes the likelihood early, and yields interpretable genotype-phenotype maps, outperforming standard criteria without multi-size training.

Comment: Directly targets Model Architecture: Mixture-of-Experts (softmax-gated) with identifiability theory, finite-sample MLE rates, and consistent expert-number selection.

Relevance: 10 Novelty: 9

5. Dr.LLM: Dynamic Layer Routing in LLMs

ArXiv ID: 2510.12773

Authors: Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh

Abstract: Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.

Comment: Model Architecture and Efficiency: adaptive-depth dynamic layer routing (skip/execute/repeat) with supervised routers for budget-aware inference.

Relevance: 10 Novelty: 8

6. NOSA: Native and Offloadable Sparse Attention

ArXiv ID: 2510.13602

Authors: Yuxiang Huang, Chaojun Xiao, Xu Han, Zhiyuan Liu

Abstract: Trainable sparse attention has emerged as a promising solution to address the decoding efficiency bottleneck of LLMs in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing sparse attention methods leave a crucial limitation unresolved: the size of the key-value (KV) cache remains unreduced, which constrains on-GPU batch sizes and throttles decoding throughput, especially in large-scale batched inference. In this paper, we show that trainable sparse attention naturally exhibits strong locality in token selection across adjacent decoding steps, thereby enabling KV cache offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected KV pairs between the CPU and GPU continues to dominate the overall decoding cost. Building on this insight, we present NOSA, a trainable sparse attention framework designed to natively support KV cache offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing KV transfers while preserving the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in decoding throughput compared with the vanilla trainable sparse attention baseline (InfLLM-V2).

Comment: Model compression and efficiency: trainable sparse attention with explicit locality enabling KV cache offloading and reduced transfers, improving decoding throughput and memory use.

Relevance: 10 Novelty: 8

7. Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory

ArXiv ID: 2510.12077

Authors: Einar Urdshals, Edmund Lau, Jesse Hoogland, Stan van Wingerden, Daniel Murfet

Abstract: We study neural network compressibility by using singular learning theory to extend the minimum description length (MDL) principle to singular models like neural networks. Through extensive experiments on the Pythia suite with quantization, factorization, and other compression techniques, we find that complexity estimates based on the local learning coefficient (LLC) are closely, and in some cases, linearly correlated with compressibility. Our results provide a path toward rigorously evaluating the limits of model compression.

Comment: Compression/Efficiency theory: extends MDL to singular models; LLC-based complexity predicts quantization/low-rank compressibility.