Previous Day 2025-12-18
Monthly Overview 2025-12
Next Day 2025-12-22

Personalized Daily ArXiv Papers 2025-12-19

[gpt-5] Prompt Completion Total
Token 40828 38231 79059
Cost $0.05 $0.38 $0.43

Total arXiv papers: 470

Total scanned papers: 297

Total relevant papers: 31

Table of contents with paper titles:

  1. How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models Authors: Ali Ghodsi

  2. Provably Extracting the Features from a General Superposition Authors: Allen Liu

  3. In-Context Multi-Operator Learning with DeepOSets Authors: Shao-Ting Chiu, Aditya Nambiar, Ali Syed, Jonathan W. Siegel, Ulisses Braga-Neto

  4. Random matrix theory of sparse neuronal networks with heterogeneous timescales Authors: Thiparat Chotibut, Oleg Evnin, Weerawit Horinouchi

  5. DEER: Draft with Diffusion, Verify with Autoregressive Models Authors: Zicong Cheng, Guo-Wei Yang, Jia Li, Zhijie Deng, Meng-Hao Guo, Shi-Min Hu

  6. AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines Authors: Dimitrios Danopoulos, Enrico Lupi, Chang Sun, Sebastian Dittmeier, Michael Kagan, Vladimir Loncar, Maurizio Pierini

  7. Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models Authors: Caner Erden

  8. NRGPT: An Energy-based Alternative for GPT Authors: Nima Dehmamy, Benjamin Hoover, Bishwajit Saha, Leo Kozachkov, Jean-Jacques Slotine, Dmitry Krotov

  9. In-Context Algebra Authors: Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau

  10. Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference Authors: Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, Ramachandran Ramjee

  11. Time-Frequency Analysis for Neural Networks Authors: Ahmed Abdeljawad, Elena Cordero

  12. MEPIC: Memory Efficient Position Independent Caching for LLM Serving Authors: Qian Wang, Zahra Yousefijamarani, Morgan Lindsay Heisler, Rongzhi Gu, Bai Xiaolong, Shan Yizhou, Wei Zhang, Wang Lan, Ying Xiong, Yong Zhang, Zhenan Fan

  13. EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving Authors: Shaoting Feng, Yuhan Liu, Hanchen Li, Xiaokun Chen, Samuel Shen, Kuntai Du, Zhuohan Gu, Rui Zhang, Yuyang Huang, Yihua Cheng, Jiayi Yao, Qizheng Zhang, Ganesh Ananthanarayanan, Junchen Jiang

  14. Batch Normalization-Free Fully Integer Quantized Neural Networks via Progressive Tandem Learning Authors: Pengfei Sun, Wenyu Jiang, Piew Yoong Chee, Paul Devos, Dick Botteldooren

  15. CKA-Guided Modular Quantization: Beyond Bit-Width to Algorithmic Diversity Authors: Jinhao Zhang, Yunquan Zhang, Daning Chen

  16. From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts? Authors: Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, Patrik Reizinger

  17. SHARe-KAN: Holographic Vector Quantization for Memory-Bound Inference Authors: Jeff Smith

  18. Geometric Laplace Neural Operator Authors: Hao Tang, Jiongyu Zhu, Zimeng Feng, Hao Li, Chao Li

  19. Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers Authors: Adam Karvonen, James Chua, Cl\'ement Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Arnab Sen Sharma, Daniel Wen, Owain Evans, Samuel Marks

  20. Muon is Provably Faster with Momentum Variance Reduction Authors: Xun Qian, Hussein Rammal, Dmitry Kovalev, Peter Richt\'arik

  21. On the Universal Representation Property of Spiking Neural Networks Authors: Shayan Hundrieser, Philipp Tuchel, Insung Kong, Johannes Schmidt-Hieber

  22. Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants Authors: Vincent Huang, Dami Choi, Daniel D. Johnson, Sarah Schwettmann, Jacob Steinhardt

  23. KOSS: Kalman-Optimal Selective State Spaces for Long-Term Sequence Modeling Authors: Lei Wang, Xin Tan, Mingwei Wang, Ying Zhang

  24. Tiny Recursive Control: Iterative Reasoning for Efficient Optimal Control Authors: Amit Jain, Richard Linares

  25. Soft Geometric Inductive Bias for Object Centric Dynamics Authors: Hampus Linander, Conor Heins, Alexander Tschantz, Marco Perin, Christopher Buckley

  26. AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs Authors: Anshul Kumar, Gagan Raj Gupta, Manisha Chawla

  27. In-Context Semi-Supervised Learning Authors: Jiashuo Fan, Paul Rosu, Aaron T. Wang, Michael Li, Lawrence Carin, Xiang Cheng

  28. Cartesian-nj: Extending e3nn to Irreducible Cartesian Tensor Product and Contracion Authors: Zemin Xu, Chenyu Wu, Wenbo Xie, Daiqian Xie, P. Hu

  29. Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models Authors: Mikel Williams-Lekuona, Georgina Cosma

  30. How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness Authors: Darshita Rathore, Vineet Kumar, Chetna Bansal, Anindya Moitra

  31. Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference Authors: Jian Tian, Shuailong Li, Yang Cao, Wenbo Cui, Minghan Zhu, Wenkang Wu, Jianming Zhang, Yanpeng Wang, Zhiwen Xiao, Zhenyu Hou, Dou Shen


1. How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models

ArXiv ID: 2512.15115

Authors: Ali Ghodsi

Abstract: Sequence modeling has produced diverse architectures -- from classical recurrent neural networks to modern Transformers and state space models (SSMs) -- yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective interaction operator $W_{ij}(X)$, making explicit two recurring construction patterns: (i) the Unified Factorized Framework (Explicit) (attention-style mixing), in which $W_{ij}(X)$ varies through scalar coefficients applied to shared value maps, and (ii) Structured Dynamics (Implicit) (state-space recurrences), in which $W_{ij}$ is induced by a latent dynamical system. Using this framework, we derive three theoretical results. First, we establish the Interaction Rank Gap: models in the Unified Factorized Framework, such as single-head attention, are constrained to a low-dimensional operator span and cannot represent certain structured dynamical maps. Second, we prove an Equivalence (Head-Count) Theorem showing that, within our multi-head factorized class, representing a linear SSM whose lag operators span a $k$-dimensional subspace on length-$n$ sequences requires and is achievable with $H=k$ heads. Third, we prove a Gradient Highway Result, showing that attention layers admit inputs with distance-independent gradient paths, whereas stable linear dynamics exhibit distance-dependent gradient attenuation. Together, these results formalize a fundamental trade-off between algebraic expressivity (interaction/operator span) and long-range gradient propagation, providing theoretical grounding for modern sequence architecture design.

Comment: Matches Model Architecture and Representation Learning—unified attention/SSM framework with head-count and gradient propagation theory.

Relevance: 10 Novelty: 9


2. Provably Extracting the Features from a General Superposition

ArXiv ID: 2512.15987

Authors: Allen Liu

Abstract: It is widely believed that complex machine learning models generally encode features through linear representations, but these features exist in superposition, making them challenging to recover. We study the following fundamental setting for learning features in superposition from black-box query access: we are given query access to a function [ f(x)=\sum_{i=1}^n a_i\,\sigma_i(v_i^\top x), ] where each unit vector $v_i$ encodes a feature direction and $\sigma_i:\mathbb{R} \rightarrow \mathbb{R}$ is an arbitrary response function and our goal is to recover the $v_i$ and the function $f$. In learning-theoretic terms, superposition refers to the overcomplete regime, when the number of features is larger than the underlying dimension (i.e. $n > d$), which has proven especially challenging for typical algorithmic approaches. Our main result is an efficient query algorithm that, from noisy oracle access to $f$, identifies all feature directions whose responses are non-degenerate and reconstructs the function $f$. Crucially, our algorithm works in a significantly more general setting than all related prior results -- we allow for essentially arbitrary superpositions, only requiring that $v_i, v_j$ are not nearly identical for $i \neq j$, and general response functions $\sigma_i$. At a high level, our algorithm introduces an approach for searching in Fourier space by iteratively refining the search space to locate the hidden directions $v_i$.

Comment: Matches Representation Learning—provable recovery of features from general superposition with efficient query algorithm.

Relevance: 10 Novelty: 9


3. In-Context Multi-Operator Learning with DeepOSets

ArXiv ID: 2512.16074

Authors: Shao-Ting Chiu, Aditya Nambiar, Ali Syed, Jonathan W. Siegel, Ulisses Braga-Neto

Abstract: In-context Learning (ICL) is the remarkable capability displayed by some machine learning models to learn from examples in a prompt, without any further weight updates. ICL had originally been thought to emerge from the self-attention mechanism in autoregressive transformer architectures. DeepOSets is a non-autoregressive, non-attention based neural architecture that combines set learning via the DeepSets architecture with operator learning via Deep Operator Networks (DeepONets). In a previous study, DeepOSets was shown to display ICL capabilities in supervised learning problems. In this paper, we show that the DeepOSets architecture, with the appropriate modifications, is a multi-operator in-context learner that can recover the solution operator of a new PDE, not seen during training, from example pairs of parameter and solution placed in a user prompt, without any weight updates. Furthermore, we show that DeepOSets is a universal uniform approximator over a class of continuous operators, which we believe is the first result of its kind in the literature of scientific machine learning. This means that a single DeepOSets architecture exists that approximates in-context any continuous operator in the class to any fixed desired degree accuracy, given an appropriate number of examples in the prompt. Experiments with Poisson and reaction-diffusion forward and inverse boundary-value problems demonstrate the ability of the proposed model to use in-context examples to predict accurately the solutions corresponding to parameter queries for PDEs not seen during training.

Comment: Matches Model Architecture with a non-attention, non-autoregressive design (DeepOSets) that exhibits in-context learning and provides a universal operator-approximation theory.

Relevance: 9 Novelty: 9


4. Random matrix theory of sparse neuronal networks with heterogeneous timescales

ArXiv ID: 2512.12767

Authors: Thiparat Chotibut, Oleg Evnin, Weerawit Horinouchi

Abstract: Training recurrent neuronal networks consisting of excitatory (E) and inhibitory (I) units with additive noise for working memory computation slows and diversifies inhibitory timescales, leading to improved task performance that is attributed to emergent marginally stable equilibria [PNAS 122 (2025) e2316745122]. Yet the link between trained network characteristics and their roles in shaping desirable dynamical landscapes remains unexplored. Here, we investigate the Jacobian matrices describing the dynamics near these equilibria and show that they are sparse, non-Hermitian rectangular-block matrices modified by heterogeneous synaptic decay timescales and activation-function gains. We specify a random matrix ensemble that faithfully captures the spectra of trained Jacobian matrices, arising from the inhibitory core - excitatory periphery network motif (pruned E weights, broadly distributed I weights) observed post-training. An analytic theory of this ensemble is developed using statistical field theory methods: a Hermitized resolvent representation of the spectral density processed with a supersymmetry-based treatment in the style of Fyodorov and Mirlin. In this manner, an analytic description of the spectral edge is obtained, relating statistical parameters of the Jacobians (sparsity, weight variances, E/I ratio, and the distributions of timescales and gains) to near-critical features of the equilibria essential for robust working memory computation.

Comment: Representation Learning/Theory: random matrix analysis of sparse E/I networks’ Jacobians links sparsity, timescales, and gains to spectral edge and dynamics.

Relevance: 9 Novelty: 8


5. DEER: Draft with Diffusion, Verify with Autoregressive Models

ArXiv ID: 2512.15176

Authors: Zicong Cheng, Guo-Wei Yang, Jia Li, Zhijie Deng, Meng-Hao Guo, Shi-Min Hu

Abstract: Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at https://czc726.github.io/DEER/

Comment: Efficiency/HPC: speculative decoding with diffusion drafter (parallel) and AR verifier significantly increases acceptance length and end-to-end speedup.

Relevance: 9 Novelty: 8


6. AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines

ArXiv ID: 2512.15946

Authors: Dimitrios Danopoulos, Enrico Lupi, Chang Sun, Sebastian Dittmeier, Michael Kagan, Vladimir Loncar, Maurizio Pierini

Abstract: Efficient AI inference on AMD's Versal AI Engine (AIE) is challenging due to tightly coupled VLIW execution, explicit datapaths, and local memory management. Prior work focused on first-generation AIE kernel optimizations, without tackling full neural network execution across the 2D array. In this work, we present AIE4ML, the first comprehensive framework for converting AI models automatically into optimized firmware targeting the AIE-ML generation devices, also with forward compatibility for the newer AIE-MLv2 architecture. At the single-kernel level, we attain performance close to the architectural peak. At the graph and system levels, we provide a structured parallelization method that can scale across the 2D AIE-ML fabric and exploit its dedicated memory tiles to stay entirely on-chip throughout the model execution. As a demonstration, we designed a generalized and highly efficient linear-layer implementation with intrinsic support for fused bias addition and ReLU activation. Also, as our framework necessitates the generation of multi-layer implementations, our approach systematically derives deterministic, compact, and topology-optimized placements tailored to the physical 2D grid of the device through a novel graph placement and search algorithm. Finally, the framework seamlessly accepts quantized models imported from high-level tools such as hls4ml or PyTorch while preserving bit-exactness. In layer scaling benchmarks, we achieve up to 98.6% efficiency relative to the single-kernel baseline, utilizing 296 of 304 AIE tiles (97.4%) of the device with entirely on-chip data movement. With evaluations across real-world model topologies, we demonstrate that AIE4ML delivers GPU-class throughput under microsecond latency constraints, making it a practical companion for ultra-low-latency environments such as trigger systems in particle physics experiments.

Comment: High Performance Computing/Systems: end-to-end compiler maps NN graphs onto AMD AIE-ML 2D arrays with on-chip memory placement and quantization support.

Relevance: 9 Novelty: 8


7. Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models

ArXiv ID: 2512.15973

Authors: Caner Erden

Abstract: We propose Dynamic Rank Reinforcement Learning (DR-RL), a novel framework that adaptively optimizes the low-rank factorization of Multi-Head Self-Attention (MHSA) in Large Language Models (LLMs) through the integration of reinforcement learning and online matrix perturbation theory. While traditional low-rank approximations often rely on static rank assumptions--limiting their flexibility across diverse input contexts--our method dynamically selects ranks based on real-time sequence dynamics, layer-specific sensitivities, and hardware constraints. The core innovation lies in an RL agent that formulates rank selection as a sequential policy optimization problem, where the reward function strictly balances attention fidelity against computational latency. Crucially, we employ online matrix perturbation bounds to enable incremental rank updates, thereby avoiding the prohibitive cost of full decomposition during inference. Furthermore, the integration of a lightweight Transformer-based policy network and batched Singular Value Decomposition (SVD) operations ensures scalable deployment on modern GPU architectures. Experiments demonstrate that DR-RL maintains downstream accuracy statistically equivalent to full-rank attention while significantly reducing Floating Point Operations (FLOPs), particularly in long-sequence regimes (L > 4096). This work bridges the gap between adaptive efficiency and theoretical rigor in MHSA, offering a principled, mathematically grounded alternative to heuristic rank reduction techniques in resource-constrained deep learning. Source code and experiment logs are available at: https://github.com/canererden/DR_RL_Project

Comment: Compression/Efficiency: adaptive low-rank factorization of MHSA via RL and online matrix perturbation to balance fidelity and latency at inference.

Relevance: 9 Novelty: 8


8. NRGPT: An Energy-based Alternative for GPT

ArXiv ID: 2512.16762

Authors: Nima Dehmamy, Benjamin Hoover, Bishwajit Saha, Leo Kozachkov, Jean-Jacques Slotine, Dmitry Krotov

Abstract: Generative Pre-trained Transformer (GPT) architectures are the most popular design for language modeling. Energy-based modeling is a different paradigm that views inference as a dynamical process operating on an energy landscape. We propose a minimal modification of the GPT setting to unify it with the EBM framework. The inference step of our model, which we call eNeRgy-GPT (NRGPT), is conceptualized as an exploration of the tokens on the energy landscape. We prove, and verify empirically, that under certain circumstances this exploration becomes gradient descent, although they don't necessarily lead to the best performing models. We demonstrate that our model performs well for simple language (Shakespeare dataset), algebraic ListOPS tasks, and richer settings such as OpenWebText language modeling. We also observe that our models may be more resistant to overfitting, doing so only during very long training.

Comment: Model Architecture/Generative modeling: reframes GPT as an energy-based model with inference as energy landscape exploration, with theory and experiments.

Relevance: 9 Novelty: 8


9. In-Context Algebra

ArXiv ID: 2512.16902

Authors: Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau

Abstract: We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While prior work has found that transformers develop geometric embeddings that mirror algebraic structure, those previous findings emerge from settings where arithmetic-valued tokens have fixed meanings. We devise a new task in which the assignment of symbols to specific algebraic group elements varies from one sequence to another. Despite this challenging setup, transformers achieve near-perfect accuracy on the task and even generalize to unseen algebraic groups. We develop targeted data distributions to create causal tests of a set of hypothesized mechanisms, and we isolate three mechanisms models consistently learn: commutative copying where a dedicated head copies answers, identity element recognition that distinguishes identity-containing facts, and closure-based cancellation that tracks group membership to constrain valid answers. Complementary to the geometric representations found in fixed-symbol settings, our findings show that models develop symbolic reasoning mechanisms when trained to reason in-context with variables whose meanings are not fixed.

Comment: Representation Learning/Mechanistic interpretability: isolates learned mechanisms (copying, identity recognition, cancellation) in transformers trained on variable algebra.

Relevance: 9 Novelty: 8


10. Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference

ArXiv ID: 2512.16391

Authors: Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, Ramachandran Ramjee

Abstract: Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic-programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments that this is critical for high accuracy. Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs while closely matching dense attention accuracy on long-context benchmarks such as LongBench and AIME-24.

Comment: Matches Model Compression and Efficiency via training-free sparse attention (Top-k selection/reuse across layers) with substantial prefill/decode speedups.

Relevance: 9 Novelty: 8


11. Time-Frequency Analysis for Neural Networks

ArXiv ID: 2512.15992

Authors: Ahmed Abdeljawad, Elena Cordero

Abstract: We develop a quantitative approximation theory for shallow neural networks using tools from time-frequency analysis. Working in weighted modulation spaces $M^{p,q}m(\mathbf{R}^{d})$, we prove dimension-independent approximation rates in Sobolev norms $W^{n,r}(\Omega)$ for networks whose units combine standard activations with localized time-frequency windows. Our main result shows that for $f \in M^{p,q}_m(\mathbf{R}^{d})$ one can achieve [ |f - f_N|$ using weighted modulation dictionaries, and derive consequences for Feichtinger's algebra, Fourier-Lebesgue spaces, and Barron spaces. Numerical experiments in one and two dimensions confirm that modulation-based networks achieve substantially better Sobolev approximation than standard ReLU networks, consistent with the theoretical estimates.}(\Omega)} \lesssim N^{-1/2}\,|f|_{M^{p,q}_m(\mathbf{R}^{d})}, ] on bounded domains, with explicit control of all constants. We further obtain global approximation theorems on $\mathbf{R}^{d

Comment: Matches Model Architecture and Theory by introducing time-frequency-windowed units and proving dimension-independent Sobolev approximation rates for shallow nets.

Relevance: 9 Novelty: 8


12. MEPIC: Memory Efficient Position Independent Caching for LLM Serving

ArXiv ID: 2512.16822

Authors: Qian Wang, Zahra Yousefijamarani, Morgan Lindsay Heisler, Rongzhi Gu, Bai Xiaolong, Shan Yizhou, Wei Zhang, Wang Lan, Ying Xiong, Yong Zhang, Zhenan Fan

Abstract: Modern LLM applications such as deep-research assistants, coding agents, and Retrieval-Augmented Generation (RAG) systems, repeatedly process long prompt histories containing shared document or code chunks, creating significant pressure on the Key Value (KV) cache, which must operate within limited memory while sustaining high throughput and low latency. Prefix caching partially alleviates some of these costs by reusing KV cache for previously processed tokens, but limited by strict prefix matching. Position-independent caching (PIC) enables chunk-level reuse at arbitrary positions, but requires selective recomputation and positional-encoding (PE) adjustments. However, because these operations vary across queries, KV for the same chunk diverges across requests. Moreover, without page alignment, chunk KV layouts diverge in memory, preventing page sharing. These issues result in only modest HBM savings even when many requests reuse the same content. We present MEPIC, a memory-efficient PIC system that enables chunk KV reuse across positions, requests, and batches. MEPIC aligns chunk KV to paged storage, shifts recomputation from token- to block-level so only the first block is request-specific, removes positional encodings via Rotary Position Embedding (RoPE) fusion in the attention kernel, and makes remaining blocks fully shareable. These techniques eliminate most duplicate chunk KV in HBM, reducing usage by up to 2x over state-of-the-art PIC at comparable latency and accuracy, and up to 5x for long prompts, without any model changes.

Comment: Matches High-Performance Serving Efficiency—position-independent KV caching with paged sharing and RoPE fusion.

Relevance: 9 Novelty: 8


13. EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving

ArXiv ID: 2512.14946

Authors: Shaoting Feng, Yuhan Liu, Hanchen Li, Xiaokun Chen, Samuel Shen, Kuntai Du, Zhuohan Gu, Rui Zhang, Yuyang Huang, Yihua Cheng, Jiayi Yao, Qizheng Zhang, Ganesh Ananthanarayanan, Junchen Jiang

Abstract: Reusing KV cache is essential for high efficiency of Large Language Model (LLM) inference systems. With more LLM users, the KV cache footprint can easily exceed GPU memory capacity, so prior work has proposed to either evict KV cache to lower-tier storage devices, or compress KV cache so that more KV cache can be fit in the fast memory. However, prior work misses an important opportunity: jointly optimizing the eviction and compression decisions across all KV caches to minimize average generation latency without hurting quality. We propose EVICPRESS, a KV-cache management system that applies lossy compression and adaptive eviction to KV cache across multiple storage tiers. Specifically, for each KV cache of a context, EVICPRESS considers the effect of compression and eviction of the KV cache on the average generation quality and delay across all contexts as a whole. To achieve this, EVICPRESS proposes a unified utility function that quantifies the effect of quality and delay of the lossy compression or eviction. To this end, EVICPRESS's profiling module periodically updates the utility function scores on all possible eviction-compression configurations for all contexts and places KV caches using a fast heuristic to rearrange KV caches on all storage tiers, with the goal of maximizing the utility function scores on each storage tier. Compared to the baselines that evict KV cache or compress KV cache, EVICPRESS achieves higher KV-cache hit rates on fast devices, i.e., lower delay, while preserving high generation quality by applying conservative compression to contexts that are sensitive to compression errors. Evaluation on 12 datasets and 5 models demonstrates that EVICPRESS achieves up to 2.19x faster time-to-first-token (TTFT) at equivalent generation quality.

Comment: Model Compression and Efficiency: jointly optimizes KV-cache lossy compression and multi-tier eviction to minimize latency while preserving quality for LLM serving.

Relevance: 9 Novelty: 7


14. Batch Normalization-Free Fully Integer Quantized Neural Networks via Progressive Tandem Learning

ArXiv ID: 2512.16476

Authors: Pengfei Sun, Wenyu Jiang, Piew Yoong Chee, Paul Devos, Dick Botteldooren

Abstract: Quantised neural networks (QNNs) shrink models and reduce inference energy through low-bit arithmetic, yet most still depend on a running statistics batch normalisation (BN) layer, preventing true integer-only deployment. Prior attempts remove BN by parameter folding or tailored initialisation; while helpful, they rarely recover BN's stability and accuracy and often impose bespoke constraints. We present a BN-free, fully integer QNN trained via a progressive, layer-wise distillation scheme that slots into existing low-bit pipelines. Starting from a pretrained BN-enabled teacher, we use layer-wise targets and progressive compensation to train a student that performs inference exclusively with integer arithmetic and contains no BN operations. On ImageNet with AlexNet, the BN-free model attains competitive Top-1 accuracy under aggressive quantisation. The procedure integrates directly with standard quantisation workflows, enabling end-to-end integer-only inference for resource-constrained settings such as edge and embedded devices.

Comment: Compression/Efficiency: trains BN-free fully integer quantized networks via progressive layer-wise distillation enabling integer-only inference.

Relevance: 9 Novelty: 7


15. CKA-Guided Modular Quantization: Beyond Bit-Width to Algorithmic Diversity

ArXiv ID: 2512.16282

Authors: Jinhao Zhang, Yunquan Zhang, Daning Chen

Abstract: Current mainstream post-training quantization methods for large language models typically apply a uniform quantization strategy across all network layers, overlooking the substantial differences in algorithmic suitability among layers. To address this limitation, we propose CKA Guided Modular Quantization, a fine-tuning-free, plug-and-play framework for algorithmic heterogeneous quantization. Our method independently evaluates multiple PTQ algorithms on each layer and employs Linear Centered Kernel Alignment (CKA) as a metric to automatically select the optimal quantization strategy per layer. The individually optimized strategies are then integrated to construct a hybrid quantized model. Experiments demonstrate that our approach consistently outperforms both uniform quantization baselines and state-of-the-art mixed-precision methods across mainstream LLMs including LLaMA and Qwen ,in terms of perplexity (PPL) and downstream task performance.

Comment: Matches Model Compression/Efficiency via per-layer heterogeneous PTQ selection guided by CKA (quantization).

Relevance: 9 Novelty: 7


16. From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

ArXiv ID: 2512.15134

Authors: Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, Patrik Reizinger

Abstract: A central goal of interpretability is to recover representations of causally relevant concepts from the activations of neural networks. The quality of these concept representations is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear whether common featurization methods - including sparse autoencoders (SAEs) and sparse probes - recover disentangled representations of these concepts. This study proposes a multi-concept evaluation setting where we control the correlations between textual concepts, such as sentiment, domain, and tense, and analyze performance under increasing correlations between them. We first evaluate the extent to which featurizers can learn disentangled representations of each concept under increasing correlational strengths. We observe a one-to-many relationship from concepts to features: features correspond to no more than one concept, but concepts are distributed across many features. Then, we perform steering experiments, measuring whether each concept is independently manipulable. Even when trained on uniform distributions of concepts, SAE features generally affect many concepts when steered, indicating that they are neither selective nor independent; nonetheless, features affect disjoint subspaces. These results suggest that correlational metrics for measuring disentanglement are generally not sufficient for establishing independence when steering, and that affecting disjoint subspaces is not sufficient for concept selectivity. These results underscore the importance of compositional evaluations in interpretability research.

Comment: Matches Representation Learning—evaluates SAEs/sparse probes for disentanglement and independent steering of concepts.

Relevance: 9 Novelty: 7


17. SHARe-KAN: Holographic Vector Quantization for Memory-Bound Inference

ArXiv ID: 2512.15742

Authors: Jeff Smith

Abstract: Kolmogorov-Arnold Networks (KANs) face a fundamental memory wall: their learned basis functions create parameter counts that impose extreme bandwidth demands, hindering deployment in memory-constrained environments. We show that Vision KANs exhibit a holographic topology, where information is distributed across the interference of splines rather than localized to specific edges. Consequently, traditional pruning fails (10% sparsity degrades mAP from 85.23% to 45%, a $\sim$40-point drop). To address this, we present SHARe-KAN, a framework utilizing Gain-Shape-Bias Vector Quantization to exploit functional redundancy while preserving the dense topology. Coupled with LUTHAM, a hardware-aware compiler with static memory planning, we achieve $88\times$ runtime memory reduction (1.13 GB $\to$ 12.91 MB) and match uncompressed baseline accuracy on PASCAL VOC. Profiling on NVIDIA Ampere architecture confirms $>90\%$ L2 cache residency, demonstrating that the workload is decoupled from DRAM bandwidth constraints inherent to spline-based architectures.

Comment: Compression/Efficiency: gain–shape–bias vector quantization and hardware-aware compiler reduce KAN runtime memory by 88× while preserving accuracy.

Relevance: 8 Novelty: 8


18. Geometric Laplace Neural Operator

ArXiv ID: 2512.16409

Authors: Hao Tang, Jiongyu Zhu, Zimeng Feng, Hao Li, Chao Li

Abstract: Neural operators have emerged as powerful tools for learning mappings between function spaces, enabling efficient solutions to partial differential equations across varying inputs and domains. Despite the success, existing methods often struggle with non-periodic excitations, transient responses, and signals defined on irregular or non-Euclidean geometries. To address this, we propose a generalized operator learning framework based on a pole-residue decomposition enriched with exponential basis functions, enabling expressive modeling of aperiodic and decaying dynamics. Building on this formulation, we introduce the Geometric Laplace Neural Operator (GLNO), which embeds the Laplace spectral representation into the eigen-basis of the Laplace-Beltrami operator, extending operator learning to arbitrary Riemannian manifolds without requiring periodicity or uniform grids. We further design a grid-invariant network architecture (GLNONet) that realizes GLNO in practice. Extensive experiments on PDEs/ODEs and real-world datasets demonstrate our robust performance over other state-of-the-art models.

Comment: Model Architecture: introduces a Laplace spectral neural operator on Riemannian manifolds via pole–residue decomposition and a grid-invariant implementation (GLNONet).

Relevance: 8 Novelty: 8


19. Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

ArXiv ID: 2512.15674

Authors: Adam Karvonen, James Chua, Cl\'ement Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Arnab Sen Sharma, Daniel Wen, Owain Evans, Samuel Marks

Abstract: Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Overall, our best AOs match or exceed prior white-box baselines on all four tasks and are the best method on 3 out of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.

Comment: Representation Learning: trains models (Activation Oracles) to verbalize information from LLM activations across tasks, improving white-box interpretability.

Relevance: 8 Novelty: 8


20. Muon is Provably Faster with Momentum Variance Reduction

ArXiv ID: 2512.16598

Authors: Xun Qian, Hussein Rammal, Dmitry Kovalev, Peter Richt\'arik

Abstract: Recent empirical research has demonstrated that deep learning optimizers based on the linear minimization oracle (LMO) over specifically chosen Non-Euclidean norm balls, such as Muon and Scion, outperform Adam-type methods in the training of large language models. In this work, we show that such optimizers can be provably improved by replacing their vanilla momentum by momentum variance reduction (MVR). Instead of proposing and analyzing MVR variants of Muon and Scion separately, we incorporate MVR into the recently proposed Gluon framework, which captures Muon, Scion and other specific Non-Euclidean LMO-based methods as special cases, and at the same time works with a more general smoothness assumption which better captures the layer-wise structure of neural networks. In the non-convex case, we incorporate MVR into Gluon in three different ways. All of them improve the convergence rate from ${\cal O} (\frac{1}{K^{1/4}})$ to ${\cal O} (\frac{1}{K^{1/3}})$. Additionally, we provide improved rates in the star-convex case. Finally, we conduct several numerical experiments that verify the superior performance of our proposed algorithms in terms of iteration complexity.

Comment: HPC/Optimization: adds momentum variance reduction to LMO-based optimizers (Muon/Scion) within Gluon framework, with provable faster rates.

Relevance: 8 Novelty: 8


21. On the Universal Representation Property of Spiking Neural Networks

ArXiv ID: 2512.16872

Authors: Shayan Hundrieser, Philipp Tuchel, Insung Kong, Johannes Schmidt-Hieber

Abstract: Inspired by biology, spiking neural networks (SNNs) process information via discrete spikes over time, offering an energy-efficient alternative to the classical computing paradigm and classical artificial neural networks (ANNs). In this work, we analyze the representational power of SNNs by viewing them as sequence-to-sequence processors of spikes, i.e., systems that transform a stream of input spikes into a stream of output spikes. We establish the universal representation property for a natural class of spike train functions. Our results are fully quantitative, constructive, and near-optimal in the number of required weights and neurons. The analysis reveals that SNNs are particularly well-suited to represent functions with few inputs, low temporal complexity, or compositions of such functions. The latter is of particular interest, as it indicates that deep SNNs can efficiently capture composite functions via a modular design. As an application of our results, we discuss spike train classification. Overall, these results contribute to a rigorous foundation for understanding the capabilities and limitations of spike-based neuromorphic systems.

Comment: Matches Representation Learning criterion via a rigorous, quantitative universal representation property for SNNs (sequence-to-sequence spike functions) and analysis of architectural depth/composition.

Relevance: 8 Novelty: 8


22. Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

ArXiv ID: 2512.15712

Authors: Vincent Huang, Dami Choi, Daniel D. Johnson, Sarah Schwettmann, Jacob Steinhardt

Abstract: Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder, enjoys favorable scaling properties: the auto-interp score of the bottleneck concepts improves with data, as does the performance on downstream applications. Specifically, PCDs can detect jailbreaks, secret hints, and implanted latent concepts, and are able to accurately surface latent user attributes.

Comment: Matches Representation Learning using a sparse concept bottleneck over activations and end-to-end training to predict behavior—scalable interpretability via learned sparse codes.

Relevance: 8 Novelty: 8


23. KOSS: Kalman-Optimal Selective State Spaces for Long-Term Sequence Modeling

ArXiv ID: 2512.16723

Authors: Lei Wang, Xin Tan, Mingwei Wang, Ying Zhang

Abstract: Recent selective state space models (SSMs), such as Mamba and Mamba-2, have demonstrated strong performance in sequence modeling owing to input-dependent selection mechanisms. However, these mechanisms lack theoretical grounding and cannot support context-aware selection from latent state dynamics. To address these limitations, we propose KOSS, a Kalman-optimal Selective State Space model that formulates selection as latent state uncertainty minimization. Derived from estimation theory, KOSS adopts a continuous-time latent update driven by a Kalman gain that dynamically modulates information propagation based on content and context, enabling a closed-loop, context-aware selectivity mechanism. To ensure stable computation and near-linear scalability, KOSS employs global spectral differentiation for frequency-domain derivative estimation, along with a segment-wise scan for hardware-efficient processing. On a selective copying task with distractors, KOSS achieves over 79\% accuracy while baselines drop below 20\%, demonstrating robust context-aware selection. Furthermore, across nine long-term forecasting benchmarks, KOSS reduces MSE by 2.92--36.23\% and consistently outperforms state-of-the-art models in both accuracy and stability. To assess real-world applicability, a case study on secondary surveillance radar (SSR) tracking confirms KOSS's robustness under irregular intervals and noisy conditions and demonstrates its effectiveness in real-world applications. Finally, supplementary experiments verify Kalman gain convergence and the frequency response of spectral differentiation, providing theoretical support for the proposed closed-loop design.

Comment: Matches Model Architecture—selective SSM with Kalman-optimal selection and scalable computation mechanisms.

Relevance: 8 Novelty: 8


24. Tiny Recursive Control: Iterative Reasoning for Efficient Optimal Control

ArXiv ID: 2512.16824

Authors: Amit Jain, Richard Linares

Abstract: Neural network controllers increasingly demand millions of parameters, and language model approaches push into the billions. For embedded aerospace systems with strict power and latency constraints, this scaling is prohibitive. We present Tiny Recursive Control (TRC), a neural architecture based on a counterintuitive principle: capacity can emerge from iteration depth rather than parameter count. TRC applies compact networks (approximately 1.5M parameters) repeatedly through a two-level hierarchical latent structure, refining control sequences by simulating trajectories and correcting based on tracking error. Because the same weights process every refinement step, adding iterations increases computation without increasing memory. We evaluate TRC on nonlinear control problems including oscillator stabilization and powered descent with fuel constraints. Across these domains, TRC achieves near-optimal control costs while requiring only millisecond-scale inference on GPU and under 10~MB memory, two orders of magnitude smaller than language model baselines. These results demonstrate that recursive reasoning, previously confined to discrete tasks, transfers effectively to continuous control synthesis.

Comment: Model Architecture/Efficiency: iterative weight sharing and hierarchical latent refinement trade iteration depth for parameter count, enabling compact controllers.

Relevance: 8 Novelty: 7


25. Soft Geometric Inductive Bias for Object Centric Dynamics

ArXiv ID: 2512.15493

Authors: Hampus Linander, Conor Heins, Alexander Tschantz, Marco Perin, Christopher Buckley

Abstract: Equivariance is a powerful prior for learning physical dynamics, yet exact group equivariance can degrade performance if the symmetries are broken. We propose object-centric world models built with geometric algebra neural networks, providing a soft geometric inductive bias. Our models are evaluated using simulated environments of 2d rigid body dynamics with static obstacles, where we train for next-step predictions autoregressively. For long-horizon rollouts we show that the soft inductive bias of our models results in better performance in terms of physical fidelity compared to non-equivariant baseline models. The approach complements recent soft-equivariance ideas and aligns with the view that simple, well-chosen priors can yield robust generalization. These results suggest that geometric algebra offers an effective middle ground between hand-crafted physics and unstructured deep nets, delivering sample-efficient dynamics models for multi-object scenes.

Comment: Model Architecture/Representation: geometric algebra neural networks provide soft equivariance as an inductive bias for object-centric dynamics modeling.

Relevance: 8 Novelty: 7


26. AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs

ArXiv ID: 2512.15764

Authors: Anshul Kumar, Gagan Raj Gupta, Manisha Chawla

Abstract: Large Language Models (LLMs) can perform many NLP tasks well, but fully fine-tuning them is expensive and requires a lot of memory. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA reduce this cost by adding small low-rank updates to frozen model weights. However, these methods restrict the training to a limited subspace, which can sometimes reduce performance. For Small Language Models (SLMs), where efficiency gains matter even more, we introduce AdaGradSelect, an adaptive method that selects which transformer blocks to update based on gradients. Early observations showed that updating only the transformer blocks with the highest gradient norms can achieve performance close to full fine-tuning. Building on this insight, AdaGradSelect adaptively chooses which blocks to train. It uses a combination of Dirichlet-based sampling, which depends on how frequently blocks were updated in the past, and an epsilon-greedy exploration strategy. This lets the method explore different blocks in early training and gradually focus on the most important ones in later epochs. Experiments show that AdaGradSelect trains about 12 percent faster and uses 35 percent less GPU memory while delivering performance very close to full fine-tuning. On the GSM8K dataset, it outperforms LoRA (rank 256) by about 3 percent on average across models such as Qwen2.5-0.5B, LLaMA3.2-1B, and Phi4-mini-3.8B. It also achieves similar accuracy on the MATH dataset. Overall, AdaGradSelect provides a more effective and resource-efficient alternative to traditional fine-tuning methods.

Comment: Efficiency: adaptive gradient-guided layer/block selection for fine-tuning SLMs reduces memory/compute while matching full fine-tuning accuracy.

Relevance: 8 Novelty: 7


27. In-Context Semi-Supervised Learning

ArXiv ID: 2512.15934

Authors: Jiashuo Fan, Paul Rosu, Aaron T. Wang, Michael Li, Lawrence Carin, Xiang Cheng

Abstract: There has been significant recent interest in understanding the capacity of Transformers for in-context learning (ICL), yet most theory focuses on supervised settings with explicitly labeled pairs. In practice, Transformers often perform well even when labels are sparse or absent, suggesting crucial structure within unlabeled contextual demonstrations. We introduce and study in-context semi-supervised learning (IC-SSL), where a small set of labeled examples is accompanied by many unlabeled points, and show that Transformers can leverage the unlabeled context to learn a robust, context-dependent representation. This representation enables accurate predictions and markedly improves performance in low-label regimes, offering foundational insights into how Transformers exploit unlabeled context for representation learning within the ICL framework.

Comment: Representation Learning/ICL theory: shows Transformers exploit unlabeled context to learn context-dependent representations in semi-supervised in-context learning.

Relevance: 8 Novelty: 7


28. Cartesian-nj: Extending e3nn to Irreducible Cartesian Tensor Product and Contracion

ArXiv ID: 2512.16882

Authors: Zemin Xu, Chenyu Wu, Wenbo Xie, Daiqian Xie, P. Hu

Abstract: Equivariant atomistic machine learning models have brought substantial gains in both extrapolation capability and predictive accuracy. Depending on the basis of the space, two distinct types of irreducible representations are utilized. From architectures built upon spherical tensors (STs) to more recent formulations employing irreducible Cartesian tensors (ICTs), STs have remained dominant owing to their compactness, elegance, and theoretical completeness. Nevertheless, questions have persisted regarding whether ST constructions are the only viable design principle, motivating continued development of Cartesian networks. In this work, we introduce the Cartesian-3j and Cartesian-nj symbol, which serve as direct analogues of the Wigner-3j and Wigner-nj symbol defined for tensor coupling. These coefficients enable the combination of any two ICTs into a new ICT. Building on this foundation, we extend e3nn to support irreducible Cartesian tensor product, and we release the resulting Python package as cartnn. Within this framework, we implement Cartesian counterparts of MACE, NequIP, and Allegro, allowing the first systematic comparison of Cartesian and spherical models to assess whether Cartesian formulations may offer advantages under specific conditions. Using TACE as a representative example, we further examine whether architectures constructed from irreducible Cartesian tensor product and contraction(ICTP and ICTC) are conceptually well-founded in Cartesian space and whether opportunities remain for improving their design.

Comment: Model Architecture: extends equivariant networks to irreducible Cartesian tensor product/contraction with a library enabling Cartesian counterparts to spherical models.

Relevance: 8 Novelty: 7


29. Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models

ArXiv ID: 2512.15372

Authors: Mikel Williams-Lekuona, Georgina Cosma

Abstract: Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.

Comment: Matches Model Compression and Efficiency via conditional/dynamic compute (early-exit) while preserving cross-modal embedding compatibility through dual-path training.

Relevance: 8 Novelty: 7


30. How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness

ArXiv ID: 2512.15634

Authors: Darshita Rathore, Vineet Kumar, Chetna Bansal, Anindya Moitra

Abstract: Large language models are increasingly adapted to downstream tasks through fine-tuning. Full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), are two dominant approaches. While PEFT methods are widely used for their computational efficiency, the implications of their configurations (e.g., rank) remain under-explored in downstream Q&A tasks and generalisation. In this work, we perform a comprehensive evaluation across multiple reasoning and recall datasets, conducting a rank sweep to quantify the trade-off between SFT and PEFT. We also compare the accuracy of PEFT and SFT models across in-domain and out-of-domain adaptation, highlighting distinct generalisation behaviour and task-specific forgetting. We demonstrate that LoRA achieves competitive and in some cases superior performance compared to SFT, particularly on reasoning tasks at specific rank values. Additionally, we analyze the internal representations via spectral features and layer-wise attention structures, offering insights into representational drift and structural changes in attention patterns.

Comment: Matches Model Compression and Efficiency via systematic study of Low-Rank Adaptation (LoRA) rank trade-offs and effects on representations/generalization.

Relevance: 8 Novelty: 7


31. Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

ArXiv ID: 2512.16134

Authors: Jian Tian, Shuailong Li, Yang Cao, Wenbo Cui, Minghan Zhu, Wenkang Wu, Jianming Zhang, Yanpeng Wang, Zhiwen Xiao, Zhenyu Hou, Dou Shen

Abstract: The evolution of Large Language Model (LLM) serving towards complex, distributed architectures--specifically the P/D-separated, large-scale DP+EP paradigm--introduces distinct scheduling challenges. Unlike traditional deployments where schedulers can treat instances as black boxes, DP+EP architectures exhibit high internal synchronization costs. We identify that immediate request dispatching in such systems leads to severe in-engine queuing and parallelization bubbles, degrading Time-to-First-Token (TTFT). To address this, we propose Staggered Batch Scheduling (SBS), a mechanism that deliberately buffers requests to form optimal execution batches. This temporal decoupling eliminates internal queuing bubbles without compromising throughput. Furthermore, leveraging the scheduling window created by buffering, we introduce a Load-Aware Global Allocation strategy that balances computational load across DP units for both Prefill and Decode phases. Deployed on a production H800 cluster serving Deepseek-V3, our system reduces TTFT by 30%-40% and improves throughput by 15%-20% compared to state-of-the-art immediate scheduling baselines.

Comment: Matches High Performance Computing with a scheduling mechanism (staggered batching and load-aware allocation) to co-optimize TTFT and throughput in DP+EP inference.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  2. Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  3. High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

  4. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.