Personalized Daily ArXiv Papers 2025-12-23

[gpt-5]	Prompt	Completion	Total
Token	57773	53182	110955
Cost	$0.07	$0.53	$0.6

Total arXiv papers: 761

Total scanned papers: 466

Total relevant papers: 35

Table of contents with paper titles:

KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction Authors: Aomufei Yuan, Zhiming Wang, Ruijie Miao, Dayu Wang, Yuxuan Tian, Zihan Wang, Yebo Peng, Yuhan Wu, Bairen Yi, Xin Liu, Tong Yang
Efficient Mixture-of-Agents Serving via Tree-Structured Routing, Adaptive Pruning, and Dependency-Aware Prefill-Decode Overlap Authors: Zijun Wang, Yijiahao Qi, Hanqiu Chen, Zishen Wan, Gongjin Sun, Dongyang Li, Shuyi Pei, Cong Hao
Sprecher Networks: A Parameter-Efficient Kolmogorov-Arnold Architecture Authors: Christian H\"agg, Kathl\'en Kohn, Giovanni Luca Marchetti, Boris Shapiro
Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing Authors: Wentao Liu, Yuhao Hu, Ruiting Zhou, Baochun Li, Ne Wang
Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA Authors: Allison Li, Kristjan Greenewald, Thomas Parnell, Navid Azizan
On the Convergence Rate of LoRA Gradient Descent Authors: Siqiao Mu, Diego Klabjan
CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs Authors: Gunho Park, Jeongin Bae, Byeongwook Kim, Baeseong park, Jiwon Ryu, Hoseung Kim, Se Jung Kwon, Dongsoo Lee
From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers Authors: Ryotaro Kawata, Yujin Song, Alberto Bietti, Naoki Nishikawa, Taiji Suzuki, Samuel Vaiter, Denny Wu
MoE Pathfinder: Trajectory-driven Expert Pruning Authors: Xican Yang, Yuanhe Tian, Yan Song
When Does Learning Renormalize? Sufficient Conditions for Power Law Spectral Dynamics Authors: Yizhou Zhang
The Interaction Bottleneck of Deep Neural Networks: Discovery, Proof, and Modulation Authors: Huiqi Deng, Qihan Ren, Zhuofan Chen, Zhenyuan Cui, Wen Shen, Peng Zhang, Hongbin Pei, Quanshi Zhang
Approximation and learning with compositional tensor trains Authors: Martin Eigel, Charles Miranda, Anthony Nouy, David Sommer
On Conditional Stochastic Interpolation for Generative Nonlinear Sufficient Dimension Reduction Authors: Shuntuo Xu, Zhou Yu, Jian Huang
Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs Authors: Rupanshu Soi, Rohan Yadav, Fredrik Kjolstad, Alex Aiken, Maryam Mehri Dehnavi, Michael Garland, Michael Bauer
Lag Operator SSMs: A Geometric Framework for Structured State Space Modeling Authors: Sutashu Tomonaga, Kenji Doya, Noboru Murata
MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning Authors: Tao Zhang, Ziqian Zeng, Hao Peng, Huiping Zhuang, Cen Chen
LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer Authors: Raina Panda, Daniel Fein, Arpita Singhal, Mark Fiore, Maneesh Agrawala, Matyas Bohacek
SAP: Syntactic Attention Pruning for Transformer-based Language Models Authors: Tzu-Yun Lee, Ding-Yong Hong, Jan-Jan Wu
Towards Minimal Fine-Tuning of VLMs Authors: Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee
A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models Authors: Zhiquan Tan, Yinrong Hong
Symplectic Reservoir Representation of Legendre Dynamics Authors: Robert Simon Fong, Gouhei Tanaka, Kazuyuki Aihara
An Inverse Scattering Inspired Fourier Neural Operator for Time-Dependent PDE Learning Authors: Rixin Yu
Research Program: Theory of Learning in Dynamical Systems Authors: Elad Hazan, Shai Shalev Shwartz, Nathan Srebro
When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models Authors: Michael S. Zhang, Rishi A. Ruia, Arnav Kewalram, Saathvik Dharmapuram, Utkarsh Sharma, Kevin Zhu
IPCV: Information-Preserving Compression for MLLM Visual Encoders Authors: Yuan Chen, Zichen Wen, Yuzhou Wu, Xuyang Liu, Shuang Chen, Junpeng Ma, Weijia Li, Conghui He, Linfeng Zhang
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies Authors: Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu
Phase-space entropy at acquisition reflects downstream learnability Authors: Xiu-Cheng Wang, Jun-Jie Zhanga, Nan Cheng, Long-Gang Pang, Taijiao Du, Deyu Meng
KerJEPA: Kernel Discrepancies for Euclidean Self-Supervised Learning Authors: Eric Zimmermann, Harley Wiltzer, Justin Szeto, David Alvarez-Melis, Lester Mackey
Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability Authors: Ge Yan (Lily), Tuomas Oikarinen (Lily), Tsui-Wei (Lily), Weng
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion Authors: Moritz B\"ohle, Am\'elie Royer, Juliette Marrie, Edouard Grave, Patrick P\'erez
Large Language Models as Discounted Bayesian Filters Authors: Jensen Zhang, Jing Yang, Keze Wang
Binary Kernel Logistic Regression: a sparsity-inducing formulation and a convergent decomposition training algorithm Authors: Antonio Consolo, Andrea Manno, Edoardo Amaldi
On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning Authors: Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos
A Logical View of GNN-Style Computation and the Role of Activation Functions Authors: Pablo Barcel\'o, Floris Geerts, Matthias Lanzinger, Klara Pakhomenko, Jan Van den Bussche
The Best of Both Worlds: Hybridizing Neural Operators and Solvers for Stable Long-Horizon Inference Authors: Rajyasri Roy, Dibyajyoti Nayak, Somdatta Goswami

1. KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction

ArXiv ID: 2512.17917

Authors: Aomufei Yuan, Zhiming Wang, Ruijie Miao, Dayu Wang, Yuxuan Tian, Zihan Wang, Yebo Peng, Yuhan Wu, Bairen Yi, Xin Liu, Tong Yang

Abstract: As the context length of current large language models (LLMs) rapidly increases, the memory demand for the Key-Value (KV) cache is becoming a bottleneck for LLM deployment and batch processing. Traditional KV cache compression methods typically involve permanently evicting or irreversibly merging "less important" tokens with low attention scores. This approach results in the unrecoverable loss of token information, which we call Contextual Amnesia, significantly degrading the model's information retrieval capability. To address this issue, we propose KVReviver, a reversible KV cache compression method based on the sketch algorithm. This method allows reconstructing compressed tokens from an additional data structure, thus enabling full-scale computation within limited memory. Experiments showed that in 2k-length contexts, it requires only 10% of KV Cache budget while maintaining identical end-to-end inference accuracy. For 32k-length contexts, it achieves equivalent or comparable accuracy ~2% accuracy loss) using merely 25% of KV Cache budget.

Comment: Model Compression and Efficiency: reversible KV-cache compression via sketch-based token reconstruction enabling large context with small memory budget.

Relevance: 10 Novelty: 8

2. Efficient Mixture-of-Agents Serving via Tree-Structured Routing, Adaptive Pruning, and Dependency-Aware Prefill-Decode Overlap

ArXiv ID: 2512.18126

Authors: Zijun Wang, Yijiahao Qi, Hanqiu Chen, Zishen Wan, Gongjin Sun, Dongyang Li, Shuyi Pei, Cong Hao

Abstract: Mixture-of-Agents (MoA) inference can suffer from dense inter-agent communication and low hardware utilization, which jointly inflate serving latency. We present a serving design that targets these bottlenecks through an algorithm-system co-design. First, we replace dense agent interaction graphs with a hierarchical tree topology that induces structured sparsity in inter-agent communication. Second, we introduce a runtime adaptive mechanism that selectively terminates or skips downstream agent invocations using semantic agreement and confidence signals from intermediate outputs. Third, we pipeline agent execution by overlapping incremental prefilling with decoding across dependency-related agents, improving utilization and reducing inference latency. Across representative tasks, this approach substantially reduces end-to-end latency (up to 90%) while maintaining comparable accuracy (within $\pm$1%) relative to dense-connectivity MoA baselines, and can improve accuracy in certain settings.

Comment: Model Architecture (MoE/MoA) and Efficiency: tree-structured routing, adaptive pruning, and prefill–decode overlap for low-latency serving.

Relevance: 10 Novelty: 8

3. Sprecher Networks: A Parameter-Efficient Kolmogorov-Arnold Architecture

ArXiv ID: 2512.19367

Authors: Christian H\"agg, Kathl\'en Kohn, Giovanni Luca Marchetti, Boris Shapiro

Abstract: We present Sprecher Networks (SNs), a family of trainable neural architectures inspired by the classical Kolmogorov-Arnold-Sprecher (KAS) construction for approximating multivariate continuous functions. Distinct from Multi-Layer Perceptrons (MLPs) with fixed node activations and Kolmogorov-Arnold Networks (KANs) featuring learnable edge activations, SNs utilize shared, learnable splines (monotonic and general) within structured blocks incorporating explicit shift parameters and mixing weights. Our approach directly realizes Sprecher's specific 1965 sum of shifted splines formula in its single-layer variant and extends it to deeper, multi-layer compositions. We further enhance the architecture with optional lateral mixing connections that enable intra-block communication between output dimensions, providing a parameter-efficient alternative to full attention mechanisms. Beyond parameter efficiency with $O(LN + LG)$ scaling (where $G$ is the knot count of the shared splines) versus MLPs' $O(LN^2)$, SNs admit a sequential evaluation strategy that reduces peak forward-intermediate memory from $O(N^2)$ to $O(N)$ (treating batch size as constant), making much wider architectures feasible under memory constraints. We demonstrate empirically that composing these blocks into deep networks leads to highly parameter and memory-efficient models, discuss theoretical motivations, and compare SNs with related architectures (MLPs, KANs, and networks with learnable node activations).

Comment: Model Architecture and Efficiency: KAS-inspired Sprecher Networks with shared learnable splines and O(LN+LG) scaling; reduced memory via sequential eval.

Relevance: 10 Novelty: 8

4. Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing

ArXiv ID: 2512.18674

Authors: Wentao Liu, Yuhao Hu, Ruiting Zhou, Baochun Li, Ne Wang

Abstract: Mixture-of-Experts (MoE) has become a dominant architecture in large language models (LLMs) due to its ability to scale model capacity via sparse expert activation. Meanwhile, serverless computing, with its elasticity and pay-per-use billing, is well-suited for deploying MoEs with bursty workloads. However, the large number of experts in MoE models incurs high inference costs due to memory-intensive parameter caching. These costs are difficult to mitigate via simple model partitioning due to input-dependent expert activation. To address these issues, we propose Remoe, a heterogeneous MoE inference system tailored for serverless computing. Remoe assigns non-expert modules to GPUs and expert modules to CPUs, and further offloads infrequently activated experts to separate serverless functions to reduce memory overhead and enable parallel execution. We incorporate three key techniques: (1) a Similar Prompts Searching (SPS) algorithm to predict expert activation patterns based on semantic similarity of inputs; (2) a Main Model Pre-allocation (MMP) algorithm to ensure service-level objectives (SLOs) via worst-case memory estimation; and (3) a joint memory and replica optimization framework leveraging Lagrangian duality and the Longest Processing Time (LPT) algorithm. We implement Remoe on Kubernetes and evaluate it across multiple LLM benchmarks. Experimental results show that Remoe reduces inference cost by up to 57% and cold start latency by 47% compared to state-of-the-art baselines.

Comment: High Performance Computing/Systems for MoE: heterogeneous CPU/GPU expert placement, serverless offloading, and optimization for cost/latency.

Relevance: 10 Novelty: 8

5. Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA

ArXiv ID: 2512.17910

Authors: Allison Li, Kristjan Greenewald, Thomas Parnell, Navid Azizan

Abstract: Modern large language model (LLM) systems increasingly rely on multi-turn pipelines that are composed of multiple task-specific adapters, yet existing serving frameworks remain inefficient, incurring substantial recomputation overhead when switching between adapters. We present the first LLM serving engine that supports cross-model prefix cache reuse between base and adapted models via Activated LoRA (aLoRA), enabling efficient and fine-grained adapter switching during inference. Our design extends the vLLM framework by introducing base-aligned block hashing and activation-aware masking within the model execution path, permitting cache reuse across models while preserving compatibility with existing serving engine optimizations. Integrated into a production-grade inference stack, this approach supports dynamic adapter activation without excessive key-value tensor recomputation. Evaluation across representative multi-turn, multi-adapter pipelines demonstrates up to 58x end-to-end latency reduction and over 100x time-to-first-token improvement relative to standard LoRA baselines, with benefits that scale with model size and sequence length and manifest across all stages of the request lifecycle. This work bridges parameter-efficient model adaptation with high-performance serving, providing the first complete realization of cross-model KV-cache reuse in modern LLM inference engines.

Comment: Matches HPC/Efficiency: cross-model KV-cache reuse and Activated LoRA enable efficient multi-adapter LLM serving.

Relevance: 10 Novelty: 8

6. On the Convergence Rate of LoRA Gradient Descent

ArXiv ID: 2512.18248

Authors: Siqiao Mu, Diego Klabjan

Abstract: The low-rank adaptation (LoRA) algorithm for fine-tuning large models has grown popular in recent years due to its remarkable performance and low computational requirements. LoRA trains two adapter" matrices that form a low-rank representation of the model parameters, thereby massively reducing the number of parameters that need to be updated at every step. Although LoRA is simple, its convergence is poorly understood due to the lack of Lipschitz smoothness, a key condition for classic convergence analyses. As a result, current theoretical results only consider asymptotic behavior or assume strong boundedness conditions which artificially enforce Lipschitz smoothness. In this work, we provide for the first time a non-asymptotic convergence analysis of the \textit{original LoRA gradient descent} algorithm, which reflects widespread practice, without such assumptions. Our work relies on three key steps: i) reformulating the problem in terms of the outer product of the stacked adapter matrices, ii) a modified descent lemma for theLipschitz-like" reparametrized function, and iii) controlling the step size. With this approach, we prove that LoRA gradient descent converges to a stationary point at rate $O(\frac{1}{\log T})$, where $T$ is the number of iterations.

Comment: Matches Compression/Efficiency Theory: non-asymptotic convergence analysis for LoRA (low-rank adaptation) gradient descent.

Relevance: 10 Novelty: 8

7. CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

ArXiv ID: 2512.17970

Authors: Gunho Park, Jeongin Bae, Byeongwook Kim, Baeseong park, Jiwon Ryu, Hoseung Kim, Se Jung Kwon, Dongsoo Lee

Abstract: Weight-only quantization is widely used to mitigate the memory-bound nature of LLM inference. Codebook-based methods extend this trend by achieving strong accuracy in the extremely low-bit regime (e.g., 2-bit). However, current kernels rely on dequantization, which repeatedly fetches centroids and reconstructs weights, incurring substantial latency and cache pressure. We present CodeGEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency-memory-accuracy trade-offs under a unified implementation. On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in the 2-bit configuration compared to state-of-the-art codebook-based quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization.

Comment: Matches Compression/Efficiency and HPC: efficient GEMM kernel for codebook quantized LLMs eliminating dequantization via precomputed partial sums.

Relevance: 10 Novelty: 8

8. From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers

ArXiv ID: 2512.18634

Authors: Ryotaro Kawata, Yujin Song, Alberto Bietti, Naoki Nishikawa, Taiji Suzuki, Samuel Vaiter, Denny Wu

Abstract: Transformers can implement both generalizable algorithms (e.g., induction heads) and simple positional shortcuts (e.g., memorizing fixed output positions). In this work, we study how the choice of pretraining data distribution steers a shallow transformer toward one behavior or the other. Focusing on a minimal trigger-output prediction task -- copying the token immediately following a special trigger upon its second occurrence -- we present a rigorous analysis of gradient-based training of a single-layer transformer. In both the infinite and finite sample regimes, we prove a transition in the learned mechanism: if input sequences exhibit sufficient diversity, measured by a low ``max-sum'' ratio of trigger-to-trigger distances, the trained model implements an induction head and generalizes to unseen contexts; by contrast, when this ratio is large, the model resorts to a positional shortcut and fails to generalize out-of-distribution (OOD). We also reveal a trade-off between the pretraining context length and OOD generalization, and derive the optimal pretraining distribution that minimizes computational cost per sample. Finally, we validate our theoretical predictions with controlled synthetic experiments, demonstrating that broadening context distributions robustly induces induction heads and enables OOD generalization. Our results shed light on the algorithmic biases of pretrained transformers and offer conceptual guidelines for data-driven control of their learned behaviors.

Comment: Representation Learning/Training Dynamics in Transformers: theoretical analysis of shortcut vs induction head selection driven by data diversity.

Relevance: 9 Novelty: 9

9. MoE Pathfinder: Trajectory-driven Expert Pruning

ArXiv ID: 2512.18425

Authors: Xican Yang, Yuanhe Tian, Yan Song

Abstract: Mixture-of-experts (MoE) architectures used in large language models (LLMs) achieve state-of-the-art performance across diverse tasks yet face practical challenges such as deployment complexity and low activation efficiency. Expert pruning has thus emerged as a promising solution to reduce computational overhead and simplify the deployment of MoE models. However, existing expert pruning approaches conventionally rely on local importance metrics and often apply uniform layer-wise pruning, leveraging only partial evaluation signals and overlooking the heterogeneous contributions of experts across layers. To address these limitations, we propose an expert pruning approach based on the trajectory of activated experts across layers, which treats MoE as a weighted computation graph and casts expert selection as a global optimal path planning problem. Within this framework, we integrate complementary importance signals from reconstruction error, routing probabilities, and activation strength at the trajectory level, which naturally yields non-uniform expert retention across layers. Experiments show that our approach achieves superior pruning performance on nearly all tasks compared with most existing approaches.

Comment: Model Architecture + Compression: MoE expert pruning via global trajectory/path planning using multi-signal importance, yielding non-uniform layerwise retention.

Relevance: 10 Novelty: 7

10. When Does Learning Renormalize? Sufficient Conditions for Power Law Spectral Dynamics

ArXiv ID: 2512.18209

Authors: Yizhou Zhang

Abstract: Empirical power--law scaling has been widely observed across modern deep learning systems, yet its theoretical origins and scope of validity remain incompletely understood. The Generalized Resolution--Shell Dynamics (GRSD) framework models learning as spectral energy transport across logarithmic resolution shells, providing a coarse--grained dynamical description of training. Within GRSD, power--law scaling corresponds to a particularly simple renormalized shell dynamics; however, such behavior is not automatic and requires additional structural properties of the learning process. In this work, we identify a set of sufficient conditions under which the GRSD shell dynamics admits a renormalizable coarse--grained description. These conditions constrain the learning configuration at multiple levels, including boundedness of gradient propagation in the computation graph, weak functional incoherence at initialization, controlled Jacobian evolution along training, and log--shift invariance of renormalized shell couplings. We further show that power--law scaling does not follow from renormalizability alone, but instead arises as a rigidity consequence: once log--shift invariance is combined with the intrinsic time--rescaling covariance of gradient flow, the renormalized GRSD velocity field is forced into a power--law form.

Comment: Matches Representation Learning/Training Dynamics: provides theoretical conditions for power-law spectral dynamics via GRSD renormalization.

Relevance: 9 Novelty: 8

11. The Interaction Bottleneck of Deep Neural Networks: Discovery, Proof, and Modulation

ArXiv ID: 2512.18607

Authors: Huiqi Deng, Qihan Ren, Zhuofan Chen, Zhenyuan Cui, Wen Shen, Peng Zhang, Hongbin Pei, Quanshi Zhang

Abstract: Understanding what kinds of cooperative structures deep neural networks (DNNs) can represent remains a fundamental yet insufficiently understood problem. In this work, we treat interactions as the fundamental units of such structure and investigate a largely unexplored question: how DNNs encode interactions under different levels of contextual complexity, and how these microscopic interaction patterns shape macroscopic representation capacity. To quantify this complexity, we use multi-order interactions [57], where each order reflects the amount of contextual information required to evaluate the joint interaction utility of a variable pair. This formulation enables a stratified analysis of cooperative patterns learned by DNNs. Building on this formulation, we develop a comprehensive study of interaction structure in DNNs. (i) We empirically discover a universal interaction bottleneck: across architectures and tasks, DNNs easily learn low-order and high-order interactions but consistently under-represent mid-order ones. (ii) We theoretically explain this bottleneck by proving that mid-order interactions incur the highest contextual variability, yielding large gradient variance and making them intrinsically difficult to learn. (iii) We further modulate the bottleneck by introducing losses that steer models toward emphasizing interactions of selected orders. Finally, we connect microscopic interaction structures with macroscopic representational behavior: low-order-emphasized models exhibit stronger generalization and robustness, whereas high-order-emphasized models demonstrate greater structural modeling and fitting capability. Together, these results uncover an inherent representational bias in modern DNNs and establish interaction order as a powerful lens for interpreting and guiding deep representations.

Comment: Matches Representation Learning/Training Dynamics: discovers and explains an interaction-order bottleneck and provides modulation losses.

Relevance: 9 Novelty: 8

12. Approximation and learning with compositional tensor trains

ArXiv ID: 2512.18059

Authors: Martin Eigel, Charles Miranda, Anthony Nouy, David Sommer

Abstract: We introduce compositional tensor trains (CTTs) for the approximation of multivariate functions, a class of models obtained by composing low-rank functions in the tensor-train format. This format can encode standard approximation tools, such as (sparse) polynomials, deep neural networks (DNNs) with fixed width, or tensor networks with arbitrary permutation of the inputs, or more general affine coordinate transformations, with similar complexities. This format can be viewed as a DNN with width exponential in the input dimension and structured weights matrices. Compared to DNNs, this format enables controlled compression at the layer level using efficient tensor algebra. On the optimization side, we derive a layerwise algorithm inspired by natural gradient descent, allowing to exploit efficient low-rank tensor algebra. This relies on low-rank estimations of Gram matrices, and tensor structured random sketching. Viewing the format as a discrete dynamical system, we also derive an optimization algorithm inspired by numerical methods in optimal control. Numerical experiments on regression tasks demonstrate the expressivity of the new format and the relevance of the proposed optimization algorithms. Overall, CTTs combine the expressivity of compositional models with the algorithmic efficiency of tensor algebra, offering a scalable alternative to standard deep neural networks.

Comment: Matches Model Architecture and Compression/Efficiency: compositional tensor-train networks enable low-rank structured layers with tensor-algebra-based optimization and controllable layer-wise compression.

Relevance: 9 Novelty: 8

13. On Conditional Stochastic Interpolation for Generative Nonlinear Sufficient Dimension Reduction

ArXiv ID: 2512.18971

Authors: Shuntuo Xu, Zhou Yu, Jian Huang

Abstract: Identifying low-dimensional sufficient structures in nonlinear sufficient dimension reduction (SDR) has long been a fundamental yet challenging problem. Most existing methods lack theoretical guarantees of exhaustiveness in identifying lower dimensional structures, either at the population level or at the sample level. We tackle this issue by proposing a new method, generative sufficient dimension reduction (GenSDR), which leverages modern generative models. We show that GenSDR is able to fully recover the information contained in the central $\sigma$-field at both the population and sample levels. In particular, at the sample level, we establish a consistency property for the GenSDR estimator from the perspective of conditional distributions, capitalizing on the distributional learning capabilities of deep generative models. Moreover, by incorporating an ensemble technique, we extend GenSDR to accommodate scenarios with non-Euclidean responses, thereby substantially broadening its applicability. Extensive numerical results demonstrate the outstanding empirical performance of GenSDR and highlight its strong potential for addressing a wide range of complex, real-world tasks.

Comment: Matches Representation Learning: generative sufficient dimension reduction with population/sample-level exhaustiveness guarantees for recovering central sigma-field.

Relevance: 9 Novelty: 8

14. Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

ArXiv ID: 2512.18134

Authors: Rupanshu Soi, Rohan Yadav, Fredrik Kjolstad, Alex Aiken, Maryam Mehri Dehnavi, Michael Garland, Michael Bauer

Abstract: GPU architectures have continued to grow in complexity, with recent incarnations introducing increasingly powerful fixed-function units for matrix multiplication and data movement to accompany highly parallel general-purpose cores. To fully leverage these machines, software must use sophisticated schedules that maximally utilize all hardware resources. Since realizing such schedules is complex, both programmers and compilers routinely employ program transformations, such as software pipelining (SWP) and warp specialization (WS), to do so in practice. However, determining how best to use SWP and WS in combination is a challenging problem that is currently handled through a mix of brittle compilation heuristics and fallible human intuition, with little insight into the space of solutions. To remedy this situation, we introduce a novel formulation of SWP and WS as a joint optimization problem that can be solved holistically by off-the-shelf constraint solvers. We reify our approach in Twill, the first system that automatically derives optimal SWP and WS schedules for a large class of iterative programs. Twill is heuristic-free, easily extensible to new GPU architectures, and guaranteed to produce optimal schedules. We show that Twill can rediscover, and thereby prove optimal, the SWP and WS schedules manually developed by experts for Flash Attention on both the NVIDIA Hopper and Blackwell GPU architectures.

Comment: High Performance Computing: joint optimization of software pipelining and warp specialization via constraint solving, delivering provably optimal GPU schedules (e.g., for Flash Attention).

Relevance: 9 Novelty: 8

15. Lag Operator SSMs: A Geometric Framework for Structured State Space Modeling

ArXiv ID: 2512.18965

Authors: Sutashu Tomonaga, Kenji Doya, Noboru Murata

Abstract: Structured State Space Models (SSMs), which are at the heart of the recently popular Mamba architecture, are powerful tools for sequence modeling. However, their theoretical foundation relies on a complex, multi-stage process of continuous-time modeling and subsequent discretization, which can obscure intuition. We introduce a direct, first-principles framework for constructing discrete-time SSMs that is both flexible and modular. Our approach is based on a novel lag operator, which geometrically derives the discrete-time recurrence by measuring how the system's basis functions "slide" and change from one timestep to the next. The resulting state matrices are computed via a single inner product involving this operator, offering a modular design space for creating novel SSMs by flexibly combining different basis functions and time-warping schemes. To validate our approach, we demonstrate that a specific instance exactly recovers the recurrence of the influential HiPPO model. Numerical simulations confirm our derivation, providing new theoretical tools for designing flexible and robust sequence models.

Comment: Model Architecture: first-principles discrete-time SSM construction via a lag operator; connects to HiPPO and offers modular design space for sequence models.

Relevance: 9 Novelty: 7

16. MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning

ArXiv ID: 2512.19206

Authors: Tao Zhang, Ziqian Zeng, Hao Peng, Huiping Zhuang, Cen Chen

Abstract: Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs), but this progress is accompanied by substantial memory and latency overhead from the extensive Key-Value (KV) cache. Although KV cache quantization is a promising compression technique, existing low-bit quantization methods often exhibit severe performance degradation on complex reasoning tasks. Fixed-precision quantization struggles to handle outlier channels in the key cache, while current mixed-precision strategies fail to accurately identify components requiring high-precision representation. We find that an effective low-bit KV cache quantization strategy must consider two factors: a key channel's intrinsic quantization difficulty and its relevance to the query. Based on this insight, we propose MixKVQ, a novel plug-and-play method that introduces a lightweight, query-aware algorithm to identify and preserve critical key channels that need higher precision, while applying per-token quantization for value cache. Experiments on complex reasoning datasets demonstrate that our approach significantly outperforms existing low-bit methods, achieving performance comparable to a full-precision baseline at a substantially reduced memory footprint.

Comment: Model Compression and Efficiency: query-aware mixed-precision KV cache quantization for long-context reasoning.

Relevance: 9 Novelty: 7

17. LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer

ArXiv ID: 2512.18930

Authors: Raina Panda, Daniel Fein, Arpita Singhal, Mark Fiore, Maneesh Agrawala, Matyas Bohacek

Abstract: Artistic style transfer in generative models remains a significant challenge, as existing methods often introduce style only via model fine-tuning, additional adapters, or prompt engineering, all of which can be computationally expensive and may still entangle style with subject matter. In this paper, we introduce a training- and inference-light, interpretable method for representing and transferring artistic style. Our approach leverages an art-specific Sparse Autoencoder (SAE) on top of latent embeddings of generative image models. Trained on artistic data, our SAE learns an emergent, largely disentangled set of stylistic and compositional concepts, corresponding to style-related elements pertaining brushwork, texture, and color palette, as well as semantic and structural concepts. We call it LouvreSAE and use it to construct style profiles: compact, decomposable steering vectors that enable style transfer without any model updates or optimization. Unlike prior concept-based style transfer methods, our method requires no fine-tuning, no LoRA training, and no additional inference passes, enabling direct steering of artistic styles from only a few reference images. We validate our method on ArtBench10, achieving or surpassing existing methods on style evaluations (VGG Style Loss and CLIP Score Style) while being 1.7-20x faster and, critically, interpretable.

Comment: Representation Learning: uses Sparse Autoencoders to learn disentangled stylistic concepts enabling interpretable, controllable steering.

Relevance: 9 Novelty: 7

18. SAP: Syntactic Attention Pruning for Transformer-based Language Models

ArXiv ID: 2512.19125

Authors: Tzu-Yun Lee, Ding-Yong Hong, Jan-Jan Wu

Abstract: This paper introduces Syntactic Attention Pruning (SAP), a novel method for effectively pruning attention heads in Transformer models. Unlike conventional approaches that rely solely on mathematical analysis of model weights and activations, SAP incorporates both the syntactic structure and attention patterns of sentences to guide the pruning process. By leveraging these linguistic features, SAP not only achieves performance comparable to state-of-the-art methods but also enhances the interpretability of model behavior. To further improve robustness, we propose Candidate Filtering (CF), a mechanism that prioritizes heads based on their contribution to model performance, mitigating degradation during pruning. Experimental results indicate that SAP effectively preserves critical heads of a high density of strong attention values, outperforming existing head pruning strategies in retrain-free settings. These findings position SAP as a promising foundation for a new direction in model compression research, offering high flexibility for pruning across all transformer-based language models.

Comment: Matches Model Compression/Efficiency: prunes Transformer attention heads using syntax-informed criteria.

Relevance: 9 Novelty: 7

19. Towards Minimal Fine-Tuning of VLMs

ArXiv ID: 2512.19219

Authors: Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee

Abstract: We introduce Image-LoRA, a lightweight parameter efficient fine-tuning (PEFT) recipe for transformer-based vision-language models (VLMs). Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span, reducing adapter-only training FLOPs roughly in proportion to the visual-token fraction. We further adapt only a subset of attention heads, selected using head influence scores estimated with a rank-1 Image-LoRA, and stabilize per-layer updates via selection-size normalization. Across screen-centric grounding and referring benchmarks spanning text-heavy to image-heavy regimes, Image-LoRA matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs. The method also preserves the pure-text reasoning performance of VLMs before and after fine-tuning, as further shown on GSM8K.

Comment: Matches Compression/Efficiency: Image-LoRA restricts low-rank adaptation to visual-token spans and selects influential heads to minimize trainable parameters/FLOPs.

Relevance: 9 Novelty: 7

20. A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models

ArXiv ID: 2512.18730

Authors: Zhiquan Tan, Yinrong Hong

Abstract: Large language models (LLMs) trained via KL-regularized reinforcement learning demonstrate strong instruction following, self-correction, and reasoning abilities. Yet their theoretical underpinnings remain limited. We exploit the closed-form energy-based model (EBM) structure of the optimal KL-regularized policy to provide a unified variational analysis of LLMs. For instruction-tuned models, under natural assumptions on reward potentials and pretraining symmetry, we prove that the transition kernel satisfies detailed balance with respect to a scalar potential encoding response quality. This yields monotonic KL convergence to a high-quality stationary distribution, bounded hitting times to superior states, and exponential mixing governed by the spectral gap. For reasoning models trained with verifiable rewards (RLVR), we show the objective is equivalent to expected KL minimization toward an optimal reasoning distribution, with the suboptimality gap reducing to the Bernoulli KL between target and current accuracies along the natural gradient flow. This helps explain empirical entropy-accuracy trade-offs.

Comment: Representation Learning/Training Dynamics: provides a unified EBM-based theoretical analysis of RL-tuned LMs (instruction/RLVR).

Relevance: 8 Novelty: 8

21. Symplectic Reservoir Representation of Legendre Dynamics

ArXiv ID: 2512.19409

Authors: Robert Simon Fong, Gouhei Tanaka, Kazuyuki Aihara

Abstract: Modern learning systems act on internal representations of data, yet how these representations encode underlying physical or statistical structure is often left implicit. In physics, conservation laws of Hamiltonian systems such as symplecticity guarantee long-term stability, and recent work has begun to hard-wire such constraints into learning models at the loss or output level. Here we ask a different question: what would it mean for the representation itself to obey a symplectic conservation law in the sense of Hamiltonian mechanics? We express this symplectic constraint through Legendre duality: the pairing between primal and dual parameters, which becomes the structure that the representation must preserve. We formalize Legendre dynamics as stochastic processes whose trajectories remain on Legendre graphs, so that the evolving primal-dual parameters stay Legendre dual. We show that this class includes linear time-invariant Gaussian process regression and Ornstein-Uhlenbeck dynamics. Geometrically, we prove that the maps that preserve all Legendre graphs are exactly symplectomorphisms of cotangent bundles of the form "cotangent lift of a base diffeomorphism followed by an exact fibre translation". Dynamically, this characterization leads to the design of a Symplectic Reservoir (SR), a reservoir-computing architecture that is a special case of recurrent neural network and whose recurrent core is generated by Hamiltonian systems that are at most linear in the momentum. Our main theorem shows that every SR update has this normal form and therefore transports Legendre graphs to Legendre graphs, preserving Legendre duality at each time step. Overall, SR implements a geometrically constrained, Legendre-preserving representation map, injecting symplectic geometry and Hamiltonian mechanics directly at the representational level.

Comment: Matches Representation Learning and Model Architecture: symplectic reservoir computing preserves Legendre duality via Hamiltonian dynamics, imposing geometric structure on representations.

Relevance: 8 Novelty: 8

22. An Inverse Scattering Inspired Fourier Neural Operator for Time-Dependent PDE Learning

ArXiv ID: 2512.19439

Authors: Rixin Yu

Abstract: Learning accurate and stable time-advancement operators for nonlinear partial differential equations (PDEs) remains challenging, particularly for chaotic, stiff, and long-horizon dynamical systems. While neural operator methods such as the Fourier Neural Operator (FNO) and Koopman-inspired extensions achieve good short-term accuracy, their long-term stability is often limited by unconstrained latent representations and cumulative rollout errors. In this work, we introduce an inverse scattering inspired Fourier Neural Operator(IS-FNO), motivated by the reversibility and spectral evolution structure underlying the classical inverse scattering transform. The proposed architecture enforces a near-reversible pairing between lifting and projection maps through an explicitly invertible neural transformation, and models latent temporal evolution using exponential Fourier layers that naturally encode linear and nonlinear spectral dynamics. We systematically evaluate IS-FNO against baseline FNO and Koopman-based models on a range of benchmark PDEs, including the Michelson-Sivashinsky and Kuramoto-Sivashinsky equations (in one and two dimensions), as well as the integrable Korteweg-de Vries and Kadomtsev-Petviashvili equations. The results demonstrate that IS-FNO achieves lower short-term errors and substantially improved long-horizon stability in non-stiff regimes. For integrable systems, reduced IS-FNO variants that embed analytical scattering structure retain competitive long-term accuracy despite limited model capacity. Overall, this work shows that incorporating physical structure -- particularly reversibility and spectral evolution -- into neural operator design significantly enhances robustness and long-term predictive fidelity for nonlinear PDE dynamics.

Comment: Matches Model Architecture: inverse-scattering-inspired Fourier Neural Operator with invertible lifting and exponential Fourier evolution improves long-horizon stability.

Relevance: 8 Novelty: 8

23. Research Program: Theory of Learning in Dynamical Systems

ArXiv ID: 2512.19410

Authors: Elad Hazan, Shai Shalev Shwartz, Nathan Srebro

Abstract: Modern learning systems increasingly interact with data that evolve over time and depend on hidden internal state. We ask a basic question: when is such a dynamical system learnable from observations alone? This paper proposes a research program for understanding learnability in dynamical systems through the lens of next-token prediction. We argue that learnability in dynamical systems should be studied as a finite-sample question, and be based on the properties of the underlying dynamics rather than the statistical properties of the resulting sequence. To this end, we give a formulation of learnability for stochastic processes induced by dynamical systems, focusing on guarantees that hold uniformly at every time step after a finite burn-in period. This leads to a notion of dynamic learnability which captures how the structure of a system, such as stability, mixing, observability, and spectral properties, governs the number of observations required before reliable prediction becomes possible. We illustrate the framework in the case of linear dynamical systems, showing that accurate prediction can be achieved after finite observation without system identification, by leveraging improper methods based on spectral filtering. We survey the relationship between learning in dynamical systems and classical PAC, online, and universal prediction theories, and suggest directions for studying nonlinear and controlled systems.

Comment: Matches Representation Learning/Training Dynamics: research program and finite-sample learnability framework for dynamical systems via spectral filtering.

Relevance: 8 Novelty: 8

24. When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models

ArXiv ID: 2512.18934

Authors: Michael S. Zhang, Rishi A. Ruia, Arnav Kewalram, Saathvik Dharmapuram, Utkarsh Sharma, Kevin Zhu

Abstract: Catastrophic forgetting poses a fundamental challenge in continual learning, particularly when models are quantized for deployment efficiency. We systematically investigate the interplay between quantization precision (FP16, INT8, INT4) and replay buffer strategies in large language models, revealing unexpected dynamics. While FP16 achieves superior initial task performance (74.44% on NLU), we observe a striking inversion on subsequent tasks: quantized models outperform FP16 by 8-15% on final task forward accuracy, with INT4 achieving nearly double FP16's performance on Code generation (40% vs 20%). Critically, even minimal replay buffers (0.1%) dramatically improve retention - increasing NLU retention after Math training from 45% to 65% across all precision levels - with INT8 consistently achieving the optimal balance between learning plasticity and knowledge retention. We hypothesize that quantization-induced noise acts as implicit regularization, preventing the overfitting to new task gradients that plagues high-precision models. These findings challenge the conventional wisdom that higher precision is always preferable, suggesting instead that INT8 quantization offers both computational efficiency and superior continual learning dynamics. Our results provide practical guidelines for deploying compressed models in continual learning scenarios: small replay buffers (1-2%) suffice for NLU tasks, while Math and Code benefit from moderate buffers (5-10%), with quantized models requiring less replay than FP16 to achieve comparable retention. Code is available at https://github.com/Festyve/LessIsMore.

Comment: Model Compression and Efficiency: systematic study of INT8/INT4 quantization effects on LLM continual learning dynamics, highlighting regularization benefits.