Personalized Daily ArXiv Papers 2025-12-04

[gpt-5]	Prompt	Completion	Total
Token	55685	45543	101228
Cost	$0.07	$0.46	$0.53

Total arXiv papers: 534

Total scanned papers: 318

Total relevant papers: 31

Table of contents with paper titles:

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs Authors: Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying
Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in ${\pm 1, \pm i}$ Authors: Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang, Peizhuang Cong, Tong Yang
Understanding and Harnessing Sparsity in Unified Multimodal Models Authors: Shwai He, Chaorui Deng, Ang Li, Shen Yan
UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs Authors: Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
Reversing Large Language Models for Efficient Training and Fine-Tuning Authors: Eshed Gal, Moshe Eliasof, Javier Turek, Uri Ascher, Eran Treister, Eldad Haber
Enforcing Orderedness to Improve Feature Consistency Authors: Sophie L. Wang, Alex Quach, Nithin Parsan, John J. Yang
A note on the impossibility of conditional PAC-efficient reasoning in large language models Authors: Hao Zeng
When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling Authors: Garrett G. Wen, Hong Hu, Yue M. Lu, Zhou Fan, Theodor Misiakiewicz
Globally optimized SVD compression of LLMs via Fermi-function-based rank selection and gauge fixing Authors: Roman Rausch, David Jansen, Sukhbinder Singh, Rom\'an Or\'us
A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention Authors: Di Xiu, Hongyin Tang, Bolin Rong, Lizhi Yan, Jingang Wang, Yifan Lu, Xunliang Cai
Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models Authors: Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang, Minwoo Lee, Shu-ping Yeh, Li Yang
Convergence for Discrete Parameter Updates Authors: Paul Wilson, Fabio Zanasi, George Constantinides
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation Authors: Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning Authors: Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, Chris Shum
Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles Authors: Yizhou Zhang, Lun Du
Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding Authors: Duy-Tung Pham, An The Nguyen, Viet-Hoang Tran, Nhan-Phu Chung, Xin T. Tong, Tan M. Nguyen, Thieu N. Vo
Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics Authors: Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis
Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting Authors: Zhenliang Ni, Xiaowen Ma, Zhenkai Wu, Shuai Xiao, Han Shu, Xinghao Chen
Probabilistic Foundations of Fuzzy Simplicial Sets for Nonlinear Dimensionality Reduction Authors: Janis Keck (Hannaneh), Lukas Silvester Barth (Hannaneh), Fatemeh (Hannaneh), Fahimi, Parvaneh Joharinad, J\"urgen Jost
Model Recovery at the Edge under Resource Constraints for Physical AI Authors: Bin Xu, Ayan Banerjee, Sandeep K. S. Gupta
AaPE: Aliasing-aware Patch Embedding for Self-Supervised Audio Representation Learning Authors: Kohei Yamamoto, Kosuke Okusa
Optical Context Compression Is Just (Bad) Autoencoding Authors: Ivan Yee Lee, Cheng Yang, Taylor Berg-Kirkpatrick
VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling Authors: Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang
Exploring Depth Generalization in Large Language Models for Solving Recursive Logic Tasks Authors: Zhiyuan He
From monoliths to modules: Decomposing transducers for efficient world modelling Authors: Alexander Boyd, Franz Nowak, David Hyland, Manuel Baltieri, Fernando E. Rosas
Learning Network Sheaves for AI-native Semantic Communication Authors: Enrico Grimaldi, Mario Edoardo Pandolfo, Gabriele D'Acunto, Sergio Barbarossa, Paolo Di Lorenzo
Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models Authors: Xiwen Wei, Mustafa Munir, Radu Marculescu
Tuning-Free Structured Sparse Recovery of Multiple Measurement Vectors using Implicit Regularization Authors: Lakshmi Jayalal, Sheetal Kalyani
Better World Models Can Lead to Better Post-Training Performance Authors: Prakhar Gupta, Henry Conklin, Sarah-Jane Leslie, Andrew Lee
Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval Authors: Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, Neel Nanda
Domain Feature Collapse: Implications for Out-of-Distribution Detection and Solutions Authors: Hong Yang, Devroop Kar, Qi Yu, Alex Ororbia, Travis Desell

1. Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

ArXiv ID: 2512.03324

Authors: Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying

Abstract: Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.

Comment: Model Compression and Efficiency: learned token retention gates for KV-cache eviction under memory budgets, improving long-context inference.

Relevance: 10 Novelty: 9

2. Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in ${\pm 1, \pm i}$

ArXiv ID: 2512.02901

Authors: Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang, Peizhuang Cong, Tong Yang

Abstract: Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.

Comment: Matches Model Compression and Efficiency: universal conversion to widely-linear complex form with phase-aware 2-bit quantization and multiplication-free accumulation for efficient LLM inference.

Relevance: 10 Novelty: 9

3. Understanding and Harnessing Sparsity in Unified Multimodal Models

ArXiv ID: 2512.02351

Authors: Shwai He, Chaorui Deng, Ang Li, Shen Yan

Abstract: Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at \href{https://github.com/Shwai-He/SparseUnifiedModel}{this link}.

Comment: Model Compression/Efficiency and MoE Architecture: training-free pruning probe of unified multimodal models and MoE adaptation enabling sparse activation in generation.

Relevance: 10 Novelty: 8

4. UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs

ArXiv ID: 2512.03383

Authors: Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu

Abstract: Deploying large language model (LLM) models on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: https://github.com/enyac-group/UniQL.

Comment: Strong match to Compression/Efficiency: unified quantization + low-rank compression with configurable pruning and kernel-level optimizations.

Relevance: 10 Novelty: 8

5. Reversing Large Language Models for Efficient Training and Fine-Tuning

ArXiv ID: 2512.02056

Authors: Eshed Gal, Moshe Eliasof, Javier Turek, Uri Ascher, Eran Treister, Eldad Haber

Abstract: Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pre-trained LLM considered a foundation model. In this work, we introduce memory-efficient, reversible architectures for LLMs, inspired by symmetric and symplectic differential equations, and investigate their theoretical properties. Different from standard, baseline architectures that store all intermediate activations, the proposed models use time-reversible dynamics to retrieve hidden states during backpropagation, relieving the need to store activations. This property allows for a drastic reduction in memory consumption, allowing for the processing of larger batch sizes for the same available memory, thereby offering improved throughput. In addition, we propose an efficient method for converting existing, non-reversible LLMs into reversible architectures through fine-tuning, rendering our approach practical for exploiting existing pre-trained models. Our results show comparable or improved performance on several datasets and benchmarks, on several LLMs, building a scalable and efficient path towards reducing the memory and computational costs associated with both training from scratch and fine-tuning of LLMs.

Comment: HPC/memory optimization and architecture: reversible LLMs enabling activation recomputation-free backprop and conversion of existing models.

Relevance: 10 Novelty: 8

6. Enforcing Orderedness to Improve Feature Consistency

ArXiv ID: 2512.02194

Authors: Sophie L. Wang, Alex Quach, Nithin Parsan, John J. Yang

Abstract: Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered Sparse Autoencoders (OSAE), which extend Matryoshka SAEs by (1) establishing a strict ordering of latent features and (2) deterministically using every feature dimension, avoiding the sampling-based approximations of prior nested SAE methods. Theoretically, we show that OSAEs resolve permutation non-identifiability in settings of sparse dictionary learning where solutions are unique (up to natural symmetries). Empirically on Gemma2-2B and Pythia-70M, we show that OSAEs can help improve consistency compared to Matryoshka baselines.

Comment: Representation Learning: sparse autoencoders with strict latent ordering resolve permutation non-identifiability in sparse dictionary learning, improving feature consistency.

Relevance: 10 Novelty: 8

7. A note on the impossibility of conditional PAC-efficient reasoning in large language models

ArXiv ID: 2512.03057

Authors: Hao Zeng

Abstract: We prove an impossibility result for conditional Probably Approximately Correct (PAC)-efficient reasoning in large language models. While recent work has established marginal PAC efficiency guarantees for composite models that switch between expensive expert models and cheaper fast models, we show that conditional (pointwise) guarantees are impossible in the distribution-free setting. Specifically, for non-atomic input spaces, any algorithm achieving conditional PAC efficiency must be trivial in the sense that it defers to the expert model with probability at least $1-\alpha$ for almost every input.

Comment: Model Architecture (conditional routing) and Efficiency theory: impossibility result for pointwise PAC-efficient defer-to-expert schemes.

Relevance: 9 Novelty: 9

8. When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling

ArXiv ID: 2512.03325

Authors: Garrett G. Wen, Hong Hu, Yue M. Lu, Zhou Fan, Theodor Misiakiewicz

Abstract: A major effort in modern high-dimensional statistics has been devoted to the analysis of linear predictors trained on nonlinear feature embeddings via empirical risk minimization (ERM). Gaussian equivalence theory (GET) has emerged as a powerful universality principle in this context: it states that the behavior of high-dimensional, complex features can be captured by Gaussian surrogates, which are more amenable to analysis. Despite its remarkable successes, numerical experiments show that this equivalence can fail even for simple embeddings -- such as polynomial maps -- under general scaling regimes. We investigate this breakdown in the setting of random feature (RF) models in the quadratic scaling regime, where both the number of features and the sample size grow quadratically with the data dimension. We show that when the target function depends on a low-dimensional projection of the data, such as generalized linear models, GET yields incorrect predictions. To capture the correct asymptotics, we introduce a Conditional Gaussian Equivalent (CGE) model, which can be viewed as appending a low-dimensional non-Gaussian component to an otherwise high-dimensional Gaussian model. This hybrid model retains the tractability of the Gaussian framework and accurately describes RF models in the quadratic scaling regime. We derive sharp asymptotics for the training and test errors in this setting, which continue to agree with numerical simulations even when GET fails. Our analysis combines general results on CLT for Wiener chaos expansions and a careful two-phase Lindeberg swapping argument. Beyond RF models and quadratic scaling, our work hints at a rich landscape of universality phenomena in high-dimensional ERM.

Comment: Representation Learning/Theory: shows failure of Gaussian equivalence for random features at quadratic scaling; introduces Conditional Gaussian Equivalent model with sharp asymptotics.

Relevance: 9 Novelty: 9

9. Globally optimized SVD compression of LLMs via Fermi-function-based rank selection and gauge fixing

ArXiv ID: 2512.03062

Authors: Roman Rausch, David Jansen, Sukhbinder Singh, Rom\'an Or\'us

Abstract: Large Language Models (LLMs) are very demanding in terms of their computational resources. Low-rank decompositions of LLM weights, e.g. via Singular Value Decomposition (SVD), is a promising approach for LLM compression, but presents several practical hurdles, e.g. selecting appropriate layer-wise ranks and getting rid of its parameter redundancy. In this work, we present two physics-inspired improvements to SVD LLM compression: (1) \textbf{FermiGrad}, a gradient-descent algorithm that determines globally optimal layer-wise ranks by relaxing the discrete singular-value truncation into a continuous optimization using the Fermi function; (2) \textbf{PivGa}, an additional \textit{lossless} compression of the low-rank factors that exploits the intrinsic gauge freedom in their parametrization.

Comment: Compression/efficiency: SVD-based LLM compression with globally optimized rank selection (FermiGrad) and lossless gauge fixing (PivGa).

Relevance: 10 Novelty: 7

10. A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention

ArXiv ID: 2512.03494

Authors: Di Xiu, Hongyin Tang, Bolin Rong, Lizhi Yan, Jingang Wang, Yifan Lu, Xunliang Cai

Abstract: Large Language Models (LLMs) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top-$k$ Attention mechanism during both the decoding and training phases. First, we validate the effectiveness of exact Top-$k$ Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the decoding stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top-$k$ Attention training strategy. Experiments confirm that ensuring the consistency between training and inference regarding Top-$k$ Attention operations facilitates the further unlocking of Top-$k$ Decoding's potential, thereby significantly enhancing model performance. Furthermore, considering the high computational complexity of exact Top-$k$ Attention, we investigate the impact of approximate Top-$k$ algorithm precision on downstream tasks. Our research confirms a positive correlation between downstream task performance and approximation fidelity, and we provide statistical evaluations of the Lightning Indexer's precision within the DeepSeek-V3.2-Exp model. Finally, this report provides a theoretical interpretation from the perspective of Entropy. Experimental observations indicate that models subjected to Top-$k$ Attention SFT exhibit a distinct phenomenon of entropy reduction in downstream tasks, which validates the hypothesis that low-entropy states are better adapted to Top-$k$ Decoding.

Comment: Model Compression/Efficiency: native Top-k sparse attention for both training and decoding, with analysis (entropy view) and approximation fidelity study.

Relevance: 10 Novelty: 7

11. Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

ArXiv ID: 2512.02185

Authors: Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang, Minwoo Lee, Shu-ping Yeh, Li Yang

Abstract: Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.

Comment: Model Compression and Efficiency: structured pruning for reasoning LLMs using decode-only gradient importance with self-generated CoT calibration and progressive regeneration aligned to decode-time behavior.

Relevance: 9 Novelty: 8

12. Convergence for Discrete Parameter Updates

ArXiv ID: 2512.04051

Authors: Paul Wilson, Fabio Zanasi, George Constantinides

Abstract: Modern deep learning models require immense computational resources, motivating research into low-precision training. Quantised training addresses this by representing training components in low-bit integers, but typically relies on discretising real-valued updates. We introduce an alternative approach where the update rule itself is discrete, avoiding the quantisation of continuous updates by design. We establish convergence guarantees for a general class of such discrete schemes, and present a multinomial update rule as a concrete example, supported by empirical evaluation. This perspective opens new avenues for efficient training, particularly for models with inherently discrete structure.

Comment: Training efficiency: discrete update rules with convergence guarantees for low-precision training—core to compression/efficiency.

Relevance: 9 Novelty: 8

13. PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

ArXiv ID: 2512.04025

Authors: Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang

Abstract: Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: http://ziplab.co/PSA

Comment: Efficient attention: Pyramid Sparse Attention introduces multi-level pooled KV for fine-grained sparsity with hardware-friendly kernels.

Relevance: 9 Novelty: 8

14. CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

ArXiv ID: 2512.02551

Authors: Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, Chris Shum

Abstract: In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

Comment: High Performance Computing: RL-driven kernel synthesis/optimization for HGEMM outperforming cuBLAS/cuBLASLt, enabling faster core operations for large-scale training/inference.

Relevance: 9 Novelty: 8

15. Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles

ArXiv ID: 2512.02409

Authors: Yizhou Zhang, Lun Du

Abstract: Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling. While several of these data-centric strategies reliably improve training efficiency and downstream performance, others fail to provide meaningful gains -- most notably self-generated synthetic data, which often increases dataset volume without enhancing model capability. We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator. Our first main result shows that \textbf{static pruning induces a bounded operator and therefore cannot change the spectral tail exponent}; it provides at most finite-region improvements and cannot alter asymptotic neural scaling. Our second result analyzes \textbf{time-dependent data curation}, showing that an ideal oracle capable of tracking spectral residuals and continuously re-normalizing the tail can provably accelerate learning -- although practical systems can only approximate this behavior.

Comment: Efficiency/Training Dynamics: theoretical analysis of data curation via operator spectra; shows limits of static pruning and acceleration via time-dependent reweighting.

Relevance: 9 Novelty: 8

16. Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding

ArXiv ID: 2512.03058

Authors: Duy-Tung Pham, An The Nguyen, Viet-Hoang Tran, Nhan-Phu Chung, Xin T. Tong, Tan M. Nguyen, Thieu N. Vo

Abstract: This paper investigates the dynamical properties of tokens in pre-trained Transformer models and explores their application to improving Transformers. To this end, we analyze the dynamical system governing the continuous-time limit of the pre-trained model and characterize the asymptotic behavior of its solutions. Specifically, we characterize when tokens move closer to or farther from one another over time, depending on the model parameters. We provide sufficient conditions, based on these parameters, to identify scenarios where tokens either converge to zero or diverge to infinity. Unlike prior works, our conditions are broader in scope and more applicable to real-world models. Furthermore, we investigate how different forms of positional encoding -- specifically absolute and rotary -- affect these dynamical regimes. Empirical evidence reveals that the convergence scenario adversely impacts model performance. Motivated by these insights, we propose simple refinements to Transformer architectures that mitigate convergence behavior in models with absolute or rotary positional encoding. These findings support theoretical foundations and design principles for improving Transformer models.

Comment: Model Architecture and Training Dynamics: theoretical analysis of self-attention token dynamics and positional encodings; proposes refinements to mitigate collapse.

Relevance: 9 Novelty: 8

17. Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics

ArXiv ID: 2512.04006

Authors: Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis

Abstract: Cross-entropy (CE) training loss dominates deep learning practice, yet existing theory often relies on simplifications, either replacing it with squared loss or restricting to convex models, that miss essential behavior. CE and squared loss generate fundamentally different dynamics, and convex linear models cannot capture the complexities of non-convex optimization. We provide an in-depth characterization of multi-class CE optimization dynamics beyond the convex regime by analyzing a canonical two-layer linear neural network with standard-basis vectors as inputs: the simplest non-convex extension for which the implicit bias remained unknown. This model coincides with the unconstrained features model used to study neural collapse, making our work the first to prove that gradient flow on CE converges to the neural collapse geometry. We construct an explicit Lyapunov function that establishes global convergence, despite the presence of spurious critical points in the non-convex landscape. A key insight underlying our analysis is an inconspicuous finding: Hadamard Initialization diagonalizes the softmax operator, freezing the singular vectors of the weight matrices and reducing the dynamics entirely to their singular values. This technique opens a pathway for analyzing CE training dynamics well beyond our specific setting considered here.

Comment: Matches Representation Learning: provides a theoretical characterization of cross-entropy training dynamics and neural collapse; Hadamard initialization diagonalizes softmax to make dynamics tractable.

Relevance: 9 Novelty: 8

18. Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting

ArXiv ID: 2512.02061

Authors: Zhenliang Ni, Xiaowen Ma, Zhenkai Wu, Shuai Xiao, Han Shu, Xinghao Chen

Abstract: Multivariate time series forecasts are widely used, such as industrial, transportation and financial forecasts. However, the dominant frequencies in time series may shift with the evolving spectral distribution of the data. Traditional Mixture of Experts (MoE) models, which employ a fixed number of experts, struggle to adapt to these changes, resulting in frequency coverage imbalance issue. Specifically, too few experts can lead to the overlooking of critical information, while too many can introduce noise. To this end, we propose Ada-MoGE, an adaptive Gaussian Mixture of Experts model. Ada-MoGE integrates spectral intensity and frequency response to adaptively determine the number of experts, ensuring alignment with the input data's frequency distribution. This approach prevents both information loss due to an insufficient number of experts and noise contamination from an excess of experts. Additionally, to prevent noise introduction from direct band truncation, we employ Gaussian band-pass filtering to smoothly decompose the frequency domain features, further optimizing the feature representation. The experimental results show that our model achieves state-of-the-art performance on six public benchmarks with only 0.2 million parameters.

Comment: Strong match to Model Architecture: Mixture-of-Experts with adaptive expert count driven by frequency-domain cues.

Relevance: 9 Novelty: 7

19. Probabilistic Foundations of Fuzzy Simplicial Sets for Nonlinear Dimensionality Reduction

ArXiv ID: 2512.03899

Authors: Janis Keck (Hannaneh), Lukas Silvester Barth (Hannaneh), Fatemeh (Hannaneh), Fahimi, Parvaneh Joharinad, J\"urgen Jost

Abstract: Fuzzy simplicial sets have become an object of interest in dimensionality reduction and manifold learning, most prominently through their role in UMAP. However, their definition through tools from algebraic topology without a clear probabilistic interpretation detaches them from commonly used theoretical frameworks in those areas. In this work we introduce a framework that explains fuzzy simplicial sets as marginals of probability measures on simplicial sets. In particular, this perspective shows that the fuzzy weights of UMAP arise from a generative model that samples Vietoris-Rips filtrations at random scales, yielding cumulative distribution functions of pairwise distances. More generally, the framework connects fuzzy simplicial sets to probabilistic models on the face poset, clarifies the relation between Kullback-Leibler divergence and fuzzy cross-entropy in this setting, and recovers standard t-norms and t-conorms via Boolean operations on the underlying simplicial sets. We then show how new embedding methods may be derived from this framework and illustrate this on an example where we generalize UMAP using \v{C}ech filtrations with triplet sampling. In summary, this probabilistic viewpoint provides a unified probabilistic theoretical foundation for fuzzy simplicial sets, clarifies the role of UMAP within this framework, and enables the systematic derivation of new dimensionality reduction methods.

Comment: Representation Learning theory: probabilistic foundations for fuzzy simplicial sets (UMAP), linking to generative models and enabling new embedding methods.

Relevance: 8 Novelty: 8

20. Model Recovery at the Edge under Resource Constraints for Physical AI

ArXiv ID: 2512.02283

Authors: Bin Xu, Ayan Banerjee, Sandeep K. S. Gupta

Abstract: Model Recovery (MR) enables safe, explainable decision making in mission-critical autonomous systems (MCAS) by learning governing dynamical equations, but its deployment on edge devices is hindered by the iterative nature of neural ordinary differential equations (NODEs), which are inefficient on FPGAs. Memory and energy consumption are the main concerns when applying MR on edge devices for real-time operation. We propose MERINDA, a novel FPGA-accelerated MR framework that replaces iterative solvers with a parallelizable neural architecture equivalent to NODEs. MERINDA achieves nearly 11x lower DRAM usage and 2.2x faster runtime compared to mobile GPUs. Experiments reveal an inverse relationship between memory and energy at fixed accuracy, highlighting MERINDA's suitability for resource-constrained, real-time MCAS.

Comment: HPC/Efficiency: NODE-equivalent parallelizable neural architecture enabling memory- and energy-efficient FPGA deployment for model recovery.