Personalized Daily ArXiv Papers 2025-12-04
| [gpt-5] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 55685 | 45543 | 101228 |
| Cost | $0.07 | $0.46 | $0.53 |
Total arXiv papers: 534
Total scanned papers: 318
Total relevant papers: 31
Table of contents with paper titles:
-
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs Authors: Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying
-
Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in ${\pm 1, \pm i}$ Authors: Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang, Peizhuang Cong, Tong Yang
-
Understanding and Harnessing Sparsity in Unified Multimodal Models Authors: Shwai He, Chaorui Deng, Ang Li, Shen Yan
-
UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs Authors: Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
-
Reversing Large Language Models for Efficient Training and Fine-Tuning Authors: Eshed Gal, Moshe Eliasof, Javier Turek, Uri Ascher, Eran Treister, Eldad Haber
-
Enforcing Orderedness to Improve Feature Consistency Authors: Sophie L. Wang, Alex Quach, Nithin Parsan, John J. Yang
-
A note on the impossibility of conditional PAC-efficient reasoning in large language models Authors: Hao Zeng
-
When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling Authors: Garrett G. Wen, Hong Hu, Yue M. Lu, Zhou Fan, Theodor Misiakiewicz
-
Globally optimized SVD compression of LLMs via Fermi-function-based rank selection and gauge fixing Authors: Roman Rausch, David Jansen, Sukhbinder Singh, Rom\'an Or\'us
-
A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention Authors: Di Xiu, Hongyin Tang, Bolin Rong, Lizhi Yan, Jingang Wang, Yifan Lu, Xunliang Cai
-
Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models Authors: Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang, Minwoo Lee, Shu-ping Yeh, Li Yang
-
Convergence for Discrete Parameter Updates Authors: Paul Wilson, Fabio Zanasi, George Constantinides
-
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation Authors: Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang
-
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning Authors: Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, Chris Shum
-
Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles Authors: Yizhou Zhang, Lun Du
-
Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding Authors: Duy-Tung Pham, An The Nguyen, Viet-Hoang Tran, Nhan-Phu Chung, Xin T. Tong, Tan M. Nguyen, Thieu N. Vo
-
Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics Authors: Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis
-
Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting Authors: Zhenliang Ni, Xiaowen Ma, Zhenkai Wu, Shuai Xiao, Han Shu, Xinghao Chen
-
Probabilistic Foundations of Fuzzy Simplicial Sets for Nonlinear Dimensionality Reduction Authors: Janis Keck (Hannaneh), Lukas Silvester Barth (Hannaneh), Fatemeh (Hannaneh), Fahimi, Parvaneh Joharinad, J\"urgen Jost
-
Model Recovery at the Edge under Resource Constraints for Physical AI Authors: Bin Xu, Ayan Banerjee, Sandeep K. S. Gupta
-
AaPE: Aliasing-aware Patch Embedding for Self-Supervised Audio Representation Learning Authors: Kohei Yamamoto, Kosuke Okusa
-
Optical Context Compression Is Just (Bad) Autoencoding Authors: Ivan Yee Lee, Cheng Yang, Taylor Berg-Kirkpatrick
-
VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling Authors: Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang
-
Exploring Depth Generalization in Large Language Models for Solving Recursive Logic Tasks Authors: Zhiyuan He
-
From monoliths to modules: Decomposing transducers for efficient world modelling Authors: Alexander Boyd, Franz Nowak, David Hyland, Manuel Baltieri, Fernando E. Rosas
-
Learning Network Sheaves for AI-native Semantic Communication Authors: Enrico Grimaldi, Mario Edoardo Pandolfo, Gabriele D'Acunto, Sergio Barbarossa, Paolo Di Lorenzo
-
Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models Authors: Xiwen Wei, Mustafa Munir, Radu Marculescu
-
Tuning-Free Structured Sparse Recovery of Multiple Measurement Vectors using Implicit Regularization Authors: Lakshmi Jayalal, Sheetal Kalyani
-
Better World Models Can Lead to Better Post-Training Performance Authors: Prakhar Gupta, Henry Conklin, Sarah-Jane Leslie, Andrew Lee
-
Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval Authors: Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, Neel Nanda
-
Domain Feature Collapse: Implications for Out-of-Distribution Detection and Solutions Authors: Hong Yang, Devroop Kar, Qi Yu, Alex Ororbia, Travis Desell
1. Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
ArXiv ID: 2512.03324
Authors: Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying
Abstract: Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.
Comment: Model Compression and Efficiency: learned token retention gates for KV-cache eviction under memory budgets, improving long-context inference.
Relevance: 10 Novelty: 9
2. Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in ${\pm 1, \pm i}$
ArXiv ID: 2512.02901
Authors: Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang, Peizhuang Cong, Tong Yang
Abstract: Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.
Comment: Matches Model Compression and Efficiency: universal conversion to widely-linear complex form with phase-aware 2-bit quantization and multiplication-free accumulation for efficient LLM inference.
Relevance: 10 Novelty: 9
3. Understanding and Harnessing Sparsity in Unified Multimodal Models
ArXiv ID: 2512.02351
Authors: Shwai He, Chaorui Deng, Ang Li, Shen Yan
Abstract: Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at \href{https://github.com/Shwai-He/SparseUnifiedModel}{this link}.
Comment: Model Compression/Efficiency and MoE Architecture: training-free pruning probe of unified multimodal models and MoE adaptation enabling sparse activation in generation.
Relevance: 10 Novelty: 8
4. UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs
ArXiv ID: 2512.03383
Authors: Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
Abstract: Deploying large language model (LLM) models on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: https://github.com/enyac-group/UniQL.
Comment: Strong match to Compression/Efficiency: unified quantization + low-rank compression with configurable pruning and kernel-level optimizations.
Relevance: 10 Novelty: 8
5. Reversing Large Language Models for Efficient Training and Fine-Tuning
ArXiv ID: 2512.02056
Authors: Eshed Gal, Moshe Eliasof, Javier Turek, Uri Ascher, Eran Treister, Eldad Haber
Abstract: Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pre-trained LLM considered a foundation model. In this work, we introduce memory-efficient, reversible architectures for LLMs, inspired by symmetric and symplectic differential equations, and investigate their theoretical properties. Different from standard, baseline architectures that store all intermediate activations, the proposed models use time-reversible dynamics to retrieve hidden states during backpropagation, relieving the need to store activations. This property allows for a drastic reduction in memory consumption, allowing for the processing of larger batch sizes for the same available memory, thereby offering improved throughput. In addition, we propose an efficient method for converting existing, non-reversible LLMs into reversible architectures through fine-tuning, rendering our approach practical for exploiting existing pre-trained models. Our results show comparable or improved performance on several datasets and benchmarks, on several LLMs, building a scalable and efficient path towards reducing the memory and computational costs associated with both training from scratch and fine-tuning of LLMs.
Comment: HPC/memory optimization and architecture: reversible LLMs enabling activation recomputation-free backprop and conversion of existing models.
Relevance: 10 Novelty: 8
6. Enforcing Orderedness to Improve Feature Consistency
ArXiv ID: 2512.02194
Authors: Sophie L. Wang, Alex Quach, Nithin Parsan, John J. Yang
Abstract: Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered Sparse Autoencoders (OSAE), which extend Matryoshka SAEs by (1) establishing a strict ordering of latent features and (2) deterministically using every feature dimension, avoiding the sampling-based approximations of prior nested SAE methods. Theoretically, we show that OSAEs resolve permutation non-identifiability in settings of sparse dictionary learning where solutions are unique (up to natural symmetries). Empirically on Gemma2-2B and Pythia-70M, we show that OSAEs can help improve consistency compared to Matryoshka baselines.
Comment: Representation Learning: sparse autoencoders with strict latent ordering resolve permutation non-identifiability in sparse dictionary learning, improving feature consistency.
Relevance: 10 Novelty: 8
7. A note on the impossibility of conditional PAC-efficient reasoning in large language models
ArXiv ID: 2512.03057
Authors: Hao Zeng
Abstract: We prove an impossibility result for conditional Probably Approximately Correct (PAC)-efficient reasoning in large language models. While recent work has established marginal PAC efficiency guarantees for composite models that switch between expensive expert models and cheaper fast models, we show that conditional (pointwise) guarantees are impossible in the distribution-free setting. Specifically, for non-atomic input spaces, any algorithm achieving conditional PAC efficiency must be trivial in the sense that it defers to the expert model with probability at least $1-\alpha$ for almost every input.
Comment: Model Architecture (conditional routing) and Efficiency theory: impossibility result for pointwise PAC-efficient defer-to-expert schemes.
Relevance: 9 Novelty: 9
8. When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling
ArXiv ID: 2512.03325
Authors: Garrett G. Wen, Hong Hu, Yue M. Lu, Zhou Fan, Theodor Misiakiewicz
Abstract: A major effort in modern high-dimensional statistics has been devoted to the analysis of linear predictors trained on nonlinear feature embeddings via empirical risk minimization (ERM). Gaussian equivalence theory (GET) has emerged as a powerful universality principle in this context: it states that the behavior of high-dimensional, complex features can be captured by Gaussian surrogates, which are more amenable to analysis. Despite its remarkable successes, numerical experiments show that this equivalence can fail even for simple embeddings -- such as polynomial maps -- under general scaling regimes. We investigate this breakdown in the setting of random feature (RF) models in the quadratic scaling regime, where both the number of features and the sample size grow quadratically with the data dimension. We show that when the target function depends on a low-dimensional projection of the data, such as generalized linear models, GET yields incorrect predictions. To capture the correct asymptotics, we introduce a Conditional Gaussian Equivalent (CGE) model, which can be viewed as appending a low-dimensional non-Gaussian component to an otherwise high-dimensional Gaussian model. This hybrid model retains the tractability of the Gaussian framework and accurately describes RF models in the quadratic scaling regime. We derive sharp asymptotics for the training and test errors in this setting, which continue to agree with numerical simulations even when GET fails. Our analysis combines general results on CLT for Wiener chaos expansions and a careful two-phase Lindeberg swapping argument. Beyond RF models and quadratic scaling, our work hints at a rich landscape of universality phenomena in high-dimensional ERM.
Comment: Representation Learning/Theory: shows failure of Gaussian equivalence for random features at quadratic scaling; introduces Conditional Gaussian Equivalent model with sharp asymptotics.
Relevance: 9 Novelty: 9
9. Globally optimized SVD compression of LLMs via Fermi-function-based rank selection and gauge fixing
ArXiv ID: 2512.03062
Authors: Roman Rausch, David Jansen, Sukhbinder Singh, Rom\'an Or\'us
Abstract: Large Language Models (LLMs) are very demanding in terms of their computational resources. Low-rank decompositions of LLM weights, e.g. via Singular Value Decomposition (SVD), is a promising approach for LLM compression, but presents several practical hurdles, e.g. selecting appropriate layer-wise ranks and getting rid of its parameter redundancy. In this work, we present two physics-inspired improvements to SVD LLM compression: (1) \textbf{FermiGrad}, a gradient-descent algorithm that determines globally optimal layer-wise ranks by relaxing the discrete singular-value truncation into a continuous optimization using the Fermi function; (2) \textbf{PivGa}, an additional \textit{lossless} compression of the low-rank factors that exploits the intrinsic gauge freedom in their parametrization.
Comment: Compression/efficiency: SVD-based LLM compression with globally optimized rank selection (FermiGrad) and lossless gauge fixing (PivGa).
Relevance: 10 Novelty: 7
10. A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention
ArXiv ID: 2512.03494
Authors: Di Xiu, Hongyin Tang, Bolin Rong, Lizhi Yan, Jingang Wang, Yifan Lu, Xunliang Cai
Abstract: Large Language Models (LLMs) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top-$k$ Attention mechanism during both the decoding and training phases. First, we validate the effectiveness of exact Top-$k$ Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the decoding stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top-$k$ Attention training strategy. Experiments confirm that ensuring the consistency between training and inference regarding Top-$k$ Attention operations facilitates the further unlocking of Top-$k$ Decoding's potential, thereby significantly enhancing model performance. Furthermore, considering the high computational complexity of exact Top-$k$ Attention, we investigate the impact of approximate Top-$k$ algorithm precision on downstream tasks. Our research confirms a positive correlation between downstream task performance and approximation fidelity, and we provide statistical evaluations of the Lightning Indexer's precision within the DeepSeek-V3.2-Exp model. Finally, this report provides a theoretical interpretation from the perspective of Entropy. Experimental observations indicate that models subjected to Top-$k$ Attention SFT exhibit a distinct phenomenon of entropy reduction in downstream tasks, which validates the hypothesis that low-entropy states are better adapted to Top-$k$ Decoding.
Comment: Model Compression/Efficiency: native Top-k sparse attention for both training and decoding, with analysis (entropy view) and approximation fidelity study.
Relevance: 10 Novelty: 7
11. Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models
ArXiv ID: 2512.02185
Authors: Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang, Minwoo Lee, Shu-ping Yeh, Li Yang
Abstract: Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.
Comment: Model Compression and Efficiency: structured pruning for reasoning LLMs using decode-only gradient importance with self-generated CoT calibration and progressive regeneration aligned to decode-time behavior.
Relevance: 9 Novelty: 8
12. Convergence for Discrete Parameter Updates
ArXiv ID: 2512.04051
Authors: Paul Wilson, Fabio Zanasi, George Constantinides
Abstract: Modern deep learning models require immense computational resources, motivating research into low-precision training. Quantised training addresses this by representing training components in low-bit integers, but typically relies on discretising real-valued updates. We introduce an alternative approach where the update rule itself is discrete, avoiding the quantisation of continuous updates by design. We establish convergence guarantees for a general class of such discrete schemes, and present a multinomial update rule as a concrete example, supported by empirical evaluation. This perspective opens new avenues for efficient training, particularly for models with inherently discrete structure.
Comment: Training efficiency: discrete update rules with convergence guarantees for low-precision training—core to compression/efficiency.
Relevance: 9 Novelty: 8
13. PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
ArXiv ID: 2512.04025
Authors: Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang
Abstract: Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: http://ziplab.co/PSA
Comment: Efficient attention: Pyramid Sparse Attention introduces multi-level pooled KV for fine-grained sparsity with hardware-friendly kernels.
Relevance: 9 Novelty: 8
14. CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
ArXiv ID: 2512.02551
Authors: Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, Chris Shum
Abstract: In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2
Comment: High Performance Computing: RL-driven kernel synthesis/optimization for HGEMM outperforming cuBLAS/cuBLASLt, enabling faster core operations for large-scale training/inference.
Relevance: 9 Novelty: 8
15. Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles
ArXiv ID: 2512.02409
Authors: Yizhou Zhang, Lun Du
Abstract: Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling. While several of these data-centric strategies reliably improve training efficiency and downstream performance, others fail to provide meaningful gains -- most notably self-generated synthetic data, which often increases dataset volume without enhancing model capability. We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator. Our first main result shows that \textbf{static pruning induces a bounded operator and therefore cannot change the spectral tail exponent}; it provides at most finite-region improvements and cannot alter asymptotic neural scaling. Our second result analyzes \textbf{time-dependent data curation}, showing that an ideal oracle capable of tracking spectral residuals and continuously re-normalizing the tail can provably accelerate learning -- although practical systems can only approximate this behavior.
Comment: Efficiency/Training Dynamics: theoretical analysis of data curation via operator spectra; shows limits of static pruning and acceleration via time-dependent reweighting.
Relevance: 9 Novelty: 8
16. Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding
ArXiv ID: 2512.03058
Authors: Duy-Tung Pham, An The Nguyen, Viet-Hoang Tran, Nhan-Phu Chung, Xin T. Tong, Tan M. Nguyen, Thieu N. Vo
Abstract: This paper investigates the dynamical properties of tokens in pre-trained Transformer models and explores their application to improving Transformers. To this end, we analyze the dynamical system governing the continuous-time limit of the pre-trained model and characterize the asymptotic behavior of its solutions. Specifically, we characterize when tokens move closer to or farther from one another over time, depending on the model parameters. We provide sufficient conditions, based on these parameters, to identify scenarios where tokens either converge to zero or diverge to infinity. Unlike prior works, our conditions are broader in scope and more applicable to real-world models. Furthermore, we investigate how different forms of positional encoding -- specifically absolute and rotary -- affect these dynamical regimes. Empirical evidence reveals that the convergence scenario adversely impacts model performance. Motivated by these insights, we propose simple refinements to Transformer architectures that mitigate convergence behavior in models with absolute or rotary positional encoding. These findings support theoretical foundations and design principles for improving Transformer models.
Comment: Model Architecture and Training Dynamics: theoretical analysis of self-attention token dynamics and positional encodings; proposes refinements to mitigate collapse.
Relevance: 9 Novelty: 8
17. Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics
ArXiv ID: 2512.04006
Authors: Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis
Abstract: Cross-entropy (CE) training loss dominates deep learning practice, yet existing theory often relies on simplifications, either replacing it with squared loss or restricting to convex models, that miss essential behavior. CE and squared loss generate fundamentally different dynamics, and convex linear models cannot capture the complexities of non-convex optimization. We provide an in-depth characterization of multi-class CE optimization dynamics beyond the convex regime by analyzing a canonical two-layer linear neural network with standard-basis vectors as inputs: the simplest non-convex extension for which the implicit bias remained unknown. This model coincides with the unconstrained features model used to study neural collapse, making our work the first to prove that gradient flow on CE converges to the neural collapse geometry. We construct an explicit Lyapunov function that establishes global convergence, despite the presence of spurious critical points in the non-convex landscape. A key insight underlying our analysis is an inconspicuous finding: Hadamard Initialization diagonalizes the softmax operator, freezing the singular vectors of the weight matrices and reducing the dynamics entirely to their singular values. This technique opens a pathway for analyzing CE training dynamics well beyond our specific setting considered here.
Comment: Matches Representation Learning: provides a theoretical characterization of cross-entropy training dynamics and neural collapse; Hadamard initialization diagonalizes softmax to make dynamics tractable.
Relevance: 9 Novelty: 8
18. Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting
ArXiv ID: 2512.02061
Authors: Zhenliang Ni, Xiaowen Ma, Zhenkai Wu, Shuai Xiao, Han Shu, Xinghao Chen
Abstract: Multivariate time series forecasts are widely used, such as industrial, transportation and financial forecasts. However, the dominant frequencies in time series may shift with the evolving spectral distribution of the data. Traditional Mixture of Experts (MoE) models, which employ a fixed number of experts, struggle to adapt to these changes, resulting in frequency coverage imbalance issue. Specifically, too few experts can lead to the overlooking of critical information, while too many can introduce noise. To this end, we propose Ada-MoGE, an adaptive Gaussian Mixture of Experts model. Ada-MoGE integrates spectral intensity and frequency response to adaptively determine the number of experts, ensuring alignment with the input data's frequency distribution. This approach prevents both information loss due to an insufficient number of experts and noise contamination from an excess of experts. Additionally, to prevent noise introduction from direct band truncation, we employ Gaussian band-pass filtering to smoothly decompose the frequency domain features, further optimizing the feature representation. The experimental results show that our model achieves state-of-the-art performance on six public benchmarks with only 0.2 million parameters.
Comment: Strong match to Model Architecture: Mixture-of-Experts with adaptive expert count driven by frequency-domain cues.
Relevance: 9 Novelty: 7
19. Probabilistic Foundations of Fuzzy Simplicial Sets for Nonlinear Dimensionality Reduction
ArXiv ID: 2512.03899
Authors: Janis Keck (Hannaneh), Lukas Silvester Barth (Hannaneh), Fatemeh (Hannaneh), Fahimi, Parvaneh Joharinad, J\"urgen Jost
Abstract: Fuzzy simplicial sets have become an object of interest in dimensionality reduction and manifold learning, most prominently through their role in UMAP. However, their definition through tools from algebraic topology without a clear probabilistic interpretation detaches them from commonly used theoretical frameworks in those areas. In this work we introduce a framework that explains fuzzy simplicial sets as marginals of probability measures on simplicial sets. In particular, this perspective shows that the fuzzy weights of UMAP arise from a generative model that samples Vietoris-Rips filtrations at random scales, yielding cumulative distribution functions of pairwise distances. More generally, the framework connects fuzzy simplicial sets to probabilistic models on the face poset, clarifies the relation between Kullback-Leibler divergence and fuzzy cross-entropy in this setting, and recovers standard t-norms and t-conorms via Boolean operations on the underlying simplicial sets. We then show how new embedding methods may be derived from this framework and illustrate this on an example where we generalize UMAP using \v{C}ech filtrations with triplet sampling. In summary, this probabilistic viewpoint provides a unified probabilistic theoretical foundation for fuzzy simplicial sets, clarifies the role of UMAP within this framework, and enables the systematic derivation of new dimensionality reduction methods.
Comment: Representation Learning theory: probabilistic foundations for fuzzy simplicial sets (UMAP), linking to generative models and enabling new embedding methods.
Relevance: 8 Novelty: 8
20. Model Recovery at the Edge under Resource Constraints for Physical AI
ArXiv ID: 2512.02283
Authors: Bin Xu, Ayan Banerjee, Sandeep K. S. Gupta
Abstract: Model Recovery (MR) enables safe, explainable decision making in mission-critical autonomous systems (MCAS) by learning governing dynamical equations, but its deployment on edge devices is hindered by the iterative nature of neural ordinary differential equations (NODEs), which are inefficient on FPGAs. Memory and energy consumption are the main concerns when applying MR on edge devices for real-time operation. We propose MERINDA, a novel FPGA-accelerated MR framework that replaces iterative solvers with a parallelizable neural architecture equivalent to NODEs. MERINDA achieves nearly 11x lower DRAM usage and 2.2x faster runtime compared to mobile GPUs. Experiments reveal an inverse relationship between memory and energy at fixed accuracy, highlighting MERINDA's suitability for resource-constrained, real-time MCAS.
Comment: HPC/Efficiency: NODE-equivalent parallelizable neural architecture enabling memory- and energy-efficient FPGA deployment for model recovery.
Relevance: 8 Novelty: 7
21. AaPE: Aliasing-aware Patch Embedding for Self-Supervised Audio Representation Learning
ArXiv ID: 2512.03637
Authors: Kohei Yamamoto, Kosuke Okusa
Abstract: Transformer-based audio SSL (self-supervised learning) models often treat spectrograms as images, applying convolutional patchification with heavy temporal downsampling. This lowers the effective Nyquist frequency and introduces aliasing, while na\"ive low-pass filtering removes task-relevant high-frequency cues. In this study, we present Aliasing-aware Patch Embedding (AaPE), a drop-in patch stem that mitigates aliasing while preserving high-frequency information. AaPE augments standard patch tokens with features produced by a band-limited complex sinusoidal kernel using a two-sided exponential window that dynamically targets alias-prone bands. Frequency and decay parameters of the kernel are estimated from the input, enabling parallel, adaptive subband analysis whose outputs are fused with the standard patch tokens. AaPE integrates seamlessly into the masked teacher-student self-supervised learning. In addition, we combine a multi-mask strategy with a contrastive objective to enforce consistency across diverse mask patterns, stabilizing training. Pre-training on AudioSet followed by fine-tuning evaluation across diverse downstream benchmarks, which spanned categories, such as environmental sounds and other common audio domains. This approach yields state-of-the-art performance on a subset of tasks and competitive results across the remainder. Complementary linear probing evaluation mirrors this pattern, yielding clear gains on several benchmarks and strong performance elsewhere. The collective analysis of these results indicates that AaPE serves to mitigate the effects of aliasing without discarding of informative high-frequency content.
Comment: Model Architecture and Representation Learning: aliasing-aware patch embedding for audio SSL that preserves high-frequency cues while mitigating downsampling aliasing.
Relevance: 8 Novelty: 7
22. Optical Context Compression Is Just (Bad) Autoencoding
ArXiv ID: 2512.03643
Authors: Ivan Yee Lee, Cheng Yang, Taylor Berg-Kirkpatrick
Abstract: DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives--parameter-free mean pooling and a learned hierarchical encoder--we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling--where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding
Comment: Matches Compression/Efficiency: critical analysis of context compression representations for LMs with alternative encoders.
Relevance: 8 Novelty: 7
23. VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling
ArXiv ID: 2512.02902
Authors: Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang
Abstract: Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.
Comment: Matches Efficiency/Architecture: minimal Feature Token Modulation and low-rank updates (FLA) for robust VLA adaptation.
Relevance: 8 Novelty: 7
24. Exploring Depth Generalization in Large Language Models for Solving Recursive Logic Tasks
ArXiv ID: 2512.02677
Authors: Zhiyuan He
Abstract: Large language models have demonstrated remarkable capabilities across many tasks, yet face significant challenges when dealing with recursive reasoning problems, those requiring the resolution of nested hierarchical structures. While prior research has extensively studied length generalization (a model's ability to handle longer sequences than seen during training), we investigate a distinct and underexplored limitation: depth generalization. Here, depth refers to the number of nested levels in a hierarchical problem, such as the layers of parentheses in a mathematical expression or the nesting of logical clauses in a Boolean formula. Our work reveals that standard transformer architectures struggle with problems involving deeper recursion than encountered during training, even when they perform well on longer but non-nested sequences. This limitation stems from their inability to maintain stack-like behavior, the capacity to track and resolve multiple levels of nested dependencies. Through systematic analysis, we demonstrate how this architectural constraint leads to rapid performance decay as the depth of the recursion increases. To address this challenge, we develop a novel looped locate-and-replace pipeline that decomposes recursive problems into manageable subcomponents. The approach employs two specialized models: a locator that identifies solvable subexpressions and a replacer that evaluates these components while preserving the overall structure. We evaluated this method in three carefully designed domains: Boolean algebra, recursive arithmetic, and propositional logic, each with a controllable depth of recursion. We show that our method effectively alleviates the performance decay when tested on out-of-distribution recursion depth.
Comment: Matches Representation Learning/Training Dynamics: analysis of depth generalization limits in Transformers with a decomposition pipeline.
Relevance: 8 Novelty: 7
25. From monoliths to modules: Decomposing transducers for efficient world modelling
ArXiv ID: 2512.02193
Authors: Alexander Boyd, Franz Nowak, David Hyland, Manuel Baltieri, Fernando E. Rosas
Abstract: World models have been recently proposed as sandbox environments in which AI agents can be trained and evaluated before deployment. Although realistic world models often have high computational demands, efficient modelling is usually possible by exploiting the fact that real-world scenarios tend to involve subcomponents that interact in a modular manner. In this paper, we explore this idea by developing a framework for decomposing complex world models represented by transducers, a class of models generalising POMDPs. Whereas the composition of transducers is well understood, our results clarify how to invert this process, deriving sub-transducers operating on distinct input-output subspaces, enabling parallelizable and interpretable alternatives to monolithic world modelling that can support distributed inference. Overall, these results lay a groundwork for bridging the structural transparency demanded by AI safety and the computational efficiency required for real-world inference.
Comment: Systems-level innovation: decomposing transducers into sub-transducers for parallelizable and interpretable world modeling (distributed inference).
Relevance: 8 Novelty: 7
26. Learning Network Sheaves for AI-native Semantic Communication
ArXiv ID: 2512.03248
Authors: Enrico Grimaldi, Mario Edoardo Pandolfo, Gabriele D'Acunto, Sergio Barbarossa, Paolo Di Lorenzo
Abstract: Recent advances in AI call for a paradigm shift from bit-centric communication to goal- and semantics-oriented architectures, paving the way for AI-native 6G networks. In this context, we address a key open challenge: enabling heterogeneous AI agents to exchange compressed latent-space representations while mitigating semantic noise and preserving task-relevant meaning. We cast this challenge as learning both the communication topology and the alignment maps that govern information exchange among agents, yielding a learned network sheaf equipped with orthogonal maps. This learning process is further supported by a semantic denoising end compression module that constructs a shared global semantic space and derives sparse, structured representations of each agent's latent space. This corresponds to a nonconvex dictionary learning problem solved iteratively with closed-form updates. Experiments with mutiple AI agents pre-trained on real image data show that the semantic denoising and compression facilitates AI agents alignment and the extraction of semantic clusters, while preserving high accuracy in downstream task. The resulting communication network provides new insights about semantic heterogeneity across agents, highlighting the interpretability of our methodology.
Comment: Representation learning with sparsity: learns network sheaves and sparse, structured dictionaries for semantic communication alignment.
Relevance: 8 Novelty: 7
27. Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models
ArXiv ID: 2512.03125
Authors: Xiwen Wei, Mustafa Munir, Radu Marculescu
Abstract: Unified Multimodal Generative Models (UMGMs) unify visual understanding and image generation within a single autoregressive framework. However, their ability to continually learn new tasks is severely hindered by catastrophic forgetting, both within a modality (intra-modal) and across modalities (inter-modal). While intra-modal forgetting has been studied in prior continual learning (CL) work, inter-modal forgetting remains largely unexplored. In this paper, we identify and empirically validate this phenomenon in UMGMs and provide a theoretical explanation rooted in gradient conflict between modalities. To address both intra- and inter-modal forgetting, we propose Modality-Decoupled Experts (MoDE), a lightweight and scalable architecture that isolates modality-specific updates to mitigate the gradient conflict and leverages knowledge distillation to prevent catastrophic forgetting and preserve pre-trained capabilities. Unlike previous CL methods that remain modality-coupled and suffer from modality gradient conflict, MoDE explicitly decouples modalities to prevent interference. Experiments across diverse benchmarks demonstrate that MoDE significantly mitigates both inter- and intra-modal forgetting, outperforming prior CL baselines in unified multimodal generation settings. Codes will be publicly available: https://github.com/Christina200/MoDE-official.git
Comment: Model Architecture: modality-decoupled experts (MoDE) to mitigate gradient conflict in unified multimodal models for continual learning (MoE-like decoupling).
Relevance: 8 Novelty: 7
28. Tuning-Free Structured Sparse Recovery of Multiple Measurement Vectors using Implicit Regularization
ArXiv ID: 2512.03393
Authors: Lakshmi Jayalal, Sheetal Kalyani
Abstract: Recovering jointly sparse signals in the multiple measurement vectors (MMV) setting is a fundamental problem in machine learning, but traditional methods like multiple measurement vectors orthogonal matching pursuit (M-OMP) and multiple measurement vectors FOCal Underdetermined System Solver (M-FOCUSS) often require careful parameter tuning or prior knowledge of the sparsity of the signal and/or noise variance. We introduce a novel tuning-free framework that leverages Implicit Regularization (IR) from overparameterization to overcome this limitation. Our approach reparameterizes the estimation matrix into factors that decouple the shared row-support from individual vector entries. We show that the optimization dynamics inherently promote the desired row-sparse structure by applying gradient descent to a standard least-squares objective on these factors. We prove that with a sufficiently small and balanced initialization, the optimization dynamics exhibit a "momentum-like" effect, causing the norms of rows in the true support to grow significantly faster than others. This formally guarantees that the solution trajectory converges towards an idealized row-sparse solution. Additionally, empirical results demonstrate that our approach achieves performance comparable to established methods without requiring any prior information or tuning.
Comment: Sparsity/Implicit Regularization: tuning-free structured sparse recovery in MMV via overparameterized factorization and provable row-sparsity emergence.
Relevance: 8 Novelty: 7
29. Better World Models Can Lead to Better Post-Training Performance
ArXiv ID: 2512.03400
Authors: Prakhar Gupta, Henry Conklin, Sarah-Jane Leslie, Andrew Lee
Abstract: In this work we study how explicit world-modeling objectives affect the internal representations and downstream capability of Transformers across different training stages. We use a controlled 2x2x2 Rubik's Cube and ask: (1) how does explicitly pretraining a world model affect the model's latent representations, and (2) how does world-model quality affect the model's performance after reinforcement learning post-training? We compare standard next-token prediction to two explicit world-modeling strategies -- (i) state-prediction pretraining and (ii) a joint state-prediction + next-token objective -- and assess task performance after Group Relative Policy Optimization (GRPO) is applied as post-training. We evaluate the representation quality with linear probes and causal interventions. We find that explicit world-modeling yields more linearly decodable and causally steerable state representations. More importantly, we find that improved state representations lead to higher gains for GRPO, especially on harder cube states. Our results indicate that sharpening state representations can improve the effectiveness of post-training for sequence-planning tasks.
Comment: Representation Learning: explicit world-modeling (state prediction) objectives sharpen latent state representations in Transformers and improve post-training dynamics.
Relevance: 8 Novelty: 7
30. Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval
ArXiv ID: 2512.03276
Authors: Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, Neel Nanda
Abstract: Training vision language models (VLMs) aims to align visual representations from a vision encoder with the textual representations of a pretrained large language model (LLM). However, many VLMs exhibit reduced factual recall performance compared to their LLM backbones, raising the question of how effective multimodal fine-tuning is at extending existing mechanisms within the LLM to visual inputs. We argue that factual recall based on visual inputs requires VLMs to solve a two-hop problem: (1) forming entity representations from visual inputs, and (2) recalling associated factual knowledge based on these entity representations. By benchmarking 14 VLMs with various architectures (LLaVA, Native, Cross-Attention), sizes (7B-124B parameters), and training setups on factual recall tasks against their original LLM backbone models, we find that 11 of 14 models exhibit factual recall degradation. We select three models with high and two models with low performance degradation, and use attribution patching, activation patching, and probing to show that degraded VLMs struggle to use the existing factual recall circuit of their LLM backbone, because they resolve the first hop too late in the computation. In contrast, high-performing VLMs resolve entity representations early enough to reuse the existing factual recall mechanism. Finally, we demonstrate two methods to recover performance: patching entity representations from the LLM backbone into the VLM, and prompting with chain-of-thought reasoning. Our results highlight that the speed of early entity resolution critically determines how effective VLMs are in using preexisting LLM mechanisms. More broadly, our work illustrates how mechanistic analysis can explain and unveil systematic failures in multimodal alignment.
Comment: Representation Learning: mechanistic analysis shows delayed entity resolution hinders reuse of LLM factual recall circuits in VLMs; proposes fixes.
Relevance: 8 Novelty: 7
31. Domain Feature Collapse: Implications for Out-of-Distribution Detection and Solutions
ArXiv ID: 2512.04034
Authors: Hong Yang, Devroop Kar, Qi Yu, Alex Ororbia, Travis Desell
Abstract: Why do state-of-the-art OOD detection methods exhibit catastrophic failure when models are trained on single-domain datasets? We provide the first theoretical explanation for this phenomenon through the lens of information theory. We prove that supervised learning on single-domain data inevitably produces domain feature collapse -- representations where I(x_d; z) = 0, meaning domain-specific information is completely discarded. This is a fundamental consequence of information bottleneck optimization: models trained on single domains (e.g., medical images) learn to rely solely on class-specific features while discarding domain features, leading to catastrophic failure when detecting out-of-domain samples (e.g., achieving only 53% FPR@95 on MNIST). We extend our analysis using Fano's inequality to quantify partial collapse in practical scenarios. To validate our theory, we introduce Domain Bench, a benchmark of single-domain datasets, and demonstrate that preserving I(x_d; z) > 0 through domain filtering (using pretrained representations) resolves the failure mode. While domain filtering itself is conceptually straightforward, its effectiveness provides strong empirical evidence for our information-theoretic framework. Our work explains a puzzling empirical phenomenon, reveals fundamental limitations of supervised learning in narrow domains, and has broader implications for transfer learning and when to fine-tune versus freeze pretrained models.
Comment: Representation Learning Theory: information-theoretic account of domain feature collapse (I(x_d; z)=0) and simple preservation strategy for OOD.
Relevance: 8 Novelty: 7
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.