Personalized Daily ArXiv Papers 2025-09-03

[gpt-4o]	Prompt	Completion	Total
Token	71187	9933	81120
Cost	$0.18	$0.1	$0.28

Total arXiv papers: 1163

Total scanned papers: 753

Total relevant papers: 59

Table of contents with paper titles:

Relative Trajectory Balance is equivalent to Trust-PCL Authors: Tristan Deleu, Padideh Nouri, Yoshua Bengio, Doina Precup
LongCat-Flash Technical Report Authors: Meituan LongCat Team, Bayan, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, Chengcheng Han, Chenguang Xi, Chi Zhang, Chong Peng, Chuan Qin, Chuyu Zhang, Cong Chen, Congkui Wang, Dan Ma, Daoru Pan, Defei Bu, Dengchang Zhao, Deyang Kong, Dishan Liu, Feiye Huo, Fengcun Li, Fubao Zhang, Gan Dong, Gang Liu, Gang Xu, Ge Li, Guoqiang Tan, Guoyuan Lin, Haihang Jing, Haomin Fu, Haonan Yan, Haoxing Wen, Haozhe Zhao, Hong Liu, Hongmei Shi, Hongyan Hao, Hongyin Tang, Huantian Lv, Hui Su, Jiacheng Li, Jiahao Liu, Jiahuan Li, Jiajun Yang, Jiaming Wang, Jian Yang, Jianchao Tan, Jiaqi Sun, Jiaqi Zhang, Jiawei Fu, Jiawei Yang, Jiaxi Hu, Jiayu Qin, Jingang Wang, Jiyuan He, Jun Kuang, Junhui Mei, Kai Liang, Ke He, Kefeng Zhang, Keheng Wang, Keqing He, Liang Gao, Liang Shi, Lianhui Ma, Lin Qiu, Lingbin Kong, Lingtong Si, Linkun Lyu, Linsen Guo, Liqi Yang, Lizhi Yan, Mai Xia, Man Gao, Manyuan Zhang, Meng Zhou, Mengxia Shen, Mingxiang Tuo, Mingyang Zhu, Peiguang Li, Peng Pei, Peng Zhao, Pengcheng Jia, Pingwei Sun, Qi Gu, Qianyun Li, Qingyuan Li, Qiong Huang, Qiyuan Duan, Ran Meng, Rongxiang Weng, Ruichen Shao, Rumei Li, Shizhe Wu, Shuai Liang, Shuo Wang, Suogui Dang, Tao Fang, Tao Li, Tefeng Chen, Tianhao Bai, Tianhao Zhou, Tingwen Xie, Wei He, Wei Huang, Wei Liu, Wei Shi, Wei Wang, Wei Wu, Weikang Zhao, Wen Zan, Wenjie Shi, Xi Nan, Xi Su, Xiang Li, Xiang Mei, Xiangyang Ji, Xiangyu Xi, Xiangzhou Huang, Xianpeng Li, Xiao Fu, Xiao Liu, Xiao Wei, Xiaodong Cai, Xiaolong Chen, Xiaoqing Liu, Xiaotong Li, Xiaowei Shi, Xiaoyu Li, Xili Wang, Xin Chen, Xing Hu, Xingyu Miao, Xinyan He, Xuemiao Zhang, Xueyuan Hao, Xuezhi Cao, Xunliang Cai, Xurui Yang, Yan Feng, Yang Bai, Yang Chen, Yang Yang, Yaqi Huo, Yerui Sun, Yifan Lu, Yifan Zhang, Yipeng Zang, Yitao Zhai, Yiyang Li, Yongjing Yin, Yongkang Lv, Yongwei Zhou, Yu Yang, Yuchen Xie, Yueqing Sun, Yuewen Zheng, Yuhua Wei, Yulei Qian, Yunfan Liang, Yunfang Tai, Yunke Zhao, Zeyang Yu, Zhao Zhang, Zhaohua Yang, Zhenchao Zhang, Zhikang Xia, Zhiye Zou, Zhizhao Zeng, Zhongda Su, Zhuofan Chen, Zijian Zhang, Ziwen Wang, Zixu Jiang, Zizhe Zhao, Zongyu Wang, Zunhai Su
Globally aware optimization with resurgence Authors: Wei Bu
Are We Really Learning the Score Function? Reinterpreting Diffusion Models Through Wasserstein Gradient Flow Matching Authors: An B. Vuong, Michael T. McCann, Javier E. Santos, Yen Ting Lin
The Price of Sparsity: Sufficient Conditions for Sparse Recovery using Sparse and Sparsified Measurements Authors: Youssef Chaabouni, David Gamarnik
MoPEQ: Mixture of Mixed Precision Quantized Experts Authors: Krishna Teja Chitty-Venkata, Jie Ye, Murali Emani
Universal Properties of Activation Sparsity in Modern Large Language Models Authors: Filip Szatkowski, Patryk B\k{e}dkowski, Alessio Devoto, Jan Dubi\'nski, Pasquale Minervini, Miko{\l}aj Pi\'orczy\'nski, Simone Scardapane, Bartosz W\'ojcik
ZeroQAT: Your Quantization-aware Training but Efficient Authors: Qitao Tan, Xiaoying Song, Jin Lu, Guoming Li, Jun Liu, Lingzi Hong, Caiwen Ding, Jundong Li, Xiaoming Zhai, Shaoyi Huang, Wei Niu, Geng Yuan
MEPT: Mixture of Expert Prompt Tuning as a Manifold Mapper Authors: Runjia Zeng, Guangyan Sun, Qifan Wang, Tong Geng, Sohail Dianat, Xiaotian Han, Raghuveer Rao, Xueling Zhang, Cheng Han, Lifu Huang, Dongfang Liu
DTRNet: Dynamic Token Routing Network to Reduce Quadratic Costs in Transformers Authors: Aman Sharma, Saeed Najafi, Parsa Farinneya, Benyamin Jamialahmadi, Marzieh S. Tahaei, Yuhe Fan, Mehdi Rezagholizadeh, Boxing Chen, Aref Jafari
KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache Authors: Bo Jiang, Taolue Yang, Youyuan Liu, Chengming Zhang, Xubin He, Sian Jin
Metis: Training Large Language Models with Advanced Low-Bit Quantization Authors: Hengjie Cao, Mengyi Chen, Yifeng Yang, Ruijun Huang, Fang Dong, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Yuan Cheng, Fan Wu, Fan Yang, Tun Lu, Ning Gu, Li Shang
Geometric origin of adversarial vulnerability in deep learning Authors: Yixiong Ren, Wenkang Du, Jianhui Zhou, Haiping Huang
Beyond Universal Approximation Theorems: Algorithmic Uniform Approximation by Neural Networks Trained with Noisy Data Authors: Anastasis Kratsios, Tin Sum Cheng, Daniel Roy
REFRAG: Rethinking RAG based Decoding Authors: Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan
ReLATE: Learning Efficient Sparse Encoding for High-Performance Tensor Decomposition Authors: Ahmed E. Helal, Fabio Checconi, Jan Laukemann, Yongseok Soh, Jesmin Jahan Tithi, Fabrizio Petrini, Jee Choi
LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving Authors: Huanqi Hu, Bowen Xiao, Shixuan Sun, Jianian Yin, Zhexi Zhang, Xiang Luo, Chengquan Jiang, Weiqi Xu, Xiaoying Jia, Xin Liu, Minyi Guo
Learnable Loss Geometries with Mirror Descent for Scalable and Convergent Meta-Learning Authors: Yilang Zhang, Bingcong Li, Georgios B. Giannakis
Fisher information flow in artificial neural networks Authors: Maximilian Weimar, Lukas M. Rachbauer, Ilya Starshynov, Daniele Faccio, Linara Adilova, Dorian Bouchet, Stefan Rotter
Causal Interpretation of Sparse Autoencoder Features in Vision Authors: Sangyu Han, Yearim Kim, Nojun Kwak
Cache Management for Mixture-of-Experts LLMs -- extended version Authors: Spyros Angelopoulos, Loris Marchal, Adrien Obrecht, Bertrand Simon
MODE: Mixture of Document Experts for RAG Authors: Rahul Anand
Graph Contrastive Learning versus Untrained Baselines: The Role of Dataset Size Authors: Smayan Khanna, Doruk Efe G\"okmen, Risi Kondor, Vincenzo Vitelli
Distilled Pretraining: A modern lens of Data, In-Context Learning and Test-Time Scaling Authors: Sachin Goyal, David Lopez-Paz, Kartik Ahuja
Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs Authors: Yao Fu, Runchao Li, Xianxuan Long, Haotian Yu, Xiaotian Han, Yu Yin, Pan Li
DaCe AD: Unifying High-Performance Automatic Differentiation for Machine Learning and Scientific Computing Authors: Afif Boudaoud, Alexandru Calotoiu, Marcin Copik, Torsten Hoefler
Theory Foundation of Physics-Enhanced Residual Learning Authors: Shixiao Liang, Wang Chen, Keke Long, Peng Zhang, Xiaopeng Li, Jintao Ke
Accelerating PDE Solvers with Equation-Recast Neural Operator Preconditioning Authors: Qiyun Cheng, Md Hossain Sahadath, Huihua Yang, Shaowu Pan, Wei Ji
Crystal Structure Prediction with a Geometric Permutation-Invariant Loss Function Authors: Emmanuel Jehanno, Romain Menegaux, Julien Mairal, Sergei Grudinin
Beyond Ensembles: Simulating All-Atom Protein Dynamics in a Learned Latent Space Authors: Aditya Sengar, Ali Hariri, Pierre Vandergheynst, Patrick Barth
Towards Comprehensive Information-theoretic Multi-view Learning Authors: Long Shi, Yunshan Ye, Wenjie Wang, Tao Lei, Yu Zhao, Gang Kou, Badong Chen
Unraveling LLM Jailbreaks Through Safety Knowledge Neurons Authors: Chongwen Zhao, Kaizhu Huang
An Efficient GNNs-to-KANs Distillation via Self-Attention Dynamic Sampling with Potential for Consumer Electronics Edge Deployment Authors: Can Cui, Zilong Fu, Penghe Huang, Yuanyuan Li, Wu Deng, Dongyan Li
Low Power Approximate Multiplier Architecture for Deep Neural Networks Authors: Pragun Jaswal, L. Hemanth Krishna, B. Srinivasu
Preconditioned Regularized Wasserstein Proximal Sampling Authors: Hong Ye Tan, Stanley Osher, Wuchen Li
Differentiable Expectation-Maximisation and Applications to Gaussian Mixture Model Optimal Transport Authors: Samuel Bo\"it\'e, Eloi Tanguy, Julie Delon, Agn`es Desolneux, R\'emi Flamary
An Explainable Gaussian Process Auto-encoder for Tabular Data Authors: Wei Zhang, Brian Barr, John Paisley
Learning residue level protein dynamics with multiscale Gaussians Authors: Mihir Bafna, Bowen Jing, Bonnie Berger
Quantum Circuits for Quantum Convolutions: A Quantum Convolutional Autoencoder Authors: Javier Orduz, Pablo Rivas, Erich Baker
VASSO: Variance Suppression for Sharpness-Aware Minimization Authors: Bingcong Li, Yilang Zhang, Georgios B. Giannakis
Conditional-$t^3$VAE: Equitable Latent Space Allocation for Fair Generation Authors: Aymene Mohammed Bouayed, Samuel Deslauriers-Gauthier, Adrian Iaccovelli, David Naccache
Unsupervised Training of Vision Transformers with Synthetic Negatives Authors: Nikolaos Giakoumoglou, Andreas Floros, Kleanthis Marios Papadopoulos, Tania Stathaki
Fake & Square: Training Self-Supervised Vision Transformers with Synthetic Data and Synthetic Hard Negatives Authors: Nikolaos Giakoumoglou, Andreas Floros, Kleanthis Marios Papadopoulos, Tania Stathaki
Efficient Transformer-Inspired Variants of Physics-Informed Deep Operator Networks Authors: Zhi-Feng Wei, Wenqian Chen, Panos Stinis
MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall Authors: Avinash Maurya, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae
Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMs Authors: Daehoon Gwak, Minseo Jung, Junwoo Park, Minho Park, ChaeHun Park, Junha Hyung, Jaegul Choo
Balanced Multimodal Learning: An Unidirectional Dynamic Interaction Perspective Authors: Shijie Wang, Li Zhang, Xinyan Liang, Yuhua Qian, Shen Hu
Towards Agents That Know When They Don't Know: Uncertainty as a Control Signal for Structured Reasoning Authors: Josefa Lia Stoisser, Marc Boubnovski Martell, Lawrence Phillips, Gianluca Mazzoni, Lea M{\o}rch Harder, Philip Torr, Jesper Ferkinghoff-Borg, Kaspar Martens, Julien Fauqueur
Counterfactual Sensitivity for Faithful Reasoning in Language Models Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
VISP: Volatility Informed Stochastic Projection for Adaptive Regularization Authors: Tanvir Islam
Equivariant U-Shaped Neural Operators for the Cahn-Hilliard Phase-Field Model Authors: Xiao Xue, M. F. P. ten Eikelder, Tianyue Yang, Yiqing Li, Kan He, Shuo Wang, Peter V. Coveney
Memory Limitations of Prompt Tuning in Transformers Authors: Maxime Meyer, Mario Michelessa, Caroline Chaux, Vincent Y. F. Tan
Causal representation learning from network data Authors: Jifan Zhang, Michelle M. Li, Elena Zheleva
Teaching AI to Remember: Insights from Brain-Inspired Replay in Continual Learning Authors: Jina Kim
GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping Authors: Qifu Wen, Xi Zeng, Zihan Zhou, Shuaijun Liu, Mehdi Hosseinzadeh, Reza Rawassizadeh
Double Descent and Overparameterization in Particle Physics Data Authors: Matthias Vigl, Lukas Heinrich
Optimized Weight Initialization on the Stiefel Manifold for Deep ReLU Neural Networks Authors: Hyungu Lee, Taehyeong Kim, Hayoung Choi
Generalization vs. Memorization in Autoregressive Deep Learning: Or, Examining Temporal Decay of Gradient Coherence Authors: James Amarel, Nicolas Hengartner, Robyn Miller, Kamaljeet Singh, Siddharth Mansingh, Arvind Mohan, Benjamin Migliori, Emily Casleton, Alexei Skurikhin, Earl Lawrence, Gerd J. Kunde
Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination Authors: Terry Jingchen Zhang, Gopal Dev, Ning Wang, Nicole Ni, Wenyuan Jiang, Yinya Huang, Bernhard Sch\"olkopf, Mrinmaya Sachan, Zhijing Jin

1. Relative Trajectory Balance is equivalent to Trust-PCL

ArXiv ID: 2509.01632

Authors: Tristan Deleu, Padideh Nouri, Yoshua Bengio, Doina Precup

Abstract: Recent progress in generative modeling has highlighted the importance of Reinforcement Learning (RL) for fine-tuning, with KL-regularized methods in particular proving to be highly effective for both autoregressive and diffusion models. Complementing this line of work, the Relative Trajectory Balance (RTB) objective was recently introduced in the context of Generative Flow Networks (GFlowNets) to serve the same role of improving fine-tuning in sequential generative models. Building on prior work linking GFlowNets and maximum-entropy RL, we establish in this paper an equivalence between RTB and Trust-PCL, an off-policy RL method with KL regularization. This equivalence situates RTB within the broader theoretical landscape of KL-regularized RL, and clarifies its relationship to earlier methods. Leveraging this insight, we revisit an illustrative example from the RTB paper and show that KL-regularized RL methods achieve comparable performance, offering an alternative perspective to what was previously reported.

Comment: Author match

2. LongCat-Flash Technical Report

ArXiv ID: 2509.01322

Authors: Meituan LongCat Team, Bayan, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, Chengcheng Han, Chenguang Xi, Chi Zhang, Chong Peng, Chuan Qin, Chuyu Zhang, Cong Chen, Congkui Wang, Dan Ma, Daoru Pan, Defei Bu, Dengchang Zhao, Deyang Kong, Dishan Liu, Feiye Huo, Fengcun Li, Fubao Zhang, Gan Dong, Gang Liu, Gang Xu, Ge Li, Guoqiang Tan, Guoyuan Lin, Haihang Jing, Haomin Fu, Haonan Yan, Haoxing Wen, Haozhe Zhao, Hong Liu, Hongmei Shi, Hongyan Hao, Hongyin Tang, Huantian Lv, Hui Su, Jiacheng Li, Jiahao Liu, Jiahuan Li, Jiajun Yang, Jiaming Wang, Jian Yang, Jianchao Tan, Jiaqi Sun, Jiaqi Zhang, Jiawei Fu, Jiawei Yang, Jiaxi Hu, Jiayu Qin, Jingang Wang, Jiyuan He, Jun Kuang, Junhui Mei, Kai Liang, Ke He, Kefeng Zhang, Keheng Wang, Keqing He, Liang Gao, Liang Shi, Lianhui Ma, Lin Qiu, Lingbin Kong, Lingtong Si, Linkun Lyu, Linsen Guo, Liqi Yang, Lizhi Yan, Mai Xia, Man Gao, Manyuan Zhang, Meng Zhou, Mengxia Shen, Mingxiang Tuo, Mingyang Zhu, Peiguang Li, Peng Pei, Peng Zhao, Pengcheng Jia, Pingwei Sun, Qi Gu, Qianyun Li, Qingyuan Li, Qiong Huang, Qiyuan Duan, Ran Meng, Rongxiang Weng, Ruichen Shao, Rumei Li, Shizhe Wu, Shuai Liang, Shuo Wang, Suogui Dang, Tao Fang, Tao Li, Tefeng Chen, Tianhao Bai, Tianhao Zhou, Tingwen Xie, Wei He, Wei Huang, Wei Liu, Wei Shi, Wei Wang, Wei Wu, Weikang Zhao, Wen Zan, Wenjie Shi, Xi Nan, Xi Su, Xiang Li, Xiang Mei, Xiangyang Ji, Xiangyu Xi, Xiangzhou Huang, Xianpeng Li, Xiao Fu, Xiao Liu, Xiao Wei, Xiaodong Cai, Xiaolong Chen, Xiaoqing Liu, Xiaotong Li, Xiaowei Shi, Xiaoyu Li, Xili Wang, Xin Chen, Xing Hu, Xingyu Miao, Xinyan He, Xuemiao Zhang, Xueyuan Hao, Xuezhi Cao, Xunliang Cai, Xurui Yang, Yan Feng, Yang Bai, Yang Chen, Yang Yang, Yaqi Huo, Yerui Sun, Yifan Lu, Yifan Zhang, Yipeng Zang, Yitao Zhai, Yiyang Li, Yongjing Yin, Yongkang Lv, Yongwei Zhou, Yu Yang, Yuchen Xie, Yueqing Sun, Yuewen Zheng, Yuhua Wei, Yulei Qian, Yunfan Liang, Yunfang Tai, Yunke Zhao, Zeyang Yu, Zhao Zhang, Zhaohua Yang, Zhenchao Zhang, Zhikang Xia, Zhiye Zou, Zhizhao Zeng, Zhongda Su, Zhuofan Chen, Zijian Zhang, Ziwen Wang, Zixu Jiang, Zizhe Zhao, Zongyu Wang, Zunhai Su

Abstract: We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \$0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat GitHub: https://github.com/meituan-longcat

Comment: The paper introduces LongCat-Flash, a Mixture-of-Experts language model, focusing on architectural innovations and efficiency improvements.

Relevance: 10 Novelty: 8

3. Globally aware optimization with resurgence

ArXiv ID: 2509.01329

Authors: Wei Bu

Abstract: Modern optimization faces a fundamental challenge: local gradient-based methods provide no global information about the objective function $L$ landscape, often leading to suboptimal convergence and sensitivity to initialization. We introduce a novel optimization framework that leverages resurgence theory from complex analysis to extract global structural information from divergent asymptotic series. Our key insight is that the factorially divergent perturbative expansions of parameter space partition functions encode precise information about all critical objective function value in the landscape through their Borel transform singularities. The algorithm works by computing the statistical mechanical partition function $Z(g) = \int e^{-L(\theta)/g} d\theta$ for small coupling $g\ll 1$, extracting its asymptotic series coefficients, and identifying Borel plane singularities that correspond one-to-one with critical objective function values. These target values provide global guidance to local optimizers, enabling principled learning rate adaptation and escape from suboptimal regions. Unlike heuristic adaptive methods, targets are theoretically grounded in the geometry of the optimization landscape.

Comment: The paper introduces a novel optimization framework using resurgence theory, which is relevant to emerging trends in optimization.

Relevance: 9 Novelty: 9

4. Are We Really Learning the Score Function? Reinterpreting Diffusion Models Through Wasserstein Gradient Flow Matching

ArXiv ID: 2509.00336

Authors: An B. Vuong, Michael T. McCann, Javier E. Santos, Yen Ting Lin

Abstract: Diffusion models are commonly interpreted as learning the score function, i.e., the gradient of the log-density of noisy data. However, this assumption implies that the target of learning is a conservative vector field, which is not enforced by the neural network architectures used in practice. We present numerical evidence that trained diffusion networks violate both integral and differential constraints required of true score functions, demonstrating that the learned vector fields are not conservative. Despite this, the models perform remarkably well as generative mechanisms. To explain this apparent paradox, we advocate a new theoretical perspective: diffusion training is better understood as flow matching to the velocity field of a Wasserstein Gradient Flow (WGF), rather than as score learning for a reverse-time stochastic differential equation. Under this view, the "probability flow" arises naturally from the WGF framework, eliminating the need to invoke reverse-time SDE theory and clarifying why generative sampling remains successful even when the neural vector field is not a true score. We further show that non-conservative errors from neural approximation do not necessarily harm density transport. Our results advocate for adopting the WGF perspective as a principled, elegant, and theoretically grounded framework for understanding diffusion generative models.

Comment: The paper reinterprets diffusion models through a new theoretical perspective, relevant to emerging trends in generative models.

Relevance: 9 Novelty: 9

5. The Price of Sparsity: Sufficient Conditions for Sparse Recovery using Sparse and Sparsified Measurements

ArXiv ID: 2509.01809

Authors: Youssef Chaabouni, David Gamarnik

Abstract: We consider the problem of recovering the support of a sparse signal using noisy projections. While extensive work has been done on the dense measurement matrix setting, the sparse setting remains less explored. In this work, we establish sufficient conditions on the sample size for successful sparse recovery using sparse measurement matrices. Bringing together our result with previously known necessary conditions, we discover that, in the regime where $ds/p \rightarrow +\infty$, sparse recovery in the sparse setting exhibits a phase transition at an information-theoretic threshold of $n_{\text{INF}}^{\text{SP}} = \Theta\left(s\log\left(p/s\right)/\log\left(ds/p\right)\right)$, where $p$ denotes the signal dimension, $s$ the number of non-zero components of the signal, and $d$ the expected number of non-zero components per row of measurement. This expression makes the price of sparsity explicit: restricting each measurement to $d$ non-zeros inflates the required sample size by a factor of $\log{s}/\log\left(ds/p\right)$, revealing a precise trade-off between sampling complexity and measurement sparsity. Additionally, we examine the effect of sparsifying an originally dense measurement matrix on sparse signal recovery. We prove in the regime of $s = \alpha p$ and $d = \psi p$ with $\alpha, \psi \in \left(0,1\right)$ and $\psi$ small that a sample of size $n^{\text{Sp-ified}}_{\text{INF}} = \Theta\left(p / \psi^2\right)$ is sufficient for recovery, subject to a certain uniform integrability conjecture, the proof of which is work in progress.

Comment: The paper provides theoretical insights into sparse recovery using sparse measurement matrices, relevant to model compression and sparsity.

Relevance: 9 Novelty: 8

6. MoPEQ: Mixture of Mixed Precision Quantized Experts

ArXiv ID: 2509.02512

Authors: Krishna Teja Chitty-Venkata, Jie Ye, Murali Emani

Abstract: Large Language and Vision Models using a Mixture-of-Experts (MoE) architecture pose significant challenges for deployment due to their computational and memory demands. Mixed Precision Quantization assigns different precisions to different layers of an LLM/VLM based on layer sensitivity and importance within the model. In this work, we propose a Post Training Quantization algorithm, MoPEQ, that assigns optimal bit width to each expert. Our method balances accuracy and model size by analyzing each expert's sensitivity using Hessian trace approximation instead of relying on the activation frequency of the expert. This per-expert granularity approach clusters similar experts to maintain model performance while reducing memory requirements. The experimental results on VLMEvalKit benchmark datasets using State-of-the-art VLMs Deepseek-VL2 -tiny, -small, -base, and MolmoE models demonstrate that our mixed precision quantized MoEs achieve competitive accuracy with substantial improvements in memory footprint compared to uniform-precision baseline methods. We perform a comprehensive study to analyze the impact of expert activation frequency and sensitivity using Hessian trace approximation at both layer-wise and model-wide expert precision allocation of 2, 3, and 4 bits to provide a thorough understanding of mixed precision quantization of VLM-MoEs.

Comment: The paper introduces a mixed precision quantization method for MoE architectures, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

7. Universal Properties of Activation Sparsity in Modern Large Language Models

ArXiv ID: 2509.00454

Authors: Filip Szatkowski, Patryk B\k{e}dkowski, Alessio Devoto, Jan Dubi\'nski, Pasquale Minervini, Miko{\l}aj Pi\'orczy\'nski, Simone Scardapane, Bartosz W\'ojcik

Abstract: Input-dependent activation sparsity is a notable property of deep learning models, which has been extensively studied in networks with ReLU activations and is associated with efficiency, robustness, and interpretability. However, the approaches developed for ReLU-based models depend on exact zero activations and do not transfer directly to modern large language models~(LLMs), which have abandoned ReLU in favor of other activation functions. As a result, current work on activation sparsity in LLMs is fragmented, model-specific, and lacks consensus on which components to target. We propose a general framework to assess sparsity robustness and present a systematic study of the phenomenon in the FFN layers of modern LLMs, including diffusion LLMs. Our findings reveal universal patterns of activation sparsity in LLMs, provide insights into this phenomenon, and offer practical guidelines for exploiting it in model design and acceleration.

Comment: The paper studies activation sparsity in LLMs, providing insights into model efficiency and behavior, which is relevant to representation learning and model compression.

Relevance: 9 Novelty: 8

8. ZeroQAT: Your Quantization-aware Training but Efficient

ArXiv ID: 2509.00031

Authors: Qitao Tan, Xiaoying Song, Jin Lu, Guoming Li, Jun Liu, Lingzi Hong, Caiwen Ding, Jundong Li, Xiaoming Zhai, Shaoyi Huang, Wei Niu, Geng Yuan

Abstract: Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing low-bit PTQ methods suffer from accuracy degradation because their layer-wise optimization introduces cumulative error propagation and misalignment between local reconstruction objectives and downstream performance. While quantization-aware training (QAT) provides a principled solution, its reliance on backpropagation incurs prohibitive data, time, and memory costs, limiting its practicality. To address these challenges, we propose ZeroQAT, a zeroth-order optimization-based QAT framework. ZeroQAT leverages forward-only gradient estimation to eliminate the need for backpropagation, significantly reducing computational and memory overhead while retaining the benefits of end-to-end optimization. Moreover, ZeroQAT jointly learns quantized weights, weight clipping thresholds, and equivalent transformations to mitigate quantization error and handle activation outliers. Experiments demonstrate that ZeroQAT achieves the efficiency of PTQ while retaining the accuracy of QAT, offering a practical solution for high-quality low-bit quantization of LLMs.

Comment: The paper introduces ZeroQAT, a novel quantization-aware training framework, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

9. MEPT: Mixture of Expert Prompt Tuning as a Manifold Mapper

ArXiv ID: 2509.00996

Authors: Runjia Zeng, Guangyan Sun, Qifan Wang, Tong Geng, Sohail Dianat, Xiaotian Han, Raghuveer Rao, Xueling Zhang, Cheng Han, Lifu Huang, Dongfang Liu

Abstract: Considering deep neural networks as manifold mappers, the pretrain-then-fine-tune paradigm can be interpreted as a two-stage process: pretrain establishes a broad knowledge base, and fine-tune adjusts the model parameters to activate specific neural pathways to align with the target manifold. Although prior fine-tuning approaches demonstrate success, their rigid parameter space limits their ability to dynamically activate appropriate neural pathways, rendering them ill-equipped to adapt flexibly to the diverse and evolving data distributions. In light of this view, we propose a novel approach, Mixture of Expert Prompt Tuning (MEPT), as an effective and efficient manifold-mapping framework. MEPT leverages the Mixture of Experts architecture by integrating multiple prompt experts to adaptively learn diverse and non-stationary data distributions. Empirical evaluations demonstrate that MEPT outperforms several state-of-the-art parameter efficient baselines on SuperGLUE, achieving notable improvements in mean accuracy (e.g., 1.94%) while significantly reducing activated prompts by 79.25%. The effectiveness of MEPT is further supported by theoretical insights from manifold learning and validated through neural activation pathway visualization results. Our code is avaliable at https://github.com/runtsang/MEPT.

Comment: The paper introduces Mixture of Expert Prompt Tuning (MEPT), which is relevant to the Mixture-of-Experts architecture.

Relevance: 9 Novelty: 8

10. DTRNet: Dynamic Token Routing Network to Reduce Quadratic Costs in Transformers

ArXiv ID: 2509.00925

Authors: Aman Sharma, Saeed Najafi, Parsa Farinneya, Benyamin Jamialahmadi, Marzieh S. Tahaei, Yuhe Fan, Mehdi Rezagholizadeh, Boxing Chen, Aref Jafari

Abstract: Transformers achieve state-of-the-art results across many tasks, but their uniform application of quadratic self-attention to every token at every layer makes them computationally expensive. We introduce DTRNet (Dynamic Token Routing Network), an improved Transformer architecture that allows tokens to dynamically skip the quadratic cost of cross-token mixing while still receiving lightweight linear updates. By preserving the MLP module and reducing the attention cost for most tokens to linear, DTRNet ensures that every token is explicitly updated while significantly lowering overall computation. This design offers an efficient and effective alternative to standard dense attention. Once trained, DTRNet blocks routes only ~10% of tokens through attention at each layer while maintaining performance comparable to a full Transformer. It consistently outperforms routing-based layer skipping methods such as MoD and D-LLM in both accuracy and memory at matched FLOPs, while routing fewer tokens to full attention. Its efficiency gains, scales with sequence length, offering significant reduction in FLOPs for long-context inputs. By decoupling token updates from attention mixing, DTRNet substantially reduces the quadratic share of computation, providing a simple, efficient, and scalable alternative to Transformers.

Comment: The paper introduces DTRNet, a dynamic token routing network to reduce quadratic costs in Transformers, relevant to model architecture.

Relevance: 9 Novelty: 8

11. KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache

ArXiv ID: 2509.00579

Authors: Bo Jiang, Taolue Yang, Youyuan Liu, Chengming Zhang, Xubin He, Sian Jin

Abstract: Transformer-based large language models (LLMs) demonstrate impressive potential in various practical applications. However, long context inference poses a significant challenge due to the enormous memory requirements of the key-value (KV) cache, which can scale to multiple gigabytes as sequence length and batch size increase. In this paper, we present KVComp, a generic and efficient KV cache management framework optimized for long-text generation that synergistically works with both latency-critical and throughput-critical inference systems. KVComp employs novel lossy compression techniques specifically designed for KV cache data characteristics, featuring careful co-design of compression algorithms and system architecture. Our approach maintains compatibility with the growing nature of KV cache while preserving high computational efficiency. Experimental results show that KVComp achieves on average 47\% and up to 83\% higher memory reduction rate compared to existing methods with little/no model accuracy degradation. Furthermore, KVComp achieves extremely high execution throughput, effectively reducing decompression overhead and, in some cases, even accelerating the matrix-vector multiplication operation and outperform cuBLAS-based attention kernels with less data movement.

Comment: The paper introduces a lossy compression framework for KV cache in LLMs, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

12. Metis: Training Large Language Models with Advanced Low-Bit Quantization

ArXiv ID: 2509.00404

Authors: Hengjie Cao, Mengyi Chen, Yifeng Yang, Ruijun Huang, Fang Dong, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Yuan Cheng, Fan Wu, Fan Yang, Tun Lu, Ning Gu, Li Shang

Abstract: This work identifies anisotropic parameter distributions as a fundamental barrier to training large language models (LLMs) with low-bit quantization: a few dominant singular values create wide numerical ranges that conflict with the inherent bias of block-wise quantization. This bias disproportionately preserves high-magnitude values while discarding smaller ones, causing training instability and low model performance. This work introduces Metis, a training framework that combines (i) spectral decomposition with random embedding to efficiently disentangle dominant from long-tail components, compressing broad distributions into quantization-friendly narrow ranges; (ii) adaptive learning rates in the spectral domain to amplify underrepresented directions and better capture diverse features critical for performance; and (iii) a dual-range regularizer that jointly constrains numerical precision and parameter range distribution, ensuring stable, unbiased low-bit training. With Metis, FP8 training surpasses FP32 baselines, and FP4 training achieves accuracy comparable to FP32, paving the way for robust and scalable LLM training under advanced low-bit quantization. The code implementation for Metis is available at: https://github.com/typename-yyf/Metis-quantization.

Comment: The paper introduces Metis, a framework for training large language models with low-bit quantization, addressing model compression through quantization and efficiency improvements.

Relevance: 9 Novelty: 8

13. Geometric origin of adversarial vulnerability in deep learning

ArXiv ID: 2509.01235

Authors: Yixiong Ren, Wenkang Du, Jianhui Zhou, Haiping Huang

Abstract: How to balance training accuracy and adversarial robustness has become a challenge since the birth of deep learning. Here, we introduce a geometry-aware deep learning framework that leverages layer-wise local training to sculpt the internal representations of deep neural networks. This framework promotes intra-class compactness and inter-class separation in feature space, leading to manifold smoothness and adversarial robustness against white or black box attacks. The performance can be explained by an energy model with Hebbian coupling between elements of the hidden representation. Our results thus shed light on the physics of learning in the direction of alignment between biological and artificial intelligence systems. Using the current framework, the deep network can assimilate new information into existing knowledge structures while reducing representation interference.

Comment: The paper presents a geometry-aware deep learning framework for improving adversarial robustness, focusing on representation learning and training dynamics.

Relevance: 9 Novelty: 8

14. Beyond Universal Approximation Theorems: Algorithmic Uniform Approximation by Neural Networks Trained with Noisy Data

ArXiv ID: 2509.00924

Authors: Anastasis Kratsios, Tin Sum Cheng, Daniel Roy

Abstract: At its core, machine learning seeks to train models that reliably generalize beyond noisy observations; however, the theoretical vacuum in which state-of-the-art universal approximation theorems (UATs) operate isolates them from this goal, as they assume noiseless data and allow network parameters to be chosen freely, independent of algorithmic realism. This paper bridges that gap by introducing an architecture-specific randomized training algorithm that constructs a uniform approximator from $N$ noisy training samples on the $d$-dimensional cube $[0,1]^d$. Our trained neural networks attain the minimax-optimal quantity of \textit{trainable} (non-random) parameters, subject to logarithmic factors which vanish under the idealized noiseless sampling assumed in classical UATs. Additionally, our trained models replicate key behaviours of real-world neural networks, absent in standard UAT constructions, by: (1) exhibiting sub-linear parametric complexity when fine-tuning on structurally related and favourable out-of-distribution tasks, (2) exactly interpolating the training data, and (3) maintaining reasonable Lipschitz regularity (after the initial clustering attention layer). These properties bring state-of-the-art UATs closer to practical machine learning, shifting the central open question from algorithmic implementability with noisy samples to whether stochastic gradient descent can achieve comparable guarantees.

Comment: The paper extends universal approximation theorems to noisy data, providing theoretical insights into neural network behavior, relevant to representation learning.

Relevance: 9 Novelty: 8

15. REFRAG: Rethinking RAG based Decoding

ArXiv ID: 2509.01092

Authors: Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off between knowledge enrichment and system efficiency. While minimizing latency for long-context inputs is a primary objective for LLMs, we contend that RAG require specialized consideration. In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query. These passages often exhibit low semantic similarity due to diversity or deduplication during re-ranking, leading to block-diagonal attention patterns that differ from those in standard LLM generation tasks. Based on this observation, we argue that most computations over the RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, we propose REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications. By exploiting the sparsity structure, we demonstrate a 30.85 the time-to-first-token acceleration (3.75 improvement to previous work) without loss in perplexity. In addition, our optimization framework for large context enables REFRAG to extend the context size of LLMs by 16. We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization, spanning a wide range of datasets. Experimental results confirm that REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes.

Comment: The paper proposes REFRAG, an efficient decoding framework for retrieval-augmented generation (RAG) that exploits sparsity structure to improve latency, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

16. ReLATE: Learning Efficient Sparse Encoding for High-Performance Tensor Decomposition

ArXiv ID: 2509.00280

Authors: Ahmed E. Helal, Fabio Checconi, Jan Laukemann, Yongseok Soh, Jesmin Jahan Tithi, Fabrizio Petrini, Jee Choi

Abstract: Tensor decomposition (TD) is essential for analyzing high-dimensional sparse data, yet its irregular computations and memory-access patterns pose major performance challenges on modern parallel processors. Prior works rely on expert-designed sparse tensor formats that fail to adapt to irregular tensor shapes and/or highly variable data distributions. We present the reinforcement-learned adaptive tensor encoding (ReLATE) framework, a novel learning-augmented method that automatically constructs efficient sparse tensor representations without labeled training samples. ReLATE employs an autonomous agent that discovers optimized tensor encodings through direct interaction with the TD environment, leveraging a hybrid model-free and model-based algorithm to learn from both real and imagined actions. Moreover, ReLATE introduces rule-driven action masking and dynamics-informed action filtering mechanisms that ensure functionally correct tensor encoding with bounded execution time, even during early learning stages. By automatically adapting to both irregular tensor shapes and data distributions, ReLATE generates sparse tensor representations that consistently outperform expert-designed formats across diverse sparse tensor data sets, achieving up to 2X speedup compared to the best sparse format, with a geometric-mean speedup of 1.4-1.46X.

Comment: The paper presents ReLATE, a framework for learning efficient sparse tensor encodings, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

17. LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving

ArXiv ID: 2509.01229

Authors: Huanqi Hu, Bowen Xiao, Shixuan Sun, Jianian Yin, Zhexi Zhang, Xiang Luo, Chengquan Jiang, Weiqi Xu, Xiaoying Jia, Xin Liu, Minyi Guo

Abstract: Quantization is a critical technique for accelerating LLM inference by reducing memory footprint and improving computational efficiency. Among various schemes, 4-bit weight and 8-bit activation quantization (W4A8) offers a strong balance between accuracy and performance. However, existing W4A8 GEMM kernels fall short in practice due to inefficient dequantization on CUDA Cores, which cannot keep pace with the high throughput of Tensor Cores. In this paper, we present LiquidGEMM, a hardware-efficient W4A8 GEMM kernel for efficient LLM serving. LiquidGEMM designs two key techniques: LiquidQuant, a hardware-efficient quantization method that enables fast, overflow-safe dequantization using just two arithmetic instructions per four elements; and an implicit fine-grained pipeline that fully overlaps weight loading, dequantization, and MMA across warp groups without software synchronization or redundant memory traffic. Experimental results show that LiquidGEMM achieves up to 2.90x speedup over state-of-the-art W4A8 kernels and up to 4.94x end-to-end system-level speedup. Compared to various quantized GEMM kernels in NVIDIA TensorRT-LLM, LiquidGEMM delivers 1.12-1.63x performance gains, and achieves up to 1.63x system-level speedup.

Comment: The paper presents LiquidGEMM, a hardware-efficient GEMM kernel for LLM serving, focusing on quantization and efficiency, which aligns with model compression.

Relevance: 9 Novelty: 8

18. Learnable Loss Geometries with Mirror Descent for Scalable and Convergent Meta-Learning

ArXiv ID: 2509.02418

Authors: Yilang Zhang, Bingcong Li, Georgios B. Giannakis

Abstract: Utilizing task-invariant knowledge acquired from related tasks as prior information, meta-learning offers a principled approach to learning a new task with limited data records. Sample-efficient adaptation of this prior information is a major challenge facing meta-learning, and plays an important role because it facilitates training the sought task-specific model with just a few optimization steps. Past works deal with this challenge through preconditioning that speeds up convergence of the per-task training. Though effective in representing locally quadratic loss curvatures, simple linear preconditioning can be hardly potent with complex loss geometries. Instead of relying on a quadratic distance metric, the present contribution copes with complex loss metrics by learning a versatile distance-generating function, which induces a nonlinear mirror map to effectively capture and optimize a wide range of loss geometries. With suitable parameterization, this generating function is effected by an expressive neural network that is provably a valid distance. Analytical results establish convergence of not only the proposed method, but also all meta-learning approaches based on preconditioning. To attain gradient norm less than $\epsilon$, the convergence rate of $\mathcal{O}(\epsilon^{-2})$ is on par with standard gradient-based meta-learning methods. Numerical tests on few-shot learning datasets demonstrate the superior empirical performance of the novel algorithm, as well as its rapid per-task convergence, which markedly reduces the number of adaptation steps, hence also accommodating large-scale meta-learning models.

Comment: The paper presents a novel approach to meta-learning by learning a versatile distance-generating function, which aligns with representation learning and training dynamics in neural networks.

Relevance: 9 Novelty: 8

19. Fisher information flow in artificial neural networks

ArXiv ID: 2509.02407

Authors: Maximilian Weimar, Lukas M. Rachbauer, Ilya Starshynov, Daniele Faccio, Linara Adilova, Dorian Bouchet, Stefan Rotter

Abstract: The estimation of continuous parameters from measured data plays a central role in many fields of physics. A key tool in understanding and improving such estimation processes is the concept of Fisher information, which quantifies how information about unknown parameters propagates through a physical system and determines the ultimate limits of precision. With Artificial Neural Networks (ANNs) gradually becoming an integral part of many measurement systems, it is essential to understand how they process and transmit parameter-relevant information internally. Here, we present a method to monitor the flow of Fisher information through an ANN performing a parameter estimation task, tracking it from the input to the output layer. We show that optimal estimation performance corresponds to the maximal transmission of Fisher information, and that training beyond this point results in information loss due to overfitting. This provides a model-free stopping criterion for network training-eliminating the need for a separate validation dataset. To demonstrate the practical relevance of our approach, we apply it to a network trained on data from an imaging experiment, highlighting its effectiveness in a realistic physical setting.

Comment: The paper introduces a method to monitor Fisher information flow in neural networks, providing insights into how networks process information, relevant to representation learning.

Relevance: 9 Novelty: 8

20. Causal Interpretation of Sparse Autoencoder Features in Vision

ArXiv ID: 2509.00749

Authors: Sangyu Han, Yearim Kim, Nojun Kwak

Abstract: Understanding what sparse auto-encoder (SAE) features in vision transformers truly represent is usually done by inspecting the patches where a feature's activation is highest. However, self-attention mixes information across the entire image, so an activated patch often co-occurs with-but does not cause-the feature's firing. We propose Causal Feature Explanation (CaFE), which leverages Effective Receptive Field (ERF). We consider each activation of an SAE feature to be a target and apply input-attribution methods to identify the image patches that causally drive that activation. Across CLIP-ViT features, ERF maps frequently diverge from naive activation maps, revealing hidden context dependencies (e.g., a "roaring face" feature that requires the co-occurrence of eyes and nose, rather than merely an open mouth). Patch insertion tests confirm that CaFE more effectively recovers or suppresses feature activations than activation-ranked patches. Our results show that CaFE yields more faithful and semantically precise explanations of vision-SAE features, highlighting the risk of misinterpretation when relying solely on activation location.

Comment: The paper proposes a method for causal interpretation of sparse autoencoder features, aligning with representation learning and feature learning.

Relevance: 9 Novelty: 8

21. Cache Management for Mixture-of-Experts LLMs -- extended version

ArXiv ID: 2509.02408

Authors: Spyros Angelopoulos, Loris Marchal, Adrien Obrecht, Bertrand Simon

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a variety of tasks. One of the main challenges towards the successful deployment of LLMs is memory management, since they typically involve billions of parameters. To this end, architectures based on Mixture-of-Experts have been proposed, which aim to reduce the size of the parameters that are activated when producing a token. This raises the equally critical issue of efficiently managing the limited cache of the system, in that frequently used experts should be stored in the fast cache rather than in the slower secondary memory. In this work, we introduce and study a new paging problem that models expert management optimization. Our formulation captures both the layered architecture of LLMs and the requirement that experts are cached efficiently. We first present lower bounds on the competitive ratio of both deterministic and randomized algorithms, which show that under mild assumptions, LRU-like policies have good theoretical competitive performance. We then propose a layer-based extension of LRU that is tailored to the problem at hand. Extensive simulations on both synthetic datasets and actual traces of MoE usage show that our algorithm outperforms policies for the classic paging problem, such as the standard LRU.

Comment: The paper discusses cache management for Mixture-of-Experts LLMs, which is relevant to model architecture and efficiency.

Relevance: 9 Novelty: 7

22. MODE: Mixture of Document Experts for RAG

ArXiv ID: 2509.00100

Authors: Rahul Anand

Abstract: Retrieval-Augmented Generation (RAG) often relies on large vector databases and cross-encoders tuned for large-scale corpora, which can be excessive for small, domain-specific collections. We present MODE (Mixture of Document Experts), a lightweight alternative that replaces fine-grained nearest-neighbor search with cluster-and-route retrieval. Documents are embedded, grouped into semantically coherent clusters, and represented by cached centroids. At query time, we route to the top centroid(s) and retrieve context only within those clusters, eliminating external vector-database infrastructure and reranking while keeping latency low. On HotpotQA and SQuAD corpora with 100-500 chunks, MODE matches or exceeds a dense-retrieval baseline in answer quality while reducing end-to-end retrieval time. Ablations show that cluster granularity and multi-cluster routing control the recall/precision trade-off, and that tighter clusters improve downstream accuracy. MODE offers a practical recipe for small and medium corpora where simplicity, speed, and topical focus matter.

Comment: The paper introduces a Mixture of Document Experts (MODE) for retrieval-augmented generation, which aligns with the interest in Mixture-of-Experts architectures.

Relevance: 9 Novelty: 7

23. Graph Contrastive Learning versus Untrained Baselines: The Role of Dataset Size

ArXiv ID: 2509.01541

Authors: Smayan Khanna, Doruk Efe G\"okmen, Risi Kondor, Vincenzo Vitelli

Abstract: Graph Contrastive Learning (GCL) has emerged as a leading paradigm for self- supervised learning on graphs, with strong performance reported on standardized datasets and growing applications ranging from genomics to drug discovery. We ask a basic question: does GCL actually outperform untrained baselines? We find that GCL's advantage depends strongly on dataset size and task difficulty. On standard datasets, untrained Graph Neural Networks (GNNs), simple multilayer perceptrons, and even handcrafted statistics can rival or exceed GCL. On the large molecular dataset ogbg-molhiv, we observe a crossover: GCL lags at small scales but pulls ahead beyond a few thousand graphs, though this gain eventually plateaus. On synthetic datasets, GCL accuracy approximately scales with the logarithm of the number of graphs and its performance gap (compared with untrained GNNs) varies with respect to task complexity. Moving forward, it is crucial to identify the role of dataset size in benchmarks and applications, as well as to design GCL algorithms that avoid performance plateaus.

Comment: The paper discusses Graph Contrastive Learning, which is a form of representation learning, and examines its performance relative to untrained baselines, providing insights into training dynamics and dataset size effects.

Relevance: 9 Novelty: 7

24. Distilled Pretraining: A modern lens of Data, In-Context Learning and Test-Time Scaling

ArXiv ID: 2509.01649

Authors: Sachin Goyal, David Lopez-Paz, Kartik Ahuja

Abstract: In the past year, distillation has seen a renewed prominence in large language model (LLM) pretraining, exemplified by the Llama-3.2 and Gemma model families. While distillation has historically been shown to improve statistical modeling, its effects on new paradigms that are key to modern LLMs, such as test-time scaling and in-context learning, remain underexplored. In this work, we make three main contributions. First, we show that pretraining with distillation yields models that exhibit remarkably better test-time scaling. Second, we observe that this benefit comes with a trade-off: distillation impairs in-context learning capabilities, particularly the one modeled via induction heads. Third, to demystify these findings, we study distilled pretraining in a sandbox of a bigram model, which helps us isolate the common principal factor behind our observations. Finally, using these insights, we shed light on various design choices for pretraining that should help practitioners going forward.

Comment: The paper explores the effects of distillation on LLM pretraining, particularly on test-time scaling and in-context learning, providing insights into LLM behavior.

Relevance: 9 Novelty: 7

25. Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs

ArXiv ID: 2509.00096

Authors: Yao Fu, Runchao Li, Xianxuan Long, Haotian Yu, Xiaotian Han, Yu Yin, Pan Li

Abstract: Neural network pruning has emerged as a promising approach for deploying LLMs in low-resource scenarios while preserving downstream task performance. However, for the first time, we reveal that such pruning disrupts LLMs' internal activation features crucial for lie detection, where probing classifiers (typically small logistic regression models) trained on these features assess the truthfulness of LLM-generated statements. This discovery raises a crucial open question: how can we prune LLMs without sacrificing these critical lie detection capabilities? Our investigation further reveals that naively adjusting layer-wise pruning sparsity based on importance inadvertently removes crucial weights, failing to improve lie detection performance despite its reliance on the most crucial LLM layer. To address this issue, we propose Truthful Pruning aligned by Layer-wise Outliers (TPLO), which places greater emphasis on layers with more activation outliers and stronger discriminative features simultaneously. This preserves LLMs' original performance while retaining critical features of inner states needed for robust lie detection. Moreover, we introduce a prompting rule to enrich the TruthfulQA benchmark for better calibrating LLM pruning. Empirical results show that our approach improves the hallucination detection for pruned LLMs (achieving 88% accuracy at 50% sparsity) and enhances their performance on TruthfulQA.

Comment: The paper addresses pruning in LLMs while preserving truthfulness, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 7

26. DaCe AD: Unifying High-Performance Automatic Differentiation for Machine Learning and Scientific Computing

ArXiv ID: 2509.02197

Authors: Afif Boudaoud, Alexandru Calotoiu, Marcin Copik, Torsten Hoefler

Abstract: Automatic differentiation (AD) is a set of techniques that systematically applies the chain rule to compute the gradients of functions without requiring human intervention. Although the fundamentals of this technology were established decades ago, it is experiencing a renaissance as it plays a key role in efficiently computing gradients for backpropagation in machine learning algorithms. AD is also crucial for many applications in scientific computing domains, particularly emerging techniques that integrate machine learning models within scientific simulations and schemes. Existing AD frameworks have four main limitations: limited support of programming languages, requiring code modifications for AD compatibility, limited performance on scientific computing codes, and a naive store-all solution for forward-pass data required for gradient calculations. These limitations force domain scientists to manually compute the gradients for large problems. This work presents DaCe AD, a general, efficient automatic differentiation engine that requires no code modifications. DaCe AD uses a novel ILP-based algorithm to optimize the trade-off between storing and recomputing to achieve maximum performance within a given memory constraint. We showcase the generality of our method by applying it to NPBench, a suite of HPC benchmarks with diverse scientific computing patterns, where we outperform JAX, a Python framework with state-of-the-art general AD capabilities, by more than 92 times on average without requiring any code changes.

Comment: The paper presents DaCe AD, an efficient automatic differentiation engine, which is relevant to emerging trends in computational efficiency.

Relevance: 8 Novelty: 8

27. Theory Foundation of Physics-Enhanced Residual Learning

ArXiv ID: 2509.00348

Authors: Shixiao Liang, Wang Chen, Keke Long, Peng Zhang, Xiaopeng Li, Jintao Ke

Abstract: Intensive studies have been conducted in recent years to integrate neural networks with physics models to balance model accuracy and interpretability. One recently proposed approach, named Physics-Enhanced Residual Learning (PERL), is to use learning to estimate the residual between the physics model prediction and the ground truth. Numeral examples suggested that integrating such residual with physics models in PERL has three advantages: (1) a reduction in the number of required neural network parameters; (2) faster convergence rates; and (3) fewer training samples needed for the same computational precision. However, these numerical results lack theoretical justification and cannot be adequately explained. This paper aims to explain these advantages of PERL from a theoretical perspective. We investigate a general class of problems with Lipschitz continuity properties. By examining the relationships between the bounds to the loss function and residual learning structure, this study rigorously proves a set of theorems explaining the three advantages of PERL. Several numerical examples in the context of automated vehicle trajectory prediction are conducted to illustrate the proposed theorems. The results confirm that, even with significantly fewer training samples, PERL consistently achieves higher accuracy than a pure neural network. These results demonstrate the practical value of PERL in real world autonomous driving applications where corner case data are costly or hard to obtain. PERL therefore improves predictive performance while reducing the amount of data required.

Comment: The paper provides a theoretical foundation for physics-enhanced residual learning, which is relevant to AI for science and emerging trends.

Relevance: 8 Novelty: 8

28. Accelerating PDE Solvers with Equation-Recast Neural Operator Preconditioning

ArXiv ID: 2509.01416

Authors: Qiyun Cheng, Md Hossain Sahadath, Huihua Yang, Shaowu Pan, Wei Ji

Abstract: The computational overhead of traditional numerical solvers for partial differential equations (PDEs) remains a critical bottleneck for large-scale parametric studies and design optimization. We introduce a Minimal-Data Parametric Neural Operator Preconditioning (MD-PNOP) framework, which establishes a new paradigm for accelerating parametric PDE solvers while strictly preserving physical constraints. The key idea is to recast the residual from parameter deviation as additional source term, where any trained neural operator can be used to refine the solution in an offline fashion. This directly addresses the fundamental extrapolation limitation of neural operators, enabling extrapolative generalization of any neural operator trained at a single parameter setting across a wide range of configurations without any retraining. The neural operator predictions are then embedded into iterative PDE solvers as improved initial guesses, thereby reducing convergence iterations without sacrificing accuracy. Unlike purely data-driven approaches, MD-PNOP guarantees that the governing equations remain fully enforced, eliminating concerns about loss of physics or interpretability. The framework is architecture-agnostic and is demonstrated using both Deep Operator Networks (DeepONet) and Fourier Neural Operators (FNO) for Boltzmann transport equation solvers in neutron transport applications. We demonstrated that neural operators trained on a single set of constant parameters successfully accelerate solutions with heterogeneous, sinusoidal, and discontinuous parameter distributions. Besides, MD-PNOP consistently achieves ~50% reduction in computational time while maintaining full order fidelity for fixed-source, single-group eigenvalue, and multigroup coupled eigenvalue problems.

Comment: The paper introduces a new paradigm for accelerating PDE solvers with neural operator preconditioning, which is relevant to AI for science and emerging trends.

Relevance: 8 Novelty: 8

29. Crystal Structure Prediction with a Geometric Permutation-Invariant Loss Function

ArXiv ID: 2509.00832

Authors: Emmanuel Jehanno, Romain Menegaux, Julien Mairal, Sergei Grudinin

Abstract: Crystalline structure prediction remains an open challenge in materials design. Despite recent advances in computational materials science, accurately predicting the three-dimensional crystal structures of organic materials--an essential first step for designing materials with targeted properties--remains elusive. In this work, we address the problem of molecular assembly, where a set $\mathcal{S}$ of identical rigid molecules is packed to form a crystalline structure. Existing state-of-the-art models typically rely on computationally expensive, iterative flow-matching approaches. We propose a novel loss function that correctly captures key geometric molecular properties while maintaining permutation invariance over $\mathcal{S}$. We achieve this via a differentiable linear assignment scheme based on the Sinkhorn algorithm. Remarkably, we show that even a simple regression using our method {\em SinkFast} significantly outperforms more complex flow-matching approaches on the COD-Cluster17 benchmark, a curated subset of the Crystallography Open Database (COD).

Comment: The paper proposes a novel geometric permutation-invariant loss function for crystal structure prediction, which is relevant to AI for Science with a focus on foundational research in molecular modeling.

Relevance: 8 Novelty: 8

30. Beyond Ensembles: Simulating All-Atom Protein Dynamics in a Learned Latent Space

ArXiv ID: 2509.02196

Authors: Aditya Sengar, Ali Hariri, Pierre Vandergheynst, Patrick Barth

Abstract: Simulating the long-timescale dynamics of biomolecules is a central challenge in computational science. While enhanced sampling methods can accelerate these simulations, they rely on pre-defined collective variables that are often difficult to identify. A recent generative model, LD-FPG, demonstrated that this problem could be bypassed by learning to sample the static equilibrium ensemble as all-atom deformations from a reference structure, establishing a powerful method for all-atom ensemble generation. However, while this approach successfully captures a system's probable conformations, it does not model the temporal evolution between them. Here we extend LD-FPG with a temporal propagator that operates within the learned latent space and compare three classes: (i) score-guided Langevin dynamics, (ii) Koopman-based linear operators, and (iii) autoregressive neural networks. Within a unified encoder-propagator-decoder framework, we evaluate long-horizon stability, backbone and side-chain ensemble fidelity, and functional free-energy landscapes. Autoregressive neural networks deliver the most robust long rollouts; score-guided Langevin best recovers side-chain thermodynamics when the score is well learned; and Koopman provides an interpretable, lightweight baseline that tends to damp fluctuations. These results clarify the trade-offs among propagators and offer practical guidance for latent-space simulators of all-atom protein dynamics.

Comment: The paper extends a generative model for simulating protein dynamics, relevant to AI for Science with a focus on foundational research in molecular modeling.

Relevance: 8 Novelty: 8

31. Towards Comprehensive Information-theoretic Multi-view Learning

ArXiv ID: 2509.02084

Authors: Long Shi, Yunshan Ye, Wenjie Wang, Tao Lei, Yu Zhao, Gang Kou, Badong Chen

Abstract: Information theory has inspired numerous advancements in multi-view learning. Most multi-view methods incorporating information-theoretic principles rely an assumption called multi-view redundancy which states that common information between views is necessary and sufficient for down-stream tasks. This assumption emphasizes the importance of common information for prediction, but inherently ignores the potential of unique information in each view that could be predictive to the task. In this paper, we propose a comprehensive information-theoretic multi-view learning framework named CIML, which discards the assumption of multi-view redundancy. Specifically, CIML considers the potential predictive capabilities of both common and unique information based on information theory. First, the common representation learning maximizes Gacs-Korner common information to extract shared features and then compresses this information to learn task-relevant representations based on the Information Bottleneck (IB). For unique representation learning, IB is employed to achieve the most compressed unique representation for each view while simultaneously minimizing the mutual information between unique and common representations, as well as among different unique representations. Importantly, we theoretically prove that the learned joint representation is predictively sufficient for the downstream task. Extensive experimental results have demonstrated the superiority of our model over several state-of-art methods. The code is released on CIML.

Comment: The paper proposes a new information-theoretic framework for multi-view learning, which is relevant to representation learning.