Personalized Daily ArXiv Papers 2025-09-03
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 71187 | 9933 | 81120 |
| Cost | $0.18 | $0.1 | $0.28 |
Total arXiv papers: 1163
Total scanned papers: 753
Total relevant papers: 59
Table of contents with paper titles:
-
Relative Trajectory Balance is equivalent to Trust-PCL Authors: Tristan Deleu, Padideh Nouri, Yoshua Bengio, Doina Precup
-
LongCat-Flash Technical Report Authors: Meituan LongCat Team, Bayan, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, Chengcheng Han, Chenguang Xi, Chi Zhang, Chong Peng, Chuan Qin, Chuyu Zhang, Cong Chen, Congkui Wang, Dan Ma, Daoru Pan, Defei Bu, Dengchang Zhao, Deyang Kong, Dishan Liu, Feiye Huo, Fengcun Li, Fubao Zhang, Gan Dong, Gang Liu, Gang Xu, Ge Li, Guoqiang Tan, Guoyuan Lin, Haihang Jing, Haomin Fu, Haonan Yan, Haoxing Wen, Haozhe Zhao, Hong Liu, Hongmei Shi, Hongyan Hao, Hongyin Tang, Huantian Lv, Hui Su, Jiacheng Li, Jiahao Liu, Jiahuan Li, Jiajun Yang, Jiaming Wang, Jian Yang, Jianchao Tan, Jiaqi Sun, Jiaqi Zhang, Jiawei Fu, Jiawei Yang, Jiaxi Hu, Jiayu Qin, Jingang Wang, Jiyuan He, Jun Kuang, Junhui Mei, Kai Liang, Ke He, Kefeng Zhang, Keheng Wang, Keqing He, Liang Gao, Liang Shi, Lianhui Ma, Lin Qiu, Lingbin Kong, Lingtong Si, Linkun Lyu, Linsen Guo, Liqi Yang, Lizhi Yan, Mai Xia, Man Gao, Manyuan Zhang, Meng Zhou, Mengxia Shen, Mingxiang Tuo, Mingyang Zhu, Peiguang Li, Peng Pei, Peng Zhao, Pengcheng Jia, Pingwei Sun, Qi Gu, Qianyun Li, Qingyuan Li, Qiong Huang, Qiyuan Duan, Ran Meng, Rongxiang Weng, Ruichen Shao, Rumei Li, Shizhe Wu, Shuai Liang, Shuo Wang, Suogui Dang, Tao Fang, Tao Li, Tefeng Chen, Tianhao Bai, Tianhao Zhou, Tingwen Xie, Wei He, Wei Huang, Wei Liu, Wei Shi, Wei Wang, Wei Wu, Weikang Zhao, Wen Zan, Wenjie Shi, Xi Nan, Xi Su, Xiang Li, Xiang Mei, Xiangyang Ji, Xiangyu Xi, Xiangzhou Huang, Xianpeng Li, Xiao Fu, Xiao Liu, Xiao Wei, Xiaodong Cai, Xiaolong Chen, Xiaoqing Liu, Xiaotong Li, Xiaowei Shi, Xiaoyu Li, Xili Wang, Xin Chen, Xing Hu, Xingyu Miao, Xinyan He, Xuemiao Zhang, Xueyuan Hao, Xuezhi Cao, Xunliang Cai, Xurui Yang, Yan Feng, Yang Bai, Yang Chen, Yang Yang, Yaqi Huo, Yerui Sun, Yifan Lu, Yifan Zhang, Yipeng Zang, Yitao Zhai, Yiyang Li, Yongjing Yin, Yongkang Lv, Yongwei Zhou, Yu Yang, Yuchen Xie, Yueqing Sun, Yuewen Zheng, Yuhua Wei, Yulei Qian, Yunfan Liang, Yunfang Tai, Yunke Zhao, Zeyang Yu, Zhao Zhang, Zhaohua Yang, Zhenchao Zhang, Zhikang Xia, Zhiye Zou, Zhizhao Zeng, Zhongda Su, Zhuofan Chen, Zijian Zhang, Ziwen Wang, Zixu Jiang, Zizhe Zhao, Zongyu Wang, Zunhai Su
-
Globally aware optimization with resurgence Authors: Wei Bu
-
Are We Really Learning the Score Function? Reinterpreting Diffusion Models Through Wasserstein Gradient Flow Matching Authors: An B. Vuong, Michael T. McCann, Javier E. Santos, Yen Ting Lin
-
The Price of Sparsity: Sufficient Conditions for Sparse Recovery using Sparse and Sparsified Measurements Authors: Youssef Chaabouni, David Gamarnik
-
MoPEQ: Mixture of Mixed Precision Quantized Experts Authors: Krishna Teja Chitty-Venkata, Jie Ye, Murali Emani
-
Universal Properties of Activation Sparsity in Modern Large Language Models Authors: Filip Szatkowski, Patryk B\k{e}dkowski, Alessio Devoto, Jan Dubi\'nski, Pasquale Minervini, Miko{\l}aj Pi\'orczy\'nski, Simone Scardapane, Bartosz W\'ojcik
-
ZeroQAT: Your Quantization-aware Training but Efficient Authors: Qitao Tan, Xiaoying Song, Jin Lu, Guoming Li, Jun Liu, Lingzi Hong, Caiwen Ding, Jundong Li, Xiaoming Zhai, Shaoyi Huang, Wei Niu, Geng Yuan
-
MEPT: Mixture of Expert Prompt Tuning as a Manifold Mapper Authors: Runjia Zeng, Guangyan Sun, Qifan Wang, Tong Geng, Sohail Dianat, Xiaotian Han, Raghuveer Rao, Xueling Zhang, Cheng Han, Lifu Huang, Dongfang Liu
-
DTRNet: Dynamic Token Routing Network to Reduce Quadratic Costs in Transformers Authors: Aman Sharma, Saeed Najafi, Parsa Farinneya, Benyamin Jamialahmadi, Marzieh S. Tahaei, Yuhe Fan, Mehdi Rezagholizadeh, Boxing Chen, Aref Jafari
-
KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache Authors: Bo Jiang, Taolue Yang, Youyuan Liu, Chengming Zhang, Xubin He, Sian Jin
-
Metis: Training Large Language Models with Advanced Low-Bit Quantization Authors: Hengjie Cao, Mengyi Chen, Yifeng Yang, Ruijun Huang, Fang Dong, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Yuan Cheng, Fan Wu, Fan Yang, Tun Lu, Ning Gu, Li Shang
-
Geometric origin of adversarial vulnerability in deep learning Authors: Yixiong Ren, Wenkang Du, Jianhui Zhou, Haiping Huang
-
Beyond Universal Approximation Theorems: Algorithmic Uniform Approximation by Neural Networks Trained with Noisy Data Authors: Anastasis Kratsios, Tin Sum Cheng, Daniel Roy
-
REFRAG: Rethinking RAG based Decoding Authors: Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan
-
ReLATE: Learning Efficient Sparse Encoding for High-Performance Tensor Decomposition Authors: Ahmed E. Helal, Fabio Checconi, Jan Laukemann, Yongseok Soh, Jesmin Jahan Tithi, Fabrizio Petrini, Jee Choi
-
LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving Authors: Huanqi Hu, Bowen Xiao, Shixuan Sun, Jianian Yin, Zhexi Zhang, Xiang Luo, Chengquan Jiang, Weiqi Xu, Xiaoying Jia, Xin Liu, Minyi Guo
-
Learnable Loss Geometries with Mirror Descent for Scalable and Convergent Meta-Learning Authors: Yilang Zhang, Bingcong Li, Georgios B. Giannakis
-
Fisher information flow in artificial neural networks Authors: Maximilian Weimar, Lukas M. Rachbauer, Ilya Starshynov, Daniele Faccio, Linara Adilova, Dorian Bouchet, Stefan Rotter
-
Causal Interpretation of Sparse Autoencoder Features in Vision Authors: Sangyu Han, Yearim Kim, Nojun Kwak
-
Cache Management for Mixture-of-Experts LLMs -- extended version Authors: Spyros Angelopoulos, Loris Marchal, Adrien Obrecht, Bertrand Simon
-
MODE: Mixture of Document Experts for RAG Authors: Rahul Anand
-
Graph Contrastive Learning versus Untrained Baselines: The Role of Dataset Size Authors: Smayan Khanna, Doruk Efe G\"okmen, Risi Kondor, Vincenzo Vitelli
-
Distilled Pretraining: A modern lens of Data, In-Context Learning and Test-Time Scaling Authors: Sachin Goyal, David Lopez-Paz, Kartik Ahuja
-
Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs Authors: Yao Fu, Runchao Li, Xianxuan Long, Haotian Yu, Xiaotian Han, Yu Yin, Pan Li
-
DaCe AD: Unifying High-Performance Automatic Differentiation for Machine Learning and Scientific Computing Authors: Afif Boudaoud, Alexandru Calotoiu, Marcin Copik, Torsten Hoefler
-
Theory Foundation of Physics-Enhanced Residual Learning Authors: Shixiao Liang, Wang Chen, Keke Long, Peng Zhang, Xiaopeng Li, Jintao Ke
-
Accelerating PDE Solvers with Equation-Recast Neural Operator Preconditioning Authors: Qiyun Cheng, Md Hossain Sahadath, Huihua Yang, Shaowu Pan, Wei Ji
-
Crystal Structure Prediction with a Geometric Permutation-Invariant Loss Function Authors: Emmanuel Jehanno, Romain Menegaux, Julien Mairal, Sergei Grudinin
-
Beyond Ensembles: Simulating All-Atom Protein Dynamics in a Learned Latent Space Authors: Aditya Sengar, Ali Hariri, Pierre Vandergheynst, Patrick Barth
-
Towards Comprehensive Information-theoretic Multi-view Learning Authors: Long Shi, Yunshan Ye, Wenjie Wang, Tao Lei, Yu Zhao, Gang Kou, Badong Chen
-
Unraveling LLM Jailbreaks Through Safety Knowledge Neurons Authors: Chongwen Zhao, Kaizhu Huang
-
An Efficient GNNs-to-KANs Distillation via Self-Attention Dynamic Sampling with Potential for Consumer Electronics Edge Deployment Authors: Can Cui, Zilong Fu, Penghe Huang, Yuanyuan Li, Wu Deng, Dongyan Li
-
Low Power Approximate Multiplier Architecture for Deep Neural Networks Authors: Pragun Jaswal, L. Hemanth Krishna, B. Srinivasu
-
Preconditioned Regularized Wasserstein Proximal Sampling Authors: Hong Ye Tan, Stanley Osher, Wuchen Li
-
Differentiable Expectation-Maximisation and Applications to Gaussian Mixture Model Optimal Transport Authors: Samuel Bo\"it\'e, Eloi Tanguy, Julie Delon, Agn`es Desolneux, R\'emi Flamary
-
An Explainable Gaussian Process Auto-encoder for Tabular Data Authors: Wei Zhang, Brian Barr, John Paisley
-
Learning residue level protein dynamics with multiscale Gaussians Authors: Mihir Bafna, Bowen Jing, Bonnie Berger
-
Quantum Circuits for Quantum Convolutions: A Quantum Convolutional Autoencoder Authors: Javier Orduz, Pablo Rivas, Erich Baker
-
VASSO: Variance Suppression for Sharpness-Aware Minimization Authors: Bingcong Li, Yilang Zhang, Georgios B. Giannakis
-
Conditional-$t^3$VAE: Equitable Latent Space Allocation for Fair Generation Authors: Aymene Mohammed Bouayed, Samuel Deslauriers-Gauthier, Adrian Iaccovelli, David Naccache
-
Unsupervised Training of Vision Transformers with Synthetic Negatives Authors: Nikolaos Giakoumoglou, Andreas Floros, Kleanthis Marios Papadopoulos, Tania Stathaki
-
Fake & Square: Training Self-Supervised Vision Transformers with Synthetic Data and Synthetic Hard Negatives Authors: Nikolaos Giakoumoglou, Andreas Floros, Kleanthis Marios Papadopoulos, Tania Stathaki
-
Efficient Transformer-Inspired Variants of Physics-Informed Deep Operator Networks Authors: Zhi-Feng Wei, Wenqian Chen, Panos Stinis
-
MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall Authors: Avinash Maurya, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae
-
Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMs Authors: Daehoon Gwak, Minseo Jung, Junwoo Park, Minho Park, ChaeHun Park, Junha Hyung, Jaegul Choo
-
Balanced Multimodal Learning: An Unidirectional Dynamic Interaction Perspective Authors: Shijie Wang, Li Zhang, Xinyan Liang, Yuhua Qian, Shen Hu
-
Towards Agents That Know When They Don't Know: Uncertainty as a Control Signal for Structured Reasoning Authors: Josefa Lia Stoisser, Marc Boubnovski Martell, Lawrence Phillips, Gianluca Mazzoni, Lea M{\o}rch Harder, Philip Torr, Jesper Ferkinghoff-Borg, Kaspar Martens, Julien Fauqueur
-
Counterfactual Sensitivity for Faithful Reasoning in Language Models Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
-
VISP: Volatility Informed Stochastic Projection for Adaptive Regularization Authors: Tanvir Islam
-
Equivariant U-Shaped Neural Operators for the Cahn-Hilliard Phase-Field Model Authors: Xiao Xue, M. F. P. ten Eikelder, Tianyue Yang, Yiqing Li, Kan He, Shuo Wang, Peter V. Coveney
-
Memory Limitations of Prompt Tuning in Transformers Authors: Maxime Meyer, Mario Michelessa, Caroline Chaux, Vincent Y. F. Tan
-
Causal representation learning from network data Authors: Jifan Zhang, Michelle M. Li, Elena Zheleva
-
Teaching AI to Remember: Insights from Brain-Inspired Replay in Continual Learning Authors: Jina Kim
-
GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping Authors: Qifu Wen, Xi Zeng, Zihan Zhou, Shuaijun Liu, Mehdi Hosseinzadeh, Reza Rawassizadeh
-
Double Descent and Overparameterization in Particle Physics Data Authors: Matthias Vigl, Lukas Heinrich
-
Optimized Weight Initialization on the Stiefel Manifold for Deep ReLU Neural Networks Authors: Hyungu Lee, Taehyeong Kim, Hayoung Choi
-
Generalization vs. Memorization in Autoregressive Deep Learning: Or, Examining Temporal Decay of Gradient Coherence Authors: James Amarel, Nicolas Hengartner, Robyn Miller, Kamaljeet Singh, Siddharth Mansingh, Arvind Mohan, Benjamin Migliori, Emily Casleton, Alexei Skurikhin, Earl Lawrence, Gerd J. Kunde
-
Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination Authors: Terry Jingchen Zhang, Gopal Dev, Ning Wang, Nicole Ni, Wenyuan Jiang, Yinya Huang, Bernhard Sch\"olkopf, Mrinmaya Sachan, Zhijing Jin
1. Relative Trajectory Balance is equivalent to Trust-PCL
ArXiv ID: 2509.01632
Authors: Tristan Deleu, Padideh Nouri, Yoshua Bengio, Doina Precup
Abstract: Recent progress in generative modeling has highlighted the importance of Reinforcement Learning (RL) for fine-tuning, with KL-regularized methods in particular proving to be highly effective for both autoregressive and diffusion models. Complementing this line of work, the Relative Trajectory Balance (RTB) objective was recently introduced in the context of Generative Flow Networks (GFlowNets) to serve the same role of improving fine-tuning in sequential generative models. Building on prior work linking GFlowNets and maximum-entropy RL, we establish in this paper an equivalence between RTB and Trust-PCL, an off-policy RL method with KL regularization. This equivalence situates RTB within the broader theoretical landscape of KL-regularized RL, and clarifies its relationship to earlier methods. Leveraging this insight, we revisit an illustrative example from the RTB paper and show that KL-regularized RL methods achieve comparable performance, offering an alternative perspective to what was previously reported.
Comment: Author match
2. LongCat-Flash Technical Report
ArXiv ID: 2509.01322
Authors: Meituan LongCat Team, Bayan, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, Chengcheng Han, Chenguang Xi, Chi Zhang, Chong Peng, Chuan Qin, Chuyu Zhang, Cong Chen, Congkui Wang, Dan Ma, Daoru Pan, Defei Bu, Dengchang Zhao, Deyang Kong, Dishan Liu, Feiye Huo, Fengcun Li, Fubao Zhang, Gan Dong, Gang Liu, Gang Xu, Ge Li, Guoqiang Tan, Guoyuan Lin, Haihang Jing, Haomin Fu, Haonan Yan, Haoxing Wen, Haozhe Zhao, Hong Liu, Hongmei Shi, Hongyan Hao, Hongyin Tang, Huantian Lv, Hui Su, Jiacheng Li, Jiahao Liu, Jiahuan Li, Jiajun Yang, Jiaming Wang, Jian Yang, Jianchao Tan, Jiaqi Sun, Jiaqi Zhang, Jiawei Fu, Jiawei Yang, Jiaxi Hu, Jiayu Qin, Jingang Wang, Jiyuan He, Jun Kuang, Junhui Mei, Kai Liang, Ke He, Kefeng Zhang, Keheng Wang, Keqing He, Liang Gao, Liang Shi, Lianhui Ma, Lin Qiu, Lingbin Kong, Lingtong Si, Linkun Lyu, Linsen Guo, Liqi Yang, Lizhi Yan, Mai Xia, Man Gao, Manyuan Zhang, Meng Zhou, Mengxia Shen, Mingxiang Tuo, Mingyang Zhu, Peiguang Li, Peng Pei, Peng Zhao, Pengcheng Jia, Pingwei Sun, Qi Gu, Qianyun Li, Qingyuan Li, Qiong Huang, Qiyuan Duan, Ran Meng, Rongxiang Weng, Ruichen Shao, Rumei Li, Shizhe Wu, Shuai Liang, Shuo Wang, Suogui Dang, Tao Fang, Tao Li, Tefeng Chen, Tianhao Bai, Tianhao Zhou, Tingwen Xie, Wei He, Wei Huang, Wei Liu, Wei Shi, Wei Wang, Wei Wu, Weikang Zhao, Wen Zan, Wenjie Shi, Xi Nan, Xi Su, Xiang Li, Xiang Mei, Xiangyang Ji, Xiangyu Xi, Xiangzhou Huang, Xianpeng Li, Xiao Fu, Xiao Liu, Xiao Wei, Xiaodong Cai, Xiaolong Chen, Xiaoqing Liu, Xiaotong Li, Xiaowei Shi, Xiaoyu Li, Xili Wang, Xin Chen, Xing Hu, Xingyu Miao, Xinyan He, Xuemiao Zhang, Xueyuan Hao, Xuezhi Cao, Xunliang Cai, Xurui Yang, Yan Feng, Yang Bai, Yang Chen, Yang Yang, Yaqi Huo, Yerui Sun, Yifan Lu, Yifan Zhang, Yipeng Zang, Yitao Zhai, Yiyang Li, Yongjing Yin, Yongkang Lv, Yongwei Zhou, Yu Yang, Yuchen Xie, Yueqing Sun, Yuewen Zheng, Yuhua Wei, Yulei Qian, Yunfan Liang, Yunfang Tai, Yunke Zhao, Zeyang Yu, Zhao Zhang, Zhaohua Yang, Zhenchao Zhang, Zhikang Xia, Zhiye Zou, Zhizhao Zeng, Zhongda Su, Zhuofan Chen, Zijian Zhang, Ziwen Wang, Zixu Jiang, Zizhe Zhao, Zongyu Wang, Zunhai Su
Abstract: We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \$0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat GitHub: https://github.com/meituan-longcat
Comment: The paper introduces LongCat-Flash, a Mixture-of-Experts language model, focusing on architectural innovations and efficiency improvements.
Relevance: 10 Novelty: 8
3. Globally aware optimization with resurgence
ArXiv ID: 2509.01329
Authors: Wei Bu
Abstract: Modern optimization faces a fundamental challenge: local gradient-based methods provide no global information about the objective function $L$ landscape, often leading to suboptimal convergence and sensitivity to initialization. We introduce a novel optimization framework that leverages resurgence theory from complex analysis to extract global structural information from divergent asymptotic series. Our key insight is that the factorially divergent perturbative expansions of parameter space partition functions encode precise information about all critical objective function value in the landscape through their Borel transform singularities. The algorithm works by computing the statistical mechanical partition function $Z(g) = \int e^{-L(\theta)/g} d\theta$ for small coupling $g\ll 1$, extracting its asymptotic series coefficients, and identifying Borel plane singularities that correspond one-to-one with critical objective function values. These target values provide global guidance to local optimizers, enabling principled learning rate adaptation and escape from suboptimal regions. Unlike heuristic adaptive methods, targets are theoretically grounded in the geometry of the optimization landscape.
Comment: The paper introduces a novel optimization framework using resurgence theory, which is relevant to emerging trends in optimization.
Relevance: 9 Novelty: 9
4. Are We Really Learning the Score Function? Reinterpreting Diffusion Models Through Wasserstein Gradient Flow Matching
ArXiv ID: 2509.00336
Authors: An B. Vuong, Michael T. McCann, Javier E. Santos, Yen Ting Lin
Abstract: Diffusion models are commonly interpreted as learning the score function, i.e., the gradient of the log-density of noisy data. However, this assumption implies that the target of learning is a conservative vector field, which is not enforced by the neural network architectures used in practice. We present numerical evidence that trained diffusion networks violate both integral and differential constraints required of true score functions, demonstrating that the learned vector fields are not conservative. Despite this, the models perform remarkably well as generative mechanisms. To explain this apparent paradox, we advocate a new theoretical perspective: diffusion training is better understood as flow matching to the velocity field of a Wasserstein Gradient Flow (WGF), rather than as score learning for a reverse-time stochastic differential equation. Under this view, the "probability flow" arises naturally from the WGF framework, eliminating the need to invoke reverse-time SDE theory and clarifying why generative sampling remains successful even when the neural vector field is not a true score. We further show that non-conservative errors from neural approximation do not necessarily harm density transport. Our results advocate for adopting the WGF perspective as a principled, elegant, and theoretically grounded framework for understanding diffusion generative models.
Comment: The paper reinterprets diffusion models through a new theoretical perspective, relevant to emerging trends in generative models.
Relevance: 9 Novelty: 9
5. The Price of Sparsity: Sufficient Conditions for Sparse Recovery using Sparse and Sparsified Measurements
ArXiv ID: 2509.01809
Authors: Youssef Chaabouni, David Gamarnik
Abstract: We consider the problem of recovering the support of a sparse signal using noisy projections. While extensive work has been done on the dense measurement matrix setting, the sparse setting remains less explored. In this work, we establish sufficient conditions on the sample size for successful sparse recovery using sparse measurement matrices. Bringing together our result with previously known necessary conditions, we discover that, in the regime where $ds/p \rightarrow +\infty$, sparse recovery in the sparse setting exhibits a phase transition at an information-theoretic threshold of $n_{\text{INF}}^{\text{SP}} = \Theta\left(s\log\left(p/s\right)/\log\left(ds/p\right)\right)$, where $p$ denotes the signal dimension, $s$ the number of non-zero components of the signal, and $d$ the expected number of non-zero components per row of measurement. This expression makes the price of sparsity explicit: restricting each measurement to $d$ non-zeros inflates the required sample size by a factor of $\log{s}/\log\left(ds/p\right)$, revealing a precise trade-off between sampling complexity and measurement sparsity. Additionally, we examine the effect of sparsifying an originally dense measurement matrix on sparse signal recovery. We prove in the regime of $s = \alpha p$ and $d = \psi p$ with $\alpha, \psi \in \left(0,1\right)$ and $\psi$ small that a sample of size $n^{\text{Sp-ified}}_{\text{INF}} = \Theta\left(p / \psi^2\right)$ is sufficient for recovery, subject to a certain uniform integrability conjecture, the proof of which is work in progress.
Comment: The paper provides theoretical insights into sparse recovery using sparse measurement matrices, relevant to model compression and sparsity.
Relevance: 9 Novelty: 8
6. MoPEQ: Mixture of Mixed Precision Quantized Experts
ArXiv ID: 2509.02512
Authors: Krishna Teja Chitty-Venkata, Jie Ye, Murali Emani
Abstract: Large Language and Vision Models using a Mixture-of-Experts (MoE) architecture pose significant challenges for deployment due to their computational and memory demands. Mixed Precision Quantization assigns different precisions to different layers of an LLM/VLM based on layer sensitivity and importance within the model. In this work, we propose a Post Training Quantization algorithm, MoPEQ, that assigns optimal bit width to each expert. Our method balances accuracy and model size by analyzing each expert's sensitivity using Hessian trace approximation instead of relying on the activation frequency of the expert. This per-expert granularity approach clusters similar experts to maintain model performance while reducing memory requirements. The experimental results on VLMEvalKit benchmark datasets using State-of-the-art VLMs Deepseek-VL2 -tiny, -small, -base, and MolmoE models demonstrate that our mixed precision quantized MoEs achieve competitive accuracy with substantial improvements in memory footprint compared to uniform-precision baseline methods. We perform a comprehensive study to analyze the impact of expert activation frequency and sensitivity using Hessian trace approximation at both layer-wise and model-wide expert precision allocation of 2, 3, and 4 bits to provide a thorough understanding of mixed precision quantization of VLM-MoEs.
Comment: The paper introduces a mixed precision quantization method for MoE architectures, relevant to model compression and efficiency.
Relevance: 9 Novelty: 8
7. Universal Properties of Activation Sparsity in Modern Large Language Models
ArXiv ID: 2509.00454
Authors: Filip Szatkowski, Patryk B\k{e}dkowski, Alessio Devoto, Jan Dubi\'nski, Pasquale Minervini, Miko{\l}aj Pi\'orczy\'nski, Simone Scardapane, Bartosz W\'ojcik
Abstract: Input-dependent activation sparsity is a notable property of deep learning models, which has been extensively studied in networks with ReLU activations and is associated with efficiency, robustness, and interpretability. However, the approaches developed for ReLU-based models depend on exact zero activations and do not transfer directly to modern large language models~(LLMs), which have abandoned ReLU in favor of other activation functions. As a result, current work on activation sparsity in LLMs is fragmented, model-specific, and lacks consensus on which components to target. We propose a general framework to assess sparsity robustness and present a systematic study of the phenomenon in the FFN layers of modern LLMs, including diffusion LLMs. Our findings reveal universal patterns of activation sparsity in LLMs, provide insights into this phenomenon, and offer practical guidelines for exploiting it in model design and acceleration.
Comment: The paper studies activation sparsity in LLMs, providing insights into model efficiency and behavior, which is relevant to representation learning and model compression.
Relevance: 9 Novelty: 8
8. ZeroQAT: Your Quantization-aware Training but Efficient
ArXiv ID: 2509.00031
Authors: Qitao Tan, Xiaoying Song, Jin Lu, Guoming Li, Jun Liu, Lingzi Hong, Caiwen Ding, Jundong Li, Xiaoming Zhai, Shaoyi Huang, Wei Niu, Geng Yuan
Abstract: Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing low-bit PTQ methods suffer from accuracy degradation because their layer-wise optimization introduces cumulative error propagation and misalignment between local reconstruction objectives and downstream performance. While quantization-aware training (QAT) provides a principled solution, its reliance on backpropagation incurs prohibitive data, time, and memory costs, limiting its practicality. To address these challenges, we propose ZeroQAT, a zeroth-order optimization-based QAT framework. ZeroQAT leverages forward-only gradient estimation to eliminate the need for backpropagation, significantly reducing computational and memory overhead while retaining the benefits of end-to-end optimization. Moreover, ZeroQAT jointly learns quantized weights, weight clipping thresholds, and equivalent transformations to mitigate quantization error and handle activation outliers. Experiments demonstrate that ZeroQAT achieves the efficiency of PTQ while retaining the accuracy of QAT, offering a practical solution for high-quality low-bit quantization of LLMs.
Comment: The paper introduces ZeroQAT, a novel quantization-aware training framework, which is relevant to model compression and efficiency.
Relevance: 9 Novelty: 8
9. MEPT: Mixture of Expert Prompt Tuning as a Manifold Mapper
ArXiv ID: 2509.00996
Authors: Runjia Zeng, Guangyan Sun, Qifan Wang, Tong Geng, Sohail Dianat, Xiaotian Han, Raghuveer Rao, Xueling Zhang, Cheng Han, Lifu Huang, Dongfang Liu
Abstract: Considering deep neural networks as manifold mappers, the pretrain-then-fine-tune paradigm can be interpreted as a two-stage process: pretrain establishes a broad knowledge base, and fine-tune adjusts the model parameters to activate specific neural pathways to align with the target manifold. Although prior fine-tuning approaches demonstrate success, their rigid parameter space limits their ability to dynamically activate appropriate neural pathways, rendering them ill-equipped to adapt flexibly to the diverse and evolving data distributions. In light of this view, we propose a novel approach, Mixture of Expert Prompt Tuning (MEPT), as an effective and efficient manifold-mapping framework. MEPT leverages the Mixture of Experts architecture by integrating multiple prompt experts to adaptively learn diverse and non-stationary data distributions. Empirical evaluations demonstrate that MEPT outperforms several state-of-the-art parameter efficient baselines on SuperGLUE, achieving notable improvements in mean accuracy (e.g., 1.94%) while significantly reducing activated prompts by 79.25%. The effectiveness of MEPT is further supported by theoretical insights from manifold learning and validated through neural activation pathway visualization results. Our code is avaliable at https://github.com/runtsang/MEPT.
Comment: The paper introduces Mixture of Expert Prompt Tuning (MEPT), which is relevant to the Mixture-of-Experts architecture.
Relevance: 9 Novelty: 8
10. DTRNet: Dynamic Token Routing Network to Reduce Quadratic Costs in Transformers
ArXiv ID: 2509.00925
Authors: Aman Sharma, Saeed Najafi, Parsa Farinneya, Benyamin Jamialahmadi, Marzieh S. Tahaei, Yuhe Fan, Mehdi Rezagholizadeh, Boxing Chen, Aref Jafari
Abstract: Transformers achieve state-of-the-art results across many tasks, but their uniform application of quadratic self-attention to every token at every layer makes them computationally expensive. We introduce DTRNet (Dynamic Token Routing Network), an improved Transformer architecture that allows tokens to dynamically skip the quadratic cost of cross-token mixing while still receiving lightweight linear updates. By preserving the MLP module and reducing the attention cost for most tokens to linear, DTRNet ensures that every token is explicitly updated while significantly lowering overall computation. This design offers an efficient and effective alternative to standard dense attention. Once trained, DTRNet blocks routes only ~10% of tokens through attention at each layer while maintaining performance comparable to a full Transformer. It consistently outperforms routing-based layer skipping methods such as MoD and D-LLM in both accuracy and memory at matched FLOPs, while routing fewer tokens to full attention. Its efficiency gains, scales with sequence length, offering significant reduction in FLOPs for long-context inputs. By decoupling token updates from attention mixing, DTRNet substantially reduces the quadratic share of computation, providing a simple, efficient, and scalable alternative to Transformers.
Comment: The paper introduces DTRNet, a dynamic token routing network to reduce quadratic costs in Transformers, relevant to model architecture.
Relevance: 9 Novelty: 8
11. KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache
ArXiv ID: 2509.00579
Authors: Bo Jiang, Taolue Yang, Youyuan Liu, Chengming Zhang, Xubin He, Sian Jin
Abstract: Transformer-based large language models (LLMs) demonstrate impressive potential in various practical applications. However, long context inference poses a significant challenge due to the enormous memory requirements of the key-value (KV) cache, which can scale to multiple gigabytes as sequence length and batch size increase. In this paper, we present KVComp, a generic and efficient KV cache management framework optimized for long-text generation that synergistically works with both latency-critical and throughput-critical inference systems. KVComp employs novel lossy compression techniques specifically designed for KV cache data characteristics, featuring careful co-design of compression algorithms and system architecture. Our approach maintains compatibility with the growing nature of KV cache while preserving high computational efficiency. Experimental results show that KVComp achieves on average 47\% and up to 83\% higher memory reduction rate compared to existing methods with little/no model accuracy degradation. Furthermore, KVComp achieves extremely high execution throughput, effectively reducing decompression overhead and, in some cases, even accelerating the matrix-vector multiplication operation and outperform cuBLAS-based attention kernels with less data movement.
Comment: The paper introduces a lossy compression framework for KV cache in LLMs, relevant to model compression and efficiency.
Relevance: 9 Novelty: 8
12. Metis: Training Large Language Models with Advanced Low-Bit Quantization
ArXiv ID: 2509.00404
Authors: Hengjie Cao, Mengyi Chen, Yifeng Yang, Ruijun Huang, Fang Dong, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Yuan Cheng, Fan Wu, Fan Yang, Tun Lu, Ning Gu, Li Shang
Abstract: This work identifies anisotropic parameter distributions as a fundamental barrier to training large language models (LLMs) with low-bit quantization: a few dominant singular values create wide numerical ranges that conflict with the inherent bias of block-wise quantization. This bias disproportionately preserves high-magnitude values while discarding smaller ones, causing training instability and low model performance. This work introduces Metis, a training framework that combines (i) spectral decomposition with random embedding to efficiently disentangle dominant from long-tail components, compressing broad distributions into quantization-friendly narrow ranges; (ii) adaptive learning rates in the spectral domain to amplify underrepresented directions and better capture diverse features critical for performance; and (iii) a dual-range regularizer that jointly constrains numerical precision and parameter range distribution, ensuring stable, unbiased low-bit training. With Metis, FP8 training surpasses FP32 baselines, and FP4 training achieves accuracy comparable to FP32, paving the way for robust and scalable LLM training under advanced low-bit quantization. The code implementation for Metis is available at: https://github.com/typename-yyf/Metis-quantization.
Comment: The paper introduces Metis, a framework for training large language models with low-bit quantization, addressing model compression through quantization and efficiency improvements.
Relevance: 9 Novelty: 8
13. Geometric origin of adversarial vulnerability in deep learning
ArXiv ID: 2509.01235
Authors: Yixiong Ren, Wenkang Du, Jianhui Zhou, Haiping Huang
Abstract: How to balance training accuracy and adversarial robustness has become a challenge since the birth of deep learning. Here, we introduce a geometry-aware deep learning framework that leverages layer-wise local training to sculpt the internal representations of deep neural networks. This framework promotes intra-class compactness and inter-class separation in feature space, leading to manifold smoothness and adversarial robustness against white or black box attacks. The performance can be explained by an energy model with Hebbian coupling between elements of the hidden representation. Our results thus shed light on the physics of learning in the direction of alignment between biological and artificial intelligence systems. Using the current framework, the deep network can assimilate new information into existing knowledge structures while reducing representation interference.
Comment: The paper presents a geometry-aware deep learning framework for improving adversarial robustness, focusing on representation learning and training dynamics.
Relevance: 9 Novelty: 8
14. Beyond Universal Approximation Theorems: Algorithmic Uniform Approximation by Neural Networks Trained with Noisy Data
ArXiv ID: 2509.00924
Authors: Anastasis Kratsios, Tin Sum Cheng, Daniel Roy
Abstract: At its core, machine learning seeks to train models that reliably generalize beyond noisy observations; however, the theoretical vacuum in which state-of-the-art universal approximation theorems (UATs) operate isolates them from this goal, as they assume noiseless data and allow network parameters to be chosen freely, independent of algorithmic realism. This paper bridges that gap by introducing an architecture-specific randomized training algorithm that constructs a uniform approximator from $N$ noisy training samples on the $d$-dimensional cube $[0,1]^d$. Our trained neural networks attain the minimax-optimal quantity of \textit{trainable} (non-random) parameters, subject to logarithmic factors which vanish under the idealized noiseless sampling assumed in classical UATs. Additionally, our trained models replicate key behaviours of real-world neural networks, absent in standard UAT constructions, by: (1) exhibiting sub-linear parametric complexity when fine-tuning on structurally related and favourable out-of-distribution tasks, (2) exactly interpolating the training data, and (3) maintaining reasonable Lipschitz regularity (after the initial clustering attention layer). These properties bring state-of-the-art UATs closer to practical machine learning, shifting the central open question from algorithmic implementability with noisy samples to whether stochastic gradient descent can achieve comparable guarantees.
Comment: The paper extends universal approximation theorems to noisy data, providing theoretical insights into neural network behavior, relevant to representation learning.
Relevance: 9 Novelty: 8
15. REFRAG: Rethinking RAG based Decoding
ArXiv ID: 2509.01092
Authors: Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off between knowledge enrichment and system efficiency. While minimizing latency for long-context inputs is a primary objective for LLMs, we contend that RAG require specialized consideration. In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query. These passages often exhibit low semantic similarity due to diversity or deduplication during re-ranking, leading to block-diagonal attention patterns that differ from those in standard LLM generation tasks. Based on this observation, we argue that most computations over the RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, we propose REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications. By exploiting the sparsity structure, we demonstrate a 30.85 the time-to-first-token acceleration (3.75 improvement to previous work) without loss in perplexity. In addition, our optimization framework for large context enables REFRAG to extend the context size of LLMs by 16. We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization, spanning a wide range of datasets. Experimental results confirm that REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes.
Comment: The paper proposes REFRAG, an efficient decoding framework for retrieval-augmented generation (RAG) that exploits sparsity structure to improve latency, which aligns with model compression and efficiency breakthroughs.
Relevance: 9 Novelty: 8
16. ReLATE: Learning Efficient Sparse Encoding for High-Performance Tensor Decomposition
ArXiv ID: 2509.00280
Authors: Ahmed E. Helal, Fabio Checconi, Jan Laukemann, Yongseok Soh, Jesmin Jahan Tithi, Fabrizio Petrini, Jee Choi
Abstract: Tensor decomposition (TD) is essential for analyzing high-dimensional sparse data, yet its irregular computations and memory-access patterns pose major performance challenges on modern parallel processors. Prior works rely on expert-designed sparse tensor formats that fail to adapt to irregular tensor shapes and/or highly variable data distributions. We present the reinforcement-learned adaptive tensor encoding (ReLATE) framework, a novel learning-augmented method that automatically constructs efficient sparse tensor representations without labeled training samples. ReLATE employs an autonomous agent that discovers optimized tensor encodings through direct interaction with the TD environment, leveraging a hybrid model-free and model-based algorithm to learn from both real and imagined actions. Moreover, ReLATE introduces rule-driven action masking and dynamics-informed action filtering mechanisms that ensure functionally correct tensor encoding with bounded execution time, even during early learning stages. By automatically adapting to both irregular tensor shapes and data distributions, ReLATE generates sparse tensor representations that consistently outperform expert-designed formats across diverse sparse tensor data sets, achieving up to 2X speedup compared to the best sparse format, with a geometric-mean speedup of 1.4-1.46X.
Comment: The paper presents ReLATE, a framework for learning efficient sparse tensor encodings, which aligns with model compression and efficiency breakthroughs.
Relevance: 9 Novelty: 8
17. LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving
ArXiv ID: 2509.01229
Authors: Huanqi Hu, Bowen Xiao, Shixuan Sun, Jianian Yin, Zhexi Zhang, Xiang Luo, Chengquan Jiang, Weiqi Xu, Xiaoying Jia, Xin Liu, Minyi Guo
Abstract: Quantization is a critical technique for accelerating LLM inference by reducing memory footprint and improving computational efficiency. Among various schemes, 4-bit weight and 8-bit activation quantization (W4A8) offers a strong balance between accuracy and performance. However, existing W4A8 GEMM kernels fall short in practice due to inefficient dequantization on CUDA Cores, which cannot keep pace with the high throughput of Tensor Cores. In this paper, we present LiquidGEMM, a hardware-efficient W4A8 GEMM kernel for efficient LLM serving. LiquidGEMM designs two key techniques: LiquidQuant, a hardware-efficient quantization method that enables fast, overflow-safe dequantization using just two arithmetic instructions per four elements; and an implicit fine-grained pipeline that fully overlaps weight loading, dequantization, and MMA across warp groups without software synchronization or redundant memory traffic. Experimental results show that LiquidGEMM achieves up to 2.90x speedup over state-of-the-art W4A8 kernels and up to 4.94x end-to-end system-level speedup. Compared to various quantized GEMM kernels in NVIDIA TensorRT-LLM, LiquidGEMM delivers 1.12-1.63x performance gains, and achieves up to 1.63x system-level speedup.
Comment: The paper presents LiquidGEMM, a hardware-efficient GEMM kernel for LLM serving, focusing on quantization and efficiency, which aligns with model compression.
Relevance: 9 Novelty: 8
18. Learnable Loss Geometries with Mirror Descent for Scalable and Convergent Meta-Learning
ArXiv ID: 2509.02418
Authors: Yilang Zhang, Bingcong Li, Georgios B. Giannakis
Abstract: Utilizing task-invariant knowledge acquired from related tasks as prior information, meta-learning offers a principled approach to learning a new task with limited data records. Sample-efficient adaptation of this prior information is a major challenge facing meta-learning, and plays an important role because it facilitates training the sought task-specific model with just a few optimization steps. Past works deal with this challenge through preconditioning that speeds up convergence of the per-task training. Though effective in representing locally quadratic loss curvatures, simple linear preconditioning can be hardly potent with complex loss geometries. Instead of relying on a quadratic distance metric, the present contribution copes with complex loss metrics by learning a versatile distance-generating function, which induces a nonlinear mirror map to effectively capture and optimize a wide range of loss geometries. With suitable parameterization, this generating function is effected by an expressive neural network that is provably a valid distance. Analytical results establish convergence of not only the proposed method, but also all meta-learning approaches based on preconditioning. To attain gradient norm less than $\epsilon$, the convergence rate of $\mathcal{O}(\epsilon^{-2})$ is on par with standard gradient-based meta-learning methods. Numerical tests on few-shot learning datasets demonstrate the superior empirical performance of the novel algorithm, as well as its rapid per-task convergence, which markedly reduces the number of adaptation steps, hence also accommodating large-scale meta-learning models.
Comment: The paper presents a novel approach to meta-learning by learning a versatile distance-generating function, which aligns with representation learning and training dynamics in neural networks.
Relevance: 9 Novelty: 8
19. Fisher information flow in artificial neural networks
ArXiv ID: 2509.02407
Authors: Maximilian Weimar, Lukas M. Rachbauer, Ilya Starshynov, Daniele Faccio, Linara Adilova, Dorian Bouchet, Stefan Rotter
Abstract: The estimation of continuous parameters from measured data plays a central role in many fields of physics. A key tool in understanding and improving such estimation processes is the concept of Fisher information, which quantifies how information about unknown parameters propagates through a physical system and determines the ultimate limits of precision. With Artificial Neural Networks (ANNs) gradually becoming an integral part of many measurement systems, it is essential to understand how they process and transmit parameter-relevant information internally. Here, we present a method to monitor the flow of Fisher information through an ANN performing a parameter estimation task, tracking it from the input to the output layer. We show that optimal estimation performance corresponds to the maximal transmission of Fisher information, and that training beyond this point results in information loss due to overfitting. This provides a model-free stopping criterion for network training-eliminating the need for a separate validation dataset. To demonstrate the practical relevance of our approach, we apply it to a network trained on data from an imaging experiment, highlighting its effectiveness in a realistic physical setting.
Comment: The paper introduces a method to monitor Fisher information flow in neural networks, providing insights into how networks process information, relevant to representation learning.
Relevance: 9 Novelty: 8
20. Causal Interpretation of Sparse Autoencoder Features in Vision
ArXiv ID: 2509.00749
Authors: Sangyu Han, Yearim Kim, Nojun Kwak
Abstract: Understanding what sparse auto-encoder (SAE) features in vision transformers truly represent is usually done by inspecting the patches where a feature's activation is highest. However, self-attention mixes information across the entire image, so an activated patch often co-occurs with-but does not cause-the feature's firing. We propose Causal Feature Explanation (CaFE), which leverages Effective Receptive Field (ERF). We consider each activation of an SAE feature to be a target and apply input-attribution methods to identify the image patches that causally drive that activation. Across CLIP-ViT features, ERF maps frequently diverge from naive activation maps, revealing hidden context dependencies (e.g., a "roaring face" feature that requires the co-occurrence of eyes and nose, rather than merely an open mouth). Patch insertion tests confirm that CaFE more effectively recovers or suppresses feature activations than activation-ranked patches. Our results show that CaFE yields more faithful and semantically precise explanations of vision-SAE features, highlighting the risk of misinterpretation when relying solely on activation location.
Comment: The paper proposes a method for causal interpretation of sparse autoencoder features, aligning with representation learning and feature learning.
Relevance: 9 Novelty: 8
21. Cache Management for Mixture-of-Experts LLMs -- extended version
ArXiv ID: 2509.02408
Authors: Spyros Angelopoulos, Loris Marchal, Adrien Obrecht, Bertrand Simon
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a variety of tasks. One of the main challenges towards the successful deployment of LLMs is memory management, since they typically involve billions of parameters. To this end, architectures based on Mixture-of-Experts have been proposed, which aim to reduce the size of the parameters that are activated when producing a token. This raises the equally critical issue of efficiently managing the limited cache of the system, in that frequently used experts should be stored in the fast cache rather than in the slower secondary memory. In this work, we introduce and study a new paging problem that models expert management optimization. Our formulation captures both the layered architecture of LLMs and the requirement that experts are cached efficiently. We first present lower bounds on the competitive ratio of both deterministic and randomized algorithms, which show that under mild assumptions, LRU-like policies have good theoretical competitive performance. We then propose a layer-based extension of LRU that is tailored to the problem at hand. Extensive simulations on both synthetic datasets and actual traces of MoE usage show that our algorithm outperforms policies for the classic paging problem, such as the standard LRU.
Comment: The paper discusses cache management for Mixture-of-Experts LLMs, which is relevant to model architecture and efficiency.
Relevance: 9 Novelty: 7
22. MODE: Mixture of Document Experts for RAG
ArXiv ID: 2509.00100
Authors: Rahul Anand
Abstract: Retrieval-Augmented Generation (RAG) often relies on large vector databases and cross-encoders tuned for large-scale corpora, which can be excessive for small, domain-specific collections. We present MODE (Mixture of Document Experts), a lightweight alternative that replaces fine-grained nearest-neighbor search with cluster-and-route retrieval. Documents are embedded, grouped into semantically coherent clusters, and represented by cached centroids. At query time, we route to the top centroid(s) and retrieve context only within those clusters, eliminating external vector-database infrastructure and reranking while keeping latency low. On HotpotQA and SQuAD corpora with 100-500 chunks, MODE matches or exceeds a dense-retrieval baseline in answer quality while reducing end-to-end retrieval time. Ablations show that cluster granularity and multi-cluster routing control the recall/precision trade-off, and that tighter clusters improve downstream accuracy. MODE offers a practical recipe for small and medium corpora where simplicity, speed, and topical focus matter.
Comment: The paper introduces a Mixture of Document Experts (MODE) for retrieval-augmented generation, which aligns with the interest in Mixture-of-Experts architectures.
Relevance: 9 Novelty: 7
23. Graph Contrastive Learning versus Untrained Baselines: The Role of Dataset Size
ArXiv ID: 2509.01541
Authors: Smayan Khanna, Doruk Efe G\"okmen, Risi Kondor, Vincenzo Vitelli
Abstract: Graph Contrastive Learning (GCL) has emerged as a leading paradigm for self- supervised learning on graphs, with strong performance reported on standardized datasets and growing applications ranging from genomics to drug discovery. We ask a basic question: does GCL actually outperform untrained baselines? We find that GCL's advantage depends strongly on dataset size and task difficulty. On standard datasets, untrained Graph Neural Networks (GNNs), simple multilayer perceptrons, and even handcrafted statistics can rival or exceed GCL. On the large molecular dataset ogbg-molhiv, we observe a crossover: GCL lags at small scales but pulls ahead beyond a few thousand graphs, though this gain eventually plateaus. On synthetic datasets, GCL accuracy approximately scales with the logarithm of the number of graphs and its performance gap (compared with untrained GNNs) varies with respect to task complexity. Moving forward, it is crucial to identify the role of dataset size in benchmarks and applications, as well as to design GCL algorithms that avoid performance plateaus.
Comment: The paper discusses Graph Contrastive Learning, which is a form of representation learning, and examines its performance relative to untrained baselines, providing insights into training dynamics and dataset size effects.
Relevance: 9 Novelty: 7
24. Distilled Pretraining: A modern lens of Data, In-Context Learning and Test-Time Scaling
ArXiv ID: 2509.01649
Authors: Sachin Goyal, David Lopez-Paz, Kartik Ahuja
Abstract: In the past year, distillation has seen a renewed prominence in large language model (LLM) pretraining, exemplified by the Llama-3.2 and Gemma model families. While distillation has historically been shown to improve statistical modeling, its effects on new paradigms that are key to modern LLMs, such as test-time scaling and in-context learning, remain underexplored. In this work, we make three main contributions. First, we show that pretraining with distillation yields models that exhibit remarkably better test-time scaling. Second, we observe that this benefit comes with a trade-off: distillation impairs in-context learning capabilities, particularly the one modeled via induction heads. Third, to demystify these findings, we study distilled pretraining in a sandbox of a bigram model, which helps us isolate the common principal factor behind our observations. Finally, using these insights, we shed light on various design choices for pretraining that should help practitioners going forward.
Comment: The paper explores the effects of distillation on LLM pretraining, particularly on test-time scaling and in-context learning, providing insights into LLM behavior.
Relevance: 9 Novelty: 7
25. Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs
ArXiv ID: 2509.00096
Authors: Yao Fu, Runchao Li, Xianxuan Long, Haotian Yu, Xiaotian Han, Yu Yin, Pan Li
Abstract: Neural network pruning has emerged as a promising approach for deploying LLMs in low-resource scenarios while preserving downstream task performance. However, for the first time, we reveal that such pruning disrupts LLMs' internal activation features crucial for lie detection, where probing classifiers (typically small logistic regression models) trained on these features assess the truthfulness of LLM-generated statements. This discovery raises a crucial open question: how can we prune LLMs without sacrificing these critical lie detection capabilities? Our investigation further reveals that naively adjusting layer-wise pruning sparsity based on importance inadvertently removes crucial weights, failing to improve lie detection performance despite its reliance on the most crucial LLM layer. To address this issue, we propose Truthful Pruning aligned by Layer-wise Outliers (TPLO), which places greater emphasis on layers with more activation outliers and stronger discriminative features simultaneously. This preserves LLMs' original performance while retaining critical features of inner states needed for robust lie detection. Moreover, we introduce a prompting rule to enrich the TruthfulQA benchmark for better calibrating LLM pruning. Empirical results show that our approach improves the hallucination detection for pruned LLMs (achieving 88% accuracy at 50% sparsity) and enhances their performance on TruthfulQA.
Comment: The paper addresses pruning in LLMs while preserving truthfulness, which aligns with model compression and efficiency breakthroughs.
Relevance: 9 Novelty: 7
26. DaCe AD: Unifying High-Performance Automatic Differentiation for Machine Learning and Scientific Computing
ArXiv ID: 2509.02197
Authors: Afif Boudaoud, Alexandru Calotoiu, Marcin Copik, Torsten Hoefler
Abstract: Automatic differentiation (AD) is a set of techniques that systematically applies the chain rule to compute the gradients of functions without requiring human intervention. Although the fundamentals of this technology were established decades ago, it is experiencing a renaissance as it plays a key role in efficiently computing gradients for backpropagation in machine learning algorithms. AD is also crucial for many applications in scientific computing domains, particularly emerging techniques that integrate machine learning models within scientific simulations and schemes. Existing AD frameworks have four main limitations: limited support of programming languages, requiring code modifications for AD compatibility, limited performance on scientific computing codes, and a naive store-all solution for forward-pass data required for gradient calculations. These limitations force domain scientists to manually compute the gradients for large problems. This work presents DaCe AD, a general, efficient automatic differentiation engine that requires no code modifications. DaCe AD uses a novel ILP-based algorithm to optimize the trade-off between storing and recomputing to achieve maximum performance within a given memory constraint. We showcase the generality of our method by applying it to NPBench, a suite of HPC benchmarks with diverse scientific computing patterns, where we outperform JAX, a Python framework with state-of-the-art general AD capabilities, by more than 92 times on average without requiring any code changes.
Comment: The paper presents DaCe AD, an efficient automatic differentiation engine, which is relevant to emerging trends in computational efficiency.
Relevance: 8 Novelty: 8
27. Theory Foundation of Physics-Enhanced Residual Learning
ArXiv ID: 2509.00348
Authors: Shixiao Liang, Wang Chen, Keke Long, Peng Zhang, Xiaopeng Li, Jintao Ke
Abstract: Intensive studies have been conducted in recent years to integrate neural networks with physics models to balance model accuracy and interpretability. One recently proposed approach, named Physics-Enhanced Residual Learning (PERL), is to use learning to estimate the residual between the physics model prediction and the ground truth. Numeral examples suggested that integrating such residual with physics models in PERL has three advantages: (1) a reduction in the number of required neural network parameters; (2) faster convergence rates; and (3) fewer training samples needed for the same computational precision. However, these numerical results lack theoretical justification and cannot be adequately explained. This paper aims to explain these advantages of PERL from a theoretical perspective. We investigate a general class of problems with Lipschitz continuity properties. By examining the relationships between the bounds to the loss function and residual learning structure, this study rigorously proves a set of theorems explaining the three advantages of PERL. Several numerical examples in the context of automated vehicle trajectory prediction are conducted to illustrate the proposed theorems. The results confirm that, even with significantly fewer training samples, PERL consistently achieves higher accuracy than a pure neural network. These results demonstrate the practical value of PERL in real world autonomous driving applications where corner case data are costly or hard to obtain. PERL therefore improves predictive performance while reducing the amount of data required.
Comment: The paper provides a theoretical foundation for physics-enhanced residual learning, which is relevant to AI for science and emerging trends.
Relevance: 8 Novelty: 8
28. Accelerating PDE Solvers with Equation-Recast Neural Operator Preconditioning
ArXiv ID: 2509.01416
Authors: Qiyun Cheng, Md Hossain Sahadath, Huihua Yang, Shaowu Pan, Wei Ji
Abstract: The computational overhead of traditional numerical solvers for partial differential equations (PDEs) remains a critical bottleneck for large-scale parametric studies and design optimization. We introduce a Minimal-Data Parametric Neural Operator Preconditioning (MD-PNOP) framework, which establishes a new paradigm for accelerating parametric PDE solvers while strictly preserving physical constraints. The key idea is to recast the residual from parameter deviation as additional source term, where any trained neural operator can be used to refine the solution in an offline fashion. This directly addresses the fundamental extrapolation limitation of neural operators, enabling extrapolative generalization of any neural operator trained at a single parameter setting across a wide range of configurations without any retraining. The neural operator predictions are then embedded into iterative PDE solvers as improved initial guesses, thereby reducing convergence iterations without sacrificing accuracy. Unlike purely data-driven approaches, MD-PNOP guarantees that the governing equations remain fully enforced, eliminating concerns about loss of physics or interpretability. The framework is architecture-agnostic and is demonstrated using both Deep Operator Networks (DeepONet) and Fourier Neural Operators (FNO) for Boltzmann transport equation solvers in neutron transport applications. We demonstrated that neural operators trained on a single set of constant parameters successfully accelerate solutions with heterogeneous, sinusoidal, and discontinuous parameter distributions. Besides, MD-PNOP consistently achieves ~50% reduction in computational time while maintaining full order fidelity for fixed-source, single-group eigenvalue, and multigroup coupled eigenvalue problems.
Comment: The paper introduces a new paradigm for accelerating PDE solvers with neural operator preconditioning, which is relevant to AI for science and emerging trends.
Relevance: 8 Novelty: 8
29. Crystal Structure Prediction with a Geometric Permutation-Invariant Loss Function
ArXiv ID: 2509.00832
Authors: Emmanuel Jehanno, Romain Menegaux, Julien Mairal, Sergei Grudinin
Abstract: Crystalline structure prediction remains an open challenge in materials design. Despite recent advances in computational materials science, accurately predicting the three-dimensional crystal structures of organic materials--an essential first step for designing materials with targeted properties--remains elusive. In this work, we address the problem of molecular assembly, where a set $\mathcal{S}$ of identical rigid molecules is packed to form a crystalline structure. Existing state-of-the-art models typically rely on computationally expensive, iterative flow-matching approaches. We propose a novel loss function that correctly captures key geometric molecular properties while maintaining permutation invariance over $\mathcal{S}$. We achieve this via a differentiable linear assignment scheme based on the Sinkhorn algorithm. Remarkably, we show that even a simple regression using our method {\em SinkFast} significantly outperforms more complex flow-matching approaches on the COD-Cluster17 benchmark, a curated subset of the Crystallography Open Database (COD).
Comment: The paper proposes a novel geometric permutation-invariant loss function for crystal structure prediction, which is relevant to AI for Science with a focus on foundational research in molecular modeling.
Relevance: 8 Novelty: 8
30. Beyond Ensembles: Simulating All-Atom Protein Dynamics in a Learned Latent Space
ArXiv ID: 2509.02196
Authors: Aditya Sengar, Ali Hariri, Pierre Vandergheynst, Patrick Barth
Abstract: Simulating the long-timescale dynamics of biomolecules is a central challenge in computational science. While enhanced sampling methods can accelerate these simulations, they rely on pre-defined collective variables that are often difficult to identify. A recent generative model, LD-FPG, demonstrated that this problem could be bypassed by learning to sample the static equilibrium ensemble as all-atom deformations from a reference structure, establishing a powerful method for all-atom ensemble generation. However, while this approach successfully captures a system's probable conformations, it does not model the temporal evolution between them. Here we extend LD-FPG with a temporal propagator that operates within the learned latent space and compare three classes: (i) score-guided Langevin dynamics, (ii) Koopman-based linear operators, and (iii) autoregressive neural networks. Within a unified encoder-propagator-decoder framework, we evaluate long-horizon stability, backbone and side-chain ensemble fidelity, and functional free-energy landscapes. Autoregressive neural networks deliver the most robust long rollouts; score-guided Langevin best recovers side-chain thermodynamics when the score is well learned; and Koopman provides an interpretable, lightweight baseline that tends to damp fluctuations. These results clarify the trade-offs among propagators and offer practical guidance for latent-space simulators of all-atom protein dynamics.
Comment: The paper extends a generative model for simulating protein dynamics, relevant to AI for Science with a focus on foundational research in molecular modeling.
Relevance: 8 Novelty: 8
31. Towards Comprehensive Information-theoretic Multi-view Learning
ArXiv ID: 2509.02084
Authors: Long Shi, Yunshan Ye, Wenjie Wang, Tao Lei, Yu Zhao, Gang Kou, Badong Chen
Abstract: Information theory has inspired numerous advancements in multi-view learning. Most multi-view methods incorporating information-theoretic principles rely an assumption called multi-view redundancy which states that common information between views is necessary and sufficient for down-stream tasks. This assumption emphasizes the importance of common information for prediction, but inherently ignores the potential of unique information in each view that could be predictive to the task. In this paper, we propose a comprehensive information-theoretic multi-view learning framework named CIML, which discards the assumption of multi-view redundancy. Specifically, CIML considers the potential predictive capabilities of both common and unique information based on information theory. First, the common representation learning maximizes Gacs-Korner common information to extract shared features and then compresses this information to learn task-relevant representations based on the Information Bottleneck (IB). For unique representation learning, IB is employed to achieve the most compressed unique representation for each view while simultaneously minimizing the mutual information between unique and common representations, as well as among different unique representations. Importantly, we theoretically prove that the learned joint representation is predictively sufficient for the downstream task. Extensive experimental results have demonstrated the superiority of our model over several state-of-art methods. The code is released on CIML.
Comment: The paper proposes a new information-theoretic framework for multi-view learning, which is relevant to representation learning.
Relevance: 8 Novelty: 7
32. Unraveling LLM Jailbreaks Through Safety Knowledge Neurons
ArXiv ID: 2509.01631
Authors: Chongwen Zhao, Kaizhu Huang
Abstract: Large Language Models (LLMs) are increasingly attracting attention in various applications. Nonetheless, there is a growing concern as some users attempt to exploit these models for malicious purposes, including the synthesis of controlled substances and the propagation of disinformation, a technique known as "Jailbreak." While some studies have achieved defenses against jailbreak attacks by modifying output distributions or detecting harmful content, the exact rationale still remains elusive. In this work, we present a novel neuron-level interpretability method that focuses on the role of safety-related knowledge neurons. Unlike existing approaches, our method projects the model's internal representation into a more consistent and interpretable vocabulary space. We then show that adjusting the activation of safety-related neurons can effectively control the model's behavior with a mean ASR higher than 97%. Building on this insight, we propose SafeTuning, a fine-tuning strategy that reinforces safety-critical neurons to improve model robustness against jailbreaks. SafeTuning consistently reduces attack success rates across multiple LLMs and outperforms all four baseline defenses. These findings offer a new perspective on understanding and defending against jailbreak attacks.
Comment: The paper presents a neuron-level interpretability method for LLMs, relevant to theoretical insights into LLM behavior.
Relevance: 8 Novelty: 7
33. An Efficient GNNs-to-KANs Distillation via Self-Attention Dynamic Sampling with Potential for Consumer Electronics Edge Deployment
ArXiv ID: 2509.00560
Authors: Can Cui, Zilong Fu, Penghe Huang, Yuanyuan Li, Wu Deng, Dongyan Li
Abstract: Knowledge distillation (KD) is crucial for deploying deep learning models in resource-constrained edge environments, particularly within the consumer electronics sector, including smart home devices, wearable technology, and mobile terminals. These applications place higher demands on model compression and inference speed, necessitating the transfer of knowledge from Graph Neural Networks (GNNs) to more efficient Multi-Layer Perceptron (MLP) models. However, due to their fixed activation functions and fully connected architecture, MLPs face challenges in rapidly capturing the complex neighborhood dependencies learned by GNNs, thereby limiting their performance in edge environments. To address these limitations, this paper introduces an innovative from GNNs to Kolmogorov-Arnold Networks (KANs) knowledge distillation framework-Self Attention Dynamic Sampling Distillation (SA-DSD). This study improved Fourier KAN (FR-KAN) and replaced MLP with the improved FR-KAN+ as the student model. Through the incorporation of learnable frequency bases and phase-shift mechanisms, along with algorithmic optimization, FR-KAN significantly improves its nonlinear fitting capability while effectively reducing computational complexity. Building on this, a margin-level sampling probability matrix, based on teacher-student prediction consistency, is constructed, and an adaptive weighted loss mechanism is designed to mitigate performance degradation in the student model due to the lack of explicit neighborhood aggregation. Extensive experiments conducted on six real-world datasets demonstrate that SA-DSD achieves performance improvements of 3.05%-3.62% over three GNN teacher models and 15.61% over the FR-KAN+ model. Moreover, when compared with key benchmark models, SA-DSD achieves a 16.96x reduction in parameter count and a 55.75% decrease in inference time.
Comment: The paper discusses knowledge distillation and model compression techniques, which are relevant to model compression and efficiency.
Relevance: 8 Novelty: 7
34. Low Power Approximate Multiplier Architecture for Deep Neural Networks
ArXiv ID: 2509.00764
Authors: Pragun Jaswal, L. Hemanth Krishna, B. Srinivasu
Abstract: This paper proposes an low power approximate multiplier architecture for deep neural network (DNN) applications. A 4:2 compressor, introducing only a single combination error, is designed and integrated into an 8x8 unsigned multiplier. This integration significantly reduces the usage of exact compressors while preserving low error rates. The proposed multiplier is employed within a custom convolution layer and evaluated on neural network tasks, including image recognition and denoising. Hardware evaluation demonstrates that the proposed design achieves up to 30.24% energy savings compared to the best among existing multipliers. In image denoising, the custom approximate convolution layer achieves improved Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) compared to other approximate designs. Additionally, when applied to handwritten digit recognition, the model maintains high classification accuracy. These results demonstrate that the proposed architecture offers a favorable balance between energy efficiency and computational precision, making it suitable for low-power AI hardware implementations.
Comment: The paper proposes a low power approximate multiplier architecture, which is relevant to model compression and efficiency.
Relevance: 8 Novelty: 7
35. Preconditioned Regularized Wasserstein Proximal Sampling
ArXiv ID: 2509.01685
Authors: Hong Ye Tan, Stanley Osher, Wuchen Li
Abstract: We consider sampling from a Gibbs distribution by evolving finitely many particles. We propose a preconditioned version of a recently proposed noise-free sampling method, governed by approximating the score function with the numerically tractable score of a regularized Wasserstein proximal operator. This is derived by a Cole--Hopf transformation on coupled anisotropic heat equations, yielding a kernel formulation for the preconditioned regularized Wasserstein proximal. The diffusion component of the proposed method is also interpreted as a modified self-attention block, as in transformer architectures. For quadratic potentials, we provide a discrete-time non-asymptotic convergence analysis and explicitly characterize the bias, which is dependent on regularization and independent of step-size. Experiments demonstrate acceleration and particle-level stability on various log-concave and non-log-concave toy examples to Bayesian total-variation regularized image deconvolution, and competitive/better performance on non-convex Bayesian neural network training when utilizing variable preconditioning matrices.
Comment: The paper proposes a preconditioned sampling method with connections to transformer architectures, relevant to model architecture and efficiency.
Relevance: 8 Novelty: 7
36. Differentiable Expectation-Maximisation and Applications to Gaussian Mixture Model Optimal Transport
ArXiv ID: 2509.02109
Authors: Samuel Bo\"it\'e, Eloi Tanguy, Julie Delon, Agn`es Desolneux, R\'emi Flamary
Abstract: The Expectation-Maximisation (EM) algorithm is a central tool in statistics and machine learning, widely used for latent-variable models such as Gaussian Mixture Models (GMMs). Despite its ubiquity, EM is typically treated as a non-differentiable black box, preventing its integration into modern learning pipelines where end-to-end gradient propagation is essential. In this work, we present and compare several differentiation strategies for EM, from full automatic differentiation to approximate methods, assessing their accuracy and computational efficiency. As a key application, we leverage this differentiable EM in the computation of the Mixture Wasserstein distance $\mathrm{MW}_2$ between GMMs, allowing $\mathrm{MW}_2$ to be used as a differentiable loss in imaging and machine learning tasks. To complement our practical use of $\mathrm{MW}_2$, we contribute a novel stability result which provides theoretical justification for the use of $\mathrm{MW}_2$ with EM, and also introduce a novel unbalanced variant of $\mathrm{MW}_2$. Numerical experiments on barycentre computation, colour and style transfer, image generation, and texture synthesis illustrate the versatility and effectiveness of the proposed approach in different settings.
Comment: The paper presents differentiable EM for Gaussian Mixture Models, which is relevant to representation learning and model architecture.
Relevance: 8 Novelty: 7
37. An Explainable Gaussian Process Auto-encoder for Tabular Data
ArXiv ID: 2509.00884
Authors: Wei Zhang, Brian Barr, John Paisley
Abstract: Explainable machine learning has attracted much interest in the community where the stakes are high. Counterfactual explanations methods have become an important tool in explaining a black-box model. The recent advances have leveraged the power of generative models such as an autoencoder. In this paper, we propose a novel method using a Gaussian process to construct the auto-encoder architecture for generating counterfactual samples. The resulting model requires fewer learnable parameters and thus is less prone to overfitting. We also introduce a novel density estimator that allows for searching for in-distribution samples. Furthermore, we introduce an algorithm for selecting the optimal regularization rate on density estimator while searching for counterfactuals. We experiment with our method in several large-scale tabular datasets and compare with other auto-encoder-based methods. The results show that our method is capable of generating diversified and in-distribution counterfactual samples.
Comment: The paper proposes a novel Gaussian process auto-encoder, which is relevant to model architecture innovations, particularly in autoencoders.
Relevance: 8 Novelty: 7
38. Learning residue level protein dynamics with multiscale Gaussians
ArXiv ID: 2509.01038
Authors: Mihir Bafna, Bowen Jing, Bonnie Berger
Abstract: Many methods have been developed to predict static protein structures, however understanding the dynamics of protein structure is essential for elucidating biological function. While molecular dynamics (MD) simulations remain the in silico gold standard, its high computational cost limits scalability. We present DynaProt, a lightweight, SE(3)-invariant framework that predicts rich descriptors of protein dynamics directly from static structures. By casting the problem through the lens of multivariate Gaussians, DynaProt estimates dynamics at two complementary scales: (1) per-residue marginal anisotropy as $3 \times 3$ covariance matrices capturing local flexibility, and (2) joint scalar covariances encoding pairwise dynamic coupling across residues. From these dynamics outputs, DynaProt achieves high accuracy in predicting residue-level flexibility (RMSF) and, remarkably, enables reasonable reconstruction of the full covariance matrix for fast ensemble generation. Notably, it does so using orders of magnitude fewer parameters than prior methods. Our results highlight the potential of direct protein dynamics prediction as a scalable alternative to existing methods.
Comment: The paper presents a framework for predicting protein dynamics, which is relevant to foundational research in AI for Science, particularly in molecular modeling.
Relevance: 8 Novelty: 7
39. Quantum Circuits for Quantum Convolutions: A Quantum Convolutional Autoencoder
ArXiv ID: 2509.00637
Authors: Javier Orduz, Pablo Rivas, Erich Baker
Abstract: Quantum machine learning deals with leveraging quantum theory with classic machine learning algorithms. Current research efforts study the advantages of using quantum mechanics or quantum information theory to accelerate learning time or convergence. Other efforts study data transformations in the quantum information space to evaluate robustness and performance boosts. This paper focuses on processing input data using randomized quantum circuits that act as quantum convolutions producing new representations that can be used in a convolutional network. Experimental results suggest that the performance is comparable to classic convolutional neural networks, and in some instances, using quantum convolutions can accelerate convergence.
Comment: The paper discusses quantum convolutional autoencoders, which is relevant to model architecture innovations, particularly in autoencoders.
Relevance: 8 Novelty: 7
40. VASSO: Variance Suppression for Sharpness-Aware Minimization
ArXiv ID: 2509.02433
Authors: Bingcong Li, Yilang Zhang, Georgios B. Giannakis
Abstract: Sharpness-aware minimization (SAM) has well-documented merits in enhancing generalization of deep neural network models. Accounting for sharpness in the loss function geometry, where neighborhoods of flat minima' heighten generalization ability, SAM seeksflat valleys' by minimizing the maximum loss provoked by an adversarial perturbation within the neighborhood. Although critical to account for sharpness of the loss function, in practice SAM suffers from over-friendly adversaries,' which can curtail the outmost level of generalization. To avoid suchfriendliness,' the present contribution fosters stabilization of adversaries through variance suppression (VASSO). VASSO offers a general approach to provably stabilize adversaries. In particular, when integrating VASSO with SAM, improved generalizability is numerically validated on extensive vision and language tasks. Once applied on top of a computationally efficient SAM variant, VASSO offers a desirable generalization-computation tradeoff.
Comment: The paper introduces VASSO for sharpness-aware minimization, which is relevant to representation learning and model optimization.
Relevance: 8 Novelty: 7
41. Conditional-$t^3$VAE: Equitable Latent Space Allocation for Fair Generation
ArXiv ID: 2509.02154
Authors: Aymene Mohammed Bouayed, Samuel Deslauriers-Gauthier, Adrian Iaccovelli, David Naccache
Abstract: Variational Autoencoders (VAEs) with global priors mirror the training set's class frequency in latent space, underrepresenting tail classes and reducing generative fairness on imbalanced datasets. While $t^3$VAE improves robustness via heavy-tailed Student's t-distribution priors, it still allocates latent volume proportionally to the class frequency.In this work, we address this issue by explicitly enforcing equitable latent space allocation across classes. To this end, we propose Conditional-$t^3$VAE, which defines a per-class \mbox{Student's t} joint prior over latent and output variables, preventing dominance by majority classes. Our model is optimized using a closed-form objective derived from the $\gamma$-power divergence. Moreover, for class-balanced generation, we derive an equal-weight latent mixture of Student's t-distributions. On SVHN-LT, CIFAR100-LT, and CelebA, Conditional-$t^3$VAE consistently achieves lower FID scores than both $t^3$VAE and Gaussian-based VAE baselines, particularly under severe class imbalance. In per-class F1 evaluations, Conditional-$t^3$VAE also outperforms the conditional Gaussian VAE across all highly imbalanced settings. While Gaussian-based models remain competitive under mild imbalance ratio ($\rho \lesssim 3$), our approach substantially improves generative fairness and diversity in more extreme regimes.
Comment: The paper proposes Conditional-$t^3$VAE, which is relevant to model architecture innovations, particularly in variational autoencoders.
Relevance: 8 Novelty: 7
42. Unsupervised Training of Vision Transformers with Synthetic Negatives
ArXiv ID: 2509.02024
Authors: Nikolaos Giakoumoglou, Andreas Floros, Kleanthis Marios Papadopoulos, Tania Stathaki
Abstract: This paper does not introduce a novel method per se. Instead, we address the neglected potential of hard negative samples in self-supervised learning. Previous works explored synthetic hard negatives but rarely in the context of vision transformers. We build on this observation and integrate synthetic hard negatives to improve vision transformer representation learning. This simple yet effective technique notably improves the discriminative power of learned representations. Our experiments show performance improvements for both DeiT-S and Swin-T architectures.
Comment: The paper explores the use of synthetic negatives in self-supervised learning for vision transformers, contributing to representation learning.
Relevance: 8 Novelty: 7
43. Fake & Square: Training Self-Supervised Vision Transformers with Synthetic Data and Synthetic Hard Negatives
ArXiv ID: 2509.02029
Authors: Nikolaos Giakoumoglou, Andreas Floros, Kleanthis Marios Papadopoulos, Tania Stathaki
Abstract: This paper does not introduce a new method per se. Instead, we build on existing self-supervised learning approaches for vision, drawing inspiration from the adage "fake it till you make it". While contrastive self-supervised learning has achieved remarkable success, it typically relies on vast amounts of real-world data and carefully curated hard negatives. To explore alternatives to these requirements, we investigate two forms of "faking it" in vision transformers. First, we study the potential of generative models for unsupervised representation learning, leveraging synthetic data to augment sample diversity. Second, we examine the feasibility of generating synthetic hard negatives in the representation space, creating diverse and challenging contrasts. Our framework - dubbed Syn2Co - combines both approaches and evaluates whether synthetically enhanced training can lead to more robust and transferable visual representations on DeiT-S and Swin-T architectures. Our findings highlight the promise and limitations of synthetic data in self-supervised learning, offering insights for future work in this direction.
Comment: The paper investigates synthetic data and hard negatives in self-supervised learning for vision transformers, relevant to representation learning.
Relevance: 8 Novelty: 7
44. Efficient Transformer-Inspired Variants of Physics-Informed Deep Operator Networks
ArXiv ID: 2509.01679
Authors: Zhi-Feng Wei, Wenqian Chen, Panos Stinis
Abstract: Operator learning has emerged as a promising tool for accelerating the solution of partial differential equations (PDEs). The Deep Operator Networks (DeepONets) represent a pioneering framework in this area: the "vanilla" DeepONet is valued for its simplicity and efficiency, while the modified DeepONet achieves higher accuracy at the cost of increased training time. In this work, we propose a series of Transformer-inspired DeepONet variants that introduce bidirectional cross-conditioning between the branch and trunk networks in DeepONet. Query-point information is injected into the branch network and input-function information into the trunk network, enabling dynamic dependencies while preserving the simplicity and efficiency of the "vanilla" DeepONet in a non-intrusive manner. Experiments on four PDE benchmarks -- advection, diffusion-reaction, Burgers', and Korteweg-de Vries equations -- show that for each case, there exists a variant that matches or surpasses the accuracy of the modified DeepONet while offering improved training efficiency. Moreover, the best-performing variant for each equation aligns naturally with the equation's underlying characteristics, suggesting that the effectiveness of cross-conditioning depends on the characteristics of the equation and its underlying physics. To ensure robustness, we validate the effectiveness of our variants through a range of rigorous statistical analyses, among them the Wilcoxon Two One-Sided Test, Glass's Delta, and Spearman's rank correlation.
Comment: The paper proposes Transformer-inspired variants of Deep Operator Networks, contributing to model architecture innovations.
Relevance: 8 Novelty: 7
45. MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall
ArXiv ID: 2509.02480
Authors: Avinash Maurya, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae
Abstract: Training LLMs larger than the aggregated memory of multiple GPUs is increasingly necessary due to the faster growth of LLM sizes compared to GPU memory. To this end, multi-tier host memory or disk offloading techniques are proposed by state of art. Despite advanced asynchronous multi-tier read/write strategies, such offloading strategies result in significant I/O overheads in the critical path of training, resulting in slower iterations. To this end, we propose MLP-Offload, a novel multi-level, multi-path offloading engine specifically designed for optimizing LLM training on resource-constrained setups by mitigating I/O bottlenecks. We make several key observations that drive the design of MLP-Offload, such as I/O overheads during the update dominate the iteration time; I/O bandwidth of the third-level remote storage tier remains unutilized; and, contention due to concurrent offloading amplifies I/O bottlenecks. Driven by these insights, we design and implement MLP-Offload to offload the optimizer states across multiple tiers in a cache-efficient and concurrency-controlled fashion to mitigate I/O bottlenecks during the backward and update phases. Evaluations on models up to 280B parameters shows that MLP-Offload achieves 2.5$\times$ faster iterations compared to the state-of-the-art LLM training runtimes.
Comment: The paper introduces MLP-Offload, a novel offloading engine for LLM training, which addresses efficiency and memory constraints. It is relevant to model compression and efficiency improvements.
Relevance: 8 Novelty: 7
46. Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMs
ArXiv ID: 2509.00707
Authors: Daehoon Gwak, Minseo Jung, Junwoo Park, Minho Park, ChaeHun Park, Junha Hyung, Jaegul Choo
Abstract: Masked diffusion models (MDMs) offer a promising non-autoregressive alternative for large language modeling. Standard decoding methods for MDMs, such as confidence-based sampling, select tokens independently based on individual token confidences at each diffusion step. However, we observe that this independent token selection often results in generation orders resembling sequential autoregressive processes, limiting the advantages of non-autoregressive modeling. To mitigate this pheonomenon, we propose Reward-Weighted Sampling (RWS), a novel decoding strategy that leverages an external reward model to provide a principled global signal during the iterative diffusion process. Specifically, at each diffusion step, RWS evaluates the quality of the entire intermediate sequence and scales token logits accordingly, guiding token selection by integrating global sequence-level coherence. This method selectively increases the confidence of tokens that initially have lower scores, thereby promoting a more non-autoregressive generation order. Furthermore, we provide theoretical justification showing that reward-weighted logit scaling induces beneficial rank reversals in token selection and consistently improves expected reward. Experiments demonstrate that RWS significantly promotes non-autoregressive generation orders, leading to improvements across multiple evaluation metrics. These results highlight the effectiveness of integrating global signals in enhancing both the non-autoregressive properties and overall performance of MDMs.
Comment: The paper proposes a novel decoding strategy for masked diffusion models, which is relevant to model architecture and representation learning.
Relevance: 8 Novelty: 7
47. Balanced Multimodal Learning: An Unidirectional Dynamic Interaction Perspective
ArXiv ID: 2509.02281
Authors: Shijie Wang, Li Zhang, Xinyan Liang, Yuhua Qian, Shen Hu
Abstract: Multimodal learning typically utilizes multimodal joint loss to integrate different modalities and enhance model performance. However, this joint learning strategy can induce modality imbalance, where strong modalities overwhelm weaker ones and limit exploitation of individual information from each modality and the inter-modality interaction information.Existing strategies such as dynamic loss weighting, auxiliary objectives and gradient modulation mitigate modality imbalance based on joint loss. These methods remain fundamentally reactive, detecting and correcting imbalance after it arises, while leaving the competitive nature of the joint loss untouched. This limitation drives us to explore a new strategy for multimodal imbalance learning that does not rely on the joint loss, enabling more effective interactions between modalities and better utilization of information from individual modalities and their interactions. In this paper, we introduce Unidirectional Dynamic Interaction (UDI), a novel strategy that abandons the conventional joint loss in favor of a proactive, sequential training scheme. UDI first trains the anchor modality to convergence, then uses its learned representations to guide the other modality via unsupervised loss. Furthermore, the dynamic adjustment of modality interactions allows the model to adapt to the task at hand, ensuring that each modality contributes optimally. By decoupling modality optimization and enabling directed information flow, UDI prevents domination by any single modality and fosters effective cross-modal feature learning. Our experimental results demonstrate that UDI outperforms existing methods in handling modality imbalance, leading to performance improvement in multimodal learning tasks.
Comment: The paper proposes a new strategy for multimodal learning that addresses modality imbalance, which is relevant to model architecture innovations.
Relevance: 8 Novelty: 7
48. Towards Agents That Know When They Don't Know: Uncertainty as a Control Signal for Structured Reasoning
ArXiv ID: 2509.02401
Authors: Josefa Lia Stoisser, Marc Boubnovski Martell, Lawrence Phillips, Gianluca Mazzoni, Lea M{\o}rch Harder, Philip Torr, Jesper Ferkinghoff-Borg, Kaspar Martens, Julien Fauqueur
Abstract: Large language model (LLM) agents are increasingly deployed in structured biomedical data environments, yet they often produce fluent but overconfident outputs when reasoning over complex multi-table data. We introduce an uncertainty-aware agent for query-conditioned multi-table summarization that leverages two complementary signals: (i) retrieval uncertainty--entropy over multiple table-selection rollouts--and (ii) summary uncertainty--combining self-consistency and perplexity. Summary uncertainty is incorporated into reinforcement learning (RL) with Group Relative Policy Optimization (GRPO), while both retrieval and summary uncertainty guide inference-time filtering and support the construction of higher-quality synthetic datasets. On multi-omics benchmarks, our approach improves factuality and calibration, nearly tripling correct and useful claims per summary (3.0(\rightarrow)8.4 internal; 3.6(\rightarrow)9.9 cancer multi-omics) and substantially improving downstream survival prediction (C-index 0.32(\rightarrow)0.63). These results demonstrate that uncertainty can serve as a control signal--enabling agents to abstain, communicate confidence, and become more reliable tools for complex structured-data environments.
Comment: The paper introduces an uncertainty-aware agent for structured reasoning, which is relevant to large language models and emerging trends in AI research.
Relevance: 8 Novelty: 7
49. Counterfactual Sensitivity for Faithful Reasoning in Language Models
ArXiv ID: 2509.01544
Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Abstract: Large language models (LLMs) often produce correct answers while relying on flawed or irrelevant reasoning traces, undermining their trustworthiness in high-stakes domains. We propose Counterfactual Sensitivity Regularization (CSR), a lightweight training objective that enforces dependence between intermediate reasoning and final outputs. CSR introduces automated, operator-level counterfactual interventions (e.g., swapping "+" with "-") during training and penalizes models that preserve the same answer under logically invalid traces. This requires only one additional forward pass per sample. To measure faithfulness, we introduce Counterfactual Outcome Sensitivity (COS), which quantifies the impact of such perturbations on model predictions. Across structured reasoning tasks - arithmetic (GSM8K), logical deduction (PrOntoQA), and planning (Blocks World) - CSR improves faithfulness by up to 70 percentage points over standard fine-tuning and process supervision, with only minor accuracy loss. The learned sensitivity generalizes to larger models and synergizes with inference-time methods such as self-consistency. A pilot study on HellaSwag further demonstrates that extending CSR with semantic perturbations can enhance faithfulness in commonsense reasoning.
Comment: The paper proposes a novel training objective for improving reasoning faithfulness in language models, which is relevant to large language models.
Relevance: 8 Novelty: 7
50. VISP: Volatility Informed Stochastic Projection for Adaptive Regularization
ArXiv ID: 2509.01903
Authors: Tanvir Islam
Abstract: We propose VISP: Volatility Informed Stochastic Projection, an adaptive regularization method that leverages gradient volatility to guide stochastic noise injection in deep neural networks. Unlike conventional techniques that apply uniform noise or fixed dropout rates, VISP dynamically computes volatility from gradient statistics and uses it to scale a stochastic projection matrix. This mechanism selectively regularizes inputs and hidden nodes that exhibit higher gradient volatility while preserving stable representations, thereby mitigating overfitting. Extensive experiments on MNIST, CIFAR-10, and SVHN demonstrate that VISP consistently improves generalization performance over baseline models and fixed-noise alternatives. In addition, detailed analyses of the evolution of volatility, the spectral properties of the projection matrix, and activation distributions reveal that VISP not only stabilizes the internal dynamics of the network but also fosters a more robust feature representation.
Comment: The paper introduces an adaptive regularization method using gradient volatility, which is relevant to representation learning and model architecture.
Relevance: 8 Novelty: 7
51. Equivariant U-Shaped Neural Operators for the Cahn-Hilliard Phase-Field Model
ArXiv ID: 2509.01293
Authors: Xiao Xue, M. F. P. ten Eikelder, Tianyue Yang, Yiqing Li, Kan He, Shuo Wang, Peter V. Coveney
Abstract: Phase separation in binary mixtures, governed by the Cahn-Hilliard equation, plays a central role in interfacial dynamics across materials science and soft matter. While numerical solvers are accurate, they are often computationally expensive and lack flexibility across varying initial conditions and geometries. Neural operators provide a data-driven alternative by learning solution operators between function spaces, but current architectures often fail to capture multiscale behavior and neglect underlying physical symmetries. Here we show that an equivariant U-shaped neural operator (E-UNO) can learn the evolution of the phase-field variable from short histories of past dynamics, achieving accurate predictions across space and time. The model combines global spectral convolution with a multi-resolution U-shaped architecture and regulates translation equivariance to align with the underlying physics. E-UNO outperforms standard Fourier neural operator and U-shaped neural operator baselines, particularly on fine-scale and high-frequency structures. By encoding symmetry and scale hierarchy, the model generalizes better, requires less training data, and yields physically consistent dynamics. This establishes E-UNO as an efficient surrogate for complex phase-field systems.
Comment: The paper introduces an equivariant U-shaped neural operator, which is relevant to model architecture innovations.
Relevance: 8 Novelty: 7
52. Memory Limitations of Prompt Tuning in Transformers
ArXiv ID: 2509.00421
Authors: Maxime Meyer, Mario Michelessa, Caroline Chaux, Vincent Y. F. Tan
Abstract: Despite the empirical success of prompt tuning in adapting pretrained language models to new tasks, theoretical analyses of its capabilities remain limited. Existing theoretical work primarily addresses universal approximation properties, demonstrating results comparable to standard weight tuning. In this paper, we explore a different aspect of the theory of transformers: the memorization capability of prompt tuning. We provide two principal theoretical contributions. First, we prove that the amount of information memorized by a transformer cannot scale faster than linearly with the prompt length. Second, and more importantly, we present the first formal proof of a phenomenon empirically observed in large language models: performance degradation in transformers with extended contexts. We rigorously demonstrate that transformers inherently have limited memory, constraining the amount of information they can retain, regardless of the context size. This finding offers a fundamental understanding of the intrinsic limitations of transformer architectures, particularly their ability to handle long sequences.
Comment: The paper explores the memory limitations of prompt tuning in transformers, relevant to model architecture insights.
Relevance: 8 Novelty: 7
53. Causal representation learning from network data
ArXiv ID: 2509.01916
Authors: Jifan Zhang, Michelle M. Li, Elena Zheleva
Abstract: Causal disentanglement from soft interventions is identifiable under the assumptions of linear interventional faithfulness and availability of both observational and interventional data. Previous research has looked into this problem from the perspective of i.i.d. data. Here, we develop a framework, GraCE-VAE, for non-i.i.d. settings, in which structured context in the form of network data is available. GraCE-VAE integrates discrepancy-based variational autoencoders with graph neural networks to jointly recover the true latent causal graph and intervention effects. We show that the theoretical results of identifiability from i.i.d. data hold in our setup. We also empirically evaluate GraCE-VAE against state-of-the-art baselines on three genetic perturbation datasets to demonstrate the impact of leveraging structured context for causal disentanglement.
Comment: The paper introduces a framework for causal representation learning from network data, which is relevant to representation learning.
Relevance: 8 Novelty: 7
54. Teaching AI to Remember: Insights from Brain-Inspired Replay in Continual Learning
ArXiv ID: 2509.00047
Authors: Jina Kim
Abstract: Artificial neural networks (ANNs) continue to face challenges in continual learning, particularly due to catastrophic forgetting, the loss of previously learned knowledge when acquiring new tasks. Inspired by memory consolidation in the human brain, we investigate the internal replay mechanism proposed by~\citep{brain_inspired_replay1}, which reactivates latent representations of prior experiences during learning. As internal replay was identified as the most influential component among the brain-inspired mechanisms in their framework, it serves as the central focus of our in-depth investigation. Using the CIFAR-100 dataset in a class-incremental setting, we evaluate the effectiveness of internal replay, both in isolation and in combination with Synaptic Intelligence (SI). Our experiments show that internal replay significantly mitigates forgetting, especially when paired with SI, but at the cost of reduced initial task accuracy, highlighting a trade-off between memory stability and learning plasticity. Further analyses using log-likelihood distributions, reconstruction errors, silhouette scores, and UMAP projections reveal that internal replay increases representational overlap in latent space, potentially limiting task-specific differentiation. These results underscore the limitations of current brain-inspired methods and suggest future directions for balancing retention and adaptability in continual learning systems.
Comment: The paper explores brain-inspired replay mechanisms in continual learning, focusing on representation learning and training dynamics in neural networks.
Relevance: 8 Novelty: 7
55. GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping
ArXiv ID: 2509.01842
Authors: Qifu Wen, Xi Zeng, Zihan Zhou, Shuaijun Liu, Mehdi Hosseinzadeh, Reza Rawassizadeh
Abstract: Early stopping monitors global validation loss and halts all parameter updates simultaneously, which is computationally costly for large transformers due to the extended time required for validation inference. We propose GradES, a novel gradient-based early stopping approach that operates within transformer components (attention projections and Feed-Forward layer matrices). We found that different components converge at varying rates during fine-tuning. GradES tracks the magnitude of gradients in backpropagation for these matrices during training. When a projection matrix's gradients fall below a convergence threshold $\tau$, we exclude that projection matrix from further updates individually, eliminating costly validation passes while allowing slow converging matrices to continue learning. By strategically freezing parameters when their gradients converge, GradES speeds up training time by 1.57--7.22$\times$ while simultaneously enhancing generalization through early prevention of overfitting, resulting in 1.2% higher average accuracy.
Comment: The paper introduces GradES, a gradient-based early stopping method for transformers, which is relevant to model architecture and efficiency improvements.
Relevance: 8 Novelty: 7
56. Double Descent and Overparameterization in Particle Physics Data
ArXiv ID: 2509.01397
Authors: Matthias Vigl, Lukas Heinrich
Abstract: Recently, the benefit of heavily overparameterized models has been observed in machine learning tasks: models with enough capacity to easily cross the \emph{interpolation threshold} improve in generalization error compared to the classical bias-variance tradeoff regime. We demonstrate this behavior for the first time in particle physics data and explore when and where `double descent' appears and under which circumstances overparameterization results in a performance gain.
Comment: The paper explores overparameterization and double descent in particle physics data, providing insights into training dynamics and model behavior.
Relevance: 8 Novelty: 7
57. Optimized Weight Initialization on the Stiefel Manifold for Deep ReLU Neural Networks
ArXiv ID: 2509.00362
Authors: Hyungu Lee, Taehyeong Kim, Hayoung Choi
Abstract: Stable and efficient training of ReLU networks with large depth is highly sensitive to weight initialization. Improper initialization can cause permanent neuron inactivation dying ReLU and exacerbate gradient instability as network depth increases. Methods such as He, Xavier, and orthogonal initialization preserve variance or promote approximate isometry. However, they do not necessarily regulate the pre-activation mean or control activation sparsity, and their effectiveness often diminishes in very deep architectures. This work introduces an orthogonal initialization specifically optimized for ReLU by solving an optimization problem on the Stiefel manifold, thereby preserving scale and calibrating the pre-activation statistics from the outset. A family of closed-form solutions and an efficient sampling scheme are derived. Theoretical analysis at initialization shows that prevention of the dying ReLU problem, slower decay of activation variance, and mitigation of gradient vanishing, which together stabilize signal and gradient flow in deep architectures. Empirically, across MNIST, Fashion-MNIST, multiple tabular datasets, few-shot settings, and ReLU-family activations, our method outperforms previous initializations and enables stable training in deep networks.
Comment: The paper introduces an optimized weight initialization method on the Stiefel manifold for deep ReLU networks, addressing issues like dying ReLU and gradient instability, which is relevant to representation learning and training dynamics.
Relevance: 8 Novelty: 7
58. Generalization vs. Memorization in Autoregressive Deep Learning: Or, Examining Temporal Decay of Gradient Coherence
ArXiv ID: 2509.00024
Authors: James Amarel, Nicolas Hengartner, Robyn Miller, Kamaljeet Singh, Siddharth Mansingh, Arvind Mohan, Benjamin Migliori, Emily Casleton, Alexei Skurikhin, Earl Lawrence, Gerd J. Kunde
Abstract: Foundation models trained as autoregressive PDE surrogates hold significant promise for accelerating scientific discovery through their capacity to both extrapolate beyond training regimes and efficiently adapt to downstream tasks despite a paucity of examples for fine-tuning. However, reliably achieving genuine generalization - a necessary capability for producing novel scientific insights and robustly performing during deployment - remains a critical challenge. Establishing whether or not these requirements are met demands evaluation metrics capable of clearly distinguishing genuine model generalization from mere memorization. We apply the influence function formalism to systematically characterize how autoregressive PDE surrogates assimilate and propagate information derived from diverse physical scenarios, revealing fundamental limitations of standard models and training routines in addition to providing actionable insights regarding the design of improved surrogates.
Comment: The paper explores generalization vs. memorization in autoregressive models, providing insights into model behavior, which is relevant to representation learning.
Relevance: 8 Novelty: 7
59. Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination
ArXiv ID: 2509.00072
Authors: Terry Jingchen Zhang, Gopal Dev, Ning Wang, Nicole Ni, Wenyuan Jiang, Yinya Huang, Bernhard Sch\"olkopf, Mrinmaya Sachan, Zhijing Jin
Abstract: Capability evaluation of large language models (LLMs) is increasingly shadowed by rising concerns of data contamination that cast doubts on whether static benchmarks measure genuine reasoning or mere memorization. We present an empirical study using an infinitely scalable framework to synthesize research-level QA directly from arXiv papers, harnessing the natural temporal structure of research publications where performance decay after knowledge cutoffs may indicate potential contamination. We evaluated 4 frontier model represented by 2 models of different knowledge cutoff dates per family on 1,643 multi-step reasoning questions synthesized from 20,277 arXiv papers stratified over 26 months, covering at least 6 months before and after all cutoff dates. Our results consistently showed a lack of significant performance decay near knowledge cutoff dates for models of various sizes, developers, and release dates. We further performed a comparative analysis with previous longitudinal studies that reported significant post-cutoff performance decay using directly retrieved questions based on public data. we hypothesize that the multi-step reasoning required by our synthesis pipeline offered additional complexity that goes deeper than shallow memorization, which effectively serves a mitigation strategy against benchmark contamination. We fully open source our code and dataset to aid reproducibility and advocate for a paradigm shift that prioritize reasoning-driven synthesis to construct benchmarks over simply collecting newly released questions periodically.
Comment: The paper addresses benchmark contamination in LLMs, focusing on reasoning-driven synthesis, which is relevant to foundational research in LLM behavior.
Relevance: 8 Novelty: 7
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.