Personalized Daily ArXiv Papers 2025-05-27

[gpt-4o]	Prompt	Completion	Total
Token	109786	15021	124807
Cost	$0.27	$0.15	$0.42

Total arXiv papers: 1691

Total scanned papers: 1027

Total relevant papers: 84

Table of contents with paper titles:

PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training Authors: Matan Haroush, Daniel Soudry
AuroRA: Breaking Low-Rank Bottleneck of LoRA with Nonlinear Mapping Authors: Haonan Dong, Wenhao Zhu, Guojie Song, Liang Wang
LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning Authors: Junyu Chen, Junzhuo Li, Zhen Peng, Wenjie Wang, Yuxiang Ren, Long Shi, Xuming Hu
Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer Authors: Guodong Du, Zitao Fang, Jing Li, Junlin Li, Runhua Jiang, Shuyang Yu, Yifei Guo, Yangneng Chen, Sim Kuan Goh, Ho-Kin Tang, Daojing He, Honghai Liu, Min Zhang
MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention Authors: Can Yaras, Alec S. Xu, Pierre Abillama, Changwoo Lee, Laura Balzano
Feature Preserving Shrinkage on Bayesian Neural Networks via the R2D2 Prior Authors: Tsai Hor Chan, Dora Yan Zhang, Guosheng Yin, Lequan Yu
ResSVD: Residual Compensated SVD for Large Language Model Compression Authors: Haolei Bai, Siyong Jian, Tuo Liang, Yu Yin, Huan Wang
On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts Authors: Fanqi Yan, Huy Nguyen, Dung Le, Pedram Akbarian, Nhat Ho, Alessandro Rinaldo
Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments Authors: Junming Liu, Yanting Gao, Siyuan Meng, Yifei Sun, Aoqi Wu, Yufei Jin, Yirong Chen, Ding Wang, Guosun Zeng
$\mu$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts Authors: Toshiaki Koike-Akino, Jing Liu, Ye Wang
RefLoRA: Refactored Low-Rank Adaptation for Efficient Fine-Tuning of Large Models Authors: Yilang Zhang, Bingcong Li, Georgios B. Giannakis
ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning Authors: Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Hongjian Fang, Ruihan Jin, Feihu Che, Pengpeng Shao, Zhengqi Wen, Jianhua Tao
The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models Authors: Shashata Sawmya, Micah Adler, Nir Shavit
Understanding Transformer from the Perspective of Associative Memory Authors: Shu Zhong, Mingyu Xu, Tenglong Ao, Guang Shi
FP4 All the Way: Fully Quantized Training of LLMs Authors: Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry
Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior Authors: Florian Eichin, Yupei Du, Philipp Mondorf, Barbara Plank, Michael A. Hedderich
Shifting AI Efficiency From Model-Centric to Data-Centric Compression Authors: Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang
Unfolding AlphaFold's Bayesian Roots in Probability Kinematics Authors: Thomas Hamelryck, Kanti V. Mardia
Exact Expressive Power of Transformers with Padding Authors: William Merrill, Ashish Sabharwal
The Coverage Principle: A Framework for Understanding Compositional Generalization Authors: Hoyeon Chang, Jinho Park, Hanseul Cho, Sohee Yang, Miyoung Ko, Hyeonbin Hwang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo
To CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers Authors: Kevin Xu, Issei Sato
Operator Learning for Schr\"{o}dinger Equation: Unitarity, Error Bounds, and Time Generalization Authors: Yash Patel, Unique Subedi, Ambuj Tewari
Foundations of Top-$k$ Decoding For Language Models Authors: Georgy Noarov, Soham Mallick, Tao Wang, Sunay Joshi, Yan Sun, Yangxinyu Xie, Mengxin Yu, Edgar Dobriban
When fractional quasi p-norms concentrate Authors: Ivan Y. Tyukin, Bogdan Grechuk, Evgeny M. Mirkes, Alexander N. Gorban
I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts Authors: Jiayi Xin, Sukwon Yun, Jie Peng, Inyoung Choi, Jenna L. Ballard, Tianlong Chen, Qi Long
Error Optimization: Overcoming Exponential Signal Decay in Deep Predictive Coding Networks Authors: C\'edric Goemaere, Gaspard Oliviers, Rafal Bogacz, Thomas Demeester
Multiple Descents in Deep Learning as a Sequence of Order-Chaos Transitions Authors: Wenbo Wei, Nicholas Chong Jia Le, Choy Heng Lai, Ling Feng
Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes Authors: Itamar Harel, Yonathan Wolanowsky, Gal Vardi, Nathan Srebro, Daniel Soudry
Token-Importance Guided Direct Preference Optimization Authors: Yang Ning, Lin Hai, Liu Yibo, Tian Baoliang, Liu Guoqing, Zhang Haijun
SwarmThinkers: Learning Physically Consistent Atomic KMC Transitions at Scale Authors: Qi Li, Kun Li, Haozhi Han, Honghui Shang, Xinfu He, Yunquan Zhang, Hong An, Ting Cao, Mao Yang
Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate Authors: Liangwei Nathan Zheng, Wei Emma Zhang, Mingyu Guo, Miao Xu, Olaf Maennel, Weitong Chen
A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models Authors: Utkarsh Sahu, Zhisheng Qi, Yongjia Lei, Ryan A. Rossi, Franck Dernoncourt, Nesreen K. Ahmed, Mahantesh M Halappanavar, Yao Ma, Yu Wang
MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE Authors: Zongle Huang, Lei Zhu, Zongyuan Zhan, Ting Hu, Weikai Mao, Xianzhi Yu, Yongpan Liu, Tianyu Zhang
FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models Authors: Hao Kang, Zichun Yu, Chenyan Xiong
Equivariant Representation Learning for Symmetry-Aware Inference with Guarantees Authors: Daniel Ordo\~nez-Apraez, Alek Fr\"ohlich, Vladimir Kosti\'c, Karim Lounici, Vivien Brandt, Massimiliano Pontil
Lorentz Local Canonicalization: How to Make Any Network Lorentz-Equivariant Authors: Jonas Spinner, Luigi Favaro, Peter Lippmann, Sebastian Pitz, Gerrit Gerhartz, Tilman Plehn, Fred A. Hamprecht
Uncovering a Universal Abstract Algorithm for Modular Addition in Neural Networks Authors: Gavin McCracken, Gabriela Moisescu-Pareja, Vincent Letourneau, Doina Precup, Jonathan Love
Message-Passing State-Space Models: Improving Graph Learning with Modern Sequence Modeling Authors: Andrea Ceni, Alessio Gravina, Claudio Gallicchio, Davide Bacciu, Carola-Bibiane Schonlieb, Moshe Eliasof
On the Mechanisms of Weak-to-Strong Generalization: A Theoretical Perspective Authors: Behrad Moniri, Hamed Hassani
Chordless Structure: A Pathway to Simple and Expressive GNNs Authors: Hongxu Pan, Shuxian Hu, Mo Zhou, Zhibin Wang, Rong Gu, Chen Tian, Kun Yang, Sheng Zhong
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality Authors: Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik
Follow the Energy, Find the Path: Riemannian Metrics from Energy-Based Models Authors: Louis B\'ethune, David Vigouroux, Yilun Du, Rufin VanRullen, Thomas Serre, Victor Boutin
Advanced long-term earth system forecasting by learning the small-scale nature Authors: Hao Wu, Yuan Gao, Ruiqi Shu, Kun Wang, Ruijian Gou, Chuhan Wu, Xinliang Liu, Juncai He, Shuhao Cao, Junfeng Fang, Xingjian Shi, Feng Tao, Qi Song, Shengxuan Ji, Yanfei Xiang, Yuze Sun, Jiahao Li, Fan Xu, Huanshuo Dong, Haixin Wang, Fan Zhang, Penghao Zhao, Xian Wu, Qingsong Wen, Deliang Chen, Xiaomeng Huang
Convexified Message-Passing Graph Neural Networks Authors: Saar Cohen, Noa Agmon, Uri Shaham
SeMe: Training-Free Language Model Merging via Semantic Alignment Authors: Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression Authors: Kunjun Li, Zigeng Chen, Cheng-Yen Yang, Jenq-Neng Hwang
Paying Alignment Tax with Contrastive Learning Authors: Buse Sibel Korkmaz, Rahul Nair, Elizabeth M. Daly, Antonio del Rio Chanona
AmorLIP: Efficient Language-Image Pretraining via Amortization Authors: Haotian Sun, Yitong Li, Yuchen Zhuang, Niao He, Hanjun Dai, Bo Dai
Joint-stochastic-approximation Autoencoders with Application to Semi-supervised Learning Authors: Wenbo He, Zhijian Ou
FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models Authors: Jintao Tong, Wenwei Jin, Pengda Qin, Anqi Li, Yixiong Zou, Yuhong Li, Yuhua Li, Ruixuan Li
Variational Deep Learning via Implicit Regularization Authors: Jonathan Wenger, Beau Coker, Juraj Marusic, John P. Cunningham
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? Authors: Hongzheng Yang, Yongqiang Chen, Zeyu Qin, Tongliang Liu, Chaowei Xiao, Kun Zhang, Bo Han
ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models Authors: Hao Chen, Haoze Li, Zhiqing Xiao, Lirong Gao, Qi Zhang, Xiaomeng Hu, Ningtao Wang, Xing Fu, Junbo Zhao
HD-PiSSA: High-Rank Distributed Orthogonal Adaptation Authors: Yiding Wang, Fauxu meng, Xuefeng Zhang, Fan Jiang, Pingzhi Tang, Muhan Zhang
Learning Optimal Multimodal Information Bottleneck Representations Authors: Qilong Wu, Yiyang Shao, Jun Wang, Xiaobo Sun
Latent Mamba Operator for Partial Differential Equations Authors: Karn Tiwari, Niladri Dutta, N M Anoop Krishnan, Prathosh A P
Do Large Language Models (Really) Need Statistical Foundations? Authors: Weijie Su
ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining Authors: Melis Ilayda Bal, Volkan Cevher, Michael Muehlebach
Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning? Authors: Zexi Li, Xiangzhu Wang, William F. Shen, Meghdad Kurmanji, Xinchi Qiu, Dongqi Cai, Chao Wu, Nicholas D. Lane
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs Authors: Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang
Task Specific Pruning with LLM-Sieve: How Many Parameters Does Your Task Really Need? Authors: Waleed Reda, Abhinav Jangda, Krishna Chintalapudi
Statistical inference for Linear Stochastic Approximation with Markovian Noise Authors: Sergey Samsonov, Marina Sheshukova, Eric Moulines, Alexey Naumov
When Models Don't Collapse: On the Consistency of Iterative MLE Authors: Daniel Barzilai, Ohad Shamir
Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing Authors: Dan Peng, Zhihui Fu, Zewen Ye, Zhuoran Song, Jun Wang
Revisiting Glorot Initialization for Long-Range Linear Recurrences Authors: Noga Bar, Mariia Seleznova, Yotam Alexander, Gitta Kutyniok, Raja Giryes
Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law Authors: Frederik Kunstner, Francis Bach
Mitigating Deceptive Alignment via Self-Monitoring Authors: Jiaming Ji, Wenqi Chen, Kaile Wang, Donghai Hong, Sitong Fang, Boyuan Chen, Jiayi Zhou, Juntao Dai, Sirui Han, Yike Guo, Yaodong Yang
Your Classifier Can Do More: Towards Bridging the Gaps in Classification, Robustness, and Generation Authors: Kaichao Jiang, He Wang, Xiaoshuai Hao, Xiulong Yang, Ajian Liu, Qi Chu, Yunfeng Diao
Efficient Data Selection at Scale via Influence Distillation Authors: Mahdi Nikdan, Vincent Cohen-Addad, Dan Alistarh, Vahab Mirrokni
AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings Authors: Konstantin Dobler, Desmond Elliott, Gerard de Melo
Logic Gate Neural Networks are Good for Verification Authors: Fabian Kresse, Emily Yu, Christoph H. Lampert, Thomas A. Henzinger
Hierarchical-embedding autoencoder with a predictor (HEAP) as efficient architecture for learning long-term evolution of complex multi-scale physical systems Authors: Alexander Khrabry, Edward Startsev, Andrew Powis, Igor Kaganovich
Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings Authors: Sarang Patil, Ashish Parmanand Pandey, Ioannis Koutis, Mengjia Xu
ThanoRA: Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation Authors: Jian Liang, Wenke Huang, Xianda Guo, Guancheng Wan, Bo Du, Mang Ye
Model Stitching by Functional Latent Alignment Authors: Ioannis Athanasiadis, Anmar Karmush, Michael Felsberg
TabPFN: One Model to Rule Them All? Authors: Qiong Zhang, Yan Shuo Tan, Qinglong Tian, Pengfei Li
CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models Authors: Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen
Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression Authors: Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, Bo Li
Mind The Gap: Deep Learning Doesn't Learn Deeply Authors: Lucas Saldyt, Subbarao Kambhampati
On the Role of Label Noise in the Feature Learning Process Authors: Andi Han, Wei Huang, Zhanpeng Zhou, Gang Niu, Wuyang Chen, Junchi Yan, Akiko Takeda, Taiji Suzuki
PacTrain: Pruning and Adaptive Sparse Gradient Compression for Efficient Collective Communication in Distributed Deep Learning Authors: Yisu Wang, Ruilong Wu, Xinjiao Li, Dirk Kutscher
Scalable Gaussian Processes with Low-Rank Deep Kernel Decomposition Authors: Yunqin Zhu, Henry Shaowu Yuchi, Yao Xie
Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs Authors: Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, Jong Chul Ye
Hamiltonian Theory and Computation of Optimal Probability Density Control in High Dimensions Authors: Nathan Gaby, Xiaojing Ye

1. PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training

ArXiv ID: 2505.18313

Authors: Matan Haroush, Daniel Soudry

Abstract: Accelerator memory and networking constraints have emerged as dominant bottlenecks when training large language models LLMs with billions of parameters. Existing low rank gradient estimators such as GaLoRE and FLORA compress gradients and optimizer tensors by projecting weight gradients onto a rank r subspace, enabling LLM training on consumer hardware. Yet, these methods are either biased or subject to high estimator variance. Moreover, the optimizer state based on the first and second moments estimates expressed in the previous subspace becomes misaligned whenever the projection is updated, leading to instabilities during training. We propose PLUMAGE: Probabilistic Low rank Unbiased Minimum vAriance Gradient Estimator. PLUMAGE is a drop in replacement for existing low rank gradient estimators. It does not introduce new hyperparameters beyond the chosen rank r and the update interval. In addition, we resolve optimizer state misalignment issues to prevent spurious weight updates and enhance training stability. We empirically demonstrate that PLUMAGE shrinks the full rank optimization's gap over the pre training evaluation loss by 33% on average across models and the average training loss across the GLUE benchmark by 28% within a similar computational and memory footprint as GaloRE.

Comment: The paper proposes a new low-rank gradient estimator for efficient large model training, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

2. AuroRA: Breaking Low-Rank Bottleneck of LoRA with Nonlinear Mapping

ArXiv ID: 2505.18738

Authors: Haonan Dong, Wenhao Zhu, Guojie Song, Liang Wang

Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method validated across NLP and CV domains. However, LoRA faces an inherent low-rank bottleneck: narrowing its performance gap with full finetuning requires increasing the rank of its parameter matrix, resulting in significant parameter overhead. Recent linear LoRA variants have attempted to enhance expressiveness by introducing additional linear mappings; however, their composition remains inherently linear and fails to fundamentally improve LoRA's representational capacity. To address this limitation, we propose AuroRA, which incorporates an Adaptive Nonlinear Layer (ANL) between two linear projectors to capture fixed and learnable nonlinearities. This combination forms an MLP-like structure with a compressed rank, enabling flexible and precise approximation of diverse target functions while theoretically guaranteeing lower approximation errors and bounded gradients. Extensive experiments on 22 datasets and 6 pretrained models demonstrate that AuroRA: (I) not only matches or surpasses full fine-tuning performance with only 6.18% ~ 25% of LoRA's parameters but also (II) outperforms state-of-the-art PEFT methods by up to 10.88% in both NLP and CV tasks, and (III) exhibits robust performance across various rank configurations.

Comment: The paper proposes AuroRA, which addresses the low-rank bottleneck in LoRA, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

3. LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning

ArXiv ID: 2505.18724

Authors: Junyu Chen, Junzhuo Li, Zhen Peng, Wenjie Wang, Yuxiang Ren, Long Shi, Xuming Hu

Abstract: Quantization and fine-tuning are crucial for deploying large language models (LLMs) on resource-constrained edge devices. However, fine-tuning quantized models presents significant challenges, primarily stemming from: First, the mismatch in data types between the low-precision quantized weights (e.g., 4-bit) and the high-precision adaptation weights (e.g., 16-bit). This mismatch limits the computational efficiency advantage offered by quantized weights during inference. Second, potential accuracy degradation when merging these high-precision adaptation weights into the low-precision quantized weights, as the adaptation weights often necessitate approximation or truncation. Third, as far as we know, no existing methods support the lossless merging of adaptation while adjusting all quantized weights. To address these challenges, we introduce lossless ternary adaptation for quantization-aware fine-tuning (LoTA-QAF). This is a novel fine-tuning method specifically designed for quantized LLMs, enabling the lossless merging of ternary adaptation weights into quantized weights and the adjustment of all quantized weights. LoTA-QAF operates through a combination of: i) A custom-designed ternary adaptation (TA) that aligns ternary weights with the quantization grid and uses these ternary weights to adjust quantized weights. ii) A TA-based mechanism that enables the lossless merging of adaptation weights. iii) Ternary signed gradient descent (t-SignSGD) for updating the TA weights. We apply LoTA-QAF to Llama-3.1/3.3 and Qwen-2.5 model families and validate its effectiveness on several downstream tasks. On the MMLU benchmark, our method effectively recovers performance for quantized models, surpassing 16-bit LoRA by up to 5.14\%. For task-specific fine-tuning, 16-bit LoRA achieves superior results, but LoTA-QAF still outperforms other methods.

Comment: The paper introduces a novel fine-tuning method for quantized LLMs, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

4. Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer

ArXiv ID: 2505.18713

Authors: Guodong Du, Zitao Fang, Jing Li, Junlin Li, Runhua Jiang, Shuyang Yu, Yifei Guo, Yangneng Chen, Sim Kuan Goh, Ho-Kin Tang, Daojing He, Honghai Liu, Min Zhang

Abstract: Foundation models and their checkpoints have significantly advanced deep learning, boosting performance across various applications. However, fine-tuned models often struggle outside their specific domains and exhibit considerable redundancy. Recent studies suggest that combining a pruned fine-tuned model with the original pre-trained model can mitigate forgetting, reduce interference when merging model parameters across tasks, and improve compression efficiency. In this context, developing an effective pruning strategy for fine-tuned models is crucial. Leveraging the advantages of the task vector mechanism, we preprocess fine-tuned models by calculating the differences between them and the original model. Recognizing that different task vector subspaces contribute variably to model performance, we introduce a novel method called Neural Parameter Search (NPS-Pruning) for slimming down fine-tuned models. This method enhances pruning efficiency by searching through neural parameters of task vectors within low-rank subspaces. Our method has three key applications: enhancing knowledge transfer through pairwise model interpolation, facilitating effective knowledge fusion via model merging, and enabling the deployment of compressed models that retain near-original performance while significantly reducing storage costs. Extensive experiments across vision, NLP, and multi-modal benchmarks demonstrate the effectiveness and robustness of our approach, resulting in substantial performance gains. The code is publicly available at: https://github.com/duguodong7/NPS-Pruning.

Comment: The paper introduces a novel pruning strategy for fine-tuned models, focusing on neural parameter search within low-rank subspaces, which aligns with the model compression criterion.

Relevance: 9 Novelty: 8

5. MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

ArXiv ID: 2505.18698

Authors: Can Yaras, Alec S. Xu, Pierre Abillama, Changwoo Lee, Laura Balzano

Abstract: Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with $\Theta(N\sqrt{N} d)$ computational complexity and $\Theta(Nd)$ memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: $1.4\times$ for shorter sequences $(N=256)$, $4.5\times$ for medium-length sequences $(N=4K)$, and $8.2\times$ for longer sequences $(N=16K)$. We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at https://github.com/cjyaras/monarch-attention.

Comment: MonarchAttention presents a novel approach to sub-quadratic attention approximation in transformers, relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8

6. Feature Preserving Shrinkage on Bayesian Neural Networks via the R2D2 Prior

ArXiv ID: 2505.18280

Authors: Tsai Hor Chan, Dora Yan Zhang, Guosheng Yin, Lequan Yu

Abstract: Bayesian neural networks (BNNs) treat neural network weights as random variables, which aim to provide posterior uncertainty estimates and avoid overfitting by performing inference on the posterior weights. However, the selection of appropriate prior distributions remains a challenging task, and BNNs may suffer from catastrophic inflated variance or poor predictive performance when poor choices are made for the priors. Existing BNN designs apply different priors to weights, while the behaviours of these priors make it difficult to sufficiently shrink noisy signals or they are prone to overshrinking important signals in the weights. To alleviate this problem, we propose a novel R2D2-Net, which imposes the R^2-induced Dirichlet Decomposition (R2D2) prior to the BNN weights. The R2D2-Net can effectively shrink irrelevant coefficients towards zero, while preventing key features from over-shrinkage. To approximate the posterior distribution of weights more accurately, we further propose a variational Gibbs inference algorithm that combines the Gibbs updating procedure and gradient-based optimization. This strategy enhances stability and consistency in estimation when the variational objective involving the shrinkage parameters is non-convex. We also analyze the evidence lower bound (ELBO) and the posterior concentration rates from a theoretical perspective. Experiments on both natural and medical image classification and uncertainty estimation tasks demonstrate satisfactory performance of our method.

Comment: The paper proposes a novel R2D2-Net for Bayesian neural networks, focusing on feature-preserving shrinkage, which aligns with representation learning and model compression through sparsity and shrinkage methods.

Relevance: 9 Novelty: 8

7. ResSVD: Residual Compensated SVD for Large Language Model Compression

ArXiv ID: 2505.20112

Authors: Haolei Bai, Siyong Jian, Tuo Liang, Yu Yin, Huan Wang

Abstract: Large language models (LLMs) have demonstrated impressive capabilities in a wide range of downstream natural language processing tasks. Nevertheless, their considerable sizes and memory demands hinder practical deployment, underscoring the importance of developing efficient compression strategies. Singular value decomposition (SVD) decomposes a matrix into orthogonal components, enabling efficient low-rank approximation. This is particularly suitable for LLM compression, where weight matrices often exhibit significant redundancy. However, current SVD-based methods neglect the residual matrix from truncation, resulting in significant truncation loss. Additionally, compressing all layers of the model results in severe performance degradation. To overcome these limitations, we propose ResSVD, a new post-training SVD-based LLM compression method. Specifically, we leverage the residual matrix generated during the truncation process to reduce truncation loss. Moreover, under a fixed overall compression ratio, we selectively compress the last few layers of the model, which mitigates error propagation and significantly improves the performance of compressed models.Comprehensive evaluations of ResSVD on diverse LLM families and multiple benchmark datasets indicate that ResSVD consistently achieves superior performance over existing counterpart methods, demonstrating its practical effectiveness.

Comment: The paper presents ResSVD, a new SVD-based method for LLM compression, focusing on reducing truncation loss and selective layer compression, which is relevant to model compression.

Relevance: 9 Novelty: 8

8. On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts

ArXiv ID: 2505.18455

Authors: Fanqi Yan, Huy Nguyen, Dung Le, Pedram Akbarian, Nhat Ho, Alessandro Rinaldo

Abstract: The softmax-contaminated mixture of experts (MoE) model is deployed when a large-scale pre-trained model, which plays the role of a fixed expert, is fine-tuned for learning downstream tasks by including a new contamination part, or prompt, functioning as a new, trainable expert. Despite its popularity and relevance, the theoretical properties of the softmax-contaminated MoE have remained unexplored in the literature. In the paper, we study the convergence rates of the maximum likelihood estimator of gating and prompt parameters in order to gain insights into the statistical properties and potential challenges of fine-tuning with a new prompt. We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model, in the sense that we make precise by formulating a novel analytic notion of distinguishability. Under distinguishability of the pre-trained and prompt models, we derive minimax optimal estimation rates for all the gating and prompt parameters. By contrast, when the distinguishability condition is violated, these estimation rates become significantly slower due to their dependence on the prompt convergence rate to the pre-trained model. Finally, we empirically corroborate our theoretical findings through several numerical experiments.

Comment: The paper provides theoretical insights into the softmax-contaminated mixture of experts model, which is relevant to model architecture and representation learning.

Relevance: 9 Novelty: 8

9. Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments

ArXiv ID: 2505.19699

Authors: Junming Liu, Yanting Gao, Siyuan Meng, Yifei Sun, Aoqi Wu, Yufei Jin, Yirong Chen, Ding Wang, Guosun Zeng

Abstract: Federated Learning (FL) is a decentralized machine learning paradigm that enables clients to collaboratively train models while preserving data privacy. However, the coexistence of model and data heterogeneity gives rise to inconsistent representations and divergent optimization dynamics across clients, ultimately hindering robust global performance. To transcend these challenges, we propose Mosaic, a novel data-free knowledge distillation framework tailored for heterogeneous distributed environments. Mosaic first trains local generative models to approximate each client's personalized distribution, enabling synthetic data generation that safeguards privacy through strict separation from real data. Subsequently, Mosaic forms a Mixture-of-Experts (MoE) from client models based on their specialized knowledge, and distills it into a global model using the generated data. To further enhance the MoE architecture, Mosaic integrates expert predictions via a lightweight meta model trained on a few representative prototypes. Extensive experiments on standard image classification benchmarks demonstrate that Mosaic consistently outperforms state-of-the-art approaches under both model and data heterogeneity. The source code has been published at https://github.com/Wings-Of-Disaster/Mosaic.

Comment: The paper presents a data-free knowledge distillation framework using Mixture-of-Experts, which is relevant to model architecture and compression.

Relevance: 9 Novelty: 8

10. $\mu$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts

ArXiv ID: 2505.18451

Authors: Toshiaki Koike-Akino, Jing Liu, Ye Wang

Abstract: To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these rely on calibration data, domain shift may arise for unknown downstream tasks. With a computationally efficient calibration, activation-aware pruning can be executed for every prompt adaptively, yet achieving reduced complexity at inference. We formulate it as a mixture of micro-experts, called $\mu$-MoE. Several experiments demonstrate that $\mu$-MoE can dynamically adapt to task/prompt-dependent structured sparsity on the fly.

Comment: The paper introduces a test-time pruning method as a micro-grained mixture-of-experts, relevant to model compression and architecture.

Relevance: 9 Novelty: 8

11. RefLoRA: Refactored Low-Rank Adaptation for Efficient Fine-Tuning of Large Models

ArXiv ID: 2505.18877

Authors: Yilang Zhang, Bingcong Li, Georgios B. Giannakis

Abstract: Low-Rank Adaptation (LoRA) lowers the computational and memory overhead of fine-tuning large models by updating a low-dimensional subspace of the pre-trained weight matrix. Albeit efficient, LoRA exhibits suboptimal convergence and noticeable performance degradation, due to inconsistent and imbalanced weight updates induced by its nonunique low-rank factorizations. To overcome these limitations, this article identifies the optimal low-rank factorization per step that minimizes an upper bound on the loss. The resultant refactored low-rank adaptation (RefLoRA) method promotes a flatter loss landscape, along with consistent and balanced weight updates, thus speeding up stable convergence. Extensive experiments evaluate RefLoRA on natural language understanding, and commonsense reasoning tasks with popular large language models including DeBERTaV3, LLaMA-7B, LLaMA2-7B and LLaMA3-8B. The numerical tests corroborate that RefLoRA converges faster, outperforms various benchmarks, and enjoys negligible computational overhead compared to state-of-the-art LoRA variants.

Comment: RefLoRA proposes a method for efficient fine-tuning of large models using low-rank adaptation, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

12. ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning

ArXiv ID: 2505.18232

Authors: Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Hongjian Fang, Ruihan Jin, Feihu Che, Pengpeng Shao, Zhengqi Wen, Jianhua Tao

Abstract: The deployment of Large language models (LLMs) in many fields is largely hindered by their high computational and memory costs. Recent studies suggest that LLMs exhibit sparsity, which can be used for pruning. Previous pruning methods typically follow a prune-then-finetune paradigm. Since the pruned parts still contain valuable information, statically removing them without updating the remaining parameters often results in irreversible performance degradation, requiring costly recovery fine-tuning (RFT) to maintain performance. To address this, we propose a novel paradigm: first apply regularization, then prune. Based on this paradigm, we propose ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning. We multiply the output of each transformer layer by an initial weight, then we iteratively learn the weights of each transformer layer by using a small amount of data in a simple way. After that, we apply regularization to the difference between the output and input of the layers with smaller weights, forcing the information to be transferred to the remaining layers. Compared with direct pruning, ELDeR reduces the information loss caused by direct parameter removal, thus better preserving the model's language modeling ability. Experimental results show that ELDeR achieves superior performance compared with powerful layer-wise structured pruning methods, while greatly reducing RFT computational costs. Since ELDeR is a layer-wise pruning method, its end-to-end acceleration effect is obvious, making it a promising technique for efficient LLMs.

Comment: ELDeR introduces a novel paradigm for pruning LLMs, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

13. The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models

ArXiv ID: 2505.19440

Authors: Shashata Sawmya, Micah Adler, Nir Shavit

Abstract: This paper studies the emergence of interpretable categorical features within large language models (LLMs), analyzing their behavior across training checkpoints (time), transformer layers (space), and varying model sizes (scale). Using sparse autoencoders for mechanistic interpretability, we identify when and where specific semantic concepts emerge within neural activations. Results indicate clear temporal and scale-specific thresholds for feature emergence across multiple domains. Notably, spatial analysis reveals unexpected semantic reactivation, with early-layer features re-emerging at later layers, challenging standard assumptions about representational dynamics in transformer models.

Comment: The paper studies the emergence of interpretable features in LLMs using sparse autoencoders, which aligns with representation learning and provides insights into how deep networks encode information.

Relevance: 9 Novelty: 8

14. Understanding Transformer from the Perspective of Associative Memory

ArXiv ID: 2505.19488

Authors: Shu Zhong, Mingyu Xu, Tenglong Ao, Guang Shi

Abstract: In this paper, we share our reflections and insights on understanding Transformer architectures through the lens of associative memory--a classic psychological concept inspired by human cognition. We start with the basics of associative memory (think simple linear attention) and then dive into two dimensions: Memory Capacity: How much can a Transformer really remember, and how well? We introduce retrieval SNR to measure this and use a kernel perspective to mathematically reveal why Softmax Attention is so effective. We also show how FFNs can be seen as a type of associative memory, leading to insights on their design and potential improvements. Memory Update: How do these memories learn and evolve? We present a unified framework for understanding how different Transformer variants (like DeltaNet and Softmax Attention) update their "knowledge base". This leads us to tackle two provocative questions: 1. Are Transformers fundamentally limited in what they can express, and can we break these barriers? 2. If a Transformer had infinite context, would it become infinitely intelligent? We want to demystify Transformer architecture, offering a clearer understanding of existing designs. This exploration aims to provide fresh insights and spark new avenues for Transformer innovation.

Comment: The paper provides insights into Transformer architectures through the lens of associative memory, which aligns with model architecture analysis and offers theoretical insights.

Relevance: 9 Novelty: 8

15. FP4 All the Way: Fully Quantized Training of LLMs

ArXiv ID: 2505.19115

Authors: Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry

Abstract: We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately $\sqrt{3}$ times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way .

Comment: The paper demonstrates fully quantized training of LLMs, which is relevant to model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

16. Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior

ArXiv ID: 2505.20076

Authors: Florian Eichin, Yupei Du, Philipp Mondorf, Barbara Plank, Michael A. Hedderich

Abstract: Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, these approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all three perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to more realistic training settings. Empirically, we find that both a CNN and a Transformer model are replicated accurately by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. We show their effectiveness in parameter pruning that is comparable to existing methods, reinforcing their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Among other things, our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.

Comment: The paper presents a unified framework for model, data, and training attribution, which is relevant to understanding model behavior and training dynamics.

Relevance: 9 Novelty: 8

17. Shifting AI Efficiency From Model-Centric to Data-Centric Compression

ArXiv ID: 2505.19147

Authors: Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang

Abstract: The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community's advancement.

Comment: The paper discusses a shift from model-centric to data-centric compression, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

18. Unfolding AlphaFold's Bayesian Roots in Probability Kinematics

ArXiv ID: 2505.19763

Authors: Thomas Hamelryck, Kanti V. Mardia

Abstract: We present a novel theoretical interpretation of AlphaFold1. The seminal breakthrough of AlphaFold1 in protein structure prediction by deep learning relied on a learned potential energy function, in contrast to the later end-to-end architectures of AlphaFold2 and AlphaFold3. While this potential was originally justified by referring to physical potentials of mean force (PMFs), we reinterpret AlphaFold1's potential as an instance of probability kinematics - also known as Jeffrey conditioning - a principled but underrecognised generalization of conventional Bayesian updating. Probability kinematics accommodates uncertain or soft evidence in the form of updated probabilities over a partition. This perspective reveals AlphaFold1's potential as a form of generalized Bayesian updating, rather than a thermodynamic potential. To confirm our probabilistic framework's scope and precision, we analyze a synthetic 2D model in which an angular random walk prior is updated with evidence on distances via probability kinematics, mirroring AlphaFold1's approach. This theoretical contribution connects AlphaFold1 to a broader class of well-justified Bayesian methods, allowing precise quantification, surpassing merely qualitative heuristics based on PMFs. More broadly, given the achievements of AlphaFold1, probability kinematics holds considerable promise for probabilistic deep learning, as it allows for the formulation of complex models from a few simpler components.

Comment: The paper provides a novel theoretical interpretation of AlphaFold1 using probability kinematics, which is a foundational research in AI for Science.

Relevance: 9 Novelty: 8

19. Exact Expressive Power of Transformers with Padding

ArXiv ID: 2505.18948

Authors: William Merrill, Ashish Sabharwal

Abstract: Chain of thought is a natural inference-time method for increasing the computational power of transformer-based large language models (LLMs), but comes at the cost of sequential decoding. Are there more efficient alternatives to expand a transformer's expressive power without adding parameters? We consider transformers with padding tokens as a form of parallelizable test-time compute. We show that averaging-hard-attention, masked-pre-norm transformers with polynomial padding converge to precisely the class $\mathsf{TC}^0$ of extremely parallelizable problems. While the $\mathsf{TC}^0$ upper bound was known, proving a matching lower bound had been elusive. Further, our novel analysis reveals the precise expanded power of padded transformers when coupled with another form of inference-time compute, namely dynamically increasing depth via looping. Our core technical contribution is to show how padding helps bring the notions of complete problems and reductions, which have been a cornerstone of classical complexity theory, to the formal study of transformers. Armed with this new tool, we prove that padded transformers with $O(\log^d n)$ looping on inputs of length $n$ recognize exactly the class $\mathsf{TC}^d$ of moderately parallelizable problems. Thus, padding and looping together systematically expand transformers' expressive power: with polylogarithmic looping, padded transformers converge to the class $\mathsf{NC}$, the best that could be expected without losing parallelism (unless $\mathsf{NC} = \mathsf{P}$). Our results thus motivate further exploration of padding and looping as parallelizable alternatives to chain of thought.

Comment: The paper analyzes the expressive power of transformers with padding, contributing to the understanding of model architecture.

Relevance: 9 Novelty: 8

20. The Coverage Principle: A Framework for Understanding Compositional Generalization

ArXiv ID: 2505.20278

Authors: Hoyeon Chang, Jinho Park, Hanseul Cho, Sohee Yang, Miyoung Ko, Hyeonbin Hwang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo

Abstract: Large language models excel at pattern matching, yet often fall short in systematic compositional generalization. We propose the coverage principle: a data-centric framework showing that models relying primarily on pattern matching for compositional tasks cannot reliably generalize beyond substituting fragments that yield identical results when used in the same contexts. We demonstrate that this framework has a strong predictive power for the generalization capabilities of Transformers. First, we derive and empirically confirm that the training data required for two-hop generalization grows at least quadratically with the token set size, and the training data efficiency does not improve with 20x parameter scaling. Second, for compositional tasks with path ambiguity where one variable affects the output through multiple computational paths, we show that Transformers learn context-dependent state representations that undermine both performance and interoperability. Third, Chain-of-Thought supervision improves training data efficiency for multi-hop tasks but still struggles with path ambiguity. Finally, we outline a \emph{mechanism-based} taxonomy that distinguishes three ways neural networks can generalize: structure-based (bounded by coverage), property-based (leveraging algebraic invariances), and shared-operator (through function reuse). This conceptual lens contextualizes our results and highlights where new architectural ideas are needed to achieve systematic compositionally. Overall, the coverage principle provides a unified lens for understanding compositional reasoning, and underscores the need for fundamental architectural or training innovations to achieve truly systematic compositionality.

Comment: The paper introduces the coverage principle for understanding compositional generalization, which is relevant to representation learning and emerging trends.

Relevance: 9 Novelty: 8

21. To CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers

ArXiv ID: 2505.19245

Authors: Kevin Xu, Issei Sato

Abstract: Chain-of-Thought (CoT) and Looped Transformers have been shown to empirically improve performance on reasoning tasks and to theoretically enhance expressivity by recursively increasing the number of computational steps. However, their comparative capabilities are still not well understood. In this paper, we provide a formal analysis of their respective strengths and limitations. We show that Looped Transformers can efficiently simulate parallel computations for deterministic tasks, which we formalize as evaluation over directed acyclic graphs. In contrast, CoT with stochastic decoding excels at approximate inference for compositional structures, namely self-reducible problems. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical cues for choosing between reasoning paradigms.

Comment: The paper provides a formal comparison between Chain-of-Thought and Looped Transformers, contributing to the understanding of model architecture.

Relevance: 9 Novelty: 8

22. Operator Learning for Schr\"{o}dinger Equation: Unitarity, Error Bounds, and Time Generalization

ArXiv ID: 2505.18288

Authors: Yash Patel, Unique Subedi, Ambuj Tewari

Abstract: We consider the problem of learning the evolution operator for the time-dependent Schr\"{o}dinger equation, where the Hamiltonian may vary with time. Existing neural network-based surrogates often ignore fundamental properties of the Schr\"{o}dinger equation, such as linearity and unitarity, and lack theoretical guarantees on prediction error or time generalization. To address this, we introduce a linear estimator for the evolution operator that preserves a weak form of unitarity. We establish both upper and lower bounds on the prediction error that hold uniformly over all sufficiently smooth initial wave functions. Additionally, we derive time generalization bounds that quantify how the estimator extrapolates beyond the time points seen during training. Experiments across real-world Hamiltonians -- including hydrogen atoms, ion traps for qubit design, and optical lattices -- show that our estimator achieves relative errors $10^{-2}$ to $10^{-3}$ times smaller than state-of-the-art methods such as the Fourier Neural Operator and DeepONet.

Comment: The paper addresses operator learning for the Schrödinger equation with theoretical guarantees, relevant to foundational research in AI for science.

Relevance: 9 Novelty: 8

23. Foundations of Top-$k$ Decoding For Language Models

ArXiv ID: 2505.19371

Authors: Georgy Noarov, Soham Mallick, Tao Wang, Sunay Joshi, Yan Sun, Yangxinyu Xie, Mengxin Yu, Edgar Dobriban

Abstract: Top-$k$ decoding is a widely used method for sampling from LLMs: at each token, only the largest $k$ next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top-$k$ and other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top-$k$ decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top-$k$ decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We consider \emph{Bregman decoders} obtained by minimizing a separable Bregman divergence (for both the \emph{primal} and \emph{dual} cases) with a sparsity-inducing $\ell_0$ regularization. Despite the combinatorial nature of the objective, we show how to optimize it efficiently for a large class of divergences. We show that the optimal decoding strategies are greedy, and further that the loss function is discretely convex in $k$, so that binary search provably and efficiently finds the optimal $k$. We show that top-$k$ decoding arises as a special case for the KL divergence, and identify new decoding strategies that have distinct behaviors (e.g., non-linearly up-weighting larger probabilities after re-normalization).

Comment: The paper provides a theoretical framework for top-k decoding in LLMs, which is relevant to theoretical insights into LLM behavior.

Relevance: 9 Novelty: 8

24. When fractional quasi p-norms concentrate

ArXiv ID: 2505.19635

Authors: Ivan Y. Tyukin, Bogdan Grechuk, Evgeny M. Mirkes, Alexander N. Gorban

Abstract: Concentration of distances in high dimension is an important factor for the development and design of stable and reliable data analysis algorithms. In this paper, we address the fundamental long-standing question about the concentration of distances in high dimension for fractional quasi $p$-norms, $p\in(0,1)$. The topic has been at the centre of various theoretical and empirical controversies. Here we, for the first time, identify conditions when fractional quasi $p$-norms concentrate and when they don't. We show that contrary to some earlier suggestions, for broad classes of distributions, fractional quasi $p$-norms admit exponential and uniform in $p$ concentration bounds. For these distributions, the results effectively rule out previously proposed approaches to alleviate concentration by "optimal" setting the values of $p$ in $(0,1)$. At the same time, we specify conditions and the corresponding families of distributions for which one can still control concentration rates by appropriate choices of $p$. We also show that in an arbitrarily small vicinity of a distribution from a large class of distributions for which uniform concentration occurs, there are uncountably many other distributions featuring anti-concentration properties. Importantly, this behavior enables devising relevant data encoding or representation schemes favouring or discouraging distance concentration. The results shed new light on this long-standing problem and resolve the tension around the topic in both theory and empirical evidence reported in the literature.

Comment: The paper addresses the concentration of fractional quasi p-norms, relevant to emerging trends in theoretical work.

Relevance: 9 Novelty: 8

25. I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts

ArXiv ID: 2505.19190

Authors: Jiayi Xin, Sukwon Yun, Jie Peng, Inyoung Choi, Jenna L. Ballard, Tianlong Chen, Qi Long

Abstract: Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, vanilla fusion methods are limited by (1) inability to account for heterogeneous interactions between modalities and (2) lack of interpretability in uncovering the multimodal interactions inherent in the data. To this end, we propose I2MoE (Interpretable Multimodal Interaction-aware Mixture of Experts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, I2MoE utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, I2MoE deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that I2MoE is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at https://github.com/Raina-Xin/I2MoE.

Comment: The paper introduces I2MoE, a framework for interpretable multimodal interaction-aware mixture-of-experts, relevant to model architecture.

Relevance: 9 Novelty: 8

26. Error Optimization: Overcoming Exponential Signal Decay in Deep Predictive Coding Networks

ArXiv ID: 2505.20137

Authors: C\'edric Goemaere, Gaspard Oliviers, Rafal Bogacz, Thomas Demeester

Abstract: Predictive Coding (PC) offers a biologically plausible alternative to backpropagation for neural network training, yet struggles with deeper architectures. This paper identifies the root cause: an inherent signal decay problem where gradients attenuate exponentially with depth, becoming computationally negligible due to numerical precision constraints. To address this fundamental limitation, we introduce Error Optimization (EO), a novel reparameterization that preserves PC's theoretical properties while eliminating signal decay. By optimizing over prediction errors rather than states, EO enables signals to reach all layers simultaneously and without attenuation, converging orders of magnitude faster than standard PC. Experiments across multiple architectures and datasets demonstrate that EO matches backpropagation's performance even for deeper models where conventional PC struggles. Besides practical improvements, our work provides theoretical insight into PC dynamics and establishes a foundation for scaling biologically-inspired learning to deeper architectures on digital hardware and beyond.

Comment: The paper introduces Error Optimization to address signal decay in deep predictive coding networks, providing theoretical insights into training dynamics.

Relevance: 9 Novelty: 8

27. Multiple Descents in Deep Learning as a Sequence of Order-Chaos Transitions

ArXiv ID: 2505.20030

Authors: Wenbo Wei, Nicholas Chong Jia Le, Choy Heng Lai, Ling Feng

Abstract: We observe a novel 'multiple-descent' phenomenon during the training process of LSTM, in which the test loss goes through long cycles of up and down trend multiple times after the model is overtrained. By carrying out asymptotic stability analysis of the models, we found that the cycles in test loss are closely associated with the phase transition process between order and chaos, and the local optimal epochs are consistently at the critical transition point between the two phases. More importantly, the global optimal epoch occurs at the first transition from order to chaos, where the 'width' of the 'edge of chaos' is the widest, allowing the best exploration of better weight configurations for learning.

Comment: The paper observes a novel 'multiple-descent' phenomenon in LSTM training, providing insights into training dynamics and phase transitions.

Relevance: 9 Novelty: 8

28. Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes

ArXiv ID: 2505.19087

Authors: Itamar Harel, Yonathan Wolanowsky, Gal Vardi, Nathan Srebro, Daniel Soudry

Abstract: We analyze the generalization gap (gap between the training and test errors) when training a potentially over-parametrized model using a Markovian stochastic training algorithm, initialized from some distribution $\theta_0 \sim p_0$. We focus on Langevin dynamics with a positive temperature $\beta^{-1}$, i.e. gradient descent on a training loss $L$ with infinitesimal step size, perturbed with $\beta^{-1}$-variances Gaussian noise, and lightly regularized or bounded. There, we bound the generalization gap, at any time during training, by $\sqrt{(\beta\mathbb{E} L (\theta_0) + \log(1/\delta))/N}$ with probability $1-\delta$ over the dataset, where $N$ is the sample size, and $\mathbb{E} L (\theta_0) =O(1)$ with standard initialization scaling. In contrast to previous guarantees, we have no dependence on either training time or reliance on mixing, nor a dependence on dimensionality, gradient norms, or any other properties of the loss or model. This guarantee follows from a general analysis of any Markov process-based training that has a Gibbs-style stationary distribution. The proof is surprisingly simple, once we observe that the marginal distribution divergence from initialization remains bounded, as implied by a generalized second law of thermodynamics.

Comment: The paper analyzes the generalization gap in Langevin dynamics, providing theoretical insights into training dynamics and generalization.

Relevance: 9 Novelty: 8

29. Token-Importance Guided Direct Preference Optimization

ArXiv ID: 2505.19653

Authors: Yang Ning, Lin Hai, Liu Yibo, Tian Baoliang, Liu Guoqing, Zhang Haijun

Abstract: Ensuring that large language models (LLMs) generate outputs aligned with human preferences is important for safe and effective AI interactions. While Direct Preference Optimization (DPO) employs an implicit reward function to optimize the policy model, however, it and its related variants overlook the differential importance of individual tokens and are sensitive to judgment noise in preference datasets during generation. Although recent methods attempt to assess the important weight of tokens via probability prediction or simplistic weighting schemes, these evaluation methods are prone to biases and still cannot fully address these issues. To solve this problem, we propose the Token-Importance Guided Direct Preference Optimization (TI-DPO), which introduces two key innovations: the gradient-based token-importance weights that dynamically prioritize critical tokens, and a triple loss that explicitly guides model outputs to approach human-preferred responses and stay away from non-preferred responses. Experimental results show that TI-DPO achieves higher accuracy and stronger generative diversity, providing more stable and computationally efficient solutions compared with DPO and other RLHF methods.

Comment: The paper introduces a novel method for optimizing LLM outputs by focusing on token importance, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 8

30. SwarmThinkers: Learning Physically Consistent Atomic KMC Transitions at Scale

ArXiv ID: 2505.20094

Authors: Qi Li, Kun Li, Haozhi Han, Honghui Shang, Xinfu He, Yunquan Zhang, Hong An, Ting Cao, Mao Yang

Abstract: Can a scientific simulation system be physically consistent, interpretable by design, and scalable across regimes--all at once? Despite decades of progress, this trifecta remains elusive. Classical methods like Kinetic Monte Carlo ensure thermodynamic accuracy but scale poorly; learning-based methods offer efficiency but often sacrifice physical consistency and interpretability. We present SwarmThinkers, a reinforcement learning framework that recasts atomic-scale simulation as a physically grounded swarm intelligence system. Each diffusing particle is modeled as a local decision-making agent that selects transitions via a shared policy network trained under thermodynamic constraints. A reweighting mechanism fuses learned preferences with transition rates, preserving statistical fidelity while enabling interpretable, step-wise decision making. Training follows a centralized-training, decentralized-execution paradigm, allowing the policy to generalize across system sizes, concentrations, and temperatures without retraining. On a benchmark simulating radiation-induced Fe-Cu alloy precipitation, SwarmThinkers is the first system to achieve full-scale, physically consistent simulation on a single A100 GPU, previously attainable only via OpenKMC on a supercomputer. It delivers up to 4963x (3185x on average) faster computation with 485x lower memory usage. By treating particles as decision-makers, not passive samplers, SwarmThinkers marks a paradigm shift in scientific simulation--one that unifies physical consistency, interpretability, and scalability through agent-driven intelligence.

Comment: SwarmThinkers introduces a novel reinforcement learning framework for atomic-scale simulation, relevant to AI for Science with a focus on foundational research.

Relevance: 8 Novelty: 9

31. Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate

ArXiv ID: 2505.19525

Authors: Liangwei Nathan Zheng, Wei Emma Zhang, Mingyu Guo, Miao Xu, Olaf Maennel, Weitong Chen

Abstract: Effectively managing missing modalities is a fundamental challenge in real-world multimodal learning scenarios, where data incompleteness often results from systematic collection errors or sensor failures. Sparse Mixture-of-Experts (SMoE) architectures have the potential to naturally handle multimodal data, with individual experts specializing in different modalities. However, existing SMoE approach often lacks proper ability to handle missing modality, leading to performance degradation and poor generalization in real-world applications. We propose Conf-SMoE to introduce a two-stage imputation module to handle the missing modality problem for the SMoE architecture and reveal the insight of expert collapse from theoretical analysis with strong empirical evidence. Inspired by our theoretical analysis, Conf-SMoE propose a novel expert gating mechanism by detaching the softmax routing score to task confidence score w.r.t ground truth. This naturally relieves expert collapse without introducing additional load balance loss function. We show that the insights of expert collapse aligns with other gating mechanism such as Gaussian and Laplacian gate. We also evaluate the proposed method on four different real world dataset with three different experiment settings to conduct comprehensive the analysis of Conf-SMoE on modality fusion and resistance to missing modality.

Comment: The paper proposes a novel gating mechanism for Sparse Mixture-of-Experts (SMoE) architectures, which aligns with the core topic of model architecture innovations.

Relevance: 9 Novelty: 7

32. A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models

ArXiv ID: 2505.19286

Authors: Utkarsh Sahu, Zhisheng Qi, Yongjia Lei, Ryan A. Rossi, Franck Dernoncourt, Nesreen K. Ahmed, Mahantesh M Halappanavar, Yao Ma, Yu Wang

Abstract: Large language models have been extensively studied as neural knowledge bases for their knowledge access, editability, reasoning, and explainability. However, few works focus on the structural patterns of their knowledge. Motivated by this gap, we investigate these structural patterns from a graph perspective. We quantify the knowledge of LLMs at both the triplet and entity levels, and analyze how it relates to graph structural properties such as node degree. Furthermore, we uncover the knowledge homophily, where topologically close entities exhibit similar levels of knowledgeability, which further motivates us to develop graph machine learning models to estimate entity knowledge based on its local neighbors. This model further enables valuable knowledge checking by selecting triplets less known to LLMs. Empirical results show that using selected triplets for fine-tuning leads to superior performance.

Comment: The paper investigates structural patterns of knowledge in LLMs from a graph perspective, which aligns with foundational research in understanding LLM behavior.

Relevance: 9 Novelty: 7

33. MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

ArXiv ID: 2505.19645

Authors: Zongle Huang, Lei Zhu, Zongyuan Zhan, Ting Hu, Weikai Mao, Xianzhi Yu, Yongpan Liu, Tianyu Zhang

Abstract: Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser -- the prevailing trend in MoE designs -- the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of algorithms, changes in workload and model architecture can still lead to degraded SD acceleration even with high acceptance rates. To address this limitation, we introduce a new metric 'target efficiency' that characterizes these effects, thus helping researchers identify system bottlenecks and understand SD acceleration more comprehensively. For scenarios like private serving, this work unveils a new perspective to speed up MoE inference, where existing solutions struggle. Experiments on different GPUs show up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes and validate our theoretical predictions.

Comment: The paper explores speculative decoding for accelerating sparse MoE models, which is relevant to model architecture and compression.

Relevance: 9 Novelty: 7

34. FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

ArXiv ID: 2505.20225

Authors: Hao Kang, Zichun Yu, Chenyan Xiong

Abstract: Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at https://github.com/cmu-flame/FLAME-MoE.

Comment: The paper introduces FLAME-MoE, a research platform for MoE models, which is relevant to model architecture.

Relevance: 9 Novelty: 7

35. Equivariant Representation Learning for Symmetry-Aware Inference with Guarantees

ArXiv ID: 2505.19809

Authors: Daniel Ordo\~nez-Apraez, Alek Fr\"ohlich, Vladimir Kosti\'c, Karim Lounici, Vivien Brandt, Massimiliano Pontil

Abstract: In many real-world applications of regression, conditional probability estimation, and uncertainty quantification, exploiting symmetries rooted in physics or geometry can dramatically improve generalization and sample efficiency. While geometric deep learning has made significant empirical advances by incorporating group-theoretic structure, less attention has been given to statistical learning guarantees. In this paper, we introduce an equivariant representation learning framework that simultaneously addresses regression, conditional probability estimation, and uncertainty quantification while providing first-of-its-kind non-asymptotic statistical learning guarantees. Grounded in operator and group representation theory, our framework approximates the spectral decomposition of the conditional expectation operator, building representations that are both equivariant and disentangled along independent symmetry subgroups. Empirical evaluations on synthetic datasets and real-world robotics applications confirm the potential of our approach, matching or outperforming existing equivariant baselines in regression while additionally providing well-calibrated parametric uncertainty estimates.

Comment: The paper introduces an equivariant representation learning framework with statistical learning guarantees, relevant to representation learning and symmetry-aware inference.

Relevance: 8 Novelty: 8

36. Lorentz Local Canonicalization: How to Make Any Network Lorentz-Equivariant

ArXiv ID: 2505.20280

Authors: Jonas Spinner, Luigi Favaro, Peter Lippmann, Sebastian Pitz, Gerrit Gerhartz, Tilman Plehn, Fred A. Hamprecht

Abstract: Lorentz-equivariant neural networks are becoming the leading architectures for high-energy physics. Current implementations rely on specialized layers, limiting architectural choices. We introduce Lorentz Local Canonicalization (LLoCa), a general framework that renders any backbone network exactly Lorentz-equivariant. Using equivariantly predicted local reference frames, we construct LLoCa-transformers and graph networks. We adapt a recent approach to geometric message passing to the non-compact Lorentz group, allowing propagation of space-time tensorial features. Data augmentation emerges from LLoCa as a special choice of reference frame. Our models surpass state-of-the-art accuracy on relevant particle physics tasks, while being $4\times$ faster and using $5$-$100\times$ fewer FLOPs.

Comment: The paper introduces a framework for making any network Lorentz-equivariant, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 8

37. Uncovering a Universal Abstract Algorithm for Modular Addition in Neural Networks

ArXiv ID: 2505.18266

Authors: Gavin McCracken, Gabriela Moisescu-Pareja, Vincent Letourneau, Doina Precup, Jonathan Love

Abstract: We propose a testable universality hypothesis, asserting that seemingly disparate neural network solutions observed in the simple task of modular addition are unified under a common abstract algorithm. While prior work interpreted variations in neuron-level representations as evidence for distinct algorithms, we demonstrate - through multi-level analyses spanning neurons, neuron clusters, and entire networks - that multilayer perceptrons and transformers universally implement the abstract algorithm we call the approximate Chinese Remainder Theorem. Crucially, we introduce approximate cosets and show that neurons activate exclusively on them. Furthermore, our theory works for deep neural networks (DNNs). It predicts that universally learned solutions in DNNs with trainable embeddings or more than one hidden layer require only O(log n) features, a result we empirically confirm. This work thus provides the first theory-backed interpretation of multilayer networks solving modular addition. It advances generalizable interpretability and opens a testable universality hypothesis for group multiplication beyond modular addition.

Comment: The paper proposes a universality hypothesis for neural networks solving modular addition, relevant to representation learning and interpretability.

Relevance: 8 Novelty: 8

38. Message-Passing State-Space Models: Improving Graph Learning with Modern Sequence Modeling

ArXiv ID: 2505.18728

Authors: Andrea Ceni, Alessio Gravina, Claudio Gallicchio, Davide Bacciu, Carola-Bibiane Schonlieb, Moshe Eliasof

Abstract: The recent success of State-Space Models (SSMs) in sequence modeling has motivated their adaptation to graph learning, giving rise to Graph State-Space Models (GSSMs). However, existing GSSMs operate by applying SSM modules to sequences extracted from graphs, often compromising core properties such as permutation equivariance, message-passing compatibility, and computational efficiency. In this paper, we introduce a new perspective by embedding the key principles of modern SSM computation directly into the Message-Passing Neural Network framework, resulting in a unified methodology for both static and temporal graphs. Our approach, MP-SSM, enables efficient, permutation-equivariant, and long-range information propagation while preserving the architectural simplicity of message passing. Crucially, MP-SSM enables an exact sensitivity analysis, which we use to theoretically characterize information flow and evaluate issues like vanishing gradients and over-squashing in the deep regime. Furthermore, our design choices allow for a highly optimized parallel implementation akin to modern SSMs. We validate MP-SSM across a wide range of tasks, including node classification, graph property prediction, long-range benchmarks, and spatiotemporal forecasting, demonstrating both its versatility and strong empirical performance.

Comment: The paper introduces a new method for graph learning by embedding state-space model principles into message-passing networks, relevant to model architecture innovations.

Relevance: 8 Novelty: 8

39. On the Mechanisms of Weak-to-Strong Generalization: A Theoretical Perspective

ArXiv ID: 2505.18346

Authors: Behrad Moniri, Hamed Hassani

Abstract: Weak-to-strong generalization, where a student model trained on imperfect labels generated by a weaker teacher nonetheless surpasses that teacher, has been widely observed but the mechanisms that enable it have remained poorly understood. In this paper, through a theoretical analysis of simple models, we uncover three core mechanisms that can drive this phenomenon. First, by analyzing ridge regression, we study the interplay between the teacher and student regularization and prove that a student can compensate for a teacher's under-regularization and achieve lower test error. We also analyze the role of the parameterization regime of the models. Second, by analyzing weighted ridge regression, we show that a student model with a regularization structure more aligned to the target, can outperform its teacher. Third, in a nonlinear multi-index setting, we demonstrate that a student can learn easy, task-specific features from the teacher while leveraging its own broader pre-training to learn hard-to-learn features that the teacher cannot capture.

Comment: The paper provides a theoretical perspective on weak-to-strong generalization, offering insights into training dynamics and representation learning.

Relevance: 8 Novelty: 8

40. Chordless Structure: A Pathway to Simple and Expressive GNNs

ArXiv ID: 2505.19188

Authors: Hongxu Pan, Shuxian Hu, Mo Zhou, Zhibin Wang, Rong Gu, Chen Tian, Kun Yang, Sheng Zhong

Abstract: Researchers have proposed various methods of incorporating more structured information into the design of Graph Neural Networks (GNNs) to enhance their expressiveness. However, these methods are either computationally expensive or lacking in provable expressiveness. In this paper, we observe that the chords increase the complexity of the graph structure while contributing little useful information in many cases. In contrast, chordless structures are more efficient and effective for representing the graph. Therefore, when leveraging the information of cycles, we choose to omit the chords. Accordingly, we propose a Chordless Structure-based Graph Neural Network (CSGNN) and prove that its expressiveness is strictly more powerful than the k-hop GNN (KPGNN) with polynomial complexity. Experimental results on real-world datasets demonstrate that CSGNN outperforms existing GNNs across various graph tasks while incurring lower computational costs and achieving better performance than the GNNs of 3-WL expressiveness.

Comment: The paper introduces a new GNN architecture based on chordless structures, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 8

41. Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

ArXiv ID: 2505.18227

Authors: Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik

Abstract: In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.

Comment: The paper discusses token reduction in generative models, positioning it as a fundamental principle in generative modeling, which is relevant to model architecture and emerging trends.

Relevance: 8 Novelty: 8

42. Follow the Energy, Find the Path: Riemannian Metrics from Energy-Based Models

ArXiv ID: 2505.18230

Authors: Louis B\'ethune, David Vigouroux, Yilun Du, Rufin VanRullen, Thomas Serre, Victor Boutin

Abstract: What is the shortest path between two data points lying in a high-dimensional space? While the answer is trivial in Euclidean geometry, it becomes significantly more complex when the data lies on a curved manifold -- requiring a Riemannian metric to describe the space's local curvature. Estimating such a metric, however, remains a major challenge in high dimensions. In this work, we propose a method for deriving Riemannian metrics directly from pretrained Energy-Based Models (EBMs) -- a class of generative models that assign low energy to high-density regions. These metrics define spatially varying distances, enabling the computation of geodesics -- shortest paths that follow the data manifold's intrinsic geometry. We introduce two novel metrics derived from EBMs and show that they produce geodesics that remain closer to the data manifold and exhibit lower curvature distortion, as measured by alignment with ground-truth trajectories. We evaluate our approach on increasingly complex datasets: synthetic datasets with known data density, rotated character images with interpretable geometry, and high-resolution natural images embedded in a pretrained VAE latent space. Our results show that EBM-derived metrics consistently outperform established baselines, especially in high-dimensional settings. Our work is the first to derive Riemannian metrics from EBMs, enabling data-aware geodesics and unlocking scalable, geometry-driven learning for generative modeling and simulation.

Comment: The paper proposes a method for deriving Riemannian metrics from energy-based models, which is a novel approach in representation learning.

Relevance: 8 Novelty: 8

43. Advanced long-term earth system forecasting by learning the small-scale nature

ArXiv ID: 2505.19432

Authors: Hao Wu, Yuan Gao, Ruiqi Shu, Kun Wang, Ruijian Gou, Chuhan Wu, Xinliang Liu, Juncai He, Shuhao Cao, Junfeng Fang, Xingjian Shi, Feng Tao, Qi Song, Shengxuan Ji, Yanfei Xiang, Yuze Sun, Jiahao Li, Fan Xu, Huanshuo Dong, Haixin Wang, Fan Zhang, Penghao Zhao, Xian Wu, Qingsong Wen, Deliang Chen, Xiaomeng Huang

Abstract: Reliable long-term forecast of Earth system dynamics is heavily hampered by instabilities in current AI models during extended autoregressive simulations. These failures often originate from inherent spectral bias, leading to inadequate representation of critical high-frequency, small-scale processes and subsequent uncontrolled error amplification. We present Triton, an AI framework designed to address this fundamental challenge. Inspired by increasing grids to explicitly resolve small scales in numerical models, Triton employs a hierarchical architecture processing information across multiple resolutions to mitigate spectral bias and explicitly model cross-scale dynamics. We demonstrate Triton's superior performance on challenging forecast tasks, achieving stable year-long global temperature forecasts, skillful Kuroshio eddy predictions till 120 days, and high-fidelity turbulence simulations preserving fine-scale structures all without external forcing, with significantly surpassing baseline AI models in long-term stability and accuracy. By effectively suppressing high-frequency error accumulation, Triton offers a promising pathway towards trustworthy AI-driven simulation for climate and earth system science.

Comment: Triton addresses spectral bias in AI models for Earth system forecasting, relevant to AI for Science with a focus on foundational research.

Relevance: 8 Novelty: 8

44. Convexified Message-Passing Graph Neural Networks

ArXiv ID: 2505.18289

Authors: Saar Cohen, Noa Agmon, Uri Shaham

Abstract: Graph Neural Networks (GNNs) have become prominent methods for graph representation learning, demonstrating strong empirical results on diverse graph prediction tasks. In this paper, we introduce Convexified Message Passing Graph Neural Networks (CGNNs), a novel and general framework that combines the power of message-passing GNNs with the tractability of convex optimization. By mapping their nonlinear filters into a reproducing kernel Hilbert space, CGNNs transform training into a convex optimization problem, which can be solved efficiently and optimally by projected gradient methods. This convexity further allows the statistical properties of CGNNs to be analyzed accurately and rigorously. For two-layer CGNNs, we establish rigorous generalization guarantees, showing convergence to the performance of the optimal GNN. To scale to deeper architectures, we adopt a principled layer-wise training strategy. Experiments on benchmark datasets show that CGNNs significantly exceed the performance of leading GNN models, achieving 10 to 40 percent higher accuracy in most cases, underscoring their promise as a powerful and principled method with strong theoretical foundations. In rare cases where improvements are not quantitatively substantial, the convex models either slightly exceed or match the baselines, stressing their robustness and wide applicability. Though over-parameterization is often employed to enhance performance in nonconvex models, we show that our CGNNs framework yields shallow convex models that can surpass these models in both accuracy and resource efficiency.

Comment: The paper introduces Convexified Message-Passing Graph Neural Networks, which provides a novel framework combining GNNs with convex optimization, aligning with representation learning.

Relevance: 8 Novelty: 8

45. SeMe: Training-Free Language Model Merging via Semantic Alignment

ArXiv ID: 2505.20144

Authors: Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

Abstract: Despite the remarkable capabilities of Language Models (LMs) across diverse tasks, no single model consistently outperforms others, necessitating efficient methods to combine their strengths without expensive retraining. Existing model merging techniques, such as parameter averaging and task-guided fusion, often rely on data-dependent computations or fail to preserve internal knowledge, limiting their robustness and scalability. We introduce SeMe (Semantic-based Merging), a novel, data-free, and training-free approach that leverages latent semantic alignment to merge LMs at a fine-grained, layer-wise level. Unlike prior work, SeMe not only preserves model behaviors but also explicitly stabilizes internal knowledge, addressing a critical gap in LM fusion. Through extensive experiments across diverse architectures and tasks, we demonstrate that SeMe outperforms existing methods in both performance and efficiency while eliminating reliance on external data. Our work establishes a new paradigm for knowledge-aware model merging and provides insights into the semantic structure of LMs, paving the way for more scalable and interpretable model composition.

Comment: The paper introduces a novel, data-free, and training-free approach for merging language models using semantic alignment, which provides insights into the semantic structure of LMs.