This is a remedial run for missed papers from 05/16/2025 to 05/18/2025.

Results generated on 05/26/2025.

Personalized Daily ArXiv Papers 2025-05-19

[gpt-4o]	Prompt	Completion	Total
Token	93333	12929	106262
Cost	$0.23	$0.13	$0.36

Total arXiv papers: 889

Total scanned papers: 889

Total relevant papers: 78

Table of contents with paper titles:

Search-Based Correction of Reasoning Chains for Language Models Authors: Minsu Kim, Jean-Pierre Falet, Oliver E. Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, Yoshua Bengio
On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating Authors: Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo
RanDeS: Randomized Delta Superposition for Multi-Model Compression Authors: Hangyu Zhou, Aaron Gokaslan, Volodymyr Kuleshov, Bharath Hariharan
TDFormer: A Top-Down Attention-Controlled Spiking Transformer Authors: Zizheng Zhu, Yingchao Yu, Zeqi Zheng, Zhaofei Yu, Yaochu Jin
Joint Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning Authors: Hugues Van Assel, Mark Ibrahim, Tommaso Biancalani, Aviv Regev, Randall Balestriero
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction Authors: Jeffrey Willette, Heejun Lee, Sung Ju Hwang
MINGLE: Mixtures of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging Authors: Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li
Harnessing the Universal Geometry of Embeddings Authors: Rishi Jha, Collin Zhang, Vitaly Shmatikov, John X. Morris
Unsupervised Invariant Risk Minimization Authors: Yotam Norman, Ron Meir
Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization Authors: Shihao Zhang, Haoyu Zhang, Ian Colbert, Rayan Saab
Approximation theory for 1-Lipschitz ResNets Authors: Davide Murari, Takashi Furuya, Carola-Bibiane Schönlieb
Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency Authors: Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Markos A. Katsoulakis
MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production Authors: Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, Xin Liu
Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis Authors: Akarsh Kumar, Jeff Clune, Joel Lehman, Kenneth O. Stanley
Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks Authors: Chiyue Wei, Bowen Duan, Cong Guo, Jingyang Zhang, Qingyue Song, Hai "Helen" Li, Yiran Chen
Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning Authors: Jingcheng Niu, Subhabrata Dutta, Ahmed Elshabrawy, Harish Tayyar Madabushi, Iryna Gurevych
Training NTK to Generalize with KARE Authors: Johannes Schwab, Bryan Kelly, Semyon Malamud, Teng Andrea Xu
Neural Thermodynamics I: Entropic Forces in Deep and Universal Representation Learning Authors: Liu Ziyin, Yizhou Xu, Isaac Chuang
Addition is almost all you need: Compressing neural networks with double binary factorization Authors: Vladimír Boža, Vladimír Macko
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors Authors: Jing Huang, Junyi Tao, Thomas Icard, Diyi Yang, Christopher Potts
InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models Authors: Zhen Li, Yupeng Su, Songmiao Wang, Runming Yang, Congkai Xie, Aofan Liu, Ming Li, Jiannong Cao, Yuan Xie, Ngai Wong, Hongxia Yang
An Explanation of Intrinsic Self-Correction via Linear Representations and Latent Concepts Authors: Yu-Ting Lee, Hui-Ying Shih, Fu-Chieh Chang, Pei-Yuan Wu
Dynamic Base model Shift for Delta Compression Authors: Chenyu Huang, Peng Ye, Shenghe Zheng, Xiaohui Wang, Lei Bai, Tao Chen, Wanli Ouyang
On Next-Token Prediction in LLMs: How End Goals Determine the Consistency of Decoding Algorithms Authors: Jacob Trauger, Ambuj Tewari
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems Authors: Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Dayou Du, Tairan Xu, Kai Zou, Edoardo Ponti, Luo Mai
Model Merging in Pre-training of Large Language Models Authors: Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu
Foundation model for mass spectrometry proteomics Authors: Justin Sanders, Melih Yilmaz, Jacob H. Russell, Wout Bittremieux, William E. Fondrie, Nicholas M. Riley, Sewoong Oh, William Stafford Noble
Redefining Neural Operators in $d+1$ Dimensions Authors: Haoze Song, Zhihao Li, Xiaobo Zhang, Zecheng Gan, Zhilu Lai, Wei Wang
SpikeX: Exploring Accelerator Architecture and Network-Hardware Co-Optimization for Sparse Spiking Neural Networks Authors: Boxun Xu, Richard Boone, Peng Li
SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache Authors: Qiuyu Zhu, Liang Zhang, Qianxiong Xu, Cheng Long, Jie Zhang
Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures Authors: Zhiheng Chen, Ruofan Wu, Guanhua Fang
SepPrune: Structured Pruning for Efficient Deep Speech Separation Authors: Yuqi Li, Kai Li, Xin Yin, Zhifei Yang, Junhao Dong, Zeyu Dong, Chuanguang Yang, Yingli Tian, Yao Lu
Memory-Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation Authors: Fei Wu, Jia Hu, Geyong Min, Shiqiang Wang
LoRASuite: Efficient LoRA Adaptation Across Large Language Model Upgrades Authors: Yanan Li, Fanxu Meng, Muhan Zhang, Shiai Zhu, Shangguang Wang, Mengwei Xu
AdaDim: Dimensionality Adaptation for SSL Representational Dynamics Authors: Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib
Structured Representation Authors: Arun Kumar, Paul Schrater
SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization Authors: Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang
When the Left Foot Leads to the Right Path: Bridging Initial Prejudice and Trainability Authors: Alberto Bassi, Carlo Albert, Aurelien Lucchi, Marco Baity-Jesi, Emanuele Francazi
What Can We Learn From MIMO Graph Convolutions? Authors: Andreas Roth, Thomas Liebig
Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum Authors: Wenquan Lu, Jiaqi Zhang, Hugues Van Assel, Randall Balestriero
A Local Polyak-Lojasiewicz and Descent Lemma of Gradient Descent For Overparametrized Linear Models Authors: Ziqing Xu, Hancheng Min, Salma Tarmoun, Enrique Mallada, Rene Vidal
Attention on the Sphere Authors: Boris Bonev, Max Rietmann, Andrea Paris, Alberto Carpentieri, Thorsten Kurth
Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers Authors: Zhexiang Li, Haoyu Wang, Yutong Bao, David Woodruff
Adaptive parameter-efficient fine-tuning via Hessian-informed subset selection Authors: Shiyun Xu, Zhiqi Bu
SRLoRA: Subspace Recomposition in Low-Rank Adaptation via Importance-Based Fusion and Reinitialization Authors: Haodong Yang, Lei Wang, Md Zakir Hossain
Revisiting Stochastic Approximation and Stochastic Gradient Descent Authors: Rajeeva Laxman Karandikar, Bhamidi Visweswara Rao, Mathukumalli Vidyasagar
Flash Invariant Point Attention Authors: Andrew Liu, Axel Elaldi, Nicholas T Franklin, Nathan Russell, Gurinder S Atwal, Yih-En A Ban, Olivia Viessmann
Where You Place the Norm Matters: From Prejudiced to Neutral Initializations Authors: Emanuele Francazi, Francesco Pinto, Aurelien Lucchi, Marco Baity-Jesi
CUBIC: Concept Embeddings for Unsupervised Bias Identification using VLMs Authors: David Méndez, Gianpaolo Bontempo, Elisa Ficarra, Roberto Confalonieri, Natalia Díaz-Rodríguez
Efficient Optimization with Orthogonality Constraint: a Randomized Riemannian Submanifold Method Authors: Andi Han, Pierre-Louis Poirion, Akiko Takeda
Understanding Nonlinear Implicit Bias via Region Counts in Input Space Authors: Jingwei Li, Jing Xu, Zifan Wang, Huishuai Zhang, Jingzhao Zhang
Efficient Federated Class-Incremental Learning of Pre-Trained Models via Task-agnostic Low-rank Residual Adaptation Authors: Feng Yu, Jia Hu, Geyong Min
FlashBias: Fast Computation of Attention with Bias Authors: Haixu Wu, Minghao Guo, Yuezhou Ma, Yuanxu Sun, Jianmin Wang, Wojciech Matusik, Mingsheng Long
Adversarially Robust Spiking Neural Networks with Sparse Connectivity Authors: Mathias Schmolli, Maximilian Baronig, Robert Legenstein, Ozan Özdenizci
Relational Graph Transformer Authors: Vijay Prakash Dwivedi, Sri Jaladi, Yangyi Shen, Federico López, Charilaos I. Kanatsoulis, Rishi Puri, Matthias Fey, Jure Leskovec
Hyperbolic Residual Quantization: Discrete Representations for Data with Latent Hierarchies Authors: Piotr Piękos, Subhradeep Kayal, Alexandros Karatzoglou
MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection Authors: Pouya Shaeri, Ariane Middel
PoE-World: Compositional World Modeling with Products of Programmatic Experts Authors: Wasu Top Piriyakulkij, Yichao Liang, Hao Tang, Adrian Weller, Marta Kryven, Kevin Ellis
AltLoRA: Towards Better Gradient Approximation in Low-Rank Adaptation with Alternating Projections Authors: Xin Yu, Yujia Wang, Jinghui Chen, Lingzhou Xue
Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations Authors: Jian-Qiao Zhu, Haijiang Yan, Thomas L. Griffiths
SchoenbAt: Rethinking Attention with Polynomial basis Authors: Yuhan Guo, Lizhong Ding, Yuwan Yang, Xuewei Guo
S-Crescendo: A Nested Transformer Weaving Framework for Scalable Nonlinear System in S-Domain Representation Authors: Junlang Huang, Hao Chen, Li Luo, Yong Cai, Lexin Zhang, Tianhao Ma, Yitian Zhang, Zhong Guan
On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm Authors: Huan Li, Yiming Dong, Zhouchen Lin
Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD Authors: Nikita P. Kalinin, Ryan McKenna, Jalaj Upadhyay, Christoph H. Lampert
Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought Authors: Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian
PhiNet v2: A Mask-Free Brain-Inspired Vision Foundation Model from Video Authors: Makoto Yamada, Kian Ming A. Chai, Ayoub Rhim, Satoki Ishikawa, Mohammad Sabokrou, Yao-Hung Hubert Tsai
msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML Authors: Zhaolan Huang, Emmanuel Baccelli
MergeBench: A Benchmark for Merging Domain-Specialized LLMs Authors: Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, Han Zhao
SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training Authors: Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen
A Classical View on Benign Overfitting: The Role of Sample Size Authors: Junhyung Park, Patrick Bloebaum, Shiva Prasad Kasiviswanathan
Exploring Sparsity for Parameter Efficient Fine Tuning Using Wavelets Authors: Ahmet Bilican, M. Akın Yılmaz, A. Murat Tekalp, R. Gökberk Cinbiş
WaLRUS: Wavelets for Long-range Representation Using SSMs Authors: Hossein Babaei, Mel White, Sina Alemohammad, Richard G. Baraniuk
SAINT: Attention-Based Modeling of Sub-Action Dependencies in Multi-Action Policies Authors: Matthew Landers, Taylor W. Killian, Thomas Hartvigsen, Afsaneh Doryab
STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference Authors: Yichen Guo, Hanze Li, Zonghao Zhang, Jinhao You, Kai Tang, Xiande Huang
Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables Authors: Yu Gui, Cong Ma, Zongming Ma
Graph Representational Learning: When Does More Expressivity Hurt Generalization? Authors: Sohir Maskey, Raffaele Paolino, Fabian Jogl, Gitta Kutyniok, Johannes F. Lutzeyer
Do different prompting methods yield a common task representation in language models? Authors: Guy Davidson, Todd M. Gureckis, Brenden M. Lake, Adina Williams
On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study Authors: Shuai Yang, Qi Yang, Luoxi Tang, Jeremy Blackburn, Zhaohan Xi

1. Search-Based Correction of Reasoning Chains for Language Models

ArXiv ID: 2505.11824

Authors: Minsu Kim, Jean-Pierre Falet, Oliver E. Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, Yoshua Bengio

Abstract: Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we introduce a new self-correction framework that augments each reasoning step in a CoT with a latent variable indicating its veracity, enabling modeling of all possible truth assignments rather than assuming correctness throughout. To efficiently explore this expanded space, we introduce Search Corrector, a discrete search algorithm over boolean-valued veracity assignments. It efficiently performs otherwise intractable inference in the posterior distribution over veracity assignments by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time correction method facilitates supervised fine-tuning of an Amortized Corrector by providing pseudo-labels for veracity. The Amortized Corrector generalizes self-correction, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that Search Corrector reliably identifies errors in logical (ProntoQA) and mathematical reasoning (GSM8K) benchmarks. The Amortized Corrector achieves comparable zero-shot accuracy and improves final answer accuracy by up to 25%.

Comment: Author match

2. On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating

ArXiv ID: 2505.10860

Authors: Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo

Abstract: Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.

Comment: The paper provides a theoretical study of MoE architectures, specifically focusing on shared experts and normalized sigmoid gating, which is highly relevant to model architecture.

Relevance: 10 Novelty: 8

3. RanDeS: Randomized Delta Superposition for Multi-Model Compression

ArXiv ID: 2505.11204

Authors: Hangyu Zhou, Aaron Gokaslan, Volodymyr Kuleshov, Bharath Hariharan

Abstract: From a multi-model compression perspective, model merging enables memory-efficient serving of multiple models fine-tuned from the same base, but suffers from degraded performance due to interference among their task-specific parameter adjustments (i.e., deltas). In this paper, we reformulate model merging as a compress-and-retrieve scheme, revealing that the task interference arises from the summation of irrelevant deltas during model retrieval. To address this issue, we use random orthogonal transformations to decorrelate these vectors into self-cancellation. We show that this approach drastically reduces interference, improving performance across both vision and language tasks. Since these transformations are fully defined by random seeds, adding new models requires no extra memory. Further, their data- and model-agnostic nature enables easy addition or removal of models with minimal compute overhead, supporting efficient and flexible multi-model serving.

Comment: The paper presents a novel approach to multi-model compression using randomized transformations, aligning with the core topic of model compression.

Relevance: 9 Novelty: 8

4. TDFormer: A Top-Down Attention-Controlled Spiking Transformer

ArXiv ID: 2505.15840

Authors: Zizheng Zhu, Yingchao Yu, Zeqi Zheng, Zhaofei Yu, Yaochu Jin

Abstract: Traditional spiking neural networks (SNNs) can be viewed as a combination of multiple subnetworks with each running for one time step, where the parameters are shared, and the membrane potential serves as the only information link between them. However, the implicit nature of the membrane potential limits its ability to effectively represent temporal information. As a result, each time step cannot fully leverage information from previous time steps, seriously limiting the model's performance. Inspired by the top-down mechanism in the brain, we introduce TDFormer, a novel model with a top-down feedback structure that functions hierarchically and leverages high-order representations from earlier time steps to modulate the processing of low-order information at later stages. The feedback structure plays a role from two perspectives: 1) During forward propagation, our model increases the mutual information across time steps, indicating that richer temporal information is being transmitted and integrated in different time steps. 2) During backward propagation, we theoretically prove that the feedback structure alleviates the problem of vanishing gradients along the time dimension. We find that these mechanisms together significantly and consistently improve the model performance on multiple datasets. In particular, our model achieves state-of-the-art performance on ImageNet with an accuracy of 86.83%.

Comment: The paper introduces TDFormer, a novel spiking transformer model, aligning with the core topic of model architecture.

Relevance: 9 Novelty: 8

5. Joint Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning

ArXiv ID: 2505.12477

Authors: Hugues Van Assel, Mark Ibrahim, Tommaso Biancalani, Aviv Regev, Randall Balestriero

Abstract: Reconstruction and joint embedding have emerged as two leading paradigms in Self Supervised Learning (SSL). Reconstruction methods focus on recovering the original sample from a different view in input space. On the other hand, joint embedding methods align the representations of different views in latent space. Both approaches offer compelling advantages, yet practitioners lack clear guidelines for choosing between them. In this work, we unveil the core mechanisms that distinguish each paradigm. By leveraging closed form solutions for both approaches, we precisely characterize how the view generation process, e.g. data augmentation, impacts the learned representations. We then demonstrate that, unlike supervised learning, both SSL paradigms require a minimal alignment between augmentations and irrelevant features to achieve asymptotic optimality with increasing sample size. Our findings indicate that in scenarios where these irrelevant features have a large magnitude, joint embedding methods are preferable because they impose a strictly weaker alignment condition compared to reconstruction based methods. These results not only clarify the trade offs between the two paradigms but also substantiate the empirical success of joint embedding approaches on real world challenging datasets.

Comment: The paper compares joint embedding and reconstruction in self-supervised learning, providing insights into representation learning paradigms.

Relevance: 9 Novelty: 8

6. Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction

ArXiv ID: 2505.11254

Authors: Jeffrey Willette, Heejun Lee, Sung Ju Hwang

Abstract: The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.

Comment: The paper proposes a method for sparse attention inference in transformers, which is relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8

7. MINGLE: Mixtures of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging

ArXiv ID: 2505.11883

Authors: Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li

Abstract: Continual model merging integrates independently fine-tuned models sequentially without access to original training data, providing a scalable and efficient solution to continual learning. However, current methods still face critical challenges, notably parameter interference among tasks and limited adaptability to evolving test distributions. The former causes catastrophic forgetting of integrated tasks, while the latter hinders effective adaptation to new tasks. To address these, we propose MINGLE, a novel framework for test-time continual model merging, which leverages test-time adaptation using a small set of unlabeled test samples from the current task to dynamically guide the merging process. MINGLE employs a mixture-of-experts architecture composed of parameter-efficient, low-rank experts, enabling efficient adaptation and improving robustness to distribution shifts. To mitigate catastrophic forgetting, we propose Null-Space Constrained Gating, which restricts gating updates to subspaces orthogonal to prior task representations. This suppresses activations on old task inputs and preserves model behavior on past tasks. To further balance stability and adaptability, we design an Adaptive Relaxation Strategy, which dynamically adjusts the constraint strength based on interference signals captured during test-time adaptation. Extensive experiments on standard continual merging benchmarks demonstrate that MINGLE achieves robust generalization, reduces forgetting significantly, and consistently surpasses previous state-of-the-art methods by 7-9\% on average across diverse task orders.

Comment: The paper introduces a novel MoE framework for continual model merging, which is relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8

8. Harnessing the Universal Geometry of Embeddings

ArXiv ID: 2505.12540

Authors: Rishi Jha, Collin Zhang, Vitaly Shmatikov, John X. Morris

Abstract: We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets. The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference.

Comment: The paper presents a method for translating text embeddings without paired data, relevant to representation learning and foundational research in embeddings.

Relevance: 9 Novelty: 8

9. Unsupervised Invariant Risk Minimization

ArXiv ID: 2505.12506

Authors: Yotam Norman, Ron Meir

Abstract: We propose a novel unsupervised framework for \emph{Invariant Risk Minimization} (IRM), extending the concept of invariance to settings where labels are unavailable. Traditional IRM methods rely on labeled data to learn representations that are robust to distributional shifts across environments. In contrast, our approach redefines invariance through feature distribution alignment, enabling robust representation learning from unlabeled data. We introduce two methods within this framework: Principal Invariant Component Analysis (PICA), a linear method that extracts invariant directions under Gaussian assumptions, and Variational Invariant Autoencoder (VIAE), a deep generative model that disentangles environment-invariant and environment-dependent latent factors. Our approach is based on a novel ``unsupervised'' structural causal model and supports environment-conditioned sample-generation and intervention. Empirical evaluations on synthetic dataset and modified versions of MNIST demonstrate the effectiveness of our methods in capturing invariant structure, preserving relevant information, and generalizing across environments without access to labels.

Comment: The paper proposes an unsupervised framework for invariant risk minimization, relevant to representation learning and foundational research.

Relevance: 9 Novelty: 8

10. Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization

ArXiv ID: 2505.11695

Authors: Shihao Zhang, Haoyu Zhang, Ian Colbert, Rayan Saab

Abstract: We introduce Qronos -- a new state-of-the-art post-training quantization algorithm that sequentially rounds and updates neural network weights. Qronos not only explicitly corrects errors due to both weight and activation quantization, but also errors resulting from quantizing previous layers. Our iterative algorithm is based on an interpretable and disciplined optimization framework that subsumes and surpasses existing data-driven approaches. At each step, Qronos alternates between error correction and diffusion via optimal update rules. Importantly, we prove that Qronos admits an efficient implementation that uses the Cholesky decomposition for solving least-squares problems. We also demonstrate that Qronos is compatible with existing transformation techniques such as Hadamard-based incoherence processing and weight-activation scaling equalization, among others. We evaluate Qronos using recent autoregressive language generation models in the Llama3 family; Qronos consistently outperforms previous state-of-the-art adaptive rounding methods when quantizing the weights, activations, and/or KV caches.

Comment: Qronos is a new post-training quantization algorithm, which is relevant to model compression.

Relevance: 9 Novelty: 8

11. Approximation theory for 1-Lipschitz ResNets

ArXiv ID: 2505.12003

Authors: Davide Murari, Takashi Furuya, Carola-Bibiane Schönlieb

Abstract: 1-Lipschitz neural networks are fundamental for generative modelling, inverse problems, and robust classifiers. In this paper, we focus on 1-Lipschitz residual networks (ResNets) based on explicit Euler steps of negative gradient flows and study their approximation capabilities. Leveraging the Restricted Stone-Weierstrass Theorem, we first show that these 1-Lipschitz ResNets are dense in the set of scalar 1-Lipschitz functions on any compact domain when width and depth are allowed to grow. We also show that these networks can exactly represent scalar piecewise affine 1-Lipschitz functions. We then prove a stronger statement: by inserting norm-constrained linear maps between the residual blocks, the same density holds when the hidden width is fixed. Because every layer obeys simple norm constraints, the resulting models can be trained with off-the-shelf optimisers. This paper provides the first universal approximation guarantees for 1-Lipschitz ResNets, laying a rigorous foundation for their practical use.

Comment: The paper provides universal approximation guarantees for 1-Lipschitz ResNets, contributing to model architecture analysis and theoretical insights.

Relevance: 9 Novelty: 8

12. Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency

ArXiv ID: 2505.13499

Authors: Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Markos A. Katsoulakis

Abstract: We study Transformers through the perspective of optimal control theory, using tools from continuous-time formulations to derive actionable insights into training and architecture design. This framework improves the performance of existing Transformer models while providing desirable theoretical guarantees, including generalization and robustness. Our framework is designed to be plug-and-play, enabling seamless integration with established Transformer models and requiring only slight changes to the implementation. We conduct seven extensive experiments on tasks motivated by text generation, sentiment analysis, image classification, and point cloud classification. Experimental results show that the framework improves the test performance of the baselines, while being more parameter-efficient. On character-level text generation with nanoGPT, our framework achieves a 46% reduction in final test loss while using 42% fewer parameters. On GPT-2, our framework achieves a 5.6% reduction in final test loss, demonstrating scalability to larger models. To the best of our knowledge, this is the first work that applies optimal control theory to both the training and architecture of Transformers. It offers a new foundation for systematic, theory-driven improvements and moves beyond costly trial-and-error approaches.

Comment: The paper applies optimal control theory to Transformers, offering theoretical insights into architecture design and training, which aligns with model architecture innovations.

Relevance: 9 Novelty: 8

13. MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

ArXiv ID: 2505.11432

Authors: Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, Xin Liu

Abstract: We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$\times$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.

Comment: The paper presents MegaScale-MoE, a system for efficient training of MoE models, which aligns with model architecture and efficiency innovations.

Relevance: 9 Novelty: 8

14. Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis

ArXiv ID: 2505.11581

Authors: Akarsh Kumar, Jeff Clune, Joel Lehman, Kenneth O. Stanley

Abstract: Much of the excitement in modern AI is driven by the observation that scaling up existing systems leads to better performance. But does better performance necessarily imply better internal representations? While the representational optimist assumes it must, this position paper challenges that view. We compare neural networks evolved through an open-ended search process to networks trained via conventional stochastic gradient descent (SGD) on the simple task of generating a single image. This minimal setup offers a unique advantage: each hidden neuron's full functional behavior can be easily visualized as an image, thus revealing how the network's output behavior is internally constructed neuron by neuron. The result is striking: while both networks produce the same output behavior, their internal representations differ dramatically. The SGD-trained networks exhibit a form of disorganization that we term fractured entangled representation (FER). Interestingly, the evolved networks largely lack FER, even approaching a unified factored representation (UFR). In large models, FER may be degrading core model capacities like generalization, creativity, and (continual) learning. Therefore, understanding and mitigating FER could be critical to the future of representation learning.

Comment: The paper challenges the assumption that better performance implies better internal representations, aligning with the representation learning criterion.

Relevance: 9 Novelty: 8

15. Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks

ArXiv ID: 2505.10909

Authors: Chiyue Wei, Bowen Duan, Cong Guo, Jingyang Zhang, Qingyue Song, Hai "Helen" Li, Yiran Chen

Abstract: Spiking Neural Networks (SNNs) are gaining attention for their energy efficiency and biological plausibility, utilizing 0-1 activation sparsity through spike-driven computation. While existing SNN accelerators exploit this sparsity to skip zero computations, they often overlook the unique distribution patterns inherent in binary activations. In this work, we observe that particular patterns exist in spike activations, which we can utilize to reduce the substantial computation of SNN models. Based on these findings, we propose a novel \textbf{pattern-based hierarchical sparsity} framework, termed \textbf{\textit{Phi}}, to optimize computation. \textit{Phi} introduces a two-level sparsity hierarchy: Level 1 exhibits vector-wise sparsity by representing activations with pre-defined patterns, allowing for offline pre-computation with weights and significantly reducing most runtime computation. Level 2 features element-wise sparsity by complementing the Level 1 matrix, using a highly sparse matrix to further reduce computation while maintaining accuracy. We present an algorithm-hardware co-design approach. Algorithmically, we employ a k-means-based pattern selection method to identify representative patterns and introduce a pattern-aware fine-tuning technique to enhance Level 2 sparsity. Architecturally, we design \textbf{\textit{Phi}}, a dedicated hardware architecture that efficiently processes the two levels of \textit{Phi} sparsity on the fly. Extensive experiments demonstrate that \textit{Phi} achieves a $3.45\times$ speedup and a $4.93\times$ improvement in energy efficiency compared to state-of-the-art SNN accelerators, showcasing the effectiveness of our framework in optimizing SNN computation.

Comment: The paper proposes a framework for optimizing spiking neural networks using hierarchical sparsity, aligning with the model compression criterion.

Relevance: 9 Novelty: 8

16. Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning

ArXiv ID: 2505.11004

Authors: Jingcheng Niu, Subhabrata Dutta, Ahmed Elshabrawy, Harish Tayyar Madabushi, Iryna Gurevych

Abstract: Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks after seeing just a few examples. The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood. Some studies argue that it is merely the result of memorizing vast amounts of data, while others contend that it reflects a fundamental, symbolic algorithmic development in LMs. In this work, we introduce a suite of investigative tasks and a novel method to systematically investigate ICL by leveraging the full Pythia scaling suite, including interim checkpoints that capture progressively larger amount of training data. By carefully exploring ICL performance on downstream tasks and simultaneously conducting a mechanistic analysis of the residual stream's subspace, we demonstrate that ICL extends beyond mere "memorization" of the training corpus, yet does not amount to the implementation of an independent symbolic algorithm. Our results also clarify several aspects of ICL, including the influence of training dynamics, model capabilities, and elements of mechanistic interpretability. Overall, our work advances the understanding of ICL and its implications, offering model developers insights into potential improvements and providing AI security practitioners with a basis for more informed guidelines.

Comment: The paper investigates in-context learning in large-scale transformer models, providing insights into training dynamics and interpretability, relevant to representation learning and LLM behavior.

Relevance: 9 Novelty: 8

17. Training NTK to Generalize with KARE

ArXiv ID: 2505.11347

Authors: Johannes Schwab, Bryan Kelly, Semyon Malamud, Teng Andrea Xu

Abstract: The performance of the data-dependent neural tangent kernel (NTK; Jacot et al. (2018)) associated with a trained deep neural network (DNN) often matches or exceeds that of the full network. This implies that DNN training via gradient descent implicitly performs kernel learning by optimizing the NTK. In this paper, we propose instead to optimize the NTK explicitly. Rather than minimizing empirical risk, we train the NTK to minimize its generalization error using the recently developed Kernel Alignment Risk Estimator (KARE; Jacot et al. (2020)). Our simulations and real data experiments show that NTKs trained with KARE consistently match or significantly outperform the original DNN and the DNN- induced NTK (the after-kernel). These results suggest that explicitly trained kernels can outperform traditional end-to-end DNN optimization in certain settings, challenging the conventional dominance of DNNs. We argue that explicit training of NTK is a form of over-parametrized feature learning.

Comment: The paper proposes optimizing the neural tangent kernel explicitly, which aligns with representation learning and training dynamics.

Relevance: 9 Novelty: 8

18. Neural Thermodynamics I: Entropic Forces in Deep and Universal Representation Learning

ArXiv ID: 2505.12387

Authors: Liu Ziyin, Yizhou Xu, Isaac Chuang

Abstract: With the rapid discovery of emergent phenomena in deep learning and large language models, explaining and understanding their cause has become an urgent need. Here, we propose a rigorous entropic-force theory for understanding the learning dynamics of neural networks trained with stochastic gradient descent (SGD) and its variants. Building on the theory of parameter symmetries and an entropic loss landscape, we show that representation learning is crucially governed by emergent entropic forces arising from stochasticity and discrete-time updates. These forces systematically break continuous parameter symmetries and preserve discrete ones, leading to a series of gradient balance phenomena that resemble the equipartition property of thermal systems. These phenomena, in turn, (a) explain the universal alignment of neural representations between AI models and lead to a proof of the Platonic Representation Hypothesis, and (b) reconcile the seemingly contradictory observations of sharpness- and flatness-seeking behavior of deep learning optimization. Our theory and experiments demonstrate that a combination of entropic forces and symmetry breaking is key to understanding emergent phenomena in deep learning.

Comment: The paper proposes a theory for understanding learning dynamics in neural networks, which aligns with the representation learning criterion.

Relevance: 9 Novelty: 8

19. Addition is almost all you need: Compressing neural networks with double binary factorization

ArXiv ID: 2505.11076

Authors: Vladimír Boža, Vladimír Macko

Abstract: Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions, offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs). However, the severe quantization constraint ($\pm1$) can lead to significant accuracy degradation. In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors. DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods. Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP# and QTIP. Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization's intermediate dimension. Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria. Code available at: https://github.com/usamec/double_binary

Comment: The paper introduces a novel method for model compression using Double Binary Factorization, which aligns with the model compression criterion focusing on sparsity, pruning, and quantization.

Relevance: 9 Novelty: 8

20. Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

ArXiv ID: 2505.11770

Authors: Jing Huang, Junyi Tao, Thomas Icard, Diyi Yang, Christopher Potts

Abstract: Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks--including symbol manipulation, knowledge retrieval, and instruction following--we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.

Comment: The paper explores internal causal mechanisms in language models to predict out-of-distribution behaviors, offering theoretical insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 8

21. InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models

ArXiv ID: 2505.11574

Authors: Zhen Li, Yupeng Su, Songmiao Wang, Runming Yang, Congkai Xie, Aofan Liu, Ming Li, Jiannong Cao, Yuan Xie, Ngai Wong, Hongxia Yang

Abstract: Large Language Models (LLMs) have demonstrated impressive performance on complex reasoning benchmarks such as GSM8K, MATH, and AIME. However, the substantial computational demands of these tasks pose significant challenges for real-world deployment. Model quantization has emerged as a promising approach to reduce memory footprint and inference latency by representing weights and activations with lower bit-widths. In this work, we conduct a comprehensive study of mainstream quantization methods(e.g., AWQ, GPTQ, SmoothQuant) on the most popular open-sourced models (e.g., Qwen2.5, LLaMA3 series), and reveal that quantization can degrade mathematical reasoning accuracy by up to 69.81%. To better understand this degradation, we develop an automated assignment and judgment pipeline that qualitatively categorizes failures into four error types and quantitatively identifies the most impacted reasoning capabilities. Building on these findings, we employ an automated data-curation pipeline to construct a compact "Silver Bullet" datasets. Training a quantized model on as few as 332 carefully selected examples for just 3-5 minutes on a single GPU is enough to restore its reasoning accuracy to match that of the full-precision baseline.

Comment: The paper focuses on quantization in LLMs, which is relevant to model compression, specifically addressing the degradation in mathematical reasoning accuracy due to quantization.

Relevance: 9 Novelty: 7

22. An Explanation of Intrinsic Self-Correction via Linear Representations and Latent Concepts

ArXiv ID: 2505.11924

Authors: Yu-Ting Lee, Hui-Ying Shih, Fu-Chieh Chang, Pei-Yuan Wu

Abstract: We provide an explanation for the performance gains of intrinsic self-correction, a process where a language model iteratively refines its outputs without external feedback. More precisely, we investigate how prompting induces interpretable changes in hidden states and thus affects the output distributions. We hypothesize that each prompt-induced shift lies in a linear span of some linear representation vectors, naturally separating tokens based on individual concept alignment. Building around this idea, we give a mathematical formulation of self-correction and derive a concentration result for output tokens based on alignment magnitudes. Our experiments on text detoxification with zephyr-7b-sft reveal a substantial gap in the inner products of the prompt-induced shifts and the unembeddings of the top-100 most toxic tokens vs. those of the unembeddings of the bottom-100 least toxic tokens, under toxic instructions. This suggests that self-correction prompts enhance a language model's capability of latent concept recognition. Our analysis offers insights into the underlying mechanism of self-correction by characterizing how prompting works explainably. For reproducibility, our code is available.

Comment: The paper provides an explanation for intrinsic self-correction in language models, which is relevant to understanding LLM behavior and interpretability.

Relevance: 9 Novelty: 7

23. Dynamic Base model Shift for Delta Compression

ArXiv ID: 2505.11344

Authors: Chenyu Huang, Peng Ye, Shenghe Zheng, Xiaohui Wang, Lei Bai, Tao Chen, Wanli Ouyang

Abstract: Transformer-based models with the pretrain-finetune paradigm bring about significant progress, along with the heavy storage and deployment costs of finetuned models on multiple tasks. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights) through pruning or quantization. However, existing methods by default employ the pretrained model as the base model and compress the delta parameters for every task, which may causes significant performance degradation, especially when the compression rate is extremely high. To tackle this issue, we investigate the impact of different base models on the performance of delta compression and find that the pre-trained base model can hardly be optimal. To this end, we propose Dynamic Base Model Shift (DBMS), which dynamically adapts the base model to the target task before performing delta compression. Specifically, we adjust two parameters, which respectively determine the magnitude of the base model shift and the overall scale of delta compression, to boost the compression performance on each task. Through low-cost learning of these two parameters, our DBMS can maintain most of the finetuned model's performance even under an extremely high compression ratio setting, significantly surpassing existing methods. Moreover, our DBMS is orthogonal and can be integrated with a variety of other methods, and it has been evaluated across different types of models including language, vision transformer, and multi-modal models.

Comment: The paper discusses delta compression in transformer models, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 7

24. On Next-Token Prediction in LLMs: How End Goals Determine the Consistency of Decoding Algorithms

ArXiv ID: 2505.11183

Authors: Jacob Trauger, Ambuj Tewari

Abstract: Probabilistic next-token prediction trained using cross-entropy loss is the basis of most large language models. Given a sequence of previous values, next-token prediction assigns a probability to each possible next value in the vocabulary. There are many ways to use next-token prediction to output token sequences. This paper examines a few of these algorithms (greedy, lookahead, random sampling, and temperature-scaled random sampling) and studies their consistency with respect to various goals encoded as loss functions. Although consistency of surrogate losses with respect to a target loss function is a well researched topic, we are the first to study it in the context of LLMs (to the best of our knowledge). We find that, so long as next-token prediction converges to its true probability distribution, random sampling is consistent with outputting sequences that mimic sampling from the true probability distribution. For the other goals, such as minimizing the 0-1 loss on the entire sequence, we show no polynomial-time algorithm is optimal for all probability distributions and all decoding algorithms studied are only optimal for a subset of probability distributions. When analyzing these results, we see that there is a dichotomy created between the goals of information retrieval and creative generation for the decoding algorithms. This shows that choosing the correct decoding algorithm based on the desired goal is extremely important and many of the ones used are lacking theoretical grounding in numerous scenarios.

Comment: The paper studies next-token prediction in LLMs, relevant to foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 7

25. MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

ArXiv ID: 2505.11415

Authors: Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Dayou Du, Tairan Xu, Kai Zou, Edoardo Ponti, Luo Mai

Abstract: The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.

Comment: The paper introduces MoE-CAP, a benchmark for sparse MoE systems, which aligns with model architecture and efficiency innovations.

Relevance: 9 Novelty: 7

26. Model Merging in Pre-training of Large Language Models

ArXiv ID: 2505.12082

Authors: Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu

Abstract: Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.

Comment: The paper investigates model merging techniques in the pre-training of large language models, with a focus on Mixture-of-Experts (MoE) architectures, which aligns with the model architecture criterion.

Relevance: 9 Novelty: 7

27. Foundation model for mass spectrometry proteomics

ArXiv ID: 2505.10848

Authors: Justin Sanders, Melih Yilmaz, Jacob H. Russell, Wout Bittremieux, William E. Fondrie, Nicholas M. Riley, Sewoong Oh, William Stafford Noble

Abstract: Mass spectrometry is the dominant technology in the field of proteomics, enabling high-throughput analysis of the protein content of complex biological samples. Due to the complexity of the instrumentation and resulting data, sophisticated computational methods are required for the processing and interpretation of acquired mass spectra. Machine learning has shown great promise to improve the analysis of mass spectrometry data, with numerous purpose-built methods for improving specific steps in the data acquisition and analysis pipeline reaching widespread adoption. Here, we propose unifying various spectrum prediction tasks under a single foundation model for mass spectra. To this end, we pre-train a spectrum encoder using de novo sequencing as a pre-training task. We then show that using these pre-trained spectrum representations improves our performance on the four downstream tasks of spectrum quality prediction, chimericity prediction, phosphorylation prediction, and glycosylation status prediction. Finally, we perform multi-task fine-tuning and find that this approach improves the performance on each task individually. Overall, our work demonstrates that a foundation model for tandem mass spectrometry proteomics trained on de novo sequencing learns generalizable representations of spectra, improves performance on downstream tasks where training data is limited, and can ultimately enhance data acquisition and analysis in proteomics experiments.

Comment: The paper proposes a foundation model for mass spectrometry proteomics, which is relevant to AI for science and foundational model research.

Relevance: 8 Novelty: 8

28. Redefining Neural Operators in $d+1$ Dimensions

ArXiv ID: 2505.11766

Authors: Haoze Song, Zhihao Li, Xiaobo Zhang, Zecheng Gan, Zhilu Lai, Wei Wang

Abstract: Neural Operators have emerged as powerful tools for learning mappings between function spaces. Among them, the kernel integral operator has been widely validated on universally approximating various operators. Although recent advancements following this definition have developed effective modules to better approximate the kernel function defined on the original domain (with $d$ dimensions, $d=1, 2, 3...$), the unclarified evolving mechanism in the embedding spaces blocks our view to design neural operators that can fully capture the target system evolution. Drawing on recent breakthroughs in quantum simulation of partial differential equations (PDEs), we elucidate the linear evolution process in neural operators. Based on that, we redefine neural operators on a new $d+1$ dimensional domain. Within this framework, we implement our proposed Schr\"odingerised Kernel Neural Operator (SKNO) aligning better with the $d+1$ dimensional evolution. In experiments, our $d+1$ dimensional evolving linear block performs far better than others. Also, we test SKNO's SOTA performance on various benchmark tests and also the zero-shot super-resolution task. In addition, we analyse the impact of different lifting and recovering operators on the prediction within the redefined NO framework, reflecting the alignment between our model and the underlying $d+1$ dimensional evolution.

Comment: The paper redefines neural operators in a new dimensional framework, which is relevant to emerging trends in representation learning and model architecture.

Relevance: 8 Novelty: 8

29. SpikeX: Exploring Accelerator Architecture and Network-Hardware Co-Optimization for Sparse Spiking Neural Networks

ArXiv ID: 2505.12292

Authors: Boxun Xu, Richard Boone, Peng Li

Abstract: Spiking Neural Networks (SNNs) are promising biologically plausible models of computation which utilize a spiking binary activation function similar to that of biological neurons. SNNs are well positioned to process spatiotemporal data, and are advantageous in ultra-low power and real-time processing. Despite a large body of work on conventional artificial neural network accelerators, much less attention has been given to efficient SNN hardware accelerator design. In particular, SNNs exhibit inherent unstructured spatial and temporal firing sparsity, an opportunity yet to be fully explored for great hardware processing efficiency. In this work, we propose a novel systolic-array SNN accelerator architecture, called SpikeX, to take on the challenges and opportunities stemming from unstructured sparsity while taking into account the unique characteristics of spike-based computation. By developing an efficient dataflow targeting expensive multi-bit weight data movements, SpikeX reduces memory access and increases data sharing and hardware utilization for computations spanning across both time and space, thereby significantly improving energy efficiency and inference latency. Furthermore, recognizing the importance of SNN network and hardware co-design, we develop a co-optimization methodology facilitating not only hardware-aware SNN training but also hardware accelerator architecture search, allowing joint network weight parameter optimization and accelerator architectural reconfiguration. This end-to-end network/accelerator co-design approach offers a significant reduction of 15.1x-150.87x in energy-delay-product(EDP) without comprising model accuracy.

Comment: The paper proposes a novel SNN accelerator architecture, relevant to model architecture and efficiency improvements.

Relevance: 8 Novelty: 8

30. SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache

ArXiv ID: 2505.10951

Authors: Qiuyu Zhu, Liang Zhang, Qianxiong Xu, Cheng Long, Jie Zhang

Abstract: Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to incorporate structured knowledge via graph retrieval as contextual input, enhancing more accurate and context-aware reasoning. We observe that for different queries, it could retrieve similar subgraphs as prompts, and thus we propose SubGCache, which aims to reduce inference latency by reusing computation across queries with similar structural prompts (i.e., subgraphs). Specifically, SubGCache clusters queries based on subgraph embeddings, constructs a representative subgraph for each cluster, and pre-computes the key-value (KV) cache of the representative subgraph. For each query with its retrieved subgraph within a cluster, it reuses the pre-computed KV cache of the representative subgraph of the cluster without computing the KV tensors again for saving computation. Experiments on two new datasets across multiple LLM backbones and graph-based RAG frameworks demonstrate that SubGCache consistently reduces inference latency with comparable and even improved generation quality, achieving up to 6.68$\times$ reduction in time-to-first-token (TTFT).

Comment: The paper introduces SubGCache, which is related to model compression through KV cache optimization, aligning with the core topic of model compression.