Previous Day 2025-05-16
Monthly Overview 2025-05
Next Day 2025-05-20

This is a remedial run for missed papers from 05/16/2025 to 05/18/2025.

Results generated on 05/26/2025.

Personalized Daily ArXiv Papers 2025-05-19

[gpt-4o] Prompt Completion Total
Token 93333 12929 106262
Cost $0.23 $0.13 $0.36

Total arXiv papers: 889

Total scanned papers: 889

Total relevant papers: 78

Table of contents with paper titles:

  1. Search-Based Correction of Reasoning Chains for Language Models Authors: Minsu Kim, Jean-Pierre Falet, Oliver E. Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, Yoshua Bengio

  2. On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating Authors: Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo

  3. RanDeS: Randomized Delta Superposition for Multi-Model Compression Authors: Hangyu Zhou, Aaron Gokaslan, Volodymyr Kuleshov, Bharath Hariharan

  4. TDFormer: A Top-Down Attention-Controlled Spiking Transformer Authors: Zizheng Zhu, Yingchao Yu, Zeqi Zheng, Zhaofei Yu, Yaochu Jin

  5. Joint Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning Authors: Hugues Van Assel, Mark Ibrahim, Tommaso Biancalani, Aviv Regev, Randall Balestriero

  6. Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction Authors: Jeffrey Willette, Heejun Lee, Sung Ju Hwang

  7. MINGLE: Mixtures of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging Authors: Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li

  8. Harnessing the Universal Geometry of Embeddings Authors: Rishi Jha, Collin Zhang, Vitaly Shmatikov, John X. Morris

  9. Unsupervised Invariant Risk Minimization Authors: Yotam Norman, Ron Meir

  10. Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization Authors: Shihao Zhang, Haoyu Zhang, Ian Colbert, Rayan Saab

  11. Approximation theory for 1-Lipschitz ResNets Authors: Davide Murari, Takashi Furuya, Carola-Bibiane Schönlieb

  12. Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency Authors: Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Markos A. Katsoulakis

  13. MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production Authors: Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, Xin Liu

  14. Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis Authors: Akarsh Kumar, Jeff Clune, Joel Lehman, Kenneth O. Stanley

  15. Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks Authors: Chiyue Wei, Bowen Duan, Cong Guo, Jingyang Zhang, Qingyue Song, Hai "Helen" Li, Yiran Chen

  16. Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning Authors: Jingcheng Niu, Subhabrata Dutta, Ahmed Elshabrawy, Harish Tayyar Madabushi, Iryna Gurevych

  17. Training NTK to Generalize with KARE Authors: Johannes Schwab, Bryan Kelly, Semyon Malamud, Teng Andrea Xu

  18. Neural Thermodynamics I: Entropic Forces in Deep and Universal Representation Learning Authors: Liu Ziyin, Yizhou Xu, Isaac Chuang

  19. Addition is almost all you need: Compressing neural networks with double binary factorization Authors: Vladimír Boža, Vladimír Macko

  20. Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors Authors: Jing Huang, Junyi Tao, Thomas Icard, Diyi Yang, Christopher Potts

  21. InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models Authors: Zhen Li, Yupeng Su, Songmiao Wang, Runming Yang, Congkai Xie, Aofan Liu, Ming Li, Jiannong Cao, Yuan Xie, Ngai Wong, Hongxia Yang

  22. An Explanation of Intrinsic Self-Correction via Linear Representations and Latent Concepts Authors: Yu-Ting Lee, Hui-Ying Shih, Fu-Chieh Chang, Pei-Yuan Wu

  23. Dynamic Base model Shift for Delta Compression Authors: Chenyu Huang, Peng Ye, Shenghe Zheng, Xiaohui Wang, Lei Bai, Tao Chen, Wanli Ouyang

  24. On Next-Token Prediction in LLMs: How End Goals Determine the Consistency of Decoding Algorithms Authors: Jacob Trauger, Ambuj Tewari

  25. MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems Authors: Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Dayou Du, Tairan Xu, Kai Zou, Edoardo Ponti, Luo Mai

  26. Model Merging in Pre-training of Large Language Models Authors: Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu

  27. Foundation model for mass spectrometry proteomics Authors: Justin Sanders, Melih Yilmaz, Jacob H. Russell, Wout Bittremieux, William E. Fondrie, Nicholas M. Riley, Sewoong Oh, William Stafford Noble

  28. Redefining Neural Operators in $d+1$ Dimensions Authors: Haoze Song, Zhihao Li, Xiaobo Zhang, Zecheng Gan, Zhilu Lai, Wei Wang

  29. SpikeX: Exploring Accelerator Architecture and Network-Hardware Co-Optimization for Sparse Spiking Neural Networks Authors: Boxun Xu, Richard Boone, Peng Li

  30. SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache Authors: Qiuyu Zhu, Liang Zhang, Qianxiong Xu, Cheng Long, Jie Zhang

  31. Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures Authors: Zhiheng Chen, Ruofan Wu, Guanhua Fang

  32. SepPrune: Structured Pruning for Efficient Deep Speech Separation Authors: Yuqi Li, Kai Li, Xin Yin, Zhifei Yang, Junhao Dong, Zeyu Dong, Chuanguang Yang, Yingli Tian, Yao Lu

  33. Memory-Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation Authors: Fei Wu, Jia Hu, Geyong Min, Shiqiang Wang

  34. LoRASuite: Efficient LoRA Adaptation Across Large Language Model Upgrades Authors: Yanan Li, Fanxu Meng, Muhan Zhang, Shiai Zhu, Shangguang Wang, Mengwei Xu

  35. AdaDim: Dimensionality Adaptation for SSL Representational Dynamics Authors: Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib

  36. Structured Representation Authors: Arun Kumar, Paul Schrater

  37. SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization Authors: Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang

  38. When the Left Foot Leads to the Right Path: Bridging Initial Prejudice and Trainability Authors: Alberto Bassi, Carlo Albert, Aurelien Lucchi, Marco Baity-Jesi, Emanuele Francazi

  39. What Can We Learn From MIMO Graph Convolutions? Authors: Andreas Roth, Thomas Liebig

  40. Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum Authors: Wenquan Lu, Jiaqi Zhang, Hugues Van Assel, Randall Balestriero

  41. A Local Polyak-Lojasiewicz and Descent Lemma of Gradient Descent For Overparametrized Linear Models Authors: Ziqing Xu, Hancheng Min, Salma Tarmoun, Enrique Mallada, Rene Vidal

  42. Attention on the Sphere Authors: Boris Bonev, Max Rietmann, Andrea Paris, Alberto Carpentieri, Thorsten Kurth

  43. Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers Authors: Zhexiang Li, Haoyu Wang, Yutong Bao, David Woodruff

  44. Adaptive parameter-efficient fine-tuning via Hessian-informed subset selection Authors: Shiyun Xu, Zhiqi Bu

  45. SRLoRA: Subspace Recomposition in Low-Rank Adaptation via Importance-Based Fusion and Reinitialization Authors: Haodong Yang, Lei Wang, Md Zakir Hossain

  46. Revisiting Stochastic Approximation and Stochastic Gradient Descent Authors: Rajeeva Laxman Karandikar, Bhamidi Visweswara Rao, Mathukumalli Vidyasagar

  47. Flash Invariant Point Attention Authors: Andrew Liu, Axel Elaldi, Nicholas T Franklin, Nathan Russell, Gurinder S Atwal, Yih-En A Ban, Olivia Viessmann

  48. Where You Place the Norm Matters: From Prejudiced to Neutral Initializations Authors: Emanuele Francazi, Francesco Pinto, Aurelien Lucchi, Marco Baity-Jesi

  49. CUBIC: Concept Embeddings for Unsupervised Bias Identification using VLMs Authors: David Méndez, Gianpaolo Bontempo, Elisa Ficarra, Roberto Confalonieri, Natalia Díaz-Rodríguez

  50. Efficient Optimization with Orthogonality Constraint: a Randomized Riemannian Submanifold Method Authors: Andi Han, Pierre-Louis Poirion, Akiko Takeda

  51. Understanding Nonlinear Implicit Bias via Region Counts in Input Space Authors: Jingwei Li, Jing Xu, Zifan Wang, Huishuai Zhang, Jingzhao Zhang

  52. Efficient Federated Class-Incremental Learning of Pre-Trained Models via Task-agnostic Low-rank Residual Adaptation Authors: Feng Yu, Jia Hu, Geyong Min

  53. FlashBias: Fast Computation of Attention with Bias Authors: Haixu Wu, Minghao Guo, Yuezhou Ma, Yuanxu Sun, Jianmin Wang, Wojciech Matusik, Mingsheng Long

  54. Adversarially Robust Spiking Neural Networks with Sparse Connectivity Authors: Mathias Schmolli, Maximilian Baronig, Robert Legenstein, Ozan Özdenizci

  55. Relational Graph Transformer Authors: Vijay Prakash Dwivedi, Sri Jaladi, Yangyi Shen, Federico López, Charilaos I. Kanatsoulis, Rishi Puri, Matthias Fey, Jure Leskovec

  56. Hyperbolic Residual Quantization: Discrete Representations for Data with Latent Hierarchies Authors: Piotr Piękos, Subhradeep Kayal, Alexandros Karatzoglou

  57. MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection Authors: Pouya Shaeri, Ariane Middel

  58. PoE-World: Compositional World Modeling with Products of Programmatic Experts Authors: Wasu Top Piriyakulkij, Yichao Liang, Hao Tang, Adrian Weller, Marta Kryven, Kevin Ellis

  59. AltLoRA: Towards Better Gradient Approximation in Low-Rank Adaptation with Alternating Projections Authors: Xin Yu, Yujia Wang, Jinghui Chen, Lingzhou Xue

  60. Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations Authors: Jian-Qiao Zhu, Haijiang Yan, Thomas L. Griffiths

  61. SchoenbAt: Rethinking Attention with Polynomial basis Authors: Yuhan Guo, Lizhong Ding, Yuwan Yang, Xuewei Guo

  62. S-Crescendo: A Nested Transformer Weaving Framework for Scalable Nonlinear System in S-Domain Representation Authors: Junlang Huang, Hao Chen, Li Luo, Yong Cai, Lexin Zhang, Tianhao Ma, Yitian Zhang, Zhong Guan

  63. On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm Authors: Huan Li, Yiming Dong, Zhouchen Lin

  64. Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD Authors: Nikita P. Kalinin, Ryan McKenna, Jalaj Upadhyay, Christoph H. Lampert

  65. Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought Authors: Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian

  66. PhiNet v2: A Mask-Free Brain-Inspired Vision Foundation Model from Video Authors: Makoto Yamada, Kian Ming A. Chai, Ayoub Rhim, Satoki Ishikawa, Mohammad Sabokrou, Yao-Hung Hubert Tsai

  67. msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML Authors: Zhaolan Huang, Emmanuel Baccelli

  68. MergeBench: A Benchmark for Merging Domain-Specialized LLMs Authors: Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, Han Zhao

  69. SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training Authors: Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen

  70. A Classical View on Benign Overfitting: The Role of Sample Size Authors: Junhyung Park, Patrick Bloebaum, Shiva Prasad Kasiviswanathan

  71. Exploring Sparsity for Parameter Efficient Fine Tuning Using Wavelets Authors: Ahmet Bilican, M. Akın Yılmaz, A. Murat Tekalp, R. Gökberk Cinbiş

  72. WaLRUS: Wavelets for Long-range Representation Using SSMs Authors: Hossein Babaei, Mel White, Sina Alemohammad, Richard G. Baraniuk

  73. SAINT: Attention-Based Modeling of Sub-Action Dependencies in Multi-Action Policies Authors: Matthew Landers, Taylor W. Killian, Thomas Hartvigsen, Afsaneh Doryab

  74. STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference Authors: Yichen Guo, Hanze Li, Zonghao Zhang, Jinhao You, Kai Tang, Xiande Huang

  75. Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables Authors: Yu Gui, Cong Ma, Zongming Ma

  76. Graph Representational Learning: When Does More Expressivity Hurt Generalization? Authors: Sohir Maskey, Raffaele Paolino, Fabian Jogl, Gitta Kutyniok, Johannes F. Lutzeyer

  77. Do different prompting methods yield a common task representation in language models? Authors: Guy Davidson, Todd M. Gureckis, Brenden M. Lake, Adina Williams

  78. On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study Authors: Shuai Yang, Qi Yang, Luoxi Tang, Jeremy Blackburn, Zhaohan Xi


1. Search-Based Correction of Reasoning Chains for Language Models

ArXiv ID: 2505.11824

Authors: Minsu Kim, Jean-Pierre Falet, Oliver E. Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, Yoshua Bengio

Abstract: Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we introduce a new self-correction framework that augments each reasoning step in a CoT with a latent variable indicating its veracity, enabling modeling of all possible truth assignments rather than assuming correctness throughout. To efficiently explore this expanded space, we introduce Search Corrector, a discrete search algorithm over boolean-valued veracity assignments. It efficiently performs otherwise intractable inference in the posterior distribution over veracity assignments by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time correction method facilitates supervised fine-tuning of an Amortized Corrector by providing pseudo-labels for veracity. The Amortized Corrector generalizes self-correction, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that Search Corrector reliably identifies errors in logical (ProntoQA) and mathematical reasoning (GSM8K) benchmarks. The Amortized Corrector achieves comparable zero-shot accuracy and improves final answer accuracy by up to 25%.

Comment: Author match


2. On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating

ArXiv ID: 2505.10860

Authors: Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo

Abstract: Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.

Comment: The paper provides a theoretical study of MoE architectures, specifically focusing on shared experts and normalized sigmoid gating, which is highly relevant to model architecture.

Relevance: 10 Novelty: 8


3. RanDeS: Randomized Delta Superposition for Multi-Model Compression

ArXiv ID: 2505.11204

Authors: Hangyu Zhou, Aaron Gokaslan, Volodymyr Kuleshov, Bharath Hariharan

Abstract: From a multi-model compression perspective, model merging enables memory-efficient serving of multiple models fine-tuned from the same base, but suffers from degraded performance due to interference among their task-specific parameter adjustments (i.e., deltas). In this paper, we reformulate model merging as a compress-and-retrieve scheme, revealing that the task interference arises from the summation of irrelevant deltas during model retrieval. To address this issue, we use random orthogonal transformations to decorrelate these vectors into self-cancellation. We show that this approach drastically reduces interference, improving performance across both vision and language tasks. Since these transformations are fully defined by random seeds, adding new models requires no extra memory. Further, their data- and model-agnostic nature enables easy addition or removal of models with minimal compute overhead, supporting efficient and flexible multi-model serving.

Comment: The paper presents a novel approach to multi-model compression using randomized transformations, aligning with the core topic of model compression.

Relevance: 9 Novelty: 8


4. TDFormer: A Top-Down Attention-Controlled Spiking Transformer

ArXiv ID: 2505.15840

Authors: Zizheng Zhu, Yingchao Yu, Zeqi Zheng, Zhaofei Yu, Yaochu Jin

Abstract: Traditional spiking neural networks (SNNs) can be viewed as a combination of multiple subnetworks with each running for one time step, where the parameters are shared, and the membrane potential serves as the only information link between them. However, the implicit nature of the membrane potential limits its ability to effectively represent temporal information. As a result, each time step cannot fully leverage information from previous time steps, seriously limiting the model's performance. Inspired by the top-down mechanism in the brain, we introduce TDFormer, a novel model with a top-down feedback structure that functions hierarchically and leverages high-order representations from earlier time steps to modulate the processing of low-order information at later stages. The feedback structure plays a role from two perspectives: 1) During forward propagation, our model increases the mutual information across time steps, indicating that richer temporal information is being transmitted and integrated in different time steps. 2) During backward propagation, we theoretically prove that the feedback structure alleviates the problem of vanishing gradients along the time dimension. We find that these mechanisms together significantly and consistently improve the model performance on multiple datasets. In particular, our model achieves state-of-the-art performance on ImageNet with an accuracy of 86.83%.

Comment: The paper introduces TDFormer, a novel spiking transformer model, aligning with the core topic of model architecture.

Relevance: 9 Novelty: 8


5. Joint Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning

ArXiv ID: 2505.12477

Authors: Hugues Van Assel, Mark Ibrahim, Tommaso Biancalani, Aviv Regev, Randall Balestriero

Abstract: Reconstruction and joint embedding have emerged as two leading paradigms in Self Supervised Learning (SSL). Reconstruction methods focus on recovering the original sample from a different view in input space. On the other hand, joint embedding methods align the representations of different views in latent space. Both approaches offer compelling advantages, yet practitioners lack clear guidelines for choosing between them. In this work, we unveil the core mechanisms that distinguish each paradigm. By leveraging closed form solutions for both approaches, we precisely characterize how the view generation process, e.g. data augmentation, impacts the learned representations. We then demonstrate that, unlike supervised learning, both SSL paradigms require a minimal alignment between augmentations and irrelevant features to achieve asymptotic optimality with increasing sample size. Our findings indicate that in scenarios where these irrelevant features have a large magnitude, joint embedding methods are preferable because they impose a strictly weaker alignment condition compared to reconstruction based methods. These results not only clarify the trade offs between the two paradigms but also substantiate the empirical success of joint embedding approaches on real world challenging datasets.

Comment: The paper compares joint embedding and reconstruction in self-supervised learning, providing insights into representation learning paradigms.

Relevance: 9 Novelty: 8


6. Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction

ArXiv ID: 2505.11254

Authors: Jeffrey Willette, Heejun Lee, Sung Ju Hwang

Abstract: The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.

Comment: The paper proposes a method for sparse attention inference in transformers, which is relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8


7. MINGLE: Mixtures of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging

ArXiv ID: 2505.11883

Authors: Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li

Abstract: Continual model merging integrates independently fine-tuned models sequentially without access to original training data, providing a scalable and efficient solution to continual learning. However, current methods still face critical challenges, notably parameter interference among tasks and limited adaptability to evolving test distributions. The former causes catastrophic forgetting of integrated tasks, while the latter hinders effective adaptation to new tasks. To address these, we propose MINGLE, a novel framework for test-time continual model merging, which leverages test-time adaptation using a small set of unlabeled test samples from the current task to dynamically guide the merging process. MINGLE employs a mixture-of-experts architecture composed of parameter-efficient, low-rank experts, enabling efficient adaptation and improving robustness to distribution shifts. To mitigate catastrophic forgetting, we propose Null-Space Constrained Gating, which restricts gating updates to subspaces orthogonal to prior task representations. This suppresses activations on old task inputs and preserves model behavior on past tasks. To further balance stability and adaptability, we design an Adaptive Relaxation Strategy, which dynamically adjusts the constraint strength based on interference signals captured during test-time adaptation. Extensive experiments on standard continual merging benchmarks demonstrate that MINGLE achieves robust generalization, reduces forgetting significantly, and consistently surpasses previous state-of-the-art methods by 7-9\% on average across diverse task orders.

Comment: The paper introduces a novel MoE framework for continual model merging, which is relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8


8. Harnessing the Universal Geometry of Embeddings

ArXiv ID: 2505.12540

Authors: Rishi Jha, Collin Zhang, Vitaly Shmatikov, John X. Morris

Abstract: We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets. The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference.

Comment: The paper presents a method for translating text embeddings without paired data, relevant to representation learning and foundational research in embeddings.

Relevance: 9 Novelty: 8


9. Unsupervised Invariant Risk Minimization

ArXiv ID: 2505.12506

Authors: Yotam Norman, Ron Meir

Abstract: We propose a novel unsupervised framework for \emph{Invariant Risk Minimization} (IRM), extending the concept of invariance to settings where labels are unavailable. Traditional IRM methods rely on labeled data to learn representations that are robust to distributional shifts across environments. In contrast, our approach redefines invariance through feature distribution alignment, enabling robust representation learning from unlabeled data. We introduce two methods within this framework: Principal Invariant Component Analysis (PICA), a linear method that extracts invariant directions under Gaussian assumptions, and Variational Invariant Autoencoder (VIAE), a deep generative model that disentangles environment-invariant and environment-dependent latent factors. Our approach is based on a novel ``unsupervised'' structural causal model and supports environment-conditioned sample-generation and intervention. Empirical evaluations on synthetic dataset and modified versions of MNIST demonstrate the effectiveness of our methods in capturing invariant structure, preserving relevant information, and generalizing across environments without access to labels.

Comment: The paper proposes an unsupervised framework for invariant risk minimization, relevant to representation learning and foundational research.

Relevance: 9 Novelty: 8


10. Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization

ArXiv ID: 2505.11695

Authors: Shihao Zhang, Haoyu Zhang, Ian Colbert, Rayan Saab

Abstract: We introduce Qronos -- a new state-of-the-art post-training quantization algorithm that sequentially rounds and updates neural network weights. Qronos not only explicitly corrects errors due to both weight and activation quantization, but also errors resulting from quantizing previous layers. Our iterative algorithm is based on an interpretable and disciplined optimization framework that subsumes and surpasses existing data-driven approaches. At each step, Qronos alternates between error correction and diffusion via optimal update rules. Importantly, we prove that Qronos admits an efficient implementation that uses the Cholesky decomposition for solving least-squares problems. We also demonstrate that Qronos is compatible with existing transformation techniques such as Hadamard-based incoherence processing and weight-activation scaling equalization, among others. We evaluate Qronos using recent autoregressive language generation models in the Llama3 family; Qronos consistently outperforms previous state-of-the-art adaptive rounding methods when quantizing the weights, activations, and/or KV caches.

Comment: Qronos is a new post-training quantization algorithm, which is relevant to model compression.

Relevance: 9 Novelty: 8


11. Approximation theory for 1-Lipschitz ResNets

ArXiv ID: 2505.12003

Authors: Davide Murari, Takashi Furuya, Carola-Bibiane Schönlieb

Abstract: 1-Lipschitz neural networks are fundamental for generative modelling, inverse problems, and robust classifiers. In this paper, we focus on 1-Lipschitz residual networks (ResNets) based on explicit Euler steps of negative gradient flows and study their approximation capabilities. Leveraging the Restricted Stone-Weierstrass Theorem, we first show that these 1-Lipschitz ResNets are dense in the set of scalar 1-Lipschitz functions on any compact domain when width and depth are allowed to grow. We also show that these networks can exactly represent scalar piecewise affine 1-Lipschitz functions. We then prove a stronger statement: by inserting norm-constrained linear maps between the residual blocks, the same density holds when the hidden width is fixed. Because every layer obeys simple norm constraints, the resulting models can be trained with off-the-shelf optimisers. This paper provides the first universal approximation guarantees for 1-Lipschitz ResNets, laying a rigorous foundation for their practical use.

Comment: The paper provides universal approximation guarantees for 1-Lipschitz ResNets, contributing to model architecture analysis and theoretical insights.

Relevance: 9 Novelty: 8


12. Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency

ArXiv ID: 2505.13499

Authors: Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Markos A. Katsoulakis

Abstract: We study Transformers through the perspective of optimal control theory, using tools from continuous-time formulations to derive actionable insights into training and architecture design. This framework improves the performance of existing Transformer models while providing desirable theoretical guarantees, including generalization and robustness. Our framework is designed to be plug-and-play, enabling seamless integration with established Transformer models and requiring only slight changes to the implementation. We conduct seven extensive experiments on tasks motivated by text generation, sentiment analysis, image classification, and point cloud classification. Experimental results show that the framework improves the test performance of the baselines, while being more parameter-efficient. On character-level text generation with nanoGPT, our framework achieves a 46% reduction in final test loss while using 42% fewer parameters. On GPT-2, our framework achieves a 5.6% reduction in final test loss, demonstrating scalability to larger models. To the best of our knowledge, this is the first work that applies optimal control theory to both the training and architecture of Transformers. It offers a new foundation for systematic, theory-driven improvements and moves beyond costly trial-and-error approaches.

Comment: The paper applies optimal control theory to Transformers, offering theoretical insights into architecture design and training, which aligns with model architecture innovations.

Relevance: 9 Novelty: 8


13. MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

ArXiv ID: 2505.11432

Authors: Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, Xin Liu

Abstract: We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$\times$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.

Comment: The paper presents MegaScale-MoE, a system for efficient training of MoE models, which aligns with model architecture and efficiency innovations.

Relevance: 9 Novelty: 8


14. Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis

ArXiv ID: 2505.11581

Authors: Akarsh Kumar, Jeff Clune, Joel Lehman, Kenneth O. Stanley

Abstract: Much of the excitement in modern AI is driven by the observation that scaling up existing systems leads to better performance. But does better performance necessarily imply better internal representations? While the representational optimist assumes it must, this position paper challenges that view. We compare neural networks evolved through an open-ended search process to networks trained via conventional stochastic gradient descent (SGD) on the simple task of generating a single image. This minimal setup offers a unique advantage: each hidden neuron's full functional behavior can be easily visualized as an image, thus revealing how the network's output behavior is internally constructed neuron by neuron. The result is striking: while both networks produce the same output behavior, their internal representations differ dramatically. The SGD-trained networks exhibit a form of disorganization that we term fractured entangled representation (FER). Interestingly, the evolved networks largely lack FER, even approaching a unified factored representation (UFR). In large models, FER may be degrading core model capacities like generalization, creativity, and (continual) learning. Therefore, understanding and mitigating FER could be critical to the future of representation learning.

Comment: The paper challenges the assumption that better performance implies better internal representations, aligning with the representation learning criterion.

Relevance: 9 Novelty: 8


15. Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks

ArXiv ID: 2505.10909

Authors: Chiyue Wei, Bowen Duan, Cong Guo, Jingyang Zhang, Qingyue Song, Hai "Helen" Li, Yiran Chen

Abstract: Spiking Neural Networks (SNNs) are gaining attention for their energy efficiency and biological plausibility, utilizing 0-1 activation sparsity through spike-driven computation. While existing SNN accelerators exploit this sparsity to skip zero computations, they often overlook the unique distribution patterns inherent in binary activations. In this work, we observe that particular patterns exist in spike activations, which we can utilize to reduce the substantial computation of SNN models. Based on these findings, we propose a novel \textbf{pattern-based hierarchical sparsity} framework, termed \textbf{\textit{Phi}}, to optimize computation. \textit{Phi} introduces a two-level sparsity hierarchy: Level 1 exhibits vector-wise sparsity by representing activations with pre-defined patterns, allowing for offline pre-computation with weights and significantly reducing most runtime computation. Level 2 features element-wise sparsity by complementing the Level 1 matrix, using a highly sparse matrix to further reduce computation while maintaining accuracy. We present an algorithm-hardware co-design approach. Algorithmically, we employ a k-means-based pattern selection method to identify representative patterns and introduce a pattern-aware fine-tuning technique to enhance Level 2 sparsity. Architecturally, we design \textbf{\textit{Phi}}, a dedicated hardware architecture that efficiently processes the two levels of \textit{Phi} sparsity on the fly. Extensive experiments demonstrate that \textit{Phi} achieves a $3.45\times$ speedup and a $4.93\times$ improvement in energy efficiency compared to state-of-the-art SNN accelerators, showcasing the effectiveness of our framework in optimizing SNN computation.

Comment: The paper proposes a framework for optimizing spiking neural networks using hierarchical sparsity, aligning with the model compression criterion.

Relevance: 9 Novelty: 8


16. Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning

ArXiv ID: 2505.11004

Authors: Jingcheng Niu, Subhabrata Dutta, Ahmed Elshabrawy, Harish Tayyar Madabushi, Iryna Gurevych

Abstract: Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks after seeing just a few examples. The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood. Some studies argue that it is merely the result of memorizing vast amounts of data, while others contend that it reflects a fundamental, symbolic algorithmic development in LMs. In this work, we introduce a suite of investigative tasks and a novel method to systematically investigate ICL by leveraging the full Pythia scaling suite, including interim checkpoints that capture progressively larger amount of training data. By carefully exploring ICL performance on downstream tasks and simultaneously conducting a mechanistic analysis of the residual stream's subspace, we demonstrate that ICL extends beyond mere "memorization" of the training corpus, yet does not amount to the implementation of an independent symbolic algorithm. Our results also clarify several aspects of ICL, including the influence of training dynamics, model capabilities, and elements of mechanistic interpretability. Overall, our work advances the understanding of ICL and its implications, offering model developers insights into potential improvements and providing AI security practitioners with a basis for more informed guidelines.

Comment: The paper investigates in-context learning in large-scale transformer models, providing insights into training dynamics and interpretability, relevant to representation learning and LLM behavior.

Relevance: 9 Novelty: 8


17. Training NTK to Generalize with KARE

ArXiv ID: 2505.11347

Authors: Johannes Schwab, Bryan Kelly, Semyon Malamud, Teng Andrea Xu

Abstract: The performance of the data-dependent neural tangent kernel (NTK; Jacot et al. (2018)) associated with a trained deep neural network (DNN) often matches or exceeds that of the full network. This implies that DNN training via gradient descent implicitly performs kernel learning by optimizing the NTK. In this paper, we propose instead to optimize the NTK explicitly. Rather than minimizing empirical risk, we train the NTK to minimize its generalization error using the recently developed Kernel Alignment Risk Estimator (KARE; Jacot et al. (2020)). Our simulations and real data experiments show that NTKs trained with KARE consistently match or significantly outperform the original DNN and the DNN- induced NTK (the after-kernel). These results suggest that explicitly trained kernels can outperform traditional end-to-end DNN optimization in certain settings, challenging the conventional dominance of DNNs. We argue that explicit training of NTK is a form of over-parametrized feature learning.

Comment: The paper proposes optimizing the neural tangent kernel explicitly, which aligns with representation learning and training dynamics.

Relevance: 9 Novelty: 8


18. Neural Thermodynamics I: Entropic Forces in Deep and Universal Representation Learning

ArXiv ID: 2505.12387

Authors: Liu Ziyin, Yizhou Xu, Isaac Chuang

Abstract: With the rapid discovery of emergent phenomena in deep learning and large language models, explaining and understanding their cause has become an urgent need. Here, we propose a rigorous entropic-force theory for understanding the learning dynamics of neural networks trained with stochastic gradient descent (SGD) and its variants. Building on the theory of parameter symmetries and an entropic loss landscape, we show that representation learning is crucially governed by emergent entropic forces arising from stochasticity and discrete-time updates. These forces systematically break continuous parameter symmetries and preserve discrete ones, leading to a series of gradient balance phenomena that resemble the equipartition property of thermal systems. These phenomena, in turn, (a) explain the universal alignment of neural representations between AI models and lead to a proof of the Platonic Representation Hypothesis, and (b) reconcile the seemingly contradictory observations of sharpness- and flatness-seeking behavior of deep learning optimization. Our theory and experiments demonstrate that a combination of entropic forces and symmetry breaking is key to understanding emergent phenomena in deep learning.

Comment: The paper proposes a theory for understanding learning dynamics in neural networks, which aligns with the representation learning criterion.

Relevance: 9 Novelty: 8


19. Addition is almost all you need: Compressing neural networks with double binary factorization

ArXiv ID: 2505.11076

Authors: Vladimír Boža, Vladimír Macko

Abstract: Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions, offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs). However, the severe quantization constraint ($\pm1$) can lead to significant accuracy degradation. In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors. DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods. Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP# and QTIP. Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization's intermediate dimension. Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria. Code available at: https://github.com/usamec/double_binary

Comment: The paper introduces a novel method for model compression using Double Binary Factorization, which aligns with the model compression criterion focusing on sparsity, pruning, and quantization.

Relevance: 9 Novelty: 8


20. Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

ArXiv ID: 2505.11770

Authors: Jing Huang, Junyi Tao, Thomas Icard, Diyi Yang, Christopher Potts

Abstract: Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks--including symbol manipulation, knowledge retrieval, and instruction following--we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.

Comment: The paper explores internal causal mechanisms in language models to predict out-of-distribution behaviors, offering theoretical insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 8


21. InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models

ArXiv ID: 2505.11574

Authors: Zhen Li, Yupeng Su, Songmiao Wang, Runming Yang, Congkai Xie, Aofan Liu, Ming Li, Jiannong Cao, Yuan Xie, Ngai Wong, Hongxia Yang

Abstract: Large Language Models (LLMs) have demonstrated impressive performance on complex reasoning benchmarks such as GSM8K, MATH, and AIME. However, the substantial computational demands of these tasks pose significant challenges for real-world deployment. Model quantization has emerged as a promising approach to reduce memory footprint and inference latency by representing weights and activations with lower bit-widths. In this work, we conduct a comprehensive study of mainstream quantization methods(e.g., AWQ, GPTQ, SmoothQuant) on the most popular open-sourced models (e.g., Qwen2.5, LLaMA3 series), and reveal that quantization can degrade mathematical reasoning accuracy by up to 69.81%. To better understand this degradation, we develop an automated assignment and judgment pipeline that qualitatively categorizes failures into four error types and quantitatively identifies the most impacted reasoning capabilities. Building on these findings, we employ an automated data-curation pipeline to construct a compact "Silver Bullet" datasets. Training a quantized model on as few as 332 carefully selected examples for just 3-5 minutes on a single GPU is enough to restore its reasoning accuracy to match that of the full-precision baseline.

Comment: The paper focuses on quantization in LLMs, which is relevant to model compression, specifically addressing the degradation in mathematical reasoning accuracy due to quantization.

Relevance: 9 Novelty: 7


22. An Explanation of Intrinsic Self-Correction via Linear Representations and Latent Concepts

ArXiv ID: 2505.11924

Authors: Yu-Ting Lee, Hui-Ying Shih, Fu-Chieh Chang, Pei-Yuan Wu

Abstract: We provide an explanation for the performance gains of intrinsic self-correction, a process where a language model iteratively refines its outputs without external feedback. More precisely, we investigate how prompting induces interpretable changes in hidden states and thus affects the output distributions. We hypothesize that each prompt-induced shift lies in a linear span of some linear representation vectors, naturally separating tokens based on individual concept alignment. Building around this idea, we give a mathematical formulation of self-correction and derive a concentration result for output tokens based on alignment magnitudes. Our experiments on text detoxification with zephyr-7b-sft reveal a substantial gap in the inner products of the prompt-induced shifts and the unembeddings of the top-100 most toxic tokens vs. those of the unembeddings of the bottom-100 least toxic tokens, under toxic instructions. This suggests that self-correction prompts enhance a language model's capability of latent concept recognition. Our analysis offers insights into the underlying mechanism of self-correction by characterizing how prompting works explainably. For reproducibility, our code is available.

Comment: The paper provides an explanation for intrinsic self-correction in language models, which is relevant to understanding LLM behavior and interpretability.

Relevance: 9 Novelty: 7


23. Dynamic Base model Shift for Delta Compression

ArXiv ID: 2505.11344

Authors: Chenyu Huang, Peng Ye, Shenghe Zheng, Xiaohui Wang, Lei Bai, Tao Chen, Wanli Ouyang

Abstract: Transformer-based models with the pretrain-finetune paradigm bring about significant progress, along with the heavy storage and deployment costs of finetuned models on multiple tasks. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights) through pruning or quantization. However, existing methods by default employ the pretrained model as the base model and compress the delta parameters for every task, which may causes significant performance degradation, especially when the compression rate is extremely high. To tackle this issue, we investigate the impact of different base models on the performance of delta compression and find that the pre-trained base model can hardly be optimal. To this end, we propose Dynamic Base Model Shift (DBMS), which dynamically adapts the base model to the target task before performing delta compression. Specifically, we adjust two parameters, which respectively determine the magnitude of the base model shift and the overall scale of delta compression, to boost the compression performance on each task. Through low-cost learning of these two parameters, our DBMS can maintain most of the finetuned model's performance even under an extremely high compression ratio setting, significantly surpassing existing methods. Moreover, our DBMS is orthogonal and can be integrated with a variety of other methods, and it has been evaluated across different types of models including language, vision transformer, and multi-modal models.

Comment: The paper discusses delta compression in transformer models, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 7


24. On Next-Token Prediction in LLMs: How End Goals Determine the Consistency of Decoding Algorithms

ArXiv ID: 2505.11183

Authors: Jacob Trauger, Ambuj Tewari

Abstract: Probabilistic next-token prediction trained using cross-entropy loss is the basis of most large language models. Given a sequence of previous values, next-token prediction assigns a probability to each possible next value in the vocabulary. There are many ways to use next-token prediction to output token sequences. This paper examines a few of these algorithms (greedy, lookahead, random sampling, and temperature-scaled random sampling) and studies their consistency with respect to various goals encoded as loss functions. Although consistency of surrogate losses with respect to a target loss function is a well researched topic, we are the first to study it in the context of LLMs (to the best of our knowledge). We find that, so long as next-token prediction converges to its true probability distribution, random sampling is consistent with outputting sequences that mimic sampling from the true probability distribution. For the other goals, such as minimizing the 0-1 loss on the entire sequence, we show no polynomial-time algorithm is optimal for all probability distributions and all decoding algorithms studied are only optimal for a subset of probability distributions. When analyzing these results, we see that there is a dichotomy created between the goals of information retrieval and creative generation for the decoding algorithms. This shows that choosing the correct decoding algorithm based on the desired goal is extremely important and many of the ones used are lacking theoretical grounding in numerous scenarios.

Comment: The paper studies next-token prediction in LLMs, relevant to foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 7


25. MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

ArXiv ID: 2505.11415

Authors: Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Dayou Du, Tairan Xu, Kai Zou, Edoardo Ponti, Luo Mai

Abstract: The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.

Comment: The paper introduces MoE-CAP, a benchmark for sparse MoE systems, which aligns with model architecture and efficiency innovations.

Relevance: 9 Novelty: 7


26. Model Merging in Pre-training of Large Language Models

ArXiv ID: 2505.12082

Authors: Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu

Abstract: Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.

Comment: The paper investigates model merging techniques in the pre-training of large language models, with a focus on Mixture-of-Experts (MoE) architectures, which aligns with the model architecture criterion.

Relevance: 9 Novelty: 7


27. Foundation model for mass spectrometry proteomics

ArXiv ID: 2505.10848

Authors: Justin Sanders, Melih Yilmaz, Jacob H. Russell, Wout Bittremieux, William E. Fondrie, Nicholas M. Riley, Sewoong Oh, William Stafford Noble

Abstract: Mass spectrometry is the dominant technology in the field of proteomics, enabling high-throughput analysis of the protein content of complex biological samples. Due to the complexity of the instrumentation and resulting data, sophisticated computational methods are required for the processing and interpretation of acquired mass spectra. Machine learning has shown great promise to improve the analysis of mass spectrometry data, with numerous purpose-built methods for improving specific steps in the data acquisition and analysis pipeline reaching widespread adoption. Here, we propose unifying various spectrum prediction tasks under a single foundation model for mass spectra. To this end, we pre-train a spectrum encoder using de novo sequencing as a pre-training task. We then show that using these pre-trained spectrum representations improves our performance on the four downstream tasks of spectrum quality prediction, chimericity prediction, phosphorylation prediction, and glycosylation status prediction. Finally, we perform multi-task fine-tuning and find that this approach improves the performance on each task individually. Overall, our work demonstrates that a foundation model for tandem mass spectrometry proteomics trained on de novo sequencing learns generalizable representations of spectra, improves performance on downstream tasks where training data is limited, and can ultimately enhance data acquisition and analysis in proteomics experiments.

Comment: The paper proposes a foundation model for mass spectrometry proteomics, which is relevant to AI for science and foundational model research.

Relevance: 8 Novelty: 8


28. Redefining Neural Operators in $d+1$ Dimensions

ArXiv ID: 2505.11766

Authors: Haoze Song, Zhihao Li, Xiaobo Zhang, Zecheng Gan, Zhilu Lai, Wei Wang

Abstract: Neural Operators have emerged as powerful tools for learning mappings between function spaces. Among them, the kernel integral operator has been widely validated on universally approximating various operators. Although recent advancements following this definition have developed effective modules to better approximate the kernel function defined on the original domain (with $d$ dimensions, $d=1, 2, 3...$), the unclarified evolving mechanism in the embedding spaces blocks our view to design neural operators that can fully capture the target system evolution. Drawing on recent breakthroughs in quantum simulation of partial differential equations (PDEs), we elucidate the linear evolution process in neural operators. Based on that, we redefine neural operators on a new $d+1$ dimensional domain. Within this framework, we implement our proposed Schr\"odingerised Kernel Neural Operator (SKNO) aligning better with the $d+1$ dimensional evolution. In experiments, our $d+1$ dimensional evolving linear block performs far better than others. Also, we test SKNO's SOTA performance on various benchmark tests and also the zero-shot super-resolution task. In addition, we analyse the impact of different lifting and recovering operators on the prediction within the redefined NO framework, reflecting the alignment between our model and the underlying $d+1$ dimensional evolution.

Comment: The paper redefines neural operators in a new dimensional framework, which is relevant to emerging trends in representation learning and model architecture.

Relevance: 8 Novelty: 8


29. SpikeX: Exploring Accelerator Architecture and Network-Hardware Co-Optimization for Sparse Spiking Neural Networks

ArXiv ID: 2505.12292

Authors: Boxun Xu, Richard Boone, Peng Li

Abstract: Spiking Neural Networks (SNNs) are promising biologically plausible models of computation which utilize a spiking binary activation function similar to that of biological neurons. SNNs are well positioned to process spatiotemporal data, and are advantageous in ultra-low power and real-time processing. Despite a large body of work on conventional artificial neural network accelerators, much less attention has been given to efficient SNN hardware accelerator design. In particular, SNNs exhibit inherent unstructured spatial and temporal firing sparsity, an opportunity yet to be fully explored for great hardware processing efficiency. In this work, we propose a novel systolic-array SNN accelerator architecture, called SpikeX, to take on the challenges and opportunities stemming from unstructured sparsity while taking into account the unique characteristics of spike-based computation. By developing an efficient dataflow targeting expensive multi-bit weight data movements, SpikeX reduces memory access and increases data sharing and hardware utilization for computations spanning across both time and space, thereby significantly improving energy efficiency and inference latency. Furthermore, recognizing the importance of SNN network and hardware co-design, we develop a co-optimization methodology facilitating not only hardware-aware SNN training but also hardware accelerator architecture search, allowing joint network weight parameter optimization and accelerator architectural reconfiguration. This end-to-end network/accelerator co-design approach offers a significant reduction of 15.1x-150.87x in energy-delay-product(EDP) without comprising model accuracy.

Comment: The paper proposes a novel SNN accelerator architecture, relevant to model architecture and efficiency improvements.

Relevance: 8 Novelty: 8


30. SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache

ArXiv ID: 2505.10951

Authors: Qiuyu Zhu, Liang Zhang, Qianxiong Xu, Cheng Long, Jie Zhang

Abstract: Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to incorporate structured knowledge via graph retrieval as contextual input, enhancing more accurate and context-aware reasoning. We observe that for different queries, it could retrieve similar subgraphs as prompts, and thus we propose SubGCache, which aims to reduce inference latency by reusing computation across queries with similar structural prompts (i.e., subgraphs). Specifically, SubGCache clusters queries based on subgraph embeddings, constructs a representative subgraph for each cluster, and pre-computes the key-value (KV) cache of the representative subgraph. For each query with its retrieved subgraph within a cluster, it reuses the pre-computed KV cache of the representative subgraph of the cluster without computing the KV tensors again for saving computation. Experiments on two new datasets across multiple LLM backbones and graph-based RAG frameworks demonstrate that SubGCache consistently reduces inference latency with comparable and even improved generation quality, achieving up to 6.68$\times$ reduction in time-to-first-token (TTFT).

Comment: The paper introduces SubGCache, which is related to model compression through KV cache optimization, aligning with the core topic of model compression.

Relevance: 8 Novelty: 7


31. Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures

ArXiv ID: 2505.11918

Authors: Zhiheng Chen, Ruofan Wu, Guanhua Fang

Abstract: The transformer architecture has demonstrated remarkable capabilities in modern artificial intelligence, among which the capability of implicitly learning an internal model during inference time is widely believed to play a key role in the under standing of pre-trained large language models. However, most recent works have been focusing on studying supervised learning topics such as in-context learning, leaving the field of unsupervised learning largely unexplored. This paper investigates the capabilities of transformers in solving Gaussian Mixture Models (GMMs), a fundamental unsupervised learning problem through the lens of statistical estimation. We propose a transformer-based learning framework called TGMM that simultaneously learns to solve multiple GMM tasks using a shared transformer backbone. The learned models are empirically demonstrated to effectively mitigate the limitations of classical methods such as Expectation-Maximization (EM) or spectral algorithms, at the same time exhibit reasonable robustness to distribution shifts. Theoretically, we prove that transformers can approximate both the EM algorithm and a core component of spectral methods (cubic tensor power iterations). These results bridge the gap between practical success and theoretical understanding, positioning transformers as versatile tools for unsupervised learning.

Comment: The paper explores transformers in unsupervised learning, specifically Gaussian Mixture Models, which aligns with the core topic of model architecture and representation learning.

Relevance: 8 Novelty: 7


32. SepPrune: Structured Pruning for Efficient Deep Speech Separation

ArXiv ID: 2505.12079

Authors: Yuqi Li, Kai Li, Xin Yin, Zhifei Yang, Junhao Dong, Zeyu Dong, Chuanguang Yang, Yingli Tian, Yao Lu

Abstract: Although deep learning has substantially advanced speech separation in recent years, most existing studies continue to prioritize separation quality while overlooking computational efficiency, an essential factor for low-latency speech processing in real-time applications. In this paper, we propose SepPrune, the first structured pruning framework specifically designed to compress deep speech separation models and reduce their computational cost. SepPrune begins by analyzing the computational structure of a given model to identify layers with the highest computational burden. It then introduces a differentiable masking strategy to enable gradient-driven channel selection. Based on the learned masks, SepPrune prunes redundant channels and fine-tunes the remaining parameters to recover performance. Extensive experiments demonstrate that this learnable pruning paradigm yields substantial advantages for channel pruning in speech separation models, outperforming existing methods. Notably, a model pruned with SepPrune can recover 85% of the performance of a pre-trained model (trained over hundreds of epochs) with only one epoch of fine-tuning, and achieves convergence 36$\times$ faster than training from scratch. Code is available at https://github.com/itsnotacie/SepPrune.

Comment: The paper presents SepPrune, a structured pruning framework for speech separation, aligning with the core topic of model compression.

Relevance: 8 Novelty: 7


33. Memory-Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation

ArXiv ID: 2505.11235

Authors: Fei Wu, Jia Hu, Geyong Min, Shiqiang Wang

Abstract: Driven by the relentless growth in model parameters, which renders full fine-tuning prohibitively expensive for large-scale deployment, parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for rapidly adapting large models to a wide range of downstream tasks. Among the PEFT family, orthogonal fine-tuning and its variants have demonstrated remarkable performance by preserving hyperspherical energy, which encodes pairwise angular similarity between neurons. However, these methods are inherently memory-inefficient due to the need to store intermediate activations from multiple full-dimensional sparse matrices. To address this limitation, we propose Memory-efficient Orthogonal Fine-Tuning (MOFT) with principal subspace adaptation. Specifically, we first establish a theoretical condition under which orthogonal transformations within a low-rank subspace preserve hyperspherical energy. Based on this insight, we constrain orthogonal fine-tuning to the principal subspace defined by the top-r components obtained through singular value decomposition and impose an additional constraint on the projection matrix to satisfy the preservation condition. To enhance MOFT's flexibility across tasks, we relax strict orthogonality by introducing two learnable scaling vectors. Extensive experiments on 37 diverse tasks and four models across NLP and CV demonstrate that MOFT consistently outperforms key baselines while significantly reducing the memory footprint of orthogonal fine-tuning.

Comment: The paper proposes MOFT for memory-efficient orthogonal fine-tuning, aligning with the core topic of model compression and efficiency.

Relevance: 8 Novelty: 7


34. LoRASuite: Efficient LoRA Adaptation Across Large Language Model Upgrades

ArXiv ID: 2505.13515

Authors: Yanan Li, Fanxu Meng, Muhan Zhang, Shiai Zhu, Shangguang Wang, Mengwei Xu

Abstract: As Large Language Models (LLMs) are frequently updated, LoRA weights trained on earlier versions quickly become obsolete. The conventional practice of retraining LoRA weights from scratch on the latest model is costly, time-consuming, and environmentally detrimental, particularly as the diversity of LLMs and downstream tasks expands. This motivates a critical question: "How can we efficiently leverage existing LoRA weights to adapt to newer model versions?" To address this, we propose LoRASuite, a modular approach tailored specifically to various types of LLM updates. First, we compute a transfer matrix utilizing known parameters from both old and new LLMs. Next, we allocate corresponding layers and attention heads based on centered kernel alignment and cosine similarity metrics, respectively. A subsequent small-scale, skillful fine-tuning step ensures numerical stability. Experimental evaluations demonstrate that LoRASuite consistently surpasses small-scale vanilla LoRA methods. Notably, on backbone LLMs such as MiniCPM and Qwen, LoRASuite even exceeds the performance of full-scale LoRA retraining, with average improvements of +1.4 and +6.6 points on math tasks, respectively. Additionally, LoRASuite significantly reduces memory consumption by 5.5 GB and computational time by 78.23%.

Comment: The paper introduces LoRASuite for efficient adaptation of LoRA weights, aligning with the core topic of model compression and efficiency.

Relevance: 8 Novelty: 7


35. AdaDim: Dimensionality Adaptation for SSL Representational Dynamics

ArXiv ID: 2505.12576

Authors: Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib

Abstract: A key factor in effective Self-Supervised learning (SSL) is preventing dimensional collapse, which is where higher-dimensional representation spaces span a lower-dimensional subspace. Therefore, SSL optimization strategies involve guiding a model to produce representations ($R$) with a higher dimensionality. Dimensionality is either optimized through a dimension-contrastive approach that encourages feature decorrelation or through a sample-contrastive method that promotes a uniform spread of sample representations. Both families of SSL algorithms also utilize a projection head that maps $R$ into a lower-dimensional embedding space $Z$. Recent work has characterized the projection head as a filter of irrelevant features from the SSL objective by reducing mutual information, $I(R;Z)$. Therefore, the current literature's view is that a good SSL representation space should have a high $H(R)$ and a low $I(R;Z)$. However, this view of the problem is lacking in terms of an understanding of the underlying training dynamics that influences both terms, as well as how the values of $H(R)$ and $I(R;Z)$ arrived at the end of training reflect the downstream performance of an SSL model. We address both gaps in the literature by demonstrating that increases in $H(R)$ due to feature decorrelation at the start of training lead to a higher $I(R;Z)$, while increases in $H(R)$ due to samples distributing uniformly in a high-dimensional space at the end of training cause $I(R;Z)$ to plateau or decrease. Furthermore, our analysis shows that the best performing SSL models do not have the highest $H(R)$ nor the lowest $I(R;Z)$, but arrive at an optimal intermediate point for both. We develop a method called AdaDim to exploit these observed training dynamics by adaptively weighting between losses based on feature decorrelation and uniform sample spread.

Comment: The paper proposes AdaDim for SSL representational dynamics, aligning with the core topic of representation learning.

Relevance: 8 Novelty: 7


36. Structured Representation

ArXiv ID: 2505.12143

Authors: Arun Kumar, Paul Schrater

Abstract: Invariant representations are core to representation learning, yet a central challenge remains: uncovering invariants that are stable and transferable without suppressing task-relevant signals. This raises fundamental questions, requiring further inquiry, about the appropriate level of abstraction at which such invariants should be defined, and which aspects of a system they should characterize. Interpretation of the environment relies on abstract knowledge structures to make sense of the current state, which leads to interactions, essential drivers of learning and knowledge acquisition. We posit that interpretation operates at the level of higher-order relational knowledge; hence, invariant structures must be where knowledge resides, specifically, as partitions defined by the closure of relational paths within an abstract knowledge space. These partitions serve as the core invariant representations, forming the structural substrate where knowledge is stored and learning occurs. On the other hand, inter-partition connectors enable the deployment of these knowledge partitions encoding task-relevant transitions. Thus, invariant partitions provide the foundational primitives of structured representation. We formalize the computational foundations for structured representation of the invariant partitions based on closed semiring, a relational algebraic structure.

Comment: The paper discusses structured representation and invariant partitions, aligning with the core topic of representation learning.

Relevance: 8 Novelty: 7


37. SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

ArXiv ID: 2505.11166

Authors: Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang

Abstract: Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named $\textbf{S}$h$\textbf{o}$rt-to-$\textbf{Lo}$ng $\textbf{P}$reference $\textbf{O}$ptimization ($\textbf{SoLoPO}$), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency utilization for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model's ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.

Comment: The paper proposes a framework for improving long-context capabilities in LLMs, which aligns with foundational research in large language models.

Relevance: 8 Novelty: 7


38. When the Left Foot Leads to the Right Path: Bridging Initial Prejudice and Trainability

ArXiv ID: 2505.12096

Authors: Alberto Bassi, Carlo Albert, Aurelien Lucchi, Marco Baity-Jesi, Emanuele Francazi

Abstract: Understanding the statistical properties of deep neural networks (DNNs) at initialization is crucial for elucidating both their trainability and the intrinsic architectural biases they encode prior to data exposure. Mean-field (MF) analyses have demonstrated that the parameter distribution in randomly initialized networks dictates whether gradients vanish or explode. Concurrently, untrained DNNs were found to exhibit an initial-guessing bias (IGB), in which large regions of the input space are assigned to a single class. In this work, we derive a theoretical proof establishing the correspondence between IGB and previous MF theories, thereby connecting a network prejudice toward specific classes with the conditions for fast and accurate learning. This connection yields the counter-intuitive conclusion: the initialization that optimizes trainability is necessarily biased, rather than neutral. Furthermore, we extend the MF/IGB framework to multi-node activation functions, offering practical guidelines for designing initialization schemes that ensure stable optimization in architectures employing max- and average-pooling layers.

Comment: The paper connects initial-guessing bias with trainability in neural networks, providing insights into training dynamics, which is relevant to representation learning.

Relevance: 8 Novelty: 7


39. What Can We Learn From MIMO Graph Convolutions?

ArXiv ID: 2505.11346

Authors: Andreas Roth, Thomas Liebig

Abstract: Most graph neural networks (GNNs) utilize approximations of the general graph convolution derived in the graph Fourier domain. While GNNs are typically applied in the multi-input multi-output (MIMO) case, the approximations are performed in the single-input single-output (SISO) case. In this work, we first derive the MIMO graph convolution through the convolution theorem and approximate it directly in the MIMO case. We find the key MIMO-specific property of the graph convolution to be operating on multiple computational graphs, or equivalently, applying distinct feature transformations for each pair of nodes. As a localized approximation, we introduce localized MIMO graph convolutions (LMGCs), which generalize many linear message-passing neural networks. For almost every choice of edge weights, we prove that LMGCs with a single computational graph are injective on multisets, and the resulting representations are linearly independent when more than one computational graph is used. Our experimental results confirm that an LMGC can combine the benefits of various methods.

Comment: The paper explores MIMO graph convolutions, providing insights into graph neural network architectures, which is relevant to model architecture.

Relevance: 8 Novelty: 7


40. Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum

ArXiv ID: 2505.12191

Authors: Wenquan Lu, Jiaqi Zhang, Hugues Van Assel, Randall Balestriero

Abstract: Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ($\sigma=255$, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at https://github.com/wenquanlu/noisy_dinov2.

Comment: The paper presents a self-supervised learning framework for noise robustness, which is relevant to representation learning.

Relevance: 8 Novelty: 7


41. A Local Polyak-Lojasiewicz and Descent Lemma of Gradient Descent For Overparametrized Linear Models

ArXiv ID: 2505.11664

Authors: Ziqing Xu, Hancheng Min, Salma Tarmoun, Enrique Mallada, Rene Vidal

Abstract: Most prior work on the convergence of gradient descent (GD) for overparameterized neural networks relies on strong assumptions on the step size (infinitesimal), the hidden-layer width (infinite), or the initialization (large, spectral, balanced). Recent efforts to relax these assumptions focus on two-layer linear networks trained with the squared loss. In this work, we derive a linear convergence rate for training two-layer linear neural networks with GD for general losses and under relaxed assumptions on the step size, width, and initialization. A key challenge in deriving this result is that classical ingredients for deriving convergence rates for nonconvex problems, such as the Polyak-{\L}ojasiewicz (PL) condition and Descent Lemma, do not hold globally for overparameterized neural networks. Here, we prove that these two conditions hold locally with local constants that depend on the weights. Then, we provide bounds on these local constants, which depend on the initialization of the weights, the current loss, and the global PL and smoothness constants of the non-overparameterized model. Based on these bounds, we derive a linear convergence rate for GD. Our convergence analysis not only improves upon prior results but also suggests a better choice for the step size, as verified through our numerical experiments.

Comment: The paper provides a theoretical analysis of gradient descent for overparameterized linear models, which is relevant to representation learning and training dynamics in neural networks.

Relevance: 8 Novelty: 7


42. Attention on the Sphere

ArXiv ID: 2505.11157

Authors: Boris Bonev, Max Rietmann, Andrea Paris, Alberto Carpentieri, Thorsten Kurth

Abstract: We introduce a generalized attention mechanism for spherical domains, enabling Transformer architectures to natively process data defined on the two-dimensional sphere - a critical need in fields such as atmospheric physics, cosmology, and robotics, where preserving spherical symmetries and topology is essential for physical accuracy. By integrating numerical quadrature weights into the attention mechanism, we obtain a geometrically faithful spherical attention that is approximately rotationally equivariant, providing strong inductive biases and leading to better performance than Cartesian approaches. To further enhance both scalability and model performance, we propose neighborhood attention on the sphere, which confines interactions to geodesic neighborhoods. This approach reduces computational complexity and introduces the additional inductive bias for locality, while retaining the symmetry properties of our method. We provide optimized CUDA kernels and memory-efficient implementations to ensure practical applicability. The method is validated on three diverse tasks: simulating shallow water equations on the rotating sphere, spherical image segmentation, and spherical depth estimation. Across all tasks, our spherical Transformers consistently outperform their planar counterparts, highlighting the advantage of geometric priors for learning on spherical domains.

Comment: The paper introduces a generalized attention mechanism for spherical domains, which is relevant to model architecture, specifically in the context of transformers.

Relevance: 8 Novelty: 7


43. Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers

ArXiv ID: 2505.11040

Authors: Zhexiang Li, Haoyu Wang, Yutong Bao, David Woodruff

Abstract: Recent advances in transformer architectures deeply enhance long-context language modeling. Among them, HyperAttention achieves competitive efficiency by combining a single-level LSH-based clustering with uniform residual sampling. However,such a sampling limits crucial keys' capturing, which in turn raises the overall perplexity. In this paper, we propose a pre-scoring mechanism to assist HyperAttention to prioritize significant keys. Specifically, we introduce three scoring methods: K-means clustering, K-median clustering, and leverage score-based ranking (inspired by LevAttention) to filter keys effectively. We further replace HyperAttention's original uniform residual sampling entirely, relying exclusively on our pre-scoring mechanism. Experiments on ChatGLM2 (131k token context) reduce perplexity from 12 to 8.3, which outperforms standard HyperAttention. Moreover, when running on the Vision-Transformer (ViT), our method shows that it can guarantee similar accuracy compared with LevAttention, and will surpass LevAttention given specific parameters. Although this method introduces computational overhead, its combination with HyperAttention remains 20 times faster than FlashAttention, providing a balanced trade-off between speed and modeling accuracy. Our results highlight the effectiveness of integrating pre-scoring into hierarchical attention mechanisms, significantly improving Transformer's efficiency.

Comment: The paper proposes a pre-scoring mechanism to improve attention in transformers, which is relevant to model architecture and efficiency.

Relevance: 8 Novelty: 7


44. Adaptive parameter-efficient fine-tuning via Hessian-informed subset selection

ArXiv ID: 2505.12579

Authors: Shiyun Xu, Zhiqi Bu

Abstract: Parameter-efficient fine-tuning (PEFT) is a highly effective approach for adapting large pre-trained models to downstream tasks with minimal computational overhead. At the core, PEFT methods freeze most parameters and only trains a small subset (say $<0.1\%$ of total parameters). Notably, different PEFT methods select different subsets, resulting in varying levels of performance. This variation prompts a key question: how to effectively select the most influential subset to train? We formulate the subset selection as a multi-task problem: maximizing the performance and minimizing the number of trainable parameters. We leverage a series of transformations -- including $\epsilon$-constraint method and second-order Taylor approximation -- to arrive at the classical 0-1 knapsack problem, which we solve through the lens of Pareto optimality. Consequently, we propose AdaPEFT, a Hessian-informed PEFT that adapts to various tasks and models, in which the selected subset empirically transfers across training horizons and model sizes.

Comment: The paper proposes a Hessian-informed subset selection for parameter-efficient fine-tuning, which is relevant to model compression and efficiency.

Relevance: 8 Novelty: 7


45. SRLoRA: Subspace Recomposition in Low-Rank Adaptation via Importance-Based Fusion and Reinitialization

ArXiv ID: 2505.12433

Authors: Haodong Yang, Lei Wang, Md Zakir Hossain

Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method that injects two trainable low-rank matrices (A and B) into frozen pretrained models. While efficient, LoRA constrains updates to a fixed low-rank subspace (Delta W = BA), which can limit representational capacity and hinder downstream performance. We introduce Subspace Recomposition in Low-Rank Adaptation (SRLoRA) via importance-based fusion and reinitialization, a novel approach that enhances LoRA's expressiveness without compromising its lightweight structure. SRLoRA assigns importance scores to each LoRA pair (a column of B and the corresponding row of A), and dynamically recomposes the subspace during training. Less important pairs are fused into the frozen backbone, freeing capacity to reinitialize new pairs along unused principal directions derived from the pretrained weight's singular value decomposition. This mechanism enables continual subspace refreshment and richer adaptation over time, without increasing the number of trainable parameters. We evaluate SRLoRA on both language and vision tasks, including the GLUE benchmark and various image classification datasets. SRLoRA consistently achieves faster convergence and improved accuracy over standard LoRA, demonstrating its generality, efficiency, and potential for broader PEFT applications.

Comment: The paper introduces a novel approach to enhance LoRA's expressiveness, which is relevant to model compression and efficiency.

Relevance: 8 Novelty: 7


46. Revisiting Stochastic Approximation and Stochastic Gradient Descent

ArXiv ID: 2505.11343

Authors: Rajeeva Laxman Karandikar, Bhamidi Visweswara Rao, Mathukumalli Vidyasagar

Abstract: In this paper, we take a fresh look at stochastic approximation (SA) and Stochastic Gradient Descent (SGD). We derive new sufficient conditions for the convergence of SA. In particular, the "noise" or measurement error need not have a finite second moment, and under suitable conditions, not even a finite mean. By adapting this method of proof, we also derive sufficient conditions for the convergence of zero-order SGD, wherein the stochastic gradient is computed using only two function evaluations, and no gradient computations. The sufficient conditions derived here are the weakest to date, thus leading to a considerable expansion of the applicability of SA and SGD theory.

Comment: The paper provides new theoretical insights into stochastic approximation and SGD, which are foundational topics in optimization and machine learning.

Relevance: 8 Novelty: 7


47. Flash Invariant Point Attention

ArXiv ID: 2505.11580

Authors: Andrew Liu, Axel Elaldi, Nicholas T Franklin, Nathan Russell, Gurinder S Atwal, Yih-En A Ban, Olivia Viessmann

Abstract: Invariant Point Attention (IPA) is a key algorithm for geometry-aware modeling in structural biology, central to many protein and RNA models. However, its quadratic complexity limits the input sequence length. We introduce FlashIPA, a factorized reformulation of IPA that leverages hardware-efficient FlashAttention to achieve linear scaling in GPU memory and wall-clock time with sequence length. FlashIPA matches or exceeds standard IPA performance while substantially reducing computational costs. FlashIPA extends training to previously unattainable lengths, and we demonstrate this by re-training generative models without length restrictions and generating structures of thousands of residues. FlashIPA is available at https://github.com/flagshippioneering/flash_ipa.

Comment: The paper introduces FlashIPA, a reformulation of IPA for efficient computation, relevant to model architecture and efficiency.

Relevance: 8 Novelty: 7


48. Where You Place the Norm Matters: From Prejudiced to Neutral Initializations

ArXiv ID: 2505.11312

Authors: Emanuele Francazi, Francesco Pinto, Aurelien Lucchi, Marco Baity-Jesi

Abstract: Normalization layers, such as Batch Normalization and Layer Normalization, are central components in modern neural networks, widely adopted to improve training stability and generalization. While their practical effectiveness is well documented, a detailed theoretical understanding of how normalization affects model behavior, starting from initialization, remains an important open question. In this work, we investigate how both the presence and placement of normalization within hidden layers influence the statistical properties of network predictions before training begins. In particular, we study how these choices shape the distribution of class predictions at initialization, which can range from unbiased (Neutral) to highly concentrated (Prejudiced) toward a subset of classes. Our analysis shows that normalization placement induces systematic differences in the initial prediction behavior of neural networks, which in turn shape the dynamics of learning. By linking architectural choices to prediction statistics at initialization, our work provides a principled understanding of how normalization can influence early training behavior and offers guidance for more controlled and interpretable network design.

Comment: The paper investigates the impact of normalization placement on network initialization, relevant to model architecture analysis.

Relevance: 8 Novelty: 7


49. CUBIC: Concept Embeddings for Unsupervised Bias Identification using VLMs

ArXiv ID: 2505.11060

Authors: David Méndez, Gianpaolo Bontempo, Elisa Ficarra, Roberto Confalonieri, Natalia Díaz-Rodríguez

Abstract: Deep vision models often rely on biases learned from spurious correlations in datasets. To identify these biases, methods that interpret high-level, human-understandable concepts are more effective than those relying primarily on low-level features like heatmaps. A major challenge for these concept-based methods is the lack of image annotations indicating potentially bias-inducing concepts, since creating such annotations requires detailed labeling for each dataset and concept, which is highly labor-intensive. We present CUBIC (Concept embeddings for Unsupervised Bias IdentifiCation), a novel method that automatically discovers interpretable concepts that may bias classifier behavior. Unlike existing approaches, CUBIC does not rely on predefined bias candidates or examples of model failures tied to specific biases, as such information is not always available. Instead, it leverages image-text latent space and linear classifier probes to examine how the latent representation of a superclass label$\unicode{x2014}$shared by all instances in the dataset$\unicode{x2014}$is influenced by the presence of a given concept. By measuring these shifts against the normal vector to the classifier's decision boundary, CUBIC identifies concepts that significantly influence model predictions. Our experiments demonstrate that CUBIC effectively uncovers previously unknown biases using Vision-Language Models (VLMs) without requiring the samples in the dataset where the classifier underperforms or prior knowledge of potential biases.

Comment: The paper introduces CUBIC, a method for unsupervised bias identification using VLMs, relevant to representation learning and foundational research.

Relevance: 8 Novelty: 7


50. Efficient Optimization with Orthogonality Constraint: a Randomized Riemannian Submanifold Method

ArXiv ID: 2505.12378

Authors: Andi Han, Pierre-Louis Poirion, Akiko Takeda

Abstract: Optimization with orthogonality constraints frequently arises in various fields such as machine learning. Riemannian optimization offers a powerful framework for solving these problems by equipping the constraint set with a Riemannian manifold structure and performing optimization intrinsically on the manifold. This approach typically involves computing a search direction in the tangent space and updating variables via a retraction operation. However, as the size of the variables increases, the computational cost of the retraction can become prohibitively high, limiting the applicability of Riemannian optimization to large-scale problems. To address this challenge and enhance scalability, we propose a novel approach that restricts each update on a random submanifold, thereby significantly reducing the per-iteration complexity. We introduce two sampling strategies for selecting the random submanifolds and theoretically analyze the convergence of the proposed methods. We provide convergence results for general nonconvex functions and functions that satisfy Riemannian Polyak-Lojasiewicz condition as well as for stochastic optimization settings. Additionally, we demonstrate how our approach can be generalized to quotient manifolds derived from the orthogonal manifold. Extensive experiments verify the benefits of the proposed method, across a wide variety of problems.

Comment: The paper proposes a randomized Riemannian submanifold method for optimization with orthogonality constraints, relevant to foundational research in optimization methods.

Relevance: 8 Novelty: 7


51. Understanding Nonlinear Implicit Bias via Region Counts in Input Space

ArXiv ID: 2505.11370

Authors: Jingwei Li, Jing Xu, Zifan Wang, Huishuai Zhang, Jingzhao Zhang

Abstract: One explanation for the strong generalization ability of neural networks is implicit bias. Yet, the definition and mechanism of implicit bias in non-linear contexts remains little understood. In this work, we propose to characterize implicit bias by the count of connected regions in the input space with the same predicted label. Compared with parameter-dependent metrics (e.g., norm or normalized margin), region count can be better adapted to nonlinear, overparameterized models, because it is determined by the function mapping and is invariant to reparametrization. Empirically, we found that small region counts align with geometrically simple decision boundaries and correlate well with good generalization performance. We also observe that good hyper-parameter choices such as larger learning rates and smaller batch sizes can induce small region counts. We further establish the theoretical connections and explain how larger learning rate can induce small region counts in neural networks.

Comment: The paper explores implicit bias in neural networks, which is relevant to representation learning as it provides insights into how deep networks encode information.

Relevance: 8 Novelty: 7


52. Efficient Federated Class-Incremental Learning of Pre-Trained Models via Task-agnostic Low-rank Residual Adaptation

ArXiv ID: 2505.12318

Authors: Feng Yu, Jia Hu, Geyong Min

Abstract: Federated Parameter-Efficient Fine-Tuning (FedPEFT) reduces communication and computation costs in federated fine-tuning of pre-trained models by updating only a small subset of model parameters. However, existing approaches assume static data distributions, failing to adequately address real-world scenarios where new classes continually emerge, particularly in Federated Class Incremental Learning (FCIL). FCIL faces two key challenges: catastrophic forgetting and performance degradation caused by non-IID data across clients. Unlike current methods that maintain separate task-specific components or suffer from aggregation noise during parameter aggregation, we propose Federated Task-agnostic Low-rank Residual Adaptation (Fed-TaLoRA), a novel parameter-efficient approach for fine-tuning in resource-constrained FCIL scenarios. Specifically, we fine-tune only shared task-agnostic LoRA parameters across sequential tasks, effectively mitigating catastrophic forgetting while enabling efficient knowledge transfer among clients. Based on a theoretical analysis of aggregation, we develop a novel residual weight update mechanism that ensures accurate knowledge consolidation with minimal overhead. Our methodological innovations are attributed to three key strategies: task-agnostic adaptation, post-aggregation model calibration, and strategic placement of LoRA modules. Extensive experiments on multiple benchmark datasets demonstrate that Fed-TaLoRA consistently outperforms state-of-the-art methods in diverse data heterogeneity scenarios while substantially reducing resource requirements.

Comment: The paper introduces a novel approach for federated class-incremental learning, focusing on low-rank residual adaptation, which is relevant to model compression.

Relevance: 8 Novelty: 7


53. FlashBias: Fast Computation of Attention with Bias

ArXiv ID: 2505.12044

Authors: Haixu Wu, Minghao Guo, Yuezhou Ma, Yuanxu Sun, Jianmin Wang, Wojciech Matusik, Mingsheng Long

Abstract: Attention mechanism has emerged as a foundation module of modern deep learning models and has also empowered many milestones in various domains. Moreover, FlashAttention with IO-aware speedup resolves the efficiency issue of standard attention, further promoting its practicality. Beyond canonical attention, attention with bias also widely exists, such as relative position bias in vision and language models and pair representation bias in AlphaFold. In these works, prior knowledge is introduced as an additive bias term of attention weights to guide the learning process, which has been proven essential for model performance. Surprisingly, despite the common usage of attention with bias, its targeted efficiency optimization is still absent, which seriously hinders its wide applications in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is determined by the rank of the attention weight matrix. Inspired by this theoretical result, this paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases and a fast-accurate approximation for biases in general formalization. FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5$\times$ speedup for AlphaFold, and over 2$\times$ speedup for attention with bias in vision and language models without loss of accuracy.

Comment: FlashBias focuses on efficient computation of attention with bias, which is relevant to model architecture and efficiency.

Relevance: 8 Novelty: 7


54. Adversarially Robust Spiking Neural Networks with Sparse Connectivity

ArXiv ID: 2505.15833

Authors: Mathias Schmolli, Maximilian Baronig, Robert Legenstein, Ozan Özdenizci

Abstract: Deployment of deep neural networks in resource-constrained embedded systems requires innovative algorithmic solutions to facilitate their energy and memory efficiency. To further ensure the reliability of these systems against malicious actors, recent works have extensively studied adversarial robustness of existing architectures. Our work focuses on the intersection of adversarial robustness, memory- and energy-efficiency in neural networks. We introduce a neural network conversion algorithm designed to produce sparse and adversarially robust spiking neural networks (SNNs) by leveraging the sparse connectivity and weights from a robustly pretrained artificial neural network (ANN). Our approach combines the energy-efficient architecture of SNNs with a novel conversion algorithm, leading to state-of-the-art performance with enhanced energy and memory efficiency through sparse connectivity and activations. Our models are shown to achieve up to 100x reduction in the number of weights to be stored in memory, with an estimated 8.6x increase in energy efficiency compared to dense SNNs, while maintaining high performance and robustness against adversarial threats.

Comment: The paper introduces a method for creating sparse and adversarially robust spiking neural networks, which is relevant to model compression and sparsity.

Relevance: 8 Novelty: 7


55. Relational Graph Transformer

ArXiv ID: 2505.10960

Authors: Vijay Prakash Dwivedi, Sri Jaladi, Yangyi Shen, Federico López, Charilaos I. Kanatsoulis, Rishi Puri, Matthias Fey, Jure Leskovec

Abstract: Relational Deep Learning (RDL) is a promising approach for building state-of-the-art predictive models on multi-table relational data by representing it as a heterogeneous temporal graph. However, commonly used Graph Neural Network models suffer from fundamental limitations in capturing complex structural patterns and long-range dependencies that are inherent in relational data. While Graph Transformers have emerged as powerful alternatives to GNNs on general graphs, applying them to relational entity graphs presents unique challenges: (i) Traditional positional encodings fail to generalize to massive, heterogeneous graphs; (ii) existing architectures cannot model the temporal dynamics and schema constraints of relational data; (iii) existing tokenization schemes lose critical structural information. Here we introduce the Relational Graph Transformer (RelGT), the first graph transformer architecture designed specifically for relational tables. RelGT employs a novel multi-element tokenization strategy that decomposes each node into five components (features, type, hop distance, time, and local structure), enabling efficient encoding of heterogeneity, temporality, and topology without expensive precomputation. Our architecture combines local attention over sampled subgraphs with global attention to learnable centroids, incorporating both local and database-wide representations. Across 21 tasks from the RelBench benchmark, RelGT consistently matches or outperforms GNN baselines by up to 18%, establishing Graph Transformers as a powerful architecture for Relational Deep Learning.

Comment: The paper introduces a novel graph transformer architecture specifically designed for relational tables, which aligns with the model architecture criterion.

Relevance: 8 Novelty: 7


56. Hyperbolic Residual Quantization: Discrete Representations for Data with Latent Hierarchies

ArXiv ID: 2505.12404

Authors: Piotr Piękos, Subhradeep Kayal, Alexandros Karatzoglou

Abstract: Hierarchical data arise in countless domains, from biological taxonomies and organizational charts to legal codes and knowledge graphs. Residual Quantization (RQ) is widely used to generate discrete, multitoken representations for such data by iteratively quantizing residuals in a multilevel codebook. However, its reliance on Euclidean geometry can introduce fundamental mismatches that hinder modeling of hierarchical branching, necessary for faithful representation of hierarchical data. In this work, we propose Hyperbolic Residual Quantization (HRQ), which embeds data natively in a hyperbolic manifold and performs residual quantization using hyperbolic operations and distance metrics. By adapting the embedding network, residual computation, and distance metric to hyperbolic geometry, HRQ imparts an inductive bias that aligns naturally with hierarchical branching. We claim that HRQ in comparison to RQ can generate more useful for downstream tasks discrete hierarchical representations for data with latent hierarchies. We evaluate HRQ on two tasks: supervised hierarchy modeling using WordNet hypernym trees, where the model is supervised to learn the latent hierarchy - and hierarchy discovery, where, while latent hierarchy exists in the data, the model is not directly trained or evaluated on a task related to the hierarchy. Across both scenarios, HRQ hierarchical tokens yield better performance on downstream tasks compared to Euclidean RQ with gains of up to $20\%$ for the hierarchy modeling task. Our results demonstrate that integrating hyperbolic geometry into discrete representation learning substantially enhances the ability to capture latent hierarchies.

Comment: The paper introduces Hyperbolic Residual Quantization for hierarchical data, which aligns with the representation learning criterion.

Relevance: 8 Novelty: 7


57. MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection

ArXiv ID: 2505.11416

Authors: Pouya Shaeri, Ariane Middel

Abstract: Modern neural networks often activate all neurons for every input, leading to unnecessary computation and inefficiency. We introduce Matrix-Interpolated Dropout Layer (MID-L), a novel module that dynamically selects and activates only the most informative neurons by interpolating between two transformation paths via a learned, input-dependent gating vector. Unlike conventional dropout or static sparsity methods, MID-L employs a differentiable Top-k masking strategy, enabling per-input adaptive computation while maintaining end-to-end differentiability. MID-L is model-agnostic and integrates seamlessly into existing architectures. Extensive experiments on six benchmarks, including MNIST, CIFAR-10, CIFAR-100, SVHN, UCI Adult, and IMDB, show that MID-L achieves up to average 55\% reduction in active neurons, 1.7$\times$ FLOPs savings, and maintains or exceeds baseline accuracy. We further validate the informativeness and selectivity of the learned neurons via Sliced Mutual Information (SMI) and observe improved robustness under overfitting and noisy data conditions. Additionally, MID-L demonstrates favorable inference latency and memory usage profiles, making it suitable for both research exploration and deployment on compute-constrained systems. These results position MID-L as a general-purpose, plug-and-play dynamic computation layer, bridging the gap between dropout regularization and efficient inference.

Comment: The paper introduces a novel dropout layer for efficient neural network computation, which aligns with the model compression criterion.

Relevance: 8 Novelty: 7


58. PoE-World: Compositional World Modeling with Products of Programmatic Experts

ArXiv ID: 2505.10819

Authors: Wasu Top Piriyakulkij, Yichao Liang, Hao Tang, Adrian Weller, Marta Kryven, Kevin Ellis

Abstract: Learning how the world works is central to building AI agents that can adapt to complex environments. Traditional world models based on deep learning demand vast amounts of training data, and do not flexibly update their knowledge from sparse observations. Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code, supporting strong generalization from little data. To date, application of program-structured world models remains limited to natural language and grid-world domains. We introduce a novel program synthesis method for effectively modeling complex, non-gridworld domains by representing a world model as an exponentially-weighted product of programmatic experts (PoE-World) synthesized by LLMs. We show that this approach can learn complex, stochastic world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari's Pong and Montezuma's Revenge. We release our code and display the learned world models and videos of the agent's gameplay at https://topwasu.github.io/poe-world.

Comment: The paper introduces a novel program synthesis method for world modeling using products of programmatic experts, which aligns with representation learning and model architecture innovations.

Relevance: 8 Novelty: 7


59. AltLoRA: Towards Better Gradient Approximation in Low-Rank Adaptation with Alternating Projections

ArXiv ID: 2505.12455

Authors: Xin Yu, Yujia Wang, Jinghui Chen, Lingzhou Xue

Abstract: Low-Rank Adaptation (LoRA) has emerged as an effective technique for reducing memory overhead in fine-tuning large language models. However, it often suffers from sub-optimal performance compared with full fine-tuning since the update is constrained in the low-rank space. Recent variants such as LoRA-Pro attempt to mitigate this by adjusting the gradients of the low-rank matrices to approximate the full gradient. However, LoRA-Pro's solution is not unique, and different solutions can lead to significantly varying performance in ablation studies. Besides, to incorporate momentum or adaptive optimization design, approaches like LoRA-Pro must first compute the equivalent gradient, causing a higher memory cost close to full fine-tuning. A key challenge remains in integrating momentum properly into the low-rank space with lower memory cost. In this work, we propose AltLoRA, an alternating projection method that avoids the difficulties in gradient approximation brought by the joint update design, meanwhile integrating momentum without higher memory complexity. Our theoretical analysis provides convergence guarantees and further shows that AltLoRA enables stable feature learning and robustness to transformation invariance. Extensive experiments across multiple tasks demonstrate that AltLoRA outperforms LoRA and its variants, narrowing the gap toward full fine-tuning while preserving superior memory efficiency.

Comment: The paper proposes AltLoRA, an improvement in low-rank adaptation for LLMs, which aligns with model compression and efficiency.

Relevance: 8 Novelty: 7


60. Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations

ArXiv ID: 2505.11615

Authors: Jian-Qiao Zhu, Haijiang Yan, Thomas L. Griffiths

Abstract: Changing the behavior of large language models (LLMs) can be as straightforward as editing the Transformer's residual streams using appropriately constructed "steering vectors." These modifications to internal neural activations, a form of representation engineering, offer an effective and targeted means of influencing model behavior without retraining or fine-tuning the model. But how can such steering vectors be systematically identified? We propose a principled approach for uncovering steering vectors by aligning latent representations elicited through behavioral methods (specifically, Markov chain Monte Carlo with LLMs) with their neural counterparts. To evaluate this approach, we focus on extracting latent risk preferences from LLMs and steering their risk-related outputs using the aligned representations as steering vectors. We show that the resulting steering vectors successfully and reliably modulate LLM outputs in line with the targeted behavior.

Comment: The paper proposes a method for steering risk preferences in LLMs by aligning behavioral and neural representations, which aligns with representation learning and LLM behavior insights.

Relevance: 8 Novelty: 7


61. SchoenbAt: Rethinking Attention with Polynomial basis

ArXiv ID: 2505.12252

Authors: Yuhan Guo, Lizhong Ding, Yuwan Yang, Xuewei Guo

Abstract: Kernelized attention extends the attention mechanism by modeling sequence correlations through kernel functions, making significant progresses in optimizing attention. Under the guarantee of harmonic analysis theory, kernel functions can be expanded with basis functions, inspiring random feature-based approaches to enhance the efficiency of kernelized attention while maintaining predictive performance. However, current random feature-based works are limited to the Fourier basis expansions under Bochner's theorem. We propose Schoenberg's theorem-based attention (SchoenbAt), which approximates dot-product kernelized attention with the polynomial basis under Schoenberg's theorem via random Maclaurin features and applies a two-stage regularization to constrain the input space and restore the output scale, acting as a drop-in replacement of dot-product kernelized attention. Our theoretical proof of the unbiasedness and concentration error bound of SchoenbAt supports its efficiency and accuracy as a kernelized attention approximation, which is also empirically validated under various random feature dimensions. Evaluations on real-world datasets demonstrate that SchoenbAt significantly enhances computational speed while preserving competitive performance in terms of precision, outperforming several efficient attention methods.

Comment: The paper proposes SchoenbAt, a new attention mechanism using polynomial basis, which aligns with model architecture innovations.

Relevance: 8 Novelty: 7


62. S-Crescendo: A Nested Transformer Weaving Framework for Scalable Nonlinear System in S-Domain Representation

ArXiv ID: 2505.11843

Authors: Junlang Huang, Hao Chen, Li Luo, Yong Cai, Lexin Zhang, Tianhao Ma, Yitian Zhang, Zhong Guan

Abstract: Simulation of high-order nonlinear system requires extensive computational resources, especially in modern VLSI backend design where bifurcation-induced instability and chaos-like transient behaviors pose challenges. We present S-Crescendo - a nested transformer weaving framework that synergizes S-domain with neural operators for scalable time-domain prediction in high-order nonlinear networks, alleviating the computational bottlenecks of conventional solvers via Newton-Raphson method. By leveraging the partial-fraction decomposition of an n-th order transfer function into first-order modal terms with repeated poles and residues, our method bypasses the conventional Jacobian matrix-based iterations and efficiently reduces computational complexity from cubic $O(n^3)$ to linear $O(n)$.The proposed architecture seamlessly integrates an S-domain encoder with an attention-based correction operator to simultaneously isolate dominant response and adaptively capture higher-order non-linearities. Validated on order-1 to order-10 networks, our method achieves up to 0.99 test-set ($R^2$) accuracy against HSPICE golden waveforms and accelerates simulation by up to 18(X), providing a scalable, physics-aware framework for high-dimensional nonlinear modeling.

Comment: The paper introduces a novel nested transformer framework for scalable nonlinear system simulation, which aligns with the model architecture criterion.

Relevance: 8 Novelty: 7


63. On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm

ArXiv ID: 2505.11840

Authors: Huan Li, Yiming Dong, Zhouchen Lin

Abstract: As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $\frac{1}{K}\sum_{k=1}^KE\left[|\nabla f(x^k)|1\right]\leq O(\frac{\sqrt{d}C}{K^{1/4}})$ for AdamW measured by $\ell_1$ norm, where $K$ represents the iteration number, $d$ denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $E\left[|\nabla f(x)|_1\right]\geq\sqrt{\frac{2d}{\pi}}E\left[|\nabla f(x)|_2\right]$ when each element of $\nabla f(x)$ is generated from Gaussian distribution $\mathcal N(0,1)$. Empirically, our experimental results on real-world deep learning tasks reveal $|\nabla f(x)|_1=\varTheta(\sqrt{d})|\nabla f(x)|_2$. Both support that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum)$ convergence rate of SGD.}^KE\left[|\nabla f(x^k)|_2\right]\leq O(\frac{C}{K^{1/4}

Comment: The paper provides a theoretical analysis of the convergence rate of AdamW, which is relevant to understanding training dynamics in neural networks.

Relevance: 8 Novelty: 7


64. Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD

ArXiv ID: 2505.12128

Authors: Nikita P. Kalinin, Ryan McKenna, Jalaj Upadhyay, Christoph H. Lampert

Abstract: Matrix factorization mechanisms for differentially private training have emerged as a promising approach to improve model utility under privacy constraints. In practical settings, models are typically trained over multiple epochs, requiring matrix factorizations that account for repeated participation. Existing theoretical upper and lower bounds on multi-epoch factorization error leave a significant gap. In this work, we introduce a new explicit factorization method, Banded Inverse Square Root (BISR), which imposes a banded structure on the inverse correlation matrix. This factorization enables us to derive an explicit and tight characterization of the multi-epoch error. We further prove that BISR achieves asymptotically optimal error by matching the upper and lower bounds. Empirically, BISR performs on par with state-of-the-art factorization methods, while being simpler to implement, computationally efficient, and easier to analyze.

Comment: The paper introduces a new matrix factorization method for differentially private training, which is relevant to model compression and efficiency.

Relevance: 8 Novelty: 7


65. Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

ArXiv ID: 2505.12514

Authors: Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in many applications, including challenging reasoning problems via chain-of-thoughts (CoTs) techniques that generate ``thinking tokens'' before answering the questions. While existing theoretical works demonstrate that CoTs with discrete tokens boost the capability of LLMs, recent work on continuous CoTs lacks a theoretical understanding of why it outperforms discrete counterparts in various reasoning tasks such as directed graph reachability, a fundamental graph reasoning problem that includes many practical domain applications as special cases. In this paper, we prove that a two-layer transformer with $D$ steps of continuous CoTs can solve the directed graph reachability problem, where $D$ is the diameter of the graph, while the best known result of constant-depth transformers with discrete CoTs requires $O(n^2)$ decoding steps where $n$ is the number of vertices ($D<n$). In our construction, each continuous thought vector is a superposition state that encodes multiple search frontiers simultaneously (i.e., parallel breadth-first search (BFS)), while discrete CoTs must choose a single path sampled from the superposition state, which leads to sequential search that requires many more steps and may be trapped into local solutions. We also performed extensive experiments to verify that our theoretical construction aligns well with the empirical solution obtained via training dynamics. Notably, encoding of multiple search frontiers as a superposition state automatically emerges in training continuous CoTs, without explicit supervision to guide the model to explore multiple paths simultaneously.

Comment: The paper provides a theoretical perspective on continuous chain-of-thoughts in LLMs, which is relevant to the large language models criterion.

Relevance: 8 Novelty: 7


66. PhiNet v2: A Mask-Free Brain-Inspired Vision Foundation Model from Video

ArXiv ID: 2505.11129

Authors: Makoto Yamada, Kian Ming A. Chai, Ayoub Rhim, Satoki Ishikawa, Mohammad Sabokrou, Yao-Hung Hubert Tsai

Abstract: Recent advances in self-supervised learning (SSL) have revolutionized computer vision through innovative architectures and learning objectives, yet they have not fully leveraged insights from biological visual processing systems. Recently, a brain-inspired SSL model named PhiNet was proposed; it is based on a ResNet backbone and operates on static image inputs with strong augmentation. In this paper, we introduce PhiNet v2, a novel Transformer-based architecture that processes temporal visual input (that is, sequences of images) without relying on strong augmentation. Our model leverages variational inference to learn robust visual representations from continuous input streams, similar to human visual processing. Through extensive experimentation, we demonstrate that PhiNet v2 achieves competitive performance compared to state-of-the-art vision foundation models, while maintaining the ability to learn from sequential input without strong data augmentation. This work represents a significant step toward more biologically plausible computer vision systems that process visual information in a manner more closely aligned with human cognitive processes.

Comment: The paper introduces a brain-inspired vision model, which aligns with the model architecture criterion.

Relevance: 8 Novelty: 7


67. msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML

ArXiv ID: 2505.11483

Authors: Zhaolan Huang, Emmanuel Baccelli

Abstract: AI spans from large language models to tiny models running on microcontrollers (MCUs). Extremely memory-efficient model architectures are decisive to fit within an MCU's tiny memory budget e.g., 128kB of RAM. However, inference latency must remain small to fit real-time constraints. An approach to tackle this is patch-based fusion, which aims to optimize data flows across neural network layers. In this paper, we introduce msf-CNN, a novel technique that efficiently finds optimal fusion settings for convolutional neural networks (CNNs) by walking through the fusion solution space represented as a directed acyclic graph. Compared to previous work on CNN fusion for MCUs, msf-CNN identifies a wider set of solutions. We published an implementation of msf-CNN running on various microcontrollers (ARM Cortex-M, RISC-V, ESP32). We show that msf-CNN can achieve inference using 50% less RAM compared to the prior art (MCUNetV2 and StreamNet). We thus demonstrate how msf-CNN offers additional flexibility for system designers.

Comment: The paper introduces a novel technique for optimizing CNNs for TinyML, relevant to model compression and efficiency.

Relevance: 8 Novelty: 7


68. MergeBench: A Benchmark for Merging Domain-Specialized LLMs

ArXiv ID: 2505.10833

Authors: Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, Han Zhao

Abstract: Model merging provides a scalable alternative to multi-task training by combining specialized finetuned models through parameter arithmetic, enabling efficient deployment without the need for joint training or access to all task data. While recent methods have shown promise, existing evaluations are limited in both model scale and task diversity, leaving open questions about their applicability to large, domain-specialized LLMs. To tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to assess model merging at scale. MergeBench builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains: instruction following, mathematics, multilingual understanding, coding and safety. We standardize finetuning and evaluation protocols, and assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency. Based on extensive experiments, we provide practical guidelines for algorithm selection and share insights showing that model merging tends to perform better on stronger base models, with techniques such as merging coefficient tuning and sparsification improving knowledge retention. However, several challenges remain, including the computational cost on large models, the gap for in-domain performance compared to multi-task models, and the underexplored role of model merging in standard LLM training pipelines. We hope MergeBench provides a foundation for future research to advance the understanding and practical application of model merging. We open source our code at \href{https://github.com/uiuctml/MergeBench}{https://github.com/uiuctml/MergeBench}.

Comment: The paper introduces a benchmark for merging domain-specialized LLMs, which is relevant to model architecture and LLMs.

Relevance: 8 Novelty: 7


69. SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

ArXiv ID: 2505.11594

Authors: Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen

Abstract: The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.

Comment: The paper explores low-bit attention for inference and training, relevant to model compression and efficiency.

Relevance: 8 Novelty: 7


70. A Classical View on Benign Overfitting: The Role of Sample Size

ArXiv ID: 2505.11621

Authors: Junhyung Park, Patrick Bloebaum, Shiva Prasad Kasiviswanathan

Abstract: Benign overfitting is a phenomenon in machine learning where a model perfectly fits (interpolates) the training data, including noisy examples, yet still generalizes well to unseen data. Understanding this phenomenon has attracted considerable attention in recent years. In this work, we introduce a conceptual shift, by focusing on almost benign overfitting, where models simultaneously achieve both arbitrarily small training and test errors. This behavior is characteristic of neural networks, which often achieve low (but non-zero) training error while still generalizing well. We hypothesize that this almost benign overfitting can emerge even in classical regimes, by analyzing how the interaction between sample size and model complexity enables larger models to achieve both good training fit but still approach Bayes-optimal generalization. We substantiate this hypothesis with theoretical evidence from two case studies: (i) kernel ridge regression, and (ii) least-squares regression using a two-layer fully connected ReLU neural network trained via gradient flow. In both cases, we overcome the strong assumptions often required in prior work on benign overfitting. Our results on neural networks also provide the first generalization result in this setting that does not rely on any assumptions about the underlying regression function or noise, beyond boundedness. Our analysis introduces a novel proof technique based on decomposing the excess risk into estimation and approximation errors, interpreting gradient flow as an implicit regularizer, that helps avoid uniform convergence traps. This analysis idea could be of independent interest.

Comment: The paper provides theoretical insights into benign overfitting, which is relevant to understanding training dynamics in neural networks.

Relevance: 8 Novelty: 7


71. Exploring Sparsity for Parameter Efficient Fine Tuning Using Wavelets

ArXiv ID: 2505.12532

Authors: Ahmet Bilican, M. Akın Yılmaz, A. Murat Tekalp, R. Gökberk Cinbiş

Abstract: Efficiently adapting large foundation models is critical, especially with tight compute and memory budgets. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA offer limited granularity and effectiveness in few-parameter regimes. We propose Wavelet Fine-Tuning (WaveFT), a novel PEFT method that learns highly sparse updates in the wavelet domain of residual matrices. WaveFT allows precise control of trainable parameters, offering fine-grained capacity adjustment and excelling with remarkably low parameter count, potentially far fewer than LoRA's minimum -- ideal for extreme parameter-efficient scenarios. In order to demonstrate the effect of the wavelet transform, we compare WaveFT with a special case, called SHiRA, that entails applying sparse updates directly in the weight domain. Evaluated on personalized text-to-image generation using Stable Diffusion XL as baseline, WaveFT significantly outperforms LoRA and other PEFT methods, especially at low parameter counts; achieving superior subject fidelity, prompt alignment, and image diversity.

Comment: The paper introduces a novel parameter-efficient fine-tuning method using wavelets, which aligns with the model compression criterion.

Relevance: 8 Novelty: 7


72. WaLRUS: Wavelets for Long-range Representation Using SSMs

ArXiv ID: 2505.12161

Authors: Hossein Babaei, Mel White, Sina Alemohammad, Richard G. Baraniuk

Abstract: State-Space Models (SSMs) have proven to be powerful tools for modeling long-range dependencies in sequential data. While the recent method known as HiPPO has demonstrated strong performance, and formed the basis for machine learning models S4 and Mamba, it remains limited by its reliance on closed-form solutions for a few specific, well-behaved bases. The SaFARi framework generalized this approach, enabling the construction of SSMs from arbitrary frames, including non-orthogonal and redundant ones, thus allowing an infinite diversity of possible "species" within the SSM family. In this paper, we introduce WaLRUS (Wavelets for Long-range Representation Using SSMs), a new implementation of SaFARi built from Daubechies wavelets.

Comment: The paper introduces a new implementation of state-space models using wavelets, which aligns with model architecture innovations.

Relevance: 8 Novelty: 7


73. SAINT: Attention-Based Modeling of Sub-Action Dependencies in Multi-Action Policies

ArXiv ID: 2505.12109

Authors: Matthew Landers, Taylor W. Killian, Thomas Hartvigsen, Afsaneh Doryab

Abstract: The combinatorial structure of many real-world action spaces leads to exponential growth in the number of possible actions, limiting the effectiveness of conventional reinforcement learning algorithms. Recent approaches for combinatorial action spaces impose factorized or sequential structures over sub-actions, failing to capture complex joint behavior. We introduce the Sub-Action Interaction Network using Transformers (SAINT), a novel policy architecture that represents multi-component actions as unordered sets and models their dependencies via self-attention conditioned on the global state. SAINT is permutation-invariant, sample-efficient, and compatible with standard policy optimization algorithms. In 15 distinct combinatorial environments across three task domains, including environments with nearly 17 million joint actions, SAINT consistently outperforms strong baselines.

Comment: The paper presents a novel policy architecture using Transformers, which aligns with the model architecture criterion.

Relevance: 8 Novelty: 7


74. STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference

ArXiv ID: 2505.12359

Authors: Yichen Guo, Hanze Li, Zonghao Zhang, Jinhao You, Kai Tang, Xiande Huang

Abstract: Although large vision-language models (LVLMs) leverage rich visual token representations to achieve strong performance on multimodal tasks, these tokens also introduce significant computational overhead during inference. Existing training-free token pruning methods typically adopt a single-stage strategy, focusing either on visual self-attention or visual-textual cross-attention. However, such localized perspectives often overlook the broader information flow across the model, leading to substantial performance degradation, especially under high pruning ratios. In this work, we propose STAR (Stage-wise Attention-guided token Reduction), a training-free, plug-and-play framework that approaches token pruning from a global perspective. Instead of pruning at a single point, STAR performs attention-guided reduction in two complementary stages: an early-stage pruning based on visual self-attention to remove redundant low-level features, and a later-stage pruning guided by cross-modal attention to discard task-irrelevant tokens. This holistic approach allows STAR to significantly reduce computational cost while better preserving task-critical information. Extensive experiments across multiple LVLM architectures and benchmarks show that STAR achieves strong acceleration while maintaining comparable, and in some cases even improved performance.

Comment: The paper introduces a token reduction framework for efficient inference in large vision-language models, which aligns with the model compression criterion.

Relevance: 8 Novelty: 7


75. Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables

ArXiv ID: 2505.12473

Authors: Yu Gui, Cong Ma, Zongming Ma

Abstract: Multi-modal contrastive learning as a self-supervised representation learning technique has achieved great success in foundation model training, such as CLIP~\citep{radford2021learning}. In this paper, we study the theoretical properties of the learned representations from multi-modal contrastive learning beyond linear representations and specific data distributions. Our analysis reveals that, enabled by temperature optimization, multi-modal contrastive learning not only maximizes mutual information between modalities but also adapts to intrinsic dimensions of data, which can be much lower than user-specified dimensions for representation vectors. Experiments on both synthetic and real-world datasets demonstrate the ability of contrastive learning to learn low-dimensional and informative representations, bridging theoretical insights and practical performance.

Comment: The paper studies multi-modal contrastive learning, which aligns with the representation learning criterion.

Relevance: 8 Novelty: 7


76. Graph Representational Learning: When Does More Expressivity Hurt Generalization?

ArXiv ID: 2505.11298

Authors: Sohir Maskey, Raffaele Paolino, Fabian Jogl, Gitta Kutyniok, Johannes F. Lutzeyer

Abstract: Graph Neural Networks (GNNs) are powerful tools for learning on structured data, yet the relationship between their expressivity and predictive performance remains unclear. We introduce a family of premetrics that capture different degrees of structural similarity between graphs and relate these similarities to generalization, and consequently, the performance of expressive GNNs. By considering a setting where graph labels are correlated with structural features, we derive generalization bounds that depend on the distance between training and test graphs, model complexity, and training set size. These bounds reveal that more expressive GNNs may generalize worse unless their increased complexity is balanced by a sufficiently large training set or reduced distance between training and test graphs. Our findings relate expressivity and generalization, offering theoretical insights supported by empirical results.

Comment: The paper provides theoretical insights into the expressivity and generalization of Graph Neural Networks, which relates to representation learning and model architecture analysis.

Relevance: 8 Novelty: 7


77. Do different prompting methods yield a common task representation in language models?

ArXiv ID: 2505.12075

Authors: Guy Davidson, Todd M. Gureckis, Brenden M. Lake, Adina Williams

Abstract: Demonstrations and instructions are two primary approaches for prompting language models to perform in-context learning (ICL) tasks. Do identical tasks elicited in different ways result in similar representations of the task? An improved understanding of task representation mechanisms would offer interpretability insights and may aid in steering models. We study this through \textit{function vectors} (FVs), recently proposed as a mechanism to extract few-shot ICL task representations. We generalize FVs to alternative task presentations, focusing on short textual instruction prompts, and successfully extract instruction function vectors that promote zero-shot task accuracy. We find evidence that demonstration- and instruction-based function vectors leverage different model components, and offer several controls to dissociate their contributions to task performance. Our results suggest that different task promptings forms do not induce a common task representation through FVs but elicit different, partly overlapping mechanisms. Our findings offer principled support to the practice of combining instructions and task demonstrations, imply challenges in universally monitoring task inference across presentation forms, and encourage further examinations of LLM task inference mechanisms.

Comment: The paper investigates task representation in language models through different prompting methods, offering insights into LLM behavior and interpretability.

Relevance: 8 Novelty: 7


78. On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study

ArXiv ID: 2505.11839

Authors: Shuai Yang, Qi Yang, Luoxi Tang, Jeremy Blackburn, Zhaohan Xi

Abstract: Counterfactual reasoning has emerged as a crucial technique for generalizing the reasoning capabilities of large language models (LLMs). By generating and analyzing counterfactual scenarios, researchers can assess the adaptability and reliability of model decision-making. Although prior work has shown that LLMs often struggle with counterfactual reasoning, it remains unclear which factors most significantly impede their performance across different tasks and modalities. In this paper, we propose a decompositional strategy that breaks down the counterfactual generation from causality construction to the reasoning over counterfactual interventions. To support decompositional analysis, we investigate 11 datasets spanning diverse tasks, including natural language understanding, mathematics, programming, and vision-language tasks. Through extensive evaluations, we characterize LLM behavior across each decompositional stage and identify how modality type and intermediate reasoning influence performance. By establishing a structured framework for analyzing counterfactual reasoning, this work contributes to the development of more reliable LLM-based reasoning systems and informs future elicitation strategies.

Comment: The paper provides a decompositional study on counterfactual reasoning in LLMs, which aligns with the interest in theoretical insights into LLM behavior.

Relevance: 8 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords: