Previous Day 2025-05-28
Monthly Overview 2025-05
Next Day 2025-05-30

Personalized Daily ArXiv Papers 2025-05-29

[gpt-4o] Prompt Completion Total
Token 69404 9085 78489
Cost $0.17 $0.09 $0.26

Total arXiv papers: 854

Total scanned papers: 510

Total relevant papers: 50

Table of contents with paper titles:

  1. An Augmentation-Aware Theory for Self-Supervised Contrastive Learning Authors: Jingyi Cui, Hongwei Wen, Yisen Wang

  2. Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning Authors: Maosen Zhao, Pengtao Chen, Chong Yu, Yan Wen, Xudong Tan, Tao Chen

  3. RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers Authors: Xuwei Xu, Yang Li, Yudong Chen, Jiajun Liu, Sen Wang

  4. Global Minimizers of $\ell^p$-Regularized Objectives Yield the Sparsest ReLU Neural Networks Authors: Julia Nakhleh, Robert D. Nowak

  5. EvidenceMoE: A Physics-Guided Mixture-of-Experts with Evidential Critics for Advancing Fluorescence Light Detection and Ranging in Scattering Media Authors: Ismail Erbas, Ferhat Demirkiran, Karthik Swaminathan, Naigang Wang, Navid Ibtehaj Nizam, Stefan T. Radev, Kaoutar El Maghraoui, Xavier Intes, Vikas Pandey

  6. LaX: Boosting Low-Rank Training of Foundation Models via Latent Crossing Authors: Ruijie Zhang, Ziyue Liu, Zhengyang Wang, Zheng Zhang

  7. Curse of High Dimensionality Issue in Transformer for Long-context Modeling Authors: Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan

  8. Is Attention Required for Transformer Inference? Explore Function-preserving Attention Replacement Authors: Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

  9. Almost Linear Convergence under Minimal Score Assumptions: Quantized Transition Diffusion Authors: Xunpeng Huang, Yingyu Lin, Nikki Lijing Kuang, Hanze Dong, Difan Zou, Yian Ma, Tong Zhang

  10. Scaling Reasoning without Attention Authors: Xueliang Zhao, Wei Wu, Lingpeng Kong

  11. Self-Organizing Visual Prototypes for Non-Parametric Representation Learning Authors: Thalles Silva, Helio Pedrini, Ad\'in Ram\'irez Rivera

  12. Weakly-Supervised Contrastive Learning for Imprecise Class Labels Authors: Zi-Hao Zhou, Jun-Jie Wang, Tong Wei, Min-Ling Zhang

  13. Learning Shared Representations from Unpaired Data Authors: Amitai Yacobi, Nir Ben-Ari, Ronen Talmon, Uri Shaham

  14. On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling Authors: Moritz Haas, Sebastian Bordt, Ulrike von Luxburg, Leena Chennuru Vankadara

  15. Learning in Compact Spaces with Approximately Normalized Transformers Authors: J\"org K. H. Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, Michael Hefenbrock

  16. CellCLAT: Preserving Topology and Trimming Redundancy in Self-Supervised Cellular Contrastive Learning Authors: Bin Qin, Qirui Ji, Jiangmeng Li, Yupeng Wang, Xuesong Wu, Jianwen Cao, Fanjiang Xu

  17. Benignity of loss landscape with weight decay requires both large overparametrization and initialization Authors: Etienne Boursier, Matthew Bowditch, Matthias Englert, Ranko Lazic

  18. Multiclass Loss Geometry Matters for Generalization of Gradient Descent in Separable Classification Authors: Matan Schliserman, Tomer Koren

  19. FCOS: A Two-Stage Recoverable Model Pruning Framework for Automatic Modulation Recognition Authors: Yao Lu, Tengfei Ma, Zeyu Wang, Zhuangzhi Chen, Dongwei Xu, Yun Lin, Qi Xuan, Guan Gui

  20. PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective Authors: Tim Tsz-Kit Lau, Qi Long, Weijie Su

  21. Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates Authors: Haoning Xu, Zhaoqing Li, Youjun Chen, Huimeng Wang, Guinan Li, Mengzhe Geng, Chengxi Deng, Xunying Liu

  22. The quest for the GRAph Level autoEncoder (GRALE) Authors: Paul Krzakala, Gabriel Melo, Charlotte Laclau, Florence d'Alch\'e-Buc, R\'emi Flamary

  23. Mitigating Overthinking in Large Reasoning Models via Manifold Steering Authors: Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong

  24. Efficient Diffusion Models for Symmetric Manifolds Authors: Oren Mangoubi, Neil He, Nisheeth K. Vishnoi

  25. Relevance-driven Input Dropout: an Explanation-guided Regularization Technique Authors: Shreyas Gururaj, Lars Gr\"une, Wojciech Samek, Sebastian Lapuschkin, Leander Weber

  26. Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning Authors: Erxin Yu, Jing Li, Ming Liao, Qi Zhu, Boyang Xue, Minghui Xu, Baojun Wang, Lanqing Hong, Fei Mi, Lifeng Shang

  27. Enhancing Vision Transformer Explainability Using Artificial Astrocytes Authors: Nicolas Echevarrieta-Catalan, Ana Ribas-Rodriguez, Francisco Cedron, Odelia Schwartz, Vanessa Aguiar-Pulido

  28. Sherlock: Self-Correcting Reasoning in Vision-Language Models Authors: Yi Ding, Ruqi Zhang

  29. EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models Authors: Feng Jiang, Zihao Zheng, Xiuping Cui, Maoliang Li, JIayu Chen, Xiang Chen

  30. Understanding (Un)Reliability of Steering Vectors in Language Models Authors: Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, Dmitrii Krasheninnikov

  31. A Closer Look at Multimodal Representation Collapse Authors: Abhra Chaudhuri, Anjan Dutta, Tu Bui, Serban Georgescu

  32. Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling Authors: M\'onika Farsang, Ramin Hasani, Radu Grosu

  33. Geometric Hyena Networks for Large-scale Equivariant Learning Authors: Artem Moskalev, Mangal Prakash, Junjie Xu, Tianyu Cui, Rui Liao, Tommaso Mansi

  34. R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing Authors: Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang

  35. Taming Transformer Without Using Learning Rate Warmup Authors: Xianbiao Qi, Yelin He, Jiaquan Ye, Chun-Guang Li, Bojia Zi, Xili Dai, Qin Zou, Rong Xiao

  36. AI Mathematician: Towards Fully Automated Frontier Mathematical Research Authors: Yuanhang Liu, Yanxing Huang, Yanqiao Wang, Peng Li, Yang Liu

  37. Sparsification and Reconstruction from the Perspective of Representation Geometry Authors: Wenjie Sun, Bingzhe Wu, Zhile Yang, Chengke Wu

  38. Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization Authors: Kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen

  39. Compressing Sine-Activated Low-Rank Adapters through Post-Training Quantization Authors: Cameron Gordon, Yiping Ji, Hemanth Saratchandran, Paul Albert, Simon Lucey

  40. Look Within or Look Beyond? A Theoretical Comparison Between Parameter-Efficient and Full Fine-Tuning Authors: Yongkang Liu, Xingle Xu, Ercong Nie, Zijing Wang, Shi Feng, Daling Wang, Qian Li, Hinrich Sch\"utze

  41. Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design Authors: Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu

  42. Born a Transformer -- Always a Transformer? Authors: Yana Veitsman, Mayank Jobanputra, Yash Sarrof, Aleksandra Bakalova, Vera Demberg, Ellie Pavlick, Michael Hahn

  43. TuneComp: Joint Fine-tuning and Compression for Large Foundation Models Authors: Xiangyu Chen (Perry), Jing Liu (Perry), Ye Wang (Perry), Matthew Brand (Perry), Pu (Perry), Wang, Toshiaki Koike-Akino

  44. The Resurrection of the ReLU Authors: Co\c{s}ku Can Horuz, Geoffrey Kasenbacher, Saya Higuchi, Sebastian Kairat, Jendrik Stoltz, Moritz Pesl, Bernhard A. Moser, Christoph Linse, Thomas Martinetz, Sebastian Otte

  45. Transformers Pretrained on Procedural Data Contain Modular Structures for Algorithmic Reasoning Authors: Zachary Shinnick, Liangze Jiang, Hemanth Saratchandran, Anton van den Hengel, Damien Teney

  46. Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models Authors: Sohyun An, Ruochen Wang, Tianyi Zhou, Cho-Jui Hsieh

  47. Estimating the Effects of Sample Training Orders for Large Language Models without Retraining Authors: Hao Yang, Haoxuan Li, Mengyue Yang, Xu Chen, Mingming Gong

  48. Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents Authors: Michael Kirchhof, Gjergji Kasneci, Enkelejda Kasneci

  49. From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs Authors: Stanley Yu, Vaidehi Bulusu, Oscar Yasunaga, Clayton Lau, Cole Blondin, Sean O'Brien, Kevin Zhu, Vasu Sharma

  50. One Rank at a Time: Cascading Error Dynamics in Sequential Learning Authors: Mahtab Alizadeh Vandchali (Jasper), Fangshuo (Jasper), Liao, Anastasios Kyrillidis


1. An Augmentation-Aware Theory for Self-Supervised Contrastive Learning

ArXiv ID: 2505.22196

Authors: Jingyi Cui, Hongwei Wen, Yisen Wang

Abstract: Self-supervised contrastive learning has emerged as a powerful tool in machine learning and computer vision to learn meaningful representations from unlabeled data. Meanwhile, its empirical success has encouraged many theoretical studies to reveal the learning mechanisms. However, in the existing theoretical research, the role of data augmentation is still under-exploited, especially the effects of specific augmentation types. To fill in the blank, we for the first time propose an augmentation-aware error bound for self-supervised contrastive learning, showing that the supervised risk is bounded not only by the unsupervised risk, but also explicitly by a trade-off induced by data augmentation. Then, under a novel semantic label assumption, we discuss how certain augmentation methods affect the error bound. Lastly, we conduct both pixel- and representation-level experiments to verify our proposed theoretical results.

Comment: The paper provides a theoretical framework for self-supervised contrastive learning, which aligns with representation learning.

Relevance: 9 Novelty: 8


2. Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning

ArXiv ID: 2505.21591

Authors: Maosen Zhao, Pengtao Chen, Chong Yu, Yan Wen, Xudong Tan, Tao Chen

Abstract: Model quantization reduces the bit-width of weights and activations, improving memory efficiency and inference speed in diffusion models. However, achieving 4-bit quantization remains challenging. Existing methods, primarily based on integer quantization and post-training quantization fine-tuning, struggle with inconsistent performance. Inspired by the success of floating-point (FP) quantization in large language models, we explore low-bit FP quantization for diffusion models and identify key challenges: the failure of signed FP quantization to handle asymmetric activation distributions, the insufficient consideration of temporal complexity in the denoising process during fine-tuning, and the misalignment between fine-tuning loss and quantization error. To address these challenges, we propose the mixup-sign floating-point quantization (MSFP) framework, first introducing unsigned FP quantization in model quantization, along with timestep-aware LoRA (TALoRA) and denoising-factor loss alignment (DFA), which ensure precise and stable fine-tuning. Extensive experiments show that we are the first to achieve superior performance in 4-bit FP quantization for diffusion models, outperforming existing PTQ fine-tuning methods in 4-bit INT quantization.

Comment: The paper explores 4-bit FP quantization for diffusion models, addressing challenges in model quantization, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8


3. RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers

ArXiv ID: 2505.21847

Authors: Xuwei Xu, Yang Li, Yudong Chen, Jiajun Liu, Sen Wang

Abstract: We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency, with their impact signifying as model size increases. This finding highlights a critical opportunity for optimizing the efficiency of large-scale ViTs by focusing on FFN layers. In this work, we propose a novel channel idle mechanism that facilitates post-training structural reparameterization for efficient FFN layers during testing. Specifically, a set of feature channels remains idle and bypasses the nonlinear activation function in each FFN layer, thereby forming a linear pathway that enables structural reparameterization during inference. This mechanism results in a family of ReParameterizable Vision Transformers (RePaViTs), which achieve remarkable latency reductions with acceptable sacrifices (sometimes gains) in accuracy across various ViTs. The benefits of our method scale consistently with model sizes, demonstrating greater speed improvements and progressively narrowing accuracy gaps or even higher accuracies on larger models. In particular, RePa-ViT-Large and RePa-ViT-Huge enjoy 66.8% and 68.7% speed-ups with +1.7% and +1.1% higher top-1 accuracies under the same training strategy, respectively. RePaViT is the first to employ structural reparameterization on FFN layers to expedite ViTs to our best knowledge, and we believe that it represents an auspicious direction for efficient ViTs. Source code is available at https://github.com/Ackesnal/RePaViT.

Comment: The paper proposes a structural reparameterization method for Vision Transformers, which is relevant to model compression and architecture.

Relevance: 9 Novelty: 8


4. Global Minimizers of $\ell^p$-Regularized Objectives Yield the Sparsest ReLU Neural Networks

ArXiv ID: 2505.21791

Authors: Julia Nakhleh, Robert D. Nowak

Abstract: Overparameterized neural networks can interpolate a given dataset in many different ways, prompting the fundamental question: which among these solutions should we prefer, and what explicit regularization strategies will provably yield these solutions? This paper addresses the challenge of finding the sparsest interpolating ReLU network -- i.e., the network with the fewest nonzero parameters or neurons -- a goal with wide-ranging implications for efficiency, generalization, interpretability, theory, and model compression. Unlike post hoc pruning approaches, we propose a continuous, almost-everywhere differentiable training objective whose global minima are guaranteed to correspond to the sparsest single-hidden-layer ReLU networks that fit the data. This result marks a conceptual advance: it recasts the combinatorial problem of sparse interpolation as a smooth optimization task, potentially enabling the use of gradient-based training methods. Our objective is based on minimizing $\ell^p$ quasinorms of the weights for $0 < p < 1$, a classical sparsity-promoting strategy in finite-dimensional settings. However, applying these ideas to neural networks presents new challenges: the function class is infinite-dimensional, and the weights are learned using a highly nonconvex objective. We prove that, under our formulation, global minimizers correspond exactly to sparsest solutions. Our work lays a foundation for understanding when and how continuous sparsity-inducing objectives can be leveraged to recover sparse networks through training.

Comment: The paper addresses the challenge of finding the sparsest interpolating ReLU network using a novel training objective, which is relevant to model compression and sparsity.

Relevance: 9 Novelty: 8


5. EvidenceMoE: A Physics-Guided Mixture-of-Experts with Evidential Critics for Advancing Fluorescence Light Detection and Ranging in Scattering Media

ArXiv ID: 2505.21532

Authors: Ismail Erbas, Ferhat Demirkiran, Karthik Swaminathan, Naigang Wang, Navid Ibtehaj Nizam, Stefan T. Radev, Kaoutar El Maghraoui, Xavier Intes, Vikas Pandey

Abstract: Fluorescence LiDAR (FLiDAR), a Light Detection and Ranging (LiDAR) technology employed for distance and depth estimation across medical, automotive, and other fields, encounters significant computational challenges in scattering media. The complex nature of the acquired FLiDAR signal, particularly in such environments, makes isolating photon time-of-flight (related to target depth) and intrinsic fluorescence lifetime exceptionally difficult, thus limiting the effectiveness of current analytical and computational methodologies. To overcome this limitation, we present a Physics-Guided Mixture-of-Experts (MoE) framework tailored for specialized modeling of diverse temporal components. In contrast to the conventional MoE approaches our expert models are informed by underlying physics, such as the radiative transport equation governing photon propagation in scattering media. Central to our approach is EvidenceMoE, which integrates Evidence-Based Dirichlet Critics (EDCs). These critic models assess the reliability of each expert's output by providing per-expert quality scores and corrective feedback. A Decider Network then leverages this information to fuse expert predictions into a robust final estimate adaptively. We validate our method using realistically simulated Fluorescence LiDAR (FLiDAR) data for non-invasive cancer cell depth detection generated from photon transport models in tissue. Our framework demonstrates strong performance, achieving a normalized root mean squared error (NRMSE) of 0.030 for depth estimation and 0.074 for fluorescence lifetime.

Comment: The paper presents EvidenceMoE, a Physics-Guided Mixture-of-Experts framework, which is relevant to model architecture innovations and MoE.

Relevance: 9 Novelty: 8


6. LaX: Boosting Low-Rank Training of Foundation Models via Latent Crossing

ArXiv ID: 2505.21732

Authors: Ruijie Zhang, Ziyue Liu, Zhengyang Wang, Zheng Zhang

Abstract: Training foundation models such as ViTs and LLMs requires tremendous computing cost. Low-rank matrix or tensor factorization offers a parameter-efficient alternative, but often downgrades performance due to the restricted parameter space. In this work, we introduce {\textbf{Latent Crossing (LaX)}} -- a simple yet effective plug-and-play module that enhances the capacity of low-rank models by enabling information flow across low-rank subspaces. We extensively validate the benefits of LaX on pre-training tasks with ViT-Base/Large and LLaMA-like models ranging from 60M to 1B parameters. LaX boosts low-rank model performance to match or exceed the full-rank baselines while using 2-3(\times) fewer parameters. When equipped with low-rank adapters (i.e., LoRA) for fine-tuning LLaMA-7/13B, LaX consistently improves performance on arithmetic and common sense reasoning tasks with negligible cost.

Comment: The paper introduces Latent Crossing, a module to enhance low-rank models, relevant to model compression and low-rank approaches.

Relevance: 9 Novelty: 8


7. Curse of High Dimensionality Issue in Transformer for Long-context Modeling

ArXiv ID: 2505.22107

Authors: Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan

Abstract: Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textit{redundant} attention computations: while attention weights are often \textit{sparse}, all tokens consume \textit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose \textit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.Code is available at https://github.com/bolixinyu/DynamicGroupAttention.

Comment: The paper addresses the computational inefficiencies in transformers for long-context modeling, proposing a dynamic group attention mechanism, relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8


8. Is Attention Required for Transformer Inference? Explore Function-preserving Attention Replacement

ArXiv ID: 2505.21535

Authors: Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

Abstract: While transformers excel across vision and language pretraining tasks, their reliance on attention mechanisms poses challenges for inference efficiency, especially on edge and embedded accelerators with limited parallelism and memory bandwidth. Hinted by the observed redundancy of attention at inference time, we hypothesize that though the model learns complicated token dependency through pretraining, the inference-time sequence-to-sequence mapping in each attention layer is actually ''simple'' enough to be represented with a much cheaper function. In this work, we explore FAR, a Function-preserving Attention Replacement framework that replaces all attention blocks in pretrained transformers with learnable sequence-to-sequence modules, exemplified by an LSTM. FAR optimize a multi-head LSTM architecture with a block-wise distillation objective and a global structural pruning framework to achieve a family of efficient LSTM-based models from pretrained transformers. We validate FAR on the DeiT vision transformer family and demonstrate that it matches the accuracy of the original models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships and the token-to-token correlation learned in the transformer's attention module.

Comment: The paper explores replacing attention mechanisms in transformers with more efficient modules, relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8


9. Almost Linear Convergence under Minimal Score Assumptions: Quantized Transition Diffusion

ArXiv ID: 2505.21892

Authors: Xunpeng Huang, Yingyu Lin, Nikki Lijing Kuang, Hanze Dong, Difan Zou, Yian Ma, Tong Zhang

Abstract: Continuous diffusion models have demonstrated remarkable performance in data generation across various domains, yet their efficiency remains constrained by two critical limitations: (1) the local adjacency structure of the forward Markov process, which restricts long-range transitions in the data space, and (2) inherent biases introduced during the simulation of time-inhomogeneous reverse denoising processes. To address these challenges, we propose Quantized Transition Diffusion (QTD), a novel approach that integrates data quantization with discrete diffusion dynamics. Our method first transforms the continuous data distribution $p_$ into a discrete one $q_$ via histogram approximation and binary encoding, enabling efficient representation in a structured discrete latent space. We then design a continuous-time Markov chain (CTMC) with Hamming distance-based transitions as the forward process, which inherently supports long-range movements in the original data space. For reverse-time sampling, we introduce a \textit{truncated uniformization} technique to simulate the reverse CTMC, which can provably provide unbiased generation from $q_$ under minimal score assumptions. Through a novel KL dynamic analysis of the reverse CTMC, we prove that QTD can generate samples with $O(d\ln^2(d/\epsilon))$ score evaluations in expectation to approximate the $d$--dimensional target distribution $p_$ within an $\epsilon$ error tolerance. Our method not only establishes state-of-the-art inference efficiency but also advances the theoretical foundations of diffusion-based generative modeling by unifying discrete and continuous diffusion paradigms.

Comment: The paper introduces Quantized Transition Diffusion, a novel approach integrating data quantization with discrete diffusion dynamics, advancing theoretical foundations of diffusion-based generative modeling.

Relevance: 9 Novelty: 8


10. Scaling Reasoning without Attention

ArXiv ID: 2505.22425

Authors: Xueliang Zhao, Wei Wu, Lingpeng Kong

Abstract: Large language models (LLMs) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers, and a lack of structured fine-tuning for high-difficulty domains. We introduce \ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2, our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference. To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the \textsc{PromptCoT} synthesis paradigm, which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, \ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale, and even surpasses the much larger Gemma3-27B by 2.6\% on AIME 24, 0.6\% on AIME 25, and 3.0\% on Livecodebench. These results highlight the potential of state space models as efficient and scalable alternatives to attention-based architectures for high-capacity reasoning.

Comment: The paper introduces an attention-free language model, addressing architectural inefficiencies in LLMs, which aligns with interests in model architecture innovations.

Relevance: 9 Novelty: 8


11. Self-Organizing Visual Prototypes for Non-Parametric Representation Learning

ArXiv ID: 2505.21533

Authors: Thalles Silva, Helio Pedrini, Ad\'in Ram\'irez Rivera

Abstract: We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders.

Comment: The paper introduces a novel training technique for unsupervised visual feature learning, focusing on non-parametric representation learning, which aligns with the representation learning criterion.

Relevance: 9 Novelty: 8


12. Weakly-Supervised Contrastive Learning for Imprecise Class Labels

ArXiv ID: 2505.22028

Authors: Zi-Hao Zhou, Jun-Jie Wang, Tong Wei, Min-Ling Zhang

Abstract: Contrastive learning has achieved remarkable success in learning effective representations, with supervised contrastive learning often outperforming self-supervised approaches. However, in real-world scenarios, data annotations are often ambiguous or inaccurate, meaning that class labels may not reliably indicate whether two examples belong to the same class. This limitation restricts the applicability of supervised contrastive learning. To address this challenge, we introduce the concept of ``continuous semantic similarity'' to define positive and negative pairs. Instead of directly relying on imprecise class labels, we measure the semantic similarity between example pairs, which quantifies how closely they belong to the same category by iteratively refining weak supervisory signals. Based on this concept, we propose a graph-theoretic framework for weakly-supervised contrastive learning, where semantic similarity serves as the graph weights. Our framework is highly versatile and can be applied to many weakly-supervised learning scenarios. We demonstrate its effectiveness through experiments in two common settings, i.e., noisy label and partial label learning, where existing methods can be easily integrated to significantly improve performance. Theoretically, we establish an error bound for our approach, showing that it can approximate supervised contrastive learning under mild conditions. The implementation code is available at https://github.com/Speechless-10308/WSC.

Comment: The paper proposes a weakly-supervised contrastive learning framework, which is relevant to representation learning, focusing on semantic similarity and graph-theoretic approaches.

Relevance: 9 Novelty: 7


13. Learning Shared Representations from Unpaired Data

ArXiv ID: 2505.21524

Authors: Amitai Yacobi, Nir Ben-Ari, Ronen Talmon, Uri Shaham

Abstract: Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our code IS publicly available at https://github.com/shaham-lab/SUE.

Comment: The paper explores learning shared representations from unpaired data, contributing to foundational research in representation learning.

Relevance: 9 Novelty: 7


14. On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling

ArXiv ID: 2505.22491

Authors: Moritz Haas, Sebastian Bordt, Ulrike von Luxburg, Leena Chennuru Vankadara

Abstract: The dominant paradigm for training large-scale vision and language models is He initialization and a single global learning rate (\textit{standard parameterization}, SP). Despite its practical success, standard parametrization remains poorly understood from a theoretical perspective: Existing infinite-width theory would predict instability under large learning rates and vanishing feature learning under stable learning rates. However, empirically optimal learning rates consistently decay much slower than theoretically predicted. By carefully studying neural network training dynamics, we demonstrate that this discrepancy is not fully explained by finite-width phenomena such as catapult effects or a lack of alignment between weights and incoming activations. We instead show that the apparent contradiction can be fundamentally resolved by taking the loss function into account: In contrast to Mean Squared Error (MSE) loss, we prove that under cross-entropy (CE) loss, an intermediate \textit{controlled divergence} regime emerges, where logits diverge but loss, gradients, and activations remain stable. Stable training under large learning rates enables persistent feature evolution at scale in all hidden layers, which is crucial for the practical success of SP. In experiments across optimizers (SGD, Adam), architectures (MLPs, GPT) and data modalities (vision, language), we validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss. Our empirical evidence suggests that width-scaling considerations are surprisingly useful for predicting empirically optimal learning rate exponents. Finally, our analysis clarifies the effectiveness and limitations of recently proposed layerwise learning rate scalings for standard initialization.

Comment: The paper analyzes the effectiveness of large learning rates under standard width scaling, which is relevant to training dynamics in neural networks.

Relevance: 8 Novelty: 8


15. Learning in Compact Spaces with Approximately Normalized Transformers

ArXiv ID: 2505.22014

Authors: J\"org K. H. Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, Michael Hefenbrock

Abstract: In deep learning, regularization and normalization are common solutions for challenges such as overfitting, numerical instabilities, and the increasing variance in the residual stream. An alternative approach is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic but approximate normalization (anTransformer). Our approach constrains the norm of parameters and normalizes all representations via scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. When applied to GPT training, we observe a 40% faster convergence compared to models with QK normalization, with less than 3% additional runtime. Deriving scaling laws for anGPT, we found our method enables training with larger batch sizes and fewer hyperparameters, while matching the favorable scaling characteristics of classic GPT architectures.

Comment: The paper introduces a novel normalization method for Transformers, which aligns with the model architecture criterion.

Relevance: 8 Novelty: 7


16. CellCLAT: Preserving Topology and Trimming Redundancy in Self-Supervised Cellular Contrastive Learning

ArXiv ID: 2505.21587

Authors: Bin Qin, Qirui Ji, Jiangmeng Li, Yupeng Wang, Xuesong Wu, Jianwen Cao, Fanjiang Xu

Abstract: Self-supervised topological deep learning (TDL) represents a nascent but underexplored area with significant potential for modeling higher-order interactions in simplicial complexes and cellular complexes to derive representations of unlabeled graphs. Compared to simplicial complexes, cellular complexes exhibit greater expressive power. However, the advancement in self-supervised learning for cellular TDL is largely hindered by two core challenges: \textit{extrinsic structural constraints} inherent to cellular complexes, and intrinsic semantic redundancy in cellular representations. The first challenge highlights that traditional graph augmentation techniques may compromise the integrity of higher-order cellular interactions, while the second underscores that topological redundancy in cellular complexes potentially diminish task-relevant information. To address these issues, we introduce Cellular Complex Contrastive Learning with Adaptive Trimming (CellCLAT), a twofold framework designed to adhere to the combinatorial constraints of cellular complexes while mitigating informational redundancy. Specifically, we propose a parameter perturbation-based augmentation method that injects controlled noise into cellular interactions without altering the underlying cellular structures, thereby preserving cellular topology during contrastive learning. Additionally, a cellular trimming scheduler is employed to mask gradient contributions from task-irrelevant cells through a bi-level meta-learning approach, effectively removing redundant topological elements while maintaining critical higher-order semantics. We provide theoretical justification and empirical validation to demonstrate that CellCLAT achieves substantial improvements over existing self-supervised graph learning methods, marking a significant attempt in this domain.

Comment: The paper introduces a novel framework for self-supervised topological deep learning, which is relevant to representation learning.

Relevance: 8 Novelty: 7


17. Benignity of loss landscape with weight decay requires both large overparametrization and initialization

ArXiv ID: 2505.22578

Authors: Etienne Boursier, Matthew Bowditch, Matthias Englert, Ranko Lazic

Abstract: The optimization of neural networks under weight decay remains poorly understood from a theoretical standpoint. While weight decay is standard practice in modern training procedures, most theoretical analyses focus on unregularized settings. In this work, we investigate the loss landscape of the $\ell_2$-regularized training loss for two-layer ReLU networks. We show that the landscape becomes benign -- i.e., free of spurious local minima -- under large overparametrization, specifically when the network width $m$ satisfies $m \gtrsim \min(n^d, 2^n)$, where $n$ is the number of data points and $d$ the input dimension. More precisely in this regime, almost all constant activation regions contain a global minimum and no spurious local minima. We further show that this level of overparametrization is not only sufficient but also necessary via the example of orthogonal data. Finally, we demonstrate that such loss landscape results primarily hold relevance in the large initialization regime. In contrast, for small initializations -- corresponding to the feature learning regime -- optimization can still converge to spurious local minima, despite the global benignity of the landscape.

Comment: The paper provides theoretical insights into the loss landscape with weight decay, relevant to training dynamics in neural networks.

Relevance: 8 Novelty: 7


18. Multiclass Loss Geometry Matters for Generalization of Gradient Descent in Separable Classification

ArXiv ID: 2505.22359

Authors: Matan Schliserman, Tomer Koren

Abstract: We study the generalization performance of unregularized gradient methods for separable linear classification. While previous work mostly deal with the binary case, we focus on the multiclass setting with $k$ classes and establish novel population risk bounds for Gradient Descent for loss functions that decay to zero. In this setting, we show risk bounds that reveal that convergence rates are crucially influenced by the geometry of the loss template, as formalized by Wang and Scott (2024), rather than of the loss function itself. Particularly, we establish risk upper bounds that holds for any decay rate of the loss whose template is smooth with respect to the $p$-norm. In the case of exponentially decaying losses, our results indicates a contrast between the $p=\infty$ case, where the risk exhibits a logarithmic dependence on $k$, and $p=2$ where the risk scales linearly with $k$. To establish this separation formally, we also prove a lower bound in the latter scenario, demonstrating that the polynomial dependence on $k$ is unavoidable. Central to our analysis is a novel bound on the Rademacher complexity of low-noise vector-valued linear predictors with a loss template smooth w.r.t.~general $p$-norms.

Comment: The paper provides theoretical insights into the generalization of gradient descent in multiclass classification, which is relevant to representation learning.

Relevance: 8 Novelty: 7


19. FCOS: A Two-Stage Recoverable Model Pruning Framework for Automatic Modulation Recognition

ArXiv ID: 2505.21571

Authors: Yao Lu, Tengfei Ma, Zeyu Wang, Zhuangzhi Chen, Dongwei Xu, Yun Lin, Qi Xuan, Guan Gui

Abstract: With the rapid development of wireless communications and the growing complexity of digital modulation schemes, traditional manual modulation recognition methods struggle to extract reliable signal features and meet real-time requirements in modern scenarios. Recently, deep learning based Automatic Modulation Recognition (AMR) approaches have greatly improved classification accuracy. However, their large model sizes and high computational demands hinder deployment on resource-constrained devices. Model pruning provides a general approach to reduce model complexity, but existing weight, channel, and layer pruning techniques each present a trade-off between compression rate, hardware acceleration, and accuracy preservation. To this end, in this paper, we introduce FCOS, a novel Fine-to-COarse two-Stage pruning framework that combines channel-level pruning with layer-level collapse diagnosis to achieve extreme compression, high performance and efficient inference. In the first stage of FCOS, hierarchical clustering and parameter fusion are applied to channel weights to achieve channel-level pruning. Then a Layer Collapse Diagnosis (LaCD) module uses linear probing to identify layer collapse and removes the collapsed layers due to high channel compression ratio. Experiments on multiple AMR benchmarks demonstrate that FCOS outperforms existing channel and layer pruning methods. Specifically, FCOS achieves 95.51% FLOPs reduction and 95.31% parameter reduction while still maintaining performance close to the original ResNet56, with only a 0.46% drop in accuracy on Sig2019-12. Code is available at https://github.com/yaolu-zjut/FCOS.

Comment: The paper introduces a novel model pruning framework, which aligns with the model compression criterion.

Relevance: 8 Novelty: 7


20. PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective

ArXiv ID: 2505.21799

Authors: Tim Tsz-Kit Lau, Qi Long, Weijie Su

Abstract: The ever-growing scale of deep learning models and datasets underscores the critical importance of efficient optimization methods. While preconditioned gradient methods such as Adam and AdamW are the de facto optimizers for training neural networks and large language models, structure-aware preconditioned optimizers like Shampoo and Muon, which utilize the matrix structure of gradients, have demonstrated promising evidence of faster convergence. In this paper, we introduce a unifying framework for analyzing "matrix-aware" preconditioned methods, which not only sheds light on the effectiveness of Muon and related optimizers but also leads to a class of new structure-aware preconditioned methods. A key contribution of this framework is its precise distinction between preconditioning strategies that treat neural network weights as vectors (addressing curvature anisotropy) versus those that consider their matrix structure (addressing gradient anisotropy). This perspective provides new insights into several empirical phenomena in language model pre-training, including Adam's training instabilities, Muon's accelerated convergence, and the necessity of learning rate warmup for Adam. Building upon this framework, we introduce PolarGrad, a new class of preconditioned optimization methods based on the polar decomposition of matrix-valued gradients. As a special instance, PolarGrad includes Muon with updates scaled by the nuclear norm of the gradients. We provide numerical implementations of these methods, leveraging efficient numerical polar decomposition algorithms for enhanced convergence. Our extensive evaluations across diverse matrix optimization problems and language model pre-training tasks demonstrate that PolarGrad outperforms both Adam and Muon.

Comment: The paper introduces a new class of matrix-gradient optimizers, which is relevant to model training dynamics and optimization.

Relevance: 8 Novelty: 7


21. Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates

ArXiv ID: 2505.22608

Authors: Haoning Xu, Zhaoqing Li, Youjun Chen, Huimeng Wang, Guinan Li, Mengzhe Geng, Chengxi Deng, Xunying Liu

Abstract: This paper presents a novel approach for speech foundation models compression that tightly integrates model pruning and parameter update into a single stage. Highly compact layer-level tied self-pinching gates each containing only a single learnable threshold are jointly trained with uncompressed models and used in fine-grained neuron level pruning. Experiments conducted on the LibriSpeech-100hr corpus suggest that our approach reduces the number of parameters of wav2vec2.0-base and HuBERT-large models by 65% and 60% respectively, while incurring no statistically significant word error rate (WER) increase on the test-clean dataset. Compared to previously published methods on the same task, our approach not only achieves the lowest WER of 7.05% on the test-clean dataset under a comparable model compression ratio of 4.26x, but also operates with at least 25% less model compression time.

Comment: The paper presents a novel approach for speech model compression, which aligns with the model compression criterion.

Relevance: 8 Novelty: 7


22. The quest for the GRAph Level autoEncoder (GRALE)

ArXiv ID: 2505.22109

Authors: Paul Krzakala, Gabriel Melo, Charlotte Laclau, Florence d'Alch\'e-Buc, R\'emi Flamary

Abstract: Although graph-based learning has attracted a lot of attention, graph representation learning is still a challenging task whose resolution may impact key application fields such as chemistry or biology. To this end, we introduce GRALE, a novel graph autoencoder that encodes and decodes graphs of varying sizes into a shared embedding space. GRALE is trained using an Optimal Transport-inspired loss that compares the original and reconstructed graphs and leverages a differentiable node matching module, which is trained jointly with the encoder and decoder. The proposed attention-based architecture relies on Evoformer, the core component of AlphaFold, which we extend to support both graph encoding and decoding. We show, in numerical experiments on simulated and molecular data, that GRALE enables a highly general form of pre-training, applicable to a wide range of downstream tasks, from classification and regression to more complex tasks such as graph interpolation, editing, matching, and prediction.

Comment: The paper introduces a novel graph autoencoder, which is relevant to model architecture and representation learning.

Relevance: 8 Novelty: 7


23. Mitigating Overthinking in Large Reasoning Models via Manifold Steering

ArXiv ID: 2505.22411

Authors: Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong

Abstract: Recent advances in Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex tasks such as mathematics and coding. However, these models frequently exhibit a phenomenon known as overthinking during inference, characterized by excessive validation loops and redundant deliberation, leading to substantial computational overheads. In this paper, we aim to mitigate overthinking by investigating the underlying mechanisms from the perspective of mechanistic interpretability. We first showcase that the tendency of overthinking can be effectively captured by a single direction in the model's activation space and the issue can be eased by intervening the activations along this direction. However, this efficacy soon reaches a plateau and even deteriorates as the intervention strength increases. We therefore systematically explore the activation space and find that the overthinking phenomenon is actually tied to a low-dimensional manifold, which indicates that the limited effect stems from the noises introduced by the high-dimensional steering direction. Based on this insight, we propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold given the theoretical approximation of the interference noise. Extensive experiments on DeepSeek-R1 distilled models validate that our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks. Our method also exhibits robust cross-domain transferability, delivering consistent token reduction performance in code generation and knowledge-based QA tasks. Code is available at: https://github.com/Aries-iai/Manifold_Steering.

Comment: The paper introduces Manifold Steering to mitigate overthinking in large reasoning models, which involves mechanistic interpretability and activation space analysis, aligning with representation learning insights.

Relevance: 8 Novelty: 7


24. Efficient Diffusion Models for Symmetric Manifolds

ArXiv ID: 2505.21640

Authors: Oren Mangoubi, Neil He, Nisheeth K. Vishnoi

Abstract: We introduce a framework for designing efficient diffusion models for $d$-dimensional symmetric-space Riemannian manifolds, including the torus, sphere, special orthogonal group and unitary group. Existing manifold diffusion models often depend on heat kernels, which lack closed-form expressions and require either $d$ gradient evaluations or exponential-in-$d$ arithmetic operations per training step. We introduce a new diffusion model for symmetric manifolds with a spatially-varying covariance, allowing us to leverage a projection of Euclidean Brownian motion to bypass heat kernel computations. Our training algorithm minimizes a novel efficient objective derived via Ito's Lemma, allowing each step to run in $O(1)$ gradient evaluations and nearly-linear-in-$d$ ($O(d^{1.19})$) arithmetic operations, reducing the gap between diffusions on symmetric manifolds and Euclidean space. Manifold symmetries ensure the diffusion satisfies an "average-case" Lipschitz condition, enabling accurate and efficient sample generation. Empirically, our model outperforms prior methods in training speed and improves sample quality on synthetic datasets on the torus, special orthogonal group, and unitary group.

Comment: The paper introduces efficient diffusion models for symmetric manifolds, focusing on algorithmic efficiency and training speed, which aligns with model compression and efficiency breakthroughs.

Relevance: 8 Novelty: 7


25. Relevance-driven Input Dropout: an Explanation-guided Regularization Technique

ArXiv ID: 2505.21595

Authors: Shreyas Gururaj, Lars Gr\"une, Wojciech Samek, Sebastian Lapuschkin, Leander Weber

Abstract: Overfitting is a well-known issue extending even to state-of-the-art (SOTA) Machine Learning (ML) models, resulting in reduced generalization, and a significant train-test performance gap. Mitigation measures include a combination of dropout, data augmentation, weight decay, and other regularization techniques. Among the various data augmentation strategies, occlusion is a prominent technique that typically focuses on randomly masking regions of the input during training. Most of the existing literature emphasizes randomness in selecting and modifying the input features instead of regions that strongly influence model decisions. We propose Relevance-driven Input Dropout (RelDrop), a novel data augmentation method which selectively occludes the most relevant regions of the input, nudging the model to use other important features in the prediction process, thus improving model generalization through informed regularization. We further conduct qualitative and quantitative analyses to study how Relevance-driven Input Dropout (RelDrop) affects model decision-making. Through a series of experiments on benchmark datasets, we demonstrate that our approach improves robustness towards occlusion, results in models utilizing more features within the region of interest, and boosts inference time generalization performance. Our code is available at https://github.com/Shreyas-Gururaj/LRP_Relevance_Dropout.

Comment: The paper introduces Relevance-driven Input Dropout, a novel data augmentation method for improving model generalization, which aligns with representation learning insights.

Relevance: 8 Novelty: 7


26. Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning

ArXiv ID: 2505.22591

Authors: Erxin Yu, Jing Li, Ming Liao, Qi Zhu, Boyang Xue, Minghui Xu, Baojun Wang, Lanqing Hong, Fei Mi, Lifeng Shang

Abstract: Although large language models demonstrate strong performance across various domains, they still struggle with numerous bad cases in mathematical reasoning. Previous approaches to learning from errors synthesize training data by solely extrapolating from isolated bad cases, thereby failing to generalize the extensive patterns inherent within these cases. This paper presents Self-Error-Instruct (SEI), a framework that addresses these model weaknesses and synthesizes more generalized targeted training data. Specifically, we explore a target model on two mathematical datasets, GSM8K and MATH, to pinpoint bad cases. Then, we generate error keyphrases for these cases based on the instructor model's (GPT-4o) analysis and identify error types by clustering these keyphrases. Next, we sample a few bad cases during each generation for each identified error type and input them into the instructor model, which synthesizes additional training data using a self-instruct approach. This new data is refined through a one-shot learning process to ensure that only the most effective examples are kept. Finally, we use these curated data to fine-tune the target model, iteratively repeating the process to enhance performance. We apply our framework to various models and observe improvements in their reasoning abilities across both in-domain and out-of-domain mathematics datasets. These results demonstrate the effectiveness of self-error instruction in improving LLMs' mathematical reasoning through error generalization.

Comment: The paper presents a framework for improving LLMs' mathematical reasoning through error generalization, which is relevant to foundational research in LLMs.

Relevance: 8 Novelty: 7


27. Enhancing Vision Transformer Explainability Using Artificial Astrocytes

ArXiv ID: 2505.21513

Authors: Nicolas Echevarrieta-Catalan, Ana Ribas-Rodriguez, Francisco Cedron, Odelia Schwartz, Vanessa Aguiar-Pulido

Abstract: Machine learning models achieve high precision, but their decision-making processes often lack explainability. Furthermore, as model complexity increases, explainability typically decreases. Existing efforts to improve explainability primarily involve developing new eXplainable artificial intelligence (XAI) techniques or incorporating explainability constraints during training. While these approaches yield specific improvements, their applicability remains limited. In this work, we propose the Vision Transformer with artificial Astrocytes (ViTA). This training-free approach is inspired by neuroscience and enhances the reasoning of a pretrained deep neural network to generate more human-aligned explanations. We evaluated our approach employing two well-known XAI techniques, Grad-CAM and Grad-CAM++, and compared it to a standard Vision Transformer (ViT). Using the ClickMe dataset, we quantified the similarity between the heatmaps produced by the XAI techniques and a (human-aligned) ground truth. Our results consistently demonstrate that incorporating artificial astrocytes enhances the alignment of model explanations with human perception, leading to statistically significant improvements across all XAI techniques and metrics utilized.

Comment: The paper proposes a novel approach to enhance Vision Transformer explainability using artificial astrocytes, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 7


28. Sherlock: Self-Correcting Reasoning in Vision-Language Models

ArXiv ID: 2505.22651

Authors: Yi Ding, Ruqi Zhang

Abstract: Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $\beta$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.

Comment: The paper introduces a self-correction framework for reasoning in vision-language models, which is relevant to foundational research in model architecture.

Relevance: 8 Novelty: 7


29. EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models

ArXiv ID: 2505.21567

Authors: Feng Jiang, Zihao Zheng, Xiuping Cui, Maoliang Li, JIayu Chen, Xiang Chen

Abstract: With the development of Embodied Artificial intelligence, the end-to-end control policy such as Vision-Language-Action (VLA) model has become the mainstream. Existing VLA models faces expensive computing/storage cost, which need to be optimized. Quantization is considered as the most effective method which can not only reduce the memory cost but also achieve computation acceleration. However, we find the token alignment of VLA models hinders the application of existing quantization methods. To address this, we proposed an optimized framework called EaqVLA, which apply encoding-aligned quantization to VLA models. Specifically, we propose an complete analysis method to find the misalignment in various granularity. Based on the analysis results, we propose a mixed precision quantization with the awareness of encoding alignment. Experiments shows that the porposed EaqVLA achieves better quantization performance (with the minimal quantization loss for end-to-end action control and xxx times acceleration) than existing quantization methods.

Comment: The paper proposes a quantization framework for Vision-Language-Action models, which is relevant to model compression.

Relevance: 8 Novelty: 7


30. Understanding (Un)Reliability of Steering Vectors in Language Models

ArXiv ID: 2505.22637

Authors: Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, Dmitrii Krasheninnikov

Abstract: Steering vectors are a lightweight method to control language model behavior by adding a learned bias to the activations at inference time. Although steering demonstrates promising performance, recent work shows that it can be unreliable or even counterproductive in some cases. This paper studies the influence of prompt types and the geometry of activation differences on steering reliability. First, we find that all seven prompt types used in our experiments produce a net positive steering effect, but exhibit high variance across samples, and often give an effect opposite of the desired one. No prompt type clearly outperforms the others, and yet the steering vectors resulting from the different prompt types often differ directionally (as measured by cosine similarity). Second, we show that higher cosine similarity between training set activation differences predicts more effective steering. Finally, we observe that datasets where positive and negative activations are better separated are more steerable. Our results suggest that vector steering is unreliable when the target behavior is not represented by a coherent direction.

Comment: The paper studies the reliability of steering vectors in language models, which is relevant to foundational research in LLMs.

Relevance: 8 Novelty: 7


31. A Closer Look at Multimodal Representation Collapse

ArXiv ID: 2505.22483

Authors: Abhra Chaudhuri, Anjan Dutta, Tu Bui, Serban Georgescu

Abstract: We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.

Comment: The paper investigates modality collapse in multimodal representation learning, which is relevant to representation learning.

Relevance: 8 Novelty: 7


32. Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling

ArXiv ID: 2505.21717

Authors: M\'onika Farsang, Ramin Hasani, Radu Grosu

Abstract: We present LrcSSM, a \textit{nonlinear} recurrent model that processes long sequences as fast as today's linear state-space layers. By forcing the state-transition matrix to be diagonal and learned at every step, the full sequence can be solved in parallel with a single prefix-scan, giving $\mathcal{O}(TD)$ time and memory and only $\mathcal{O}(\log T)$ sequential depth, for input-sequence length $T$ and a state dimension $D$. Moreover, LrcSSM offers a formal gradient-stability guarantee that other input-varying systems such as Liquid-S4 and Mamba do not provide. Lastly, for network depth $L$, as the forward and backward passes cost $\Theta(T\,D\,L)$ FLOPs, with its low sequential depth and parameter count $\Theta(D\,L)$, the model follows the compute-optimal scaling law regime ($\beta \approx 0.42$) recently observed for Mamba, outperforming quadratic-attention Transformers at equal compute while avoiding the memory overhead of FFT-based long convolutions. We show that on a series of long-range forecasting tasks, LrcSSM outperforms LRU, S5 and Mamba.

Comment: The paper presents a new recurrent model LrcSSM for efficient sequence modeling, which aligns with the core topic of model architecture innovations.

Relevance: 8 Novelty: 7


33. Geometric Hyena Networks for Large-scale Equivariant Learning

ArXiv ID: 2505.22560

Authors: Artem Moskalev, Mangal Prakash, Junjie Xu, Tianyu Cui, Rui Liao, Tommaso Mansi

Abstract: Processing global geometric context while preserving equivariance is crucial when modeling biological, chemical, and physical systems. Yet, this is challenging due to the computational demands of equivariance and global context at scale. Standard methods such as equivariant self-attention suffer from quadratic complexity, while local methods such as distance-based message passing sacrifice global information. Inspired by the recent success of state-space and long-convolutional models, we introduce Geometric Hyena, the first equivariant long-convolutional model for geometric systems. Geometric Hyena captures global geometric context at sub-quadratic complexity while maintaining equivariance to rotations and translations. Evaluated on all-atom property prediction of large RNA molecules and full protein molecular dynamics, Geometric Hyena outperforms existing equivariant models while requiring significantly less memory and compute that equivariant self-attention. Notably, our model processes the geometric context of 30k tokens 20x faster than the equivariant transformer and allows 72x longer context within the same budget.

Comment: The paper introduces Geometric Hyena, an equivariant long-convolutional model for geometric systems, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 7


34. R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

ArXiv ID: 2505.21600

Authors: Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang

Abstract: Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce Roads to Rome (R2R), a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.

Comment: The paper introduces R2R, a neural token routing method for efficient LLM inference, which is relevant to large language models and efficiency.

Relevance: 8 Novelty: 7


35. Taming Transformer Without Using Learning Rate Warmup

ArXiv ID: 2505.21910

Authors: Xianbiao Qi, Yelin He, Jiaquan Ye, Chun-Guang Li, Bojia Zi, Xili Dai, Qin Zou, Rong Xiao

Abstract: Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and using an obviously lower learning rate is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for the process of training Transformer and reveal the rationale behind the model crash phenomenon in the training process, termed \textit{spectral energy concentration} of ${\bW_q}^{\top} \bW_k$, which is the reason for a malignant entropy collapse, where ${\bW_q}$ and $\bW_k$ are the projection matrices for the query and the key in Transformer, respectively. To remedy this problem, motivated by \textit{Weyl's Inequality}, we present a novel optimization strategy, \ie, making the weight updating in successive steps smooth -- if the ratio $\frac{\sigma_{1}(\nabla \bW_t)}{\sigma_{1}(\bW_{t-1})}$ is larger than a threshold, we will automatically bound the learning rate to a weighted multiple of $\frac{\sigma_{1}(\bW_{t-1})}{\sigma_{1}(\nabla \bW_t)}$, where $\nabla \bW_t$ is the updating quantity in step $t$. Such an optimization strategy can prevent spectral energy concentration to only a few directions, and thus can avoid malignant entropy collapse which will trigger the model crash. We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our optimization strategy can effectively and stably train these Transformers without using learning rate warmup.

Comment: The paper provides a theoretical analysis of training Transformers without learning rate warmup, which aligns with model architecture insights.

Relevance: 8 Novelty: 7


36. AI Mathematician: Towards Fully Automated Frontier Mathematical Research

ArXiv ID: 2505.22451

Authors: Yuanhang Liu, Yanxing Huang, Yanqiao Wang, Peng Li, Yang Liu

Abstract: Large Reasoning Models (LRMs) have made significant progress in mathematical capabilities in recent times. However, these successes have been primarily confined to competition-level problems. In this work, we propose AI Mathematician (AIM) framework, which harnesses the reasoning strength of LRMs to support frontier mathematical research. We have identified two critical challenges of mathematical research compared to competition, {\it the intrinsic complexity of research problems} and {\it the requirement of procedural rigor}. To address these challenges, AIM incorporates two core strategies: an exploration mechanism to foster longer solution paths, and the pessimistic reasonable verification method to ensure reliability. This early version of AIM already exhibits strong capability in tackling research-level tasks. We conducted extensive experiments across several real-world mathematical topics and obtained promising results. AIM is able to autonomously construct substantial portions of proofs and uncover non-trivial insights within each research area. These findings highlight the potential of LRMs in mathematical discovery and suggest that LRM-based agent systems could significantly accelerate mathematical research in the future.

Comment: The paper proposes an AI framework for mathematical research using LRMs, which is relevant to foundational research in LLMs.

Relevance: 8 Novelty: 7


37. Sparsification and Reconstruction from the Perspective of Representation Geometry

ArXiv ID: 2505.22506

Authors: Wenjie Sun, Bingzhe Wu, Zhile Yang, Chengke Wu

Abstract: Sparse Autoencoders (SAEs) have emerged as a predominant tool in mechanistic interpretability, aiming to identify interpretable monosemantic features. However, how does sparse encoding organize the representations of activation vector from language models? What is the relationship between this organizational paradigm and feature disentanglement as well as reconstruction performance? To address these questions, we propose the SAEMA, which validates the stratified structure of the representation by observing the variability of the rank of the symmetric semipositive definite (SSPD) matrix corresponding to the modal tensor unfolded along the latent tensor with the level of noise added to the residual stream. To systematically investigate how sparse encoding alters representational structures, we define local and global representations, demonstrating that they amplify inter-feature distinctions by merging similar semantic features and introducing additional dimensionality. Furthermore, we intervene the global representation from an optimization perspective, proving a significant causal relationship between their separability and the reconstruction performance. This study explains the principles of sparsity from the perspective of representational geometry and demonstrates the impact of changes in representational structure on reconstruction performance. Particularly emphasizes the necessity of understanding representations and incorporating representational constraints, providing empirical references for developing new interpretable tools and improving SAEs. The code is available at \hyperlink{https://github.com/wenjie1835/SAERepGeo}{https://github.com/wenjie1835/SAERepGeo}.

Comment: The paper explores sparse autoencoders and representation geometry, which is relevant to representation learning.

Relevance: 8 Novelty: 7


38. Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

ArXiv ID: 2505.22038

Authors: Kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen

Abstract: Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the use of dynamic high-resolution inputs further increases this burden. Previous approaches have attempted to reduce the number of image tokens through token pruning, typically by selecting tokens based on attention scores or image token diversity. Through empirical studies, we observe that existing methods often overlook the joint impact of pruning on both the current layer's output (local) and the outputs of subsequent layers (global), leading to suboptimal pruning decisions. To address this challenge, we propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens. Specifically, our method utilizes a small calibration set to divide the pruning process into multiple stages. In the early stages, our method emphasizes the impact of pruning on subsequent layers, whereas in the deeper stages, the focus shifts toward preserving the consistency of local outputs. Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78% compression rate while preserving 96.7% of the original models' performance on average.

Comment: The paper introduces a novel token pruning method for vision-language models, which is relevant to model compression through pruning techniques.

Relevance: 8 Novelty: 7


39. Compressing Sine-Activated Low-Rank Adapters through Post-Training Quantization

ArXiv ID: 2505.21895

Authors: Cameron Gordon, Yiping Ji, Hemanth Saratchandran, Paul Albert, Simon Lucey

Abstract: Low-Rank Adaptation (LoRA) has become a standard approach for parameter-efficient fine-tuning, offering substantial reductions in trainable parameters by modeling updates as the product of two low-rank matrices. While effective, the low-rank constraint inherently limits representational capacity, often resulting in reduced performance compared to full-rank fine-tuning. Recent work by Ji et al. (2025) has addressed this limitation by applying a fixed-frequency sinusoidal transformation to low-rank adapters, increasing their stable rank without introducing additional parameters. This raises a crucial question: can the same sine-activated technique be successfully applied within the context of Post-Training Quantization to retain benefits even after model compression? In this paper, we investigate this question by extending the sinusoidal transformation framework to quantized LoRA adapters. We develop a theoretical analysis showing that the stable rank of a quantized adapter is tightly linked to that of its full-precision counterpart, motivating the use of such rank-enhancing functions even under quantization. Our results demonstrate that the expressivity gains from a sinusoidal non-linearity persist after quantization, yielding highly compressed adapters with negligible loss in performance. We validate our approach across a range of fine-tuning tasks for language, vision and text-to-image generation achieving significant memory savings while maintaining competitive accuracy.

Comment: The paper investigates the application of sinusoidal transformations in low-rank adapters and their impact on post-training quantization, relevant to model compression and low-rank approaches.

Relevance: 8 Novelty: 7


40. Look Within or Look Beyond? A Theoretical Comparison Between Parameter-Efficient and Full Fine-Tuning

ArXiv ID: 2505.22355

Authors: Yongkang Liu, Xingle Xu, Ercong Nie, Zijing Wang, Shi Feng, Daling Wang, Qian Li, Hinrich Sch\"utze

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods achieve performance comparable to Full Fine-Tuning (FFT) while requiring significantly fewer computing resources, making it the go-to choice for researchers. We find that although PEFT can achieve competitive results on some benchmarks, its performance falls short of FFT in complex tasks, such as reasoning and instruction-based fine-tuning. In this paper, we compare the characteristics of PEFT and FFT in terms of representational capacity and robustness based on optimization theory. We theoretically demonstrate that PEFT is a strict subset of FFT. By providing theoretical upper bounds for PEFT, we show that the limited parameter space constrains the model's representational ability, making it more susceptible to perturbations. Experiments on 15 datasets encompassing classification, generation, reasoning, instruction fine-tuning tasks and 11 adversarial test sets validate our theories. We hope that these results spark further research beyond the realms of well established PEFT. The source code is in the anonymous Github repository\footnote{https://github.com/misonsky/PEFTEval}.

Comment: The paper provides a theoretical comparison between parameter-efficient and full fine-tuning, relevant to model architecture and efficiency.

Relevance: 8 Novelty: 7


41. Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

ArXiv ID: 2505.22179

Authors: Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu

Abstract: Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at https://github.com/AI9Stars/SpecMQuant.

Comment: The paper explores the integration of speculative decoding and quantization, relevant to model compression and efficiency.

Relevance: 8 Novelty: 7


42. Born a Transformer -- Always a Transformer?

ArXiv ID: 2505.21785

Authors: Yana Veitsman, Mayank Jobanputra, Yash Sarrof, Aleksandra Bakalova, Vera Demberg, Ellie Pavlick, Michael Hahn

Abstract: Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of $\textit{retrieval}$ and $\textit{copying}$ tasks inspired by Liu et al. [2024]. We use the recently proposed C-RASP framework for studying length generalization [Huang et al., 2025b] to provide guarantees for each of our settings. Empirically, we observe an $\textit{induction-versus-anti-induction}$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) rather than the left (anti-induction) of a query token. This asymmetry disappears upon targeted fine-tuning if length-generalization is guaranteed by theory. Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained Transformers. We validate our findings through practical experiments on real-world tasks demonstrating reliability risks. Our results highlight that pretraining selectively enhances certain Transformer capabilities, but does not overcome fundamental length-generalization limits.

Comment: The paper investigates the limitations of transformers in sequence-to-sequence tasks, relevant to model architecture and emerging trends.

Relevance: 8 Novelty: 7


43. TuneComp: Joint Fine-tuning and Compression for Large Foundation Models

ArXiv ID: 2505.21835

Authors: Xiangyu Chen (Perry), Jing Liu (Perry), Ye Wang (Perry), Matthew Brand (Perry), Pu (Perry), Wang, Toshiaki Koike-Akino

Abstract: To reduce model size during post-training, compression methods, including knowledge distillation, low-rank approximation, and pruning, are often applied after fine-tuning the model. However, sequential fine-tuning and compression sacrifices performance, while creating a larger than necessary model as an intermediate step. In this work, we aim to reduce this gap, by directly constructing a smaller model while guided by the downstream task. We propose to jointly fine-tune and compress the model by gradually distilling it to a pruned low-rank structure. Experiments demonstrate that joint fine-tuning and compression significantly outperforms other sequential compression methods.

Comment: The paper proposes a joint fine-tuning and compression method, which aligns with the model compression criterion.

Relevance: 8 Novelty: 7


44. The Resurrection of the ReLU

ArXiv ID: 2505.22074

Authors: Co\c{s}ku Can Horuz, Geoffrey Kasenbacher, Saya Higuchi, Sebastian Kairat, Jendrik Stoltz, Moritz Pesl, Bernhard A. Moser, Christoph Linse, Thomas Martinetz, Sebastian Otte

Abstract: Modeling sophisticated activation functions within deep learning architectures has evolved into a distinct research direction. Functions such as GELU, SELU, and SiLU offer smooth gradients and improved convergence properties, making them popular choices in state-of-the-art models. Despite this trend, the classical ReLU remains appealing due to its simplicity, inherent sparsity, and other advantageous topological characteristics. However, ReLU units are prone to becoming irreversibly inactive - a phenomenon known as the dying ReLU problem - which limits their overall effectiveness. In this work, we introduce surrogate gradient learning for ReLU (SUGAR) as a novel, plug-and-play regularizer for deep architectures. SUGAR preserves the standard ReLU function during the forward pass but replaces its derivative in the backward pass with a smooth surrogate that avoids zeroing out gradients. We demonstrate that SUGAR, when paired with a well-chosen surrogate function, substantially enhances generalization performance over convolutional network architectures such as VGG-16 and ResNet-18, providing sparser activations while effectively resurrecting dead ReLUs. Moreover, we show that even in modern architectures like Conv2NeXt and Swin Transformer - which typically employ GELU - substituting these with SUGAR yields competitive and even slightly superior performance. These findings challenge the prevailing notion that advanced activation functions are necessary for optimal performance. Instead, they suggest that the conventional ReLU, particularly with appropriate gradient handling, can serve as a strong, versatile revived classic across a broad range of deep learning vision models.

Comment: The paper introduces a novel regularizer for ReLU activation functions, which aligns with the model architecture criterion.

Relevance: 8 Novelty: 7


45. Transformers Pretrained on Procedural Data Contain Modular Structures for Algorithmic Reasoning

ArXiv ID: 2505.22308

Authors: Zachary Shinnick, Liangze Jiang, Hemanth Saratchandran, Anton van den Hengel, Damien Teney

Abstract: Pretraining on large, semantically rich datasets is key for developing language models. Surprisingly, recent studies have shown that even synthetic data, generated procedurally through simple semantic-free algorithms, can yield some of the same benefits as natural language pretraining. It is unclear what specific capabilities such simple synthetic data instils in a model, where these capabilities reside in the architecture, and how they manifest within its weights. In this short paper, we identify several beneficial forms of procedural data, together with specific algorithmic reasoning skills that improve in small transformers. Our core finding is that different procedural rules instil distinct but complementary inductive structures in the model. With extensive ablations and partial-transfer experiments, we discover that these structures reside in different parts of the model. Attention layers often carry the most transferable information, but some pretraining rules impart useful structure to MLP blocks instead. Most interestingly, the structures induced by multiple rules can be composed to jointly reinforce multiple capabilities. These results suggest an exciting possibility of disentangling the acquisition of knowledge from reasoning in language models, with the goal of improving their robustness and data efficiency.

Comment: The paper explores modular structures in transformers pretrained on procedural data, which aligns with the model architecture criterion.

Relevance: 8 Novelty: 7


46. Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models

ArXiv ID: 2505.21765

Authors: Sohyun An, Ruochen Wang, Tianyi Zhou, Cho-Jui Hsieh

Abstract: While recent success of large reasoning models (LRMs) significantly advanced LLMs' reasoning capability by optimizing the final answer accuracy using reinforcement learning, they may also drastically increase the output length due to overthinking, characterized by unnecessarily complex reasoning paths that waste computation and potentially degrade the performance. We hypothesize that such inefficiencies stem from LRMs' limited capability to dynamically select the proper modular reasoning strategies, termed thinking patterns at the right position. To investigate this hypothesis, we propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns, systematically identifying and promoting beneficial patterns that improve the answer while removing detrimental ones. Empirical analysis confirms that our optimized thinking paths yield more concise yet sufficiently informative trajectories, enhancing reasoning efficiency by reducing attention FLOPs by up to 47% while maintaining accuracy for originally correct responses. Moreover, a non-trivial portion of originally incorrect responses are transformed into correct ones, achieving a 15.6% accuracy improvement with reduced length. Motivated by the improvement brought by the optimized thinking paths, we apply a preference optimization technique supported by a pairwise dataset contrasting suboptimal and optimal reasoning paths. Experimental evaluations across multiple mathematical reasoning benchmarks reveal that our method notably reduces computational overhead while simultaneously improving reasoning accuracy, achieving up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.

Comment: The paper focuses on optimizing thinking dynamics in large reasoning models, which aligns with the interest in theoretical insights into LLM behavior.

Relevance: 8 Novelty: 7


47. Estimating the Effects of Sample Training Orders for Large Language Models without Retraining

ArXiv ID: 2505.22042

Authors: Hao Yang, Haoxuan Li, Mengyue Yang, Xu Chen, Mingming Gong

Abstract: The order of training samples plays a crucial role in large language models (LLMs), significantly impacting both their external performance and internal learning dynamics. Traditional methods for investigating this effect generally require retraining the model with various sample orders, which is computationally infeasible for LLMs. In this work, we improve traditional methods by designing a retraining-free framework. By approximating Adam optimizer updates with first- and second-order Taylor expansions and utilizing random projection methods to store intermediate checkpoints, our framework can efficiently estimate model parameters for arbitrary training sample orders. Next, we apply our framework to two downstream research problems: (1) Training curriculum design for LLMs -- we base our retraining-free framework to propose a novel curriculum learning strategy that augments curriculum proposals with estimated model performances, enabling more informed sample scheduling. (2) LLMs' memorization and generalization effect analysis -- we use our retraining-free framework to estimate how the positions of training samples influence LLMs' capacity for memorization and generalization. We conduct extensive experiments to validate the effectiveness of our retraining-free framework in reproducing the true model performances, and further demonstrate its potential in optimizing LLM training curricula and analyzing the memorization and generalization effects of LLMs.

Comment: The paper presents a retraining-free framework for estimating the effects of sample training orders in LLMs, contributing to theoretical insights into LLM behavior.

Relevance: 8 Novelty: 7


48. Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents

ArXiv ID: 2505.22655

Authors: Michael Kirchhof, Gjergji Kasneci, Enkelejda Kasneci

Abstract: Large-language models (LLMs) and chatbot agents are known to provide wrong outputs at times, and it was recently found that this can never be fully prevented. Hence, uncertainty quantification plays a crucial role, aiming to quantify the level of ambiguity in either one overall number or two numbers for aleatoric and epistemic uncertainty. This position paper argues that this traditional dichotomy of uncertainties is too limited for the open and interactive setup that LLM agents operate in when communicating with a user, and that we need to research avenues that enrich uncertainties in this novel scenario. We review the literature and find that popular definitions of aleatoric and epistemic uncertainties directly contradict each other and lose their meaning in interactive LLM agent settings. Hence, we propose three novel research directions that focus on uncertainties in such human-computer interactions: Underspecification uncertainties, for when users do not provide all information or define the exact task at the first go, interactive learning, to ask follow-up questions and reduce the uncertainty about the current context, and output uncertainties, to utilize the rich language and speech space to express uncertainties as more than mere numbers. We expect that these new ways of dealing with and communicating uncertainties will lead to LLM agent interactions that are more transparent, trustworthy, and intuitive.

Comment: The paper argues for a reassessment of uncertainty quantification in LLMs, proposing new research directions, which aligns with theoretical insights into LLM behavior.

Relevance: 8 Novelty: 7


49. From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

ArXiv ID: 2505.21800

Authors: Stanley Yu, Vaidehi Bulusu, Oscar Yasunaga, Clayton Lau, Cole Blondin, Sean O'Brien, Kevin Zhu, Vasu Sharma

Abstract: Large Language Models (LLMs) exhibit strong conversational abilities but often generate falsehoods. Prior work suggests that the truthfulness of simple propositions can be represented as a single linear direction in a model's internal activations, but this may not fully capture its underlying geometry. In this work, we extend the concept cone framework, recently introduced for modeling refusal, to the domain of truth. We identify multi-dimensional cones that causally mediate truth-related behavior across multiple LLM families. Our results are supported by three lines of evidence: (i) causal interventions reliably flip model responses to factual statements, (ii) learned cones generalize across model architectures, and (iii) cone-based interventions preserve unrelated model behavior. These findings reveal the richer, multidirectional structure governing simple true/false propositions in LLMs and highlight concept cones as a promising tool for probing abstract behaviors.

Comment: The paper explores multidimensional representations of propositional facts in LLMs, contributing to theoretical insights into LLM behavior.

Relevance: 8 Novelty: 7


50. One Rank at a Time: Cascading Error Dynamics in Sequential Learning

ArXiv ID: 2505.22602

Authors: Mahtab Alizadeh Vandchali (Jasper), Fangshuo (Jasper), Liao, Anastasios Kyrillidis

Abstract: Sequential learning -- where complex tasks are broken down into simpler, hierarchical components -- has emerged as a paradigm in AI. This paper views sequential learning through the lens of low-rank linear regression, focusing specifically on how errors propagate when learning rank-1 subspaces sequentially. We present an analysis framework that decomposes the learning process into a series of rank-1 estimation problems, where each subsequent estimation depends on the accuracy of previous steps. Our contribution is a characterization of the error propagation in this sequential process, establishing bounds on how errors -- e.g., due to limited computational budgets and finite precision -- affect the overall model accuracy. We prove that these errors compound in predictable ways, with implications for both algorithmic design and stability guarantees.

Comment: The paper provides an analysis of error dynamics in sequential learning through low-rank linear regression, which relates to model compression and low-rank approaches.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.