Personalized Daily ArXiv Papers 2025-05-29

[gpt-4o]	Prompt	Completion	Total
Token	69404	9085	78489
Cost	$0.17	$0.09	$0.26

Total arXiv papers: 854

Total scanned papers: 510

Total relevant papers: 50

Table of contents with paper titles:

An Augmentation-Aware Theory for Self-Supervised Contrastive Learning Authors: Jingyi Cui, Hongwei Wen, Yisen Wang
Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning Authors: Maosen Zhao, Pengtao Chen, Chong Yu, Yan Wen, Xudong Tan, Tao Chen
RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers Authors: Xuwei Xu, Yang Li, Yudong Chen, Jiajun Liu, Sen Wang
Global Minimizers of $\ell^p$-Regularized Objectives Yield the Sparsest ReLU Neural Networks Authors: Julia Nakhleh, Robert D. Nowak
EvidenceMoE: A Physics-Guided Mixture-of-Experts with Evidential Critics for Advancing Fluorescence Light Detection and Ranging in Scattering Media Authors: Ismail Erbas, Ferhat Demirkiran, Karthik Swaminathan, Naigang Wang, Navid Ibtehaj Nizam, Stefan T. Radev, Kaoutar El Maghraoui, Xavier Intes, Vikas Pandey
LaX: Boosting Low-Rank Training of Foundation Models via Latent Crossing Authors: Ruijie Zhang, Ziyue Liu, Zhengyang Wang, Zheng Zhang
Curse of High Dimensionality Issue in Transformer for Long-context Modeling Authors: Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan
Is Attention Required for Transformer Inference? Explore Function-preserving Attention Replacement Authors: Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang
Almost Linear Convergence under Minimal Score Assumptions: Quantized Transition Diffusion Authors: Xunpeng Huang, Yingyu Lin, Nikki Lijing Kuang, Hanze Dong, Difan Zou, Yian Ma, Tong Zhang
Scaling Reasoning without Attention Authors: Xueliang Zhao, Wei Wu, Lingpeng Kong
Self-Organizing Visual Prototypes for Non-Parametric Representation Learning Authors: Thalles Silva, Helio Pedrini, Ad\'in Ram\'irez Rivera
Weakly-Supervised Contrastive Learning for Imprecise Class Labels Authors: Zi-Hao Zhou, Jun-Jie Wang, Tong Wei, Min-Ling Zhang
Learning Shared Representations from Unpaired Data Authors: Amitai Yacobi, Nir Ben-Ari, Ronen Talmon, Uri Shaham
On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling Authors: Moritz Haas, Sebastian Bordt, Ulrike von Luxburg, Leena Chennuru Vankadara
Learning in Compact Spaces with Approximately Normalized Transformers Authors: J\"org K. H. Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, Michael Hefenbrock
CellCLAT: Preserving Topology and Trimming Redundancy in Self-Supervised Cellular Contrastive Learning Authors: Bin Qin, Qirui Ji, Jiangmeng Li, Yupeng Wang, Xuesong Wu, Jianwen Cao, Fanjiang Xu
Benignity of loss landscape with weight decay requires both large overparametrization and initialization Authors: Etienne Boursier, Matthew Bowditch, Matthias Englert, Ranko Lazic
Multiclass Loss Geometry Matters for Generalization of Gradient Descent in Separable Classification Authors: Matan Schliserman, Tomer Koren
FCOS: A Two-Stage Recoverable Model Pruning Framework for Automatic Modulation Recognition Authors: Yao Lu, Tengfei Ma, Zeyu Wang, Zhuangzhi Chen, Dongwei Xu, Yun Lin, Qi Xuan, Guan Gui
PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective Authors: Tim Tsz-Kit Lau, Qi Long, Weijie Su
Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates Authors: Haoning Xu, Zhaoqing Li, Youjun Chen, Huimeng Wang, Guinan Li, Mengzhe Geng, Chengxi Deng, Xunying Liu
The quest for the GRAph Level autoEncoder (GRALE) Authors: Paul Krzakala, Gabriel Melo, Charlotte Laclau, Florence d'Alch\'e-Buc, R\'emi Flamary
Mitigating Overthinking in Large Reasoning Models via Manifold Steering Authors: Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong
Efficient Diffusion Models for Symmetric Manifolds Authors: Oren Mangoubi, Neil He, Nisheeth K. Vishnoi
Relevance-driven Input Dropout: an Explanation-guided Regularization Technique Authors: Shreyas Gururaj, Lars Gr\"une, Wojciech Samek, Sebastian Lapuschkin, Leander Weber
Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning Authors: Erxin Yu, Jing Li, Ming Liao, Qi Zhu, Boyang Xue, Minghui Xu, Baojun Wang, Lanqing Hong, Fei Mi, Lifeng Shang
Enhancing Vision Transformer Explainability Using Artificial Astrocytes Authors: Nicolas Echevarrieta-Catalan, Ana Ribas-Rodriguez, Francisco Cedron, Odelia Schwartz, Vanessa Aguiar-Pulido
Sherlock: Self-Correcting Reasoning in Vision-Language Models Authors: Yi Ding, Ruqi Zhang
EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models Authors: Feng Jiang, Zihao Zheng, Xiuping Cui, Maoliang Li, JIayu Chen, Xiang Chen
Understanding (Un)Reliability of Steering Vectors in Language Models Authors: Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, Dmitrii Krasheninnikov
A Closer Look at Multimodal Representation Collapse Authors: Abhra Chaudhuri, Anjan Dutta, Tu Bui, Serban Georgescu
Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling Authors: M\'onika Farsang, Ramin Hasani, Radu Grosu
Geometric Hyena Networks for Large-scale Equivariant Learning Authors: Artem Moskalev, Mangal Prakash, Junjie Xu, Tianyu Cui, Rui Liao, Tommaso Mansi
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing Authors: Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
Taming Transformer Without Using Learning Rate Warmup Authors: Xianbiao Qi, Yelin He, Jiaquan Ye, Chun-Guang Li, Bojia Zi, Xili Dai, Qin Zou, Rong Xiao
AI Mathematician: Towards Fully Automated Frontier Mathematical Research Authors: Yuanhang Liu, Yanxing Huang, Yanqiao Wang, Peng Li, Yang Liu
Sparsification and Reconstruction from the Perspective of Representation Geometry Authors: Wenjie Sun, Bingzhe Wu, Zhile Yang, Chengke Wu
Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization Authors: Kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen
Compressing Sine-Activated Low-Rank Adapters through Post-Training Quantization Authors: Cameron Gordon, Yiping Ji, Hemanth Saratchandran, Paul Albert, Simon Lucey
Look Within or Look Beyond? A Theoretical Comparison Between Parameter-Efficient and Full Fine-Tuning Authors: Yongkang Liu, Xingle Xu, Ercong Nie, Zijing Wang, Shi Feng, Daling Wang, Qian Li, Hinrich Sch\"utze
Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design Authors: Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu
Born a Transformer -- Always a Transformer? Authors: Yana Veitsman, Mayank Jobanputra, Yash Sarrof, Aleksandra Bakalova, Vera Demberg, Ellie Pavlick, Michael Hahn
TuneComp: Joint Fine-tuning and Compression for Large Foundation Models Authors: Xiangyu Chen (Perry), Jing Liu (Perry), Ye Wang (Perry), Matthew Brand (Perry), Pu (Perry), Wang, Toshiaki Koike-Akino
The Resurrection of the ReLU Authors: Co\c{s}ku Can Horuz, Geoffrey Kasenbacher, Saya Higuchi, Sebastian Kairat, Jendrik Stoltz, Moritz Pesl, Bernhard A. Moser, Christoph Linse, Thomas Martinetz, Sebastian Otte
Transformers Pretrained on Procedural Data Contain Modular Structures for Algorithmic Reasoning Authors: Zachary Shinnick, Liangze Jiang, Hemanth Saratchandran, Anton van den Hengel, Damien Teney
Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models Authors: Sohyun An, Ruochen Wang, Tianyi Zhou, Cho-Jui Hsieh
Estimating the Effects of Sample Training Orders for Large Language Models without Retraining Authors: Hao Yang, Haoxuan Li, Mengyue Yang, Xu Chen, Mingming Gong
Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents Authors: Michael Kirchhof, Gjergji Kasneci, Enkelejda Kasneci
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs Authors: Stanley Yu, Vaidehi Bulusu, Oscar Yasunaga, Clayton Lau, Cole Blondin, Sean O'Brien, Kevin Zhu, Vasu Sharma
One Rank at a Time: Cascading Error Dynamics in Sequential Learning Authors: Mahtab Alizadeh Vandchali (Jasper), Fangshuo (Jasper), Liao, Anastasios Kyrillidis

1. An Augmentation-Aware Theory for Self-Supervised Contrastive Learning

ArXiv ID: 2505.22196

Authors: Jingyi Cui, Hongwei Wen, Yisen Wang

Abstract: Self-supervised contrastive learning has emerged as a powerful tool in machine learning and computer vision to learn meaningful representations from unlabeled data. Meanwhile, its empirical success has encouraged many theoretical studies to reveal the learning mechanisms. However, in the existing theoretical research, the role of data augmentation is still under-exploited, especially the effects of specific augmentation types. To fill in the blank, we for the first time propose an augmentation-aware error bound for self-supervised contrastive learning, showing that the supervised risk is bounded not only by the unsupervised risk, but also explicitly by a trade-off induced by data augmentation. Then, under a novel semantic label assumption, we discuss how certain augmentation methods affect the error bound. Lastly, we conduct both pixel- and representation-level experiments to verify our proposed theoretical results.

Comment: The paper provides a theoretical framework for self-supervised contrastive learning, which aligns with representation learning.

Relevance: 9 Novelty: 8

2. Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning

ArXiv ID: 2505.21591

Authors: Maosen Zhao, Pengtao Chen, Chong Yu, Yan Wen, Xudong Tan, Tao Chen

Abstract: Model quantization reduces the bit-width of weights and activations, improving memory efficiency and inference speed in diffusion models. However, achieving 4-bit quantization remains challenging. Existing methods, primarily based on integer quantization and post-training quantization fine-tuning, struggle with inconsistent performance. Inspired by the success of floating-point (FP) quantization in large language models, we explore low-bit FP quantization for diffusion models and identify key challenges: the failure of signed FP quantization to handle asymmetric activation distributions, the insufficient consideration of temporal complexity in the denoising process during fine-tuning, and the misalignment between fine-tuning loss and quantization error. To address these challenges, we propose the mixup-sign floating-point quantization (MSFP) framework, first introducing unsigned FP quantization in model quantization, along with timestep-aware LoRA (TALoRA) and denoising-factor loss alignment (DFA), which ensure precise and stable fine-tuning. Extensive experiments show that we are the first to achieve superior performance in 4-bit FP quantization for diffusion models, outperforming existing PTQ fine-tuning methods in 4-bit INT quantization.

Comment: The paper explores 4-bit FP quantization for diffusion models, addressing challenges in model quantization, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

3. RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers

ArXiv ID: 2505.21847

Authors: Xuwei Xu, Yang Li, Yudong Chen, Jiajun Liu, Sen Wang

Abstract: We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency, with their impact signifying as model size increases. This finding highlights a critical opportunity for optimizing the efficiency of large-scale ViTs by focusing on FFN layers. In this work, we propose a novel channel idle mechanism that facilitates post-training structural reparameterization for efficient FFN layers during testing. Specifically, a set of feature channels remains idle and bypasses the nonlinear activation function in each FFN layer, thereby forming a linear pathway that enables structural reparameterization during inference. This mechanism results in a family of ReParameterizable Vision Transformers (RePaViTs), which achieve remarkable latency reductions with acceptable sacrifices (sometimes gains) in accuracy across various ViTs. The benefits of our method scale consistently with model sizes, demonstrating greater speed improvements and progressively narrowing accuracy gaps or even higher accuracies on larger models. In particular, RePa-ViT-Large and RePa-ViT-Huge enjoy 66.8% and 68.7% speed-ups with +1.7% and +1.1% higher top-1 accuracies under the same training strategy, respectively. RePaViT is the first to employ structural reparameterization on FFN layers to expedite ViTs to our best knowledge, and we believe that it represents an auspicious direction for efficient ViTs. Source code is available at https://github.com/Ackesnal/RePaViT.

Comment: The paper proposes a structural reparameterization method for Vision Transformers, which is relevant to model compression and architecture.

Relevance: 9 Novelty: 8

4. Global Minimizers of $\ell^p$-Regularized Objectives Yield the Sparsest ReLU Neural Networks

ArXiv ID: 2505.21791

Authors: Julia Nakhleh, Robert D. Nowak

Abstract: Overparameterized neural networks can interpolate a given dataset in many different ways, prompting the fundamental question: which among these solutions should we prefer, and what explicit regularization strategies will provably yield these solutions? This paper addresses the challenge of finding the sparsest interpolating ReLU network -- i.e., the network with the fewest nonzero parameters or neurons -- a goal with wide-ranging implications for efficiency, generalization, interpretability, theory, and model compression. Unlike post hoc pruning approaches, we propose a continuous, almost-everywhere differentiable training objective whose global minima are guaranteed to correspond to the sparsest single-hidden-layer ReLU networks that fit the data. This result marks a conceptual advance: it recasts the combinatorial problem of sparse interpolation as a smooth optimization task, potentially enabling the use of gradient-based training methods. Our objective is based on minimizing $\ell^p$ quasinorms of the weights for $0 < p < 1$, a classical sparsity-promoting strategy in finite-dimensional settings. However, applying these ideas to neural networks presents new challenges: the function class is infinite-dimensional, and the weights are learned using a highly nonconvex objective. We prove that, under our formulation, global minimizers correspond exactly to sparsest solutions. Our work lays a foundation for understanding when and how continuous sparsity-inducing objectives can be leveraged to recover sparse networks through training.

Comment: The paper addresses the challenge of finding the sparsest interpolating ReLU network using a novel training objective, which is relevant to model compression and sparsity.

Relevance: 9 Novelty: 8

5. EvidenceMoE: A Physics-Guided Mixture-of-Experts with Evidential Critics for Advancing Fluorescence Light Detection and Ranging in Scattering Media

ArXiv ID: 2505.21532

Authors: Ismail Erbas, Ferhat Demirkiran, Karthik Swaminathan, Naigang Wang, Navid Ibtehaj Nizam, Stefan T. Radev, Kaoutar El Maghraoui, Xavier Intes, Vikas Pandey

Abstract: Fluorescence LiDAR (FLiDAR), a Light Detection and Ranging (LiDAR) technology employed for distance and depth estimation across medical, automotive, and other fields, encounters significant computational challenges in scattering media. The complex nature of the acquired FLiDAR signal, particularly in such environments, makes isolating photon time-of-flight (related to target depth) and intrinsic fluorescence lifetime exceptionally difficult, thus limiting the effectiveness of current analytical and computational methodologies. To overcome this limitation, we present a Physics-Guided Mixture-of-Experts (MoE) framework tailored for specialized modeling of diverse temporal components. In contrast to the conventional MoE approaches our expert models are informed by underlying physics, such as the radiative transport equation governing photon propagation in scattering media. Central to our approach is EvidenceMoE, which integrates Evidence-Based Dirichlet Critics (EDCs). These critic models assess the reliability of each expert's output by providing per-expert quality scores and corrective feedback. A Decider Network then leverages this information to fuse expert predictions into a robust final estimate adaptively. We validate our method using realistically simulated Fluorescence LiDAR (FLiDAR) data for non-invasive cancer cell depth detection generated from photon transport models in tissue. Our framework demonstrates strong performance, achieving a normalized root mean squared error (NRMSE) of 0.030 for depth estimation and 0.074 for fluorescence lifetime.

Comment: The paper presents EvidenceMoE, a Physics-Guided Mixture-of-Experts framework, which is relevant to model architecture innovations and MoE.

Relevance: 9 Novelty: 8

6. LaX: Boosting Low-Rank Training of Foundation Models via Latent Crossing

ArXiv ID: 2505.21732

Authors: Ruijie Zhang, Ziyue Liu, Zhengyang Wang, Zheng Zhang

Abstract: Training foundation models such as ViTs and LLMs requires tremendous computing cost. Low-rank matrix or tensor factorization offers a parameter-efficient alternative, but often downgrades performance due to the restricted parameter space. In this work, we introduce {\textbf{Latent Crossing (LaX)}} -- a simple yet effective plug-and-play module that enhances the capacity of low-rank models by enabling information flow across low-rank subspaces. We extensively validate the benefits of LaX on pre-training tasks with ViT-Base/Large and LLaMA-like models ranging from 60M to 1B parameters. LaX boosts low-rank model performance to match or exceed the full-rank baselines while using 2-3(\times) fewer parameters. When equipped with low-rank adapters (i.e., LoRA) for fine-tuning LLaMA-7/13B, LaX consistently improves performance on arithmetic and common sense reasoning tasks with negligible cost.

Comment: The paper introduces Latent Crossing, a module to enhance low-rank models, relevant to model compression and low-rank approaches.

Relevance: 9 Novelty: 8

7. Curse of High Dimensionality Issue in Transformer for Long-context Modeling

ArXiv ID: 2505.22107

Authors: Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan

Abstract: Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textit{redundant} attention computations: while attention weights are often \textit{sparse}, all tokens consume \textit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose \textit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.Code is available at https://github.com/bolixinyu/DynamicGroupAttention.

Comment: The paper addresses the computational inefficiencies in transformers for long-context modeling, proposing a dynamic group attention mechanism, relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8

8. Is Attention Required for Transformer Inference? Explore Function-preserving Attention Replacement

ArXiv ID: 2505.21535

Authors: Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

Abstract: While transformers excel across vision and language pretraining tasks, their reliance on attention mechanisms poses challenges for inference efficiency, especially on edge and embedded accelerators with limited parallelism and memory bandwidth. Hinted by the observed redundancy of attention at inference time, we hypothesize that though the model learns complicated token dependency through pretraining, the inference-time sequence-to-sequence mapping in each attention layer is actually ''simple'' enough to be represented with a much cheaper function. In this work, we explore FAR, a Function-preserving Attention Replacement framework that replaces all attention blocks in pretrained transformers with learnable sequence-to-sequence modules, exemplified by an LSTM. FAR optimize a multi-head LSTM architecture with a block-wise distillation objective and a global structural pruning framework to achieve a family of efficient LSTM-based models from pretrained transformers. We validate FAR on the DeiT vision transformer family and demonstrate that it matches the accuracy of the original models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships and the token-to-token correlation learned in the transformer's attention module.

Comment: The paper explores replacing attention mechanisms in transformers with more efficient modules, relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8

9. Almost Linear Convergence under Minimal Score Assumptions: Quantized Transition Diffusion

ArXiv ID: 2505.21892

Authors: Xunpeng Huang, Yingyu Lin, Nikki Lijing Kuang, Hanze Dong, Difan Zou, Yian Ma, Tong Zhang

Abstract: Continuous diffusion models have demonstrated remarkable performance in data generation across various domains, yet their efficiency remains constrained by two critical limitations: (1) the local adjacency structure of the forward Markov process, which restricts long-range transitions in the data space, and (2) inherent biases introduced during the simulation of time-inhomogeneous reverse denoising processes. To address these challenges, we propose Quantized Transition Diffusion (QTD), a novel approach that integrates data quantization with discrete diffusion dynamics. Our method first transforms the continuous data distribution $p_$ into a discrete one $q_$ via histogram approximation and binary encoding, enabling efficient representation in a structured discrete latent space. We then design a continuous-time Markov chain (CTMC) with Hamming distance-based transitions as the forward process, which inherently supports long-range movements in the original data space. For reverse-time sampling, we introduce a \textit{truncated uniformization} technique to simulate the reverse CTMC, which can provably provide unbiased generation from $q_$ under minimal score assumptions. Through a novel KL dynamic analysis of the reverse CTMC, we prove that QTD can generate samples with $O(d\ln^2(d/\epsilon))$ score evaluations in expectation to approximate the $d$--dimensional target distribution $p_$ within an $\epsilon$ error tolerance. Our method not only establishes state-of-the-art inference efficiency but also advances the theoretical foundations of diffusion-based generative modeling by unifying discrete and continuous diffusion paradigms.

Comment: The paper introduces Quantized Transition Diffusion, a novel approach integrating data quantization with discrete diffusion dynamics, advancing theoretical foundations of diffusion-based generative modeling.

Relevance: 9 Novelty: 8

10. Scaling Reasoning without Attention

ArXiv ID: 2505.22425

Authors: Xueliang Zhao, Wei Wu, Lingpeng Kong

Abstract: Large language models (LLMs) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers, and a lack of structured fine-tuning for high-difficulty domains. We introduce \ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2, our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference. To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the \textsc{PromptCoT} synthesis paradigm, which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, \ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale, and even surpasses the much larger Gemma3-27B by 2.6\% on AIME 24, 0.6\% on AIME 25, and 3.0\% on Livecodebench. These results highlight the potential of state space models as efficient and scalable alternatives to attention-based architectures for high-capacity reasoning.

Comment: The paper introduces an attention-free language model, addressing architectural inefficiencies in LLMs, which aligns with interests in model architecture innovations.

Relevance: 9 Novelty: 8

11. Self-Organizing Visual Prototypes for Non-Parametric Representation Learning

ArXiv ID: 2505.21533

Authors: Thalles Silva, Helio Pedrini, Ad\'in Ram\'irez Rivera

Abstract: We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders.

Comment: The paper introduces a novel training technique for unsupervised visual feature learning, focusing on non-parametric representation learning, which aligns with the representation learning criterion.

Relevance: 9 Novelty: 8

12. Weakly-Supervised Contrastive Learning for Imprecise Class Labels

ArXiv ID: 2505.22028

Authors: Zi-Hao Zhou, Jun-Jie Wang, Tong Wei, Min-Ling Zhang

Abstract: Contrastive learning has achieved remarkable success in learning effective representations, with supervised contrastive learning often outperforming self-supervised approaches. However, in real-world scenarios, data annotations are often ambiguous or inaccurate, meaning that class labels may not reliably indicate whether two examples belong to the same class. This limitation restricts the applicability of supervised contrastive learning. To address this challenge, we introduce the concept of ``continuous semantic similarity'' to define positive and negative pairs. Instead of directly relying on imprecise class labels, we measure the semantic similarity between example pairs, which quantifies how closely they belong to the same category by iteratively refining weak supervisory signals. Based on this concept, we propose a graph-theoretic framework for weakly-supervised contrastive learning, where semantic similarity serves as the graph weights. Our framework is highly versatile and can be applied to many weakly-supervised learning scenarios. We demonstrate its effectiveness through experiments in two common settings, i.e., noisy label and partial label learning, where existing methods can be easily integrated to significantly improve performance. Theoretically, we establish an error bound for our approach, showing that it can approximate supervised contrastive learning under mild conditions. The implementation code is available at https://github.com/Speechless-10308/WSC.

Comment: The paper proposes a weakly-supervised contrastive learning framework, which is relevant to representation learning, focusing on semantic similarity and graph-theoretic approaches.

Relevance: 9 Novelty: 7

13. Learning Shared Representations from Unpaired Data

ArXiv ID: 2505.21524

Authors: Amitai Yacobi, Nir Ben-Ari, Ronen Talmon, Uri Shaham

Abstract: Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our code IS publicly available at https://github.com/shaham-lab/SUE.

Comment: The paper explores learning shared representations from unpaired data, contributing to foundational research in representation learning.

Relevance: 9 Novelty: 7

14. On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling

ArXiv ID: 2505.22491

Authors: Moritz Haas, Sebastian Bordt, Ulrike von Luxburg, Leena Chennuru Vankadara

Abstract: The dominant paradigm for training large-scale vision and language models is He initialization and a single global learning rate (\textit{standard parameterization}, SP). Despite its practical success, standard parametrization remains poorly understood from a theoretical perspective: Existing infinite-width theory would predict instability under large learning rates and vanishing feature learning under stable learning rates. However, empirically optimal learning rates consistently decay much slower than theoretically predicted. By carefully studying neural network training dynamics, we demonstrate that this discrepancy is not fully explained by finite-width phenomena such as catapult effects or a lack of alignment between weights and incoming activations. We instead show that the apparent contradiction can be fundamentally resolved by taking the loss function into account: In contrast to Mean Squared Error (MSE) loss, we prove that under cross-entropy (CE) loss, an intermediate \textit{controlled divergence} regime emerges, where logits diverge but loss, gradients, and activations remain stable. Stable training under large learning rates enables persistent feature evolution at scale in all hidden layers, which is crucial for the practical success of SP. In experiments across optimizers (SGD, Adam), architectures (MLPs, GPT) and data modalities (vision, language), we validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss. Our empirical evidence suggests that width-scaling considerations are surprisingly useful for predicting empirically optimal learning rate exponents. Finally, our analysis clarifies the effectiveness and limitations of recently proposed layerwise learning rate scalings for standard initialization.

Comment: The paper analyzes the effectiveness of large learning rates under standard width scaling, which is relevant to training dynamics in neural networks.

Relevance: 8 Novelty: 8

15. Learning in Compact Spaces with Approximately Normalized Transformers

ArXiv ID: 2505.22014

Authors: J\"org K. H. Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, Michael Hefenbrock

Abstract: In deep learning, regularization and normalization are common solutions for challenges such as overfitting, numerical instabilities, and the increasing variance in the residual stream. An alternative approach is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic but approximate normalization (anTransformer). Our approach constrains the norm of parameters and normalizes all representations via scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. When applied to GPT training, we observe a 40% faster convergence compared to models with QK normalization, with less than 3% additional runtime. Deriving scaling laws for anGPT, we found our method enables training with larger batch sizes and fewer hyperparameters, while matching the favorable scaling characteristics of classic GPT architectures.

Comment: The paper introduces a novel normalization method for Transformers, which aligns with the model architecture criterion.