Previous Day 2025-02-27
Monthly Overview 2025-02
Next Day 2025-03-03

Personalized Daily Arxiv Papers 02/28/2025

[gpt-4o] Prompt Completion Total
Token 51173 7203 58376
Cost $0.13 $0.07 $0.2

Total ArXiv papers: 557

Total scanned papers: 306

Total relevant papers: 41

Table of contents with paper titles:

  1. Learning with Exact Invariances in Polynomial Time Authors: Ashkan Soleymani, Behrooz Tahmasebi, Stefanie Jegelka, Patrick Jaillet

  2. Algebraic Machine Learning: Learning as computing an algebraic decomposition of a task Authors: Fernando Martin-Maroto, Nabil Abderrahaman, David Mendez, Gonzalo G. de Polavieja

  3. Layer-Aware Task Arithmetic: Disentangling Task-Specific and Instruction-Following Knowledge Authors: Yan-Lun Chen, Yi-Ru Wei, Chia-Yi Hsu, Chia-Mu Yu, Chun-Ying Huang, Ying-Dar Lin, Yu-Sung Wu, Wei-Bin Lee

  4. Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts Authors: Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu

  5. LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM Authors: Yehonathan Refael, Iftach Arbel, Ofir Lindenbaum, Tom Tirer

  6. HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration Authors: Rohan Juneja, Shivam Aggarwal, Safeen Huda, Tulika Mitra, Li-Shiuan Peh

  7. Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models Authors: Yukang Yang, Declan Campbell, Kaixuan Huang, Mengdi Wang, Jonathan Cohen, Taylor Webb

  8. Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking Authors: Yifan Zhang, Wenyu Du, Dongming Jin, Jie Fu, Zhi Jin

  9. Forward-Cooperation-Backward (FCB) learning in a Multi-Encoding Uni-Decoding neural network architecture Authors: Prasun Dutta, Koustab Ghosh, Rajat K. De

  10. Do Large Language Models Know How Much They Know? Authors: Gabriele Prato, Jerry Huang, Prasannna Parthasarathi, Shagun Sodhani, Sarath Chandar

  11. Self-Training Elicits Concise Reasoning in Large Language Models Authors: Tergel Munkhbat, Namgyu Ho, Seohyun Kim, Yongjin Yang, Yujin Kim, Se-Young Yun

  12. Your contrastive learning problem is secretly a distribution alignment problem Authors: Zihao Chen, Chi-Heng Lin, Ran Liu, Jingyun Xiao, Eva L Dyer

  13. R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou

  14. Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models Authors: Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz

  15. Asymptotics of Non-Convex Generalized Linear Models in High-Dimensions: A proof of the replica formula Authors: Matteo Vilucchio, Yatin Dandi, Cedric Gerbelot, Florent Krzakala

  16. Identifiable Multi-View Causal Discovery Without Non-Gaussianity Authors: Ambroise Heurtebise, Omar Chehab, Pierre Ablin, Alexandre Gramfort, Aapo Hyv\"arinen

  17. Topological Autoencoders++: Fast and Accurate Cycle-Aware Dimensionality Reduction Authors: Matt\'eo Cl\'emot, Julie Digne, Julien Tierny

  18. Scalable Signature Kernel Computations for Long Time Series via Local Neumann Series Expansions Authors: Matthew Tamayo-Rios, Alexander Schell, Rima Alaifari

  19. Extremely Greedy Equivalence Search Authors: Achille Nazaret, David Blei

  20. Tell me why: Visual foundation models as self-explainable classifiers Authors: Hugues Turb\'e, Mina Bjelogrlic, Gianmarco Mengaldo, Christian Lovis

  21. Obtaining Example-Based Explanations from Deep Neural Networks Authors: Genghua Dong, Henrik Bostr\"om, Michalis Vazirgiannis, Roman Bresson

  22. Variation Matters: from Mitigating to Embracing Zero-Shot NAS Ranking Function Variation Authors: Pavel Rumiantsev, Mark Coates

  23. Accurate and Scalable Graph Neural Networks via Message Invariance Authors: Zhihao Shi, Jie Wang, Zhiwei Zhuang, Xize Liang, Bin Li, Feng Wu

  24. Sanity Checking Causal Representation Learning on a Simple Real-World System Authors: Juan L. Gamella, Simon Bing, Jakob Runge

  25. Spectral Analysis of Representational Similarity with Limited Neurons Authors: Hyunmo Kang, Abdulkadir Canatar, SueYeon Chung

  26. Teasing Apart Architecture and Initial Weights as Sources of Inductive Bias in Neural Networks Authors: Gianluca Bencomo, Max Gupta, Ioana Marinescu, R. Thomas McCoy, Thomas L. Griffiths

  27. Erasing Without Remembering: Safeguarding Knowledge Forgetting in Large Language Models Authors: Huazheng Wang, Yongcheng Jing, Haifeng Sun, Yingjie Wang, Jingyu Wang, Jianxin Liao, Dacheng Tao

  28. Recommendations from Sparse Comparison Data: Provably Fast Convergence for Nonconvex Matrix Factorization Authors: Suryanarayana Sankagiri, Jalal Etesami, Matthias Grossglauser

  29. Incremental Learning with Repetition via Pseudo-Feature Projection Authors: Benedikt Tscheschner, Eduardo Veas, Marc Masana

  30. Global Framework for Simultaneous Emulation Across the Nuclear Landscape Authors: Antoine Belley, Jose M. Munoz, Ronald F. Garcia Ruiz

  31. SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks Authors: Nikolay Blagoev, Lydia Yiyu Chen, O\u{g}uzhan Ersoy

  32. LangProBe: a Language Programs Benchmark Authors: Shangyin Tan, Lakshya A Agrawal, Arnav Singhvi, Liheng Lai, Michael J Ryan, Dan Klein, Omar Khattab, Koushik Sen, Matei Zaharia

  33. Mixtraining: A Better Trade-Off Between Compute and Performance Authors: Zexin Li, Jiancheng Zhang, Yinglun Zhu, Cong Liu

  34. Self-rewarding correction for mathematical reasoning Authors: Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang

  35. Beyond Worst-Case Dimensionality Reduction for Sparse Vectors Authors: Sandeep Silwal, David P. Woodruff, Qiuyi Zhang

  36. Walking the Web of Concept-Class Relationships in Incrementally Trained Interpretable Models Authors: Susmit Agrawal, Deepika Vemuri, Sri Siddarth Chakaravarthy P, Vineeth N. Balasubramanian

  37. Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones? Authors: Yudi Zhang, Lu Wang, Meng Fang, Yali Du, Chenghua Huang, Jun Wang, Qingwei Lin, Mykola Pechenizkiy, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

  38. Revisiting Self-Consistency from Dynamic Distributional Alignment Perspective on Answer Aggregation Authors: Yiwei Li, Ji Zhang, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li

  39. SCU: An Efficient Machine Unlearning Scheme for Deep Learning Enabled Semantic Communications Authors: Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Shui Yu

  40. NeoBERT: A Next-Generation BERT Authors: Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar

  41. Do Sparse Autoencoders Generalize? A Case Study of Answerability Authors: Lovis Heindrich, Philip Torr, Fazl Barez, Veronika Thost


1. Learning with Exact Invariances in Polynomial Time

ArXiv ID: 2502.19758

Authors: Ashkan Soleymani, Behrooz Tahmasebi, Stefanie Jegelka, Patrick Jaillet

Abstract: We study the statistical-computational trade-offs for learning with exact invariances (or symmetries) using kernel regression. Traditional methods, such as data augmentation, group averaging, canonicalization, and frame-averaging, either fail to provide a polynomial-time solution or are not applicable in the kernel setting. However, with oracle access to the geometric properties of the input space, we propose a polynomial-time algorithm that learns a classifier with \emph{exact} invariances. Moreover, our approach achieves the same excess population risk (or generalization error) as the original kernel regression problem. To the best of our knowledge, this is the first polynomial-time algorithm to achieve exact (not approximate) invariances in this context. Our proof leverages tools from differential geometry, spectral theory, and optimization. A key result in our development is a new reformulation of the problem of learning under invariances as optimizing an infinite number of linearly constrained convex quadratic programs, which may be of independent interest.

Comment: The paper provides a polynomial-time algorithm for learning with exact invariances, which is a cutting-edge theoretical contribution relevant to representation learning.

Relevance: 9 Novelty: 9


2. Algebraic Machine Learning: Learning as computing an algebraic decomposition of a task

ArXiv ID: 2502.19944

Authors: Fernando Martin-Maroto, Nabil Abderrahaman, David Mendez, Gonzalo G. de Polavieja

Abstract: Statistics and Optimization are foundational to modern Machine Learning. Here, we propose an alternative foundation based on Abstract Algebra, with mathematics that facilitates the analysis of learning. In this approach, the goal of the task and the data are encoded as axioms of an algebra, and a model is obtained where only these axioms and their logical consequences hold. Although this is not a generalizing model, we show that selecting specific subsets of its breakdown into algebraic atoms obtained via subdirect decomposition gives a model that generalizes. We validate this new learning principle on standard datasets such as MNIST, FashionMNIST, CIFAR-10, and medical images, achieving performance comparable to optimized multilayer perceptrons. Beyond data-driven tasks, the new learning principle extends to formal problems, such as finding Hamiltonian cycles from their specifications and without relying on search. This algebraic foundation offers a fresh perspective on machine intelligence, featuring direct learning from training data without the need for validation dataset, scaling through model additivity, and asymptotic convergence to the underlying rule in the data.

Comment: The paper proposes a novel algebraic foundation for machine learning, which is a cutting-edge theoretical contribution and aligns with emerging trends in foundational research.

Relevance: 9 Novelty: 9


3. Layer-Aware Task Arithmetic: Disentangling Task-Specific and Instruction-Following Knowledge

ArXiv ID: 2502.20186

Authors: Yan-Lun Chen, Yi-Ru Wei, Chia-Yi Hsu, Chia-Mu Yu, Chun-Ying Huang, Ying-Dar Lin, Yu-Sung Wu, Wei-Bin Lee

Abstract: Large language models (LLMs) demonstrate strong task-specific capabilities through fine-tuning, but merging multiple fine-tuned models often leads to degraded performance due to overlapping instruction-following components. Task Arithmetic (TA), which combines task vectors derived from fine-tuning, enables multi-task learning and task forgetting but struggles to isolate task-specific knowledge from general instruction-following behavior. To address this, we propose Layer-Aware Task Arithmetic (LATA), a novel approach that assigns layer-specific weights to task vectors based on their alignment with instruction-following or task-specific components. By amplifying task-relevant layers and attenuating instruction-following layers, LATA improves task learning and forgetting performance while preserving overall model utility. Experiments on multiple benchmarks, including WikiText-2, GSM8K, and HumanEval, demonstrate that LATA outperforms existing methods in both multi-task learning and selective task forgetting, achieving higher task accuracy and alignment with minimal degradation in output quality. Our findings highlight the importance of layer-wise analysis in disentangling task-specific and general-purpose knowledge, offering a robust framework for efficient model merging and editing.

Comment: The paper proposes Layer-Aware Task Arithmetic (LATA) to disentangle task-specific and instruction-following knowledge in LLMs, which aligns with foundational insights into LLM behavior.

Relevance: 9 Novelty: 8


4. Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

ArXiv ID: 2502.19811

Authors: Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu

Abstract: Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96\times$ and for end-to-end execution, COMET delivers a $1.71\times$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

Comment: The paper introduces COMET, a fine-grained communication-computation overlapping system for MoE, which aligns with architectural efficiency improvements in MoE systems.

Relevance: 9 Novelty: 8


5. LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM

ArXiv ID: 2502.19571

Authors: Yehonathan Refael, Iftach Arbel, Ofir Lindenbaum, Tom Tirer

Abstract: We study robust parameter-efficient fine-tuning (PEFT) techniques designed to improve accuracy and generalization while operating within strict computational and memory hardware constraints, specifically focusing on large-language models (LLMs). Existing PEFT methods often lack robustness and fail to generalize effectively across diverse tasks, leading to suboptimal performance in real-world scenarios. To address this, we present a new highly computationally efficient framework called AdaZo-SAM, combining Adam and Sharpness-Aware Minimization (SAM) while requiring only a single-gradient computation in every iteration. This is achieved using a stochastic zeroth-order estimation to find SAM's ascent perturbation. We provide a convergence guarantee for AdaZo-SAM and show that it improves the generalization ability of state-of-the-art PEFT methods. Additionally, we design a low-rank gradient optimization method named LORENZA, which is a memory-efficient version of AdaZo-SAM. LORENZA utilizes a randomized SVD scheme to efficiently compute the subspace projection matrix and apply optimization steps onto the selected subspace. This technique enables full-parameter fine-tuning with adaptive low-rank gradient updates, achieving the same reduced memory consumption as gradient-low-rank-projection methods. We provide a convergence analysis of LORENZA and demonstrate its merits for pre-training and fine-tuning LLMs.

Comment: The paper introduces a low-rank gradient optimization method (LORENZA) and provides theoretical insights into its efficiency for LLMs, aligning with the model compression criterion.

Relevance: 9 Novelty: 8


6. HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

ArXiv ID: 2502.19662

Authors: Rohan Juneja, Shivam Aggarwal, Safeen Huda, Tulika Mitra, Li-Shiuan Peh

Abstract: Quantization is critical for realizing efficient inference of LLMs. Traditional quantization methods are hardware-agnostic, limited to bit-width constraints, and lacking circuit-level insights, such as timing and energy characteristics of Multiply-Accumulate (MAC) units. We introduce HALO, a versatile framework that adapts to various hardware through a Hardware-Aware Post-Training Quantization (PTQ) approach. By leveraging MAC unit properties, HALO minimizes critical-path delays and enables dynamic frequency scaling. Deployed on LLM accelerators like TPUs and GPUs, HALO achieves on average 270% performance gains and 51% energy savings, all with minimal accuracy drop.

Comment: The paper introduces a hardware-aware quantization framework (HALO) for LLM acceleration, aligning with the model compression criterion.

Relevance: 9 Novelty: 8


7. Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models

ArXiv ID: 2502.20332

Authors: Yukang Yang, Declan Campbell, Kaixuan Huang, Mengdi Wang, Jonathan Cohen, Taylor Webb

Abstract: Many recent studies have found evidence for emergent reasoning capabilities in large language models, but debate persists concerning the robustness of these capabilities, and the extent to which they depend on structured reasoning mechanisms. To shed light on these issues, we perform a comprehensive study of the internal mechanisms that support abstract rule induction in an open-source language model (Llama3-70B). We identify an emergent symbolic architecture that implements abstract reasoning via a series of three computations. In early layers, symbol abstraction heads convert input tokens to abstract variables based on the relations between those tokens. In intermediate layers, symbolic induction heads perform sequence induction over these abstract variables. Finally, in later layers, retrieval heads predict the next token by retrieving the value associated with the predicted abstract variable. These results point toward a resolution of the longstanding debate between symbolic and neural network approaches, suggesting that emergent reasoning in neural networks depends on the emergence of symbolic mechanisms.

Comment: The paper identifies symbolic mechanisms in LLMs for abstract reasoning, aligning with the large language models criterion.

Relevance: 9 Novelty: 8


8. Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking

ArXiv ID: 2502.20129

Authors: Yifan Zhang, Wenyu Du, Dongming Jin, Jie Fu, Zhi Jin

Abstract: Chain-of-Thought (CoT) significantly enhances the performance of large language models (LLMs) across a wide range of tasks, and prior research shows that CoT can theoretically increase expressiveness. However, there is limited mechanistic understanding of the algorithms that Transformer+CoT can learn. In this work, we (1) evaluate the state tracking capabilities of Transformer+CoT and its variants, confirming the effectiveness of CoT. (2) Next, we identify the circuit, a subset of model components, responsible for tracking the world state, finding that late-layer MLP neurons play a key role. We propose two metrics, compression and distinction, and show that the neuron sets for each state achieve nearly 100% accuracy, providing evidence of an implicit finite state automaton (FSA) embedded within the model. (3) Additionally, we explore three realistic settings: skipping intermediate steps, introducing data noise, and testing length generalization. Our results demonstrate that Transformer+CoT learns robust algorithms (FSA), highlighting its resilience in challenging scenarios.

Comment: The paper provides a mechanistic study of state tracking in Transformers with Chain-of-Thought, offering insights into model behavior and architecture, aligning with foundational research.

Relevance: 9 Novelty: 8


9. Forward-Cooperation-Backward (FCB) learning in a Multi-Encoding Uni-Decoding neural network architecture

ArXiv ID: 2502.20113

Authors: Prasun Dutta, Koustab Ghosh, Rajat K. De

Abstract: The most popular technique to train a neural network is backpropagation. Recently, the Forward-Forward technique has also been introduced for certain learning tasks. However, in real life, human learning does not follow any of these techniques exclusively. The way a human learns is basically a combination of forward learning, backward propagation and cooperation. Humans start learning a new concept by themselves and try to refine their understanding hierarchically during which they might come across several doubts. The most common approach to doubt solving is a discussion with peers, which can be called cooperation. Cooperation/discussion/knowledge sharing among peers is one of the most important steps of learning that humans follow. However, there might still be a few doubts even after the discussion. Then the difference between the understanding of the concept and the original literature is identified and minimized over several revisions. Inspired by this, the paper introduces Forward-Cooperation-Backward (FCB) learning in a deep neural network framework mimicking the human nature of learning a new concept. A novel deep neural network architecture, called Multi Encoding Uni Decoding neural network model, has been designed which learns using the notion of FCB. A special lateral synaptic connection has also been introduced to realize cooperation. The models have been justified in terms of their performance in dimension reduction on four popular datasets. The ability to preserve the granular properties of data in low-rank embedding has been tested to justify the quality of dimension reduction. For downstream analyses, classification has also been performed. An experimental study on convergence analysis has been performed to establish the efficacy of the FCB learning strategy.

Comment: The paper introduces a novel learning paradigm (Forward-Cooperation-Backward) and a new architecture (Multi-Encoding Uni-Decoding) with lateral synaptic connections, which aligns with the 'Model Architecture' criterion for architectural innovations.

Relevance: 9 Novelty: 8


10. Do Large Language Models Know How Much They Know?

ArXiv ID: 2502.19573

Authors: Gabriele Prato, Jerry Huang, Prasannna Parthasarathi, Shagun Sodhani, Sarath Chandar

Abstract: Large Language Models (LLMs) have emerged as highly capable systems and are increasingly being integrated into various uses. However, the rapid pace of their deployment has outpaced a comprehensive understanding of their internal mechanisms and a delineation of their capabilities and limitations. A desired attribute of an intelligent system is its ability to recognize the scope of its own knowledge. To investigate whether LLMs embody this characteristic, we develop a benchmark designed to challenge these models to enumerate all information they possess on specific topics. This benchmark evaluates whether the models recall excessive, insufficient, or the precise amount of information, thereby indicating their awareness of their own knowledge. Our findings reveal that all tested LLMs, given sufficient scale, demonstrate an understanding of how much they know about specific topics. While different architectures exhibit varying rates of this capability's emergence, the results suggest that awareness of knowledge may be a generalizable attribute of LLMs. Further research is needed to confirm this potential and fully elucidate the underlying mechanisms.

Comment: The paper investigates whether LLMs can assess the scope of their own knowledge, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 8


11. Self-Training Elicits Concise Reasoning in Large Language Models

ArXiv ID: 2502.20122

Authors: Tergel Munkhbat, Namgyu Ho, Seohyun Kim, Yongjin Yang, Yujin Kim, Se-Young Yun

Abstract: Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at https://github.com/TergelMunkhbat/concise-reasoning

Comment: The paper proposes methods to elicit concise reasoning in LLMs, which aligns with foundational research in LLM behavior and training dynamics.

Relevance: 9 Novelty: 8


12. Your contrastive learning problem is secretly a distribution alignment problem

ArXiv ID: 2502.20141

Authors: Zihao Chen, Chi-Heng Lin, Ran Liu, Jingyun Xiao, Eva L Dyer

Abstract: Despite the success of contrastive learning (CL) in vision and language, its theoretical foundations and mechanisms for building representations remain poorly understood. In this work, we build connections between noise contrastive estimation losses widely used in CL and distribution alignment with entropic optimal transport (OT). This connection allows us to develop a family of different losses and multistep iterative variants for existing CL methods. Intuitively, by using more information from the distribution of latents, our approach allows a more distribution-aware manipulation of the relationships within augmented sample sets. We provide theoretical insights and experimental evidence demonstrating the benefits of our approach for {\em generalized contrastive alignment}. Through this framework, it is possible to leverage tools in OT to build unbalanced losses to handle noisy views and customize the representation space by changing the constraints on alignment. By reframing contrastive learning as an alignment problem and leveraging existing optimization tools for OT, our work provides new insights and connections between different self-supervised learning models in addition to new tools that can be more easily adapted to incorporate domain knowledge into learning.

Comment: The paper reframes contrastive learning as a distribution alignment problem using optimal transport, providing theoretical insights into representation learning. This aligns closely with foundational research in representation learning.

Relevance: 9 Novelty: 8


13. R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

ArXiv ID: 2502.20395

Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou

Abstract: In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2) that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.

Comment: The paper proposes a test-time re-routing method for multimodal mixture-of-experts (MoE), which aligns well with the model architecture criterion, particularly for MoE innovations.

Relevance: 9 Novelty: 8


14. Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

ArXiv ID: 2502.19649

Authors: Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz

Abstract: Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.

Comment: This paper introduces Representation Engineering (RepE) as a novel paradigm for controlling LLM behavior by manipulating internal representations. It aligns closely with the 'Representation Learning' and 'Large Language Models' criteria, offering theoretical insights and a comprehensive framework for a new direction in LLM research.

Relevance: 9 Novelty: 8


15. Asymptotics of Non-Convex Generalized Linear Models in High-Dimensions: A proof of the replica formula

ArXiv ID: 2502.20003

Authors: Matteo Vilucchio, Yatin Dandi, Cedric Gerbelot, Florent Krzakala

Abstract: The analytic characterization of the high-dimensional behavior of optimization for Generalized Linear Models (GLMs) with Gaussian data has been a central focus in statistics and probability in recent years. While convex cases, such as the LASSO, ridge regression, and logistic regression, have been extensively studied using a variety of techniques, the non-convex case remains far less understood despite its significance. A non-rigorous statistical physics framework has provided remarkable predictions for the behavior of high-dimensional optimization problems, but rigorously establishing their validity for non-convex problems has remained a fundamental challenge. In this work, we address this challenge by developing a systematic framework that rigorously proves replica-symmetric formulas for non-convex GLMs and precisely determines the conditions under which these formulas are valid. Remarkably, the rigorous replica-symmetric predictions align exactly with the conjectures made by physicists, and the so-called replicon condition. The originality of our approach lies in connecting two powerful theoretical tools: the Gaussian Min-Max Theorem, which we use to provide precise lower bounds, and Approximate Message Passing (AMP), which is shown to achieve these bounds algorithmically. We demonstrate the utility of this framework through significant applications: (i) by proving the optimality of the Tukey loss over the more commonly used Huber loss under a $\varepsilon$ contaminated data model, (ii) establishing the optimality of negative regularization in high-dimensional non-convex regression and (iii) characterizing the performance limits of linearized AMP algorithms. By rigorously validating statistical physics predictions in non-convex settings, we aim to open new pathways for analyzing increasingly complex optimization landscapes beyond the convex regime.

Comment: The paper rigorously validates statistical physics predictions for non-convex GLMs, aligning with the emerging trends criterion.

Relevance: 8 Novelty: 9


16. Identifiable Multi-View Causal Discovery Without Non-Gaussianity

ArXiv ID: 2502.20115

Authors: Ambroise Heurtebise, Omar Chehab, Pierre Ablin, Alexandre Gramfort, Aapo Hyv\"arinen

Abstract: We propose a novel approach to linear causal discovery in the framework of multi-view Structural Equation Models (SEM). Our proposed model relaxes the well-known assumption of non-Gaussian disturbances by alternatively assuming diversity of variances over views, making it more broadly applicable. We prove the identifiability of all the parameters of the model without any further assumptions on the structure of the SEM other than it being acyclic. We further propose an estimation algorithm based on recent advances in multi-view Independent Component Analysis (ICA). The proposed methodology is validated through simulations and application on real neuroimaging data, where it enables the estimation of causal graphs between brain regions.

Comment: The paper proposes a novel approach to causal discovery in multi-view SEMs, which aligns with representation learning and introduces theoretical advancements in causal modeling.

Relevance: 8 Novelty: 8


17. Topological Autoencoders++: Fast and Accurate Cycle-Aware Dimensionality Reduction

ArXiv ID: 2502.20215

Authors: Matt\'eo Cl\'emot, Julie Digne, Julien Tierny

Abstract: This paper presents a novel topology-aware dimensionality reduction approach aiming at accurately visualizing the cyclic patterns present in high dimensional data. To that end, we build on the Topological Autoencoders (TopoAE) formulation. First, we provide a novel theoretical analysis of its associated loss and show that a zero loss indeed induces identical persistence pairs (in high and low dimensions) for the $0$-dimensional persistent homology (PH$^0$) of the Rips filtration. We also provide a counter example showing that this property no longer holds for a naive extension of TopoAE to PH$^d$ for $d\ge 1$. Based on this observation, we introduce a novel generalization of TopoAE to $1$-dimensional persistent homology (PH$^1$), called TopoAE++, for the accurate generation of cycle-aware planar embeddings, addressing the above failure case. This generalization is based on the notion of cascade distortion, a new penalty term favoring an isometric embedding of the $2$-chains filling persistent $1$-cycles, hence resulting in more faithful geometrical reconstructions of the $1$-cycles in the plane. We further introduce a novel, fast algorithm for the exact computation of PH for Rips filtrations in the plane, yielding improved runtimes over previously documented topology-aware methods. Our method also achieves a better balance between the topological accuracy, as measured by the Wasserstein distance, and the visual preservation of the cycles in low dimensions. Our C++ implementation is available at https://github.com/MClemot/TopologicalAutoencodersPlusPlus.

Comment: The paper proposes a novel topology-aware dimensionality reduction method with theoretical analysis, aligning with foundational research in representation learning.

Relevance: 8 Novelty: 8


18. Scalable Signature Kernel Computations for Long Time Series via Local Neumann Series Expansions

ArXiv ID: 2502.20392

Authors: Matthew Tamayo-Rios, Alexander Schell, Rima Alaifari

Abstract: The signature kernel is a recent state-of-the-art tool for analyzing high-dimensional sequential data, valued for its theoretical guarantees and strong empirical performance. In this paper, we present a novel method for efficiently computing the signature kernel of long, high-dimensional time series via dynamically truncated recursive local power series expansions. Building on the characterization of the signature kernel as the solution of a Goursat PDE, our approach employs tilewise Neumann-series expansions to derive rapidly converging power series approximations of the signature kernel that are locally defined on subdomains and propagated iteratively across the entire domain of the Goursat solution by exploiting the geometry of the time series. Algorithmically, this involves solving a system of interdependent local Goursat PDEs by recursively propagating boundary conditions along a directed graph via topological ordering, with dynamic truncation adaptively terminating each local power series expansion when coefficients fall below machine precision, striking an effective balance between computational cost and accuracy. This method achieves substantial performance improvements over state-of-the-art approaches for computing the signature kernel, providing (a) adjustable and superior accuracy, even for time series with very high roughness; (b) drastically reduced memory requirements; and (c) scalability to efficiently handle very long time series (e.g., with up to half a million points or more) on a single GPU. These advantages make our method particularly well-suited for rough-path-assisted machine learning, financial modeling, and signal processing applications that involve very long and highly volatile data.

Comment: The paper introduces a novel method for scalable signature kernel computations, which aligns with foundational research in efficiency and algorithmic breakthroughs.

Relevance: 8 Novelty: 8


ArXiv ID: 2502.19551

Authors: Achille Nazaret, David Blei

Abstract: The goal of causal discovery is to learn a directed acyclic graph from data. One of the most well-known methods for this problem is Greedy Equivalence Search (GES). GES searches for the graph by incrementally and greedily adding or removing edges to maximize a model selection criterion. It has strong theoretical guarantees on infinite data but can fail in practice on finite data. In this paper, we first identify some of the causes of GES's failure, finding that it can get blocked in local optima, especially in denser graphs. We then propose eXtremely Greedy Equivalent Search (XGES), which involves a new heuristic to improve the search strategy of GES while retaining its theoretical guarantees. In particular, XGES favors deleting edges early in the search over inserting edges, which reduces the possibility of the search ending in local optima. A further contribution of this work is an efficient algorithmic formulation of XGES (and GES). We benchmark XGES on simulated datasets with known ground truth. We find that XGES consistently outperforms GES in recovering the correct graphs, and it is 10 times faster. XGES implementations in Python and C++ are available at https://github.com/ANazaret/XGES.

Comment: The paper proposes an improvement to the Greedy Equivalence Search algorithm, which aligns with foundational research in model efficiency and algorithmic innovation.

Relevance: 8 Novelty: 8


20. Tell me why: Visual foundation models as self-explainable classifiers

ArXiv ID: 2502.19577

Authors: Hugues Turb\'e, Mina Bjelogrlic, Gianmarco Mengaldo, Christian Lovis

Abstract: Visual foundation models (VFMs) have become increasingly popular due to their state-of-the-art performance. However, interpretability remains crucial for critical applications. In this sense, self-explainable models (SEM) aim to provide interpretable classifiers that decompose predictions into a weighted sum of interpretable concepts. Despite their promise, recent studies have shown that these explanations often lack faithfulness. In this work, we combine VFMs with a novel prototypical architecture and specialized training objectives. By training only a lightweight head (approximately 1M parameters) on top of frozen VFMs, our approach (ProtoFM) offers an efficient and interpretable solution. Evaluations demonstrate that our approach achieves competitive classification performance while outperforming existing models across a range of interpretability metrics derived from the literature. Code is available at https://github.com/hturbe/proto-fm.

Comment: The paper introduces a novel prototypical architecture for interpretability in visual foundation models, which aligns with representation learning and architectural insights.

Relevance: 8 Novelty: 7


21. Obtaining Example-Based Explanations from Deep Neural Networks

ArXiv ID: 2502.19768

Authors: Genghua Dong, Henrik Bostr\"om, Michalis Vazirgiannis, Roman Bresson

Abstract: Most techniques for explainable machine learning focus on feature attribution, i.e., values are assigned to the features such that their sum equals the prediction. Example attribution is another form of explanation that assigns weights to the training examples, such that their scalar product with the labels equals the prediction. The latter may provide valuable complementary information to feature attribution, in particular in cases where the features are not easily interpretable. Current example-based explanation techniques have targeted a few model types only, such as k-nearest neighbors and random forests. In this work, a technique for obtaining example-based explanations from deep neural networks (EBE-DNN) is proposed. The basic idea is to use the deep neural network to obtain an embedding, which is employed by a k-nearest neighbor classifier to form a prediction; the example attribution can hence straightforwardly be derived from the latter. Results from an empirical investigation show that EBE-DNN can provide highly concentrated example attributions, i.e., the predictions can be explained with few training examples, without reducing accuracy compared to the original deep neural network. Another important finding from the empirical investigation is that the choice of layer to use for the embeddings may have a large impact on the resulting accuracy.

Comment: The paper proposes example-based explanations for deep neural networks, which aligns with representation learning and interpretability.

Relevance: 8 Novelty: 7


22. Variation Matters: from Mitigating to Embracing Zero-Shot NAS Ranking Function Variation

ArXiv ID: 2502.19657

Authors: Pavel Rumiantsev, Mark Coates

Abstract: Neural Architecture Search (NAS) is a powerful automatic alternative to manual design of a neural network. In the zero-shot version, a fast ranking function is used to compare architectures without training them. The outputs of the ranking functions often vary significantly due to different sources of randomness, including the evaluated architecture's weights' initialization or the batch of data used for calculations. A common approach to addressing the variation is to average a ranking function output over several evaluations. We propose taking into account the variation in a different manner, by viewing the ranking function output as a random variable representing a proxy performance metric. During the search process, we strive to construct a stochastic ordering of the performance metrics to determine the best architecture. Our experiments show that the proposed stochastic ordering can effectively boost performance of a search on standard benchmark search spaces.

Comment: The paper addresses variation in zero-shot NAS ranking functions, which is relevant to architectural optimization and efficiency.

Relevance: 8 Novelty: 7


23. Accurate and Scalable Graph Neural Networks via Message Invariance

ArXiv ID: 2502.19693

Authors: Zhihao Shi, Jie Wang, Zhiwei Zhuang, Xize Liang, Bin Li, Feng Wu

Abstract: Message passing-based graph neural networks (GNNs) have achieved great success in many real-world applications. For a sampled mini-batch of target nodes, the message passing process is divided into two parts: message passing between nodes within the batch (MP-IB) and message passing from nodes outside the batch to those within it (MP-OB). However, MP-OB recursively relies on higher-order out-of-batch neighbors, leading to an exponentially growing computational cost with respect to the number of layers. Due to the neighbor explosion, the whole message passing stores most nodes and edges on the GPU such that many GNNs are infeasible to large-scale graphs. To address this challenge, we propose an accurate and fast mini-batch approach for large graph transductive learning, namely topological compensation (TOP), which obtains the outputs of the whole message passing solely through MP-IB, without the costly MP-OB. The major pillar of TOP is a novel concept of message invariance, which defines message-invariant transformations to convert costly MP-OB into fast MP-IB. This ensures that the modified MP-IB has the same output as the whole message passing. Experiments demonstrate that TOP is significantly faster than existing mini-batch methods by order of magnitude on vast graphs (millions of nodes and billions of edges) with limited accuracy degradation.

Comment: The paper introduces a novel concept of message invariance to address computational challenges in GNNs, which aligns with representation learning and training dynamics in neural networks.

Relevance: 8 Novelty: 7


24. Sanity Checking Causal Representation Learning on a Simple Real-World System

ArXiv ID: 2502.20099

Authors: Juan L. Gamella, Simon Bing, Jakob Runge

Abstract: We evaluate methods for causal representation learning (CRL) on a simple, real-world system where these methods are expected to work. The system consists of a controlled optical experiment specifically built for this purpose, which satisfies the core assumptions of CRL and where the underlying causal factors (the inputs to the experiment) are known, providing a ground truth. We select methods representative of different approaches to CRL and find that they all fail to recover the underlying causal factors. To understand the failure modes of the evaluated algorithms, we perform an ablation on the data by substituting the real data-generating process with a simpler synthetic equivalent. The results reveal a reproducibility problem, as most methods already fail on this synthetic ablation despite its simple data-generating process. Additionally, we observe that common assumptions on the mixing function are crucial for the performance of some of the methods but do not hold in the real data. Our efforts highlight the contrast between the theoretical promise of the state of the art and the challenges in its application. We hope the benchmark serves as a simple, real-world sanity check to further develop and validate methodology, bridging the gap towards CRL methods that work in practice. We make all code and datasets publicly available at github.com/simonbing/CRLSanityCheck

Comment: The paper evaluates causal representation learning methods on a real-world system, highlighting reproducibility challenges, which aligns with foundational research in representation learning.

Relevance: 8 Novelty: 7


25. Spectral Analysis of Representational Similarity with Limited Neurons

ArXiv ID: 2502.19648

Authors: Hyunmo Kang, Abdulkadir Canatar, SueYeon Chung

Abstract: Measuring representational similarity between neural recordings and computational models is challenging due to constraints on the number of neurons that can be recorded simultaneously. In this work, we investigate how such limitations affect similarity measures, focusing on Canonical Correlation Analysis (CCA) and Centered Kernel Alignment (CKA). Leveraging tools from Random Matrix Theory, we develop a predictive spectral framework for these measures and demonstrate that finite neuron sampling systematically underestimates similarity due to eigenvector delocalization. To overcome this, we introduce a denoising method to infer population-level similarity, enabling accurate analysis even with small neuron samples. Our theory is validated on synthetic and real datasets, offering practical strategies for interpreting neural data under finite sampling constraints.

Comment: The paper provides a theoretical framework for representational similarity measures using Random Matrix Theory, which aligns with the representation learning criterion.

Relevance: 8 Novelty: 7


26. Teasing Apart Architecture and Initial Weights as Sources of Inductive Bias in Neural Networks

ArXiv ID: 2502.20237

Authors: Gianluca Bencomo, Max Gupta, Ioana Marinescu, R. Thomas McCoy, Thomas L. Griffiths

Abstract: Artificial neural networks can acquire many aspects of human knowledge from data, making them promising as models of human learning. But what those networks can learn depends upon their inductive biases -- the factors other than the data that influence the solutions they discover -- and the inductive biases of neural networks remain poorly understood, limiting our ability to draw conclusions about human learning from the performance of these systems. Cognitive scientists and machine learning researchers often focus on the architecture of a neural network as a source of inductive bias. In this paper we explore the impact of another source of inductive bias -- the initial weights of the network -- using meta-learning as a tool for finding initial weights that are adapted for specific problems. We evaluate four widely-used architectures -- MLPs, CNNs, LSTMs, and Transformers -- by meta-training 430 different models across three tasks requiring different biases and forms of generalization. We find that meta-learning can substantially reduce or entirely eliminate performance differences across architectures and data representations, suggesting that these factors may be less important as sources of inductive bias than is typically assumed. When differences are present, architectures and data representations that perform well without meta-learning tend to meta-train more effectively. Moreover, all architectures generalize poorly on problems that are far from their meta-training experience, underscoring the need for stronger inductive biases for robust generalization.

Comment: The paper explores the role of architecture and initial weights as sources of inductive bias, which aligns with the model architecture criterion.

Relevance: 8 Novelty: 7


27. Erasing Without Remembering: Safeguarding Knowledge Forgetting in Large Language Models

ArXiv ID: 2502.19982

Authors: Huazheng Wang, Yongcheng Jing, Haifeng Sun, Yingjie Wang, Jingyu Wang, Jianxin Liao, Dacheng Tao

Abstract: In this paper, we explore machine unlearning from a novel dimension, by studying how to safeguard model unlearning in large language models (LLMs). Our goal is to prevent unlearned models from recalling any related memory of the targeted knowledge.We begin by uncovering a surprisingly simple yet overlooked fact: existing methods typically erase only the exact expressions of the targeted knowledge, leaving paraphrased or related information intact. To rigorously measure such oversights, we introduce UGBench, the first benchmark tailored for evaluating the generalisation performance across 13 state-of-the-art methods.UGBench reveals that unlearned models can still recall paraphrased answers and retain target facts in intermediate layers. To address this, we propose PERMU, a perturbation-based method that significantly enhances the generalisation capabilities for safeguarding LLM unlearning.Experiments demonstrate that PERMU delivers up to a 50.13% improvement in unlearning while maintaining a 43.53% boost in robust generalisation. Our code can be found in https://github.com/MaybeLizzy/UGBench.

Comment: The paper explores machine unlearning in LLMs, introducing a benchmark (UGBench) and a perturbation-based method (PERMU) to enhance unlearning generalization. This aligns with the 'Large Language Models' criterion for theoretical insights into LLM behavior.

Relevance: 8 Novelty: 7


28. Recommendations from Sparse Comparison Data: Provably Fast Convergence for Nonconvex Matrix Factorization

ArXiv ID: 2502.20033

Authors: Suryanarayana Sankagiri, Jalal Etesami, Matthias Grossglauser

Abstract: This paper provides a theoretical analysis of a new learning problem for recommender systems where users provide feedback by comparing pairs of items instead of rating them individually. We assume that comparisons stem from latent user and item features, which reduces the task of predicting preferences to learning these features from comparison data. Similar to the classical matrix factorization problem, the main challenge in this learning task is that the resulting loss function is nonconvex. Our analysis shows that the loss function exhibits (restricted) strong convexity near the true solution, which ensures gradient-based methods converge exponentially, given an appropriate warm start. Importantly, this result holds in a sparse data regime, where each user compares only a few pairs of items. Our main technical contribution is to extend certain concentration inequalities commonly used in matrix completion to our model. Our work demonstrates that learning personalized recommendations from comparison data is computationally and statistically efficient.

Comment: The paper provides theoretical analysis for nonconvex matrix factorization in sparse data regimes, which is relevant to foundational research in representation learning and efficiency.

Relevance: 8 Novelty: 7


29. Incremental Learning with Repetition via Pseudo-Feature Projection

ArXiv ID: 2502.19922

Authors: Benedikt Tscheschner, Eduardo Veas, Marc Masana

Abstract: Incremental Learning scenarios do not always represent real-world inference use-cases, which tend to have less strict task boundaries, and exhibit repetition of common classes and concepts in their continual data stream. To better represent these use-cases, new scenarios with partial repetition and mixing of tasks are proposed, where the repetition patterns are innate to the scenario and unknown to the strategy. We investigate how exemplar-free incremental learning strategies are affected by data repetition, and we adapt a series of state-of-the-art approaches to analyse and fairly compare them under both settings. Further, we also propose a novel method (Horde), able to dynamically adjust an ensemble of self-reliant feature extractors, and align them by exploiting class repetition. Our proposed exemplar-free method achieves competitive results in the classic scenario without repetition, and state-of-the-art performance in the one with repetition.

Comment: The paper introduces a novel exemplar-free incremental learning method (Horde) with dynamic feature extractor alignment, which aligns with 'Representation Learning' for insights into training dynamics and feature learning.

Relevance: 8 Novelty: 7


30. Global Framework for Simultaneous Emulation Across the Nuclear Landscape

ArXiv ID: 2502.20363

Authors: Antoine Belley, Jose M. Munoz, Ronald F. Garcia Ruiz

Abstract: We introduce a hierarchical framework that combines ab initio many-body calculations with a Bayesian neural network, developing emulators capable of accurately predicting nuclear properties across the nuclear chart, including multiple isotopes simultaneously. We benchmark our developments using the oxygen isotopic chain, achieving accurate results for ground-state energies and nuclear charge radii, while providing robust uncertainty quantification. Our framework enables global sensitivity analysis of nuclear binding energies and charge radii with respect to the low-energy constants that describe the nuclear force.

Comment: The paper introduces a hierarchical framework combining Bayesian neural networks with ab initio calculations for nuclear emulation, which aligns with 'AI for Science' for foundational research in generative paradigms.

Relevance: 8 Novelty: 7


31. SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks

ArXiv ID: 2502.19913

Authors: Nikolay Blagoev, Lydia Yiyu Chen, O\u{g}uzhan Ersoy

Abstract: Data and pipeline parallelism are ubiquitous for training of Large Language Models (LLM) on distributed nodes. Driven by the need for cost-effective training, recent work explores efficient communication arrangement for end to end training. Motivated by LLM's resistance to layer skipping and layer reordering, in this paper, we explore stage (several consecutive layers) skipping in pipeline training, and challenge the conventional practice of sequential pipeline execution. We derive convergence and throughput constraints (guidelines) for pipelining with skipping and swapping pipeline stages. Based on these constraints, we propose SkipPipe, the first partial pipeline framework to reduce the end-to-end training time for LLMs while preserving the convergence. The core of SkipPipe is a path scheduling algorithm that optimizes the paths for individual microbatches and reduces idle time (due to microbatch collisions) on the distributed nodes, complying with the given stage skipping ratio. We extensively evaluate SkipPipe on LLaMa models from 500M to 8B parameters on up to 20 nodes. Our results show that SkipPipe reduces training iteration time by up to $55\%$ compared to full pipeline. Our partial pipeline training also improves resistance to layer omission during inference, experiencing a drop in perplexity of only $7\%$ when running only half the model. Our code is available at https://github.com/gensyn-ai/skippipe.

Comment: The paper proposes a novel pipeline training framework for LLMs, which aligns with foundational research in model efficiency and training dynamics.

Relevance: 8 Novelty: 7


32. LangProBe: a Language Programs Benchmark

ArXiv ID: 2502.20315

Authors: Shangyin Tan, Lakshya A Agrawal, Arnav Singhvi, Liheng Lai, Michael J Ryan, Dan Klein, Omar Khattab, Koushik Sen, Matei Zaharia

Abstract: Composing language models (LMs) into multi-step language programs and automatically optimizing their modular prompts is now a mainstream paradigm for building AI systems, but the tradeoffs in this space have only scarcely been studied before. We introduce LangProBe, the first large-scale benchmark for evaluating the architectures and optimization strategies for language programs, with over 2000 combinations of tasks, architectures, optimizers, and choices of LMs. Using LangProBe, we are the first to study the impact of program architectures and optimizers (and their compositions together and with different models) on tradeoffs of quality and cost. We find that optimized language programs offer strong cost--quality Pareto improvement over raw calls to models, but simultaneously demonstrate that human judgment (or empirical decisions) about which compositions to pursue is still necessary for best performance. We will open source the code and evaluation data for LangProBe.

Comment: The paper introduces a benchmark for evaluating language program architectures and optimization strategies, which is relevant to foundational research in LLMs and model architecture.

Relevance: 8 Novelty: 7


33. Mixtraining: A Better Trade-Off Between Compute and Performance

ArXiv ID: 2502.19513

Authors: Zexin Li, Jiancheng Zhang, Yinglun Zhu, Cong Liu

Abstract: Incorporating self-supervised learning (SSL) before standard supervised learning (SL) has become a widely used strategy to enhance model performance, particularly in data-limited scenarios. However, this approach introduces a trade-off between computation and performance: while SSL helps with representation learning, it requires a separate, often time-consuming training phase, increasing computational overhead and limiting efficiency in resource-constrained settings. To address these challenges, we propose MixTraining, a novel framework that interleaves several SSL and SL epochs within a unified mixtraining training phase, featuring a smooth transition between two learning objectives. MixTraining enhances synergy between SSL and SL for improved accuracy and consolidates shared computation steps to reduce computation overhead. MixTraining is versatile and applicable to both single-task and multi-task learning scenarios. Extensive experiments demonstrate that MixTraining offers a superior compute-performance trade-off compared to conventional pipelines, achieving an 8.81% absolute accuracy gain (18.89% relative accuracy gain) on the TinyImageNet dataset while accelerating training by up to 1.29x with the ViT-Tiny model.

Comment: The paper proposes a novel training framework combining SSL and SL, which aligns with foundational research in training dynamics and efficiency.

Relevance: 8 Novelty: 7


34. Self-rewarding correction for mathematical reasoning

ArXiv ID: 2502.19613

Authors: Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang

Abstract: We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.

Comment: The paper proposes a self-rewarding correction mechanism for mathematical reasoning in LLMs, which aligns with foundational insights into LLM behavior and self-correction mechanisms.

Relevance: 8 Novelty: 7


35. Beyond Worst-Case Dimensionality Reduction for Sparse Vectors

ArXiv ID: 2502.19865

Authors: Sandeep Silwal, David P. Woodruff, Qiuyi Zhang

Abstract: We study beyond worst-case dimensionality reduction for $s$-sparse vectors. Our work is divided into two parts, each focusing on a different facet of beyond worst-case analysis: We first consider average-case guarantees. A folklore upper bound based on the birthday-paradox states: For any collection $X$ of $s$-sparse vectors in $\mathbb{R}^d$, there exists a linear map to $\mathbb{R}^{O(s^2)}$ which \emph{exactly} preserves the norm of $99\%$ of the vectors in $X$ in any $\ell_p$ norm (as opposed to the usual setting where guarantees hold for all vectors). We give lower bounds showing that this is indeed optimal in many settings: any oblivious linear map satisfying similar average-case guarantees must map to $\Omega(s^2)$ dimensions. The same lower bound also holds for a wide class of smooth maps, including `encoder-decoder schemes', where we compare the norm of the original vector to that of a smooth function of the embedding. These lower bounds reveal a separation result, as an upper bound of $O(s \log(d))$ is possible if we instead use arbitrary (possibly non-smooth) functions, e.g., via compressed sensing algorithms. Given these lower bounds, we specialize to sparse \emph{non-negative} vectors. For a dataset $X$ of non-negative $s$-sparse vectors and any $p \ge 1$, we can non-linearly embed $X$ to $O(s\log(|X|s)/\epsilon^2)$ dimensions while preserving all pairwise distances in $\ell_p$ norm up to $1\pm \epsilon$, with no dependence on $p$. Surprisingly, the non-negativity assumption enables much smaller embeddings than arbitrary sparse vectors, where the best known bounds suffer exponential dependence. Our map also guarantees \emph{exact} dimensionality reduction for $\ell_{\infty}$ by embedding into $O(s\log |X|)$ dimensions, which is tight. We show that both the non-linearity of $f$ and the non-negativity of $X$ are necessary, and provide downstream algorithmic improvements.

Comment: The paper provides theoretical insights into dimensionality reduction for sparse vectors, which is relevant to representation learning and sparsity but focuses on a specific mathematical framework.

Relevance: 7 Novelty: 8


36. Walking the Web of Concept-Class Relationships in Incrementally Trained Interpretable Models

ArXiv ID: 2502.20393

Authors: Susmit Agrawal, Deepika Vemuri, Sri Siddarth Chakaravarthy P, Vineeth N. Balasubramanian

Abstract: Concept-based methods have emerged as a promising direction to develop interpretable neural networks in standard supervised settings. However, most works that study them in incremental settings assume either a static concept set across all experiences or assume that each experience relies on a distinct set of concepts. In this work, we study concept-based models in a more realistic, dynamic setting where new classes may rely on older concepts in addition to introducing new concepts themselves. We show that concepts and classes form a complex web of relationships, which is susceptible to degradation and needs to be preserved and augmented across experiences. We introduce new metrics to show that existing concept-based models cannot preserve these relationships even when trained using methods to prevent catastrophic forgetting, since they cannot handle forgetting at concept, class, and concept-class relationship levels simultaneously. To address these issues, we propose a novel method - MuCIL - that uses multimodal concepts to perform classification without increasing the number of trainable parameters across experiences. The multimodal concepts are aligned to concepts provided in natural language, making them interpretable by design. Through extensive experimentation, we show that our approach obtains state-of-the-art classification performance compared to other concept-based models, achieving over 2$\times$ the classification performance in some cases. We also study the ability of our model to perform interventions on concepts, and show that it can localize visual concepts in input images, providing post-hoc interpretations.

Comment: The paper proposes MuCIL for interpretable models in incremental learning, which aligns with representation learning and interpretability but is more niche.

Relevance: 7 Novelty: 7


37. Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?

ArXiv ID: 2502.19557

Authors: Yudi Zhang, Lu Wang, Meng Fang, Yali Du, Chenghua Huang, Jun Wang, Qingwei Lin, Mykola Pechenizkiy, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

Abstract: Distilling large language models (LLMs) typically involves transferring the teacher model's responses through supervised fine-tuning (SFT). However, this approach neglects the potential to distill both data (output content) and reward signals (quality evaluations). Extracting reliable reward signals directly from teacher models is challenging, as LLMs are optimized for generation rather than evaluation, often resulting in biased or inconsistent assessments. To address this limitation, we propose a novel distillation pipeline that transfers both responses and rewards. Our method generates pseudo-rewards through a self-supervised mechanism that leverages the inherent structure of both teacher and student responses, enabling reward learning without explicit external evaluation. The reward model subsequently guides reinforcement learning (RL), allowing iterative refinement of the student model after an SFT warm-up phase. Experiments on GSM8K and MMLU-PRO demonstrate that our method consistently outperforms traditional SFT-based approaches, enabling student models to surpass the performance of their teachers. This work highlights the potential for scalable, efficient distillation through structured self-supervised reward learning, reducing dependence on external reward supervision.

Comment: The paper proposes a novel distillation pipeline for LLMs, focusing on reward learning and reinforcement learning, which partially aligns with foundational research in LLM behavior.

Relevance: 7 Novelty: 7


38. Revisiting Self-Consistency from Dynamic Distributional Alignment Perspective on Answer Aggregation

ArXiv ID: 2502.19830

Authors: Yiwei Li, Ji Zhang, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li

Abstract: Self-consistency improves reasoning by aggregating diverse stochastic samples, yet the dynamics behind its efficacy remain underexplored. We reframe self-consistency as a dynamic distributional alignment problem, revealing that decoding temperature not only governs sampling randomness but also actively shapes the latent answer distribution. Given that high temperatures require prohibitively large sample sizes to stabilize, while low temperatures risk amplifying biases, we propose a confidence-driven mechanism that dynamically calibrates temperature: sharpening the sampling distribution under uncertainty to align with high-probability modes, and promoting exploration when confidence is high. Experiments on mathematical reasoning tasks show this approach outperforms fixed-diversity baselines under limited samples, improving both average and best-case performance across varying initial temperatures without additional data or modules. This establishes self-consistency as a synchronization challenge between sampling dynamics and evolving answer distributions.

Comment: The paper reframes self-consistency in reasoning as a dynamic distributional alignment problem, which provides insights into LLM behavior but does not introduce foundational changes to LLMs.

Relevance: 7 Novelty: 6


39. SCU: An Efficient Machine Unlearning Scheme for Deep Learning Enabled Semantic Communications

ArXiv ID: 2502.19785

Authors: Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Shui Yu

Abstract: Deep learning (DL) enabled semantic communications leverage DL to train encoders and decoders (codecs) to extract and recover semantic information. However, most semantic training datasets contain personal private information. Such concerns call for enormous requirements for specified data erasure from semantic codecs when previous users hope to move their data from the semantic system. {Existing machine unlearning solutions remove data contribution from trained models, yet usually in supervised sole model scenarios. These methods are infeasible in semantic communications that often need to jointly train unsupervised encoders and decoders.} In this paper, we investigate the unlearning problem in DL-enabled semantic communications and propose a semantic communication unlearning (SCU) scheme to tackle the problem. {SCU includes two key components. Firstly,} we customize the joint unlearning method for semantic codecs, including the encoder and decoder, by minimizing mutual information between the learned semantic representation and the erased samples. {Secondly,} to compensate for semantic model utility degradation caused by unlearning, we propose a contrastive compensation method, which considers the erased data as the negative samples and the remaining data as the positive samples to retrain the unlearned semantic models contrastively. Theoretical analysis and extensive experimental results on three representative datasets demonstrate the effectiveness and efficiency of our proposed methods.

Comment: The paper proposes a machine unlearning scheme for semantic communications, focusing on mutual information minimization and contrastive compensation, which aligns partially with 'Model Compression' for efficiency-related innovations.

Relevance: 7 Novelty: 6


40. NeoBERT: A Next-Generation BERT

ArXiv ID: 2502.19587

Authors: Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar

Abstract: Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.

Comment: NeoBERT introduces architectural advancements for bidirectional models, focusing on pretraining and fine-tuning improvements. While it is relevant to model architecture, it lacks groundbreaking insights into foundational architectural innovations.

Relevance: 7 Novelty: 6


41. Do Sparse Autoencoders Generalize? A Case Study of Answerability

ArXiv ID: 2502.19964

Authors: Lovis Heindrich, Philip Torr, Fazl Barez, Veronika Thost

Abstract: Sparse autoencoders (SAEs) have emerged as a promising approach in language model interpretability, offering unsupervised extraction of sparse features. For interpretability methods to succeed, they must identify abstract features across domains, and these features can often manifest differently in each context. We examine this through "answerability"-a model's ability to recognize answerable questions. We extensively evaluate SAE feature generalization across diverse answerability datasets for Gemma 2 SAEs. Our analysis reveals that residual stream probes outperform SAE features within domains, but generalization performance differs sharply. SAE features demonstrate inconsistent transfer ability, and residual stream probes similarly show high variance out of distribution. Overall, this demonstrates the need for quantitative methods to predict feature generalization in SAE-based interpretability.

Comment: The paper focuses on sparse autoencoders (SAEs) and their generalization properties, which aligns with the representation learning criterion. However, the focus on 'answerability' datasets makes it slightly application-driven.

Relevance: 7 Novelty: 6


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.