Personalized Daily Arxiv Papers 02/28/2025

[gpt-4o]	Prompt	Completion	Total
Token	51173	7203	58376
Cost	$0.13	$0.07	$0.2

Total ArXiv papers: 557

Total scanned papers: 306

Total relevant papers: 41

Table of contents with paper titles:

Learning with Exact Invariances in Polynomial Time Authors: Ashkan Soleymani, Behrooz Tahmasebi, Stefanie Jegelka, Patrick Jaillet
Algebraic Machine Learning: Learning as computing an algebraic decomposition of a task Authors: Fernando Martin-Maroto, Nabil Abderrahaman, David Mendez, Gonzalo G. de Polavieja
Layer-Aware Task Arithmetic: Disentangling Task-Specific and Instruction-Following Knowledge Authors: Yan-Lun Chen, Yi-Ru Wei, Chia-Yi Hsu, Chia-Mu Yu, Chun-Ying Huang, Ying-Dar Lin, Yu-Sung Wu, Wei-Bin Lee
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts Authors: Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu
LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM Authors: Yehonathan Refael, Iftach Arbel, Ofir Lindenbaum, Tom Tirer
HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration Authors: Rohan Juneja, Shivam Aggarwal, Safeen Huda, Tulika Mitra, Li-Shiuan Peh
Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models Authors: Yukang Yang, Declan Campbell, Kaixuan Huang, Mengdi Wang, Jonathan Cohen, Taylor Webb
Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking Authors: Yifan Zhang, Wenyu Du, Dongming Jin, Jie Fu, Zhi Jin
Forward-Cooperation-Backward (FCB) learning in a Multi-Encoding Uni-Decoding neural network architecture Authors: Prasun Dutta, Koustab Ghosh, Rajat K. De
Do Large Language Models Know How Much They Know? Authors: Gabriele Prato, Jerry Huang, Prasannna Parthasarathi, Shagun Sodhani, Sarath Chandar
Self-Training Elicits Concise Reasoning in Large Language Models Authors: Tergel Munkhbat, Namgyu Ho, Seohyun Kim, Yongjin Yang, Yujin Kim, Se-Young Yun
Your contrastive learning problem is secretly a distribution alignment problem Authors: Zihao Chen, Chi-Heng Lin, Ran Liu, Jingyun Xiao, Eva L Dyer
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou
Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models Authors: Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz
Asymptotics of Non-Convex Generalized Linear Models in High-Dimensions: A proof of the replica formula Authors: Matteo Vilucchio, Yatin Dandi, Cedric Gerbelot, Florent Krzakala
Identifiable Multi-View Causal Discovery Without Non-Gaussianity Authors: Ambroise Heurtebise, Omar Chehab, Pierre Ablin, Alexandre Gramfort, Aapo Hyv\"arinen
Topological Autoencoders++: Fast and Accurate Cycle-Aware Dimensionality Reduction Authors: Matt\'eo Cl\'emot, Julie Digne, Julien Tierny
Scalable Signature Kernel Computations for Long Time Series via Local Neumann Series Expansions Authors: Matthew Tamayo-Rios, Alexander Schell, Rima Alaifari
Extremely Greedy Equivalence Search Authors: Achille Nazaret, David Blei
Tell me why: Visual foundation models as self-explainable classifiers Authors: Hugues Turb\'e, Mina Bjelogrlic, Gianmarco Mengaldo, Christian Lovis
Obtaining Example-Based Explanations from Deep Neural Networks Authors: Genghua Dong, Henrik Bostr\"om, Michalis Vazirgiannis, Roman Bresson
Variation Matters: from Mitigating to Embracing Zero-Shot NAS Ranking Function Variation Authors: Pavel Rumiantsev, Mark Coates
Accurate and Scalable Graph Neural Networks via Message Invariance Authors: Zhihao Shi, Jie Wang, Zhiwei Zhuang, Xize Liang, Bin Li, Feng Wu
Sanity Checking Causal Representation Learning on a Simple Real-World System Authors: Juan L. Gamella, Simon Bing, Jakob Runge
Spectral Analysis of Representational Similarity with Limited Neurons Authors: Hyunmo Kang, Abdulkadir Canatar, SueYeon Chung
Teasing Apart Architecture and Initial Weights as Sources of Inductive Bias in Neural Networks Authors: Gianluca Bencomo, Max Gupta, Ioana Marinescu, R. Thomas McCoy, Thomas L. Griffiths
Erasing Without Remembering: Safeguarding Knowledge Forgetting in Large Language Models Authors: Huazheng Wang, Yongcheng Jing, Haifeng Sun, Yingjie Wang, Jingyu Wang, Jianxin Liao, Dacheng Tao
Recommendations from Sparse Comparison Data: Provably Fast Convergence for Nonconvex Matrix Factorization Authors: Suryanarayana Sankagiri, Jalal Etesami, Matthias Grossglauser
Incremental Learning with Repetition via Pseudo-Feature Projection Authors: Benedikt Tscheschner, Eduardo Veas, Marc Masana
Global Framework for Simultaneous Emulation Across the Nuclear Landscape Authors: Antoine Belley, Jose M. Munoz, Ronald F. Garcia Ruiz
SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks Authors: Nikolay Blagoev, Lydia Yiyu Chen, O\u{g}uzhan Ersoy
LangProBe: a Language Programs Benchmark Authors: Shangyin Tan, Lakshya A Agrawal, Arnav Singhvi, Liheng Lai, Michael J Ryan, Dan Klein, Omar Khattab, Koushik Sen, Matei Zaharia
Mixtraining: A Better Trade-Off Between Compute and Performance Authors: Zexin Li, Jiancheng Zhang, Yinglun Zhu, Cong Liu
Self-rewarding correction for mathematical reasoning Authors: Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang
Beyond Worst-Case Dimensionality Reduction for Sparse Vectors Authors: Sandeep Silwal, David P. Woodruff, Qiuyi Zhang
Walking the Web of Concept-Class Relationships in Incrementally Trained Interpretable Models Authors: Susmit Agrawal, Deepika Vemuri, Sri Siddarth Chakaravarthy P, Vineeth N. Balasubramanian
Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones? Authors: Yudi Zhang, Lu Wang, Meng Fang, Yali Du, Chenghua Huang, Jun Wang, Qingwei Lin, Mykola Pechenizkiy, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
Revisiting Self-Consistency from Dynamic Distributional Alignment Perspective on Answer Aggregation Authors: Yiwei Li, Ji Zhang, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
SCU: An Efficient Machine Unlearning Scheme for Deep Learning Enabled Semantic Communications Authors: Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Shui Yu
NeoBERT: A Next-Generation BERT Authors: Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar
Do Sparse Autoencoders Generalize? A Case Study of Answerability Authors: Lovis Heindrich, Philip Torr, Fazl Barez, Veronika Thost

1. Learning with Exact Invariances in Polynomial Time

ArXiv ID: 2502.19758

Authors: Ashkan Soleymani, Behrooz Tahmasebi, Stefanie Jegelka, Patrick Jaillet

Abstract: We study the statistical-computational trade-offs for learning with exact invariances (or symmetries) using kernel regression. Traditional methods, such as data augmentation, group averaging, canonicalization, and frame-averaging, either fail to provide a polynomial-time solution or are not applicable in the kernel setting. However, with oracle access to the geometric properties of the input space, we propose a polynomial-time algorithm that learns a classifier with \emph{exact} invariances. Moreover, our approach achieves the same excess population risk (or generalization error) as the original kernel regression problem. To the best of our knowledge, this is the first polynomial-time algorithm to achieve exact (not approximate) invariances in this context. Our proof leverages tools from differential geometry, spectral theory, and optimization. A key result in our development is a new reformulation of the problem of learning under invariances as optimizing an infinite number of linearly constrained convex quadratic programs, which may be of independent interest.

Comment: The paper provides a polynomial-time algorithm for learning with exact invariances, which is a cutting-edge theoretical contribution relevant to representation learning.

Relevance: 9 Novelty: 9

2. Algebraic Machine Learning: Learning as computing an algebraic decomposition of a task

ArXiv ID: 2502.19944

Authors: Fernando Martin-Maroto, Nabil Abderrahaman, David Mendez, Gonzalo G. de Polavieja

Abstract: Statistics and Optimization are foundational to modern Machine Learning. Here, we propose an alternative foundation based on Abstract Algebra, with mathematics that facilitates the analysis of learning. In this approach, the goal of the task and the data are encoded as axioms of an algebra, and a model is obtained where only these axioms and their logical consequences hold. Although this is not a generalizing model, we show that selecting specific subsets of its breakdown into algebraic atoms obtained via subdirect decomposition gives a model that generalizes. We validate this new learning principle on standard datasets such as MNIST, FashionMNIST, CIFAR-10, and medical images, achieving performance comparable to optimized multilayer perceptrons. Beyond data-driven tasks, the new learning principle extends to formal problems, such as finding Hamiltonian cycles from their specifications and without relying on search. This algebraic foundation offers a fresh perspective on machine intelligence, featuring direct learning from training data without the need for validation dataset, scaling through model additivity, and asymptotic convergence to the underlying rule in the data.

Comment: The paper proposes a novel algebraic foundation for machine learning, which is a cutting-edge theoretical contribution and aligns with emerging trends in foundational research.

Relevance: 9 Novelty: 9

3. Layer-Aware Task Arithmetic: Disentangling Task-Specific and Instruction-Following Knowledge

ArXiv ID: 2502.20186

Authors: Yan-Lun Chen, Yi-Ru Wei, Chia-Yi Hsu, Chia-Mu Yu, Chun-Ying Huang, Ying-Dar Lin, Yu-Sung Wu, Wei-Bin Lee

Abstract: Large language models (LLMs) demonstrate strong task-specific capabilities through fine-tuning, but merging multiple fine-tuned models often leads to degraded performance due to overlapping instruction-following components. Task Arithmetic (TA), which combines task vectors derived from fine-tuning, enables multi-task learning and task forgetting but struggles to isolate task-specific knowledge from general instruction-following behavior. To address this, we propose Layer-Aware Task Arithmetic (LATA), a novel approach that assigns layer-specific weights to task vectors based on their alignment with instruction-following or task-specific components. By amplifying task-relevant layers and attenuating instruction-following layers, LATA improves task learning and forgetting performance while preserving overall model utility. Experiments on multiple benchmarks, including WikiText-2, GSM8K, and HumanEval, demonstrate that LATA outperforms existing methods in both multi-task learning and selective task forgetting, achieving higher task accuracy and alignment with minimal degradation in output quality. Our findings highlight the importance of layer-wise analysis in disentangling task-specific and general-purpose knowledge, offering a robust framework for efficient model merging and editing.

Comment: The paper proposes Layer-Aware Task Arithmetic (LATA) to disentangle task-specific and instruction-following knowledge in LLMs, which aligns with foundational insights into LLM behavior.

Relevance: 9 Novelty: 8

4. Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

ArXiv ID: 2502.19811

Authors: Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu

Abstract: Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96\times$ and for end-to-end execution, COMET delivers a $1.71\times$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

Comment: The paper introduces COMET, a fine-grained communication-computation overlapping system for MoE, which aligns with architectural efficiency improvements in MoE systems.

Relevance: 9 Novelty: 8

5. LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM

ArXiv ID: 2502.19571

Authors: Yehonathan Refael, Iftach Arbel, Ofir Lindenbaum, Tom Tirer

Abstract: We study robust parameter-efficient fine-tuning (PEFT) techniques designed to improve accuracy and generalization while operating within strict computational and memory hardware constraints, specifically focusing on large-language models (LLMs). Existing PEFT methods often lack robustness and fail to generalize effectively across diverse tasks, leading to suboptimal performance in real-world scenarios. To address this, we present a new highly computationally efficient framework called AdaZo-SAM, combining Adam and Sharpness-Aware Minimization (SAM) while requiring only a single-gradient computation in every iteration. This is achieved using a stochastic zeroth-order estimation to find SAM's ascent perturbation. We provide a convergence guarantee for AdaZo-SAM and show that it improves the generalization ability of state-of-the-art PEFT methods. Additionally, we design a low-rank gradient optimization method named LORENZA, which is a memory-efficient version of AdaZo-SAM. LORENZA utilizes a randomized SVD scheme to efficiently compute the subspace projection matrix and apply optimization steps onto the selected subspace. This technique enables full-parameter fine-tuning with adaptive low-rank gradient updates, achieving the same reduced memory consumption as gradient-low-rank-projection methods. We provide a convergence analysis of LORENZA and demonstrate its merits for pre-training and fine-tuning LLMs.

Comment: The paper introduces a low-rank gradient optimization method (LORENZA) and provides theoretical insights into its efficiency for LLMs, aligning with the model compression criterion.

Relevance: 9 Novelty: 8

6. HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

ArXiv ID: 2502.19662

Authors: Rohan Juneja, Shivam Aggarwal, Safeen Huda, Tulika Mitra, Li-Shiuan Peh

Abstract: Quantization is critical for realizing efficient inference of LLMs. Traditional quantization methods are hardware-agnostic, limited to bit-width constraints, and lacking circuit-level insights, such as timing and energy characteristics of Multiply-Accumulate (MAC) units. We introduce HALO, a versatile framework that adapts to various hardware through a Hardware-Aware Post-Training Quantization (PTQ) approach. By leveraging MAC unit properties, HALO minimizes critical-path delays and enables dynamic frequency scaling. Deployed on LLM accelerators like TPUs and GPUs, HALO achieves on average 270% performance gains and 51% energy savings, all with minimal accuracy drop.

Comment: The paper introduces a hardware-aware quantization framework (HALO) for LLM acceleration, aligning with the model compression criterion.

Relevance: 9 Novelty: 8

7. Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models

ArXiv ID: 2502.20332

Authors: Yukang Yang, Declan Campbell, Kaixuan Huang, Mengdi Wang, Jonathan Cohen, Taylor Webb

Abstract: Many recent studies have found evidence for emergent reasoning capabilities in large language models, but debate persists concerning the robustness of these capabilities, and the extent to which they depend on structured reasoning mechanisms. To shed light on these issues, we perform a comprehensive study of the internal mechanisms that support abstract rule induction in an open-source language model (Llama3-70B). We identify an emergent symbolic architecture that implements abstract reasoning via a series of three computations. In early layers, symbol abstraction heads convert input tokens to abstract variables based on the relations between those tokens. In intermediate layers, symbolic induction heads perform sequence induction over these abstract variables. Finally, in later layers, retrieval heads predict the next token by retrieving the value associated with the predicted abstract variable. These results point toward a resolution of the longstanding debate between symbolic and neural network approaches, suggesting that emergent reasoning in neural networks depends on the emergence of symbolic mechanisms.

Comment: The paper identifies symbolic mechanisms in LLMs for abstract reasoning, aligning with the large language models criterion.

Relevance: 9 Novelty: 8

8. Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking

ArXiv ID: 2502.20129

Authors: Yifan Zhang, Wenyu Du, Dongming Jin, Jie Fu, Zhi Jin

Abstract: Chain-of-Thought (CoT) significantly enhances the performance of large language models (LLMs) across a wide range of tasks, and prior research shows that CoT can theoretically increase expressiveness. However, there is limited mechanistic understanding of the algorithms that Transformer+CoT can learn. In this work, we (1) evaluate the state tracking capabilities of Transformer+CoT and its variants, confirming the effectiveness of CoT. (2) Next, we identify the circuit, a subset of model components, responsible for tracking the world state, finding that late-layer MLP neurons play a key role. We propose two metrics, compression and distinction, and show that the neuron sets for each state achieve nearly 100% accuracy, providing evidence of an implicit finite state automaton (FSA) embedded within the model. (3) Additionally, we explore three realistic settings: skipping intermediate steps, introducing data noise, and testing length generalization. Our results demonstrate that Transformer+CoT learns robust algorithms (FSA), highlighting its resilience in challenging scenarios.

Comment: The paper provides a mechanistic study of state tracking in Transformers with Chain-of-Thought, offering insights into model behavior and architecture, aligning with foundational research.

Relevance: 9 Novelty: 8

9. Forward-Cooperation-Backward (FCB) learning in a Multi-Encoding Uni-Decoding neural network architecture

ArXiv ID: 2502.20113

Authors: Prasun Dutta, Koustab Ghosh, Rajat K. De

Abstract: The most popular technique to train a neural network is backpropagation. Recently, the Forward-Forward technique has also been introduced for certain learning tasks. However, in real life, human learning does not follow any of these techniques exclusively. The way a human learns is basically a combination of forward learning, backward propagation and cooperation. Humans start learning a new concept by themselves and try to refine their understanding hierarchically during which they might come across several doubts. The most common approach to doubt solving is a discussion with peers, which can be called cooperation. Cooperation/discussion/knowledge sharing among peers is one of the most important steps of learning that humans follow. However, there might still be a few doubts even after the discussion. Then the difference between the understanding of the concept and the original literature is identified and minimized over several revisions. Inspired by this, the paper introduces Forward-Cooperation-Backward (FCB) learning in a deep neural network framework mimicking the human nature of learning a new concept. A novel deep neural network architecture, called Multi Encoding Uni Decoding neural network model, has been designed which learns using the notion of FCB. A special lateral synaptic connection has also been introduced to realize cooperation. The models have been justified in terms of their performance in dimension reduction on four popular datasets. The ability to preserve the granular properties of data in low-rank embedding has been tested to justify the quality of dimension reduction. For downstream analyses, classification has also been performed. An experimental study on convergence analysis has been performed to establish the efficacy of the FCB learning strategy.

Comment: The paper introduces a novel learning paradigm (Forward-Cooperation-Backward) and a new architecture (Multi-Encoding Uni-Decoding) with lateral synaptic connections, which aligns with the 'Model Architecture' criterion for architectural innovations.

Relevance: 9 Novelty: 8

10. Do Large Language Models Know How Much They Know?

ArXiv ID: 2502.19573

Authors: Gabriele Prato, Jerry Huang, Prasannna Parthasarathi, Shagun Sodhani, Sarath Chandar

Abstract: Large Language Models (LLMs) have emerged as highly capable systems and are increasingly being integrated into various uses. However, the rapid pace of their deployment has outpaced a comprehensive understanding of their internal mechanisms and a delineation of their capabilities and limitations. A desired attribute of an intelligent system is its ability to recognize the scope of its own knowledge. To investigate whether LLMs embody this characteristic, we develop a benchmark designed to challenge these models to enumerate all information they possess on specific topics. This benchmark evaluates whether the models recall excessive, insufficient, or the precise amount of information, thereby indicating their awareness of their own knowledge. Our findings reveal that all tested LLMs, given sufficient scale, demonstrate an understanding of how much they know about specific topics. While different architectures exhibit varying rates of this capability's emergence, the results suggest that awareness of knowledge may be a generalizable attribute of LLMs. Further research is needed to confirm this potential and fully elucidate the underlying mechanisms.

Comment: The paper investigates whether LLMs can assess the scope of their own knowledge, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 8

11. Self-Training Elicits Concise Reasoning in Large Language Models

ArXiv ID: 2502.20122

Authors: Tergel Munkhbat, Namgyu Ho, Seohyun Kim, Yongjin Yang, Yujin Kim, Se-Young Yun

Abstract: Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at https://github.com/TergelMunkhbat/concise-reasoning

Comment: The paper proposes methods to elicit concise reasoning in LLMs, which aligns with foundational research in LLM behavior and training dynamics.

Relevance: 9 Novelty: 8

12. Your contrastive learning problem is secretly a distribution alignment problem

ArXiv ID: 2502.20141

Authors: Zihao Chen, Chi-Heng Lin, Ran Liu, Jingyun Xiao, Eva L Dyer

Abstract: Despite the success of contrastive learning (CL) in vision and language, its theoretical foundations and mechanisms for building representations remain poorly understood. In this work, we build connections between noise contrastive estimation losses widely used in CL and distribution alignment with entropic optimal transport (OT). This connection allows us to develop a family of different losses and multistep iterative variants for existing CL methods. Intuitively, by using more information from the distribution of latents, our approach allows a more distribution-aware manipulation of the relationships within augmented sample sets. We provide theoretical insights and experimental evidence demonstrating the benefits of our approach for {\em generalized contrastive alignment}. Through this framework, it is possible to leverage tools in OT to build unbalanced losses to handle noisy views and customize the representation space by changing the constraints on alignment. By reframing contrastive learning as an alignment problem and leveraging existing optimization tools for OT, our work provides new insights and connections between different self-supervised learning models in addition to new tools that can be more easily adapted to incorporate domain knowledge into learning.

Comment: The paper reframes contrastive learning as a distribution alignment problem using optimal transport, providing theoretical insights into representation learning. This aligns closely with foundational research in representation learning.

Relevance: 9 Novelty: 8

13. R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

ArXiv ID: 2502.20395

Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou

Abstract: In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2) that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.

Comment: The paper proposes a test-time re-routing method for multimodal mixture-of-experts (MoE), which aligns well with the model architecture criterion, particularly for MoE innovations.

Relevance: 9 Novelty: 8

14. Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

ArXiv ID: 2502.19649

Authors: Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz

Abstract: Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.

Comment: This paper introduces Representation Engineering (RepE) as a novel paradigm for controlling LLM behavior by manipulating internal representations. It aligns closely with the 'Representation Learning' and 'Large Language Models' criteria, offering theoretical insights and a comprehensive framework for a new direction in LLM research.

Relevance: 9 Novelty: 8

15. Asymptotics of Non-Convex Generalized Linear Models in High-Dimensions: A proof of the replica formula

ArXiv ID: 2502.20003

Authors: Matteo Vilucchio, Yatin Dandi, Cedric Gerbelot, Florent Krzakala

Abstract: The analytic characterization of the high-dimensional behavior of optimization for Generalized Linear Models (GLMs) with Gaussian data has been a central focus in statistics and probability in recent years. While convex cases, such as the LASSO, ridge regression, and logistic regression, have been extensively studied using a variety of techniques, the non-convex case remains far less understood despite its significance. A non-rigorous statistical physics framework has provided remarkable predictions for the behavior of high-dimensional optimization problems, but rigorously establishing their validity for non-convex problems has remained a fundamental challenge. In this work, we address this challenge by developing a systematic framework that rigorously proves replica-symmetric formulas for non-convex GLMs and precisely determines the conditions under which these formulas are valid. Remarkably, the rigorous replica-symmetric predictions align exactly with the conjectures made by physicists, and the so-called replicon condition. The originality of our approach lies in connecting two powerful theoretical tools: the Gaussian Min-Max Theorem, which we use to provide precise lower bounds, and Approximate Message Passing (AMP), which is shown to achieve these bounds algorithmically. We demonstrate the utility of this framework through significant applications: (i) by proving the optimality of the Tukey loss over the more commonly used Huber loss under a $\varepsilon$ contaminated data model, (ii) establishing the optimality of negative regularization in high-dimensional non-convex regression and (iii) characterizing the performance limits of linearized AMP algorithms. By rigorously validating statistical physics predictions in non-convex settings, we aim to open new pathways for analyzing increasingly complex optimization landscapes beyond the convex regime.

Comment: The paper rigorously validates statistical physics predictions for non-convex GLMs, aligning with the emerging trends criterion.

Relevance: 8 Novelty: 9

16. Identifiable Multi-View Causal Discovery Without Non-Gaussianity

ArXiv ID: 2502.20115

Authors: Ambroise Heurtebise, Omar Chehab, Pierre Ablin, Alexandre Gramfort, Aapo Hyv\"arinen

Abstract: We propose a novel approach to linear causal discovery in the framework of multi-view Structural Equation Models (SEM). Our proposed model relaxes the well-known assumption of non-Gaussian disturbances by alternatively assuming diversity of variances over views, making it more broadly applicable. We prove the identifiability of all the parameters of the model without any further assumptions on the structure of the SEM other than it being acyclic. We further propose an estimation algorithm based on recent advances in multi-view Independent Component Analysis (ICA). The proposed methodology is validated through simulations and application on real neuroimaging data, where it enables the estimation of causal graphs between brain regions.

Comment: The paper proposes a novel approach to causal discovery in multi-view SEMs, which aligns with representation learning and introduces theoretical advancements in causal modeling.

Relevance: 8 Novelty: 8

17. Topological Autoencoders++: Fast and Accurate Cycle-Aware Dimensionality Reduction

ArXiv ID: 2502.20215

Authors: Matt\'eo Cl\'emot, Julie Digne, Julien Tierny

Abstract: This paper presents a novel topology-aware dimensionality reduction approach aiming at accurately visualizing the cyclic patterns present in high dimensional data. To that end, we build on the Topological Autoencoders (TopoAE) formulation. First, we provide a novel theoretical analysis of its associated loss and show that a zero loss indeed induces identical persistence pairs (in high and low dimensions) for the $0$-dimensional persistent homology (PH$^0$) of the Rips filtration. We also provide a counter example showing that this property no longer holds for a naive extension of TopoAE to PH$^d$ for $d\ge 1$. Based on this observation, we introduce a novel generalization of TopoAE to $1$-dimensional persistent homology (PH$^1$), called TopoAE++, for the accurate generation of cycle-aware planar embeddings, addressing the above failure case. This generalization is based on the notion of cascade distortion, a new penalty term favoring an isometric embedding of the $2$-chains filling persistent $1$-cycles, hence resulting in more faithful geometrical reconstructions of the $1$-cycles in the plane. We further introduce a novel, fast algorithm for the exact computation of PH for Rips filtrations in the plane, yielding improved runtimes over previously documented topology-aware methods. Our method also achieves a better balance between the topological accuracy, as measured by the Wasserstein distance, and the visual preservation of the cycles in low dimensions. Our C++ implementation is available at https://github.com/MClemot/TopologicalAutoencodersPlusPlus.

Comment: The paper proposes a novel topology-aware dimensionality reduction method with theoretical analysis, aligning with foundational research in representation learning.

Relevance: 8 Novelty: 8

18. Scalable Signature Kernel Computations for Long Time Series via Local Neumann Series Expansions

ArXiv ID: 2502.20392

Authors: Matthew Tamayo-Rios, Alexander Schell, Rima Alaifari

Abstract: The signature kernel is a recent state-of-the-art tool for analyzing high-dimensional sequential data, valued for its theoretical guarantees and strong empirical performance. In this paper, we present a novel method for efficiently computing the signature kernel of long, high-dimensional time series via dynamically truncated recursive local power series expansions. Building on the characterization of the signature kernel as the solution of a Goursat PDE, our approach employs tilewise Neumann-series expansions to derive rapidly converging power series approximations of the signature kernel that are locally defined on subdomains and propagated iteratively across the entire domain of the Goursat solution by exploiting the geometry of the time series. Algorithmically, this involves solving a system of interdependent local Goursat PDEs by recursively propagating boundary conditions along a directed graph via topological ordering, with dynamic truncation adaptively terminating each local power series expansion when coefficients fall below machine precision, striking an effective balance between computational cost and accuracy. This method achieves substantial performance improvements over state-of-the-art approaches for computing the signature kernel, providing (a) adjustable and superior accuracy, even for time series with very high roughness; (b) drastically reduced memory requirements; and (c) scalability to efficiently handle very long time series (e.g., with up to half a million points or more) on a single GPU. These advantages make our method particularly well-suited for rough-path-assisted machine learning, financial modeling, and signal processing applications that involve very long and highly volatile data.

Comment: The paper introduces a novel method for scalable signature kernel computations, which aligns with foundational research in efficiency and algorithmic breakthroughs.

Relevance: 8 Novelty: 8

19. Extremely Greedy Equivalence Search

ArXiv ID: 2502.19551

Authors: Achille Nazaret, David Blei

Abstract: The goal of causal discovery is to learn a directed acyclic graph from data. One of the most well-known methods for this problem is Greedy Equivalence Search (GES). GES searches for the graph by incrementally and greedily adding or removing edges to maximize a model selection criterion. It has strong theoretical guarantees on infinite data but can fail in practice on finite data. In this paper, we first identify some of the causes of GES's failure, finding that it can get blocked in local optima, especially in denser graphs. We then propose eXtremely Greedy Equivalent Search (XGES), which involves a new heuristic to improve the search strategy of GES while retaining its theoretical guarantees. In particular, XGES favors deleting edges early in the search over inserting edges, which reduces the possibility of the search ending in local optima. A further contribution of this work is an efficient algorithmic formulation of XGES (and GES). We benchmark XGES on simulated datasets with known ground truth. We find that XGES consistently outperforms GES in recovering the correct graphs, and it is 10 times faster. XGES implementations in Python and C++ are available at https://github.com/ANazaret/XGES.

Comment: The paper proposes an improvement to the Greedy Equivalence Search algorithm, which aligns with foundational research in model efficiency and algorithmic innovation.

Relevance: 8 Novelty: 8

20. Tell me why: Visual foundation models as self-explainable classifiers

ArXiv ID: 2502.19577

Authors: Hugues Turb\'e, Mina Bjelogrlic, Gianmarco Mengaldo, Christian Lovis

Abstract: Visual foundation models (VFMs) have become increasingly popular due to their state-of-the-art performance. However, interpretability remains crucial for critical applications. In this sense, self-explainable models (SEM) aim to provide interpretable classifiers that decompose predictions into a weighted sum of interpretable concepts. Despite their promise, recent studies have shown that these explanations often lack faithfulness. In this work, we combine VFMs with a novel prototypical architecture and specialized training objectives. By training only a lightweight head (approximately 1M parameters) on top of frozen VFMs, our approach (ProtoFM) offers an efficient and interpretable solution. Evaluations demonstrate that our approach achieves competitive classification performance while outperforming existing models across a range of interpretability metrics derived from the literature. Code is available at https://github.com/hturbe/proto-fm.

Comment: The paper introduces a novel prototypical architecture for interpretability in visual foundation models, which aligns with representation learning and architectural insights.