Personalized Daily Arxiv Papers 02/04/2025

	Prompt	Completion	Total
Token	166289	15882	182171
Cost	$4.16	$1.59	$5.75

Total scanned papers: 464

Total relevant papers: 66

Table of contents with paper titles:

Strassen Attention: Unlocking Compositional Abilities in Transformers Based on a New Lower Bound Method Authors: Alexander Kozachinskiy, Felipe Urrutia, Hector Jimenez, Tomasz Steifer, Germ\'an Pizarro, Mat\'ias Fuentes, Francisco Meza, Cristian Buc, Crist\'obal Rojas
AdaSVD: Adaptive Singular Value Decomposition for Large Language Models Authors: Li Zhiteng, Xia Mingyuan, Zhang Jingyuan, Hui Zheng, Kong Linghe, Zhang Yulun, Yang Xiaokang
MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization Authors: JiangYong Yu, Sifan Zhou, Dawei Yang, Shuo Wang, Shuoyu Li, Xing Hu, Chen Xu, Zukang Xu, Changyong Shu, Zhihang Yuan
Pushing the Limits of BFP on Narrow Precision LLM Inference Authors: Hui Wang, Yuan Cheng, Xiaomeng Han, Zhengpeng Zhao, Dawei Yang, Zhe Jiang
Neural Collapse Beyond the Unconstrainted Features Model: Landscape, Dynamics, and Generalization in the Mean-Field Regime Authors: Diyuan Wu, Marco Mondelli
Beyond Limited Data: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving Authors: Kefan Dong, Tengyu Ma
LLM Safety Alignment is Divergence Estimation in Disguise Authors: Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin, Yue Xing
Nearly Lossless Adaptive Bit Switching Authors: Haiduo Huang, Zhenhua Liu, Tian Xia, Wenzhe zhao, Pengju Ren
One-step full gradient suffices for low-rank fine-tuning, provably and efficiently Authors: Yuanhe Zhang, Fanghui Liu, Yudong Chen
An Invitation to Neuroalgebraic Geometry Authors: Giovanni Luca Marchetti, Vahid Shahverdi, Stefano Mereta, Matthew Trager, Kathl\'en Kohn
MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs Authors: Yuhang Zhou, Giannis Karamanolakis, Victor Soto, Anna Rumshisky, Mayank Kulkarni, Furong Huang, Wei Ai, Jianhua Lu
Scaling Laws for Differentially Private Language Models Authors: Ryan McKenna, Yangsibo Huang, Amer Sinha, Borja Balle, Zachary Charles, Christopher A. Choquette-Choo, Badih Ghazi, George Kaissis, Ravi Kumar, Ruibo Liu, Da Yu, Chiyuan Zhang
Symmetric Pruning of Large Language Models Authors: Kai Yi, Peter Richt\'arik
Self-Supervised Learning Using Nonlinear Dependence Authors: M. Hadi Sepanj, Benyamin Ghojogh, Paul Fieguth
Spectro-Riemannian Graph Neural Networks Authors: Karish Grover, Haiyang Yu, Xiang Song, Qi Zhu, Han Xie, Vassilis N. Ioannidis, Christos Faloutsos
Activation by Interval-wise Dropout: A Simple Way to Prevent Neural Networks from Plasticity Loss Authors: Sangyeon Park, Isaac Han, Seungwon Oh, Kyung-Joong Kim
TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs Authors: Yan Sun, Tiansheng Huang, Liang Ding, Li Shen, Dacheng Tao
Memory-Efficient Fine-Tuning of Transformers via Token Selection Authors: Antoine Simoulin, Namyong Park, Xiaoyi Liu, Grey Yang
RandLoRA: Full-rank parameter-efficient fine-tuning of large models Authors: Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton van den Hengel, Ehsan Abbasnejad
Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective Authors: Sifan Wang, Ananyae Kumar Bhartari, Bowen Li, Paris Perdikaris
A theoretical framework for overfitting in energy-based modeling Authors: Giovanni Catania, Aur\'elien Decelle, Cyril Furtlehner, Beatriz Seoane
A Comunication Framework for Compositional Generation Authors: Rafael Elberg, Mircea Petrache, Denis Parra
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning Authors: Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi
What is a Number, That a Large Language Model May Know It? Authors: Raja Marjieh, Veniamin Veselovsky, Thomas L. Griffiths, Ilia Sucholutsky
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods Authors: Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, Akash Srivastava
Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions Authors: Yihao Xue, Jiping Li, Baharan Mirzasoleiman
E2Former: A Linear-time Efficient and Equivariant Transformer for Scalable Molecular Modeling Authors: Yunyang Li, Lin Huang, Zhihao Ding, Chu Wang, Xinran Wei, Han Yang, Zun Wang, Chang Liu, Yu Shi, Peiran Jin, Jia Zhang, Mark Gerstein, Tao Qin
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning Authors: Zhi Zhou, Tan Yuhao, Zenan Li, Yuan Yao, Lan-Zhe Guo, Xiaoxing Ma, Yu-Feng Li
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers Authors: Akiyoshi Tomihari, Issei Sato
LLM Program Optimization via Retrieval Augmented Search Authors: Sagnik Anupam, Alexander Shypula, Osbert Bastani
Efficient Reasoning with Hidden Thinking Authors: Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, Jiuxiang Gu
Demystifying MPNNs: Message Passing as Merely Efficient Matrix Multiplication Authors: Qin Jiang, Chengjia Wang, Michael Lones, Wei Pang
GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation Authors: Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Dinh Phung, Chen Gong, Shirui Pan
Masked Generative Nested Transformers with Decode Time Scaling Authors: Sahil Goyal, Debapriya Tula, Gagan Jain, Pradeep Shenoy, Prateek Jain, Sujoy Paul
HoP: Homeomorphic Polar Learning for Hard Constrained Optimization Authors: Ke Deng, Hanwen Zhang, Jin Lu, Haijian Sun
GRADIEND: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models Authors: Jonathan Drechsel, Steffen Herbold
eagle: early approximated gradient based learning rate estimator Authors: Takumi Fujimoto, Hiroaki Nishi
Redefining Machine Unlearning: A Conformal Prediction-Motivated Approach Authors: Yingdan Shi, Ren Wang
Compact Rule-Based Classifier Learning via Gradient Descent Authors: Javier Fumanal-Idocin, Raquel Fernandez-Peralta, Javier Andreu-Perez
RIGNO: A Graph-based framework for robust and accurate operator learning for PDEs on arbitrary domains Authors: Sepehr Mousavi, Shizheng Wen, Levi Lingsch, Maximilian Herde, Bogdan Raoni\'c, Siddhartha Mishra
Understanding Generalization in Physics Informed Models through Affine Variety Dimensions Authors: Takeshi Koshizuka, Issei Sato
Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing Authors: Saarthak Kapse, Robin Betz, Srinivasan Sivanandan
Understanding Oversmoothing in GNNs as Consensus in Opinion Dynamics Authors: Keqin Wang, Yulong Yang, Ishan Saha, Christine Allen-Blanchette
FragmentNet: Adaptive Graph Fragmentation for Graph-to-Sequence Molecular Representation Learning Authors: Ankur Samanta, Rohan Gupta, Aditi Misra, Christian McIntosh Clarke, Jayakumar Rajadas
Process Reinforcement through Implicit Rewards Authors: Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding
The role of positional encodings in the ARC benchmark Authors: Guilherme H. Bandeira Costa, Miguel Freire, Arlindo L. Oliveira
Lifelong Sequential Knowledge Editing without Model Degradation Authors: Akshat Gupta, Phudish Prateepamornkul, Maochuan Lu, Ahmed Alaa, Thomas Hartvigsen, Gopala Anumanchipalli
The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models Authors: Zhiyuan Xu, Joseph Gardiner, Sana Belguith
Refining Adaptive Zeroth-Order Optimization at Ease Authors: Yao Shu, Qixin Zhang, Kun He, Zhongxiang Dai
A single-loop SPIDER-type stochastic subgradient method for expectation-constrained nonconvex nonsmooth optimization Authors: Wei Liu, Yangyang Xu
Trading Inference-Time Compute for Adversarial Robustness Authors: Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese
What is causal about causal models and representations? Authors: Frederik Hytting J{\o}rgensen, Luigi Gresele, Sebastian Weichwald
Enhance Learning Efficiency of Oblique Decision Tree via Feature Concatenation Authors: Shen-Huan Lyu, Yi-Xiao He, Yanyan Wang, Zhihao Qu, Bin Tang, Baoliu Ye
Efficient rule induction by ignoring pointless rules Authors: Andrew Cropper, David M. Cerna
A Statistical Learning Perspective on Semi-dual Adversarial Neural Optimal Transport Solvers Authors: Roman Tarasov, Petr Mokrov, Milena Gazdieva, Evgeny Burnaev, Alexander Korotin
Estimating LLM Uncertainty with Logits Authors: Huan Ma, Jingdong Chen, Guangyu Wang, Changqing Zhang
TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues Authors: Yubin Ge, Salvatore Romeo, Jason Cai, Raphael Shu, Monica Sunkara, Yassine Benajiba, Yi Zhang
Scalable Multi-phase Word Embedding Using Conjunctive Propositional Clauses Authors: Ahmed K. Kadhim, Lei Jiao, Rishad Shafik, Ole-Christoffer Granmo, Bimal Bhattarai
Jackpot! Alignment as a Maximal Lottery Authors: Roberto-Rafael Maura-Rivero, Marc Lanctot, Francesco Visin, Kate Larson
Advanced Weakly-Supervised Formula Exploration for Neuro-Symbolic Mathematical Reasoning Authors: Yuxuan Wu, Hideki Nakayama
Locality-aware Surrogates for Gradient-based Black-box Optimization Authors: Ali Momeni, Stefan Uhlich, Arun Venkitaraman, Chia-Yu Hsieh, Andrea Bonetti, Ryoga Matsuo, Eisaku Ohbuchi, Lorenzo Servadei
Learning Sheaf Laplacian Optimizing Restriction Maps Authors: Leonardo Di Nino, Sergio Barbarossa, Paolo Di Lorenzo
No Foundations without Foundations -- Why semi-mechanistic models are essential for regulatory biology Authors: Luka Kova\v{c}evi\'c, Thomas Gaudelet, James Opzoomer, Hagen Triendl, John Whittaker, Caroline Uhler, Lindsay Edwards, Jake P. Taylor-King
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming Authors: Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez
Fantastic Multi-Task Gradient Updates and How to Find Them In a Cone Authors: Negar Hassanpour, Muhammad Kamran Janjua, Kunlin Zhang, Sepehr Lavasani, Xiaowen Zhang, Chunhua Zhou, Chao Gao
Principal Components for Neural Network Initialization Authors: Nhan Phan, Thu Nguyen, P{\aa}l Halvorsen, Michael A. Riegler

1. Strassen Attention: Unlocking Compositional Abilities in Transformers Based on a New Lower Bound Method

ArXiv ID: 2501.19215

Authors: Alexander Kozachinskiy, Felipe Urrutia, Hector Jimenez, Tomasz Steifer, Germ\'an Pizarro, Mat\'ias Fuentes, Francisco Meza, Cristian Buc, Crist\'obal Rojas

Abstract: We propose a novel method to evaluate the theoretical limits of Transformers, allowing us to prove the first lower bounds against one-layer softmax Transformers with infinite precision. We establish those bounds for three tasks that require advanced reasoning. The first task, Match3 (Sanford et al., 2023), requires looking at all triples of positions. The second and third tasks address compositionality-based reasoning: one is composition of functions (Peng et al., 2024) and the other is composition of binary relations. We formally prove the inability of one-layer softmax Transformers to solve any of these tasks. In an attempt to overcome these limitations, we introduce Strassen attention and prove that with this mechanism a one-layer Transformer can in principle solve all these tasks. We also show that it enjoys sub-cubic running-time complexity, making it more scalable than similar previously proposed mechanisms, such as higher-order attention (Sanford et al., 2023). To complement our theoretical findings, we experimentally studied Strassen attention and compared it against standard (Vaswani et al, 2017), higher-order attention (Sanford et al., 2023) and triangular attention (Bergen et al. 2021). Our results help to disentangle all these attention mechanisms, highlighting their strengths and limitations. In particular, Strassen attention outperforms standard attention significantly on all the tasks. Altogether, understanding the theoretical limitations can guide research towards scalable attention mechanisms that improve the reasoning abilities of Transformers.

Comment: This paper introduces 'Strassen attention' as a scalable mechanism addressing the limitations of current attention mechanisms, making significant contributions to Transformer architecture research.

Relevance: 10 Novelty: 9

2. AdaSVD: Adaptive Singular Value Decomposition for Large Language Models

ArXiv ID: 2502.01403

Authors: Li Zhiteng, Xia Mingyuan, Zhang Jingyuan, Hui Zheng, Kong Linghe, Zhang Yulun, Yang Xiaokang

Abstract: Large language models (LLMs) have achieved remarkable success in natural language processing (NLP) tasks, yet their substantial memory requirements present significant challenges for deployment on resource-constrained devices. Singular Value Decomposition (SVD) has emerged as a promising compression technique for LLMs, offering considerable reductions in memory overhead. However, existing SVD-based methods often struggle to effectively mitigate the errors introduced by SVD truncation, leading to a noticeable performance gap when compared to the original models. Furthermore, applying a uniform compression ratio across all transformer layers fails to account for the varying importance of different layers. To address these challenges, we propose AdaSVD, an adaptive SVD-based LLM compression approach. Specifically, AdaSVD introduces adaComp, which adaptively compensates for SVD truncation errors by alternately updating the singular matrices U and V^T. Additionally, AdaSVD introduces adaCR, which adaptively assigns layer-specific compression ratios based on the relative importance of each layer. Extensive experiments across multiple LLM families and evaluation metrics demonstrate that AdaSVD consistently outperforms state-of-the-art (SOTA) SVD-based methods, achieving superior performance with significantly reduced memory requirements. The code and models will be available at https://github.com/ZHITENGLI/AdaSVD.

Comment: The paper focuses on model compression techniques for LLMs using adaptive SVD, aligning closely with the relevance criteria of sparsity, low-rank approaches, and theoretical efficiency breakthroughs.

Relevance: 10 Novelty: 8

3. MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization

ArXiv ID: 2502.00425

Authors: JiangYong Yu, Sifan Zhou, Dawei Yang, Shuo Wang, Shuoyu Li, Xing Hu, Chen Xu, Zukang Xu, Changyong Shu, Zhihang Yuan

Abstract: Multimodal large language models (MLLMs) have garnered widespread attention due to their ability to understand multimodal input. However, their large parameter sizes and substantial computational demands severely hinder their practical deployment and application.While quantization is an effective way to reduce model size and inference latency, its application to MLLMs remains underexplored. In this paper, we propose MQuant, a post-training quantization (PTQ) framework designed to tackle the unique challenges of multimodal large language models (MLLMs). Conventional quantization often struggles with MLLMs because of (a) high inference latency from large visual token counts, (b) distributional disparities between visual and textual tokens, and (c) extreme outliers introduced by Hadamard-based transformations. To address these issues, MQuant introduces: Modality-Specific Static Quantization (MSQ), assigning distinct static scales for visual vs. textual tokens; Attention-Invariant Flexible Switching (AIFS), reordering tokens to preserve casual attention while eliminating expensive token-wise scale computations; Rotation Magnitude Suppression (RMS), mitigating weight outliers arising from online Hadamard rotations. On five mainstream MLLMs (including Qwen-VL, MiniCPM-V, CogVLM2), MQuant under W4A8 achieves near-floating-point accuracy (<1% degradation) while reducing inference latency by up to 30%, significantly outperforming existing PTQ baselines. Our MQuant effectively bridges the gap for efficient and accurate MLLMs inference in resource-constrained devices. Code will be released.

Comment: Proposes a quantization framework (MQuant) for multimodal LLMs, addressing efficiency challenges. This directly aligns with model compression and quantization strategies.

Relevance: 10 Novelty: 8

4. Pushing the Limits of BFP on Narrow Precision LLM Inference

ArXiv ID: 2502.00026

Authors: Hui Wang, Yuan Cheng, Xiaomeng Han, Zhengpeng Zhao, Dawei Yang, Zhe Jiang

Abstract: The substantial computational and memory demands of Large Language Models (LLMs) hinder their deployment. Block Floating Point (BFP) has proven effective in accelerating linear operations, a cornerstone of LLM workloads. However, as sequence lengths grow, nonlinear operations, such as Attention, increasingly become performance bottlenecks due to their quadratic computational complexity. These nonlinear operations are predominantly executed using inefficient floating-point formats, which renders the system challenging to optimize software efficiency and hardware overhead. In this paper, we delve into the limitations and potential of applying BFP to nonlinear operations. Given our findings, we introduce a hardware-software co-design framework (DB-Attn), including: (i) DBFP, an advanced BFP version, overcomes nonlinear operation challenges with a pivot-focus strategy for diverse data and an adaptive grouping strategy for flexible exponent sharing. (ii) DH-LUT, a novel lookup table algorithm dedicated to accelerating nonlinear operations with DBFP format. (iii) An RTL-level DBFP-based engine is implemented to support DB-Attn, applicable to FPGA and ASIC. Results show that DB-Attn provides significant performance improvements with negligible accuracy loss, achieving 74% GPU speedup on Softmax of LLaMA and 10x low overhead performance improvement over SOTA designs.

Comment: The paper proposes hardware-efficient optimizations using a BFP framework for LLMs, providing novel insights into compression techniques.

Relevance: 9 Novelty: 9

5. Neural Collapse Beyond the Unconstrainted Features Model: Landscape, Dynamics, and Generalization in the Mean-Field Regime

ArXiv ID: 2501.19104

Authors: Diyuan Wu, Marco Mondelli

Abstract: Neural Collapse is a phenomenon where the last-layer representations of a well-trained neural network converge to a highly structured geometry. In this paper, we focus on its first (and most basic) property, known as NC1: the within-class variability vanishes. While prior theoretical studies establish the occurrence of NC1 via the data-agnostic unconstrained features model, our work adopts a data-specific perspective, analyzing NC1 in a three-layer neural network, with the first two layers operating in the mean-field regime and followed by a linear layer. In particular, we establish a fundamental connection between NC1 and the loss landscape: we prove that points with small empirical loss and gradient norm (thus, close to being stationary) approximately satisfy NC1, and the closeness to NC1 is controlled by the residual loss and gradient norm. We then show that (i) gradient flow on the mean squared error converges to NC1 solutions with small empirical loss, and (ii) for well-separated data distributions, both NC1 and vanishing test loss are achieved simultaneously. This aligns with the empirical observation that NC1 emerges during training while models attain near-zero test error. Overall, our results demonstrate that NC1 arises from gradient training due to the properties of the loss landscape, and they show the co-occurrence of NC1 and small test error for certain data distributions.

Comment: Explores Neural Collapse, providing theoretical insights into training dynamics and representation learning. This is highly aligned with foundational research.

Relevance: 9 Novelty: 9

6. Beyond Limited Data: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving

ArXiv ID: 2502.00212

Authors: Kefan Dong, Tengyu Ma

Abstract: A fundamental challenge in formal theorem proving by LLMs is the lack of high-quality training data. Although reinforcement learning or expert iteration partially mitigates this issue by alternating between LLM generating proofs and finetuning them on correctly generated ones, performance quickly plateaus due to the scarcity of correct proofs (sparse rewards). To keep improving the models with limited data, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them. We design the Self-play Theorem Prover (STP) that simultaneously takes on two roles, conjecturer and prover, each providing training signals to the other. The conjecturer is trained iteratively on previously generated conjectures that are barely provable by the current prover, which incentivizes it to generate increasingly challenging conjectures over time. The prover attempts to prove the conjectures with standard expert iteration. We evaluate STP with both Lean and Isabelle formal versifiers. With 19.8 billion tokens generated during the training in Lean, STP proves 26.3% of the statements in the LeanWorkbook dataset, doubling the previous best result of 13.2% achieved through expert iteration. The final model achieves state-of-the-art performance among whole-proof generation methods on miniF2F-test (61.1%, pass@3200), Proofnet-test (23.1%, pass@3200) and PutnamBench (8/644, pass@64).

Comment: The paper introduces an iterative self-play framework for theorem proving with LLMs, which aligns with foundational insights into training dynamics and use of LLMs for novel tasks.

Relevance: 9 Novelty: 8

7. LLM Safety Alignment is Divergence Estimation in Disguise

ArXiv ID: 2502.00657

Authors: Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin, Yue Xing

Abstract: We propose a theoretical framework demonstrating that popular Large Language Model (LLM) alignment methods, including Reinforcement Learning from Human Feedback (RLHF) and alternatives, fundamentally function as divergence estimators between aligned (preferred or safe) and unaligned (less-preferred or harmful) distributions. This explains the separation phenomenon between safe and harmful prompts in the model hidden representation after alignment. Inspired by the theoretical results, we identify that some alignment methods are better than others in terms of separation and, introduce a new method, KLDO, and further demonstrate the implication of our theories. We advocate for compliance-refusal datasets over preference datasets to enhance safety alignment, supported by both theoretical reasoning and empirical evidence. Additionally, to quantify safety separation, we leverage a distance metric in the representation space and statistically validate its efficacy as a statistical significant indicator of LLM resilience against jailbreak attacks.

Comment: The theoretical perspective connecting LLM safety alignment to divergence estimation offers a foundational insight into behavior and interpretability, which aligns well with LLM theoretical insights.

Relevance: 9 Novelty: 8

8. Nearly Lossless Adaptive Bit Switching

ArXiv ID: 2502.01199

Authors: Haiduo Huang, Zhenhua Liu, Tian Xia, Wenzhe zhao, Pengju Ren

Abstract: Model quantization is widely applied for compressing and accelerating deep neural networks (DNNs). However, conventional Quantization-Aware Training (QAT) focuses on training DNNs with uniform bit-width. The bit-width settings vary across different hardware and transmission demands, which induces considerable training and storage costs. Hence, the scheme of one-shot joint training multiple precisions is proposed to address this issue. Previous works either store a larger FP32 model to switch between different precision models for higher accuracy or store a smaller INT8 model but compromise accuracy due to using shared quantization parameters. In this paper, we introduce the Double Rounding quantization method, which fully utilizes the quantized representation range to accomplish nearly lossless bit-switching while reducing storage by using the highest integer precision instead of full precision. Furthermore, we observe a competitive interference among different precisions during one-shot joint training, primarily due to inconsistent gradients of quantization scales during backward propagation. To tackle this problem, we propose an Adaptive Learning Rate Scaling (ALRS) technique that dynamically adapts learning rates for various precisions to optimize the training process. Additionally, we extend our Double Rounding to one-shot mixed precision training and develop a Hessian-Aware Stochastic Bit-switching (HASB) strategy. Experimental results on the ImageNet-1K classification demonstrate that our methods have enough advantages to state-of-the-art one-shot joint QAT in both multi-precision and mixed-precision. We also validate the feasibility of our method on detection and segmentation tasks, as well as on LLMs task. Our codes are available at https://github.com/haiduo/Double-Rounding.

Comment: Proposes nearly lossless bit-switching quantization and addresses inter-precision interference with theoretical contributions, relevant to model compression.

Relevance: 9 Novelty: 8

9. One-step full gradient suffices for low-rank fine-tuning, provably and efficiently

ArXiv ID: 2502.01235

Authors: Yuanhe Zhang, Fanghui Liu, Yudong Chen

Abstract: This paper studies how to improve the performance of Low-Rank Adaption (LoRA) as guided by our theoretical analysis. Our first set of theoretical results show that for random initialization and linear models, \textit{i)} LoRA will align to the certain singular subspace of one-step gradient of full fine-tuning; \textit{ii)} preconditioners improve convergence in the high-rank case. These insights motivate us to focus on preconditioned LoRA using a specific spectral initialization strategy for aligning with certain subspaces. For both linear and nonlinear models, we prove that alignment and generalization guarantees can be directly achieved at initialization, and the subsequent linear convergence can be also built. Our analysis leads to the \emph{LoRA-One} algorithm (using \emph{One}-step gradient and preconditioning), a theoretically grounded algorithm that achieves significant empirical improvement over vanilla LoRA and its variants on several benchmarks. Our theoretical analysis, based on decoupling the learning dynamics and characterizing how spectral initialization contributes to feature learning, may be of independent interest for understanding matrix sensing and deep learning theory. The source code can be found in the https://github.com/YuanheZ/LoRA-One.

Comment: This paper investigates low-rank fine-tuning and provides both theoretical and empirical insights into improving LoRA using spectral initialization. This is directly relevant to compression/efficiency breakthroughs.

Relevance: 9 Novelty: 8

10. An Invitation to Neuroalgebraic Geometry

ArXiv ID: 2501.18915

Authors: Giovanni Luca Marchetti, Vahid Shahverdi, Stefano Mereta, Matthew Trager, Kathl\'en Kohn

Abstract: In this expository work, we promote the study of function spaces parameterized by machine learning models through the lens of algebraic geometry. To this end, we focus on algebraic models, such as neural networks with polynomial activations, whose associated function spaces are semi-algebraic varieties. We outline a dictionary between algebro-geometric invariants of these varieties, such as dimension, degree, and singularities, and fundamental aspects of machine learning, such as sample complexity, expressivity, training dynamics, and implicit bias. Along the way, we review the literature and discuss ideas beyond the algebraic domain. This work lays the foundations of a research direction bridging algebraic geometry and deep learning, that we refer to as neuroalgebraic geometry.

Comment: This expository work creates a bridge between algebraic geometry and neural networks, focusing on theoretical underpinnings. Its foundational approach has strong potential for novel theoretical insights.

Relevance: 9 Novelty: 8

11. MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

ArXiv ID: 2502.00997

Authors: Yuhang Zhou, Giannis Karamanolakis, Victor Soto, Anna Rumshisky, Mayank Kulkarni, Furong Huang, Wei Ai, Jianhua Lu

Abstract: The recent success of specialized Large Language Models (LLMs) in domains such as mathematical reasoning and coding has led to growing interest in methods for merging these expert LLMs into a unified Mixture-of-Experts (MoE) model, with the goal of enhancing performance in each domain while retaining effectiveness on general tasks. However, the effective merging of expert models remains an open challenge, especially for models with highly divergent weight parameters or different architectures. State-of-the-art MoE merging methods only work with homogeneous model architectures and rely on simple unweighted averaging to merge expert layers, which does not address parameter interference and requires extensive fine-tuning of the merged MoE to restore performance. To address these limitations, this paper introduces new MoE merging techniques, including strategies to mitigate parameter interference, routing heuristics to reduce the need for MoE fine-tuning, and a novel method for merging experts with different architectures. Extensive experiments across multiple domains demonstrate the effectiveness of our proposed methods, reducing fine-tuning costs, improving performance over state-of-the-art methods, and expanding the applicability of MoE merging.

Comment: Presents new methods for merging homogeneous and heterogeneous MoEs, directly aligning with the model architecture criterion, specifically innovations in Mixture-of-Experts.

Relevance: 9 Novelty: 8

12. Scaling Laws for Differentially Private Language Models

ArXiv ID: 2501.18914

Authors: Ryan McKenna, Yangsibo Huang, Amer Sinha, Borja Balle, Zachary Charles, Christopher A. Choquette-Choo, Badih Ghazi, George Kaissis, Ravi Kumar, Ruibo Liu, Da Yu, Chiyuan Zhang

Abstract: Scaling laws have emerged as important components of large language model (LLM) training as they can predict performance gains through scale, and provide guidance on important hyper-parameter choices that would otherwise be expensive. LLMs also rely on large, high-quality training datasets, like those sourced from (sometimes sensitive) user data. Training models on this sensitive user data requires careful privacy protections like differential privacy (DP). However, the dynamics of DP training are significantly different, and consequently their scaling laws are not yet fully understood. In this work, we establish scaling laws that accurately model the intricacies of DP LLM training, providing a complete picture of the compute-privacy-utility tradeoffs and the optimal training configurations in many settings.

Comment: Examines scaling laws under differential privacy for LLMs, providing foundational insights into compute-privacy-utility tradeoffs, aligning well with the criterion for LLM theoretical contributions.

Relevance: 9 Novelty: 8

13. Symmetric Pruning of Large Language Models

ArXiv ID: 2501.18980

Authors: Kai Yi, Peter Richt\'arik

Abstract: Popular post-training pruning methods such as Wanda and RIA are known for their simple, yet effective, designs that have shown exceptional empirical performance. Wanda optimizes performance through calibrated activations during pruning, while RIA emphasizes the relative, rather than absolute, importance of weight elements. Despite their practical success, a thorough theoretical foundation explaining these outcomes has been lacking. This paper introduces new theoretical insights that redefine the standard minimization objective for pruning, offering a deeper understanding of the factors contributing to their success. Our study extends beyond these insights by proposing complementary strategies that consider both input activations and weight significance. We validate these approaches through rigorous experiments, demonstrating substantial enhancements over existing methods. Furthermore, we introduce a novel training-free fine-tuning approach $R^2$-DSnoT that incorporates relative weight importance and a regularized decision boundary within a dynamic pruning-and-growing framework, significantly outperforming strong baselines and establishing a new state of the art.

Comment: Addresses theoretical insights into pruning methods for Large Language Models (LLMs), directly matching the model compression topic and offering improvements to existing techniques.

Relevance: 9 Novelty: 8

14. Self-Supervised Learning Using Nonlinear Dependence

ArXiv ID: 2501.18875

Authors: M. Hadi Sepanj, Benyamin Ghojogh, Paul Fieguth

Abstract: Self-supervised learning has gained significant attention in contemporary applications, particularly due to the scarcity of labeled data. While existing SSL methodologies primarily address feature variance and linear correlations, they often neglect the intricate relations between samples and the nonlinear dependencies inherent in complex data. In this paper, we introduce Correlation-Dependence Self-Supervised Learning (CDSSL), a novel framework that unifies and extends existing SSL paradigms by integrating both linear correlations and nonlinear dependencies, encapsulating sample-wise and feature-wise interactions. Our approach incorporates the Hilbert-Schmidt Independence Criterion (HSIC) to robustly capture nonlinear dependencies within a Reproducing Kernel Hilbert Space, enriching representation learning. Experimental evaluations on diverse benchmarks demonstrate the efficacy of CDSSL in improving representation quality.

Comment: Presents a novel self-supervised technique leveraging nonlinear dependency, tying closely to representation learning and enriching feature encoding, which is foundational.

Relevance: 9 Novelty: 8

15. Spectro-Riemannian Graph Neural Networks

ArXiv ID: 2502.00401

Authors: Karish Grover, Haiyang Yu, Xiang Song, Qi Zhu, Han Xie, Vassilis N. Ioannidis, Christos Faloutsos

Abstract: Can integrating spectral and curvature signals unlock new potential in graph representation learning? Non-Euclidean geometries, particularly Riemannian manifolds such as hyperbolic (negative curvature) and spherical (positive curvature), offer powerful inductive biases for embedding complex graph structures like scale-free, hierarchical, and cyclic patterns. Meanwhile, spectral filtering excels at processing signal variations across graphs, making it effective in homophilic and heterophilic settings. Leveraging both can significantly enhance the learned representations. To this end, we propose Spectro-Riemannian Graph Neural Networks (CUSP) - the first graph representation learning paradigm that unifies both CUrvature (geometric) and SPectral insights. CUSP is a mixed-curvature spectral GNN that learns spectral filters to optimize node embeddings in products of constant-curvature manifolds (hyperbolic, spherical, and Euclidean). Specifically, CUSP introduces three novel components: (a) Cusp Laplacian, an extension of the traditional graph Laplacian based on Ollivier-Ricci curvature, designed to capture the curvature signals better; (b) Cusp Filtering, which employs multiple Riemannian graph filters to obtain cues from various bands in the eigenspectrum; and (c) Cusp Pooling, a hierarchical attention mechanism combined with a curvature-based positional encoding to assess the relative importance of differently curved substructures in our graph. Empirical evaluation across eight homophilic and heterophilic datasets demonstrates the superiority of CUSP in node classification and link prediction tasks, with a gain of up to 5.3% over state-of-the-art models.

Comment: Introduces a novel graph neural network framework unifying spectral and curvature signals, which aligns with architectural innovation and foundational model design.

Relevance: 9 Novelty: 8

16. Activation by Interval-wise Dropout: A Simple Way to Prevent Neural Networks from Plasticity Loss

ArXiv ID: 2502.01342

Authors: Sangyeon Park, Isaac Han, Seungwon Oh, Kyung-Joong Kim

Abstract: Plasticity loss, a critical challenge in neural network training, limits a model's ability to adapt to new tasks or shifts in data distribution. This paper introduces AID (Activation by Interval-wise Dropout), a novel method inspired by Dropout, designed to address plasticity loss. Unlike Dropout, AID generates subnetworks by applying Dropout with different probabilities on each preactivation interval. Theoretical analysis reveals that AID regularizes the network, promoting behavior analogous to that of deep linear networks, which do not suffer from plasticity loss. We validate the effectiveness of AID in maintaining plasticity across various benchmarks, including continual learning tasks on standard image classification datasets such as CIFAR10, CIFAR100, and TinyImageNet. Furthermore, we show that AID enhances reinforcement learning performance in the Arcade Learning Environment benchmark.

Comment: The AID method introduces a novel dropout variation targeting training dynamics, aligning with foundational representation learning and training dynamics research.

Relevance: 9 Novelty: 8

17. TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs

ArXiv ID: 2501.19057

Authors: Yan Sun, Tiansheng Huang, Liang Ding, Li Shen, Dacheng Tao

Abstract: Zeroth-order optimization (ZO) has demonstrated remarkable promise in efficient fine-tuning tasks for Large Language Models (LLMs). In particular, recent advances incorporate the low-rankness of gradients, introducing low-rank ZO estimators to further reduce GPU memory consumption. However, most existing works focus solely on the low-rankness of each individual gradient, overlooking a broader property shared by all gradients throughout the training, i.e., all gradients approximately reside within a similar subspace. In this paper, we consider two factors together and propose a novel low-rank ZO estimator, TeZO, which captures the low-rankness across both the model and temporal dimension. Specifically, we represent ZO perturbations along the temporal dimension as a 3D tensor and employ Canonical Polyadic Decomposition (CPD) to extract each low-rank 2D matrix, significantly reducing the training cost. TeZO can also be easily extended to the Adam variant while consuming less memory than MeZO-SGD, and requiring about only 35% memory of MeZO-Adam. Both comprehensive theoretical analysis and extensive experimental research have validated its efficiency, achieving SOTA-comparable results with lower overhead of time and memory.

Comment: Proposes a novel low-rank ZO estimator for temporal dimensions in fine-tuning LLMs. Directly relevant to model compression and parameter efficiency advances.

Relevance: 9 Novelty: 8

18. Memory-Efficient Fine-Tuning of Transformers via Token Selection

ArXiv ID: 2501.18824

Authors: Antoine Simoulin, Namyong Park, Xiaoyi Liu, Grey Yang

Abstract: Fine-tuning provides an effective means to specialize pre-trained models for various downstream tasks. However, fine-tuning often incurs high memory overhead, especially for large transformer-based models, such as LLMs. While existing methods may reduce certain parts of the memory required for fine-tuning, they still require caching all intermediate activations computed in the forward pass to update weights during the backward pass. In this work, we develop TokenTune, a method to reduce memory usage, specifically the memory to store intermediate activations, in the fine-tuning of transformer-based models. During the backward pass, TokenTune approximates the gradient computation by backpropagating through just a subset of input tokens. Thus, with TokenTune, only a subset of intermediate activations are cached during the forward pass. Also, TokenTune can be easily combined with existing methods like LoRA, further reducing the memory cost. We evaluate our approach on pre-trained transformer models with up to billions of parameters, considering the performance on multiple downstream tasks such as text classification and question answering in a few-shot learning setup. Overall, TokenTune achieves performance on par with full fine-tuning or representative memory-efficient fine-tuning methods, while greatly reducing the memory footprint, especially when combined with other methods with complementary memory reduction mechanisms. We hope that our approach will facilitate the fine-tuning of large transformers, in specializing them for specific domains or co-training them with other neural components from a larger system. Our code is available at https://github.com/facebookresearch/tokentune.

Comment: Introduces TokenTune for memory-efficient fine-tuning of transformer models using token selection, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

19. RandLoRA: Full-rank parameter-efficient fine-tuning of large models

ArXiv ID: 2502.00987

Authors: Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton van den Hengel, Ehsan Abbasnejad

Abstract: Low-Rank Adaptation (LoRA) and its variants have shown impressive results in reducing the number of trainable parameters and memory requirements of large transformer networks while maintaining fine-tuning performance. However, the low-rank nature of the weight update inherently limits the representation power of fine-tuned models, potentially compromising performance on complex tasks. This raises a critical question: when a performance gap between LoRA and standard fine-tuning is observed, is it due to the reduced number of trainable parameters or the rank deficiency? This paper aims to answer this question by introducing RandLoRA, a parameter-efficient method that performs full-rank updates using a learned linear combinations of low-rank, non-trainable random matrices. Our method limits the number of trainable parameters by restricting optimization to diagonal scaling matrices applied to the fixed random matrices. This allows us to effectively overcome the low-rank limitations while maintaining parameter and memory efficiency during training. Through extensive experimentation across vision, language, and vision-language benchmarks, we systematically evaluate the limitations of LoRA and existing random basis methods. Our findings reveal that full-rank updates are beneficial across vision and language tasks individually, and even more so for vision-language tasks, where RandLoRA significantly reduces -- and sometimes eliminates -- the performance gap between standard fine-tuning and LoRA, demonstrating its efficacy.

Comment: RandLoRA proposes significant advancements in parameter-efficient methods by addressing the limitations of low-rank adaptations in fine-tuning using full-rank optimization. Relevant to compression and efficiency topics like low-rank approaches.

Relevance: 9 Novelty: 8

20. Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective

ArXiv ID: 2502.00604

Authors: Sifan Wang, Ananyae Kumar Bhartari, Bowen Li, Paris Perdikaris

Abstract: Multi-task learning through composite loss functions is fundamental to modern deep learning, yet optimizing competing objectives remains challenging. We present new theoretical and practical approaches for addressing directional conflicts between loss terms, demonstrating their effectiveness in physics-informed neural networks (PINNs) where such conflicts are particularly challenging to resolve. Through theoretical analysis, we demonstrate how these conflicts limit first-order methods and show that second-order optimization naturally resolves them through implicit gradient alignment. We prove that SOAP, a recently proposed quasi-Newton method, efficiently approximates the Hessian preconditioner, enabling breakthrough performance in PINNs: state-of-the-art results on 10 challenging PDE benchmarks, including the first successful application to turbulent flows with Reynolds numbers up to 10,000, with 2-10x accuracy improvements over existing methods. We also introduce a novel gradient alignment score that generalizes cosine similarity to multiple gradients, providing a practical tool for analyzing optimization dynamics. Our findings establish frameworks for understanding and resolving gradient conflicts, with broad implications for optimization beyond scientific computing.

Comment: Breakthrough use of second-order optimization in physics-informed neural networks aligns with tackling gradient alignment and efficiency challenges, making it relevant to foundational optimization issues.

Relevance: 8 Novelty: 9

21. A theoretical framework for overfitting in energy-based modeling

ArXiv ID: 2501.19158

Authors: Giovanni Catania, Aur\'elien Decelle, Cyril Furtlehner, Beatriz Seoane

Abstract: We investigate the impact of limited data on training pairwise energy-based models for inverse problems aimed at identifying interaction networks. Utilizing the Gaussian model as testbed, we dissect training trajectories across the eigenbasis of the coupling matrix, exploiting the independent evolution of eigenmodes and revealing that the learning timescales are tied to the spectral decomposition of the empirical covariance matrix. We see that optimal points for early stopping arise from the interplay between these timescales and the initial conditions of training. Moreover, we show that finite data corrections can be accurately modeled through asymptotic random matrix theory calculations and provide the counterpart of generalized cross-validation in the energy based model context. Our analytical framework extends to binary-variable maximum-entropy pairwise models with minimal variations. These findings offer strategies to control overfitting in discrete-variable models through empirical shrinkage corrections, improving the management of overfitting in energy-based generative models.

Comment: Offers a theoretical framework for overfitting in energy-based models and discusses learning dynamics related to spectral decomposition, which is highly aligned with foundational research in representation learning.

Relevance: 9 Novelty: 8

22. A Comunication Framework for Compositional Generation

ArXiv ID: 2501.19182

Authors: Rafael Elberg, Mircea Petrache, Denis Parra

Abstract: Compositionality and compositional generalization--the ability to understand novel combinations of known concepts--are central characteristics of human language and are hypothesized to be essential for human cognition. In machine learning, the emergence of this property has been studied in a communication game setting, where independent agents (a sender and a receiver) converge to a shared encoding policy from a set of states to a space of discrete messages, where the receiver can correctly reconstruct the states observed by the sender using only the sender's messages. The use of communication games in generation tasks is still largely unexplored, with recent methods for compositional generation focusing mainly on the use of supervised guidance (either through class labels or text). In this work, we take the first steps to fill this gap, and we present a self-supervised generative communication game-based framework for creating compositional encodings in learned representations from pre-trained encoder-decoder models. In an Iterated Learning (IL) protocol involving a sender and a receiver, we apply alternating pressures for compression and diversity of encoded discrete messages, so that the protocol converges to an efficient but unambiguous encoding. Approximate message entropy regularization is used to favor compositional encodings. Our framework is based on rigorous justifications and proofs of defining and balancing the concepts of Eficiency, Unambiguity and Non-Holisticity in encoding. We test our method on the compositional image dataset Shapes3D, demonstrating robust performance in both reconstruction and compositionality metrics, surpassing other tested discrete message frameworks.

Comment: Explores compositional encodings in learned representations using a communication game framework, directly tying into representation learning and advancing insights on compositionality.

Relevance: 9 Novelty: 8

23. ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

ArXiv ID: 2502.01100

Authors: Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi

Abstract: We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows -- a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.

Comment: Presents a framework for evaluating reasoning capabilities of LLMs under complexity scaling, providing theoretical insights into their limits, aligning with foundational LLM research.

Relevance: 9 Novelty: 8

24. What is a Number, That a Large Language Model May Know It?

ArXiv ID: 2502.01540

Authors: Raja Marjieh, Veniamin Veselovsky, Thomas L. Griffiths, Ilia Sucholutsky

Abstract: Numbers are a basic part of how humans represent and describe the world around them. As a consequence, learning effective representations of numbers is critical for the success of large language models as they become more integrated into everyday decisions. However, these models face a challenge: depending on context, the same sequence of digit tokens, e.g., 911, can be treated as a number or as a string. What kind of representations arise from this duality, and what are its downstream implications? Using a similarity-based prompting technique from cognitive science, we show that LLMs learn representational spaces that blend string-like and numerical representations. In particular, we show that elicited similarity judgments from these models over integer pairs can be captured by a combination of Levenshtein edit distance and numerical Log-Linear distance, suggesting an entangled representation. In a series of experiments we show how this entanglement is reflected in the latent embeddings, how it can be reduced but not entirely eliminated by context, and how it can propagate into a realistic decision scenario. These results shed light on a representational tension in transformer models that must learn what a number is from text input.

Comment: Examines numerical representation in LLMs and blends cognitive science approaches, highly relevant to foundational representation learning in LLMs.

Relevance: 9 Novelty: 8

25. A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

ArXiv ID: 2502.01618

Authors: Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, Akash Srivastava

Abstract: Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work. Code and further information is available at https://probabilistic-inference-scaling.github.io.

Comment: The paper introduces a novel inference-time scaling method using particle-based Monte Carlo techniques for LLMs, offering possible breakthroughs in efficiency and robustness, relevant to inference optimization.

Relevance: 9 Novelty: 8

26. Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions

ArXiv ID: 2502.00620

Authors: Yihao Xue, Jiping Li, Baharan Mirzasoleiman

Abstract: Weak-to-Strong Generalization (W2SG), where a weak model supervises a stronger one, serves as an important analogy for understanding how humans might guide superhuman intelligence in the future. Promising empirical results revealed that a strong model can surpass its weak supervisor. While recent work has offered theoretical insights into this phenomenon, a clear understanding of the interactions between weak and strong models that drive W2SG remains elusive. We investigate W2SG through a theoretical lens and show that it can be characterized using kernels derived from the principal components of weak and strong models' internal representations. These kernels can be used to define a space that, at a high level, captures what the weak model is unable to learn but is learnable by the strong model. The projection of labels onto this space quantifies how much the strong model falls short of its full potential due to weak supervision. This characterization also provides insights into how certain errors in weak supervision can be corrected by the strong model, regardless of overfitting. Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends without requiring labels, as shown in experiments on molecular predictions with transformers and 5 NLP tasks involving 52 LLMs.

Comment: Offers theoretical insights into weak-to-strong generalization using representation kernels, providing a fresh perspective relevant to representation learning.

Relevance: 8 Novelty: 8

27. E2Former: A Linear-time Efficient and Equivariant Transformer for Scalable Molecular Modeling

ArXiv ID: 2501.19216

Authors: Yunyang Li, Lin Huang, Zhihao Ding, Chu Wang, Xinran Wei, Han Yang, Zun Wang, Chang Liu, Yu Shi, Peiran Jin, Jia Zhang, Mark Gerstein, Tao Qin

Abstract: Equivariant Graph Neural Networks (EGNNs) have demonstrated significant success in modeling microscale systems, including those in chemistry, biology and materials science. However, EGNNs face substantial computational challenges due to the high cost of constructing edge features via spherical tensor products, making them impractical for large-scale systems. To address this limitation, we introduce E2Former, an equivariant and efficient transformer architecture that incorporates the Wigner $6j$ convolution (Wigner $6j$ Conv). By shifting the computational burden from edges to nodes, the Wigner $6j$ Conv reduces the complexity from $O(|\mathcal{E}|)$ to $ O(| \mathcal{V}|)$ while preserving both the model's expressive power and rotational equivariance. We show that this approach achieves a 7x-30x speedup compared to conventional $\mathrm{SO}(3)$ convolutions. Furthermore, our empirical results demonstrate that the derived E2Former mitigates the computational challenges of existing approaches without compromising the ability to capture detailed geometric information. This development could suggest a promising direction for scalable and efficient molecular modeling.

Comment: Introduces an equivariant transformer architecture (E2Former) for molecular modeling, presenting significant computational gains and addressing model efficiency. This aligns well with AI foundational architectural innovations.

Relevance: 8 Novelty: 8

28. Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning

ArXiv ID: 2502.00511

Authors: Zhi Zhou, Tan Yuhao, Zenan Li, Yuan Yao, Lan-Zhe Guo, Xiaoxing Ma, Yu-Feng Li

Abstract: Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities. However, single-shot inference often yields unreliable results for complex reasoning tasks, leading researchers to explore multiple reasoning paths through methods such as perplexity and self-consistency. In this paper, we present the first theoretical error decomposition analysis of these techniques, breaking down their error into estimation error and model error. Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function, while self-consistency exhibits high estimation error due to a slow error convergence rate. To overcome these limitations, we propose Reasoning-Pruning Perplexity Consistency (RPC). This approach combines Perplexity Consistency, which seamlessly integrates LLM perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths to effectively prevent the degeneration of estimation error reduction. Theoretical analysis demonstrates that RPC not only accelerates the convergence rate of estimation error to an exponential level but also holds strong potential for further reducing model error. Extensive empirical evaluations on seven benchmark datasets confirm that RPC can significantly improve reasoning performance, sample efficiency, and confidence reliability.

Comment: The paper addresses reasoning performance along theoretical lines with new error decomposition techniques and methodology related to LLMs, aligning with the LLM behavior/interpretability criterion.

Relevance: 8 Novelty: 8

29. Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers

ArXiv ID: 2502.00213

Authors: Akiyoshi Tomihari, Issei Sato

Abstract: Transformer models are challenging to optimize with SGD and typically require adaptive optimizers such as Adam. However, the reasons behind the superior performance of Adam over SGD remain unclear. In this study, we investigate the optimization of transformer models by focusing on \emph{gradient heterogeneity}, defined as the disparity in gradient norms among parameters. Our analysis shows that gradient heterogeneity hinders gradient-based optimization, including SGD, while sign-based optimization, a simplified variant of Adam, is less affected. We further examine gradient heterogeneity in transformer models and show that it is influenced by the placement of layer normalization. Additionally, we show that the momentum term in sign-based optimization is important for preventing the excessive growth of linear-head parameters in tasks with many classes. Experimental results from fine-tuning transformer models in both NLP and vision domains validate our theoretical analyses. This study provides insights into the optimization challenges of transformer models and offers guidance for designing future optimization algorithms. Code is available at \url{https://github.com/tom4649/gradient-heterogeneity}.

Comment: The paper provides insight into why Adam outperforms SGD in transformer training, contributing to foundational understanding of optimizer behavior in model training dynamics.

Relevance: 8 Novelty: 8

30. LLM Program Optimization via Retrieval Augmented Search

ArXiv ID: 2501.18916

Authors: Sagnik Anupam, Alexander Shypula, Osbert Bastani

Abstract: With the advent of large language models (LLMs), there has been a great deal of interest in applying them to solve difficult programming tasks. Recent work has demonstrated their potential at program optimization, a key challenge in programming languages research. We propose a blackbox adaptation method called Retrieval Augmented Search (RAS) that performs beam search over candidate optimizations; at each step, it retrieves in-context examples from a given training dataset of slow-fast program pairs to guide the LLM. Critically, we find that performing contextual retrieval based on an LLM-generated natural language description significantly outperforms retrieval based on the source code. In addition, we propose a method called AEGIS for improving interpretability by decomposing training examples into "atomic edits" that are significantly more incremental in nature. We show that RAS performs 1.8$\times$ better than prior state-of-the-art blackbox adaptation strategies, and that AEGIS performs 1.37$\times$ better while performing significantly smaller edits.

Comment: Proposes a novel blackbox adaptation method (Retrieval Augmented Search) utilizing LLMs for program optimization, which includes theoretical contributions relevant to LLM efficiency and interpretability.

Relevance: 8 Novelty: 8

31. Efficient Reasoning with Hidden Thinking

ArXiv ID: 2501.19201

Authors: Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, Jiuxiang Gu

Abstract: Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces significant inefficiencies. In this work, we propose $\textbf{Heima}$ (as hidden llama), an efficient reasoning framework that leverages reasoning CoTs at hidden latent space. We design the Heima Encoder to condense each intermediate CoT into a compact, higher-level hidden representation using a single thinking token, effectively minimizing verbosity and reducing the overall number of tokens required during the reasoning process. Meanwhile, we design corresponding Heima Decoder with traditional Large Language Models (LLMs) to adaptively interpret the hidden representations into variable-length textual sequence, reconstructing reasoning processes that closely resemble the original CoTs. Experimental results across diverse reasoning MLLM benchmarks demonstrate that Heima model achieves higher generation efficiency while maintaining or even better zero-shot task accuracy. Moreover, the effective reconstruction of multimodal reasoning processes with Heima Decoder validates both the robustness and interpretability of our approach.

Comment: The Heima framework introduces hidden latent reasoning representations for CoT reasoning in LLMs, aligning closely with efficiency breakthroughs in reasoning methods.

Relevance: 9 Novelty: 7

32. Demystifying MPNNs: Message Passing as Merely Efficient Matrix Multiplication

ArXiv ID: 2502.00140

Authors: Qin Jiang, Chengjia Wang, Michael Lones, Wei Pang

Abstract: While Graph Neural Networks (GNNs) have achieved remarkable success, their design largely relies on empirical intuition rather than theoretical understanding. In this paper, we present a comprehensive analysis of GNN behavior through three fundamental aspects: (1) we establish that \textbf{$k$-layer} Message Passing Neural Networks efficiently aggregate \textbf{$k$-hop} neighborhood information through iterative computation, (2) analyze how different loop structures influence neighborhood computation, and (3) examine behavior across structure-feature hybrid and structure-only tasks. For deeper GNNs, we demonstrate that gradient-related issues, rather than just over-smoothing, can significantly impact performance in sparse graphs. We also analyze how different normalization schemes affect model performance and how GNNs make predictions with uniform node features, providing a theoretical framework that bridges the gap between empirical success and theoretical understanding.

Comment: Provides theoretical analysis relating Message Passing Neural Networks to efficient matrix multiplication, contributing to foundational understanding of graph neural networks.

Relevance: 8 Novelty: 8

33. GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation

ArXiv ID: 2502.01113

Authors: Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Dinh Phung, Chen Gong, Shirui Pan

Abstract: Retrieval-augmented generation (RAG) has proven effective in integrating knowledge into large language models (LLMs). However, conventional RAGs struggle to capture complex relationships between pieces of knowledge, limiting their performance in intricate reasoning that requires integrating knowledge from multiple sources. Recently, graph-enhanced retrieval augmented generation (GraphRAG) builds graph structure to explicitly model these relationships, enabling more effective and efficient retrievers. Nevertheless, its performance is still hindered by the noise and incompleteness within the graph structure. To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for retrieval augmented generation. GFM-RAG is powered by an innovative graph neural network that reasons over graph structure to capture complex query-knowledge relationships. The GFM with 8M parameters undergoes a two-stage training process on large-scale datasets, comprising 60 knowledge graphs with over 14M triples and 700k documents. This results in impressive performance and generalizability for GFM-RAG, making it the first graph foundation model applicable to unseen datasets for retrieval without any fine-tuning required. Extensive experiments on three multi-hop QA datasets and seven domain-specific RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance while maintaining efficiency and alignment with neural scaling laws, highlighting its potential for further improvement.

Comment: Proposes a novel graph-enhanced retrieval-augmented generation model that builds on foundational architecture concepts like graph neural networks, which aligns with model architecture relevance.

Relevance: 8 Novelty: 8

34. Masked Generative Nested Transformers with Decode Time Scaling

ArXiv ID: 2502.00382

Authors: Sahil Goyal, Debapriya Tula, Gagan Jain, Pradeep Shenoy, Prateek Jain, Sujoy Paul

Abstract: Recent advances in visual generation have made significant strides in producing content of exceptional quality. However, most methods suffer from a fundamental problem - a bottleneck of inference computational efficiency. Most of these algorithms involve multiple passes over a transformer model to generate tokens or denoise inputs. However, the model size is kept consistent throughout all iterations, which makes it computationally expensive. In this work, we aim to address this issue primarily through two key ideas - (a) not all parts of the generation process need equal compute, and we design a decode time model scaling schedule to utilize compute effectively, and (b) we can cache and reuse some of the computation. Combining these two ideas leads to using smaller models to process more tokens while large models process fewer tokens. These different-sized models do not increase the parameter size, as they share parameters. We rigorously experiment with ImageNet256$\times$256 , UCF101, and Kinetics600 to showcase the efficacy of the proposed method for image/video generation and frame prediction. Our experiments show that with almost $3\times$ less compute than baseline, our model obtains competitive performance.

Comment: Focuses on decode-time scaling in nested transformers for visual generation tasks, aligning closely with the prompt’s interest in compute efficiency and transformer architecture innovations.

Relevance: 9 Novelty: 7

35. HoP: Homeomorphic Polar Learning for Hard Constrained Optimization

ArXiv ID: 2502.00304

Authors: Ke Deng, Hanwen Zhang, Jin Lu, Haijian Sun

Abstract: Constrained optimization demands highly efficient solvers which promotes the development of learn-to-optimize (L2O) approaches. As a data-driven method, L2O leverages neural networks to efficiently produce approximate solutions. However, a significant challenge remains in ensuring both optimality and feasibility of neural networks' output. To tackle this issue, we introduce Homeomorphic Polar Learning (HoP) to solve the star-convex hard-constrained optimization by embedding homeomorphic mapping in neural networks. The bijective structure enables end-to-end training without extra penalty or correction. For performance evaluation, we evaluate HoP's performance across a variety of synthetic optimization tasks and real-world applications in wireless communications. In all cases, HoP achieves solutions closer to the optimum than existing L2O methods while strictly maintaining feasibility.

Comment: Proposes a constrained optimization approach embedding homeomorphic mapping into neural networks, relevant to efficiency and representation learning with a novel L2O formulation.

Relevance: 8 Novelty: 8

36. GRADIEND: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models

ArXiv ID: 2502.01406

Authors: Jonathan Drechsel, Steffen Herbold

Abstract: AI systems frequently exhibit and amplify social biases, including gender bias, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a single monosemantic feature neuron encoding gender information. We show that our method can be used to debias transformer-based language models, while maintaining other capabilities. We demonstrate the effectiveness of our approach across multiple encoder-only based models and highlight its potential for broader applications.

Comment: Introduces an encoding-decoding mechanism for gender debiasing in transformer models, directly relevant to representation learning in foundational transformer-based research.

Relevance: 9 Novelty: 7

37. eagle: early approximated gradient based learning rate estimator

ArXiv ID: 2502.01036

Authors: Takumi Fujimoto, Hiroaki Nishi

Abstract: We propose EAGLE update rule, a novel optimization method that accelerates loss convergence during the early stages of training by leveraging both current and previous step parameter and gradient values. The update algorithm estimates optimal parameters by computing the changes in parameters and gradients between consecutive training steps and leveraging the local curvature of the loss landscape derived from these changes. However, this update rule has potential instability, and to address that, we introduce an adaptive switching mechanism that dynamically selects between Adam and EAGLE update rules to enhance training stability. Experiments on standard benchmark datasets demonstrate that EAGLE optimizer, which combines this novel update rule with the switching mechanism achieves rapid training loss convergence with fewer epochs, compared to conventional optimization methods.

Comment: The EAGLE optimizer introduces a novel optimization method featuring adaptive switching, relevant to training dynamics and efficiency.

Relevance: 8 Novelty: 8

38. Redefining Machine Unlearning: A Conformal Prediction-Motivated Approach

ArXiv ID: 2501.19403

Authors: Yingdan Shi, Ren Wang

Abstract: Machine unlearning seeks to systematically remove specified data from a trained model, effectively achieving a state as though the data had never been encountered during training. While metrics such as Unlearning Accuracy (UA) and Membership Inference Attack (MIA) provide a baseline for assessing unlearning performance, they fall short of evaluating the completeness and reliability of forgetting. This is because the ground truth labels remain potential candidates within the scope of uncertainty quantification, leaving gaps in the evaluation of true forgetting. In this paper, we identify critical limitations in existing unlearning metrics and propose enhanced evaluation metrics inspired by conformal prediction. Our metrics can effectively capture the extent to which ground truth labels are excluded from the prediction set. Furthermore, we observe that many existing machine unlearning methods do not achieve satisfactory forgetting performance when evaluated with our new metrics. To address this, we propose an unlearning framework that integrates conformal prediction insights into Carlini & Wagner adversarial attack loss. Extensive experiments on the image classification task demonstrate that our enhanced metrics offer deeper insights into unlearning effectiveness, and that our unlearning framework significantly improves the forgetting quality of unlearning methods.

Comment: The paper redefines machine unlearning metrics with a novel conformal prediction approach and proposes an improved unlearning framework, contributing significantly to model compression and interpretability.

Relevance: 8 Novelty: 8

39. Compact Rule-Based Classifier Learning via Gradient Descent

ArXiv ID: 2502.01375

Authors: Javier Fumanal-Idocin, Raquel Fernandez-Peralta, Javier Andreu-Perez

Abstract: Rule-based models play a crucial role in scenarios that require transparency and accountable decision-making. However, they primarily consist of discrete parameters and structures, which presents challenges for scalability and optimization. In this work, we introduce a new rule-based classifier trained using gradient descent, in which the user can control the maximum number and length of the rules. For numerical partitions, the user can also control the partitions used with fuzzy sets, which also helps keep the number of partitions small. We perform a series of exhaustive experiments on $40$ datasets to show how this classifier performs in terms of accuracy and rule base size. Then, we compare our results with a genetic search that fits an equivalent classifier and with other explainable and non-explainable state-of-the-art classifiers. Our results show how our method can obtain compact rule bases that use significantly fewer patterns than other rule-based methods and perform better than other explainable classifiers.

Comment: The proposed compact rule-based classifier is relevant for model efficiency and aligns with topics like sparsity and gradient-based optimization innovations.

Relevance: 8 Novelty: 7

40. RIGNO: A Graph-based framework for robust and accurate operator learning for PDEs on arbitrary domains

ArXiv ID: 2501.19205

Authors: Sepehr Mousavi, Shizheng Wen, Levi Lingsch, Maximilian Herde, Bogdan Raoni\'c, Siddhartha Mishra

Abstract: Learning the solution operators of PDEs on arbitrary domains is challenging due to the diversity of possible domain shapes, in addition to the often intricate underlying physics. We propose an end-to-end graph neural network (GNN) based neural operator to learn PDE solution operators from data on point clouds in arbitrary domains. Our multi-scale model maps data between input/output point clouds by passing it through a downsampled regional mesh. Many novel elements are also incorporated to ensure resolution invariance and temporal continuity. Our model, termed RIGNO, is tested on a challenging suite of benchmarks, composed of various time-dependent and steady PDEs defined on a diverse set of domains. We demonstrate that RIGNO is significantly more accurate than neural operator baselines and robustly generalizes to unseen spatial resolutions and time instances.

Comment: RIGNO utilizes GNNs for operator learning in PDEs, a novel framework that aligns with emerging trends in neural operators for scientific modeling, suggesting relevance to AI for science.

Relevance: 8 Novelty: 7

41. Understanding Generalization in Physics Informed Models through Affine Variety Dimensions

ArXiv ID: 2501.18879

Authors: Takeshi Koshizuka, Issei Sato

Abstract: In recent years, physics-informed machine learning has gained significant attention for its ability to enhance statistical performance and sample efficiency by integrating physical structures into machine learning models. These structures, such as differential equations, conservation laws, and symmetries, serve as inductive biases that can improve the generalization capacity of the hybrid model. However, the mechanisms by which these physical structures enhance generalization capacity are not fully understood, limiting the ability to guarantee the performance of the models. In this study, we show that the generalization performance of linear regressors incorporating differential equation structures is determined by the dimension of the associated affine variety, rather than the number of parameters. This finding enables a unified analysis of various equations, including nonlinear ones. We introduce a method to approximate the dimension of the affine variety and provide experimental evidence to validate our theoretical insights.

Comment: By analyzing generalization in physics-informed machine learning models, this work advances theoretical understanding consistent with AI for foundational scientific modeling.

Relevance: 7 Novelty: 8

42. Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing

ArXiv ID: 2502.00594

Authors: Saarthak Kapse, Robin Betz, Srinivasan Sivanandan

Abstract: State Space Models (SSMs) with selective scan (Mamba) have been adapted into efficient vision models. Mamba, unlike Vision Transformers, achieves linear complexity for token interactions through a recurrent hidden state process. This sequential processing is enhanced by a parallel scan algorithm, which reduces the computational time of recurrent steps from $L$ sequential steps to $log(L)$ parallel steps with respect to the number of input tokens ($L$). In this work, we propose Fast Vision Mamba (FastVim), that further reduces the computational time of the SSM block by reducing the number of recurrent steps in Vision Mamba models while still retaining model performance. By alternately pooling tokens along image dimensions across Mamba blocks, we obtain a 2$\times$ reduction in the number of parallel steps in SSM block. Our model offers up to $72.5\%$ speedup in inference speed compared to baseline Vision Mamba models on high resolution (2048$\times$2048) images. Our experiments demonstrate state-of-the-art performance with dramatically improved throughput in a range of tasks such as image classification, cell perturbation prediction, segmentation, and object detection. Code is made available at https://github.com/insitro/FastVim

Comment: Introduces architectural enhancements for State Space Models in vision tasks, which aligns with the architectural innovation criterion. The technique improves efficiency but does not significantly advance foundational model theory.

Relevance: 8 Novelty: 7

43. Understanding Oversmoothing in GNNs as Consensus in Opinion Dynamics

ArXiv ID: 2501.19089

Authors: Keqin Wang, Yulong Yang, Ishan Saha, Christine Allen-Blanchette

Abstract: In contrast to classes of neural networks where the learned representations become increasingly expressive with network depth, the learned representations in graph neural networks (GNNs), tend to become increasingly similar. This phenomena, known as oversmoothing, is characterized by learned representations that cannot be reliably differentiated leading to reduced predictive performance. In this paper, we propose an analogy between oversmoothing in GNNs and consensus or agreement in opinion dynamics. Through this analogy, we show that the message passing structure of recent continuous-depth GNNs is equivalent to a special case of opinion dynamics (i.e., linear consensus models) which has been theoretically proven to converge to consensus (i.e., oversmoothing) for all inputs. Using the understanding developed through this analogy, we design a new continuous-depth GNN model based on nonlinear opinion dynamics and prove that our model, which we call behavior-inspired message passing neural network (BIMP) circumvents oversmoothing for general inputs. Through extensive experiments, we show that BIMP is robust to oversmoothing and adversarial attack, and consistently outperforms competitive baselines on numerous benchmarks.

Comment: Connects GNN oversmoothing with opinion dynamics and proposes a new model that addresses oversmoothing issues. Relevant for understanding and innovating architectures.

Relevance: 8 Novelty: 7

44. FragmentNet: Adaptive Graph Fragmentation for Graph-to-Sequence Molecular Representation Learning

ArXiv ID: 2502.01184

Authors: Ankur Samanta, Rohan Gupta, Aditi Misra, Christian McIntosh Clarke, Jayakumar Rajadas

Abstract: Molecular property prediction uses molecular structure to infer chemical properties. Chemically interpretable representations that capture meaningful intramolecular interactions enhance the usability and effectiveness of these predictions. However, existing methods often rely on atom-based or rule-based fragment tokenization, which can be chemically suboptimal and lack scalability. We introduce FragmentNet, a graph-to-sequence foundation model with an adaptive, learned tokenizer that decomposes molecular graphs into chemically valid fragments while preserving structural connectivity. FragmentNet integrates VQVAE-GCN for hierarchical fragment embeddings, spatial positional encodings for graph serialization, global molecular descriptors, and a transformer. Pre-trained with Masked Fragment Modeling and fine-tuned on MoleculeNet tasks, FragmentNet outperforms models with similarly scaled architectures and datasets while rivaling larger state-of-the-art models requiring significantly more resources. This novel framework enables adaptive decomposition, serialization, and reconstruction of molecular graphs, facilitating fragment-based editing and visualization of property trends in learned embeddings - a powerful tool for molecular design and optimization.

Comment: Explores fragmentation in graph molecular representation learning with architectural innovations like VQVAE-GCN and transformers. Moderately related to innovative architectures.

Relevance: 7 Novelty: 8

45. Process Reinforcement through Implicit Rewards

ArXiv ID: 2502.01456

Authors: Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding

Abstract: Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.

Comment: Focuses on dense process reinforcement in LLMs during reasoning tasks and introduces implicit reward techniques. Novel contribution to LLM optimization might be of interest.

Relevance: 7 Novelty: 8

46. The role of positional encodings in the ARC benchmark

ArXiv ID: 2502.00174

Authors: Guilherme H. Bandeira Costa, Miguel Freire, Arlindo L. Oliveira

Abstract: The Abstraction and Reasoning Corpus challenges AI systems to perform abstract reasoning with minimal training data, a task intuitive for humans but demanding for machine learning models. Using CodeT5+ as a case study, we demonstrate how limitations in positional encoding hinder reasoning and impact performance. This work further examines the role of positional encoding across transformer architectures, highlighting its critical influence on models of varying sizes and configurations. Comparing several strategies, we find that while 2D positional encoding and Rotary Position Embedding offer competitive performance, 2D encoding excels in data-constrained scenarios, emphasizing its effectiveness for ARC tasks

Comment: Analyzes the role of positional encodings in transformer-based tasks. Directly related to foundational aspects of model architectures and improvements in encoder-decoder setups.

Relevance: 8 Novelty: 7

47. Lifelong Sequential Knowledge Editing without Model Degradation

ArXiv ID: 2502.01636

Authors: Akshat Gupta, Phudish Prateepamornkul, Maochuan Lu, Ahmed Alaa, Thomas Hartvigsen, Gopala Anumanchipalli

Abstract: Prior work in parameter-modifying knowledge editing has shown that large-scale sequential editing leads to significant model degradation. In this paper, we study the reasons behind this and scale sequential knowledge editing to 10,000 sequential edits, while maintaining the downstream performance of the original model. We first show that locate-then-edit knowledge editing methods lead to overfitting on the edited facts. We also show that continuous knowledge editing using these methods leads to disproportionate growth in the norm of the edited matrix. We then provide a crucial insight into the inner workings of locate-then-edit methods. We show that norm-growth is a hidden trick employed by these methods that gives larger importance to the output activations produced from the edited layers. With this "importance hacking", the edited layers provide a much larger contributions to the model's output. To mitigate these issues, we present ENCORE - Early stopping and Norm-Constrained Robust knowledge Editing. ENCORE controls for overfitting and the disproportionate norm-growth to enable long-term sequential editing, where we are able to perform up to 10,000 sequential edits without loss of downstream performance. ENCORE is also 61% faster than MEMIT and 64% faster than AlphaEdit on Llama3-8B.

Comment: This paper introduces a method for long-term knowledge editing in large models, focusing on preventing model degradation and overfitting. It touches on an architecture-level improvement via norm-constrained methods, relevant to representation learning and model architecture.

Relevance: 8 Novelty: 7

48. The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models

ArXiv ID: 2502.01225

Authors: Zhiyuan Xu, Joseph Gardiner, Sana Belguith

Abstract: Large language models are typically trained on vast amounts of data during the pre-training phase, which may include some potentially harmful information. Fine-tuning attacks can exploit this by prompting the model to reveal such behaviours, leading to the generation of harmful content. In this paper, we focus on investigating the performance of the Chain of Thought based reasoning model, DeepSeek, when subjected to fine-tuning attacks. Specifically, we explore how fine-tuning manipulates the model's output, exacerbating the harmfulness of its responses while examining the interaction between the Chain of Thought reasoning and adversarial inputs. Through this study, we aim to shed light on the vulnerability of Chain of Thought enabled models to fine-tuning attacks and the implications for their safety and ethical deployment.

Comment: Examines vulnerabilities in Chain of Thought reasoning for safety alignment of LLMs, theoretically relevant to interpretability and robustness.

Relevance: 8 Novelty: 7

49. Refining Adaptive Zeroth-Order Optimization at Ease

ArXiv ID: 2502.01014

Authors: Yao Shu, Qixin Zhang, Kun He, Zhongxiang Dai

Abstract: Recently, zeroth-order (ZO) optimization plays an essential role in scenarios where gradient information is inaccessible or unaffordable, such as black-box systems and resource-constrained environments. While existing adaptive methods such as ZO-AdaMM have shown promise, they are fundamentally limited by their underutilization of moment information during optimization, usually resulting in underperforming convergence. To overcome these limitations, this paper introduces Refined Adaptive Zeroth-Order Optimization (R-AdaZO). Specifically, we first show the untapped variance reduction effect of first moment estimate on ZO gradient estimation, which improves the accuracy and stability of ZO updates. We then refine the second moment estimate based on these variance-reduced gradient estimates to better capture the geometry of the optimization landscape, enabling a more effective scaling of ZO updates. We present rigorous theoretical analysis to show (I) the first analysis to the variance reduction of first moment estimate in ZO optimization, (II) the improved second moment estimates with a more accurate approximation of its variance-free ideal, (III) the first variance-aware convergence framework for adaptive ZO methods, which may be of independent interest, and (IV) the faster convergence of R-AdaZO than existing baselines like ZO-AdaMM. Our extensive experiments, including synthetic problems, black-box adversarial attack, and memory-efficient fine-tuning of large language models (LLMs), further verify the superior convergence of R-AdaZO, indicating that R-AdaZO offers an improved solution for real-world ZO optimization challenges.

Comment: This paper introduces a novel zeroth-order optimization method with theoretical variance-aware convergence analysis, which provides insights into efficiency optimization relevant to compression and training dynamics.

Relevance: 7 Novelty: 8

50. A single-loop SPIDER-type stochastic subgradient method for expectation-constrained nonconvex nonsmooth optimization

ArXiv ID: 2501.19214

Authors: Wei Liu, Yangyang Xu

Abstract: Many real-world problems, such as those with fairness constraints, involve complex expectation constraints and large datasets, necessitating the design of efficient stochastic methods to solve them. Most existing research focuses on cases with no {constraint} or easy-to-project constraints or deterministic constraints. In this paper, we consider nonconvex nonsmooth stochastic optimization problems with expectation constraints, for which we build a novel exact penalty model. We first show the relationship between the penalty model and the original problem. Then on solving the penalty problem, we present a single-loop SPIDER-type stochastic subgradient method, which utilizes the subgradients of both the objective and constraint functions, as well as the constraint function value at each iteration. Under certain regularity conditions (weaker than Slater-type constraint qualification or strong feasibility assumed in existing works), we establish an iteration complexity result of $O(\epsilon^{-4})$ to reach a near-$\epsilon$ stationary point of the penalized problem in expectation, matching the lower bound for such tasks. Building on the exact penalization, an $(\epsilon,\epsilon)$-KKT point of the original problem is obtained. For a few scenarios, our complexity of either the {objective} sample subgradient or the constraint sample function values can be lower than the state-of-the-art results by a factor of $\epsilon^{-2}$. Moreover, on solving two fairness-constrained problems, our method is significantly (up to 466 times) faster than the state-of-the-art algorithms, including switching subgradient method and inexact proximal point methods.

Comment: The proposed stochastic subgradient method for constrained optimization involves novel penalty models, contributing to optimization efficiency and theoretical advances.

Relevance: 7 Novelty: 8

51. Trading Inference-Time Compute for Adversarial Robustness

ArXiv ID: 2501.18841

Authors: Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese

Abstract: We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve reliability, and speculate on the reasons for these as well as ways to address them.

Comment: The paper explores compute scaling for adversarial robustness, contributing to inference-time optimization and resilience, with implications for LLM efficiency.

Relevance: 8 Novelty: 7

52. What is causal about causal models and representations?

ArXiv ID: 2501.19335

Authors: Frederik Hytting J{\o}rgensen, Luigi Gresele, Sebastian Weichwald

Abstract: Causal Bayesian networks are 'causal' models since they make predictions about interventional distributions. To connect such causal model predictions to real-world outcomes, we must determine which actions in the world correspond to which interventions in the model. For example, to interpret an action as an intervention on a treatment variable, the action will presumably have to a) change the distribution of treatment in a way that corresponds to the intervention, and b) not change other aspects, such as how the outcome depends on the treatment; while the marginal distributions of some variables may change as an effect. We introduce a formal framework to make such requirements for different interpretations of actions as interventions precise. We prove that the seemingly natural interpretation of actions as interventions is circular: Under this interpretation, every causal Bayesian network that correctly models the observational distribution is trivially also interventionally valid, and no action yields empirical data that could possibly falsify such a model. We prove an impossibility result: No interpretation exists that is non-circular and simultaneously satisfies a set of natural desiderata. Instead, we examine non-circular interpretations that may violate some desiderata and show how this may in turn enable the falsification of causal models. By rigorously examining how a causal Bayesian network could be a 'causal' model of the world instead of merely a mathematical object, our formal framework contributes to the conceptual foundations of causal representation learning, causal discovery, and causal abstraction, while also highlighting some limitations of existing approaches.

Comment: The paper rigorously investigates the conceptual foundations of causal models, connecting to causal representation learning and theory, but lacks direct application to neural architectures.

Relevance: 7 Novelty: 8

53. Enhance Learning Efficiency of Oblique Decision Tree via Feature Concatenation

ArXiv ID: 2502.00465

Authors: Shen-Huan Lyu, Yi-Xiao He, Yanyan Wang, Zhihao Qu, Bin Tang, Baoliu Ye

Abstract: Oblique Decision Tree (ODT) separates the feature space by linear projections, as opposed to the conventional Decision Tree (DT) that forces axis-parallel splits. ODT has been proven to have a stronger representation ability than DT, as it provides a way to create shallower tree structures while still approximating complex decision boundaries. However, its learning efficiency is still insufficient, since the linear projections cannot be transmitted to the child nodes, resulting in a waste of model parameters. In this work, we propose an enhanced ODT method with Feature Concatenation (\texttt{FC-ODT}), which enables in-model feature transformation to transmit the projections along the decision paths. Theoretically, we prove that our method enjoys a faster consistency rate w.r.t. the tree depth, indicating that our method possesses a significant advantage in generalization performance, especially for shallow trees. Experiments show that \texttt{FC-ODT} can outperform the other state-of-the-art decision trees with a limited tree depth.

Comment: Proposes advancements to Oblique Decision Trees (ODT), improving their efficiency and generalization. The focus on sparsity and representation alignment qualifies it for foundational relevance.

Relevance: 7 Novelty: 8

54. Efficient rule induction by ignoring pointless rules

ArXiv ID: 2502.01232

Authors: Andrew Cropper, David M. Cerna

Abstract: The goal of inductive logic programming (ILP) is to find a set of logical rules that generalises training examples and background knowledge. We introduce an ILP approach that identifies pointless rules. A rule is pointless if it contains a redundant literal or cannot discriminate against negative examples. We show that ignoring pointless rules allows an ILP system to soundly prune the hypothesis space. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can reduce learning times by 99% whilst maintaining predictive accuracies.

Comment: The paper introduces a new approach for inductive logic programming rule induction, which aligns well with model efficiency innovations but is primarily focused on specific applications.

Relevance: 7 Novelty: 7

55. A Statistical Learning Perspective on Semi-dual Adversarial Neural Optimal Transport Solvers

ArXiv ID: 2502.01310

Authors: Roman Tarasov, Petr Mokrov, Milena Gazdieva, Evgeny Burnaev, Alexander Korotin

Abstract: Neural network based Optimal Transport (OT) is a recent and fruitful direction in the generative modeling community. It finds its applications in various fields such as domain translation, image super-resolution, computational biology and others. Among the existing approaches to OT, of considerable interest are adversarial minimax solvers based on semi-dual formulations of OT problems. While promising, these methods lack theoretical investigation from a statistical learning perspective. Our work fills this gap by establishing upper bounds on the generalization error of an approximate OT map recovered by the minimax quadratic OT solver. Importantly, the bounds we derive depend solely on some standard statistical and mathematical properties of the considered functional classes (neural networks). While our analysis focuses on the quadratic OT, we believe that similar bounds could be derived for more general OT formulations, paving the promising direction for future research.

Comment: The paper investigates neural OT solvers theoretically, potentially offering insights relevant to representation learning and optimization, though slightly peripheral to foundational model advances.

Relevance: 7 Novelty: 7

56. Estimating LLM Uncertainty with Logits

ArXiv ID: 2502.00290

Authors: Huan Ma, Jingdong Chen, Guangyu Wang, Changqing Zhang

Abstract: In recent years, Large Language Models (LLMs) have seen remarkable advancements and have been extensively integrated across various fields. Despite their progress, LLMs are prone to hallucinations, producing responses that may not be dependable if the models lack sufficient grounding knowledge. To mitigate this issue, methods for estimating uncertainty have been adopted, with a focus on critical tokens as indicators of reliability. Nevertheless, probability-based approaches have shown limitations in assessing token-level reliability due to the erosion of evidence strength information acquired during training. In this paper, we introduce Logits-induced Token Uncertainty (LogU), a novel framework designed to estimate token-specific uncertainty in LLMs in real time, without the need for multiple sampling rounds. By leveraging evidence modeling for the implementation of LogU, we utilize the derived uncertainty measures to steer downstream tasks. Our experimental findings highlight the substantial effectiveness and potential of LogU, marking a significant advancement in addressing the challenge of model hallucinations.

Comment: Proposes a novel framework for estimating token-level uncertainty in LLMs using logits, addressing fundamental interpretability and reliability concerns, with some ties to representation learning.

Relevance: 7 Novelty: 7

57. TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues

ArXiv ID: 2502.01630

Authors: Yubin Ge, Salvatore Romeo, Jason Cai, Raphael Shu, Monica Sunkara, Yassine Benajiba, Yi Zhang

Abstract: Temporal reasoning in multi-session dialogues presents a significant challenge which has been under-studied in previous temporal reasoning benchmarks. To bridge this gap, we propose a new evaluation task for temporal reasoning in multi-session dialogues and introduce an approach to construct a new benchmark by augmenting dialogues from LoCoMo and creating multi-choice QAs. Furthermore, we present TReMu, a new framework aimed at enhancing the temporal reasoning capabilities of LLM-agents in this context. Specifically, the framework employs \textit{time-aware memorization} through timeline summarization, generating retrievable memory by summarizing events in each dialogue session with their inferred dates. Additionally, we integrate \textit{neuro-symbolic temporal reasoning}, where LLMs generate Python code to perform temporal calculations and select answers. Experimental evaluations on popular LLMs demonstrate that our benchmark is challenging, and the proposed framework significantly improves temporal reasoning performance compared to baseline methods, raising from 29.83 on GPT-4o via standard prompting to 77.67 via our approach and highlighting its effectiveness in addressing temporal reasoning in multi-session dialogues.

Comment: Proposes a neuro-symbolic approach to enhance temporal reasoning in LLM agents for multi-session dialogues, touching on foundational interpretability improvements.

Relevance: 7 Novelty: 7

58. Scalable Multi-phase Word Embedding Using Conjunctive Propositional Clauses

ArXiv ID: 2501.19018

Authors: Ahmed K. Kadhim, Lei Jiao, Rishad Shafik, Ole-Christoffer Granmo, Bimal Bhattarai

Abstract: The Tsetlin Machine (TM) architecture has recently demonstrated effectiveness in Machine Learning (ML), particularly within Natural Language Processing (NLP). It has been utilized to construct word embedding using conjunctive propositional clauses, thereby significantly enhancing our understanding and interpretation of machine-derived decisions. The previous approach performed the word embedding over a sequence of input words to consolidate the information into a cohesive and unified representation. However, that approach encounters scalability challenges as the input size increases. In this study, we introduce a novel approach incorporating two-phase training to discover contextual embeddings of input sequences. Specifically, this method encapsulates the knowledge for each input word within the dataset's vocabulary, subsequently constructing embeddings for a sequence of input words utilizing the extracted knowledge. This technique not only facilitates the design of a scalable model but also preserves interpretability. Our experimental findings revealed that the proposed method yields competitive performance compared to the previous approaches, demonstrating promising results in contrast to human-generated benchmarks. Furthermore, we applied the proposed approach to sentiment analysis on the IMDB dataset, where the TM embedding and the TM classifier, along with other interpretable classifiers, offered a transparent end-to-end solution with competitive performance.

Comment: This paper focuses on a novel approach to constructing scalable and interpretable embeddings using Tsetlin Machines, which ties to representation learning. However, the application on sentiment analysis leans it partially towards applied NLP.

Relevance: 7 Novelty: 7

59. Jackpot! Alignment as a Maximal Lottery

ArXiv ID: 2501.19266

Authors: Roberto-Rafael Maura-Rivero, Marc Lanctot, Francesco Visin, Kate Larson

Abstract: Reinforcement Learning from Human Feedback (RLHF), the standard for aligning Large Language Models (LLMs) with human values, is known to fail to satisfy properties that are intuitively desirable, such as respecting the preferences of the majority \cite{ge2024axioms}. To overcome these issues, we propose the use of a probabilistic Social Choice rule called \emph{maximal lotteries} as a replacement for RLHF. We show that a family of alignment techniques, namely Nash Learning from Human Feedback (NLHF) \cite{munos2023nash} and variants, approximate maximal lottery outcomes and thus inherit its beneficial properties. We confirm experimentally that our proposed methodology handles situations that arise when working with preferences more robustly than standard RLHF, including supporting the preferences of the majority, providing principled ways of handling non-transitivities in the preference data, and robustness to irrelevant alternatives. This results in systems that better incorporate human values and respect human intentions.

Comment: Applies maximal lottery-based probabilistic social choice to LLM alignment, which is novel in exploring the intersection of RLHF and social choice theory.

Relevance: 7 Novelty: 7

60. Advanced Weakly-Supervised Formula Exploration for Neuro-Symbolic Mathematical Reasoning

ArXiv ID: 2502.00629

Authors: Yuxuan Wu, Hideki Nakayama

Abstract: In recent years, neuro-symbolic methods have become a popular and powerful approach that augments artificial intelligence systems with the capability to perform abstract, logical, and quantitative deductions with enhanced precision and controllability. Recent studies successfully performed symbolic reasoning by leveraging various machine learning models to explicitly or implicitly predict intermediate labels that provide symbolic instructions. However, these intermediate labels are not always prepared for every task as a part of training data, and pre-trained models, represented by Large Language Models (LLMs), also do not consistently generate valid symbolic instructions with their intrinsic knowledge. On the other hand, existing work developed alternative learning techniques that allow the learning system to autonomously uncover optimal symbolic instructions. Nevertheless, their performance also exhibits limitations when faced with relatively huge search spaces or more challenging reasoning problems. In view of this, in this work, we put forward an advanced practice for neuro-symbolic reasoning systems to explore the intermediate labels with weak supervision from problem inputs and final outputs. Our experiments on the Mathematics dataset illustrated the effectiveness of our proposals from multiple aspects.

Comment: Proposes a neuro-symbolic framework and addresses reasoning with weak supervision, moderately relevant to representation learning but more niche.

Relevance: 7 Novelty: 7

61. Locality-aware Surrogates for Gradient-based Black-box Optimization

ArXiv ID: 2501.19161

Authors: Ali Momeni, Stefan Uhlich, Arun Venkitaraman, Chia-Yu Hsieh, Andrea Bonetti, Ryoga Matsuo, Eisaku Ohbuchi, Lorenzo Servadei

Abstract: In physics and engineering, many processes are modeled using non-differentiable black-box simulators, making the optimization of such functions particularly challenging. To address such cases, inspired by the Gradient Theorem, we propose locality-aware surrogate models for active model-based black-box optimization. We first establish a theoretical connection between gradient alignment and the minimization of a Gradient Path Integral Equation (GradPIE) loss, which enforces consistency of the surrogate's gradients in local regions of the design space. Leveraging this theoretical insight, we develop a scalable training algorithm that minimizes the GradPIE loss, enabling both offline and online learning while maintaining computational efficiency. We evaluate our approach on three real-world tasks - spanning automated in silico experiments such as coupled nonlinear oscillators, analog circuits, and optical systems - and demonstrate consistent improvements in optimization efficiency under limited query budgets. Our results offer dependable solutions for both offline and online optimization tasks where reliable gradient estimation is needed.

Comment: Proposes optimization of sheaf Laplacians within graph theory, interesting for representation learning but with a mathematical niche emphasis.

Relevance: 7 Novelty: 7

62. Learning Sheaf Laplacian Optimizing Restriction Maps

ArXiv ID: 2501.19207

Authors: Leonardo Di Nino, Sergio Barbarossa, Paolo Di Lorenzo

Abstract: The aim of this paper is to propose a novel framework to infer the sheaf Laplacian, including the topology of a graph and the restriction maps, from a set of data observed over the nodes of a graph. The proposed method is based on sheaf theory, which represents an important generalization of graph signal processing. The learning problem aims to find the sheaf Laplacian that minimizes the total variation of the observed data, where the variation over each edge is also locally minimized by optimizing the associated restriction maps. Compared to alternative methods based on semidefinite programming, our solution is significantly more numerically efficient, as all its fundamental steps are resolved in closed form. The method is numerically tested on data consisting of vectors defined over subspaces of varying dimensions at each node. We demonstrate how the resulting graph is influenced by two key factors: the cross-correlation and the dimensionality difference of the data residing on the graph's nodes.

Comment: Presents a new method for inferring sheaf Laplacians with potential implications for representation learning but remains focused on mathematical frameworks.

Relevance: 7 Novelty: 7

63. No Foundations without Foundations -- Why semi-mechanistic models are essential for regulatory biology

ArXiv ID: 2501.19178

Authors: Luka Kova\v{c}evi\'c, Thomas Gaudelet, James Opzoomer, Hagen Triendl, John Whittaker, Caroline Uhler, Lindsay Edwards, Jake P. Taylor-King

Abstract: Despite substantial efforts, deep learning has not yet delivered a transformative impact on elucidating regulatory biology, particularly in the realm of predicting gene expression profiles. Here, we argue that genuine "foundation models" of regulatory biology will remain out of reach unless guided by frameworks that integrate mechanistic insight with principled experimental design. We present one such ground-up, semi-mechanistic framework that unifies perturbation-based experimental designs across both in vitro and in vivo CRISPR screens, accounting for differentiating and non-differentiating cellular systems. By revealing previously unrecognised assumptions in published machine learning methods, our approach clarifies links with popular techniques such as variational autoencoders and structural causal models. In practice, this framework suggests a modified loss function that we demonstrate can improve predictive performance, and further suggests an error analysis that informs batching strategies. Ultimately, since cellular regulation emerges from innumerable interactions amongst largely uncharted molecular components, we contend that systems-level understanding cannot be achieved through structural biology alone. Instead, we argue that real progress will require a first-principles perspective on how experiments capture biological phenomena, how data are generated, and how these processes can be reflected in more faithful modelling architectures.

Comment: This paper provides a semi-mechanistic framework that involves variational autoencoders and structural causal models, which has ties to representation learning and causal abstraction, albeit primarily within the context of regulatory biology.

Relevance: 7 Novelty: 7

64. Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

ArXiv ID: 2501.18837

Authors: Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez

Abstract: Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

Comment: This paper develops constitutional classifiers for defending against jailbreaks using synthetic rule-based data, contributing insights into LLM reliability and interpretability.

Relevance: 7 Novelty: 7

65. Fantastic Multi-Task Gradient Updates and How to Find Them In a Cone

ArXiv ID: 2502.00217

Authors: Negar Hassanpour, Muhammad Kamran Janjua, Kunlin Zhang, Sepehr Lavasani, Xiaowen Zhang, Chunhua Zhou, Chao Gao

Abstract: Balancing competing objectives remains a fundamental challenge in multi-task learning (MTL), primarily due to conflicting gradients across individual tasks. A common solution relies on computing a dynamic gradient update vector that balances competing tasks as optimization progresses. Building on this idea, we propose ConicGrad, a principled, scalable, and robust MTL approach formulated as a constrained optimization problem. Our method introduces an angular constraint to dynamically regulate gradient update directions, confining them within a cone centered on the reference gradient of the overall objective. By balancing task-specific gradients without over-constraining their direction or magnitude, ConicGrad effectively resolves inter-task gradient conflicts. Moreover, our framework ensures computational efficiency and scalability to high-dimensional parameter spaces. We conduct extensive experiments on standard supervised learning and reinforcement learning MTL benchmarks, and demonstrate that ConicGrad achieves state-of-the-art performance across diverse tasks.

Comment: Proposes a new method (ConicGrad) for resolving gradient conflicts in multi-task learning, touching on optimization dynamics but not directly on foundational advances in representation learning or architecture design.

Relevance: 7 Novelty: 7

66. Principal Components for Neural Network Initialization

ArXiv ID: 2501.19114

Authors: Nhan Phan, Thu Nguyen, P{\aa}l Halvorsen, Michael A. Riegler

Abstract: Principal Component Analysis (PCA) is a commonly used tool for dimension reduction and denoising. Therefore, it is also widely used on the data prior to training a neural network. However, this approach can complicate the explanation of explainable AI (XAI) methods for the decision of the model. In this work, we analyze the potential issues with this approach and propose Principal Components-based Initialization (PCsInit), a strategy to incorporate PCA into the first layer of a neural network via initialization of the first layer in the network with the principal components, and its two variants PCsInit-Act and PCsInit-Sub. Explanations using these strategies are as direct and straightforward as for neural networks and are simpler than using PCA prior to training a neural network on the principal components. Moreover, as will be illustrated in the experiments, such training strategies can also allow further improvement of training via backpropagation.

Comment: Introduces PCA-based strategies for neural network initialization, which aligns with representation-related insights but lacks a transformative innovation.

Relevance: 7 Novelty: 6

Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Relevant Topics

Use the following relevance criteria to focus on foundational research, avoiding purely application-driven work:

Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
Large Language Models (LLMs) - Relevant: Major breakthroughs in training or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), minor tweaks (e.g., instruction tuning, CoT, data mixing), or purely empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords for Relevant Domains: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.

Hints on Irrelevant Domains: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.

Hints on Application Tasks: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, etc.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Relevance 9-10 (Completely Relevant)
Focus: Fully aligned with core topics with no deviation, score the highest if contains keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
Focus: Purely unrelated to our topics. Completely a different domain.
Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

Novelty 9-10 (Breakthrough)
Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
Definition: Minimal originality, applying standard approaches without real innovation.
Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

ARXIVID: should be the ArXiv ID.
COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
RELEVANCE: should be a score from 1-10.
NOVELTY: should be a score from 1-10.