This is a remedial run for missed papers from 03/14/2025 to 03/16/2025.

Results generated on 03/24/2025.

Personalized Daily Arxiv Papers 3/17/2025

[gpt-4o]	Prompt	Completion	Total
Token	85432	13519	98951
Cost	$0.21	$0.14	$0.35

Total arXiv papers: 505

Total scanned papers: 505

Total relevant papers: 75

Table of contents with paper titles:

Cognitive Activation and Chaotic Dynamics in Large Language Models: A Quasi-Lyapunov Analysis of Reasoning Mechanisms Authors: Xiaojian Li, Yongkang Leng, Ruiqing Ding, Hangjie Mo, Shanlin Yang
When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective Authors: Alireza Mousavi-Hosseini, Clayton Sanford, Denny Wu, Murat A. Erdogdu
A Review of DeepSeek Models' Key Innovative Techniques Authors: Chengen Wang, Murat Kantarcioglu
SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders Authors: Qing Li, Jiahui Geng, Derui Zhu, Fengyu Cai, Chenyang Lyu, Fakhri Karray
PREAMBLE: Private and Efficient Aggregation of Block Sparse Vectors and Applications Authors: Hilal Asi, Vitaly Feldman, Hannah Keller, Guy N. Rothblum, Kunal Talwar
Multi-View Node Pruning for Accurate Graph Representation Authors: Jiseong Park, Hanjin Kim, Seojin Kim, Jueun Choi
Training Diagonal Linear Networks with Stochastic Sharpness-Aware Minimization Authors: Gabriel Clara, Sophie Langer, Johannes Schmidt-Hieber
MoECollab: Democratizing LLM Development Through Collaborative Mixture of Experts Authors: Harshit
Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs Authors: Nir Ailon, Akhiad Bercovich, Omri Weinstein
ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory Authors: Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, Huanyi Xie, David E. Keyes, Di Wang
Counterfactual Realizability Authors: Arvind Raghavan, Elias Bareinboim
Atlas: Multi-Scale Attention Improves Long Context Image Modeling Authors: Kumar Krishna Agrawal, Long Lian, Longchao Liu, Natalia Harguindeguy, Boyi Li, Alexander Bick, Maggie Chung, Trevor Darrell, Adam Yala
Taming Knowledge Conflicts in Language Models Authors: Gaotang Li, Yuzhong Chen, Hanghang Tong
MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling Authors: Rachel S. Y. Teo, Tan M. Nguyen
Fuzzy Rule-based Differentiable Representation Learning Authors: Wei Zhang, Zhaohong Deng, Guanjin Wang, Kup-Sze Choi
Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation Authors: Byung Hyun Lee, Sungjin Lim, Se Young Chun
LLM-Driven Multi-step Translation from C to Rust using Static Analysis Authors: Tianyang Zhou, Haowen Lin, Somesh Jha, Mihai Christodorescu, Kirill Levchenko, Varun Chandrasekaran
HyperKAN: Hypergraph Representation Learning with Kolmogorov-Arnold Networks Authors: Xiangfei Fang, Boying Wang, Chengying Huan, Shaonan Ma, Heng Zhang, Chen Zhao
PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders Authors: Ahmed Frikha, Muhammad Reza Ar Razi, Krishna Kanth Nakka, Ricardo Mendes, Xue Jiang, Xuebing Zhou
Probabilistic Neural Networks (PNNs) with t-Distributed Outputs: Adaptive Prediction Intervals Beyond Gaussian Assumptions Authors: Farhad Pourkamali-Anaraki
Limits of KV Cache Compression for Tensor Attention based Autoregressive Transformers Authors: Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yu Tian
From Dionysius Emerges Apollo -- Learning Patterns and Abstractions from Perceptual Sequences Authors: Shuchen Wu
Auditing language models for hidden objectives Authors: Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R. Bowman, Shan Carter, Brian Chen, Hoagy Cunningham, Carson Denison, Florian Dietz, Satvik Golechha, Akbir Khan, Jan Kirchner, Jan Leike, Austin Meek, Kei Nishimura-Gasparian, Euan Ong, Christopher Olah, Adam Pearce, Fabien Roger, Jeanne Salle, Andy Shih, Meg Tong, Drake Thomas, Kelley Rivoire, Adam Jermyn, Monte MacDiarmid, Tom Henighan, Evan Hubinger
BriLLM: Brain-inspired Large Language Model Authors: Hai Zhao, Hongqiu Wu, Dongjie Yang, Anni Zou, Jiale Hong
Neural Tangent Kernel of Neural Networks with Loss Informed by Differential Operators Authors: Weiye Gan, Yicheng Li, Qian Lin, Zuoqiang Shi
FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization Authors: Hao Mark Chen, Shell Xu Hu, Wayne Luk, Timothy Hospedales, Hongxiang Fan
Discovering uncertainty: Gaussian constitutive neural networks with correlated weights Authors: Jeremy A. McCulloch, Ellen Kuhl
Spherical Tree-Sliced Wasserstein Distance Authors: Viet-Hoang Tran, Thanh T. Chu, Khoi N. M. Nguyen, Trang Pham, Tam Le, Tan M. Nguyen
Positivity sets of hinge functions Authors: Josef Schicho, Ayush Kumar Tewari, Audie Warren
Hybrid Learners Do Not Forget: A Brain-Inspired Neuro-Symbolic Approach to Continual Learning Authors: Amin Banayeeanzade, Mohammad Rostami
An Algebraic Approach to Moralisation and Triangulation of Probabilistic Graphical Models Authors: Antonio Lorenzin, Fabio Zanasi
Riemannian Geometric-based Meta Learning Authors: JuneYoung Park, YuMi Lee, Tae-Joon Kim, Jang-Hwan Choi
Combining Causal Models for More Accurate Abstractions of Neural Networks Authors: Theodora-Mara Pîslar, Sara Magliacane, Atticus Geiger
From Denoising Score Matching to Langevin Sampling: A Fine-Grained Error Analysis in the Gaussian Setting Authors: Samuel Hurault, Matthieu Terris, Thomas Moreau, Gabriel Peyré
Towards Learning High-Precision Least Squares Algorithms with Sequence Models Authors: Jerry Liu, Jessica Grogan, Owen Dugan, Ashish Rao, Simran Arora, Atri Rudra, Christopher Ré
Classifying Long-tailed and Label-noise Data via Disentangling and Unlearning Authors: Chen Shu, Mengke Li, Yiqun Zhang, Yang Lu, Bo Han, Yiu-ming Cheung, Hanzi Wang
FlowKac: An Efficient Neural Fokker-Planck solver using Temporal Normalizing flows and the Feynman Kac-Formula Authors: Naoufal El Bekri, Lucas Drumetz, Franck Vermet
Permutation Equivariant Neural Networks for Symmetric Tensors Authors: Edward Pearce-Crump
Unifying Perplexing Behaviors in Modified BP Attributions through Alignment Perspective Authors: Guanhua Zheng, Jitao Sang, Changsheng Xu
Context-Aware Rule Mining Using a Dynamic Transformer-Based Framework Authors: Jie Liu, Yiwei Zhang, Yuan Sheng, Yujia Lou, Haige Wang, Bohuan Yang
Implicit Bias-Like Patterns in Reasoning Models Authors: Messi H. J. Lee, Calvin K. Lai
Advanced Deep Learning Methods for Protein Structure Prediction and Design Authors: Tianyang Wang, Yichao Zhang, Ningyuan Deng, Xinyuan Song, Ziqian Bi, Zheyu Yao, Keyu Chen, Ming Li, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Ming Liu, Li Zhang, Xuanhe Pan, Jinlang Wang, Pohsun Feng, Yizhu Wen, Lawrence KQ Yan, Hongming Tseng, Yan Zhong, Yunze Wang, Ziyuan Qin, Bowen Jing, Junjie Yang, Jun Zhou, Chia Xin Liang, Junhao Song
Class-Level Feature Selection Method Using Feature Weighted Growing Self-Organising Maps Authors: Andrew Starkey, Uduak Idio Akpan, Omaimah AL Hosni, Yaseen Pullissery
Weighted Graph Structure Learning with Attention Denoising for Node Classification Authors: Tingting Wang, Jiaxin Su, Haobing Liu, Ruobing Jiang
Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages Authors: Matteo Farina, Massimiliano Mancini, Giovanni Iacca, Elisa Ricci
Asynchronous Sharpness-Aware Minimization For Fast and Accurate Deep Learning Authors: Junhyuk Jo, Jihyun Lim, Sunwoo Lee
Revisiting Gradient Descent: A Dual-Weight Method for Improved Learning Authors: Xi Wang
Efficient and Privacy-Preserved Link Prediction via Condensed Graphs Authors: Yunbo Long, Liming Xu, Alexandra Brintrup
Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection Authors: Bangzheng Li, Fei Wang, Wenxuan Zhou, Nan Xu, Ben Zhou, Sheng Zhang, Hoifung Poon, Muhao Chen
RTD-Lite: Scalable Topological Analysis for Comparing Weighted Graphs in Learning Tasks Authors: Eduard Tulchinskii, Daria Voronkova, Ilya Trofimov, Evgeny Burnaev, Serguei Barannikov
Probabilistic Graph Circuits: Deep Generative Models for Tractable Probabilistic Inference over Graphs Authors: Milan Papež, Martin Rektoris, Václav Šmídl, Tomáš Pevný
Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium Authors: Kaizhao Liu, Qi Long, Zhekun Shi, Weijie J. Su, Jiancong Xiao
The Architecture and Evaluation of Bayesian Neural Networks Authors: Alisa Sheinkman, Sara Wade
Quantifying Interpretability in CLIP Models with Concept Consistency Authors: Avinash Madasu, Vasudev Lal, Phillip Howard
Make Optimization Once and for All with Fine-grained Guidance Authors: Mingjia Shi, Ruihan Lin, Xuxi Chen, Yuhao Zhou, Zezhen Ding, Pingzhi Li, Tong Wang, Kai Wang, Zhangyang Wang, Jiheng Zhang, Tianlong Chen
Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization Authors: Dmitry Kovalev
Understanding Flatness in Generative Models: Its Role and Benefits Authors: Taehwan Lee, Kyeongkook Seo, Jaejun Yoo, Sung Whan Yoon
Support Collapse of Deep Gaussian Processes with Polynomial Kernels for a Wide Regime of Hyperparameters Authors: Daryna Chernobrovkina, Steffen Grünewälder
Adaptive Stochastic Gradient Descents on Manifolds with an Application on Weighted Low-Rank Approximation Authors: Peiqi Yang, Conglong Xu, Hao Wu
Bayes and Biased Estimators Without Hyper-parameter Estimation: Comparable Performance to the Empirical-Bayes-Based Regularized Estimator Authors: Yue Ju, Bo Wahlberg, Håkan Hjalmarsson
FedOSAA: Improving Federated Learning with One-Step Anderson Acceleration Authors: Xue Feng, M. Paul Laiu, Thomas Strohmer
Revisiting FastMap: New Applications Authors: Ang Li
GraphEval: A Lightweight Graph-Based LLM Framework for Idea Evaluation Authors: Tao Feng, Yihang Sun, Jiaxuan You
Designing Neural Synthesizers for Low Latency Interaction Authors: Franco Caspe, Jordie Shier, Mark Sandler, Charalampos Saitis, Andrew McPherson
From Demonstrations to Rewards: Alignment Without Explicit Human Preferences Authors: Siliang Zeng, Yao Liu, Huzefa Rangwala, George Karypis, Mingyi Hong, Rasool Fakoor
Token-Level Uncertainty-Aware Objective for Language Model Post-Training Authors: Tingkai Liu, Ari S. Benjamin, Anthony M. Zador
Reasoning-Grounded Natural Language Explanations for Language Models Authors: Vojtech Cahlik, Rodrigo Alves, Pavel Kordik
Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations Authors: Xue Jiang, Xiulian Peng, Yuan Zhang, Yan Lu
PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models Authors: Mayank Nautiyal, Stela Arranz Gheorghe, Kristiana Stefa, Li Ju, Ida-Maria Sintorn, Prashant Singh
Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning Authors: Matthew Khoriaty, Andrii Shportko, Gustavo Mercier, Zach Wood-Doughty
Simulation-based Bayesian inference under model misspecification Authors: Ryan P. Kelly, David J. Warne, David T. Frazier, David J. Nott, Michael U. Gutmann, Christopher Drovandi
Enhanced Soups for Graph Neural Networks Authors: Joseph Zuber, Aishwarya Sarkar, Joseph Jennings, Ali Jannesari
Fast filtering of non-Gaussian models using Amortized Optimal Transport Maps Authors: Mohammad Al-Jarrah, Bamdad Hosseini, Amirhossein Taghvaei
Can LLMs Formally Reason as Abstract Interpreters for Program Analysis? Authors: Jacqueline L. Mitchell, Brian Hyeongseok Kim, Chenyu Zhou, Chao Wang
Empirical Privacy Variance Authors: Yuzheng Hu, Fan Wu, Ruicheng Xian, Yuhang Liu, Lydia Zakynthinou, Pritish Kamath, Chiyuan Zhang, David Forsyth

1. Cognitive Activation and Chaotic Dynamics in Large Language Models: A Quasi-Lyapunov Analysis of Reasoning Mechanisms

ArXiv ID: 2503.13530

Authors: Xiaojian Li, Yongkang Leng, Ruiqing Ding, Hangjie Mo, Shanlin Yang

Abstract: The human-like reasoning capabilities exhibited by Large Language Models (LLMs) challenge the traditional neural network theory's understanding of the flexibility of fixed-parameter systems. This paper proposes the "Cognitive Activation" theory, revealing the essence of LLMs' reasoning mechanisms from the perspective of dynamic systems: the model's reasoning ability stems from a chaotic process of dynamic information extraction in the parameter space. By introducing the Quasi-Lyapunov Exponent (QLE), we quantitatively analyze the chaotic characteristics of the model at different layers. Experiments show that the model's information accumulation follows a nonlinear exponential law, and the Multilayer Perceptron (MLP) accounts for a higher proportion in the final output than the attention mechanism. Further experiments indicate that minor initial value perturbations will have a substantial impact on the model's reasoning ability, confirming the theoretical analysis that large language models are chaotic systems. This research provides a chaos theory framework for the interpretability of LLMs' reasoning and reveals potential pathways for balancing creativity and reliability in model design.

Comment: The paper proposes a chaos theory framework for understanding LLM reasoning mechanisms, aligning closely with foundational research in LLM behavior.

Relevance: 10 Novelty: 9

2. When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective

ArXiv ID: 2503.11272

Authors: Alireza Mousavi-Hosseini, Clayton Sanford, Denny Wu, Murat A. Erdogdu

Abstract: Theoretical efforts to prove advantages of Transformers in comparison with classical architectures such as feedforward and recurrent neural networks have mostly focused on representational power. In this work, we take an alternative perspective and prove that even with infinite compute, feedforward and recurrent networks may suffer from larger sample complexity compared to Transformers, as the latter can adapt to a form of dynamic sparsity. Specifically, we consider a sequence-to-sequence data generating model on sequences of length $N$, in which the output at each position depends only on $q$ relevant tokens with $q \ll N$, and the positions of these tokens are described in the input prompt. We prove that a single-layer Transformer can learn this model if and only if its number of attention heads is at least $q$, in which case it achieves a sample complexity almost independent of $N$, while recurrent networks require $N^{\Omega(1)}$ samples on the same problem. If we simplify this model, recurrent networks may achieve a complexity almost independent of $N$, while feedforward networks still require $N$ samples. Consequently, our proposed sparse retrieval model illustrates a natural hierarchy in sample complexity across these architectures.

Comment: The paper provides theoretical insights into when Transformers outperform other architectures, which is highly relevant to foundational research in model architecture.

Relevance: 10 Novelty: 8

3. A Review of DeepSeek Models' Key Innovative Techniques

ArXiv ID: 2503.11486

Authors: Chengen Wang, Murat Kantarcioglu

Abstract: DeepSeek-V3 and DeepSeek-R1 are leading open-source Large Language Models (LLMs) for general-purpose tasks and reasoning, achieving performance comparable to state-of-the-art closed-source models from companies like OpenAI and Anthropic -- while requiring only a fraction of their training costs. Understanding the key innovative techniques behind DeepSeek's success is crucial for advancing LLM research. In this paper, we review the core techniques driving the remarkable effectiveness and efficiency of these models, including refinements to the transformer architecture, innovations such as Multi-Head Latent Attention and Mixture of Experts, Multi-Token Prediction, the co-design of algorithms, frameworks, and hardware, the Group Relative Policy Optimization algorithm, post-training with pure reinforcement learning and iterative training alternating between supervised fine-tuning and reinforcement learning. Additionally, we identify several open questions and highlight potential research opportunities in this rapidly advancing field.

Comment: The paper reviews techniques behind DeepSeek models, including innovations in transformers and Mixture of Experts, aligning closely with model architecture research.

Relevance: 10 Novelty: 8

4. SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders

ArXiv ID: 2503.14530

Authors: Qing Li, Jiahui Geng, Derui Zhu, Fengyu Cai, Chenyang Lyu, Fakhri Karray

Abstract: Unlearning methods for vision-language models (VLMs) have primarily adapted techniques from large language models (LLMs), relying on weight updates that demand extensive annotated forget sets. Moreover, these methods perform unlearning at a coarse granularity, often leading to excessive forgetting and reduced model utility. To address this issue, we introduce SAUCE, a novel method that leverages sparse autoencoders (SAEs) for fine-grained and selective concept unlearning in VLMs. Briefly, SAUCE first trains SAEs to capture high-dimensional, semantically rich sparse features. It then identifies the features most relevant to the target concept for unlearning. During inference, it selectively modifies these features to suppress specific concepts while preserving unrelated information. We evaluate SAUCE on two distinct VLMs, LLaVA-v1.5-7B and LLaMA-3.2-11B-Vision-Instruct, across two types of tasks: concrete concept unlearning (objects and sports scenes) and abstract concept unlearning (emotions, colors, and materials), encompassing a total of 60 concepts. Extensive experiments demonstrate that SAUCE outperforms state-of-the-art methods by 18.04% in unlearning quality while maintaining comparable model utility. Furthermore, we investigate SAUCE's robustness against widely used adversarial attacks, its transferability across models, and its scalability in handling multiple simultaneous unlearning requests. Our findings establish SAUCE as an effective and scalable solution for selective concept unlearning in VLMs.

Comment: SAUCE utilizes sparse autoencoders for selective concept unlearning, demonstrating theoretical innovations in sparse methods and aligning with foundational model compression research.

Relevance: 9 Novelty: 9

5. PREAMBLE: Private and Efficient Aggregation of Block Sparse Vectors and Applications

ArXiv ID: 2503.11897

Authors: Hilal Asi, Vitaly Feldman, Hannah Keller, Guy N. Rothblum, Kunal Talwar

Abstract: We revisit the problem of secure aggregation of high-dimensional vectors in a two-server system such as Prio. These systems are typically used to aggregate vectors such as gradients in private federated learning, where the aggregate itself is protected via noise addition to ensure differential privacy. Existing approaches require communication scaling with the dimensionality, and thus limit the dimensionality of vectors one can efficiently process in this setup. We propose PREAMBLE: Private Efficient Aggregation Mechanism for BLock-sparse Euclidean Vectors. PREAMBLE is a novel extension of distributed point functions that enables communication- and computation-efficient aggregation of block-sparse vectors, which are sparse vectors where the non-zero entries occur in a small number of clusters of consecutive coordinates. We then show that PREAMBLE can be combined with random sampling and privacy amplification by sampling results, to allow asymptotically optimal privacy-utility trade-offs for vector aggregation, at a fraction of the communication cost. When coupled with recent advances in numerical privacy accounting, our approach incurs a negligible overhead in noise variance, compared to the Gaussian mechanism used with Prio.

Comment: Introduces PREAMBLE for efficient aggregation of block-sparse vectors, aligning with model compression and sparsity criteria.

Relevance: 9 Novelty: 8

6. Multi-View Node Pruning for Accurate Graph Representation

ArXiv ID: 2503.11737

Authors: Jiseong Park, Hanjin Kim, Seojin Kim, Jueun Choi

Abstract: Graph pooling, which compresses a whole graph into a smaller coarsened graph, is an essential component of graph representation learning. To efficiently compress a given graph, graph pooling methods often drop their nodes with attention-based scoring with the task loss. However, this often results in simply removing nodes with lower degrees without consideration of their feature-level relevance to the given task. To fix this problem, we propose a Multi-View Pruning(MVP), a graph pruning method based on a multi-view framework and reconstruction loss. Given a graph, MVP first constructs multiple graphs for different views either by utilizing the predefined modalities or by randomly partitioning the input features, to consider the importance of each node in diverse perspectives. Then, it learns the score for each node by considering both the reconstruction and the task loss. MVP can be incorporated with any hierarchical pooling framework to score the nodes. We validate MVP on multiple benchmark datasets by coupling it with two graph pooling methods, and show that it significantly improves the performance of the base graph pooling method, outperforming all baselines. Further analysis shows that both the encoding of multiple views and the consideration of reconstruction loss are the key to the success of MVP, and that it indeed identifies nodes that are less important according to domain knowledge.

Comment: Proposes a multi-view pruning method for graph representation learning, aligning with representation learning and sparsity criteria.

Relevance: 9 Novelty: 8

7. Training Diagonal Linear Networks with Stochastic Sharpness-Aware Minimization

ArXiv ID: 2503.11891

Authors: Gabriel Clara, Sophie Langer, Johannes Schmidt-Hieber

Abstract: We analyze the landscape and training dynamics of diagonal linear networks in a linear regression task, with the network parameters being perturbed by small isotropic normal noise. The addition of such noise may be interpreted as a stochastic form of sharpness-aware minimization (SAM) and we prove several results that relate its action on the underlying landscape and training dynamics to the sharpness of the loss. In particular, the noise changes the expected gradient to force balancing of the weight matrices at a fast rate along the descent trajectory. In the diagonal linear model, we show that this equates to minimizing the average sharpness, as well as the trace of the Hessian matrix, among all possible factorizations of the same matrix. Further, the noise forces the gradient descent iterates towards a shrinkage-thresholding of the underlying true parameter, with the noise level explicitly regulating both the shrinkage factor and the threshold.

Comment: Analyzes training dynamics of diagonal linear networks with stochastic sharpness-aware minimization, aligning with representation learning and training dynamics criteria.

Relevance: 9 Novelty: 8

8. MoECollab: Democratizing LLM Development Through Collaborative Mixture of Experts

ArXiv ID: 2503.12592

Authors: Harshit

Abstract: Large Language Model (LLM) development has become increasingly centralized, limiting participation to well-resourced organizations. This paper introduces MoECollab, a novel framework leveraging Mixture of Experts (MoE) architecture to enable distributed, collaborative LLM development. By decomposing monolithic models into specialized expert modules coordinated by a trainable gating network, our framework allows diverse contributors to participate regardless of computational resources. We provide a complete technical implementation with mathematical foundations for expert dynamics, gating mechanisms, and integration strategies. Experiments on multiple datasets demonstrate that our approach achieves accuracy improvements of 3-7% over baseline models while reducing computational requirements by 34%. Expert specialization yields significant domain-specific gains, with improvements from 51% to 88% F1 score in general classification and from 23% to 44% accuracy in news categorization. We formalize the routing entropy optimization problem and demonstrate how proper regularization techniques lead to 14% higher expert utilization rates. These results validate MoECollab as an effective approach for democratizing LLM development through architecturally-supported collaboration.

Comment: Proposes MoECollab framework leveraging Mixture of Experts (MoE) architecture, aligning with architectural innovation and emerging trends.

Relevance: 9 Novelty: 8

9. Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs

ArXiv ID: 2503.12211

Authors: Nir Ailon, Akhiad Bercovich, Omri Weinstein

Abstract: We propose a cheaper alternative bilinear operator to matrix-multiplication in deep neural networks (DNNs). Unlike many stubborn attempts to accelerate MatMuls in DNN inference, this operator is supported by capabilities of existing GPU hardware, most notably NVIDIA TensorCores. To our knowledge, this is the first GPU-native acceleration technique which \emph{does not decrease} (in fact, increases) the number of trainable parameters of the network, mitigating the accuracy-loss of compression-based techniques. Hence, this operator is at the same time more expressive than MatMul, yet requires substantially \emph{fewer} FLOPs to evaluate. We term this new operator \emph{Strassen-Tile} (STL). The main idea behind STL$(X,W)$ is a \emph{local} change-of-basis (learnable encoder) on weights and activation \emph{tiles}, after which we perform batched \emph{elementwise} products between tiles, and a final decoding transformation (inspired by algebraic pipelines from fast matrix and polynomial multiplication). We compare STL against two benchmarks. The first one is SoTA T2T-ViT on Imagenet-1K. Here we show that replacing \emph{all} linear layers with STL and training from scratch, results in factor x2.7 reduction in FLOPs with a 0.5 \emph{accuracy improvement}. Our second speed-accuracy comparison benchmark for pretrained LLMs is the most practical GPU-acceleration technique, \twofour structured Sparsity. Finetuning TinyLlama \cite{tinyllama24} with STL layers on the Slim Pajama dataset, achieves similar accuracy to 2:4, with x2.2 FLOP speedup compared to x1.7 of the latter. Finally, we discuss a group-theoretic approach for discovering \emph{universal} encoders for STL, which could lead to fast \emph{black-box} acceleration via approximate matrix-multiplication (AMM).

Comment: Proposes a GPU-efficient alternative to matrix multiplication in DNNs, aligning with model compression and efficiency criteria.

Relevance: 9 Novelty: 8

10. ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

ArXiv ID: 2503.12668

Authors: Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, Huanyi Xie, David E. Keyes, Di Wang

Abstract: Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, eliminating the need to store activations. Furthermore, by leveraging CPU capabilities, it's feasible to enhance both the memory and processing power available to a single GPU. We propose a novel framework, ZO2 (Zeroth-Order Offloading), for efficient zeroth-order fine-tuning of LLMs with only limited GPU memory. Our framework dynamically shifts model parameters between the CPU and GPU as required, optimizing computation flow and maximizing GPU usage by minimizing downtime. This integration of parameter adjustments with ZO's double forward operations reduces unnecessary data movement, enhancing the fine-tuning efficacy. Additionally, our framework supports an innovative low-bit precision approach in AMP mode to streamline data exchanges between the CPU and GPU. Employing this approach allows us to fine-tune extraordinarily large models, such as the OPT-175B with more than 175 billion parameters, on a mere 18GB GPU--achievements beyond the reach of traditional methods. Moreover, our framework achieves these results with almost no additional time overhead and absolutely no accuracy loss compared to standard zeroth-order methods. ZO2's code has been open-sourced in https://github.com/liangyuwang/zo2.

Comment: Proposes ZO2 for zeroth-order fine-tuning of LLMs, aligning with model compression and efficiency criteria.

Relevance: 9 Novelty: 8

11. Counterfactual Realizability

ArXiv ID: 2503.11870

Authors: Arvind Raghavan, Elias Bareinboim

Abstract: It is commonly believed that, in a real-world environment, samples can only be drawn from observational and interventional distributions, corresponding to Layers 1 and 2 of the Pearl Causal Hierarchy. Layer 3, representing counterfactual distributions, is believed to be inaccessible by definition. However, Bareinboim, Forney, and Pearl (2015) introduced a procedure that allows an agent to sample directly from a counterfactual distribution, leaving open the question of what other counterfactual quantities can be estimated directly via physical experimentation. We resolve this by introducing a formal definition of realizability, the ability to draw samples from a distribution, and then developing a complete algorithm to determine whether an arbitrary counterfactual distribution is realizable given fundamental physical constraints, such as the inability to go back in time and subject the same unit to a different experimental condition. We illustrate the implications of this new framework for counterfactual data collection using motivating examples from causal fairness and causal reinforcement learning. While the baseline approach in these motivating settings typically follows an interventional or observational strategy, we show that a counterfactual strategy provably dominates both.

Comment: Counterfactual realizability in causal inference offers foundational contributions to causal reasoning and representation learning, a key area of interest.

Relevance: 9 Novelty: 8

12. Atlas: Multi-Scale Attention Improves Long Context Image Modeling

ArXiv ID: 2503.12355

Authors: Kumar Krishna Agrawal, Long Lian, Longchao Liu, Natalia Harguindeguy, Boyi Li, Alexander Bick, Maggie Chung, Trevor Darrell, Adam Yala

Abstract: Efficiently modeling massive images is a long-standing challenge in machine learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale communication. MSA creates O(log N) scales to represent the image across progressively coarser features and leverages cross-attention to propagate information across scales. We then introduce Atlas, a novel neural network architecture based on MSA. We demonstrate that Atlas significantly improves the compute-performance tradeoff of long-context image modeling in a high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves 91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster. Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96% better than LongViT. In comparisons against MambaVision-S, we find Atlas-S achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px respectively, while obtaining similar runtimes. Code for reproducing our experiments and pretrained models is available at https://github.com/yalalab/atlas.

Comment: The paper introduces a novel multi-scale attention mechanism and architecture (Atlas), which aligns with foundational research in model architecture.

Relevance: 9 Novelty: 8

13. Taming Knowledge Conflicts in Language Models

ArXiv ID: 2503.10996

Authors: Gaotang Li, Yuzhong Chen, Hanghang Tong

Abstract: Language Models (LMs) often encounter knowledge conflicts when parametric memory contradicts contextual knowledge. Previous works attribute this conflict to the interplay between "memory heads" and "context heads", attention heads assumed to promote either memory or context exclusively. In this study, we go beyond this fundamental assumption by uncovering a critical phenomenon we term the "superposition of contextual information and parametric memory", where highly influential attention heads could simultaneously contribute to both memory and context. Building upon this insight, we propose Just Run Twice (JUICE), a test-time attention intervention method that steers LMs toward either parametric beliefs or contextual knowledge without requiring fine-tuning. JUICE identifies a set of reliable attention heads and leverages a dual-run approach to mitigate the superposition effects. Extensive experiments across 11 datasets and 6 model architectures demonstrate that JUICE sets the new state-of-the-art performance and robust generalization, achieving significant and consistent improvement across different domains under various conflict types. Finally, we theoretically analyze knowledge conflict and the superposition of contextual information and parametric memory in attention heads, which further elucidates the effectiveness of JUICE in these settings.

Comment: The paper introduces a method to address knowledge conflicts in LLMs, which aligns with foundational research in LLM behavior.

Relevance: 9 Novelty: 8

14. MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling

ArXiv ID: 2503.11144

Authors: Rachel S. Y. Teo, Tan M. Nguyen

Abstract: Large-scale pre-training of deep models, followed by fine-tuning them, has become the cornerstone of natural language processing (NLP). The prevalence of data coupled with computational resources has led to large models with a considerable number of parameters. While the massive size of these models has led to remarkable success in many NLP tasks, a detriment is the expense required to retrain all the base model's parameters for the adaptation to each task or domain. Parameter Efficient Fine-Tuning (PEFT) provides an effective solution for this challenge by minimizing the number of parameters required to be fine-tuned while maintaining the quality of the model. While existing methods have achieved impressive results, they mainly focus on adapting a subset of parameters, weight reparameterization, and prompt engineering. In this paper, we study layers as extractors of different types of linguistic information that are valuable when used in conjunction. We then propose the Mixture of Layer Experts (MoLEx), a novel sparse mixture of experts (SMoE) whose experts are layers in the pre-trained model. It performs a conditional computation of a mixture of layers during fine-tuning to provide the model with more structural knowledge about the data. By providing an avenue for information exchange between layers, MoLEx enables the model to make a more well-informed prediction for the downstream task, leading to better fine-tuning results with the same number of effective parameters. As experts can be processed in parallel, MoLEx introduces minimal additional computational overhead. We empirically corroborate the advantages of MoLEx when combined with popular PEFT baseline methods on a variety of downstream fine-tuning tasks, including the popular GLUE benchmark as well as the End-to-End Challenge (E2E). The code is publicly available at https://github.com/rachtsy/molex.

Comment: The paper introduces a sparse mixture of layer experts for fine-tuning, which is highly relevant to foundational research in model architecture.

Relevance: 9 Novelty: 8

15. Fuzzy Rule-based Differentiable Representation Learning

ArXiv ID: 2503.13548

Authors: Wei Zhang, Zhaohong Deng, Guanjin Wang, Kup-Sze Choi

Abstract: Representation learning has emerged as a crucial focus in machine and deep learning, involving the extraction of meaningful and useful features and patterns from the input data, thereby enhancing the performance of various downstream tasks such as classification, clustering, and prediction. Current mainstream representation learning methods primarily rely on non-linear data mining techniques such as kernel methods and deep neural networks to extract abstract knowledge from complex datasets. However, most of these methods are black-box, lacking transparency and interpretability in the learning process, which constrains their practical utility. To this end, this paper introduces a novel representation learning method grounded in an interpretable fuzzy rule-based model. Specifically, it is built upon the Takagi-Sugeno-Kang fuzzy system (TSK-FS) to initially map input data to a high-dimensional fuzzy feature space through the antecedent part of the TSK-FS. Subsequently, a novel differentiable optimization method is proposed for the consequence part learning which can preserve the model's interpretability and transparency while further exploring the nonlinear relationships within the data. This optimization method retains the essence of traditional optimization, with certain parts of the process parameterized corresponding differentiable modules constructed, and a deep optimization process implemented. Consequently, this method not only enhances the model's performance but also ensures its interpretability. Moreover, a second-order geometry preservation method is introduced to further improve the robustness of the proposed method. Extensive experiments conducted on various benchmark datasets validate the superiority of the proposed method, highlighting its potential for advancing representation learning methodologies.

Comment: The paper introduces a novel representation learning method grounded in interpretable fuzzy rule-based models, aligning with the foundational research in representation learning.

Relevance: 9 Novelty: 8

16. Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation

ArXiv ID: 2503.12356

Authors: Byung Hyun Lee, Sungjin Lim, Se Young Chun

Abstract: Fine-tuning based concept erasing has demonstrated promising results in preventing generation of harmful contents from text-to-image diffusion models by removing target concepts while preserving remaining concepts. To maintain the generation capability of diffusion models after concept erasure, it is necessary to remove only the image region containing the target concept when it locally appears in an image, leaving other regions intact. However, prior arts often compromise fidelity of the other image regions in order to erase the localized target concept appearing in a specific area, thereby reducing the overall performance of image generation. To address these limitations, we first introduce a framework called localized concept erasure, which allows for the deletion of only the specific area containing the target concept in the image while preserving the other regions. As a solution for the localized concept erasure, we propose a training-free approach, dubbed Gated Low-rank adaptation for Concept Erasure (GLoCE), that injects a lightweight module into the diffusion model. GLoCE consists of low-rank matrices and a simple gate, determined only by several generation steps for concepts without training. By directly applying GLoCE to image embeddings and designing the gate to activate only for target concepts, GLoCE can selectively remove only the region of the target concepts, even when target and remaining concepts coexist within an image. Extensive experiments demonstrated GLoCE not only improves the image fidelity to text prompts after erasing the localized target concepts, but also outperforms prior arts in efficacy, specificity, and robustness by large margin and can be extended to mass concept erasure.

Comment: The paper introduces a training-free low-rank adaptation method for concept erasure in diffusion models, aligning with model compression and efficiency research.

Relevance: 9 Novelty: 8

17. LLM-Driven Multi-step Translation from C to Rust using Static Analysis

ArXiv ID: 2503.12511

Authors: Tianyang Zhou, Haowen Lin, Somesh Jha, Mihai Christodorescu, Kirill Levchenko, Varun Chandrasekaran

Abstract: Translating software written in legacy languages to modern languages, such as C to Rust, has significant benefits in improving memory safety while maintaining high performance. However, manual translation is cumbersome, error-prone, and produces unidiomatic code. Large language models (LLMs) have demonstrated promise in producing idiomatic translations, but offer no correctness guarantees as they lack the ability to capture all the semantics differences between the source and target languages. To resolve this issue, we propose SACTOR, an LLM-driven C-to-Rust zero-shot translation tool using a two-step translation methodology: an "unidiomatic" step to translate C into Rust while preserving semantics, and an "idiomatic" step to refine the code to follow Rust's semantic standards. SACTOR utilizes information provided by static analysis of the source C program to address challenges such as pointer semantics and dependency resolution. To validate the correctness of the translated result from each step, we use end-to-end testing via the foreign function interface to embed our translated code segment into the original code. We evaluate the translation of 200 programs from two datasets and two case studies, comparing the performance of GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, Llama 3.3 70B and DeepSeek-R1 in SACTOR. Our results demonstrate that SACTOR achieves high correctness and improved idiomaticity, with the best-performing model (DeepSeek-R1) reaching 93% and (GPT-4o, Claude 3.5, DeepSeek-R1) reaching 84% correctness (on each dataset, respectively), while producing more natural and Rust-compliant translations compared to existing methods.

Comment: The paper proposes a multi-step translation methodology for C-to-Rust using LLMs, aligning with foundational research in LLM-driven architecture innovations.

Relevance: 9 Novelty: 8

18. HyperKAN: Hypergraph Representation Learning with Kolmogorov-Arnold Networks

ArXiv ID: 2503.12365

Authors: Xiangfei Fang, Boying Wang, Chengying Huan, Shaonan Ma, Heng Zhang, Chen Zhao

Abstract: Hypergraph representation learning has garnered increasing attention across various domains due to its capability to model high-order relationships. Traditional methods often rely on hypergraph neural networks (HNNs) employing message passing mechanisms to aggregate vertex and hyperedge features. However, these methods are constrained by their dependence on hypergraph topology, leading to the challenge of imbalanced information aggregation, where high-degree vertices tend to aggregate redundant features, while low-degree vertices often struggle to capture sufficient structural features. To overcome the above challenges, we introduce HyperKAN, a novel framework for hypergraph representation learning that transcends the limitations of message-passing techniques. HyperKAN begins by encoding features for each vertex and then leverages Kolmogorov-Arnold Networks (KANs) to capture complex nonlinear relationships. By adjusting structural features based on similarity, our approach generates refined vertex representations that effectively addresses the challenge of imbalanced information aggregation. Experiments conducted on the real-world datasets demonstrate that HyperKAN significantly outperforms state of-the-art HNN methods, achieving nearly a 9% performance improvement on the Senate dataset.

Comment: The paper introduces HyperKAN for hypergraph representation learning, aligning with foundational research in representation learning.

Relevance: 9 Novelty: 8

19. PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders

ArXiv ID: 2503.11232

Authors: Ahmed Frikha, Muhammad Reza Ar Razi, Krishna Kanth Nakka, Ricardo Mendes, Xue Jiang, Xuebing Zhou

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing but also pose significant privacy risks by memorizing and leaking Personally Identifiable Information (PII). Existing mitigation strategies, such as differential privacy and neuron-level interventions, often degrade model utility or fail to effectively prevent leakage. To address this challenge, we introduce PrivacyScalpel, a novel privacy-preserving framework that leverages LLM interpretability techniques to identify and mitigate PII leakage while maintaining performance. PrivacyScalpel comprises three key steps: (1) Feature Probing, which identifies layers in the model that encode PII-rich representations, (2) Sparse Autoencoding, where a k-Sparse Autoencoder (k-SAE) disentangles and isolates privacy-sensitive features, and (3) Feature-Level Interventions, which employ targeted ablation and vector steering to suppress PII leakage. Our empirical evaluation on Gemma2-2b and Llama2-7b, fine-tuned on the Enron dataset, shows that PrivacyScalpel significantly reduces email leakage from 5.15\% to as low as 0.0\%, while maintaining over 99.4\% of the original model's utility. Notably, our method outperforms neuron-level interventions in privacy-utility trade-offs, demonstrating that acting on sparse, monosemantic features is more effective than manipulating polysemantic neurons. Beyond improving LLM privacy, our approach offers insights into the mechanisms underlying PII memorization, contributing to the broader field of model interpretability and secure AI deployment.

Comment: PrivacyScalpel introduces sparse autoencoders for privacy enhancement in LLMs, aligning with foundational research in sparsity and representation learning.

Relevance: 9 Novelty: 8

20. Probabilistic Neural Networks (PNNs) with t-Distributed Outputs: Adaptive Prediction Intervals Beyond Gaussian Assumptions

ArXiv ID: 2503.12354

Authors: Farhad Pourkamali-Anaraki

Abstract: Traditional neural network regression models provide only point estimates, failing to capture predictive uncertainty. Probabilistic neural networks (PNNs) address this limitation by producing output distributions, enabling the construction of prediction intervals. However, the common assumption of Gaussian output distributions often results in overly wide intervals, particularly in the presence of outliers or deviations from normality. To enhance the adaptability of PNNs, we propose t-Distributed Neural Networks (TDistNNs), which generate t-distributed outputs, parameterized by location, scale, and degrees of freedom. The degrees of freedom parameter allows TDistNNs to model heavy-tailed predictive distributions, improving robustness to non-Gaussian data and enabling more adaptive uncertainty quantification. We develop a novel loss function tailored for the t-distribution and derive efficient gradient computations for seamless integration into deep learning frameworks. Empirical evaluations on synthetic and real-world data demonstrate that TDistNNs improve the balance between coverage and interval width. Notably, for identical architectures, TDistNNs consistently produce narrower prediction intervals than Gaussian-based PNNs while maintaining proper coverage. This work contributes a flexible framework for uncertainty estimation in neural networks tasked with regression, particularly suited to settings involving complex output distributions.

Comment: The paper introduces t-distributed outputs for PNNs, aligning with foundational research in representation learning and uncertainty quantification.

Relevance: 9 Novelty: 8

21. Limits of KV Cache Compression for Tensor Attention based Autoregressive Transformers

ArXiv ID: 2503.11108

Authors: Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yu Tian

Abstract: The key-value (KV) cache in autoregressive transformers presents a significant bottleneck during inference, which restricts the context length capabilities of large language models (LLMs). While previous work analyzes the fundamental space complexity barriers in standard attention mechanism [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a novel reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when $d = \Omega(\log n)$. In the low dimensional regime where $d = o(\log n)$, we analyze the theoretical bounds of the space complexity as well. Overall, our work provides a theoretical foundation for us to understand the compression-expressivity tradeoff in tensor attention mechanisms and offers more perspectives in developing more memory-efficient transformer architectures.

Comment: The paper analyzes KV cache compression limits in tensor attention, aligning with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8

22. From Dionysius Emerges Apollo -- Learning Patterns and Abstractions from Perceptual Sequences

ArXiv ID: 2503.10973

Authors: Shuchen Wu

Abstract: Cognition swiftly breaks high-dimensional sensory streams into familiar parts and uncovers their relations. Why do structures emerge, and how do they enable learning, generalization, and prediction? What computational principles underlie this core aspect of perception and intelligence? A sensory stream, simplified, is a one-dimensional sequence. In learning such sequences, we naturally segment them into parts -- a process known as chunking. In the first project, I investigated factors influencing chunking in a serial reaction time task and showed that humans adapt to underlying chunks while balancing speed and accuracy. Building on this, I developed models that learn chunks and parse sequences chunk by chunk. Normatively, I proposed chunking as a rational strategy for discovering recurring patterns and nested hierarchies, enabling efficient sequence factorization. Learned chunks serve as reusable primitives for transfer, composition, and mental simulation -- letting the model compose the new from the known. I demonstrated this model's ability to learn hierarchies in single and multi-dimensional sequences and highlighted its utility for unsupervised pattern discovery. The second part moves from concrete to abstract sequences. I taxonomized abstract motifs and examined their role in sequence memory. Behavioral evidence suggests that humans exploit pattern redundancies for compression and transfer. I proposed a non-parametric hierarchical variable model that learns both chunks and abstract variables, uncovering invariant symbolic patterns. I showed its similarity to human learning and compared it to large language models. Taken together, this thesis suggests that chunking and abstraction as simple computational principles enable structured knowledge acquisition in hierarchically organized sequences, from simple to complex, concrete to abstract.

Comment: The paper explores chunking and abstraction in sequence learning, which is relevant to representation learning and foundational insights into how models encode information.

Relevance: 9 Novelty: 8

23. Auditing language models for hidden objectives

ArXiv ID: 2503.10965

Authors: Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R. Bowman, Shan Carter, Brian Chen, Hoagy Cunningham, Carson Denison, Florian Dietz, Satvik Golechha, Akbir Khan, Jan Kirchner, Jan Leike, Austin Meek, Kei Nishimura-Gasparian, Euan Ong, Christopher Olah, Adam Pearce, Fabien Roger, Jeanne Salle, Andy Shih, Meg Tong, Drake Thomas, Kelley Rivoire, Adam Jermyn, Monte MacDiarmid, Tom Henighan, Evan Hubinger

Abstract: We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.

Comment: The paper studies alignment audits for LLMs, which provides theoretical insights into model behavior and interpretability, aligning with foundational research.

Relevance: 9 Novelty: 8

24. BriLLM: Brain-inspired Large Language Model

ArXiv ID: 2503.11299

Authors: Hai Zhao, Hongqiu Wu, Dongjie Yang, Anni Zou, Jiale Hong

Abstract: This paper reports the first brain-inspired large language model (BriLLM). This is a non-Transformer, non-GPT, non-traditional machine learning input-output controlled generative language model. The model is based on the Signal Fully-connected flowing (SiFu) definition on the directed graph in terms of the neural network, and has the interpretability of all nodes on the graph of the whole model, instead of the traditional machine learning model that only has limited interpretability at the input and output ends. In the language model scenario, the token is defined as a node in the graph. A randomly shaped or user-defined signal flow flows between nodes on the principle of "least resistance" along paths. The next token or node to be predicted or generated is the target of the signal flow. As a language model, BriLLM theoretically supports infinitely long $n$-gram models when the model size is independent of the input and predicted length of the model. The model's working signal flow provides the possibility of recall activation and innate multi-modal support similar to the cognitive patterns of the human brain. At present, we released the first BriLLM version in Chinese, with 4000 tokens, 32-dimensional node width, 16-token long sequence prediction ability, and language model prediction performance comparable to GPT-1. More computing power will help us explore the infinite possibilities depicted above.

Comment: The paper introduces a brain-inspired large language model, which aligns with architectural innovations and foundational research.

Relevance: 9 Novelty: 8

25. Neural Tangent Kernel of Neural Networks with Loss Informed by Differential Operators

ArXiv ID: 2503.11029

Authors: Weiye Gan, Yicheng Li, Qian Lin, Zuoqiang Shi

Abstract: Spectral bias is a significant phenomenon in neural network training and can be explained by neural tangent kernel (NTK) theory. In this work, we develop the NTK theory for deep neural networks with physics-informed loss, providing insights into the convergence of NTK during initialization and training, and revealing its explicit structure. We find that, in most cases, the differential operators in the loss function do not induce a faster eigenvalue decay rate and stronger spectral bias. Some experimental results are also presented to verify the theory.

Comment: The paper develops NTK theory for physics-informed loss, providing foundational insights into training dynamics and spectral bias.

Relevance: 9 Novelty: 8

26. FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization

ArXiv ID: 2503.12649

Authors: Hao Mark Chen, Shell Xu Hu, Wayne Luk, Timothy Hospedales, Hongxiang Fan

Abstract: Model merging has emerged as a promising approach for multi-task learning (MTL), offering a data-efficient alternative to conventional fine-tuning. However, with the rapid development of the open-source AI ecosystem and the increasing availability of fine-tuned foundation models, existing model merging methods face two key limitations: (i) They are primarily designed for in-house fine-tuned models, making them less adaptable to diverse model sources with partially unknown model and task information, (ii) They struggle to scale effectively when merging numerous model checkpoints. To address these challenges, we formulate model merging as a constrained optimization problem and introduce a novel approach: Frank-Wolfe Merging (FW-Merging). Inspired by Frank-Wolfe optimization, our approach iteratively selects the most relevant model in the pool to minimize a linear approximation of the objective function and then executes a local merging similar to the Frank-Wolfe update. The objective function is designed to capture the desired behavior of the target-merged model, while the fine-tuned candidate models define the constraint set. More importantly, FW-Merging serves as an orthogonal technique for existing merging methods, seamlessly integrating with them to further enhance accuracy performance. Our experiments show that FW-Merging scales across diverse model sources, remaining stable with 16 irrelevant models and improving by 15.3% with 16 relevant models on 20 CV tasks, while maintaining constant memory overhead, unlike the linear overhead of data-informed merging methods. Compared with the state-of-the-art approaches, FW-Merging surpasses the data-free merging method by 32.8% and outperforms the data-informed Adamerging by 8.39% when merging 20 ViT models.

Comment: FW-Merging innovates in model merging via constrained optimization techniques, aligning significantly with foundational research in model efficiency and architecture-level improvements.

Relevance: 9 Novelty: 8

27. Discovering uncertainty: Gaussian constitutive neural networks with correlated weights

ArXiv ID: 2503.12679

Authors: Jeremy A. McCulloch, Ellen Kuhl

Abstract: When characterizing materials, it can be important to not only predict their mechanical properties, but also to estimate the probability distribution of these properties across a set of samples. Constitutive neural networks allow for the automated discovery of constitutive models that exactly satisfy physical laws given experimental testing data, but are only capable of predicting the mean stress response. Stochastic methods treat each weight as a random variable and are capable of learning their probability distributions. Bayesian constitutive neural networks combine both methods, but their weights lack physical interpretability and we must sample each weight from a probability distribution to train or evaluate the model. Here we introduce a more interpretable network with fewer parameters, simpler training, and the potential to discover correlated weights: Gaussian constitutive neural networks. We demonstrate the performance of our new Gaussian network on biaxial testing data, and discover a sparse and interpretable four-term model with correlated weights. Importantly, the discovered distributions of material parameters across a set of samples can serve as priors to discover better constitutive models for new samples with limited data. We anticipate that Gaussian constitutive neural networks are a natural first step towards generative constitutive models informed by physical laws and parameter uncertainty.

Comment: Gaussian constitutive neural networks enhance interpretability and tackle parameter uncertainty, showing foundational advancements in sparse/low-rank methods for AI.

Relevance: 9 Novelty: 8

28. Spherical Tree-Sliced Wasserstein Distance

ArXiv ID: 2503.11249

Authors: Viet-Hoang Tran, Thanh T. Chu, Khoi N. M. Nguyen, Trang Pham, Tam Le, Tan M. Nguyen

Abstract: Sliced Optimal Transport (OT) simplifies the OT problem in high-dimensional spaces by projecting supports of input measures onto one-dimensional lines and then exploiting the closed-form expression of the univariate OT to reduce the computational burden of OT. Recently, the Tree-Sliced method has been introduced to replace these lines with more intricate structures, known as tree systems. This approach enhances the ability to capture topological information of integration domains in Sliced OT while maintaining low computational cost. Inspired by this approach, in this paper, we present an adaptation of tree systems on OT problems for measures supported on a sphere. As a counterpart to the Radon transform variant on tree systems, we propose a novel spherical Radon transform with a new integration domain called spherical trees. By leveraging this transform and exploiting the spherical tree structures, we derive closed-form expressions for OT problems on the sphere. Consequently, we obtain an efficient metric for measures on the sphere, named Spherical Tree-Sliced Wasserstein (STSW) distance. We provide an extensive theoretical analysis to demonstrate the topology of spherical trees and the well-definedness and injectivity of our Radon transform variant, which leads to an orthogonally invariant distance between spherical measures. Finally, we conduct a wide range of numerical experiments, including gradient flows and self-supervised learning, to assess the performance of our proposed metric, comparing it to recent benchmarks.

Comment: Introduces the Spherical Tree-Sliced Wasserstein Distance, a method extending sliced optimal transport in high-dimensional spaces, aligning well with foundational mathematical innovations.

Relevance: 8 Novelty: 9

29. Positivity sets of hinge functions

ArXiv ID: 2503.13512

Authors: Josef Schicho, Ayush Kumar Tewari, Audie Warren

Abstract: In this paper we investigate which subsets of the real plane are realisable as the set of points on which a one-layer ReLU neural network takes a positive value. In the case of cones we give a full characterisation of such sets. Furthermore, we give a necessary condition for any subset of $\mathbb R^d$. We give various examples of such one-layer neural networks.

Comment: The paper provides theoretical insights into the expressivity of one-layer ReLU neural networks related to their activation regions, targeting foundational architectural understanding.

Relevance: 9 Novelty: 7

30. Hybrid Learners Do Not Forget: A Brain-Inspired Neuro-Symbolic Approach to Continual Learning

ArXiv ID: 2503.12635

Authors: Amin Banayeeanzade, Mohammad Rostami

Abstract: Continual learning is crucial for creating AI agents that can learn and improve themselves autonomously. A primary challenge in continual learning is to learn new tasks without losing previously learned knowledge. Current continual learning methods primarily focus on enabling a neural network with mechanisms that mitigate forgetting effects. Inspired by the two distinct systems in the human brain, System 1 and System 2, we propose a Neuro-Symbolic Brain-Inspired Continual Learning (NeSyBiCL) framework that incorporates two subsystems to solve continual learning: A neural network model responsible for quickly adapting to the most recent task, together with a symbolic reasoner responsible for retaining previously acquired knowledge from previous tasks. Moreover, we design an integration mechanism between these components to facilitate knowledge transfer from the symbolic reasoner to the neural network. We also introduce two compositional continual learning benchmarks and demonstrate that NeSyBiCL is effective and leads to superior performance compared to continual learning methods that merely rely on neural architectures to address forgetting.

Comment: Introduces a neuro-symbolic approach to continual learning, aligning with architectural innovation and emerging trends.

Relevance: 8 Novelty: 8

31. An Algebraic Approach to Moralisation and Triangulation of Probabilistic Graphical Models

ArXiv ID: 2503.11820

Authors: Antonio Lorenzin, Fabio Zanasi

Abstract: Moralisation and Triangulation are transformations allowing to switch between different ways of factoring a probability distribution into a graphical model. Moralisation allows to view a Bayesian network (a directed model) as a Markov network (an undirected model), whereas triangulation works in the opposite direction. We present a categorical framework where these transformations are modelled as functors between a category of Bayesian networks and one of Markov networks. The two kinds of network (the objects of these categories) are themselves represented as functors, from a syntax' domain to asemantics' codomain. Notably, moralisation and triangulation are definable inductively on such syntax, and operate as a form of functor pre-composition. This approach introduces a modular, algebraic perspective in the theory of probabilistic graphical models.

Comment: Proposes an algebraic approach to probabilistic graphical models, aligning with emerging trends and foundational research.

Relevance: 8 Novelty: 8

32. Riemannian Geometric-based Meta Learning

ArXiv ID: 2503.10993

Authors: JuneYoung Park, YuMi Lee, Tae-Joon Kim, Jang-Hwan Choi

Abstract: Meta-learning, or "learning to learn," aims to enable models to quickly adapt to new tasks with minimal data. While traditional methods like Model-Agnostic Meta-Learning (MAML) optimize parameters in Euclidean space, they often struggle to capture complex learning dynamics, particularly in few-shot learning scenarios. To address this limitation, we propose Stiefel-MAML, which integrates Riemannian geometry by optimizing within the Stiefel manifold, a space that naturally enforces orthogonality constraints. By leveraging the geometric structure of the Stiefel manifold, we improve parameter expressiveness and enable more efficient optimization through Riemannian gradient calculations and retraction operations. We also introduce a novel kernel-based loss function defined on the Stiefel manifold, further enhancing the model's ability to explore the parameter space. Experimental results on benchmark datasets--including Omniglot, Mini-ImageNet, FC-100, and CUB--demonstrate that Stiefel-MAML consistently outperforms traditional MAML, achieving superior performance across various few-shot learning tasks. Our findings highlight the potential of Riemannian geometry to enhance meta-learning, paving the way for future research on optimizing over different geometric structures.

Comment: The Stiefel-MAML approach provides novel insights using Riemannian geometry for meta-learning, advancing foundational algorithmic methodologies for learning paradigms.

Relevance: 8 Novelty: 8

33. Combining Causal Models for More Accurate Abstractions of Neural Networks

ArXiv ID: 2503.11429

Authors: Theodora-Mara Pîslar, Sara Magliacane, Atticus Geiger

Abstract: Mechanistic interpretability aims to reverse engineer neural networks by uncovering which high-level algorithms they implement. Causal abstraction provides a precise notion of when a network implements an algorithm, i.e., a causal model of the network contains low-level features that realize the high-level variables in a causal model of the algorithm. A typical problem in practical settings is that the algorithm is not an entirely faithful abstraction of the network, meaning it only partially captures the true reasoning process of a model. We propose a solution where we combine different simple high-level models to produce a more faithful representation of the network. Through learning this combination, we can model neural networks as being in different computational states depending on the input provided, which we show is more accurate to GPT 2-small fine-tuned on two toy tasks. We observe a trade-off between the strength of an interpretability hypothesis, which we define in terms of the number of inputs explained by the high-level models, and its faithfulness, which we define as the interchange intervention accuracy. Our method allows us to modulate between the two, providing the most accurate combination of models that describe the behavior of a neural network given a faithfulness level.

Comment: The combination of causal models for neural network abstractions offers foundational contributions towards mechanistic interpretability of models.

Relevance: 8 Novelty: 8

34. From Denoising Score Matching to Langevin Sampling: A Fine-Grained Error Analysis in the Gaussian Setting

ArXiv ID: 2503.11615

Authors: Samuel Hurault, Matthieu Terris, Thomas Moreau, Gabriel Peyré

Abstract: Sampling from an unknown distribution, accessible only through discrete samples, is a fundamental problem at the core of generative AI. The current state-of-the-art methods follow a two-step process: first estimating the score function (the gradient of a smoothed log-distribution) and then applying a gradient-based sampling algorithm. The resulting distribution's correctness can be impacted by several factors: the generalization error due to a finite number of initial samples, the error in score matching, and the diffusion error introduced by the sampling algorithm. In this paper, we analyze the sampling process in a simple yet representative setting-sampling from Gaussian distributions using a Langevin diffusion sampler. We provide a sharp analysis of the Wasserstein sampling error that arises from the multiple sources of error throughout the pipeline. This allows us to rigorously track how the anisotropy of the data distribution (encoded by its power spectrum) interacts with key parameters of the end-to-end sampling method, including the noise amplitude, the step sizes in both score matching and diffusion, and the number of initial samples. Notably, we show that the Wasserstein sampling error can be expressed as a kernel-type norm of the data power spectrum, where the specific kernel depends on the method parameters. This result provides a foundation for further analysis of the tradeoffs involved in optimizing sampling accuracy, such as adapting the noise amplitude to the choice of step sizes.

Comment: The paper offers a fine-grained theoretical analysis of Langevin sampling methods, contributing to foundational understanding in generative sampling algorithms.

Relevance: 8 Novelty: 8

35. Towards Learning High-Precision Least Squares Algorithms with Sequence Models

ArXiv ID: 2503.12295

Authors: Jerry Liu, Jessica Grogan, Owen Dugan, Ashish Rao, Simran Arora, Atri Rudra, Christopher Ré

Abstract: This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to apply broadly across problem instances. We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain 100,000x lower MSE than standard Transformers trained end-to-end and they incur a 10,000x smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares.

Comment: The paper explores sequence models for numerical algorithms, which is relevant to foundational research in representation learning.

Relevance: 8 Novelty: 8

36. Classifying Long-tailed and Label-noise Data via Disentangling and Unlearning

ArXiv ID: 2503.11414

Authors: Chen Shu, Mengke Li, Yiqun Zhang, Yang Lu, Bo Han, Yiu-ming Cheung, Hanzi Wang

Abstract: In real-world datasets, the challenges of long-tailed distributions and noisy labels often coexist, posing obstacles to the model training and performance. Existing studies on long-tailed noisy label learning (LTNLL) typically assume that the generation of noisy labels is independent of the long-tailed distribution, which may not be true from a practical perspective. In real-world situaiton, we observe that the tail class samples are more likely to be mislabeled as head, exacerbating the original degree of imbalance. We call this phenomenon as ``tail-to-head (T2H)'' noise. T2H noise severely degrades model performance by polluting the head classes and forcing the model to learn the tail samples as head. To address this challenge, we investigate the dynamic misleading process of the nosiy labels and propose a novel method called Disentangling and Unlearning for Long-tailed and Label-noisy data (DULL). It first employs the Inner-Feature Disentangling (IFD) to disentangle feature internally. Based on this, the Inner-Feature Partial Unlearning (IFPU) is then applied to weaken and unlearn incorrect feature regions correlated to wrong classes. This method prevents the model from being misled by noisy labels, enhancing the model's robustness against noise. To provide a controlled experimental environment, we further propose a new noise addition algorithm to simulate T2H noise. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness of our proposed method.

Comment: The paper proposes disentangling and unlearning methods for noisy long-tailed data, aligning with foundational research in representation learning.

Relevance: 8 Novelty: 8

37. FlowKac: An Efficient Neural Fokker-Planck solver using Temporal Normalizing flows and the Feynman Kac-Formula

ArXiv ID: 2503.11427

Authors: Naoufal El Bekri, Lucas Drumetz, Franck Vermet

Abstract: Solving the Fokker-Planck equation for high-dimensional complex dynamical systems remains a pivotal yet challenging task due to the intractability of analytical solutions and the limitations of traditional numerical methods. In this work, we present FlowKac, a novel approach that reformulates the Fokker-Planck equation using the Feynman-Kac formula, allowing to query the solution at a given point via the expected values of stochastic paths. A key innovation of FlowKac lies in its adaptive stochastic sampling scheme which significantly reduces the computational complexity while maintaining high accuracy. This sampling technique, coupled with a time-indexed normalizing flow, designed for capturing time-evolving probability densities, enables robust sampling of collocation points, resulting in a flexible and mesh-free solver. This formulation mitigates the curse of dimensionality and enhances computational efficiency and accuracy, which is particularly crucial for applications that inherently require dimensions beyond the conventional three. We validate the robustness and scalability of our method through various experiments on a range of stochastic differential equations, demonstrating significant improvements over existing techniques.

Comment: The paper introduces a novel approach to solving the Fokker-Planck equation using temporal normalizing flows, which aligns with foundational research in representation learning and efficiency improvements.

Relevance: 8 Novelty: 8

38. Permutation Equivariant Neural Networks for Symmetric Tensors

ArXiv ID: 2503.11276

Authors: Edward Pearce-Crump

Abstract: Incorporating permutation equivariance into neural networks has proven to be useful in ensuring that models respect symmetries that exist in data. Symmetric tensors, which naturally appear in statistics, machine learning, and graph theory, are essential for many applications in physics, chemistry, and materials science, amongst others. However, existing research on permutation equivariant models has not explored symmetric tensors as inputs, and most prior work on learning from these tensors has focused on equivariance to Euclidean groups. In this paper, we present two different characterisations of all linear permutation equivariant functions between symmetric power spaces of $\mathbb{R}^n$. We show on two tasks that these functions are highly data efficient compared to standard MLPs and have potential to generalise well to symmetric tensors of different sizes.

Comment: The paper introduces permutation equivariant neural networks for symmetric tensors, which aligns with architectural innovations and foundational research.

Relevance: 8 Novelty: 8

39. Unifying Perplexing Behaviors in Modified BP Attributions through Alignment Perspective

ArXiv ID: 2503.11160

Authors: Guanhua Zheng, Jitao Sang, Changsheng Xu

Abstract: Attributions aim to identify input pixels that are relevant to the decision-making process. A popular approach involves using modified backpropagation (BP) rules to reverse decisions, which improves interpretability compared to the original gradients. However, these methods lack a solid theoretical foundation and exhibit perplexing behaviors, such as reduced sensitivity to parameter randomization, raising concerns about their reliability and highlighting the need for theoretical justification. In this work, we present a unified theoretical framework for methods like GBP, RectGrad, LRP, and DTD, demonstrating that they achieve input alignment by combining the weights of activated neurons. This alignment improves the visualization quality and reduces sensitivity to weight randomization. Our contributions include: (1) Providing a unified explanation for multiple behaviors, rather than focusing on just one. (2) Accurately predicting novel behaviors. (3) Offering insights into decision-making processes, including layer-wise information changes and the relationship between attributions and model decisions.

Comment: The paper provides a unified theoretical framework for backpropagation attribution methods, aligning with foundational research in representation learning.

Relevance: 8 Novelty: 8

40. Context-Aware Rule Mining Using a Dynamic Transformer-Based Framework

ArXiv ID: 2503.11125

Authors: Jie Liu, Yiwei Zhang, Yuan Sheng, Yujia Lou, Haige Wang, Bohuan Yang

Abstract: This study proposes a dynamic rule data mining algorithm based on an improved Transformer architecture, aiming to improve the accuracy and efficiency of rule mining in a dynamic data environment. With the increase in data volume and complexity, traditional data mining methods are difficult to cope with dynamic data with strong temporal and variable characteristics, so new algorithms are needed to capture the temporal regularity in the data. By improving the Transformer architecture, and introducing a dynamic weight adjustment mechanism and a temporal dependency module, we enable the model to adapt to data changes and mine more accurate rules. Experimental results show that compared with traditional rule mining algorithms, the improved Transformer model has achieved significant improvements in rule mining accuracy, coverage, and stability. The contribution of each module in the algorithm performance is further verified by ablation experiments, proving the importance of temporal dependency and dynamic weight adjustment mechanisms in improving the model effect. In addition, although the improved model has certain challenges in computational efficiency, its advantages in accuracy and coverage enable it to perform well in processing complex dynamic data. Future research will focus on optimizing computational efficiency and combining more deep learning technologies to expand the application scope of the algorithm, especially in practical applications in the fields of finance, medical care, and intelligent recommendation.

Comment: Proposes an improved Transformer architecture with dynamic weight adjustment and temporal dependency modules, aligning with architectural innovation.