This is a remedial run for missed papers from 05/19/2025 to 05/19/2025.

Results generated on 05/26/2025.

Personalized Daily ArXiv Papers 2025-05-20

[gpt-4o]	Prompt	Completion	Total
Token	54622	7163	61785
Cost	$0.14	$0.07	$0.21

Total arXiv papers: 472

Total scanned papers: 472

Total relevant papers: 49

Table of contents with paper titles:

Mean Flows for One-step Generative Modeling Authors: Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, Kaiming He
Model Selection for Gaussian-gated Gaussian Mixture of Experts Using Dendrograms of Mixing Measures Authors: Tuan Thai, TrungTin Nguyen, Dat Do, Nhat Ho, Christopher Drovandi
Unpacking Positional Encoding in Transformers: A Spectral Analysis of Content-Position Coupling Authors: Zihan Gu, Han Zhang, Ruoyu Chen, Yue Hu, Hua Zhang
Dense Communication between Language Models Authors: Shiguang Wu, Yaqing Wang, Quanming Yao
Information Science Principles of Machine Learning: A Causal Chain Meta-Framework Based on Formalized Information Mapping Authors: Jianfeng Xu
Deterministic Bounds and Random Estimates of Metric Tensors on Neuromanifolds Authors: Ke Sun
Causality-Inspired Robustness for Nonlinear Models via Representation Learning Authors: Marin Šola, Peter Bühlmann, Xinwei Shen
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference Authors: Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, Jieru Zhao
Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression Authors: Xiaohui Wang, Peng Ye, Chenyu Huang, Shenghe Zheng, Bo Zhang, Wanli Ouyang, Tao Chen
Implicit bias produces neural scaling laws in learning curves, from perceptrons to deep networks Authors: Francesco D'Amico, Dario Bocchi, Matteo Negri
Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs) Authors: Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, Peter Richtárik
A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone Authors: Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, Jun Yu
Understanding Task Representations in Neural Networks via Bayesian Ablation Authors: Andrew Nam, Declan Campbell, Thomas Griffiths, Jonathan Cohen, Sarah-Jane Leslie
A3 : an Analytical Low-Rank Approximation Framework for Attention Authors: Jeffrey T. H. Wong, Cheng Zhang, Xinye Cao, Pedro Gimenes, George A. Constantinides, Wayne Luk, Yiren Zhao
Exploring Federated Pruning for Large Language Models Authors: Pengxin Guo, Yinong Wang, Wei Li, Mengting Liu, Ming Li, Jinkai Zheng, Liangqiong Qu
Does Low Rank Adaptation Lead to Lower Robustness against Training-Time Attacks? Authors: Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, Ronghua Li
Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers Authors: Andrew Nam, Henry Conklin, Yukang Yang, Thomas Griffiths, Jonathan Cohen, Sarah-Jane Leslie
Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference Authors: Shuqing Luo, Pingzhi Li, Jie Peng, Hanrui Wang, Yang, Zhao, Yu, Cao, Yu Cheng, Tianlong Chen
RGNMR: A Gauss-Newton method for robust matrix completion with theoretical guarantees Authors: Eilon Vaknin Laufer, Boaz Nadler
Learning (Approximately) Equivariant Networks via Constrained Optimization Authors: Andrei Manolache, Luiz F. O. Chamon, Mathias Niepert
Neural Functional: Learning Function to Scalar Maps for Neural PDE Surrogates Authors: Anthony Zhou, Amir Barati Farimani
ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibility Data Authors: Yifeng Jiao, Yuchen Liu, Yu Zhang, Xin Guo, Yushuai Wu, Chen Jiang, Jiyang Li, Hongwei Zhang, Limei Han, Xin Gao, Yuan Qi, Yuan Cheng
Hardware-Adaptive and Superlinear-Capacity Memristor-based Associative Memory Authors: Chengping He, Mingrui Jiang, Keyi Shan, Szu-Hao Yang, Zefan Li, Shengbo Wang, Giacomo Pedretti, Jim Ignowski, Can Li
A Path to Universal Neural Cellular Automata Authors: Gabriel Béna, Maxence Faldor, Dan F. M. Goodman, Antoine Cully
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens Authors: Kaya Stechly, Karthik Valmeekam, Atharva Gundawar, Vardhan Palod, Subbarao Kambhampati
Sinusoidal Initialization, Time for a New Start Authors: Alberto Fernández-Hernández, Jose I. Mestre, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí
A Minimum Description Length Approach to Regularization in Neural Networks Authors: Matan Abudy, Orr Well, Emmanuel Chemla, Roni Katzir, Nur Lan
KHRONOS: a Kernel-Based Neural Architecture for Rapid, Resource-Efficient Scientific Computation Authors: Reza T. Batley, Sourav Saha
When majority rules, minority loses: bias amplification of gradient descent Authors: François Bachoc, Jérôme Bolte, Ryan Boustany, Jean-Michel Loubes
Self-Reinforced Graph Contrastive Learning Authors: Chou-Ying Hsieh, Chun-Fu Jang, Cheng-En Hsieh, Qian-Hui Chen, Sy-Yen Kuo
Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation Authors: Sungmin Cha, Kyunghyun Cho
Scalable Bayesian Monte Carlo: fast uncertainty estimation beyond deep ensembles Authors: Xinzhu Liang, Joseph M. Lukens, Sanjaya Lohani, Brian T. Kirby, Thomas A. Searles, Xin Qiu, Kody J. H. Law
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models Authors: Yanbin Yin, Kun Zhou, Zhen Wang, Xiangdong Zhang, Yifei Shao, Shibo Hao, Yi Gu, Jieyuan Liu, Somanshu Singla, Tianyang Liu, Eric P. Xing, Zhengzhong Liu, Haojian Jin, Zhiting Hu
TSPulse: Dual Space Tiny Pre-Trained Models for Rapid Time-Series Analysis Authors: Vijay Ekambaram, Subodh Kumar, Arindam Jati, Sumanta Mukherjee, Tomoya Sakai, Pankaj Dayama, Wesley M. Gifford, Jayant Kalagnanam
Learning by solving differential equations Authors: Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, Sourabh Medapati, Javier Gonzalvo
CALM-PDE: Continuous and Adaptive Convolutions for Latent Space Modeling of Time-dependent PDEs Authors: Jan Hagnberger, Daniel Musekamp, Mathias Niepert
On the Thinking-Language Modeling Gap in Large Language Models Authors: Chenxi Liu, Yongqiang Chen, Tongliang Liu, James Cheng, Bo Han, Kun Zhang
Walking the Tightrope: Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning Authors: Xiaoyu Yang, Jie Lu, En Yu
Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities Authors: Lili Zhang, Haomiaomiao Wang, Long Cheng, Libao Deng, Tomas Ward
Fractured Chain-of-Thought Reasoning Authors: Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, Caiming Xiong
Deep Unfolding with Kernel-based Quantization in MIMO Detection Authors: Zeyi Ren, Jingreng Lei, Yichen Jin, Ermo Hua, Qingfeng Lin, Chen Zhang, Bowen Zhou, Yik-Chung Wu
Parallel Layer Normalization for Universal Approximation Authors: Yunhao Ni, Yuhe Liu, Wenxin Sun, Yitong Tang, Yuxin Guo, Peilin Feng, Wenjun Wu, Lei Huang
Multi-head Temporal Latent Attention Authors: Keqi Deng, Philip C. Woodland
Identifiability of Nonnegative Tucker Decompositions -- Part I: Theory Authors: Subhayan Saha, Giovanni Barbarino, Nicolas Gillis
$μ$PC: Scaling Predictive Coding to 100+ Layer Networks Authors: Francesco Innocenti, El Mehdi Achour, Christopher L. Buckley
Efficient training for large-scale optical neural network using an evolutionary strategy and attention pruning Authors: Zhiwei Yang, Zeyang Fan, Yihang Lai, Qi Chen, Tian Zhang, Jian Dai, Kun Xu
Fine-tuning Quantized Neural Networks with Zeroth-order Optimization Authors: Sifeng Shang, Jiayi Zhou, Chenyu Lin, Minxian Li, Kaiyang Zhou
MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion Authors: Wei Hua, Chenlin Zhou, Jibin Wu, Yansong Chua, Yangyang Shu
Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations Authors: Li Ji-An, Hua-Dong Xiong, Robert C. Wilson, Marcelo G. Mattar, Marcus K. Benna

1. Mean Flows for One-step Generative Modeling

ArXiv ID: 2505.13447

Authors: Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, Kaiming He

Abstract: We propose a principled and effective framework for one-step generative modeling. We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models. Our study substantially narrows the gap between one-step diffusion/flow models and their multi-step predecessors, and we hope it will motivate future research to revisit the foundations of these powerful models.

Comment: Author match

2. Model Selection for Gaussian-gated Gaussian Mixture of Experts Using Dendrograms of Mixing Measures

ArXiv ID: 2505.13052

Authors: Tuan Thai, TrungTin Nguyen, Dat Do, Nhat Ho, Christopher Drovandi

Abstract: Mixture of Experts (MoE) models constitute a widely utilized class of ensemble learning approaches in statistics and machine learning, known for their flexibility and computational efficiency. They have become integral components in numerous state-of-the-art deep neural network architectures, particularly for analyzing heterogeneous data across diverse domains. Despite their practical success, the theoretical understanding of model selection, especially concerning the optimal number of mixture components or experts, remains limited and poses significant challenges. These challenges primarily stem from the inclusion of covariates in both the Gaussian gating functions and expert networks, which introduces intrinsic interactions governed by partial differential equations with respect to their parameters. In this paper, we revisit the concept of dendrograms of mixing measures and introduce a novel extension to Gaussian-gated Gaussian MoE models that enables consistent estimation of the true number of mixture components and achieves the pointwise optimal convergence rate for parameter estimation in overfitted scenarios. Notably, this approach circumvents the need to train and compare a range of models with varying numbers of components, thereby alleviating the computational burden, particularly in high-dimensional or deep neural network settings. Experimental results on synthetic data demonstrate the effectiveness of the proposed method in accurately recovering the number of experts. It outperforms common criteria such as the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood, while achieving optimal convergence rates for parameter estimation and accurately approximating the regression function.

Comment: The paper introduces a novel method for model selection in Mixture of Experts, which is highly relevant to model architecture.

Relevance: 10 Novelty: 8

3. Unpacking Positional Encoding in Transformers: A Spectral Analysis of Content-Position Coupling

ArXiv ID: 2505.13027

Authors: Zihan Gu, Han Zhang, Ruoyu Chen, Yue Hu, Hua Zhang

Abstract: Positional encoding (PE) is essential for enabling Transformers to model sequential structure. However, the mechanisms by which different PE schemes couple token content and positional information-and how these mechanisms influence model dynamics-remain theoretically underexplored. In this work, we present a unified framework that analyzes PE through the spectral properties of Toeplitz and related matrices derived from attention logits. We show that multiplicative content-position coupling-exemplified by Rotary Positional Encoding (RoPE) via a Hadamard product with a Toeplitz matrix-induces spectral contraction, which theoretically improves optimization stability and efficiency. Guided by this theory, we construct synthetic tasks that contrast content-position dependent and content-position independent settings, and evaluate a range of PE methods. Our experiments reveal strong alignment with theory: RoPE consistently outperforms other methods on position-sensitive tasks and induces "single-head deposit" patterns in early layers, indicating localized positional processing. Further analyses show that modifying the method and timing of PE coupling, such as MLA in Deepseek-V3, can effectively mitigate this concentration. These results establish explicit content-relative mixing with relative-position Toeplitz signals as a key principle for effective PE design and provide new insight into how positional structure is integrated in Transformer architectures.

Comment: The paper provides a spectral analysis of positional encoding in Transformers, offering insights into model architecture and positional encoding mechanisms.

Relevance: 9 Novelty: 8

4. Dense Communication between Language Models

ArXiv ID: 2505.12741

Authors: Shiguang Wu, Yaqing Wang, Quanming Yao

Abstract: As higher-level intelligence emerges from the combination of modular components with lower-level intelligence, many works combines Large Language Models (LLMs) for collective intelligence. Such combination is achieved by building communications among LLMs. While current systems primarily facilitate such communication through natural language, this paper proposes a novel paradigm of direct dense vector communication between LLMs. Our approach eliminates the unnecessary embedding and de-embedding steps when LLM interact with another, enabling more efficient information transfer, fully differentiable optimization pathways, and exploration of capabilities beyond human heuristics. We use such stripped LLMs as vertexes and optimizable seq2seq modules as edges to construct LMNet, with similar structure as MLPs. By utilizing smaller pre-trained LLMs as vertexes, we train a LMNet that achieves comparable performance with LLMs in similar size with only less than 0.1% training cost. This offers a new perspective on scaling for general intelligence rather than training a monolithic LLM from scratch. Besides, the proposed method can be used for other applications, like customizing LLM with limited data, showing its versatility.

Comment: The paper introduces a novel paradigm of direct dense vector communication between LLMs, which aligns with the core topic of Large Language Models and offers a new perspective on scaling for general intelligence.

Relevance: 9 Novelty: 8

5. Information Science Principles of Machine Learning: A Causal Chain Meta-Framework Based on Formalized Information Mapping

ArXiv ID: 2505.13182

Authors: Jianfeng Xu

Abstract: [Objective] This study focuses on addressing the current lack of a unified formal theoretical framework in machine learning, as well as the deficiencies in interpretability and ethical safety assurance. [Methods] A formal information model is first constructed, utilizing sets of well-formed formulas to explicitly define the ontological states and carrier mappings of typical components in machine learning. Learnable and processable predicates, along with learning and processing functions, are introduced to analyze the logical deduction and constraint rules of the causal chains within models. [Results] A meta-framework for machine learning theory (MLT-MF) is established. Based on this framework, universal definitions for model interpretability and ethical safety are proposed. Furthermore, three key theorems are proved: the equivalence of model interpretability and information recoverability, the assurance of ethical safety, and the estimation of generalization error. [Limitations] The current framework assumes ideal conditions with noiseless information-enabling mappings and primarily targets model learning and processing logic in static scenarios. It does not yet address information fusion and conflict resolution across ontological spaces in multimodal or multi-agent systems. [Conclusions] This work overcomes the limitations of fragmented research and provides a unified theoretical foundation for systematically addressing the critical challenges currently faced in machine learning.

Comment: The paper presents a meta-framework for machine learning theory, addressing interpretability and ethical safety, which aligns with emerging trends in foundational research.

Relevance: 9 Novelty: 8

6. Deterministic Bounds and Random Estimates of Metric Tensors on Neuromanifolds

ArXiv ID: 2505.13614

Authors: Ke Sun

Abstract: The high dimensional parameter space of modern deep neural networks -- the neuromanifold -- is endowed with a unique metric tensor defined by the Fisher information, estimating which is crucial for both theory and practical methods in deep learning. To analyze this tensor for classification networks, we return to a low dimensional space of probability distributions -- the core space -- and carefully analyze the spectrum of its Riemannian metric. We extend our discoveries there into deterministic bounds of the metric tensor on the neuromanifold. We introduce an unbiased random estimate of the metric tensor and its bounds based on Hutchinson's trace estimator. It can be evaluated efficiently through a single backward pass and can be used to estimate the diagonal, or block diagonal, or the full tensor. Its quality is guaranteed with a standard deviation bounded by the true value up to scaling.

Comment: The paper provides deterministic bounds and random estimates of metric tensors on neuromanifolds, contributing to theoretical insights in representation learning.

Relevance: 9 Novelty: 8

7. Causality-Inspired Robustness for Nonlinear Models via Representation Learning

ArXiv ID: 2505.12868

Authors: Marin Šola, Peter Bühlmann, Xinwei Shen

Abstract: Distributional robustness is a central goal of prediction algorithms due to the prevalent distribution shifts in real-world data. The prediction model aims to minimize the worst-case risk among a class of distributions, a.k.a., an uncertainty set. Causality provides a modeling framework with a rigorous robustness guarantee in the above sense, where the uncertainty set is data-driven rather than pre-specified as in traditional distributional robustness optimization. However, current causality-inspired robustness methods possess finite-radius robustness guarantees only in the linear settings, where the causal relationships among the covariates and the response are linear. In this work, we propose a nonlinear method under a causal framework by incorporating recent developments in identifiable representation learning and establish a distributional robustness guarantee. To our best knowledge, this is the first causality-inspired robustness method with such a finite-radius robustness guarantee in nonlinear settings. Empirical validation of the theoretical findings is conducted on both synthetic data and real-world single-cell data, also illustrating that finite-radius robustness is crucial.

Comment: The paper introduces a causality-inspired robustness method for nonlinear models via representation learning, contributing to theoretical insights in representation learning.

Relevance: 9 Novelty: 8

8. FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

ArXiv ID: 2505.13109

Authors: Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, Jieru Zhao

Abstract: Large language models (LLMs) have been widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods are proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, an algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency. Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to 13$\times$ speedup compared to SOTA KV retrieval methods.

Comment: The paper introduces FreeKV, a framework for efficient KV cache retrieval in LLMs, which aligns with the interest in model compression and efficiency.

Relevance: 9 Novelty: 8

9. Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression

ArXiv ID: 2505.13563

Authors: Xiaohui Wang, Peng Ye, Chenyu Huang, Shenghe Zheng, Bo Zhang, Wanli Ouyang, Tao Chen

Abstract: With the rise of the fine-tuned--pretrained paradigm, storing numerous fine-tuned models for multi-tasking creates significant storage overhead. Delta compression alleviates this by storing only the pretrained model and the highly compressed delta weights (the differences between fine-tuned and pretrained model weights). However, existing methods fail to maintain both high compression and performance, and often rely on data. To address these challenges, we propose UltraDelta, the first data-free delta compression pipeline that achieves both ultra-high compression and strong performance. UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions, using three key components: (1) Variance-Based Mixed Sparsity Allocation assigns sparsity based on variance, giving lower sparsity to high-variance layers to preserve inter-layer information. (2) Distribution-Aware Compression applies uniform quantization and then groups parameters by value, followed by group-wise pruning, to better preserve intra-layer distribution. (3) Trace-Norm-Guided Rescaling uses the trace norm of delta weights to estimate a global rescaling factor, improving model stability under higher compression. Extensive experiments across (a) large language models (fine-tuned on LLaMA-2 7B and 13B) with up to 133x, (b) general NLP models (RoBERTa-base, T5-base) with up to 800x, (c) vision models (ViT-B/32, ViT-L/14) with up to 400x, and (d) multi-modal models (BEiT-3) with 40x compression ratio, demonstrate that UltraDelta consistently outperforms existing methods, especially under ultra-high compression.

Comment: The paper introduces UltraDelta, a data-free delta compression pipeline, which is relevant to model compression with a focus on sparsity and quantization.

Relevance: 9 Novelty: 8

10. Implicit bias produces neural scaling laws in learning curves, from perceptrons to deep networks

ArXiv ID: 2505.13230

Authors: Francesco D'Amico, Dario Bocchi, Matteo Negri

Abstract: Scaling laws in deep learning - empirical power-law relationships linking model performance to resource growth - have emerged as simple yet striking regularities across architectures, datasets, and tasks. These laws are particularly impactful in guiding the design of state-of-the-art models, since they quantify the benefits of increasing data or model size, and hint at the foundations of interpretability in machine learning. However, most studies focus on asymptotic behavior at the end of training or on the optimal training time given the model size. In this work, we uncover a richer picture by analyzing the entire training dynamics through the lens of spectral complexity norms. We identify two novel dynamical scaling laws that govern how performance evolves during training. These laws together recover the well-known test error scaling at convergence, offering a mechanistic explanation of generalization emergence. Our findings are consistent across CNNs, ResNets, and Vision Transformers trained on MNIST, CIFAR-10 and CIFAR-100. Furthermore, we provide analytical support using a solvable model: a single-layer perceptron trained with binary cross-entropy. In this setting, we show that the growth of spectral complexity driven by the implicit bias mirrors the generalization behavior observed at fixed norm, allowing us to connect the performance dynamics to classical learning rules in the perceptron.

Comment: The paper uncovers dynamical scaling laws in learning curves, providing insights into representation learning and training dynamics.

Relevance: 9 Novelty: 8

11. Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)

ArXiv ID: 2505.13416

Authors: Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, Peter Richtárik

Abstract: Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as $\sf Muon$ and $\sf Scion$. After over a decade of $\sf Adam$'s dominance, these LMO-based methods are emerging as viable replacements, offering several practical advantages such as improved memory efficiency, better hyperparameter transferability, and most importantly, superior empirical performance on large-scale tasks, including LLM training. However, a significant gap remains between their practical use and our current theoretical understanding: prior analyses (1) overlook the layer-wise LMO application of these optimizers in practice, and (2) rely on an unrealistic smoothness assumption, leading to impractically small stepsizes. To address both, we propose a new LMO-based method called $\sf Gluon$, capturing prior theoretically analyzed methods as special cases, and introduce a new refined generalized smoothness model that captures the layer-wise geometry of neural networks, matches the layer-wise practical implementation of $\sf Muon$ and $\sf Scion$, and leads to convergence guarantees with strong practical predictive power. Unlike prior results, our theoretical stepsizes closely match the fine-tuned values reported by Pethick et al. (2025). Our experiments with NanoGPT and CNN confirm that our assumption holds along the optimization trajectory, ultimately closing the gap between theory and practice.

Comment: The paper introduces a new LMO-based optimization method for LLMs, addressing theoretical gaps and improving practical performance, which is relevant to LLM architecture and optimization.

Relevance: 9 Novelty: 8

12. A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

ArXiv ID: 2505.12781

Authors: Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, Jun Yu

Abstract: Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens--while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf.

Comment: The paper introduces a low-rank approach for efficient knowledge distillation, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

13. Understanding Task Representations in Neural Networks via Bayesian Ablation

ArXiv ID: 2505.13742

Authors: Andrew Nam, Declan Campbell, Thomas Griffiths, Jonathan Cohen, Sarah-Jane Leslie

Abstract: Neural networks are powerful tools for cognitive modeling due to their flexibility and emergent properties. However, interpreting their learned representations remains challenging due to their sub-symbolic semantics. In this work, we introduce a novel probabilistic framework for interpreting latent task representations in neural networks. Inspired by Bayesian inference, our approach defines a distribution over representational units to infer their causal contributions to task performance. Using ideas from information theory, we propose a suite of tools and metrics to illuminate key model properties, including representational distributedness, manifold complexity, and polysemanticity.

Comment: The paper introduces a novel probabilistic framework for interpreting latent task representations in neural networks, aligning with representation learning.

Relevance: 9 Novelty: 8

14. A3 : an Analytical Low-Rank Approximation Framework for Attention

ArXiv ID: 2505.12942

Authors: Jeffrey T. H. Wong, Cheng Zhang, Xinye Cao, Pedro Gimenes, George A. Constantinides, Wayne Luk, Yiren Zhao

Abstract: Large language models have demonstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Low-rank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices. Consequently, these methods often fall short compared to other compression techniques like pruning and quantization, and introduce runtime overhead such as the extra GEMM kernel launches for decomposed small matrices. To address these limitations, we propose $\tt A^\tt 3$, a post-training low-rank approximation framework. $\tt A^\tt 3$ splits a Transformer layer into three functional components, namely $\tt QK$, $\tt OV$, and $\tt MLP$. For each component, $\tt A^\tt 3$ provides an analytical solution that reduces the hidden dimension size inside each component while minimizing the component's functional loss ($\it i.e.$, error in attention scores, attention outputs, and MLP outputs). This approach directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overheads. In addition, it provides a new narrative in advancing the optimization problem from singular linear layer loss optimization toward improved end-to-end performance. Through extensive experiments, we show that $\tt A^\tt 3$ maintains superior performance compared to SoTAs. For example, under the same reduction budget in computation and memory, our low-rank approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2, outperforming the previous SoTA's 7.87 by 3.18. We also demonstrate the versatility of $\tt A^\tt 3$, including KV cache compression, quantization, and mixed-rank assignments for enhanced performance.

Comment: The paper presents a low-rank approximation framework for attention in Transformers, relevant to model compression and architecture.

Relevance: 9 Novelty: 8

15. Exploring Federated Pruning for Large Language Models

ArXiv ID: 2505.13547

Authors: Pengxin Guo, Yinong Wang, Wei Li, Mengting Liu, Ming Li, Jinkai Zheng, Liangqiong Qu

Abstract: LLM pruning has emerged as a promising technology for compressing LLMs, enabling their deployment on resource-limited devices. However, current methodologies typically require access to public calibration samples, which can be challenging to obtain in privacy-sensitive domains. To address this issue, we introduce FedPrLLM, a comprehensive federated pruning framework designed for the privacy-preserving compression of LLMs. In FedPrLLM, each client only needs to calculate a pruning mask matrix based on its local calibration data and share it with the server to prune the global model. This approach allows for collaborative pruning of the global model with the knowledge of each client while maintaining local data privacy. Additionally, we conduct extensive experiments to explore various possibilities within the FedPrLLM framework, including different comparison groups, pruning strategies, and the decision to scale weights. Our extensive evaluation reveals that one-shot pruning with layer comparison and no weight scaling is the optimal choice within the FedPrLLM framework. We hope our work will help guide future efforts in pruning LLMs in privacy-sensitive fields. Our code is available at https://github.com/Pengxin-Guo/FedPrLLM.

Comment: The paper discusses federated pruning for LLMs, which aligns with the interest in model compression and efficiency.

Relevance: 9 Novelty: 7

16. Does Low Rank Adaptation Lead to Lower Robustness against Training-Time Attacks?

ArXiv ID: 2505.12871

Authors: Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, Ronghua Li

Abstract: Low rank adaptation (LoRA) has emerged as a prominent technique for fine-tuning large language models (LLMs) thanks to its superb efficiency gains over previous methods. While extensive studies have examined the performance and structural properties of LoRA, its behavior upon training-time attacks remain underexplored, posing significant security risks. In this paper, we theoretically investigate the security implications of LoRA's low-rank structure during fine-tuning, in the context of its robustness against data poisoning and backdoor attacks. We propose an analytical framework that models LoRA's training dynamics, employs the neural tangent kernel to simplify the analysis of the training process, and applies information theory to establish connections between LoRA's low rank structure and its vulnerability against training-time attacks. Our analysis indicates that LoRA exhibits better robustness to backdoor attacks than full fine-tuning, while becomes more vulnerable to untargeted data poisoning due to its over-simplified information geometry. Extensive experimental evaluations have corroborated our theoretical findings.

Comment: The paper investigates the robustness of Low Rank Adaptation (LoRA) in LLMs, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 7

17. Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers

ArXiv ID: 2505.13737

Authors: Andrew Nam, Henry Conklin, Yukang Yang, Thomas Griffiths, Jonathan Cohen, Sarah-Jane Leslie

Abstract: We present causal head gating (CHG), a scalable method for interpreting the functional roles of attention heads in transformer models. CHG learns soft gates over heads and assigns them a causal taxonomy - facilitating, interfering, or irrelevant - based on their impact on task performance. Unlike prior approaches in mechanistic interpretability, which are hypothesis-driven and require prompt templates or target labels, CHG applies directly to any dataset using standard next-token prediction. We evaluate CHG across multiple large language models (LLMs) in the Llama 3 model family and diverse tasks, including syntax, commonsense, and mathematical reasoning, and show that CHG scores yield causal - not merely correlational - insight, validated via ablation and causal mediation analyses. We also introduce contrastive CHG, a variant that isolates sub-circuits for specific task components. Our findings reveal that LLMs contain multiple sparse, sufficient sub-circuits, that individual head roles depend on interactions with others (low modularity), and that instruction following and in-context learning rely on separable mechanisms.

Comment: The paper introduces causal head gating for interpreting attention heads in transformers, which is relevant to model architecture and interpretability.

Relevance: 9 Novelty: 7

18. Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference

ArXiv ID: 2505.13345

Authors: Shuqing Luo, Pingzhi Li, Jie Peng, Hanrui Wang, Yang, Zhao, Yu, Cao, Yu Cheng, Tianlong Chen

Abstract: Mixture-of-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over $40\%$ runtime in large-scale training). In this paper, we first define collaborative communication to illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them "collaborated", which comprises $2$ cases as intra- and inter-collaboration, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of intra-collaboration can accelerate expert parallelism at scale. It motivates us to strategically optimize collaborative communication for accelerated MoE training and inference, dubbed Occult. Our designs are capable of either delivering exact results with reduced communication cost or controllably minimizing the cost with collaboration pruning, materialized by modified fine-tuning. Comprehensive experiments on various MoE-LLMs demonstrate that Occult can be faster than popular state-of-the-art inference or training frameworks (more than $1.5\times$ speed up across multiple tasks and models) with comparable or superior quality compared to the standard fine-tuning. Code is available at $\href{https://github.com/UNITES-Lab/Occult}{https://github.com/UNITES-Lab/Occult}$.

Comment: The paper addresses communication optimization in Mixture-of-Experts (MoE) architectures, relevant to model architecture.

Relevance: 9 Novelty: 7

19. RGNMR: A Gauss-Newton method for robust matrix completion with theoretical guarantees

ArXiv ID: 2505.12919

Authors: Eilon Vaknin Laufer, Boaz Nadler

Abstract: Recovering a low rank matrix from a subset of its entries, some of which may be corrupted, is known as the robust matrix completion (RMC) problem. Existing RMC methods have several limitations: they require a relatively large number of observed entries; they may fail under overparametrization, when their assumed rank is higher than the correct one; and many of them fail to recover even mildly ill-conditioned matrices. In this paper we propose a novel RMC method, denoted $\texttt{RGNMR}$, which overcomes these limitations. $\texttt{RGNMR}$ is a simple factorization-based iterative algorithm, which combines a Gauss-Newton linearization with removal of entries suspected to be outliers. On the theoretical front, we prove that under suitable assumptions, $\texttt{RGNMR}$ is guaranteed exact recovery of the underlying low rank matrix. Our theoretical results improve upon the best currently known for factorization-based methods. On the empirical front, we show via several simulations the advantages of $\texttt{RGNMR}$ over existing RMC methods, and in particular its ability to handle a small number of observed entries, overparameterization of the rank and ill-conditioned matrices.

Comment: The paper introduces a novel method for robust matrix completion with theoretical guarantees, relevant to model compression through low-rank approaches.

Relevance: 8 Novelty: 8

20. Learning (Approximately) Equivariant Networks via Constrained Optimization

ArXiv ID: 2505.13631

Authors: Andrei Manolache, Luiz F. O. Chamon, Mathias Niepert

Abstract: Equivariant neural networks are designed to respect symmetries through their architecture, boosting generalization and sample efficiency when those symmetries are present in the data distribution. Real-world data, however, often departs from perfect symmetry because of noise, structural variation, measurement bias, or other symmetry-breaking effects. Strictly equivariant models may struggle to fit the data, while unconstrained models lack a principled way to leverage partial symmetries. Even when the data is fully symmetric, enforcing equivariance can hurt training by limiting the model to a restricted region of the parameter space. Guided by homotopy principles, where an optimization problem is solved by gradually transforming a simpler problem into a complex one, we introduce Adaptive Constrained Equivariance (ACE), a constrained optimization approach that starts with a flexible, non-equivariant model and gradually reduces its deviation from equivariance. This gradual tightening smooths training early on and settles the model at a data-driven equilibrium, balancing between equivariance and non-equivariance. Across multiple architectures and tasks, our method consistently improves performance metrics, sample efficiency, and robustness to input perturbations compared with strictly equivariant models and heuristic equivariance relaxations.

Comment: The paper introduces a method for learning approximately equivariant networks, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 8

21. Neural Functional: Learning Function to Scalar Maps for Neural PDE Surrogates

ArXiv ID: 2505.13275

Authors: Anthony Zhou, Amir Barati Farimani

Abstract: Many architectures for neural PDE surrogates have been proposed in recent years, largely based on neural networks or operator learning. In this work, we derive and propose a new architecture, the Neural Functional, which learns function to scalar mappings. Its implementation leverages insights from operator learning and neural fields, and we show the ability of neural functionals to implicitly learn functional derivatives. For the first time, this allows for an extension of Hamiltonian mechanics to neural PDE surrogates by learning the Hamiltonian functional and optimizing its functional derivatives. We demonstrate that the Hamiltonian Neural Functional can be an effective surrogate model through improved stability and conserving energy-like quantities on 1D and 2D PDEs. Beyond PDEs, functionals are prevalent in physics; functional approximation and learning with its gradients may find other uses, such as in molecular dynamics or design optimization.

Comment: The paper introduces a new architecture for neural PDE surrogates, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 8

22. ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibility Data

ArXiv ID: 2505.12638

Authors: Yifeng Jiao, Yuchen Liu, Yu Zhang, Xin Guo, Yushuai Wu, Chen Jiang, Jiyang Li, Hongwei Zhang, Limei Han, Xin Gao, Yuan Qi, Yuan Cheng

Abstract: The advent of single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) offers an innovative perspective for deciphering regulatory mechanisms by assembling a vast repository of single-cell chromatin accessibility data. While foundation models have achieved significant success in single-cell transcriptomics, there is currently no foundation model for scATAC-seq that supports zero-shot high-quality cell identification and comprehensive multi-omics analysis simultaneously. Key challenges lie in the high dimensionality and sparsity of scATAC-seq data, as well as the lack of a standardized schema for representing open chromatin regions (OCRs). Here, we present ChromFound, a foundation model tailored for scATAC-seq. ChromFound utilizes a hybrid architecture and genome-aware tokenization to effectively capture genome-wide long contexts and regulatory signals from dynamic chromatin landscapes. Pretrained on 1.97 million cells from 30 tissues and 6 disease conditions, ChromFound demonstrates broad applicability across 6 diverse tasks. Notably, it achieves robust zero-shot performance in generating universal cell representations and exhibits excellent transferability in cell type annotation and cross-omics prediction. By uncovering enhancer-gene links undetected by existing computational methods, ChromFound offers a promising framework for understanding disease risk variants in the noncoding genome.

Comment: The paper presents ChromFound, a foundation model for single-cell chromatin accessibility data, which is relevant to AI for Science as it offers a new framework for understanding disease risk variants.

Relevance: 8 Novelty: 8

23. Hardware-Adaptive and Superlinear-Capacity Memristor-based Associative Memory

ArXiv ID: 2505.12960

Authors: Chengping He, Mingrui Jiang, Keyi Shan, Szu-Hao Yang, Zefan Li, Shengbo Wang, Giacomo Pedretti, Jim Ignowski, Can Li

Abstract: Brain-inspired computing aims to mimic cognitive functions like associative memory, the ability to recall complete patterns from partial cues. Memristor technology offers promising hardware for such neuromorphic systems due to its potential for efficient in-memory analog computing. Hopfield Neural Networks (HNNs) are a classic model for associative memory, but implementations on conventional hardware suffer from efficiency bottlenecks, while prior memristor-based HNNs faced challenges with vulnerability to hardware defects due to offline training, limited storage capacity, and difficulty processing analog patterns. Here we introduce and experimentally demonstrate on integrated memristor hardware a new hardware-adaptive learning algorithm for associative memories that significantly improves defect tolerance and capacity, and naturally extends to scalable multilayer architectures capable of handling both binary and continuous patterns. Our approach achieves 3x effective capacity under 50% device faults compared to state-of-the-art methods. Furthermore, its extension to multilayer architectures enables superlinear capacity scaling ((\propto N^{1.49}\ for binary patterns) and effective recalling of continuous patterns (\propto N^{1.74}\ scaling), as compared to linear capacity scaling for previous HNNs. It also provides flexibility to adjust capacity by tuning hidden neurons for the same-sized patterns. By leveraging the massive parallelism of the hardware enabled by synchronous updates, it reduces energy by 8.8x and latency by 99.7% for 64-dimensional patterns over asynchronous schemes, with greater improvements at scale. This promises the development of more reliable memristor-based associative memory systems and enables new applications research due to the significantly improved capacity, efficiency, and flexibility.

Comment: The paper presents a memristor-based associative memory system, which is relevant to Model Architecture as it introduces a new hardware-adaptive learning algorithm for associative memories.

Relevance: 8 Novelty: 8

24. A Path to Universal Neural Cellular Automata

ArXiv ID: 2505.13058

Authors: Gabriel Béna, Maxence Faldor, Dan F. M. Goodman, Antoine Cully

Abstract: Cellular automata have long been celebrated for their ability to generate complex behaviors from simple, local rules, with well-known discrete models like Conway's Game of Life proven capable of universal computation. Recent advancements have extended cellular automata into continuous domains, raising the question of whether these systems retain the capacity for universal computation. In parallel, neural cellular automata have emerged as a powerful paradigm where rules are learned via gradient descent rather than manually designed. This work explores the potential of neural cellular automata to develop a continuous Universal Cellular Automaton through training by gradient descent. We introduce a cellular automaton model, objective functions and training strategies to guide neural cellular automata toward universal computation in a continuous setting. Our experiments demonstrate the successful training of fundamental computational primitives - such as matrix multiplication and transposition - culminating in the emulation of a neural network solving the MNIST digit classification task directly within the cellular automata state. These results represent a foundational step toward realizing analog general-purpose computers, with implications for understanding universal computation in continuous dynamics and advancing the automated discovery of complex cellular automata behaviors via machine learning.

Comment: The paper explores neural cellular automata for universal computation, which is relevant to emerging trends in foundational research.

Relevance: 8 Novelty: 8

25. Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

ArXiv ID: 2505.13775

Authors: Kaya Stechly, Karthik Valmeekam, Atharva Gundawar, Vardhan Palod, Subbarao Kambhampati

Abstract: Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in order to help find new reasoning patterns. In this paper, we critically examine that interpretation by investigating how the semantics of intermediate tokens-often anthropomorphized as "thoughts" or reasoning traces and which are claimed to display behaviors like backtracking, self-verification etc.-actually influence model performance. We train transformer models on formally verifiable reasoning traces and solutions, constraining both intermediate steps and final outputs to align with those of a formal solver (in our case, A* search). By constructing a formal interpreter of the semantics of our problems and intended algorithm, we systematically evaluate not only solution accuracy but also the correctness of intermediate traces, thus allowing us to evaluate whether the latter causally influences the former. We notice that, despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions. To further show that trace accuracy is only loosely connected to solution accuracy, we then train models on noisy, corrupted traces which have no relation to the specific problem each is paired with, and find that not only does performance remain largely consistent with models trained on correct data, but in some cases can improve upon it and generalize more robustly on out-of-distribution tasks. These results challenge the assumption that intermediate tokens or "Chains of Thought" induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly correct forms) as evidence of human-like or algorithmic behaviors in language models.

Comment: The paper challenges the effectiveness of intermediate tokens in reasoning models, which is relevant to representation learning and LLM behavior.

Relevance: 8 Novelty: 8

26. Sinusoidal Initialization, Time for a New Start

ArXiv ID: 2505.12909

Authors: Alberto Fernández-Hernández, Jose I. Mestre, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí

Abstract: Initialization plays a critical role in Deep Neural Network training, directly influencing convergence, stability, and generalization. Common approaches such as Glorot and He initializations rely on randomness, which can produce uneven weight distributions across layer connections. In this paper, we introduce the Sinusoidal initialization, a novel deterministic method that employs sinusoidal functions to construct structured weight matrices expressly to improve the spread and balance of weights throughout the network while simultaneously fostering a more uniform, well-conditioned distribution of neuron activation states from the very first forward pass. Because Sinusoidal initialization begins with weights and activations that are already evenly and efficiently utilized, it delivers consistently faster convergence, greater training stability, and higher final accuracy across a wide range of models, including convolutional neural networks, vision transformers, and large language models. On average, our experiments show an increase of 4.9% in final validation accuracy and 20.9% in convergence speed. By replacing randomness with structure, this initialization provides a stronger and more reliable foundation for Deep Learning systems.

Comment: The paper introduces Sinusoidal initialization, a novel deterministic method for neural network training, which is relevant to representation learning.

Relevance: 8 Novelty: 8

27. A Minimum Description Length Approach to Regularization in Neural Networks

ArXiv ID: 2505.13398

Authors: Matan Abudy, Orr Well, Emmanuel Chemla, Roni Katzir, Nur Lan

Abstract: State-of-the-art neural networks can be trained to become remarkable solutions to many problems. But while these architectures can express symbolic, perfect solutions, trained models often arrive at approximations instead. We show that the choice of regularization method plays a crucial role: when trained on formal languages with standard regularization ($L_1$, $L_2$, or none), expressive architectures not only fail to converge to correct solutions but are actively pushed away from perfect initializations. In contrast, applying the Minimum Description Length (MDL) principle to balance model complexity with data fit provides a theoretically grounded regularization method. Using MDL, perfect solutions are selected over approximations, independently of the optimization algorithm. We propose that unlike existing regularization techniques, MDL introduces the appropriate inductive bias to effectively counteract overfitting and promote generalization.

Comment: The paper introduces a theoretically grounded regularization method using the Minimum Description Length principle, relevant to representation learning and model training dynamics.

Relevance: 8 Novelty: 8

28. KHRONOS: a Kernel-Based Neural Architecture for Rapid, Resource-Efficient Scientific Computation

ArXiv ID: 2505.13315

Authors: Reza T. Batley, Sourav Saha

Abstract: Contemporary models of high dimensional physical systems are constrained by the curse of dimensionality and a reliance on dense data. We introduce KHRONOS (Kernel Expansion Hierarchy for Reduced Order, Neural Optimized Surrogates), an AI framework for model based, model free and model inversion tasks. KHRONOS constructs continuously differentiable target fields with a hierarchical composition of per-dimension kernel expansions, which are tensorized into modes and then superposed. We evaluate KHRONOS on a canonical 2D, Poisson equation benchmark: across 16 to 512 degrees of freedom (DoFs), it obtained L2 square errors of 5e-4 down to 6e-10. This represents a 100 time gain over Kolmogorov Arnold Networks (which itself reports a 100 times improvement on MLPs/PINNs with 100 times fewer parameters) when controlling for the number of parameters. This also represents a 1e4 times improvement in L2 square error compared to standard linear FEM at comparable DoFs. Inference complexity is dominated by inner products, yielding sub-millisecond full-field predictions that scale to an arbitrary resolution. For inverse problems, KHRONOS facilitates rapid, iterative level set recovery in only a few forward evaluations, with sub-microsecond per sample latency. KHRONOS scalability, expressivity, and interpretability open new avenues in constrained edge computing, online control, computer vision, and beyond.

Comment: The paper introduces a new kernel-based neural architecture for scientific computation, which is relevant to AI for Science and model architecture.

Relevance: 8 Novelty: 8

29. When majority rules, minority loses: bias amplification of gradient descent

ArXiv ID: 2505.13122

Authors: François Bachoc, Jérôme Bolte, Ryan Boustany, Jean-Michel Loubes

Abstract: Despite growing empirical evidence of bias amplification in machine learning, its theoretical foundations remain poorly understood. We develop a formal framework for majority-minority learning tasks, showing how standard training can favor majority groups and produce stereotypical predictors that neglect minority-specific features. Assuming population and variance imbalance, our analysis reveals three key findings: (i) the close proximity between ``full-data'' and stereotypical predictors, (ii) the dominance of a region where training the entire model tends to merely learn the majority traits, and (iii) a lower bound on the additional training required. Our results are illustrated through experiments in deep learning for tabular and image classification tasks.

Comment: The paper provides theoretical insights into bias amplification in gradient descent, which is relevant to understanding training dynamics in neural networks.