Personalized Daily ArXiv Papers 2025-06-17

[gpt-4o]	Prompt	Completion	Total
Token	61027	8213	69240
Cost	$0.15	$0.08	$0.23

Total arXiv papers: 1037

Total scanned papers: 629

Total relevant papers: 47

Table of contents with paper titles:

Evolutionary Developmental Biology Can Serve as the Conceptual Foundation for a New Design Paradigm in Artificial Intelligence Authors: Zeki Doruk Erden, Boi Faltings
AlphaEvolve: A coding agent for scientific and algorithmic discovery Authors: Alexander Novikov, Ng\^an V\~u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, Matej Balog
Random Matrix Theory for Deep Learning: Beyond Eigenvalues of Linear Models Authors: Zhenyu Liao, Michael W. Mahoney
MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on Large Language Models Authors: Yan Sun, Qixin Zhang, Zhiyuan Yu, Xikun Zhang, Li Shen, Dacheng Tao
Automatic Expert Discovery in LLM Upcycling via Sparse Interpolated Mixture-of-Experts Authors: Shengzhuang Chen, Ying Wei, Jonathan Richard Schwarz
What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers Authors: Pulkit Gopalani, Wei Hu
Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs Authors: Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization Authors: Guanghui Song, Dongping Liao, Yiren Zhao, Kejiang Ye, Cheng-zhong Xu, Xitong Gao
Align-then-Unlearn: Embedding Alignment for LLM Unlearning Authors: Philipp Spohn, Leander Girrbach, Jessica Bader, Zeynep Akata
Training Neural Networks by Optimizing Neuron Positions Authors: Laura Erb, Tommaso Boccato, Alexandru Vasilache, Juergen Becker, Nicola Toschi
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining Authors: Hongyuan Dong, Dingkang Yang, Xiao Liang, Chao Feng, Jiao Ran
PDEfuncta: Spectrally-Aware Neural Representation for PDE Solution Modeling Authors: Minju Jo, Woojin Cho, Uvini Balasuriya Mudiyanselage, Seungjun Lee, Noseong Park, Kookjin Lee
Meta Pruning via Graph Metanetworks : A Meta Learning Framework for Network Pruning Authors: Yewei Liu, Xiyuan Wang, Muhan Zhang
Constant Bit-size Transformers Are Turing Complete Authors: Qian Li, Yuyi Wang
Mixture of Cognitive Reasoners: Modular Reasoning with Brain-Like Specialization Authors: Badr AlKhamissi, C. Nicol`o De Sabbata, Zeming Chen, Martin Schrimpf, Antoine Bosselut
Contrastive Self-Supervised Learning As Neural Manifold Packing Authors: Guanming Zhang, David J. Heeger, Stefano Martiniani
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention Authors: MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Zhu, Jian Sun, Jiaqi Zhuang, Jiaren Cai, Jiayuan Song, Jin Zhu, Jingyang Li, Jinhao Tian, Jinli Liu, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kaiyi Feng, Ke Yang, Kecheng Xiao, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Li, Lin Zheng, Linge Du, Lingyu Yang, Lunbin Zeng, Minghui Yu, Mingliang Tao, Mingyuan Chi, Mozhi Zhang, Mujie Lin, Nan Hu, Nongyu Di, Peng Gao, Pengfei Li, Pengyu Zhao, Qibing Ren, Qidi Xu, Qile Li, Qin Wang, Rong Tian, Ruitao Leng, Shaoxiang Chen, Shaoyu Chen, Shengmin Shi, Shitong Weng, Shuchang Guan, Shuqi Yu, Sichen Li, Songquan Zhu, Tengfei Li, Tianchi Cai, Tianrun Liang, Weiyu Cheng, Weize Kong, Wenkai Li, Xiancai Chen, Xiangjun Song, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xinzhu Hou, Xuan Lu, Xun Zou, Xuyang Shen, Yan Gong, Yan Ma, Yang Wang, Yiqi Shi, Yiran Zhong, Yonghong Duan, Yongxiang Fu, Yongyi Hu, Yu Gao, Yuanxiang Fan, Yufeng Yang, Yuhao Li, Yulin Hu, Yunan Huang, Yunji Li, Yunzhi Xu, Yuxin Mao, Yuxuan Shi, Yuze Wenren, Zehan Li, Zelin Li, Zhanxu Tian, Zhengmao Zhu, Zhenhua Fan, Zhenzhen Wu, Zhichao Xu, Zhihang Yu, Zhiheng Lyu, Zhuo Jiang, Zibo Gao, Zijia Wu, Zijian Song, Zijun Sun
Variational Learning Finds Flatter Solutions at the Edge of Stability Authors: Avrajit Ghosh, Bai Cong, Rio Yokota, Saiprasad Ravishankar, Rongrong Wang, Molei Tao, Mohammad Emtiyaz Khan, Thomas M\"ollenhoff
Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources? Authors: Houyi Li, Ka Man Lo, Ziqi Wang, Zili Wang, Wenzhen Zheng, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang
Distributional Training Data Attribution Authors: Bruno Mlodozeniec, Isaac Reid, Sam Power, David Krueger, Murat Erdogdu, Richard E. Turner, Roger Grosse
QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm Authors: Qirui Zhou, Shaohui Peng, Weiqiang Xiong, Haixin Chen, Yuanbo Wen, Haochen Li, Ling Li, Qi Guo, Yongwei Zhao, Ke Gao, Ruizhi Chen, Yanjun Wu, Chen Zhao, Yunji Chen
GeoRecon: Graph-Level Representation Learning for 3D Molecules via Reconstruction-Based Pretraining Authors: Shaoheng Yan, Zian Li, Muhan Zhang
Global Convergence of Adjoint-Optimized Neural PDEs Authors: Konstantin Riedl, Justin Sirignano, Konstantinos Spiliopoulos
GrokAlign: Geometric Characterisation and Acceleration of Grokking Authors: Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk
SeqPE: Transformer with Sequential Position Encoding Authors: Huyang Li, Yahui Liu, Hongyu Sun, Deng Cai, Leyang Cui, Wei Bi, Peilin Zhao, Taro Watanabe
Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence Authors: Yibo Yang, Sihao Liu, Chuan Rao, Bang An, Tiancheng Shen, Philip H. S. Torr, Ming-Hsuan Yang, Bernard Ghanem
Scaling Probabilistic Circuits via Monarch Matrices Authors: Honghua Zhang, Meihua Dang, Benjie Wang, Stefano Ermon, Nanyun Peng, Guy Van den Broeck
LCD: Advancing Extreme Low-Bit Clustering for Large Language Models via Knowledge Distillation Authors: Fangxin Liu, Ning Yang, Junping Zhao, Tao Yang, Haibing Guan, Li Jiang
Efficient Network Automatic Relevance Determination Authors: Hongwei Zhang, Ziqi Ye, Xinyuan Wang, Xin Guo, Zenglin Xu, Yuan Cheng, Zixin Hu, Yuan Qi
FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization Authors: Fangxin Liu, Zongwu Wang, JinHong Xia, Junping Zhao, Jian Liu, Haibing Guan, Li Jiang
Human-like Forgetting Curves in Deep Neural Networks Authors: Dylan Kline
Beyond Sin-Squared Error: Linear-Time Entrywise Uncertainty Quantification for Streaming PCA Authors: Syamantak Kumar, Shourya Pandey, Purnamrita Sarkar
Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model Authors: Chong Li, Yingzhuo Deng, Jiajun Zhang, Chengqing Zong
Breaking Thought Patterns: A Multi-Dimensional Reasoning Framework for LLMs Authors: Xintong Tang, Meiru Zhang, Shang Xiao, Junzhao Jin, Zihan Zhao, Liwei Li, Yang Zheng, Bangyi Wu
Model Merging for Knowledge Editing Authors: Zichuan Fu, Xian Wu, Guojing Li, Yingying Zhang, Yefeng Zheng, Tianshi Ming, Yejing Wang, Wanyu Wang, Xiangyu Zhao
MEraser: An Effective Fingerprint Erasure Approach for Large Language Models Authors: Jingxuan Zhang, Zhenhua Xu, Rui Hu, Wenpeng Xing, Xuhong Zhang, Meng Han
CALM: Consensus-Aware Localized Merging for Multi-Task Learning Authors: Kunda Yan, Min Zhang, Sen Cui, Zikun Qu, Bo Jiang, Feng Liu, Changshui Zhang
Flexible Realignment of Language Models Authors: Wenhong Zhu, Ruobing Xie, Weinan Zhang, Rui Wang
Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study Authors: Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Seraphina Zhang, Tianfu Wang, Nicholas Jing Yuan, Xing Xie, Hui Xiong
TensorSLM: Energy-efficient Embedding Compression of Sub-billion Parameter Language Models on Low-end Devices Authors: Mingxue Xu, Yao Lei Xu, Danilo P. Mandic
Multipole Attention for Efficient Long Context Reasoning Authors: Coleman Hooper, Sebastian Zhao, Luca Manolache, Sehoon Kim, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami
Because we have LLMs, we Can and Should Pursue Agentic Interpretability Authors: Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord
LARGO: Low-Rank Regulated Gradient Projection for Robust Parameter Efficient Fine-Tuning Authors: Haotian Zhang, Liu Liu, Baosheng Yu, Jiayan Qiu, Yanwei Ren, Xianglong Liu
Understanding Learning Invariance in Deep Linear Networks Authors: Hao Duan, Guido Mont\'ufar
Why Do Some Inputs Break Low-Bit LLM Quantization? Authors: Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia
Distinct Computations Emerge From Compositional Curricula in In-Context Learning Authors: Jin Hwa Lee, Andrew K. Lampinen, Aaditya K. Singh, Andrew M. Saxe
Training-free LLM Merging for Multi-task Learning Authors: Zichuan Fu, Xian Wu, Yejing Wang, Wanyu Wang, Shanshan Ye, Hongzhi Yin, Yi Chang, Yefeng Zheng, Xiangyu Zhao

1. Evolutionary Developmental Biology Can Serve as the Conceptual Foundation for a New Design Paradigm in Artificial Intelligence

ArXiv ID: 2506.12891

Authors: Zeki Doruk Erden, Boi Faltings

Abstract: Artificial intelligence (AI), propelled by advancements in machine learning, has made significant strides in solving complex tasks. However, the current neural network-based paradigm, while effective, is heavily constrained by inherent limitations, primarily a lack of structural organization and a progression of learning that displays undesirable properties. As AI research progresses without a unifying framework, it either tries to patch weaknesses heuristically or draws loosely from biological mechanisms without strong theoretical foundations. Meanwhile, the recent paradigm shift in evolutionary understanding -- driven primarily by evolutionary developmental biology (EDB) -- has been largely overlooked in AI literature, despite a striking analogy between the Modern Synthesis and contemporary machine learning, evident in their shared assumptions, approaches, and limitations upon careful analysis. Consequently, the principles of adaptation from EDB that reshaped our understanding of the evolutionary process can also form the foundation of a unifying conceptual framework for the next design philosophy in AI, going beyond mere inspiration and grounded firmly in biology's first principles. This article provides a detailed overview of the analogy between the Modern Synthesis and modern machine learning, and outlines the core principles of a new AI design paradigm based on insights from EDB. To exemplify our analysis, we also present two learning system designs grounded in specific developmental principles -- regulatory connections, somatic variation and selection, and weak linkage -- that resolve multiple major limitations of contemporary machine learning in an organic manner, while also providing deeper insights into the role of these mechanisms in biological evolution.

Comment: The paper discusses a new design paradigm for AI based on evolutionary developmental biology, which is relevant to emerging trends in AI.

Relevance: 9 Novelty: 9

2. AlphaEvolve: A coding agent for scientific and algorithmic discovery

ArXiv ID: 2506.13131

Authors: Alexander Novikov, Ng\^an V\~u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, Matej Balog

Abstract: In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical discoveries. We demonstrate the broad applicability of this approach by applying it to a number of important computational problems. When applied to optimizing critical components of large-scale computational stacks at Google, AlphaEvolve developed a more efficient scheduling algorithm for data centers, found a functionally equivalent simplification in the circuit design of hardware accelerators, and accelerated the training of the LLM underpinning AlphaEvolve itself. Furthermore, AlphaEvolve discovered novel, provably correct algorithms that surpass state-of-the-art solutions on a spectrum of problems in mathematics and computer science, significantly expanding the scope of prior automated discovery methods (Romera-Paredes et al., 2023). Notably, AlphaEvolve developed a search algorithm that found a procedure to multiply two $4 \times 4$ complex-valued matrices using $48$ scalar multiplications; offering the first improvement, after 56 years, over Strassen's algorithm in this setting. We believe AlphaEvolve and coding agents like it can have a significant impact in improving solutions of problems across many areas of science and computation.

Comment: The paper introduces AlphaEvolve, an evolutionary coding agent for scientific and algorithmic discovery, which is relevant to AI for Science and emerging trends.

Relevance: 9 Novelty: 9

3. Random Matrix Theory for Deep Learning: Beyond Eigenvalues of Linear Models

ArXiv ID: 2506.13139

Authors: Zhenyu Liao, Michael W. Mahoney

Abstract: Modern Machine Learning (ML) and Deep Neural Networks (DNNs) often operate on high-dimensional data and rely on overparameterized models, where classical low-dimensional intuitions break down. In particular, the proportional regime where the data dimension, sample size, and number of model parameters are all large and comparable, gives rise to novel and sometimes counterintuitive behaviors. This paper extends traditional Random Matrix Theory (RMT) beyond eigenvalue-based analysis of linear models to address the challenges posed by nonlinear ML models such as DNNs in this regime. We introduce the concept of High-dimensional Equivalent, which unifies and generalizes both Deterministic Equivalent and Linear Equivalent, to systematically address three technical challenges: high dimensionality, nonlinearity, and the need to analyze generic eigenspectral functionals. Leveraging this framework, we provide precise characterizations of the training and generalization performance of linear models, nonlinear shallow networks, and deep networks. Our results capture rich phenomena, including scaling laws, double descent, and nonlinear learning dynamics, offering a unified perspective on the theoretical understanding of deep learning in high dimensions.

Comment: The paper extends Random Matrix Theory to address challenges in deep learning, aligning with the emerging trends criterion.

Relevance: 9 Novelty: 9

4. MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on Large Language Models

ArXiv ID: 2506.12876

Authors: Yan Sun, Qixin Zhang, Zhiyuan Yu, Xikun Zhang, Li Shen, Dacheng Tao

Abstract: The rapid scaling of large language models (LLMs) has made inference efficiency a primary bottleneck in the practical deployment. To address this, semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights, thereby enabling hardware-friendly acceleration and reduced memory. However, existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven combinatorial learning, which incurs prohibitive training costs. To tackle these challenges, we propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling without replacement. Furthermore, to mitigate the training instability induced by the high variance of policy gradients in the super large combinatorial space, we propose a novel update method by introducing a moving average tracker of loss residuals instead of vanilla loss. Finally, we conduct comprehensive theoretical analysis and extensive experiments to validate the superior performance of MaskPro, as well as its excellent scalability in memory efficiency and exceptional robustness to data samples. Our code is available at https://github.com/woodenchild95/Maskpro.git.

Comment: The paper introduces a novel linear-space probabilistic framework for achieving (N:M)-sparsity in LLMs, which is relevant to model compression through sparsity and pruning.

Relevance: 9 Novelty: 8

5. Automatic Expert Discovery in LLM Upcycling via Sparse Interpolated Mixture-of-Experts

ArXiv ID: 2506.12597

Authors: Shengzhuang Chen, Ying Wei, Jonathan Richard Schwarz

Abstract: We present Sparse Interpolated Mixture-of-Experts (SIMoE) instruction-tuning, an end-to-end algorithm designed to fine-tune a dense pre-trained Large Language Model (LLM) into a MoE-style model that possesses capabilities in multiple specialized domains. During instruction-tuning, SIMoE automatically identifies multiple specialized experts under a specified sparsity constraint, with each expert representing a structurally sparse subset of the seed LLM's parameters that correspond to domain-specific knowledge within the data. SIMoE simultaneously learns an input-dependent expert merging strategy via a router network, leveraging rich cross-expert knowledge for superior downstream generalization that surpasses existing baselines. Empirically, SIMoE consistently achieves state-of-the-art performance on common instruction-tuning benchmarks while maintaining an optimal performance-compute trade-off compared to all baselines.

Comment: The paper presents a Sparse Interpolated Mixture-of-Experts (SIMoE) method, which is relevant to model architecture innovations, specifically MoE.

Relevance: 9 Novelty: 8

6. What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers

ArXiv ID: 2506.13688

Authors: Pulkit Gopalani, Wei Hu

Abstract: Training Transformers on algorithmic tasks frequently demonstrates an intriguing abrupt learning phenomenon: an extended performance plateau followed by a sudden, sharp improvement. This work investigates the underlying mechanisms for such dynamics, primarily in shallow Transformers. We reveal that during the plateau, the model often develops an interpretable partial solution while simultaneously exhibiting a strong repetition bias in their outputs. This output degeneracy is accompanied by internal representation collapse, where hidden states across different tokens become nearly parallel. We further identify the slow learning of optimal attention maps as a key bottleneck. Hidden progress in attention configuration during the plateau precedes the eventual rapid convergence, and directly intervening on attention significantly alters plateau duration and the severity of repetition bias and representational collapse. We validate that these identified phenomena-repetition bias and representation collapse-are not artifacts of toy setups but also manifest in the early pre-training stage of large language models like Pythia and OLMo.

Comment: The paper investigates the training dynamics of Transformers, focusing on the abrupt learning phenomenon and representation collapse, which aligns with representation learning and insights into how deep networks encode information.

Relevance: 9 Novelty: 8

7. Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs

ArXiv ID: 2506.13727

Authors: Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin

Abstract: Large Language Models (LLMs) are central to many contemporary AI applications, yet their extensive parameter counts pose significant challenges for deployment in memory- and compute-constrained environments. Recent works in eXplainable AI (XAI), particularly on attribution methods, suggest that interpretability can also enable model compression by identifying and removing components irrelevant to inference. In this paper, we leverage Layer-wise Relevance Propagation (LRP) to perform attribution-guided pruning of LLMs. While LRP has shown promise in structured pruning for vision models, we extend it to unstructured pruning in LLMs and demonstrate that it can substantially reduce model size with minimal performance loss. Our method is especially effective in extracting task-relevant subgraphs -- so-called ``circuits'' -- which can represent core functions (e.g., indirect object identification). Building on this, we introduce a technique for model correction, by selectively removing circuits responsible for spurious behaviors (e.g., toxic outputs). All in all, we gather these techniques as a uniform holistic framework and showcase its effectiveness and limitations through extensive experiments for compression, circuit discovery and model correction on Llama and OPT models, highlighting its potential for improving both model efficiency and safety. Our code is publicly available at https://github.com/erfanhatefi/SparC3.

Comment: The paper presents attribution-guided pruning for LLMs, focusing on model compression and interpretability, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

8. Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization

ArXiv ID: 2506.13541

Authors: Guanghui Song, Dongping Liao, Yiren Zhao, Kejiang Ye, Cheng-zhong Xu, Xitong Gao

Abstract: Transformer models face scalability challenges in causal language modeling (CLM) due to inefficient memory allocation for growing key-value (KV) caches, which strains compute and storage resources. Existing methods like Grouped Query Attention (GQA) and token-level KV optimization improve efficiency but rely on rigid resource allocation, often discarding "low-priority" tokens or statically grouping them, failing to address the dynamic spectrum of token importance. We propose mixSGA, a novel mixture-of-expert (MoE) approach that dynamically optimizes token-wise computation and memory allocation. Unlike prior approaches, mixSGA retains all tokens while adaptively routing them to specialized experts with varying KV group sizes, balancing granularity and efficiency. Our key novelties include: (1) a token-wise expert-choice routing mechanism guided by learned importance scores, enabling proportional resource allocation without token discard; (2) weight-sharing across grouped attention projections to minimize parameter overhead; and (3) an auxiliary loss to ensure one-hot routing decisions for training-inference consistency in CLMs. Extensive evaluations across Llama3, TinyLlama, OPT, and Gemma2 model families show mixSGA's superiority over static baselines. On instruction-following and continued pretraining tasks, mixSGA achieves higher ROUGE-L and lower perplexity under the same KV budgets.

Comment: The paper introduces a novel mixture-of-expert (MoE) approach for dynamic token-wise KV optimization in transformers, which aligns with the model architecture and model compression criteria.

Relevance: 9 Novelty: 8

9. Align-then-Unlearn: Embedding Alignment for LLM Unlearning

ArXiv ID: 2506.13181

Authors: Philipp Spohn, Leander Girrbach, Jessica Bader, Zeynep Akata

Abstract: As large language models (LLMs) are trained on massive datasets, they have raised significant privacy and ethical concerns due to their potential to inadvertently retain sensitive information. Unlearning seeks to selectively remove specific data from trained models, such as personal information or copyrighted content. Current approaches targeting specific output sequences at the token level often fail to achieve complete forgetting and remain susceptible to prompt rephrasing. We propose Align-then-Unlearn, a novel framework that performs unlearning in the semantic embedding space rather than directly on output tokens. Align-then-Unlearn first augments the LLM with an embedding prediction module trained to anticipate future context representations. Unlearning is then achieved by fine-tuning the model to minimize the similarity between these predicted embeddings and a target embedding that represents the concept to be removed. Initial results show that Align-then-Unlearn effectively removes targeted knowledge with minimal degradation in overall model utility. These findings suggest that embedding-based unlearning offers a promising and robust approach to removing conceptual knowledge. Our code is available at https://github.com/ExplainableML/align-then-unlearn.

Comment: The paper proposes a novel framework for unlearning in LLMs using embedding alignment, which is relevant to foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 8

10. Training Neural Networks by Optimizing Neuron Positions

ArXiv ID: 2506.13410

Authors: Laura Erb, Tommaso Boccato, Alexandru Vasilache, Juergen Becker, Nicola Toschi

Abstract: The high computational complexity and increasing parameter counts of deep neural networks pose significant challenges for deployment in resource-constrained environments, such as edge devices or real-time systems. To address this, we propose a parameter-efficient neural architecture where neurons are embedded in Euclidean space. During training, their positions are optimized and synaptic weights are determined as the inverse of the spatial distance between connected neurons. These distance-dependent wiring rules replace traditional learnable weight matrices and significantly reduce the number of parameters while introducing a biologically inspired inductive bias: connection strength decreases with spatial distance, reflecting the brain's embedding in three-dimensional space where connections tend to minimize wiring length. We validate this approach for both multi-layer perceptrons and spiking neural networks. Through a series of experiments, we demonstrate that these spatially embedded neural networks achieve a performance competitive with conventional architectures on the MNIST dataset. Additionally, the models maintain performance even at pruning rates exceeding 80% sparsity, outperforming traditional networks with the same number of parameters under similar conditions. Finally, the spatial embedding framework offers an intuitive visualization of the network structure.

Comment: The paper proposes a parameter-efficient neural architecture by optimizing neuron positions, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

11. AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining

ArXiv ID: 2506.13274

Authors: Hongyuan Dong, Dingkang Yang, Xiao Liang, Chao Feng, Jiao Ran

Abstract: Learning rate is widely regarded as crucial for effective foundation model pretraining. Recent research explores and demonstrates the transferability of learning rate configurations across varying model and dataset sizes, etc. Nevertheless, these approaches are constrained to specific training scenarios and typically necessitate extensive hyperparameter tuning on proxy models. In this work, we propose \textbf{AdaLRS}, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search via optimizing loss descent velocities. We provide experiment results to show that the optimization of training loss and loss descent velocity in foundation model pretraining are both convex and share the same optimal learning rate. Relying solely on training loss dynamics, AdaLRS involves few extra computations to guide the search process, and its convergence is guaranteed via theoretical analysis. Experiments on both LLM and VLM pretraining show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness, with model performance improved accordingly. We also show the robust generalizability of AdaLRS across varying training scenarios, such as different model sizes, training paradigms, and base learning rate scheduler choices.

Comment: The paper proposes AdaLRS, an adaptive learning rate search algorithm for foundation model pretraining, which is relevant to foundational research in model training dynamics.

Relevance: 9 Novelty: 8

12. PDEfuncta: Spectrally-Aware Neural Representation for PDE Solution Modeling

ArXiv ID: 2506.12790

Authors: Minju Jo, Woojin Cho, Uvini Balasuriya Mudiyanselage, Seungjun Lee, Noseong Park, Kookjin Lee

Abstract: Scientific machine learning often involves representing complex solution fields that exhibit high-frequency features such as sharp transitions, fine-scale oscillations, and localized structures. While implicit neural representations (INRs) have shown promise for continuous function modeling, capturing such high-frequency behavior remains a challenge-especially when modeling multiple solution fields with a shared network. Prior work addressing spectral bias in INRs has primarily focused on single-instance settings, limiting scalability and generalization. In this work, we propose Global Fourier Modulation (GFM), a novel modulation technique that injects high-frequency information at each layer of the INR through Fourier-based reparameterization. This enables compact and accurate representation of multiple solution fields using low-dimensional latent vectors. Building upon GFM, we introduce PDEfuncta, a meta-learning framework designed to learn multi-modal solution fields and support generalization to new tasks. Through empirical studies on diverse scientific problems, we demonstrate that our method not only improves representational quality but also shows potential for forward and inverse inference tasks without the need for retraining.

Comment: The paper introduces Global Fourier Modulation for neural representations, which is relevant to representation learning and architectural innovations.

Relevance: 9 Novelty: 8

13. Meta Pruning via Graph Metanetworks : A Meta Learning Framework for Network Pruning

ArXiv ID: 2506.12041

Authors: Yewei Liu, Xiyuan Wang, Muhan Zhang

Abstract: Network pruning, aimed at reducing network size while preserving accuracy, has attracted significant research interest. Numerous pruning techniques have been proposed over time. They are becoming increasingly effective, but more complex and harder to interpret as well. Given the inherent complexity of neural networks, we argue that manually designing pruning criteria has reached a bottleneck. To address this, we propose a novel approach in which we "use a neural network to prune neural networks". More specifically, we introduce the newly developed idea of metanetwork from meta-learning into pruning. A metanetwork is a network that takes another network as input and produces a modified network as output. In this paper, we first establish a bijective mapping between neural networks and graphs, and then employ a graph neural network as our metanetwork. We train a metanetwork that learns the pruning strategy automatically which can transform a network that is hard to prune into another network that is much easier to prune. Once the metanetwork is trained, our pruning needs nothing more than a feedforward through the metanetwork and the standard finetuning to prune at state-of-the-art. Our method achieved outstanding results on many popular and representative pruning tasks (including ResNet56 on CIFAR10, VGG19 on CIFAR100, ResNet50 on ImageNet). Our code is available at https://github.com/Yewei-Liu/MetaPruning

Comment: The paper proposes a novel meta-learning framework for network pruning, which is relevant to model compression and introduces a new approach using metanetworks.

Relevance: 9 Novelty: 8

14. Constant Bit-size Transformers Are Turing Complete

ArXiv ID: 2506.12027

Authors: Qian Li, Yuyi Wang

Abstract: We prove that any Turing machine running on inputs of arbitrary length can be simulated by a constant bit-size transformer, as long as the context window is sufficiently long. This improves previous works, which require scaling up either the model's precision or the number of parameters on longer inputs. Furthermore, we prove that the complexity class SPACE$[s(n)]$ exactly characterizes the expressive power of a constant bit-size transformer with a context window of length $s(n)$. Our approach relies on simulating Post machines, a Turing-complete computational model. Post machines can be modeled as automata equipped with a queue, exhibiting computational behaviors naturally aligned with those of transformers. The behavioral similarity between transformers and Post machines may offer new insights into the mechanisms underlying the reasoning abilities of transformers.

Comment: The paper proves that constant bit-size transformers are Turing complete, providing theoretical insights into transformer models.

Relevance: 9 Novelty: 8

15. Mixture of Cognitive Reasoners: Modular Reasoning with Brain-Like Specialization

ArXiv ID: 2506.13331

Authors: Badr AlKhamissi, C. Nicol`o De Sabbata, Zeming Chen, Martin Schrimpf, Antoine Bosselut

Abstract: Human intelligence emerges from the interaction of specialized brain networks, each dedicated to distinct cognitive functions such as language processing, logical reasoning, social understanding, and memory retrieval. Inspired by this biological observation, we introduce the Mixture of Cognitive Reasoners (MiCRo) architecture and training paradigm: a modular transformer-based language model with a training curriculum that encourages the emergence of functional specialization among different modules. Inspired by studies in neuroscience, we partition the layers of a pretrained transformer model into four expert modules, each corresponding to a well-studied cognitive brain network. Our Brain-Like model has three key benefits over the state of the art: First, the specialized experts are highly interpretable and functionally critical, where removing a module significantly impairs performance on domain-relevant benchmarks. Second, our model outperforms comparable baselines that lack specialization on seven reasoning benchmarks. And third, the model's behavior can be steered at inference time by selectively emphasizing certain expert modules (e.g., favoring social over logical reasoning), enabling fine-grained control over the style of its response. Our findings suggest that biologically inspired inductive biases involved in human cognition lead to significant modeling gains in interpretability, performance, and controllability.

Comment: The paper introduces the Mixture of Cognitive Reasoners architecture, which is relevant to model architecture and offers insights into modular reasoning with brain-like specialization.

Relevance: 9 Novelty: 8

16. Contrastive Self-Supervised Learning As Neural Manifold Packing

ArXiv ID: 2506.13717

Authors: Guanming Zhang, David J. Heeger, Stefano Martiniani

Abstract: Contrastive self-supervised learning based on point-wise comparisons has been widely studied for vision tasks. In the visual cortex of the brain, neuronal responses to distinct stimulus classes are organized into geometric structures known as neural manifolds. Accurate classification of stimuli can be achieved by effectively separating these manifolds, akin to solving a packing problem. We introduce Contrastive Learning As Manifold Packing (CLAMP), a self-supervised framework that recasts representation learning as a manifold packing problem. CLAMP introduces a loss function inspired by the potential energy of short-range repulsive particle systems, such as those encountered in the physics of simple liquids and jammed packings. In this framework, each class consists of sub-manifolds embedding multiple augmented views of a single image. The sizes and positions of the sub-manifolds are dynamically optimized by following the gradient of a packing loss. This approach yields interpretable dynamics in the embedding space that parallel jamming physics, and introduces geometrically meaningful hyperparameters within the loss function. Under the standard linear evaluation protocol, which freezes the backbone and trains only a linear classifier, CLAMP achieves competitive performance with state-of-the-art self-supervised models. Furthermore, our analysis reveals that neural manifolds corresponding to different categories emerge naturally and are effectively separated in the learned representation space, highlighting the potential of CLAMP to bridge insights from physics, neural science, and machine learning.

Comment: The paper introduces a novel framework for contrastive self-supervised learning, aligning with the representation learning criterion.

Relevance: 9 Novelty: 8

17. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

ArXiv ID: 2506.13585

Authors: MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Zhu, Jian Sun, Jiaqi Zhuang, Jiaren Cai, Jiayuan Song, Jin Zhu, Jingyang Li, Jinhao Tian, Jinli Liu, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kaiyi Feng, Ke Yang, Kecheng Xiao, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Li, Lin Zheng, Linge Du, Lingyu Yang, Lunbin Zeng, Minghui Yu, Mingliang Tao, Mingyuan Chi, Mozhi Zhang, Mujie Lin, Nan Hu, Nongyu Di, Peng Gao, Pengfei Li, Pengyu Zhao, Qibing Ren, Qidi Xu, Qile Li, Qin Wang, Rong Tian, Ruitao Leng, Shaoxiang Chen, Shaoyu Chen, Shengmin Shi, Shitong Weng, Shuchang Guan, Shuqi Yu, Sichen Li, Songquan Zhu, Tengfei Li, Tianchi Cai, Tianrun Liang, Weiyu Cheng, Weize Kong, Wenkai Li, Xiancai Chen, Xiangjun Song, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xinzhu Hou, Xuan Lu, Xun Zou, Xuyang Shen, Yan Gong, Yan Ma, Yang Wang, Yiqi Shi, Yiran Zhong, Yonghong Duan, Yongxiang Fu, Yongyi Hu, Yu Gao, Yuanxiang Fan, Yufeng Yang, Yuhao Li, Yulin Hu, Yunan Huang, Yunji Li, Yunzhi Xu, Yuxin Mao, Yuxuan Shi, Yuze Wenren, Zehan Li, Zelin Li, Zhanxu Tian, Zhengmao Zhu, Zhenhua Fan, Zhenzhen Wu, Zhichao Xu, Zhihang Yu, Zhiheng Lyu, Zhuo Jiang, Zibo Gao, Zijia Wu, Zijian Song, Zijun Sun

Abstract: We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.

Comment: The paper introduces a hybrid Mixture-of-Experts architecture with a novel attention mechanism, which is relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8

18. Variational Learning Finds Flatter Solutions at the Edge of Stability

ArXiv ID: 2506.12903

Authors: Avrajit Ghosh, Bai Cong, Rio Yokota, Saiprasad Ravishankar, Rongrong Wang, Molei Tao, Mohammad Emtiyaz Khan, Thomas M\"ollenhoff

Abstract: Variational Learning (VL) has recently gained popularity for training deep neural networks and is competitive to standard learning methods. Part of its empirical success can be explained by theories such as PAC-Bayes bounds, minimum description length and marginal likelihood, but there are few tools to unravel the implicit regularization in play. Here, we analyze the implicit regularization of VL through the Edge of Stability (EoS) framework. EoS has previously been used to show that gradient descent can find flat solutions and we extend this result to VL to show that it can find even flatter solutions. This is obtained by controlling the posterior covariance and the number of Monte Carlo samples from the posterior. These results are derived in a similar fashion as the standard EoS literature for deep learning, by first deriving a result for a quadratic problem and then extending it to deep neural networks. We empirically validate these findings on a wide variety of large networks, such as ResNet and ViT, to find that the theoretical results closely match the empirical ones. Ours is the first work to analyze the EoS dynamics in VL.

Comment: The paper analyzes the implicit regularization of variational learning through the Edge of Stability framework, contributing to understanding training dynamics in neural networks.

Relevance: 9 Novelty: 7

19. Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources?

ArXiv ID: 2506.12119

Authors: Houyi Li, Ka Man Lo, Ziqi Wang, Zili Wang, Wenzhen Zheng, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang

Abstract: Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints - that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for the enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens. All models will be released publicly.

Comment: The paper investigates whether Mixture-of-Experts models can surpass dense LLMs under equal resources, relevant to model architecture and offering insights into MoE performance.

Relevance: 9 Novelty: 7

20. Distributional Training Data Attribution

ArXiv ID: 2506.12965

Authors: Bruno Mlodozeniec, Isaac Reid, Sam Power, David Krueger, Murat Erdogdu, Richard E. Turner, Roger Grosse

Abstract: Randomness is an unavoidable part of training deep learning models, yet something that traditional training data attribution algorithms fail to rigorously account for. They ignore the fact that, due to stochasticity in the initialisation and batching, training on the same dataset can yield different models. In this paper, we address this shortcoming through introducing distributional training data attribution (d-TDA), the goal of which is to predict how the distribution of model outputs (over training runs) depends upon the dataset. We demonstrate the practical significance of d-TDA in experiments, e.g. by identifying training examples that drastically change the distribution of some target measurement without necessarily changing the mean. Intriguingly, we also find that influence functions (IFs), a popular but poorly-understood data attribution tool, emerge naturally from our distributional framework as the limit to unrolled differentiation; without requiring restrictive convexity assumptions. This provides a new mathematical motivation for their efficacy in deep learning, and helps to characterise their limitations.

Comment: The paper introduces distributional training data attribution, providing new insights into how randomness affects model outputs, which is relevant to representation learning.

Relevance: 8 Novelty: 8

21. QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm

ArXiv ID: 2506.12355

Authors: Qirui Zhou, Shaohui Peng, Weiqiang Xiong, Haixin Chen, Yuanbo Wen, Haochen Li, Ling Li, Qi Guo, Yongwei Zhao, Ke Gao, Ruizhi Chen, Yanjun Wu, Chen Zhao, Yunji Chen

Abstract: The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in code generation tasks, but struggle to generate high-performance attention code. The key challenge is it cannot comprehend the complex data flow and computation process of the attention operator and utilize low-level primitive to exploit GPU performance. To address the above challenge, we propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU, and enhance LLMs' understanding of attention operator. Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPUs, establishing a self-optimizing paradigm for generating high-performance attention operators in attention-centric algorithms. Verified on A100, RTX8000, and T4 GPUs, the performance of our methods significantly outshines that of vanilla LLMs, achieving a speed-up of up to 35.16x. Besides, our method not only surpasses human-optimized libraries (cuDNN and official library) in most scenarios but also extends support to unsupported hardware and data types, reducing development time from months to minutes compared with human experts.

Comment: The paper introduces QiMeng-Attention, a self-optimizing paradigm for generating high-performance attention operators, which is relevant to model architecture and efficiency.

Relevance: 8 Novelty: 8

22. GeoRecon: Graph-Level Representation Learning for 3D Molecules via Reconstruction-Based Pretraining

ArXiv ID: 2506.13174

Authors: Shaoheng Yan, Zian Li, Muhan Zhang

Abstract: The pretraining-and-finetuning paradigm has driven significant advances across domains, such as natural language processing and computer vision, with representative pretraining paradigms such as masked language modeling and next-token prediction. However, in molecular representation learning, the task design remains largely limited to node-level denoising, which is effective at modeling local atomic environments, yet maybe insufficient for capturing the global molecular structure required by graph-level property prediction tasks, such as energy estimation and molecular regression. In this work, we present GeoRecon, a novel graph-level pretraining framework that shifts the focus from individual atoms to the molecule as an integrated whole. GeoRecon introduces a graph-level reconstruction task: during pretraining, the model is trained to generate an informative graph representation capable of accurately guiding reconstruction of the molecular geometry. This encourages the model to learn coherent, global structural features rather than isolated atomic details. Without relying on additional supervision or external data, GeoRecon outperforms node-centric baselines on multiple molecular benchmarks (e.g., QM9, MD17), demonstrating the benefit of incorporating graph-level reconstruction for learning more holistic and geometry-aware molecular embeddings.

Comment: The paper introduces a novel graph-level pretraining framework for molecular representation learning, focusing on global structural features, which is relevant to AI for Science and representation learning.

Relevance: 8 Novelty: 8

23. Global Convergence of Adjoint-Optimized Neural PDEs

ArXiv ID: 2506.13633

Authors: Konstantin Riedl, Justin Sirignano, Konstantinos Spiliopoulos

Abstract: Many engineering and scientific fields have recently become interested in modeling terms in partial differential equations (PDEs) with neural networks. The resulting neural-network PDE model, being a function of the neural network parameters, can be calibrated to available data by optimizing over the PDE using gradient descent, where the gradient is evaluated in a computationally efficient manner by solving an adjoint PDE. These neural-network PDE models have emerged as an important research area in scientific machine learning. In this paper, we study the convergence of the adjoint gradient descent optimization method for training neural-network PDE models in the limit where both the number of hidden units and the training time tend to infinity. Specifically, for a general class of nonlinear parabolic PDEs with a neural network embedded in the source term, we prove convergence of the trained neural-network PDE solution to the target data (i.e., a global minimizer). The global convergence proof poses a unique mathematical challenge that is not encountered in finite-dimensional neural network convergence analyses due to (1) the neural network training dynamics involving a non-local neural network kernel operator in the infinite-width hidden layer limit where the kernel lacks a spectral gap for its eigenvalues and (2) the nonlinearity of the limit PDE system, which leads to a non-convex optimization problem, even in the infinite-width hidden layer limit (unlike in typical neual network training cases where the optimization problem becomes convex in the large neuron limit). The theoretical results are illustrated and empirically validated by numerical studies.

Comment: The paper studies the global convergence of adjoint-optimized neural PDEs, contributing to the theoretical understanding of neural network PDE models, which is relevant to AI for Science.

Relevance: 8 Novelty: 8

24. GrokAlign: Geometric Characterisation and Acceleration of Grokking

ArXiv ID: 2506.12284

Authors: Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk

Abstract: A key challenge for the machine learning community is to understand and accelerate the training dynamics of deep networks that lead to delayed generalisation and emergent robustness to input perturbations, also known as grokking. Prior work has associated phenomena like delayed generalisation with the transition of a deep network from a linear to a feature learning regime, and emergent robustness with changes to the network's functional geometry, in particular the arrangement of the so-called linear regions in deep networks employing continuous piecewise affine nonlinearities. Here, we explain how grokking is realised in the Jacobian of a deep network and demonstrate that aligning a network's Jacobians with the training data (in the sense of cosine similarity) ensures grokking under a low-rank Jacobian assumption. Our results provide a strong theoretical motivation for the use of Jacobian regularisation in optimizing deep networks -- a method we introduce as GrokAlign -- which we show empirically to induce grokking much sooner than more conventional regularizers like weight decay. Moreover, we introduce centroid alignment as a tractable and interpretable simplification of Jacobian alignment that effectively identifies and tracks the stages of deep network training dynamics. Accompanying \href{https://thomaswalker1.github.io/blog/grokalign.html}{webpage} and \href{https://github.com/ThomasWalker1/grokalign}{code}.

Comment: The paper provides insights into the training dynamics of deep networks, specifically grokking, which is relevant to representation learning.

Relevance: 8 Novelty: 8

25. SeqPE: Transformer with Sequential Position Encoding

ArXiv ID: 2506.13277

Authors: Huyang Li, Yahui Liu, Hongyu Sun, Deng Cai, Leyang Cui, Wei Bi, Peilin Zhao, Taro Watanabe

Abstract: Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position embeddings (PEs) limit extrapolation capabilities beyond pre-trained sequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate this limitation but demand extensive modifications for adapting to new modalities, underscoring fundamental challenges in adaptability and scalability. In this work, we present SeqPE, a unified and fully learnable position encoding framework that represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings in an end-to-end manner. To regularize SeqPE's embedding space, we introduce two complementary objectives: a contrastive objective that aligns embedding distances with a predefined position-distance function, and a knowledge distillation loss that anchors out-of-distribution position embeddings to in-distribution teacher representations, further enhancing extrapolation performance. Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM), and accuracy--particularly under context length extrapolation--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign. We release our code, data, and checkpoints at https://github.com/ghrua/seqpe.

Comment: The paper introduces SeqPE, a novel position encoding framework for transformers, which is relevant to model architecture and efficiency.

Relevance: 8 Novelty: 8

26. Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence

ArXiv ID: 2506.13187

Authors: Yibo Yang, Sihao Liu, Chuan Rao, Bang An, Tiancheng Shen, Philip H. S. Torr, Ming-Hsuan Yang, Bernard Ghanem

Abstract: Conventional low-rank adaptation methods build adapters without considering data context, leading to sub-optimal fine-tuning performance and severe forgetting of inherent world knowledge. In this paper, we propose context-oriented decomposition adaptation (CorDA), a novel method that initializes adapters in a task-aware manner. Concretely, we develop context-oriented singular value decomposition, where we collect covariance matrices of input activations for each linear layer using sampled data from the target task, and apply SVD to the product of weight matrix and its corresponding covariance matrix. By doing so, the task-specific capability is compacted into the principal components. Thanks to the task awareness, our method enables two optional adaptation modes, knowledge-preserved mode (KPM) and instruction-previewed mode (IPM), providing flexibility to choose between freezing the principal components to preserve their associated knowledge or adapting them to better learn a new task. We further develop CorDA++ by deriving a metric that reflects the compactness of task-specific principal components, and then introducing dynamic covariance selection and dynamic rank allocation strategies based on the same metric. The two strategies provide each layer with the most representative covariance matrix and a proper rank allocation. Experimental results show that CorDA++ outperforms CorDA by a significant margin. CorDA++ in KPM not only achieves better fine-tuning performance than LoRA, but also mitigates the forgetting of pre-trained knowledge in both large language models and vision language models. For IPM, our method exhibits faster convergence, \emph{e.g.,} 4.5x speedup over QLoRA, and improves adaptation performance in various scenarios, outperforming strong baseline methods. Our method has been integrated into the PEFT library developed by Hugging Face.

Comment: The paper introduces a novel low-rank adaptation method for LLMs, which is relevant to model compression and efficiency.