Previous Day 2025-03-04
Monthly Overview 2025-03
Next Day 2025-03-06

Personalized Daily Arxiv Papers 03/05/2025

[gpt-4o] Prompt Completion Total
Token 49494 7131 56625
Cost $0.12 $0.07 $0.19

Total ArXiv papers: 643

Total scanned papers: 375

Total relevant papers: 35

Table of contents with paper titles:

  1. Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer Authors: Yujiao Yang, Jing Lian, Linhui Li

  2. Deep Learning is Not So Mysterious or Different Authors: Andrew Gordon Wilson

  3. Neural Manifolds and Cognitive Consistency: A New Approach to Memory Consolidation in Artificial Systems Authors: Phuong-Nam Nguyen

  4. A Near Complete Nonasymptotic Generalization Theory For Multilayer Neural Networks: Beyond the Bias-Variance Tradeoff Authors: Hao Yu, Xiangyang Ji

  5. Identifying Sensitive Weights via Post-quantization Integral Authors: Yuezhou Hu, Weiyu Huang, Zichen Liang, Chang Chen, Jintao Zhang, Jun Zhu, Jianfei Chen

  6. Unsupervised Attributed Dynamic Network Embedding with Stability Guarantees Authors: Emma Ceccherini, Ian Gallagher, Andrew Jones, Daniel Lawson

  7. Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization Authors: Leonid Berlyand, Theo Bourdais, Houman Owhadi, Yitzchak Shmalo

  8. CABS: Conflict-Aware and Balanced Sparsification for Enhancing Model Merging Authors: Zongzhen Yang, Binhang Qi, Hailong Sun, Wenrui Long, Ruobing Zhao, Xiang Gao

  9. Weak-to-Strong Generalization Even in Random Feature Networks, Provably Authors: Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora, Zhiyuan Li, Nathan Srebro

  10. A Theory of Initialisation's Impact on Specialisation Authors: Devon Jarvis, Sebastian Lee, Cl\'ementine Carla Juliette Domin\'e, Andrew M Saxe, Stefano Sarao Mannelli

  11. An Accelerated Alternating Partial Bregman Algorithm for ReLU-based Matrix Decomposition Authors: Qingsong Wang, Yunfei Qu, Chunfeng Cui, Deren Han

  12. Forgetting Transformer: Softmax Attention with a Forget Gate Authors: Zhixuan Lin, Evgenii Nikishin, Xu Owen He, Aaron Courville

  13. Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression Authors: Nathan Godey, Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini, \'Eric de la Clergerie, Beno\^it Sagot

  14. (How) Do Language Models Track State? Authors: Belinda Z. Li, Zifan Carl Guo, Jacob Andreas

  15. CrystalFramer: Rethinking the Role of Frames for SE(3)-Invariant Crystal Structure Modeling Authors: Yusei Ito, Tatsunori Taniai, Ryo Igarashi, Yoshitaka Ushiku, Kanta Ono

  16. Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs Authors: Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi

  17. PaCA: Partial Connection Adaptation for Efficient Fine-Tuning Authors: Sunghyeon Woo, Sol Namkung, Sunwoo Lee, Inho Jeong, Beomseok Kim, Dongsuk Jeon

  18. The Distributionally Robust Optimization Model of Sparse Principal Component Analysis Authors: Lei Wang, Xin Liu, Xiaojun Chen

  19. Weight transport through spike timing for robust local gradients Authors: Timo Gierlich, Andreas Baumbach, Akos F. Kungl, Kevin Max, Mihai A. Petrovici

  20. A Minimalist Example of Edge-of-Stability and Progressive Sharpening Authors: Liming Liu, Zixuan Zhang, Simon Du, Tuo Zhao

  21. Spike-and-Slab Posterior Sampling in High Dimensions Authors: Syamantak Kumar, Purnamrita Sarkar, Kevin Tian, Yusong Zhu

  22. Elliptic Loss Regularization Authors: Ali Hasan, Haoming Yang, Yuting Ng, Vahid Tarokh

  23. Mathematical Foundation of Interpretable Equivariant Surrogate Models Authors: Jacopo Joy Colombini, Filippo Bonchi, Francesco Giannini, Fosca Giannotti, Roberto Pellungrini, Patrizio Frosini

  24. DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models Authors: Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang

  25. Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms and Numerical Analysis Authors: Long Cheng, Qichen Liao, Fan Wu, Junlin Mu, Tengfei Han, Zhe Qiu, Lianqiang Li, Tianyi Liu, Fangzheng Miao, Keming Gao, Liang Wang, Zhen Zhang, Qiande Yin

  26. Enhancing Transformer with GNN Structural Knowledge via Distillation: A Novel Approach Authors: Zhihua Duan, Jialin Wang

  27. Quantifying Overfitting along the Regularization Path for Two-Part-Code MDL in Supervised Classification Authors: Xiaohan Zhu, Nathan Srebro

  28. On the Relationship Between Double Descent of CNNs and Shape/Texture Bias Under Learning Process Authors: Shun Iwase, Shuya Takahashi, Nakamasa Inoue, Rio Yokota, Ryo Nakamura, Hirokatsu Kataoka

  29. Linear Representations of Political Perspective Emerge in Large Language Models Authors: Junsol Kim, James Evans, Aaron Schein

  30. MindBridge: Scalable and Cross-Model Knowledge Editing via Memory-Augmented Modality Authors: Shuaike Li, Kai Zhang, Qi Liu, Enhong Chen

  31. Sharpness-Aware Minimization: General Analysis and Improved Rates Authors: Dimitris Oikonomou, Nicolas Loizou

  32. VAEs and GANs: Implicitly Approximating Complex Distributions with Simple Base Distributions and Deep Neural Networks -- Principles, Necessity, and Limitations Authors: Yuan-Hao Wei

  33. Frankenstein Optimizer: Harnessing the Potential by Revisiting Optimization Tricks Authors: Chia-Wei Hsu, Nien-Ti Tsou, Yu-Cheng Chen, Yang Jeong Park, Ju Li

  34. Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale Authors: Jerome Ku, Eric Nguyen, David W. Romero, Garyk Brixi, Brandon Yang, Anton Vorontsov, Ali Taghibakhshi, Amy X. Lu, Dave P. Burke, Greg Brockman, Stefano Massaroli, Christopher R\'e, Patrick D. Hsu, Brian L. Hie, Stefano Ermon, Michael Poli

  35. Unnatural Languages Are Not Bugs but Features for LLMs Authors: Keyu Duan, Yiran Zhao, Zhili Feng, Jinjie Ni, Tianyu Pang, Qian Liu, Tianle Cai, Longxu Dou, Kenji Kawaguchi, Anirudh Goyal, J. Zico Kolter, Michael Qizhe Shieh


1. Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

ArXiv ID: 2503.02495

Authors: Yujiao Yang, Jing Lian, Linhui Li

Abstract: Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. However, expert in exist MoE paradigm works as an individual, thereby lacking high-quality expert interactions. Moreover, they have not been effectively extended to attention block, which constrains further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement dynamic routing on input data and experts. Our approach advances MoE design with three key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE's routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the model employed with UoE surpass Full Attention, state-of-art MoEs and efficient transformers in several tasks across image and natural language domains. The source codes are available at https://github.com/YujiaoYang-work/UoE.

Comment: The paper proposes Union-of-Experts (UoE), which advances the Mixture-of-Experts paradigm with architectural innovations, aligning closely with model architecture research.

Relevance: 10 Novelty: 8


2. Deep Learning is Not So Mysterious or Different

ArXiv ID: 2503.02113

Authors: Andrew Gordon Wilson

Abstract: Deep neural networks are often seen as different from other model classes by defying conventional notions of generalization. Popular examples of anomalous generalization behaviour include benign overfitting, double descent, and the success of overparametrization. We argue that these phenomena are not distinct to neural networks, or particularly mysterious. Moreover, this generalization behaviour can be intuitively understood, and rigorously characterized using long-standing generalization frameworks such as PAC-Bayes and countable hypothesis bounds. We present soft inductive biases as a key unifying principle in explaining these phenomena: rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem. However, we also highlight how deep learning is relatively distinct in other ways, such as its ability for representation learning, phenomena such as mode connectivity, and its relative universality.

Comment: The paper provides a theoretical perspective on generalization phenomena in deep learning, which aligns with foundational research in representation learning and training dynamics.

Relevance: 9 Novelty: 9


3. Neural Manifolds and Cognitive Consistency: A New Approach to Memory Consolidation in Artificial Systems

ArXiv ID: 2503.01867

Authors: Phuong-Nam Nguyen

Abstract: We introduce a novel mathematical framework that unifies neural population dynamics, hippocampal sharp wave-ripple (SpWR) generation, and cognitive consistency constraints inspired by Heider's theory. Our model leverages low-dimensional manifold representations to capture structured neural drift and incorporates a balance energy function to enforce coherent synaptic interactions, effectively simulating the memory consolidation processes observed in biological systems. Simulation results demonstrate that our approach not only reproduces key features of SpWR events but also enhances network interpretability. This work paves the way for scalable neuromorphic architectures that bridge neuroscience and artificial intelligence, offering more robust and adaptive learning mechanisms for future intelligent systems.

Comment: The paper introduces a novel framework for memory consolidation inspired by neuroscience, which aligns with foundational research in representation learning and emerging trends.

Relevance: 9 Novelty: 9


4. A Near Complete Nonasymptotic Generalization Theory For Multilayer Neural Networks: Beyond the Bias-Variance Tradeoff

ArXiv ID: 2503.02129

Authors: Hao Yu, Xiangyang Ji

Abstract: We propose a first near complete (that will make explicit sense in the main text) nonasymptotic generalization theory for multilayer neural networks with arbitrary Lipschitz activations and general Lipschitz loss functions (with some very mild conditions). In particular, it doens't require the boundness of loss function, as commonly assumed in the literature. Our theory goes beyond the bias-variance tradeoff, aligned with phenomenon typically encountered in deep learning. It is therefore sharp different with other existing nonasymptotic generalization error bounds for neural networks. More explicitly, we propose an explicit generalization error upper bound for multilayer neural networks with arbitrary Lipschitz activations $\sigma$ with $\sigma(0)=0$ and broad enough Lipschitz loss functions, without requiring either the width, depth or other hyperparameters of the neural network approaching infinity, a specific neural network architect (e.g. sparsity, boundness of some norms), a particular activation function, a particular optimization algorithm or boundness of the loss function, and with taking the approximation error into consideration. General Lipschitz activation can also be accommodated into our framework. A feature of our theory is that it also considers approximation errors. Furthermore, we show the near minimax optimality of our theory for multilayer ReLU networks for regression problems. Notably, our upper bound exhibits the famous double descent phenomenon for such networks, which is the most distinguished characteristic compared with other existing results. This work emphasizes a view that many classical results should be improved to embrace the unintuitive characteristics of deep learning to get a better understanding of it.

Comment: The paper introduces a nonasymptotic generalization theory for multilayer neural networks, addressing foundational aspects of generalization and double descent, which is highly relevant to understanding training dynamics.

Relevance: 9 Novelty: 8


5. Identifying Sensitive Weights via Post-quantization Integral

ArXiv ID: 2503.01901

Authors: Yuezhou Hu, Weiyu Huang, Zichen Liang, Chang Chen, Jintao Zhang, Jun Zhu, Jianfei Chen

Abstract: Serving Large Language Models (LLMs) is costly. However, post-training weight quantization can address this problem by both compressing their sizes for limited memory and saving bandwidth for acceleration. As not all weight dimensions are equally important, those methods typically rely on a sensitivity metric, which indicates the element-wise influence of weights on loss function and is used to preprocess original weights for better quantization. In this work, we conduct an empirical study on the accuracy of the sensitivity metric, and find that existing gradient and Hessian based metrics are very inaccurate: they underestimate quantization's impact on the loss function by orders of magnitude, mainly due to the small convergence radius of local 2nd order approximation, \ie, gradient and Hessian term in Taylor's formula. To tackle this problem, we propose Post-quantization Integral (PQI), an accurate metric to estimate posterior sensitivity in a fine-grained manner. To leverage this accurate metric, we further propose ReQuant, a simple yet powerful framework that mainly consists of two Dense-and-Sparse detach components: self-adaptive outlier selection and step-wise significant weights detach. Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.

Comment: The paper proposes a novel sensitivity metric (PQI) for post-training quantization, which is highly relevant to model compression and efficiency.

Relevance: 9 Novelty: 8


6. Unsupervised Attributed Dynamic Network Embedding with Stability Guarantees

ArXiv ID: 2503.02859

Authors: Emma Ceccherini, Ian Gallagher, Andrew Jones, Daniel Lawson

Abstract: Stability for dynamic network embeddings ensures that nodes behaving the same at different times receive the same embedding, allowing comparison of nodes in the network across time. We present attributed unfolded adjacency spectral embedding (AUASE), a stable unsupervised representation learning framework for dynamic networks in which nodes are attributed with time-varying covariate information. To establish stability, we prove uniform convergence to an associated latent position model. We quantify the benefits of our dynamic embedding by comparing with state-of-the-art network representation learning methods on three real attributed networks. To the best of our knowledge, AUASE is the only attributed dynamic embedding that satisfies stability guarantees without the need for ground truth labels, which we demonstrate provides significant improvements for link prediction and node classification.

Comment: The paper focuses on unsupervised representation learning for dynamic networks, with a novel stability guarantee and theoretical contributions. This aligns with the representation learning criterion.

Relevance: 9 Novelty: 8


7. Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization

ArXiv ID: 2503.01922

Authors: Leonid Berlyand, Theo Bourdais, Houman Owhadi, Yitzchak Shmalo

Abstract: Deep neural networks (DNNs) have brought significant advancements in various applications in recent years, such as image recognition, speech recognition, and natural language processing. In particular, Vision Transformers (ViTs) have emerged as a powerful class of models in the field of deep learning for image classification. In this work, we propose a novel Random Matrix Theory (RMT)-based method for pruning pre-trained DNNs, based on the sparsification of weights and singular vectors, and apply it to ViTs. RMT provides a robust framework to analyze the statistical properties of large matrices, which has been shown to be crucial for understanding and optimizing the performance of DNNs. We demonstrate that our RMT-based pruning can be used to reduce the number of parameters of ViT models (trained on ImageNet) by 30-50\% with less than 1\% loss in accuracy. To our knowledge, this represents the state-of-the-art in pruning for these ViT models. Furthermore, we provide a rigorous mathematical underpinning of the above numerical studies, namely we proved a theorem for fully connected DNNs, and other more general DNN structures, describing how the randomness in the weight matrices of a DNN decreases as the weights approach a local or global minimum (during training). We verify this theorem through numerical experiments on fully connected DNNs, providing empirical support for our theoretical findings. Moreover, we prove a theorem that describes how DNN loss decreases as we remove randomness in the weight layers, and show a monotone dependence of the decrease in loss with the amount of randomness that we remove. Our results also provide significant RMT-based insights into the role of regularization during training and pruning.

Comment: The paper uses Random Matrix Theory for pruning DNNs, aligning with the model compression criterion and providing both theoretical and empirical contributions.

Relevance: 9 Novelty: 8


8. CABS: Conflict-Aware and Balanced Sparsification for Enhancing Model Merging

ArXiv ID: 2503.01874

Authors: Zongzhen Yang, Binhang Qi, Hailong Sun, Wenrui Long, Ruobing Zhao, Xiang Gao

Abstract: Model merging based on task vectors, i.e., the parameter differences between fine-tuned models and a shared base model, provides an efficient way to integrate multiple task-specific models into a multitask model without retraining. Recent works have endeavored to address the conflicts between task vectors, one of the significant challenges faced by model merging, through sparsification; however, two issues significantly limit their performance: high parameter overlap and unbalanced weight distribution. To address these issues, we propose a simple, yet effective framework called CABS (Conflict-Aware and Balanced Sparsification), consisting of Conflict-Aware Sparsification (CA) and Balanced Sparsification (BS). CA can reduce parameter overlap by applying masks during sequential pruning, ensuring that each task vector retains distinct, non-overlapping parameters. BS leverages $n$: $m$ pruning to preserve critical weights while maintaining an even distribution across layers. Our comprehensive experiments demonstrate that CABS outperforms state-of-the-art methods across diverse tasks and model sizes.

Comment: The paper introduces a sparsification framework (CABS) for model merging, which aligns with model compression and sparsity-related research.

Relevance: 9 Novelty: 8


9. Weak-to-Strong Generalization Even in Random Feature Networks, Provably

ArXiv ID: 2503.02877

Authors: Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora, Zhiyuan Li, Nathan Srebro

Abstract: Weak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed bottom layer and a trained top layer. A "weak" teacher, with a small number of units (i.e. random features), is trained on the population, and a "strong" student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.

Comment: The paper explores weak-to-strong generalization in random feature networks, providing theoretical insights into training dynamics and generalization, which aligns well with foundational research in representation learning.

Relevance: 9 Novelty: 8


10. A Theory of Initialisation's Impact on Specialisation

ArXiv ID: 2503.02526

Authors: Devon Jarvis, Sebastian Lee, Cl\'ementine Carla Juliette Domin\'e, Andrew M Saxe, Stefano Sarao Mannelli

Abstract: Prior work has demonstrated a consistent tendency in neural networks engaged in continual learning tasks, wherein intermediate task similarity results in the highest levels of catastrophic interference. This phenomenon is attributed to the network's tendency to reuse learned features across tasks. However, this explanation heavily relies on the premise that neuron specialisation occurs, i.e. the emergence of localised representations. Our investigation challenges the validity of this assumption. Using theoretical frameworks for the analysis of neural networks, we show a strong dependence of specialisation on the initial condition. More precisely, we show that weight imbalance and high weight entropy can favour specialised solutions. We then apply these insights in the context of continual learning, first showing the emergence of a monotonic relation between task-similarity and forgetting in non-specialised networks. {Finally, we show that specialization by weight imbalance is beneficial on the commonly employed elastic weight consolidation regularisation technique.

Comment: The paper provides theoretical insights into the impact of initialization on neuron specialization, which is relevant to representation learning and training dynamics in neural networks.

Relevance: 9 Novelty: 8


11. An Accelerated Alternating Partial Bregman Algorithm for ReLU-based Matrix Decomposition

ArXiv ID: 2503.02386

Authors: Qingsong Wang, Yunfei Qu, Chunfeng Cui, Deren Han

Abstract: Despite the remarkable success of low-rank estimation in data mining, its effectiveness diminishes when applied to data that inherently lacks low-rank structure. To address this limitation, in this paper, we focus on non-negative sparse matrices and aim to investigate the intrinsic low-rank characteristics of the rectified linear unit (ReLU) activation function. We first propose a novel nonlinear matrix decomposition framework incorporating a comprehensive regularization term designed to simultaneously promote useful structures in clustering and compression tasks, such as low-rankness, sparsity, and non-negativity in the resulting factors. This formulation presents significant computational challenges due to its multi-block structure, non-convexity, non-smoothness, and the absence of global gradient Lipschitz continuity. To address these challenges, we develop an accelerated alternating partial Bregman proximal gradient method (AAPB), whose distinctive feature lies in its capability to enable simultaneous updates of multiple variables. Under mild and theoretically justified assumptions, we establish both sublinear and global convergence properties of the proposed algorithm. Through careful selection of kernel generating distances tailored to various regularization terms, we derive corresponding closed-form solutions while maintaining the $L$-smooth adaptable property always holds for any $L\ge 1$. Numerical experiments, on graph regularized clustering and sparse NMF basis compression confirm the effectiveness of our model and algorithm.

Comment: The paper introduces a novel matrix decomposition framework with theoretical contributions to sparsity and low-rank methods, which aligns with model compression and representation learning.

Relevance: 9 Novelty: 8


12. Forgetting Transformer: Softmax Attention with a Forget Gate

ArXiv ID: 2503.02130

Authors: Zhixuan Lin, Evgenii Nikishin, Xu Owen He, Aaron Courville

Abstract: An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism the Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a "Pro" block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.

Comment: The paper introduces a Forgetting Transformer with a novel attention mechanism, which aligns with foundational research in model architecture and transformer innovations.

Relevance: 9 Novelty: 8


13. Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

ArXiv ID: 2503.02812

Authors: Nathan Godey, Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini, \'Eric de la Clergerie, Beno\^it Sagot

Abstract: Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors that allow us to efficiently approximate attention scores without computing the attention maps. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection. Contrarily to many alternatives, Q-Filters is compatible with FlashAttention, as it does not require direct access to attention weights. Experimental results in long-context settings demonstrate that Q-Filters is competitive with attention-based compression methods such as SnapKV in retrieval tasks while consistently outperforming efficient compression schemes such as Streaming-LLM in generation setups. Notably, Q-Filters achieves a 99% accuracy in the needle-in-a-haystack task with a x32 compression level while reducing the generation perplexity drop by up to 65% in text generation compared to Streaming-LLM.

Comment: The paper introduces Q-Filters, a novel KV Cache compression method leveraging QK geometry, which aligns with the model compression criterion. It provides theoretical insights and demonstrates compatibility with FlashAttention, making it highly relevant.

Relevance: 9 Novelty: 8


14. (How) Do Language Models Track State?

ArXiv ID: 2503.02854

Authors: Belinda Z. Li, Zifan Carl Guo, Jacob Andreas

Abstract: Transformer language models (LMs) exhibit behaviors -- from storytelling to code generation -- that appear to require tracking the unobserved state of an evolving world. How do they do so? We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). Despite the simple algebraic structure of this problem, many other tasks (e.g., simulation of finite automata and evaluation of boolean expressions) can be reduced to permutation composition, making it a natural model for state tracking in general. We show that LMs consistently learn one of two state tracking mechanisms for this task. The first closely resembles the "associative scan" construction used in recent theoretical work by Liu et al. (2023) and Merrill et al. (2024). The second uses an easy-to-compute feature (permutation parity) to partially prune the space of outputs, then refines this with an associative scan. The two mechanisms exhibit markedly different robustness properties, and we show how to steer LMs toward one or the other with intermediate training tasks that encourage or suppress the heuristics. Our results demonstrate that transformer LMs, whether pretrained or fine-tuned, can learn to implement efficient and interpretable state tracking mechanisms, and the emergence of these mechanisms can be predicted and controlled.

Comment: The paper investigates how language models track state and identifies two distinct mechanisms, providing theoretical insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 8


15. CrystalFramer: Rethinking the Role of Frames for SE(3)-Invariant Crystal Structure Modeling

ArXiv ID: 2503.02209

Authors: Yusei Ito, Tatsunori Taniai, Ryo Igarashi, Yoshitaka Ushiku, Kanta Ono

Abstract: Crystal structure modeling with graph neural networks is essential for various applications in materials informatics, and capturing SE(3)-invariant geometric features is a fundamental requirement for these networks. A straightforward approach is to model with orientation-standardized structures through structure-aligned coordinate systems, or"frames." However, unlike molecules, determining frames for crystal structures is challenging due to their infinite and highly symmetric nature. In particular, existing methods rely on a statically fixed frame for each structure, determined solely by its structural information, regardless of the task under consideration. Here, we rethink the role of frames, questioning whether such simplistic alignment with the structure is sufficient, and propose the concept of dynamic frames. While accommodating the infinite and symmetric nature of crystals, these frames provide each atom with a dynamic view of its local environment, focusing on actively interacting atoms. We demonstrate this concept by utilizing the attention mechanism in a recent transformer-based crystal encoder, resulting in a new architecture called CrystalFramer. Extensive experiments show that CrystalFramer outperforms conventional frames and existing crystal encoders in various crystal property prediction tasks.

Comment: The paper introduces dynamic frames for crystal structure modeling, which is a novel architectural concept in the context of SE(3)-invariant modeling.

Relevance: 8 Novelty: 8


16. Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

ArXiv ID: 2503.02597

Authors: Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi

Abstract: Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of earlier modalities (e.g., images) to incorporate information from later modalities (e.g., text). To address this problem, we propose AKI, a novel MLLM that unlocks causal attention into modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows AKI to achieve superior performance in 12 multimodal understanding benchmarks (+7.2% on average) without introducing additional parameters and increasing training time. Our MMA design is intended to be generic, allowing for application across various modalities, and scalable to accommodate diverse multimodal scenarios. The code is publicly available at https://github.com/sony/aki, and we will release our AKI-4B model to encourage further advancements in MLLMs across various directions.

Comment: The paper proposes a novel attention mechanism (MMA) for multimodal LLMs, which aligns with architectural innovations in large models.

Relevance: 8 Novelty: 8


17. PaCA: Partial Connection Adaptation for Efficient Fine-Tuning

ArXiv ID: 2503.01905

Authors: Sunghyeon Woo, Sol Namkung, Sunwoo Lee, Inho Jeong, Beomseok Kim, Dongsuk Jeon

Abstract: Prior parameter-efficient fine-tuning (PEFT) algorithms reduce memory usage and computational costs of fine-tuning large neural network models by training only a few additional adapter parameters, rather than the entire model. However, the reduction in computational costs due to PEFT does not necessarily translate to a reduction in training time; although the computational costs of the adapter layers are much smaller than the pretrained layers, it is well known that those two types of layers are processed sequentially on GPUs, resulting in significant latency overhead. LoRA and its variants merge low-rank adapter matrices with pretrained weights during inference to avoid latency overhead, but during training, the pretrained weights remain frozen while the adapter matrices are continuously updated, preventing such merging. To mitigate this issue, we propose Partial Connection Adaptation (PaCA), which fine-tunes randomly selected partial connections within the pretrained weights instead of introducing adapter layers in the model. PaCA not only enhances training speed by eliminating the time overhead due to the sequential processing of the adapter and pretrained layers but also reduces activation memory since only partial activations, rather than full activations, need to be stored for gradient computation. Compared to LoRA, PaCA reduces training time by 22% and total memory usage by 16%, while maintaining comparable accuracy across various fine-tuning scenarios, such as fine-tuning on the MMLU dataset and instruction tuning on the Oasst1 dataset. PaCA can also be combined with quantization, enabling the fine-tuning of large models such as LLaMA3.1-70B. In addition, PaCA enables training with 23% longer sequence and improves throughput by 16% on both NVIDIA A100 GPU and INTEL Gaudi2 HPU compared to LoRA. The code is available at https://github.com/WooSunghyeon/paca.

Comment: The paper proposes a parameter-efficient fine-tuning method (PaCA) that improves training speed and memory usage, aligning with the model compression criterion.

Relevance: 8 Novelty: 8


18. The Distributionally Robust Optimization Model of Sparse Principal Component Analysis

ArXiv ID: 2503.02494

Authors: Lei Wang, Xin Liu, Xiaojun Chen

Abstract: We consider sparse principal component analysis (PCA) under a stochastic setting where the underlying probability distribution of the random parameter is uncertain. This problem is formulated as a distributionally robust optimization (DRO) model based on a constructive approach to capturing uncertainty in the covariance matrix, which constitutes a nonsmooth constrained min-max optimization problem. We further prove that the inner maximization problem admits a closed-form solution, reformulating the original DRO model into an equivalent minimization problem on the Stiefel manifold. This transformation leads to a Riemannian optimization problem with intricate nonsmooth terms, a challenging formulation beyond the reach of existing algorithms. To address this issue, we devise an efficient smoothing manifold proximal gradient algorithm. We prove the Riemannian gradient consistency and global convergence of our algorithm to a stationary point of the nonsmooth minimization problem. Moreover, we establish the iteration complexity of our algorithm. Finally, numerical experiments are conducted to validate the effectiveness and scalability of our algorithm, as well as to highlight the necessity and rationality of adopting the DRO model for sparse PCA.

Comment: The paper addresses sparse PCA using a distributionally robust optimization model, which aligns with foundational research in sparse methods and optimization.

Relevance: 8 Novelty: 8


19. Weight transport through spike timing for robust local gradients

ArXiv ID: 2503.02642

Authors: Timo Gierlich, Andreas Baumbach, Akos F. Kungl, Kevin Max, Mihai A. Petrovici

Abstract: In both machine learning and in computational neuroscience, plasticity in functional neural networks is frequently expressed as gradient descent on a cost. Often, this imposes symmetry constraints that are difficult to reconcile with local computation, as is required for biological networks or neuromorphic hardware. For example, wake-sleep learning in networks characterized by Boltzmann distributions builds on the assumption of symmetric connectivity. Similarly, the error backpropagation algorithm is notoriously plagued by the weight transport problem between the representation and the error stream. Existing solutions such as feedback alignment tend to circumvent the problem by deferring to the robustness of these algorithms to weight asymmetry. However, they are known to scale poorly with network size and depth. We introduce spike-based alignment learning (SAL), a complementary learning rule for spiking neural networks, which uses spike timing statistics to extract and correct the asymmetry between effective reciprocal connections. Apart from being spike-based and fully local, our proposed mechanism takes advantage of noise. Based on an interplay between Hebbian and anti-Hebbian plasticity, synapses can thereby recover the true local gradient. This also alleviates discrepancies that arise from neuron and synapse variability -- an omnipresent property of physical neuronal networks. We demonstrate the efficacy of our mechanism using different spiking network models. First, we show how SAL can significantly improve convergence to the target distribution in probabilistic spiking networks as compared to Hebbian plasticity alone. Second, in neuronal hierarchies based on cortical microcircuits, we show how our proposed mechanism effectively enables the alignment of feedback weights to the forward pathway, thus allowing the backpropagation of correct feedback errors.

Comment: The paper introduces a novel spike-based learning rule for spiking neural networks, which aligns with foundational research in representation learning and training dynamics.

Relevance: 8 Novelty: 8


20. A Minimalist Example of Edge-of-Stability and Progressive Sharpening

ArXiv ID: 2503.02809

Authors: Liming Liu, Zixuan Zhang, Simon Du, Tuo Zhao

Abstract: Recent advances in deep learning optimization have unveiled two intriguing phenomena under large learning rates: Edge of Stability (EoS) and Progressive Sharpening (PS), challenging classical Gradient Descent (GD) analyses. Current research approaches, using either generalist frameworks or minimalist examples, face significant limitations in explaining these phenomena. This paper advances the minimalist approach by introducing a two-layer network with a two-dimensional input, where one dimension is relevant to the response and the other is irrelevant. Through this model, we rigorously prove the existence of progressive sharpening and self-stabilization under large learning rates, and establish non-asymptotic analysis of the training dynamics and sharpness along the entire GD trajectory. Besides, we connect our minimalist example to existing works by reconciling the existence of a well-behaved ``stable set" between minimalist and generalist analyses, and extending the analysis of Gradient Flow Solution sharpness to our two-dimensional input scenario. These findings provide new insights into the EoS phenomenon from both parameter and input data distribution perspectives, potentially informing more effective optimization strategies in deep learning practice.

Comment: The paper explores Edge-of-Stability and Progressive Sharpening phenomena in optimization, providing theoretical insights into training dynamics, which is relevant to representation learning.

Relevance: 8 Novelty: 8


21. Spike-and-Slab Posterior Sampling in High Dimensions

ArXiv ID: 2503.02798

Authors: Syamantak Kumar, Purnamrita Sarkar, Kevin Tian, Yusong Zhu

Abstract: Posterior sampling with the spike-and-slab prior [MB88], a popular multimodal distribution used to model uncertainty in variable selection, is considered the theoretical gold standard method for Bayesian sparse linear regression [CPS09, Roc18]. However, designing provable algorithms for performing this sampling task is notoriously challenging. Existing posterior samplers for Bayesian sparse variable selection tasks either require strong assumptions about the signal-to-noise ratio (SNR) [YWJ16], only work when the measurement count grows at least linearly in the dimension [MW24], or rely on heuristic approximations to the posterior. We give the first provable algorithms for spike-and-slab posterior sampling that apply for any SNR, and use a measurement count sublinear in the problem dimension. Concretely, assume we are given a measurement matrix $\mathbf{X} \in \mathbb{R}^{n\times d}$ and noisy observations $\mathbf{y} = \mathbf{X}\mathbf{\theta}^\star + \mathbf{\xi}$ of a signal $\mathbf{\theta}^\star$ drawn from a spike-and-slab prior $\pi$ with a Gaussian diffuse density and expected sparsity k, where $\mathbf{\xi} \sim \mathcal{N}(\mathbb{0}_n, \sigma^2\mathbf{I}_n)$. We give a polynomial-time high-accuracy sampler for the posterior $\pi(\cdot \mid \mathbf{X}, \mathbf{y})$, for any SNR $\sigma^{-1}$ > 0, as long as $n \geq k^3 \cdot \text{polylog}(d)$ and $X$ is drawn from a matrix ensemble satisfying the restricted isometry property. We further give a sampler that runs in near-linear time $\approx nd$ in the same setting, as long as $n \geq k^5 \cdot \text{polylog}(d)$. To demonstrate the flexibility of our framework, we extend our result to spike-and-slab posterior sampling with Laplace diffuse densities, achieving similar guarantees when $\sigma = O(\frac{1}{k})$ is bounded.

Comment: The paper introduces provable algorithms for spike-and-slab posterior sampling in high dimensions, which aligns with foundational research in sparse methods and representation learning.

Relevance: 8 Novelty: 8


22. Elliptic Loss Regularization

ArXiv ID: 2503.02138

Authors: Ali Hasan, Haoming Yang, Yuting Ng, Vahid Tarokh

Abstract: Regularizing neural networks is important for anticipating model behavior in regions of the data space that are not well represented. In this work, we propose a regularization technique for enforcing a level of smoothness in the mapping between the data input space and the loss value. We specify the level of regularity by requiring that the loss of the network satisfies an elliptic operator over the data domain. To do this, we modify the usual empirical risk minimization objective such that we instead minimize a new objective that satisfies an elliptic operator over points within the domain. This allows us to use existing theory on elliptic operators to anticipate the behavior of the error for points outside the training set. We propose a tractable computational method that approximates the behavior of the elliptic operator while being computationally efficient. Finally, we analyze the properties of the proposed regularization to understand the performance on common problems of distribution shift and group imbalance. Numerical experiments confirm the utility of the proposed regularization technique.

Comment: The paper proposes a novel regularization technique based on elliptic operators, which aligns with foundational research in model regularization and generalization.

Relevance: 8 Novelty: 7


23. Mathematical Foundation of Interpretable Equivariant Surrogate Models

ArXiv ID: 2503.01942

Authors: Jacopo Joy Colombini, Filippo Bonchi, Francesco Giannini, Fosca Giannotti, Roberto Pellungrini, Patrizio Frosini

Abstract: This paper introduces a rigorous mathematical framework for neural network explainability, and more broadly for the explainability of equivariant operators called Group Equivariant Operators (GEOs) based on Group Equivariant Non-Expansive Operators (GENEOs) transformations. The central concept involves quantifying the distance between GEOs by measuring the non-commutativity of specific diagrams. Additionally, the paper proposes a definition of interpretability of GEOs according to a complexity measure that can be defined according to each user preferences. Moreover, we explore the formal properties of this framework and show how it can be applied in classical machine learning scenarios, like image classification with convolutional neural networks.

Comment: The paper proposes a mathematical framework for explainability in equivariant operators, which aligns with foundational research in model interpretability and architecture analysis.

Relevance: 8 Novelty: 7


24. DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

ArXiv ID: 2503.02175

Authors: Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang

Abstract: Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by an integrated Large Language Model (LLM). Including visual tokens substantially increases the total token count, often by thousands. The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. The existing token pruning methods either require extensive calibration and fine-tuning or rely on suboptimal importance metrics which results in increased redundancy among the retained tokens. In this paper, we first formulate token pruning as Max-Min Diversity Problem (MMDP) where the goal is to select a subset such that the diversity among the selected {tokens} is maximized. Then, we solve the MMDP to obtain the selected subset and prune the rest. The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens. By ensuring high diversity, the selected tokens better represent the original tokens, enabling effective performance even at high pruning ratios without requiring fine-tuning. Extensive experiments with various LMMs show that DivPrune achieves state-of-the-art accuracy over 16 image- and video-language datasets. Additionally, DivPrune reduces both the end-to-end latency and GPU memory usage for the tested models. The code is available $\href{https://github.com/vbdi/divprune}{\text{here}}$.

Comment: The paper proposes a diversity-based token pruning method for multimodal models, which aligns with the model compression criterion, particularly in reducing redundancy and improving efficiency.

Relevance: 8 Novelty: 7


25. Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms and Numerical Analysis

ArXiv ID: 2503.01873

Authors: Long Cheng, Qichen Liao, Fan Wu, Junlin Mu, Tengfei Han, Zhe Qiu, Lianqiang Li, Tianyi Liu, Fangzheng Miao, Keming Gao, Liang Wang, Zhen Zhang, Qiande Yin

Abstract: Attention calculation is extremely time-consuming for long-sequence inference tasks, such as text or image/video generation, in large models. To accelerate this process, we developed a low-precision, mathematically-equivalent algorithm called PASA, based on Flash Attention. PASA introduces two novel techniques: online pseudo-average shifting and global recovering. These techniques enable the use of half-precision computation throughout the Flash Attention process without incurring overflow instability or unacceptable numerical accuracy loss. This algorithm enhances performance on memory-restricted AI hardware architectures, such as the Ascend Neural-network Processing Unit(NPU), by reducing data movement and increasing computational FLOPs. The algorithm is validated using both designed random benchmarks and real large models. We find that the large bias and amplitude of attention input data are critical factors contributing to numerical overflow ($>65504$ for half precision) in two different categories of large models (Qwen2-7B language models and Stable-Video-Diffusion multi-modal models). Specifically, overflow arises due to the large bias in the sequence dimension and the resonance mechanism between the query and key in the head dimension of the Stable-Video-Diffusion models. The resonance mechanism is defined as phase coincidence or 180-degree phase shift between query and key matrices. It will remarkably amplify the element values of attention score matrix. This issue also applies to the Qwen models. Additionally, numerical accuracy is assessed through root mean square error (RMSE) and by comparing the final generated texts and videos to those produced using high-precision attention.

Comment: The paper proposes a low-precision algorithm (PASA) for efficient attention computation in LLMs, which aligns with model compression and efficiency breakthroughs.

Relevance: 8 Novelty: 7


26. Enhancing Transformer with GNN Structural Knowledge via Distillation: A Novel Approach

ArXiv ID: 2503.01888

Authors: Zhihua Duan, Jialin Wang

Abstract: Integrating the structural inductive biases of Graph Neural Networks (GNNs) with the global contextual modeling capabilities of Transformers represents a pivotal challenge in graph representation learning. While GNNs excel at capturing localized topological patterns through message-passing mechanisms, their inherent limitations in modeling long-range dependencies and parallelizability hinder their deployment in large-scale scenarios. Conversely, Transformers leverage self-attention mechanisms to achieve global receptive fields but struggle to inherit the intrinsic graph structural priors of GNNs. This paper proposes a novel knowledge distillation framework that systematically transfers multiscale structural knowledge from GNN teacher models to Transformer student models, offering a new perspective on addressing the critical challenges in cross-architectural distillation. The framework effectively bridges the architectural gap between GNNs and Transformers through micro-macro distillation losses and multiscale feature alignment. This work establishes a new paradigm for inheriting graph structural biases in Transformer architectures, with broad application prospects.

Comment: The paper proposes a knowledge distillation framework to transfer structural knowledge from GNNs to Transformers, aligning with architectural innovations and cross-architectural analysis.

Relevance: 8 Novelty: 7


27. Quantifying Overfitting along the Regularization Path for Two-Part-Code MDL in Supervised Classification

ArXiv ID: 2503.02110

Authors: Xiaohan Zhu, Nathan Srebro

Abstract: We provide a complete characterization of the entire regularization curve of a modified two-part-code Minimum Description Length (MDL) learning rule for binary classification, based on an arbitrary prior or description language. \citet{GL} previously established the lack of asymptotic consistency, from an agnostic PAC (frequentist worst case) perspective, of the MDL rule with a penalty parameter of $\lambda=1$, suggesting that it underegularizes. Driven by interest in understanding how benign or catastrophic under-regularization and overfitting might be, we obtain a precise quantitative description of the worst case limiting error as a function of the regularization parameter $\lambda$ and noise level (or approximation error), significantly tightening the analysis of \citeauthor{GL} for $\lambda=1$ and extending it to all other choices of $\lambda$.

Comment: This paper provides a theoretical analysis of regularization paths in MDL learning rules, which aligns with foundational research in representation learning and training dynamics.

Relevance: 8 Novelty: 7


28. On the Relationship Between Double Descent of CNNs and Shape/Texture Bias Under Learning Process

ArXiv ID: 2503.02302

Authors: Shun Iwase, Shuya Takahashi, Nakamasa Inoue, Rio Yokota, Ryo Nakamura, Hirokatsu Kataoka

Abstract: The double descent phenomenon, which deviates from the traditional bias-variance trade-off theory, attracts considerable research attention; however, the mechanism of its occurrence is not fully understood. On the other hand, in the study of convolutional neural networks (CNNs) for image recognition, methods are proposed to quantify the bias on shape features versus texture features in images, determining which features the CNN focuses on more. In this work, we hypothesize that there is a relationship between the shape/texture bias in the learning process of CNNs and epoch-wise double descent, and we conduct verification. As a result, we discover double descent/ascent of shape/texture bias synchronized with double descent of test error under conditions where epoch-wise double descent is observed. Quantitative evaluations confirm this correlation between the test errors and the bias values from the initial decrease to the full increase in test error. Interestingly, double descent/ascent of shape/texture bias is observed in some cases even in conditions without label noise, where double descent is thought not to occur. These experimental results are considered to contribute to the understanding of the mechanisms behind the double descent phenomenon and the learning process of CNNs in image recognition.

Comment: The paper investigates the double descent phenomenon in CNNs, which is relevant to understanding training dynamics and representation learning. It provides experimental insights into shape/texture bias.

Relevance: 8 Novelty: 7


29. Linear Representations of Political Perspective Emerge in Large Language Models

ArXiv ID: 2503.02080

Authors: Junsol Kim, James Evans, Aaron Schein

Abstract: Large language models (LLMs) have demonstrated the ability to generate text that realistically reflects a range of different subjective human perspectives. This paper studies how LLMs are seemingly able to reflect more liberal versus more conservative viewpoints among other political perspectives in American politics. We show that LLMs possess linear representations of political perspectives within activation space, wherein more similar perspectives are represented closer together. To do so, we probe the attention heads across the layers of three open transformer-based LLMs (\texttt{Llama-2-7b-chat}, \texttt{Mistral-7b-instruct}, \texttt{Vicuna-7b}). We first prompt models to generate text from the perspectives of different U.S.~lawmakers. We then identify sets of attention heads whose activations linearly predict those lawmakers' DW-NOMINATE scores, a widely-used and validated measure of political ideology. We find that highly predictive heads are primarily located in the middle layers, often speculated to encode high-level concepts and tasks. Using probes only trained to predict lawmakers' ideology, we then show that the same probes can predict measures of news outlets' slant from the activations of models prompted to simulate text from those news outlets. These linear probes allow us to visualize, interpret, and monitor ideological stances implicitly adopted by an LLM as it generates open-ended responses. Finally, we demonstrate that by applying linear interventions to these attention heads, we can steer the model outputs toward a more liberal or conservative stance. Overall, our research suggests that LLMs possess a high-level linear representation of American political ideology and that by leveraging recent advances in mechanistic interpretability, we can identify, monitor, and steer the subjective perspective underlying generated text.

Comment: The paper explores linear representations of political perspectives in LLMs, which aligns with interpretability and representation learning in LLMs. It provides novel insights into high-level representations.

Relevance: 8 Novelty: 7


30. MindBridge: Scalable and Cross-Model Knowledge Editing via Memory-Augmented Modality

ArXiv ID: 2503.02701

Authors: Shuaike Li, Kai Zhang, Qi Liu, Enhong Chen

Abstract: Knowledge editing is a technique for efficiently and accurately updating the knowledge of large language models (LLMs) to alleviate obsolescence and correct errors. However, most existing methods overfit to specific models, causing edited knowledge to be discarded during each LLM update and requiring frequent re-editing, which is particularly burdensome in today's rapidly evolving open-source community. To address this issue, we propose the problem of cross-model knowledge editing and introduce MindBridge, a scalable solution inspired by the low coupling between modality processing and LLMs in multi-modal models. MindBridge introduces the novel concept of memory modality, which encodes edited knowledge as an independent modality. It first performs LLM-agnostic pre-training of the memory modality and then integrates it with various LLMs. Extensive experiments on multiple LLMs and popular knowledge editing datasets demonstrate that MindBridge achieves superior performance even in editing tens of thousands of knowledge entries and can flexibly adapt to different LLMs. Our code is available at https://github.com/CrashBugger/MindBridge.

Comment: The paper proposes a scalable knowledge editing framework for LLMs, which aligns with foundational research in LLM behavior and architecture-level innovations.

Relevance: 8 Novelty: 7


31. Sharpness-Aware Minimization: General Analysis and Improved Rates

ArXiv ID: 2503.02225

Authors: Dimitris Oikonomou, Nicolas Loizou

Abstract: Sharpness-Aware Minimization (SAM) has emerged as a powerful method for improving generalization in machine learning models by minimizing the sharpness of the loss landscape. However, despite its success, several important questions regarding the convergence properties of SAM in non-convex settings are still open, including the benefits of using normalization in the update rule, the dependence of the analysis on the restrictive bounded variance assumption, and the convergence guarantees under different sampling strategies. To address these questions, in this paper, we provide a unified analysis of SAM and its unnormalized variant (USAM) under one single flexible update rule (Unified SAM), and we present convergence results of the new algorithm under a relaxed and more natural assumption on the stochastic noise. Our analysis provides convergence guarantees for SAM under different step size selections for non-convex problems and functions that satisfy the Polyak-Lojasiewicz (PL) condition (a non-convex generalization of strongly convex functions). The proposed theory holds under the arbitrary sampling paradigm, which includes importance sampling as special case, allowing us to analyze variants of SAM that were never explicitly considered in the literature. Experiments validate the theoretical findings and further demonstrate the practical effectiveness of Unified SAM in training deep neural networks for image classification tasks.

Comment: The paper provides a unified analysis of Sharpness-Aware Minimization (SAM) and introduces a new algorithm with theoretical guarantees, which aligns with representation learning and training dynamics in neural networks.

Relevance: 8 Novelty: 7


32. VAEs and GANs: Implicitly Approximating Complex Distributions with Simple Base Distributions and Deep Neural Networks -- Principles, Necessity, and Limitations

ArXiv ID: 2503.01898

Authors: Yuan-Hao Wei

Abstract: This tutorial focuses on the fundamental architectures of Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN), disregarding their numerous variations, to highlight their core principles. Both VAE and GAN utilize simple distributions, such as Gaussians, as a basis and leverage the powerful nonlinear transformation capabilities of neural networks to approximate arbitrarily complex distributions. The theoretical basis lies in that a linear combination of multiple Gaussians can almost approximate any probability distribution, while neural networks enable further refinement through nonlinear transformations. Both methods approximate complex data distributions implicitly. This implicit approximation is crucial because directly modeling high-dimensional distributions explicitly is often intractable. However, the choice of a simple latent prior, while computationally convenient, introduces limitations. In VAEs, the fixed Gaussian prior forces the posterior distribution to align with it, potentially leading to loss of information and reduced expressiveness. This restriction affects both the interpretability of the model and the quality of generated samples.

Comment: The paper discusses fundamental principles and limitations of VAEs and GANs, which are relevant to representation learning and foundational generative modeling.

Relevance: 8 Novelty: 6


33. Frankenstein Optimizer: Harnessing the Potential by Revisiting Optimization Tricks

ArXiv ID: 2503.02147

Authors: Chia-Wei Hsu, Nien-Ti Tsou, Yu-Cheng Chen, Yang Jeong Park, Ju Li

Abstract: Gradient-based optimization drives the unprecedented performance of modern deep neural network models across diverse applications. Adaptive algorithms have accelerated neural network training due to their rapid convergence rates; however, they struggle to find ``flat minima" reliably, resulting in suboptimal generalization compared to stochastic gradient descent (SGD). By revisiting various adaptive algorithms' mechanisms, we propose the Frankenstein optimizer, which combines their advantages. The proposed Frankenstein dynamically adjusts first- and second-momentum coefficients according to the optimizer's current state to directly maintain consistent learning dynamics and immediately reflect sudden gradient changes. Extensive experiments across several research domains such as computer vision, natural language processing, few-shot learning, and scientific simulations show that Frankenstein surpasses existing adaptive algorithms and SGD empirically regarding convergence speed and generalization performance. Furthermore, this research deepens our understanding of adaptive algorithms through centered kernel alignment analysis and loss landscape visualization during the learning process.

Comment: The paper proposes a novel optimizer (Frankenstein) and provides insights into optimization dynamics, which could be relevant to training dynamics in neural networks.

Relevance: 7 Novelty: 7


34. Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale

ArXiv ID: 2503.01868

Authors: Jerome Ku, Eric Nguyen, David W. Romero, Garyk Brixi, Brandon Yang, Anton Vorontsov, Ali Taghibakhshi, Amy X. Lu, Dave P. Burke, Greg Brockman, Stefano Massaroli, Christopher R\'e, Patrick D. Hsu, Brian L. Hie, Stefano Ermon, Michael Poli

Abstract: We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. First, operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression, with input-dependent convolutions and attention offering complementary performance. Second, co-designing convolution operators and hardware-aware algorithms enables efficiency gains in regimes where previous alternative architectures struggle to surpass Transformers. At the 40 billion parameter scale, we train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids. On H100 GPUs and model width 4096, individual operators in the proposed multi-hybrid StripedHyena 2 architecture achieve two-fold throughput improvement over linear attention and state-space models. Multi-hybrids excel at sequence modeling over byte-tokenized data, as demonstrated by the Evo 2 line of models. We discuss the foundations that enable these results, including architecture design, overlap-add blocked kernels for tensor cores, and dedicated all-to-all and point-to-point context parallelism strategies.

Comment: The paper introduces convolutional multi-hybrid architectures, which is relevant to architectural innovations. However, the focus is on efficiency and hardware optimization rather than foundational insights.

Relevance: 7 Novelty: 7


35. Unnatural Languages Are Not Bugs but Features for LLMs

ArXiv ID: 2503.01926

Authors: Keyu Duan, Yiran Zhao, Zhili Feng, Jinjie Ni, Tianyu Pang, Qian Liu, Tianle Cai, Longxu Dou, Kenji Kawaguchi, Anirudh Goyal, J. Zico Kolter, Michael Qizhe Shieh

Abstract: Large Language Models (LLMs) have been observed to process non-human-readable text sequences, such as jailbreak prompts, often viewed as a bug for aligned LLMs. In this work, we present a systematic investigation challenging this perception, demonstrating that unnatural languages - strings that appear incomprehensible to humans but maintain semantic meanings for LLMs - contain latent features usable by models. Notably, unnatural languages possess latent features that can be generalized across different models and tasks during inference. Furthermore, models fine-tuned on unnatural versions of instruction datasets perform on-par with those trained on natural language, achieving 49.71 win rates in Length-controlled AlpacaEval 2.0 in average across various base models. In addition, through comprehensive analysis, we demonstrate that LLMs process unnatural languages by filtering noise and inferring contextual meaning from filtered words.

Comment: The paper explores the concept of unnatural languages in LLMs, which aligns with foundational research on LLM behavior and interpretability. However, it is more focused on empirical findings rather than theoretical breakthroughs.

Relevance: 7 Novelty: 6


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.