Personalized Daily Arxiv Papers 03/11/2025

[gpt-4o]	Prompt	Completion	Total
Token	77903	11057	88960
Cost	$0.18	$0.11	$0.3

Total ArXiv papers: 1080

Total scanned papers: 642

Total relevant papers: 55

Table of contents with paper titles:

Learning Decision Trees as Amortized Structure Inference Authors: Mohammed Mahfoud, Ghait Boukachab, Micha{\l} Koziarski, Alex Hernandez-Garcia, Stefan Bauer, Yoshua Bengio, Nikolay Malkin
Denoising Hamiltonian Network for Physical Reasoning Authors: Congyue Deng, Brandon Y. Feng, Cecilia Garraffo, Alan Garbarz, Robin Walters, William T. Freeman, Leonidas Guibas, Kaiming He
MoFE: Mixture of Frozen Experts Architecture Authors: Jean Seo, Jaeyoon Kim, Hyopil Shin
A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications Authors: Siyuan Mu, Sen Lin
Analyzing the Role of Permutation Invariance in Linear Mode Connectivity Authors: Keyao Zhan, Puheng Li, Lei Wu
MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering Authors: Vinay Kumar Verma, Shreyas Sunil Kulkarni, Happy Mittal, Deepak Gupta
Breaking Free from MMI: A New Frontier in Rationalization by Probing Input Utilization Authors: Wei Liu, Zhiying Deng, Zhongyu Niu, Jun Wang, Haozhao Wang, Zhigang Zeng, Ruixuan Li
Seesaw: High-throughput LLM Inference via Model Re-sharding Authors: Qidong Su, Wei Zhao, Xin Li, Muralidhar Andoorveedu, Chenhao Jiang, Zhanda Zhu, Kevin Song, Christina Giannoula, Gennady Pekhimenko
Learning Energy-Based Models by Self-normalising the Likelihood Authors: Hugo Senetaire, Paul Jeha, Pierre-Alexandre Mattei, Jes Frellsen
Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform Authors: Chenyu Huang, Peng Ye, Xiaohui Wang, Shenghe Zheng, Biqing Qi, Lei Bai, Wanli Ouyang, Tao Chen
Understanding the Learning Dynamics of LoRA: A Gradient Flow Perspective on Low-Rank Adaptation in Matrix Factorization Authors: Ziqing Xu, Hancheng Min, Lachlan Ewen MacDonald, Jinqi Luo, Salma Tarmoun, Enrique Mallada, Rene Vidal
Make Haste Slowly: A Theory of Emergent Structured Mixed Selectivity in Feature Learning ReLU Networks Authors: Devon Jarvis, Richard Klein, Benjamin Rosman, Andrew M. Saxe
IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining Authors: Yixiao Li, Xianzhi Du, Ajay Jaiswal, Tao Lei, Tuo Zhao, Chong Wang, Jianyu Wang
Sample-aware Adaptive Structured Pruning for Large Language Models Authors: Jun Kong, Xinge Ma, Jin Wang, Xuejie Zhang
Task Vector Quantization for Memory-Efficient Model Merging Authors: Youngeun Kim, Seunghwan Lee, Aecheon Jung, Bogon Ryu, Sungeun Hong
How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders Authors: Tatsuro Inaba, Kentaro Inui, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, Yu Takagi
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration Authors: Mengting Ai, Tianxin Wei, Yifan Chen, Zhichen Zeng, Ritchie Zhao, Girish Varatkar, Bita Darvish Rouhani, Xianfeng Tang, Hanghang Tong, Jingrui He
eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference Authors: Suraiya Tairin, Shohaib Mahmud, Haiying Shen, Anand Iyer
Towards Superior Quantization Accuracy: A Layer-sensitive Approach Authors: Feng Zhang, Yanbin Liu, Weihua Li, Jie Lv, Xiaodan Wang, Quan Bai
InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models Authors: Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang
Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers Authors: Francesco Daghero, Daniele Jahier Pagliari, Francesco Conti, Luca Benini, Massimo Poncino, Alessio Burrello
This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs Authors: Lorenz Wolf, Sangwoong Yoon, Ilija Bogunovic
Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms Authors: Jiaming Song, Linqi Zhou
Characterizing Learning in Spiking Neural Networks with Astrocyte-Like Units Authors: Christopher S. Yang, Sylvester J. Gates III, Dulara De Zoysa, Jaehoon Choe, Wolfgang Losert, Corey B. Hart
Understanding the role of autoencoders for stiff dynamical systems using information theory Authors: Vijayamanikandan Vijayarangan, Harshavardhana A. Uranakara, Francisco E. Hern\'andez-P\'erez, Hong G. Im
Nearly Optimal Differentially Private ReLU Regression Authors: Meng Ding, Mingxi Lei, Shaowei Wang, Tianhang Zheng, Di Wang, Jinhui Xu
From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning Authors: Eric Zhao, Pranjal Awasthi, Nika Haghtalab
Uncertainty Quantification From Scaling Laws in Deep Neural Networks Authors: Ibrahim Elsharkawy, Yonatan Kahn, Benjamin Hooberman
Delusions of Large Language Models Authors: Hongshen Xu, Zixv yang, Zichen Zhu, Kunyao Lan, Zihan Wang, Mengyue Wu, Ziwei Ji, Lu Chen, Pascale Fung, Kai Yu
GIN-Graph: A Generative Interpretation Network for Model-Level Explanation of Graph Neural Networks Authors: Xiao Yue, Guangzhi Qu, Lige Gan
Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels Authors: Santiago Cuervo, Adel Moumen, Yanis Labrak, Sameer Khurana, Antoine Laurent, Mickael Rouvier, Ricard Marxer
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation Authors: Zisheng Chen, Chunwei Wang, Xiuwei Chen, Hang Xu, Jianhua Han, Xiandan Liang
SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs Authors: Samir Abdaljalil, Hasan Kurban, Parichit Sharma, Erchin Serpedin, Rachad Atat
Language Models Fail to Introspect About Their Knowledge of Language Authors: Siyuan Song, Jennifer Hu, Kyle Mahowald
Enhancing CBMs Through Binary Distillation with Applications to Test-Time Intervention Authors: Matthew Shen, Aliyah Hsu, Abhineet Agarwal, Bin Yu
Using Subgraph GNNs for Node Classification:an Overlooked Potential Approach Authors: Qian Zeng, Xin Lin, Jingyi Gao, Yang Yu
Is My Text in Your AI Model? Gradient-based Membership Inference Test applied to LLMs Authors: Gonzalo Mancera, Daniel de Alcala, Julian Fierrez, Ruben Tolosana, Aythami Morales
TPU-Gen: LLM-Driven Custom Tensor Processing Unit Generator Authors: Deepak Vungarala, Mohammed E. Elbtity, Sumiya Syed, Sakila Alam, Kartik Pandit, Arnob Ghosh, Ramtin Zand, Shaahin Angizi
Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals Authors: Hanze Li, Xiande Huang
BlackGoose Rimer: Harnessing RWKV-7 as a Simple yet Superior Replacement for Transformers in Large-Scale Time Series Modeling Authors: Li weile, Liu Xiao
Deep Cut-informed Graph Embedding and Clustering Authors: Zhiyuan Ning, Zaitian Wang, Ran Zhang, Ping Xu, Kunpeng Liu, Pengyang Wang, Chong Chen, Pengfei Wang, Yuanchun Zhou, Erik Cambria
TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation Authors: Victor Shea-Jay Huang, Le Zhuo, Yi Xin, Zhaokai Wang, Peng Gao, Hongsheng Li
TokenButler: Token Importance is Predictable Authors: Yash Akhauri, Ahmed F AbouElhamayed, Yifei Gao, Chi-Chih Chang, Nilesh Jain, Mohamed S. Abdelfattah
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs Authors: Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, Se-Young Yun
AF-KAN: Activation Function-Based Kolmogorov-Arnold Networks for Efficient Representation Learning Authors: Hoang-Thang Ta, Anh Tran
Decision-Dependent Stochastic Optimization: The Role of Distribution Dynamics Authors: Zhiyu He, Saverio Bolognani, Florian D\"orfler, Michael Muehlebach
System 0/1/2/3: Quad-process theory for multi-timescale embodied collective cognitive systems Authors: Tadahiro Taniguchi, Yasushi Hirai, Masahiro Suzuki, Shingo Murata, Takato Horii, Kazutoshi Tanaka
NFIG: Autoregressive Image Generation with Next-Frequency Prediction Authors: Zhihao Huang, Xi Qiu, Yukuo Ma, Yifu Zhou, Chi Zhang, Xuelong Li
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding Authors: Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang
Emergent Abilities in Large Language Models: A Survey Authors: Leonardo Berti, Flavio Giorgi, Gjergji Kasneci
Lifelong Learning with Task-Specific Adaptation: Addressing the Stability-Plasticity Dilemma Authors: Ruiyu Wang, Sen Wang, Xinxin Zuo, Qiang Sun
What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces Authors: Jordi Armengol-Estap\'e, Quentin Carbonneaux, Tianjun Zhang, Aram H. Markosyan, Volker Seeker, Chris Cummins, Melanie Kambadur, Michael F. P. O'Boyle, Sida Wang, Gabriel Synnaeve, Hugh James Leather
Minion Gated Recurrent Unit for Continual Learning Authors: Abdullah M. Zyarah, Dhireesha Kudithipudi
Towards Experience Replay for Class-Incremental Learning in Fully-Binary Networks Authors: Yanis Basso-Bert, Anca Molnos, Romain Lemaire, William Guicquero, Antoine Dupret
Gender Encoding Patterns in Pretrained Language Model Representations Authors: Mahdi Zakizadeh, Mohammad Taher Pilehvar

1. Learning Decision Trees as Amortized Structure Inference

ArXiv ID: 2503.06985

Authors: Mohammed Mahfoud, Ghait Boukachab, Micha{\l} Koziarski, Alex Hernandez-Garcia, Stefan Bauer, Yoshua Bengio, Nikolay Malkin

Abstract: Building predictive models for tabular data presents fundamental challenges, notably in scaling consistently, i.e., more resources translating to better performance, and generalizing systematically beyond the training data distribution. Designing decision tree models remains especially challenging given the intractably large search space, and most existing methods rely on greedy heuristics, while deep learning inductive biases expect a temporal or spatial structure not naturally present in tabular data. We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data, formulating decision tree construction as a sequential planning problem. We train a deep reinforcement learning (GFlowNet) policy to solve this problem, yielding a generative model that samples decision trees from the Bayesian posterior. We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks derived from real-world data, robustness to distribution shifts, and anomaly detection, all while yielding interpretable models with shorter description lengths. Samples from the trained DT-GFN model can be ensembled to construct a random forest, and we further show that the performance of scales consistently in ensemble size, yielding ensembles of predictors that continue to generalize systematically.

Comment: Author match

2. Denoising Hamiltonian Network for Physical Reasoning

ArXiv ID: 2503.07596

Authors: Congyue Deng, Brandon Y. Feng, Cecilia Garraffo, Alan Garbarz, Robin Walters, William T. Freeman, Leonidas Guibas, Kaiming He

Abstract: Machine learning frameworks for physical problems must capture and enforce physical constraints that preserve the structure of dynamical systems. Many existing approaches achieve this by integrating physical operators into neural networks. While these methods offer theoretical guarantees, they face two key limitations: (i) they primarily model local relations between adjacent time steps, overlooking longer-range or higher-level physical interactions, and (ii) they focus on forward simulation while neglecting broader physical reasoning tasks. We propose the Denoising Hamiltonian Network (DHN), a novel framework that generalizes Hamiltonian mechanics operators into more flexible neural operators. DHN captures non-local temporal relationships and mitigates numerical integration errors through a denoising mechanism. DHN also supports multi-system modeling with a global conditioning mechanism. We demonstrate its effectiveness and flexibility across three diverse physical reasoning tasks with distinct inputs and outputs.

Comment: Author match

3. MoFE: Mixture of Frozen Experts Architecture

ArXiv ID: 2503.06491

Authors: Jean Seo, Jaeyoon Kim, Hyopil Shin

Abstract: We propose the Mixture of Frozen Experts (MoFE) architecture, which integrates Parameter-efficient Fine-tuning (PEFT) and the Mixture of Experts (MoE) architecture to enhance both training efficiency and model scalability. By freezing the Feed Forward Network (FFN) layers within the MoE framework, MoFE significantly reduces the number of trainable parameters, improving training efficiency while still allowing for effective knowledge transfer from the expert models. This facilitates the creation of models proficient in multiple domains. We conduct experiments to evaluate the trade-offs between performance and efficiency, compare MoFE with other PEFT methodologies, assess the impact of domain expertise in the constituent models, and determine the optimal training strategy. The results show that, although there may be some trade-offs in performance, the efficiency gains are substantial, making MoFE a reasonable solution for real-world, resource-constrained environments.

Comment: The paper introduces the Mixture of Frozen Experts (MoFE) architecture, which is directly relevant to foundational research on Mixture-of-Experts and efficiency in model architectures.

Relevance: 10 Novelty: 8

4. A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications

ArXiv ID: 2503.07137

Authors: Siyuan Mu, Sen Lin

Abstract: Artificial intelligence (AI) has achieved astonishing successes in many domains, especially with the recent breakthroughs in the development of foundational large models. These large models, leveraging their extensive training data, provide versatile solutions for a wide range of downstream tasks. However, as modern datasets become increasingly diverse and complex, the development of large AI models faces two major challenges: (1) the enormous consumption of computational resources and deployment difficulties, and (2) the difficulty in fitting heterogeneous and complex data, which limits the usability of the models. Mixture of Experts (MoE) models has recently attracted much attention in addressing these challenges, by dynamically selecting and activating the most relevant sub-models to process input data. It has been shown that MoEs can significantly improve model performance and efficiency with fewer resources, particularly excelling in handling large-scale, multimodal data. Given the tremendous potential MoE has demonstrated across various domains, it is urgent to provide a comprehensive summary of recent advancements of MoEs in many important fields. Existing surveys on MoE have their limitations, e.g., being outdated or lacking discussion on certain key areas, and we aim to address these gaps. In this paper, we first introduce the basic design of MoE, including gating functions, expert networks, routing mechanisms, training strategies, and system design. We then explore the algorithm design of MoE in important machine learning paradigms such as continual learning, meta-learning, multi-task learning, and reinforcement learning. Additionally, we summarize theoretical studies aimed at understanding MoE and review its applications in computer vision and natural language processing. Finally, we discuss promising future research directions.

Comment: This is a comprehensive survey on Mixture-of-Experts (MoE), directly aligning with the model architecture criterion. It provides a broad overview and insights into MoE, making it highly relevant.

Relevance: 10 Novelty: 7

5. Analyzing the Role of Permutation Invariance in Linear Mode Connectivity

ArXiv ID: 2503.06001

Authors: Keyao Zhan, Puheng Li, Lei Wu

Abstract: It was empirically observed in Entezari et al. (2021) that when accounting for the permutation invariance of neural networks, there is likely no loss barrier along the linear interpolation between two SGD solutions -- a phenomenon known as linear mode connectivity (LMC) modulo permutation. This phenomenon has sparked significant attention due to both its theoretical interest and practical relevance in applications such as model merging. In this paper, we provide a fine-grained analysis of this phenomenon for two-layer ReLU networks under a teacher-student setup. We show that as the student network width $m$ increases, the LMC loss barrier modulo permutation exhibits a {\bf double descent} behavior. Particularly, when $m$ is sufficiently large, the barrier decreases to zero at a rate $O(m^{-1/2})$. Notably, this rate does not suffer from the curse of dimensionality and demonstrates how substantial permutation can reduce the LMC loss barrier. Moreover, we observe a sharp transition in the sparsity of GD/SGD solutions when increasing the learning rate and investigate how this sparsity preference affects the LMC loss barrier modulo permutation. Experiments on both synthetic and MNIST datasets corroborate our theoretical predictions and reveal a similar trend for more complex network architectures.

Comment: The paper provides a theoretical analysis of linear mode connectivity and sparsity in neural networks, which aligns with representation learning and training dynamics.

Relevance: 9 Novelty: 8

ArXiv ID: 2503.06296

Authors: Vinay Kumar Verma, Shreyas Sunil Kulkarni, Happy Mittal, Deepak Gupta

Abstract: Question Answering (QA) and Visual Question Answering (VQA) are well-studied problems in the language and vision domain. One challenging scenario involves multiple sources of information, each of a different modality, where the answer to the question may exist in one or more sources. This scenario contains richer information but is highly complex to handle. In this work, we formulate a novel question-answer generation (QAG) framework in an environment containing multi-source, multimodal information. The answer may belong to any or all sources; therefore, selecting the most prominent answer source or an optimal combination of all sources for a given question is challenging. To address this issue, we propose a question-guided attention mechanism that learns attention across multiple sources and decodes this information for robust and unbiased answer generation. To learn attention within each source, we introduce an explicit alignment between questions and various information sources, which facilitates identifying the most pertinent parts of the source information relative to the question. Scalability in handling diverse questions poses a challenge. We address this by extending our model to a sparse mixture-of-experts (sparse-MoE) framework, enabling it to handle thousands of question types. Experiments on T5 and Flan-T5 using three datasets demonstrate the model's efficacy, supported by ablation studies.

Comment: The paper proposes a sparse Mixture-of-Experts framework for multi-source, multi-modal question answering, which aligns with foundational research on MoE and scalability.

Relevance: 9 Novelty: 8

7. Breaking Free from MMI: A New Frontier in Rationalization by Probing Input Utilization

ArXiv ID: 2503.06202

Authors: Wei Liu, Zhiying Deng, Zhongyu Niu, Jun Wang, Haozhao Wang, Zhigang Zeng, Ruixuan Li

Abstract: Extracting a small subset of crucial rationales from the full input is a key problem in explainability research. The most widely used fundamental criterion for rationale extraction is the maximum mutual information (MMI) criterion. In this paper, we first demonstrate that MMI suffers from diminishing marginal returns. Once part of the rationale has been identified, finding the remaining portions contributes only marginally to increasing the mutual information, making it difficult to use MMI to locate the rest. In contrast to MMI that aims to reproduce the prediction, we seek to identify the parts of the input that the network can actually utilize. This is achieved by comparing how different rationale candidates match the capability space of the weight matrix. The weight matrix of a neural network is typically low-rank, meaning that the linear combinations of its column vectors can only cover part of the directions in a high-dimensional space (high-dimension: the dimensions of an input vector). If an input is fully utilized by the network, {it generally matches these directions (e.g., a portion of a hypersphere), resulting in a representation with a high norm. Conversely, if an input primarily falls outside (orthogonal to) these directions}, its representation norm will approach zero, behaving like noise that the network cannot effectively utilize. Building on this, we propose using the norms of rationale candidates as an alternative objective to MMI. Through experiments on four text classification datasets and one graph classification dataset using three network architectures (GRUs, BERT, and GCN), we show that our method outperforms MMI and its improved variants in identifying better rationales. We also compare our method with a representative LLM (llama-3.1-8b-instruct) and find that our simple method gets comparable results to it and can sometimes even outperform it.

Comment: The paper critiques the MMI criterion and proposes a novel alternative for rationale extraction, which aligns with representation learning and interpretability.

Relevance: 9 Novelty: 8

8. Seesaw: High-throughput LLM Inference via Model Re-sharding

ArXiv ID: 2503.06433

Authors: Qidong Su, Wei Zhao, Xin Li, Muralidhar Andoorveedu, Chenhao Jiang, Zhanda Zhu, Kevin Song, Christina Giannoula, Gennady Pekhimenko

Abstract: To improve the efficiency of distributed large language model (LLM) inference, various parallelization strategies, such as tensor and pipeline parallelism, have been proposed. However, the distinct computational characteristics inherent in the two stages of LLM inference-prefilling and decoding-render a single static parallelization strategy insufficient for the effective optimization of both stages. In this work, we present Seesaw, an LLM inference engine optimized for throughput-oriented tasks. The key idea behind Seesaw is dynamic model re-sharding, a technique that facilitates the dynamic reconfiguration of parallelization strategies across stages, thereby maximizing throughput at both phases. To mitigate re-sharding overhead and optimize computational efficiency, we employ tiered KV cache buffering and transition-minimizing scheduling. These approaches work synergistically to reduce the overhead caused by frequent stage transitions while ensuring maximum batching efficiency. Our evaluation demonstrates that Seesaw achieves a throughput increase of up to 1.78x (1.36x on average) compared to vLLM, the most widely used state-of-the-art LLM inference engine.

Comment: The paper introduces a dynamic re-sharding technique for LLM inference, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

9. Learning Energy-Based Models by Self-normalising the Likelihood

ArXiv ID: 2503.07021

Authors: Hugo Senetaire, Paul Jeha, Pierre-Alexandre Mattei, Jes Frellsen

Abstract: Training an energy-based model (EBM) with maximum likelihood is challenging due to the intractable normalisation constant. Traditional methods rely on expensive Markov chain Monte Carlo (MCMC) sampling to estimate the gradient of logartihm of the normalisation constant. We propose a novel objective called self-normalised log-likelihood (SNL) that introduces a single additional learnable parameter representing the normalisation constant compared to the regular log-likelihood. SNL is a lower bound of the log-likelihood, and its optimum corresponds to both the maximum likelihood estimate of the model parameters and the normalisation constant. We show that the SNL objective is concave in the model parameters for exponential family distributions. Unlike the regular log-likelihood, the SNL can be directly optimised using stochastic gradient techniques by sampling from a crude proposal distribution. We validate the effectiveness of our proposed method on various density estimation tasks as well as EBMs for regression. Our results show that the proposed method, while simpler to implement and tune, outperforms existing techniques.

Comment: The paper proposes a novel self-normalized log-likelihood objective for energy-based models, which aligns with foundational research in representation learning.

Relevance: 9 Novelty: 8

10. Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform

ArXiv ID: 2503.06676

Authors: Chenyu Huang, Peng Ye, Xiaohui Wang, Shenghe Zheng, Biqing Qi, Lei Bai, Wanli Ouyang, Tao Chen

Abstract: With transformer-based models and the pretrain-finetune paradigm becoming mainstream, the high storage and deployment costs of individual finetuned models on multiple tasks pose critical challenges. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights). However, existing methods usually face problems including data accessibility and training requirements. To tackle this issue, we introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT). We first (a) group delta parameters within a layer into patches. Then we (b) assess the importance of each patch and allocate them with different quantization bit-widths. Afterwards, we (c) convert these patches to the DCT domain and conduct quantization to each patch based on the allocated bit-width. The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.

Comment: The paper introduces a novel data-free delta compression method inspired by JPEG compression, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

11. Understanding the Learning Dynamics of LoRA: A Gradient Flow Perspective on Low-Rank Adaptation in Matrix Factorization

ArXiv ID: 2503.06982

Authors: Ziqing Xu, Hancheng Min, Lachlan Ewen MacDonald, Jinqi Luo, Salma Tarmoun, Enrique Mallada, Rene Vidal

Abstract: Despite the empirical success of Low-Rank Adaptation (LoRA) in fine-tuning pre-trained models, there is little theoretical understanding of how first-order methods with carefully crafted initialization adapt models to new tasks. In this work, we take the first step towards bridging this gap by theoretically analyzing the learning dynamics of LoRA for matrix factorization (MF) under gradient flow (GF), emphasizing the crucial role of initialization. For small initialization, we theoretically show that GF converges to a neighborhood of the optimal solution, with smaller initialization leading to lower final error. Our analysis shows that the final error is affected by the misalignment between the singular spaces of the pre-trained model and the target matrix, and reducing the initialization scale improves alignment. To address this misalignment, we propose a spectral initialization for LoRA in MF and theoretically prove that GF with small spectral initialization converges to the fine-tuning task with arbitrary precision. Numerical experiments from MF and image classification validate our findings.

Comment: The paper provides theoretical insights into the learning dynamics of LoRA, which aligns with representation learning and low-rank adaptation in model compression.

Relevance: 9 Novelty: 8

12. Make Haste Slowly: A Theory of Emergent Structured Mixed Selectivity in Feature Learning ReLU Networks

ArXiv ID: 2503.06181

Authors: Devon Jarvis, Richard Klein, Benjamin Rosman, Andrew M. Saxe

Abstract: In spite of finite dimension ReLU neural networks being a consistent factor behind recent deep learning successes, a theory of feature learning in these models remains elusive. Currently, insightful theories still rely on assumptions including the linearity of the network computations, unstructured input data and architectural constraints such as infinite width or a single hidden layer. To begin to address this gap we establish an equivalence between ReLU networks and Gated Deep Linear Networks, and use their greater tractability to derive dynamics of learning. We then consider multiple variants of a core task reminiscent of multi-task learning or contextual control which requires both feature learning and nonlinearity. We make explicit that, for these tasks, the ReLU networks possess an inductive bias towards latent representations which are not strictly modular or disentangled but are still highly structured and reusable between contexts. This effect is amplified with the addition of more contexts and hidden layers. Thus, we take a step towards a theory of feature learning in finite ReLU networks and shed light on how structured mixed-selective latent representations can emerge due to a bias for node-reuse and learning speed.

Comment: The paper provides theoretical insights into feature learning in ReLU networks, which aligns with foundational research in representation learning.

Relevance: 9 Novelty: 8

13. IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining

ArXiv ID: 2503.05920

Authors: Yixiao Li, Xianzhi Du, Ajay Jaiswal, Tao Lei, Tuo Zhao, Chong Wang, Jianyu Wang

Abstract: Recent advancements in large language models have intensified the need for efficient and deployable models within limited inference budgets. Structured pruning pipelines have shown promise in token efficiency compared to training target-size models from scratch. In this paper, we advocate incorporating enlarged model pretraining, which is often ignored in previous works, into pruning. We study the enlarge-and-prune pipeline as an integrated system to address two critical questions: whether it is worth pretraining an enlarged model even when the model is never deployed, and how to optimize the entire pipeline for better pruned models. We propose an integrated enlarge-and-prune pipeline, which combines enlarge model training, pruning, and recovery under a single cosine annealing learning rate schedule. This approach is further complemented by a novel iterative structured pruning method for gradual parameter removal. The proposed method helps to mitigate the knowledge loss caused by the rising learning rate in naive enlarge-and-prune pipelines and enable effective redistribution of model capacity among surviving neurons, facilitating smooth compression and enhanced performance. We conduct comprehensive experiments on compressing 2.8B models to 1.3B with up to 2T tokens in pretraining. It demonstrates the integrated approach not only provides insights into the token efficiency of enlarged model pretraining but also achieves superior performance of pruned models.

Comment: The paper proposes an integrated enlarge-and-prune pipeline for generative language model pretraining, which aligns with foundational research in model compression.

Relevance: 9 Novelty: 8

14. Sample-aware Adaptive Structured Pruning for Large Language Models

ArXiv ID: 2503.06184

Authors: Jun Kong, Xinge Ma, Jin Wang, Xuejie Zhang

Abstract: Large language models (LLMs) have achieved outstanding performance in natural language processing, but enormous model sizes and high computational costs limit their practical deployment. Structured pruning can effectively reduce the resource demands for deployment by removing redundant model parameters. However, the randomly selected calibration data and fixed single importance estimation metrics in existing structured pruning methods lead to degraded performance of pruned models. This study introduces AdaPruner, a sample-aware adaptive structured pruning framework for LLMs, aiming to optimize the calibration data and importance estimation metrics in the structured pruning process. Specifically, AdaPruner effectively removes redundant parameters from LLMs by constructing a structured pruning solution space and then employing Bayesian optimization to adaptively search for the optimal calibration data and importance estimation metrics. Experimental results show that the AdaPruner outperforms existing structured pruning methods on a family of LLMs with varying pruning ratios, demonstrating its applicability and robustness. Remarkably, at a 20\% pruning ratio, the model pruned with AdaPruner maintains 97\% of the performance of the unpruned model.

Comment: The paper proposes a structured pruning framework for LLMs, which aligns with the model compression criterion. The use of adaptive methods adds novelty to the pruning process.

Relevance: 9 Novelty: 8

15. Task Vector Quantization for Memory-Efficient Model Merging

ArXiv ID: 2503.06921

Authors: Youngeun Kim, Seunghwan Lee, Aecheon Jung, Bogon Ryu, Sungeun Hong

Abstract: Model merging enables efficient multi-task models by combining task-specific fine-tuned checkpoints. However, storing multiple task-specific checkpoints requires significant memory, limiting scalability and restricting model merging to larger models and diverse tasks. In this paper, we propose quantizing task vectors (i.e., the difference between pre-trained and fine-tuned checkpoints) instead of quantizing fine-tuned checkpoints. We observe that task vectors exhibit a narrow weight range, enabling low precision quantization (up to 4 bit) within existing task vector merging frameworks. To further mitigate quantization errors within ultra-low bit precision (e.g., 2 bit), we introduce Residual Task Vector Quantization, which decomposes the task vector into a base vector and offset component. We allocate bits based on quantization sensitivity, ensuring precision while minimizing error within a memory budget. Experiments on image classification and dense prediction show our method maintains or improves model merging performance while using only 8% of the memory required for full-precision checkpoints.

Comment: The paper introduces a memory-efficient model merging method using task vector quantization, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

16. How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders

ArXiv ID: 2503.06394

Authors: Tatsuro Inaba, Kentaro Inui, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, Yu Takagi

Abstract: Large Language Models (LLMs) demonstrate remarkable multilingual capabilities and broad knowledge. However, the internal mechanisms underlying the development of these capabilities remain poorly understood. To investigate this, we analyze how the information encoded in LLMs' internal representations evolves during the training process. Specifically, we train sparse autoencoders at multiple checkpoints of the model and systematically compare the interpretative results across these stages. Our findings suggest that LLMs initially acquire language-specific knowledge independently, followed by cross-linguistic correspondences. Moreover, we observe that after mastering token-level knowledge, the model transitions to learning higher-level, abstract concepts, indicating the development of more conceptual understanding.

Comment: The paper uses sparse autoencoders to trace internal representations in LLMs, directly addressing representation learning and interpretability in LLMs.

Relevance: 9 Novelty: 8

17. ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration

ArXiv ID: 2503.06881

Authors: Mengting Ai, Tianxin Wei, Yifan Chen, Zhichen Zeng, Ritchie Zhao, Girish Varatkar, Bita Darvish Rouhani, Xianfeng Tang, Hanghang Tong, Jingrui He

Abstract: Mixture-of-Experts (MoE) Transformer, the backbone architecture of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. The sparse structure, while allowing constant time costs, results in space inefficiency: we still need to load all the model parameters during inference. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones. ResMoE enhances the space efficiency for inference of large-scale MoE Transformers in a one-shot and data-agnostic manner without retraining while maintaining minimal accuracy loss, thereby paving the way for broader accessibility to large language models. We demonstrate the effectiveness of ResMoE through extensive experiments on Switch Transformer, Mixtral, and DeepSeekMoE models. The results show that ResMoE can reduce the number of parameters in an expert by up to 75% while maintaining comparable performance. The code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/ResMoE.

Comment: The paper introduces a compression method for Mixture-of-Experts models, which aligns with model compression and efficiency improvements.

Relevance: 9 Novelty: 8

18. eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference

ArXiv ID: 2503.06823

Authors: Suraiya Tairin, Shohaib Mahmud, Haiying Shen, Anand Iyer

Abstract: In recent years, Mixture-of-Experts (MoE) has emerged as an effective approach for enhancing the capacity of deep neural network (DNN) with sub-linear computational costs. However, storing all experts on GPUs incurs significant memory overhead, increasing the monetary cost of MoE-based inference. To address this, we propose eMoE, a memory efficient inference system for MoE-based large language models (LLMs) by leveraging our observations from experiment measurements. eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing. To reduce loading latency while maintaining accuracy, as we found using the same experts for subsequent prompts has minimal impact on perplexity, eMoE invokes the expert predictor every few prompts rather than for each prompt. In addition, it skips predictions for tasks less sensitive to routing accuracy. Finally, it has task-aware scheduling to minimize inference latency by considering Service Level Objectives (SLOs), task-specific output lengths, and expert loading latencies. Experimental results show that compared to existing systems, eMoE reduces memory consumption by up to 80% while maintaining accuracy and reduces inference latency by up to 17%. It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.

Comment: The paper proposes a memory-efficient MoE inference system, directly aligning with the model architecture and efficiency criteria.

Relevance: 9 Novelty: 8

19. Towards Superior Quantization Accuracy: A Layer-sensitive Approach

ArXiv ID: 2503.06518

Authors: Feng Zhang, Yanbin Liu, Weihua Li, Jie Lv, Xiaodan Wang, Quan Bai

Abstract: Large Vision and Language Models have exhibited remarkable human-like intelligence in tasks such as natural language comprehension, problem-solving, logical reasoning, and knowledge retrieval. However, training and serving these models require substantial computational resources, posing a significant barrier to their widespread application and further research. To mitigate this challenge, various model compression techniques have been developed to reduce computational requirements. Nevertheless, existing methods often employ uniform quantization configurations, failing to account for the varying difficulties across different layers in quantizing large neural network models. This paper tackles this issue by leveraging layer-sensitivity features, such as activation sensitivity and weight distribution Kurtosis, to identify layers that are challenging to quantize accurately and allocate additional memory budget. The proposed methods, named SensiBoost and KurtBoost, respectively, demonstrate notable improvement in quantization accuracy, achieving up to 9% lower perplexity with only a 2% increase in memory budget on LLama models compared to the baseline.

Comment: This paper proposes a layer-sensitive approach to quantization, which directly aligns with the model compression criterion. The methods SensiBoost and KurtBoost provide novel insights into layer-specific quantization strategies, improving accuracy with minimal memory overhead.

Relevance: 9 Novelty: 8

20. InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models

ArXiv ID: 2503.06692

Authors: Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang

Abstract: Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.

Comment: This paper introduces a novel paradigm for long-context reasoning in LLMs, addressing computational scaling and reasoning depth. It aligns with foundational research in LLMs by proposing a new iterative reasoning framework, which could have broader implications for model efficiency and architecture.

Relevance: 9 Novelty: 8

21. Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers

ArXiv ID: 2503.06183

Authors: Francesco Daghero, Daniele Jahier Pagliari, Francesco Conti, Luca Benini, Massimo Poncino, Alessio Burrello

Abstract: The acceleration of pruned Deep Neural Networks (DNNs) on edge devices such as Microcontrollers (MCUs) is a challenging task, given the tight area- and power-constraints of these devices. In this work, we propose a three-fold contribution to address this problem. First, we design a set of optimized software kernels for N:M pruned layers, targeting ultra-low-power, multicore RISC-V MCUs, which are up to 2.1x and 3.4x faster than their dense counterparts at 1:8 and 1:16 sparsity, respectively. Then, we implement a lightweight Instruction-Set Architecture (ISA) extension to accelerate the indirect load and non-zero indices decompression operations required by our kernels, obtaining up to 1.9x extra speedup, at the cost of a 5% area overhead. Lastly, we extend an open-source DNN compiler to utilize our sparse kernels for complete networks, showing speedups of 3.21x and 1.81x on a ResNet18 and a Vision Transformer (ViT), with less than 1.5% accuracy drop compared to a dense baseline.

Comment: The paper focuses on sparsity and pruning techniques for efficient DNNs on microcontrollers, aligning with the model compression criterion.

Relevance: 9 Novelty: 7

22. This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs

ArXiv ID: 2503.05856

Authors: Lorenz Wolf, Sangwoong Yoon, Ilija Bogunovic

Abstract: Mixture of large language model (LLMs) Agents (MoA) architectures achieve state-of-the-art performance on prominent benchmarks like AlpacaEval 2.0 by leveraging the collaboration of multiple LLMs at inference time. Despite these successes, an evaluation of the safety and reliability of MoA is missing. We present the first comprehensive study of MoA's robustness against deceptive LLM agents that deliberately provide misleading responses. We examine factors like the propagation of deceptive information, model size, and information availability, and uncover critical vulnerabilities. On AlpacaEval 2.0, the popular LLaMA 3.1-70B model achieves a length-controlled Win Rate (LC WR) of 49.2% when coupled with 3-layer MoA (6 LLM agents). However, we demonstrate that introducing only a $\textit{single}$ carefully-instructed deceptive agent into the MoA can reduce performance to 37.9%, effectively nullifying all MoA gains. On QuALITY, a multiple-choice comprehension task, the impact is also severe, with accuracy plummeting by a staggering 48.5%. Inspired in part by the historical Doge of Venice voting process, designed to minimize influence and deception, we propose a range of unsupervised defense mechanisms that recover most of the lost performance.

Comment: The paper evaluates the robustness of MoE architectures, directly aligning with the model architecture criterion.

Relevance: 9 Novelty: 7

23. Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms

ArXiv ID: 2503.07154

Authors: Jiaming Song, Linqi Zhou

Abstract: Recent years have seen significant advancements in foundation models through generative pre-training, yet algorithmic innovation in this space has largely stagnated around autoregressive models for discrete signals and diffusion models for continuous signals. This stagnation creates a bottleneck that prevents us from fully unlocking the potential of rich multi-modal data, which in turn limits the progress on multimodal intelligence. We argue that an inference-first perspective, which prioritizes scaling efficiency during inference time across sequence length and refinement steps, can inspire novel generative pre-training algorithms. Using Inductive Moment Matching (IMM) as a concrete example, we demonstrate how addressing limitations in diffusion models' inference process through targeted modifications yields a stable, single-stage algorithm that achieves superior sample quality with over an order of magnitude greater inference efficiency.

Comment: The paper discusses inference-time scaling for generative pre-training algorithms, which aligns with foundational research on efficiency and generative paradigms.

Relevance: 8 Novelty: 8

24. Characterizing Learning in Spiking Neural Networks with Astrocyte-Like Units

ArXiv ID: 2503.06798

Authors: Christopher S. Yang, Sylvester J. Gates III, Dulara De Zoysa, Jaehoon Choe, Wolfgang Losert, Corey B. Hart

Abstract: Traditional artificial neural networks take inspiration from biological networks, using layers of neuron-like nodes to pass information for processing. More realistic models include spiking in the neural network, capturing the electrical characteristics more closely. However, a large proportion of brain cells are of the glial cell type, in particular astrocytes which have been suggested to play a role in performing computations. Here, we introduce a modified spiking neural network model with added astrocyte-like units in a neural network and asses their impact on learning. We implement the network as a liquid state machine and task the network with performing a chaotic time-series prediction task. We varied the number and ratio of neuron-like and astrocyte-like units in the network to examine the latter units effect on learning. We show that the combination of neurons and astrocytes together, as opposed to neural- and astrocyte-only networks, are critical for driving learning. Interestingly, we found that the highest learning rate was achieved when the ratio between astrocyte-like and neuron-like units was roughly 2 to 1, mirroring some estimates of the ratio of biological astrocytes to neurons. Our results demonstrate that incorporating astrocyte-like units which represent information across longer timescales can alter the learning rates of neural networks, and the proportion of astrocytes to neurons should be tuned appropriately to a given task.

Comment: The paper explores the impact of astrocyte-like units in spiking neural networks, which is an emerging trend in foundational research.

Relevance: 8 Novelty: 8

25. Understanding the role of autoencoders for stiff dynamical systems using information theory

ArXiv ID: 2503.06325

Authors: Vijayamanikandan Vijayarangan, Harshavardhana A. Uranakara, Francisco E. Hern\'andez-P\'erez, Hong G. Im

Abstract: Using the information theory, this study provides insights into how the construction of latent space of autoencoder (AE) using deep neural network (DNN) training finds a smooth low-dimensional manifold in the stiff dynamical system. Our recent study [1] reported that an autoencoder (AE) combined with neural ODE (NODE) as a surrogate reduced order model (ROM) for the integration of stiff chemically reacting systems led to a significant reduction in the temporal stiffness, and the behavior was attributed to the identification of a slow invariant manifold by the nonlinear projection of the AE. The present work offers fundamental understanding of the mechanism by employing concepts from information theory and better mixing. The learning mechanism of both the encoder and decoder are explained by plotting the evolution of mutual information and identifying two different phases. Subsequently, the density distribution is plotted for the physical and latent variables, which shows the transformation of the \emph{rare event} in the physical space to a \emph{highly likely} (more probable) event in the latent space provided by the nonlinear autoencoder. Finally, the nonlinear transformation leading to density redistribution is explained using concepts from information theory and probability.

Comment: The paper provides insights into how autoencoders encode information in stiff dynamical systems, aligning with representation learning.

Relevance: 8 Novelty: 8

26. Nearly Optimal Differentially Private ReLU Regression

ArXiv ID: 2503.06009

Authors: Meng Ding, Mingxi Lei, Shaowei Wang, Tianhang Zheng, Di Wang, Jinhui Xu

Abstract: In this paper, we investigate one of the most fundamental nonconvex learning problems, ReLU regression, in the Differential Privacy (DP) model. Previous studies on private ReLU regression heavily rely on stringent assumptions, such as constant bounded norms for feature vectors and labels. We relax these assumptions to a more standard setting, where data can be i.i.d. sampled from $O(1)$-sub-Gaussian distributions. We first show that when $\varepsilon = \tilde{O}(\sqrt{\frac{1}{N}})$ and there is some public data, it is possible to achieve an upper bound of $\Tilde{O}(\frac{d^2}{N^2 \varepsilon^2})$ for the excess population risk in $(\epsilon, \delta)$-DP, where $d$ is the dimension and $N$ is the number of data samples. Moreover, we relax the requirement of $\epsilon$ and public data by proposing and analyzing a one-pass mini-batch Generalized Linear Model Perceptron algorithm (DP-MBGLMtron). Additionally, using the tracing attack argument technique, we demonstrate that the minimax rate of the estimation error for $(\varepsilon, \delta)$-DP algorithms is lower bounded by $\Omega(\frac{d^2}{N^2 \varepsilon^2})$. This shows that DP-MBGLMtron achieves the optimal utility bound up to logarithmic factors. Experiments further support our theoretical results.

Comment: The paper investigates differentially private ReLU regression, which is a foundational topic in model efficiency and privacy. It provides theoretical insights into optimal utility bounds and relaxes prior assumptions, making it relevant to model compression and efficiency.

Relevance: 8 Novelty: 8

27. From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning

ArXiv ID: 2503.05919

Authors: Eric Zhao, Pranjal Awasthi, Nika Haghtalab

Abstract: Finetuning provides a scalable and cost-effective means of customizing language models for specific tasks or response styles, with greater reliability than prompting or in-context learning. In contrast, the conventional wisdom is that injecting knowledge via finetuning results in brittle performance and poor generalization. We argue that the dichotomy of "task customization" (e.g., instruction tuning) and "knowledge injection" (e.g., teaching new facts) is a distinction without a difference. We instead identify concrete factors that explain the heterogeneous effectiveness observed with finetuning. To this end, we conduct a large-scale experimental study of finetuning the frontier Gemini v1.5 model family on a spectrum of datasets that are artificially engineered to interpolate between the strengths and failure modes of finetuning. Our findings indicate that question-answer training data formats provide much stronger knowledge generalization than document/article-style training data, numerical information can be harder for finetuning to retain than categorical information, and models struggle to apply finetuned knowledge during multi-step reasoning even when trained on similar examples -- all factors that render "knowledge injection" to be especially difficult, even after controlling for considerations like data augmentation and information volume. On the other hand, our findings also indicate that it is not fundamentally more difficult to finetune information about a real-world event than information about what a model's writing style should be.

Comment: The paper studies finetuning for knowledge injection in LLMs, providing insights into the limitations and challenges of finetuning, which aligns with foundational research on LLM behavior.