Previous Day 2025-03-10
Monthly Overview 2025-03
Next Day 2025-03-12

Personalized Daily Arxiv Papers 03/11/2025

[gpt-4o] Prompt Completion Total
Token 77903 11057 88960
Cost $0.18 $0.11 $0.3

Total ArXiv papers: 1080

Total scanned papers: 642

Total relevant papers: 55

Table of contents with paper titles:

  1. Learning Decision Trees as Amortized Structure Inference Authors: Mohammed Mahfoud, Ghait Boukachab, Micha{\l} Koziarski, Alex Hernandez-Garcia, Stefan Bauer, Yoshua Bengio, Nikolay Malkin

  2. Denoising Hamiltonian Network for Physical Reasoning Authors: Congyue Deng, Brandon Y. Feng, Cecilia Garraffo, Alan Garbarz, Robin Walters, William T. Freeman, Leonidas Guibas, Kaiming He

  3. MoFE: Mixture of Frozen Experts Architecture Authors: Jean Seo, Jaeyoon Kim, Hyopil Shin

  4. A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications Authors: Siyuan Mu, Sen Lin

  5. Analyzing the Role of Permutation Invariance in Linear Mode Connectivity Authors: Keyao Zhan, Puheng Li, Lei Wu

  6. MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering Authors: Vinay Kumar Verma, Shreyas Sunil Kulkarni, Happy Mittal, Deepak Gupta

  7. Breaking Free from MMI: A New Frontier in Rationalization by Probing Input Utilization Authors: Wei Liu, Zhiying Deng, Zhongyu Niu, Jun Wang, Haozhao Wang, Zhigang Zeng, Ruixuan Li

  8. Seesaw: High-throughput LLM Inference via Model Re-sharding Authors: Qidong Su, Wei Zhao, Xin Li, Muralidhar Andoorveedu, Chenhao Jiang, Zhanda Zhu, Kevin Song, Christina Giannoula, Gennady Pekhimenko

  9. Learning Energy-Based Models by Self-normalising the Likelihood Authors: Hugo Senetaire, Paul Jeha, Pierre-Alexandre Mattei, Jes Frellsen

  10. Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform Authors: Chenyu Huang, Peng Ye, Xiaohui Wang, Shenghe Zheng, Biqing Qi, Lei Bai, Wanli Ouyang, Tao Chen

  11. Understanding the Learning Dynamics of LoRA: A Gradient Flow Perspective on Low-Rank Adaptation in Matrix Factorization Authors: Ziqing Xu, Hancheng Min, Lachlan Ewen MacDonald, Jinqi Luo, Salma Tarmoun, Enrique Mallada, Rene Vidal

  12. Make Haste Slowly: A Theory of Emergent Structured Mixed Selectivity in Feature Learning ReLU Networks Authors: Devon Jarvis, Richard Klein, Benjamin Rosman, Andrew M. Saxe

  13. IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining Authors: Yixiao Li, Xianzhi Du, Ajay Jaiswal, Tao Lei, Tuo Zhao, Chong Wang, Jianyu Wang

  14. Sample-aware Adaptive Structured Pruning for Large Language Models Authors: Jun Kong, Xinge Ma, Jin Wang, Xuejie Zhang

  15. Task Vector Quantization for Memory-Efficient Model Merging Authors: Youngeun Kim, Seunghwan Lee, Aecheon Jung, Bogon Ryu, Sungeun Hong

  16. How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders Authors: Tatsuro Inaba, Kentaro Inui, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, Yu Takagi

  17. ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration Authors: Mengting Ai, Tianxin Wei, Yifan Chen, Zhichen Zeng, Ritchie Zhao, Girish Varatkar, Bita Darvish Rouhani, Xianfeng Tang, Hanghang Tong, Jingrui He

  18. eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference Authors: Suraiya Tairin, Shohaib Mahmud, Haiying Shen, Anand Iyer

  19. Towards Superior Quantization Accuracy: A Layer-sensitive Approach Authors: Feng Zhang, Yanbin Liu, Weihua Li, Jie Lv, Xiaodan Wang, Quan Bai

  20. InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models Authors: Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang

  21. Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers Authors: Francesco Daghero, Daniele Jahier Pagliari, Francesco Conti, Luca Benini, Massimo Poncino, Alessio Burrello

  22. This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs Authors: Lorenz Wolf, Sangwoong Yoon, Ilija Bogunovic

  23. Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms Authors: Jiaming Song, Linqi Zhou

  24. Characterizing Learning in Spiking Neural Networks with Astrocyte-Like Units Authors: Christopher S. Yang, Sylvester J. Gates III, Dulara De Zoysa, Jaehoon Choe, Wolfgang Losert, Corey B. Hart

  25. Understanding the role of autoencoders for stiff dynamical systems using information theory Authors: Vijayamanikandan Vijayarangan, Harshavardhana A. Uranakara, Francisco E. Hern\'andez-P\'erez, Hong G. Im

  26. Nearly Optimal Differentially Private ReLU Regression Authors: Meng Ding, Mingxi Lei, Shaowei Wang, Tianhang Zheng, Di Wang, Jinhui Xu

  27. From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning Authors: Eric Zhao, Pranjal Awasthi, Nika Haghtalab

  28. Uncertainty Quantification From Scaling Laws in Deep Neural Networks Authors: Ibrahim Elsharkawy, Yonatan Kahn, Benjamin Hooberman

  29. Delusions of Large Language Models Authors: Hongshen Xu, Zixv yang, Zichen Zhu, Kunyao Lan, Zihan Wang, Mengyue Wu, Ziwei Ji, Lu Chen, Pascale Fung, Kai Yu

  30. GIN-Graph: A Generative Interpretation Network for Model-Level Explanation of Graph Neural Networks Authors: Xiao Yue, Guangzhi Qu, Lige Gan

  31. Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels Authors: Santiago Cuervo, Adel Moumen, Yanis Labrak, Sameer Khurana, Antoine Laurent, Mickael Rouvier, Ricard Marxer

  32. SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation Authors: Zisheng Chen, Chunwei Wang, Xiuwei Chen, Hang Xu, Jianhua Han, Xiandan Liang

  33. SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs Authors: Samir Abdaljalil, Hasan Kurban, Parichit Sharma, Erchin Serpedin, Rachad Atat

  34. Language Models Fail to Introspect About Their Knowledge of Language Authors: Siyuan Song, Jennifer Hu, Kyle Mahowald

  35. Enhancing CBMs Through Binary Distillation with Applications to Test-Time Intervention Authors: Matthew Shen, Aliyah Hsu, Abhineet Agarwal, Bin Yu

  36. Using Subgraph GNNs for Node Classification:an Overlooked Potential Approach Authors: Qian Zeng, Xin Lin, Jingyi Gao, Yang Yu

  37. Is My Text in Your AI Model? Gradient-based Membership Inference Test applied to LLMs Authors: Gonzalo Mancera, Daniel de Alcala, Julian Fierrez, Ruben Tolosana, Aythami Morales

  38. TPU-Gen: LLM-Driven Custom Tensor Processing Unit Generator Authors: Deepak Vungarala, Mohammed E. Elbtity, Sumiya Syed, Sakila Alam, Kartik Pandit, Arnob Ghosh, Ramtin Zand, Shaahin Angizi

  39. Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals Authors: Hanze Li, Xiande Huang

  40. BlackGoose Rimer: Harnessing RWKV-7 as a Simple yet Superior Replacement for Transformers in Large-Scale Time Series Modeling Authors: Li weile, Liu Xiao

  41. Deep Cut-informed Graph Embedding and Clustering Authors: Zhiyuan Ning, Zaitian Wang, Ran Zhang, Ping Xu, Kunpeng Liu, Pengyang Wang, Chong Chen, Pengfei Wang, Yuanchun Zhou, Erik Cambria

  42. TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation Authors: Victor Shea-Jay Huang, Le Zhuo, Yi Xin, Zhaokai Wang, Peng Gao, Hongsheng Li

  43. TokenButler: Token Importance is Predictable Authors: Yash Akhauri, Ahmed F AbouElhamayed, Yifei Gao, Chi-Chih Chang, Nilesh Jain, Mohamed S. Abdelfattah

  44. DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs Authors: Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, Se-Young Yun

  45. AF-KAN: Activation Function-Based Kolmogorov-Arnold Networks for Efficient Representation Learning Authors: Hoang-Thang Ta, Anh Tran

  46. Decision-Dependent Stochastic Optimization: The Role of Distribution Dynamics Authors: Zhiyu He, Saverio Bolognani, Florian D\"orfler, Michael Muehlebach

  47. System 0/1/2/3: Quad-process theory for multi-timescale embodied collective cognitive systems Authors: Tadahiro Taniguchi, Yasushi Hirai, Masahiro Suzuki, Shingo Murata, Takato Horii, Kazutoshi Tanaka

  48. NFIG: Autoregressive Image Generation with Next-Frequency Prediction Authors: Zhihao Huang, Xi Qiu, Yukuo Ma, Yifu Zhou, Chi Zhang, Xuelong Li

  49. Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding Authors: Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang

  50. Emergent Abilities in Large Language Models: A Survey Authors: Leonardo Berti, Flavio Giorgi, Gjergji Kasneci

  51. Lifelong Learning with Task-Specific Adaptation: Addressing the Stability-Plasticity Dilemma Authors: Ruiyu Wang, Sen Wang, Xinxin Zuo, Qiang Sun

  52. What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces Authors: Jordi Armengol-Estap\'e, Quentin Carbonneaux, Tianjun Zhang, Aram H. Markosyan, Volker Seeker, Chris Cummins, Melanie Kambadur, Michael F. P. O'Boyle, Sida Wang, Gabriel Synnaeve, Hugh James Leather

  53. Minion Gated Recurrent Unit for Continual Learning Authors: Abdullah M. Zyarah, Dhireesha Kudithipudi

  54. Towards Experience Replay for Class-Incremental Learning in Fully-Binary Networks Authors: Yanis Basso-Bert, Anca Molnos, Romain Lemaire, William Guicquero, Antoine Dupret

  55. Gender Encoding Patterns in Pretrained Language Model Representations Authors: Mahdi Zakizadeh, Mohammad Taher Pilehvar


1. Learning Decision Trees as Amortized Structure Inference

ArXiv ID: 2503.06985

Authors: Mohammed Mahfoud, Ghait Boukachab, Micha{\l} Koziarski, Alex Hernandez-Garcia, Stefan Bauer, Yoshua Bengio, Nikolay Malkin

Abstract: Building predictive models for tabular data presents fundamental challenges, notably in scaling consistently, i.e., more resources translating to better performance, and generalizing systematically beyond the training data distribution. Designing decision tree models remains especially challenging given the intractably large search space, and most existing methods rely on greedy heuristics, while deep learning inductive biases expect a temporal or spatial structure not naturally present in tabular data. We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data, formulating decision tree construction as a sequential planning problem. We train a deep reinforcement learning (GFlowNet) policy to solve this problem, yielding a generative model that samples decision trees from the Bayesian posterior. We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks derived from real-world data, robustness to distribution shifts, and anomaly detection, all while yielding interpretable models with shorter description lengths. Samples from the trained DT-GFN model can be ensembled to construct a random forest, and we further show that the performance of scales consistently in ensemble size, yielding ensembles of predictors that continue to generalize systematically.

Comment: Author match


2. Denoising Hamiltonian Network for Physical Reasoning

ArXiv ID: 2503.07596

Authors: Congyue Deng, Brandon Y. Feng, Cecilia Garraffo, Alan Garbarz, Robin Walters, William T. Freeman, Leonidas Guibas, Kaiming He

Abstract: Machine learning frameworks for physical problems must capture and enforce physical constraints that preserve the structure of dynamical systems. Many existing approaches achieve this by integrating physical operators into neural networks. While these methods offer theoretical guarantees, they face two key limitations: (i) they primarily model local relations between adjacent time steps, overlooking longer-range or higher-level physical interactions, and (ii) they focus on forward simulation while neglecting broader physical reasoning tasks. We propose the Denoising Hamiltonian Network (DHN), a novel framework that generalizes Hamiltonian mechanics operators into more flexible neural operators. DHN captures non-local temporal relationships and mitigates numerical integration errors through a denoising mechanism. DHN also supports multi-system modeling with a global conditioning mechanism. We demonstrate its effectiveness and flexibility across three diverse physical reasoning tasks with distinct inputs and outputs.

Comment: Author match


3. MoFE: Mixture of Frozen Experts Architecture

ArXiv ID: 2503.06491

Authors: Jean Seo, Jaeyoon Kim, Hyopil Shin

Abstract: We propose the Mixture of Frozen Experts (MoFE) architecture, which integrates Parameter-efficient Fine-tuning (PEFT) and the Mixture of Experts (MoE) architecture to enhance both training efficiency and model scalability. By freezing the Feed Forward Network (FFN) layers within the MoE framework, MoFE significantly reduces the number of trainable parameters, improving training efficiency while still allowing for effective knowledge transfer from the expert models. This facilitates the creation of models proficient in multiple domains. We conduct experiments to evaluate the trade-offs between performance and efficiency, compare MoFE with other PEFT methodologies, assess the impact of domain expertise in the constituent models, and determine the optimal training strategy. The results show that, although there may be some trade-offs in performance, the efficiency gains are substantial, making MoFE a reasonable solution for real-world, resource-constrained environments.

Comment: The paper introduces the Mixture of Frozen Experts (MoFE) architecture, which is directly relevant to foundational research on Mixture-of-Experts and efficiency in model architectures.

Relevance: 10 Novelty: 8


4. A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications

ArXiv ID: 2503.07137

Authors: Siyuan Mu, Sen Lin

Abstract: Artificial intelligence (AI) has achieved astonishing successes in many domains, especially with the recent breakthroughs in the development of foundational large models. These large models, leveraging their extensive training data, provide versatile solutions for a wide range of downstream tasks. However, as modern datasets become increasingly diverse and complex, the development of large AI models faces two major challenges: (1) the enormous consumption of computational resources and deployment difficulties, and (2) the difficulty in fitting heterogeneous and complex data, which limits the usability of the models. Mixture of Experts (MoE) models has recently attracted much attention in addressing these challenges, by dynamically selecting and activating the most relevant sub-models to process input data. It has been shown that MoEs can significantly improve model performance and efficiency with fewer resources, particularly excelling in handling large-scale, multimodal data. Given the tremendous potential MoE has demonstrated across various domains, it is urgent to provide a comprehensive summary of recent advancements of MoEs in many important fields. Existing surveys on MoE have their limitations, e.g., being outdated or lacking discussion on certain key areas, and we aim to address these gaps. In this paper, we first introduce the basic design of MoE, including gating functions, expert networks, routing mechanisms, training strategies, and system design. We then explore the algorithm design of MoE in important machine learning paradigms such as continual learning, meta-learning, multi-task learning, and reinforcement learning. Additionally, we summarize theoretical studies aimed at understanding MoE and review its applications in computer vision and natural language processing. Finally, we discuss promising future research directions.

Comment: This is a comprehensive survey on Mixture-of-Experts (MoE), directly aligning with the model architecture criterion. It provides a broad overview and insights into MoE, making it highly relevant.

Relevance: 10 Novelty: 7


5. Analyzing the Role of Permutation Invariance in Linear Mode Connectivity

ArXiv ID: 2503.06001

Authors: Keyao Zhan, Puheng Li, Lei Wu

Abstract: It was empirically observed in Entezari et al. (2021) that when accounting for the permutation invariance of neural networks, there is likely no loss barrier along the linear interpolation between two SGD solutions -- a phenomenon known as linear mode connectivity (LMC) modulo permutation. This phenomenon has sparked significant attention due to both its theoretical interest and practical relevance in applications such as model merging. In this paper, we provide a fine-grained analysis of this phenomenon for two-layer ReLU networks under a teacher-student setup. We show that as the student network width $m$ increases, the LMC loss barrier modulo permutation exhibits a {\bf double descent} behavior. Particularly, when $m$ is sufficiently large, the barrier decreases to zero at a rate $O(m^{-1/2})$. Notably, this rate does not suffer from the curse of dimensionality and demonstrates how substantial permutation can reduce the LMC loss barrier. Moreover, we observe a sharp transition in the sparsity of GD/SGD solutions when increasing the learning rate and investigate how this sparsity preference affects the LMC loss barrier modulo permutation. Experiments on both synthetic and MNIST datasets corroborate our theoretical predictions and reveal a similar trend for more complex network architectures.

Comment: The paper provides a theoretical analysis of linear mode connectivity and sparsity in neural networks, which aligns with representation learning and training dynamics.

Relevance: 9 Novelty: 8


6. MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering

ArXiv ID: 2503.06296

Authors: Vinay Kumar Verma, Shreyas Sunil Kulkarni, Happy Mittal, Deepak Gupta

Abstract: Question Answering (QA) and Visual Question Answering (VQA) are well-studied problems in the language and vision domain. One challenging scenario involves multiple sources of information, each of a different modality, where the answer to the question may exist in one or more sources. This scenario contains richer information but is highly complex to handle. In this work, we formulate a novel question-answer generation (QAG) framework in an environment containing multi-source, multimodal information. The answer may belong to any or all sources; therefore, selecting the most prominent answer source or an optimal combination of all sources for a given question is challenging. To address this issue, we propose a question-guided attention mechanism that learns attention across multiple sources and decodes this information for robust and unbiased answer generation. To learn attention within each source, we introduce an explicit alignment between questions and various information sources, which facilitates identifying the most pertinent parts of the source information relative to the question. Scalability in handling diverse questions poses a challenge. We address this by extending our model to a sparse mixture-of-experts (sparse-MoE) framework, enabling it to handle thousands of question types. Experiments on T5 and Flan-T5 using three datasets demonstrate the model's efficacy, supported by ablation studies.

Comment: The paper proposes a sparse Mixture-of-Experts framework for multi-source, multi-modal question answering, which aligns with foundational research on MoE and scalability.

Relevance: 9 Novelty: 8


7. Breaking Free from MMI: A New Frontier in Rationalization by Probing Input Utilization

ArXiv ID: 2503.06202

Authors: Wei Liu, Zhiying Deng, Zhongyu Niu, Jun Wang, Haozhao Wang, Zhigang Zeng, Ruixuan Li

Abstract: Extracting a small subset of crucial rationales from the full input is a key problem in explainability research. The most widely used fundamental criterion for rationale extraction is the maximum mutual information (MMI) criterion. In this paper, we first demonstrate that MMI suffers from diminishing marginal returns. Once part of the rationale has been identified, finding the remaining portions contributes only marginally to increasing the mutual information, making it difficult to use MMI to locate the rest. In contrast to MMI that aims to reproduce the prediction, we seek to identify the parts of the input that the network can actually utilize. This is achieved by comparing how different rationale candidates match the capability space of the weight matrix. The weight matrix of a neural network is typically low-rank, meaning that the linear combinations of its column vectors can only cover part of the directions in a high-dimensional space (high-dimension: the dimensions of an input vector). If an input is fully utilized by the network, {it generally matches these directions (e.g., a portion of a hypersphere), resulting in a representation with a high norm. Conversely, if an input primarily falls outside (orthogonal to) these directions}, its representation norm will approach zero, behaving like noise that the network cannot effectively utilize. Building on this, we propose using the norms of rationale candidates as an alternative objective to MMI. Through experiments on four text classification datasets and one graph classification dataset using three network architectures (GRUs, BERT, and GCN), we show that our method outperforms MMI and its improved variants in identifying better rationales. We also compare our method with a representative LLM (llama-3.1-8b-instruct) and find that our simple method gets comparable results to it and can sometimes even outperform it.

Comment: The paper critiques the MMI criterion and proposes a novel alternative for rationale extraction, which aligns with representation learning and interpretability.

Relevance: 9 Novelty: 8


8. Seesaw: High-throughput LLM Inference via Model Re-sharding

ArXiv ID: 2503.06433

Authors: Qidong Su, Wei Zhao, Xin Li, Muralidhar Andoorveedu, Chenhao Jiang, Zhanda Zhu, Kevin Song, Christina Giannoula, Gennady Pekhimenko

Abstract: To improve the efficiency of distributed large language model (LLM) inference, various parallelization strategies, such as tensor and pipeline parallelism, have been proposed. However, the distinct computational characteristics inherent in the two stages of LLM inference-prefilling and decoding-render a single static parallelization strategy insufficient for the effective optimization of both stages. In this work, we present Seesaw, an LLM inference engine optimized for throughput-oriented tasks. The key idea behind Seesaw is dynamic model re-sharding, a technique that facilitates the dynamic reconfiguration of parallelization strategies across stages, thereby maximizing throughput at both phases. To mitigate re-sharding overhead and optimize computational efficiency, we employ tiered KV cache buffering and transition-minimizing scheduling. These approaches work synergistically to reduce the overhead caused by frequent stage transitions while ensuring maximum batching efficiency. Our evaluation demonstrates that Seesaw achieves a throughput increase of up to 1.78x (1.36x on average) compared to vLLM, the most widely used state-of-the-art LLM inference engine.

Comment: The paper introduces a dynamic re-sharding technique for LLM inference, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8


9. Learning Energy-Based Models by Self-normalising the Likelihood

ArXiv ID: 2503.07021

Authors: Hugo Senetaire, Paul Jeha, Pierre-Alexandre Mattei, Jes Frellsen

Abstract: Training an energy-based model (EBM) with maximum likelihood is challenging due to the intractable normalisation constant. Traditional methods rely on expensive Markov chain Monte Carlo (MCMC) sampling to estimate the gradient of logartihm of the normalisation constant. We propose a novel objective called self-normalised log-likelihood (SNL) that introduces a single additional learnable parameter representing the normalisation constant compared to the regular log-likelihood. SNL is a lower bound of the log-likelihood, and its optimum corresponds to both the maximum likelihood estimate of the model parameters and the normalisation constant. We show that the SNL objective is concave in the model parameters for exponential family distributions. Unlike the regular log-likelihood, the SNL can be directly optimised using stochastic gradient techniques by sampling from a crude proposal distribution. We validate the effectiveness of our proposed method on various density estimation tasks as well as EBMs for regression. Our results show that the proposed method, while simpler to implement and tune, outperforms existing techniques.

Comment: The paper proposes a novel self-normalized log-likelihood objective for energy-based models, which aligns with foundational research in representation learning.

Relevance: 9 Novelty: 8


10. Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform

ArXiv ID: 2503.06676

Authors: Chenyu Huang, Peng Ye, Xiaohui Wang, Shenghe Zheng, Biqing Qi, Lei Bai, Wanli Ouyang, Tao Chen

Abstract: With transformer-based models and the pretrain-finetune paradigm becoming mainstream, the high storage and deployment costs of individual finetuned models on multiple tasks pose critical challenges. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights). However, existing methods usually face problems including data accessibility and training requirements. To tackle this issue, we introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT). We first (a) group delta parameters within a layer into patches. Then we (b) assess the importance of each patch and allocate them with different quantization bit-widths. Afterwards, we (c) convert these patches to the DCT domain and conduct quantization to each patch based on the allocated bit-width. The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.

Comment: The paper introduces a novel data-free delta compression method inspired by JPEG compression, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8


11. Understanding the Learning Dynamics of LoRA: A Gradient Flow Perspective on Low-Rank Adaptation in Matrix Factorization

ArXiv ID: 2503.06982

Authors: Ziqing Xu, Hancheng Min, Lachlan Ewen MacDonald, Jinqi Luo, Salma Tarmoun, Enrique Mallada, Rene Vidal

Abstract: Despite the empirical success of Low-Rank Adaptation (LoRA) in fine-tuning pre-trained models, there is little theoretical understanding of how first-order methods with carefully crafted initialization adapt models to new tasks. In this work, we take the first step towards bridging this gap by theoretically analyzing the learning dynamics of LoRA for matrix factorization (MF) under gradient flow (GF), emphasizing the crucial role of initialization. For small initialization, we theoretically show that GF converges to a neighborhood of the optimal solution, with smaller initialization leading to lower final error. Our analysis shows that the final error is affected by the misalignment between the singular spaces of the pre-trained model and the target matrix, and reducing the initialization scale improves alignment. To address this misalignment, we propose a spectral initialization for LoRA in MF and theoretically prove that GF with small spectral initialization converges to the fine-tuning task with arbitrary precision. Numerical experiments from MF and image classification validate our findings.

Comment: The paper provides theoretical insights into the learning dynamics of LoRA, which aligns with representation learning and low-rank adaptation in model compression.

Relevance: 9 Novelty: 8


12. Make Haste Slowly: A Theory of Emergent Structured Mixed Selectivity in Feature Learning ReLU Networks

ArXiv ID: 2503.06181

Authors: Devon Jarvis, Richard Klein, Benjamin Rosman, Andrew M. Saxe

Abstract: In spite of finite dimension ReLU neural networks being a consistent factor behind recent deep learning successes, a theory of feature learning in these models remains elusive. Currently, insightful theories still rely on assumptions including the linearity of the network computations, unstructured input data and architectural constraints such as infinite width or a single hidden layer. To begin to address this gap we establish an equivalence between ReLU networks and Gated Deep Linear Networks, and use their greater tractability to derive dynamics of learning. We then consider multiple variants of a core task reminiscent of multi-task learning or contextual control which requires both feature learning and nonlinearity. We make explicit that, for these tasks, the ReLU networks possess an inductive bias towards latent representations which are not strictly modular or disentangled but are still highly structured and reusable between contexts. This effect is amplified with the addition of more contexts and hidden layers. Thus, we take a step towards a theory of feature learning in finite ReLU networks and shed light on how structured mixed-selective latent representations can emerge due to a bias for node-reuse and learning speed.

Comment: The paper provides theoretical insights into feature learning in ReLU networks, which aligns with foundational research in representation learning.

Relevance: 9 Novelty: 8


13. IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining

ArXiv ID: 2503.05920

Authors: Yixiao Li, Xianzhi Du, Ajay Jaiswal, Tao Lei, Tuo Zhao, Chong Wang, Jianyu Wang

Abstract: Recent advancements in large language models have intensified the need for efficient and deployable models within limited inference budgets. Structured pruning pipelines have shown promise in token efficiency compared to training target-size models from scratch. In this paper, we advocate incorporating enlarged model pretraining, which is often ignored in previous works, into pruning. We study the enlarge-and-prune pipeline as an integrated system to address two critical questions: whether it is worth pretraining an enlarged model even when the model is never deployed, and how to optimize the entire pipeline for better pruned models. We propose an integrated enlarge-and-prune pipeline, which combines enlarge model training, pruning, and recovery under a single cosine annealing learning rate schedule. This approach is further complemented by a novel iterative structured pruning method for gradual parameter removal. The proposed method helps to mitigate the knowledge loss caused by the rising learning rate in naive enlarge-and-prune pipelines and enable effective redistribution of model capacity among surviving neurons, facilitating smooth compression and enhanced performance. We conduct comprehensive experiments on compressing 2.8B models to 1.3B with up to 2T tokens in pretraining. It demonstrates the integrated approach not only provides insights into the token efficiency of enlarged model pretraining but also achieves superior performance of pruned models.

Comment: The paper proposes an integrated enlarge-and-prune pipeline for generative language model pretraining, which aligns with foundational research in model compression.

Relevance: 9 Novelty: 8


14. Sample-aware Adaptive Structured Pruning for Large Language Models

ArXiv ID: 2503.06184

Authors: Jun Kong, Xinge Ma, Jin Wang, Xuejie Zhang

Abstract: Large language models (LLMs) have achieved outstanding performance in natural language processing, but enormous model sizes and high computational costs limit their practical deployment. Structured pruning can effectively reduce the resource demands for deployment by removing redundant model parameters. However, the randomly selected calibration data and fixed single importance estimation metrics in existing structured pruning methods lead to degraded performance of pruned models. This study introduces AdaPruner, a sample-aware adaptive structured pruning framework for LLMs, aiming to optimize the calibration data and importance estimation metrics in the structured pruning process. Specifically, AdaPruner effectively removes redundant parameters from LLMs by constructing a structured pruning solution space and then employing Bayesian optimization to adaptively search for the optimal calibration data and importance estimation metrics. Experimental results show that the AdaPruner outperforms existing structured pruning methods on a family of LLMs with varying pruning ratios, demonstrating its applicability and robustness. Remarkably, at a 20\% pruning ratio, the model pruned with AdaPruner maintains 97\% of the performance of the unpruned model.

Comment: The paper proposes a structured pruning framework for LLMs, which aligns with the model compression criterion. The use of adaptive methods adds novelty to the pruning process.

Relevance: 9 Novelty: 8


15. Task Vector Quantization for Memory-Efficient Model Merging

ArXiv ID: 2503.06921

Authors: Youngeun Kim, Seunghwan Lee, Aecheon Jung, Bogon Ryu, Sungeun Hong

Abstract: Model merging enables efficient multi-task models by combining task-specific fine-tuned checkpoints. However, storing multiple task-specific checkpoints requires significant memory, limiting scalability and restricting model merging to larger models and diverse tasks. In this paper, we propose quantizing task vectors (i.e., the difference between pre-trained and fine-tuned checkpoints) instead of quantizing fine-tuned checkpoints. We observe that task vectors exhibit a narrow weight range, enabling low precision quantization (up to 4 bit) within existing task vector merging frameworks. To further mitigate quantization errors within ultra-low bit precision (e.g., 2 bit), we introduce Residual Task Vector Quantization, which decomposes the task vector into a base vector and offset component. We allocate bits based on quantization sensitivity, ensuring precision while minimizing error within a memory budget. Experiments on image classification and dense prediction show our method maintains or improves model merging performance while using only 8% of the memory required for full-precision checkpoints.

Comment: The paper introduces a memory-efficient model merging method using task vector quantization, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8


16. How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders

ArXiv ID: 2503.06394

Authors: Tatsuro Inaba, Kentaro Inui, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, Yu Takagi

Abstract: Large Language Models (LLMs) demonstrate remarkable multilingual capabilities and broad knowledge. However, the internal mechanisms underlying the development of these capabilities remain poorly understood. To investigate this, we analyze how the information encoded in LLMs' internal representations evolves during the training process. Specifically, we train sparse autoencoders at multiple checkpoints of the model and systematically compare the interpretative results across these stages. Our findings suggest that LLMs initially acquire language-specific knowledge independently, followed by cross-linguistic correspondences. Moreover, we observe that after mastering token-level knowledge, the model transitions to learning higher-level, abstract concepts, indicating the development of more conceptual understanding.

Comment: The paper uses sparse autoencoders to trace internal representations in LLMs, directly addressing representation learning and interpretability in LLMs.

Relevance: 9 Novelty: 8


17. ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration

ArXiv ID: 2503.06881

Authors: Mengting Ai, Tianxin Wei, Yifan Chen, Zhichen Zeng, Ritchie Zhao, Girish Varatkar, Bita Darvish Rouhani, Xianfeng Tang, Hanghang Tong, Jingrui He

Abstract: Mixture-of-Experts (MoE) Transformer, the backbone architecture of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. The sparse structure, while allowing constant time costs, results in space inefficiency: we still need to load all the model parameters during inference. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones. ResMoE enhances the space efficiency for inference of large-scale MoE Transformers in a one-shot and data-agnostic manner without retraining while maintaining minimal accuracy loss, thereby paving the way for broader accessibility to large language models. We demonstrate the effectiveness of ResMoE through extensive experiments on Switch Transformer, Mixtral, and DeepSeekMoE models. The results show that ResMoE can reduce the number of parameters in an expert by up to 75% while maintaining comparable performance. The code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/ResMoE.

Comment: The paper introduces a compression method for Mixture-of-Experts models, which aligns with model compression and efficiency improvements.

Relevance: 9 Novelty: 8


18. eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference

ArXiv ID: 2503.06823

Authors: Suraiya Tairin, Shohaib Mahmud, Haiying Shen, Anand Iyer

Abstract: In recent years, Mixture-of-Experts (MoE) has emerged as an effective approach for enhancing the capacity of deep neural network (DNN) with sub-linear computational costs. However, storing all experts on GPUs incurs significant memory overhead, increasing the monetary cost of MoE-based inference. To address this, we propose eMoE, a memory efficient inference system for MoE-based large language models (LLMs) by leveraging our observations from experiment measurements. eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing. To reduce loading latency while maintaining accuracy, as we found using the same experts for subsequent prompts has minimal impact on perplexity, eMoE invokes the expert predictor every few prompts rather than for each prompt. In addition, it skips predictions for tasks less sensitive to routing accuracy. Finally, it has task-aware scheduling to minimize inference latency by considering Service Level Objectives (SLOs), task-specific output lengths, and expert loading latencies. Experimental results show that compared to existing systems, eMoE reduces memory consumption by up to 80% while maintaining accuracy and reduces inference latency by up to 17%. It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.

Comment: The paper proposes a memory-efficient MoE inference system, directly aligning with the model architecture and efficiency criteria.

Relevance: 9 Novelty: 8


19. Towards Superior Quantization Accuracy: A Layer-sensitive Approach

ArXiv ID: 2503.06518

Authors: Feng Zhang, Yanbin Liu, Weihua Li, Jie Lv, Xiaodan Wang, Quan Bai

Abstract: Large Vision and Language Models have exhibited remarkable human-like intelligence in tasks such as natural language comprehension, problem-solving, logical reasoning, and knowledge retrieval. However, training and serving these models require substantial computational resources, posing a significant barrier to their widespread application and further research. To mitigate this challenge, various model compression techniques have been developed to reduce computational requirements. Nevertheless, existing methods often employ uniform quantization configurations, failing to account for the varying difficulties across different layers in quantizing large neural network models. This paper tackles this issue by leveraging layer-sensitivity features, such as activation sensitivity and weight distribution Kurtosis, to identify layers that are challenging to quantize accurately and allocate additional memory budget. The proposed methods, named SensiBoost and KurtBoost, respectively, demonstrate notable improvement in quantization accuracy, achieving up to 9% lower perplexity with only a 2% increase in memory budget on LLama models compared to the baseline.

Comment: This paper proposes a layer-sensitive approach to quantization, which directly aligns with the model compression criterion. The methods SensiBoost and KurtBoost provide novel insights into layer-specific quantization strategies, improving accuracy with minimal memory overhead.

Relevance: 9 Novelty: 8


20. InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models

ArXiv ID: 2503.06692

Authors: Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang

Abstract: Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.

Comment: This paper introduces a novel paradigm for long-context reasoning in LLMs, addressing computational scaling and reasoning depth. It aligns with foundational research in LLMs by proposing a new iterative reasoning framework, which could have broader implications for model efficiency and architecture.

Relevance: 9 Novelty: 8


21. Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers

ArXiv ID: 2503.06183

Authors: Francesco Daghero, Daniele Jahier Pagliari, Francesco Conti, Luca Benini, Massimo Poncino, Alessio Burrello

Abstract: The acceleration of pruned Deep Neural Networks (DNNs) on edge devices such as Microcontrollers (MCUs) is a challenging task, given the tight area- and power-constraints of these devices. In this work, we propose a three-fold contribution to address this problem. First, we design a set of optimized software kernels for N:M pruned layers, targeting ultra-low-power, multicore RISC-V MCUs, which are up to 2.1x and 3.4x faster than their dense counterparts at 1:8 and 1:16 sparsity, respectively. Then, we implement a lightweight Instruction-Set Architecture (ISA) extension to accelerate the indirect load and non-zero indices decompression operations required by our kernels, obtaining up to 1.9x extra speedup, at the cost of a 5% area overhead. Lastly, we extend an open-source DNN compiler to utilize our sparse kernels for complete networks, showing speedups of 3.21x and 1.81x on a ResNet18 and a Vision Transformer (ViT), with less than 1.5% accuracy drop compared to a dense baseline.

Comment: The paper focuses on sparsity and pruning techniques for efficient DNNs on microcontrollers, aligning with the model compression criterion.

Relevance: 9 Novelty: 7


22. This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs

ArXiv ID: 2503.05856

Authors: Lorenz Wolf, Sangwoong Yoon, Ilija Bogunovic

Abstract: Mixture of large language model (LLMs) Agents (MoA) architectures achieve state-of-the-art performance on prominent benchmarks like AlpacaEval 2.0 by leveraging the collaboration of multiple LLMs at inference time. Despite these successes, an evaluation of the safety and reliability of MoA is missing. We present the first comprehensive study of MoA's robustness against deceptive LLM agents that deliberately provide misleading responses. We examine factors like the propagation of deceptive information, model size, and information availability, and uncover critical vulnerabilities. On AlpacaEval 2.0, the popular LLaMA 3.1-70B model achieves a length-controlled Win Rate (LC WR) of 49.2% when coupled with 3-layer MoA (6 LLM agents). However, we demonstrate that introducing only a $\textit{single}$ carefully-instructed deceptive agent into the MoA can reduce performance to 37.9%, effectively nullifying all MoA gains. On QuALITY, a multiple-choice comprehension task, the impact is also severe, with accuracy plummeting by a staggering 48.5%. Inspired in part by the historical Doge of Venice voting process, designed to minimize influence and deception, we propose a range of unsupervised defense mechanisms that recover most of the lost performance.

Comment: The paper evaluates the robustness of MoE architectures, directly aligning with the model architecture criterion.

Relevance: 9 Novelty: 7


23. Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms

ArXiv ID: 2503.07154

Authors: Jiaming Song, Linqi Zhou

Abstract: Recent years have seen significant advancements in foundation models through generative pre-training, yet algorithmic innovation in this space has largely stagnated around autoregressive models for discrete signals and diffusion models for continuous signals. This stagnation creates a bottleneck that prevents us from fully unlocking the potential of rich multi-modal data, which in turn limits the progress on multimodal intelligence. We argue that an inference-first perspective, which prioritizes scaling efficiency during inference time across sequence length and refinement steps, can inspire novel generative pre-training algorithms. Using Inductive Moment Matching (IMM) as a concrete example, we demonstrate how addressing limitations in diffusion models' inference process through targeted modifications yields a stable, single-stage algorithm that achieves superior sample quality with over an order of magnitude greater inference efficiency.

Comment: The paper discusses inference-time scaling for generative pre-training algorithms, which aligns with foundational research on efficiency and generative paradigms.

Relevance: 8 Novelty: 8


24. Characterizing Learning in Spiking Neural Networks with Astrocyte-Like Units

ArXiv ID: 2503.06798

Authors: Christopher S. Yang, Sylvester J. Gates III, Dulara De Zoysa, Jaehoon Choe, Wolfgang Losert, Corey B. Hart

Abstract: Traditional artificial neural networks take inspiration from biological networks, using layers of neuron-like nodes to pass information for processing. More realistic models include spiking in the neural network, capturing the electrical characteristics more closely. However, a large proportion of brain cells are of the glial cell type, in particular astrocytes which have been suggested to play a role in performing computations. Here, we introduce a modified spiking neural network model with added astrocyte-like units in a neural network and asses their impact on learning. We implement the network as a liquid state machine and task the network with performing a chaotic time-series prediction task. We varied the number and ratio of neuron-like and astrocyte-like units in the network to examine the latter units effect on learning. We show that the combination of neurons and astrocytes together, as opposed to neural- and astrocyte-only networks, are critical for driving learning. Interestingly, we found that the highest learning rate was achieved when the ratio between astrocyte-like and neuron-like units was roughly 2 to 1, mirroring some estimates of the ratio of biological astrocytes to neurons. Our results demonstrate that incorporating astrocyte-like units which represent information across longer timescales can alter the learning rates of neural networks, and the proportion of astrocytes to neurons should be tuned appropriately to a given task.

Comment: The paper explores the impact of astrocyte-like units in spiking neural networks, which is an emerging trend in foundational research.

Relevance: 8 Novelty: 8


25. Understanding the role of autoencoders for stiff dynamical systems using information theory

ArXiv ID: 2503.06325

Authors: Vijayamanikandan Vijayarangan, Harshavardhana A. Uranakara, Francisco E. Hern\'andez-P\'erez, Hong G. Im

Abstract: Using the information theory, this study provides insights into how the construction of latent space of autoencoder (AE) using deep neural network (DNN) training finds a smooth low-dimensional manifold in the stiff dynamical system. Our recent study [1] reported that an autoencoder (AE) combined with neural ODE (NODE) as a surrogate reduced order model (ROM) for the integration of stiff chemically reacting systems led to a significant reduction in the temporal stiffness, and the behavior was attributed to the identification of a slow invariant manifold by the nonlinear projection of the AE. The present work offers fundamental understanding of the mechanism by employing concepts from information theory and better mixing. The learning mechanism of both the encoder and decoder are explained by plotting the evolution of mutual information and identifying two different phases. Subsequently, the density distribution is plotted for the physical and latent variables, which shows the transformation of the \emph{rare event} in the physical space to a \emph{highly likely} (more probable) event in the latent space provided by the nonlinear autoencoder. Finally, the nonlinear transformation leading to density redistribution is explained using concepts from information theory and probability.

Comment: The paper provides insights into how autoencoders encode information in stiff dynamical systems, aligning with representation learning.

Relevance: 8 Novelty: 8


26. Nearly Optimal Differentially Private ReLU Regression

ArXiv ID: 2503.06009

Authors: Meng Ding, Mingxi Lei, Shaowei Wang, Tianhang Zheng, Di Wang, Jinhui Xu

Abstract: In this paper, we investigate one of the most fundamental nonconvex learning problems, ReLU regression, in the Differential Privacy (DP) model. Previous studies on private ReLU regression heavily rely on stringent assumptions, such as constant bounded norms for feature vectors and labels. We relax these assumptions to a more standard setting, where data can be i.i.d. sampled from $O(1)$-sub-Gaussian distributions. We first show that when $\varepsilon = \tilde{O}(\sqrt{\frac{1}{N}})$ and there is some public data, it is possible to achieve an upper bound of $\Tilde{O}(\frac{d^2}{N^2 \varepsilon^2})$ for the excess population risk in $(\epsilon, \delta)$-DP, where $d$ is the dimension and $N$ is the number of data samples. Moreover, we relax the requirement of $\epsilon$ and public data by proposing and analyzing a one-pass mini-batch Generalized Linear Model Perceptron algorithm (DP-MBGLMtron). Additionally, using the tracing attack argument technique, we demonstrate that the minimax rate of the estimation error for $(\varepsilon, \delta)$-DP algorithms is lower bounded by $\Omega(\frac{d^2}{N^2 \varepsilon^2})$. This shows that DP-MBGLMtron achieves the optimal utility bound up to logarithmic factors. Experiments further support our theoretical results.

Comment: The paper investigates differentially private ReLU regression, which is a foundational topic in model efficiency and privacy. It provides theoretical insights into optimal utility bounds and relaxes prior assumptions, making it relevant to model compression and efficiency.

Relevance: 8 Novelty: 8


27. From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning

ArXiv ID: 2503.05919

Authors: Eric Zhao, Pranjal Awasthi, Nika Haghtalab

Abstract: Finetuning provides a scalable and cost-effective means of customizing language models for specific tasks or response styles, with greater reliability than prompting or in-context learning. In contrast, the conventional wisdom is that injecting knowledge via finetuning results in brittle performance and poor generalization. We argue that the dichotomy of "task customization" (e.g., instruction tuning) and "knowledge injection" (e.g., teaching new facts) is a distinction without a difference. We instead identify concrete factors that explain the heterogeneous effectiveness observed with finetuning. To this end, we conduct a large-scale experimental study of finetuning the frontier Gemini v1.5 model family on a spectrum of datasets that are artificially engineered to interpolate between the strengths and failure modes of finetuning. Our findings indicate that question-answer training data formats provide much stronger knowledge generalization than document/article-style training data, numerical information can be harder for finetuning to retain than categorical information, and models struggle to apply finetuned knowledge during multi-step reasoning even when trained on similar examples -- all factors that render "knowledge injection" to be especially difficult, even after controlling for considerations like data augmentation and information volume. On the other hand, our findings also indicate that it is not fundamentally more difficult to finetune information about a real-world event than information about what a model's writing style should be.

Comment: The paper studies finetuning for knowledge injection in LLMs, providing insights into the limitations and challenges of finetuning, which aligns with foundational research on LLM behavior.

Relevance: 8 Novelty: 7


28. Uncertainty Quantification From Scaling Laws in Deep Neural Networks

ArXiv ID: 2503.05938

Authors: Ibrahim Elsharkawy, Yonatan Kahn, Benjamin Hooberman

Abstract: Quantifying the uncertainty from machine learning analyses is critical to their use in the physical sciences. In this work we focus on uncertainty inherited from the initialization distribution of neural networks. We compute the mean $\mu_{\mathcal{L}}$ and variance $\sigma_{\mathcal{L}}^2$ of the test loss $\mathcal{L}$ for an ensemble of multi-layer perceptrons (MLPs) with neural tangent kernel (NTK) initialization in the infinite-width limit, and compare empirically to the results from finite-width networks for three example tasks: MNIST classification, CIFAR classification and calorimeter energy regression. We observe scaling laws as a function of training set size $N_\mathcal{D}$ for both $\mu_{\mathcal{L}}$ and $\sigma_{\mathcal{L}}$, but find that the coefficient of variation $\epsilon_{\mathcal{L}} \equiv \sigma_{\mathcal{L}}/\mu_{\mathcal{L}}$ becomes independent of $N_\mathcal{D}$ at both infinite and finite width for sufficiently large $N_\mathcal{D}$. This implies that the coefficient of variation of a finite-width network may be approximated by its infinite-width value, and may in principle be calculable using finite-width perturbation theory.

Comment: The paper investigates uncertainty quantification in neural networks using scaling laws, which aligns with representation learning and theoretical insights.

Relevance: 8 Novelty: 7


29. Delusions of Large Language Models

ArXiv ID: 2503.06709

Authors: Hongshen Xu, Zixv yang, Zichen Zhu, Kunyao Lan, Zihan Wang, Mengyue Wu, Ziwei Ji, Lu Chen, Pascale Fung, Kai Yu

Abstract: Large Language Models often generate factually incorrect but plausible outputs, known as hallucinations. We identify a more insidious phenomenon, LLM delusion, defined as high belief hallucinations, incorrect outputs with abnormally high confidence, making them harder to detect and mitigate. Unlike ordinary hallucinations, delusions persist with low uncertainty, posing significant challenges to model reliability. Through empirical analysis across different model families and sizes on several Question Answering tasks, we show that delusions are prevalent and distinct from hallucinations. LLMs exhibit lower honesty with delusions, which are harder to override via finetuning or self reflection. We link delusion formation with training dynamics and dataset noise and explore mitigation strategies such as retrieval augmented generation and multi agent debating to mitigate delusions. By systematically investigating the nature, prevalence, and mitigation of LLM delusions, our study provides insights into the underlying causes of this phenomenon and outlines future directions for improving model reliability.

Comment: The paper investigates LLM delusions, linking them to training dynamics and dataset noise, which aligns with foundational research on training dynamics in neural networks.

Relevance: 8 Novelty: 7


30. GIN-Graph: A Generative Interpretation Network for Model-Level Explanation of Graph Neural Networks

ArXiv ID: 2503.06352

Authors: Xiao Yue, Guangzhi Qu, Lige Gan

Abstract: One significant challenge of exploiting Graph neural networks (GNNs) in real-life scenarios is that they are always treated as black boxes, therefore leading to the requirement of interpretability. Model-level interpretations explain what patterns maximize probability of predicting to a certain class. However, existing model-level interpretation methods pose several limitations such as generating invalid explanation graphs and requiring extreme fine-tuning on hyperparameters manually. In this paper, we propose a new Generative Interpretation Network for Model-Level Explanation of Graph Neural Networks (GIN-Graph), to generate reliable model-level explanation graphs. The implicit and likelihood-free generative adversarial networks are exploited to construct explanation graphs similar to original graphs, meanwhile maximizing the prediction probability for a certain class by adopting a novel objective function. Experimental results indicate that GIN-Graph can be easily applied to GNN models trained on a variety of graph datasets to create meaningful explanation graphs without requiring extensive fine-tuning on hyperparameters.

Comment: The paper proposes a generative interpretation network for GNNs, which aligns with foundational research on interpretability and model-level explanations.

Relevance: 8 Novelty: 7


31. Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels

ArXiv ID: 2503.06211

Authors: Santiago Cuervo, Adel Moumen, Yanis Labrak, Sameer Khurana, Antoine Laurent, Mickael Rouvier, Ricard Marxer

Abstract: Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- aim to enable cross-modal knowledge transfer to overcome the scaling limitations of unimodal speech LMs. The predominant approach to TSLM training expands the vocabulary of a pre-trained text LM by appending new embeddings and linear projections for speech, followed by fine-tuning on speech data. We hypothesize that this method limits cross-modal transfer by neglecting feature compositionality, preventing text-learned functions from being fully leveraged at appropriate abstraction levels. To address this, we propose augmenting vocabulary expansion with modules that better align abstraction levels across layers. Our models, \textsc{SmolTolk}, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute. Representation analyses and improved multimodal performance suggest our method enhances cross-modal transfer.

Comment: The paper focuses on text-speech language models and proposes methods to improve cross-modal transfer, which aligns with foundational research on representation learning and model architecture.

Relevance: 8 Novelty: 7


32. SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

ArXiv ID: 2503.06764

Authors: Zisheng Chen, Chunwei Wang, Xiuwei Chen, Hang Xu, Jianhua Han, Xiandan Liang

Abstract: We present SemHiTok, a unified image Tokenizer via Semantic-Guided Hierarchical codebook that provides consistent discrete feature representations for multimodal understanding and generation tasks. Recently, unified multimodal large models (MLLMs) for understanding and generation have sparked exploration within research community. Previous works attempt to train a unified image tokenizer by combining loss functions for semantic feature reconstruction and pixel reconstruction. However, due to the differing levels of features prioritized by multimodal understanding and generation tasks, joint training methods face significant challenges in achieving a good trade-off. SemHiTok addresses this challenge through Semantic-Guided Hierarchical codebook which builds texture sub-codebooks on pre-trained semantic codebook. This design decouples the training of semantic reconstruction and pixel reconstruction and equips the tokenizer with low-level texture feature extraction capability without degradation of high-level semantic feature extraction ability. Our experiments demonstrate that SemHiTok achieves state-of-the-art rFID score at 256X256resolution compared to other unified tokenizers, and exhibits competitive performance on multimodal understanding and generation tasks.

Comment: The paper introduces a unified image tokenizer for multimodal tasks, which is relevant to foundational research on representation learning and multimodal understanding.

Relevance: 8 Novelty: 7


33. SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs

ArXiv ID: 2503.05980

Authors: Samir Abdaljalil, Hasan Kurban, Parichit Sharma, Erchin Serpedin, Rachad Atat

Abstract: Large language models (LLMs) are increasingly deployed across diverse domains, yet they are prone to generating factually incorrect outputs - commonly known as "hallucinations." Among existing mitigation strategies, uncertainty-based methods are particularly attractive due to their ease of implementation, independence from external data, and compatibility with standard LLMs. In this work, we introduce a novel and scalable uncertainty-based semantic clustering framework for automated hallucination detection. Our approach leverages sentence embeddings and hierarchical clustering alongside a newly proposed inconsistency measure, SINdex, to yield more homogeneous clusters and more accurate detection of hallucination phenomena across various LLMs. Evaluations on prominent open- and closed-book QA datasets demonstrate that our method achieves AUROC improvements of up to 9.3% over state-of-the-art techniques. Extensive ablation studies further validate the effectiveness of each component in our framework.

Comment: The paper proposes a semantic inconsistency index for hallucination detection in LLMs, which is relevant to foundational research on LLM behavior and interpretability.

Relevance: 8 Novelty: 7


34. Language Models Fail to Introspect About Their Knowledge of Language

ArXiv ID: 2503.07513

Authors: Siyuan Song, Jennifer Hu, Kyle Mahowald

Abstract: There has been recent interest in whether large language models (LLMs) can introspect about their own internal states. Such abilities would make LLMs more interpretable, and also validate the use of standard introspective methods in linguistics to evaluate grammatical knowledge in models (e.g., asking "Is this sentence grammatical?"). We systematically investigate emergent introspection across 21 open-source LLMs, in two domains where introspection is of theoretical interest: grammatical knowledge and word prediction. Crucially, in both domains, a model's internal linguistic knowledge can be theoretically grounded in direct measurements of string probability. We then evaluate whether models' responses to metalinguistic prompts faithfully reflect their internal knowledge. We propose a new measure of introspection: the degree to which a model's prompted responses predict its own string probabilities, beyond what would be predicted by another model with nearly identical internal knowledge. While both metalinguistic prompting and probability comparisons lead to high task accuracy, we do not find evidence that LLMs have privileged "self-access". Our findings complicate recent results suggesting that models can introspect, and add new evidence to the argument that prompted responses should not be conflated with models' linguistic generalizations.

Comment: The paper investigates introspection in LLMs, which aligns with theoretical insights into LLM behavior and interpretability.

Relevance: 8 Novelty: 7


35. Enhancing CBMs Through Binary Distillation with Applications to Test-Time Intervention

ArXiv ID: 2503.06730

Authors: Matthew Shen, Aliyah Hsu, Abhineet Agarwal, Bin Yu

Abstract: Concept bottleneck models~(CBM) aim to improve model interpretability by predicting human level ``concepts" in a bottleneck within a deep learning model architecture. However, how the predicted concepts are used in predicting the target still either remains black-box or is simplified to maintain interpretability at the cost of prediction performance. We propose to use Fast Interpretable Greedy Sum-Trees~(FIGS) to obtain Binary Distillation~(BD). This new method, called FIGS-BD, distills a binary-augmented concept-to-target portion of the CBM into an interpretable tree-based model, while mimicking the competitive prediction performance of the CBM teacher. FIGS-BD can be used in downstream tasks to explain and decompose CBM predictions into interpretable binary-concept-interaction attributions and guide adaptive test-time intervention. Across $4$ datasets, we demonstrate that adaptive test-time intervention identifies key concepts that significantly improve performance for realistic human-in-the-loop settings that allow for limited concept interventions.

Comment: The paper proposes a method to enhance concept bottleneck models, which aligns with representation learning and interpretability.

Relevance: 8 Novelty: 7


36. Using Subgraph GNNs for Node Classification:an Overlooked Potential Approach

ArXiv ID: 2503.06614

Authors: Qian Zeng, Xin Lin, Jingyi Gao, Yang Yu

Abstract: Previous studies have demonstrated the strong performance of Graph Neural Networks (GNNs) in node classification. However, most existing GNNs adopt a node-centric perspective and rely on global message passing, leading to high computational and memory costs that hinder scalability. To mitigate these challenges, subgraph-based methods have been introduced, leveraging local subgraphs as approximations of full computational trees. While this approach improves efficiency, it often suffers from performance degradation due to the loss of global contextual information, limiting its effectiveness compared to global GNNs. To address this trade-off between scalability and classification accuracy, we reformulate the node classification task as a subgraph classification problem and propose SubGND (Subgraph GNN for NoDe). This framework introduces a differentiated zero-padding strategy and an Ego-Alter subgraph representation method to resolve label conflicts while incorporating an Adaptive Feature Scaling Mechanism to dynamically adjust feature contributions based on dataset-specific dependencies. Experimental results on six benchmark datasets demonstrate that SubGND achieves performance comparable to or surpassing global message-passing GNNs, particularly in heterophilic settings, highlighting its effectiveness and scalability as a promising solution for node classification.

Comment: The paper reformulates node classification as a subgraph classification problem, which aligns with architectural innovations in GNNs.

Relevance: 8 Novelty: 7


37. Is My Text in Your AI Model? Gradient-based Membership Inference Test applied to LLMs

ArXiv ID: 2503.07384

Authors: Gonzalo Mancera, Daniel de Alcala, Julian Fierrez, Ruben Tolosana, Aythami Morales

Abstract: This work adapts and studies the gradient-based Membership Inference Test (gMINT) to the classification of text based on LLMs. MINT is a general approach intended to determine if given data was used for training machine learning models, and this work focuses on its application to the domain of Natural Language Processing. Using gradient-based analysis, the MINT model identifies whether particular data samples were included during the language model training phase, addressing growing concerns about data privacy in machine learning. The method was evaluated in seven Transformer-based models and six datasets comprising over 2.5 million sentences, focusing on text classification tasks. Experimental results demonstrate MINTs robustness, achieving AUC scores between 85% and 99%, depending on data size and model architecture. These findings highlight MINTs potential as a scalable and reliable tool for auditing machine learning models, ensuring transparency, safeguarding sensitive data, and fostering ethical compliance in the deployment of AI/NLP technologies.

Comment: The paper adapts a membership inference test for LLMs, which aligns with theoretical insights into LLM behavior and interpretability.

Relevance: 8 Novelty: 7


38. TPU-Gen: LLM-Driven Custom Tensor Processing Unit Generator

ArXiv ID: 2503.05951

Authors: Deepak Vungarala, Mohammed E. Elbtity, Sumiya Syed, Sakila Alam, Kartik Pandit, Arnob Ghosh, Ramtin Zand, Shaahin Angizi

Abstract: The increasing complexity and scale of Deep Neural Networks (DNNs) necessitate specialized tensor accelerators, such as Tensor Processing Units (TPUs), to meet various computational and energy efficiency requirements. Nevertheless, designing optimal TPU remains challenging due to the high domain expertise level, considerable manual design time, and lack of high-quality, domain-specific datasets. This paper introduces TPU-Gen, the first Large Language Model (LLM) based framework designed to automate the exact and approximate TPU generation process, focusing on systolic array architectures. TPU-Gen is supported with a meticulously curated, comprehensive, and open-source dataset that covers a wide range of spatial array designs and approximate multiply-and-accumulate units, enabling design reuse, adaptation, and customization for different DNN workloads. The proposed framework leverages Retrieval-Augmented Generation (RAG) as an effective solution for a data-scare hardware domain in building LLMs, addressing the most intriguing issue, hallucinations. TPU-Gen transforms high-level architectural specifications into optimized low-level implementations through an effective hardware generation pipeline. Our extensive experimental evaluations demonstrate superior performance, power, and area efficiency, with an average reduction in area and power of 92\% and 96\% from the manual optimization reference values. These results set new standards for driving advancements in next-generation design automation tools powered by LLMs.

Comment: The paper introduces an LLM-driven framework for TPU generation, which aligns with foundational research in model compression and efficiency.

Relevance: 8 Novelty: 7


39. Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals

ArXiv ID: 2503.06473

Authors: Hanze Li, Xiande Huang

Abstract: Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significantly advanced network architectures. However, existing layer attention methods suffer from redundancy, as attention weights learned by adjacent layers often become highly similar. This redundancy causes multiple layers to extract nearly identical features, reducing the model's representational capacity and increasing training time. To address this issue, we propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers. Additionally, we introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers, thereby maintaining model stability. Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30\% reduction in training time while enhancing performance in tasks such as image classification and object detection.

Comment: The paper proposes a method to enhance layer attention efficiency through pruning, which aligns with foundational research in model compression and efficiency.

Relevance: 8 Novelty: 7


40. BlackGoose Rimer: Harnessing RWKV-7 as a Simple yet Superior Replacement for Transformers in Large-Scale Time Series Modeling

ArXiv ID: 2503.06121

Authors: Li weile, Liu Xiao

Abstract: Time series models face significant challenges in scaling to handle large and complex datasets, akin to the scaling achieved by large language models (LLMs). The unique characteristics of time series data and the computational demands of model scaling necessitate innovative approaches. While researchers have explored various architectures such as Transformers, LSTMs, and GRUs to address these challenges, we propose a novel solution using RWKV-7, which incorporates meta-learning into its state update mechanism. By integrating RWKV-7's time mix and channel mix components into the transformer-based time series model Timer, we achieve a substantial performance improvement of approximately 1.13 to 43.3x and a 4.5x reduction in training time with 1/23 parameters, all while utilizing fewer parameters. Our code and model weights are publicly available for further research and development at https://github.com/Alic-Li/BlackGoose_Rimer.

Comment: The paper introduces RWKV-7 as a replacement for Transformers in time series modeling, which aligns with foundational research in model architecture.

Relevance: 8 Novelty: 7


41. Deep Cut-informed Graph Embedding and Clustering

ArXiv ID: 2503.06635

Authors: Zhiyuan Ning, Zaitian Wang, Ran Zhang, Ping Xu, Kunpeng Liu, Pengyang Wang, Chong Chen, Pengfei Wang, Yuanchun Zhou, Erik Cambria

Abstract: Graph clustering aims to divide the graph into different clusters. The recently emerging deep graph clustering approaches are largely built on graph neural networks (GNN). However, GNN is designed for general graph encoding and there is a common issue of representation collapse in existing GNN-based deep graph clustering algorithms. We attribute two main reasons for such issue: (i) the inductive bias of GNN models: GNNs tend to generate similar representations for proximal nodes. Since graphs often contain a non-negligible amount of inter-cluster links, the bias results in error message passing and leads to biased clustering; (ii) the clustering guided loss function: most traditional approaches strive to make all samples closer to pre-learned cluster centers, which cause a degenerate solution assigning all data points to a single label thus make all samples and less discriminative. To address these challenges, we investigate graph clustering from a graph cut perspective and propose an innovative and non-GNN-based Deep Cut-informed Graph embedding and Clustering framework, namely DCGC. This framework includes two modules: (i) cut-informed graph encoding; (ii) self-supervised graph clustering via optimal transport. For the encoding module, we derive a cut-informed graph embedding objective to fuse graph structure and attributes by minimizing their joint normalized cut. For the clustering module, we utilize the optimal transport theory to obtain the clustering assignments, which can balance the guidance of proximity to the pre-learned cluster center. With the above two tailored designs, DCGC is more suitable for the graph clustering task, which can effectively alleviate the problem of representation collapse and achieve better performance. We conduct extensive experiments to demonstrate that our method is simple but effective compared with benchmarks.

Comment: The paper introduces a novel graph clustering framework with a focus on representation learning through graph embeddings. It aligns with foundational research in representation learning.

Relevance: 8 Novelty: 7


42. TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation

ArXiv ID: 2503.07050

Authors: Victor Shea-Jay Huang, Le Zhuo, Yi Xin, Zhaokai Wang, Peng Gao, Hongsheng Li

Abstract: Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models compared to U-Net-based diffusion models. To bridge this gap, we introduce TIDE (Temporal-aware Sparse Autoencoders for Interpretable Diffusion transformErs), a novel framework that enhances temporal reconstruction within DiT activation layers across denoising steps. TIDE employs Sparse Autoencoders (SAEs) with a sparse bottleneck layer to extract interpretable and hierarchical features, revealing that diffusion models inherently learn hierarchical features at multiple levels (e.g., 3D, semantic, class) during generative pre-training. Our approach achieves state-of-the-art reconstruction performance, with a mean squared error (MSE) of 1e-3 and a cosine similarity of 0.97, demonstrating superior accuracy in capturing activation dynamics along the denoising trajectory. Beyond interpretability, we showcase TIDE's potential in downstream applications such as sparse activation-guided image editing and style transfer, enabling improved controllability for generative systems. By providing a comprehensive training and evaluation protocol tailored for DiTs, TIDE contributes to developing more interpretable, transparent, and trustworthy generative models.

Comment: The paper introduces sparse autoencoders for diffusion transformers, which aligns with representation learning and architectural insights, particularly in sparse methods and interpretability.

Relevance: 8 Novelty: 7


43. TokenButler: Token Importance is Predictable

ArXiv ID: 2503.07518

Authors: Yash Akhauri, Ahmed F AbouElhamayed, Yifei Gao, Chi-Chih Chang, Nilesh Jain, Mohamed S. Abdelfattah

Abstract: Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck, however, there is an opportunity to alleviate this bottleneck, especially because prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks (pages) of tokens at generation, failing at dense, context-rich tasks. Additionally, many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. By training a light-weight predictor with less than 1.2% parameter overhead, TokenButler prioritizes tokens based on their contextual, predicted importance. This improves perplexity & downstream accuracy by over 8% relative to SoTA methods for estimating token importance. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy. Code, models and benchmarks: https://github.com/abdelfattah-lab/TokenButler

Comment: The paper introduces a token importance predictor for KV-cache optimization, which aligns with model compression and efficiency improvements in LLMs.

Relevance: 8 Novelty: 7


44. DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

ArXiv ID: 2503.07067

Authors: Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, Se-Young Yun

Abstract: Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.

Comment: The paper introduces a contrastive approach for LLM distillation, which aligns with foundational improvements in LLM training and efficiency.

Relevance: 8 Novelty: 7


45. AF-KAN: Activation Function-Based Kolmogorov-Arnold Networks for Efficient Representation Learning

ArXiv ID: 2503.06112

Authors: Hoang-Thang Ta, Anh Tran

Abstract: Kolmogorov-Arnold Networks (KANs) have inspired numerous works exploring their applications across a wide range of scientific problems, with the potential to replace Multilayer Perceptrons (MLPs). While many KANs are designed using basis and polynomial functions, such as B-splines, ReLU-KAN utilizes a combination of ReLU functions to mimic the structure of B-splines and take advantage of ReLU's speed. However, ReLU-KAN is not built for multiple inputs, and its limitations stem from ReLU's handling of negative values, which can restrict feature extraction. To address these issues, we introduce Activation Function-Based Kolmogorov-Arnold Networks (AF-KAN), expanding ReLU-KAN with various activations and their function combinations. This novel KAN also incorporates parameter reduction methods, primarily attention mechanisms and data normalization, to enhance performance on image classification datasets. We explore different activation functions, function combinations, grid sizes, and spline orders to validate the effectiveness of AF-KAN and determine its optimal configuration. In the experiments, AF-KAN significantly outperforms MLP, ReLU-KAN, and other KANs with the same parameter count. It also remains competitive even when using fewer than 6 to 10 times the parameters while maintaining the same network structure. However, AF-KAN requires a longer training time and consumes more FLOPs. The repository for this work is available at https://github.com/hoangthangta/All-KAN.

Comment: The paper proposes AF-KAN, a novel architecture inspired by Kolmogorov-Arnold Networks, with innovations in activation functions and parameter reduction methods. This aligns with the 'Model Architecture' criterion, as it explores architectural innovations and efficiency improvements.

Relevance: 8 Novelty: 7


46. Decision-Dependent Stochastic Optimization: The Role of Distribution Dynamics

ArXiv ID: 2503.07324

Authors: Zhiyu He, Saverio Bolognani, Florian D\"orfler, Michael Muehlebach

Abstract: Distribution shifts have long been regarded as troublesome external forces that a decision-maker should either counteract or conform to. An intriguing feedback phenomenon termed decision dependence arises when the deployed decision affects the environment and alters the data-generating distribution. In the realm of performative prediction, this is encoded by distribution maps parameterized by decisions due to strategic behaviors. In contrast, we formalize an endogenous distribution shift as a feedback process featuring nonlinear dynamics that couple the evolving distribution with the decision. Stochastic optimization in this dynamic regime provides a fertile ground to examine the various roles played by dynamics in the composite problem structure. To this end, we develop an online algorithm that achieves optimal decision-making by both adapting to and shaping the dynamic distribution. Throughout the paper, we adopt a distributional perspective and demonstrate how this view facilitates characterizations of distribution dynamics and the optimality and generalization performance of the proposed algorithm. We showcase the theoretical results in an opinion dynamics context, where an opportunistic party maximizes the affinity of a dynamic polarized population, and in a recommender system scenario, featuring performance optimization with discrete distributions in the probability simplex.

Comment: The paper introduces a theoretical framework for decision-dependent stochastic optimization, which aligns with emerging trends in foundational research.

Relevance: 7 Novelty: 8


47. System 0/1/2/3: Quad-process theory for multi-timescale embodied collective cognitive systems

ArXiv ID: 2503.06138

Authors: Tadahiro Taniguchi, Yasushi Hirai, Masahiro Suzuki, Shingo Murata, Takato Horii, Kazutoshi Tanaka

Abstract: This paper introduces the System 0/1/2/3 framework as an extension of dual-process theory, employing a quad-process model of cognition. Expanding upon System 1 (fast, intuitive thinking) and System 2 (slow, deliberative thinking), we incorporate System 0, which represents pre-cognitive embodied processes, and System 3, which encompasses collective intelligence and symbol emergence. We contextualize this model within Bergson's philosophy by adopting multi-scale time theory to unify the diverse temporal dynamics of cognition. System 0 emphasizes morphological computation and passive dynamics, illustrating how physical embodiment enables adaptive behavior without explicit neural processing. Systems 1 and 2 are explained from a constructive perspective, incorporating neurodynamical and AI viewpoints. In System 3, we introduce collective predictive coding to explain how societal-level adaptation and symbol emergence operate over extended timescales. This comprehensive framework ranges from rapid embodied reactions to slow-evolving collective intelligence, offering a unified perspective on cognition across multiple timescales, levels of abstraction, and forms of human intelligence. The System 0/1/2/3 model provides a novel theoretical foundation for understanding the interplay between adaptive and cognitive processes, thereby opening new avenues for research in cognitive science, AI, robotics, and collective intelligence.

Comment: The paper introduces a quad-process theory for multi-timescale cognitive systems, which aligns with emerging trends in foundational research on cognition and AI.

Relevance: 7 Novelty: 8


48. NFIG: Autoregressive Image Generation with Next-Frequency Prediction

ArXiv ID: 2503.07076

Authors: Zhihao Huang, Xi Qiu, Yukuo Ma, Yifu Zhou, Chi Zhang, Xuelong Li

Abstract: Autoregressive models have achieved promising results in natural language processing. However, for image generation tasks, they encounter substantial challenges in effectively capturing long-range dependencies, managing computational costs, and most crucially, defining meaningful autoregressive sequences that reflect natural image hierarchies. To address these issues, we present \textbf{N}ext-\textbf{F}requency \textbf{I}mage \textbf{G}eneration (\textbf{NFIG}), a novel framework that decomposes the image generation process into multiple frequency-guided stages. Our approach first generates low-frequency components to establish global structure with fewer tokens, then progressively adds higher-frequency details, following the natural spectral hierarchy of images. This principled autoregressive sequence not only improves the quality of generated images by better capturing true causal relationships between image components, but also significantly reduces computational overhead during inference. Extensive experiments demonstrate that NFIG achieves state-of-the-art performance with fewer steps, offering a more efficient solution for image generation, with 1.25$\times$ speedup compared to VAR-d20 while achieving better performance (FID: 2.81) on the ImageNet-256 benchmark. We hope that our insight of incorporating frequency-domain knowledge to guide autoregressive sequence design will shed light on future research. We will make our code publicly available upon acceptance of the paper.

Comment: The paper proposes a novel autoregressive framework for image generation using frequency-guided stages, which aligns with representation learning and architectural insights. However, the focus on image generation makes it partially relevant.

Relevance: 7 Novelty: 8


49. Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

ArXiv ID: 2503.06287

Authors: Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang

Abstract: Visual grounding seeks to localize the image region corresponding to a free-form text description. Recently, the strong multimodal capabilities of Large Vision-Language Models (LVLMs) have driven substantial improvements in visual grounding, though they inevitably require fine-tuning and additional model components to explicitly generate bounding boxes or segmentation masks. However, we discover that a few attention heads in frozen LVLMs demonstrate strong visual grounding capabilities. We refer to these heads, which consistently capture object locations related to text semantics, as localization heads. Using localization heads, we introduce a straightforward and effective training-free visual grounding framework that utilizes text-to-image attention maps from localization heads to identify the target objects. Surprisingly, only three out of thousands of attention heads are sufficient to achieve competitive localization performance compared to existing LVLM-based visual grounding methods that require fine-tuning. Our findings suggest that LVLMs can innately ground objects based on a deep comprehension of the text-image relationship, as they implicitly focus on relevant image regions to generate informative text outputs. All the source codes will be made available to the public.

Comment: The paper explores attention heads in large vision-language models for visual grounding, which provides insights into model architecture and representation learning. However, it is slightly application-driven.

Relevance: 7 Novelty: 8


50. Emergent Abilities in Large Language Models: A Survey

ArXiv ID: 2503.05788

Authors: Leonardo Berti, Flavio Giorgi, Gjergji Kasneci

Abstract: Large Language Models (LLMs) are leading a new technological revolution as one of the most promising research streams toward artificial general intelligence. The scaling of these models, accomplished by increasing the number of parameters and the magnitude of the training datasets, has been linked to various so-called emergent abilities that were previously unobserved. These emergent abilities, ranging from advanced reasoning and in-context learning to coding and problem-solving, have sparked an intense scientific debate: Are they truly emergent, or do they simply depend on external factors, such as training dynamics, the type of problems, or the chosen metric? What underlying mechanism causes them? Despite their transformative potential, emergent abilities remain poorly understood, leading to misconceptions about their definition, nature, predictability, and implications. In this work, we shed light on emergent abilities by conducting a comprehensive review of the phenomenon, addressing both its scientific underpinnings and real-world consequences. We first critically analyze existing definitions, exposing inconsistencies in conceptualizing emergent abilities. We then explore the conditions under which these abilities appear, evaluating the role of scaling laws, task complexity, pre-training loss, quantization, and prompting strategies. Our review extends beyond traditional LLMs and includes Large Reasoning Models (LRMs), which leverage reinforcement learning and inference-time search to amplify reasoning and self-reflection. However, emergence is not inherently positive. As AI systems gain autonomous reasoning capabilities, they also develop harmful behaviors, including deception, manipulation, and reward hacking. We highlight growing concerns about safety and governance, emphasizing the need for better evaluation frameworks and regulatory oversight.

Comment: The paper surveys emergent abilities in LLMs, which aligns with theoretical insights into LLM behavior and interpretability.

Relevance: 8 Novelty: 6


51. Lifelong Learning with Task-Specific Adaptation: Addressing the Stability-Plasticity Dilemma

ArXiv ID: 2503.06213

Authors: Ruiyu Wang, Sen Wang, Xinxin Zuo, Qiang Sun

Abstract: Lifelong learning (LL) aims to continuously acquire new knowledge while retaining previously learned knowledge. A central challenge in LL is the stability-plasticity dilemma, which requires models to balance the preservation of previous knowledge (stability) with the ability to learn new tasks (plasticity). While parameter-efficient fine-tuning (PEFT) has been widely adopted in large language models, its application to lifelong learning remains underexplored. To bridge this gap, this paper proposes AdaLL, an adapter-based framework designed to address the dilemma through a simple, universal, and effective strategy. AdaLL co-trains the backbone network and adapters under regularization constraints, enabling the backbone to capture task-invariant features while allowing the adapters to specialize in task-specific information. Unlike methods that freeze the backbone network, AdaLL incrementally enhances the backbone's capabilities across tasks while minimizing interference through backbone regularization. This architectural design significantly improves both stability and plasticity, effectively eliminating the stability-plasticity dilemma. Extensive experiments demonstrate that AdaLL consistently outperforms existing methods across various configurations, including dataset choices, task sequences, and task scales.

Comment: The paper proposes a novel adapter-based framework for lifelong learning, which touches on representation learning and model architecture but is more application-driven.

Relevance: 7 Novelty: 7


52. What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces

ArXiv ID: 2503.05703

Authors: Jordi Armengol-Estap\'e, Quentin Carbonneaux, Tianjun Zhang, Aram H. Markosyan, Volker Seeker, Chris Cummins, Melanie Kambadur, Michael F. P. O'Boyle, Sida Wang, Gabriel Synnaeve, Hugh James Leather

Abstract: Code generation and understanding are critical capabilities for large language models (LLMs). Thus, most LLMs are pretrained and fine-tuned on code data. However, these datasets typically treat code as static strings and rarely exploit the dynamic information about their execution. Building upon previous work on trace modeling, we study Execution Tuning (E.T.), a training procedure in which we explicitly model real-world program execution traces without requiring manual test annotations. We train and evaluate models on different execution trace granularities (line and instruction-level) and strategies on the task of output prediction, obtaining around 80% accuracy on CruxEval and MBPP, and showing the advantages of dynamic scratchpads (i.e., self-contained intermediate computations updated by the model rather than accumulated as a history of past computations) on long executions (up to 14k steps). Finally, we discuss E.T.'s practical applications.

Comment: The paper explores training LLMs using program execution traces, which provides insights into representation learning and training dynamics. However, it is more focused on code understanding and generation rather than foundational representation learning.

Relevance: 7 Novelty: 7


53. Minion Gated Recurrent Unit for Continual Learning

ArXiv ID: 2503.06175

Authors: Abdullah M. Zyarah, Dhireesha Kudithipudi

Abstract: The increasing demand for continual learning in sequential data processing has led to progressively complex training methodologies and larger recurrent network architectures. Consequently, this has widened the knowledge gap between continual learning with recurrent neural networks (RNNs) and their ability to operate on devices with limited memory and compute. To address this challenge, we investigate the effectiveness of simplifying RNN architectures, particularly gated recurrent unit (GRU), and its impact on both single-task and multitask sequential learning. We propose a new variant of GRU, namely the minion recurrent unit (MiRU). MiRU replaces conventional gating mechanisms with scaling coefficients to regulate dynamic updates of hidden states and historical context, reducing computational costs and memory requirements. Despite its simplified architecture, MiRU maintains performance comparable to the standard GRU while achieving 2.90x faster training and reducing parameter usage by 2.88x, as demonstrated through evaluations on sequential image classification and natural language processing benchmarks. The impact of model simplification on its learning capacity is also investigated by performing continual learning tasks with a rehearsal-based strategy and global inhibition. We find that MiRU demonstrates stable performance in multitask learning even when using only rehearsal, unlike the standard GRU and its variants. These features position MiRU as a promising candidate for edge-device applications.

Comment: The paper proposes a simplified recurrent unit for continual learning, which aligns with model architecture innovations. However, the focus on edge-device applications makes it partially relevant.

Relevance: 7 Novelty: 7


54. Towards Experience Replay for Class-Incremental Learning in Fully-Binary Networks

ArXiv ID: 2503.07107

Authors: Yanis Basso-Bert, Anca Molnos, Romain Lemaire, William Guicquero, Antoine Dupret

Abstract: Binary Neural Networks (BNNs) are a promising approach to enable Artificial Neural Network (ANN) implementation on ultra-low power edge devices. Such devices may compute data in highly dynamic environments, in which the classes targeted for inference can evolve or even novel classes may arise, requiring continual learning. Class Incremental Learning (CIL) is a common type of continual learning for classification problems, that has been scarcely addressed in the context of BNNs. Furthermore, most of existing BNNs models are not fully binary, as they require several real-valued network layers, at the input, the output, and for batch normalization. This paper goes a step further, enabling class incremental learning in Fully-Binarized NNs (FBNNs) through four main contributions. We firstly revisit the FBNN design and its training procedure that is suitable to CIL. Secondly, we explore loss balancing, a method to trade-off the performance of past and current classes. Thirdly, we propose a semi-supervised method to pre-train the feature extractor of the FBNN for transferable representations. Fourthly, two conventional CIL methods, \ie, Latent and Native replay, are thoroughly compared. These contributions are exemplified first on the CIFAR100 dataset, before being scaled up to address the CORE50 continual learning benchmark. The final results based on our 3Mb FBNN on CORE50 exhibit at par and better performance than conventional real-valued larger NN models.

Comment: The paper explores class-incremental learning in fully binary networks, which aligns with model architecture and efficiency but is more application-driven.

Relevance: 7 Novelty: 6


55. Gender Encoding Patterns in Pretrained Language Model Representations

ArXiv ID: 2503.06734

Authors: Mahdi Zakizadeh, Mohammad Taher Pilehvar

Abstract: Gender bias in pretrained language models (PLMs) poses significant social and ethical challenges. Despite growing awareness, there is a lack of comprehensive investigation into how different models internally represent and propagate such biases. This study adopts an information-theoretic approach to analyze how gender biases are encoded within various encoder-based architectures. We focus on three key aspects: identifying how models encode gender information and biases, examining the impact of bias mitigation techniques and fine-tuning on the encoded biases and their effectiveness, and exploring how model design differences influence the encoding of biases. Through rigorous and systematic investigation, our findings reveal a consistent pattern of gender encoding across diverse models. Surprisingly, debiasing techniques often exhibit limited efficacy, sometimes inadvertently increasing the encoded bias in internal representations while reducing bias in model output distributions. This highlights a disconnect between mitigating bias in output distributions and addressing its internal representations. This work provides valuable guidance for advancing bias mitigation strategies and fostering the development of more equitable language models.

Comment: The paper analyzes gender encoding in pretrained language models, which aligns with representation learning by exploring how biases are encoded in model representations.

Relevance: 7 Novelty: 6


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.