Previous Day 2025-03-03
Monthly Overview 2025-03
Next Day 2025-03-05

Personalized Daily Arxiv Papers 03/04/2025

[gpt-4o] Prompt Completion Total
Token 89804 13289 103093
Cost $0.22 $0.13 $0.35

Total ArXiv papers: 1220

Total scanned papers: 618

Total relevant papers: 61

Table of contents with paper titles:

  1. Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry Authors: Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, Demba Ba

  2. Efficiently Editing Mixture-of-Experts Models with Compressed Experts Authors: Yifei He, Yang Liu, Chen Liang, Hany Hassan Awadalla

  3. Synergy Between Sufficient Changes and Sparse Mixing Procedure for Disentangled Representation Learning Authors: Zijian Li, Shunxing Fan, Yujia Zheng, Ignavier Ng, Shaoan Xie, Guangyi Chen, Xinshuai Dong, Ruichu Cai, Kun Zhang

  4. Compositional Reasoning with Transformers, RNNs, and Chain of Thought Authors: Gilad Yehudai, Noah Amsel, Joan Bruna

  5. KurTail : Kurtosis-based LLM Quantization Authors: Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski, Evangelos Eleftheriou, Martino Dazzi

  6. On the Power of Context-Enhanced Learning in LLMs Authors: Xingyu Zhu, Abhishek Panigrahi, Sanjeev Arora

  7. KVCrush: Key value cache size-reduction using similarity in head-behaviour Authors: Gopi Krishna Jha, Sameh Gobriel, Liubov Talamanova, Alexander Kozlov, Nilesh Jain

  8. LoR2C : Low-Rank Residual Connection Adaptation for Parameter-Efficient Fine-Tuning Authors: Jiancheng Zhao, Xingda Yu, Yuxiang Zhang, Zhen Yang

  9. Towards Understanding the Benefit of Multitask Representation Learning in Decision Process Authors: Rui Lu, Yang Yue, Andrew Zhao, Simon Du, Gao Huang

  10. Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks Authors: Thang Do, Arnulf Jentzen, Adrian Riekert

  11. CE-U: Cross Entropy Unlearning Authors: Bo Yang

  12. EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection Authors: Yuhao Zhou, Sirui Song, Boyang Liu, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Zhihao Zhang, Wei Li, Xuanjing Huang

  13. From superposition to sparse codes: interpretable representations in neural networks Authors: David Klindt, Charles O'Neill, Patrik Reizinger, Harald Maurer, Nina Miolane

  14. Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation Authors: Tiansheng Wen, Yifei Wang, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, Chenyu You

  15. DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models Authors: Yongqi Huang, Peng Ye, Chenyu Huang, Jianjian Cao, Lin Zhang, Baopu Li, Gang Yu, Tao Chen

  16. RSQ: Learning from Important Tokens Leads to Better Quantized LLMs Authors: Yi-Lin Sung, Prateek Yadav, Jialu Li, Jaehong Yoon, Mohit Bansal

  17. Revisiting Large Language Model Pruning using Neuron Semantic Attribution Authors: Yizhuo Ding, Xinwei Sun, Yanwei Fu, Guosheng Hu

  18. Parameter-Efficient Fine-Tuning of Large Language Models via Deconvolution in Subspace Authors: Jia-Chen Zhang, Yu-Jie Xiong, Chun-Ming Xia, Dong-Hai Zhu, Xi-He Qiu

  19. Asymptotic Theory of Eigenvectors for Latent Embeddings with Generalized Laplacian Matrices Authors: Jianqing Fan, Yingying Fan, Jinchi Lv, Fan Yang, Diwen Yu

  20. Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving Authors: Qihui Zhou, Peiqi Yin, Pengfei Zuo, James Cheng

  21. Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning Authors: Anh Tong, Thanh Nguyen-Tang, Dongeun Lee, Duc Nguyen, Toan Tran, David Hall, Cheongwoong Kang, Jaesik Choi

  22. Projection Head is Secretly an Information Bottleneck Authors: Zhuo Ouyang, Kaiwen Hu, Qi Zhang, Yifei Wang, Yisen Wang

  23. Transformer Meets Twicing: Harnessing Unattended Residual Information Authors: Laziz Abdullaev, Tan Nguyen

  24. Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers Authors: Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher, Orr Fischer, Ran Gilad-Bachrach, Amir Globerson

  25. CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering Authors: Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, Liang He

  26. CoSMoEs: Compact Sparse Mixture of Experts Authors: Patrick Huber, Akshat Shrivastava, Ernie Chang, Chinnadhurai Sankar, Ahmed Aly, Adithya Sagar

  27. Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs Authors: Ravi Ghadia, Avinash Kumar, Gaurav Jain, Prashant Nair, Poulami Das

  28. When Can You Get Away with Low Memory Adam? Authors: Dayal Singh Kalra, John Kirchenbauer, Maissam Barkeshli, Tom Goldstein

  29. Liger: Linearizing Large Language Models to Gated Recurrent Structures Authors: Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng

  30. Steering Large Language Model Activations in Sparse Spaces Authors: Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, Pascal Vincent

  31. Relating Piecewise Linear Kolmogorov Arnold Networks to ReLU Networks Authors: Nandi Schoots, Mattia Jacopo Villani, Niels uit de Bos

  32. Homomorphism Expressivity of Spectral Invariant Graph Neural Networks Authors: Jingchu Gai, Yiheng Du, Bohang Zhang, Haggai Maron, Liwei Wang

  33. Depth-Adaptive Graph Neural Networks via Learnable Bakry-'Emery Curvature Authors: Asela Hevapathige, Ahad N. Zehmakan, Qing Wang

  34. Riemann Tensor Neural Networks: Learning Conservative Systems with Physics-Constrained Networks Authors: Anas Jnini, Lorenzo Breschi, Flavio Vella

  35. Understanding Dataset Distillation via Spectral Filtering Authors: Deyu Bo, Songhua Liu, Xinchao Wang

  36. Modeling Arbitrarily Applicable Relational Responding with the Non-Axiomatic Reasoning System: A Machine Psychology Approach Authors: Robert Johansson

  37. Learning-Augmented Frequent Directions Authors: Anders Aamand, Justin Y. Chen, Siddharth Gollapudi, Sandeep Silwal, Hao Wu

  38. Multi-Level Collaboration in Model Merging Authors: Qi Li, Runpeng Yu, Xinchao Wang

  39. How simple can you go? An off-the-shelf transformer approach to molecular dynamics Authors: Max Eissler, Tim Korjakow, Stefan Ganscha, Oliver T. Unke, Klaus-Robert M\"uller, Stefan Gugler

  40. Improve Representation for Imbalanced Regression through Geometric Constraints Authors: Zijian Dong, Yilei Wu, Chongyao Chen, Yingtian Zou, Yichi Zhang, Juan Helen Zhou

  41. DILEMMA: Joint LLM Quantization and Distributed LLM Inference Over Edge Computing Systems Authors: Minoo Hosseinzadeh, Hana Khamfroush

  42. PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization Authors: Xinyi Wan, Penghui Qi, Guangxing Huang, Jialin Li, Min Lin

  43. Generalization Bounds for Equivariant Networks on Markov Data Authors: Hui Li, Zhiguo Wang, Bohui Chen, Li Sheng

  44. SAKE: Steering Activations for Knowledge Editing Authors: Marco Scialanga, Thibault Laugel, Vincent Grari, Marcin Detyniecki

  45. Parameter Expanded Stochastic Gradient Markov Chain Monte Carlo Authors: Hyunsu Kim, Giung Nam, Chulhee Yun, Hongseok Yang, Juho Lee

  46. Cauchy-Schwarz Regularizers Authors: Sueda Taner, Ziyi Wang, Christoph Studer

  47. Convergence of energy-based learning in linear resistive networks Authors: Anne-Men Huijzer, Thomas Chaffey, Bart Besselink, Henk J. van Waarde

  48. Regularization-based Framework for Quantization-, Fault- and Variability-Aware Training Authors: Anmol Biswas, Raghav Singhal, Sivakumar Elangovan, Shreyas Sabnis, Udayan Ganguly

  49. Re-Imagining Multimodal Instruction Tuning: A Representation View Authors: Yiyang Liu, James Chenhao Liang, Ruixiang Tang, Yugyung Lee, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Lifu Huang, Dongfang Liu, Qifan Wang, Cheng Han

  50. Constraining Sequential Model Editing with Editing Anchor Compression Authors: Hao-Xiang Xu, Jun-Yu Ma, Zhen-Hua Ling, Ningyu Zhang, Jia-Chen Gu

  51. Unlocking Efficient, Scalable, and Continual Knowledge Editing with Basis-Level Representation Fine-Tuning Authors: Tianci Liu, Ruirui Li, Yunzhe Qi, Hui Liu, Xianfeng Tang, Tianqi Zheng, Qingyu Yin, Monica Xiao Cheng, Jun Huan, Haoyu Wang, Jing Gao

  52. Personalize Your LLM: Fake it then Align it Authors: Yijing Zhang, Dyah Adila, Changho Shin, Frederic Sala

  53. Scaling Law Phenomena Across Regression Paradigms: Multiple and Kernel Approaches Authors: Yifang Chen, Xuyang Guo, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song

  54. ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Parity LLM Data Valuation Authors: Yanzhou Pan, Huawei Lin, Yide Ran, Jiamin Chen, Xiaodong Yu, Weijie Zhao, Denghui Zhang, Zhaozhuo Xu

  55. Hypergraph Foundation Model Authors: Yifan Feng, Shiquan Liu, Xiangmin Han, Shaoyi Du, Zongze Wu, Han Hu, Yue Gao

  56. Architectural and Inferential Inductive Biases For Exchangeable Sequence Modeling Authors: Daksh Mittal, Ang Li, Tzu-Ching Yen, Daniel Guetta, Hongseok Namkoong

  57. Cauchy Random Features for Operator Learning in Sobolev Space Authors: Chunyang Liao, Deanna Needell, Hayden Schaeffer

  58. AMUN: Adversarial Machine UNlearning Authors: Ali Ebrahimpour-Boroojeny, Hari Sundaram, Varun Chandrasekaran

  59. Learning Stochastic Dynamical Systems with Structured Noise Authors: Ziheng Guo, James Greene, Ming Zhong

  60. Input Specific Neural Networks Authors: Asghar A. Jadoon, D. Thomas Seidl, Reese E. Jones, Jan N. Fuhg

  61. On the Saturation Effects of Spectral Algorithms in Large Dimensions Authors: Weihao Lu, Haobo Zhang, Yicheng Li, Qian Lin


1. Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

ArXiv ID: 2503.01822

Authors: Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, Demba Ba

Abstract: Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward certain kinds of concepts? We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem, revealing a fundamental challenge: each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. This means different SAEs are not interchangeable -- switching architectures can expose entirely new concepts or obscure existing ones. To systematically probe this effect, we evaluate SAEs across a spectrum of settings: from controlled toy models that isolate key variables, to semi-synthetic experiments on real model activations and finally to large-scale, naturalistic datasets. Across this progression, we examine two fundamental properties that real-world concepts often exhibit: heterogeneity in intrinsic dimensionality (some concepts are inherently low-dimensional, others are not) and nonlinear separability. We show that SAEs fail to recover concepts when these properties are ignored, and we design a new SAE that explicitly incorporates both, enabling the discovery of previously hidden concepts and reinforcing our theoretical insights. Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability. Overall, we argue an SAE does not just reveal concepts -- it determines what can be seen at all.

Comment: The paper provides a theoretical framework for sparse autoencoders, directly addressing representation learning and the biases in concept detection.

Relevance: 10 Novelty: 9


2. Efficiently Editing Mixture-of-Experts Models with Compressed Experts

ArXiv ID: 2503.00634

Authors: Yifei He, Yang Liu, Chen Liang, Hany Hassan Awadalla

Abstract: Mixture-of-Experts (MoE) models have become a key approach for scaling large language models efficiently by activating only a subset of experts during training and inference. Typically, the number of activated experts presents a trade-off: fewer experts reduce computational costs, while more experts improve performance. Recent studies reveal that not all activated experts contribute equally to model performance, with some providing minimal utility, particularly when finetuning pretrained MoE models for specialized downstream tasks. The co-existence of significant and redundant parameters in experts provides us an opportunity to reduce the number of activated experts while maintaining model performance. In this work, we propose the concept of compressed experts, lightweight modules that serve as compact representations of full experts. Our approach preserves the most important experts while replacing other auxiliary activated experts with compressed experts. The reduction of active parameters significantly lowers inference costs while achieving comparable performance. Extensive experiments on models including Phi-MoE and OLMoE demonstrate that compressed experts recover over 90% of full expert performance across various tasks while reducing more than 30% active parameters and saving 20% in inference costs. This approach enables efficient deployment of MoE models in resource-constrained settings and facilitates scaling to larger models with manageable overhead. Our code is available at https://github.com/yifei-he/Compressed-Experts.

Comment: The paper introduces compressed experts for Mixture-of-Experts (MoE) models, reducing inference costs while maintaining performance. This directly aligns with the 'Model Architecture' and 'Model Compression' criteria.

Relevance: 10 Novelty: 8


3. Synergy Between Sufficient Changes and Sparse Mixing Procedure for Disentangled Representation Learning

ArXiv ID: 2503.00639

Authors: Zijian Li, Shunxing Fan, Yujia Zheng, Ignavier Ng, Shaoan Xie, Guangyi Chen, Xinshuai Dong, Ruichu Cai, Kun Zhang

Abstract: Disentangled representation learning aims to uncover latent variables underlying the observed data, and generally speaking, rather strong assumptions are needed to ensure identifiability. Some approaches rely on sufficient changes on the distribution of latent variables indicated by auxiliary variables such as domain indices, but acquiring enough domains is often challenging. Alternative approaches exploit structural sparsity assumptions on the mixing procedure, but such constraints are usually (partially) violated in practice. Interestingly, we find that these two seemingly unrelated assumptions can actually complement each other to achieve identifiability. Specifically, when conditioned on auxiliary variables, the sparse mixing procedure assumption provides structural constraints on the mapping from estimated to true latent variables and hence compensates for potentially insufficient distribution changes. Building on this insight, we propose an identifiability theory with less restrictive constraints regarding distribution changes and the sparse mixing procedure, enhancing applicability to real-world scenarios. Additionally, we develop an estimation framework incorporating a domain encoding network and a sparse mixing constraint and provide two implementations based on variational autoencoders and generative adversarial networks, respectively. Experiment results on synthetic and real-world datasets support our theoretical results.

Comment: The paper proposes a novel framework combining sparse mixing and distributional changes for disentangled representation learning, which directly aligns with foundational research in representation learning.

Relevance: 9 Novelty: 9


4. Compositional Reasoning with Transformers, RNNs, and Chain of Thought

ArXiv ID: 2503.01544

Authors: Gilad Yehudai, Noah Amsel, Joan Bruna

Abstract: We study and compare the expressive power of transformers, RNNs, and transformers with chain of thought tokens on a simple and natural class of problems we term Compositional Reasoning Questions (CRQ). This family captures problems like evaluating Boolean formulas and multi-step word problems. Assuming standard hardness assumptions from circuit complexity and communication complexity, we prove that none of these three architectures is capable of solving CRQs unless some hyperparameter (depth, embedding dimension, and number of chain of thought tokens, respectively) grows with the size of the input. We also provide a construction for each architecture that solves CRQs. For transformers, our construction uses depth that is logarithmic in the problem size. For RNNs, logarithmic embedding dimension is necessary and sufficient, so long as the inputs are provided in a certain order. (Otherwise, a linear dimension is necessary). For transformers with chain of thought, our construction uses $n$ CoT tokens. These results show that, while CRQs are inherently hard, there are several different ways for language models to overcome this hardness. Even for a single class of problems, each architecture has strengths and weaknesses, and none is strictly better than the others.

Comment: The paper compares the expressive power of transformers, RNNs, and chain-of-thought methods for compositional reasoning, providing theoretical insights into model capabilities. This aligns with the interest in analyzing architectures.

Relevance: 9 Novelty: 8


5. KurTail : Kurtosis-based LLM Quantization

ArXiv ID: 2503.01483

Authors: Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski, Evangelos Eleftheriou, Martino Dazzi

Abstract: One of the challenges of quantizing a large language model (LLM) is the presence of outliers. Outliers often make uniform quantization schemes less effective, particularly in extreme cases such as 4-bit quantization. We introduce KurTail, a new post-training quantization (PTQ) scheme that leverages Kurtosis-based rotation to mitigate outliers in the activations of LLMs. Our method optimizes Kurtosis as a measure of tailedness. This approach enables the quantization of weights, activations, and the KV cache in 4 bits. We utilize layer-wise optimization, ensuring memory efficiency. KurTail outperforms existing quantization methods, offering a 13.3\% boost in MMLU accuracy and a 15.5\% drop in Wiki perplexity compared to QuaRot. It also outperforms SpinQuant with a 2.6\% MMLU gain and reduces perplexity by 2.9\%, all while reducing the training cost. For comparison, learning the rotation using SpinQuant for Llama3-70B requires at least four NVIDIA H100 80GB GPUs, whereas our method requires only a single GPU, making it a more accessible solution for consumer GPU.

Comment: This paper introduces a novel quantization method for LLMs, addressing outliers and optimizing memory efficiency. It aligns with the model compression criterion, particularly in quantization and efficiency breakthroughs.

Relevance: 9 Novelty: 8


6. On the Power of Context-Enhanced Learning in LLMs

ArXiv ID: 2503.01821

Authors: Xingyu Zhu, Abhishek Panigrahi, Sanjeev Arora

Abstract: We formalize a new concept for LLMs, context-enhanced learning. It involves standard gradient-based learning on text except that the context is enhanced with additional data on which no auto-regressive gradients are computed. This setting is a gradient-based analog of usual in-context learning (ICL) and appears in some recent works. Using a multi-step reasoning task, we prove in a simplified setting that context-enhanced learning can be exponentially more sample-efficient than standard learning when the model is capable of ICL. At a mechanistic level, we find that the benefit of context-enhancement arises from a more accurate gradient learning signal. We also experimentally demonstrate that it appears hard to detect or recover learning materials that were used in the context during training. This may have implications for data security as well as copyright.

Comment: The paper formalizes context-enhanced learning for LLMs, providing theoretical insights into gradient-based learning with enhanced context. This aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 8


7. KVCrush: Key value cache size-reduction using similarity in head-behaviour

ArXiv ID: 2503.00022

Authors: Gopi Krishna Jha, Sameh Gobriel, Liubov Talamanova, Alexander Kozlov, Nilesh Jain

Abstract: Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs). By allowing the attention operation to scale linearly rather than quadratically with the total sequence length, KV caching significantly enhances generation throughput. However, due to large context lengths in the modern LLMs, the memory footprint of the KV is a huge bottleneck for model deployment directly impacting the model's batch size, hindering its ability to deliver high-throughput. Existing research addresses this challenge using several techniques, such as discarding low-attention tokens, quantization, and matrix approximation which typically lead to a negative impact on the model accuracy. In this paper, We propose KVCrush technology which can be combined with many KV compression technologies to improve the model accuracy at a much smaller memory. KVCrush provides an alternate representation scheme for key-value states, along with a low-overhead token pruning algorithm that accounts for the token distribution in the KV cache, which in turn allows for a a smaller footprint while maintaining the accuracy of the model. Based on our results, KVCrush reduces LongBench KV Cache size by 4x with less than 1% accuracy drop and achieves state-of-the-art average accuracy with minimal overhead, incurring less than 0.5% total inference latency. KVCrush not only outperforms the accuracy of state-of-the-art importance-based token retention schemes but is also compatible with typical practical LLM deployments using KV cache paging schemes such as vLLM and mixed precision quantization.

Comment: The paper proposes a KV cache compression method for LLMs, addressing memory efficiency with minimal accuracy loss. This aligns with the model compression criterion, particularly in KV cache optimization.

Relevance: 9 Novelty: 8


8. LoR2C : Low-Rank Residual Connection Adaptation for Parameter-Efficient Fine-Tuning

ArXiv ID: 2503.00572

Authors: Jiancheng Zhao, Xingda Yu, Yuxiang Zhang, Zhen Yang

Abstract: In recent years, pretrained large language models have demonstrated outstanding performance across various natural language processing tasks. However, full-parameter fine-tuning methods require adjusting all model parameters, leading to immense computational resource demands. Although parameter-efficient fine-tuning methods like LoRA have significantly reduced the number of parameters, they still face challenges such as gradient vanishing and the potential for further parameter reduction. To address these issues, this paper proposes a novel parameter-efficient fine-tuning method called LoR2C (Low-Rank Residual Connection Adaptation). LoR2C introduces residual connections with low-rank matrices within the model layers, which not only reduces the number of fine-tuning parameters but also effectively alleviates the gradient vanishing problem. Additionally, this paper presents three optimization variants of LoR2C: ShareLoR2C, MergeLoR2C, and InjectLoR2C. These variants further improve parameter efficiency and model performance through parameter sharing, module merging, and injection mechanisms, respectively. Experimental results on multiple natural language understanding and natural language generation tasks demonstrate that LoR2C and its optimized variants significantly reduce parameter overhead while maintaining or even improving performance, outperforming existing mainstream parameter-efficient fine-tuning methods.Our code is publicly available at https://github.com/Oblivioniss/LoR2C.

Comment: The paper introduces a novel low-rank residual connection adaptation for parameter-efficient fine-tuning, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8


9. Towards Understanding the Benefit of Multitask Representation Learning in Decision Process

ArXiv ID: 2503.00345

Authors: Rui Lu, Yang Yue, Andrew Zhao, Simon Du, Gao Huang

Abstract: Multitask Representation Learning (MRL) has emerged as a prevalent technique to improve sample efficiency in Reinforcement Learning (RL). Empirical studies have found that training agents on multiple tasks simultaneously within online and transfer learning environments can greatly improve efficiency. Despite its popularity, a comprehensive theoretical framework that elucidates its operational efficacy remains incomplete. Prior analyses have predominantly assumed that agents either possess a pre-known representation function or utilize functions from a linear class, where both are impractical. The complexity of real-world applications typically requires the use of sophisticated, non-linear functions such as neural networks as representation function, which are not pre-existing but must be learned. Our work tries to fill the gap by extending the analysis to \textit{unknown non-linear} representations, giving a comprehensive analysis for its mechanism in online and transfer learning setting. We consider the setting that an agent simultaneously playing $M$ contextual bandits (or MDPs), developing a shared representation function $\phi$ from a non-linear function class $\Phi$ using our novel Generalized Functional Upper Confidence Bound algorithm (GFUCB). We formally prove that this approach yields a regret upper bound that outperforms the lower bound associated with learning $M$ separate tasks, marking the first demonstration of MRL's efficacy in a general function class. This framework also explains the contribution of representations to transfer learning when faced with new, yet related tasks, and identifies key conditions for successful transfer. Empirical experiments further corroborate our theoretical findings.

Comment: The paper provides theoretical insights into multitask representation learning, directly addressing foundational aspects of representation learning.

Relevance: 9 Novelty: 8


10. Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks

ArXiv ID: 2503.01660

Authors: Thang Do, Arnulf Jentzen, Adrian Riekert

Abstract: Despite the omnipresent use of stochastic gradient descent (SGD) optimization methods in the training of deep neural networks (DNNs), it remains, in basically all practically relevant scenarios, a fundamental open problem to provide a rigorous theoretical explanation for the success (and the limitations) of SGD optimization methods in deep learning. In particular, it remains an open question to prove or disprove convergence of the true risk of SGD optimization methods to the optimal true risk value in the training of DNNs. In one of the main results of this work we reveal for a general class of activations, loss functions, random initializations, and SGD optimization methods (including, for example, standard SGD, momentum SGD, Nesterov accelerated SGD, Adagrad, RMSprop, Adadelta, Adam, Adamax, Nadam, Nadamax, and AMSGrad) that in the training of any arbitrary fully-connected feedforward DNN it does not hold that the true risk of the considered optimizer converges in probability to the optimal true risk value. Nonetheless, the true risk of the considered SGD optimization method may very well converge to a strictly suboptimal true risk value.

Comment: The paper provides theoretical insights into the limitations of SGD optimization in deep learning, which aligns with foundational research on training dynamics.

Relevance: 9 Novelty: 8


11. CE-U: Cross Entropy Unlearning

ArXiv ID: 2503.01224

Authors: Bo Yang

Abstract: Large language models (LLMs) inadvertently memorize sensitive data from their massive pretraining corpora \cite{jang2022knowledge}. In this work, we propose CE-U (Cross Entropy Unlearning), a novel loss function designed specifically for unlearning tasks. CE-U addresses fundamental limitations of gradient ascent approaches which suffer from instability due to vanishing gradients when model confidence is high and gradient exploding when confidence is low. We also unify standard cross entropy supervision and cross entropy unlearning into a single framework. Notably, on the TOFU benchmark for unlearning \cite{maini2024tofu}, CE-U achieves state-of-the-art results on LLaMA2-7B with 1\% and 5\% forgetting, even without the use of any extra reference model or additional positive samples. Our theoretical analysis further reveals that the gradient instability issues also exist in popular reinforcement learning algorithms like DPO and GRPO, as they include a gradient ascent component. This suggests that applying CE-U principles to reinforcement learning could be a promising direction for improving stability and convergence.

Comment: CE-U proposes a novel loss function for unlearning in LLMs, which aligns with foundational research in LLM behavior and theoretical insights. The focus on gradient stability and theoretical analysis is a strong match.

Relevance: 9 Novelty: 8


12. EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection

ArXiv ID: 2503.01586

Authors: Yuhao Zhou, Sirui Song, Boyang Liu, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Zhihao Zhang, Wei Li, Xuanjing Huang

Abstract: Rotary Position Embedding (RoPE) enables each attention head to capture multi-frequency information along the sequence dimension and is widely applied in foundation models. However, the nonlinearity introduced by RoPE complicates optimization of the key state in the Key-Value (KV) cache for RoPE-based attention. Existing KV cache compression methods typically store key state before rotation and apply the transformation during decoding, introducing additional computational overhead. This paper introduces EliteKV, a flexible modification framework for RoPE-based models supporting variable KV cache compression ratios. EliteKV first identifies the intrinsic frequency preference of each head using RoPElite, selectively restoring linearity to certain dimensions of key within attention computation. Building on this, joint low-rank compression of key and value enables partial cache sharing. Experimental results show that with minimal uptraining on only $0.6\%$ of the original training data, RoPE-based models achieve a $75\%$ reduction in KV cache size while preserving performance within a negligible margin. Furthermore, EliteKV consistently performs well across models of different scales within the same family.

Comment: EliteKV proposes a novel KV cache compression method for RoPE-based models, which aligns with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8


13. From superposition to sparse codes: interpretable representations in neural networks

ArXiv ID: 2503.01824

Authors: David Klindt, Charles O'Neill, Patrik Reizinger, Harald Maurer, Nina Miolane

Abstract: Understanding how information is represented in neural networks is a fundamental challenge in both neuroscience and artificial intelligence. Despite their nonlinear architectures, recent evidence suggests that neural networks encode features in superposition, meaning that input concepts are linearly overlaid within the network's representations. We present a perspective that explains this phenomenon and provides a foundation for extracting interpretable representations from neural activations. Our theoretical framework consists of three steps: (1) Identifiability theory shows that neural networks trained for classification recover latent features up to a linear transformation. (2) Sparse coding methods can extract disentangled features from these representations by leveraging principles from compressed sensing. (3) Quantitative interpretability metrics provide a means to assess the success of these methods, ensuring that extracted features align with human-interpretable concepts. By bridging insights from theoretical neuroscience, representation learning, and interpretability research, we propose an emerging perspective on understanding neural representations in both artificial and biological systems. Our arguments have implications for neural coding theories, AI transparency, and the broader goal of making deep learning models more interpretable.

Comment: The paper provides a theoretical framework for understanding neural representations using sparse coding, which aligns with foundational research in representation learning.

Relevance: 9 Novelty: 8


14. Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation

ArXiv ID: 2503.01776

Authors: Tiansheng Wen, Yifei Wang, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, Chenyu You

Abstract: Many large-scale systems rely on high-quality deep representations (embeddings) to facilitate tasks like retrieval, search, and generative modeling. Matryoshka Representation Learning (MRL) recently emerged as a solution for adaptive embedding lengths, but it requires full model retraining and suffers from noticeable performance degradations at short lengths. In this paper, we show that sparse coding offers a compelling alternative for achieving adaptive representation with minimal overhead and higher fidelity. We propose Contrastive Sparse Representation (CSR), a method that sparsifies pre-trained embeddings into a high-dimensional but selectively activated feature space. By leveraging lightweight autoencoding and task-aware contrastive objectives, CSR preserves semantic quality while allowing flexible, cost-effective inference at different sparsity levels. Extensive experiments on image, text, and multimodal benchmarks demonstrate that CSR consistently outperforms MRL in terms of both accuracy and retrieval speed-often by large margins-while also cutting training time to a fraction of that required by MRL. Our results establish sparse coding as a powerful paradigm for adaptive representation learning in real-world applications where efficiency and fidelity are both paramount. Code is available at https://github.com/neilwen987/CSR_Adaptive_Rep

Comment: The paper proposes a sparse coding method for adaptive representation learning, which aligns with foundational research in representation learning and efficiency.

Relevance: 9 Novelty: 8


15. DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models

ArXiv ID: 2503.01359

Authors: Yongqi Huang, Peng Ye, Chenyu Huang, Jianjian Cao, Lin Zhang, Baopu Li, Gang Yu, Tao Chen

Abstract: Upcycled Mixture-of-Experts (MoE) models have shown great potential in various tasks by converting the original Feed-Forward Network (FFN) layers in pre-trained dense models into MoE layers. However, these models still suffer from significant parameter inefficiency due to the introduction of multiple experts. In this work, we propose a novel DeRS (Decompose, Replace, and Synthesis) paradigm to overcome this shortcoming, which is motivated by our observations about the unique redundancy mechanisms of upcycled MoE experts. Specifically, DeRS decomposes the experts into one expert-shared base weight and multiple expert-specific delta weights, and subsequently represents these delta weights in lightweight forms. Our proposed DeRS paradigm can be applied to enhance parameter efficiency in two different scenarios, including: 1) DeRS Compression for inference stage, using sparsification or quantization to compress vanilla upcycled MoE models; and 2) DeRS Upcycling for training stage, employing lightweight sparse or low-rank matrixes to efficiently upcycle dense models into MoE models. Extensive experiments across three different tasks show that the proposed methods can achieve extreme parameter efficiency while maintaining the performance for both training and compression of upcycled MoE models.

Comment: The paper proposes a method for enhancing parameter efficiency in Mixture-of-Experts models, which aligns with foundational research in model architecture and efficiency.

Relevance: 9 Novelty: 8


16. RSQ: Learning from Important Tokens Leads to Better Quantized LLMs

ArXiv ID: 2503.01820

Authors: Yi-Lin Sung, Prateek Yadav, Jialu Li, Jaehong Yoon, Mohit Bansal

Abstract: Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by "uniformly" optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better-quantized models can be obtained by prioritizing learning from important tokens (e.g. which have large attention scores). Building on this finding, we propose RSQ (Rotate, Scale, then Quantize), which (1) applies rotations (orthogonal transformation) to the model to mitigate outliers (those with exceptionally large magnitude), (2) scales the token feature based on its importance, and (3) quantizes the model using the GPTQ framework with the second-order statistics computed by scaled tokens. To compute token importance, we explore both heuristic and dynamic strategies. Based on a thorough analysis of all approaches, we adopt attention concentration, which uses attention scores of each token as its importance, as the best approach. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families: LLaMA3, Mistral, and Qwen2.5. Additionally, models quantized with RSQ achieve superior performance on long-context tasks, further highlighting its effectiveness. Lastly, RSQ demonstrates generalizability across various setups, including different model sizes, calibration datasets, bit precisions, and quantization methods.

Comment: The paper proposes a novel quantization method (RSQ) for LLMs, focusing on token importance and efficiency, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8


17. Revisiting Large Language Model Pruning using Neuron Semantic Attribution

ArXiv ID: 2503.01542

Authors: Yizhuo Ding, Xinwei Sun, Yanwei Fu, Guosheng Hu

Abstract: Model pruning technique is vital for accelerating large language models by reducing their size and computational requirements. However, the generalizability of existing pruning methods across diverse datasets and tasks remains unclear. Thus, we conduct extensive evaluations on 24 datasets and 4 tasks using popular pruning methods. Based on these evaluations, we find and then investigate that calibration set greatly affect the performance of pruning methods. In addition, we surprisingly find a significant performance drop of existing pruning methods in sentiment classification tasks. To understand the link between performance drop and pruned neurons, we propose Neuron Semantic Attribution, which learns to associate each neuron with specific semantics. This method first makes the unpruned neurons of LLMs explainable.

Comment: The paper revisits pruning in LLMs using neuron semantic attribution, which aligns with model compression and provides insights into pruning behavior.

Relevance: 9 Novelty: 8


18. Parameter-Efficient Fine-Tuning of Large Language Models via Deconvolution in Subspace

ArXiv ID: 2503.01419

Authors: Jia-Chen Zhang, Yu-Jie Xiong, Chun-Ming Xia, Dong-Hai Zhu, Xi-He Qiu

Abstract: Large language model (LLM) is considered a milestone towards achieving Artificial General Intelligence (AGI). With its advanced emergent capabilities, it adapt to a wide range of specific applications. Fine-tuning LLMs for various downstream tasks has become a new paradigm. Low-Rank Adaptation (LoRA) is well-known for its parameter efficiency. It can reduce the number of parameters needed to fine-tune LLMs by several orders of magnitude. However, LoRA-based approaches encounter a significant limitation due to the bottleneck imposed by rank one decomposition. As the parameters count in LLMs increase, even rank one decomposition might surpass the number of parameters truly necessary for handling more downstream tasks. In this paper, we propose a new method for Parameter-Efficient Fine-Tuning (PEFT) via deconvolution in subspace, dubbed as DCFT. We innovatively use deconvolution to complete details and enhance knowledge in subspace incremental matrices, and dynamically control parameters by adjusting the kernel size, unconstrained by rank-one decomposition. Extensive experiments are conducted to validate the effectiveness of DCFT. Results show that compared to LoRA, DCFT achieve an 8$\times$ reduction in parameters, and still achieves highly impressive performance. Our code is available here: https://github.com/Godz-z/DCFT.

Comment: The paper proposes a parameter-efficient fine-tuning method (DCFT) for LLMs, which aligns with foundational research in model efficiency.

Relevance: 9 Novelty: 8


19. Asymptotic Theory of Eigenvectors for Latent Embeddings with Generalized Laplacian Matrices

ArXiv ID: 2503.00640

Authors: Jianqing Fan, Yingying Fan, Jinchi Lv, Fan Yang, Diwen Yu

Abstract: Laplacian matrices are commonly employed in many real applications, encoding the underlying latent structural information such as graphs and manifolds. The use of the normalization terms naturally gives rise to random matrices with dependency. It is well-known that dependency is a major bottleneck of new random matrix theory (RMT) developments. To this end, in this paper, we formally introduce a class of generalized (and regularized) Laplacian matrices, which contains the Laplacian matrix and the random adjacency matrix as a specific case, and suggest the new framework of the asymptotic theory of eigenvectors for latent embeddings with generalized Laplacian matrices (ATE-GL). Our new theory is empowered by the tool of generalized quadratic vector equation for dealing with RMT under dependency, and delicate high-order asymptotic expansions of the empirical spiked eigenvectors and eigenvalues based on local laws. The asymptotic normalities established for both spiked eigenvectors and eigenvalues will enable us to conduct precise inference and uncertainty quantification for applications involving the generalized Laplacian matrices with flexibility. We discuss some applications of the suggested ATE-GL framework and showcase its validity through some numerical examples.

Comment: The paper develops an asymptotic theory for eigenvectors in generalized Laplacian matrices, contributing to foundational research in representation learning and theoretical insights into latent embeddings.

Relevance: 9 Novelty: 8


20. Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving

ArXiv ID: 2503.00392

Authors: Qihui Zhou, Peiqi Yin, Pengfei Zuo, James Cheng

Abstract: Processing long contexts has become a critical capability for modern large language models (LLMs). However, serving long-context LLMs comes with significant inference costs due to the high memory overhead of the key-value (KV) cache. Existing work leverages dynamic sparse attention algorithms (DSAes) to mitigate the KV cache overhead, but these algorithms rely on top-$k$ KV cache selection, which results in a trade-off between accuracy and efficiency. A larger $k$ improves accuracy but decreases efficiency, while a smaller $k$ boosts efficiency but compromises accuracy. To overcome this trade-off, this paper presents PSA, a $\underline{P}$rogressive $\underline{S}$parse $\underline{A}$ttention mechanism that integrates algorithmic innovations with system co-design to achieve both high inference accuracy and improved efficiency in LLM serving. The PSA algorithm adaptively adjusts the KV cache budget of different tokens and layers according to their real attention weight distributions, rather than relying on a fixed budget $k$. This enables high accuracy while minimizing KV cache usage. To further enhance execution efficiency, we introduce a pipelined iteration scheme that reduces CPU-GPU interleaving and synchronization overhead during PSA computation. Additionally, we implement unified GPU memory management that optimizes PSA's memory utilization by accounting for uneven memory requirements across different model layers. Extensive experimental results demonstrate that PSA reduces KV cache usage for attention computation by up to 2.4$\times$ and 8.8$\times$, and increases end-to-end serving throughput by up to 1.4$\times$ and 2.0$\times$, compared to state-of-the-art DSAes and systems without sparse attention, respectively.

Comment: The paper introduces Progressive Sparse Attention (PSA) for efficient attention in LLMs, focusing on reducing KV cache usage and improving inference efficiency. This aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8


21. Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning

ArXiv ID: 2503.01329

Authors: Anh Tong, Thanh Nguyen-Tang, Dongeun Lee, Duc Nguyen, Toan Tran, David Hall, Cheongwoong Kang, Jaesik Choi

Abstract: Recent advancements in large language models (LLMs) based on transformer architectures have sparked significant interest in understanding their inner workings. In this paper, we introduce a novel approach to modeling transformer architectures using highly flexible non-autonomous neural ordinary differential equations (ODEs). Our proposed model parameterizes all weights of attention and feed-forward blocks through neural networks, expressing these weights as functions of a continuous layer index. Through spectral analysis of the model's dynamics, we uncover an increase in eigenvalue magnitude that challenges the weight-sharing assumption prevalent in existing theoretical studies. We also leverage the Lyapunov exponent to examine token-level sensitivity, enhancing model interpretability. Our neural ODE transformer demonstrates performance comparable to or better than vanilla transformers across various configurations and datasets, while offering flexible fine-tuning capabilities that can adapt to different architectural constraints.

Comment: The paper introduces Neural ODE Transformers, offering insights into internal dynamics and adaptive fine-tuning. This aligns with foundational research in model architecture and interpretability.

Relevance: 9 Novelty: 8


22. Projection Head is Secretly an Information Bottleneck

ArXiv ID: 2503.00507

Authors: Zhuo Ouyang, Kaiwen Hu, Qi Zhang, Yifei Wang, Yisen Wang

Abstract: Recently, contrastive learning has risen to be a promising paradigm for extracting meaningful data representations. Among various special designs, adding a projection head on top of the encoder during training and removing it for downstream tasks has proven to significantly enhance the performance of contrastive learning. However, despite its empirical success, the underlying mechanism of the projection head remains under-explored. In this paper, we develop an in-depth theoretical understanding of the projection head from the information-theoretic perspective. By establishing the theoretical guarantees on the downstream performance of the features before the projector, we reveal that an effective projector should act as an information bottleneck, filtering out the information irrelevant to the contrastive objective. Based on theoretical insights, we introduce modifications to projectors with training and structural regularizations. Empirically, our methods exhibit consistent improvement in the downstream performance across various real-world datasets, including CIFAR-10, CIFAR-100, and ImageNet-100. We believe our theoretical understanding on the role of the projection head will inspire more principled and advanced designs in this field. Code is available at https://github.com/PKU-ML/Projector_Theory.

Comment: The paper provides a theoretical understanding of the projection head in contrastive learning, aligning with foundational research in representation learning and offering novel insights into its role as an information bottleneck.

Relevance: 9 Novelty: 8


23. Transformer Meets Twicing: Harnessing Unattended Residual Information

ArXiv ID: 2503.00687

Authors: Laziz Abdullaev, Tan Nguyen

Abstract: Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks. While the self-attention mechanism, a core component of transformers, has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers, thereby hurting its overall performance. In this work, we leverage the connection between self-attention computations and low-pass non-local means (NLM) smoothing filters and propose the Twicing Attention, a novel attention mechanism that uses kernel twicing procedure in nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing with compelling theoretical guarantees and enhanced adversarial robustness. This approach enables the extraction and reuse of meaningful information retained in the residuals following the imperfect smoothing operation at each layer. Our proposed method offers two key advantages over standard self-attention: 1) a provably slower decay of representational capacity and 2) improved robustness and accuracy across various data modalities and tasks. We empirically demonstrate the performance gains of our model over baseline transformers on multiple tasks and benchmarks, including image classification and language modeling, on both clean and corrupted data.

Comment: The paper proposes Twicing Attention, a novel attention mechanism addressing representational capacity decay in transformers. This aligns with foundational research in model architecture and offers theoretical guarantees.

Relevance: 9 Novelty: 8


24. Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

ArXiv ID: 2503.01805

Authors: Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher, Orr Fischer, Ran Gilad-Bachrach, Amir Globerson

Abstract: Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement a task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We support our theoretical results with empirical evaluations.

Comment: The paper provides theoretical insights into the depth-width tradeoffs in transformers for graph tasks, which is highly relevant to understanding transformer architectures and their efficiency.

Relevance: 9 Novelty: 8


25. CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering

ArXiv ID: 2503.00413

Authors: Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, Liang He

Abstract: Multimodal large language models (MLLMs) have garnered widespread attention from researchers due to their remarkable understanding and generation capabilities in visual language tasks (e.g., visual question answering). However, the rapid pace of knowledge updates in the real world makes offline training of MLLMs costly, and when faced with non-stationary data streams, MLLMs suffer from catastrophic forgetting during learning. In this paper, we propose an MLLMs-based dual momentum Mixture-of-Experts (CL-MoE) framework for continual visual question answering (VQA). We integrate MLLMs with continual learning to utilize the rich commonsense knowledge in LLMs. We introduce a Dual-Router MoE (RMoE) strategy to select the global and local experts using task-level and instance-level routers, to robustly assign weights to the experts most appropriate for the task. Then, we design a dynamic Momentum MoE (MMoE) to update the parameters of experts dynamically based on the relationships between the experts and tasks/instances, so that the model can absorb new knowledge while maintaining existing knowledge. The extensive experimental results indicate that our method achieves state-of-the-art performance on 10 VQA tasks, proving the effectiveness of our approach.

Comment: The paper introduces a dual momentum Mixture-of-Experts framework for continual learning in multimodal tasks, which is highly relevant to MoE and architectural innovations.

Relevance: 9 Novelty: 8


26. CoSMoEs: Compact Sparse Mixture of Experts

ArXiv ID: 2503.00245

Authors: Patrick Huber, Akshat Shrivastava, Ernie Chang, Chinnadhurai Sankar, Ahmed Aly, Adithya Sagar

Abstract: Sparse Mixture of Expert (MoE) models are popular foundational architectures at large scale, however, under-explored at smaller sizes. Here, we show how to enable Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference. Specifically, we tackle the three main on-device dimensions: Quality, Memory and Latency. Along the quality axis, we show that in a fair evaluation (removing confounding factors) MoE architectures outperform FLOP-aligned dense models at on-device scale. We introduce weight-decomposed experts, further improving the MoE model performance. Regarding model memory and latency, we significantly improve model offloading efficiency and, in turn, reduce model inference latency.

Comment: This paper introduces Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference, addressing quality, memory, and latency. It is highly relevant to the Mixture-of-Experts (MoE) criterion and provides insights into architectural innovations.

Relevance: 9 Novelty: 8


27. Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs

ArXiv ID: 2503.00979

Authors: Ravi Ghadia, Avinash Kumar, Gaurav Jain, Prashant Nair, Poulami Das

Abstract: Autoregressive Transformers rely on Key-Value (KV) caching to accelerate inference. However, the linear growth of the KV cache with context length leads to excessive memory consumption and bandwidth constraints. This bottleneck is particularly problematic in real-time applications -- such as chatbots and interactive assistants -- where low latency and high memory efficiency are critical. Existing methods drop distant tokens or compress states in a lossy manner, sacrificing accuracy by discarding vital context or introducing bias. We propose MorphKV, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy. MorphKV balances long-range dependencies and local coherence during text generation. It eliminates early-token bias while retaining high-fidelity context by adaptively ranking tokens through correlation-aware selection. Unlike heuristic retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens. This approach captures inter-token correlation with greater accuracy, crucial for tasks like content creation and code generation. Our studies on long-response tasks show 52.9$\%$ memory savings and 18.2$\%$ higher accuracy on average compared to state-of-the-art prior works, enabling efficient real-world deployment.

Comment: The paper introduces MorphKV, a novel inference-time technique for maintaining constant-sized KV caches in LLMs, addressing memory efficiency and accuracy trade-offs. This aligns with the 'Model Compression' criterion, particularly in the context of KV cache optimization.

Relevance: 9 Novelty: 8


28. When Can You Get Away with Low Memory Adam?

ArXiv ID: 2503.01843

Authors: Dayal Singh Kalra, John Kirchenbauer, Maissam Barkeshli, Tom Goldstein

Abstract: Adam is the go-to optimizer for training modern machine learning models, but it requires additional memory to maintain the moving averages of the gradients and their squares. While various low-memory optimizers have been proposed that sometimes match the performance of Adam, their lack of reliability has left Adam as the default choice. In this work, we apply a simple layer-wise Signal-to-Noise Ratio (SNR) analysis to quantify when second-moment tensors can be effectively replaced by their means across different dimensions. Our SNR analysis reveals how architecture, training hyperparameters, and dataset properties impact compressibility along Adam's trajectory, naturally leading to $\textit{SlimAdam}$, a memory-efficient Adam variant. $\textit{SlimAdam}$ compresses the second moments along dimensions with high SNR when feasible, and leaves when compression would be detrimental. Through experiments across a diverse set of architectures and training scenarios, we show that $\textit{SlimAdam}$ matches Adam's performance and stability while saving up to $98\%$ of total second moments. Code for $\textit{SlimAdam}$ is available at https://github.com/dayal-kalra/low-memory-adam.

Comment: The paper introduces SlimAdam, a memory-efficient variant of Adam optimizer, which aligns with the model compression criterion by addressing memory efficiency through a novel SNR-based approach.

Relevance: 9 Novelty: 8


29. Liger: Linearizing Large Language Models to Gated Recurrent Structures

ArXiv ID: 2503.01496

Authors: Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng

Abstract: Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents Liger, short for Linearizing LLMs to gated recurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93\% of the Transformer-based LLM at 0.02\% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters. Code is available at https://github.com/OpenSparseLLMs/Linearization.

Comment: The paper introduces Liger, a method for linearizing LLMs into gated recurrent structures, which aligns with foundational research in model architecture and efficiency. The use of LoRA for lightweight fine-tuning and the introduction of Liger Attention are novel contributions.

Relevance: 9 Novelty: 8


30. Steering Large Language Model Activations in Sparse Spaces

ArXiv ID: 2503.00177

Authors: Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, Pascal Vincent

Abstract: A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which modifies internal model activations during inference, offers a potential solution. However, prior work in dense activation spaces struggles with superposition, wherein multiple features become entangled, limiting interpretability and precise control. In contrast, sparse representations provide an untapped opportunity for more interpretable behavior modulation. In this work, we introduce sparse activation steering (SAS), a method that leverages sparse autoencoders (SAEs) to steer LLM behavior in sparse spaces. By isolating behavior-specific features through a contrastive prompt-pairing approach, we define a set of features that can selectively reinforce or suppress behaviors. Experiments on Gemma 2 LLMs show that SAS vectors enable nuanced behavioral modulation and finer-grained control. Furthermore, scaling SAEs improves monosemanticity of SAS vectors, suggesting more reliable and interpretable interventions.

Comment: The paper introduces Sparse Activation Steering (SAS) for guiding LLM behavior using sparse autoencoders. This aligns with foundational research in representation learning and interpretability, offering a novel approach to behavior modulation.

Relevance: 9 Novelty: 8


31. Relating Piecewise Linear Kolmogorov Arnold Networks to ReLU Networks

ArXiv ID: 2503.01702

Authors: Nandi Schoots, Mattia Jacopo Villani, Niels uit de Bos

Abstract: Kolmogorov-Arnold Networks are a new family of neural network architectures which holds promise for overcoming the curse of dimensionality and has interpretability benefits (arXiv:2404.19756). In this paper, we explore the connection between Kolmogorov Arnold Networks (KANs) with piecewise linear (univariate real) functions and ReLU networks. We provide completely explicit constructions to convert a piecewise linear KAN into a ReLU network and vice versa.

Comment: The paper explores the connection between Kolmogorov-Arnold Networks and ReLU networks, providing explicit constructions. This aligns with the interest in theoretical insights into architectures.

Relevance: 8 Novelty: 8


32. Homomorphism Expressivity of Spectral Invariant Graph Neural Networks

ArXiv ID: 2503.00485

Authors: Jingchu Gai, Yiheng Du, Bohang Zhang, Haggai Maron, Liwei Wang

Abstract: Graph spectra are an important class of structural features on graphs that have shown promising results in enhancing Graph Neural Networks (GNNs). Despite their widespread practical use, the theoretical understanding of the power of spectral invariants -- particularly their contribution to GNNs -- remains incomplete. In this paper, we address this fundamental question through the lens of homomorphism expressivity, providing a comprehensive and quantitative analysis of the expressive power of spectral invariants. Specifically, we prove that spectral invariant GNNs can homomorphism-count exactly a class of specific tree-like graphs which we refer to as parallel trees. We highlight the significance of this result in various contexts, including establishing a quantitative expressiveness hierarchy across different architectural variants, offering insights into the impact of GNN depth, and understanding the subgraph counting capabilities of spectral invariant GNNs. In particular, our results significantly extend Arvind et al. (2024) and settle their open questions. Finally, we generalize our analysis to higher-order GNNs and answer an open question raised by Zhang et al. (2024).

Comment: The paper provides a theoretical analysis of spectral invariant GNNs, which aligns with foundational research in model architecture and representation learning.

Relevance: 8 Novelty: 8


33. Depth-Adaptive Graph Neural Networks via Learnable Bakry-'Emery Curvature

ArXiv ID: 2503.01079

Authors: Asela Hevapathige, Ahad N. Zehmakan, Qing Wang

Abstract: Graph Neural Networks (GNNs) have demonstrated strong representation learning capabilities for graph-based tasks. Recent advances on GNNs leverage geometric properties, such as curvature, to enhance its representation capabilities by modeling complex connectivity patterns and information flow within graphs. However, most existing approaches focus solely on discrete graph topology, overlooking diffusion dynamics and task-specific dependencies essential for effective learning. To address this, we propose integrating Bakry-\'Emery curvature, which captures both structural and task-driven aspects of information propagation. We develop an efficient, learnable approximation strategy, making curvature computation scalable for large graphs. Furthermore, we introduce an adaptive depth mechanism that dynamically adjusts message-passing layers per vertex based on its curvature, ensuring efficient propagation. Our theoretical analysis establishes a link between curvature and feature distinctiveness, showing that high-curvature vertices require fewer layers, while low-curvature ones benefit from deeper propagation. Extensive experiments on benchmark datasets validate the effectiveness of our approach, showing consistent performance improvements across diverse graph learning tasks.

Comment: The paper proposes a depth-adaptive GNN leveraging Bakry-Émery curvature, which aligns with architectural innovations in graph neural networks.

Relevance: 8 Novelty: 8


34. Riemann Tensor Neural Networks: Learning Conservative Systems with Physics-Constrained Networks

ArXiv ID: 2503.00755

Authors: Anas Jnini, Lorenzo Breschi, Flavio Vella

Abstract: Divergence-free symmetric tensors (DFSTs) are fundamental in continuum mechanics, encoding conservation laws such as mass and momentum conservation. We introduce Riemann Tensor Neural Networks (RTNNs), a novel neural architecture that inherently satisfies the DFST condition to machine precision, providing a strong inductive bias for enforcing these conservation laws. We prove that RTNNs can approximate any sufficiently smooth DFST with arbitrary precision and demonstrate their effectiveness as surrogates for conservative PDEs, achieving improved accuracy across benchmarks. This work is the first to use DFSTs as an inductive bias in neural PDE surrogates and to explicitly enforce the conservation of both mass and momentum within a physics-constrained neural architecture.

Comment: The introduction of Riemann Tensor Neural Networks (RTNNs) aligns with foundational research in model architecture by enforcing physics-constrained inductive biases.

Relevance: 8 Novelty: 8


35. Understanding Dataset Distillation via Spectral Filtering

ArXiv ID: 2503.01212

Authors: Deyu Bo, Songhua Liu, Xinchao Wang

Abstract: Dataset distillation (DD) has emerged as a promising approach to compress datasets and speed up model training. However, the underlying connections among various DD methods remain largely unexplored. In this paper, we introduce UniDD, a spectral filtering framework that unifies diverse DD objectives. UniDD interprets each DD objective as a specific filter function that affects the eigenvalues of the feature-feature correlation (FFC) matrix and modulates the frequency components of the feature-label correlation (FLC) matrix. In this way, UniDD reveals that the essence of DD fundamentally lies in matching frequency-specific features. Moreover, according to the filter behaviors, we classify existing methods into low-frequency matching and high-frequency matching, encoding global texture and local details, respectively. However, existing methods rely on fixed filter functions throughout distillation, which cannot capture the low- and high-frequency information simultaneously. To address this limitation, we further propose Curriculum Frequency Matching (CFM), which gradually adjusts the filter parameter to cover both low- and high-frequency information of the FFC and FLC matrices. Extensive experiments on small-scale datasets, such as CIFAR-10/100, and large-scale datasets, including ImageNet-1K, demonstrate the superior performance of CFM over existing baselines and validate the practicality of UniDD.

Comment: The paper introduces a spectral filtering framework for dataset distillation, which provides theoretical insights into representation learning.

Relevance: 8 Novelty: 8


36. Modeling Arbitrarily Applicable Relational Responding with the Non-Axiomatic Reasoning System: A Machine Psychology Approach

ArXiv ID: 2503.00611

Authors: Robert Johansson

Abstract: Arbitrarily Applicable Relational Responding (AARR) is a cornerstone of human language and reasoning, referring to the learned ability to relate symbols in flexible, context-dependent ways. In this paper, we present a novel theoretical approach for modeling AARR within an artificial intelligence framework using the Non-Axiomatic Reasoning System (NARS). NARS is an adaptive reasoning system designed for learning under uncertainty. By integrating principles from Relational Frame Theory - the behavioral psychology account of AARR - with the reasoning mechanisms of NARS, we conceptually demonstrate how key properties of AARR (mutual entailment, combinatorial entailment, and transformation of stimulus functions) can emerge from the inference rules and memory structures of NARS. Two theoretical experiments illustrate this approach: one modeling stimulus equivalence and transfer of function, and another modeling complex relational networks involving opposition frames. In both cases, the system logically demonstrates the derivation of untrained relations and context-sensitive transformations of stimulus significance, mirroring established human cognitive phenomena. These results suggest that AARR - long considered uniquely human - can be conceptually captured by suitably designed AI systems, highlighting the value of integrating behavioral science insights into artificial general intelligence (AGI) research.

Comment: This paper introduces a novel theoretical approach to model Arbitrarily Applicable Relational Responding (AARR) using the Non-Axiomatic Reasoning System (NARS). It aligns with emerging trends in integrating behavioral science insights into AI, making it relevant to foundational research.

Relevance: 8 Novelty: 8


37. Learning-Augmented Frequent Directions

ArXiv ID: 2503.00937

Authors: Anders Aamand, Justin Y. Chen, Siddharth Gollapudi, Sandeep Silwal, Hao Wu

Abstract: An influential paper of Hsu et al. (ICLR'19) introduced the study of learning-augmented streaming algorithms in the context of frequency estimation. A fundamental problem in the streaming literature, the goal of frequency estimation is to approximate the number of occurrences of items appearing in a long stream of data using only a small amount of memory. Hsu et al. develop a natural framework to combine the worst-case guarantees of popular solutions such as CountMin and CountSketch with learned predictions of high frequency elements. They demonstrate that learning the underlying structure of data can be used to yield better streaming algorithms, both in theory and practice. We simplify and generalize past work on learning-augmented frequency estimation. Our first contribution is a learning-augmented variant of the Misra-Gries algorithm which improves upon the error of learned CountMin and learned CountSketch and achieves the state-of-the-art performance of randomized algorithms (Aamand et al., NeurIPS'23) with a simpler, deterministic algorithm. Our second contribution is to adapt learning-augmentation to a high-dimensional generalization of frequency estimation corresponding to finding important directions (top singular vectors) of a matrix given its rows one-by-one in a stream. We analyze a learning-augmented variant of the Frequent Directions algorithm, extending the theoretical and empirical understanding of learned predictions to matrix streaming.

Comment: The paper introduces a learning-augmented variant of the Frequent Directions algorithm, which aligns with representation learning and efficiency breakthroughs in streaming algorithms.

Relevance: 8 Novelty: 8


38. Multi-Level Collaboration in Model Merging

ArXiv ID: 2503.01268

Authors: Qi Li, Runpeng Yu, Xinchao Wang

Abstract: Parameter-level model merging is an emerging paradigm in multi-task learning with significant promise. Previous research has explored its connections with prediction-level model ensembling-commonly viewed as the upper bound for merging-to reveal the potential of achieving performance consistency between the two. However, this observation relies on certain preconditions, such as being limited to two models, using ViT-based models, and all models are fine-tuned from the same pre-trained checkpoint. To further understand the intrinsic connections between model merging and model ensembling, this paper explores an interesting possibility: If these restrictions are removed, can performance consistency still be achieved between merging and ensembling? To answer this question, we first theoretically establish a performance correlation between merging and ensembling. We find that even when previous restrictions are not met, there is still a way for model merging to attain a near-identical and superior performance similar to that of ensembling. To verify whether our findings are practical, we introduce a validation framework termed Neural Ligand (NeuLig). The learning process of NeuLig is meticulously designed with a specialized loss function supported by theoretical foundations. Experimental results demonstrate the robust resilience of NeuLig in terms of both model scale and the number of collaborating models. For instance, for the case involving 5 CLIP-ViT-B/32 models, parameter-level merging achieves the same performance as prediction-level ensembling (merging: 95.44% vs. ensembling: 95.46%).

Comment: The paper explores model merging and its theoretical connection to model ensembling, which aligns with representation learning and architectural innovations by addressing multi-task learning and parameter-level merging.

Relevance: 8 Novelty: 8


39. How simple can you go? An off-the-shelf transformer approach to molecular dynamics

ArXiv ID: 2503.01431

Authors: Max Eissler, Tim Korjakow, Stefan Ganscha, Oliver T. Unke, Klaus-Robert M\"uller, Stefan Gugler

Abstract: Most current neural networks for molecular dynamics (MD) include physical inductive biases, resulting in specialized and complex architectures. This is in contrast to most other machine learning domains, where specialist approaches are increasingly replaced by general-purpose architectures trained on vast datasets. In line with this trend, several recent studies have questioned the necessity of architectural features commonly found in MD models, such as built-in rotational equivariance or energy conservation. In this work, we contribute to the ongoing discussion by evaluating the performance of an MD model with as few specialized architectural features as possible. We present a recipe for MD using an Edge Transformer, an off-the-shelf'' transformer architecture that has been minimally modified for the MD domain, termed MD-ET. Our model implements neither built-in equivariance nor energy conservation. We use a simple supervised pre-training scheme on $\sim$30 million molecular structures from the QCML database. Using thisoff-the-shelf'' approach, we show state-of-the-art results on several benchmarks after fine-tuning for a small number of steps. Additionally, we examine the effects of being only approximately equivariant and energy conserving for MD simulations, proposing a novel method for distinguishing the errors resulting from non-equivariance from other sources of inaccuracies like numerical rounding errors. While our model exhibits runaway energy increases on larger structures, we show approximately energy-conserving NVE simulations for a range of small structures.

Comment: The paper evaluates a minimally modified transformer for molecular dynamics, questioning the necessity of specialized architectural features. This aligns with the interest in analyzing existing architectures.

Relevance: 8 Novelty: 7


40. Improve Representation for Imbalanced Regression through Geometric Constraints

ArXiv ID: 2503.00876

Authors: Zijian Dong, Yilei Wu, Chongyao Chen, Yingtian Zou, Yichi Zhang, Juan Helen Zhou

Abstract: In representation learning, uniformity refers to the uniform feature distribution in the latent space (i.e., unit hypersphere). Previous work has shown that improving uniformity contributes to the learning of under-represented classes. However, most of the previous work focused on classification; the representation space of imbalanced regression remains unexplored. Classification-based methods are not suitable for regression tasks because they cluster features into distinct groups without considering the continuous and ordered nature essential for regression. In a geometric aspect, we uniquely focus on ensuring uniformity in the latent space for imbalanced regression through two key losses: enveloping and homogeneity. The enveloping loss encourages the induced trace to uniformly occupy the surface of a hypersphere, while the homogeneity loss ensures smoothness, with representations evenly spaced at consistent intervals. Our method integrates these geometric principles into the data representations via a Surrogate-driven Representation Learning (SRL) framework. Experiments with real-world regression and operator learning tasks highlight the importance of uniformity in imbalanced regression and validate the efficacy of our geometry-based loss functions.

Comment: The paper introduces geometric constraints for representation learning in imbalanced regression, which aligns with foundational research in representation learning.

Relevance: 8 Novelty: 7


41. DILEMMA: Joint LLM Quantization and Distributed LLM Inference Over Edge Computing Systems

ArXiv ID: 2503.01704

Authors: Minoo Hosseinzadeh, Hana Khamfroush

Abstract: With a recent trend of using Large Language Models (LLMs) for different applications within smart cities, there is a need for pushing these models toward the edge of network while still preserving their performance. Edge Computing (EC) as a physically closer computing resource to the end users can help to reduce the communication delay for serving end users' tasks for LLM-dependent services. However, EC servers have limited capacity in terms of communication, computation, and storage capacity. This paper introduces DILEMMA, a novel framework addressing the challenges of deploying LLMs in EC systems by jointly optimizing layer placement and layer quantization in EC systems. DILEMMA formulates an Integer Linear Programming problem to minimize total inference delay while ensuring acceptable LLM performance levels, leveraging layer-wise quantization and knowledge distillation for LLM performance control. Experimental evaluations on OPT-350 model using the SQuAD dataset demonstrate that DILEMMA achieves a quantization ratio of up to 12.75% while preserving model loss, highlighting its effectiveness in resource-constrained environments.

Comment: The paper addresses LLM quantization and distributed inference, which is relevant to model compression and efficiency, particularly in resource-constrained environments.

Relevance: 8 Novelty: 7


42. PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

ArXiv ID: 2503.01328

Authors: Xinyi Wan, Penghui Qi, Guangxing Huang, Jialin Li, Min Lin

Abstract: Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption. The implementation is open-sourced at \href{https://github.com/sail-sg/zero-bubble-pipeline-parallelism}{this url}.

Comment: The paper introduces a memory optimization strategy for pipeline parallelism in LLM training, which is relevant to model compression and efficiency.

Relevance: 8 Novelty: 7


43. Generalization Bounds for Equivariant Networks on Markov Data

ArXiv ID: 2503.00292

Authors: Hui Li, Zhiguo Wang, Bohui Chen, Li Sheng

Abstract: Equivariant neural networks play a pivotal role in analyzing datasets with symmetry properties, particularly in complex data structures. However, integrating equivariance with Markov properties presents notable challenges due to the inherent dependencies within such data. Previous research has primarily concentrated on establishing generalization bounds under the assumption of independently and identically distributed data, frequently neglecting the influence of Markov dependencies. In this study, we investigate the impact of Markov properties on generalization performance alongside the role of equivariance within this context. We begin by applying a new McDiarmid's inequality to derive a generalization bound for neural networks trained on Markov datasets, using Rademacher complexity as a central measure of model capacity. Subsequently, we utilize group theory to compute the covering number under equivariant constraints, enabling us to obtain an upper bound on the Rademacher complexity based on this covering number. This bound provides practical insights into selecting low-dimensional irreducible representations, enhancing generalization performance for fixed-width equivariant neural networks.

Comment: The paper provides generalization bounds for equivariant networks on Markov data, which aligns with foundational research in model behavior and theoretical insights.

Relevance: 8 Novelty: 7


44. SAKE: Steering Activations for Knowledge Editing

ArXiv ID: 2503.01751

Authors: Marco Scialanga, Thibault Laugel, Vincent Grari, Marcin Detyniecki

Abstract: As Large Langue Models have been shown to memorize real-world facts, the need to update this knowledge in a controlled and efficient manner arises. Designed with these constraints in mind, Knowledge Editing (KE) approaches propose to alter specific facts in pretrained models. However, they have been shown to suffer from several limitations, including their lack of contextual robustness and their failure to generalize to logical implications related to the fact. To overcome these issues, we propose SAKE, a steering activation method that models a fact to be edited as a distribution rather than a single prompt. Leveraging Optimal Transport, SAKE alters the LLM behavior over a whole fact-related distribution, defined as paraphrases and logical implications. Several numerical experiments demonstrate the effectiveness of this method: SAKE is thus able to perform more robust edits than its existing counterparts.

Comment: SAKE introduces a method for knowledge editing in LLMs, focusing on robustness and generalization. This aligns with foundational research in LLM behavior and interpretability.

Relevance: 8 Novelty: 7


45. Parameter Expanded Stochastic Gradient Markov Chain Monte Carlo

ArXiv ID: 2503.00699

Authors: Hyunsu Kim, Giung Nam, Chulhee Yun, Hongseok Yang, Juho Lee

Abstract: Bayesian Neural Networks (BNNs) provide a promising framework for modeling predictive uncertainty and enhancing out-of-distribution robustness (OOD) by estimating the posterior distribution of network parameters. Stochastic Gradient Markov Chain Monte Carlo (SGMCMC) is one of the most powerful methods for scalable posterior sampling in BNNs, achieving efficiency by combining stochastic gradient descent with second-order Langevin dynamics. However, SGMCMC often suffers from limited sample diversity in practice, which affects uncertainty estimation and model performance. We propose a simple yet effective approach to enhance sample diversity in SGMCMC without the need for tempering or running multiple chains. Our approach reparameterizes the neural network by decomposing each of its weight matrices into a product of matrices, resulting in a sampling trajectory that better explores the target parameter space. This approach produces a more diverse set of samples, allowing faster mixing within the same computational budget. Notably, our sampler achieves these improvements without increasing the inference cost compared to the standard SGMCMC. Extensive experiments on image classification tasks, including OOD robustness, diversity, loss surface analyses, and a comparative study with Hamiltonian Monte Carlo, demonstrate the superiority of the proposed approach.

Comment: The paper proposes a reparameterization method to enhance sample diversity in SGMCMC for Bayesian Neural Networks, which aligns with foundational research in training dynamics and efficiency.

Relevance: 8 Novelty: 7


46. Cauchy-Schwarz Regularizers

ArXiv ID: 2503.01639

Authors: Sueda Taner, Ziyi Wang, Christoph Studer

Abstract: We introduce a novel class of regularization functions, called Cauchy-Schwarz (CS) regularizers, which can be designed to induce a wide range of properties in solution vectors of optimization problems. To demonstrate the versatility of CS regularizers, we derive regularization functions that promote discrete-valued vectors, eigenvectors of a given matrix, and orthogonal matrices. The resulting CS regularizers are simple, differentiable, and can be free of spurious stationary points, making them suitable for gradient-based solvers and large-scale optimization problems. In addition, CS regularizers automatically adapt to the appropriate scale, which is, for example, beneficial when discretizing the weights of neural networks. To demonstrate the efficacy of CS regularizers, we provide results for solving underdetermined systems of linear equations and weight quantization in neural networks. Furthermore, we discuss specializations, variations, and generalizations, which lead to an even broader class of new and possibly more powerful regularizers.

Comment: The paper introduces Cauchy-Schwarz regularizers, which align with foundational research in model compression and efficiency.

Relevance: 8 Novelty: 7


47. Convergence of energy-based learning in linear resistive networks

ArXiv ID: 2503.00349

Authors: Anne-Men Huijzer, Thomas Chaffey, Bart Besselink, Henk J. van Waarde

Abstract: Energy-based learning algorithms are alternatives to backpropagation and are well-suited to distributed implementations in analog electronic devices. However, a rigorous theory of convergence is lacking. We make a first step in this direction by analysing a particular energy-based learning algorithm, Contrastive Learning, applied to a network of linear adjustable resistors. It is shown that, in this setup, Contrastive Learning is equivalent to projected gradient descent on a convex function, for any step size, giving a guarantee of convergence for the algorithm.

Comment: The paper provides theoretical insights into energy-based learning algorithms, which is foundational and relevant to representation learning.

Relevance: 8 Novelty: 7


48. Regularization-based Framework for Quantization-, Fault- and Variability-Aware Training

ArXiv ID: 2503.01297

Authors: Anmol Biswas, Raghav Singhal, Sivakumar Elangovan, Shreyas Sabnis, Udayan Ganguly

Abstract: Efficient inference is critical for deploying deep learning models on edge AI devices. Low-bit quantization (e.g., 3- and 4-bit) with fixed-point arithmetic improves efficiency, while low-power memory technologies like analog nonvolatile memory enable further gains. However, these methods introduce non-ideal hardware behavior, including bit faults and device-to-device variability. We propose a regularization-based quantization-aware training (QAT) framework that supports fixed, learnable step-size, and learnable non-uniform quantization, achieving competitive results on CIFAR-10 and ImageNet. Our method also extends to Spiking Neural Networks (SNNs), demonstrating strong performance on 4-bit networks on CIFAR10-DVS and N-Caltech 101. Beyond quantization, our framework enables fault and variability-aware fine-tuning, mitigating stuck-at faults (fixed weight bits) and device resistance variability. Compared to prior fault-aware training, our approach significantly improves performance recovery under upto 20% bit-fault rate and 40% device-to-device variability. Our results establish a generalizable framework for quantization and robustness-aware training, enhancing efficiency and reliability in low-power, non-ideal hardware.

Comment: The paper proposes a regularization-based framework for quantization-aware training, which aligns with model compression and efficiency topics.

Relevance: 8 Novelty: 7


49. Re-Imagining Multimodal Instruction Tuning: A Representation View

ArXiv ID: 2503.00723

Authors: Yiyang Liu, James Chenhao Liang, Ruixiang Tang, Yugyung Lee, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Lifu Huang, Dongfang Liu, Qifan Wang, Cheng Han

Abstract: Multimodal instruction tuning has proven to be an effective strategy for achieving zero-shot generalization by fine-tuning pre-trained Large Multimodal Models (LMMs) with instruction-following data. However, as the scale of LMMs continues to grow, fully fine-tuning these models has become highly parameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced to reduce the number of tunable parameters, a significant performance gap remains compared to full fine-tuning. Furthermore, existing PEFT approaches are often highly parameterized, making them difficult to interpret and control. In light of this, we introduce Multimodal Representation Tuning (MRT), a novel approach that focuses on directly editing semantically rich multimodal representations to achieve strong performance and provide intuitive control over LMMs. Empirical results show that our method surpasses current state-of-the-art baselines with significant performance gains (e.g., 1580.40 MME score) while requiring substantially fewer tunable parameters (e.g., 0.03% parameters). Additionally, we conduct experiments on editing instrumental tokens within multimodal representations, demonstrating that direct manipulation of these representations enables simple yet effective control over network behavior.

Comment: The paper introduces a novel approach to multimodal instruction tuning, which is relevant to representation learning and parameter-efficient methods.

Relevance: 8 Novelty: 7


50. Constraining Sequential Model Editing with Editing Anchor Compression

ArXiv ID: 2503.00035

Authors: Hao-Xiang Xu, Jun-Yu Ma, Zhen-Hua Ling, Ningyu Zhang, Jia-Chen Gu

Abstract: Large language models (LLMs) struggle with hallucinations due to false or outdated knowledge. Given the high resource demands of retraining these models, there is an increasing focus on developing model editing. However, the general abilities of LLMs across downstream tasks are prone to significant degradation during sequential editing. This paper statistically observes that the parameter matrix after editing exhibits a significant deviation compared to its previous state as the number of edits increases. This serious deviation affects the original knowledge associations within LLMs and leads to the degradation of their general abilities. To this end, a framework termed Editing Anchor Compression (EAC) is proposed to constrain the deviation of the parameter matrix during sequential editing. It compresses the editing information by selecting editing anchors that are important in encoding new relations without deviating too much from the original matrix, thereby preserving the general abilities. Experiments of applying EAC to two popular editing methods on three LLMs across four tasks are conducted. Evaluation results show that EAC effectively minimizes unreasonable deviations caused by model editing, preserving over 70% of the general abilities while better retaining the editing knowledge compared to the original counterpart methods.

Comment: The paper introduces Editing Anchor Compression (EAC) to address sequential model editing in LLMs, which aligns with foundational research in model compression and efficiency. The focus on preserving general abilities while editing is a novel contribution.

Relevance: 8 Novelty: 7


51. Unlocking Efficient, Scalable, and Continual Knowledge Editing with Basis-Level Representation Fine-Tuning

ArXiv ID: 2503.00306

Authors: Tianci Liu, Ruirui Li, Yunzhe Qi, Hui Liu, Xianfeng Tang, Tianqi Zheng, Qingyu Yin, Monica Xiao Cheng, Jun Huan, Haoyu Wang, Jing Gao

Abstract: Large language models (LLMs) have achieved remarkable performance on various natural language tasks. However, they are trained on static corpora and their knowledge can become outdated quickly in the fast-changing world. This motivates the development of knowledge editing methods designed to update certain knowledge in LLMs without changing unrelated others. To make selective edits, previous efforts often sought to update a small amount of parameters in some specific layer(s) of a LLM. Nonetheless, in challenging scenarios, they still fall short in making successful edits while preserving knowledge irrelevant to the updates simultaneously, resulting in a notable editing-locality trade-off. In this work, we question if the trade-offs are caused by the fact that parameter-based updates have a global effect, i.e., edited parameters affect all inputs indiscriminately. In light of this, we explore the feasibility of representation fine-tuning, which applied some linear update to a few representations in a learned subspace, for knowledge editing. While being effective to enhance an LLM's general ability as demonstrated in the previous work, we theoretically show that this linear update imposes a tension in editing-locality trade-off. Subsequently, BaFT is proposed to break the linearity. BaFT computes a weight for each basis that spans a dimension of the subspace based on the input representation. This input-dependent weighting mechanism allows BaFT to manage different types of knowledge in an adaptive way, thereby achieving a better editing-locality trade-off. Experiments on three LLMs with five editing benchmarks in diverse scenarios show the superiority of our method.

Comment: The paper proposes BaFT, a method for knowledge editing in LLMs using basis-level representation fine-tuning. This aligns with foundational research in representation learning and model editing, offering a novel approach to the editing-locality trade-off.

Relevance: 8 Novelty: 7


52. Personalize Your LLM: Fake it then Align it

ArXiv ID: 2503.01048

Authors: Yijing Zhang, Dyah Adila, Changho Shin, Frederic Sala

Abstract: Personalizing large language models (LLMs) is essential for delivering tailored interactions that improve user experience. Many existing personalization methods require fine-tuning LLMs for each user, rendering them prohibitively expensive for widespread adoption. Although retrieval-based approaches offer a more compute-efficient alternative, they still depend on large, high-quality datasets that are not consistently available for all users. To address this challenge, we propose CHAMELEON, a scalable and efficient personalization approach that uses (1) self-generated personal preference data and (2) representation editing to enable quick and cost-effective personalization. Our experiments on various tasks, including those from the LaMP personalization benchmark, show that CHAMELEON efficiently adapts models to personal preferences, improving instruction-tuned models and outperforms two personalization baselines by an average of 40% across two model architectures.

Comment: The paper introduces SlimAdam, a memory-efficient variant of Adam optimizer, aligning with foundational research in model compression and efficiency.

Relevance: 8 Novelty: 7


53. Scaling Law Phenomena Across Regression Paradigms: Multiple and Kernel Approaches

ArXiv ID: 2503.01314

Authors: Yifang Chen, Xuyang Guo, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song

Abstract: Recently, Large Language Models (LLMs) have achieved remarkable success. A key factor behind this success is the scaling law observed by OpenAI. Specifically, for models with Transformer architecture, the test loss exhibits a power-law relationship with model size, dataset size, and the amount of computation used in training, demonstrating trends that span more than seven orders of magnitude. This scaling law challenges traditional machine learning wisdom, notably the Oscar Scissors principle, which suggests that an overparametrized algorithm will overfit the training datasets, resulting in poor test performance. Recent research has also identified the scaling law in simpler machine learning contexts, such as linear regression. However, fully explaining the scaling law in large practical models remains an elusive goal. In this work, we advance our understanding by demonstrating that the scaling law phenomenon extends to multiple regression and kernel regression settings, which are significantly more expressive and powerful than linear methods. Our analysis provides deeper insights into the scaling law, potentially enhancing our understanding of LLMs.

Comment: The paper extends scaling law phenomena to multiple and kernel regression, contributing to theoretical insights into scaling laws, which are relevant to understanding LLM behavior.

Relevance: 8 Novelty: 7


54. ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Parity LLM Data Valuation

ArXiv ID: 2503.01052

Authors: Yanzhou Pan, Huawei Lin, Yide Ran, Jiamin Chen, Xiaodong Yu, Weijie Zhao, Denghui Zhang, Zhaozhuo Xu

Abstract: Large Language Models (LLMs) heavily rely on high-quality training data, making data valuation crucial for optimizing model performance, especially when working within a limited budget. In this work, we aim to offer a third-party data valuation approach that benefits both data providers and model developers. We introduce a linearized future influence kernel (LinFiK), which assesses the value of individual data samples in improving LLM performance during training. We further propose ALinFiK, a learning strategy to approximate LinFiK, enabling scalable data valuation. Our comprehensive evaluations demonstrate that this approach surpasses existing baselines in effectiveness and efficiency, demonstrating significant scalability advantages as LLM parameters increase.

Comment: The paper proposes a scalable data valuation method for LLMs, which is relevant to foundational research in LLM efficiency and data optimization.

Relevance: 8 Novelty: 7


55. Hypergraph Foundation Model

ArXiv ID: 2503.01203

Authors: Yifan Feng, Shiquan Liu, Xiangmin Han, Shaoyi Du, Zongze Wu, Han Hu, Yue Gao

Abstract: Hypergraph neural networks (HGNNs) effectively model complex high-order relationships in domains like protein interactions and social networks by connecting multiple vertices through hyperedges, enhancing modeling capabilities, and reducing information loss. Developing foundation models for hypergraphs is challenging due to their distinct data, which includes both vertex features and intricate structural information. We present Hyper-FM, a Hypergraph Foundation Model for multi-domain knowledge extraction, featuring Hierarchical High-Order Neighbor Guided Vertex Knowledge Embedding for vertex feature representation and Hierarchical Multi-Hypergraph Guided Structural Knowledge Extraction for structural information. Additionally, we curate 10 text-attributed hypergraph datasets to advance research between HGNNs and LLMs. Experiments on these datasets show that Hyper-FM outperforms baseline methods by approximately 13.3\%, validating our approach. Furthermore, we propose the first scaling law for hypergraph foundation models, demonstrating that increasing domain diversity significantly enhances performance, unlike merely augmenting vertex and hyperedge counts. This underscores the critical role of domain diversity in scaling hypergraph models.

Comment: The paper proposes a hypergraph foundation model, which is relevant to architectural innovations and representation learning, particularly in the context of hypergraphs.

Relevance: 8 Novelty: 7


56. Architectural and Inferential Inductive Biases For Exchangeable Sequence Modeling

ArXiv ID: 2503.01215

Authors: Daksh Mittal, Ang Li, Tzu-Ching Yen, Daniel Guetta, Hongseok Namkoong

Abstract: Autoregressive models have emerged as a powerful framework for modeling exchangeable sequences - i.i.d. observations when conditioned on some latent factor - enabling direct modeling of uncertainty from missing data (rather than a latent). Motivated by the critical role posterior inference plays as a subroutine in decision-making (e.g., active learning, bandits), we study the inferential and architectural inductive biases that are most effective for exchangeable sequence modeling. For the inference stage, we highlight a fundamental limitation of the prevalent single-step generation approach: inability to distinguish between epistemic and aleatoric uncertainty. Instead, a long line of works in Bayesian statistics advocates for multi-step autoregressive generation; we demonstrate this "correct approach" enables superior uncertainty quantification that translates into better performance on downstream decision-making tasks. This naturally leads to the next question: which architectures are best suited for multi-step inference? We identify a subtle yet important gap between recently proposed Transformer architectures for exchangeable sequences (Muller et al., 2022; Nguyen & Grover, 2022; Ye & Namkoong, 2024), and prove that they in fact cannot guarantee exchangeability despite introducing significant computational overhead. We illustrate our findings using controlled synthetic settings, demonstrating how custom architectures can significantly underperform standard causal masks, underscoring the need for new architectural innovations.

Comment: The paper studies architectural and inferential biases in exchangeable sequence modeling, which aligns with the model architecture criterion by analyzing and proposing improvements to Transformer-based architectures.

Relevance: 8 Novelty: 7


57. Cauchy Random Features for Operator Learning in Sobolev Space

ArXiv ID: 2503.00300

Authors: Chunyang Liao, Deanna Needell, Hayden Schaeffer

Abstract: Operator learning is the approximation of operators between infinite dimensional Banach spaces using machine learning approaches. While most progress in this area has been driven by variants of deep neural networks such as the Deep Operator Network and Fourier Neural Operator, the theoretical guarantees are often in the form of a universal approximation property. However, the existence theorems do not guarantee that an accurate operator network is obtainable in practice. Motivated by the recent kernel-based operator learning framework, we propose a random feature operator learning method with theoretical guarantees and error bounds. The random feature method can be viewed as a randomized approximation of a kernel method, which significantly reduces the computation requirements for training. We provide a generalization error analysis for our proposed random feature operator learning method along with comprehensive numerical results. Compared to kernel-based method and neural network methods, the proposed method can obtain similar or better test errors across benchmarks examples with significantly reduced training times. An additional advantages it that our implementation is simple and does require costly computational resources, such as GPU.

Comment: The paper proposes a random feature method for operator learning with theoretical guarantees, which could be relevant to representation learning due to its focus on kernel-based methods and error bounds.

Relevance: 7 Novelty: 8


58. AMUN: Adversarial Machine UNlearning

ArXiv ID: 2503.00917

Authors: Ali Ebrahimpour-Boroojeny, Hari Sundaram, Varun Chandrasekaran

Abstract: Machine unlearning, where users can request the deletion of a forget dataset, is becoming increasingly important because of numerous privacy regulations. Initial works on exact'' unlearning (e.g., retraining) incur large computational overheads. However, while computationally inexpensive,approximate'' methods have fallen short of reaching the effectiveness of exact unlearning: models produced fail to obtain comparable accuracy and prediction confidence on both the forget and test (i.e., unseen) dataset. Exploiting this observation, we propose a new unlearning method, Adversarial Machine UNlearning (AMUN), that outperforms prior state-of-the-art (SOTA) methods for image classification. AMUN lowers the confidence of the model on the forget samples by fine-tuning the model on their corresponding adversarial examples. Adversarial examples naturally belong to the distribution imposed by the model on the input space; fine-tuning the model on the adversarial examples closest to the corresponding forget samples (a) localizes the changes to the decision boundary of the model around each forget sample and (b) avoids drastic changes to the global behavior of the model, thereby preserving the model's accuracy on test samples. Using AMUN for unlearning a random $10\%$ of CIFAR-10 samples, we observe that even SOTA membership inference attacks cannot do better than random guessing.

Comment: The paper introduces a novel adversarial machine unlearning method, which is relevant to model compression and efficiency, particularly in terms of fine-tuning and decision boundary adjustments.

Relevance: 7 Novelty: 8


59. Learning Stochastic Dynamical Systems with Structured Noise

ArXiv ID: 2503.01077

Authors: Ziheng Guo, James Greene, Ming Zhong

Abstract: Stochastic differential equations (SDEs) are a ubiquitous modeling framework that finds applications in physics, biology, engineering, social science, and finance. Due to the availability of large-scale data sets, there is growing interest in learning mechanistic models from observations with stochastic noise. In this work, we present a nonparametric framework to learn both the drift and diffusion terms in systems of SDEs where the stochastic noise is singular. Specifically, inspired by second-order equations from classical physics, we consider systems which possess structured noise, i.e. noise with a singular covariance matrix. We provide an algorithm for constructing estimators given trajectory data and demonstrate the effectiveness of our methods via a number of examples from physics and biology. As the developed framework is most naturally applicable to systems possessing a high degree of dimensionality reduction (i.e. symmetry), we also apply it to the high dimensional Cucker-Smale flocking model studied in collective dynamics and show that it is able to accurately infer the low dimensional interaction kernel from particle data.

Comment: The paper introduces a framework for learning stochastic dynamical systems with structured noise, which has potential relevance to foundational research in representation learning.

Relevance: 7 Novelty: 7


60. Input Specific Neural Networks

ArXiv ID: 2503.00268

Authors: Asghar A. Jadoon, D. Thomas Seidl, Reese E. Jones, Jan N. Fuhg

Abstract: The black-box nature of neural networks limits the ability to encode or impose specific structural relationships between inputs and outputs. While various studies have introduced architectures that ensure the network's output adheres to a particular form in relation to certain inputs, the majority of these approaches impose constraints on only a single set of inputs. This paper introduces a novel neural network architecture, termed the Input Specific Neural Network (ISNN), which extends this concept by allowing scalar-valued outputs to be subject to multiple constraints. Specifically, the ISNN can enforce convexity in some inputs, non-decreasing monotonicity combined with convexity with respect to others, and simple non-decreasing monotonicity or arbitrary relationships with additional inputs. The paper presents two distinct ISNN architectures, along with equations for the first and second derivatives of the output with respect to the inputs. These networks are broadly applicable. In this work, we restrict their usage to solving problems in computational mechanics. In particular, we show how they can be effectively applied to fitting data-driven constitutive models. We then embed our trained data-driven constitutive laws into a finite element solver where significant time savings can be achieved by using explicit manual differentiation using the derived equations as opposed to automatic differentiation. We also show how ISNNs can be used to learn structural relationships between inputs and outputs via a binary gating mechanism. Particularly, ISNNs are employed to model an anisotropic free energy potential to get the homogenized macroscopic response in a decoupled multiscale setting, where the network learns whether or not the potential should be modeled as polyconvex, and retains only the relevant layers while using the minimum number of inputs.

Comment: The paper introduces Input Specific Neural Networks (ISNNs) with novel architectural constraints for encoding structural relationships. While the focus is on computational mechanics applications, the architectural innovation aligns with the 'Model Architecture' criterion.

Relevance: 7 Novelty: 7


61. On the Saturation Effects of Spectral Algorithms in Large Dimensions

ArXiv ID: 2503.00504

Authors: Weihao Lu, Haobo Zhang, Yicheng Li, Qian Lin

Abstract: The saturation effects, which originally refer to the fact that kernel ridge regression (KRR) fails to achieve the information-theoretical lower bound when the regression function is over-smooth, have been observed for almost 20 years and were rigorously proved recently for kernel ridge regression and some other spectral algorithms over a fixed dimensional domain. The main focus of this paper is to explore the saturation effects for a large class of spectral algorithms (including the KRR, gradient descent, etc.) in large dimensional settings where $n \asymp d^{\gamma}$. More precisely, we first propose an improved minimax lower bound for the kernel regression problem in large dimensional settings and show that the gradient flow with early stopping strategy will result in an estimator achieving this lower bound (up to a logarithmic factor). Similar to the results in KRR, we can further determine the exact convergence rates (both upper and lower bounds) of a large class of (optimal tuned) spectral algorithms with different qualification $\tau$'s. In particular, we find that these exact rate curves (varying along $\gamma$) exhibit the periodic plateau behavior and the polynomial approximation barrier. Consequently, we can fully depict the saturation effects of the spectral algorithms and reveal a new phenomenon in large dimensional settings (i.e., the saturation effect occurs in large dimensional setting as long as the source condition $s>\tau$ while it occurs in fixed dimensional setting as long as $s>2\tau$).

Comment: This paper explores saturation effects in spectral algorithms, providing theoretical insights into their behavior in large dimensions. While not directly tied to representation learning or model architecture, it contributes to foundational understanding of algorithmic behavior.

Relevance: 7 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.