Personalized Daily Arxiv Papers 03/04/2025

[gpt-4o]	Prompt	Completion	Total
Token	89804	13289	103093
Cost	$0.22	$0.13	$0.35

Total ArXiv papers: 1220

Total scanned papers: 618

Total relevant papers: 61

Table of contents with paper titles:

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry Authors: Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, Demba Ba
Efficiently Editing Mixture-of-Experts Models with Compressed Experts Authors: Yifei He, Yang Liu, Chen Liang, Hany Hassan Awadalla
Synergy Between Sufficient Changes and Sparse Mixing Procedure for Disentangled Representation Learning Authors: Zijian Li, Shunxing Fan, Yujia Zheng, Ignavier Ng, Shaoan Xie, Guangyi Chen, Xinshuai Dong, Ruichu Cai, Kun Zhang
Compositional Reasoning with Transformers, RNNs, and Chain of Thought Authors: Gilad Yehudai, Noah Amsel, Joan Bruna
KurTail : Kurtosis-based LLM Quantization Authors: Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski, Evangelos Eleftheriou, Martino Dazzi
On the Power of Context-Enhanced Learning in LLMs Authors: Xingyu Zhu, Abhishek Panigrahi, Sanjeev Arora
KVCrush: Key value cache size-reduction using similarity in head-behaviour Authors: Gopi Krishna Jha, Sameh Gobriel, Liubov Talamanova, Alexander Kozlov, Nilesh Jain
LoR2C : Low-Rank Residual Connection Adaptation for Parameter-Efficient Fine-Tuning Authors: Jiancheng Zhao, Xingda Yu, Yuxiang Zhang, Zhen Yang
Towards Understanding the Benefit of Multitask Representation Learning in Decision Process Authors: Rui Lu, Yang Yue, Andrew Zhao, Simon Du, Gao Huang
Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks Authors: Thang Do, Arnulf Jentzen, Adrian Riekert
CE-U: Cross Entropy Unlearning Authors: Bo Yang
EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection Authors: Yuhao Zhou, Sirui Song, Boyang Liu, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Zhihao Zhang, Wei Li, Xuanjing Huang
From superposition to sparse codes: interpretable representations in neural networks Authors: David Klindt, Charles O'Neill, Patrik Reizinger, Harald Maurer, Nina Miolane
Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation Authors: Tiansheng Wen, Yifei Wang, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, Chenyu You
DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models Authors: Yongqi Huang, Peng Ye, Chenyu Huang, Jianjian Cao, Lin Zhang, Baopu Li, Gang Yu, Tao Chen
RSQ: Learning from Important Tokens Leads to Better Quantized LLMs Authors: Yi-Lin Sung, Prateek Yadav, Jialu Li, Jaehong Yoon, Mohit Bansal
Revisiting Large Language Model Pruning using Neuron Semantic Attribution Authors: Yizhuo Ding, Xinwei Sun, Yanwei Fu, Guosheng Hu
Parameter-Efficient Fine-Tuning of Large Language Models via Deconvolution in Subspace Authors: Jia-Chen Zhang, Yu-Jie Xiong, Chun-Ming Xia, Dong-Hai Zhu, Xi-He Qiu
Asymptotic Theory of Eigenvectors for Latent Embeddings with Generalized Laplacian Matrices Authors: Jianqing Fan, Yingying Fan, Jinchi Lv, Fan Yang, Diwen Yu
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving Authors: Qihui Zhou, Peiqi Yin, Pengfei Zuo, James Cheng
Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning Authors: Anh Tong, Thanh Nguyen-Tang, Dongeun Lee, Duc Nguyen, Toan Tran, David Hall, Cheongwoong Kang, Jaesik Choi
Projection Head is Secretly an Information Bottleneck Authors: Zhuo Ouyang, Kaiwen Hu, Qi Zhang, Yifei Wang, Yisen Wang
Transformer Meets Twicing: Harnessing Unattended Residual Information Authors: Laziz Abdullaev, Tan Nguyen
Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers Authors: Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher, Orr Fischer, Ran Gilad-Bachrach, Amir Globerson
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering Authors: Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, Liang He
CoSMoEs: Compact Sparse Mixture of Experts Authors: Patrick Huber, Akshat Shrivastava, Ernie Chang, Chinnadhurai Sankar, Ahmed Aly, Adithya Sagar
Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs Authors: Ravi Ghadia, Avinash Kumar, Gaurav Jain, Prashant Nair, Poulami Das
When Can You Get Away with Low Memory Adam? Authors: Dayal Singh Kalra, John Kirchenbauer, Maissam Barkeshli, Tom Goldstein
Liger: Linearizing Large Language Models to Gated Recurrent Structures Authors: Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng
Steering Large Language Model Activations in Sparse Spaces Authors: Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, Pascal Vincent
Relating Piecewise Linear Kolmogorov Arnold Networks to ReLU Networks Authors: Nandi Schoots, Mattia Jacopo Villani, Niels uit de Bos
Homomorphism Expressivity of Spectral Invariant Graph Neural Networks Authors: Jingchu Gai, Yiheng Du, Bohang Zhang, Haggai Maron, Liwei Wang
Depth-Adaptive Graph Neural Networks via Learnable Bakry-'Emery Curvature Authors: Asela Hevapathige, Ahad N. Zehmakan, Qing Wang
Riemann Tensor Neural Networks: Learning Conservative Systems with Physics-Constrained Networks Authors: Anas Jnini, Lorenzo Breschi, Flavio Vella
Understanding Dataset Distillation via Spectral Filtering Authors: Deyu Bo, Songhua Liu, Xinchao Wang
Modeling Arbitrarily Applicable Relational Responding with the Non-Axiomatic Reasoning System: A Machine Psychology Approach Authors: Robert Johansson
Learning-Augmented Frequent Directions Authors: Anders Aamand, Justin Y. Chen, Siddharth Gollapudi, Sandeep Silwal, Hao Wu
Multi-Level Collaboration in Model Merging Authors: Qi Li, Runpeng Yu, Xinchao Wang
How simple can you go? An off-the-shelf transformer approach to molecular dynamics Authors: Max Eissler, Tim Korjakow, Stefan Ganscha, Oliver T. Unke, Klaus-Robert M\"uller, Stefan Gugler
Improve Representation for Imbalanced Regression through Geometric Constraints Authors: Zijian Dong, Yilei Wu, Chongyao Chen, Yingtian Zou, Yichi Zhang, Juan Helen Zhou
DILEMMA: Joint LLM Quantization and Distributed LLM Inference Over Edge Computing Systems Authors: Minoo Hosseinzadeh, Hana Khamfroush
PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization Authors: Xinyi Wan, Penghui Qi, Guangxing Huang, Jialin Li, Min Lin
Generalization Bounds for Equivariant Networks on Markov Data Authors: Hui Li, Zhiguo Wang, Bohui Chen, Li Sheng
SAKE: Steering Activations for Knowledge Editing Authors: Marco Scialanga, Thibault Laugel, Vincent Grari, Marcin Detyniecki
Parameter Expanded Stochastic Gradient Markov Chain Monte Carlo Authors: Hyunsu Kim, Giung Nam, Chulhee Yun, Hongseok Yang, Juho Lee
Cauchy-Schwarz Regularizers Authors: Sueda Taner, Ziyi Wang, Christoph Studer
Convergence of energy-based learning in linear resistive networks Authors: Anne-Men Huijzer, Thomas Chaffey, Bart Besselink, Henk J. van Waarde
Regularization-based Framework for Quantization-, Fault- and Variability-Aware Training Authors: Anmol Biswas, Raghav Singhal, Sivakumar Elangovan, Shreyas Sabnis, Udayan Ganguly
Re-Imagining Multimodal Instruction Tuning: A Representation View Authors: Yiyang Liu, James Chenhao Liang, Ruixiang Tang, Yugyung Lee, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Lifu Huang, Dongfang Liu, Qifan Wang, Cheng Han
Constraining Sequential Model Editing with Editing Anchor Compression Authors: Hao-Xiang Xu, Jun-Yu Ma, Zhen-Hua Ling, Ningyu Zhang, Jia-Chen Gu
Unlocking Efficient, Scalable, and Continual Knowledge Editing with Basis-Level Representation Fine-Tuning Authors: Tianci Liu, Ruirui Li, Yunzhe Qi, Hui Liu, Xianfeng Tang, Tianqi Zheng, Qingyu Yin, Monica Xiao Cheng, Jun Huan, Haoyu Wang, Jing Gao
Personalize Your LLM: Fake it then Align it Authors: Yijing Zhang, Dyah Adila, Changho Shin, Frederic Sala
Scaling Law Phenomena Across Regression Paradigms: Multiple and Kernel Approaches Authors: Yifang Chen, Xuyang Guo, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song
ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Parity LLM Data Valuation Authors: Yanzhou Pan, Huawei Lin, Yide Ran, Jiamin Chen, Xiaodong Yu, Weijie Zhao, Denghui Zhang, Zhaozhuo Xu
Hypergraph Foundation Model Authors: Yifan Feng, Shiquan Liu, Xiangmin Han, Shaoyi Du, Zongze Wu, Han Hu, Yue Gao
Architectural and Inferential Inductive Biases For Exchangeable Sequence Modeling Authors: Daksh Mittal, Ang Li, Tzu-Ching Yen, Daniel Guetta, Hongseok Namkoong
Cauchy Random Features for Operator Learning in Sobolev Space Authors: Chunyang Liao, Deanna Needell, Hayden Schaeffer
AMUN: Adversarial Machine UNlearning Authors: Ali Ebrahimpour-Boroojeny, Hari Sundaram, Varun Chandrasekaran
Learning Stochastic Dynamical Systems with Structured Noise Authors: Ziheng Guo, James Greene, Ming Zhong
Input Specific Neural Networks Authors: Asghar A. Jadoon, D. Thomas Seidl, Reese E. Jones, Jan N. Fuhg
On the Saturation Effects of Spectral Algorithms in Large Dimensions Authors: Weihao Lu, Haobo Zhang, Yicheng Li, Qian Lin

1. Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

ArXiv ID: 2503.01822

Authors: Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, Demba Ba

Abstract: Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward certain kinds of concepts? We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem, revealing a fundamental challenge: each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. This means different SAEs are not interchangeable -- switching architectures can expose entirely new concepts or obscure existing ones. To systematically probe this effect, we evaluate SAEs across a spectrum of settings: from controlled toy models that isolate key variables, to semi-synthetic experiments on real model activations and finally to large-scale, naturalistic datasets. Across this progression, we examine two fundamental properties that real-world concepts often exhibit: heterogeneity in intrinsic dimensionality (some concepts are inherently low-dimensional, others are not) and nonlinear separability. We show that SAEs fail to recover concepts when these properties are ignored, and we design a new SAE that explicitly incorporates both, enabling the discovery of previously hidden concepts and reinforcing our theoretical insights. Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability. Overall, we argue an SAE does not just reveal concepts -- it determines what can be seen at all.

Comment: The paper provides a theoretical framework for sparse autoencoders, directly addressing representation learning and the biases in concept detection.

Relevance: 10 Novelty: 9

2. Efficiently Editing Mixture-of-Experts Models with Compressed Experts

ArXiv ID: 2503.00634

Authors: Yifei He, Yang Liu, Chen Liang, Hany Hassan Awadalla

Abstract: Mixture-of-Experts (MoE) models have become a key approach for scaling large language models efficiently by activating only a subset of experts during training and inference. Typically, the number of activated experts presents a trade-off: fewer experts reduce computational costs, while more experts improve performance. Recent studies reveal that not all activated experts contribute equally to model performance, with some providing minimal utility, particularly when finetuning pretrained MoE models for specialized downstream tasks. The co-existence of significant and redundant parameters in experts provides us an opportunity to reduce the number of activated experts while maintaining model performance. In this work, we propose the concept of compressed experts, lightweight modules that serve as compact representations of full experts. Our approach preserves the most important experts while replacing other auxiliary activated experts with compressed experts. The reduction of active parameters significantly lowers inference costs while achieving comparable performance. Extensive experiments on models including Phi-MoE and OLMoE demonstrate that compressed experts recover over 90% of full expert performance across various tasks while reducing more than 30% active parameters and saving 20% in inference costs. This approach enables efficient deployment of MoE models in resource-constrained settings and facilitates scaling to larger models with manageable overhead. Our code is available at https://github.com/yifei-he/Compressed-Experts.

Comment: The paper introduces compressed experts for Mixture-of-Experts (MoE) models, reducing inference costs while maintaining performance. This directly aligns with the 'Model Architecture' and 'Model Compression' criteria.

Relevance: 10 Novelty: 8

3. Synergy Between Sufficient Changes and Sparse Mixing Procedure for Disentangled Representation Learning

ArXiv ID: 2503.00639

Authors: Zijian Li, Shunxing Fan, Yujia Zheng, Ignavier Ng, Shaoan Xie, Guangyi Chen, Xinshuai Dong, Ruichu Cai, Kun Zhang

Abstract: Disentangled representation learning aims to uncover latent variables underlying the observed data, and generally speaking, rather strong assumptions are needed to ensure identifiability. Some approaches rely on sufficient changes on the distribution of latent variables indicated by auxiliary variables such as domain indices, but acquiring enough domains is often challenging. Alternative approaches exploit structural sparsity assumptions on the mixing procedure, but such constraints are usually (partially) violated in practice. Interestingly, we find that these two seemingly unrelated assumptions can actually complement each other to achieve identifiability. Specifically, when conditioned on auxiliary variables, the sparse mixing procedure assumption provides structural constraints on the mapping from estimated to true latent variables and hence compensates for potentially insufficient distribution changes. Building on this insight, we propose an identifiability theory with less restrictive constraints regarding distribution changes and the sparse mixing procedure, enhancing applicability to real-world scenarios. Additionally, we develop an estimation framework incorporating a domain encoding network and a sparse mixing constraint and provide two implementations based on variational autoencoders and generative adversarial networks, respectively. Experiment results on synthetic and real-world datasets support our theoretical results.

Comment: The paper proposes a novel framework combining sparse mixing and distributional changes for disentangled representation learning, which directly aligns with foundational research in representation learning.

Relevance: 9 Novelty: 9

4. Compositional Reasoning with Transformers, RNNs, and Chain of Thought

ArXiv ID: 2503.01544

Authors: Gilad Yehudai, Noah Amsel, Joan Bruna

Abstract: We study and compare the expressive power of transformers, RNNs, and transformers with chain of thought tokens on a simple and natural class of problems we term Compositional Reasoning Questions (CRQ). This family captures problems like evaluating Boolean formulas and multi-step word problems. Assuming standard hardness assumptions from circuit complexity and communication complexity, we prove that none of these three architectures is capable of solving CRQs unless some hyperparameter (depth, embedding dimension, and number of chain of thought tokens, respectively) grows with the size of the input. We also provide a construction for each architecture that solves CRQs. For transformers, our construction uses depth that is logarithmic in the problem size. For RNNs, logarithmic embedding dimension is necessary and sufficient, so long as the inputs are provided in a certain order. (Otherwise, a linear dimension is necessary). For transformers with chain of thought, our construction uses $n$ CoT tokens. These results show that, while CRQs are inherently hard, there are several different ways for language models to overcome this hardness. Even for a single class of problems, each architecture has strengths and weaknesses, and none is strictly better than the others.

Comment: The paper compares the expressive power of transformers, RNNs, and chain-of-thought methods for compositional reasoning, providing theoretical insights into model capabilities. This aligns with the interest in analyzing architectures.