Personalized Daily Arxiv Papers 4/01/2025

[gpt-4o]	Prompt	Completion	Total
Token	46570	6439	53009
Cost	$0.12	$0.06	$0.18

Total arXiv papers: 798

Total scanned papers: 486

Total relevant papers: 28

Table of contents with paper titles:

Beyond Standard MoE: Mixture of Latent Experts for Resource-Efficient Language Models Authors: Zehua Liu, Han Wu, Ruifeng She, Xiaojin Fu, Xiongwei Han, Tao Zhong, Mingxuan Yuan
NoProp: Training Neural Networks without Back-propagation or Forward-propagation Authors: Qinyu Li, Yee Whye Teh, Razvan Pascanu
The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction Authors: Yihuai Hong, Dian Zhou, Meng Cao, Lei Yu, Zhijing Jin
TransMamba: Flexibly Switching between Transformer and Mamba Authors: Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, Jie Jiang
Boosting Large Language Models with Mask Fine-Tuning Authors: Mingyuan Zhang, Yue Bai, Huan Wang, Yizhou Wang, Qihua Dong, Yun Fu
SQuat: Subspace-orthogonal KV Cache Quantization Authors: Hao Wang, Ligong Han, Kai Xu, Akash Srivastava
Towards Understanding the Optimization Mechanisms in Deep Learning Authors: Binchuan Qi, Wei Gong, Li Li
MoRE-LLM: Mixture of Rule Experts Guided by a Large Language Model Authors: Alexander Koebler, Ingo Thon, Florian Buettner
TRA: Better Length Generalisation with Threshold Relative Attention Authors: Mattia Opper, Roland Fernandez, Paul Smolensky, Jianfeng Gao
Model Hemorrhage and the Robustness Limits of Large Language Models Authors: Ziyang Ma, Zuchao Li, Lefei Zhang, Gui-Song Xia, Bo Du, Liangpei Zhang, Dacheng Tao
Mixture of Routers Authors: Jia-Chen Zhang, Yu-Jie Xiong, Xi-He Qiu, Chun-Ming Xia, Fei Dai
KernelDNA: Dynamic Kernel Sharing via Decoupled Naive Adapters Authors: Haiduo Huang, Yadong Zhang, Pengju Ren
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality Authors: Sewoong Lee, Adam Davies, Marc E. Canby, Julia Hockenmaier
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference Authors: Kai Huang, Hao Zou, Bochen Wang, Ye Xi, Zhen Xie, Hao Wang
GMapLatent: Geometric Mapping in Latent Space Authors: Wei Zeng, Xuebin Chang, Jianghao Su, Xiang Gu, Jian Sun, Zongben Xu
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models Authors: Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
Node Embeddings via Neighbor Embeddings Authors: Jan Niklas B\"ohm, Marius Keute, Alica Guzm\'an, Sebastian Damrich, Andrew Draganov, Dmitry Kobak
RBFleX-NAS: Training-Free Neural Architecture Search Using Radial Basis Function Kernel and Hyperparameter Detection Authors: Tomomasa Yamasaki, Zhehui Wang, Tao Luo, Niangjun Chen, Bo Wang
Bayesian Predictive Coding Authors: Alexander Tschantz, Magnus Koudahl, Hampus Linander, Lancelot Da Costa, Conor Heins, Jeff Beck, Christopher Buckley
AuditVotes: A Framework Towards More Deployable Certified Robustness for Graph Neural Networks Authors: Yuni Lai, Yulin Zhu, Yixuan Sun, Yulun Wu, Bin Xiao, Gaolei Li, Jianhua Li, Kai Zhou
Order Independence With Finetuning Authors: Katrina Brown, Reid McIlroy
Adaptive Layer-skipping in Pre-trained LLMs Authors: Xuan Luo, Weizhi Wang, Xifeng Yan
How to safely discard features based on aggregate SHAP values Authors: Robi Bhattacharjee, Karolin Frohnapfel, Ulrike von Luxburg
An extrapolated and provably convergent algorithm for nonlinear matrix decomposition with the ReLU function Authors: Nicolas Gillis, Margherita Porcelli, Giovanni Seraghiti
Partial Transportability for Domain Generalization Authors: Kasra Jalaldoust, Alexis Bellot, Elias Bareinboim
On Geometrical Properties of Text Token Embeddings for Strong Semantic Binding in Text-to-Image Generation Authors: Hoigi Seo, Junseo Bang, Haechang Lee, Joohoon Lee, Byung Hyun Lee, Se Young Chun
From Colors to Classes: Emergence of Concepts in Vision Transformers Authors: Teresa Dorszewski, Lenka T\v{e}tkov\'a, Robert Jenssen, Lars Kai Hansen, Kristoffer Knutsen Wickstr{\o}m
Learning Library Cell Representations in Vector Space Authors: Rongjian Liang, Yi-Chen Lu, Wen-Hao Liu, Haoxing Ren

1. Beyond Standard MoE: Mixture of Latent Experts for Resource-Efficient Language Models

ArXiv ID: 2503.23100

Authors: Zehua Liu, Han Wu, Ruifeng She, Xiaojin Fu, Xiongwei Han, Tao Zhong, Mingxuan Yuan

Abstract: Mixture of Experts (MoE) has emerged as a pivotal architectural paradigm for efficient scaling of Large Language Models (LLMs), operating through selective activation of parameter subsets for each input token. Nevertheless, conventional MoE architectures encounter substantial challenges, including excessive memory utilization and communication overhead during training and inference, primarily attributable to the proliferation of expert modules. In this paper, we introduce Mixture of Latent Experts (MoLE), a novel parameterization methodology that facilitates the mapping of specific experts into a shared latent space. Specifically, all expert operations are systematically decomposed into two principal components: a shared projection into a lower-dimensional latent space, followed by expert-specific transformations with significantly reduced parametric complexity. This factorized approach substantially diminishes parameter count and computational requirements. Beyond the pretraining implementation of the MoLE architecture, we also establish a rigorous mathematical framework for transforming pre-trained MoE models into the MoLE architecture, characterizing the sufficient conditions for optimal factorization and developing a systematic two-phase algorithm for this conversion process. Our comprehensive theoretical analysis demonstrates that MoLE significantly enhances computational efficiency across multiple dimensions while preserving model representational capacity. Empirical evaluations corroborate our theoretical findings, confirming that MoLE achieves performance comparable to standard MoE implementations while substantially reducing resource requirements.

Comment: The paper introduces Mixture of Latent Experts (MoLE), a novel parameterization for MoE architectures, addressing computational efficiency and memory challenges. This aligns closely with the 'Model Architecture' and 'Model Compression' criteria.

Relevance: 10 Novelty: 8

2. NoProp: Training Neural Networks without Back-propagation or Forward-propagation

ArXiv ID: 2503.24322

Authors: Qinyu Li, Yee Whye Teh, Razvan Pascanu

Abstract: The canonical deep learning approach for learning requires computing a gradient term at each layer by back-propagating the error signal from the output towards each learnable parameter. Given the stacked structure of neural networks, where each layer builds on the representation of the layer below, this approach leads to hierarchical representations. More abstract features live on the top layers of the model, while features on lower layers are expected to be less abstract. In contrast to this, we introduce a new learning method named NoProp, which does not rely on either forward or backwards propagation. Instead, NoProp takes inspiration from diffusion and flow matching methods, where each layer independently learns to denoise a noisy target. We believe this work takes a first step towards introducing a new family of gradient-free learning methods, that does not learn hierarchical representations -- at least not in the usual sense. NoProp needs to fix the representation at each layer beforehand to a noised version of the target, learning a local denoising process that can then be exploited at inference. We demonstrate the effectiveness of our method on MNIST, CIFAR-10, and CIFAR-100 image classification benchmarks. Our results show that NoProp is a viable learning algorithm which achieves superior accuracy, is easier to use and computationally more efficient compared to other existing back-propagation-free methods. By departing from the traditional gradient based learning paradigm, NoProp alters how credit assignment is done within the network, enabling more efficient distributed learning as well as potentially impacting other characteristics of the learning process.

Comment: The paper introduces a gradient-free learning method (NoProp) that departs from traditional backpropagation, offering a novel perspective on training dynamics.

Relevance: 9 Novelty: 9

3. The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction

ArXiv ID: 2503.23084

Authors: Yihuai Hong, Dian Zhou, Meng Cao, Lei Yu, Zhijing Jin

Abstract: Large language models (LLMs) excel on a variety of reasoning benchmarks, but previous studies suggest they sometimes struggle to generalize to unseen questions, potentially due to over-reliance on memorized training examples. However, the precise conditions under which LLMs switch between reasoning and memorization during text generation remain unclear. In this work, we provide a mechanistic understanding of LLMs' reasoning-memorization dynamics by identifying a set of linear features in the model's residual stream that govern the balance between genuine reasoning and memory recall. These features not only distinguish reasoning tasks from memory-intensive ones but can also be manipulated to causally influence model performance on reasoning tasks. Additionally, we show that intervening in these reasoning features helps the model more accurately activate the most relevant problem-solving capabilities during answer generation. Our findings offer new insights into the underlying mechanisms of reasoning and memory in LLMs and pave the way for the development of more robust and interpretable generative AI systems.

Comment: The paper provides mechanistic insights into reasoning and memorization dynamics in LLMs, which aligns with the foundational research on LLM behavior and interpretability.