Personalized Daily Arxiv Papers 02/20/2025

gpt-4o	Prompt	Completion	Total
Token	45328	6639	51967
Cost	$0.11	$0.07	$0.18

Total ArXiv papers: 521

Total scanned papers: 296

Total relevant papers: 43

Table of contents with paper titles:

MoM: Linear Sequence Modeling with Mixture-of-Memories Authors: Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models Authors: Jiaqi Zhao, Miao Zhang, Ming Wang, Yuzhang Shang, Kaihao Zhang, Weili Guan, Yaowei Wang, Min Zhang
NestQuant: Nested Lattice Quantization for Matrix Products and LLMs Authors: Semyon Savkin, Eitan Porat, Or Ordentlich, Yury Polyanskiy
The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent Authors: Yatin Dandi, Luca Pesce, Lenka Zdeborov\'a, Florent Krzakala
BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference Authors: Ahmed Burak Gulhan, Krishna Teja Chitty-Venkata, Murali Emani, Mahmut Kandemir, Venkatram Vishwanath
LESA: Learnable LLM Layer Scaling-Up Authors: Yifei Yang, Zouying Cao, Xinbei Ma, Yao Yao, Libo Qin, Zhi Chen, Hai Zhao
MoBA: Mixture of Block Attention for Long-Context LLMs Authors: Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu
Breaking the bonds of generative artificial intelligence by minimizing the maximum entropy Authors: Mattia Miotto, Lorenzo Monacelli
How Do LLMs Perform Two-Hop Reasoning in Context? Authors: Tianyu Guo, Hanlin Zhu, Ruiqi Zhang, Jiantao Jiao, Song Mei, Michael I. Jordan, Stuart Russell
Unraveling the Localized Latents: Learning Stratified Manifold Structures in LLM Embedding Space with Sparse Mixture-of-Experts Authors: Xin Li, Anand Sarwate
LSR-Adapt: Ultra-Efficient Parameter Tuning with Matrix Low Separation Rank Kernel Adaptation Authors: Xin Li, Anand Sarwate
RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals Authors: Jaemu Heo, Eldor Fozilov, Hyunmin Song, Taehwan Kim
Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models Authors: Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Yang You, Guiming Xie, Xuejian Gong, Kunlong Zhou
Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference Authors: Qingfa Xiao, Jiachuan Wang, Haoyang Li, Cheng Deng, Jiaqi Tang, Shuangyin Li, Yongqi Zhang, Jun Wang, Lei Chen
On the Duality between Gradient Transformations and Adapters Authors: Lucas Torroba-Hennigen, Hunter Lang, Han Guo, Yoon Kim
Concept Layers: Enhancing Interpretability and Intervenability via LLM Conceptualization Authors: Or Raphael Bidusa, Shaul Markovitch
The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding? Authors: Yutao Sun, Mingshuai Chen, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Jianwei Yin
Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment Authors: Yuze Zhao, Tianyun Ji, Wenjun Feng, Zhenya Huang, Qi Liu, Zhiding Liu, Yixiao Ma, Kai Zhang, Enhong Chen
Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression Authors: Jingfeng Wu, Peter Bartlett, Matus Telgarsky, Bin Yu
Neural Attention Search Authors: Difan Deng, Marius Lindauer
ETS: Efficient Tree Search for Inference-Time Scaling Authors: Coleman Hooper, Sehoon Kim, Suhong Moon, Kerem Dilmen, Monishwaran Maheswaran, Nicholas Lee, Michael W. Mahoney, Sophia Shao, Kurt Keutzer, Amir Gholami
NVR: Vector Runahead on NPUs for Sparse Memory Access Authors: Hui Wang, Zhengpeng Zhao, Jing Wang, Yushu Du, Yuan Cheng, Bing Guo, He Xiao, Chenhao Ma, Xiaomeng Han, Dean You, Jiapeng Guan, Ran Wei, Dawei Yang, Zhe Jiang
Learning Is a Kan Extension Authors: Matthew Pugh, Jo Grundy, Corina Cirstea, Nick Harris
Language Models Can Predict Their Own Behavior Authors: Dhananjay Ashok, Jonathan May
What are Models Thinking about? Understanding Large Language Model Hallucinations "Psychology" through Model Inner State Analysis Authors: Peiran Wang, Yang Liu, Yunfei Lu, Jue Hong, Ye Wu
Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis Authors: Jiaqi Zhao, Ming Wang, Miao Zhang, Yuzhang Shang, Xuebo Liu, Yaowei Wang, Min Zhang, Liqiang Nie
Towards Invariance to Node Identifiers in Graph Neural Networks Authors: Maya Bechler-Speicher, Moshe Eliasof, Carola-Bibiane Schonlieb, Ran Gilad-Bachrach, Amir Globerson
Refining embeddings with fill-tuning: data-efficient generalised performance improvements for materials foundation models Authors: Matthew P. Wilson, Edward O. Pyzer-Knapp, Nicolas Galichet, Luke Dicks
Random Forest Autoencoders for Guided Representation Learning Authors: Adrien Aumon, Shuang Ni, Myriam Lizotte, Guy Wolf, Kevin R. Moon, Jake S. Rhodes
Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models Authors: Yingqian Cui, Pengfei He, Jingying Zeng, Hui Liu, Xianfeng Tang, Zhenwei Dai, Yan Han, Chen Luo, Jing Huang, Zhen Li, Suhang Wang, Yue Xing, Jiliang Tang, Qi He
Generalization error bound for denoising score matching under relaxed manifold assumption Authors: Konstantin Yakovlev, Nikita Puchkin
Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning Authors: Zenan Li, Zhaoyu Li, Wen Tang, Xian Zhang, Yuan Yao, Xujie Si, Fan Yang, Kaiyu Yang, Xiaoxing Ma
Mixup Regularization: A Probabilistic Perspective Authors: Yousef El-Laham, Niccolo Dalmasso, Svitlana Vyetrenko, Vamsi Potluru, Manuela Veloso
SPEX: Scaling Feature Interaction Explanations for LLMs Authors: Justin Singh Kang, Landon Butler, Abhineet Agarwal, Yigit Efe Erginbas, Ramtin Pedarsani, Kannan Ramchandran, Bin Yu
How Expressive are Knowledge Graph Foundation Models? Authors: Xingyue Huang, Pablo Barcel\'o, Michael M. Bronstein, .Ismail .Ilkan Ceylan, Mikhail Galkin, Juan L Reutter, Miguel Romero Orth
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length Authors: Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O\u{g}uzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, Afshin Dehghan
Task Shift: From Classification to Regression in Overparameterized Linear Models Authors: Tyler LaBonte, Kuo-Wei Lai, Vidya Muthukumar
Revisiting Privacy, Utility, and Efficiency Trade-offs when Fine-Tuning Large Language Models Authors: Soumi Das, Camila Kolling, Mohammad Aflah Khan, Mahsa Amani, Bishwamittra Ghosh, Qinyuan Wu, Till Speicher, Krishna P. Gummadi
Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images Authors: Shengguang Wu, Fan-Yun Sun, Kaiyue Wen, Nick Haber
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region Authors: Chak Tou Leong, Qingyu Yin, Jian Wang, Wenjie Li
Flow-based generative models as iterative algorithms in probability space Authors: Yao Xie, Xiuyuan Cheng
The impact of conformer quality on learned representations of molecular conformer ensembles Authors: Keir Adams, Connor W. Coley
RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision Authors: Guangzhi Xiong, Qiao Jin, Xiao Wang, Yin Fang, Haolin Liu, Yifan Yang, Fangyuan Chen, Zhixing Song, Dengyu Wang, Minjia Zhang, Zhiyong Lu, Aidong Zhang

1. MoM: Linear Sequence Modeling with Mixture-of-Memories

ArXiv ID: 2502.13685

Authors: Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng

Abstract: Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain's ability to maintain robust long-term memory while mitigating "memory interference", we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM significantly outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.

Comment: The paper proposes Mixture-of-Memories (MoM), a novel architecture for linear sequence modeling inspired by neuroscience, which aligns with the model architecture criterion and introduces a new paradigm.

Relevance: 10 Novelty: 9

2. PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models

ArXiv ID: 2502.13179

Authors: Jiaqi Zhao, Miao Zhang, Ming Wang, Yuzhang Shang, Kaihao Zhang, Weili Guan, Yaowei Wang, Min Zhang

Abstract: Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. Several existing sub 2-bit post-training quantization (PTQ) methods utilize a mix-precision scheme by leveraging an unstructured fine-grained mask to explicitly distinguish salient weights, while which introduces an extra 1-bit or more per weight. To explore the real limit of PTQ, we propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time. Specifically, we first introduce a one-dimensional structured mask with negligibly additional 0.0002-bit per weight based on input activations from the perspective of reducing the upper bound of quantization error to allocate corresponding salient weight channels to 4-bit. For non-salient channels binarization, an efficient block-wise scaling factors optimization framework is then presented to take implicit row-wise correlations and angular biases into account. Different from prior works that concentrate on adjusting quantization methodologies, we further propose a novel paradigm called quantization preprocessing, where we argue that transforming the weight distribution of the pretrained model before quantization can alleviate the difficulty in per-channel extremely low-bit PTQ. Extensive experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization. Codes are available at https://github.com/zjq0455/PTQ1.61.

Comment: The paper introduces a novel post-training quantization method for LLMs, achieving extremely low-bit quantization (1.61-bit) with innovative preprocessing and optimization techniques. This directly aligns with the 'Model Compression' criterion, particularly in quantization.

Relevance: 10 Novelty: 9

3. NestQuant: Nested Lattice Quantization for Matrix Products and LLMs

ArXiv ID: 2502.09720

Authors: Semyon Savkin, Eitan Porat, Or Ordentlich, Yury Polyanskiy

Abstract: Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent work have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Meta's SpinQuant (perplexity 7.3). Comparisons on various LLM evaluation benchmarks also show a reduction in performance degradation induced by quantization.

Comment: The paper introduces a novel quantization scheme (NestQuant) for LLMs, achieving state-of-the-art results in low-bit quantization. This directly aligns with the 'Model Compression' criterion, particularly in quantization.

Relevance: 10 Novelty: 9

4. The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent

ArXiv ID: 2502.13961

Authors: Yatin Dandi, Luca Pesce, Lenka Zdeborov\'a, Florent Krzakala

Abstract: Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. While the study of multi-index models with Gaussian data in high dimensions has provided analytical insights into the benefits of GD-trained neural networks over kernels, the role of depth in improving sample complexity and generalization in GD-trained networks remains poorly understood. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics and generalization performance of deep networks compared to shallow ones in the high-dimensional limit. Specifically, our main theorem shows that feature learning with GD reduces the effective dimensionality, transforming a high-dimensional problem into a sequence of lower-dimensional ones. This enables learning the target function with drastically less samples than with shallow networks. While the results are proven in a controlled training setting, we also discuss more common training procedures and argue that they learn through the same mechanisms. These findings open the way to further quantitative studies of the crucial role of depth in learning hierarchical structures with deep networks.

Comment: The paper provides theoretical insights into the computational advantages of depth in neural networks, aligning closely with representation learning and training dynamics.

Relevance: 10 Novelty: 9

5. BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference

ArXiv ID: 2502.13176

Authors: Ahmed Burak Gulhan, Krishna Teja Chitty-Venkata, Murali Emani, Mahmut Kandemir, Venkatram Vishwanath

Abstract: In Large Language Model (LLM) inference, Key-Value (KV) caches (KV-caches) are essential for reducing time complexity. However, they result in a linear increase in GPU memory as the context length grows. While recent work explores KV-cache eviction and compression policies to reduce memory usage, they often consider uniform KV-caches across all attention heads, leading to suboptimal performance. We introduce BaKlaVa, a method to allocate optimal memory for individual KV-caches across the model by estimating the importance of each KV-cache. Our empirical analysis demonstrates that not all KV-caches are equally critical for LLM performance. Using a one-time profiling approach, BaKlaVa assigns optimal memory budgets to each KV-cache. We evaluated our method on LLaMA-3-8B, and Qwen2.5-7B models, achieving up to a 70\% compression ratio while keeping baseline performance and delivering up to an order-of-magnitude accuracy improvement at higher compression levels.

Comment: The paper introduces BaKlaVa, a method for optimizing KV-cache memory allocation in LLMs, which directly addresses model compression and efficiency in LLM inference.

Relevance: 10 Novelty: 8

6. LESA: Learnable LLM Layer Scaling-Up

ArXiv ID: 2502.13794

Authors: Yifei Yang, Zouying Cao, Xinbei Ma, Yao Yao, Libo Qin, Zhi Chen, Hai Zhao

Abstract: Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. However, existing depth scaling-up methods rely on empirical heuristic rules for layer duplication, which result in poorer initialization and slower convergence during continual pre-training. We propose \textbf{LESA}, a novel learnable method for depth scaling-up. By concatenating parameters from each layer and applying Singular Value Decomposition, we uncover latent patterns between layers, suggesting that inter-layer parameters can be learned. LESA uses a neural network to predict the parameters inserted between adjacent layers, enabling better initialization and faster training. Experiments show that LESA outperforms existing baselines, achieving superior performance with less than half the computational cost during continual pre-training. Extensive analyses demonstrate its effectiveness across different model sizes and tasks.

Comment: LESA proposes a learnable method for scaling up LLM layers, which directly addresses architectural innovations and efficiency in LLM training.

Relevance: 10 Novelty: 8

7. MoBA: Mixture of Block Attention for Long-Context LLMs

ArXiv ID: 2502.13189

Authors: Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu

Abstract: Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.

Comment: The paper introduces Mixture of Block Attention (MoBA), which applies Mixture of Experts (MoE) principles to attention mechanisms in LLMs. This aligns closely with the 'Model Architecture' and 'Large Language Models' criteria, focusing on architectural innovation and efficiency improvements.

Relevance: 10 Novelty: 8

8. Breaking the bonds of generative artificial intelligence by minimizing the maximum entropy

ArXiv ID: 2502.13287

Authors: Mattia Miotto, Lorenzo Monacelli

Abstract: The emergence of generative artificial intelligence (GenAI), comprising large language models, text-to-image generators, and AI algorithms for medical drug and material design, had a transformative impact on society. However, despite an initial exponential growth surpassing Moore's law, progress is now plateauing, suggesting we are approaching the limits of current technology. Indeed, these models are notoriously data-hungry, prone to overfitting, and challenging to direct during the generative process, hampering their effective professional employment. To cope with these limitations, we propose a paradigm shift in GenAI by introducing an ab initio method based on the minimal maximum entropy principle. Our approach does not fit the data. Instead, it compresses information in the training set by finding a latent representation parameterized by arbitrary nonlinear functions, such as neural networks. The result is a general physics-driven model, which is data-efficient, resistant to overfitting, and flexible, permitting to control and influence the generative process. Benchmarking shows that our method outperforms variational autoencoders (VAEs) with similar neural architectures, particularly on undersampled datasets. We demonstrate the methods effectiveness in generating images, even with limited training data, and its unprecedented capability to customize the generation process a posteriori without the need of any fine-tuning or retraining.

Comment: This paper introduces a new paradigm for generative AI based on the minimal maximum entropy principle, which aligns with foundational research in representation learning and generative paradigms.

Relevance: 9 Novelty: 9

9. How Do LLMs Perform Two-Hop Reasoning in Context?

ArXiv ID: 2502.13913

Authors: Tianyu Guo, Hanlin Zhu, Ruiqi Zhang, Jiantao Jiao, Song Mei, Michael I. Jordan, Stuart Russell

Abstract: "Socrates is human. All humans are mortal. Therefore, Socrates is mortal." This classical example demonstrates two-hop reasoning, where a conclusion logically follows from two connected premises. While transformer-based Large Language Models (LLMs) can make two-hop reasoning, they tend to collapse to random guessing when faced with distracting premises. To understand the underlying mechanism, we train a three-layer transformer on synthetic two-hop reasoning tasks. The training dynamics show two stages: a slow learning phase, where the 3-layer transformer performs random guessing like LLMs, followed by an abrupt phase transitions, where the 3-layer transformer suddenly reaches $100%$ accuracy. Through reverse engineering, we explain the inner mechanisms for how models learn to randomly guess between distractions initially, and how they learn to ignore distractions eventually. We further propose a three-parameter model that supports the causal claims for the mechanisms to the training dynamics of the transformer. Finally, experiments on LLMs suggest that the discovered mechanisms generalize across scales. Our methodologies provide new perspectives for scientific understandings of LLMs and our findings provide new insights into how reasoning emerges during training.

Comment: This paper provides theoretical insights into the training dynamics of transformers for two-hop reasoning, which aligns with understanding training dynamics and interpretability in LLMs.