Personalized Daily Arxiv Papers 02/27/2025

[gpt-4o]	Prompt	Completion	Total
Token	48501	7042	55543
Cost	$0.12	$0.07	$0.19

Total ArXiv papers: 565

Total scanned papers: 343

Total relevant papers: 30

Table of contents with paper titles:

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization Authors: Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki
CAMEx: Curvature-aware Merging of Experts Authors: Dung V. Nguyen, Minh H. Nguyen, Luc Q. Nguyen, Rachel S. Y. Teo, Tan M. Nguyen, Linh Duy Tran
General Reasoning Requires Learning to Reason from the Get-go Authors: Seungwook Han, Jyothish Pari, Samuel J. Gershman, Pulkit Agrawal
FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge Authors: Nakyeong Yang, Minsung Kim, Seunghyun Yoon, Joongbo Shin, Kyomin Jung
Norm Growth and Stability Challenges in Localized Sequential Knowledge Editing Authors: Akshat Gupta, Christine Fang, Atahan Ozdemir, Maochuan Lu, Ahmed Alaa, Thomas Hartvigsen, Gopala Anumanchipalli
HDEE: Heterogeneous Domain Expert Ensemble Authors: O\u{g}uzhan Ersoy, Jari Kolehmainen, Gabriel Passamani Andrade
Consistent Amortized Clustering via Generative Flow Networks Authors: Irit Chelly, Roy Uziel, Oren Freifeld, Ari Pakman
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases Authors: Michael Y. Hu, Jackson Petty, Chuan Shi, William Merrill, Tal Linzen
Rethinking LLM Unlearning Objectives: A Gradient Perspective and Go Beyond Authors: Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, Kilian Q. Weinberger
Fourier Multi-Component and Multi-Layer Neural Networks: Unlocking High-Frequency Potential Authors: Shijun Zhang, Hongkai Zhao, Yimin Zhong, Haomin Zhou
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training Authors: Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Weinan E, Lei Wu
(Mis)Fitting: A Survey of Scaling Laws Authors: Margaret Li, Sneha Kudugunta, Luke Zettlemoyer
A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops Authors: Shi Fu, Yingjie Wang, Yuzhu Chen, Xinmei Tian, Dacheng Tao
On Pruning State-Space LLMs Authors: Tamer Ghattas, Michael Hassid, Roy Schwartz
Applications of Statistical Field Theory in Deep Learning Authors: Zohar Ringel, Noa Rubin, Edo Mor, Moritz Helias, Inbar Seroussi
Optimal Approximate Matrix Multiplication over Sliding Windows Authors: Ziqi Yao, Mingsong Chen, Cheng Chen
INFO-SEDD: Continuous Time Markov Chains as Scalable Information Metrics Estimators Authors: Alberto Foresti, Giulio Franzese, Pietro Michiardi
Optimal Stochastic Trace Estimation in Generative Modeling Authors: Xinyang Liu, Hengrong Du, Wei Deng, Ruqi Zhang
END: Early Noise Dropping for Efficient and Effective Context Denoising Authors: Hongye Jin, Pei Chen, Jingfeng Yang, Zhengyang Wang, Meng Jiang, Yifan Gao, Binxuan Huang, Xinyang Zhang, Zheng Li, Tianyi Liu, Huasheng Li, Bing Yin
Sliding Window Attention Training for Efficient Large Language Models Authors: Zichuan Fu, Wentao Song, Yejing Wang, Xian Wu, Yefeng Zheng, Yingying Zhang, Derong Xu, Xuetao Wei, Tong Xu, Xiangyu Zhao
Revisiting Convolution Architecture in the Realm of DNA Foundation Models Authors: Yu Bo, Weian Mao, Yanjun Shao, Weiqiang Bai, Peng Ye, Xinzhu Ma, Junbo Zhao, Hao Chen, Chunhua Shen
Invariance Pair-Guided Learning: Enhancing Robustness in Neural Networks Authors: Martin Surner, Abdelmajid Khelil, Ludwig Bothmann
FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression Authors: Jianjian Li, Junquan Fan, Feng Tang, Gang Huang, Shitao Zhu, Songlin Liu, Nian Xie, Wulong Liu, Yong Liao
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation Authors: Shiven Sinha, Shashwat Goel, Ponnurangam Kumaraguru, Jonas Geiping, Matthias Bethge, Ameya Prabhu
Investigating Generalization of One-shot LLM Steering Vectors Authors: Jacob Dunefsky, Arman Cohan
MixLLM: Dynamic Routing in Mixed Large Language Models Authors: Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, Haifeng Chen
Mechanistic Understanding of Language Models in Syntactic Code Completion Authors: Samuel Miller, Daking Rai, Ziyu Yao
Blending Optimal Control and Biologically Plausible Learning for Noise-Robust Physical Neural Networks Authors: Satoshi Sunada, Tomoaki Niiyama, Kazutaka Kanno, Rin Nogami, Andr\'e R\"ohm, Takato Awano, Atsushi Uchida
Binary Neural Networks for Large Language Model: A Survey Authors: Liangdong Liu, Zhitong Zheng, Cong Wang, Tianhuang Su, Zhenyu Yang
Set and functional prediction: randomness, exchangeability, and conformal Authors: Vladimir Vovk

1. Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

ArXiv ID: 2502.19261

Authors: Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki

Abstract: The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.

Comment: Proposes Drop-Upcycling for training sparse Mixture of Experts (MoE) models, directly aligning with the 'Model Architecture' and 'Model Compression' criteria.

Relevance: 10 Novelty: 9

2. CAMEx: Curvature-aware Merging of Experts

ArXiv ID: 2502.18821

Authors: Dung V. Nguyen, Minh H. Nguyen, Luc Q. Nguyen, Rachel S. Y. Teo, Tan M. Nguyen, Linh Duy Tran

Abstract: Existing methods for merging experts during model training and fine-tuning predominantly rely on Euclidean geometry, which assumes a flat parameter space. This assumption can limit the model's generalization ability, especially during the pre-training phase, where the parameter manifold might exhibit more complex curvature. Curvature-aware merging methods typically require additional information and computational resources to approximate the Fisher Information Matrix, adding memory overhead. In this paper, we introduce CAMEx (\textbf{C}urvature-\textbf{A}ware \textbf{M}erging of \textbf{Ex}perts), a novel expert merging protocol that incorporates natural gradients to account for the non-Euclidean curvature of the parameter manifold. By leveraging natural gradients, CAMEx adapts more effectively to the structure of the parameter space, improving alignment between model updates and the manifold's geometry. This approach enhances both pre-training and fine-tuning, resulting in better optimization trajectories and improved generalization without the substantial memory overhead typically associated with curvature-aware methods. Our contributions are threefold: (1) CAMEx significantly outperforms traditional Euclidean-based expert merging techniques across various natural language processing tasks, leading to enhanced performance during pre-training and fine-tuning; (2) we introduce a dynamic merging architecture that optimizes resource utilization, achieving high performance while reducing computational costs, facilitating efficient scaling of large language models; and (3) we provide both theoretical and empirical evidence to demonstrate the efficiency of our proposed method.

Comment: The paper introduces CAMEx, a novel curvature-aware merging protocol for Mixture-of-Experts (MoE) models, which aligns closely with the 'Model Architecture' and 'Representation Learning' criteria. It provides theoretical and empirical insights into expert merging, improving optimization and generalization.

Relevance: 10 Novelty: 8

3. General Reasoning Requires Learning to Reason from the Get-go

ArXiv ID: 2502.19402

Authors: Seungwook Han, Jyothish Pari, Samuel J. Gershman, Pulkit Agrawal

Abstract: Large Language Models (LLMs) have demonstrated impressive real-world utility, exemplifying artificial useful intelligence (AUI). However, their ability to reason adaptively and robustly -- the hallmarks of artificial general intelligence (AGI) -- remains fragile. While LLMs seemingly succeed in commonsense reasoning, programming, and mathematics, they struggle to generalize algorithmic understanding across novel contexts. Our experiments with algorithmic tasks in esoteric programming languages reveal that LLM's reasoning overfits to the training data and is limited in its transferability. We hypothesize that the core issue underlying such limited transferability is the coupling of reasoning and knowledge in LLMs. To transition from AUI to AGI, we propose disentangling knowledge and reasoning through three key directions: (1) pretaining to reason using RL from scratch as an alternative to the widely used next-token prediction pretraining, (2) using a curriculum of synthetic tasks to ease the learning of a \textit{reasoning prior} for RL that can then be transferred to natural language tasks, and (3) learning more generalizable reasoning functions using a small context window to reduce exploiting spurious correlations between tokens. Such a reasoning system coupled with a trained retrieval system and a large external memory bank as a knowledge store can overcome several limitations of existing architectures at learning to reason in novel scenarios.

Comment: The paper discusses disentangling reasoning and knowledge in LLMs, aligning with 'Large Language Models' as it proposes foundational changes to pretraining and reasoning paradigms. The focus on reasoning priors and curriculum learning adds significant novelty.

Relevance: 9 Novelty: 9

4. FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge

ArXiv ID: 2502.19207

Authors: Nakyeong Yang, Minsung Kim, Seunghyun Yoon, Joongbo Shin, Kyomin Jung

Abstract: Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the complex and interconnected nature of knowledge, where related knowledge must be carefully examined. Specifically, they have failed to evaluate whether an unlearning method faithfully erases interconnected knowledge that should be removed, retaining knowledge that appears relevant but exists in a completely different context. To resolve this problem, we first define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method either fails to erase the interconnected knowledge it should remove or unintentionally erases irrelevant knowledge. Based on the definition, we introduce a new benchmark, FaithUn, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings. Furthermore, we propose a novel unlearning method, KLUE, which updates only knowledge-related neurons to achieve faithful unlearning. KLUE identifies knowledge neurons using an explainability method and updates only those neurons using selected unforgotten samples. Experimental results demonstrate that widely-used unlearning methods fail to ensure faithful unlearning, while our method shows significant effectiveness in real-world QA unlearning.

Comment: The paper introduces a novel unlearning method (KLUE) for faithful forgetting in LLMs, which aligns with foundational research on LLM behavior and interpretability.