Personalized Daily ArXiv Papers 2025-04-16

[gpt-4o]	Prompt	Completion	Total
Token	41080	5249	46329
Cost	$0.1	$0.05	$0.16

Total arXiv papers: 458

Total scanned papers: 283

Total relevant papers: 15

Table of contents with paper titles:

A Dual-Space Framework for General Knowledge Distillation of Large Language Models Authors: Xue Zhang, Songming Zhang, Yunlong Liang, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
Weight-of-Thought Reasoning: Exploring Neural Network Weights for Enhanced LLM Reasoning Authors: Saif Punjwani, Larry Heck
Erzeugunsgrad, VC-Dimension and Neural Networks with rational activation function Authors: Luis Miguel Pardo, Daniel Sebasti\'an
Dynamic Compressing Prompts for Efficient Inference of Large Language Models Authors: Jinwu Hu, Wei Zhang, Yufeng Wang, Yu Hu, Bin Xiao, Mingkui Tan, Qing Du
When is Task Vector Provably Effective for Model Editing? A Generalization Analysis of Nonlinear Transformers Authors: Hongkang Li, Yihua Zhang, Shuai Zhang, Meng Wang, Sijia Liu, Pin-Yu Chen
Looking beyond the next token Authors: Abitha Thankaraj, Yiding Jiang, J. Zico Kolter, Yonatan Bisk
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints Authors: Ruicheng Ao, Gan Luo, David Simchi-Levi, Xinshang Wang
Leveraging Submodule Linearity Enhances Task Arithmetic Performance in LLMs Authors: Rui Dai, Sile Hu, Xu Shen, Yonggang Zhang, Xinmei Tian, Jieping Ye
Cryo-em images are intrinsically low dimensional Authors: Luke Evans, Octavian-Vlad Murad, Lars Dingeldein, Pilar Cossio, Roberto Covino, Marina Meila
MiMu: Mitigating Multiple Shortcut Learning Behavior of Transformers Authors: Lili Zhao, Qi Liu, Wei Chen, Liyi Chen, Ruijun Sun, Min Hou, Yang Wang, Shijin Wang
Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning Authors: Juan Garcia Giraldo, Nikolaos Dimitriadis, Ke Wang, Pascal Frossard
VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers Authors: Run Wang, Gamze Islamoglu, Andrea Belano, Viviane Potocnik, Francesco Conti, Angelo Garofalo, Luca Benini
Better Estimation of the KL Divergence Between Language Models Authors: Afra Amini, Tim Vieira, Ryan Cotterell
Elucidating the Design Space of Multimodal Protein Language Models Authors: Cheng-Yen (Wesley), Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu
MLPs and KANs for data-driven learning in physical problems: A performance comparison Authors: Raghav Pant, Sikan Li, Xingjian Li, Hassan Iqbal, Krishna Kumar

1. A Dual-Space Framework for General Knowledge Distillation of Large Language Models

ArXiv ID: 2504.11426

Authors: Xue Zhang, Songming Zhang, Yunlong Liang, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou

Abstract: Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model; b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies.

Comment: The paper proposes a novel dual-space knowledge distillation framework for compressing large language models, addressing key limitations in existing methods. This aligns with the 'Model Compression' criterion, particularly in sparsity and efficiency breakthroughs.