Personalized Daily ArXiv Papers 2025-10-24

[gpt-5]	Prompt	Completion	Total
Token	38964	40718	79682
Cost	$0.05	$0.41	$0.46

Total arXiv papers: 659

Total scanned papers: 344

Total relevant papers: 15

Table of contents with paper titles:

Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples Authors: Shiva Sreeram, Alaa Maalouf, Pratyusha Sharma, Daniela Rus
AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training Authors: Huawei Bai, Yifan Huang, Wenqi Shi, Ansheng You, Feifan Shao, Tengfei Han, Minghui Yu
Collective Communication for 100k+ GPUs Authors: Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, Saif Hasan, Subodh Iyengar, Dan Johnson, Bingzhe Liu, Jingliang Ren, Ashmitha Jeevaraj Shetty, Greg Steinbrecher, Xinfeng Xie, Yulun Wang, Bruce Wu, Jingyi Yang, Mingran Yang, Minlan Yu, Cen Zhao, Wes Bland, Denis Boyda, Suman Gumudavelli, Cristian Lumezanu, Rui Miao, Zhe Qu, Venkat Ramesh, Maxim Samoylov, Jan Seidel, Feng Tian, Qiye Tan, Shuqiang Zhang, Yimeng Zhao, Shengbao Zheng, Art Zhu, Hongyi Zeng
Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning Authors: Xiaohan Lan, Fanfan Liu, Haibo Qiu, Siqi Yang, Delian Ruan, Peng Shi, Lin Ma
Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling Authors: Jinhee Kim, Jae Jun An, Kang Eun Jeon, Jong Hwan Ko
ARC-Encoder: learning compressed text representations for large language models Authors: Hippolyte Pilchen, Edouard Grave, Patrick P\'erez
On the Structure of Stationary Solutions to McKean-Vlasov Equations with Applications to Noisy Transformers Authors: Krishnakumar Balasubramanian, Sayan Banerjee, Philippe Rigollet
Connecting Jensen-Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learning Authors: Reuben Dorent, Polina Golland, William Wells III
Why Prototypes Collapse: Diagnosing and Preventing Partial Collapse in Prototypical Self-Supervised Learning Authors: Gabriel Y. Arteaga, Marius Aasan, Rwiddhi Chakraborty, Martine Hjelkrem-Tan, Thalles Silva, Michael Kampffmeyer, Ad\'in Ram\'irez Rivera
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction Authors: Mutian He, Philip N. Garner
H-SPLID: HSIC-based Saliency Preserving Latent Information Decomposition Authors: Lukas Miklautz, Chengzhi Shi, Andrii Shkabrii, Theodoros Thirimachos Davarakis, Prudence Lam, Claudia Plant, Jennifer Dy, Stratis Ioannidis
Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLMs Authors: Hongyi Liu, Jiaji Huang, Zhen Jia, Youngsuk Park, Yu-Xiang Wang
Diffusion Autoencoders with Perceivers for Long, Irregular and Multimodal Astronomical Sequences Authors: Yunyi Shen, Alexander Gagliano
IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks Authors: Insu Jeon, Wonkwang Lee, Myeongjang Pyeon, Gunhee Kim
Context-level Language Modeling by Learning Predictive Context Embeddings Authors: Beiya Dai, Yuliang Liu, Daozheng Xue, Qipeng Guo, Kai Chen, Xinbing Wang

1. Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples

ArXiv ID: 2510.20800

Authors: Shiva Sreeram, Alaa Maalouf, Pratyusha Sharma, Daniela Rus

Abstract: Recently, Sharma et al. suggested a method called Layer-SElective-Rank reduction (LASER) which demonstrated that pruning high-order components of carefully chosen LLM's weight matrices can boost downstream accuracy -- without any gradient-based fine-tuning. Yet LASER's exhaustive, per-matrix search (each requiring full-dataset forward passes) makes it impractical for rapid deployment. We demonstrate that this overhead can be removed and find that: (i) Only a small, carefully chosen subset of matrices needs to be inspected -- eliminating the layer-by-layer sweep, (ii) The gradient of each matrix's singular values pinpoints which matrices merit reduction, (iii) Increasing the factorization search space by allowing matrices rows to cluster around multiple subspaces and then decomposing each cluster separately further reduces overfitting on the original training data and further lifts accuracy by up to 24.6 percentage points, and finally, (iv) we discover that evaluating on just 100 samples rather than the full training data -- both for computing the indicative gradients and for measuring the final accuracy -- suffices to further reduce the search time; we explain that as adaptation to downstream tasks is dominated by prompting style, not dataset size. As a result, we show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks. Overall, with a single gradient step on 100 examples and a quick scan of the top candidate layers and factorization techniques, we can adapt LLMs to new datasets -- entirely without fine-tuning.

Comment: Compression/Efficiency: layer-selective rank reduction and pruning of high-order components with low-rank factorization; rapid adaptation using a single gradient step on 100 samples.

Relevance: 10 Novelty: 9

2. AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training

ArXiv ID: 2510.20111

Authors: Huawei Bai, Yifan Huang, Wenqi Shi, Ansheng You, Feifan Shao, Tengfei Han, Minghui Yu

Abstract: The training efficiency and scalability of language models on massive clusters currently remain a critical bottleneck. Mainstream approaches like ND parallelism are often cumbersome and complex, while flexible alternatives such as the Zero Redundancy Optimizer (ZeRO) are frequently hampered by communication overhead. In this paper, we propose Asynchronous Hierarchical Zero Parallelism (AsyncHZP), a novel asynchronous variant of ZeRO designed to achieve superior performance while maintaining simplicity and memory efficiency. Unlike traditional ZeRO, which employs over-fine-grained sharding that can lead to inefficient communication, AsyncHZP adaptively reshards parameters, gradients, and optimizer states across different replica groups. This strategy optimizes device memory utilization and significantly reduces communication overhead. In addition, we also design a multi-stream asynchronous scheduling method that executes parameter all-gather and gradient reduce-scatter operations in dedicated background threads, effectively overlapping communication with computation while incurring negligible memory fragmentation. Empirical evaluations on both Dense and Mixture-of-Experts (MoE) models confirm that AsyncHZP maintains robust stability at scale. It consistently outperforms classic ND parallelism, achieving state-of-the-art performance without complex strategic tuning, thereby simplifying the path to efficient large-scale training.

Comment: High-performance training: asynchronous hierarchical ZeRO with adaptive resharding and multi-stream overlap for scalable LLM training.