Personalized Daily ArXiv Papers 2025-12-22

[gpt-5]	Prompt	Completion	Total
Token	23313	26570	49883
Cost	$0.03	$0.27	$0.29

Total arXiv papers: 394

Total scanned papers: 232

Total relevant papers: 12

Table of contents with paper titles:

Learning What to Write: Write-Gated KV for Efficient Long-Context Inference Authors: Yen-Chieh Huang, Rui Fang, Ming-Syan Chen, Pi-Cheng Hsiu
A Unified Representation of Neural Networks Architectures Authors: Christophe Prieur, Mircea Lazar, Bogdan Robu
Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation Authors: Zhenyu Liu, Yunzhen Liu, Zehao Fan, Garrett Gagnon, Yayue Hou, Nan Wu, Yangwook Kang, Liu Liu
GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping Authors: Yikang Yue, Yishu Yin, Xuehai Qian
Bridging Training and Merging Through Momentum-Aware Optimization Authors: Alireza Moayedikia, Alicia Troncoso
InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression Authors: Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, Ming-Yu Liu
Disentangled representations via score-based variational autoencoders Authors: Benjamin S. H. Lyo, Eero P. Simoncelli, Cristina Savin
Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs Authors: Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao
Mitigating Forgetting in Low Rank Adaptation Authors: Joanna Sliwa, Frank Schneider, Philipp Hennig, Jose Miguel Hernandez-Lobato
Dion2: A Simple Method to Shrink Matrix in Muon Authors: Kwangjun Ahn, Noah Amsel, John Langford
Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing Authors: Lingxiao Zhao, Haoran Zhou, Yuezhi Che, Dazhao Cheng
DeepShare: Sharing ReLU Across Channels and Layers for Efficient Private Inference Authors: Yonathan Bornfeld, Shai Avidan

1. Learning What to Write: Write-Gated KV for Efficient Long-Context Inference

ArXiv ID: 2512.17452

Authors: Yen-Chieh Huang, Rui Fang, Ming-Syan Chen, Pi-Cheng Hsiu

Abstract: Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to persistent memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV, a lightweight mechanism that learns to predict token utility before it enters the cache. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, Write-Gated KV reduces memory usage by 46-57% and delivers 3.03-3.45$\times$ prefill and 1.89-2.56$\times$ decode speedups on Llama model with negligible accuracy loss, all while remaining compatible with FlashAttention and paged-KV systems. These results demonstrate that learning what to write, is a principled and practical recipe for efficient long-context inference. Code is available at https://github.com/EMCLab-Sinica/WG-KV .

Comment: Transformer efficiency: learned KV admission (write-gated KV) and compact global+local cache to reduce KV size and attention cost—cache/memory optimization for long-context inference.

Relevance: 10 Novelty: 8

2. A Unified Representation of Neural Networks Architectures

ArXiv ID: 2512.17593

Authors: Christophe Prieur, Mircea Lazar, Bogdan Robu

Abstract: In this paper we consider the limiting case of neural networks (NNs) architectures when the number of neurons in each hidden layer and the number of hidden layers tend to infinity thus forming a continuum, and we derive approximation errors as a function of the number of neurons and/or hidden layers. Firstly, we consider the case of neural networks with a single hidden layer and we derive an integral infinite width neural representation that generalizes existing continuous neural networks (CNNs) representations. Then we extend this to deep residual CNNs that have a finite number of integral hidden layers and residual connections. Secondly, we revisit the relation between neural ODEs and deep residual NNs and we formalize approximation errors via discretization techniques. Then, we merge these two approaches into a unified homogeneous representation of NNs as a Distributed Parameter neural Network (DiPaNet) and we show that most of the existing finite and infinite-dimensional NNs architectures are related via homogeneization/discretization with the DiPaNet representation. Our approach is purely deterministic and applies to general, uniformly continuous matrix weight functions. Differences and similarities with neural fields are discussed along with further possible generalizations and applications of the DiPaNet framework.

Comment: Foundational architecture theory: unified continuum representation (DiPaNet) linking infinite width/depth, residual nets, and neural ODEs with approximation error analysis.

Relevance: 10 Novelty: 8

3. Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation

ArXiv ID: 2512.17073

Authors: Zhenyu Liu, Yunzhen Liu, Zehao Fan, Garrett Gagnon, Yayue Hou, Nan Wu, Yangwook Kang, Liu Liu

Abstract: Mixture-of-Experts (MoE) models scale capacity via sparse activation but stress memory and bandwidth. Offloading alleviates GPU memory by fetching experts on demand, yet token-level routing causes irregular transfers that make inference I/O-bound. Static uniform quantization reduces traffic but degrades accuracy under aggressive compression by ignoring expert heterogeneity. We present Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation, which performs router-guided precision restoration using precomputed low-rank compensators. At inference time, our method transfers compact low-rank factors with Top-n (n<k) experts per token and applies compensation to them, keeping others low-bit. Integrated with offloading on GPU and GPU-NDP systems, our method delivers a superior bandwidth-accuracy trade-off and improved throughput.

Comment: Model Architecture (Mixture-of-Experts) + Model Compression/Efficiency — router-guided low-rank compensation with quantization/offloading to cut bandwidth while preserving accuracy.

Relevance: 10 Novelty: 8

4. GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping

ArXiv ID: 2512.17570

Authors: Yikang Yue, Yishu Yin, Xuehai Qian

Abstract: SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B. The code is open-sourced at https://github.com/npz7yyk/GreedySnake

Comment: Systems/HPC contribution: SSD-offloaded LLM training with vertical micro-batch scheduling and optimizer-step overlap for memory/throughput optimization.