Personalized Daily ArXiv Papers 2025-10-17

[gpt-5]	Prompt	Completion	Total
Token	62659	46077	108736
Cost	$0.08	$0.46	$0.54

Total arXiv papers: 680

Total scanned papers: 424

Total relevant papers: 46

Table of contents with paper titles:

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression Authors: Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa
What Layers When: Learning to Skip Compute in LLMs with Residual Gates Authors: Filipe Laitenberger, Dawid Kopiczko, Cees G. M. Snoek, Yuki M. Asano
Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference Authors: Chao Han, Yijuan Liang, Zihao Xuan, Daokuan Wu, Wei Zhang, Xiaoyu Shen
MergeMoE: Efficient Compression of MoE Models via Expert Output Merging Authors: Ruijie Miao, Yilun Yao, Zihan Wang, Zhiming Wang, Bairen Yi, LingJun Liu, Yikai Zhao, Tong Yang
First Attentions Last: Better Exploiting First Attentions for Efficient Transformer Training Authors: Gyudong Kim, Hyukju Na, Jin Hyeon Kim, Hyunsung Jang, Jaemin Park, Jaegi Hwang, Namkoo Ha, Seungryong Kim, Young Geun Kim
Efficient Dynamic Structured Sparse Training with Learned Shuffles Authors: Abhishek Tyagi, Arjun Iyer, Liam Young, William H Renninger, Christopher Kanan, Yuhao Zhu
A Free Lunch in LLM Compression: Revisiting Retraining after Pruning Authors: Moritz Wagner, Christophe Roux, Max Zimmer, Sebastian Pokutta
On the expressivity of sparse maxout networks Authors: Moritz Grillo, Tobias Hofmann
Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning Authors: Weijie Shen, Yitian Liu, Yuhao Wu, Zhixuan Liang, Sijia Gu, Dehui Wang, Tian Nian, Lei Xu, Yusen Qin, Jiangmiao Pang, Xinping Guan, Xiaokang Yang, Yao Mu
BitNet Distillation Authors: Xun Wu, Shaohan Huang, Wenhui Wang, Ting Song, Li Dong, Yan Xia, Furu Wei
FairBatching: Fairness-Aware Batch Formation for LLM Inference Authors: Hongtao Lyu, Boyue Liu, Mingyu Wu, Haibo Chen
From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR Authors: Erwei Wang, Samuel Bayliss, Andra Bisca, Zachary Blair, Sangeeta Chowdhary, Kristof Denolf, Jeff Fifield, Brandon Freiberger, Erika Hunhoff, Phil James-Roxby, Jack Lo, Joseph Melber, Stephen Neuendorffer, Eddie Richter, Andre Rosti, Javier Setoain, Gagandeep Singh, Endri Taka, Pranathi Vasireddy, Zhewen Yu, Niansong Zhang, Jinming Zhuang
Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries Authors: Divyat Mahajan, Sachin Goyal, Badr Youbi Idrissi, Mohammad Pezeshki, Ioannis Mitliagkas, David Lopez-Paz, Kartik Ahuja
Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production Authors: Alexandre Galashov, Matt Jones, Rosemary Ke, Yuan Cao, Vaishnavh Nagarajan, Michael C. Mozer
Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References Authors: Hongzheng Chen, Bin Fan, Alexander Collins, Bastian Hagedorn, Evghenii Gaburov, Masahiro Masuda, Matthew Brookhart, Chris Sullivan, Jason Knight, Zhiru Zhang, Vinod Grover
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences Authors: Julian Minder, Cl\'ement Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, Neel Nanda
Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning Authors: Awni Altabaa, Siyu Chen, John Lafferty, Zhuoran Yang
Towards Reversible Model Merging For Low-rank Weights Authors: Mohammadsajad Alipour, Mohammad Mohammadi Amiri
MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving Authors: Jungi Lee, Junyong Park, Soohyun Cha, Jaehoon Cho, Jaewoong Sim
A Deep State-Space Model Compression Method using Upper Bound on Output Error Authors: Hiroki Sakamoto, Kazuhiro Sato
Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling Authors: Alexandru Meterez, Depen Morwani, Jingfeng Wu, Costin-Andrei Oncescu, Cengiz Pehlevan, Sham Kakade
Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models Authors: Jonas Geiping, Xinyu Yang, Guinan Su
Context-Selective State Space Models: Feedback is All You Need Authors: Riccardo Zattra, Giacomo Baggio, Umberto Casti, Augusto Ferrante, Francesco Ticozzi
Attention Is All You Need for KV Cache in Diffusion LLMs Authors: Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen
CAST: Compositional Analysis via Spectral Tracking for Understanding Transformer Layer Functions Authors: Zihao Fu, Ming Liao, Chris Russell, Zhenguang G. Cai
ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing Authors: Shivanshu Kumar, Gopalakrishnan Srinivasan
Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning Authors: Minsik Choi, Hyegang Son, Changhoon Kim, Young Geun Kim
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow Authors: Ching-Lin Hsiung, Tian-Sheuan Chang
Programmatic Representation Learning with Language Models Authors: Gabriel Poesia, Georgia Gabriela Sampaio
To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models Authors: Eran Malach, Omid Saremi, Sinead Williamson, Arwen Bradley, Aryo Lotfi, Emmanuel Abbe, Josh Susskind, Etai Littwin
When Flatness Does (Not) Guarantee Adversarial Robustness Authors: Nils Philipp Walter, Linara Adilova, Jilles Vreeken, Michael Kamp
Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training Authors: Jie Hao, Xiaochuan Gong, Jie Xu, Zhengdao Wang, Mingrui Liu
SHaRe-SSM: An Oscillatory Spiking Neural Network for Target Variable Modeling in Long Sequences Authors: Kartikay Agrawal, Abhijeet Vikram, Vedant Sharma, Vaishnavi N., Ayon Borthakur
Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models Authors: Ivan Lee, Taylor Berg-Kirkpatrick
Predicting Task Performance with Context-aware Scaling Laws Authors: Kyle Montgomery, David Park, Jianhong Tu, Michael Bendersky, Beliz Gunel, Dawn Song, Chenguang Wang
DARTS-GT: Differentiable Architecture Search for Graph Transformers with Quantifiable Instance-Specific Interpretability Analysis Authors: Shruti Sarika Chakraborty, Peter Minary
A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space Authors: Bingjie Zhang, Yibo Yang, Renzhe, Dandan Guo, Jindong Gu, Philip Torr, Bernard Ghanem
Provable Unlearning with Gradient Ascent on Two-Layer ReLU Neural Networks Authors: Odelia Melamed, Gilad Yehudai, Gal Vardi
Rethinking Hebbian Principle: Low-Dimensional Structural Projection for Unsupervised Learning Authors: Shikuang Deng, Jiayuan Zhang, Yuhang Wu, Ting Chen, Shi Gu
xLLM Technical Report Authors: Tongxuan Liu, Tao Peng, Peijun Yang, Xiaoyang Zhao, Xiusheng Lu, Weizhe Huang, Zirui Liu, Xiaoyu Chen, Zhiwei Liang, Jun Xiong, Donghe Jin, Minchao Zhang, Jinrong Guo, Yingxu Deng, Xu Zhang, Xianzhe Dong, Siqi Wang, Siyu Wu, Yu Wu, Zihan Tang, Yuting Zeng, Yanshu Wang, Jinguang Liu, Meng Kang, Menxin Li, Yunlong Wang, Yiming Liu, Xiaolong Ma, Yifan Wang, Yichen Zhang, Jinrun Yin, Keyang Zheng, Jiawei Yin, Jun Zhang, Ziyue Wang, Xiaobo Lin, Liangyu Liu, Liwei Lan, Yang Liu, Chunhua Peng, Han Liu, Songcheng Ren, Xuezhu Wang, Yunheng Shen, Yi Wang, Guyue Liu, Hui Chen, Tong Yang, Hailong Yang, Jing Li, Guiguang Ding, Ke Zhang
Circuit Insights: Towards Interpretability Beyond Activations Authors: Elena Golimblevskaia, Aakriti Jain, Bruno Puri, Ammar Ibrahim, Wojciech Samek, Sebastian Lapuschkin
TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar Authors: Yinxi Li, Yuntian Deng, Pengyu Nie
DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models Authors: Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar
Purifying Task Vectors in Knowledge-Aware Subspace for Model Merging Authors: Bang An, Yibo Yang, Philip Torr, Bernard Ghanem
LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning Authors: Beomseok Kang, Jiwon Song, Jae-Joon Kim
Semantic representations emerge in biologically inspired ensembles of cross-supervising neural networks Authors: Roy Urbach, Elad Schneidman

1. REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

ArXiv ID: 2510.13999

Authors: Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa

Abstract: Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert merging on discriminative benchmarks, we demonstrate that expert pruning is a superior strategy for generative tasks. We prove that merging introduces an irreducible error by causing a "functional subspace collapse", due to the loss of the router's independent, input-dependent control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation and tool-calling tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.

Comment: MoE + Compression: theoretical case against expert merging and a router-weighted expert pruning criterion for one-shot SMoE compression.

Relevance: 10 Novelty: 9

2. What Layers When: Learning to Skip Compute in LLMs with Residual Gates

ArXiv ID: 2510.13876

Authors: Filipe Laitenberger, Dawid Kopiczko, Cees G. M. Snoek, Yuki M. Asano

Abstract: We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch's output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15\% compute while retaining over 90\% of baseline accuracy. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50\% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.

Comment: Model Compression and Efficiency: token-wise layer skipping via residual-stream gates enabling dynamic computation with stable fine-tuning.

Relevance: 10 Novelty: 8

3. Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference

ArXiv ID: 2510.13831

Authors: Chao Han, Yijuan Liang, Zihao Xuan, Daokuan Wu, Wei Zhang, Xiaoyu Shen

Abstract: The deployment of large language models (LLMs) in real-world applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively activating model components per token, existing methods rely on greedy routing--a myopic execute-or-skip mechanism that often leads to irreversible information loss and suboptimal token selection. This paper introduces informed routing, a new paradigm that proactively addresses these issues. The key insight is to assess not only a token's immediate importance but also its recoverability, i.e., how well its transformation can be approximated. To this end, we propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit's output before routing decisions are made. This enables a flexible execute-or-approximate policy that preserves model fidelity while drastically reducing computation. Extensive experiments on both language modeling and reasoning tasks show that informed routing achieves state-of-the-art efficiency-performance trade-offs across multiple sparsity levels. Notably, even without final LoRA fine-tuning, our method matches or surpasses strong baselines that require full fine-tuning, all while reducing training time by over 50%. The code is available at: https://github.com/EIT-NLP/informed-routing

Comment: Model Compression/Efficiency: informed token-level routing using a lightweight feature forecaster for execute-or-approximate computation.

Relevance: 10 Novelty: 8

4. MergeMoE: Efficient Compression of MoE Models via Expert Output Merging

ArXiv ID: 2510.14436

Authors: Ruijie Miao, Yilun Yao, Zihan Wang, Zhiming Wang, Bairen Yi, LingJun Liu, Yikai Zhao, Tong Yang

Abstract: The Mixture-of-Experts (MoE) technique has proven to be a promising solution to efficiently scale the model size, which has been widely applied in recent LLM advancements. However, the substantial memory overhead of MoE models has made their compression an important research direction. In this work, we provide a theoretical analysis of expert merging, a recently proposed technique for compressing MoE models. Rather than interpreting expert merging from the conventional perspective of parameter aggregation, we approach it from the perspective of merging experts' outputs. Our key insight is that the merging process can be interpreted as inserting additional matrices into the forward computation, which naturally leads to an optimization formulation. Building on this analysis, we introduce MergeMoE, a method that leverages mathematical optimization to construct the compression matrices. We evaluate MergeMoE on multiple MoE models and show that our algorithm consistently outperforms the baselines with the same compression ratios.

Comment: Model Compression and Efficiency (MoE): theoretical framing and optimized expert output merging for compressing MoE models.

Relevance: 10 Novelty: 8

5. First Attentions Last: Better Exploiting First Attentions for Efficient Transformer Training

ArXiv ID: 2510.14614

Authors: Gyudong Kim, Hyukju Na, Jin Hyeon Kim, Hyunsung Jang, Jaemin Park, Jaegi Hwang, Namkoo Ha, Seungryong Kim, Young Geun Kim

Abstract: As training billion-scale transformers becomes increasingly common, employing multiple distributed GPUs along with parallel training methods has become a standard practice. However, existing transformer designs suffer from significant communication overhead, especially in Tensor Parallelism (TP), where each block's MHA-MLP connection requires an all-reduce communication. Through our investigation, we show that the MHA-MLP connections can be bypassed for efficiency, while the attention output of the first layer can serve as an alternative signal for the bypassed connection. Motivated by the observations, we propose FAL (First Attentions Last), an efficient transformer architecture that redirects the first MHA output to the MLP inputs of the following layers, eliminating the per-block MHA-MLP connections. This removes the all-reduce communication and enables parallel execution of MHA and MLP on a single GPU. We also introduce FAL+, which adds the normalized first attention output to the MHA outputs of the following layers to augment the MLP input for the model quality. Our evaluation shows that FAL reduces multi-GPU training time by up to 44%, improves single-GPU throughput by up to 1.18x, and achieves better perplexity compared to the baseline GPT. FAL+ achieves even lower perplexity without increasing the training time than the baseline.

Comment: Model Architecture + HPC – redesigns Transformer wiring to remove per-block MHA–MLP communication, eliminating TP all-reduce and enabling parallel MHA/MLP execution.

Relevance: 10 Novelty: 8

6. Efficient Dynamic Structured Sparse Training with Learned Shuffles

ArXiv ID: 2510.14812

Authors: Abhishek Tyagi, Arjun Iyer, Liam Young, William H Renninger, Christopher Kanan, Yuhao Zhu

Abstract: Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of expressivity: whereas a dense layer can realize every possible mask obtained by choosing any $w$ active weights out of $n$, a fixed block or N:M layout explores only a subset of those possibilities. We propose to close this gap by learning, for each layer, a single permutation matrix jointly with the structured weight matrix. Applied to three canonical structures -- block, N:M, and diagonals -- we show that permutation-augmented DST (PA-DST) matches unstructured baselines (RigL, SET) at 90--95\% sparsity on ImageNet-1K (ViT-B/16) and WikiText-103 (GPT-2), yet trains up to $1.21\times$ and infers up to $2.9\times$ faster. The results position structure + learned permutation as a sweet spot between accuracy and efficiency.

Comment: Compression/Efficiency: dynamic structured sparsity augmented with learned permutations to match unstructured DST accuracy while accelerating training/inference.

Relevance: 10 Novelty: 8

7. A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

ArXiv ID: 2510.14444

Authors: Moritz Wagner, Christophe Roux, Max Zimmer, Sebastian Pokutta

Abstract: While Neural Network pruning typically requires retraining the model to recover pruning-induced performance degradation, state-of-the-art Large Language Models (LLMs) pruning methods instead solve a layer-wise mask selection and reconstruction problem on a small set of calibration data to avoid full retraining, as it is considered computationally infeasible for LLMs. Reconstructing single matrices in isolation has favorable properties, such as convexity of the objective and significantly reduced memory requirements compared to full retraining. In practice, however, reconstruction is often implemented at coarser granularities, e.g., reconstructing a whole transformer block against its dense activations instead of a single matrix. In this work, we study the key design choices when reconstructing or retraining the remaining weights after pruning. We conduct an extensive computational study on state-of-the-art GPT architectures, and report several surprising findings that challenge common intuitions about retraining after pruning. In particular, we observe a free lunch scenario: reconstructing attention and MLP components separately within each transformer block is nearly the most resource-efficient yet achieves the best perplexity. Most importantly, this Pareto-optimal setup achieves better performance than full retraining, despite requiring only a fraction of the memory. Furthermore, we demonstrate that simple and efficient pruning criteria such as Wanda can outperform much more complex approaches when the reconstruction step is properly executed, highlighting its importance. Our findings challenge the narrative that retraining should be avoided at all costs and provide important insights into post-pruning performance recovery for LLMs.

Comment: Compression: shows reconstruction-based post-pruning retraining can beat full retraining; key design insights and efficient recovery after pruning.

Relevance: 10 Novelty: 8

8. On the expressivity of sparse maxout networks

ArXiv ID: 2510.14068

Authors: Moritz Grillo, Tobias Hofmann

Abstract: We study the expressivity of sparse maxout networks, where each neuron takes a fixed number of inputs from the previous layer and employs a, possibly multi-argument, maxout activation. This setting captures key characteristics of convolutional or graph neural networks. We establish a duality between functions computable by such networks and a class of virtual polytopes, linking their geometry to questions of network expressivity. In particular, we derive a tight bound on the dimension of the associated polytopes, which serves as the central tool for our analysis. Building on this, we construct a sequence of depth hierarchies. While sufficiently deep sparse maxout networks are universal, we prove that if the required depth is not reached, width alone cannot compensate for the sparsity of a fixed indegree constraint.

Comment: Representation/Architecture Theory: expressivity analysis and depth hierarchies for sparse maxout networks under fixed indegree (sparsity).

Relevance: 9 Novelty: 9

9. Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

ArXiv ID: 2510.14300

Authors: Weijie Shen, Yitian Liu, Yuhao Wu, Zhixuan Liang, Sijia Gu, Dehui Wang, Tian Nian, Lei Xu, Yusen Qin, Jiangmiao Pang, Xinping Guan, Xiaokang Yang, Yao Mu

Abstract: Vision-Language-Action (VLA) models are experiencing rapid development and demonstrating promising capabilities in robotic manipulation tasks. However, scaling up VLA models presents several critical challenges: (1) Training new VLA models from scratch demands substantial computational resources and extensive datasets. Given the current scarcity of robot data, it becomes particularly valuable to fully leverage well-pretrained VLA model weights during the scaling process. (2) Real-time control requires carefully balancing model capacity with computational efficiency. To address these challenges, We propose AdaMoE, a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models, and scales up the action expert by substituting the feedforward layers into sparsely activated MoE layers. AdaMoE employs a decoupling technique that decouples expert selection from expert weighting through an independent scale adapter working alongside the traditional router. This enables experts to be selected based on task relevance while contributing with independently controlled weights, allowing collaborative expert utilization rather than winner-takes-all dynamics. Our approach demonstrates that expertise need not monopolize. Instead, through collaborative expert utilization, we can achieve superior performance while maintaining computational efficiency. AdaMoE consistently outperforms the baseline model across key benchmarks, delivering performance gains of 1.8% on LIBERO and 9.3% on RoboTwin. Most importantly, a substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.

Comment: Model Architecture (MoE): action-specialized MoE for VLA with decoupled expert selection/weighting enabling collaborative expert usage.

Relevance: 10 Novelty: 7

10. BitNet Distillation

ArXiv ID: 2510.13998

Authors: Xun Wu, Shaohan Huang, Wenhui Wang, Ting Song, Li Dong, Yan Xia, Furu Wei

Abstract: In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost. Specifically, BitDistill incorporates three key techniques: the SubLN module, as introduced in BitNet; multi-head attention distillation, based on MiniLM; and continual pre-training, which serves as a crucial warm-up step to mitigate the scalability issue of the performance gap between finetuned full-precision and 1.58-bit LLMs on specific tasks. Experimental results show that BitDistill achieves performance comparable to the full-precision counterpart models across model size, while enabling up to 10x memory savings and 2.65x faster inference on CPUs. Code is available at https://github.com/microsoft/BitNet.

Comment: Model Compression and Efficiency: distillation to 1.58-bit (ternary) LLMs with SubLN and attention distillation; large memory/speed gains.

Relevance: 10 Novelty: 7

11. FairBatching: Fairness-Aware Batch Formation for LLM Inference

ArXiv ID: 2510.14392

Authors: Hongtao Lyu, Boyue Liu, Mingyu Wu, Haibo Chen

Abstract: Large language model (LLM) inference systems face a fundamental tension between minimizing Time-to-First-Token (TTFT) latency for new requests and maintaining a high, steady token generation rate (low Time-Per-Output-Token, or TPOT) for ongoing requests. Existing stall-free batching schedulers proposed by Sarathi, while effective at preventing decode stalls, introduce significant computational unfairness. They prioritize decode tasks excessively, simultaneously leading to underutilized decode slack and unnecessary prefill queuing delays, which collectively degrade the system's overall quality of service (QoS). This work identifies the root cause of this unfairness: the non-monotonic nature of Time-Between-Tokens (TBT) as a scheduling metric and the rigid decode-prioritizing policy that fails to adapt to dynamic workload bursts. We therefore propose FairBatching, a novel LLM inference scheduler that enforces fair resource allocation between prefill and decode tasks. It features an adaptive batch capacity determination mechanism, which dynamically adjusts the computational budget to improve the GPU utilization without triggering SLO violations. Its fair and dynamic batch formation algorithm breaks away from the decode-prioritizing paradigm, allowing computation resources to be reclaimed from bursting decode tasks to serve prefill surges, achieving global fairness. Furthermore, FairBatching provides a novel load estimation method, enabling more effective coordination with upper-level schedulers. Implemented and evaluated on realistic traces, FairBatching significantly reduces TTFT tail latency by up to 2.29x while robustly maintaining TPOT SLOs, achieving overall 20.0% improvement in single-node capacity and 54.3% improvement in cluster-level capacity.

Comment: High Performance Computing/Systems: fairness-aware batching scheduler improves TTFT/TPOT and GPU utilization for LLM inference.