Personalized Daily ArXiv Papers 2025-11-07

[gpt-5]	Prompt	Completion	Total
Token	36053	34849	70902
Cost	$0.05	$0.35	$0.39

Total arXiv papers: 458

Total scanned papers: 251

Total relevant papers: 20

Table of contents with paper titles:

TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training Authors: Michael Menezes, Barbara Su, Xinze Feng, Yehya Farhat, Hamza Shili, Anastasios Kyrillidis
The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms Authors: Hikari Otsuka, Daiki Chijiwa, Yasuyuki Okoshi, Daichi Fujiki, Susumu Takeuchi, Masato Motomura
Block Rotation is All You Need for MXFP4 Quantization Authors: Yuantian Shao, Peisong Wang, Yuanteng Chen, Chang Xu, Zhihui Wei, Jian Cheng
GMoPE:A Prompt-Expert Mixture Framework for Graph Foundation Models Authors: Zhibin Wang, Zhixing Zhang, Shuqi Wang, Xuanting Xie, Zhao Kang
DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization Authors: Yuantian Shao, Yuanteng Chen, Peisong Wang, Jianlin Yu, Jing Lin, Yiwu Yao, Zhihui Wei, Jian Cheng
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators Authors: Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian Haggstrom, Guangtao Wang, Shubhangi Upasani, Ayush Sachdeva, Rui Li, Faline Fu, Chen Wu, Ayesha Siddiqua, John Long, Tuowen Zhao, Matheen Musaddiq, Hakan Zeffer, Yun Du, Mingran Wang, Qinghua Li, Bo Li, Urmish Thakker, Raghu Prabhakar
High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes Authors: Aukosh Jagannath, Taj Jones-McCormick, Varnan Sarangian
Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models Authors: Semih Cayci
An Augmentation Overlap Theory of Contrastive Learning Authors: Qi Zhang, Yifei Wang, Yisen Wang
Efficient Linear Attention for Multivariate Time Series Modeling via Entropy Equality Authors: Mingtao Zhang, Guoli Yang, Zhanxing Zhu, Mengzhu Wang, Xiaoying Bai
ODE approximation for the Adam algorithm: General and overparametrized setting Authors: Steffen Dereich, Arnulf Jentzen, Sebastian Kassing
Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing Authors: Mingyu Sung, Vikas Palakonda, Suhwan Im, Sunghwan Moon, Il-Min Kim, Sangseok Yun, Jae-Mo Kang
Distribution-Aware Tensor Decomposition for Compression of Convolutional Neural Networks Authors: Alper Kalle, Theo Rudkiewicz, Mohamed-Oumar Ouerfelli, Mohamed Tamaazousti
Efficient Neural Networks with Discrete Cosine Transform Activations Authors: Marc Martinez-Gost, Sara Pepe, Ana P\'erez-Neira, Miguel \'Angel Lagunas
Robustness of Minimum-Volume Nonnegative Matrix Factorization under an Expanded Sufficiently Scattered Condition Authors: Giovanni Barbarino, Nicolas Gillis, Subhayan Saha
Sketch-Augmented Features Improve Learning Long-Range Dependencies in Graph Neural Networks Authors: Ryien Hosseini, Filippo Simini, Venkatram Vishwanath, Rebecca Willett, Henry Hoffmann
Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness Authors: Subeen Park, Joowang Kim, Hakyung Lee, Sunjae Yoo, Kyungwoo Song
PerfDojo: Automated ML Library Generation for Heterogeneous Architectures Authors: Andrei Ivanov, Siyuan Shen, Gioele Gottardo, Marcin Chrapek, Afif Boudaoud, Timo Schneider, Luca Benini, Torsten Hoefler
AILA--First Experiments with Localist Language Models Authors: Joachim Diederich
Optimizing Reasoning Efficiency through Prompt Difficulty Prediction Authors: Bo Zhao, Berkcan Kapusuzoglu, Kartik Balasubramaniam, Sambit Sahu, Supriyo Chakraborty, Genta Indra Winata

1. TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training

ArXiv ID: 2511.03983

Authors: Michael Menezes, Barbara Su, Xinze Feng, Yehya Farhat, Hamza Shili, Anastasios Kyrillidis

Abstract: We introduce TwIST, a distributed training framework for efficient large language model (LLM) sparsification. TwIST trains multiple subnetworks in parallel, periodically aggregates their parameters, and resamples new subnetworks during training. This process identifies high-quality subnetworks ("golden tickets") without requiring post-training procedures such as calibration or Hessian-based recovery. As a result, TwIST enables zero-cost pruning at deployment time while achieving perplexity competitive with state-of-the-art post-training sparsification methods. The benefits are most pronounced under aggressive sparsity (e.g., 50%+), where TwIST significantly outperforms baseline methods; for example, reaching 23.14 PPL compared to 31.64 for the closest prior approach. Unlike unstructured pruning, TwIST produces structured, dense matrices that offer practical inference speedups and memory reductions on commodity hardware (e.g., CPUs) that do not support efficient sparse computation. TwIST provides an efficient training-time path to deployable sparse LLMs without additional fine-tuning or recovery overhead.

Comment: Model Compression and Efficiency: distributed training-time sparsification via independent subnetwork training and aggregation enabling zero-cost, structured pruning; also an HPC-oriented distributed framework.

Relevance: 10 Novelty: 9

2. The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms

ArXiv ID: 2511.04217

Authors: Hikari Otsuka, Daiki Chijiwa, Yasuyuki Okoshi, Daichi Fujiki, Susumu Takeuchi, Masato Motomura

Abstract: The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks. Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding. In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs. We prove that, if a randomly initialized MHA of $H$ heads and input dimension $d$ has the hidden dimension $O(d\log(Hd^{3/2}))$ for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability. Furthermore, by leveraging this theory for MHAs, we extend the SLTH to transformers without normalization layers. We empirically validate our theoretical findings, demonstrating that the approximation error between the SLT within a source model (MHA and transformer) and an approximate target counterpart decreases exponentially by increasing the hidden dimension of the source model.

Comment: Model Architecture: establishes a strong lottery ticket existence result for multi-head attention in transformers, advancing sparsity/lottery-ticket theory for this core component.

Relevance: 10 Novelty: 9

3. Block Rotation is All You Need for MXFP4 Quantization

ArXiv ID: 2511.04214

Authors: Yuantian Shao, Peisong Wang, Yuanteng Chen, Chang Xu, Zhihui Wei, Jian Cheng

Abstract: Large language models (LLMs) have achieved remarkable success, but their rapidly growing scale imposes prohibitive costs in memory, computation, and energy. Post-training quantization (PTQ) is a promising solution for efficient deployment, yet achieving accurate W4A4 quantization remains an open challenge. While most existing methods are designed for INT4 formats, the emergence of MXFP4 -- a new FP4 format with various hardware support (NVIDIA, AMD, Intel)-- raises questions about the applicability of current techniques. In this work, we establish a comprehensive benchmark of PTQ methods under the MXFP4 format. Through systematic evaluation, we find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4. We further provide the first in-depth analysis of this conflict, tracing its root to a fundamental mismatch between MXFP4's PoT (power-of-two) block scaling and the redistribution of outlier energy via global rotation. Building on this insight, we propose a simple yet effective block rotation strategy that adapts rotation-based methods to MXFP4, leading to substantial accuracy improvements across diverse LLMs. Our findings not only offer clear guidance for practitioners but also set a foundation for advancing PTQ research under emerging low-precision formats.

Comment: Model Compression: PTQ under MXFP4 (FP4) with a block-rotation strategy resolving incompatibility with power-of-two block scaling.

Relevance: 10 Novelty: 8

4. GMoPE:A Prompt-Expert Mixture Framework for Graph Foundation Models

ArXiv ID: 2511.03251

Authors: Zhibin Wang, Zhixing Zhang, Shuqi Wang, Xuanting Xie, Zhao Kang

Abstract: Graph Neural Networks (GNNs) have demonstrated impressive performance on task-specific benchmarks, yet their ability to generalize across diverse domains and tasks remains limited. Existing approaches often struggle with negative transfer, scalability issues, and high adaptation costs. To address these challenges, we propose GMoPE (Graph Mixture of Prompt-Experts), a novel framework that seamlessly integrates the Mixture-of-Experts (MoE) architecture with prompt-based learning for graphs. GMoPE leverages expert-specific prompt vectors and structure-aware MoE routing to enable each expert to specialize in distinct subdomains and dynamically contribute to predictions. To promote diversity and prevent expert collapse, we introduce a soft orthogonality constraint across prompt vectors, encouraging expert specialization and facilitating a more balanced expert utilization. Additionally, we adopt a prompt-only fine-tuning strategy that significantly reduces spatiotemporal complexity during transfer. We validate GMoPE through extensive experiments under various pretraining strategies and multiple downstream tasks. Results show that GMoPE consistently outperforms state-of-the-art baselines and achieves performance comparable to full parameter fine-tuning-while requiring only a fraction of the adaptation overhead. Our work provides a principled and scalable framework for advancing generalizable and efficient graph foundation models.

Comment: Model Architecture: Mixture-of-Experts framework for graph foundation models with structure-aware routing and prompt-expert vectors; prompt-only fine-tuning improves efficiency.

Relevance: 10 Novelty: 8

5. DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization

ArXiv ID: 2511.04063

Authors: Yuantian Shao, Yuanteng Chen, Peisong Wang, Jianlin Yu, Jing Lin, Yiwu Yao, Zhihui Wei, Jian Cheng

Abstract: Quantization plays a crucial role in accelerating the inference of large-scale models, and rotational matrices have been shown to effectively improve quantization performance by smoothing outliers. However, end-to-end fine-tuning of rotational optimization algorithms incurs high computational costs and is prone to overfitting. To address this challenge, we propose an efficient distribution-aware rotational calibration method, DartQuant, which reduces the complexity of rotational optimization by constraining the distribution of the activations after rotation. This approach also effectively reduces reliance on task-specific losses, thereby mitigating the risk of overfitting. Additionally, we introduce the QR-Orth optimization scheme, which replaces expensive alternating optimization with a more efficient solution. In a variety of model quantization experiments, DartQuant demonstrates superior performance. Compared to existing methods, it achieves 47$\times$ acceleration and 10$\times$ memory savings for rotational optimization on a 70B model. Furthermore, it is the first to successfully complete rotational calibration for a 70B model on a single 3090 GPU, making quantization of large language models feasible in resource-constrained environments. Code is available at https://github.com/CAS-CLab/DartQuant.git.

Comment: Model Compression and Efficiency: distribution-aware rotational calibration (DartQuant) with efficient QR-orth optimization for LLM quantization at large scale.

Relevance: 10 Novelty: 8

6. SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

ArXiv ID: 2511.03092

Authors: Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian Haggstrom, Guangtao Wang, Shubhangi Upasani, Ayush Sachdeva, Rui Li, Faline Fu, Chen Wu, Ayesha Siddiqua, John Long, Tuowen Zhao, Matheen Musaddiq, Hakan Zeffer, Yun Du, Mingran Wang, Qinghua Li, Bo Li, Urmish Thakker, Raghu Prabhakar

Abstract: The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.

Comment: High-Performance Inference/Memory Optimization: deployable KV-cache compression compatible with static-graph, continuous-batching accelerators for long-context LLMs.