Personalized Daily ArXiv Papers 2025-11-14

[gpt-5]	Prompt	Completion	Total
Token	50111	42228	92339
Cost	$0.06	$0.42	$0.48

Total arXiv papers: 542

Total scanned papers: 334

Total relevant papers: 20

Table of contents with paper titles:

BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference Authors: Yun Wang, Lingyun Yang, Senhao Yu, Yixiao Wang, Ruixing Li, Zhixiang Wei, James Yen, Zhengwei Qi
Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off Authors: Mingkuan Zhao, Wentao Hu, Jiayin Wang, Xin Lai, Tianchen Huang, Yuheng Min, Rui Yan, Xiaoyan Zhu
EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training Authors: Qingao Yi, Jiaang Duan, Hanwen Hu, Qin Hua, Haiyan Zhao, Shiyou Qian, Dingyu Yang, Jian Cao, Jinghua Tang, Yinghao Yu, Chenzhi Liao, Kangjin Wang, Liping Zhang
Global Convergence of Four-Layer Matrix Factorization under Random Initialization Authors: Minrui Luo, Weihang Xu, Xiang Gao, Maryam Fazel, Simon Shaolei Du
Fractional neural attention for efficient multiscale sequence processing Authors: Cheng Kevin Qu, Andrew Ly, Pulin Gong
SVD-NO: Learning PDE Solution Operators with SVD Integral Kernels Authors: Noam Koren, Ralf J. J. Mackenbach, Ruud J. G. van Sloun, Kira Radinsky, Daniel Freedman
Koopman Invariants as Drivers of Emergent Time-Series Clustering in Joint-Embedding Predictive Architectures Authors: Pablo Ruiz-Morales, Dries Vanoost, Davy Pissoort, Mathias Verbeke
Scalable Synthesis of distributed LLM workloads through Symbolic Tensor Graphs Authors: Changhai Man, Joongun Park, Hanjiang Wu, Huan Xu, Srinivas Sridharan, Tushar Krishna
On the Convergence of Overparameterized Problems: Inherent Properties of the Compositional Structure of Neural Networks Authors: Arthur Castello Branco de Oliveira, Dhruv Jatkar, Eduardo Sontag
Rethinking Visual Information Processing in Multimodal LLMs Authors: Dongwan Kim, Viresh Ranjan, Takashi Nagata, Arnab Dhua, Amit Kumar K C
Explore and Establish Synergistic Effects Between Weight Pruning and Coreset Selection in Neural Network Training Authors: Weilin Wan, Fan Yi, Weizhong Zhang, Quan Zhou, Cheng Jin
TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training Authors: Houming Wu, Ling Chen
Semi-Unified Sparse Dictionary Learning with Learnable Top-K LISTA and FISTA Encoders Authors: Fengsheng Lin, Shengyi Yan, Trac Duy Tran
Black-Box On-Policy Distillation of Large Language Models Authors: Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, Furu Wei
Steering Pretrained Drafters during Speculative Decoding Authors: Fr\'ed\'eric Berdoz, Peer Rheinboldt, Roger Wattenhofer
Efficient Hyperdimensional Computing with Modular Composite Representations Authors: Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy Loutfi, Mauro Olivieri, Denis Kleyko
Generalization Can Emerge in Tabular Foundation Models From a Single Table Authors: Junwei Ma, Nour Shaheen, Alex Labach, Amine Mhedhbi, Frank Hutter, Anthony L. Caterini, Valentin Thomas
Expandable and Differentiable Dual Memories with Orthogonal Regularization for Exemplar-free Continual Learning Authors: Hyung-Jun Moon, Sung-Bae Cho
Generalizing PDE Emulation with Equation-Aware Neural Operators Authors: Qian-Ze Zhu, Paul Raccuglia, Michael P. Brenner
Continuum Dropout for Neural Differential Equations Authors: Jonghun Lee, YongKyung Oh, Sungil Kim, Dong-Young Lim

1. BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference

ArXiv ID: 2511.10054

Authors: Yun Wang, Lingyun Yang, Senhao Yu, Yixiao Wang, Ruixing Li, Zhixiang Wei, James Yen, Zhengwei Qi

Abstract: Mixture-of-Experts (MoE) architectures scale language models by activating only a subset of specialized expert networks for each input token, thereby reducing the number of floating-point operations. However, the growing size of modern MoE models causes their full parameter sets to exceed GPU memory capacity; for example, Mixtral-8x7B has 45 billion parameters and requires 87 GB of memory even though only 14 billion parameters are used per token. Existing systems alleviate this limitation by offloading inactive experts to CPU memory, but transferring experts across the PCIe interconnect incurs significant latency (about 10 ms). Prefetching heuristics aim to hide this latency by predicting which experts are needed, but prefetch failures introduce significant stalls and amplify inference latency. In the event of a prefetch failure, prior work offers two primary solutions: either fetch the expert on demand, which incurs a long stall due to the PCIe bottleneck, or drop the expert from the computation, which significantly degrades model accuracy. The critical challenge, therefore, is to maintain both high inference speed and model accuracy when prefetching fails.

Comment: Compression/Efficiency + Systems for MoE: exploits expert redundancy to accelerate memory-constrained MoE inference and mitigate PCIe offloading stalls when prefetch fails.

Relevance: 10 Novelty: 8

2. Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

ArXiv ID: 2511.09596

Authors: Mingkuan Zhao, Wentao Hu, Jiayin Wang, Xin Lai, Tianchen Huang, Yuheng Min, Rui Yan, Xiaoyan Zhu

Abstract: The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of $O(H \cdot N^2)$ that grows quadratically with the context size ($N$) and linearly with the number of heads ($H$). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from $H$ independent $O(N^2)$ computations into a single, collaborative $O(N^2)$ computation, fundamentally reducing complexity by a factor of $H$. The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Extensive empirical validation on the OLMoE-1B-7B and 0.25B-1.75B model series demonstrates that while delivering an approximately two-fold increase in training throughput, its performance is on par with standard dense attention, even surpassing it on select key metrics, while consistently outperforming representative sparse attention methods including Longformer, Reformer, and BigBird across all evaluation metrics.

Comment: Introduces principled structural sparsity in multi-head attention, reducing complexity by a factor of H—core Model Architecture and Efficiency.

Relevance: 10 Novelty: 8

3. EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training

ArXiv ID: 2511.10333

Authors: Qingao Yi, Jiaang Duan, Hanwen Hu, Qin Hua, Haiyan Zhao, Shiyou Qian, Dingyu Yang, Jian Cao, Jinghua Tang, Yinghao Yu, Chenzhi Liao, Kangjin Wang, Liping Zhang

Abstract: Training large language models (LLMs) poses significant challenges regarding computational resources and memory capacity. Although distributed training techniques help mitigate these issues, they still suffer from considerable communication overhead. Existing approaches primarily rely on static gradient compression to enhance communication efficiency; however, these methods neglect the dynamic nature of evolving gradients during training, leading to performance degradation. Accelerating LLM training via compression without sacrificing performance remains a challenge. In this paper, we propose an entropy-driven dynamic gradient compression framework called EDGC. The core concept is to adjust the compression rate during LLM training based on the evolving trends of gradient entropy, taking into account both compression efficiency and error. EDGC consists of three key components.First, it employs a down-sampling method to efficiently estimate gradient entropy, reducing computation overhead. Second, it establishes a theoretical model linking compression rate with gradient entropy, enabling more informed compression decisions. Lastly, a window-based adjustment mechanism dynamically adapts the compression rate across pipeline stages, improving communication efficiency and maintaining model performance. We implemented EDGC on a 32-NVIDIA-V100 cluster and a 64-NVIDIA-H100 cluster to train GPT2-2.5B and GPT2-12.1B, respectively. The results show that EDGC significantly reduces communication latency and training time by up to 46.45% and 16.13% while preserving LLM accuracy.

Comment: Entropy-driven dynamic gradient compression for distributed LLM training—Compression/Efficiency and HPC systems innovation.

Relevance: 10 Novelty: 8

4. Global Convergence of Four-Layer Matrix Factorization under Random Initialization

ArXiv ID: 2511.09925

Authors: Minrui Luo, Weihang Xu, Xiang Gao, Maryam Fazel, Simon Shaolei Du

Abstract: Gradient descent dynamics on the deep matrix factorization problem is extensively studied as a simplified theoretical model for deep neural networks. Although the convergence theory for two-layer matrix factorization is well-established, no global convergence guarantee for general deep matrix factorization under random initialization has been established to date. To address this gap, we provide a polynomial-time global convergence guarantee for randomly initialized gradient descent on four-layer matrix factorization, given certain conditions on the target matrix and a standard balanced regularization term. Our analysis employs new techniques to show saddle-avoidance properties of gradient decent dynamics, and extends previous theories to characterize the change in eigenvalues of layer weights.

Comment: Training Dynamics/Theory: first polynomial-time global convergence guarantee for gradient descent on four-layer matrix factorization under random initialization.

Relevance: 9 Novelty: 9

5. Fractional neural attention for efficient multiscale sequence processing

ArXiv ID: 2511.10208

Authors: Cheng Kevin Qu, Andrew Ly, Pulin Gong

Abstract: Attention mechanisms underpin the computational power of Transformer models, which have achieved remarkable success across diverse domains. Yet understanding and extending the principles underlying self-attention remains a key challenge for advancing artificial intelligence. Drawing inspiration from the multiscale dynamics of biological attention and from dynamical systems theory, we introduce Fractional Neural Attention (FNA), a principled, neuroscience-inspired framework for multiscale information processing. FNA models token interactions through L\'evy diffusion governed by the fractional Laplacian, intrinsically realizing simultaneous short- and long-range dependencies across multiple scales. This mechanism yields greater expressivity and faster information mixing, advancing the foundational capacity of Transformers. Theoretically, we show that FNA's dynamics are governed by the fractional diffusion equation, and that the resulting attention networks exhibit larger spectral gaps and shorter path lengths -- mechanistic signatures of enhanced computational efficiency. Empirically, FNA achieves competitive text-classification performance even with a single layer and a single head; it also improves performance in image processing and neural machine translation. Finally, the diffusion map algorithm from geometric harmonics enables dimensionality reduction of FNA weights while preserving the intrinsic structure of embeddings and hidden states. Together, these results establish FNA as a principled mechanism connecting self-attention, stochastic dynamics, and geometry, providing an interpretable, biologically grounded foundation for powerful, neuroscience-inspired AI.

Comment: Model Architecture: replaces standard self-attention with Fractional Neural Attention based on fractional Laplacian diffusion for multiscale dependencies; theory links to larger spectral gaps and shorter path lengths (efficiency).