Previous Day 2025-11-13
Monthly Overview 2025-11
Next Day 2025-11-17

Personalized Daily ArXiv Papers 2025-11-14

[gpt-5] Prompt Completion Total
Token 50111 42228 92339
Cost $0.06 $0.42 $0.48

Total arXiv papers: 542

Total scanned papers: 334

Total relevant papers: 20

Table of contents with paper titles:

  1. BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference Authors: Yun Wang, Lingyun Yang, Senhao Yu, Yixiao Wang, Ruixing Li, Zhixiang Wei, James Yen, Zhengwei Qi

  2. Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off Authors: Mingkuan Zhao, Wentao Hu, Jiayin Wang, Xin Lai, Tianchen Huang, Yuheng Min, Rui Yan, Xiaoyan Zhu

  3. EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training Authors: Qingao Yi, Jiaang Duan, Hanwen Hu, Qin Hua, Haiyan Zhao, Shiyou Qian, Dingyu Yang, Jian Cao, Jinghua Tang, Yinghao Yu, Chenzhi Liao, Kangjin Wang, Liping Zhang

  4. Global Convergence of Four-Layer Matrix Factorization under Random Initialization Authors: Minrui Luo, Weihang Xu, Xiang Gao, Maryam Fazel, Simon Shaolei Du

  5. Fractional neural attention for efficient multiscale sequence processing Authors: Cheng Kevin Qu, Andrew Ly, Pulin Gong

  6. SVD-NO: Learning PDE Solution Operators with SVD Integral Kernels Authors: Noam Koren, Ralf J. J. Mackenbach, Ruud J. G. van Sloun, Kira Radinsky, Daniel Freedman

  7. Koopman Invariants as Drivers of Emergent Time-Series Clustering in Joint-Embedding Predictive Architectures Authors: Pablo Ruiz-Morales, Dries Vanoost, Davy Pissoort, Mathias Verbeke

  8. Scalable Synthesis of distributed LLM workloads through Symbolic Tensor Graphs Authors: Changhai Man, Joongun Park, Hanjiang Wu, Huan Xu, Srinivas Sridharan, Tushar Krishna

  9. On the Convergence of Overparameterized Problems: Inherent Properties of the Compositional Structure of Neural Networks Authors: Arthur Castello Branco de Oliveira, Dhruv Jatkar, Eduardo Sontag

  10. Rethinking Visual Information Processing in Multimodal LLMs Authors: Dongwan Kim, Viresh Ranjan, Takashi Nagata, Arnab Dhua, Amit Kumar K C

  11. Explore and Establish Synergistic Effects Between Weight Pruning and Coreset Selection in Neural Network Training Authors: Weilin Wan, Fan Yi, Weizhong Zhang, Quan Zhou, Cheng Jin

  12. TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training Authors: Houming Wu, Ling Chen

  13. Semi-Unified Sparse Dictionary Learning with Learnable Top-K LISTA and FISTA Encoders Authors: Fengsheng Lin, Shengyi Yan, Trac Duy Tran

  14. Black-Box On-Policy Distillation of Large Language Models Authors: Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, Furu Wei

  15. Steering Pretrained Drafters during Speculative Decoding Authors: Fr\'ed\'eric Berdoz, Peer Rheinboldt, Roger Wattenhofer

  16. Efficient Hyperdimensional Computing with Modular Composite Representations Authors: Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy Loutfi, Mauro Olivieri, Denis Kleyko

  17. Generalization Can Emerge in Tabular Foundation Models From a Single Table Authors: Junwei Ma, Nour Shaheen, Alex Labach, Amine Mhedhbi, Frank Hutter, Anthony L. Caterini, Valentin Thomas

  18. Expandable and Differentiable Dual Memories with Orthogonal Regularization for Exemplar-free Continual Learning Authors: Hyung-Jun Moon, Sung-Bae Cho

  19. Generalizing PDE Emulation with Equation-Aware Neural Operators Authors: Qian-Ze Zhu, Paul Raccuglia, Michael P. Brenner

  20. Continuum Dropout for Neural Differential Equations Authors: Jonghun Lee, YongKyung Oh, Sungil Kim, Dong-Young Lim


1. BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference

ArXiv ID: 2511.10054

Authors: Yun Wang, Lingyun Yang, Senhao Yu, Yixiao Wang, Ruixing Li, Zhixiang Wei, James Yen, Zhengwei Qi

Abstract: Mixture-of-Experts (MoE) architectures scale language models by activating only a subset of specialized expert networks for each input token, thereby reducing the number of floating-point operations. However, the growing size of modern MoE models causes their full parameter sets to exceed GPU memory capacity; for example, Mixtral-8x7B has 45 billion parameters and requires 87 GB of memory even though only 14 billion parameters are used per token. Existing systems alleviate this limitation by offloading inactive experts to CPU memory, but transferring experts across the PCIe interconnect incurs significant latency (about 10 ms). Prefetching heuristics aim to hide this latency by predicting which experts are needed, but prefetch failures introduce significant stalls and amplify inference latency. In the event of a prefetch failure, prior work offers two primary solutions: either fetch the expert on demand, which incurs a long stall due to the PCIe bottleneck, or drop the expert from the computation, which significantly degrades model accuracy. The critical challenge, therefore, is to maintain both high inference speed and model accuracy when prefetching fails.

Comment: Compression/Efficiency + Systems for MoE: exploits expert redundancy to accelerate memory-constrained MoE inference and mitigate PCIe offloading stalls when prefetch fails.

Relevance: 10 Novelty: 8


2. Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

ArXiv ID: 2511.09596

Authors: Mingkuan Zhao, Wentao Hu, Jiayin Wang, Xin Lai, Tianchen Huang, Yuheng Min, Rui Yan, Xiaoyan Zhu

Abstract: The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of $O(H \cdot N^2)$ that grows quadratically with the context size ($N$) and linearly with the number of heads ($H$). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from $H$ independent $O(N^2)$ computations into a single, collaborative $O(N^2)$ computation, fundamentally reducing complexity by a factor of $H$. The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Extensive empirical validation on the OLMoE-1B-7B and 0.25B-1.75B model series demonstrates that while delivering an approximately two-fold increase in training throughput, its performance is on par with standard dense attention, even surpassing it on select key metrics, while consistently outperforming representative sparse attention methods including Longformer, Reformer, and BigBird across all evaluation metrics.

Comment: Introduces principled structural sparsity in multi-head attention, reducing complexity by a factor of H—core Model Architecture and Efficiency.

Relevance: 10 Novelty: 8


3. EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training

ArXiv ID: 2511.10333

Authors: Qingao Yi, Jiaang Duan, Hanwen Hu, Qin Hua, Haiyan Zhao, Shiyou Qian, Dingyu Yang, Jian Cao, Jinghua Tang, Yinghao Yu, Chenzhi Liao, Kangjin Wang, Liping Zhang

Abstract: Training large language models (LLMs) poses significant challenges regarding computational resources and memory capacity. Although distributed training techniques help mitigate these issues, they still suffer from considerable communication overhead. Existing approaches primarily rely on static gradient compression to enhance communication efficiency; however, these methods neglect the dynamic nature of evolving gradients during training, leading to performance degradation. Accelerating LLM training via compression without sacrificing performance remains a challenge. In this paper, we propose an entropy-driven dynamic gradient compression framework called EDGC. The core concept is to adjust the compression rate during LLM training based on the evolving trends of gradient entropy, taking into account both compression efficiency and error. EDGC consists of three key components.First, it employs a down-sampling method to efficiently estimate gradient entropy, reducing computation overhead. Second, it establishes a theoretical model linking compression rate with gradient entropy, enabling more informed compression decisions. Lastly, a window-based adjustment mechanism dynamically adapts the compression rate across pipeline stages, improving communication efficiency and maintaining model performance. We implemented EDGC on a 32-NVIDIA-V100 cluster and a 64-NVIDIA-H100 cluster to train GPT2-2.5B and GPT2-12.1B, respectively. The results show that EDGC significantly reduces communication latency and training time by up to 46.45% and 16.13% while preserving LLM accuracy.

Comment: Entropy-driven dynamic gradient compression for distributed LLM training—Compression/Efficiency and HPC systems innovation.

Relevance: 10 Novelty: 8


4. Global Convergence of Four-Layer Matrix Factorization under Random Initialization

ArXiv ID: 2511.09925

Authors: Minrui Luo, Weihang Xu, Xiang Gao, Maryam Fazel, Simon Shaolei Du

Abstract: Gradient descent dynamics on the deep matrix factorization problem is extensively studied as a simplified theoretical model for deep neural networks. Although the convergence theory for two-layer matrix factorization is well-established, no global convergence guarantee for general deep matrix factorization under random initialization has been established to date. To address this gap, we provide a polynomial-time global convergence guarantee for randomly initialized gradient descent on four-layer matrix factorization, given certain conditions on the target matrix and a standard balanced regularization term. Our analysis employs new techniques to show saddle-avoidance properties of gradient decent dynamics, and extends previous theories to characterize the change in eigenvalues of layer weights.

Comment: Training Dynamics/Theory: first polynomial-time global convergence guarantee for gradient descent on four-layer matrix factorization under random initialization.

Relevance: 9 Novelty: 9


5. Fractional neural attention for efficient multiscale sequence processing

ArXiv ID: 2511.10208

Authors: Cheng Kevin Qu, Andrew Ly, Pulin Gong

Abstract: Attention mechanisms underpin the computational power of Transformer models, which have achieved remarkable success across diverse domains. Yet understanding and extending the principles underlying self-attention remains a key challenge for advancing artificial intelligence. Drawing inspiration from the multiscale dynamics of biological attention and from dynamical systems theory, we introduce Fractional Neural Attention (FNA), a principled, neuroscience-inspired framework for multiscale information processing. FNA models token interactions through L\'evy diffusion governed by the fractional Laplacian, intrinsically realizing simultaneous short- and long-range dependencies across multiple scales. This mechanism yields greater expressivity and faster information mixing, advancing the foundational capacity of Transformers. Theoretically, we show that FNA's dynamics are governed by the fractional diffusion equation, and that the resulting attention networks exhibit larger spectral gaps and shorter path lengths -- mechanistic signatures of enhanced computational efficiency. Empirically, FNA achieves competitive text-classification performance even with a single layer and a single head; it also improves performance in image processing and neural machine translation. Finally, the diffusion map algorithm from geometric harmonics enables dimensionality reduction of FNA weights while preserving the intrinsic structure of embeddings and hidden states. Together, these results establish FNA as a principled mechanism connecting self-attention, stochastic dynamics, and geometry, providing an interpretable, biologically grounded foundation for powerful, neuroscience-inspired AI.

Comment: Model Architecture: replaces standard self-attention with Fractional Neural Attention based on fractional Laplacian diffusion for multiscale dependencies; theory links to larger spectral gaps and shorter path lengths (efficiency).

Relevance: 9 Novelty: 8


6. SVD-NO: Learning PDE Solution Operators with SVD Integral Kernels

ArXiv ID: 2511.10025

Authors: Noam Koren, Ralf J. J. Mackenbach, Ruud J. G. van Sloun, Kira Radinsky, Daniel Freedman

Abstract: Neural operators have emerged as a promising paradigm for learning solution operators of partial differential equa- tions (PDEs) directly from data. Existing methods, such as those based on Fourier or graph techniques, make strong as- sumptions about the structure of the kernel integral opera- tor, assumptions which may limit expressivity. We present SVD-NO, a neural operator that explicitly parameterizes the kernel by its singular-value decomposition (SVD) and then carries out the integral directly in the low-rank basis. Two lightweight networks learn the left and right singular func- tions, a diagonal parameter matrix learns the singular values, and a Gram-matrix regularizer enforces orthonormality. As SVD-NO approximates the full kernel, it obtains a high de- gree of expressivity. Furthermore, due to its low-rank struc- ture the computational complexity of applying the operator remains reasonable, leading to a practical system. In exten- sive evaluations on five diverse benchmark equations, SVD- NO achieves a new state of the art. In particular, SVD-NO provides greater performance gains on PDEs whose solutions are highly spatially variable. The code of this work is publicly available at https://github.com/2noamk/SVDNO.git.

Comment: Architecture + Low-rank Efficiency: parameterizes neural operator kernels via SVD with learned singular functions and values, enforcing orthonormality for an expressive yet efficient low-rank operator.

Relevance: 9 Novelty: 8


7. Koopman Invariants as Drivers of Emergent Time-Series Clustering in Joint-Embedding Predictive Architectures

ArXiv ID: 2511.09783

Authors: Pablo Ruiz-Morales, Dries Vanoost, Davy Pissoort, Mathias Verbeke

Abstract: Joint-Embedding Predictive Architectures (JEPAs), a powerful class of self-supervised models, exhibit an unexplained ability to cluster time-series data by their underlying dynamical regimes. We propose a novel theoretical explanation for this phenomenon, hypothesizing that JEPA's predictive objective implicitly drives it to learn the invariant subspace of the system's Koopman operator. We prove that an idealized JEPA loss is minimized when the encoder represents the system's regime indicator functions, which are Koopman eigenfunctions. This theory was validated on synthetic data with known dynamics, demonstrating that constraining the JEPA's linear predictor to be a near-identity operator is the key inductive bias that forces the encoder to learn these invariants. We further discuss that this constraint is critical for selecting this interpretable solution from a class of mathematically equivalent but entangled optima, revealing the predictor's role in representation disentanglement. This work demystifies a key behavior of JEPAs, provides a principled connection between modern self-supervised learning and dynamical systems theory, and informs the design of more robust and interpretable time-series models.

Comment: Representation Learning: theoretical link showing JEPAs learn Koopman invariant subspaces under near-identity predictors, explaining emergent regime clustering.

Relevance: 9 Novelty: 8


8. Scalable Synthesis of distributed LLM workloads through Symbolic Tensor Graphs

ArXiv ID: 2511.10480

Authors: Changhai Man, Joongun Park, Hanjiang Wu, Huan Xu, Srinivas Sridharan, Tushar Krishna

Abstract: Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution. Such modeling is essential for pre-deployment system-level optimizations (e.g., parallelization strategies) and design-space explorations. While recent efforts have proposed collecting execution traces from real systems, access to large-scale infrastructure remains limited to major cloud providers. Moreover, traces obtained from existing platforms cannot be easily adapted to study future larger-scale system configurations. We introduce Symbolic Tensor grAph GEnerator(STAGE), a framework that synthesizes high-fidelity execution traces to accurately model LLM workloads. STAGE supports a comprehensive set of parallelization strategies, allowing users to systematically explore a wide spectrum of LLM architectures and system configurations. STAGE demonstrates its scalability by synthesizing high-fidelity LLM traces spanning over 32K GPUs, while preserving tensor-level accuracy in compute, memory, and communication. STAGE will be publicly available to facilitate further research in distributed machine learning systems.

Comment: High Performance Computing: introduces a symbolic tensor-graph generator to synthesize high-fidelity distributed LLM execution traces and explore parallelization strategies at massive scale.

Relevance: 9 Novelty: 8


9. On the Convergence of Overparameterized Problems: Inherent Properties of the Compositional Structure of Neural Networks

ArXiv ID: 2511.09810

Authors: Arthur Castello Branco de Oliveira, Dhruv Jatkar, Eduardo Sontag

Abstract: This paper investigates how the compositional structure of neural networks shapes their optimization landscape and training dynamics. We analyze the gradient flow associated with overparameterized optimization problems, which can be interpreted as training a neural network with linear activations. Remarkably, we show that the global convergence properties can be derived for any cost function that is proper and real analytic. We then specialize the analysis to scalar-valued cost functions, where the geometry of the landscape can be fully characterized. In this setting, we demonstrate that key structural features -- such as the location and stability of saddle points -- are universal across all admissible costs, depending solely on the overparameterized representation rather than on problem-specific details. Moreover, we show that convergence can be arbitrarily accelerated depending on the initialization, as measured by an imbalance metric introduced in this work. Finally, we discuss how these insights may generalize to neural networks with sigmoidal activations, showing through a simple example which geometric and dynamical properties persist beyond the linear case.

Comment: Training Dynamics/Representation Learning: theoretical analysis of gradient flow and convergence in overparameterized neural compositions, characterizing saddles and initialization effects.

Relevance: 9 Novelty: 8


10. Rethinking Visual Information Processing in Multimodal LLMs

ArXiv ID: 2511.10301

Authors: Dongwan Kim, Viresh Ranjan, Takashi Nagata, Arnab Dhua, Amit Kumar K C

Abstract: Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks, even surpassing models with double its parameter count, establishing a more effective approach to vision-language modeling.

Comment: Model Architecture: LLaViT introduces modality-specific QKV, bidirectional attention over visual tokens, and multi-scale representations to better integrate vision into LLMs.

Relevance: 9 Novelty: 8


11. Explore and Establish Synergistic Effects Between Weight Pruning and Coreset Selection in Neural Network Training

ArXiv ID: 2511.09901

Authors: Weilin Wan, Fan Yi, Weizhong Zhang, Quan Zhou, Cheng Jin

Abstract: Modern deep neural networks rely heavily on massive model weights and training samples, incurring substantial computational costs. Weight pruning and coreset selection are two emerging paradigms proposed to improve computational efficiency. In this paper, we first explore the interplay between redundant weights and training samples through a transparent analysis: redundant samples, particularly noisy ones, cause model weights to become unnecessarily overtuned to fit them, complicating the identification of irrelevant weights during pruning; conversely, irrelevant weights tend to overfit noisy data, undermining coreset selection effectiveness. To further investigate and harness this interplay in deep learning, we develop a Simultaneous Weight and Sample Tailoring mechanism (SWaST) that alternately performs weight pruning and coreset selection to establish a synergistic effect in training. During this investigation, we observe that when simultaneously removing a large number of weights and samples, a phenomenon we term critical double-loss can occur, where important weights and their supportive samples are mistakenly eliminated at the same time, leading to model instability and nearly irreversible degradation that cannot be recovered in subsequent training. Unlike classic machine learning models, this issue can arise in deep learning due to the lack of theoretical guarantees on the correctness of weight pruning and coreset selection, which explains why these paradigms are often developed independently. We mitigate this by integrating a state preservation mechanism into SWaST, enabling stable joint optimization. Extensive experiments reveal a strong synergy between pruning and coreset selection across varying prune rates and coreset sizes, delivering accuracy boosts of up to 17.83% alongside 10% to 90% FLOPs reductions.

Comment: Model Compression and Efficiency: jointly explores weight pruning and coreset selection with a new SWaST mechanism and state preservation to stabilize simultaneous reduction.

Relevance: 9 Novelty: 8


12. TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training

ArXiv ID: 2511.09741

Authors: Houming Wu, Ling Chen

Abstract: Training large language models (LLMs) is fundamentally constrained by limited device memory and costly inter-device communication. Although pipeline parallelism alleviates memory pressure by partitioning models across devices, it incurs activation communication overhead that scales linearly with sequence length, limiting efficiency in long-context training. Recent weight-passing approaches (e.g., WeiPipe) mitigate this by transmitting model weights instead of activations, but suffer from redundant peer-to-peer (P2P) transfers and underutilized intra-node bandwidth. We propose TawPipe--topology-aware weight pipeline parallelism, which exploits hierarchical bandwidth in distributed clusters for improved communication efficiency. TawPipe: (i) groups devices based on topology to optimize intra-node collective and inter-node P2P communication; (ii) assigns each device a fixed shard of model weights and gradients, avoiding redundant transfers; and (iii) overlaps communication with computation to hide latency. Unlike global collective operations used in fully sharded data parallelism (FSDP), TawPipe confines most communication within node boundaries, significantly reducing cross-node traffic. Extensive experiments on up to 24 GPUs with LLaMA-style models show that TawPipe achieves superior throughput and scalability compared to state-of-the-art baselines.

Comment: High Performance Computing — topology-aware weight pipeline parallelism that reduces cross-node traffic and overlaps compute/communication to scale long-context LLM training.

Relevance: 9 Novelty: 8


13. Semi-Unified Sparse Dictionary Learning with Learnable Top-K LISTA and FISTA Encoders

ArXiv ID: 2511.10575

Authors: Fengsheng Lin, Shengyi Yan, Trac Duy Tran

Abstract: We present a semi-unified sparse dictionary learning framework that bridges the gap between classical sparse models and modern deep architectures. Specifically, the method integrates strict Top-$K$ LISTA and its convex FISTA-based variant (LISTAConv) into the discriminative LC-KSVD2 model, enabling co-evolution between the sparse encoder and the dictionary under supervised or unsupervised regimes. This unified design retains the interpretability of traditional sparse coding while benefiting from efficient, differentiable training. We further establish a PALM-style convergence analysis for the convex variant, ensuring theoretical stability under block alternation. Experimentally, our method achieves 95.6\% on CIFAR-10, 86.3\% on CIFAR-100, and 88.5\% on TinyImageNet with faster convergence and lower memory cost ($<$4GB GPU). The results confirm that the proposed LC-KSVD2 + LISTA/LISTAConv pipeline offers an interpretable and computationally efficient alternative for modern deep architectures.

Comment: Representation Learning and Sparsity: semi-unified sparse dictionary learning with learnable Top-K LISTA/LISTAConv encoders and PALM-style convergence analysis.

Relevance: 9 Novelty: 7


14. Black-Box On-Policy Distillation of Large Language Models

ArXiv ID: 2511.10643

Authors: Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, Furu Wei

Abstract: Black-box distillation creates student large language models (LLMs) by learning from a proprietary teacher model's text outputs alone, without access to its internal logits or parameters. In this work, we introduce Generative Adversarial Distillation (GAD), which enables on-policy and black-box distillation. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM's, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation. In particular, Qwen2.5-14B-Instruct (student) trained with GAD becomes comparable to its teacher, GPT-5-Chat, on the LMSYS-Chat automatic evaluation. The results establish GAD as a promising and effective paradigm for black-box LLM distillation.

Comment: Model Compression and Efficiency: introduces on-policy black-box distillation via a GAN-style discriminator, a new distillation paradigm without access to teacher logits/params.

Relevance: 8 Novelty: 7


15. Steering Pretrained Drafters during Speculative Decoding

ArXiv ID: 2511.09844

Authors: Fr\'ed\'eric Berdoz, Peer Rheinboldt, Roger Wattenhofer

Abstract: Speculative decoding accelerates language model inference by separating generation into fast drafting and parallel verification. Its main limitation is drafter-verifier misalignment, which limits token acceptance and reduces overall effectiveness. While small drafting heads trained from scratch compensate with speed, they struggle when verification dominates latency or when inputs are out of distribution. In contrast, pretrained drafters, though slower, achieve higher acceptance rates thanks to stronger standalone generation capabilities, making them competitive when drafting latency is negligible relative to verification or communication overhead. In this work, we aim to improve the acceptance rates of pretrained drafters by introducing a lightweight dynamic alignment mechanism: a steering vector computed from the verifier's hidden states and injected into the pretrained drafter. Compared to existing offline alignment methods such as distillation, our approach boosts the number of accepted tokens by up to 35\% under standard sampling and 22\% under greedy sampling, all while incurring negligible computational overhead. Importantly, our approach can be retrofitted to existing architectures and pretrained models, enabling rapid adoption.

Comment: Model Efficiency/HPC: improves speculative decoding via a lightweight dynamic alignment (steering vector) to increase token acceptance with negligible overhead.

Relevance: 8 Novelty: 7


16. Efficient Hyperdimensional Computing with Modular Composite Representations

ArXiv ID: 2511.09708

Authors: Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy Loutfi, Mauro Olivieri, Denis Kleyko

Abstract: The modular composite representation (MCR) is a computing model that represents information with high-dimensional integer vectors using modular arithmetic. Originally proposed as a generalization of the binary spatter code model, it aims to provide higher representational power while remaining a lighter alternative to models requiring high-precision components. Despite this potential, MCR has received limited attention. Systematic analyses of its trade-offs and comparisons with other models are lacking, sustaining the perception that its added complexity outweighs the improved expressivity. In this work, we revisit MCR by presenting its first extensive evaluation, demonstrating that it achieves a unique balance of capacity, accuracy, and hardware efficiency. Experiments measuring capacity demonstrate that MCR outperforms binary and integer vectors while approaching complex-valued representations at a fraction of their memory footprint. Evaluation on 123 datasets confirms consistent accuracy gains and shows that MCR can match the performance of binary spatter codes using up to 4x less memory. We investigate the hardware realization of MCR by showing that it maps naturally to digital logic and by designing the first dedicated accelerator. Evaluations on basic operations and 7 selected datasets demonstrate a speedup of up to 3 orders of magnitude and significant energy reductions compared to software implementation. When matched for accuracy against binary spatter codes, MCR achieves on average 3.08x faster execution and 2.68x lower energy consumption. These findings demonstrate that, although MCR requires more sophisticated operations than binary spatter codes, its modular arithmetic and higher per-component precision enable lower dimensionality. When realized with dedicated hardware, this results in a faster, more energy-efficient, and high-precision alternative to existing models.

Comment: Representation Learning + Efficiency/HW: introduces modular composite high-dimensional representations with analysis of capacity/accuracy and a dedicated accelerator for efficient implementation.

Relevance: 8 Novelty: 7


17. Generalization Can Emerge in Tabular Foundation Models From a Single Table

ArXiv ID: 2511.09665

Authors: Junwei Ma, Nour Shaheen, Alex Labach, Amine Mhedhbi, Frank Hutter, Anthony L. Caterini, Valentin Thomas

Abstract: Deep tabular modelling increasingly relies on in-context learning where, during inference, a model receives a set of $(x,y)$ pairs as context and predicts labels for new inputs without weight updates. We challenge the prevailing view that broad generalization here requires pre-training on large synthetic corpora (e.g., TabPFN priors) or a large collection of real data (e.g., TabDPT training datasets), discovering that a relatively small amount of data suffices for generalization. We find that simple self-supervised pre-training on just a \emph{single} real table can produce surprisingly strong transfer across heterogeneous benchmarks. By systematically pre-training and evaluating on many diverse datasets, we analyze what aspects of the data are most important for building a Tabular Foundation Model (TFM) generalizing across domains. We then connect this to the pre-training procedure shared by most TFMs and show that the number and quality of \emph{tasks} one can construct from a dataset is key to downstream performance.

Comment: Representation Learning: analyses/pretrains tabular foundation models showing generalization can emerge from self-supervision on a single table; highlights task-construction effects.

Relevance: 8 Novelty: 7


18. Expandable and Differentiable Dual Memories with Orthogonal Regularization for Exemplar-free Continual Learning

ArXiv ID: 2511.09871

Authors: Hyung-Jun Moon, Sung-Bae Cho

Abstract: Continual learning methods used to force neural networks to process sequential tasks in isolation, preventing them from leveraging useful inter-task relationships and causing them to repeatedly relearn similar features or overly differentiate them. To address this problem, we propose a fully differentiable, exemplar-free expandable method composed of two complementary memories: One learns common features that can be used across all tasks, and the other combines the shared features to learn discriminative characteristics unique to each sample. Both memories are differentiable so that the network can autonomously learn latent representations for each sample. For each task, the memory adjustment module adaptively prunes critical slots and minimally expands capacity to accommodate new concepts, and orthogonal regularization enforces geometric separation between preserved and newly learned memory components to prevent interference. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that the proposed method outperforms 14 state-of-the-art methods for class-incremental learning, achieving final accuracies of 55.13\%, 37.24\%, and 30.11\%, respectively. Additional analysis confirms that, through effective integration and utilization of knowledge, the proposed method can increase average performance across sequential tasks, and it produces feature extraction results closest to the upper bound, thus establishing a new milestone in continual learning.

Comment: Differentiable dual-memory architecture with orthogonal regularization and adaptive pruning/expansion—Model Architecture for continual learning.

Relevance: 8 Novelty: 7


19. Generalizing PDE Emulation with Equation-Aware Neural Operators

ArXiv ID: 2511.09729

Authors: Qian-Ze Zhu, Paul Raccuglia, Michael P. Brenner

Abstract: Solving partial differential equations (PDEs) can be prohibitively expensive using traditional numerical methods. Deep learning-based surrogate models typically specialize in a single PDE with fixed parameters. We present a framework for equation-aware emulation that generalizes to unseen PDEs, conditioning a neural model on a vector encoding representing the terms in a PDE and their coefficients. We present a baseline of four distinct modeling technqiues, trained on a family of 1D PDEs from the APEBench suite. Our approach achieves strong performance on parameter sets held out from the training distribution, with strong stability for rollout beyond the training window, and generalization to an entirely unseen PDE. This work was developed as part of a broader effort exploring AI systems that automate the creation of expert-level empirical software for scorable scientific tasks. The data and codebase are available at https://github.com/google-research/generalized-pde-emulator.

Comment: Model Architecture and Representation Learning: equation-aware neural operator conditioned on PDE term encodings to generalize across PDE families.

Relevance: 8 Novelty: 7


20. Continuum Dropout for Neural Differential Equations

ArXiv ID: 2511.10446

Authors: Jonghun Lee, YongKyung Oh, Sungil Kim, Dong-Young Lim

Abstract: Neural Differential Equations (NDEs) excel at modeling continuous-time dynamics, effectively handling challenges such as irregular observations, missing values, and noise. Despite their advantages, NDEs face a fundamental challenge in adopting dropout, a cornerstone of deep learning regularization, making them susceptible to overfitting. To address this research gap, we introduce Continuum Dropout, a universally applicable regularization technique for NDEs built upon the theory of alternating renewal processes. Continuum Dropout formulates the on-off mechanism of dropout as a stochastic process that alternates between active (evolution) and inactive (paused) states in continuous time. This provides a principled approach to prevent overfitting and enhance the generalization capabilities of NDEs. Moreover, Continuum Dropout offers a structured framework to quantify predictive uncertainty via Monte Carlo sampling at test time. Through extensive experiments, we demonstrate that Continuum Dropout outperforms existing regularization methods for NDEs, achieving superior performance on various time series and image classification tasks. It also yields better-calibrated and more trustworthy probability estimates, highlighting its effectiveness for uncertainty-aware modeling.

Comment: Training Dynamics/Regularization for Neural Differential Equations: introduces a principled continuous-time dropout mechanism improving generalization and uncertainty quantification.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  2. Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  3. High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

  4. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.