Personalized Daily ArXiv Papers 2025-09-11

[gpt-5]	Prompt	Completion	Total
Token	31263	32300	63563
Cost	$0.04	$0.32	$0.36

Total arXiv papers: 365

Total scanned papers: 223

Total relevant papers: 17

Table of contents with paper titles:

Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism Authors: Jiaming Yan, Jianchun Liu, Hongli Xu, Liusheng Huang
EvolKV: Evolutionary KV Cache Compression for LLM Inference Authors: Bohan Yu, Yekun Chai
OCTANE -- Optimal Control for Tensor-based Autoencoder Network Emergence: Explicit Case Authors: Ratna Khatri, Anthony Kolshorn, Colin Olson, Harbir Antil
Facet: highly efficient E(3)-equivariant networks for interatomic potentials Authors: Nicholas Miklaucic, Lai Wei, Rongzhi Dong, Nihang Fu, Sadman Sadeed Omee, Qingyang Li, Sourin Dey, Victor Fung, Jianjun Hu
Selective Induction Heads: How Transformers Select Causal Structures In Context Authors: Francesco D'Angelo, Francesco Croce, Nicolas Flammarion
Towards Interpretable Deep Neural Networks for Tabular Data Authors: Khawla Elhadri, J\"org Schl\"otterer, Christin Seifert
Fourier Learning Machines: Nonharmonic Fourier-Based Neural Networks for Scientific Machine Learning Authors: Mominul Rubel, Adam Meyers, Gabriel Nicolosi
Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning Authors: Wei Huang, Anda Cheng, Yinggui Wang
Two Sides of the Same Optimization Coin: Model Degradation and Representation Collapse in Graph Foundation Models Authors: Xunkai Li, Daohan Su, Sicheng Liu, Ru Zhang, Rong-Hua Li, Guoren Wang
Variational Rank Reduction Autoencoders for Generative Authors: Alicia Tierz, Jad Mounayer, Beatriz Moya, Francisco Chinesta
Merge-of-Thought Distillation Authors: Zhanming Shen, Zeyu Qin, Zenan Huang, Hao Chen, Jiaqi Hu, Yihong Zhuang, Guoshan Lu, Gang Chen, Junbo Zhao
DEQuify your force field: More efficient simulations using deep equilibrium models Authors: Andreas Burger, Luca Thiede, Al\'an Aspuru-Guzik, Nandita Vijaykumar
Decentralized Stochastic Nonconvex Optimization under the Relaxed Smoothness Authors: Luo Luo, Xue Cui, Tingkai Jia, Cheng Chen
Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition Authors: Jing-Tong Tzeng, Carlos Busso, Chi-Chun Lee
Efficient Decoding Methods for Language Models on Encrypted Data Authors: Matan Avitan, Moran Baruch, Nir Drucker, Itamar Zimerman, Yoav Goldberg
Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis Authors: Matias D. Cattaneo, Boris Shigida
Tokenizing Loops of Antibodies Authors: Ada Fang, Robert G. Alberstein, Simon Kelow, Fr\'ed\'eric A. Dreyer

1. Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism

ArXiv ID: 2509.08342

Authors: Jiaming Yan, Jianchun Liu, Hongli Xu, Liusheng Huang

Abstract: Mixture-of-Experts (MoE) has emerged as a promising architecture for modern large language models (LLMs). However, massive parameters impose heavy GPU memory (i.e., VRAM) demands, hindering the widespread adoption of MoE LLMs. Offloading the expert parameters to CPU RAM offers an effective way to alleviate the VRAM requirements for MoE inference. Existing approaches typically cache a small subset of experts in VRAM and dynamically prefetch experts from RAM during inference, leading to significant degradation in inference speed due to the poor cache hit rate and substantial expert loading latency. In this work, we propose MoEpic, an efficient MoE inference system with a novel expert split mechanism. Specifically, each expert is vertically divided into two segments: top and bottom. MoEpic caches the top segment of hot experts, so that more experts will be stored under the limited VRAM budget, thereby improving the cache hit rate. During each layer's inference, MoEpic predicts and prefetches the activated experts for the next layer. Since the top segments of cached experts are exempt from fetching, the loading time is reduced, which allows efficient transfer-computation overlap. Nevertheless, the performance of MoEpic critically depends on the cache configuration (i.e., each layer's VRAM budget and expert split ratio). To this end, we propose a divide-and-conquer algorithm based on fixed-point iteration for adaptive cache configuration. Extensive experiments on popular MoE LLMs demonstrate that MoEpic can save about half of the GPU cost, while lowering the inference latency by about 37.51%-65.73% compared to the baselines.

Comment: High Performance Computing and Efficiency — MoE-specific inference system with expert offloading, caching/prefetch, vertical expert split, and adaptive cache configuration to reduce VRAM and latency.

Relevance: 10 Novelty: 8

2. EvolKV: Evolutionary KV Cache Compression for LLM Inference

ArXiv ID: 2509.08315

Authors: Bohan Yu, Yekun Chai

Abstract: Existing key-value (KV) cache compression methods typically rely on heuristics, such as uniform cache allocation across layers or static eviction policies, however, they ignore the critical interplays among layer-specific feature patterns and task performance, which can lead to degraded generalization. In this paper, we propose EvolKV, an adaptive framework for layer-wise, task-driven KV cache compression that jointly optimizes the memory efficiency and task performance. By reformulating cache allocation as a multi-objective optimization problem, EvolKV leverages evolutionary search to dynamically configure layer budgets while directly maximizing downstream performance. Extensive experiments on 11 tasks demonstrate that our approach outperforms all baseline methods across a wide range of KV cache budgets on long-context tasks and surpasses heuristic baselines by up to 7 percentage points on GSM8K. Notably, EvolKV achieves superior performance over the full KV cache setting on code completion while utilizing only 1.5% of the original budget, suggesting the untapped potential in learned compression strategies for KV cache budget allocation.

Comment: Efficiency/HPC criterion: task-driven KV cache compression for LLM inference via adaptive layer-wise budget allocation using evolutionary search.

Relevance: 10 Novelty: 8

3. OCTANE -- Optimal Control for Tensor-based Autoencoder Network Emergence: Explicit Case

ArXiv ID: 2509.08169

Authors: Ratna Khatri, Anthony Kolshorn, Colin Olson, Harbir Antil

Abstract: This paper presents a novel, mathematically rigorous framework for autoencoder-type deep neural networks that combines optimal control theory and low-rank tensor methods to yield memory-efficient training and automated architecture discovery. The learning task is formulated as an optimization problem constrained by differential equations representing the encoder and decoder components of the network and the corresponding optimality conditions are derived via a Lagrangian approach. Efficient memory compression is enabled by approximating differential equation solutions on low-rank tensor manifolds using an adaptive explicit integration scheme. These concepts are combined to form OCTANE (Optimal Control for Tensor-based Autoencoder Network Emergence) -- a unified training framework that yields compact autoencoder architectures, reduces memory usage, and enables effective learning, even with limited training data. The framework's utility is illustrated with application to image denoising and deblurring tasks and recommendations regarding governing hyperparameters are provided.

Comment: Model Architecture and Compression/Efficiency — optimal-control formulation of autoencoders with low-rank tensor manifold integration for memory-efficient training and automated architecture discovery.

Relevance: 9 Novelty: 8

ArXiv ID: 2509.08418

Authors: Nicholas Miklaucic, Lai Wei, Rongzhi Dong, Nihang Fu, Sadman Sadeed Omee, Qingyang Li, Sourin Dey, Victor Fung, Jianjun Hu

Abstract: Computational materials discovery is limited by the high cost of first-principles calculations. Machine learning (ML) potentials that predict energies from crystal structures are promising, but existing methods face computational bottlenecks. Steerable graph neural networks (GNNs) encode geometry with spherical harmonics, respecting atomic symmetries -- permutation, rotation, and translation -- for physically realistic predictions. Yet maintaining equivariance is difficult: activation functions must be modified, and each layer must handle multiple data types for different harmonic orders. We present Facet, a GNN architecture for efficient ML potentials, developed through systematic analysis of steerable GNNs. Our innovations include replacing expensive multi-layer perceptrons (MLPs) for interatomic distances with splines, which match performance while cutting computational and memory demands. We also introduce a general-purpose equivariant layer that mixes node information via spherical grid projection followed by standard MLPs -- faster than tensor products and more expressive than linear or gate layers. On the MPTrj dataset, Facet matches leading models with far fewer parameters and under 10% of their training compute. On a crystal relaxation task, it runs twice as fast as MACE models. We further show SevenNet-0's parameters can be reduced by over 25% with no accuracy loss. These techniques enable more than 10x faster training of large-scale foundation models for ML potentials, potentially reshaping computational materials discovery.

Comment: Model Architecture and Efficiency: presents a faster E(3)-equivariant layer via spherical-grid projection plus MLPs and spline distance encodings, cutting compute/memory and enabling >10x faster training of equivariant GNNs.

Relevance: 9 Novelty: 8

5. Selective Induction Heads: How Transformers Select Causal Structures In Context

ArXiv ID: 2509.08184

Authors: Francesco D'Angelo, Francesco Croce, Nicolas Flammarion

Abstract: Transformers have exhibited exceptional capabilities in sequence modeling tasks, leveraging self-attention and in-context learning. Critical to this success are induction heads, attention circuits that enable copying tokens based on their previous occurrences. In this work, we introduce a novel framework that showcases transformers' ability to dynamically handle causal structures. Existing works rely on Markov Chains to study the formation of induction heads, revealing how transformers capture causal dependencies and learn transition probabilities in-context. However, they rely on a fixed causal structure that fails to capture the complexity of natural languages, where the relationship between tokens dynamically changes with context. To this end, our framework varies the causal structure through interleaved Markov chains with different lags while keeping the transition probabilities fixed. This setting unveils the formation of Selective Induction Heads, a new circuit that endows transformers with the ability to select the correct causal structure in-context. We empirically demonstrate that transformers learn this mechanism to predict the next token by identifying the correct lag and copying the corresponding token from the past. We provide a detailed construction of a 3-layer transformer to implement the selective induction head, and a theoretical analysis proving that this mechanism asymptotically converges to the maximum likelihood solution. Our findings advance the understanding of how transformers select causal structures, providing new insights into their functioning and interpretability.

Comment: Representation Learning/Architecture Analysis: mechanistic study of transformers introducing selective induction heads, with a constructive 3-layer design and theory on convergence to MLE.

Relevance: 9 Novelty: 8

6. Towards Interpretable Deep Neural Networks for Tabular Data

ArXiv ID: 2509.08617

Authors: Khawla Elhadri, J\"org Schl\"otterer, Christin Seifert

Abstract: Tabular data is the foundation of many applications in fields such as finance and healthcare. Although DNNs tailored for tabular data achieve competitive predictive performance, they are blackboxes with little interpretability. We introduce XNNTab, a neural architecture that uses a sparse autoencoder (SAE) to learn a dictionary of monosemantic features within the latent space used for prediction. Using an automated method, we assign human-interpretable semantics to these features. This allows us to represent predictions as linear combinations of semantically meaningful components. Empirical evaluations demonstrate that XNNTab attains performance on par with or exceeding that of state-of-the-art, black-box neural models and classical machine learning approaches while being fully interpretable.

Comment: Matches Representation Learning and Model Architecture: employs a sparse autoencoder to learn a dictionary of monosemantic features with interpretable latent components for prediction.

Relevance: 9 Novelty: 7

7. Fourier Learning Machines: Nonharmonic Fourier-Based Neural Networks for Scientific Machine Learning

ArXiv ID: 2509.08759

Authors: Mominul Rubel, Adam Meyers, Gabriel Nicolosi

Abstract: We introduce the Fourier Learning Machine (FLM), a neural network (NN) architecture designed to represent a multidimensional nonharmonic Fourier series. The FLM uses a simple feedforward structure with cosine activation functions to learn the frequencies, amplitudes, and phase shifts of the series as trainable parameters. This design allows the model to create a problem-specific spectral basis adaptable to both periodic and nonperiodic functions. Unlike previous Fourier-inspired NN models, the FLM is the first architecture able to represent a complete, separable Fourier basis in multiple dimensions using a standard Multilayer Perceptron-like architecture. A one-to-one correspondence between the Fourier coefficients and amplitudes and phase-shifts is demonstrated, allowing for the translation between a full, separable basis form and the cosine phase--shifted one. Additionally, we evaluate the performance of FLMs on several scientific computing problems, including benchmark Partial Differential Equations (PDEs) and a family of Optimal Control Problems (OCPs). Computational experiments show that the performance of FLMs is comparable, and often superior, to that of established architectures like SIREN and vanilla feedforward NNs.

Comment: Model Architecture and Representation Learning: introduces a Fourier-based MLP with cosine activations that learns frequencies/amplitudes/phases, yielding a separable Fourier basis with one-to-one mapping to coefficients.

Relevance: 9 Novelty: 7

8. Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning

ArXiv ID: 2509.08255

Authors: Wei Huang, Anda Cheng, Yinggui Wang

Abstract: Recent advancements in large language models (LLMs) have shown impressive capabilities in various downstream tasks but typically face Catastrophic Forgetting (CF) during fine-tuning. In this paper, we propose the Forgetting-Aware Pruning Metric (FAPM), a novel pruning-based approach to balance CF and downstream task performance. Our investigation reveals that the degree to which task vectors (i.e., the subtraction of pre-trained weights from the weights fine-tuned on downstream tasks) overlap with pre-trained model parameters is a critical factor for CF. Based on this finding, FAPM employs the ratio of the task vector to pre-trained model parameters as a metric to quantify CF, integrating this measure into the pruning criteria. Importantly, FAPM does not necessitate modifications to the training process or model architecture, nor does it require any auxiliary data. We conducted extensive experiments across eight datasets, covering natural language inference, General Q&A, Medical Q&A, Math Q&A, reading comprehension, and cloze tests. The results demonstrate that FAPM limits CF to just 0.25\% while maintaining 99.67\% accuracy on downstream tasks. We provide the code to reproduce our results.

Comment: Model Compression and Efficiency — pruning-based method (FAPM) using a task-vector criterion to control catastrophic forgetting without changing training or architecture.

Relevance: 9 Novelty: 7

9. Two Sides of the Same Optimization Coin: Model Degradation and Representation Collapse in Graph Foundation Models

ArXiv ID: 2509.08401

Authors: Xunkai Li, Daohan Su, Sicheng Liu, Ru Zhang, Rong-Hua Li, Guoren Wang

Abstract: Graph foundation models, inspired by the success of LLMs, are designed to learn the optimal embedding from multi-domain TAGs for the downstream cross-task generalization capability. During our investigation, graph VQ-MAE stands out among the increasingly diverse landscape of GFM architectures. This is attributed to its ability to jointly encode topology and textual attributes from multiple domains into discrete embedding spaces with clear semantic boundaries. Despite its potential, domain generalization conflicts cause imperceptible pitfalls. In this paper, we instantiate two of them, and they are just like two sides of the same GFM optimization coin - Side 1 Model Degradation: The encoder and codebook fail to capture the diversity of inputs; Side 2 Representation Collapse: The hidden embedding and codebook vector fail to preserve semantic separability due to constraints from narrow representation subspaces. These two pitfalls (sides) collectively impair the decoder and generate the low-quality reconstructed supervision, causing the GFM optimization dilemma during pre-training (coin). Through empirical investigation, we attribute the above challenges to Information Bottleneck and Regularization Deficit. To address them, we propose MoT (Mixture-of-Tinkers) - (1) Information Tinker for Two Pitfalls, which utilizes an edge-wise semantic fusion strategy and a mixture-of-codebooks with domain-aware routing to improve information capacity. (2) Regularization Tinker for Optimization Coin, which utilizes two additional regularizations to further improve gradient supervision in our proposed Information Tinker. Notably, as a flexible architecture, MoT adheres to the scaling laws of GFM, offering a controllable model scale. Compared to SOTA baselines, experiments on 22 datasets across 6 domains demonstrate that MoT achieves significant improvements in supervised, few-shot, and zero-shot scenarios.

Comment: Model Architecture and Representation Learning criteria: mixture-of-codebooks with domain-aware routing (MoE-like) plus regularization to prevent representation collapse in graph foundation models.

Relevance: 9 Novelty: 7

10. Variational Rank Reduction Autoencoders for Generative

ArXiv ID: 2509.08515

Authors: Alicia Tierz, Jad Mounayer, Beatriz Moya, Francisco Chinesta

Abstract: Generative thermal design for complex geometries is fundamental in many areas of engineering, yet it faces two main challenges: the high computational cost of high-fidelity simulations and the limitations of conventional generative models. Approaches such as autoencoders (AEs) and variational autoencoders (VAEs) often produce unstructured latent spaces with discontinuities, which restricts their capacity to explore designs and generate physically consistent solutions. To address these limitations, we propose a hybrid framework that combines Variational Rank-Reduction Autoencoders (VRRAEs) with Deep Operator Networks (DeepONets). The VRRAE introduces a truncated SVD within the latent space, leading to continuous, interpretable, and well-structured representations that mitigate posterior collapse and improve geometric reconstruction. The DeepONet then exploits this compact latent encoding in its branch network, together with spatial coordinates in the trunk network, to predict temperature gradients efficiently and accurately. This hybrid approach not only enhances the quality of generated geometries and the accuracy of gradient prediction, but also provides a substantial advantage in inference efficiency compared to traditional numerical solvers. Overall, the study underscores the importance of structured latent representations for operator learning and highlights the potential of combining generative models and operator networks in thermal design and broader engineering applications.

Comment: Matches Model Architecture (autoencoders) using low-rank latent factorization (truncated SVD) to structure representations and mitigate posterior collapse; also aligns with Compression/Efficiency via rank reduction.