Personalized Daily ArXiv Papers 2025-09-16

[gpt-5]	Prompt	Completion	Total
Token	65210	69460	134670
Cost	$0.08	$0.69	$0.78

Total arXiv papers: 826

Total scanned papers: 531

Total relevant papers: 37

Table of contents with paper titles:

On Linear Mode Connectivity of Mixture-of-Experts Architectures Authors: Viet-Hoang Tran, Van Hoan Trinh, Khanh Vinh Bui, Tan M. Nguyen
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training Authors: Chuan He, Zhanwang Deng, Zhaosong Lu
AMQ: Enabling AutoML for Mixed-precision Weight-Only Quantization of Large Language Models Authors: Sangjun Lee, Seung-taek Woo, Jungyu Jin, Changhun Lee, Eunhyeok Park
From PowerSGD to PowerSGD+: Low-Rank Gradient Compression for Distributed Optimization with Convergence Guarantees Authors: Shengping Xie, Chuyan Chen, Kun Yuan
AQUA: Attention via QUery mAgnitudes for Memory and Compute Efficient Inference in LLMs Authors: Santhosh G S, Saurav Prakash, Balaraman Ravindran
Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings Authors: Anand Gopalakrishnan, Robert Csord\'as, J\"urgen Schmidhuber, Michael C. Mozer
Long-time dynamics and universality of nonconvex gradient descent Authors: Qiyang Han
Mixture-of-Clustered-Experts: Advancing Expert Specialization and Generalization in Instruction Tuning Authors: Sugyeong Eo, Jungjun Lee, Chanjun Park, Heuiseok Lim
Resource-Aware Neural Network Pruning Using Graph-based Reinforcement Learning Authors: Dieter Balemans, Thomas Huybrechts, Jan Steckel, Siegfried Mercelis
Harnessing Optimization Dynamics for Curvature-Informed Model Merging Authors: Pouria Mahdavinia, Hamed Mahdavi, Niloofar Mireshghallah, Mehrdad Mahdavi
Contextuality, Holonomy and Discrete Fiber Bundles in Group-Valued Boltzmann Machines Authors: Jean-Pierre Magnot
Contrastive Network Representation Learning Authors: Zihan Dong, Xin Zhou, Ryumei Nakada, Lexin Li, Linjun Zhang
SpecVLM: Fast Speculative Decoding in Vision-Language Models Authors: Haiduo Huang, Fuwei Yang, Zhenhua Liu, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum
Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts Authors: Cheng Li, Jiexiong Liu, Yixuan Chen, Jie ji
Beyond Instance Consistency: Investigating View Diversity in Self-supervised Learning Authors: Huaiyuan Qin, Muli Yang, Siyuan Hu, Peng Hu, Yu Zhang, Chen Gong, Hongyuan Zhu
PHLoRA: data-free Post-hoc Low-Rank Adapter extraction from full-rank checkpoint Authors: Bhoomit Vasani, Jack FitzGerald, Anjie Fang, Sushmit Vaish
Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction Authors: Yijun Liu, Yixuan Wang, Yuzhuang Xu, Shiyu Ji, Yang Xu, Qingfu Zhu, Wanxiang Che
Deceptive Risk Minimization: Out-of-Distribution Generalization by Deceiving Distribution Shift Detectors Authors: Anirudha Majumdar
Kernel-based Stochastic Approximation Framework for Nonlinear Operator Learning Authors: Jia-Qi Yang, Lei Shi
Identifiable Autoregressive Variational Autoencoders for Nonlinear and Nonstationary Spatio-Temporal Blind Source Separation Authors: Mika Sipil\"a, Klaus Nordhausen, Sara Taskinen
LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications Authors: Yujun Lin, Zhekai Zhang, Song Han
A Differential Manifold Perspective and Universality Analysis of Continuous Attractors in Artificial Neural Networks Authors: Shaoxin Tian, Hongkai Liu, Yuying Yang, Jiali Yu, Zizheng Miao, Xuming Huang, Zhishuai Liu, Zhang Yi
Feature Space Topology Control via Hopkins Loss Authors: Einari Vaaras, Manu Airaksinen
Learning non-Markovian Dynamical Systems with Signature-based Encoders Authors: Eliott Pradeleix, R\'emy Hosseinkhan-Boucher, Alena Shilova, Onofrio Semeraro, Lionel Mathelin
Disentangling Content from Style to Overcome Shortcut Learning: A Hybrid Generative-Discriminative Learning Framework Authors: Siming Fu, Sijun Dong, Xiaoliang Meng
Semantic-guided LoRA Parameters Generation Authors: Miaoge Li, Yang Chen, Zhijie Rao, Can Jiang, Jingcai Guo
Learning Neural Networks by Neuron Pursuit Authors: Akshay Kumar, Jarvis Haupt
E-ROBOT: a dimension-free method for robust statistics and machine learning via Schr\"odinger bridge Authors: Davide La Vecchia, Hang Liu
Spectral Bottleneck in Deep Neural Networks: Noise is All You Need Authors: Hemanth Chandravamsi, Dhanush V. Shenoy, Itay Zinn, Shimon Pisnoy, Steven H. Frankel
Quantum Graph Attention Networks: Trainable Quantum Encoders for Inductive Graph Learning Authors: Arthur M. Faria, Mehdi Djellabi, Igor O. Sokolov, Savvas Varsamopoulos
EfficientUICoder: Efficient MLLM-based UI Code Generation via Input and Output Token Compression Authors: Jingyu Xiao, Zhongyi Zhang, Yuxuan Wan, Yintong Huo, Yang Liu, Michael R. Lyu
From Grounding to Skolemization: A Logic-Constrained Vector Symbolic Architecture for Complex Query Answering Authors: Yuyin Lu, Hegang Chen, Yanghui Rao
HARP: Hallucination Detection via Reasoning Subspace Projection Authors: Junjie Hu, Gang Tu, ShengYu Cheng, Jinxin Li, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan
Kalman Bayesian Transformer Authors: Haoming Jing, Oren Wright, Jos\'e M. F. Moura, Yorie Nakahira
Gradient Estimation Methods of Approximate Multipliers for High-Accuracy Retraining of Deep Learning Models Authors: Chang Meng, Wayne Burleson, Giovanni De Micheli
Verifying Computational Graphs in Production-Grade Distributed Machine Learning Frameworks Authors: Kahfi S. Zulkifli, Wenbo Qian, Shaowei Zhu, Yuan Zhou, Zhen Zhang, Chang Lou
DOSA: Differentiable Model-Based One-Loop Search for DNN Accelerators Authors: Charles Hong, Qijing Huang, Grace Dinh, Mahesh Subedar, Yakun Sophia Shao

1. On Linear Mode Connectivity of Mixture-of-Experts Architectures

ArXiv ID: 2509.11348

Authors: Viet-Hoang Tran, Van Hoan Trinh, Khanh Vinh Bui, Tan M. Nguyen

Abstract: Linear Mode Connectivity (LMC) is a notable phenomenon in the loss landscapes of neural networks, wherein independently trained models have been observed to be connected--up to permutation symmetries--by linear paths in parameter space along which the loss remains consistently low. This observation challenges classical views of non-convex optimization and has implications for model ensembling, generalization, and our understanding of neural loss geometry. Inspired by recent studies on LMC in standard neural networks, we systematically investigate this phenomenon within Mixture-of-Experts (MoE) architectures--a class of models known for their scalability and computational efficiency, which combine traditional neural networks--referred to as experts--through a learnable gating mechanism. We begin by conducting a comprehensive analysis of both dense and sparse gating regimes, demonstrating that the symmetries inherent to MoE architectures are fully characterized by permutations acting on both the expert components and the gating function. Building on these foundational findings, we propose a matching algorithm that enables alignment between independently trained MoEs, thereby facilitating the discovery of LMC. Finally, we empirically validate the presence of LMC using our proposed algorithm across diverse MoE configurations--including dense, sparse, and shared-expert variants--under a wide range of model settings and datasets of varying scales and modalities. Our results confirm the existence of LMC in MoE architectures and offer fundamental insights into the functional landscape and optimization dynamics of deep learning models.

Comment: Model Architecture (MoE): analyzes symmetries and Linear Mode Connectivity in MoE and introduces a matching algorithm to align independently trained experts.

Relevance: 10 Novelty: 8

2. Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

ArXiv ID: 2509.11983

Authors: Chuan He, Zhanwang Deng, Zhaosong Lu

Abstract: Neural network (NN) training is inherently a large-scale matrix optimization problem, yet the matrix structure of NN parameters has long been overlooked. Recently, the optimizer Muon \cite{jordanmuon}, which explicitly exploits this structure, has gained significant attention for its strong performance in foundation model training. A key component contributing to Muon's success is matrix orthogonalization. In this paper, we propose {\it low-rank orthogonalization}, which explicitly leverages the low-rank nature of gradients during NN training. Building on this, we propose low-rank matrix-signed gradient descent and a low-rank variant of Muon. Our numerical experiments demonstrate the superior performance of low-rank orthogonalization, with the low-rank Muon achieving promising results in GPT-2 and LLaMA pretraining -- surpassing the performance of the carefully tuned vanilla Muon. Theoretically, we establish the iteration complexity of the low-rank matrix-signed gradient descent for finding an approximate stationary solution, as well as that of low-rank Muon for finding an approximate stochastic stationary solution under heavy-tailed noise.

Comment: Compression/Efficiency and HPC: Low-rank orthogonalization and optimizer design exploiting low-rank gradients for large-scale foundation model training.

Relevance: 10 Novelty: 8

3. AMQ: Enabling AutoML for Mixed-precision Weight-Only Quantization of Large Language Models

ArXiv ID: 2509.12019

Authors: Sangjun Lee, Seung-taek Woo, Jungyu Jin, Changhun Lee, Eunhyeok Park

Abstract: To enable broader deployment of Large Language Models (LLMs), it is essential to identify the best-performing model under strict memory constraints. We present AMQ, Automated Mixed-Precision Weight-Only Quantization, a framework that assigns layer-wise quantization bit-widths to optimally balance model quality and memory usage. However, the combinatorial search space, with over 10^{100} possible configurations, makes conventional black-box optimization infeasible. AMQ overcomes this challenge through four key innovations:(1) search space pruning using prior knowledge to exclude unpromising configurations, (2) quantization proxy to bypass costly format conversions during search, (3) quality predictor to minimize evaluation overhead, and (4) iterative search-and-update strategy for fast and stable convergence. By integrating these components, AMQ efficiently explores the quality-efficiency landscape, reaching the Pareto frontier and yielding LLMs that are both compact and high-performing. Our code is available at https://github.com/dlwns147/amq.

Comment: Mixed-precision weight-only quantization for LLMs with AutoML search (Model Compression and Efficiency: quantization, layer-wise bit-width assignment).

Relevance: 10 Novelty: 8

4. From PowerSGD to PowerSGD+: Low-Rank Gradient Compression for Distributed Optimization with Convergence Guarantees

ArXiv ID: 2509.11254

Authors: Shengping Xie, Chuyan Chen, Kun Yuan

Abstract: Low-rank gradient compression methods, such as PowerSGD, have gained attention in communication-efficient distributed optimization. However, the convergence guarantees of PowerSGD remain unclear, particularly in stochastic settings. In this paper, we show that PowerSGD does not always converge to the optimal solution and provide a clear counterexample to support this finding. To address this, we introduce PowerSGD+, which periodically updates the projection subspace via singular value decomposition, ensuring that it remains aligned with the optimal subspace. We prove that PowerSGD+ converges under standard assumptions and validate its effectiveness through empirical evaluation on large language model tasks.

Comment: Model Compression and Efficiency: low-rank gradient compression with provable convergence; High Performance Computing: communication-efficient distributed optimization.

Relevance: 10 Novelty: 8

5. AQUA: Attention via QUery mAgnitudes for Memory and Compute Efficient Inference in LLMs

ArXiv ID: 2509.11155

Authors: Santhosh G S, Saurav Prakash, Balaraman Ravindran

Abstract: The quadratic complexity of the attention mechanism remains a fundamental barrier to scaling Large Language Models (LLMs) to longer contexts, creating a critical bottleneck in both computation and memory. To address this, we introduce AQUA (Attention via QUery mAgnitudes) a novel and versatile approximation strategy that significantly reduces the cost of attention with a graceful performance trade-off. Our method operates in two phases: an efficient offline step where we compute a universal, language agnostic projection matrix via SVD on a calibration dataset, and an online inference step where we project query and key vectors and dynamically select a sparse subset of dimensions based on the query's magnitude. We provide a formal theoretical analysis of AQUA, establishing the break-even point at which it becomes more computationally efficient than standard attention. Our empirical evaluations on state-of-the-art models like Llama-3.1-8B demonstrate that a 25% reduction in the attention dot-product computation can be achieved with a statistically insignificant impact on performance across a wide range of benchmarks. We further showcase the versatility of AQUA by demonstrating its ability to synergistically accelerate existing token eviction methods like H2O and to directly reduce KV-cache memory size. By offering a controllable knob to balance efficiency and accuracy, AQUA provides a practical and powerful tool for making large-scale LLM inference more accessible and sustainable.

Comment: Model Compression/Efficiency: attention approximation via SVD-based projection with dynamic dimension sparsification, including formal efficiency analysis and direct KV/compute reductions.

Relevance: 10 Novelty: 8

6. Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

ArXiv ID: 2509.10534

Authors: Anand Gopalakrishnan, Robert Csord\'as, J\"urgen Schmidhuber, Michael C. Mozer

Abstract: The attention mechanism in a Transformer architecture matches key to query based on both content -- the what -- and position in a sequence -- the where. We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding. This entanglement can impair performance particularly when decisions require independent matches on these two factors. We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound. PoPE is far superior on a diagnostic task requiring indexing solely by position or by content. On autoregressive sequence modeling in music, genomic, and natural language domains, Transformers using PoPE as the positional encoding scheme outperform baselines using RoPE with respect to evaluation loss (perplexity) and downstream task performance. On language modeling, these gains persist across model scale, from 124M to 774M parameters. Crucially, PoPE shows strong zero-shot length extrapolation capabilities, whereas RoPE's performance degrades significantly on longer sequences at test time without fine tuning or the use of position-interpolation methods.

Comment: Model Architecture: introduces PoPE positional embeddings to disentangle content (what) and position (where), improving length extrapolation and performance over RoPE.

Relevance: 10 Novelty: 8

7. Long-time dynamics and universality of nonconvex gradient descent

ArXiv ID: 2509.11426

Authors: Qiyang Han

Abstract: This paper develops a general approach to characterize the long-time trajectory behavior of nonconvex gradient descent in generalized single-index models in the large aspect ratio regime. In this regime, we show that for each iteration the gradient descent iterate concentrates around a deterministic vector called the Gaussian theoretical gradient descent', whose dynamics can be tracked by a state evolution system of two recursive equations for two scalars. Our concentration guarantees hold universally for a broad class of design matrices and remain valid over long time horizons until algorithmic convergence or divergence occurs. Moreover, our approach reveals that gradient descent iterates are in general approximately independent of the data and strongly incoherent with the feature vectors, a phenomenon previously known as theimplicit regularization' effect of gradient descent in specific models under Gaussian data. As an illustration of the utility of our general theory, we present two applications of different natures in the regression setting. In the first, we prove global convergence of nonconvex gradient descent with general independent initialization for a broad class of structured link functions, and establish universality of randomly initialized gradient descent in phase retrieval for large aspect ratios. In the second, we develop a data-free iterative algorithm for estimating state evolution parameters along the entire gradient descent trajectory, thereby providing a low-cost yet statistically valid tool for practical tasks such as hyperparameter tuning and runtime determination. As a by-product of our analysis, we show that in the large aspect ratio regime, the Gaussian theoretical gradient descent coincides with a recent line of dynamical mean-field theory for gradient descent over the constant-time horizon.

Comment: Representation Learning/Training Dynamics: develops a state-evolution framework for long-time dynamics of nonconvex gradient descent, explaining implicit regularization and universality.

Relevance: 9 Novelty: 9

8. Mixture-of-Clustered-Experts: Advancing Expert Specialization and Generalization in Instruction Tuning

ArXiv ID: 2509.10513

Authors: Sugyeong Eo, Jungjun Lee, Chanjun Park, Heuiseok Lim

Abstract: A sparse Mixture-of-Experts (MoE) architecture has emerged as a highly scalable solution by conditionally activating sub-modules without a proportional increase in computational costs. However, improving expert specialization to enhance performance and generalization remains a challenge for MoE, especially in instruction tuning scenarios characterized by significant input heterogeneity. In this work, we propose the Mixture-of-Clustered-Experts (MoCE) to address this limitation through a dual-stage routing mechanism. The first stage in the mechanism performs expert group routing based on sequence-level features, while the second stage activates the top-$k$ experts within the group at the token level. This approach enables the effective partitioning of heterogeneous inputs based on their knowledge requirements, encouraging expert group specialization while maintaining the advantages of token-level routing. We evaluate MoCE across a comprehensive set of benchmarks, demonstrating its consistent superiority over strong baselines and its enhanced generalization capabilities. Detailed analysis further highlights the robustness and effectiveness of MoCE.

Comment: Model Architecture – MoE innovation via dual-stage routing (sequence-level group routing followed by token-level top‑k), enhancing expert specialization and generalization.

Relevance: 10 Novelty: 7

9. Resource-Aware Neural Network Pruning Using Graph-based Reinforcement Learning

ArXiv ID: 2509.10526

Authors: Dieter Balemans, Thomas Huybrechts, Jan Steckel, Siegfried Mercelis

Abstract: This paper presents a novel approach to neural network pruning by integrating a graph-based observation space into an AutoML framework to address the limitations of existing methods. Traditional pruning approaches often depend on hand-crafted heuristics and local optimization perspectives, which can lead to suboptimal performance and inefficient pruning strategies. Our framework transforms the pruning process by introducing a graph representation of the target neural network that captures complete topological relationships between layers and channels, replacing the limited layer-wise observation space with a global view of network structure. The core innovations include a Graph Attention Network (GAT) encoder that processes the network's graph representation and generates a rich embedding. Additionally, for the action space we transition from continuous pruning ratios to fine-grained binary action spaces which enables the agent to learn optimal channel importance criteria directly from data, moving away from predefined scoring functions. These contributions are modelled within a Constrained Markov Decision Process (CMDP) framework, allowing the agent to make informed pruning decisions while adhering to resource constraints such as target compression rates. For this, we design a self-competition reward system that encourages the agent to outperform its previous best performance while satisfying the defined constraints. We demonstrate the effectiveness of our approach through extensive experiments on benchmark datasets including CIFAR-10, CIFAR-100, and ImageNet. The experiments show that our method consistently outperforms traditional pruning techniques, showing state-of-the-art results while learning task-specific pruning strategies that identify functionally redundant connections beyond simple weight magnitude considerations.

Comment: Model Compression/Efficiency: resource-aware pruning via graph-based RL (GAT encoder over network graph) with fine-grained binary channel actions under constraints.

Relevance: 9 Novelty: 8

10. Harnessing Optimization Dynamics for Curvature-Informed Model Merging

ArXiv ID: 2509.11167

Authors: Pouria Mahdavinia, Hamed Mahdavi, Niloofar Mireshghallah, Mehrdad Mahdavi

Abstract: Model merging is an effective post-training strategy for composing capabilities in large language models without joint retraining. We study this in the supervised fine-tuning (SFT) stage, where multiple capability-based SFT checkpoints -- spanning math, code, precise instruction following, general instruction following, and knowledge recall -- must be consolidated into a single model. We introduce Optimization Trajectory Aware (OTA) Merging, a curvature-aware aggregation that leverages optimizer second-moment statistics as a diagonal curvature proxy to reweight parameter edits and mitigate interference. Complementing OTA, we propose Fast Fisher Grafting (FFG), a curvature-driven task-localization step that sparsifies conflicting or low-importance edits. FFG induces extremely low-rank masks concentrated in early attention query/key projections and token embeddings, exploiting shared curvature across capabilities. We further develop a memory-light compression of the second moments that preserves OTA's effect. Across diverse capability-based SFT checkpoints, OTA+FFG improves merged-model quality over strong weight-space baselines, reduces negative transfer, and remains robust across sparsity levels. Analyses reveal substantial curvature overlap between checkpoints, offering a novel lens on why simple linear merging can be effective in practice. Ablations confirm that FFG is critical for reducing task interference and that the compressed second moments retain the gains of the full formulation. To facilitate reproducibility, we open-source all code, training and evaluation scripts, visualization artifacts, and capability-specific SFT checkpoints at https://github.com/pmahdavi/ota-merge.

Comment: Training dynamics/model merging: curvature-informed merging using optimizer second moments and Fisher-based sparsifying grafting (low-rank masks) with memory-light curvature compression.

Relevance: 9 Novelty: 8

11. Contextuality, Holonomy and Discrete Fiber Bundles in Group-Valued Boltzmann Machines

ArXiv ID: 2509.10536

Authors: Jean-Pierre Magnot

Abstract: We propose a geometric extension of restricted Boltzmann machines (RBMs) by allowing weights to take values in abstract groups such as ( \mathrm{GL}_n(\mathbb{R}) ), ( \mathrm{SU}(2) ), or even infinite-dimensional operator groups. This generalization enables the modeling of complex relational structures, including projective transformations, spinor dynamics, and functional symmetries, with direct applications to vision, language, and quantum learning. A central contribution of this work is the introduction of a \emph{contextuality index} based on group-valued holonomies computed along cycles in the RBM graph. This index quantifies the global inconsistency or "curvature" induced by local weights, generalizing classical notions of coherence, consistency, and geometric flatness. We establish links with sheaf-theoretic contextuality, gauge theory, and noncommutative geometry, and provide numerical and diagrammatic examples in both finite and infinite dimensions. This framework opens novel directions in AI, from curvature-aware learning architectures to topological regularization in uncertain or adversarial environments.

Comment: Model Architecture: generalizes RBMs with group-valued weights and introduces a holonomy-based contextuality index, linking to geometric/topological regularization.

Relevance: 9 Novelty: 8

12. Contrastive Network Representation Learning

ArXiv ID: 2509.11316

Authors: Zihan Dong, Xin Zhou, Ryumei Nakada, Lexin Li, Linjun Zhang

Abstract: Network representation learning seeks to embed networks into a low-dimensional space while preserving the structural and semantic properties, thereby facilitating downstream tasks such as classification, trait prediction, edge identification, and community detection. Motivated by challenges in brain connectivity data analysis that is characterized by subject-specific, high-dimensional, and sparse networks that lack node or edge covariates, we propose a novel contrastive learning-based statistical approach for network edge embedding, which we name as Adaptive Contrastive Edge Representation Learning (ACERL). It builds on two key components: contrastive learning of augmented network pairs, and a data-driven adaptive random masking mechanism. We establish the non-asymptotic error bounds, and show that our method achieves the minimax optimal convergence rate for edge representation learning. We further demonstrate the applicability of the learned representation in multiple downstream tasks, including network classification, important edge detection, and community detection, and establish the corresponding theoretical guarantees. We validate our method through both synthetic data and real brain connectivities studies, and show its competitive performance compared to the baseline method of sparse principal components analysis.

Comment: Representation Learning: contrastive edge embedding for networks with adaptive masking and theoretical guarantees (non-asymptotic bounds, minimax optimality).

Relevance: 9 Novelty: 8

13. SpecVLM: Fast Speculative Decoding in Vision-Language Models

ArXiv ID: 2509.11815

Authors: Haiduo Huang, Fuwei Yang, Zhenhua Liu, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum

Abstract: Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens whose count scales with image resolution and video length, inflating both compute and memory, especially the key-value (KV) cache. We study speculative decoding for VLMs and introduce SpecVLM, a practical system that (1) establishes a strong EAGLE-2-style baseline, EagleVLM, delivering 1.5--2.3x end-to-end speedups over full autoregressive inference, and (2) further accelerates VLM inference with an elastic visual compressor that adaptively selects among pruning, pooling, convolution, and resampler primitives to balance FLOPs/parameters and accuracy per input. To avoid costly offline distillation corpora, we propose an online-logit distillation protocol that trains the draft model with on-the-fly teacher logits and penultimate features using a combined cross-entropy and Smooth L1 objective, eliminating storage and preprocessing while remaining compute-efficient. This protocol reveals a training-time scaling effect: longer online training monotonically increases the draft model's average accepted length, improving speculative efficiency. Empirically, SpecVLM achieves additional acceleration, culminating in 2.5--2.9x end-to-end speedups within 5 epochs across LLaVA and MMMU, consistently over resolutions and task difficulties, while preserving the target model's output distribution (lossless decoding). Our code is available at https://github.com/haiduo/SpecVLM.

Comment: Compression/Efficiency and Systems: Speculative decoding for VLMs with KV-cache-aware design and elastic visual compression (pruning/pooling/resampler) for accelerated inference.

Relevance: 9 Novelty: 8

14. Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts

ArXiv ID: 2509.10530

Authors: Cheng Li, Jiexiong Liu, Yixuan Chen, Jie ji

Abstract: Transformer models based on the Mixture of Experts (MoE) architecture have made significant progress in long-sequence modeling, but existing models still have shortcomings in computational efficiency and the ability to capture long-range dependencies, especially in terms of the dynamic adaptability of expert resource allocation. In this paper, we propose a Dynamic Adaptive Shared Expert and Grouped Multi-Head Attention Hybrid Model (DASG-MoE) to enhance long-sequence modeling capabilities by integrating three modules. First, we employ the Grouped Multi-Head Attention (GMHA) mechanism to effectively reduce the computational complexity of long sequences. By parallel processing through sequence grouping, local sliding window attention, and feature aggregation, we address long-range dependency issues and the model's lack of generalization for local information. Second, we design a Dual-Scale Shared Expert Structure (DSSE), where shallow experts use lightweight computations to quickly respond to low-dimensional features, while deep experts process high-dimensional complex semantics through pre-training transfer and post-training optimization, achieving a dynamic balance between efficiency and accuracy. Third, we propose a hierarchical Adaptive Dynamic Routing (ADR) mechanism that dynamically selects expert levels based on feature complexity and task requirements, and optimizes resource allocation through a local expert activation strategy. Experiments on multiple long-sequence benchmark datasets demonstrate that our DASG-MoE model outperforms state-of-the-art models.

Comment: Model Architecture (MoE) and Efficiency: introduces grouped multi-head attention, dual-scale shared experts, and hierarchical adaptive routing for long-sequence modeling efficiency.

Relevance: 9 Novelty: 7

15. Beyond Instance Consistency: Investigating View Diversity in Self-supervised Learning

ArXiv ID: 2509.11344

Authors: Huaiyuan Qin, Muli Yang, Siyuan Hu, Peng Hu, Yu Zhang, Chen Gong, Hongyuan Zhu

Abstract: Self-supervised learning (SSL) conventionally relies on the instance consistency paradigm, assuming that different views of the same image can be treated as positive pairs. However, this assumption breaks down for non-iconic data, where different views may contain distinct objects or semantic information. In this paper, we investigate the effectiveness of SSL when instance consistency is not guaranteed. Through extensive ablation studies, we demonstrate that SSL can still learn meaningful representations even when positive pairs lack strict instance consistency. Furthermore, our analysis further reveals that increasing view diversity, by enforcing zero overlapping or using smaller crop scales, can enhance downstream performance on classification and dense prediction tasks. However, excessive diversity is found to reduce effectiveness, suggesting an optimal range for view diversity. To quantify this, we adopt the Earth Mover's Distance (EMD) as an estimator to measure mutual information between views, finding that moderate EMD values correlate with improved SSL learning, providing insights for future SSL framework design. We validate our findings across a range of settings, highlighting their robustness and applicability on diverse data sources.

Comment: Representation Learning: analyses of instance consistency and view diversity in SSL with an EMD-based estimator guiding view design.

Relevance: 9 Novelty: 7

16. PHLoRA: data-free Post-hoc Low-Rank Adapter extraction from full-rank checkpoint

ArXiv ID: 2509.10971

Authors: Bhoomit Vasani, Jack FitzGerald, Anjie Fang, Sushmit Vaish

Abstract: We introduce PHLoRA (Pronounced "flora"). (Post-hoc LoRA), a simple yet powerful method to extract low-rank adaptation adapters from full-rank fine-tuned models without requiring access to training data or gradients. By computing the low-rank decomposition of weight differences between a base model and its fine-tuned counterpart, our method reconstructs adapter modules that can be merged or dynamically routed at inference time via S-LoRA, or served in scalable, industry settings using platforms like NVIDIA NIM. This approach amortizes latency overhead across requests and yields substantial cost savings. Unlike prior work that trains each adapter explicitly, our approach decouples fine-tuning from adapter generation, allowing adapter extraction from existing full-rank models or third-party checkpoints. Experiments on text, image, and video benchmarks using the Amazon Nova model family demonstrate that extracted adapters preserve high energy from the full weight delta, can be pruned safely, and yield negligible degradation in downstream task performance when re-merged. Overall, PHLoRA provides a practical path for making all existing full-rank checkpoints adapter-ready, democratizing scalable inference for all models.

Comment: Low-rank adapter extraction from full-rank checkpoints (Model Compression and Efficiency: low-rank/LoRA, data-free post-hoc adapters).

Relevance: 9 Novelty: 7

17. Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction

ArXiv ID: 2509.10798

Authors: Yijun Liu, Yixuan Wang, Yuzhuang Xu, Shiyu Ji, Yang Xu, Qingfu Zhu, Wanxiang Che

Abstract: Large language models (LLMs) utilize key-value (KV) cache to store historical information during sequence processing. The size of KV cache grows linearly as the length of the sequence extends, which seriously affects memory usage and decoding efficiency. Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction. Although this scheme is simple to implement, it tends to overly focus on local information, potentially leading to the neglect or omission of crucial global information. To mitigate this issue, we propose Judge Q, a novel training method which incorporates a soft token list. This method only tunes the model's embedding layer at a low training cost. By concatenating the soft token list at the end of the input sequence, we train these tokens' attention map to the original input sequence to align with that of the actual decoded tokens. In this way, the queries corresponding to the soft tokens can effectively capture global information and better evaluate the importance of the keys and values within the KV cache, thus maintaining decoding quality when KV cache is evicted. Under the same eviction budget, our method exhibits less performance degradation compared to existing eviction approaches. We validate our approach through experiments conducted on models such as Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks including LongBench, RULER, and Needle-in-a-Haystack. Results indicate an improvement of approximately 1 point on the LongBench and over 3 points on RULER. This proposed methodology can be seamlessly integrated into existing open-source models with minimal training overhead, thereby enhancing performance in KV cache eviction scenarios.

Comment: Model Compression/Efficiency: proposes trainable soft query tokens to score KV-cache entries using global attention, improving eviction decisions and reducing memory/compute without full model retraining.

Relevance: 9 Novelty: 7

18. Deceptive Risk Minimization: Out-of-Distribution Generalization by Deceiving Distribution Shift Detectors

ArXiv ID: 2509.12081

Authors: Anirudha Majumdar

Abstract: This paper proposes deception as a mechanism for out-of-distribution (OOD) generalization: by learning data representations that make training data appear independent and identically distributed (iid) to an observer, we can identify stable features that eliminate spurious correlations and generalize to unseen domains. We refer to this principle as deceptive risk minimization (DRM) and instantiate it with a practical differentiable objective that simultaneously learns features that eliminate distribution shifts from the perspective of a detector based on conformal martingales while minimizing a task-specific loss. In contrast to domain adaptation or prior invariant representation learning methods, DRM does not require access to test data or a partitioning of training data into a finite number of data-generating domains. We demonstrate the efficacy of DRM on numerical experiments with concept shift and a simulated imitation learning setting with covariate shift in environments that a robot is deployed in.

Comment: Representation Learning: objective for invariant features by deceiving distribution shift detectors (conformal martingales) to improve OOD generalization without domain partitioning.

Relevance: 8 Novelty: 8

19. Kernel-based Stochastic Approximation Framework for Nonlinear Operator Learning

ArXiv ID: 2509.11070

Authors: Jia-Qi Yang, Lei Shi

Abstract: We develop a stochastic approximation framework for learning nonlinear operators between infinite-dimensional spaces utilizing general Mercer operator-valued kernels. Our framework encompasses two key classes: (i) compact kernels, which admit discrete spectral decompositions, and (ii) diagonal kernels of the form $K(x,x')=k(x,x')T$, where $k$ is a scalar-valued kernel and $T$ is a positive operator on the output space. This broad setting induces expressive vector-valued reproducing kernel Hilbert spaces (RKHSs) that generalize the classical $K=kI$ paradigm, thereby enabling rich structural modeling with rigorous theoretical guarantees. To address target operators lying outside the RKHS, we introduce vector-valued interpolation spaces to precisely quantify misspecification error. Within this framework, we establish dimension-free polynomial convergence rates, demonstrating that nonlinear operator learning can overcome the curse of dimensionality. The use of general operator-valued kernels further allows us to derive rates for intrinsically nonlinear operator learning, going beyond the linear-type behavior inherent in diagonal constructions of $K=kI$. Importantly, this framework accommodates a wide range of operator learning tasks, ranging from integral operators such as Fredholm operators to architectures based on encoder-decoder representations. Moreover, we validate its effectiveness through numerical experiments on the two-dimensional Navier-Stokes equations.

Comment: Representation Learning: theoretical framework for nonlinear operator learning with general operator-valued kernels and dimension-free convergence rates.

Relevance: 8 Novelty: 8

ArXiv ID: 2509.11962

Authors: Mika Sipil\"a, Klaus Nordhausen, Sara Taskinen

Abstract: The modeling and prediction of multivariate spatio-temporal data involve numerous challenges. Dimension reduction methods can significantly simplify this process, provided that they account for the complex dependencies between variables and across time and space. Nonlinear blind source separation has emerged as a promising approach, particularly following recent advances in identifiability results. Building on these developments, we introduce the identifiable autoregressive variational autoencoder, which ensures the identifiability of latent components consisting of nonstationary autoregressive processes. The blind source separation efficacy of the proposed method is showcased through a simulation study, where it is compared against state-of-the-art methods, and the spatio-temporal prediction performance is evaluated against several competitors on air pollution and weather datasets.

Comment: Representation Learning/Autoencoders: Identifiable VAE with autoregressive latent processes, advancing identifiability theory.

Relevance: 8 Novelty: 8

21. LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications

ArXiv ID: 2509.12053

Authors: Yujun Lin, Zhekai Zhang, Song Han

Abstract: Modern tensor applications, especially foundation models and generative AI applications require multiple input modalities (both vision and language), which increases the demand for flexible accelerator architecture. Existing frameworks suffer from the trade-off between design flexibility and productivity of RTL generation: either limited to very few hand-written templates or cannot automatically generate the RTL. To address this challenge, we propose the LEGO framework, which targets tensor applications and automatically generates spatial architecture design and outputs synthesizable RTL code without handwritten RTL design templates. Leveraging the affine-transformation-based architecture representation, LEGO front end finds interconnections between function units, synthesizes the memory system, and fuses different spatial dataflow designs based on data reuse analysis. LEGO back end then translates the hardware in a primitive-level graph to perform lower-level optimizations, and applies a set of linear-programming algorithms to optimally insert pipeline registers and reduce the overhead of unused logic when switching spatial dataflows. Our evaluation demonstrates that LEGO can achieve 3.2x speedup and 2.4x energy efficiency compared to previous work Gemmini, and can generate one architecture for diverse modern foundation models in generative AI applications.

Comment: High Performance Computing – algorithmic framework to automatically generate and optimize spatial accelerators for tensor workloads (affine-based representation, dataflow fusion, LP-based pipeline/register insertion).

Relevance: 8 Novelty: 8

22. A Differential Manifold Perspective and Universality Analysis of Continuous Attractors in Artificial Neural Networks

ArXiv ID: 2509.10514

Authors: Shaoxin Tian, Hongkai Liu, Yuying Yang, Jiali Yu, Zizheng Miao, Xuming Huang, Zhishuai Liu, Zhang Yi

Abstract: Continuous attractors are critical for information processing in both biological and artificial neural systems, with implications for spatial navigation, memory, and deep learning optimization. However, existing research lacks a unified framework to analyze their properties across diverse dynamical systems, limiting cross-architectural generalizability. This study establishes a novel framework from the perspective of differential manifolds to investigate continuous attractors in artificial neural networks. It verifies compatibility with prior conclusions, elucidates links between continuous attractor phenomena and eigenvalues of the local Jacobian matrix, and demonstrates the universality of singular value stratification in common classification models and datasets. These findings suggest continuous attractors may be ubiquitous in general neural networks, highlighting the need for a general theory, with the proposed framework offering a promising foundation given the close mathematical connection between eigenvalues and singular values.

Comment: Representation Learning: provides a theoretical framework for continuous attractors via differential manifolds, linking Jacobian eigenvalues and singular values to training dynamics and feature structure.

Relevance: 8 Novelty: 7

23. Feature Space Topology Control via Hopkins Loss

ArXiv ID: 2509.11154

Authors: Einari Vaaras, Manu Airaksinen

Abstract: Feature space topology refers to the organization of samples within the feature space. Modifying this topology can be beneficial in machine learning applications, including dimensionality reduction, generative modeling, transfer learning, and robustness to adversarial attacks. This paper introduces a novel loss function, Hopkins loss, which leverages the Hopkins statistic to enforce a desired feature space topology, which is in contrast to existing topology-related methods that aim to preserve input feature topology. We evaluate the effectiveness of Hopkins loss on speech, text, and image data in two scenarios: classification and dimensionality reduction using nonlinear bottleneck autoencoders. Our experiments show that integrating Hopkins loss into classification or dimensionality reduction has only a small impact on classification performance while providing the benefit of modifying feature topology.

Comment: Representation Learning: proposes a topology-shaping loss (Hopkins loss) to directly control feature-space organization in autoencoders/classifiers.

Relevance: 8 Novelty: 7

24. Learning non-Markovian Dynamical Systems with Signature-based Encoders

ArXiv ID: 2509.12022

Authors: Eliott Pradeleix, R\'emy Hosseinkhan-Boucher, Alena Shilova, Onofrio Semeraro, Lionel Mathelin

Abstract: Neural ordinary differential equations offer an effective framework for modeling dynamical systems by learning a continuous-time vector field. However, they rely on the Markovian assumption - that future states depend only on the current state - which is often untrue in real-world scenarios where the dynamics may depend on the history of past states. This limitation becomes especially evident in settings involving the continuous control of complex systems with delays and memory effects. To capture historical dependencies, existing approaches often rely on recurrent neural network (RNN)-based encoders, which are inherently discrete and struggle with continuous modeling. In addition, they may exhibit poor training behavior. In this work, we investigate the use of the signature transform as an encoder for learning non-Markovian dynamics in a continuous-time setting. The signature transform offers a continuous-time alternative with strong theoretical foundations and proven efficiency in summarizing multidimensional information in time. We integrate a signature-based encoding scheme into encoder-decoder dynamics models and demonstrate that it outperforms RNN-based alternatives in test performance on synthetic benchmarks.

Comment: Model Architecture/Representation Learning: introduces a signature-transform encoder for continuous-time non-Markovian dynamics within neural ODE frameworks.

Relevance: 8 Novelty: 7

25. Disentangling Content from Style to Overcome Shortcut Learning: A Hybrid Generative-Discriminative Learning Framework

ArXiv ID: 2509.11598

Authors: Siming Fu, Sijun Dong, Xiaoliang Meng

Abstract: Despite the remarkable success of Self-Supervised Learning (SSL), its generalization is fundamentally hindered by Shortcut Learning, where models exploit superficial features like texture instead of intrinsic structure. We experimentally verify this flaw within the generative paradigm (e.g., MAE) and argue it is a systemic issue also affecting discriminative methods, identifying it as the root cause of their failure on unseen domains. While existing methods often tackle this at a surface level by aligning or separating domain-specific features, they fail to alter the underlying learning mechanism that fosters shortcut dependency. To address this at its core, we propose HyGDL (Hybrid Generative-Discriminative Learning Framework), a hybrid framework that achieves explicit content-style disentanglement. Our approach is guided by the Invariance Pre-training Principle: forcing a model to learn an invariant essence by systematically varying a bias (e.g., style) at the input while keeping the supervision signal constant. HyGDL operates on a single encoder and analytically defines style as the component of a representation that is orthogonal to its style-invariant content, derived via vector projection.

Comment: Representation Learning – explicit content–style disentanglement using orthogonal projection with an invariance pre‑training principle to counter shortcut learning.

Relevance: 8 Novelty: 7

26. Semantic-guided LoRA Parameters Generation

ArXiv ID: 2509.10535

Authors: Miaoge Li, Yang Chen, Zhijie Rao, Can Jiang, Jingcai Guo

Abstract: Low-Rank Adaptation (LoRA) has demonstrated strong generalization capabilities across a variety of tasks for efficiently fine-tuning AI models, especially on resource-constrained edges. However, in real-world applications, edge users often exhibit task-specific preferences that are difficult to handle with a unified model trained under a closed-world assumption, and the challenge may further increase when there are significant domain shifts between training and deployment. Meanwhile, retraining/fine-tuning models for each user is also impractical due to its cost-intensive nature and privacy concerns over raw data utilization from edges. To address these challenges, we propose Semantic-guided LoRA Parameter Generation (SG-LoRA), the first of its kind framework to efficiently produce user-specific LoRA parameters without any additional training on user tasks or access to user-specific data. Concretely, SG-LoRA uses task descriptions as the semantic bridge, measuring their proximity to a set of known expert tasks in a shared embedding space. Based on this semantic guidance, it models the target task's LoRA parameter distribution to generate high-performing parameters for novel tasks. SG-LoRA enables the real-time construction of LoRA models aligned with individual intents by distilling knowledge from prominent LoRA experts and, meanwhile, offering a privacy-preserving solution for personalized model adaptation in a novel zero-shot open-world setting proposed in this work. Extensive experiments on multiple challenging tasks confirm the superior performance and remarkable adaptability of SG-LoRA. Code is available at https://github.com/keepgoingjkg/SG-LoRA.

Comment: LoRA parameter generation via semantic guidance and composition of expert LoRAs (Model Compression/Efficiency: parameter-efficient adaptation).

Relevance: 8 Novelty: 7

27. Learning Neural Networks by Neuron Pursuit

ArXiv ID: 2509.12154

Authors: Akshay Kumar, Jarvis Haupt

Abstract: The first part of this paper studies the evolution of gradient flow for homogeneous neural networks near a class of saddle points exhibiting a sparsity structure. The choice of these saddle points is motivated from previous works on homogeneous networks, which identified the first saddle point encountered by gradient flow after escaping the origin. It is shown here that, when initialized sufficiently close to such saddle points, gradient flow remains near the saddle point for a sufficiently long time, during which the set of weights with small norm remain small but converge in direction. Furthermore, important empirical observations are made on the behavior of gradient descent after escaping these saddle points. The second part of the paper, motivated by these results, introduces a greedy algorithm to train deep neural networks called Neuron Pursuit (NP). It is an iterative procedure which alternates between expanding the network by adding neuron(s) with carefully chosen weights, and minimizing the training loss using this augmented network. The efficacy of the proposed algorithm is validated using numerical experiments.

Comment: Matches Representation Learning criterion via analysis of training dynamics (gradient flow near sparse-structured saddle points) and introduces a foundational training/architecture method (greedy neuron addition) rather than an application.

Relevance: 8 Novelty: 7

28. E-ROBOT: a dimension-free method for robust statistics and machine learning via Schr\"odinger bridge

ArXiv ID: 2509.11532

Authors: Davide La Vecchia, Hang Liu

Abstract: We propose the Entropic-regularized Robust Optimal Transport (E-ROBOT) framework, a novel method that combines the robustness of ROBOT with the computational and statistical benefits of entropic regularization. We show that, rooted in the Schr\"{o}dinger bridge problem theory, E-ROBOT defines the robust Sinkhorn divergence $\overline{W}{\varepsilon,\lambda}$, where the parameter $\lambda$ controls robustness and $\varepsilon$ governs the regularization strength. Letting $n\in \mathbb{N}$ denote the sample size, a central theoretical contribution is establishing that the sample complexity of $\overline{W}) routines. From the theoretical standpoint, our work opens the door to many research directions in statistics and machine learning: we discuss some of them.}$ is $\mathcal{O}(n^{-1/2})$, thereby avoiding the curse of dimensionality that plagues standard ROBOT. This dimension-free property unlocks the use of $\overline{W}_{\varepsilon,\lambda}$ as a loss function in large-dimensional statistical and machine learning tasks. With this regard, we demonstrate its utility through four applications: goodness-of-fit testing; computation of barycenters for corrupted 2D and 3D shapes; definition of gradient flows; and image colour transfer. From the computation standpoint, a perk of our novel method is that it can be easily implemented by modifying existing (\texttt{Python

Comment: Representation Learning criterion: introduces a robust OT loss (robust Sinkhorn divergence via entropic regularization) with a dimension-free O(n^{-1/2}) sample complexity guarantee.

Relevance: 7 Novelty: 8

29. Spectral Bottleneck in Deep Neural Networks: Noise is All You Need

ArXiv ID: 2509.09719

Authors: Hemanth Chandravamsi, Dhanush V. Shenoy, Itay Zinn, Shimon Pisnoy, Steven H. Frankel

Abstract: Deep neural networks are known to exhibit a spectral learning bias, wherein low-frequency components are learned early in training, while high-frequency modes emerge more gradually in later epochs. However, when the target signal lacks low-frequency components and is dominated by broadband high frequencies, training suffers from a 'spectral bottleneck', and the model fails to reconstruct the entire signal, including the frequency components that lie within the network's representational capacity. We examine such a scenario in the context of implicit neural representations (INRs) with sinusoidal representation networks (SIRENs), focusing on the challenge of fitting high-frequency-dominant signals that are susceptible to spectral bottleneck. To effectively fit any target signal irrespective of it's frequency content, we propose a generalized target-aware 'weight perturbation scheme' (WINNER - weight initialization with noise for neural representations) for network initialization. The scheme perturbs uniformly initialized weights with Gaussian noise, where the noise scales are adaptively determined by the spectral centroid of the target signal. We show that the noise scales can provide control over the spectra of network activations and the eigenbasis of the empirical neural tangent kernel. This method not only addresses the spectral bottleneck but also yields faster convergence and with improved representation accuracy, outperforming state-of-the-art approaches in audio fitting and achieving notable gains in image fitting and denoising tasks. Beyond signal reconstruction, our approach opens new directions for adaptive weight initialization strategies in computer vision and scientific machine learning.

Comment: Representation Learning/Training Dynamics: tackles spectral bias with a target-aware weight initialization that modulates activation spectra and NTK eigenbasis.