Personalized Daily ArXiv Papers 2025-09-17

[gpt-5]	Prompt	Completion	Total
Token	66444	67838	134282
Cost	$0.08	$0.68	$0.76

Total arXiv papers: 675

Total scanned papers: 434

Total relevant papers: 47

Table of contents with paper titles:

Positional Encoding via Token-Aware Phase Attention Authors: Yu (Sid), Wang, Sheng Shen, R\'emi Munos, Hongyuan Zhan, Yuandong Tian
PHLoRA: data-free Post-hoc Low-Rank Adapter extraction from full-rank checkpoint Authors: Bhoomit Vasani, Jack FitzGerald, Anjie Fang, Sushmit Vaish
Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings Authors: Anand Gopalakrishnan, Robert Csord\'as, J\"urgen Schmidhuber, Michael C. Mozer
From PowerSGD to PowerSGD+: Low-Rank Gradient Compression for Distributed Optimization with Convergence Guarantees Authors: Shengping Xie, Chuyan Chen, Kun Yuan
Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction Authors: Ryan Lucas, Kayhan Behdin, Zhipeng Wang, Qingquan Song, Shao Tang, Rahul Mazumder
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training Authors: Chuan He, Zhanwang Deng, Zhaosong Lu
On Linear Mode Connectivity of Mixture-of-Experts Architectures Authors: Viet-Hoang Tran, Van Hoan Trinh, Khanh Vinh Bui, Tan M. Nguyen
Mixture-of-Clustered-Experts: Advancing Expert Specialization and Generalization in Instruction Tuning Authors: Sugyeong Eo, Jungjun Lee, Chanjun Park, Heuiseok Lim
Long-time dynamics and universality of nonconvex gradient descent Authors: Qiyang Han
Contextuality, Holonomy and Discrete Fiber Bundles in Group-Valued Boltzmann Machines Authors: Jean-Pierre Magnot
Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts Authors: Cheng Li, Jiexiong Liu, Yixuan Chen, Jie ji
Identifiable Autoregressive Variational Autoencoders for Nonlinear and Nonstationary Spatio-Temporal Blind Source Separation Authors: Mika Sipil\"a, Klaus Nordhausen, Sara Taskinen
Why and How Auxiliary Tasks Improve JEPA Representations Authors: Jiacan Yu, Siyi Chen, Mingrui Liu, Nono Horiuchi, Vladimir Braverman, Zicheng Xu, Dan Haramati, Randall Balestriero
Harnessing Optimization Dynamics for Curvature-Informed Model Merging Authors: Pouria Mahdavinia, Hamed Mahdavi, Niloofar Mireshghallah, Mehrdad Mahdavi
Learning Neural Networks by Neuron Pursuit Authors: Akshay Kumar, Jarvis Haupt
AQUA: Attention via QUery mAgnitudes for Memory and Compute Efficient Inference in LLMs Authors: Santhosh G S, Saurav Prakash, Balaraman Ravindran
TinyServe: Query-Aware Cache Selection for Efficient LLM Serving Authors: Dong Liu, Yanxuan Yu
LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications Authors: Yujun Lin, Zhekai Zhang, Song Han
DOSA: Differentiable Model-Based One-Loop Search for DNN Accelerators Authors: Charles Hong, Qijing Huang, Grace Dinh, Mahesh Subedar, Yakun Sophia Shao
A Modern Look at Simplicity Bias in Image Classification Tasks Authors: Xiaoguang Chang, Teng Wang, Changyin Sun
Scaling Up Data Parallelism in Decentralized Deep Learning Authors: Bing Xie, Junqi Yin, Zhenyu Zhou, Sarp Oral, Feiyi Wang
Resource-Aware Neural Network Pruning Using Graph-based Reinforcement Learning Authors: Dieter Balemans, Thomas Huybrechts, Jan Steckel, Siegfried Mercelis
AMQ: Enabling AutoML for Mixed-precision Weight-Only Quantization of Large Language Models Authors: Sangjun Lee, Seung-taek Woo, Jungyu Jin, Changhun Lee, Eunhyeok Park
Black-box Model Merging for Language-Model-as-a-Service with Massive Model Repositories Authors: Shilian Chen, Jie Zhou, Tianyu Huai, Yujiang Lu, Junsong Li, Bihao Zhan, Qianjun Pan, Yutao Yang, Xin Li, Qin Chen, Hang Yan, Liang He
Spectral Bottleneck in Deep Neural Networks: Noise is All You Need Authors: Hemanth Chandravamsi, Dhanush V. Shenoy, Itay Zinn, Shimon Pisnoy, Steven H. Frankel
Verifying Computational Graphs in Production-Grade Distributed Machine Learning Frameworks Authors: Kahfi S. Zulkifli, Wenbo Qian, Shaowei Zhu, Yuan Zhou, Zhen Zhang, Chang Lou
Semantic-guided LoRA Parameters Generation Authors: Miaoge Li, Yang Chen, Zhijie Rao, Can Jiang, Jingcai Guo
Learning to Route: Per-Sample Adaptive Routing for Multimodal Multitask Prediction Authors: Marzieh Ajirak, Oded Bein, Ellen Rose Bowen, Dora Kanellopoulos, Avital Falk, Faith M. Gunning, Nili Solomonov, Logan Grosenick
ResidualViT for Efficient Temporally Dense Video Encoding Authors: Mattia Soldan, Fabian Caba Heilbron, Bernard Ghanem, Josef Sivic, Bryan Russell
Visualization and Analysis of the Loss Landscape in Graph Neural Networks Authors: Samir Moustafa, Lorenz Kummer, Simon Fetzel, Nils M. Kriege, Wilfried N. Gansterer
RepIt: Representing Isolated Targets to Steer Language Models Authors: Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, Chenguang Wang
Learning non-Markovian Dynamical Systems with Signature-based Encoders Authors: Eliott Pradeleix, R\'emy Hosseinkhan-Boucher, Alena Shilova, Onofrio Semeraro, Lionel Mathelin
M4GN: Mesh-based Multi-segment Hierarchical Graph Network for Dynamic Simulations Authors: Bo Lei, Victor M. Castillo, Yeping Hu
CIARD: Cyclic Iterative Adversarial Robustness Distillation Authors: Liming Lu, Shuchao Pang, Xu Zheng, Xiang Gu, Anan Du, Yunhuai Liu, Yongbin Zhou
Deceptive Risk Minimization: Out-of-Distribution Generalization by Deceiving Distribution Shift Detectors Authors: Anirudha Majumdar
Feature Space Topology Control via Hopkins Loss Authors: Einari Vaaras, Manu Airaksinen
A Differential Manifold Perspective and Universality Analysis of Continuous Attractors in Artificial Neural Networks Authors: Shaoxin Tian, Hongkai Liu, Yuying Yang, Jiali Yu, Zizheng Miao, Xuming Huang, Zhishuai Liu, Zhang Yi
Disentangling Content from Style to Overcome Shortcut Learning: A Hybrid Generative-Discriminative Learning Framework Authors: Siming Fu, Sijun Dong, Xiaoliang Meng
MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors Authors: Xin Tong, Zhi Lin, Jingya Wang, Meng Han, Bo Jin
Kernel-based Stochastic Approximation Framework for Nonlinear Operator Learning Authors: Jia-Qi Yang, Lei Shi
LTA-thinker: Latent Thought-Augmented Training Framework for Large Language Models on Complex Reasoning Authors: Jiaqi Wang, Binquan Ji, Haibo Luo, Yiyang Qi, Ruiting Li, Huiyan Wang, Yuantao Han, Cangyi Yang, jiaxu Zhang, Feiliang Ren
The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations Authors: Yubo Zhu, Dongrui Liu, Zecheng Lin, Wei Tong, Sheng Zhong, Jing Shao
SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching Authors: Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, Linfeng Zhang
Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors Authors: Aniket Didolkar, Nicolas Ballas, Sanjeev Arora, Anirudh Goyal
Gradient Estimation Methods of Approximate Multipliers for High-Accuracy Retraining of Deep Learning Models Authors: Chang Meng, Wayne Burleson, Giovanni De Micheli
Kalman Bayesian Transformer Authors: Haoming Jing, Oren Wright, Jos\'e M. F. Moura, Yorie Nakahira
A Time-Series Foundation Model by Universal Delay Embedding Authors: Zijian Wang, Peng Tao, Jifan Shi, Rui Bao, Rui Liu, Luonan Chen

1. Positional Encoding via Token-Aware Phase Attention

ArXiv ID: 2509.12635

Authors: Yu (Sid), Wang, Sheng Shen, R\'emi Munos, Hongyuan Zhan, Yuandong Tian

Abstract: We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families.

Comment: Matches Model Architecture: new positional encoding (TAPA) for Transformers with theory on RoPE’s bias; improves long-context extrapolation.

Relevance: 10 Novelty: 9

2. PHLoRA: data-free Post-hoc Low-Rank Adapter extraction from full-rank checkpoint

ArXiv ID: 2509.10971

Authors: Bhoomit Vasani, Jack FitzGerald, Anjie Fang, Sushmit Vaish

Abstract: We introduce PHLoRA (Pronounced "flora"). (Post-hoc LoRA), a simple yet powerful method to extract low-rank adaptation adapters from full-rank fine-tuned models without requiring access to training data or gradients. By computing the low-rank decomposition of weight differences between a base model and its fine-tuned counterpart, our method reconstructs adapter modules that can be merged or dynamically routed at inference time via S-LoRA, or served in scalable, industry settings using platforms like NVIDIA NIM. This approach amortizes latency overhead across requests and yields substantial cost savings. Unlike prior work that trains each adapter explicitly, our approach decouples fine-tuning from adapter generation, allowing adapter extraction from existing full-rank models or third-party checkpoints. Experiments on text, image, and video benchmarks using the Amazon Nova model family demonstrate that extracted adapters preserve high energy from the full weight delta, can be pruned safely, and yield negligible degradation in downstream task performance when re-merged. Overall, PHLoRA provides a practical path for making all existing full-rank checkpoints adapter-ready, democratizing scalable inference for all models.

Comment: Matches Compression/Efficiency: data-free low-rank adapter (LoRA) extraction from full-rank checkpoints enabling scalable inference and pruning.

Relevance: 10 Novelty: 8

3. Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

ArXiv ID: 2509.10534

Authors: Anand Gopalakrishnan, Robert Csord\'as, J\"urgen Schmidhuber, Michael C. Mozer

Abstract: The attention mechanism in a Transformer architecture matches key to query based on both content -- the what -- and position in a sequence -- the where. We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding. This entanglement can impair performance particularly when decisions require independent matches on these two factors. We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound. PoPE is far superior on a diagnostic task requiring indexing solely by position or by content. On autoregressive sequence modeling in music, genomic, and natural language domains, Transformers using PoPE as the positional encoding scheme outperform baselines using RoPE with respect to evaluation loss (perplexity) and downstream task performance. On language modeling, these gains persist across model scale, from 124M to 774M parameters. Crucially, PoPE shows strong zero-shot length extrapolation capabilities, whereas RoPE's performance degrades significantly on longer sequences at test time without fine tuning or the use of position-interpolation methods.

Comment: Model Architecture: introduces PoPE, a positional encoding that disentangles content vs. position in Transformers and improves length extrapolation.

Relevance: 10 Novelty: 8

4. From PowerSGD to PowerSGD+: Low-Rank Gradient Compression for Distributed Optimization with Convergence Guarantees

ArXiv ID: 2509.11254

Authors: Shengping Xie, Chuyan Chen, Kun Yuan

Abstract: Low-rank gradient compression methods, such as PowerSGD, have gained attention in communication-efficient distributed optimization. However, the convergence guarantees of PowerSGD remain unclear, particularly in stochastic settings. In this paper, we show that PowerSGD does not always converge to the optimal solution and provide a clear counterexample to support this finding. To address this, we introduce PowerSGD+, which periodically updates the projection subspace via singular value decomposition, ensuring that it remains aligned with the optimal subspace. We prove that PowerSGD+ converges under standard assumptions and validate its effectiveness through empirical evaluation on large language model tasks.

Comment: Model Compression and Efficiency: low-rank gradient compression with periodic SVD subspace updates and formal convergence guarantees (PowerSGD+).

Relevance: 10 Novelty: 8

5. Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

ArXiv ID: 2509.12464

Authors: Ryan Lucas, Kayhan Behdin, Zhipeng Wang, Qingquan Song, Shao Tang, Rahul Mazumder

Abstract: Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This "Reasoning-Aware Compression" (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC

Comment: Model Compression and Efficiency: pruning via joint reconstruction of inputs and on-policy chain-of-thought for decode-dominated reasoning models (RAC).

Relevance: 10 Novelty: 8

6. Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

ArXiv ID: 2509.11983

Authors: Chuan He, Zhanwang Deng, Zhaosong Lu

Abstract: Neural network (NN) training is inherently a large-scale matrix optimization problem, yet the matrix structure of NN parameters has long been overlooked. Recently, the optimizer Muon \cite{jordanmuon}, which explicitly exploits this structure, has gained significant attention for its strong performance in foundation model training. A key component contributing to Muon's success is matrix orthogonalization. In this paper, we propose {\it low-rank orthogonalization}, which explicitly leverages the low-rank nature of gradients during NN training. Building on this, we propose low-rank matrix-signed gradient descent and a low-rank variant of Muon. Our numerical experiments demonstrate the superior performance of low-rank orthogonalization, with the low-rank Muon achieving promising results in GPT-2 and LLaMA pretraining -- surpassing the performance of the carefully tuned vanilla Muon. Theoretically, we establish the iteration complexity of the low-rank matrix-signed gradient descent for finding an approximate stationary solution, as well as that of low-rank Muon for finding an approximate stochastic stationary solution under heavy-tailed noise.

Comment: Model Compression and Efficiency: introduces low-rank orthogonalization exploiting low-rank gradients; also relevant to HPC/foundation model training optimizers.

Relevance: 10 Novelty: 8

7. On Linear Mode Connectivity of Mixture-of-Experts Architectures

ArXiv ID: 2509.11348

Authors: Viet-Hoang Tran, Van Hoan Trinh, Khanh Vinh Bui, Tan M. Nguyen

Abstract: Linear Mode Connectivity (LMC) is a notable phenomenon in the loss landscapes of neural networks, wherein independently trained models have been observed to be connected--up to permutation symmetries--by linear paths in parameter space along which the loss remains consistently low. This observation challenges classical views of non-convex optimization and has implications for model ensembling, generalization, and our understanding of neural loss geometry. Inspired by recent studies on LMC in standard neural networks, we systematically investigate this phenomenon within Mixture-of-Experts (MoE) architectures--a class of models known for their scalability and computational efficiency, which combine traditional neural networks--referred to as experts--through a learnable gating mechanism. We begin by conducting a comprehensive analysis of both dense and sparse gating regimes, demonstrating that the symmetries inherent to MoE architectures are fully characterized by permutations acting on both the expert components and the gating function. Building on these foundational findings, we propose a matching algorithm that enables alignment between independently trained MoEs, thereby facilitating the discovery of LMC. Finally, we empirically validate the presence of LMC using our proposed algorithm across diverse MoE configurations--including dense, sparse, and shared-expert variants--under a wide range of model settings and datasets of varying scales and modalities. Our results confirm the existence of LMC in MoE architectures and offer fundamental insights into the functional landscape and optimization dynamics of deep learning models.

Comment: Model Architecture (MoE): analyzes symmetries and establishes linear mode connectivity in MoE; introduces expert/gating alignment algorithm.

Relevance: 10 Novelty: 8

8. Mixture-of-Clustered-Experts: Advancing Expert Specialization and Generalization in Instruction Tuning

ArXiv ID: 2509.10513

Authors: Sugyeong Eo, Jungjun Lee, Chanjun Park, Heuiseok Lim

Abstract: A sparse Mixture-of-Experts (MoE) architecture has emerged as a highly scalable solution by conditionally activating sub-modules without a proportional increase in computational costs. However, improving expert specialization to enhance performance and generalization remains a challenge for MoE, especially in instruction tuning scenarios characterized by significant input heterogeneity. In this work, we propose the Mixture-of-Clustered-Experts (MoCE) to address this limitation through a dual-stage routing mechanism. The first stage in the mechanism performs expert group routing based on sequence-level features, while the second stage activates the top-$k$ experts within the group at the token level. This approach enables the effective partitioning of heterogeneous inputs based on their knowledge requirements, encouraging expert group specialization while maintaining the advantages of token-level routing. We evaluate MoCE across a comprehensive set of benchmarks, demonstrating its consistent superiority over strong baselines and its enhanced generalization capabilities. Detailed analysis further highlights the robustness and effectiveness of MoCE.

Comment: Model Architecture (MoE): proposes dual-stage routing (sequence-level group routing + token-level top-k) to improve expert specialization/generalization.

Relevance: 10 Novelty: 8

9. Long-time dynamics and universality of nonconvex gradient descent

ArXiv ID: 2509.11426

Authors: Qiyang Han

Abstract: This paper develops a general approach to characterize the long-time trajectory behavior of nonconvex gradient descent in generalized single-index models in the large aspect ratio regime. In this regime, we show that for each iteration the gradient descent iterate concentrates around a deterministic vector called the Gaussian theoretical gradient descent', whose dynamics can be tracked by a state evolution system of two recursive equations for two scalars. Our concentration guarantees hold universally for a broad class of design matrices and remain valid over long time horizons until algorithmic convergence or divergence occurs. Moreover, our approach reveals that gradient descent iterates are in general approximately independent of the data and strongly incoherent with the feature vectors, a phenomenon previously known as theimplicit regularization' effect of gradient descent in specific models under Gaussian data. As an illustration of the utility of our general theory, we present two applications of different natures in the regression setting. In the first, we prove global convergence of nonconvex gradient descent with general independent initialization for a broad class of structured link functions, and establish universality of randomly initialized gradient descent in phase retrieval for large aspect ratios. In the second, we develop a data-free iterative algorithm for estimating state evolution parameters along the entire gradient descent trajectory, thereby providing a low-cost yet statistically valid tool for practical tasks such as hyperparameter tuning and runtime determination. As a by-product of our analysis, we show that in the large aspect ratio regime, the Gaussian theoretical gradient descent coincides with a recent line of dynamical mean-field theory for gradient descent over the constant-time horizon.

Comment: Matches Representation Learning: rigorous theory for long-time gradient descent dynamics and implicit regularization (training dynamics).

Relevance: 9 Novelty: 9

10. Contextuality, Holonomy and Discrete Fiber Bundles in Group-Valued Boltzmann Machines

ArXiv ID: 2509.10536

Authors: Jean-Pierre Magnot

Abstract: We propose a geometric extension of restricted Boltzmann machines (RBMs) by allowing weights to take values in abstract groups such as ( \mathrm{GL}_n(\mathbb{R}) ), ( \mathrm{SU}(2) ), or even infinite-dimensional operator groups. This generalization enables the modeling of complex relational structures, including projective transformations, spinor dynamics, and functional symmetries, with direct applications to vision, language, and quantum learning. A central contribution of this work is the introduction of a \emph{contextuality index} based on group-valued holonomies computed along cycles in the RBM graph. This index quantifies the global inconsistency or "curvature" induced by local weights, generalizing classical notions of coherence, consistency, and geometric flatness. We establish links with sheaf-theoretic contextuality, gauge theory, and noncommutative geometry, and provide numerical and diagrammatic examples in both finite and infinite dimensions. This framework opens novel directions in AI, from curvature-aware learning architectures to topological regularization in uncertain or adversarial environments.

Comment: Matches Model Architecture: extends RBMs to group-valued weights and introduces a holonomy-based contextuality index (topological/geometric regularization).

Relevance: 9 Novelty: 9

11. Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts

ArXiv ID: 2509.10530

Authors: Cheng Li, Jiexiong Liu, Yixuan Chen, Jie ji

Abstract: Transformer models based on the Mixture of Experts (MoE) architecture have made significant progress in long-sequence modeling, but existing models still have shortcomings in computational efficiency and the ability to capture long-range dependencies, especially in terms of the dynamic adaptability of expert resource allocation. In this paper, we propose a Dynamic Adaptive Shared Expert and Grouped Multi-Head Attention Hybrid Model (DASG-MoE) to enhance long-sequence modeling capabilities by integrating three modules. First, we employ the Grouped Multi-Head Attention (GMHA) mechanism to effectively reduce the computational complexity of long sequences. By parallel processing through sequence grouping, local sliding window attention, and feature aggregation, we address long-range dependency issues and the model's lack of generalization for local information. Second, we design a Dual-Scale Shared Expert Structure (DSSE), where shallow experts use lightweight computations to quickly respond to low-dimensional features, while deep experts process high-dimensional complex semantics through pre-training transfer and post-training optimization, achieving a dynamic balance between efficiency and accuracy. Third, we propose a hierarchical Adaptive Dynamic Routing (ADR) mechanism that dynamically selects expert levels based on feature complexity and task requirements, and optimizes resource allocation through a local expert activation strategy. Experiments on multiple long-sequence benchmark datasets demonstrate that our DASG-MoE model outperforms state-of-the-art models.

Comment: Strongly matches Model Architecture (MoE/Transformers): proposes grouped multi-head attention, dual-scale shared experts, and adaptive dynamic routing for efficient expert allocation.

Relevance: 10 Novelty: 7

ArXiv ID: 2509.11962

Authors: Mika Sipil\"a, Klaus Nordhausen, Sara Taskinen

Abstract: The modeling and prediction of multivariate spatio-temporal data involve numerous challenges. Dimension reduction methods can significantly simplify this process, provided that they account for the complex dependencies between variables and across time and space. Nonlinear blind source separation has emerged as a promising approach, particularly following recent advances in identifiability results. Building on these developments, we introduce the identifiable autoregressive variational autoencoder, which ensures the identifiability of latent components consisting of nonstationary autoregressive processes. The blind source separation efficacy of the proposed method is showcased through a simulation study, where it is compared against state-of-the-art methods, and the spatio-temporal prediction performance is evaluated against several competitors on air pollution and weather datasets.

Comment: Matches Model Architecture/Representation Learning: identifiable autoregressive VAE with identifiability guarantees for nonlinear, nonstationary sources.

Relevance: 9 Novelty: 8

13. Why and How Auxiliary Tasks Improve JEPA Representations

ArXiv ID: 2509.12249

Authors: Jiacan Yu, Siyi Chen, Mingrui Liu, Nono Horiuchi, Vladimir Braverman, Zicheng Xu, Dan Haramati, Randall Balestriero

Abstract: Joint-Embedding Predictive Architecture (JEPA) is increasingly used for visual representation learning and as a component in model-based RL, but its behavior remains poorly understood. We provide a theoretical characterization of a simple, practical JEPA variant that has an auxiliary regression head trained jointly with latent dynamics. We prove a No Unhealthy Representation Collapse theorem: in deterministic MDPs, if training drives both the latent-transition consistency loss and the auxiliary regression loss to zero, then any pair of non-equivalent observations, i.e., those that do not have the same transition dynamics or auxiliary label, must map to distinct latent representations. Thus, the auxiliary task anchors which distinctions the representation must preserve. Controlled ablations in a counting environment corroborate the theory and show that training the JEPA model jointly with the auxiliary head generates a richer representation than training them separately. Our work indicates a path to improve JEPA encoders: training them with an auxiliary function that, together with the transition dynamics, encodes the right equivalence relations.

Comment: Representation Learning: provides theory (no unhealthy collapse) for JEPA with auxiliary tasks, clarifying what distinctions encoders must preserve.

Relevance: 9 Novelty: 8

14. Harnessing Optimization Dynamics for Curvature-Informed Model Merging

ArXiv ID: 2509.11167

Authors: Pouria Mahdavinia, Hamed Mahdavi, Niloofar Mireshghallah, Mehrdad Mahdavi

Abstract: Model merging is an effective post-training strategy for composing capabilities in large language models without joint retraining. We study this in the supervised fine-tuning (SFT) stage, where multiple capability-based SFT checkpoints -- spanning math, code, precise instruction following, general instruction following, and knowledge recall -- must be consolidated into a single model. We introduce Optimization Trajectory Aware (OTA) Merging, a curvature-aware aggregation that leverages optimizer second-moment statistics as a diagonal curvature proxy to reweight parameter edits and mitigate interference. Complementing OTA, we propose Fast Fisher Grafting (FFG), a curvature-driven task-localization step that sparsifies conflicting or low-importance edits. FFG induces extremely low-rank masks concentrated in early attention query/key projections and token embeddings, exploiting shared curvature across capabilities. We further develop a memory-light compression of the second moments that preserves OTA's effect. Across diverse capability-based SFT checkpoints, OTA+FFG improves merged-model quality over strong weight-space baselines, reduces negative transfer, and remains robust across sparsity levels. Analyses reveal substantial curvature overlap between checkpoints, offering a novel lens on why simple linear merging can be effective in practice. Ablations confirm that FFG is critical for reducing task interference and that the compressed second moments retain the gains of the full formulation. To facilitate reproducibility, we open-source all code, training and evaluation scripts, visualization artifacts, and capability-specific SFT checkpoints at https://github.com/pmahdavi/ota-merge.

Comment: Model Compression and Efficiency: curvature-aware model merging (OTA) with sparse/low-rank grafting (FFG) to compose SFT capabilities without joint retraining.

Relevance: 9 Novelty: 8

15. Learning Neural Networks by Neuron Pursuit

ArXiv ID: 2509.12154

Authors: Akshay Kumar, Jarvis Haupt

Abstract: The first part of this paper studies the evolution of gradient flow for homogeneous neural networks near a class of saddle points exhibiting a sparsity structure. The choice of these saddle points is motivated from previous works on homogeneous networks, which identified the first saddle point encountered by gradient flow after escaping the origin. It is shown here that, when initialized sufficiently close to such saddle points, gradient flow remains near the saddle point for a sufficiently long time, during which the set of weights with small norm remain small but converge in direction. Furthermore, important empirical observations are made on the behavior of gradient descent after escaping these saddle points. The second part of the paper, motivated by these results, introduces a greedy algorithm to train deep neural networks called Neuron Pursuit (NP). It is an iterative procedure which alternates between expanding the network by adding neuron(s) with carefully chosen weights, and minimizing the training loss using this augmented network. The efficacy of the proposed algorithm is validated using numerical experiments.

Comment: Matches Representation Learning and training dynamics: analyzes gradient flow near sparse saddle points and introduces a greedy architecture growth/training algorithm (Neuron Pursuit).

Relevance: 9 Novelty: 8

16. AQUA: Attention via QUery mAgnitudes for Memory and Compute Efficient Inference in LLMs

ArXiv ID: 2509.11155

Authors: Santhosh G S, Saurav Prakash, Balaraman Ravindran

Abstract: The quadratic complexity of the attention mechanism remains a fundamental barrier to scaling Large Language Models (LLMs) to longer contexts, creating a critical bottleneck in both computation and memory. To address this, we introduce AQUA (Attention via QUery mAgnitudes) a novel and versatile approximation strategy that significantly reduces the cost of attention with a graceful performance trade-off. Our method operates in two phases: an efficient offline step where we compute a universal, language agnostic projection matrix via SVD on a calibration dataset, and an online inference step where we project query and key vectors and dynamically select a sparse subset of dimensions based on the query's magnitude. We provide a formal theoretical analysis of AQUA, establishing the break-even point at which it becomes more computationally efficient than standard attention. Our empirical evaluations on state-of-the-art models like Llama-3.1-8B demonstrate that a 25% reduction in the attention dot-product computation can be achieved with a statistically insignificant impact on performance across a wide range of benchmarks. We further showcase the versatility of AQUA by demonstrating its ability to synergistically accelerate existing token eviction methods like H2O and to directly reduce KV-cache memory size. By offering a controllable knob to balance efficiency and accuracy, AQUA provides a practical and powerful tool for making large-scale LLM inference more accessible and sustainable.

Comment: Model Compression and Efficiency: approximate attention via SVD-based projection and query-magnitude sparsification; reduces KV compute and cache memory.

Relevance: 9 Novelty: 8

17. TinyServe: Query-Aware Cache Selection for Efficient LLM Serving

ArXiv ID: 2509.12211

Authors: Dong Liu, Yanxuan Yu

Abstract: Serving large language models (LLMs) efficiently remains challenging due to the high memory and latency overhead of key-value (KV) cache access during autoregressive decoding. We present \textbf{TinyServe}, a lightweight and extensible serving system for deploying tiny LLMs (e.g., TinyLLaMA, GPT2-345M) with support for structured KV sparsity, plugin-based token selection, and hardware-efficient attention kernels. Unlike prior simulation frameworks, TinyServe executes real-time decoding with configurable sparsity strategies and fine-grained instrumentation. To reduce decoding cost, we introduce a \textit{query-aware page selection} mechanism that leverages bounding-box metadata to estimate attention relevance between the query and KV cache blocks. This enables selective KV loading with minimal overhead and no model modifications. Our fused CUDA kernel integrates page scoring, sparse memory access, and masked attention in a single pass. Experiments show that TinyServe achieves up to \textbf{3.4x} speedup and over \textbf{2x} memory savings with negligible accuracy drop. Additional analysis of cache reuse, page hit rate, and multi-GPU scaling confirms its practicality as an efficient system-level design for LLM training and inference research on resource-constrained hardware.

Comment: High Performance Computing/Systems: query-aware KV-cache page selection with fused CUDA kernel enabling structured KV sparsity for efficient LLM serving.

Relevance: 9 Novelty: 8

18. LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications

ArXiv ID: 2509.12053

Authors: Yujun Lin, Zhekai Zhang, Song Han

Abstract: Modern tensor applications, especially foundation models and generative AI applications require multiple input modalities (both vision and language), which increases the demand for flexible accelerator architecture. Existing frameworks suffer from the trade-off between design flexibility and productivity of RTL generation: either limited to very few hand-written templates or cannot automatically generate the RTL. To address this challenge, we propose the LEGO framework, which targets tensor applications and automatically generates spatial architecture design and outputs synthesizable RTL code without handwritten RTL design templates. Leveraging the affine-transformation-based architecture representation, LEGO front end finds interconnections between function units, synthesizes the memory system, and fuses different spatial dataflow designs based on data reuse analysis. LEGO back end then translates the hardware in a primitive-level graph to perform lower-level optimizations, and applies a set of linear-programming algorithms to optimally insert pipeline registers and reduce the overhead of unused logic when switching spatial dataflows. Our evaluation demonstrates that LEGO can achieve 3.2x speedup and 2.4x energy efficiency compared to previous work Gemmini, and can generate one architecture for diverse modern foundation models in generative AI applications.

Comment: High-Performance Computing/Systems: automatic spatial accelerator RTL generation with affine architecture representation and LP-based pipeline/register optimization.

Relevance: 9 Novelty: 8

19. DOSA: Differentiable Model-Based One-Loop Search for DNN Accelerators

ArXiv ID: 2509.10702

Authors: Charles Hong, Qijing Huang, Grace Dinh, Mahesh Subedar, Yakun Sophia Shao

Abstract: In the hardware design space exploration process, it is critical to optimize both hardware parameters and algorithm-to-hardware mappings. Previous work has largely approached this simultaneous optimization problem by separately exploring the hardware design space and the mapspace - both individually large and highly nonconvex spaces - independently. The resulting combinatorial explosion has created significant difficulties for optimizers. In this paper, we introduce DOSA, which consists of differentiable performance models and a gradient descent-based optimization technique to simultaneously explore both spaces and identify high-performing design points. Experimental results demonstrate that DOSA outperforms random search and Bayesian optimization by 2.80x and 12.59x, respectively, in improving DNN model energy-delay product, given a similar number of samples. We also demonstrate the modularity and flexibility of DOSA by augmenting our analytical model with a learned model, allowing us to optimize buffer sizes and mappings of a real DNN accelerator and attain a 1.82x improvement in energy-delay product.

Comment: High-Performance Computing: differentiable performance models enabling joint optimization of hardware parameters and DNN mapspace.

Relevance: 9 Novelty: 8

20. A Modern Look at Simplicity Bias in Image Classification Tasks

ArXiv ID: 2509.12265

Authors: Xiaoguang Chang, Teng Wang, Changyin Sun

Abstract: The simplicity Bias (SB) of neural networks, i.e.\ their tendency to represent simple functions, is a key factor in their generalization capabilities. Recent studies show that an excessive SB may harm performance on complex tasks, and the need for this bias varies across tasks. Many of these studies focus on simple models or synthetic tasks. It remains challenging to measure the SB in large models and little is known about the relevance of the SB to various image classification tasks. In this paper, we investigate the relationship between the SB in CLIP models and their performance across image classification tasks. First, we theoretically analyze the potential limitation of existing measures of complexity that have been used to characterize small models. To address this, we propose a frequency-aware measure capturing finer-grained SB differences. We validate this measure on CLIP models subjected to two recent SB-modulation methods, demonstrating that it is more informative and consistent than previous measures. Second, we examine the relation between the SB of those models and their performance across a range of image classification tasks, including zero-shot and fine-tuning settings. These experiments reveal a range of behaviors. For example, a stronger SB correlates with a better performance on OOD generalization than on adversarial robustness. These results highlight the benefits of aligning a model's inductive biases with the characteristics of the target task.

Comment: Representation Learning: introduces a frequency-aware measure of simplicity bias in CLIP and links inductive bias to generalization/robustness.

Relevance: 9 Novelty: 7

21. Scaling Up Data Parallelism in Decentralized Deep Learning

ArXiv ID: 2509.12213

Authors: Bing Xie, Junqi Yin, Zhenyu Zhou, Sarp Oral, Feiyi Wang

Abstract: Although it has been extensively explored in theory, decentralized learning is not yet green-lighted for production use, largely due to a lack of stability, scalability, and generality in large scale DNN training. To shed light on the production use of decentralized learning, this work studies decentralized data parallel training at scale. To this end, we introduce a benchmarking framework, namely DBench, to host both centralized and decentralized DNN training. Building upon DBench, we introduce a benchmarking methodology to uncover the correlations between model accuracy and the variances of parameter tensors by varying communication graphs and training scales. Based on the benchmarking results, we observe that, (1) Similar to centralized learning, decentralized data parallel training also presents the issues of scalability and generality when the training scales up; (2) The model accuracy of decentralized learning is correlated to the number of connections in a communication graph; (3) The model accuracy of decentralized learning is surprisingly sensitive to the variance of parameter tensors across model replicas. Built upon the observations, we propose Ada, a decentralized adaptive approach that performs large scale DNN training following a decentralized SGD method and adapting the communication graph in use dynamically throughout training iterations. We apply Ada on large scale training and observe that Ada can obtain the best convergence rates consistently in decentralized DNN training, and delivers equally or comparably good model accuracy for all sample applications as centralized learning does, even when training ResNet50 for ImageNet-1K on the scale of 1008 GPUs.

Comment: High Performance Computing: decentralized data-parallel training with adaptive communication graph (Ada) and large-scale distributed training insights.

Relevance: 9 Novelty: 7

22. Resource-Aware Neural Network Pruning Using Graph-based Reinforcement Learning

ArXiv ID: 2509.10526

Authors: Dieter Balemans, Thomas Huybrechts, Jan Steckel, Siegfried Mercelis

Abstract: This paper presents a novel approach to neural network pruning by integrating a graph-based observation space into an AutoML framework to address the limitations of existing methods. Traditional pruning approaches often depend on hand-crafted heuristics and local optimization perspectives, which can lead to suboptimal performance and inefficient pruning strategies. Our framework transforms the pruning process by introducing a graph representation of the target neural network that captures complete topological relationships between layers and channels, replacing the limited layer-wise observation space with a global view of network structure. The core innovations include a Graph Attention Network (GAT) encoder that processes the network's graph representation and generates a rich embedding. Additionally, for the action space we transition from continuous pruning ratios to fine-grained binary action spaces which enables the agent to learn optimal channel importance criteria directly from data, moving away from predefined scoring functions. These contributions are modelled within a Constrained Markov Decision Process (CMDP) framework, allowing the agent to make informed pruning decisions while adhering to resource constraints such as target compression rates. For this, we design a self-competition reward system that encourages the agent to outperform its previous best performance while satisfying the defined constraints. We demonstrate the effectiveness of our approach through extensive experiments on benchmark datasets including CIFAR-10, CIFAR-100, and ImageNet. The experiments show that our method consistently outperforms traditional pruning techniques, showing state-of-the-art results while learning task-specific pruning strategies that identify functionally redundant connections beyond simple weight magnitude considerations.

Comment: Model Compression and Efficiency: resource-aware pruning using a global graph representation with GAT encoder and CMDP-based decision-making.

Relevance: 9 Novelty: 7

23. AMQ: Enabling AutoML for Mixed-precision Weight-Only Quantization of Large Language Models

ArXiv ID: 2509.12019

Authors: Sangjun Lee, Seung-taek Woo, Jungyu Jin, Changhun Lee, Eunhyeok Park

Abstract: To enable broader deployment of Large Language Models (LLMs), it is essential to identify the best-performing model under strict memory constraints. We present AMQ, Automated Mixed-Precision Weight-Only Quantization, a framework that assigns layer-wise quantization bit-widths to optimally balance model quality and memory usage. However, the combinatorial search space, with over 10^{100} possible configurations, makes conventional black-box optimization infeasible. AMQ overcomes this challenge through four key innovations:(1) search space pruning using prior knowledge to exclude unpromising configurations, (2) quantization proxy to bypass costly format conversions during search, (3) quality predictor to minimize evaluation overhead, and (4) iterative search-and-update strategy for fast and stable convergence. By integrating these components, AMQ efficiently explores the quality-efficiency landscape, reaching the Pareto frontier and yielding LLMs that are both compact and high-performing. Our code is available at https://github.com/dlwns147/amq.

Comment: Model Compression and Efficiency: automated mixed-precision weight-only quantization with search-space pruning and learned quality predictor.

Relevance: 9 Novelty: 7

24. Black-box Model Merging for Language-Model-as-a-Service with Massive Model Repositories

ArXiv ID: 2509.12951

Authors: Shilian Chen, Jie Zhou, Tianyu Huai, Yujiang Lu, Junsong Li, Bihao Zhan, Qianjun Pan, Yutao Yang, Xin Li, Qin Chen, Hang Yan, Liang He

Abstract: Model merging refers to the process of integrating multiple distinct models into a unified model that preserves and combines the strengths and capabilities of the individual models. Most existing approaches rely on task vectors to combine models, typically under the assumption that model parameters are accessible. However, for extremely large language models (LLMs) such as GPT-4, which are often provided solely as black-box services through API interfaces (Language-Model-as-a-Service), model weights are not available to end users. This presents a significant challenge, which we refer to as black-box model merging (BMM) with massive LLMs. To address this challenge, we propose a derivative-free optimization framework based on the evolutionary algorithm (Evo-Merging) that enables effective model merging using only inference-time API queries. Our method consists of two key components: (1) sparsity-based denoising, designed to identify and filter out irrelevant or redundant information across models, and (2) sign-aware scaling, which dynamically computes optimal combination weights for the relevant models based on their performance. We also provide a formal justification, along with a theoretical analysis, for our asymmetric sparsification. Extensive experimental evaluations demonstrate that our approach achieves state-of-the-art results on a range of tasks, significantly outperforming existing strong baselines.

Comment: Matches Model Architecture/Efficiency: black-box model merging via derivative-free optimization with sparsity-based denoising and sign-aware scaling.

Relevance: 8 Novelty: 8

25. Spectral Bottleneck in Deep Neural Networks: Noise is All You Need

ArXiv ID: 2509.09719

Authors: Hemanth Chandravamsi, Dhanush V. Shenoy, Itay Zinn, Shimon Pisnoy, Steven H. Frankel

Abstract: Deep neural networks are known to exhibit a spectral learning bias, wherein low-frequency components are learned early in training, while high-frequency modes emerge more gradually in later epochs. However, when the target signal lacks low-frequency components and is dominated by broadband high frequencies, training suffers from a 'spectral bottleneck', and the model fails to reconstruct the entire signal, including the frequency components that lie within the network's representational capacity. We examine such a scenario in the context of implicit neural representations (INRs) with sinusoidal representation networks (SIRENs), focusing on the challenge of fitting high-frequency-dominant signals that are susceptible to spectral bottleneck. To effectively fit any target signal irrespective of it's frequency content, we propose a generalized target-aware 'weight perturbation scheme' (WINNER - weight initialization with noise for neural representations) for network initialization. The scheme perturbs uniformly initialized weights with Gaussian noise, where the noise scales are adaptively determined by the spectral centroid of the target signal. We show that the noise scales can provide control over the spectra of network activations and the eigenbasis of the empirical neural tangent kernel. This method not only addresses the spectral bottleneck but also yields faster convergence and with improved representation accuracy, outperforming state-of-the-art approaches in audio fitting and achieving notable gains in image fitting and denoising tasks. Beyond signal reconstruction, our approach opens new directions for adaptive weight initialization strategies in computer vision and scientific machine learning.

Comment: Matches Representation Learning/training dynamics: proposes target-aware weight initialization with noise to overcome spectral bias in INRs; analyzes activation spectra and NTK eigenbasis.

Relevance: 8 Novelty: 8

26. Verifying Computational Graphs in Production-Grade Distributed Machine Learning Frameworks

ArXiv ID: 2509.10694

Authors: Kahfi S. Zulkifli, Wenbo Qian, Shaowei Zhu, Yuan Zhou, Zhen Zhang, Chang Lou

Abstract: Modern machine learning frameworks support very large models by incorporating parallelism and optimization techniques. Yet, these very techniques add new layers of complexity, introducing silent errors that severely degrade model performance. Existing solutions are either ad hoc or too costly for production. We present Scalify, a lightweight framework that exposes silent errors by verifying semantic equivalence of computational graphs using equality saturation and Datalog-style reasoning. To scale, Scalify partitions graphs with parallel rewriting and layer memoization, reuses rewrite templates, and augments equality saturation with relational reasoning and symbolic bijection inference. It further localizes discrepancies to precise code sites, turning verification results into actionable debugging guidance. Scalify verifies models as large as Llama-3.1-405B within minutes on a commodity machine and exposed five unknown bugs in Amazon production machine learning frameworks.

Comment: High Performance Computing/Systems: semantic equivalence verification of large distributed ML computational graphs via equality saturation and Datalog-style reasoning.

Relevance: 8 Novelty: 8

27. Semantic-guided LoRA Parameters Generation

ArXiv ID: 2509.10535

Authors: Miaoge Li, Yang Chen, Zhijie Rao, Can Jiang, Jingcai Guo

Abstract: Low-Rank Adaptation (LoRA) has demonstrated strong generalization capabilities across a variety of tasks for efficiently fine-tuning AI models, especially on resource-constrained edges. However, in real-world applications, edge users often exhibit task-specific preferences that are difficult to handle with a unified model trained under a closed-world assumption, and the challenge may further increase when there are significant domain shifts between training and deployment. Meanwhile, retraining/fine-tuning models for each user is also impractical due to its cost-intensive nature and privacy concerns over raw data utilization from edges. To address these challenges, we propose Semantic-guided LoRA Parameter Generation (SG-LoRA), the first of its kind framework to efficiently produce user-specific LoRA parameters without any additional training on user tasks or access to user-specific data. Concretely, SG-LoRA uses task descriptions as the semantic bridge, measuring their proximity to a set of known expert tasks in a shared embedding space. Based on this semantic guidance, it models the target task's LoRA parameter distribution to generate high-performing parameters for novel tasks. SG-LoRA enables the real-time construction of LoRA models aligned with individual intents by distilling knowledge from prominent LoRA experts and, meanwhile, offering a privacy-preserving solution for personalized model adaptation in a novel zero-shot open-world setting proposed in this work. Extensive experiments on multiple challenging tasks confirm the superior performance and remarkable adaptability of SG-LoRA. Code is available at https://github.com/keepgoingjkg/SG-LoRA.

Comment: Matches Compression/Efficiency: Low-Rank Adaptation (LoRA) parameter generation via semantic guidance for zero-shot personalization without retraining.