Personalized Daily ArXiv Papers 2025-09-19

[gpt-5]	Prompt	Completion	Total
Token	35904	38010	73914
Cost	$0.04	$0.38	$0.42

Total arXiv papers: 447

Total scanned papers: 272

Total relevant papers: 24

Table of contents with paper titles:

LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures Authors: Hai Huang, Yann LeCun, Randall Balestriero
Pre-training under infinite compute Authors: Konwoo Kim, Suhas Kotha, Percy Liang, Tatsunori Hashimoto
Stochastic Clock Attention for Aligning Continuous and Ordered Sequences Authors: Hyungjoon Soh, Junghyo Jo
Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models Authors: Samet Demir, Zafer Dogan
SparseDoctor: Towards Efficient Chat Doctor with Mixture of Experts Enhanced Large Language Models Authors: Zhang Jianbin, Yulin Zhu, Wai Lun Lo, Richard Tai-Chiu Hsung, Harris Sik-Ho Tsang, Kai Zhou
Q-ROAR: Outlier-Aware Rescaling for RoPE Position Interpolation in Quantized Long-Context LLMs Authors: Ye Qiao, Sitao Huang
Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting Authors: Liran Nochumsohn, Raz Marshanski, Hedi Zisling, Omri Azencot
Towards Pre-trained Graph Condensation via Optimal Transport Authors: Yeyu Yan, Shuai Zheng, Wenjun Hui, Xiangkai Zhu, Dong Chen, Zhenfeng Zhu, Yao Zhao, Kunlun He
Low-rank surrogate modeling and stochastic zero-order optimization for training of neural networks with black-box layers Authors: Andrei Chertkov, Artem Basharin, Mikhail Saygin, Evgeny Frolov, Stanislav Straupe, Ivan Oseledets
Precision Neural Networks: Joint Graph And Relational Learning Authors: Andrea Cavallo, Samuel Rey, Antonio G. Marques, Elvin Isufi
DeCoP: Enhancing Self-Supervised Time Series Representation with Dependency Controlled Pre-training Authors: Yuemin Wu, Zhongze Wu, Xiu Su, Feng Yang, Hongyan Xu, Xi Lin, Wenti Huang, Shan You, Chang Xu
eIQ Neutron: Redefining Edge-AI Inference with Integrated NPU and Compiler Innovations Authors: Lennart Bamberg, Filippo Minnella, Roberto Bosio, Fabrizio Ottati, Yuebin Wang, Jongmin Lee, Luciano Lavagno, Adam Fuks
Exploring the Global-to-Local Attention Scheme in Graph Transformers: An Empirical Study Authors: Zhengwei Wang, Gang Wu
Property-Isometric Variational Autoencoders for Sequence Modeling and Design Authors: Elham Sadeghi, Xianqi Deng, I-Hsin Lin, Stacy M. Copp, Petko Bogdanov
Attention Beyond Neighborhoods: Reviving Transformer for Graph Clustering Authors: Xuanting Xie, Bingheng Li, Erlin Pan, Rui Hou, Wenyu Chen, Zhao Kang
LiMuon: Light and Fast Muon Optimizer for Large Models Authors: Feihu Huang, Yuning Luo, Songcan Chen
Data coarse graining can improve model performance Authors: Alex Nguyen, David J. Schwab, Vudtiwat Ngampruetikorn
Decentralized Optimization with Topology-Independent Communication Authors: Ying Lin, Yao Kuang, Ahmet Alacaoglu, Michael P. Friedlander
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation Authors: Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu
Towards universal property prediction in Cartesian space: TACE is all you need Authors: Zemin Xu, Wenbo Xie, Daiqian Xie, P. Hu
Learning Graph from Smooth Signals under Partial Observation: A Robustness Analysis Authors: Hoang-Son Nguyen, Hoi-To Wai
Mind the Gap: Data Rewriting for Stable Off-Policy Supervised Fine-Tuning Authors: Shiwan Zhao, Xuyang Zhao, Jiaming Zhou, Aobo Kong, Qicheng Li, Yong Qin
From Correction to Mastery: Reinforced Distillation of Large Language Model Agents Authors: Yuanjie Lyu, Chengyu Wang, Jun Huang, Tong Xu
AToken: A Unified Tokenizer for Vision Authors: Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, Yinfei Yang

1. LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures

ArXiv ID: 2509.14252

Authors: Hai Huang, Yann LeCun, Randall Balestriero

Abstract: Large Language Model (LLM) pretraining, finetuning, and evaluation rely on input-space reconstruction and generative capabilities. Yet, it has been observed in vision that embedding-space training objectives, e.g., with Joint Embedding Predictive Architectures (JEPAs), are far superior to their input-space counterpart. That mismatch in how training is achieved between language and vision opens up a natural question: {\em can language training methods learn a few tricks from the vision ones?} The lack of JEPA-style LLM is a testimony of the challenge in designing such objectives for language. In this work, we propose a first step in that direction where we develop LLM-JEPA, a JEPA based solution for LLMs applicable both to finetuning and pretraining. Thus far, LLM-JEPA is able to outperform the standard LLM training objectives by a significant margin across models, all while being robust to overfiting. Those findings are observed across numerous datasets (NL-RX, GSM8K, Spider, RottenTomatoes) and various models from the Llama3, OpenELM, Gemma2 and Olmo families. Code: https://github.com/rbalestr-lab/llm-jepa.

Comment: Author match

2. Pre-training under infinite compute

ArXiv ID: 2509.14786

Authors: Konwoo Kim, Suhas Kotha, Percy Liang, Tatsunori Hashimoto

Abstract: Since compute grows much faster than web text available for language model pre-training, we ask how one should approach pre-training under fixed data and no compute constraints. We first show that existing data-constrained approaches of increasing epoch count and parameter count eventually overfit, and we significantly improve upon such recipes by properly tuning regularization, finding that the optimal weight decay is $30\times$ larger than standard practice. Since our regularized recipe monotonically decreases loss following a simple power law in parameter count, we estimate its best possible performance via the asymptote of its scaling law rather than the performance at a fixed compute budget. We then identify that ensembling independently trained models achieves a significantly lower loss asymptote than the regularized recipe. Our best intervention combining epoching, regularization, parameter scaling, and ensemble scaling achieves an asymptote at 200M tokens using $5.17\times$ less data than our baseline, and our data scaling laws predict that this improvement persists at higher token budgets. We find that our data efficiency gains can be realized at much smaller parameter counts as we can distill an ensemble into a student model that is 8$\times$ smaller and retains $83\%$ of the ensembling benefit. Finally, our interventions designed for validation loss generalize to downstream benchmarks, achieving a $9\%$ improvement for pre-training evals and a $17.5\times$ data efficiency improvement over continued pre-training on math mid-training data. Our results show that simple algorithmic improvements can enable significantly more data-efficient pre-training in a compute-rich future.

Comment: Matches Compression/Efficiency and Representation Learning via data-constrained pretraining insights: strong regularization, epoch/parameter scaling laws, ensemble scaling and distillation for data efficiency.

Relevance: 9 Novelty: 8

3. Stochastic Clock Attention for Aligning Continuous and Ordered Sequences

ArXiv ID: 2509.14678

Authors: Hyungjoon Soh, Junghyo Jo

Abstract: We formulate an attention mechanism for continuous and ordered sequences that explicitly functions as an alignment model, which serves as the core of many sequence-to-sequence tasks. Standard scaled dot-product attention relies on positional encodings and masks but does not enforce continuity or monotonicity, which are crucial for frame-synchronous targets. We propose learned nonnegative \emph{clocks} to source and target and model attention as the meeting probability of these clocks; a path-integral derivation yields a closed-form, Gaussian-like scoring rule with an intrinsic bias toward causal, smooth, near-diagonal alignments, without external positional regularizers. The framework supports two complementary regimes: normalized clocks for parallel decoding when a global length is available, and unnormalized clocks for autoregressive decoding -- both nearly-parameter-free, drop-in replacements. In a Transformer text-to-speech testbed, this construction produces more stable alignments and improved robustness to global time-scaling while matching or improving accuracy over scaled dot-product baselines. We hypothesize applicability to other continuous targets, including video and temporal signal modeling.

Comment: Model Architecture: introduces a new attention mechanism (clock-based alignment) enforcing continuity/monotonicity as a drop-in replacement.

Relevance: 9 Novelty: 8

4. Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models

ArXiv ID: 2509.15152

Authors: Samet Demir, Zafer Dogan

Abstract: We study the in-context learning (ICL) capabilities of pretrained Transformers in the setting of nonlinear regression. Specifically, we focus on a random Transformer with a nonlinear MLP head where the first layer is randomly initialized and fixed while the second layer is trained. Furthermore, we consider an asymptotic regime where the context length, input dimension, hidden dimension, number of training tasks, and number of training samples jointly grow. In this setting, we show that the random Transformer behaves equivalent to a finite-degree Hermite polynomial model in terms of ICL error. This equivalence is validated through simulations across varying activation functions, context lengths, hidden layer widths (revealing a double-descent phenomenon), and regularization settings. Our results offer theoretical and empirical insights into when and how MLP layers enhance ICL, and how nonlinearity and over-parameterization influence model performance.

Comment: Representation Learning: theoretical analysis of Transformer in-context learning, proving equivalence to a finite-degree Hermite polynomial model in an asymptotic regime.

Relevance: 9 Novelty: 8

5. SparseDoctor: Towards Efficient Chat Doctor with Mixture of Experts Enhanced Large Language Models

ArXiv ID: 2509.14269

Authors: Zhang Jianbin, Yulin Zhu, Wai Lun Lo, Richard Tai-Chiu Hsung, Harris Sik-Ho Tsang, Kai Zhou

Abstract: Large language models (LLMs) have achieved great success in medical question answering and clinical decision-making, promoting the efficiency and popularization of the personalized virtual doctor in society. However, the traditional fine-tuning strategies on LLM require the updates of billions of parameters, substantially increasing the training cost, including the training time and utility cost. To enhance the efficiency and effectiveness of the current medical LLMs and explore the boundary of the representation capability of the LLMs on the medical domain, apart from the traditional fine-tuning strategies from the data perspective (i.e., supervised fine-tuning or reinforcement learning from human feedback), we instead craft a novel sparse medical LLM named SparseDoctor armed with contrastive learning enhanced LoRA-MoE (low rank adaptation-mixture of experts) architecture. To this end, the crafted automatic routing mechanism can scientifically allocate the computational resources among different LoRA experts supervised by the contrastive learning. Additionally, we also introduce a novel expert memory queue mechanism to further boost the efficiency of the overall framework and prevent the memory overflow during training. We conduct comprehensive evaluations on three typical medical benchmarks: CMB, CMExam, and CMMLU-Med. Experimental results demonstrate that the proposed LLM can consistently outperform the strong baselines such as the HuatuoGPT series.

Comment: Directly matches Model Architecture (MoE with dynamic routing) and Compression/Efficiency (LoRA experts, sparse activation, memory optimization via expert memory queue).

Relevance: 9 Novelty: 7

6. Q-ROAR: Outlier-Aware Rescaling for RoPE Position Interpolation in Quantized Long-Context LLMs

ArXiv ID: 2509.14391

Authors: Ye Qiao, Sitao Huang

Abstract: Extending LLM context windows is crucial for long range tasks. RoPE-based position interpolation (PI) methods like linear and frequency-aware scaling extend input lengths without retraining, while post-training quantization (PTQ) enables practical deployment. We show that combining PI with PTQ degrades accuracy due to coupled effects long context aliasing, dynamic range dilation, axis grid anisotropy, and outlier shifting that induce position-dependent logit noise. We provide the first systematic analysis of PI plus PTQ and introduce two diagnostics: Interpolation Pressure (per-band phase scaling sensitivity) and Tail Inflation Ratios (outlier shift from short to long contexts). To address this, we propose Q-ROAR, a RoPE-aware, weight-only stabilization that groups RoPE dimensions into a few frequency bands and performs a small search over per-band scales for W_Q,W_K, with an optional symmetric variant to preserve logit scale. The diagnostics guided search uses a tiny long-context dev set and requires no fine-tuning, kernel, or architecture changes. Empirically, Q-ROAR recovers up to 0.7% accuracy on standard tasks and reduces GovReport perplexity by more than 10%, while preserving short-context performance and compatibility with existing inference stacks.

Comment: Compression/Efficiency: quantization-aware RoPE position interpolation with new diagnostics and weight-only stabilization, no fine-tuning required.

Relevance: 9 Novelty: 7

7. Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting

ArXiv ID: 2509.15105

Authors: Liran Nochumsohn, Raz Marshanski, Hedi Zisling, Omri Azencot

Abstract: Time series forecasting (TSF) is critical in domains like energy, finance, healthcare, and logistics, requiring models that generalize across diverse datasets. Large pre-trained models such as Chronos and Time-MoE show strong zero-shot (ZS) performance but suffer from high computational costs. In this work, We introduce Super-Linear, a lightweight and scalable mixture-of-experts (MoE) model for general forecasting. It replaces deep architectures with simple frequency-specialized linear experts, trained on resampled data across multiple frequency regimes. A lightweight spectral gating mechanism dynamically selects relevant experts, enabling efficient, accurate forecasting. Despite its simplicity, Super-Linear matches state-of-the-art performance while offering superior efficiency, robustness to various sampling rates, and enhanced interpretability. The implementation of Super-Linear is available at \href{https://github.com/azencot-group/SuperLinear}{https://github.com/azencot-group/SuperLinear}

Comment: Model Architecture (MoE): introduces a lightweight pretrained mixture-of-experts with spectral gating for expert selection; also aligns with efficiency-focused design.

Relevance: 9 Novelty: 7

8. Towards Pre-trained Graph Condensation via Optimal Transport

ArXiv ID: 2509.14722

Authors: Yeyu Yan, Shuai Zheng, Wenjun Hui, Xiangkai Zhu, Dong Chen, Zhenfeng Zhu, Yao Zhao, Kunlun He

Abstract: Graph condensation (GC) aims to distill the original graph into a small-scale graph, mitigating redundancy and accelerating GNN training. However, conventional GC approaches heavily rely on rigid GNNs and task-specific supervision. Such a dependency severely restricts their reusability and generalization across various tasks and architectures. In this work, we revisit the goal of ideal GC from the perspective of GNN optimization consistency, and then a generalized GC optimization objective is derived, by which those traditional GC methods can be viewed nicely as special cases of this optimization paradigm. Based on this, Pre-trained Graph Condensation (PreGC) via optimal transport is proposed to transcend the limitations of task- and architecture-dependent GC methods. Specifically, a hybrid-interval graph diffusion augmentation is presented to suppress the weak generalization ability of the condensed graph on particular architectures by enhancing the uncertainty of node states. Meanwhile, the matching between optimal graph transport plan and representation transport plan is tactfully established to maintain semantic consistencies across source graph and condensed graph spaces, thereby freeing graph condensation from task dependencies. To further facilitate the adaptation of condensed graphs to various downstream tasks, a traceable semantic harmonizer from source nodes to condensed nodes is proposed to bridge semantic associations through the optimized representation transport plan in pre-training. Extensive experiments verify the superiority and versatility of PreGC, demonstrating its task-independent nature and seamless compatibility with arbitrary GNNs.

Comment: Matches Model Compression and Efficiency via graph condensation framed with optimal transport, yielding task- and architecture-agnostic distilled graphs for accelerated GNN training.

Relevance: 8 Novelty: 8

9. Low-rank surrogate modeling and stochastic zero-order optimization for training of neural networks with black-box layers

ArXiv ID: 2509.15113

Authors: Andrei Chertkov, Artem Basharin, Mikhail Saygin, Evgeny Frolov, Stanislav Straupe, Ivan Oseledets

Abstract: The growing demand for energy-efficient, high-performance AI systems has led to increased attention on alternative computing platforms (e.g., photonic, neuromorphic) due to their potential to accelerate learning and inference. However, integrating such physical components into deep learning pipelines remains challenging, as physical devices often offer limited expressiveness, and their non-differentiable nature renders on-device backpropagation difficult or infeasible. This motivates the development of hybrid architectures that combine digital neural networks with reconfigurable physical layers, which effectively behave as black boxes. In this work, we present a framework for the end-to-end training of such hybrid networks. This framework integrates stochastic zeroth-order optimization for updating the physical layer's internal parameters with a dynamic low-rank surrogate model that enables gradient propagation through the physical layer. A key component of our approach is the implicit projector-splitting integrator algorithm, which updates the lightweight surrogate model after each forward pass with minimal hardware queries, thereby avoiding costly full matrix reconstruction. We demonstrate our method across diverse deep learning tasks, including: computer vision, audio classification, and language modeling. Notably, across all modalities, the proposed approach achieves near-digital baseline accuracy and consistently enables effective end-to-end training of hybrid models incorporating various non-differentiable physical components (spatial light modulators, microring resonators, and Mach-Zehnder interferometers). This work bridges hardware-aware deep learning and gradient-free optimization, thereby offering a practical pathway for integrating non-differentiable physical components into scalable, end-to-end trainable AI systems.

Comment: Model Efficiency/HPC: dynamic low-rank surrogate modeling with zeroth-order optimization enables end-to-end training through black-box layers.

Relevance: 8 Novelty: 8

10. Precision Neural Networks: Joint Graph And Relational Learning

ArXiv ID: 2509.14821

Authors: Andrea Cavallo, Samuel Rey, Antonio G. Marques, Elvin Isufi

Abstract: CoVariance Neural Networks (VNNs) perform convolutions on the graph determined by the covariance matrix of the data, which enables expressive and stable covariance-based learning. However, covariance matrices are typically dense, fail to encode conditional independence, and are often precomputed in a task-agnostic way, which may hinder performance. To overcome these limitations, we study Precision Neural Networks (PNNs), i.e., VNNs on the precision matrix -- the inverse covariance. The precision matrix naturally encodes statistical independence, often exhibits sparsity, and preserves the covariance spectral structure. To make precision estimation task-aware, we formulate an optimization problem that jointly learns the network parameters and the precision matrix, and solve it via alternating optimization, by sequentially updating the network weights and the precision estimate. We theoretically bound the distance between the estimated and true precision matrices at each iteration, and demonstrate the effectiveness of joint estimation compared to two-step approaches on synthetic and real-world data.

Comment: Model Architecture: introduces Precision Neural Networks performing graph convolutions on the inverse covariance and jointly learns the precision matrix; Representation Learning: task-aware structure learning with theoretical bounds and sparsity via conditional independence.