Previous Day 2025-09-24
Monthly Overview 2025-09
Next Day 2025-09-26

Personalized Daily ArXiv Papers 2025-09-25

[gpt-5] Prompt Completion Total
Token 47985 53034 101019
Cost $0.06 $0.53 $0.59

Total arXiv papers: 596

Total scanned papers: 370

Total relevant papers: 35

Table of contents with paper titles:

  1. A Recovery Guarantee for Sparse Neural Networks Authors: Sara Fridovich-Keil, Mert Pilanci

  2. Faster, Smaller, and Smarter: Task-Aware Expert Merging for Online MoE Inference Authors: Ziyi Han, Xutong Liu, Ruiting Zhou, Xiangxiang Dai, John C. S. Lui

  3. Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment Authors: Deokjae Lee, Hyun Oh Song

  4. Linear Transformers Implicitly Discover Unified Numerical Algorithms Authors: Patrick Lutz, Aditya Gangrade, Hadi Daneshmand, Venkatesh Saligrama

  5. TensLoRA: Tensor Alternatives for Low-Rank Adaptation Authors: Axel Marmoret, Reda Bensaid, Jonathan Lys, Vincent Gripon, Fran\c{c}ois Leduc-Primeau

  6. Mamba Modulation: On the Length Generalization of Mamba Authors: Peng Lu, Jerry Huang, Qiuhao Zeng, Xinyu Wang, Boxing Wang, Philippe Langlais, Yufei Cui

  7. Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels Authors: Dongming Huang, Zhifan Li, Yicheng Li, Qian Lin

  8. Learning Dynamics of Deep Learning -- Force Analysis of Deep Neural Networks Authors: Yi Ren

  9. Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models Authors: Junjie Yao, Zhi-Qin John Xu

  10. On the Rate of Convergence of Kolmogorov-Arnold Network Regression Estimators Authors: Wei Liu, Eleni Chatzi, Zhilu Lai

  11. Faster Than SVD, Smarter Than SGD: The OPLoRA Alternating Update Authors: Abdulla Jasem Almansoori, Maria Ivanova, Andrey Veprikov, Aleksandr Beznosikov, Samuel Horv\'ath, Martin Tak\'a\v{c}

  12. Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding Authors: Ruanjun Li, Ziheng Liu, Yuanming Shi, Jiawei Shao, Chi Zhang, Xuelong Li

  13. SIM-CoT: Supervised Implicit Chain-of-Thought Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin

  14. Holographic Transformers for Complex-Valued Signal Processing: Integrating Phase Interference into Self-Attention Authors: Enhao Huang, Zhiyu Zhang, Tianxiang Xu, Chunshu Xia, Kaichun Hu, Yuchen Yang, Tongtong Pan, Dong Dong, Zhan Qin

  15. Sobolev acceleration for neural networks Authors: Jong Kwon Oh, Hanbaek Lyu, Hwijae Son

  16. CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks Authors: Jiewei Chen, Xiumei Deng, Zehui Xiong, Shaoyong Guo, Xuesong Qiu, Ping Wang, Dusit Niyato

  17. How deep is your network? Deep vs. shallow learning of transfer operators Authors: Mohammad Tabish, Benedict Leimkuhler, Stefan Klus

  18. Staying on the Manifold: Geometry-Aware Noise Injection Authors: Albert Kj{\o}ller Jacobsen, Johanna Marie Gegenfurtner, Georgios Arvanitidis

  19. How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models Authors: Kangtao Lv, Haibin Chen, Yujin Yuan, Langming Liu, Shilei Liu, Yongwei Wang, Wenbo Su, Bo Zheng

  20. Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing Authors: Xinnan Dai, Chung-Hsiang Lo, Kai Guo, Shenglai Zeng, Dongsheng Luo, Jiliang Tang

  21. Projective Kolmogorov Arnold Neural Networks (P-KANs): Entropy-Driven Functional Space Discovery for Interpretable Machine Learning Authors: Alastair Poole, Stig McArthur, Saravan Kumar

  22. RoboSSM: Scalable In-context Imitation Learning via State-Space Models Authors: Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Mart\'in-Mart\'in, Peter Stone

  23. Latent Iterative Refinement Flow: A Geometric-Constrained Approach for Few-Shot Generation Authors: Songtao Li, Zhenyu Liao, Tianqi Hou, Ting Gao

  24. Feature Dynamics as Implicit Data Augmentation: A Depth-Decomposed View on Deep Neural Network Generalization Authors: Tianyu Ruan, Kuo Gai, Shihua Zhang

  25. Quantifying Compositionality of Classic and State-of-the-Art Embeddings Authors: Zhijin Guo (University of Oxford, University of Bristol), Chenhao Xue (University of Oxford), Zhaozhen Xu (University of Bristol), Hongbo Bo (University of Bristol), Yuxuan Ye (University of Bristol), Janet B. Pierrehumbert (University of Oxford), Martha Lewis (University of Amsterdam)

  26. Interpreting ResNet-based CLIP via Neuron-Attention Decomposition Authors: Edmund Bu, Yossi Gandelsman

  27. Modular Machine Learning with Applications to Genetic Circuit Composition Authors: Jichi Wang, Eduardo D. Sontag, Domitilla Del Vecchio

  28. Quantum Harmonic Analysis and the Structure in Data: Augmentation Authors: Monika Doerfler, Franz Luef, Henry McNulty

  29. Graph Variate Neural Networks Authors: Om Roy, Yashar Moshfeghi, Keith Smith

  30. A Unified Noise-Curvature View of Loss of Trainability Authors: Gunbir Singh Baveja, Mark Schmidt

  31. LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation Authors: Huizhen Shu, Xuying Li, Zhuo Li

  32. The Syntax and Semantics of einsum Authors: Maurice Wenig, Paul G. Rump, Mark Blacher, Joachim Giesen

  33. You Only Measure Once: On Designing Single-Shot Quantum Machine Learning Models Authors: Chen-Yu Liu, Leonardo Placidi, Kuan-Cheng Chen, Samuel Yen-Chi Chen, Gabriel Matos

  34. Time-adaptive H\'enonNets for separable Hamiltonian systems Authors: Konrad Janik, Peter Benner

  35. Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation Authors: Roy Fejgin, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Ryan Langman Jaehyeon Kim, Subhankar Ghosh, Shehzeen Hussain, Jason Li


1. A Recovery Guarantee for Sparse Neural Networks

ArXiv ID: 2509.20323

Authors: Sara Fridovich-Keil, Mert Pilanci

Abstract: We prove the first guarantees of sparse recovery for ReLU neural networks, where the sparse network weights constitute the signal to be recovered. Specifically, we study structural properties of the sparse network weights for two-layer, scalar-output networks under which a simple iterative hard thresholding algorithm recovers these weights exactly, using memory that grows linearly in the number of nonzero weights. We validate this theoretical result with simple experiments on recovery of sparse planted MLPs, MNIST classification, and implicit neural representations. Experimentally, we find performance that is competitive with, and often exceeds, a high-performing but memory-inefficient baseline based on iterative magnitude pruning.

Comment: Model Compression and Efficiency—sparsity: theoretical sparse recovery guarantees for ReLU networks via iterative hard thresholding with linear-memory footprint.

Relevance: 10 Novelty: 9


2. Faster, Smaller, and Smarter: Task-Aware Expert Merging for Online MoE Inference

ArXiv ID: 2509.19781

Authors: Ziyi Han, Xutong Liu, Ruiting Zhou, Xiangxiang Dai, John C. S. Lui

Abstract: Sparse Mixture of Experts (SMoE) has become a preferred architecture for scaling Transformer capacity without increasing computational cost, as it activates only a small subset of experts for each input. However, deploying such an approach for \textit{online inference} remains challenging due to the large size of a full SMoE model and the complexity of expert routing, especially in resource-constrained edge networks. Moreover, during the online inference, task information is often unavailable, making the task-level routing error-prone. In this work, we propose a novel tree-structured adaptive neural bandit router, \texttt{Tanbr}, to enable efficient and reliable online MoE inference. Instead of relying on explicit task tags, \texttt{Tanbr} estimates the task distribution over time from historical data and uses it to guide task-aware expert merging within a given pre-trained MoE. To handle the large continuous space of merging weights, \texttt{Tanbr} employs a binary tree to progressively partition the space and generate finer candidate weights. It then applies a neural bandit to learn the non-linear mapping from merging weight to model performance and decides optimal expert merging. We prove that \texttt{Tanbr} achieves a sublinear regret bound of {\small $\mathcal{O}(\sqrt{T} \log(T))$} over {\small $T$} rounds, despite operating over a continuous decision space, matching regret bounds compared to existing methods. Extensive experiments show that \texttt{Tanbr} reduces inference latency by at least {\small $45\%$} and memory usage by up to {\small $25\%$}, while maintaining a high accuracy compared to many state-of-the-art methods.

Comment: Direct MoE architecture/efficiency: task-aware expert merging with adaptive neural bandit router for online inference.

Relevance: 10 Novelty: 8


3. Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment

ArXiv ID: 2509.20214

Authors: Deokjae Lee, Hyun Oh Song

Abstract: We study weight-only post-training quantization (PTQ), which quantizes the weights of a large language model (LLM) without retraining, using little or no calibration data. Weight-only PTQ is crucial for reducing the memory footprint and latency of LLM inference, especially in memory-bound, small-batch inference scenarios, such as personalized inference on edge devices. Despite its importance, irregular weight distributions with heavy-tailed outliers in LLMs complicate quantization, recently motivating rotation-based methods that transform weights into near-Gaussian distributions, which are more regular with fewer outliers, thereby reducing quantization error. In this work, we first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit quantizers approaching the Gaussian distortion-rate bound are essential to achieve near-optimal quantization performance. To bridge this theoretical insight and practical implementation, we introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Furthermore, leveraging Q-Palette as a foundational component, we propose a novel mixed-scheme quantization framework, jointly optimizing quantizer choices and layer fusion decisions given resource constraints. The code is available at https://github.com/snu-mllab/Q-Palette.

Comment: Compression/Efficiency: weight-only PTQ for LLMs with fractional-bit quantizers and optimal bit allocation; practical CUDA kernels and mixed-scheme layer fusion.

Relevance: 10 Novelty: 8


4. Linear Transformers Implicitly Discover Unified Numerical Algorithms

ArXiv ID: 2509.19702

Authors: Patrick Lutz, Aditya Gangrade, Hadi Daneshmand, Venkatesh Saligrama

Abstract: We train a linear attention transformer on millions of masked-block matrix completion tasks: each prompt is masked low-rank matrix whose missing block may be (i) a scalar prediction target or (ii) an unseen kernel slice of Nystr\"om extrapolation. The model sees only input-output pairs and a mean-squared loss; it is given no normal equations, no handcrafted iterations, and no hint that the tasks are related. Surprisingly, after training, algebraic unrolling reveals the same parameter-free update rule across three distinct computational regimes (full visibility, rank-limited updates, and distributed computation). We prove that this rule achieves second-order convergence on full-batch problems, cuts distributed iteration complexity, and remains accurate with rank-limited attention. Thus, a transformer trained solely to patch missing blocks implicitly discovers a unified, resource-adaptive iterative solver spanning prediction, estimation, and Nystr\"om extrapolation, highlighting a powerful capability of in-context learning.

Comment: Model Architecture: linear-attention Transformer analysis/unrolling revealing a unified iterative solver with theoretical convergence.

Relevance: 9 Novelty: 9


5. TensLoRA: Tensor Alternatives for Low-Rank Adaptation

ArXiv ID: 2509.19391

Authors: Axel Marmoret, Reda Bensaid, Jonathan Lys, Vincent Gripon, Fran\c{c}ois Leduc-Primeau

Abstract: Low-Rank Adaptation (LoRA) is widely used to efficiently adapt Transformers by adding trainable low-rank matrices to attention projections. While effective, these matrices are considered independent for each attention projection (Query, Key, and Value) and each layer. Recent extensions have considered joint, tensor-based adaptations, but only in limited forms and without a systematic framework. We introduce TensLoRA, a unified framework that aggregates LoRA updates into higher-order tensors and models a broad family of tensor-based low-rank adaptations. Our formulation generalizes existing tensor-based methods and enables mode-specific compression rates, allowing parameter budgets to be tailored according to the modality and task. Experiments on vision and language benchmarks reveal that the tensor construction directly impacts performance, sometimes better than standard LoRA under similar parameter counts.

Comment: Model Compression/Efficiency: generalizes LoRA to tensorized low-rank adaptations with mode-specific compression across attention projections.

Relevance: 9 Novelty: 8


6. Mamba Modulation: On the Length Generalization of Mamba

ArXiv ID: 2509.19633

Authors: Peng Lu, Jerry Huang, Qiuhao Zeng, Xinyu Wang, Boxing Wang, Philippe Langlais, Yufei Cui

Abstract: The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix $\mathbf{A}$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^N\Delta_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $\mathbf{A}$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $\mathbf{A}$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating $\Delta_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.

Comment: Model Architecture—state-space models (Mamba): analysis of transition matrix spectra and spectral modulation to improve long-context generalization.

Relevance: 9 Novelty: 8


7. Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels

ArXiv ID: 2509.20294

Authors: Dongming Huang, Zhifan Li, Yicheng Li, Qian Lin

Abstract: We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $\sigma^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $\sigma^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.

Comment: Representation Learning/Theory: introduces effective span dimension with alignment-sensitive minimax rates and shows gradient flow reduces ESD, linking adaptive feature learning to generalization.

Relevance: 9 Novelty: 8


8. Learning Dynamics of Deep Learning -- Force Analysis of Deep Neural Networks

ArXiv ID: 2509.19554

Authors: Yi Ren

Abstract: This thesis explores how deep learning models learn over time, using ideas inspired by force analysis. Specifically, we zoom in on the model's training procedure to see how one training example affects another during learning, like analyzing how forces move objects. We break this influence into two parts: how similar the two examples are, and how strong the updating force is. This framework helps us understand a wide range of the model's behaviors in different real systems. For example, it explains why certain examples have non-trivial learning paths, why (and why not) some LLM finetuning methods work, and why simpler, more structured patterns tend to be learned more easily. We apply this approach to various learning tasks and uncover new strategies for improving model training. While the method is still developing, it offers a new way to interpret models' behaviors systematically.

Comment: Representation Learning/Training Dynamics: proposes a force-based framework analyzing inter-example influences during training, offering insights into how networks learn.

Relevance: 9 Novelty: 7


9. Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models

ArXiv ID: 2509.20124

Authors: Junjie Yao, Zhi-Qin John Xu

Abstract: The embedding space of language models is widely believed to capture the semantic relationships; for instance, embeddings of digits often exhibit an ordered structure that corresponds to their natural sequence. However, the mechanisms driving the formation of such structures remain poorly understood. In this work, we interpret the embedding structures via the data distribution. We propose a set of probability signatures that reflect the semantic relationships among tokens. Through experiments on the composite addition tasks using the linear model and feedforward network, combined with theoretical analysis of gradient flow dynamics, we reveal that these probability signatures significantly influence the embedding structures. We further generalize our analysis to large language models (LLMs) by training the Qwen2.5 architecture on the subsets of the Pile corpus. Our results show that the probability signatures are faithfully aligned with the embedding structures, particularly in capturing strong pairwise similarities among embeddings. Our work uncovers the mechanism of how data distribution guides the formation of embedding structures, establishing a novel understanding of the relationship between embedding organization and semantic patterns.

Comment: Representation Learning: provides mechanistic insight into how data distributions (probability signatures) shape embedding geometry via gradient-flow analysis.

Relevance: 9 Novelty: 7


10. On the Rate of Convergence of Kolmogorov-Arnold Network Regression Estimators

ArXiv ID: 2509.19830

Authors: Wei Liu, Eleni Chatzi, Zhilu Lai

Abstract: Kolmogorov-Arnold Networks (KANs) offer a structured and interpretable framework for multivariate function approximation by composing univariate transformations through additive or multiplicative aggregation. This paper establishes theoretical convergence guarantees for KANs when the univariate components are represented by B-splines. We prove that both additive and hybrid additive-multiplicative KANs attain the minimax-optimal convergence rate $O(n^{-2r/(2r+1)})$ for functions in Sobolev spaces of smoothness $r$. We further derive guidelines for selecting the optimal number of knots in the B-splines. The theory is supported by simulation studies that confirm the predicted convergence rates. These results provide a theoretical foundation for using KANs in nonparametric regression and highlight their potential as a structured alternative to existing methods.

Comment: Model Architecture: theoretical convergence guarantees and minimax rates for Kolmogorov-Arnold Networks, informing structured function approximation.

Relevance: 9 Novelty: 7


11. Faster Than SVD, Smarter Than SGD: The OPLoRA Alternating Update

ArXiv ID: 2509.19977

Authors: Abdulla Jasem Almansoori, Maria Ivanova, Andrey Veprikov, Aleksandr Beznosikov, Samuel Horv\'ath, Martin Tak\'a\v{c}

Abstract: Low-Rank Adaptation (LoRA) fine-tunes large models by learning low-rank updates on top of frozen weights, dramatically reducing trainable parameters and memory. However, there is still a gap between full training with low-rank projections (SVDLoRA) and LoRA fine-tuning, indicating that LoRA steps can be further improved. In this study, we propose OPLoRA, a memory-efficient optimizer that closes this gap by casting LoRA optimization as an interpretable sub-problem and solving it efficiently with alternating least squares updates, where 1-2 alternating steps are empirically found to be sufficient to closely match truncated SVD without ever forming the full matrix. We also retrieve the recently proposed preconditioning methods for LoRA as a special case. OPLoRA supports momentum by maintaining a low-rank estimate using the same subroutine (LoRSum) for computing the step, with a memory budget of 3 times the number of LoRA parameters (i.e., same as Adam). We also propose an experimental scaled variant that uses the K-FAC metric, which could be of interest. Across a linear task, MNIST, CIFAR-100, and RoBERTa-base (MNLI), OPLoRA consistently approaches SVDLoRA's performance using significantly less memory.

Comment: Compression/Efficiency: alternating least-squares optimizer for LoRA approximates SVDLoRA with low memory, improving low-rank adaptation updates.

Relevance: 9 Novelty: 7


12. Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding

ArXiv ID: 2509.19368

Authors: Ruanjun Li, Ziheng Liu, Yuanming Shi, Jiawei Shao, Chi Zhang, Xuelong Li

Abstract: Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost because each output token is generated auto-regressively through all model layers. Early-exit based self-speculative decoding (EESD) has emerged to mitigate this cost. However, in practice, many approaches struggle to achieve the expected acceleration in such draft-then-verify paradigm even with a well-aligned early-exit head and selected exit position. Our analysis reveals that EESD only pays off when the vast majority of draft tokens are accepted by the LLM. Otherwise, the draft cost may overcome the acceleration gain and lead to a negative speedup. To mitigate this, we propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work so that no effort is wasted on failed predictions. It has two key innovations. We configure the model layers as a pipeline in which early-exit (draft) computations and remaining-layer (verification) computations overlap. We interleave drafting and verification per token. While the LLM is verifying the current token in its final layers, the early-exit path simultaneously drafts the next token. Such a verify-while-draft scheme keeps all units busy and validates tokens on-the-fly analogous to pipelining the speculation and verification stages. Empirical results confirm that PPSD achieves state-of-the-art acceleration in self-speculative LLM inference. On diverse benchmarks, PPSD achieves speedup ratios in the range of 2.01x~3.81x, which gains almost the optimal acceleration at the fixed acceptance rate and exit position, showcasing its advancement in providing efficient self-speculation.

Comment: High Performance Computing/Efficiency: pipeline-parallel early-exit self-speculative decoding with verify-while-draft scheduling for faster LLM inference.

Relevance: 9 Novelty: 7


13. SIM-CoT: Supervised Implicit Chain-of-Thought

ArXiv ID: 2509.20317

Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin

Abstract: Implicit Chain-of-Thought (CoT) methods present a promising, token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited the application of implicit CoT. We identify a core latent instability issue by scaling the computational budget of implicit CoT approaches: as we increase the number of implicit reasoning tokens to enhance performance, the training process often becomes unstable and collapses. Our analysis reveals that this instability arises from the latent representations becoming homogeneous and losing their semantic diversity, a failure caused by insufficient step-level supervision in existing implicit CoT approaches. To address this issue, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. Specifically, SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring that latent states capture distinct and meaningful information. The proposed auxiliary decoder is removed during inference, preserving the computational efficiency of implicit CoT methods with no added overhead. In addition, the auxiliary decoder affords interpretability of implicit reasoning by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization of semantic roles and diagnosis. SIM-CoT significantly enhances both the in-domain accuracy and out-of-domain stability of various implicit CoT methods, boosting baselines like Coconut by +8.2% on GPT-2 and CODI by +3.0% on LLaMA-3.1 8B. Demonstrating strong scalability, SIM-CoT also surpasses the explicit CoT baseline on GPT-2 by 2.1% with 2.3\times greater token efficiency, while substantially closing the performance gap on larger models like LLaMA-3.1 8B.

Comment: Model Architecture/Representation Learning: step-level supervision via an auxiliary decoder to stabilize and diversify latent states in implicit CoT, improving training dynamics with no inference overhead.

Relevance: 8 Novelty: 8


14. Holographic Transformers for Complex-Valued Signal Processing: Integrating Phase Interference into Self-Attention

ArXiv ID: 2509.19331

Authors: Enhao Huang, Zhiyu Zhang, Tianxiang Xu, Chunshu Xia, Kaichun Hu, Yuchen Yang, Tongtong Pan, Dong Dong, Zhan Qin

Abstract: Complex-valued signals encode both amplitude and phase, yet most deep models treat attention as real-valued correlation, overlooking interference effects. We introduce the Holographic Transformer, a physics-inspired architecture that incorporates wave interference principles into self-attention. Holographic attention modulates interactions by relative phase and coherently superimposes values, ensuring consistency between amplitude and phase. A dual-headed decoder simultaneously reconstructs the input and predicts task outputs, preventing phase collapse when losses prioritize magnitude over phase. We demonstrate that holographic attention implements a discrete interference operator and maintains phase consistency under linear mixing. Experiments on PolSAR image classification and wireless channel prediction show strong performance, achieving high classification accuracy and F1 scores, low regression error, and increased robustness to phase perturbations. These results highlight that enforcing physical consistency in attention leads to generalizable improvements in complex-valued learning and provides a unified, physics-based framework for coherent signal modeling. The code is available at https://github.com/EonHao/Holographic-Transformers.

Comment: Model Architecture: introduces a physics-inspired complex-valued self-attention (holographic attention) within Transformers that explicitly models phase interference.

Relevance: 8 Novelty: 8


15. Sobolev acceleration for neural networks

ArXiv ID: 2509.19773

Authors: Jong Kwon Oh, Hanbaek Lyu, Hwijae Son

Abstract: Sobolev training, which integrates target derivatives into the loss functions, has been shown to accelerate convergence and improve generalization compared to conventional $L^2$ training. However, the underlying mechanisms of this training method remain only partially understood. In this work, we present the first rigorous theoretical framework proving that Sobolev training accelerates the convergence of Rectified Linear Unit (ReLU) networks. Under a student-teacher framework with Gaussian inputs and shallow architectures, we derive exact formulas for population gradients and Hessians, and quantify the improvements in conditioning of the loss landscape and gradient-flow convergence rates. Extensive numerical experiments validate our theoretical findings and show that the benefits of Sobolev training extend to modern deep learning tasks.

Comment: Training dynamics theory: rigorous analysis showing Sobolev training improves conditioning and accelerates convergence of ReLU networks.

Relevance: 8 Novelty: 8


16. CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks

ArXiv ID: 2509.19855

Authors: Jiewei Chen, Xiumei Deng, Zehui Xiong, Shaoyong Guo, Xuesong Qiu, Ping Wang, Dusit Niyato

Abstract: The increasing demand for intelligent mobile applications has made multi-agent collaboration with Transformer-based large language models (LLMs) essential in mobile edge computing (MEC) networks. However, training LLMs in such environments remains challenging due to heavy computation, high end-to-end latency, and limited model generalization. We introduce CollaPipe, a hybrid distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving intelligent networks. In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks. Then we perform global model update via federated aggregation. To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power. We derive and use a closed-form convergence bound to design an Dynamic Segment Scheduling and Resource Allocation (DSSDA) algorithm based on Lyapunov optimization, ensuring system stability under long-term constraints. Extensive experiments on downstream tasks with Transformer and BERT models show that CollaPipe improves computation efficiency by up to 15.09%, reduces end-to-end latency by at least 48.98%, and cuts single device memory usage by more than half, enabling online learning in heterogeneous and dynamic communication environments.

Comment: High Performance Computing: adaptive pipeline parallelism with resource scheduling for distributed LLM training on heterogeneous edge devices, with convergence analysis and memory benefits.

Relevance: 8 Novelty: 7


17. How deep is your network? Deep vs. shallow learning of transfer operators

ArXiv ID: 2509.19930

Authors: Mohammad Tabish, Benedict Leimkuhler, Stefan Klus

Abstract: We propose a randomized neural network approach called RaNNDy for learning transfer operators and their spectral decompositions from data. The weights of the hidden layers of the neural network are randomly selected and only the output layer is trained. The main advantage is that without a noticeable reduction in accuracy, this approach significantly reduces the training time and resources while avoiding common problems associated with deep learning such as sensitivity to hyperparameters and slow convergence. Additionally, the proposed framework allows us to compute a closed-form solution for the output layer which directly represents the eigenfunctions of the operator. Moreover, it is possible to estimate uncertainties associated with the computed spectral properties via ensemble learning. We present results for different dynamical operators, including Koopman and Perron-Frobenius operators, which have important applications in analyzing the behavior of complex dynamical systems, and the Schr\"odinger operator. The numerical examples, which highlight the strengths but also weaknesses of the proposed framework, include several stochastic dynamical systems, protein folding processes, and the quantum harmonic oscillator.

Comment: Compression/Efficiency/Architecture: randomized network with fixed hidden layers and closed-form output for learning operators, reducing training cost and providing interpretable eigenfunctions.

Relevance: 8 Novelty: 7


18. Staying on the Manifold: Geometry-Aware Noise Injection

ArXiv ID: 2509.20201

Authors: Albert Kj{\o}ller Jacobsen, Johanna Marie Gegenfurtner, Georgios Arvanitidis

Abstract: It has been shown that perturbing the input during training implicitly regularises the gradient of the learnt function, leading to smoother models and enhancing generalisation. However, previous research mostly considered the addition of ambient noise in the input space, without considering the underlying structure of the data. In this work, we propose several methods of adding geometry-aware input noise that accounts for the lower dimensional manifold the input space inhabits. We start by projecting ambient Gaussian noise onto the tangent space of the manifold. In a second step, the noise sample is mapped on the manifold via the associated geodesic curve. We also consider Brownian motion noise, which moves in random steps along the manifold. We show that geometry-aware noise leads to improved generalization and robustness to hyperparameter selection on highly curved manifolds, while performing at least as well as training without noise on simpler manifolds. Our proposed framework extends to learned data manifolds.

Comment: Representation Learning: geometry-aware noise injection on learned manifolds to regularize gradients and improve generalization.

Relevance: 8 Novelty: 7


19. How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models

ArXiv ID: 2509.19371

Authors: Kangtao Lv, Haibin Chen, Yujin Yuan, Langming Liu, Shilei Liu, Yongwei Wang, Wenbo Su, Bo Zheng

Abstract: Large language models (LLMs) have attracted significant attention due to their impressive general capabilities across diverse downstream tasks. However, without domain-specific optimization, they often underperform on specialized knowledge benchmarks and even produce hallucination. Recent studies show that strategically infusing domain knowledge during pretraining can substantially improve downstream performance. A critical challenge lies in balancing this infusion trade-off: injecting too little domain-specific data yields insufficient specialization, whereas excessive infusion triggers catastrophic forgetting of previously acquired knowledge. In this work, we focus on the phenomenon of memory collapse induced by over-infusion. Through systematic experiments, we make two key observations, i.e. 1) Critical collapse point: each model exhibits a threshold beyond which its knowledge retention capabilities sharply degrade. 2) Scale correlation: these collapse points scale consistently with the model's size. Building on these insights, we propose a knowledge infusion scaling law that predicts the optimal amount of domain knowledge to inject into large LLMs by analyzing their smaller counterparts. Extensive experiments across different model sizes and pertaining token budgets validate both the effectiveness and generalizability of our scaling law.

Comment: Representation Learning/Training dynamics: proposes a scaling law for knowledge infusion to balance specialization and forgetting during pretraining.

Relevance: 8 Novelty: 7


20. Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing

ArXiv ID: 2509.20336

Authors: Xinnan Dai, Chung-Hsiang Lo, Kai Guo, Shenglai Zeng, Dongsheng Luo, Jiliang Tang

Abstract: Transformer-based LLMs demonstrate strong performance on graph reasoning tasks, yet their internal mechanisms remain underexplored. To uncover these reasoning process mechanisms in a fundamental and unified view, we set the basic decoder-only transformers and explain them using the circuit-tracer framework. Through this lens, we visualize reasoning traces and identify two core mechanisms in graph reasoning: token merging and structural memorization, which underlie both path reasoning and substructure extraction tasks. We further quantify these behaviors and analyze how they are influenced by graph density and model size. Our study provides a unified interpretability framework for understanding structural reasoning in decoder-only Transformers.

Comment: Representation Learning: circuit tracing reveals token-merging and structural memorization mechanisms in decoder-only Transformers for graph reasoning.

Relevance: 8 Novelty: 7


21. Projective Kolmogorov Arnold Neural Networks (P-KANs): Entropy-Driven Functional Space Discovery for Interpretable Machine Learning

ArXiv ID: 2509.20049

Authors: Alastair Poole, Stig McArthur, Saravan Kumar

Abstract: Kolmogorov-Arnold Networks (KANs) relocate learnable nonlinearities from nodes to edges, demonstrating remarkable capabilities in scientific machine learning and interpretable modeling. However, current KAN implementations suffer from fundamental inefficiencies due to redundancy in high-dimensional spline parameter spaces, where numerous distinct parameterisations yield functionally equivalent behaviors. This redundancy manifests as a "nuisance space" in the model's Jacobian, leading to susceptibility to overfitting and poor generalization. We introduce Projective Kolmogorov-Arnold Networks (P-KANs), a novel training framework that guides edge function discovery towards interpretable functional representations through entropy-minimisation techniques from signal analysis and sparse dictionary learning. Rather than constraining functions to predetermined spaces, our approach maintains spline space flexibility while introducing "gravitational" terms that encourage convergence towards optimal functional representations. Our key insight recognizes that optimal representations can be identified through entropy analysis of projection coefficients, compressing edge functions to lower-parameter projective spaces (Fourier, Chebyshev, Bessel). P-KANs demonstrate superior performance across multiple domains, achieving up to 80% parameter reduction while maintaining representational capacity, significantly improved robustness to noise compared to standard KANs, and successful application to industrial automated fiber placement prediction. Our approach enables automatic discovery of mixed functional representations where different edges converge to different optimal spaces, providing both compression benefits and enhanced interpretability for scientific machine learning applications.

Comment: Model Architecture and Representation Learning: entropy-driven P-KAN training discovers lower-parameter functional bases (e.g., Fourier/Chebyshev) for compression and interpretability.

Relevance: 8 Novelty: 7


22. RoboSSM: Scalable In-context Imitation Learning via State-Space Models

ArXiv ID: 2509.19658

Authors: Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Mart\'in-Mart\'in, Peter Stone

Abstract: In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn -- a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. We evaluate our approach on the LIBERO benchmark and compare it against strong Transformer-based ICIL baselines. Experiments show that RoboSSM extrapolates effectively to varying numbers of in-context demonstrations, yields high performance on unseen tasks, and remains robust in long-horizon scenarios. These results highlight the potential of SSMs as an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.

Comment: Model Architecture/Efficiency: replaces Transformers with state-space models (SSM) for linear-time inference and long-context extrapolation.

Relevance: 8 Novelty: 7


23. Latent Iterative Refinement Flow: A Geometric-Constrained Approach for Few-Shot Generation

ArXiv ID: 2509.19903

Authors: Songtao Li, Zhenyu Liao, Tianqi Hou, Ting Gao

Abstract: Few-shot generation, the synthesis of high-quality and diverse samples from limited training data, remains a significant challenge in generative modeling. Existing methods trained from scratch often fail to overcome overfitting and mode collapse, and fine-tuning large models can inherit biases while neglecting the crucial geometric structure of the latent space. To address these limitations, we introduce Latent Iterative Refinement Flow (LIRF), a novel approach that reframes few-shot generation as the progressive densification of geometrically structured manifold. LIRF establishes a stable latent space using an autoencoder trained with our novel \textbf{manifold-preservation loss} $L_{\text{manifold}}$. This loss ensures that the latent space maintains the geometric and semantic correspondence of the input data. Building on this, we propose an iterative generate-correct-augment cycle. Within this cycle, candidate samples are refined by a geometric \textbf{correction operator}, a provably contractive mapping that pulls samples toward the data manifold while preserving diversity. We also provide the \textbf{Convergence Theorem} demonstrating a predictable decrease in Hausdorff distance between generated and true data manifold. We also demonstrate the framework's scalability by generating coherent, high-resolution images on AFHQ-Cat. Ablation studies confirm that both the manifold-preserving latent space and the contractive correction mechanism are critical components of this success. Ultimately, LIRF provides a solution for data-scarce generative modeling that is not only theoretically grounded but also highly effective in practice.

Comment: Representation Learning/Autoencoders: manifold-preserving autoencoder and contractive correction operator with convergence guarantees for few-shot generation.

Relevance: 8 Novelty: 7


24. Feature Dynamics as Implicit Data Augmentation: A Depth-Decomposed View on Deep Neural Network Generalization

ArXiv ID: 2509.20334

Authors: Tianyu Ruan, Kuo Gai, Shihua Zhang

Abstract: Why do deep networks generalize well? In contrast to classical generalization theory, we approach this fundamental question by examining not only inputs and outputs, but the evolution of internal features. Our study suggests a phenomenon of temporal consistency that predictions remain stable when shallow features from earlier checkpoints combine with deeper features from later ones. This stability is not a trivial convergence artifact. It acts as a form of implicit, structured augmentation that supports generalization. We show that temporal consistency extends to unseen and corrupted data, but collapses when semantic structure is destroyed (e.g., random labels). Statistical tests further reveal that SGD injects anisotropic noise aligned with a few principal directions, reinforcing its role as a source of structured variability. Together, these findings suggest a conceptual perspective that links feature dynamics to generalization, pointing toward future work on practical surrogates for measuring temporal feature evolution.

Comment: Representation Learning—training dynamics: temporal feature consistency across depth and anisotropic SGD noise linked to generalization.

Relevance: 8 Novelty: 7


25. Quantifying Compositionality of Classic and State-of-the-Art Embeddings

ArXiv ID: 2509.19332

Authors: Zhijin Guo (University of Oxford, University of Bristol), Chenhao Xue (University of Oxford), Zhaozhen Xu (University of Bristol), Hongbo Bo (University of Bristol), Yuxuan Ye (University of Bristol), Janet B. Pierrehumbert (University of Oxford), Martha Lewis (University of Amsterdam)

Abstract: For language models to generalize correctly to novel expressions, it is critical that they exploit access compositional meanings when this is justified. Even if we don't know what a "pelp" is, we can use our knowledge of numbers to understand that "ten pelps" makes more pelps than "two pelps". Static word embeddings such as Word2vec made strong, indeed excessive, claims about compositionality. The SOTA generative, transformer models and graph models, however, go too far in the other direction by providing no real limits on shifts in meaning due to context. To quantify the additive compositionality, we formalize a two-step, generalized evaluation that (i) measures the linearity between known entity attributes and their embeddings via canonical correlation analysis, and (ii) evaluates additive generalization by reconstructing embeddings for unseen attribute combinations and checking reconstruction metrics such as L2 loss, cosine similarity, and retrieval accuracy. These metrics also capture failure cases where linear composition breaks down. Sentences, knowledge graphs, and word embeddings are evaluated and tracked the compositionality across all layers and training stages. Stronger compositional signals are observed in later training stages across data modalities, and in deeper layers of the transformer-based model before a decline at the top layer. Code is available at https://github.com/Zhijin-Guo1/quantifying-compositionality.

Comment: Representation Learning—quantifying additive compositionality in embeddings via CCA and reconstruction with layer-wise/training-stage analysis.

Relevance: 8 Novelty: 7


26. Interpreting ResNet-based CLIP via Neuron-Attention Decomposition

ArXiv ID: 2509.19943

Authors: Edmund Bu, Yossi Gandelsman

Abstract: We present a novel technique for interpreting the neurons in CLIP-ResNet by decomposing their contributions to the output into individual computation paths. More specifically, we analyze all pairwise combinations of neurons and the following attention heads of CLIP's attention-pooling layer. We find that these neuron-head pairs can be approximated by a single direction in CLIP-ResNet's image-text embedding space. Leveraging this insight, we interpret each neuron-head pair by associating it with text. Additionally, we find that only a sparse set of the neuron-head pairs have a significant contribution to the output value, and that some neuron-head pairs, while polysemantic, represent sub-concepts of their corresponding neurons. We use these observations for two applications. First, we employ the pairs for training-free semantic segmentation, outperforming previous methods for CLIP-ResNet. Second, we utilize the contributions of neuron-head pairs to monitor dataset distribution shifts. Our results demonstrate that examining individual computation paths in neural networks uncovers interpretable units, and that such units can be utilized for downstream tasks.

Comment: Representation Learning: decomposes CLIP-ResNet computation into neuron–attention-head paths, finds sparse influential pairs and aligns them with directions/text, yielding interpretable representation units.

Relevance: 8 Novelty: 7


27. Modular Machine Learning with Applications to Genetic Circuit Composition

ArXiv ID: 2509.19601

Authors: Jichi Wang, Eduardo D. Sontag, Domitilla Del Vecchio

Abstract: In several applications, including in synthetic biology, one often has input/output data on a system composed of many modules, and although the modules' input/output functions and signals may be unknown, knowledge of the composition architecture can significantly reduce the amount of training data required to learn the system's input/output mapping. Learning the modules' input/output functions is also necessary for designing new systems from different composition architectures. Here, we propose a modular learning framework, which incorporates prior knowledge of the system's compositional structure to (a) identify the composing modules' input/output functions from the system's input/output data and (b) achieve this by using a reduced amount of data compared to what would be required without knowledge of the compositional structure. To achieve this, we introduce the notion of modular identifiability, which allows recovery of modules' input/output functions from a subset of the system's input/output data, and provide theoretical guarantees on a class of systems motivated by genetic circuits. We demonstrate the theory on computational studies showing that a neural network (NNET) that accounts for the compositional structure can learn the composing modules' input/output functions and predict the system's output on inputs outside of the training set distribution. By contrast, a neural network that is agnostic of the structure is unable to predict on inputs that fall outside of the training set distribution. By reducing the need for experimental data and allowing module identification, this framework offers the potential to ease the design of synthetic biological circuits and of multi-module systems more generally.

Comment: Model Architecture/Representation: modular learning with theoretical identifiability of component functions leveraging compositional structure.

Relevance: 7 Novelty: 7


28. Quantum Harmonic Analysis and the Structure in Data: Augmentation

ArXiv ID: 2509.19474

Authors: Monika Doerfler, Franz Luef, Henry McNulty

Abstract: In this short note, we study the impact of data augmentation on the smoothness of principal components of high-dimensional datasets. Using tools from quantum harmonic analysis, we show that eigenfunctions of operators corresponding to augmented data sets lie in the modulation space $M^1(\mathbb{R}^d)$, guaranteeing smoothness and continuity. Numerical examples on synthetic and audio data confirm the theoretical findings. While interesting in itself, the results suggest that manifold learning and feature extraction algorithms can benefit from systematic and informed augmentation principles.

Comment: Representation learning theory: analyzes how augmentation enforces smoothness of principal components via harmonic analysis.

Relevance: 7 Novelty: 7


29. Graph Variate Neural Networks

ArXiv ID: 2509.20311

Authors: Om Roy, Yashar Moshfeghi, Keith Smith

Abstract: Modelling dynamically evolving spatio-temporal signals is a prominent challenge in the Graph Neural Network (GNN) literature. Notably, GNNs assume an existing underlying graph structure. While this underlying structure may not always exist or is derived independently from the signal, a temporally evolving functional network can always be constructed from multi-channel data. Graph Variate Signal Analysis (GVSA) defines a unified framework consisting of a network tensor of instantaneous connectivity profiles against a stable support usually constructed from the signal itself. Building on GVSA and tools from graph signal processing, we introduce Graph-Variate Neural Networks (GVNNs): layers that convolve spatio-temporal signals with a signal-dependent connectivity tensor combining a stable long-term support with instantaneous, data-driven interactions. This design captures dynamic statistical interdependencies at each time step without ad hoc sliding windows and admits an efficient implementation with linear complexity in sequence length. Across forecasting benchmarks, GVNNs consistently outperform strong graph-based baselines and are competitive with widely used sequence models such as LSTMs and Transformers. On EEG motor-imagery classification, GVNNs achieve strong accuracy highlighting their potential for brain-computer interface applications.

Comment: Model architecture: GVNN layers with signal-dependent connectivity tensor; efficient spatio-temporal convolution with linear sequence complexity.

Relevance: 7 Novelty: 7


30. A Unified Noise-Curvature View of Loss of Trainability

ArXiv ID: 2509.19698

Authors: Gunbir Singh Baveja, Mark Schmidt

Abstract: Loss of trainability (LoT) in continual learning occurs when gradient steps no longer yield improvement as tasks evolve, so accuracy stalls or degrades despite adequate capacity and supervision. We analyze LoT incurred with Adam through an optimization lens and find that single indicators such as Hessian rank, sharpness level, weight or gradient norms, gradient-to-parameter ratios, and unit-sign entropy are not reliable predictors. Instead we introduce two complementary criteria: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound that combine into a per-layer predictive threshold that anticipates trainability behavior. Using this threshold, we build a simple per-layer scheduler that keeps each layers effective step below a safe limit, stabilizing training and improving accuracy across concatenated ReLU (CReLU), Wasserstein regularization, and L2 weight decay, with learned learning-rate trajectories that mirror canonical decay.

Comment: Representation Learning/Training Dynamics: noise–curvature-based per-layer thresholds and scheduler to mitigate loss of trainability with Adam.

Relevance: 7 Novelty: 7


31. LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation

ArXiv ID: 2509.19839

Authors: Huizhen Shu, Xuying Li, Zhuo Li

Abstract: Achieving robust safety alignment in large language models (LLMs) while preserving their utility remains a fundamental challenge. Existing approaches often struggle to balance comprehensive safety with fine-grained controllability at the representation level. We introduce LATENTGUARD, a novel three-stage framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering. Our approach begins by fine-tuning an LLM on rationalized datasets containing both reasoning-enhanced refusal responses to adversarial prompts and reasoning-enhanced normal responses to benign queries, establishing robust behavioral priors across both safety-critical and utility-preserving scenarios. We then train a structured variational autoencoder (VAE) on intermediate MLP activations, supervised by multi-label annotations including attack types, attack methods, and benign indicators. This supervision enables the VAE to learn disentangled latent representations that capture distinct adversarial characteristics while maintaining semantic interpretability. Through targeted manipulation of learned latent dimensions, LATENTGUARD achieves selective refusal behavior, effectively blocking harmful requests while preserving helpfulness for legitimate use cases. Experiments on Qwen3-8B demonstrate significant improvements in both safety controllability and response interpretability without compromising utility. Cross-architecture validation on Mistral-7B confirms the generalizability of our latent steering approach, showing consistent effectiveness across different model families. Our results suggest that structured representation-level intervention offers a promising pathway toward building safer yet practical LLM systems.

Comment: Representation Learning: supervised VAE on internal MLP activations to learn disentangled, controllable latent factors for safety steering.

Relevance: 7 Novelty: 7


32. The Syntax and Semantics of einsum

ArXiv ID: 2509.20020

Authors: Maurice Wenig, Paul G. Rump, Mark Blacher, Joachim Giesen

Abstract: In 2011, einsum was introduced to NumPy as a practical and convenient notation for tensor expressions in machine learning, quantum circuit simulation, and other fields. It has since been implemented in additional Python frameworks such as PyTorch and TensorFlow, as well as in other programming languages such as Julia. Despite its practical success, the einsum notation still lacks a solid theoretical basis, and is not unified across the different frameworks, limiting opportunities for formal reasoning and systematic optimization. In this work, we discuss the terminology of tensor expressions and provide a formal definition of the einsum language. Based on this definition, we formalize and prove important equivalence rules for tensor expressions and highlight their relevance in practical applications.

Comment: Systems/HPC: formal syntax/semantics and equivalence rules for einsum enabling formal reasoning and optimization of tensor computations.

Relevance: 7 Novelty: 7


33. You Only Measure Once: On Designing Single-Shot Quantum Machine Learning Models

ArXiv ID: 2509.20090

Authors: Chen-Yu Liu, Leonardo Placidi, Kuan-Cheng Chen, Samuel Yen-Chi Chen, Gabriel Matos

Abstract: Quantum machine learning (QML) models conventionally rely on repeated measurements (shots) of observables to obtain reliable predictions. This dependence on large shot budgets leads to high inference cost and time overhead, which is particularly problematic as quantum hardware access is typically priced proportionally to the number of shots. In this work we propose You Only Measure Once (Yomo), a simple yet effective design that achieves accurate inference with dramatically fewer measurements, down to the single-shot regime. Yomo replaces Pauli expectation-value outputs with a probability aggregation mechanism and introduces loss functions that encourage sharp predictions. Our theoretical analysis shows that Yomo avoids the shot-scaling limitations inherent to expectation-based models, and our experiments on MNIST and CIFAR-10 confirm that Yomo consistently outperforms baselines across different shot budgets and under simulations with depolarizing channels. By enabling accurate single-shot inference, Yomo substantially reduces the financial and computational costs of deploying QML, thereby lowering the barrier to practical adoption of QML.

Comment: Model Architecture/Efficiency: probability-aggregation outputs for QML enable accurate single-shot inference, reducing measurement cost.

Relevance: 7 Novelty: 7


34. Time-adaptive H\'enonNets for separable Hamiltonian systems

ArXiv ID: 2509.20212

Authors: Konrad Janik, Peter Benner

Abstract: Measurement data is often sampled irregularly, i.e., not on equidistant time grids. This is also true for Hamiltonian systems. However, existing machine learning methods, which learn symplectic integrators, such as SympNets [1] and H\'enonNets [2] still require training data generated by fixed step sizes. To learn time-adaptive symplectic integrators, an extension to SympNets called TSympNets is introduced in [3]. The aim of this work is to do a similar extension for H\'enonNets. We propose a novel neural network architecture called T-H\'enonNets, which is symplectic by design and can handle adaptive time steps. We also extend the T-H\'enonNet architecture to non-autonomous Hamiltonian systems. Additionally, we provide universal approximation theorems for both new architectures for separable Hamiltonian systems and discuss why it is difficult to handle non-separable Hamiltonian systems with the proposed methods. To investigate these theoretical approximation capabilities, we perform different numerical experiments.

Comment: Model Architecture—time-adaptive symplectic HénonNets with universal approximation results for separable Hamiltonian systems.

Relevance: 7 Novelty: 7


35. Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation

ArXiv ID: 2509.19592

Authors: Roy Fejgin, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Ryan Langman Jaehyeon Kim, Subhankar Ghosh, Shehzeen Hussain, Jason Li

Abstract: Speech generation models based on large language models (LLMs) typically operate on discrete acoustic codes, which differ fundamentally from text tokens due to their multicodebook structure. At each timestep, models must predict N codebook entries jointly, introducing dependencies that challenge simple parallel prediction approaches. Parallel prediction assumes independence among codebooks, yielding efficient decoding but often at the cost of reduced fidelity. To address this, hierarchical strategies employ a local transformer (LT) to refine predictions and capture intra-timestep dependencies. In this work, we systematically investigate two LT architectures: an autoregressive transformer that generates codebooks sequentially, and a MaskGIT-based transformer that performs iterative masked prediction. Both designs further enable frame stacking, where the primary transformer predicts multiple frames jointly, and the LT decodes their codebooks, offering improvements in speed without compromising perceptual quality. Through extensive analysis, we characterize the tradeoffs between parallel and iterative sampling strategies across different throughput and quality regimes. Finally, we propose practical guidelines for selecting decoding strategies based on deployment priorities such as computational efficiency and synthesis fidelity.

Comment: Model Architecture/Efficiency: frame-stacked local transformers for multi-codebook decoding capture intra-timestep dependencies and accelerate generation—architectural and decoding-efficiency innovation.

Relevance: 7 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  2. Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  3. High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

  4. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.