Personalized Daily ArXiv Papers 2025-09-25

[gpt-5]	Prompt	Completion	Total
Token	47985	53034	101019
Cost	$0.06	$0.53	$0.59

Total arXiv papers: 596

Total scanned papers: 370

Total relevant papers: 35

Table of contents with paper titles:

A Recovery Guarantee for Sparse Neural Networks Authors: Sara Fridovich-Keil, Mert Pilanci
Faster, Smaller, and Smarter: Task-Aware Expert Merging for Online MoE Inference Authors: Ziyi Han, Xutong Liu, Ruiting Zhou, Xiangxiang Dai, John C. S. Lui
Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment Authors: Deokjae Lee, Hyun Oh Song
Linear Transformers Implicitly Discover Unified Numerical Algorithms Authors: Patrick Lutz, Aditya Gangrade, Hadi Daneshmand, Venkatesh Saligrama
TensLoRA: Tensor Alternatives for Low-Rank Adaptation Authors: Axel Marmoret, Reda Bensaid, Jonathan Lys, Vincent Gripon, Fran\c{c}ois Leduc-Primeau
Mamba Modulation: On the Length Generalization of Mamba Authors: Peng Lu, Jerry Huang, Qiuhao Zeng, Xinyu Wang, Boxing Wang, Philippe Langlais, Yufei Cui
Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels Authors: Dongming Huang, Zhifan Li, Yicheng Li, Qian Lin
Learning Dynamics of Deep Learning -- Force Analysis of Deep Neural Networks Authors: Yi Ren
Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models Authors: Junjie Yao, Zhi-Qin John Xu
On the Rate of Convergence of Kolmogorov-Arnold Network Regression Estimators Authors: Wei Liu, Eleni Chatzi, Zhilu Lai
Faster Than SVD, Smarter Than SGD: The OPLoRA Alternating Update Authors: Abdulla Jasem Almansoori, Maria Ivanova, Andrey Veprikov, Aleksandr Beznosikov, Samuel Horv\'ath, Martin Tak\'a\v{c}
Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding Authors: Ruanjun Li, Ziheng Liu, Yuanming Shi, Jiawei Shao, Chi Zhang, Xuelong Li
SIM-CoT: Supervised Implicit Chain-of-Thought Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin
Holographic Transformers for Complex-Valued Signal Processing: Integrating Phase Interference into Self-Attention Authors: Enhao Huang, Zhiyu Zhang, Tianxiang Xu, Chunshu Xia, Kaichun Hu, Yuchen Yang, Tongtong Pan, Dong Dong, Zhan Qin
Sobolev acceleration for neural networks Authors: Jong Kwon Oh, Hanbaek Lyu, Hwijae Son
CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks Authors: Jiewei Chen, Xiumei Deng, Zehui Xiong, Shaoyong Guo, Xuesong Qiu, Ping Wang, Dusit Niyato
How deep is your network? Deep vs. shallow learning of transfer operators Authors: Mohammad Tabish, Benedict Leimkuhler, Stefan Klus
Staying on the Manifold: Geometry-Aware Noise Injection Authors: Albert Kj{\o}ller Jacobsen, Johanna Marie Gegenfurtner, Georgios Arvanitidis
How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models Authors: Kangtao Lv, Haibin Chen, Yujin Yuan, Langming Liu, Shilei Liu, Yongwei Wang, Wenbo Su, Bo Zheng
Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing Authors: Xinnan Dai, Chung-Hsiang Lo, Kai Guo, Shenglai Zeng, Dongsheng Luo, Jiliang Tang
Projective Kolmogorov Arnold Neural Networks (P-KANs): Entropy-Driven Functional Space Discovery for Interpretable Machine Learning Authors: Alastair Poole, Stig McArthur, Saravan Kumar
RoboSSM: Scalable In-context Imitation Learning via State-Space Models Authors: Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Mart\'in-Mart\'in, Peter Stone
Latent Iterative Refinement Flow: A Geometric-Constrained Approach for Few-Shot Generation Authors: Songtao Li, Zhenyu Liao, Tianqi Hou, Ting Gao
Feature Dynamics as Implicit Data Augmentation: A Depth-Decomposed View on Deep Neural Network Generalization Authors: Tianyu Ruan, Kuo Gai, Shihua Zhang
Quantifying Compositionality of Classic and State-of-the-Art Embeddings Authors: Zhijin Guo (University of Oxford, University of Bristol), Chenhao Xue (University of Oxford), Zhaozhen Xu (University of Bristol), Hongbo Bo (University of Bristol), Yuxuan Ye (University of Bristol), Janet B. Pierrehumbert (University of Oxford), Martha Lewis (University of Amsterdam)
Interpreting ResNet-based CLIP via Neuron-Attention Decomposition Authors: Edmund Bu, Yossi Gandelsman
Modular Machine Learning with Applications to Genetic Circuit Composition Authors: Jichi Wang, Eduardo D. Sontag, Domitilla Del Vecchio
Quantum Harmonic Analysis and the Structure in Data: Augmentation Authors: Monika Doerfler, Franz Luef, Henry McNulty
Graph Variate Neural Networks Authors: Om Roy, Yashar Moshfeghi, Keith Smith
A Unified Noise-Curvature View of Loss of Trainability Authors: Gunbir Singh Baveja, Mark Schmidt
LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation Authors: Huizhen Shu, Xuying Li, Zhuo Li
The Syntax and Semantics of einsum Authors: Maurice Wenig, Paul G. Rump, Mark Blacher, Joachim Giesen
You Only Measure Once: On Designing Single-Shot Quantum Machine Learning Models Authors: Chen-Yu Liu, Leonardo Placidi, Kuan-Cheng Chen, Samuel Yen-Chi Chen, Gabriel Matos
Time-adaptive H\'enonNets for separable Hamiltonian systems Authors: Konrad Janik, Peter Benner
Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation Authors: Roy Fejgin, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Ryan Langman Jaehyeon Kim, Subhankar Ghosh, Shehzeen Hussain, Jason Li

1. A Recovery Guarantee for Sparse Neural Networks

ArXiv ID: 2509.20323

Authors: Sara Fridovich-Keil, Mert Pilanci

Abstract: We prove the first guarantees of sparse recovery for ReLU neural networks, where the sparse network weights constitute the signal to be recovered. Specifically, we study structural properties of the sparse network weights for two-layer, scalar-output networks under which a simple iterative hard thresholding algorithm recovers these weights exactly, using memory that grows linearly in the number of nonzero weights. We validate this theoretical result with simple experiments on recovery of sparse planted MLPs, MNIST classification, and implicit neural representations. Experimentally, we find performance that is competitive with, and often exceeds, a high-performing but memory-inefficient baseline based on iterative magnitude pruning.

Comment: Model Compression and Efficiency—sparsity: theoretical sparse recovery guarantees for ReLU networks via iterative hard thresholding with linear-memory footprint.

Relevance: 10 Novelty: 9

2. Faster, Smaller, and Smarter: Task-Aware Expert Merging for Online MoE Inference

ArXiv ID: 2509.19781

Authors: Ziyi Han, Xutong Liu, Ruiting Zhou, Xiangxiang Dai, John C. S. Lui

Abstract: Sparse Mixture of Experts (SMoE) has become a preferred architecture for scaling Transformer capacity without increasing computational cost, as it activates only a small subset of experts for each input. However, deploying such an approach for \textit{online inference} remains challenging due to the large size of a full SMoE model and the complexity of expert routing, especially in resource-constrained edge networks. Moreover, during the online inference, task information is often unavailable, making the task-level routing error-prone. In this work, we propose a novel tree-structured adaptive neural bandit router, \texttt{Tanbr}, to enable efficient and reliable online MoE inference. Instead of relying on explicit task tags, \texttt{Tanbr} estimates the task distribution over time from historical data and uses it to guide task-aware expert merging within a given pre-trained MoE. To handle the large continuous space of merging weights, \texttt{Tanbr} employs a binary tree to progressively partition the space and generate finer candidate weights. It then applies a neural bandit to learn the non-linear mapping from merging weight to model performance and decides optimal expert merging. We prove that \texttt{Tanbr} achieves a sublinear regret bound of {\small $\mathcal{O}(\sqrt{T} \log(T))$} over {\small $T$} rounds, despite operating over a continuous decision space, matching regret bounds compared to existing methods. Extensive experiments show that \texttt{Tanbr} reduces inference latency by at least {\small $45\%$} and memory usage by up to {\small $25\%$}, while maintaining a high accuracy compared to many state-of-the-art methods.

Comment: Direct MoE architecture/efficiency: task-aware expert merging with adaptive neural bandit router for online inference.

Relevance: 10 Novelty: 8

3. Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment

ArXiv ID: 2509.20214

Authors: Deokjae Lee, Hyun Oh Song

Abstract: We study weight-only post-training quantization (PTQ), which quantizes the weights of a large language model (LLM) without retraining, using little or no calibration data. Weight-only PTQ is crucial for reducing the memory footprint and latency of LLM inference, especially in memory-bound, small-batch inference scenarios, such as personalized inference on edge devices. Despite its importance, irregular weight distributions with heavy-tailed outliers in LLMs complicate quantization, recently motivating rotation-based methods that transform weights into near-Gaussian distributions, which are more regular with fewer outliers, thereby reducing quantization error. In this work, we first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit quantizers approaching the Gaussian distortion-rate bound are essential to achieve near-optimal quantization performance. To bridge this theoretical insight and practical implementation, we introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Furthermore, leveraging Q-Palette as a foundational component, we propose a novel mixed-scheme quantization framework, jointly optimizing quantizer choices and layer fusion decisions given resource constraints. The code is available at https://github.com/snu-mllab/Q-Palette.

Comment: Compression/Efficiency: weight-only PTQ for LLMs with fractional-bit quantizers and optimal bit allocation; practical CUDA kernels and mixed-scheme layer fusion.

Relevance: 10 Novelty: 8

4. Linear Transformers Implicitly Discover Unified Numerical Algorithms

ArXiv ID: 2509.19702

Authors: Patrick Lutz, Aditya Gangrade, Hadi Daneshmand, Venkatesh Saligrama

Abstract: We train a linear attention transformer on millions of masked-block matrix completion tasks: each prompt is masked low-rank matrix whose missing block may be (i) a scalar prediction target or (ii) an unseen kernel slice of Nystr\"om extrapolation. The model sees only input-output pairs and a mean-squared loss; it is given no normal equations, no handcrafted iterations, and no hint that the tasks are related. Surprisingly, after training, algebraic unrolling reveals the same parameter-free update rule across three distinct computational regimes (full visibility, rank-limited updates, and distributed computation). We prove that this rule achieves second-order convergence on full-batch problems, cuts distributed iteration complexity, and remains accurate with rank-limited attention. Thus, a transformer trained solely to patch missing blocks implicitly discovers a unified, resource-adaptive iterative solver spanning prediction, estimation, and Nystr\"om extrapolation, highlighting a powerful capability of in-context learning.

Comment: Model Architecture: linear-attention Transformer analysis/unrolling revealing a unified iterative solver with theoretical convergence.

Relevance: 9 Novelty: 9

5. TensLoRA: Tensor Alternatives for Low-Rank Adaptation

ArXiv ID: 2509.19391

Authors: Axel Marmoret, Reda Bensaid, Jonathan Lys, Vincent Gripon, Fran\c{c}ois Leduc-Primeau

Abstract: Low-Rank Adaptation (LoRA) is widely used to efficiently adapt Transformers by adding trainable low-rank matrices to attention projections. While effective, these matrices are considered independent for each attention projection (Query, Key, and Value) and each layer. Recent extensions have considered joint, tensor-based adaptations, but only in limited forms and without a systematic framework. We introduce TensLoRA, a unified framework that aggregates LoRA updates into higher-order tensors and models a broad family of tensor-based low-rank adaptations. Our formulation generalizes existing tensor-based methods and enables mode-specific compression rates, allowing parameter budgets to be tailored according to the modality and task. Experiments on vision and language benchmarks reveal that the tensor construction directly impacts performance, sometimes better than standard LoRA under similar parameter counts.

Comment: Model Compression/Efficiency: generalizes LoRA to tensorized low-rank adaptations with mode-specific compression across attention projections.

Relevance: 9 Novelty: 8

6. Mamba Modulation: On the Length Generalization of Mamba

ArXiv ID: 2509.19633

Authors: Peng Lu, Jerry Huang, Qiuhao Zeng, Xinyu Wang, Boxing Wang, Philippe Langlais, Yufei Cui

Abstract: The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix $\mathbf{A}$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^N\Delta_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $\mathbf{A}$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $\mathbf{A}$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating $\Delta_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.

Comment: Model Architecture—state-space models (Mamba): analysis of transition matrix spectra and spectral modulation to improve long-context generalization.

Relevance: 9 Novelty: 8

7. Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels

ArXiv ID: 2509.20294

Authors: Dongming Huang, Zhifan Li, Yicheng Li, Qian Lin

Abstract: We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $\sigma^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $\sigma^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.

Comment: Representation Learning/Theory: introduces effective span dimension with alignment-sensitive minimax rates and shows gradient flow reduces ESD, linking adaptive feature learning to generalization.

Relevance: 9 Novelty: 8

8. Learning Dynamics of Deep Learning -- Force Analysis of Deep Neural Networks

ArXiv ID: 2509.19554

Authors: Yi Ren

Abstract: This thesis explores how deep learning models learn over time, using ideas inspired by force analysis. Specifically, we zoom in on the model's training procedure to see how one training example affects another during learning, like analyzing how forces move objects. We break this influence into two parts: how similar the two examples are, and how strong the updating force is. This framework helps us understand a wide range of the model's behaviors in different real systems. For example, it explains why certain examples have non-trivial learning paths, why (and why not) some LLM finetuning methods work, and why simpler, more structured patterns tend to be learned more easily. We apply this approach to various learning tasks and uncover new strategies for improving model training. While the method is still developing, it offers a new way to interpret models' behaviors systematically.

Comment: Representation Learning/Training Dynamics: proposes a force-based framework analyzing inter-example influences during training, offering insights into how networks learn.

Relevance: 9 Novelty: 7

9. Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models

ArXiv ID: 2509.20124

Authors: Junjie Yao, Zhi-Qin John Xu

Abstract: The embedding space of language models is widely believed to capture the semantic relationships; for instance, embeddings of digits often exhibit an ordered structure that corresponds to their natural sequence. However, the mechanisms driving the formation of such structures remain poorly understood. In this work, we interpret the embedding structures via the data distribution. We propose a set of probability signatures that reflect the semantic relationships among tokens. Through experiments on the composite addition tasks using the linear model and feedforward network, combined with theoretical analysis of gradient flow dynamics, we reveal that these probability signatures significantly influence the embedding structures. We further generalize our analysis to large language models (LLMs) by training the Qwen2.5 architecture on the subsets of the Pile corpus. Our results show that the probability signatures are faithfully aligned with the embedding structures, particularly in capturing strong pairwise similarities among embeddings. Our work uncovers the mechanism of how data distribution guides the formation of embedding structures, establishing a novel understanding of the relationship between embedding organization and semantic patterns.

Comment: Representation Learning: provides mechanistic insight into how data distributions (probability signatures) shape embedding geometry via gradient-flow analysis.

Relevance: 9 Novelty: 7

10. On the Rate of Convergence of Kolmogorov-Arnold Network Regression Estimators

ArXiv ID: 2509.19830

Authors: Wei Liu, Eleni Chatzi, Zhilu Lai

Abstract: Kolmogorov-Arnold Networks (KANs) offer a structured and interpretable framework for multivariate function approximation by composing univariate transformations through additive or multiplicative aggregation. This paper establishes theoretical convergence guarantees for KANs when the univariate components are represented by B-splines. We prove that both additive and hybrid additive-multiplicative KANs attain the minimax-optimal convergence rate $O(n^{-2r/(2r+1)})$ for functions in Sobolev spaces of smoothness $r$. We further derive guidelines for selecting the optimal number of knots in the B-splines. The theory is supported by simulation studies that confirm the predicted convergence rates. These results provide a theoretical foundation for using KANs in nonparametric regression and highlight their potential as a structured alternative to existing methods.

Comment: Model Architecture: theoretical convergence guarantees and minimax rates for Kolmogorov-Arnold Networks, informing structured function approximation.

Relevance: 9 Novelty: 7

11. Faster Than SVD, Smarter Than SGD: The OPLoRA Alternating Update

ArXiv ID: 2509.19977

Authors: Abdulla Jasem Almansoori, Maria Ivanova, Andrey Veprikov, Aleksandr Beznosikov, Samuel Horv\'ath, Martin Tak\'a\v{c}

Abstract: Low-Rank Adaptation (LoRA) fine-tunes large models by learning low-rank updates on top of frozen weights, dramatically reducing trainable parameters and memory. However, there is still a gap between full training with low-rank projections (SVDLoRA) and LoRA fine-tuning, indicating that LoRA steps can be further improved. In this study, we propose OPLoRA, a memory-efficient optimizer that closes this gap by casting LoRA optimization as an interpretable sub-problem and solving it efficiently with alternating least squares updates, where 1-2 alternating steps are empirically found to be sufficient to closely match truncated SVD without ever forming the full matrix. We also retrieve the recently proposed preconditioning methods for LoRA as a special case. OPLoRA supports momentum by maintaining a low-rank estimate using the same subroutine (LoRSum) for computing the step, with a memory budget of 3 times the number of LoRA parameters (i.e., same as Adam). We also propose an experimental scaled variant that uses the K-FAC metric, which could be of interest. Across a linear task, MNIST, CIFAR-100, and RoBERTa-base (MNLI), OPLoRA consistently approaches SVDLoRA's performance using significantly less memory.

Comment: Compression/Efficiency: alternating least-squares optimizer for LoRA approximates SVDLoRA with low memory, improving low-rank adaptation updates.

Relevance: 9 Novelty: 7

12. Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding

ArXiv ID: 2509.19368

Authors: Ruanjun Li, Ziheng Liu, Yuanming Shi, Jiawei Shao, Chi Zhang, Xuelong Li

Abstract: Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost because each output token is generated auto-regressively through all model layers. Early-exit based self-speculative decoding (EESD) has emerged to mitigate this cost. However, in practice, many approaches struggle to achieve the expected acceleration in such draft-then-verify paradigm even with a well-aligned early-exit head and selected exit position. Our analysis reveals that EESD only pays off when the vast majority of draft tokens are accepted by the LLM. Otherwise, the draft cost may overcome the acceleration gain and lead to a negative speedup. To mitigate this, we propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work so that no effort is wasted on failed predictions. It has two key innovations. We configure the model layers as a pipeline in which early-exit (draft) computations and remaining-layer (verification) computations overlap. We interleave drafting and verification per token. While the LLM is verifying the current token in its final layers, the early-exit path simultaneously drafts the next token. Such a verify-while-draft scheme keeps all units busy and validates tokens on-the-fly analogous to pipelining the speculation and verification stages. Empirical results confirm that PPSD achieves state-of-the-art acceleration in self-speculative LLM inference. On diverse benchmarks, PPSD achieves speedup ratios in the range of 2.01x~3.81x, which gains almost the optimal acceleration at the fixed acceptance rate and exit position, showcasing its advancement in providing efficient self-speculation.

Comment: High Performance Computing/Efficiency: pipeline-parallel early-exit self-speculative decoding with verify-while-draft scheduling for faster LLM inference.

Relevance: 9 Novelty: 7

13. SIM-CoT: Supervised Implicit Chain-of-Thought

ArXiv ID: 2509.20317

Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin

Abstract: Implicit Chain-of-Thought (CoT) methods present a promising, token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited the application of implicit CoT. We identify a core latent instability issue by scaling the computational budget of implicit CoT approaches: as we increase the number of implicit reasoning tokens to enhance performance, the training process often becomes unstable and collapses. Our analysis reveals that this instability arises from the latent representations becoming homogeneous and losing their semantic diversity, a failure caused by insufficient step-level supervision in existing implicit CoT approaches. To address this issue, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. Specifically, SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring that latent states capture distinct and meaningful information. The proposed auxiliary decoder is removed during inference, preserving the computational efficiency of implicit CoT methods with no added overhead. In addition, the auxiliary decoder affords interpretability of implicit reasoning by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization of semantic roles and diagnosis. SIM-CoT significantly enhances both the in-domain accuracy and out-of-domain stability of various implicit CoT methods, boosting baselines like Coconut by +8.2% on GPT-2 and CODI by +3.0% on LLaMA-3.1 8B. Demonstrating strong scalability, SIM-CoT also surpasses the explicit CoT baseline on GPT-2 by 2.1% with 2.3\times greater token efficiency, while substantially closing the performance gap on larger models like LLaMA-3.1 8B.

Comment: Model Architecture/Representation Learning: step-level supervision via an auxiliary decoder to stabilize and diversify latent states in implicit CoT, improving training dynamics with no inference overhead.

Relevance: 8 Novelty: 8

14. Holographic Transformers for Complex-Valued Signal Processing: Integrating Phase Interference into Self-Attention

ArXiv ID: 2509.19331

Authors: Enhao Huang, Zhiyu Zhang, Tianxiang Xu, Chunshu Xia, Kaichun Hu, Yuchen Yang, Tongtong Pan, Dong Dong, Zhan Qin

Abstract: Complex-valued signals encode both amplitude and phase, yet most deep models treat attention as real-valued correlation, overlooking interference effects. We introduce the Holographic Transformer, a physics-inspired architecture that incorporates wave interference principles into self-attention. Holographic attention modulates interactions by relative phase and coherently superimposes values, ensuring consistency between amplitude and phase. A dual-headed decoder simultaneously reconstructs the input and predicts task outputs, preventing phase collapse when losses prioritize magnitude over phase. We demonstrate that holographic attention implements a discrete interference operator and maintains phase consistency under linear mixing. Experiments on PolSAR image classification and wireless channel prediction show strong performance, achieving high classification accuracy and F1 scores, low regression error, and increased robustness to phase perturbations. These results highlight that enforcing physical consistency in attention leads to generalizable improvements in complex-valued learning and provides a unified, physics-based framework for coherent signal modeling. The code is available at https://github.com/EonHao/Holographic-Transformers.

Comment: Model Architecture: introduces a physics-inspired complex-valued self-attention (holographic attention) within Transformers that explicitly models phase interference.

Relevance: 8 Novelty: 8

15. Sobolev acceleration for neural networks

ArXiv ID: 2509.19773

Authors: Jong Kwon Oh, Hanbaek Lyu, Hwijae Son

Abstract: Sobolev training, which integrates target derivatives into the loss functions, has been shown to accelerate convergence and improve generalization compared to conventional $L^2$ training. However, the underlying mechanisms of this training method remain only partially understood. In this work, we present the first rigorous theoretical framework proving that Sobolev training accelerates the convergence of Rectified Linear Unit (ReLU) networks. Under a student-teacher framework with Gaussian inputs and shallow architectures, we derive exact formulas for population gradients and Hessians, and quantify the improvements in conditioning of the loss landscape and gradient-flow convergence rates. Extensive numerical experiments validate our theoretical findings and show that the benefits of Sobolev training extend to modern deep learning tasks.

Comment: Training dynamics theory: rigorous analysis showing Sobolev training improves conditioning and accelerates convergence of ReLU networks.

Relevance: 8 Novelty: 8

16. CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks

ArXiv ID: 2509.19855

Authors: Jiewei Chen, Xiumei Deng, Zehui Xiong, Shaoyong Guo, Xuesong Qiu, Ping Wang, Dusit Niyato

Abstract: The increasing demand for intelligent mobile applications has made multi-agent collaboration with Transformer-based large language models (LLMs) essential in mobile edge computing (MEC) networks. However, training LLMs in such environments remains challenging due to heavy computation, high end-to-end latency, and limited model generalization. We introduce CollaPipe, a hybrid distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving intelligent networks. In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks. Then we perform global model update via federated aggregation. To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power. We derive and use a closed-form convergence bound to design an Dynamic Segment Scheduling and Resource Allocation (DSSDA) algorithm based on Lyapunov optimization, ensuring system stability under long-term constraints. Extensive experiments on downstream tasks with Transformer and BERT models show that CollaPipe improves computation efficiency by up to 15.09%, reduces end-to-end latency by at least 48.98%, and cuts single device memory usage by more than half, enabling online learning in heterogeneous and dynamic communication environments.

Comment: High Performance Computing: adaptive pipeline parallelism with resource scheduling for distributed LLM training on heterogeneous edge devices, with convergence analysis and memory benefits.