Personalized Daily ArXiv Papers 2025-12-02

[gpt-5]	Prompt	Completion	Total
Token	69317	59850	129167
Cost	$0.09	$0.6	$0.69

Total arXiv papers: 1038

Total scanned papers: 656

Total relevant papers: 42

Table of contents with paper titles:

Improved Mean Flows: On the Challenges of Fastforward Generative Models Authors: Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J. Zico Kolter, Kaiming He
Efficient Turing Machine Simulation with Transformers Authors: Qian Li, Yuyi Wang
The Mean-Field Dynamics of Transformers Authors: Philippe Rigollet
HBLLM: Wavelet-Enhanced High-Fidelity 1-Bit Quantization for LLMs Authors: Ningning Chen, Weicai Ye, Ying Jiang
Low-Rank Prehab: Preparing Neural Networks for SVD Compression Authors: Haoran Qin, Shansita Sharma, Ali Abbasi, Chayne Thrash, Soheil Kolouri
WUSH: Near-Optimal Adaptive Transforms for LLM Quantization Authors: Jiale Chen, Vage Egiazarian, Torsten Hoefler, Dan Alistarh
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling Authors: Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Song Han
LPCD: Unified Framework from Layer-Wise to Submodule Quantization Authors: Yuma Ichikawa, Yudai Fujimoto, Akira Sakai
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding Authors: Yilong Zhao, Jiaming Tang, Kan Zhu, Zihao Ye, Chi-Chih Chang, Chaofan Lin, Jongseok Park, Guangxuan Xiao, Mohamed S. Abdelfattah, Mingyu Gao, Baris Kasikci, Song Han, Ion Stoica
G-KV: Decoding-Time KV Cache Eviction with Global Attention Authors: Mengqi Liao, Lu Wang, Chaoyun Zhang, Zekai Shen, Xiaowei Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Huaiyu Wan
Implicitly Normalized Online PCA: A Regularized Algorithm with Exact High-Dimensional Dynamics Authors: Samet Demir, Zafer Dogan
SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs Authors: Jiaming Xu, Jiayi Pan, Hanzhen Wang, Yongkang Zhou, Jiancai Ye, Yu Wang, Guohao Dai
Efficiently Learning Branching Networks for Multitask Algorithmic Reasoning Authors: Dongyue Li, Zhenshuo Zhang, Minxuan Duan, Edgar Dobriban, Hongyang R. Zhang
Solving Neural Min-Max Games: The Role of Architecture, Initialization & Dynamics Authors: Deep Patel, Emmanouil-Vasileios Vlatakis-Gkaragkounis
Tuning Universality in Deep Neural Networks Authors: Arsham Ghavasieh
Constructing Efficient Fact-Storing MLPs for Transformers Authors: Owen Dugan, Roberto Garcia, Ronny Junkins, Jerry Liu, Dylan Zinsley, Sabri Eyuboglu, Atri Rudra, Chris R\'e
AlignSAE: Concept-Aligned Sparse Autoencoders Authors: Minglai Yang, Xinyu Guo, Mihai Surdeanu, Liangming Pan
Less is More: Resource-Efficient Low-Rank Adaptation Authors: Chunlin Tian, Xuyang Wei, Huanrong Liu, Zhijiang Guo, Li Li
Morphling: Fast, Fused, and Flexible GNN Training at Scale Authors: Anubhab, Rupesh Nasre
Upcycled and Merged MoE Reward Model for Mitigating Reward Hacking Authors: Lingling Fu
From Coefficients to Directions: Rethinking Model Merging with Directional Alignment Authors: Zhikang Chen, Sen Cui, Deheng Ye, Min Zhang, Gang Niu, Yu Zhang, Masashi Sugiyama, Tingting Zhu
ZIP-RC: Zero-overhead Inference-time Prediction of Reward and Cost for Adaptive and Interpretable Generation Authors: Rohin Manvi, Joey Hong, Tim Seyde, Maxime Labonne, Mathias Lechner, Sergey Levine
SVRG and Beyond via Posterior Correction Authors: Nico Daheim, Thomas M\"ollenhoff, Ming Liang Ang, Mohammad Emtiyaz Khan
Does Flatness imply Generalization for Logistic Loss in Univariate Two-Layer ReLU Network? Authors: Dan Qiao, Yu-Xiang Wang
Polynomial Neural Sheaf Diffusion: A Spectral Filtering Approach on Cellular Sheaves Authors: Alessio Borgi, Fabrizio Silvestri, Pietro Li`o
Generative Modeling with Continuous Flows: Sample Complexity of Flow Matching Authors: Mudit Gaur, Prashant Trivedi, Shuchin Aeron, Amrit Singh Bedi, George K. Atia, Vaneet Aggarwal
An RKHS Perspective on Tree Ensembles Authors: Mehdi Dagdoug, Clement Dombry, Jean-Jil Duchamps
SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling Authors: Yang Xiao, Chunpu Xu, Ruifeng Yuan, Jiashuo Wang, Wenjie Li, Pengfei Liu
VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference Authors: Jiaming Tang, Yufei Sun, Yilong Zhao, Shang Yang, Yujun Lin, Zhuoyang Zhang, James Hou, Yao Lu, Zhijian Liu, Song Han
MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention Authors: Pedro M. P. Curvo, Jan-Willem van de Meent, Maksim Zhdanov
Fiber Bundle Networks: A Geometric Machine Learning Paradigm Authors: Dong Liu
Scalable and Interpretable Scientific Discovery via Sparse Variational Gaussian Process Kolmogorov-Arnold Networks (SVGP KAN) Authors: Y. Sungtaek Ju
Mode-Conditioning Unlocks Superior Test-Time Scaling Authors: Chen Henry Wu, Sachin Goyal, Aditi Raghunathan
Stay Unique, Stay Efficient: Preserving Model Personality in Multi-Task Merging Authors: Kuangpu Guo, Yuhe Ding, Jian Liang, Zilei Wang, Ran He
Preventing Model Collapse via Contraction-Conditioned Neural Filters Authors: Zongjian Han, Yiran Liang, Ruiwen Wang, Yiwei Luo, Yilin Huang, Xiaotong Song, Dongqing Wei
Beyond Loss Guidance: Using PDE Residuals as Spectral Attention in Diffusion Neural Operators Authors: Medha Sawhney, Abhilash Neog, Mridul Khurana, Anuj Karpatne
Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe Authors: Yahui Liu, Yang Yue, Jingyuan Zhang, Chenxi Sun, Yang Zhou, Wencong Zeng, Ruiming Tang, Guorui Zhou
H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons Authors: Cheng Gao, Huimin Chen, Chaojun Xiao, Zhiyi Chen, Zhiyuan Liu, Maosong Sun
Memory-Integrated Reconfigurable Adapters: A Unified Framework for Settings with Multiple Tasks Authors: Susmit Agrawal, Krishn Vishwas Kher, Saksham Mittal, Swarnim Maheshwari, Vineeth N. Balasubramanian
Fantastic Features and Where to Find Them: A Probing Method to combine Features from Multiple Foundation Models Authors: Benjamin Ramtoula, Pierre-Yves Lajoie, Paul Newman, Daniele De Martini
Upper Approximation Bounds for Neural Oscillators Authors: Zifeng Huang, Konstantin M. Zuev, Yong Xia, Michael Beer
One Swallow Does Not Make a Summer: Understanding Semantic Structures in Embedding Spaces Authors: Yandong Sun, Qiang Huang, Ziwei Xu, Yiqun Sun, Yixuan Tang, Anthony K. H. Tung

1. Improved Mean Flows: On the Challenges of Fastforward Generative Models

ArXiv ID: 2512.02012

Authors: Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J. Zico Kolter, Kaiming He

Abstract: MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF's training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity $v$, re-parameterized by a network that predicts the average velocity $u$. Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our $\textbf{improved MeanFlow}$ ($\textbf{iMF}$) method, trained entirely from scratch, achieves $\textbf{1.72}$ FID with a single function evaluation (1-NFE) on ImageNet 256$\times$256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.

Comment: Author match

2. Efficient Turing Machine Simulation with Transformers

ArXiv ID: 2512.00003

Authors: Qian Li, Yuyi Wang

Abstract: Constant bit-size Transformers are known to be Turing complete, but existing constructions require $\Omega(s(n))$ chain-of-thought (CoT) steps per simulated Turing machine (TM) step, leading to impractical reasoning lengths. In this paper, we significantly reduce this efficiency gap by proving that any $(t(n),s(n))$-bounded multi-tape TM can be simulated by a constant bit-size Transformer with an optimal $O(s(n))$-long context window and only $O(s(n)^c)$ CoT steps per TM step, where $c>0$ can be made arbitrarily small by letting the Transformers' head-layer product sufficiently large. In addition, our construction shows that sparse attention with fixed geometric offsets suffices for efficient universal computation. Our proof leverages multi-queue TMs as a bridge. The main technical novelty is a more efficient simulation of multi-tape TMs by synchronous multi-queue TMs, improving both time and space complexity under stricter model assumptions.

Comment: Matches Model Architecture and Efficiency: theoretical construction for efficient TM simulation with constant‑bit Transformers and sparse attention with fixed offsets.

Relevance: 10 Novelty: 9

3. The Mean-Field Dynamics of Transformers

ArXiv ID: 2512.01868

Authors: Philippe Rigollet

Abstract: We develop a mathematical framework that interprets Transformer attention as an interacting particle system and studies its continuum (mean-field) limits. By idealizing attention continuous on the sphere, we connect Transformer dynamics to Wasserstein gradient flows, synchronization models (Kuramoto), and mean-shift clustering. Central to our results is a global clustering phenomenon whereby tokens cluster asymptotically after long metastable states where they are arranged into multiple clusters. We further analyze a tractable equiangular reduction to obtain exact clustering rates, show how commonly used normalization schemes alter contraction speeds, and identify a phase transition for long-context attention. The results highlight both the mechanisms that drive representation collapse and the regimes that preserve expressive, multi-cluster structure in deep attention architectures.

Comment: Representation Learning: mean-field theory for Transformer attention (Wasserstein gradient flows, clustering/phase transition) elucidates training dynamics and representation collapse in deep attention.

Relevance: 10 Novelty: 9

4. HBLLM: Wavelet-Enhanced High-Fidelity 1-Bit Quantization for LLMs

ArXiv ID: 2512.00862

Authors: Ningning Chen, Weicai Ye, Ying Jiang

Abstract: We introduce HBLLM, a wavelet-enhanced high-fidelity $1$-bit post-training quantization method for Large Language Models (LLMs). By leveraging Haar wavelet transforms to enhance expressive capacity through frequency decomposition, HBLLM significantly improves quantization fidelity while maintaining minimal overhead. This approach features two innovative structure-aware grouping strategies: (1) frequency-aware multi-parameter intra-row grouping and (2) $\ell_2$-norm-based saliency-driven column selection. For non-salient weights, a shared mean is employed across quantization groups within each frequency band to optimize storage efficiency. Experiments conducted on the OPT and LLaMA models demonstrate that HBLLM achieves state-of-the-art performance in $1$-bit quantization, attaining a perplexity of $6.71$ on LLaMA$2$-$13$B with an average weight storage of only $1.08$ bits. Code available at: https://github.com/Yeyke/HBLLM.

Comment: Matches Model Compression and Efficiency: proposes wavelet-enhanced 1-bit post-training quantization with structure-aware grouping and saliency-driven selection achieving SOTA fidelity at ~1.08 bits.

Relevance: 10 Novelty: 8

5. Low-Rank Prehab: Preparing Neural Networks for SVD Compression

ArXiv ID: 2512.01980

Authors: Haoran Qin, Shansita Sharma, Ali Abbasi, Chayne Thrash, Soheil Kolouri

Abstract: Low-rank approximation methods such as singular value decomposition (SVD) and its variants (e.g., Fisher-weighted SVD, Activation SVD) have recently emerged as effective tools for neural network compression. In this setting, decomposition acts as a "surgical" intervention, followed by fine-tuning that serves as "rehab" to recover accuracy. Inspired by prehabilitation in surgery, we introduce a pre-compression fine-tuning stage, Low-Rank Prehab, that explicitly encourages low-rank structure in weight matrices while preserving task performance. By conditioning the model before SVD, Prehab steers weights toward spectrally compact regions of the parameter space, enabling smoother low-rank approximation and improved recovery. Experiments on large language models (LLMs) and other Transformer-based architectures, including Vision Transformers (ViTs), show that Prehab substantially reduces the immediate accuracy drop after compression and consistently improves post-finetuning performance. Across a wide range of compression ratios, our method outperforms state-of-the-art SVD-based techniques such as SVD-LLM, highlighting the importance of preparing models for compression rather than only improving the compression and recovery stages. Source code is available at https://github.com/niqretnuh/PREHAB-SVD

Comment: Matches Model Compression and Efficiency: pre‑conditioning networks (Prehab) for superior SVD low‑rank compression with improved post‑finetuning accuracy.

Relevance: 10 Novelty: 8

6. WUSH: Near-Optimal Adaptive Transforms for LLM Quantization

ArXiv ID: 2512.00956

Authors: Jiale Chen, Vage Egiazarian, Torsten Hoefler, Dan Alistarh

Abstract: Quantization to low bitwidth is a standard approach for deploying large language models, however, a few extreme weights and activations stretch the dynamic range and reduce the effective resolution of the quantizer. A common mitigation approach is to apply some fixed orthogonal transforms, such as Hadamard matrices, before quantization, which typically reduces the dynamic range. Yet, these transforms ignore the statistics of the data, and their optimality is currently not understood. In this work, we derive, for the first time, closed-form optimal linear blockwise transforms for joint weight-activation quantization using standard data-free quantizers for common numerical formats. Specifically, we provide derivations of the optimal adaptive (data-aware) transforms for round-to-nearest (RTN), AbsMax-scaled block quantizers for both integer and floating-point formats. The resulting construction, which we call WUSH, combines a Hadamard backbone with a data-dependent component based on second-order moments, yielding a non-orthogonal transform that is provably optimal under mild assumptions and remains structured for efficient implementation. Preliminary experimental results show that our approach consistently improves upon the Hadamard transform for common formats.

Comment: Model Compression and Efficiency: derives near-optimal adaptive linear transforms for joint weight–activation block quantization (RTN AbsMax), improving over Hadamard.

Relevance: 10 Novelty: 8

7. Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

ArXiv ID: 2512.02010

Authors: Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Song Han

Abstract: As large language models have grown larger, low-precision numerical formats such as NVFP4 have become increasingly popular due to the speed and memory benefits they provide. However, to accelerate computation with NVFP4, all matrix multiplication operands--weights and activations in the forward pass, and weights, activations, and gradients in the backward pass--must be quantized to NVFP4, often leading to divergence during training and performance degradation during inference. NVFP4 by evaluating multiple potential scale factors for each block of values. To address this issue, in this work we introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values. Unlike integer formats, floating-point formats such as FP4 have the most quantization error on near-maximal values in each block, which we find to be primarily responsible for downstream performance degradation. We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform, improving representation of near-maximal values. Importantly, 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, making it viable to use while training LLMs with NVFP4. In pre-training experiments with transformer and hybrid model architectures, we find that 4/6 prevents divergence in several cases, bringing training loss significantly closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy. We hope this inspires future work in training and deploying models with NVFP4.

Comment: Model Compression and Efficiency: proposes an FP4 (NVFP4) quantization algorithm (4/6) with adaptive block-level scaling to reduce near-maximum value error, enabling stable FP4 training/inference on Blackwell GPUs.

Relevance: 10 Novelty: 8

8. LPCD: Unified Framework from Layer-Wise to Submodule Quantization

ArXiv ID: 2512.01546

Authors: Yuma Ichikawa, Yudai Fujimoto, Akira Sakai

Abstract: Post-training quantization (PTQ) aims to preserve model-level behavior; however, most methods focus on individual linear layers. Even recent extensions, such as QEP and LoaQ, which mitigate error propagation or target specific submodules, still rely on layer-wise formulations and fail to capture the behavior of larger submodules. We introduce Layer-Projected Coordinate Descent (LPCD), a unified framework that extends PTQ beyond layers by optimizing relaxed objectives across arbitrary submodules and projecting the solutions with layer-wise quantizers. LPCD generalizes existing methods and provides a principled approach to quantizing complex submodules while maintaining the efficiency and compatibility of layer-wise PTQ pipelines. Across diverse LLM architectures and bit-widths, LPCD-based submodule quantization consistently enhances both layer-wise PTQ methods and existing submodule approaches.

Comment: Model Compression and Efficiency: unified PTQ framework extending from layer-wise to arbitrary submodule quantization via layer-projected coordinate descent.

Relevance: 10 Novelty: 8

9. Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

ArXiv ID: 2512.01278

Authors: Yilong Zhao, Jiaming Tang, Kan Zhu, Zihao Ye, Chi-Chih Chang, Chaofan Lin, Jongseok Park, Guangxuan Xiao, Mohamed S. Abdelfattah, Mingyu Gao, Baris Kasikci, Song Han, Ion Stoica

Abstract: Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longer generations demand more memory access for every step, leading to substantial pressure on memory bandwidth. To address this, we introduce SparseSpec, a speculative decoding framework that reuses the same model as the draft and target models (i.e., self-speculation). SparseSpec features a novel sparse attention mechanism, PillarAttn, as the draft model, which accurately selects critical tokens via elegantly reusing information from the verification stage. Furthermore, SparseSpec co-designs self-speculation with three system innovations: (1) a unified scheduler to batch token drafting and verification, (2) delayed verification for CPU/GPU overlap, and (3) dynamic KV-Cache management to maximize memory utilization. Across various models and datasets, SparseSpec outperforms state-of-the-art solutions, with an up to 2.13x throughput speedup.

Comment: Model Compression and Efficiency / HPC: sparse self-speculative decoding with PillarAttn, unified scheduler, delayed verification, and dynamic KV-cache management for faster long-CoT inference.

Relevance: 10 Novelty: 8

10. G-KV: Decoding-Time KV Cache Eviction with Global Attention

ArXiv ID: 2512.00504

Authors: Mengqi Liao, Lu Wang, Chaoyun Zhang, Zekai Shen, Xiaowei Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Huaiyu Wan

Abstract: Recent reasoning large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths. KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning. However, existing methods often focus on prompt compression or token eviction with local attention score, overlooking the long-term importance of tokens. We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance. Additionally, we introduce post-training techniques, including reinforcement learning and distillation, to optimize models for compressed KV cache settings. The code of this paper is available on: https://github.com/microsoft/G-KV.

Comment: Model Compression and Efficiency: decoding-time KV-cache eviction using a global attention-based scoring mechanism with post-training RL/distillation for compressed-cache settings.

Relevance: 10 Novelty: 7

11. Implicitly Normalized Online PCA: A Regularized Algorithm with Exact High-Dimensional Dynamics

ArXiv ID: 2512.01231

Authors: Samet Demir, Zafer Dogan

Abstract: Many online learning algorithms, including classical online PCA methods, enforce explicit normalization steps that discard the evolving norm of the parameter vector. We show that this norm can in fact encode meaningful information about the underlying statistical structure of the problem, and that exploiting this information leads to improved learning behavior. Motivated by this principle, we introduce Implicitly Normalized Online PCA (INO-PCA), an online PCA algorithm that removes the unit-norm constraint and instead allows the parameter norm to evolve dynamically through a simple regularized update. We prove that in the high-dimensional limit the joint empirical distribution of the estimate and the true component converges to a deterministic measure-valued process governed by a nonlinear PDE. This analysis reveals that the parameter norm obeys a closed-form ODE coupled with the cosine similarity, forming an internal state variable that regulates learning rate, stability, and sensitivity to signal-to-noise ratio (SNR). The resulting dynamics uncover a three-way relationship between the norm, SNR, and optimal step size, and expose a sharp phase transition in steady-state performance. Both theoretically and experimentally, we show that INO-PCA consistently outperforms Oja's algorithm and adapts rapidly in non-stationary environments. Overall, our results demonstrate that relaxing norm constraints can be a principled and effective way to encode and exploit problem-relevant information in online learning algorithms.

Comment: Matches Representation Learning/training dynamics: introduces Implicitly Normalized Online PCA with exact high-dimensional dynamics (PDE/ODE) and performance phase transitions.

Relevance: 9 Novelty: 8

12. SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs

ArXiv ID: 2512.00722

Authors: Jiaming Xu, Jiayi Pan, Hanzhen Wang, Yongkang Zhou, Jiancai Ye, Yu Wang, Guohao Dai

Abstract: In this paper, we point out that the objective of the retrieval algorithms is to align with the LLM, which is similar to the objective of knowledge distillation in LLMs. We analyze the similarity in information focus between the distilled language model(DLM) and the original LLM from the perspective of information theory, and thus propose a novel paradigm that leverages a DLM as the retrieval algorithm. Based on the insight, we present SpeContext, an algorithm and system co-design for long-context reasoning. (1) At the algorithm level, SpeContext proposes lightweight retrieval head based on the head-level attention weights of DLM, achieving > 90% parameters reduction by pruning the redundancy. (2) At the system level, SpeContext designs an asynchronous prefetch dataflow via the elastic loading strategy, effectively overlapping KV cache retrieval with the LLM computation. (3) At the compilation level, SpeContext constructs the theoretical memory model and implements an adaptive memory management system to achieve acceleration by maximizing GPU memory utilization. We deploy and evaluate SpeContext in two resourceconstrained environments, cloud and edge. Extensive experiments show that, compared with the Huggingface framework, SpeContext achieves up to 24.89x throughput improvement in cloud and 10.06x speedup in edge with negligible accuracy loss, pushing the Pareto frontier of accuracy and throughput.

Comment: Matches High Performance Computing and Efficiency: algorithm–system co-design for long-context LLMs via speculative context sparsity, pruned retrieval heads, asynchronous prefetch, and adaptive memory management.

Relevance: 9 Novelty: 8

13. Efficiently Learning Branching Networks for Multitask Algorithmic Reasoning

ArXiv ID: 2512.01113

Authors: Dongyue Li, Zhenshuo Zhang, Minxuan Duan, Edgar Dobriban, Hongyang R. Zhang

Abstract: Algorithmic reasoning -- the ability to perform step-by-step logical inference -- has become a core benchmark for evaluating reasoning in graph neural networks (GNNs) and large language models (LLMs). Ideally, one would like to design a single model capable of performing well on multiple algorithmic reasoning tasks simultaneously. However, this is challenging when the execution steps of algorithms differ from one another, causing negative interference when they are trained together. We propose branching neural networks, a principled architecture for multitask algorithmic reasoning. Searching for the optimal $k$-ary tree with $L$ layers over $n$ algorithmic tasks is combinatorial, requiring exploration of up to $k^{nL}$ possible structures. We develop AutoBRANE, an efficient algorithm that reduces this search to $O(nL)$ time by solving a convex relaxation at each layer to approximate an optimal task partition. The method clusters tasks using gradient-based affinity scores and can be used on top of any base model, including GNNs and LLMs. We validate AutoBRANE on a broad suite of graph-algorithmic and text-based reasoning benchmarks. We show that gradient features estimate true task performance within 5% error across four GNNs and four LLMs (up to 34B parameters). On the CLRS benchmark, it outperforms the strongest single multitask GNN by 3.7% and the best baseline by 1.2%, while reducing runtime by 48% and memory usage by 26%. The learned branching structures reveal an intuitively reasonable hierarchical clustering of related algorithms. On three text-based graph reasoning benchmarks, AutoBRANE improves over the best non-branching multitask baseline by 3.2%. Finally, on a large graph dataset with 21M edges and 500 tasks, AutoBRANE achieves a 28% accuracy gain over existing multitask and branching architectures, along with a 4.5$\times$ reduction in runtime.

Comment: Matches Model Architecture and Efficiency: branching neural networks for multitask reasoning with efficient convex-relaxed structure search (dynamic/conditional computation).

Relevance: 9 Novelty: 8

14. Solving Neural Min-Max Games: The Role of Architecture, Initialization & Dynamics

ArXiv ID: 2512.00389

Authors: Deep Patel, Emmanouil-Vasileios Vlatakis-Gkaragkounis

Abstract: Many emerging applications - such as adversarial training, AI alignment, and robust optimization - can be framed as zero-sum games between neural nets, with von Neumann-Nash equilibria (NE) capturing the desirable system behavior. While such games often involve non-convex non-concave objectives, empirical evidence shows that simple gradient methods frequently converge, suggesting a hidden geometric structure. In this paper, we provide a theoretical framework that explains this phenomenon through the lens of hidden convexity and overparameterization. We identify sufficient conditions - spanning initialization, training dynamics, and network width - that guarantee global convergence to a NE in a broad class of non-convex min-max games. To our knowledge, this is the first such result for games that involve two-layer neural networks. Technically, our approach is twofold: (a) we derive a novel path-length bound for the alternating gradient descent-ascent scheme in min-max games; and (b) we show that the reduction from a hidden convex-concave geometry to two-sided Polyak-{\L}ojasiewicz (P{\L}) min-max condition hold with high probability under overparameterization, using tools from random matrix theory.

Comment: Training Dynamics/Theory: proves global convergence to Nash equilibria in nonconvex min-max games with two-layer nets via hidden convexity and overparameterization.

Relevance: 9 Novelty: 8

15. Tuning Universality in Deep Neural Networks

ArXiv ID: 2512.00168

Authors: Arsham Ghavasieh

Abstract: Deep neural networks (DNNs) exhibit crackling-like avalanches whose origin lacks a mechanistic explanation. Here, I derive a stochastic theory of deep information propagation (DIP) by incorporating Central Limit Theorem (CLT)-level fluctuations. Four effective couplings $(r, h, D_1, D_2)$ characterize the dynamics, yielding a Landau description of the static exponents and a Directed Percolation (DP) structure of activity cascades. Tuning the couplings selects between avalanche dynamics generated by a Brownian Motion (BM) in a logarithmic trap and an absorbed free BM, each corresponding to a distinct universality classes. Numerical simulations confirm the theory and demonstrate that activation function design controls the collective dynamics in random DNNs.

Comment: Training Dynamics/Representation Theory: stochastic deep information propagation linking activation design to universality classes and avalanche dynamics.

Relevance: 9 Novelty: 8

16. Constructing Efficient Fact-Storing MLPs for Transformers

ArXiv ID: 2512.00207

Authors: Owen Dugan, Roberto Garcia, Ronny Junkins, Jerry Liu, Dylan Zinsley, Sabri Eyuboglu, Atri Rudra, Chris R\'e

Abstract: The success of large language models (LLMs) can be attributed in part to their ability to efficiently store factual knowledge as key-value mappings within their MLP parameters. Recent work has proposed explicit weight constructions to build such fact-storing MLPs, providing an improved understanding of LLM fact storage mechanisms. In this paper, we introduce an MLP construction framework that improves over previous constructions in three areas: it 1) works for all but a measure-zero set of feasible input-output pairs, 2) achieves asymptotically optimal parameter efficiency matching information-theoretic bounds for some embeddings, and 3) maintains usability within Transformers for factual recall. Through our improvements, we 1) discover a metric on value embeddings that characterizes facts-per-parameter scaling for both constructed and gradient-descent-trained MLPs, 2) identify a simple encoder-decoder mechanism that empirically matches gradient-descent MLP facts-per-parameter asymptotics across all the inputs and outputs we test, and 3) uncover a fundamental tradeoff between an MLP's fact-storage capacity and its usability within Transformers. Finally, we demonstrate a proof-of-concept application of fact-storing MLPs: modular fact editing on one-layer Transformers by \textit{replacing entire MLPs at once}.

Comment: Representation Learning/Model Architecture: explicit constructions of fact-storing MLPs with asymptotically optimal facts-per-parameter and analysis of encoder–decoder mechanisms within Transformers.

Relevance: 9 Novelty: 8

17. AlignSAE: Concept-Aligned Sparse Autoencoders

ArXiv ID: 2512.02004

Authors: Minglai Yang, Xinyu Guo, Mihai Surdeanu, Liangming Pan

Abstract: Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable features, they often struggle to reliably align these features with human-defined concepts, resulting in entangled and distributed feature representations. To address this, we introduce AlignSAE, a method that aligns SAE features with a defined ontology through a "pre-train, then post-train" curriculum. After an initial unsupervised training phase, we apply supervised post-training to bind specific concepts to dedicated latent slots while preserving the remaining capacity for general reconstruction. This separation creates an interpretable interface where specific relations can be inspected and controlled without interference from unrelated features. Empirical results demonstrate that AlignSAE enables precise causal interventions, such as reliable "concept swaps", by targeting single, semantically aligned slots.

Comment: Representation Learning: introduces concept-aligned Sparse Autoencoders with supervised post-training to bind ontology concepts to sparse latent slots enabling causal interventions.

Relevance: 9 Novelty: 8

18. Less is More: Resource-Efficient Low-Rank Adaptation

ArXiv ID: 2512.00878

Authors: Chunlin Tian, Xuyang Wei, Huanrong Liu, Zhijiang Guo, Li Li

Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method for Large Language Models (LLMs), but it still incurs notable overhead and suffers from parameter interference in complex datasets. While re- cent works decouple LoRA update matrices to exploit matrix-wise asymmetry, training costs remain high. We revisit LoRA from the perspective of inter-matrix and intra-layer parameter redundancy and propose Resource-Efficient Low-Rank Adaptation, EffiLoRA, a lightweight and generalizable approach for language, multimodal, and diffusion models. EffiLoRA employs a unified A matrix across all transformer layers and introduces a runtime selective B matrices up- date to dynamically trade-off the system resource budget and model performance. EffiLoRA consistently outperforms LoRA across diverse modalities, including commonsense reasoning, visual instruction tuning, and image generation, demon- strating improved efficiency and robustness.

Comment: Model Compression/Efficiency: EffiLoRA shares A across layers and selectively updates B at runtime to reduce PEFT cost while retaining performance.

Relevance: 9 Novelty: 7

19. Morphling: Fast, Fused, and Flexible GNN Training at Scale

ArXiv ID: 2512.01678

Authors: Anubhab, Rupesh Nasre

Abstract: Graph Neural Networks (GNNs) present a fundamental hardware challenge by fusing irregular, memory-bound graph traversals with regular, compute-intensive dense matrix operations. While frameworks such as PyTorch Geometric (PyG) and Deep Graph Library (DGL) prioritize high-level usability, they fail to address these divergent execution characteristics. As a result, they rely on generic kernels that suffer from poor cache locality, excessive memory movement, and substantial intermediate allocations. To address these limitations, we present Morphling, a domain-specific code synthesizer designed to bridge this gap. Morphling compiles high-level GNN specifications into portable, backend-specialized implementations targeting OpenMP, CUDA, and MPI. It achieves this by instantiating a library of optimized, architecture-aware primitives tailored to each execution environment. Morphling also incorporates a runtime sparsity-aware execution engine that dynamically selects dense or sparse execution paths using input feature statistics, reducing unnecessary computation on zero-valued entries. We evaluate Morphling on eleven real-world datasets spanning diverse graph structures, feature dimensionalities, and sparsity regimes. The results show that Morphling improves per-epoch training throughput by an average of 20X on CPUs and 19X on GPUs over PyG and DGL, with peak speedups reaching 66X. Morphling's memory-efficient layouts further reduce peak memory consumption by up to 15X, enabling large-scale GNN training on commodity hardware. These findings demonstrate that specialized, architecture-aware code synthesis provides an effective and scalable path toward high-performance GNN execution across diverse parallel and distributed platforms.

Comment: High-Performance Computing: domain-specific code synthesis and sparsity-aware runtime for scalable, fused GNN training across CPU/GPU/MPI backends.

Relevance: 9 Novelty: 7

20. Upcycled and Merged MoE Reward Model for Mitigating Reward Hacking

ArXiv ID: 2512.00724

Authors: Lingling Fu

Abstract: Reward models play a critical role in Reinforcement Learning from Human Feedback (RLHF) by assessing the consistency between generated outputs and human preferences. However, conventional reward models are prone to reward hacking or over-optimization, where the policy exploits shortcut patterns to obtain high reward scores that do not reflect true human preference. Although Mixture-of-Experts (MoE)-based reward models can enhance discriminative capability, they typically introduce substantial computational overhead. To address these challenges, we propose an upcycle and merge MoE reward modeling approach. We first upcycle a dense reward model into a MoE architecture, where a shared expert captures general knowledge, while normal experts specialize in instruction-specific patterns. We then apply routing-weight normalization and merge experts back into a dense model through a learnable weight-averaging mechanism, preserving performance gains while significantly reducing inference cost. Experimental results demonstrate that our method effectively mitigates reward hacking across various model scales. Our work highlights the potential of upcycle and merge MoE structures for improving both robustness and efficiency of RLHF reward models.

Comment: Model Architecture: Mixture-of-Experts reward model upcycled from dense and merged back for efficient inference; addresses robustness to reward hacking.

Relevance: 9 Novelty: 7

21. From Coefficients to Directions: Rethinking Model Merging with Directional Alignment

ArXiv ID: 2512.00391

Authors: Zhikang Chen, Sen Cui, Deheng Ye, Min Zhang, Gang Niu, Yu Zhang, Masashi Sugiyama, Tingting Zhu

Abstract: Model merging has emerged as a practical paradigm for integrating multiple independently trained models into a single model without joint retraining. Previous studies have demonstrated the effectiveness of combining parameters through strategies such as parameter decomposition, coefficient optimization, and subspace learning, significantly reducing the need for expensive joint training and achieving strong empirical performance across diverse tasks. However, these approaches predominantly treat merging as a problem of parameter space decomposition or fusion coefficient optimization, while overlooking the critical role of directional information in both parameter and feature spaces. In practice, na\"ive merging introduces inconsistencies in dominant parameter directions and disrupts structural coherence across models, which can degrade performance. Moreover, coefficient-based optimization methods implicitly assume compatible feature-space directions across models. However, Neural Collapse indicates that class features follow structured directional patterns, which may differ across independently trained models, making coefficient optimization alone insufficient. In this work, we emphasize the importance of \emph{directional alignment} and introduce a unified geometric framework, \emph{Merging with Directional Alignment} (\method{}), which aligns directional structures consistently in both the parameter and feature spaces. Our analysis shows that directional alignment improves structural coherence, and extensive experiments across benchmarks, model scales, and task configurations further validate the effectiveness of our approach.

Comment: Matches Model Architecture and Representation Learning: introduces directional alignment across parameter and feature spaces (leveraging Neural Collapse) for principled model merging.

Relevance: 8 Novelty: 8

22. ZIP-RC: Zero-overhead Inference-time Prediction of Reward and Cost for Adaptive and Interpretable Generation

ArXiv ID: 2512.01457

Authors: Rohin Manvi, Joey Hong, Tim Seyde, Maxime Labonne, Mathias Lechner, Sergey Levine

Abstract: Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.

Comment: Matches Efficiency/Test‑time Scaling: zero‑overhead reward and cost prediction from unused logits enables adaptive inference and compute allocation.

Relevance: 8 Novelty: 8

23. SVRG and Beyond via Posterior Correction

ArXiv ID: 2512.01930

Authors: Nico Daheim, Thomas M\"ollenhoff, Ming Liang Ang, Mohammad Emtiyaz Khan

Abstract: Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections, but have seen limited success in deep learning. Here, we show surprising new foundational connections of SVRG to a recently proposed Bayesian method called posterior correction. Specifically, we show that SVRG is recovered as a special case of posterior correction over the isotropic-Gaussian family, while novel extensions are automatically obtained by using more flexible exponential families. We derive two new SVRG variants by using Gaussian families: First, a Newton-like variant that employs novel Hessian corrections, and second, an Adam-like extension that improves pretraining and finetuning of Transformer language models. This is the first work to connect SVRG to Bayes and use it to boost variational training for deep networks.

Comment: Matches Training Efficiency/HPC: connects SVRG to Bayesian posterior correction and derives Hessian‑ and Adam‑like SVRG variants improving deep model training.

Relevance: 8 Novelty: 8

24. Does Flatness imply Generalization for Logistic Loss in Univariate Two-Layer ReLU Network?

ArXiv ID: 2512.01473

Authors: Dan Qiao, Yu-Xiang Wang

Abstract: We consider the problem of generalization of arbitrarily overparameterized two-layer ReLU Neural Networks with univariate input. Recent work showed that under square loss, flat solutions (motivated by flat / stable minima and Edge of Stability phenomenon) provably cannot overfit, but it remains unclear whether the same phenomenon holds for logistic loss. This is a puzzling open problem because existing work on logistic loss shows that gradient descent with increasing step size converges to interpolating solutions (at infinity, for the margin-separable cases). In this paper, we prove that the \emph{flatness implied generalization} is more delicate under logistic loss. On the positive side, we show that flat solutions enjoy near-optimal generalization bounds within a region between the left-most and right-most \emph{uncertain} sets determined by each candidate solution. On the negative side, we show that there exist arbitrarily flat yet overfitting solutions at infinity that are (falsely) certain everywhere, thus certifying that flatness alone is insufficient for generalization in general. We demonstrate the effects predicted by our theory in a well-controlled simulation study.

Comment: Matches Representation Learning/Training Dynamics theory: analyzes when flatness implies generalization for logistic loss in 2‑layer ReLU nets.

Relevance: 8 Novelty: 8

25. Polynomial Neural Sheaf Diffusion: A Spectral Filtering Approach on Cellular Sheaves

ArXiv ID: 2512.00242

Authors: Alessio Borgi, Fabrizio Silvestri, Pietro Li`o

Abstract: Sheaf Neural Networks equip graph structures with a cellular sheaf: a geometric structure which assigns local vector spaces (stalks) and a linear learnable restriction/transport maps to nodes and edges, yielding an edge-aware inductive bias that handles heterophily and limits oversmoothing. However, common Neural Sheaf Diffusion implementations rely on SVD-based sheaf normalization and dense per-edge restriction maps, which scale with stalk dimension, require frequent Laplacian rebuilds, and yield brittle gradients. To address these limitations, we introduce Polynomial Neural Sheaf Diffusion (PolyNSD), a new sheaf diffusion approach whose propagation operator is a degree-K polynomial in a normalised sheaf Laplacian, evaluated via a stable three-term recurrence on a spectrally rescaled operator. This provides an explicit K-hop receptive field in a single layer (independently of the stalk dimension), with a trainable spectral response obtained as a convex mixture of K+1 orthogonal polynomial basis responses. PolyNSD enforces stability via convex mixtures, spectral rescaling, and residual/gated paths, reaching new state-of-the-art results on both homophilic and heterophilic benchmarks, inverting the Neural Sheaf Diffusion trend by obtaining these results with just diagonal restriction maps, decoupling performance from large stalk dimension, while reducing runtime and memory requirements.

Comment: Model Architecture: Polynomial Neural Sheaf Diffusion introduces stable spectral filtering on sheaf Laplacians with diagonal restriction maps, improving scalability and stability.

Relevance: 8 Novelty: 8

26. Generative Modeling with Continuous Flows: Sample Complexity of Flow Matching

ArXiv ID: 2512.01286

Authors: Mudit Gaur, Prashant Trivedi, Shuchin Aeron, Amrit Singh Bedi, George K. Atia, Vaneet Aggarwal

Abstract: Flow matching has recently emerged as a promising alternative to diffusion-based generative models, offering faster sampling and simpler training by learning continuous flows governed by ordinary differential equations. Despite growing empirical success, the theoretical understanding of flow matching remains limited, particularly in terms of sample complexity results. In this work, we provide the first analysis of the sample complexity for flow-matching based generative models without assuming access to the empirical risk minimizer (ERM) of the loss function for estimating the velocity field. Under standard assumptions on the loss function for velocity field estimation and boundedness of the data distribution, we show that a sufficiently expressive neural network can learn a velocity field such that with $\mathcal{O}(\epsilon^{-4})$ samples, such that the Wasserstein-2 distance between the learned and the true distribution is less than $\mathcal{O}(\epsilon)$. The key technical idea is to decompose the velocity field estimation error into neural-network approximation error, statistical error due to the finite sample size, and optimization error due to the finite number of optimization steps for estimating the velocity field. Each of these terms are then handled via techniques that may be of independent interest.

Comment: Theory for Generative Modeling: first sample complexity bounds for flow matching by decomposing approximation/statistical/optimization errors to guarantee W2 convergence.

Relevance: 8 Novelty: 8

27. An RKHS Perspective on Tree Ensembles

ArXiv ID: 2512.00397

Authors: Mehdi Dagdoug, Clement Dombry, Jean-Jil Duchamps

Abstract: Random Forests and Gradient Boosting are among the most effective algorithms for supervised learning on tabular data. Both belong to the class of tree-based ensemble methods, where predictions are obtained by aggregating many randomized regression trees. In this paper, we develop a theoretical framework for analyzing such methods through Reproducing Kernel Hilbert Spaces (RKHSs) constructed on tree ensembles -- more precisely, on the random partitions generated by randomized regression trees. We establish fundamental analytical properties of the resulting Random Forest kernel, including boundedness, continuity, and universality, and show that a Random Forest predictor can be characterized as the unique minimizer of a penalized empirical risk functional in this RKHS, providing a variational interpretation of ensemble learning. We further extend this perspective to the continuous-time formulation of Gradient Boosting introduced by Dombry and Duchamps, and demonstrate that it corresponds to a gradient flow on a Hilbert manifold induced by the Random Forest RKHS. A key feature of this framework is that both the kernel and the RKHS geometry are data-dependent, offering a theoretical explanation for the strong empirical performance of tree-based ensembles. Finally, we illustrate the practical potential of this approach by introducing a kernel principal component analysis built on the Random Forest kernel, which enhances the interpretability of ensemble models, as well as GVI, a new geometric variable importance criterion.

Comment: Representation Learning/Theory: RKHS framework for tree ensembles with variational interpretation and gradient flow on a data-dependent Hilbert manifold.

Relevance: 8 Novelty: 8

28. SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling

ArXiv ID: 2512.00466

Authors: Yang Xiao, Chunpu Xu, Ruifeng Yuan, Jiashuo Wang, Wenjie Li, Pengfei Liu

Abstract: Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform resource distribution across all reasoning sub-problems, creating fundamental bottlenecks where challenging sub-problems receive insufficient attention while routine operations consume disproportionate resources. This uniform allocation creates performance bottlenecks where additional computational resources yield diminishing returns. Inspired by dual-process theory, we propose \textbf{SCALE} (Selective Resource Allocation), a framework that selectively allocates computational resources based on sub-problem difficulty. SCALE operates through four stages: (1) problem decomposition into sequential reasoning sub-problems, (2) difficulty assessment of each sub-problem to distinguish between routine operations and computationally challenging sub-problems, (3) selective processing mode assignment between System 1 for simple sub-problems and System 2 for complex ones, and (4) sequential execution with context propagation. By concentrating resources on challenging sub-problems while processing routine operations efficiently, SCALE achieves substantial performance improvements with superior resource utilization. Extensive experiments demonstrate that SCALE significantly outperforms uniform scaling baselines, achieving accuracy improvements of up to 13.75 percentage points (57.50% to 71.25% on AIME25) while reducing computational costs by 33%-53%, representing a major advance in test-time scaling that addresses fundamental limitations of current approaches.

Comment: Matches Efficiency/test-time compute: selective resource allocation for reasoning sub-problems (dynamic routing between fast/slow processing) to improve cost–accuracy trade-offs.