Previous Day 2025-12-02
Monthly Overview 2025-12
Next Day 2025-12-04

Personalized Daily ArXiv Papers 2025-12-03

[gpt-5] Prompt Completion Total
Token 37964 33415 71379
Cost $0.05 $0.33 $0.38

Total arXiv papers: 550

Total scanned papers: 292

Total relevant papers: 20

Table of contents with paper titles:

  1. Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models Authors: Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter, Nicholas Matthew Boffi, Max Simchowitz

  2. Reversing Large Language Models for Efficient Training and Fine-Tuning Authors: Eshed Gal, Moshe Eliasof, Javier Turek, Uri Ascher, Eran Treister, Eldad Haber

  3. Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models Authors: Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang, Minwoo Lee, Shu-ping Yeh, Li Yang

  4. Understanding and Harnessing Sparsity in Unified Multimodal Models Authors: Shwai He, Chaorui Deng, Ang Li, Shen Yan

  5. FAIRY2I: Universal Extremely-Low Bit QAT framework via Widely-Linear Representation and Phase-Aware Quantization Authors: Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang, Peizhuang Cong, Tong Yang

  6. Model Recovery at the Edge under Resource Constraints for Physical AI Authors: Bin Xu, Ayan Banerjee, Sandeep K. S. Gupta

  7. Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation Authors: Junghwan Park, Woojin Cho, Junhyuk Heo, Darongsae Kwon, Kookjin Lee

  8. A Fully First-Order Layer for Differentiable Optimization Authors: Zihao Zhao, Kai-Chia Mo, Shing-Hei Ho, Brandon Amos, Kai Wang

  9. Enforcing Orderedness to Improve Feature Consistency Authors: Sophie L. Wang, Alex Quach, Nithin Parsan, John J. Yang

  10. ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity Authors: Hongxiang Liu, Zhifang Deng, Tong Pu, Shengli Lu

  11. SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification Authors: Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Junjie Peng, Kun Xia

  12. Training Dynamics of Learning 3D-Rotational Equivariance Authors: Max W. Shen, Ewa Nowara, Michael Maser, Kyunghyun Cho

  13. Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting Authors: Zhenliang Ni, Xiaowen Ma, Zhenkai Wu, Shuai Xiao, Han Shu, Xinghao Chen

  14. VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm Authors: Zhenkai Wu, Xiaowen Ma, Zhenliang Ni, Dengming Zhang, Han Shu, Xin Jiang, Xinghao Chen

  15. CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning Authors: Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, Chris Shum

  16. Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles Authors: Yizhou Zhang, Lun Du

  17. From monoliths to modules: Decomposing transducers for efficient world modelling Authors: Alexander Boyd, Franz Nowak, David Hyland, Manuel Baltieri, Fernando E. Rosas

  18. Embedding networks with the random walk first return time distribution Authors: Vedanta Thapar, Renaud Lambiotte, George T. Cantwell

  19. In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs Authors: Vishnu Sarukkai, Asanshay Gupta, James Hong, Micha\"el Gharbi, Kayvon Fatahalian

  20. Graph VQ-Transformer (GVT): Fast and Accurate Molecular Generation via High-Fidelity Discrete Latents Authors: Haozhuo Zheng, Cheng Wang, Yang Liu


1. Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

ArXiv ID: 2512.02636

Authors: Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter, Nicholas Matthew Boffi, Max Simchowitz

Abstract: Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today's best generative models -- diffusion and flow-based models -- still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single model. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2's capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models. As an application of our approach, we propose a lightweight self-guidance method that enables a 2-step MeanFlow model to outperform a 1024 step teacher model with only a single additional backward NFE.

Comment: Matches Model Compression and Efficiency: joint distillation for flow-based models yielding few-step sampling and tractable likelihood with orders-of-magnitude fewer NFEs.

Relevance: 10 Novelty: 9


2. Reversing Large Language Models for Efficient Training and Fine-Tuning

ArXiv ID: 2512.02056

Authors: Eshed Gal, Moshe Eliasof, Javier Turek, Uri Ascher, Eran Treister, Eldad Haber

Abstract: Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pre-trained LLM considered a foundation model. In this work, we introduce memory-efficient, reversible architectures for LLMs, inspired by symmetric and symplectic differential equations, and investigate their theoretical properties. Different from standard, baseline architectures that store all intermediate activations, the proposed models use time-reversible dynamics to retrieve hidden states during backpropagation, relieving the need to store activations. This property allows for a drastic reduction in memory consumption, allowing for the processing of larger batch sizes for the same available memory, thereby offering improved throughput. In addition, we propose an efficient method for converting existing, non-reversible LLMs into reversible architectures through fine-tuning, rendering our approach practical for exploiting existing pre-trained models. Our results show comparable or improved performance on several datasets and benchmarks, on several LLMs, building a scalable and efficient path towards reducing the memory and computational costs associated with both training from scratch and fine-tuning of LLMs.

Comment: HPC/Memory Optimization + Model Architecture: reversible LLM layers enable backprop without storing activations; includes conversion of pretrained models into reversible form.

Relevance: 10 Novelty: 9


3. Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

ArXiv ID: 2512.02185

Authors: Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang, Minwoo Lee, Shu-ping Yeh, Li Yang

Abstract: Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.

Comment: Structured pruning tailored for reasoning LLMs using self-generated traces and decode-only gradients—directly matches Compression/Efficiency (pruning).

Relevance: 10 Novelty: 8


4. Understanding and Harnessing Sparsity in Unified Multimodal Models

ArXiv ID: 2512.02351

Authors: Shwai He, Chaorui Deng, Ang Li, Shen Yan

Abstract: Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at \href{https://github.com/Shwai-He/SparseUnifiedModel}{this link}.

Comment: Compression/Efficiency + Model Architecture: training-free pruning analysis and MoE-based sparse activation to reduce active parameters in unified multimodal models.

Relevance: 10 Novelty: 8


5. FAIRY2I: Universal Extremely-Low Bit QAT framework via Widely-Linear Representation and Phase-Aware Quantization

ArXiv ID: 2512.02901

Authors: Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang, Peizhuang Cong, Tong Yang

Abstract: Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.

Comment: Compression/Efficiency: widely-linear transformation and phase-aware ultra-low-bit quantization enabling 2-bit QAT while reusing pretrained checkpoints.

Relevance: 10 Novelty: 8


6. Model Recovery at the Edge under Resource Constraints for Physical AI

ArXiv ID: 2512.02283

Authors: Bin Xu, Ayan Banerjee, Sandeep K. S. Gupta

Abstract: Model Recovery (MR) enables safe, explainable decision making in mission-critical autonomous systems (MCAS) by learning governing dynamical equations, but its deployment on edge devices is hindered by the iterative nature of neural ordinary differential equations (NODEs), which are inefficient on FPGAs. Memory and energy consumption are the main concerns when applying MR on edge devices for real-time operation. We propose MERINDA, a novel FPGA-accelerated MR framework that replaces iterative solvers with a parallelizable neural architecture equivalent to NODEs. MERINDA achieves nearly 11x lower DRAM usage and 2.2x faster runtime compared to mobile GPUs. Experiments reveal an inverse relationship between memory and energy at fixed accuracy, highlighting MERINDA's suitability for resource-constrained, real-time MCAS.

Comment: Matches Model Compression and Efficiency + HPC: replaces iterative NODE solvers with a parallelizable neural architecture on FPGA, reducing memory/energy and runtime.

Relevance: 9 Novelty: 8


7. Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation

ArXiv ID: 2512.02441

Authors: Junghwan Park, Woojin Cho, Junhyuk Heo, Darongsae Kwon, Kookjin Lee

Abstract: Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace. In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients, along with a lightweight rescaling step while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.

Comment: Matches Model Compression and Efficiency: low-rank, basis-oriented parameter-efficient transfer with orthogonal task-informed subspaces; strong PEFT contribution.

Relevance: 9 Novelty: 8


8. A Fully First-Order Layer for Differentiable Optimization

ArXiv ID: 2512.02494

Authors: Zihao Zhao, Kai-Chia Mo, Shing-Hei Ho, Brandon Amos, Kai Wang

Abstract: Differentiable optimization layers enable learning systems to make decisions by solving embedded optimization problems. However, computing gradients via implicit differentiation requires solving a linear system with Hessian terms, which is both compute- and memory-intensive. To address this challenge, we propose a novel algorithm that computes the gradient using only first-order information. The key insight is to rewrite the differentiable optimization as a bilevel optimization problem and leverage recent advances in bilevel methods. Specifically, we introduce an active-set Lagrangian hypergradient oracle that avoids Hessian evaluations and provides finite-time, non-asymptotic approximation guarantees. We show that an approximate hypergradient can be computed using only first-order information in $\tilde{\oo}(1)$ time, leading to an overall complexity of $\tilde{\oo}(\delta^{-1}\epsilon^{-3})$ for constrained bilevel optimization, which matches the best known rate for non-smooth non-convex optimization. Furthermore, we release an open-source Python library that can be easily adapted from existing solvers. Our code is available here: https://github.com/guaguakai/FFOLayer.

Comment: High Performance Computing/Efficiency: a fully first-order differentiable optimization layer avoiding Hessian solves, reducing compute and memory with non-asymptotic guarantees.

Relevance: 9 Novelty: 8


9. Enforcing Orderedness to Improve Feature Consistency

ArXiv ID: 2512.02194

Authors: Sophie L. Wang, Alex Quach, Nithin Parsan, John J. Yang

Abstract: Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered Sparse Autoencoders (OSAE), which extend Matryoshka SAEs by (1) establishing a strict ordering of latent features and (2) deterministically using every feature dimension, avoiding the sampling-based approximations of prior nested SAE methods. Theoretically, we show that OSAEs resolve permutation non-identifiability in settings of sparse dictionary learning where solutions are unique (up to natural symmetries). Empirically on Gemma2-2B and Pythia-70M, we show that OSAEs can help improve consistency compared to Matryoshka baselines.

Comment: Proposes Ordered Sparse Autoencoders resolving permutation non-identifiability—strong match to Autoencoders and Representation Learning.

Relevance: 9 Novelty: 7


10. ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity

ArXiv ID: 2512.02403

Authors: Hongxiang Liu, Zhifang Deng, Tong Pu, Shengli Lu

Abstract: Transformers, composed of QKV generation, attention computation, and FFNs, have become the dominant model across various domains due to their outstanding performance. However, their high computational cost hinders efficient hardware deployment. Sparsity offers a promising solution, yet most existing accelerators exploit only intra-row sparsity in attention, while few consider inter-row sparsity. Approaches leveraging inter-row sparsity often rely on costly global similarity estimation, which diminishes the acceleration benefits of sparsity, and typically apply sparsity to only one or two transformer components. Through careful analysis of the attention distribution and computation flow, we observe that local similarity allows end-to-end sparse acceleration with lower computational overhead. Motivated by this observation, we propose ESACT, an end-to-end sparse accelerator for compute-intensive Transformers. ESACT centers on the Sparsity Prediction with Local Similarity (SPLS) mechanism, which leverages HLog quantization to accurately predict local attention sparsity prior to QK generation, achieving efficient sparsity across all transformer components. To support efficient hardware realization, we introduce three architectural innovations. Experimental results on 26 benchmarks demonstrate that SPLS reduces total computation by 52.03% with less than 1% accuracy loss. ESACT achieves an end-to-end energy efficiency of 3.29 TOPS/W, and improves attention-level energy efficiency by 2.95x and 2.26x over SOTA attention accelerators SpAtten and Sanger, respectively.

Comment: Designs a sparse Transformer accelerator using local similarity and HLog quantization—strong match to Compression/Efficiency and systems-level HPC for Transformers.

Relevance: 9 Novelty: 7


11. SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification

ArXiv ID: 2512.02337

Authors: Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Junjie Peng, Kun Xia

Abstract: Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.

Comment: Self-speculative decoding with partial KV verification to accelerate long-context generation—matches HPC/Efficiency (decoding acceleration, KV optimization).

Relevance: 9 Novelty: 7


12. Training Dynamics of Learning 3D-Rotational Equivariance

ArXiv ID: 2512.02303

Authors: Max W. Shen, Ewa Nowara, Michael Maser, Kyunghyun Cho

Abstract: While data augmentation is widely used to train symmetry-agnostic models, it remains unclear how quickly and effectively they learn to respect symmetries. We investigate this by deriving a principled measure of equivariance error that, for convex losses, calculates the percent of total loss attributable to imperfections in learned symmetry. We focus our empirical investigation to 3D-rotation equivariance on high-dimensional molecular tasks (flow matching, force field prediction, denoising voxels) and find that models reduce equivariance error quickly to $\leq$2\% held-out loss within 1k-10k training steps, a result robust to model and dataset size. This happens because learning 3D-rotational equivariance is an easier learning task, with a smoother and better-conditioned loss landscape, than the main prediction task. For 3D rotations, the loss penalty for non-equivariant models is small throughout training, so they may achieve lower test loss than equivariant models per GPU-hour unless the equivariant ``efficiency gap'' is narrowed. We also experimentally and theoretically investigate the relationships between relative equivariance error, learning gradients, and model parameters.

Comment: Analyzes training dynamics of learning 3D-rotational equivariance with a principled equivariance error measure—matches Representation Learning and training dynamics.

Relevance: 9 Novelty: 7


13. Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting

ArXiv ID: 2512.02061

Authors: Zhenliang Ni, Xiaowen Ma, Zhenkai Wu, Shuai Xiao, Han Shu, Xinghao Chen

Abstract: Multivariate time series forecasts are widely used, such as industrial, transportation and financial forecasts. However, the dominant frequencies in time series may shift with the evolving spectral distribution of the data. Traditional Mixture of Experts (MoE) models, which employ a fixed number of experts, struggle to adapt to these changes, resulting in frequency coverage imbalance issue. Specifically, too few experts can lead to the overlooking of critical information, while too many can introduce noise. To this end, we propose Ada-MoGE, an adaptive Gaussian Mixture of Experts model. Ada-MoGE integrates spectral intensity and frequency response to adaptively determine the number of experts, ensuring alignment with the input data's frequency distribution. This approach prevents both information loss due to an insufficient number of experts and noise contamination from an excess of experts. Additionally, to prevent noise introduction from direct band truncation, we employ Gaussian band-pass filtering to smoothly decompose the frequency domain features, further optimizing the feature representation. The experimental results show that our model achieves state-of-the-art performance on six public benchmarks with only 0.2 million parameters.

Comment: Model Architecture: proposes an adaptive Mixture-of-Experts that selects the number of experts via spectral cues; MoE innovation squarely fits core topics.

Relevance: 9 Novelty: 7


14. VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

ArXiv ID: 2512.02700

Authors: Zhenkai Wu, Xiaowen Ma, Zhenliang Ni, Dengming Zhang, Han Shu, Xin Jiang, Xinghao Chen

Abstract: Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup.

Comment: Model Compression/Efficiency: training-free token pruning with spatial sparsity buffering and redundancy-aware selection for VLMs.

Relevance: 9 Novelty: 7


15. CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

ArXiv ID: 2512.02551

Authors: Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, Chris Shum

Abstract: In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

Comment: LLM+RL-driven optimization of HGEMM CUDA kernels surpassing cuBLAS—matches HPC with novel systems-level kernel optimization.

Relevance: 8 Novelty: 8


16. Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles

ArXiv ID: 2512.02409

Authors: Yizhou Zhang, Lun Du

Abstract: Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling. While several of these data-centric strategies reliably improve training efficiency and downstream performance, others fail to provide meaningful gains -- most notably self-generated synthetic data, which often increases dataset volume without enhancing model capability. We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator. Our first main result shows that \textbf{static pruning induces a bounded operator and therefore cannot change the spectral tail exponent}; it provides at most finite-region improvements and cannot alter asymptotic neural scaling. Our second result analyzes \textbf{time-dependent data curation}, showing that an ideal oracle capable of tracking spectral residuals and continuously re-normalizing the tail can provably accelerate learning -- although practical systems can only approximate this behavior.

Comment: Representation Learning/Training Dynamics: spectral-operator analysis of data curation showing limits of static pruning and acceleration via time-dependent reweighting.

Relevance: 8 Novelty: 8


17. From monoliths to modules: Decomposing transducers for efficient world modelling

ArXiv ID: 2512.02193

Authors: Alexander Boyd, Franz Nowak, David Hyland, Manuel Baltieri, Fernando E. Rosas

Abstract: World models have been recently proposed as sandbox environments in which AI agents can be trained and evaluated before deployment. Although realistic world models often have high computational demands, efficient modelling is usually possible by exploiting the fact that real-world scenarios tend to involve subcomponents that interact in a modular manner. In this paper, we explore this idea by developing a framework for decomposing complex world models represented by transducers, a class of models generalising POMDPs. Whereas the composition of transducers is well understood, our results clarify how to invert this process, deriving sub-transducers operating on distinct input-output subspaces, enabling parallelizable and interpretable alternatives to monolithic world modelling that can support distributed inference. Overall, these results lay a groundwork for bridging the structural transparency demanded by AI safety and the computational efficiency required for real-world inference.

Comment: Matches Model Architecture/HPC: formal decomposition of transducer-based world models into modular sub-transducers enabling parallelizable, interpretable, distributed inference.

Relevance: 8 Novelty: 7


18. Embedding networks with the random walk first return time distribution

ArXiv ID: 2512.02694

Authors: Vedanta Thapar, Renaud Lambiotte, George T. Cantwell

Abstract: We propose the first return time distribution (FRTD) of a random walk as an interpretable and mathematically grounded node embedding. The FRTD assigns a probability mass function to each node, allowing us to define a distance between any pair of nodes using standard metrics for discrete distributions. We present several arguments to motivate the FRTD embedding. First, we show that FRTDs are strictly more informative than eigenvalue spectra, yet insufficient for complete graph identification, thus placing FRTD equivalence between cospectrality and isomorphism. Second, we argue that FRTD equivalence between nodes captures structural similarity. Third, we empirically demonstrate that the FRTD embedding outperforms manually designed graph metrics in network alignment tasks. Finally, we show that random networks that approximately match the FRTD of a desired target also preserve other salient features. Together these results demonstrate the FRTD as a simple and mathematically principled embedding for complex networks.

Comment: Matches Representation Learning: proposes first-return-time-distribution-based graph embeddings with theoretical grounding and empirical benefits.

Relevance: 8 Novelty: 7


19. In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs

ArXiv ID: 2512.02543

Authors: Vishnu Sarukkai, Asanshay Gupta, James Hong, Micha\"el Gharbi, Kayvon Fatahalian

Abstract: The world currently has an abundance of ideas for how to use new LLM agents, and developers seek to rapidly prototype and test new agentic designs. However, executing agents at scale using high-capacity LLMs incurs high inference costs. We propose a simple method for reducing LLM agent inference costs without incurring the development friction costs associated with LLM fine-tuning (long training cycles, optimization hyperparameter tweaking loops) or manual prompt engineering (laborious trial and error). Most importantly, we introduce $\textit{in-context distillation}$, which adapts the idea of knowledge distillation (training a low cost-student model to mimic a high-cost teacher) to an in-context learning setting. Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples, enabling the student to imitate teacher behavior on-the-fly. We combine in-context distillation with the established idea of $\textit{self-consistency cascades}$ to know when the trust the student. This adaptive strategy realizes the cost benefits of model specialization while preserving the productivity of working with frozen models. On the multi-step embodied reasoning benchmark ALFWorld, our method matches teacher-level accuracy at $\textbf{2.5$\times$ lower cost}$, reducing per-episode costs from \$0.059 to \$0.024. The upfront demonstration cost amortizes after just 843 episodes, yielding cumulative savings exceeding \$34,900 at deployment scale (1M episodes). On AppWorld, a complex agent benchmark requiring multi-step API workflows, we shift the Pareto frontier by achieving a $\textbf{2$\times$ cost reduction}$ at iso-accuracy. By reducing operational costs while maintaining rapid experimentation cycles with frozen models, our approach makes advanced agentic systems economically viable for a broader range of applications.

Comment: Efficiency: training-free in-context distillation with self-consistency cascades to reduce LLM agent inference cost while preserving accuracy.

Relevance: 8 Novelty: 7


20. Graph VQ-Transformer (GVT): Fast and Accurate Molecular Generation via High-Fidelity Discrete Latents

ArXiv ID: 2512.02667

Authors: Haozhuo Zheng, Cheng Wang, Yang Liu

Abstract: The de novo generation of molecules with desirable properties is a critical challenge, where diffusion models are computationally intensive and autoregressive models struggle with error propagation. In this work, we introduce the Graph VQ-Transformer (GVT), a two-stage generative framework that achieves both high accuracy and efficiency. The core of our approach is a novel Graph Vector Quantized Variational Autoencoder (VQ-VAE) that compresses molecular graphs into high-fidelity discrete latent sequences. By synergistically combining a Graph Transformer with canonical Reverse Cuthill-McKee (RCM) node ordering and Rotary Positional Embeddings (RoPE), our VQ-VAE achieves near-perfect reconstruction rates. An autoregressive Transformer is then trained on these discrete latents, effectively converting graph generation into a well-structured sequence modeling problem. Crucially, this mapping of complex graphs to high-fidelity discrete sequences bridges molecular design with the powerful paradigm of large-scale sequence modeling, unlocking potential synergies with Large Language Models (LLMs). Extensive experiments show that GVT achieves state-of-the-art or highly competitive performance across major benchmarks like ZINC250k, MOSES, and GuacaMol, and notably outperforms leading diffusion models on key distribution similarity metrics such as FCD and KL Divergence. With its superior performance, efficiency, and architectural novelty, GVT not only presents a compelling alternative to diffusion models but also establishes a strong new baseline for the field, paving the way for future research in discrete latent-space molecular generation.

Comment: Model Architecture: a Graph VQ-VAE producing high-fidelity discrete latents and an autoregressive Transformer over them for efficient generation.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  2. Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  3. High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

  4. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.