Previous Day 2025-12-15
Monthly Overview 2025-12
Next Day 2025-12-17

Personalized Daily ArXiv Papers 2025-12-16

[gpt-5] Prompt Completion Total
Token 64560 54894 119454
Cost $0.08 $0.55 $0.63

Total arXiv papers: 873

Total scanned papers: 555

Total relevant papers: 34

Table of contents with paper titles:

  1. World Models Can Leverage Human Videos for Dexterous Manipulation Authors: Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, Yann LeCun

  2. CoDeQ: End-to-End Joint Model Compression with Dead-Zone Quantizer for High-Sparsity and Low-Precision Networks Authors: Jonathan Wensh{\o}j, Tong Chen, Bob Pepin, Raghavendra Selvan

  3. Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability Authors: Leonard Bereska, Zoe Tzifa-Kratira, Reza Samavi, Efstratios Gavves

  4. Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings Authors: Yoav Gelberg, Koshi Eguchi, Takuya Akiba, Edoardo Cetin

  5. BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models Authors: Zhengyang Wang, Ziyue Liu, Ruijie Zhang, Avinash Maurya, Paul Hovland, Bogdan Nicolae, Franck Cappello, Zheng Zhang

  6. SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping Authors: Yu-Chen Lu, Sheng-Feng Yu, Hui-Hsien Weng, Pei-Shuo Wang, Yu-Fang Hu, Liang Hung-Chun, Hung-Yueh Chiang, Kai-Chiang Wu

  7. Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics Authors: Jingdi Lei, Di Zhang, Soujanya Poria

  8. Improving Recursive Transformers with Mixture of LoRAs Authors: Mohammadmahdi Nouriborji, Morteza Rohanian, Omid Rohanian

  9. Phase transitions reveal hierarchical structure in deep neural networks Authors: Ibrahim Talha Ersoy, Andr\'es Fernando Cardozo Licha, Karoline Wiesner

  10. Resting Neurons, Active Insights: Improving Input Sparsification for Large Language Models Authors: Haotian Xu, Tian Gao, Tsui-Wei Weng, Tengfei Ma

  11. Universality of high-dimensional scaling limits of stochastic gradient descent Authors: Reza Gheissari, Aukosh Jagannath

  12. V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval Authors: Donghyuk Kim, Sejeong Yang, Wonjin Shin, Joo-Young Kim

  13. SeVeDo: A Heterogeneous Transformer Accelerator for Low-Bit Inference via Hierarchical Group Quantization and SVD-Guided Mixed Precision Authors: Yuseon Choi, Sangjin Kim, Jungjun Oh, Byeongcheol Kim, Hoi-Jun Yoo

  14. D-STEER - Preference Alignment Techniques Learn to Behave, not to Believe -- Beneath the Surface, DPO as Steering Vector Perturbation in Activation Space Authors: Samarth Raina, Saksham Aggarwal, Aman Chadha, Vinija Jain, Amitava Das

  15. Optimized Architectures for Kolmogorov-Arnold Networks Authors: James Bagrow, Josh Bongard

  16. Self-Motivated Growing Neural Network for Adaptive Architecture via Local Structural Plasticity Authors: Yiyang Jia, Chengxu Zhou

  17. Efficient Vision-Language Reasoning via Adaptive Token Pruning Authors: Xue Li, Xiaonan Song, Henry Hu

  18. Exploring the Design Space of Transition Matching Authors: Uriel Singer, Yaron Lipman

  19. CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving Authors: Dong Liu, Yanxuan Yu

  20. Spiking Manifesto Authors: Eugene Izhikevich

  21. Stopping Rules for Stochastic Gradient Descent via Anytime-Valid Confidence Sequences Authors: Liviu Aolaritei, Michael I. Jordan

  22. State over Tokens: Characterizing the Role of Reasoning Tokens Authors: Mosh Levy, Zohar Elyoseph, Shauli Ravfogel, Yoav Goldberg

  23. Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry Authors: Behrooz Tahmasebi, Melanie Weber

  24. CORE: Contrastive Masked Feature Reconstruction on Graphs Authors: Jianyuan Bo, Yuan Fang

  25. Qonvolution: Towards Learning High-Frequency Signals with Queried Convolution Authors: Abhinav Kumar, Tristan Aumentado-Armstrong, Lazar Valkov, Gopal Sharma, Alex Levinshtein, Radek Grzeszczuk, Suren Kumar

  26. Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model Updates Authors: Wenjun Yu, Sitian Chen, Cheng Chen, Amelie Chi Zhou

  27. Alada: Alternating Adaptation of Momentum Method for Memory-Efficient Matrix Optimization Authors: Xiaoyu He, Yu Cai, Jin Jia, Canxi Huang, Wenqing Chen, Zibin Zheng

  28. Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing Authors: Zewen Qiang, Sendong Zhao, Haochun Wang, Bing Qin, Ting Liu

  29. Wait, Wait, Wait... Why Do Reasoning Models Loop? Authors: Charilaos Pipis, Shivam Garg, Vasilis Kontonis, Vaishnavi Shrivastava, Akshay Krishnamurthy, Dimitris Papailiopoulos

  30. Scalable Formal Verification via Autoencoder Latent Space Abstraction Authors: Robert Reed, Morteza Lahijanian, Luca Laurenti

  31. Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models Authors: Chendong Sun

  32. PAC-Bayes Bounds for Multivariate Linear Regression and Linear Autoencoders Authors: Ruixin Guo, Ruoming Jin, Xinyu Li, Yang Zhou

  33. High-Dimensional Tensor Discriminant Analysis: Low-Rank Discriminant Structure, Representation Synergy, and Theoretical Guarantees Authors: Elynn Chen, Yuefeng Han, Jiayu Li

  34. DP-CSGP: Differentially Private Stochastic Gradient Push with Compressed Communication Authors: Zehan Zhu, Heng Zhao, Yan Huang, Joey Tianyi Zhou, Shouling Ji, Jinming Xu


1. World Models Can Leverage Human Videos for Dexterous Manipulation

ArXiv ID: 2512.13644

Authors: Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, Yann LeCun

Abstract: Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.

Comment: Author match


2. CoDeQ: End-to-End Joint Model Compression with Dead-Zone Quantizer for High-Sparsity and Low-Precision Networks

ArXiv ID: 2512.12981

Authors: Jonathan Wensh{\o}j, Tong Chen, Bob Pepin, Raghavendra Selvan

Abstract: While joint pruning--quantization is theoretically superior to sequential application, current joint methods rely on auxiliary procedures outside the training loop for finding compression parameters. This reliance adds engineering complexity and hyperparameter tuning, while also lacking a direct data-driven gradient signal, which might result in sub-optimal compression. In this paper, we introduce CoDeQ, a simple, fully differentiable method for joint pruning--quantization. Our approach builds on a key observation: the dead-zone of a scalar quantizer is equivalent to magnitude pruning, and can be used to induce sparsity directly within the quantization operator. Concretely, we parameterize the dead-zone width and learn it via backpropagation, alongside the quantization parameters. This design provides explicit control of sparsity, regularized by a single global hyperparameter, while decoupling sparsity selection from bit-width selection. The result is a method for Compression with Dead-zone Quantizer (CoDeQ) that supports both fixed-precision and mixed-precision quantization (controlled by an optional second hyperparameter). It simultaneously determines the sparsity pattern and quantization parameters in a single end-to-end optimization. Consequently, CoDeQ does not require any auxiliary procedures, making the method architecture-agnostic and straightforward to implement. On ImageNet with ResNet-18, CoDeQ reduces bit operations to ~5% while maintaining close to full precision accuracy in both fixed and mixed-precision regimes.

Comment: Model Compression and Efficiency — end-to-end joint pruning and quantization by learning quantizer dead-zone widths (fully differentiable).

Relevance: 10 Novelty: 9


3. Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability

ArXiv ID: 2512.13568

Authors: Leonard Bereska, Zoe Tzifa-Kratira, Reza Samavi, Efstratios Gavves

Abstract: Neural networks achieve remarkable performance through superposition: encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This challenges interpretability, yet we lack principled methods to measure superposition. We present an information-theoretic framework measuring a neural representation's effective degrees of freedom. We apply Shannon entropy to sparse autoencoder activations to compute the number of effective features as the minimum neurons needed for interference-free encoding. Equivalently, this measures how many "virtual neurons" the network simulates through superposition. When networks encode more effective features than actual neurons, they must accept interference as the price of compression. Our metric strongly correlates with ground truth in toy models, detects minimal superposition in algorithmic tasks, and reveals systematic reduction under dropout. Layer-wise patterns mirror intrinsic dimensionality studies on Pythia-70M. The metric also captures developmental dynamics, detecting sharp feature consolidation during grokking. Surprisingly, adversarial training can increase effective features while improving robustness, contradicting the hypothesis that superposition causes vulnerability. Instead, the effect depends on task complexity and network capacity: simple tasks with ample capacity allow feature expansion (abundance regime), while complex tasks or limited capacity force reduction (scarcity regime). By defining superposition as lossy compression, this work enables principled measurement of how neural networks organize information under computational constraints, connecting superposition to adversarial robustness.

Comment: Representation Learning: proposes an information-theoretic metric for superposition via sparse autoencoders; connects feature capacity to robustness.

Relevance: 10 Novelty: 9


4. Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

ArXiv ID: 2512.12167

Authors: Yoav Gelberg, Koshi Eguchi, Takuya Akiba, Edoardo Cetin

Abstract: So far, expensive finetuning beyond the pretraining sequence length has been a requirement for effectively extending the context of language models (LM). In this work, we break this key bottleneck by Dropping the Positional Embeddings of LMs after training (DroPE). Our simple method is motivated by three key theoretical and empirical observations. First, positional embeddings (PEs) serve a crucial role during pretraining, providing an important inductive bias that significantly facilitates convergence. Second, over-reliance on this explicit positional information is also precisely what prevents test-time generalization to sequences of unseen length, even when using popular PE-scaling methods. Third, positional embeddings are not an inherent requirement of effective language modeling and can be safely removed after pretraining, following a short recalibration phase. Empirically, DroPE yields seamless zero-shot context extension without any long-context finetuning, quickly adapting pretrained LMs without compromising their capabilities in the original training context. Our findings hold across different models and dataset sizes, far outperforming previous specialized architectures and established rotary positional embedding scaling methods.

Comment: Model Architecture/Efficiency — zero-shot context extension by dropping positional embeddings post-training without long-context finetuning.

Relevance: 10 Novelty: 8


5. BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models

ArXiv ID: 2512.12131

Authors: Zhengyang Wang, Ziyue Liu, Ruijie Zhang, Avinash Maurya, Paul Hovland, Bogdan Nicolae, Franck Cappello, Zheng Zhang

Abstract: The scale of transformer model pre-training is constrained by the increasing computation and communication cost. Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint with minimum impact on accuracy. Despite algorithmic efficiency, bottleneck architectures scale poorly under standard tensor parallelism. Simply applying 3D parallelism designed for full-rank methods leads to excessive communication and poor GPU utilization. To address this limitation, we propose BOOST, an efficient training framework tailored for large-scale low-rank bottleneck architectures. BOOST introduces a novel Bottleneck-aware Tensor Parallelism, and combines optimizations such as online-RMSNorm, linear layer grouping, and low-rank activation checkpointing to achieve end-to-end training speedup. Evaluations on different low-rank bottleneck architectures demonstrate that BOOST achieves 1.46-1.91$\times$ speedup over full-rank model baselines and 1.87-2.27$\times$ speedup over low-rank model with naively integrated 3D parallelism, with improved GPU utilization and reduced communication overhead.

Comment: High Performance Computing — bottleneck-aware tensor parallelism and system optimizations for low-rank LLMs; also aligns with low-rank efficiency.

Relevance: 10 Novelty: 8


6. SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping

ArXiv ID: 2512.13494

Authors: Yu-Chen Lu, Sheng-Feng Yu, Hui-Hsien Weng, Pei-Shuo Wang, Yu-Fang Hu, Liang Hung-Chun, Hung-Yueh Chiang, Kai-Chiang Wu

Abstract: Large language models (LLM) have achieved remarkable performance across a wide range of tasks. However, their substantial parameter sizes pose significant challenges for deployment on edge devices with limited computational and memory resources. Low-rank compression is a promising approach to address this issue, as it reduces both computational and memory costs, making LLM more suitable for resource-constrained environments. Nonetheless, na\"ive low-rank compression methods require a significant reduction in the retained rank to achieve meaningful memory and computation savings. For a low-rank model, the ranks need to be reduced by more than half to yield efficiency gains. Such aggressive truncation, however, typically results in substantial performance degradation. To address this trade-off, we propose SkipCat, a novel low-rank compression framework that enables the use of higher ranks while achieving the same compression rates. First, we introduce an intra-layer shared low-rank projection method, where multiple matrices that share the same input use a common projection. This reduces redundancy and improves compression efficiency. Second, we propose a block skipping technique that omits computations and memory transfers for selected sub-blocks within the low-rank decomposition. These two techniques jointly enable our compressed model to retain more effective ranks under the same compression budget. Experimental results show that, without any additional fine-tuning, our method outperforms previous low-rank compression approaches by 7% accuracy improvement on zero-shot tasks under the same compression rate. These results highlight the effectiveness of our rank-maximized compression strategy in preserving model performance under tight resource constraints.

Comment: Low-rank compression for LLMs via shared projection and block skipping; directly fits Compression/Efficiency (low-rank methods).

Relevance: 10 Novelty: 8


7. Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

ArXiv ID: 2512.12602

Authors: Jingdi Lei, Di Zhang, Soujanya Poria

Abstract: Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, fully parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is not only attainable but also computable in linear time with full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the exact closed-form solution effectively corresponding to the infinite-order Runge-Kutta method. This attention mechanism is theoretically free from error accumulation, perfectly capturing the continuous dynamics while preserving the linear-time complexity. Through an extensive suite of experiments, we show that EFLA enables robust performance in noisy environments, achieving lower language modeling perplexity and superior downstream benchmark performance than DeltaNet without introducing additional parameters. Our work provides a new theoretical foundation for building high-fidelity, scalable linear-time attention models.

Comment: Model Architecture/Efficiency: exact linear-time attention via continuous-time dynamics (error-free linear attention) with theoretical foundations.

Relevance: 10 Novelty: 8


8. Improving Recursive Transformers with Mixture of LoRAs

ArXiv ID: 2512.12880

Authors: Mohammadmahdi Nouriborji, Morteza Rohanian, Omid Rohanian

Abstract: Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M--120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.

Comment: Model Architecture and Compression/Efficiency: conditional computation via Mixture of LoRAs inside shared FFN for recursive transformers; expert merging for deployment.

Relevance: 10 Novelty: 8


9. Phase transitions reveal hierarchical structure in deep neural networks

ArXiv ID: 2512.11866

Authors: Ibrahim Talha Ersoy, Andr\'es Fernando Cardozo Licha, Karoline Wiesner

Abstract: Training Deep Neural Networks relies on the model converging on a high-dimensional, non-convex loss landscape toward a good minimum. Yet, much of the phenomenology of training remains ill understood. We focus on three seemingly disparate observations: the occurrence of phase transitions reminiscent of statistical physics, the ubiquity of saddle points, and phenomenon of mode connectivity relevant for model merging. We unify these within a single explanatory framework, the geometry of the loss and error landscapes. We analytically show that phase transitions in DNN learning are governed by saddle points in the loss landscape. Building on this insight, we introduce a simple, fast, and easy to implement algorithm that uses the L2 regularizer as a tool to probe the geometry of error landscapes. We apply it to confirm mode connectivity in DNNs trained on the MNIST dataset by efficiently finding paths that connect global minima. We then show numerically that saddle points induce transitions between models that encode distinct digit classes. Our work establishes the geometric origin of key training phenomena in DNNs and reveals a hierarchy of accuracy basins analogous to phases in statistical physics.

Comment: Representation Learning — theoretical link between saddle points, phase transitions, and mode connectivity; introduces a probe of loss geometry.

Relevance: 9 Novelty: 8


10. Resting Neurons, Active Insights: Improving Input Sparsification for Large Language Models

ArXiv ID: 2512.12744

Authors: Haotian Xu, Tian Gao, Tsui-Wei Weng, Tengfei Ma

Abstract: Large Language Models (LLMs) achieve state-of-the-art performance across a wide range of applications, but their massive scale poses significant challenges for both efficiency and interpretability. Structural pruning, which reduces model size by removing redundant computational units such as neurons, has been widely explored as a solution, and this study devotes to input sparsification, an increasingly popular technique that improves efficiency by selectively activating only a subset of entry values for each input. However, existing approaches focus primarily on computational savings, often overlooking the representational consequences of sparsification and leaving a noticeable performance gap compared to full models. In this work, we first reinterpret input sparsification as a form of dynamic structural pruning. Motivated by the spontaneous baseline firing rates observed in biological neurons, we introduce a small set of trainable spontaneous neurons that act as compensatory units to stabilize activations in sparsified LLMs. Experiments demonstrate that these auxiliary neurons substantially reduce the sparsification-induced performance gap while generalizing effectively across tasks.

Comment: Proposes input sparsification as dynamic structural pruning with trainable compensatory neurons; fits Model Compression/Efficiency (sparsity/pruning, conditional activation).

Relevance: 9 Novelty: 8


11. Universality of high-dimensional scaling limits of stochastic gradient descent

ArXiv ID: 2512.13634

Authors: Reza Gheissari, Aukosh Jagannath

Abstract: We consider statistical tasks in high dimensions whose loss depends on the data only through its projection into a fixed-dimensional subspace spanned by the parameter vectors and certain ground truth vectors. This includes classifying mixture distributions with cross-entropy loss with one and two-layer networks, and learning single and multi-index models with one and two-layer networks. When the data is drawn from an isotropic Gaussian mixture distribution, it is known that the evolution of a finite family of summary statistics under stochastic gradient descent converges to an autonomous ordinary differential equation (ODE), as the dimension and sample size go to $\infty$ and the step size goes to $0$ commensurately. Our main result is that these ODE limits are universal in that this convergence occurs even when the data is drawn from mixtures of product measures provided the first two moments match the corresponding Gaussian distribution and the initialization and ground truth vectors are sufficiently coordinate-delocalized. We complement this by proving two corresponding non-universality results. We provide a simple example where the ODE limits are non-universal if the initialization is coordinate aligned. We also show that the stochastic differential equation limits arising as fluctuations of the summary statistics around their ODE's fixed points are not universal.

Comment: Theoretical universality of high-dimensional SGD scaling limits (ODE/SDE); strong fit to Representation Learning (training dynamics theory).

Relevance: 9 Novelty: 8


12. V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval

ArXiv ID: 2512.12284

Authors: Donghyuk Kim, Sejeong Yang, Wonjin Shin, Joo-Young Kim

Abstract: Streaming video large language models (LLMs) are increasingly used for real-time multimodal tasks such as video captioning, question answering, conversational agents, and augmented reality. However, these models face fundamental memory and computational challenges because their key-value (KV) caches grow substantially with continuous streaming video input. This process requires an iterative prefill stage, which is a unique feature of streaming video LLMs. Due to its iterative prefill stage, it suffers from significant limitations, including extensive computation, substantial data transfer, and degradation in accuracy. Crucially, this issue is exacerbated for edge deployment, which is the primary target for these models. In this work, we propose V-Rex, the first software-hardware co-designed accelerator that comprehensively addresses both algorithmic and hardware bottlenecks in streaming video LLM inference. At its core, V-Rex introduces ReSV, a training-free dynamic KV cache retrieval algorithm. ReSV exploits temporal and spatial similarity-based token clustering to reduce excessive KV cache memory across video frames. To fully realize these algorithmic benefits, V-Rex offers a compact, low-latency hardware accelerator with a dynamic KV cache retrieval engine (DRE), featuring bit-level and early-exit based computing units. V-Rex achieves unprecedented real-time of 3.9-8.3 FPS and energy-efficient streaming video LLM inference on edge deployment with negligible accuracy loss. While DRE only accounts for 2.2% power and 2.0% area, the system delivers 1.9-19.7x speedup and 3.1-18.5x energy efficiency improvements over AGX Orin GPU. This work is the first to comprehensively tackle KV cache retrieval across algorithms and hardware, enabling real-time streaming video LLM inference on resource-constrained edge devices.

Comment: Dynamic KV cache retrieval with software–hardware co-design for streaming video LLMs; directly matches Compression/Efficiency (cache optimization) and HPC inference acceleration.

Relevance: 9 Novelty: 8


13. SeVeDo: A Heterogeneous Transformer Accelerator for Low-Bit Inference via Hierarchical Group Quantization and SVD-Guided Mixed Precision

ArXiv ID: 2512.12930

Authors: Yuseon Choi, Sangjin Kim, Jungjun Oh, Byeongcheol Kim, Hoi-Jun Yoo

Abstract: Low-bit quantization is a promising technique for efficient transformer inference by reducing computational and memory overhead. However, aggressive bitwidth reduction remains challenging due to activation outliers, leading to accuracy degradation. Existing methods, such as outlier-handling and group quantization, achieve high accuracy but incur substantial energy consumption. To address this, we propose SeVeDo, an energy-efficient SVD-based heterogeneous accelerator that structurally separates outlier-sensitive components into a high-precision low-rank path, while the remaining computations are executed in a low-bit residual datapath with group quantization. To further enhance efficiency, Hierarchical Group Quantization (HGQ) combines coarse-grained floating-point scaling with fine-grained shifting, effectively reducing dequantization cost. Also, SVD-guided mixed precision (SVD-MP) statically allocates higher bitwidths to precision-sensitive components identified through low-rank decomposition, thereby minimizing floating-point operation cost. Experimental results show that SeVeDo achieves a peak energy efficiency of 13.8TOPS/W, surpassing conventional designs, with 12.7TOPS/W on ViT-Base and 13.4TOPS/W on Llama2-7B benchmarks.

Comment: Heterogeneous accelerator with hierarchical group quantization and SVD-guided mixed precision; strong fit to Compression/Efficiency and HPC inference.

Relevance: 9 Novelty: 8


14. D-STEER - Preference Alignment Techniques Learn to Behave, not to Believe -- Beneath the Surface, DPO as Steering Vector Perturbation in Activation Space

ArXiv ID: 2512.11838

Authors: Samarth Raina, Saksham Aggarwal, Aman Chadha, Vinija Jain, Amitava Das

Abstract: Direct Preference Optimization (DPO) has become a standard recipe for aligning large language models, yet it is still unclear what kind of change it actually induces inside the network. This paper argues that DPO does not rewrite a models internal beliefs; instead, it acts as a low rank steering mechanism that nudges activations along a small number of preference directions. Using a simple derivation, we show that the DPO gradient depends only on the difference between the logit embeddings of preferred and dispreferred completions, implying a first order shift in the final hidden representation rather than a deep restructuring of semantics. We then extract an empirical steering vector from a DPO tuned model and demonstrate that adding this vector to base activations reproduces most of the aligned behavior, while subtracting it nearly restores the original model. Finally, spectral analyses reveal rank-one dominance and entropy collapse in upper layers, indicating that alignment is funneled through a narrow subspace. Taken together, these results support a behavioral illusion view of DPO: it teaches models how to act aligned, not what to believe.

Comment: Representation Learning/Training Dynamics: shows DPO acts as a low-rank steering perturbation (rank-1 dominance) in activation space.

Relevance: 9 Novelty: 8


15. Optimized Architectures for Kolmogorov-Arnold Networks

ArXiv ID: 2512.12448

Authors: James Bagrow, Josh Bongard

Abstract: Efforts to improve Kolmogorov-Arnold networks (KANs) with architectural enhancements have been stymied by the complexity those enhancements bring, undermining the interpretability that makes KANs attractive in the first place. Here we study overprovisioned architectures combined with sparsification to learn compact, interpretable KANs without sacrificing accuracy. Crucially, we focus on differentiable sparsification, turning architecture search into an end-to-end optimization problem. Across function approximation benchmarks, dynamical systems forecasting, and real-world prediction tasks, we demonstrate competitive or superior accuracy while discovering substantially smaller models. Overprovisioning and sparsification are synergistic, with the combination outperforming either alone. The result is a principled path toward models that are both more expressive and more interpretable, addressing a key tension in scientific machine learning.

Comment: Matches Model Compression and Efficiency: differentiable sparsification to learn compact architectures (also Model Architecture for KAN design).

Relevance: 9 Novelty: 7


16. Self-Motivated Growing Neural Network for Adaptive Architecture via Local Structural Plasticity

ArXiv ID: 2512.12713

Authors: Yiyang Jia, Chengxu Zhou

Abstract: Control policies in deep reinforcement learning are often implemented with fixed-capacity multilayer perceptrons trained by backpropagation, which lack structural plasticity and depend on global error signals. This paper introduces the Self-Motivated Growing Neural Network (SMGrNN), a controller whose topology evolves online through a local Structural Plasticity Module (SPM). The SPM monitors neuron activations and edge-wise weight update statistics over short temporal windows and uses these signals to trigger neuron insertion and pruning, while synaptic weights are updated by a standard gradient-based optimizer. This allows network capacity to be regulated during learning without manual architectural tuning. SMGrNN is evaluated on control benchmarks via policy distillation. Compared with multilayer perceptron baselines, it achieves similar or higher returns, lower variance, and task-appropriate network sizes. Ablation studies with growth disabled and growth-only variants isolate the role of structural plasticity, showing that adaptive topology improves reward stability. The local and modular design of SPM enables future integration of a Hebbian plasticity module and spike-timing-dependent plasticity, so that SMGrNN can support both artificial and spiking neural implementations driven by local rules.

Comment: Matches Model Architecture: dynamic/growing neural network with local structural plasticity (neuron insertion/pruning).

Relevance: 9 Novelty: 7


17. Efficient Vision-Language Reasoning via Adaptive Token Pruning

ArXiv ID: 2512.12701

Authors: Xue Li, Xiaonan Song, Henry Hu

Abstract: Real-world deployment of Vision-Language Models (VLMs) is hindered by high computational demands, as existing architectures inefficiently process all tokens uniformly. We introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that retains only the most informative tokens based on contextual relevance. ATP operates at the vision-language interface, assigning a hybrid importance score combining ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance) to keep top-K tokens for the LLM. Unlike static compression, ATP adapts to each input without modifying the backbone. Proposed as a lightweight gating module, ATP is compatible with popular backbones like BLIP-2, LLaVA, and Flamingo. Preliminary evaluations across VQAv2, GQA, and COCO indicate that ATP reduces inference FLOPs by around 40% and achieves roughly 1.5x speedups in end-to-end latency with negligible accuracy loss (less than 1%). Qualitative analyses suggest ATP preserves visual grounding and enhances interpretability. Beyond efficiency, we investigate robustness under corruptions; observations suggest adaptive pruning suppresses spurious correlations, improving stability. These findings imply that resource-constrained inference and model reliability are not competing objectives. Finally, we discuss ATP's role in efficient multimodal edge computing pipelines.

Comment: Model Compression and Efficiency — adaptive token pruning at the vision-language interface using attention/similarity-based importance.

Relevance: 9 Novelty: 7


18. Exploring the Design Space of Transition Matching

ArXiv ID: 2512.12465

Authors: Uriel Singer, Yaron Lipman

Abstract: Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second ``internal'' generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains.

Comment: Model Architecture: systematic design/training/sampling study of Transition Matching with head–backbone architecture for generative models.

Relevance: 9 Novelty: 7


19. CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving

ArXiv ID: 2512.11920

Authors: Dong Liu, Yanxuan Yu

Abstract: Large Language Models (LLMs) have revolutionized natural language processing tasks, but their deployment in datacenter environments faces significant challenges due to the massive memory requirements of key-value (KV) caches. During the autoregressive decoding process, KV caches consume substantial GPU memory, limiting batch sizes and overall system throughput. To address these challenges, we propose \textbf{CXL-SpecKV}, a novel disaggregated KV-cache architecture that leverages Compute Express Link (CXL) interconnects and FPGA accelerators to enable efficient speculative execution and memory disaggregation. Our approach introduces three key innovations: (i) a CXL-based memory disaggregation framework that offloads KV-caches to remote FPGA memory with low latency, (ii) a speculative KV-cache prefetching mechanism that predicts and preloads future tokens' cache entries, and (iii) an FPGA-accelerated KV-cache compression and decompression engine that reduces memory bandwidth requirements by up to 4$\times$. When evaluated on state-of-the-art LLM models, CXL-SpecKV achieves up to 3.2$\times$ higher throughput compared to GPU-only baselines, while reducing memory costs by 2.8$\times$ and maintaining accuracy. Our system demonstrates that intelligent memory disaggregation combined with speculative execution can effectively address the memory wall challenge in large-scale LLM serving. Our code implementation has been open-sourced at https://github.com/FastLM/CXL-SpecKV.

Comment: High Performance Computing/Efficiency: disaggregated KV-cache over CXL with FPGA acceleration, speculative prefetch, and compression for LLM serving.

Relevance: 9 Novelty: 7


20. Spiking Manifesto

ArXiv ID: 2512.11843

Authors: Eugene Izhikevich

Abstract: Practically everything computers do is better, faster, and more power-efficient than the brain. For example, a calculator crunches numbers more energy-efficiently than any human. Yet AI models are a thousand times less efficient than the brain. These models use artificial neural networks (ANNs) and require GPUs for the multiplication of huge matrices. In contrast, spiking neural networks (SNNs) of the brain have no matrix multiplication and much smaller energy requirements. This manifesto proposes a framework for thinking about popular AI models in terms of spiking networks and polychronization, and for interpreting spiking activity as nature's way of implementing look-up tables. This offers a way to convert AI models into a novel type of architecture with the promise of a thousandfold improvement in efficiency. Code is available at https://github.com/izhikevich/SNN

Comment: Matches Model Architecture/Efficiency: proposes spiking network reinterpretation of ANNs for potential thousandfold efficiency gains.

Relevance: 8 Novelty: 8


21. Stopping Rules for Stochastic Gradient Descent via Anytime-Valid Confidence Sequences

ArXiv ID: 2512.13123

Authors: Liviu Aolaritei, Michael I. Jordan

Abstract: We study stopping rules for stochastic gradient descent (SGD) for convex optimization from the perspective of anytime-valid confidence sequences. Classical analyses of SGD provide convergence guarantees in expectation or at a fixed horizon, but offer no statistically valid way to assess, at an arbitrary time, how close the current iterate is to the optimum. We develop an anytime-valid, data-dependent upper confidence sequence for the weighted average suboptimality of projected SGD, constructed via nonnegative supermartingales and requiring no smoothness or strong convexity. This confidence sequence yields a simple stopping rule that is provably $\varepsilon$-optimal with probability at least $1-\alpha$ and is almost surely finite under standard stochastic approximation stepsizes. To the best of our knowledge, these are the first rigorous, time-uniform performance guarantees and finite-time $\varepsilon$-optimality certificates for projected SGD with general convex objectives, based solely on observable trajectory quantities.

Comment: Matches Optimization/Training (HPC relevance): anytime-valid stopping rules for SGD via confidence sequences for principled training control.

Relevance: 8 Novelty: 8


22. State over Tokens: Characterizing the Role of Reasoning Tokens

ArXiv ID: 2512.12777

Authors: Mosh Levy, Zohar Elyoseph, Shauli Ravfogel, Yoav Goldberg

Abstract: Large Language Models (LLMs) can generate reasoning tokens before their final answer to boost performance on complex tasks. While these sequences seem like human thought processes, empirical evidence reveals that they are not a faithful explanation of the model's actual reasoning process. To address this gap between appearance and function, we introduce the State over Tokens (SoT) conceptual framework. SoT reframes reasoning tokens not as a linguistic narrative, but as an externalized computational state -- the sole persistent information carrier across the model's stateless generation cycles. This explains how the tokens can drive correct reasoning without being a faithful explanation when read as text and surfaces previously overlooked research questions on these tokens. We argue that to truly understand the process that LLMs do, research must move beyond reading the reasoning tokens as text and focus on decoding them as state.

Comment: Conceptual framework recasting reasoning tokens as externalized computational state; aligns with Representation Learning/training dynamics.

Relevance: 8 Novelty: 8


23. Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry

ArXiv ID: 2512.11855

Authors: Behrooz Tahmasebi, Melanie Weber

Abstract: Enforcing exact symmetry in machine learning models often yields significant gains in scientific applications, serving as a powerful inductive bias. However, recent work suggests that relying on approximate symmetry can offer greater flexibility and robustness. Despite promising empirical evidence, there has been little theoretical understanding, and in particular, a direct comparison between exact and approximate symmetry is missing from the literature. In this paper, we initiate this study by asking: What is the cost of enforcing exact versus approximate symmetry? To address this question, we introduce averaging complexity, a framework for quantifying the cost of enforcing symmetry via averaging. Our main result is an exponential separation: under standard conditions, achieving exact symmetry requires linear averaging complexity, whereas approximate symmetry can be attained with only logarithmic averaging complexity. To the best of our knowledge, this provides the first theoretical separation of these two cases, formally justifying why approximate symmetry may be preferable in practice. Beyond this, our tools and techniques may be of independent interest for the broader study of symmetries in machine learning.

Comment: Representation Learning/Theory: quantifies complexity gap between exact vs approximate symmetry, guiding symmetry-inductive biases.

Relevance: 8 Novelty: 8


24. CORE: Contrastive Masked Feature Reconstruction on Graphs

ArXiv ID: 2512.13235

Authors: Jianyuan Bo, Yuan Fang

Abstract: In the rapidly evolving field of self-supervised learning on graphs, generative and contrastive methodologies have emerged as two dominant approaches. Our study focuses on masked feature reconstruction (MFR), a generative technique where a model learns to restore the raw features of masked nodes in a self-supervised manner. We observe that both MFR and graph contrastive learning (GCL) aim to maximize agreement between similar elements. Building on this observation, we reveal a novel theoretical insight: under specific conditions, the objectives of MFR and node-level GCL converge, despite their distinct operational mechanisms. This theoretical connection suggests these approaches are complementary rather than fundamentally different, prompting us to explore their integration to enhance self-supervised learning on graphs. Our research presents Contrastive Masked Feature Reconstruction (CORE), a novel graph self-supervised learning framework that integrates contrastive learning into MFR. Specifically, we form positive pairs exclusively between the original and reconstructed features of masked nodes, encouraging the encoder to prioritize contextual information over the node's own features. Additionally, we leverage the masked nodes themselves as negative samples, combining MFR's reconstructive power with GCL's discriminative ability to better capture intrinsic graph structures. Empirically, our proposed framework CORE significantly outperforms MFR across node and graph classification tasks, demonstrating state-of-the-art results. In particular, CORE surpasses GraphMAE and GraphMAE2 by up to 2.80% and 3.72% on node classification tasks, and by up to 3.82% and 3.76% on graph classification tasks.

Comment: Matches Representation Learning: theoretical link between masked feature reconstruction and contrastive objectives with a unified framework.

Relevance: 8 Novelty: 7


25. Qonvolution: Towards Learning High-Frequency Signals with Queried Convolution

ArXiv ID: 2512.12898

Authors: Abhinav Kumar, Tristan Aumentado-Armstrong, Lazar Valkov, Gopal Sharma, Alex Levinshtein, Radek Grzeszczuk, Suren Kumar

Abstract: Accurately learning high-frequency signals is a challenge in computer vision and graphics, as neural networks often struggle with these signals due to spectral bias or optimization difficulties. While current techniques like Fourier encodings have made great strides in improving performance, there remains scope for improvement when presented with high-frequency information. This paper introduces Queried-Convolutions (Qonvolutions), a simple yet powerful modification using the neighborhood properties of convolution. Qonvolution convolves a low-frequency signal with queries (such as coordinates) to enhance the learning of intricate high-frequency signals. We empirically demonstrate that Qonvolutions enhance performance across a variety of high-frequency learning tasks crucial to both the computer vision and graphics communities, including 1D regression, 2D super-resolution, 2D image regression, and novel view synthesis (NVS). In particular, by combining Gaussian splatting with Qonvolutions for NVS, we showcase state-of-the-art performance on real-world complex scenes, even outperforming powerful radiance field models on image quality.

Comment: Matches Model Architecture: introduces Queried-Convolutions to better learn high-frequency signals.

Relevance: 8 Novelty: 7


26. Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model Updates

ArXiv ID: 2512.12295

Authors: Wenjun Yu, Sitian Chen, Cheng Chen, Amelie Chi Zhou

Abstract: Deep Learning Recommendation Models (DLRMs) underpin personalized services but face a critical freshness-accuracy tradeoff due to massive parameter synchronization overheads. Production DLRMs deploy decoupled training/inference clusters, where synchronizing petabyte-scale embedding tables (EMTs) causes multi-minute staleness, degrading recommendation quality and revenue. We observe that (1) inference nodes exhibit sustained CPU underutilization (peak <= 20%), and (2) EMT gradients possess intrinsic low-rank structure, enabling compact update representation. We present LiveUpdate, a system that eliminates inter-cluster synchronization by colocating Low-Rank Adaptation (LoRA) trainers within inference nodes. LiveUpdate addresses two core challenges: (1) dynamic rank adaptation via singular value monitoring to constrain memory overhead (<2% of EMTs), and (2) NUMA-aware resource scheduling with hardware-enforced QoS to eliminate update inference contention (P99 latency impact <20ms). Evaluations show LiveUpdate reduces update costs by 2x versus delta-update baselines while achieving higher accuracy within 1-hour windows. By transforming idle inference resources into freshness engines, LiveUpdate delivers online model updates while outperforming state-of-the-art delta-update methods by 0.04% to 0.24% in accuracy.

Comment: Matches High Performance Computing/Efficiency: inference-side low-rank (LoRA) updates and systems optimizations for freshness with minimal overhead.

Relevance: 8 Novelty: 7


27. Alada: Alternating Adaptation of Momentum Method for Memory-Efficient Matrix Optimization

ArXiv ID: 2512.13034

Authors: Xiaoyu He, Yu Cai, Jin Jia, Canxi Huang, Wenqing Chen, Zibin Zheng

Abstract: This work proposes Alada, an adaptive momentum method for stochastic optimization over large-scale matrices. Alada employs a rank-one factorization approach to estimate the second moment of gradients, where factors are updated alternatively to minimize the estimation error. Alada achieves sublinear memory overheads and can be readily extended to optimizing tensor-shaped variables.We also equip Alada with a first moment estimation rule, which enhances the algorithm's robustness without incurring additional memory overheads. The theoretical performance of Alada aligns with that of traditional methods such as Adam. Numerical studies conducted on several natural language processing tasks demonstrate the reduction in memory overheads and the robustness in training large models relative to Adam and its variants.

Comment: Matches Model Compression and Efficiency/HPC: memory-efficient optimizer via rank-one factorized second-moment estimation.

Relevance: 8 Novelty: 7


28. Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing

ArXiv ID: 2512.13109

Authors: Zewen Qiang, Sendong Zhao, Haochun Wang, Bing Qin, Ting Liu

Abstract: Large language models (LLMs) have demonstrated strong performance on a variety of natural language processing (NLP) tasks. However, they often struggle with long-text sequences due to the ``lost in the middle'' phenomenon. This issue has been shown to arise from a U-shaped attention bias, where attention is disproportionately focused on the beginning and end of a text, leaving the middle section underrepresented. While previous studies have attributed this bias to position encoding, our research first identifies an additional factor: initial saliency. It means that in the attention computation for each token, tokens with higher attention weights relative to the initial token tend to receive more attention in the prediction of the next token. We further find that utilizing this property by scaling attention weight between the initial token and others improves the model's ability to process long contexts, achieving a maximum improvement of 3.6\% in MDQA dataset. Moreover, combining this approach with existing methods to reduce position encoding bias further enhances performance, achieving a maximum improvement of 3.4\% in KV-Retrieval tasks.

Comment: Representation Learning — analyzes U-shaped attention bias and introduces initial-saliency scaling to improve long-context processing.

Relevance: 8 Novelty: 7


29. Wait, Wait, Wait... Why Do Reasoning Models Loop?

ArXiv ID: 2512.12895

Authors: Charilaos Pipis, Shivam Garg, Vasilis Kontonis, Vaishnavi Shrivastava, Akshay Krishnamurthy, Dimitris Papailiopoulos

Abstract: Reasoning models (e.g., DeepSeek-R1) generate long chains of thought to solve harder problems, but they often loop, repeating the same text at low temperatures or with greedy decoding. We study why this happens and what role temperature plays. With open reasoning models, we find that looping is common at low temperature. Larger models tend to loop less, and distilled students loop significantly even when their teachers rarely do. This points to mismatches between the training distribution and the learned model, which we refer to as errors in learning, as a key cause. To understand how such errors cause loops, we introduce a synthetic graph reasoning task and demonstrate two mechanisms. First, risk aversion caused by hardness of learning: when the correct progress-making action is hard to learn but an easy cyclic action is available, the model puts relatively more probability on the cyclic action and gets stuck. Second, even when there is no hardness, Transformers show an inductive bias toward temporally correlated errors, so the same few actions keep being chosen and loops appear. Higher temperature reduces looping by promoting exploration, but it does not fix the errors in learning, so generations remain much longer than necessary at high temperature; in this sense, temperature is a stopgap rather than a holistic solution. We end with a discussion of training-time interventions aimed at directly reducing errors in learning.

Comment: Analyzes training dynamics and inductive biases in Transformers that cause looping; matches Representation Learning criterion (training dynamics of neural networks).

Relevance: 8 Novelty: 7


30. Scalable Formal Verification via Autoencoder Latent Space Abstraction

ArXiv ID: 2512.13593

Authors: Robert Reed, Morteza Lahijanian, Luca Laurenti

Abstract: Finite Abstraction methods provide a powerful formal framework for proving that systems satisfy their specifications. However, these techniques face scalability challenges for high-dimensional systems, as they rely on state-space discretization which grows exponentially with dimension. Learning-based approaches to dimensionality reduction, utilizing neural networks and autoencoders, have shown great potential to alleviate this problem. However, ensuring the correctness of the resulting verification results remains an open question. In this work, we provide a formal approach to reduce the dimensionality of systems via convex autoencoders and learn the dynamics in the latent space through a kernel-based method. We then construct a finite abstraction from the learned model in the latent space and guarantee that the abstraction contains the true behaviors of the original system. We show that the verification results in the latent space can be mapped back to the original system. Finally, we demonstrate the effectiveness of our approach on multiple systems, including a 26D system controlled by a neural network, showing significant scalability improvements without loss of rigor.

Comment: Autoencoder-based latent space abstraction with formal guarantees to scale verification; matches Representation Learning (autoencoder latent modeling) and systems scalability.

Relevance: 8 Novelty: 7


31. Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models

ArXiv ID: 2512.13194

Authors: Chendong Sun

Abstract: Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component -- the rejection sampling mechanism -- relies on a fixed, context-independent random threshold. This leads to a significant "random rejection" problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as (1 - \max(P_{\mathrm{target}})). By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.

Comment: Efficiency/HPC: adaptive rejection sampling for speculative decoding using target-model uncertainty to increase throughput in autoregressive inference.

Relevance: 8 Novelty: 7


32. PAC-Bayes Bounds for Multivariate Linear Regression and Linear Autoencoders

ArXiv ID: 2512.12905

Authors: Ruixin Guo, Ruoming Jin, Xinyu Li, Yang Zhou

Abstract: Linear Autoencoders (LAEs) have shown strong performance in state-of-the-art recommender systems. However, this success remains largely empirical, with limited theoretical understanding. In this paper, we investigate the generalizability -- a theoretical measure of model performance in statistical learning -- of multivariate linear regression and LAEs. We first propose a PAC-Bayes bound for multivariate linear regression, extending the earlier bound for single-output linear regression by Shalaeva et al., and establish sufficient conditions for its convergence. We then show that LAEs, when evaluated under a relaxed mean squared error, can be interpreted as constrained multivariate linear regression models on bounded data, to which our bound adapts. Furthermore, we develop theoretical methods to improve the computational efficiency of optimizing the LAE bound, enabling its practical evaluation on large models and real-world datasets. Experimental results demonstrate that our bound is tight and correlates well with practical ranking metrics such as Recall@K and NDCG@K.

Comment: Theory for Representation Learning: PAC-Bayes bounds for multivariate linear regression and linear autoencoders, enabling principled generalization analysis.

Relevance: 8 Novelty: 7


33. High-Dimensional Tensor Discriminant Analysis: Low-Rank Discriminant Structure, Representation Synergy, and Theoretical Guarantees

ArXiv ID: 2512.12122

Authors: Elynn Chen, Yuefeng Han, Jiayu Li

Abstract: High-dimensional tensor-valued predictors arise in modern applications, increasingly as learned representations from neural networks. Existing tensor classification methods rely on sparsity or Tucker structures and often lack theoretical guarantees. Motivated by empirical evidence that discriminative signals concentrate along a few multilinear components, we introduce CP low-rank structure for the discriminant tensor, a modeling perspective not previously explored. Under a Tensor Gaussian Mixture Model, we propose high-dimensional CP low-rank Tensor Discriminant Analysis (CP-TDA) with Randomized Composite PCA (\textsc{rc-PCA}) initialization, that is essential for handling dependent and anisotropic noise under weaker signal strength and incoherence conditions, followed by iterative refinement algorithm. We establish global convergence and minimax-optimal misclassification rates. To handle tensor data deviating from tensor normality, we develop the first semiparametric tensor discriminant model, in which learned tensor representations are mapped via deep generative models into a latent space tailored for CP-TDA. Misclassification risk decomposes into representation, approximation, and estimation errors. Numerical studies and real data analysis on graph classification demonstrate substantial gains over existing tensor classifiers and state-of-the-art graph neural networks, particularly in high-dimensional, small-sample regimes.

Comment: Representation Learning: CP low-rank discriminant tensor model with global convergence and minimax guarantees; low-rank structure focus.

Relevance: 8 Novelty: 7


34. DP-CSGP: Differentially Private Stochastic Gradient Push with Compressed Communication

ArXiv ID: 2512.13583

Authors: Zehan Zhu, Heng Zhao, Yan Huang, Joey Tianyi Zhou, Shouling Ji, Jinming Xu

Abstract: In this paper, we propose a Differentially Private Stochastic Gradient Push with Compressed communication (termed DP-CSGP) for decentralized learning over directed graphs. Different from existing works, the proposed algorithm is designed to maintain high model utility while ensuring both rigorous differential privacy (DP) guarantees and efficient communication. For general non-convex and smooth objective functions, we show that the proposed algorithm achieves a tight utility bound of $\mathcal{O}\left( \sqrt{d\log \left( \frac{1}{\delta} \right)}/(\sqrt{n}J\epsilon) \right)$ ($J$ and $d$ are the number of local samples and the dimension of decision variables, respectively) with $\left(\epsilon, \delta\right)$-DP guarantee for each node, matching that of decentralized counterparts with exact communication. Extensive experiments on benchmark tasks show that, under the same privacy budget, DP-CSGP achieves comparable model accuracy with significantly lower communication cost than existing decentralized counterparts with exact communication.

Comment: HPC/Distributed Training and Communication Efficiency: DP stochastic gradient push with compressed communication over directed graphs with utility bounds.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  2. Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  3. High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

  4. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.