Personalized Daily ArXiv Papers 2025-12-16

[gpt-5]	Prompt	Completion	Total
Token	64560	54894	119454
Cost	$0.08	$0.55	$0.63

Total arXiv papers: 873

Total scanned papers: 555

Total relevant papers: 34

Table of contents with paper titles:

World Models Can Leverage Human Videos for Dexterous Manipulation Authors: Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, Yann LeCun
CoDeQ: End-to-End Joint Model Compression with Dead-Zone Quantizer for High-Sparsity and Low-Precision Networks Authors: Jonathan Wensh{\o}j, Tong Chen, Bob Pepin, Raghavendra Selvan
Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability Authors: Leonard Bereska, Zoe Tzifa-Kratira, Reza Samavi, Efstratios Gavves
Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings Authors: Yoav Gelberg, Koshi Eguchi, Takuya Akiba, Edoardo Cetin
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models Authors: Zhengyang Wang, Ziyue Liu, Ruijie Zhang, Avinash Maurya, Paul Hovland, Bogdan Nicolae, Franck Cappello, Zheng Zhang
SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping Authors: Yu-Chen Lu, Sheng-Feng Yu, Hui-Hsien Weng, Pei-Shuo Wang, Yu-Fang Hu, Liang Hung-Chun, Hung-Yueh Chiang, Kai-Chiang Wu
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics Authors: Jingdi Lei, Di Zhang, Soujanya Poria
Improving Recursive Transformers with Mixture of LoRAs Authors: Mohammadmahdi Nouriborji, Morteza Rohanian, Omid Rohanian
Phase transitions reveal hierarchical structure in deep neural networks Authors: Ibrahim Talha Ersoy, Andr\'es Fernando Cardozo Licha, Karoline Wiesner
Resting Neurons, Active Insights: Improving Input Sparsification for Large Language Models Authors: Haotian Xu, Tian Gao, Tsui-Wei Weng, Tengfei Ma
Universality of high-dimensional scaling limits of stochastic gradient descent Authors: Reza Gheissari, Aukosh Jagannath
V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval Authors: Donghyuk Kim, Sejeong Yang, Wonjin Shin, Joo-Young Kim
SeVeDo: A Heterogeneous Transformer Accelerator for Low-Bit Inference via Hierarchical Group Quantization and SVD-Guided Mixed Precision Authors: Yuseon Choi, Sangjin Kim, Jungjun Oh, Byeongcheol Kim, Hoi-Jun Yoo
D-STEER - Preference Alignment Techniques Learn to Behave, not to Believe -- Beneath the Surface, DPO as Steering Vector Perturbation in Activation Space Authors: Samarth Raina, Saksham Aggarwal, Aman Chadha, Vinija Jain, Amitava Das
Optimized Architectures for Kolmogorov-Arnold Networks Authors: James Bagrow, Josh Bongard
Self-Motivated Growing Neural Network for Adaptive Architecture via Local Structural Plasticity Authors: Yiyang Jia, Chengxu Zhou
Efficient Vision-Language Reasoning via Adaptive Token Pruning Authors: Xue Li, Xiaonan Song, Henry Hu
Exploring the Design Space of Transition Matching Authors: Uriel Singer, Yaron Lipman
CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving Authors: Dong Liu, Yanxuan Yu
Spiking Manifesto Authors: Eugene Izhikevich
Stopping Rules for Stochastic Gradient Descent via Anytime-Valid Confidence Sequences Authors: Liviu Aolaritei, Michael I. Jordan
State over Tokens: Characterizing the Role of Reasoning Tokens Authors: Mosh Levy, Zohar Elyoseph, Shauli Ravfogel, Yoav Goldberg
Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry Authors: Behrooz Tahmasebi, Melanie Weber
CORE: Contrastive Masked Feature Reconstruction on Graphs Authors: Jianyuan Bo, Yuan Fang
Qonvolution: Towards Learning High-Frequency Signals with Queried Convolution Authors: Abhinav Kumar, Tristan Aumentado-Armstrong, Lazar Valkov, Gopal Sharma, Alex Levinshtein, Radek Grzeszczuk, Suren Kumar
Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model Updates Authors: Wenjun Yu, Sitian Chen, Cheng Chen, Amelie Chi Zhou
Alada: Alternating Adaptation of Momentum Method for Memory-Efficient Matrix Optimization Authors: Xiaoyu He, Yu Cai, Jin Jia, Canxi Huang, Wenqing Chen, Zibin Zheng
Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing Authors: Zewen Qiang, Sendong Zhao, Haochun Wang, Bing Qin, Ting Liu
Wait, Wait, Wait... Why Do Reasoning Models Loop? Authors: Charilaos Pipis, Shivam Garg, Vasilis Kontonis, Vaishnavi Shrivastava, Akshay Krishnamurthy, Dimitris Papailiopoulos
Scalable Formal Verification via Autoencoder Latent Space Abstraction Authors: Robert Reed, Morteza Lahijanian, Luca Laurenti
Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models Authors: Chendong Sun
PAC-Bayes Bounds for Multivariate Linear Regression and Linear Autoencoders Authors: Ruixin Guo, Ruoming Jin, Xinyu Li, Yang Zhou
High-Dimensional Tensor Discriminant Analysis: Low-Rank Discriminant Structure, Representation Synergy, and Theoretical Guarantees Authors: Elynn Chen, Yuefeng Han, Jiayu Li
DP-CSGP: Differentially Private Stochastic Gradient Push with Compressed Communication Authors: Zehan Zhu, Heng Zhao, Yan Huang, Joey Tianyi Zhou, Shouling Ji, Jinming Xu

1. World Models Can Leverage Human Videos for Dexterous Manipulation

ArXiv ID: 2512.13644

Authors: Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, Yann LeCun

Abstract: Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.

Comment: Author match

2. CoDeQ: End-to-End Joint Model Compression with Dead-Zone Quantizer for High-Sparsity and Low-Precision Networks

ArXiv ID: 2512.12981

Authors: Jonathan Wensh{\o}j, Tong Chen, Bob Pepin, Raghavendra Selvan

Abstract: While joint pruning--quantization is theoretically superior to sequential application, current joint methods rely on auxiliary procedures outside the training loop for finding compression parameters. This reliance adds engineering complexity and hyperparameter tuning, while also lacking a direct data-driven gradient signal, which might result in sub-optimal compression. In this paper, we introduce CoDeQ, a simple, fully differentiable method for joint pruning--quantization. Our approach builds on a key observation: the dead-zone of a scalar quantizer is equivalent to magnitude pruning, and can be used to induce sparsity directly within the quantization operator. Concretely, we parameterize the dead-zone width and learn it via backpropagation, alongside the quantization parameters. This design provides explicit control of sparsity, regularized by a single global hyperparameter, while decoupling sparsity selection from bit-width selection. The result is a method for Compression with Dead-zone Quantizer (CoDeQ) that supports both fixed-precision and mixed-precision quantization (controlled by an optional second hyperparameter). It simultaneously determines the sparsity pattern and quantization parameters in a single end-to-end optimization. Consequently, CoDeQ does not require any auxiliary procedures, making the method architecture-agnostic and straightforward to implement. On ImageNet with ResNet-18, CoDeQ reduces bit operations to ~5% while maintaining close to full precision accuracy in both fixed and mixed-precision regimes.

Comment: Model Compression and Efficiency — end-to-end joint pruning and quantization by learning quantizer dead-zone widths (fully differentiable).

Relevance: 10 Novelty: 9

3. Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability

ArXiv ID: 2512.13568

Authors: Leonard Bereska, Zoe Tzifa-Kratira, Reza Samavi, Efstratios Gavves

Abstract: Neural networks achieve remarkable performance through superposition: encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This challenges interpretability, yet we lack principled methods to measure superposition. We present an information-theoretic framework measuring a neural representation's effective degrees of freedom. We apply Shannon entropy to sparse autoencoder activations to compute the number of effective features as the minimum neurons needed for interference-free encoding. Equivalently, this measures how many "virtual neurons" the network simulates through superposition. When networks encode more effective features than actual neurons, they must accept interference as the price of compression. Our metric strongly correlates with ground truth in toy models, detects minimal superposition in algorithmic tasks, and reveals systematic reduction under dropout. Layer-wise patterns mirror intrinsic dimensionality studies on Pythia-70M. The metric also captures developmental dynamics, detecting sharp feature consolidation during grokking. Surprisingly, adversarial training can increase effective features while improving robustness, contradicting the hypothesis that superposition causes vulnerability. Instead, the effect depends on task complexity and network capacity: simple tasks with ample capacity allow feature expansion (abundance regime), while complex tasks or limited capacity force reduction (scarcity regime). By defining superposition as lossy compression, this work enables principled measurement of how neural networks organize information under computational constraints, connecting superposition to adversarial robustness.

Comment: Representation Learning: proposes an information-theoretic metric for superposition via sparse autoencoders; connects feature capacity to robustness.

Relevance: 10 Novelty: 9

4. Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

ArXiv ID: 2512.12167

Authors: Yoav Gelberg, Koshi Eguchi, Takuya Akiba, Edoardo Cetin

Abstract: So far, expensive finetuning beyond the pretraining sequence length has been a requirement for effectively extending the context of language models (LM). In this work, we break this key bottleneck by Dropping the Positional Embeddings of LMs after training (DroPE). Our simple method is motivated by three key theoretical and empirical observations. First, positional embeddings (PEs) serve a crucial role during pretraining, providing an important inductive bias that significantly facilitates convergence. Second, over-reliance on this explicit positional information is also precisely what prevents test-time generalization to sequences of unseen length, even when using popular PE-scaling methods. Third, positional embeddings are not an inherent requirement of effective language modeling and can be safely removed after pretraining, following a short recalibration phase. Empirically, DroPE yields seamless zero-shot context extension without any long-context finetuning, quickly adapting pretrained LMs without compromising their capabilities in the original training context. Our findings hold across different models and dataset sizes, far outperforming previous specialized architectures and established rotary positional embedding scaling methods.

Comment: Model Architecture/Efficiency — zero-shot context extension by dropping positional embeddings post-training without long-context finetuning.

Relevance: 10 Novelty: 8

5. BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models

ArXiv ID: 2512.12131

Authors: Zhengyang Wang, Ziyue Liu, Ruijie Zhang, Avinash Maurya, Paul Hovland, Bogdan Nicolae, Franck Cappello, Zheng Zhang

Abstract: The scale of transformer model pre-training is constrained by the increasing computation and communication cost. Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint with minimum impact on accuracy. Despite algorithmic efficiency, bottleneck architectures scale poorly under standard tensor parallelism. Simply applying 3D parallelism designed for full-rank methods leads to excessive communication and poor GPU utilization. To address this limitation, we propose BOOST, an efficient training framework tailored for large-scale low-rank bottleneck architectures. BOOST introduces a novel Bottleneck-aware Tensor Parallelism, and combines optimizations such as online-RMSNorm, linear layer grouping, and low-rank activation checkpointing to achieve end-to-end training speedup. Evaluations on different low-rank bottleneck architectures demonstrate that BOOST achieves 1.46-1.91$\times$ speedup over full-rank model baselines and 1.87-2.27$\times$ speedup over low-rank model with naively integrated 3D parallelism, with improved GPU utilization and reduced communication overhead.

Comment: High Performance Computing — bottleneck-aware tensor parallelism and system optimizations for low-rank LLMs; also aligns with low-rank efficiency.

Relevance: 10 Novelty: 8

6. SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping

ArXiv ID: 2512.13494

Authors: Yu-Chen Lu, Sheng-Feng Yu, Hui-Hsien Weng, Pei-Shuo Wang, Yu-Fang Hu, Liang Hung-Chun, Hung-Yueh Chiang, Kai-Chiang Wu

Abstract: Large language models (LLM) have achieved remarkable performance across a wide range of tasks. However, their substantial parameter sizes pose significant challenges for deployment on edge devices with limited computational and memory resources. Low-rank compression is a promising approach to address this issue, as it reduces both computational and memory costs, making LLM more suitable for resource-constrained environments. Nonetheless, na\"ive low-rank compression methods require a significant reduction in the retained rank to achieve meaningful memory and computation savings. For a low-rank model, the ranks need to be reduced by more than half to yield efficiency gains. Such aggressive truncation, however, typically results in substantial performance degradation. To address this trade-off, we propose SkipCat, a novel low-rank compression framework that enables the use of higher ranks while achieving the same compression rates. First, we introduce an intra-layer shared low-rank projection method, where multiple matrices that share the same input use a common projection. This reduces redundancy and improves compression efficiency. Second, we propose a block skipping technique that omits computations and memory transfers for selected sub-blocks within the low-rank decomposition. These two techniques jointly enable our compressed model to retain more effective ranks under the same compression budget. Experimental results show that, without any additional fine-tuning, our method outperforms previous low-rank compression approaches by 7% accuracy improvement on zero-shot tasks under the same compression rate. These results highlight the effectiveness of our rank-maximized compression strategy in preserving model performance under tight resource constraints.

Comment: Low-rank compression for LLMs via shared projection and block skipping; directly fits Compression/Efficiency (low-rank methods).

Relevance: 10 Novelty: 8

7. Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

ArXiv ID: 2512.12602

Authors: Jingdi Lei, Di Zhang, Soujanya Poria

Abstract: Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, fully parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is not only attainable but also computable in linear time with full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the exact closed-form solution effectively corresponding to the infinite-order Runge-Kutta method. This attention mechanism is theoretically free from error accumulation, perfectly capturing the continuous dynamics while preserving the linear-time complexity. Through an extensive suite of experiments, we show that EFLA enables robust performance in noisy environments, achieving lower language modeling perplexity and superior downstream benchmark performance than DeltaNet without introducing additional parameters. Our work provides a new theoretical foundation for building high-fidelity, scalable linear-time attention models.

Comment: Model Architecture/Efficiency: exact linear-time attention via continuous-time dynamics (error-free linear attention) with theoretical foundations.

Relevance: 10 Novelty: 8

8. Improving Recursive Transformers with Mixture of LoRAs

ArXiv ID: 2512.12880

Authors: Mohammadmahdi Nouriborji, Morteza Rohanian, Omid Rohanian

Abstract: Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M--120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.

Comment: Model Architecture and Compression/Efficiency: conditional computation via Mixture of LoRAs inside shared FFN for recursive transformers; expert merging for deployment.

Relevance: 10 Novelty: 8

9. Phase transitions reveal hierarchical structure in deep neural networks

ArXiv ID: 2512.11866

Authors: Ibrahim Talha Ersoy, Andr\'es Fernando Cardozo Licha, Karoline Wiesner

Abstract: Training Deep Neural Networks relies on the model converging on a high-dimensional, non-convex loss landscape toward a good minimum. Yet, much of the phenomenology of training remains ill understood. We focus on three seemingly disparate observations: the occurrence of phase transitions reminiscent of statistical physics, the ubiquity of saddle points, and phenomenon of mode connectivity relevant for model merging. We unify these within a single explanatory framework, the geometry of the loss and error landscapes. We analytically show that phase transitions in DNN learning are governed by saddle points in the loss landscape. Building on this insight, we introduce a simple, fast, and easy to implement algorithm that uses the L2 regularizer as a tool to probe the geometry of error landscapes. We apply it to confirm mode connectivity in DNNs trained on the MNIST dataset by efficiently finding paths that connect global minima. We then show numerically that saddle points induce transitions between models that encode distinct digit classes. Our work establishes the geometric origin of key training phenomena in DNNs and reveals a hierarchy of accuracy basins analogous to phases in statistical physics.

Comment: Representation Learning — theoretical link between saddle points, phase transitions, and mode connectivity; introduces a probe of loss geometry.

Relevance: 9 Novelty: 8

10. Resting Neurons, Active Insights: Improving Input Sparsification for Large Language Models

ArXiv ID: 2512.12744

Authors: Haotian Xu, Tian Gao, Tsui-Wei Weng, Tengfei Ma

Abstract: Large Language Models (LLMs) achieve state-of-the-art performance across a wide range of applications, but their massive scale poses significant challenges for both efficiency and interpretability. Structural pruning, which reduces model size by removing redundant computational units such as neurons, has been widely explored as a solution, and this study devotes to input sparsification, an increasingly popular technique that improves efficiency by selectively activating only a subset of entry values for each input. However, existing approaches focus primarily on computational savings, often overlooking the representational consequences of sparsification and leaving a noticeable performance gap compared to full models. In this work, we first reinterpret input sparsification as a form of dynamic structural pruning. Motivated by the spontaneous baseline firing rates observed in biological neurons, we introduce a small set of trainable spontaneous neurons that act as compensatory units to stabilize activations in sparsified LLMs. Experiments demonstrate that these auxiliary neurons substantially reduce the sparsification-induced performance gap while generalizing effectively across tasks.

Comment: Proposes input sparsification as dynamic structural pruning with trainable compensatory neurons; fits Model Compression/Efficiency (sparsity/pruning, conditional activation).

Relevance: 9 Novelty: 8

11. Universality of high-dimensional scaling limits of stochastic gradient descent

ArXiv ID: 2512.13634

Authors: Reza Gheissari, Aukosh Jagannath

Abstract: We consider statistical tasks in high dimensions whose loss depends on the data only through its projection into a fixed-dimensional subspace spanned by the parameter vectors and certain ground truth vectors. This includes classifying mixture distributions with cross-entropy loss with one and two-layer networks, and learning single and multi-index models with one and two-layer networks. When the data is drawn from an isotropic Gaussian mixture distribution, it is known that the evolution of a finite family of summary statistics under stochastic gradient descent converges to an autonomous ordinary differential equation (ODE), as the dimension and sample size go to $\infty$ and the step size goes to $0$ commensurately. Our main result is that these ODE limits are universal in that this convergence occurs even when the data is drawn from mixtures of product measures provided the first two moments match the corresponding Gaussian distribution and the initialization and ground truth vectors are sufficiently coordinate-delocalized. We complement this by proving two corresponding non-universality results. We provide a simple example where the ODE limits are non-universal if the initialization is coordinate aligned. We also show that the stochastic differential equation limits arising as fluctuations of the summary statistics around their ODE's fixed points are not universal.

Comment: Theoretical universality of high-dimensional SGD scaling limits (ODE/SDE); strong fit to Representation Learning (training dynamics theory).

Relevance: 9 Novelty: 8

12. V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval

ArXiv ID: 2512.12284

Authors: Donghyuk Kim, Sejeong Yang, Wonjin Shin, Joo-Young Kim

Abstract: Streaming video large language models (LLMs) are increasingly used for real-time multimodal tasks such as video captioning, question answering, conversational agents, and augmented reality. However, these models face fundamental memory and computational challenges because their key-value (KV) caches grow substantially with continuous streaming video input. This process requires an iterative prefill stage, which is a unique feature of streaming video LLMs. Due to its iterative prefill stage, it suffers from significant limitations, including extensive computation, substantial data transfer, and degradation in accuracy. Crucially, this issue is exacerbated for edge deployment, which is the primary target for these models. In this work, we propose V-Rex, the first software-hardware co-designed accelerator that comprehensively addresses both algorithmic and hardware bottlenecks in streaming video LLM inference. At its core, V-Rex introduces ReSV, a training-free dynamic KV cache retrieval algorithm. ReSV exploits temporal and spatial similarity-based token clustering to reduce excessive KV cache memory across video frames. To fully realize these algorithmic benefits, V-Rex offers a compact, low-latency hardware accelerator with a dynamic KV cache retrieval engine (DRE), featuring bit-level and early-exit based computing units. V-Rex achieves unprecedented real-time of 3.9-8.3 FPS and energy-efficient streaming video LLM inference on edge deployment with negligible accuracy loss. While DRE only accounts for 2.2% power and 2.0% area, the system delivers 1.9-19.7x speedup and 3.1-18.5x energy efficiency improvements over AGX Orin GPU. This work is the first to comprehensively tackle KV cache retrieval across algorithms and hardware, enabling real-time streaming video LLM inference on resource-constrained edge devices.

Comment: Dynamic KV cache retrieval with software–hardware co-design for streaming video LLMs; directly matches Compression/Efficiency (cache optimization) and HPC inference acceleration.

Relevance: 9 Novelty: 8

13. SeVeDo: A Heterogeneous Transformer Accelerator for Low-Bit Inference via Hierarchical Group Quantization and SVD-Guided Mixed Precision

ArXiv ID: 2512.12930

Authors: Yuseon Choi, Sangjin Kim, Jungjun Oh, Byeongcheol Kim, Hoi-Jun Yoo

Abstract: Low-bit quantization is a promising technique for efficient transformer inference by reducing computational and memory overhead. However, aggressive bitwidth reduction remains challenging due to activation outliers, leading to accuracy degradation. Existing methods, such as outlier-handling and group quantization, achieve high accuracy but incur substantial energy consumption. To address this, we propose SeVeDo, an energy-efficient SVD-based heterogeneous accelerator that structurally separates outlier-sensitive components into a high-precision low-rank path, while the remaining computations are executed in a low-bit residual datapath with group quantization. To further enhance efficiency, Hierarchical Group Quantization (HGQ) combines coarse-grained floating-point scaling with fine-grained shifting, effectively reducing dequantization cost. Also, SVD-guided mixed precision (SVD-MP) statically allocates higher bitwidths to precision-sensitive components identified through low-rank decomposition, thereby minimizing floating-point operation cost. Experimental results show that SeVeDo achieves a peak energy efficiency of 13.8TOPS/W, surpassing conventional designs, with 12.7TOPS/W on ViT-Base and 13.4TOPS/W on Llama2-7B benchmarks.

Comment: Heterogeneous accelerator with hierarchical group quantization and SVD-guided mixed precision; strong fit to Compression/Efficiency and HPC inference.

Relevance: 9 Novelty: 8

14. D-STEER - Preference Alignment Techniques Learn to Behave, not to Believe -- Beneath the Surface, DPO as Steering Vector Perturbation in Activation Space

ArXiv ID: 2512.11838

Authors: Samarth Raina, Saksham Aggarwal, Aman Chadha, Vinija Jain, Amitava Das

Abstract: Direct Preference Optimization (DPO) has become a standard recipe for aligning large language models, yet it is still unclear what kind of change it actually induces inside the network. This paper argues that DPO does not rewrite a models internal beliefs; instead, it acts as a low rank steering mechanism that nudges activations along a small number of preference directions. Using a simple derivation, we show that the DPO gradient depends only on the difference between the logit embeddings of preferred and dispreferred completions, implying a first order shift in the final hidden representation rather than a deep restructuring of semantics. We then extract an empirical steering vector from a DPO tuned model and demonstrate that adding this vector to base activations reproduces most of the aligned behavior, while subtracting it nearly restores the original model. Finally, spectral analyses reveal rank-one dominance and entropy collapse in upper layers, indicating that alignment is funneled through a narrow subspace. Taken together, these results support a behavioral illusion view of DPO: it teaches models how to act aligned, not what to believe.

Comment: Representation Learning/Training Dynamics: shows DPO acts as a low-rank steering perturbation (rank-1 dominance) in activation space.

Relevance: 9 Novelty: 8

15. Optimized Architectures for Kolmogorov-Arnold Networks

ArXiv ID: 2512.12448

Authors: James Bagrow, Josh Bongard

Abstract: Efforts to improve Kolmogorov-Arnold networks (KANs) with architectural enhancements have been stymied by the complexity those enhancements bring, undermining the interpretability that makes KANs attractive in the first place. Here we study overprovisioned architectures combined with sparsification to learn compact, interpretable KANs without sacrificing accuracy. Crucially, we focus on differentiable sparsification, turning architecture search into an end-to-end optimization problem. Across function approximation benchmarks, dynamical systems forecasting, and real-world prediction tasks, we demonstrate competitive or superior accuracy while discovering substantially smaller models. Overprovisioning and sparsification are synergistic, with the combination outperforming either alone. The result is a principled path toward models that are both more expressive and more interpretable, addressing a key tension in scientific machine learning.

Comment: Matches Model Compression and Efficiency: differentiable sparsification to learn compact architectures (also Model Architecture for KAN design).

Relevance: 9 Novelty: 7

16. Self-Motivated Growing Neural Network for Adaptive Architecture via Local Structural Plasticity

ArXiv ID: 2512.12713

Authors: Yiyang Jia, Chengxu Zhou

Abstract: Control policies in deep reinforcement learning are often implemented with fixed-capacity multilayer perceptrons trained by backpropagation, which lack structural plasticity and depend on global error signals. This paper introduces the Self-Motivated Growing Neural Network (SMGrNN), a controller whose topology evolves online through a local Structural Plasticity Module (SPM). The SPM monitors neuron activations and edge-wise weight update statistics over short temporal windows and uses these signals to trigger neuron insertion and pruning, while synaptic weights are updated by a standard gradient-based optimizer. This allows network capacity to be regulated during learning without manual architectural tuning. SMGrNN is evaluated on control benchmarks via policy distillation. Compared with multilayer perceptron baselines, it achieves similar or higher returns, lower variance, and task-appropriate network sizes. Ablation studies with growth disabled and growth-only variants isolate the role of structural plasticity, showing that adaptive topology improves reward stability. The local and modular design of SPM enables future integration of a Hebbian plasticity module and spike-timing-dependent plasticity, so that SMGrNN can support both artificial and spiking neural implementations driven by local rules.

Comment: Matches Model Architecture: dynamic/growing neural network with local structural plasticity (neuron insertion/pruning).

Relevance: 9 Novelty: 7

17. Efficient Vision-Language Reasoning via Adaptive Token Pruning

ArXiv ID: 2512.12701

Authors: Xue Li, Xiaonan Song, Henry Hu

Abstract: Real-world deployment of Vision-Language Models (VLMs) is hindered by high computational demands, as existing architectures inefficiently process all tokens uniformly. We introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that retains only the most informative tokens based on contextual relevance. ATP operates at the vision-language interface, assigning a hybrid importance score combining ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance) to keep top-K tokens for the LLM. Unlike static compression, ATP adapts to each input without modifying the backbone. Proposed as a lightweight gating module, ATP is compatible with popular backbones like BLIP-2, LLaVA, and Flamingo. Preliminary evaluations across VQAv2, GQA, and COCO indicate that ATP reduces inference FLOPs by around 40% and achieves roughly 1.5x speedups in end-to-end latency with negligible accuracy loss (less than 1%). Qualitative analyses suggest ATP preserves visual grounding and enhances interpretability. Beyond efficiency, we investigate robustness under corruptions; observations suggest adaptive pruning suppresses spurious correlations, improving stability. These findings imply that resource-constrained inference and model reliability are not competing objectives. Finally, we discuss ATP's role in efficient multimodal edge computing pipelines.

Comment: Model Compression and Efficiency — adaptive token pruning at the vision-language interface using attention/similarity-based importance.

Relevance: 9 Novelty: 7

18. Exploring the Design Space of Transition Matching

ArXiv ID: 2512.12465

Authors: Uriel Singer, Yaron Lipman

Abstract: Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second ``internal'' generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains.

Comment: Model Architecture: systematic design/training/sampling study of Transition Matching with head–backbone architecture for generative models.

Relevance: 9 Novelty: 7

19. CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving

ArXiv ID: 2512.11920

Authors: Dong Liu, Yanxuan Yu

Abstract: Large Language Models (LLMs) have revolutionized natural language processing tasks, but their deployment in datacenter environments faces significant challenges due to the massive memory requirements of key-value (KV) caches. During the autoregressive decoding process, KV caches consume substantial GPU memory, limiting batch sizes and overall system throughput. To address these challenges, we propose \textbf{CXL-SpecKV}, a novel disaggregated KV-cache architecture that leverages Compute Express Link (CXL) interconnects and FPGA accelerators to enable efficient speculative execution and memory disaggregation. Our approach introduces three key innovations: (i) a CXL-based memory disaggregation framework that offloads KV-caches to remote FPGA memory with low latency, (ii) a speculative KV-cache prefetching mechanism that predicts and preloads future tokens' cache entries, and (iii) an FPGA-accelerated KV-cache compression and decompression engine that reduces memory bandwidth requirements by up to 4$\times$. When evaluated on state-of-the-art LLM models, CXL-SpecKV achieves up to 3.2$\times$ higher throughput compared to GPU-only baselines, while reducing memory costs by 2.8$\times$ and maintaining accuracy. Our system demonstrates that intelligent memory disaggregation combined with speculative execution can effectively address the memory wall challenge in large-scale LLM serving. Our code implementation has been open-sourced at https://github.com/FastLM/CXL-SpecKV.

Comment: High Performance Computing/Efficiency: disaggregated KV-cache over CXL with FPGA acceleration, speculative prefetch, and compression for LLM serving.

Relevance: 9 Novelty: 7

20. Spiking Manifesto

ArXiv ID: 2512.11843

Authors: Eugene Izhikevich

Abstract: Practically everything computers do is better, faster, and more power-efficient than the brain. For example, a calculator crunches numbers more energy-efficiently than any human. Yet AI models are a thousand times less efficient than the brain. These models use artificial neural networks (ANNs) and require GPUs for the multiplication of huge matrices. In contrast, spiking neural networks (SNNs) of the brain have no matrix multiplication and much smaller energy requirements. This manifesto proposes a framework for thinking about popular AI models in terms of spiking networks and polychronization, and for interpreting spiking activity as nature's way of implementing look-up tables. This offers a way to convert AI models into a novel type of architecture with the promise of a thousandfold improvement in efficiency. Code is available at https://github.com/izhikevich/SNN

Comment: Matches Model Architecture/Efficiency: proposes spiking network reinterpretation of ANNs for potential thousandfold efficiency gains.

Relevance: 8 Novelty: 8

21. Stopping Rules for Stochastic Gradient Descent via Anytime-Valid Confidence Sequences

ArXiv ID: 2512.13123

Authors: Liviu Aolaritei, Michael I. Jordan

Abstract: We study stopping rules for stochastic gradient descent (SGD) for convex optimization from the perspective of anytime-valid confidence sequences. Classical analyses of SGD provide convergence guarantees in expectation or at a fixed horizon, but offer no statistically valid way to assess, at an arbitrary time, how close the current iterate is to the optimum. We develop an anytime-valid, data-dependent upper confidence sequence for the weighted average suboptimality of projected SGD, constructed via nonnegative supermartingales and requiring no smoothness or strong convexity. This confidence sequence yields a simple stopping rule that is provably $\varepsilon$-optimal with probability at least $1-\alpha$ and is almost surely finite under standard stochastic approximation stepsizes. To the best of our knowledge, these are the first rigorous, time-uniform performance guarantees and finite-time $\varepsilon$-optimality certificates for projected SGD with general convex objectives, based solely on observable trajectory quantities.

Comment: Matches Optimization/Training (HPC relevance): anytime-valid stopping rules for SGD via confidence sequences for principled training control.

Relevance: 8 Novelty: 8

22. State over Tokens: Characterizing the Role of Reasoning Tokens

ArXiv ID: 2512.12777

Authors: Mosh Levy, Zohar Elyoseph, Shauli Ravfogel, Yoav Goldberg

Abstract: Large Language Models (LLMs) can generate reasoning tokens before their final answer to boost performance on complex tasks. While these sequences seem like human thought processes, empirical evidence reveals that they are not a faithful explanation of the model's actual reasoning process. To address this gap between appearance and function, we introduce the State over Tokens (SoT) conceptual framework. SoT reframes reasoning tokens not as a linguistic narrative, but as an externalized computational state -- the sole persistent information carrier across the model's stateless generation cycles. This explains how the tokens can drive correct reasoning without being a faithful explanation when read as text and surfaces previously overlooked research questions on these tokens. We argue that to truly understand the process that LLMs do, research must move beyond reading the reasoning tokens as text and focus on decoding them as state.

Comment: Conceptual framework recasting reasoning tokens as externalized computational state; aligns with Representation Learning/training dynamics.

Relevance: 8 Novelty: 8

23. Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry

ArXiv ID: 2512.11855

Authors: Behrooz Tahmasebi, Melanie Weber

Abstract: Enforcing exact symmetry in machine learning models often yields significant gains in scientific applications, serving as a powerful inductive bias. However, recent work suggests that relying on approximate symmetry can offer greater flexibility and robustness. Despite promising empirical evidence, there has been little theoretical understanding, and in particular, a direct comparison between exact and approximate symmetry is missing from the literature. In this paper, we initiate this study by asking: What is the cost of enforcing exact versus approximate symmetry? To address this question, we introduce averaging complexity, a framework for quantifying the cost of enforcing symmetry via averaging. Our main result is an exponential separation: under standard conditions, achieving exact symmetry requires linear averaging complexity, whereas approximate symmetry can be attained with only logarithmic averaging complexity. To the best of our knowledge, this provides the first theoretical separation of these two cases, formally justifying why approximate symmetry may be preferable in practice. Beyond this, our tools and techniques may be of independent interest for the broader study of symmetries in machine learning.

Comment: Representation Learning/Theory: quantifies complexity gap between exact vs approximate symmetry, guiding symmetry-inductive biases.

Relevance: 8 Novelty: 8

24. CORE: Contrastive Masked Feature Reconstruction on Graphs

ArXiv ID: 2512.13235

Authors: Jianyuan Bo, Yuan Fang

Abstract: In the rapidly evolving field of self-supervised learning on graphs, generative and contrastive methodologies have emerged as two dominant approaches. Our study focuses on masked feature reconstruction (MFR), a generative technique where a model learns to restore the raw features of masked nodes in a self-supervised manner. We observe that both MFR and graph contrastive learning (GCL) aim to maximize agreement between similar elements. Building on this observation, we reveal a novel theoretical insight: under specific conditions, the objectives of MFR and node-level GCL converge, despite their distinct operational mechanisms. This theoretical connection suggests these approaches are complementary rather than fundamentally different, prompting us to explore their integration to enhance self-supervised learning on graphs. Our research presents Contrastive Masked Feature Reconstruction (CORE), a novel graph self-supervised learning framework that integrates contrastive learning into MFR. Specifically, we form positive pairs exclusively between the original and reconstructed features of masked nodes, encouraging the encoder to prioritize contextual information over the node's own features. Additionally, we leverage the masked nodes themselves as negative samples, combining MFR's reconstructive power with GCL's discriminative ability to better capture intrinsic graph structures. Empirically, our proposed framework CORE significantly outperforms MFR across node and graph classification tasks, demonstrating state-of-the-art results. In particular, CORE surpasses GraphMAE and GraphMAE2 by up to 2.80% and 3.72% on node classification tasks, and by up to 3.82% and 3.76% on graph classification tasks.

Comment: Matches Representation Learning: theoretical link between masked feature reconstruction and contrastive objectives with a unified framework.