Personalized Daily ArXiv Papers 2025-12-24

[gpt-5]	Prompt	Completion	Total
Token	29156	31877	61033
Cost	$0.04	$0.32	$0.36

Total arXiv papers: 403

Total scanned papers: 236

Total relevant papers: 15

Table of contents with paper titles:

Block-Recurrent Dynamics in Vision Transformers Authors: Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, T. Andy Keller
UCCL-EP: Portable Expert-Parallel Communication Authors: Ziming Mao, Yihan Zhang, Chihan Cui, Kaichao You, Zhongjie Chen, Zhiying Xu, Scott Shenker, Costin Raiciu, Yang Zhou, Ion Stoica
How Many Experts Are Enough? Towards Optimal Semantic Specialization for Mixture-of-Experts Authors: Sumin Park, Noseong Park
Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling Authors: Indranil Halder, Cengiz Pehlevan
BRIDGE: Budget-aware Reasoning via Intermediate Distillation with Guided Examples Authors: Xuan-An Le, Minh-Nam Tran, Son Nguyen
Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures Authors: Yedi Zhang, Andrew Saxe, Peter E. Latham
Distilling to Hybrid Attention Models via KL-Guided Layer Selection Authors: Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, Yoon Kim
Relu and softplus neural nets as zero-sum turn-based games Authors: Stephane Gaubert, Yiannis Vlassopoulos
ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge Authors: Yuntao Dai, Hang Gu, Teng Wang, Qianyu Cheng, Yifei Zheng, Zhiyong Qiu, Lei Gong, Wenqi Lou, Xuehai Zhou
How Much 3D Do Video Foundation Models Encode? Authors: Zixuan Huang, Xiang Li, Zhaoyang Lv, James M. Rehg
Field-Space Attention for Structure-Preserving Earth System Transformers Authors: Maximilian Witte, Johannes Meuer, \'Etienne Pl\'esiat, Christopher Kadow
High-Performance Self-Supervised Learning by Joint Training of Flow Matching Authors: Kosuke Ukita, Tsuyoshi Okita
Learning to Reason in LLMs by Expectation Maximization Authors: Junghyun Lee, Branislav Kveton, Sunav Choudhary, Subhojyoti Mukherjee, Anup Rao, Ryan A. Rossi, Alexa Siu
PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models Authors: Mingue Park, Jisung Hwang, Seungwoo Yoo, Kyeongmin Yeo, Minhyuk Sung
Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs Authors: Rui Pan, Zhuofu Chen, Ravi Netravali

1. Block-Recurrent Dynamics in Vision Transformers

ArXiv ID: 2512.19941

Authors: Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, T. Andy Keller

Abstract: As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.

Comment: Model Architecture and Representation Learning: discovers block-recurrent depth structure in ViTs and trains recurrent surrogates (Raptor), with dynamical/low-rank analyses.

Relevance: 10 Novelty: 8

2. UCCL-EP: Portable Expert-Parallel Communication

ArXiv ID: 2512.19849

Authors: Ziming Mao, Yihan Zhang, Chihan Cui, Kaichao You, Zhongjie Chen, Zhiying Xu, Scott Shenker, Costin Raiciu, Yang Zhou, Ion Stoica

Abstract: Mixture-of-Experts (MoE) workloads rely on expert parallelism (EP) to achieve high GPU efficiency. State-of-the-art EP communication systems such as DeepEP demonstrate strong performance but exhibit poor portability across heterogeneous GPU and NIC platforms. The poor portability is rooted in architecture: GPU-initiated token-level RDMA communication requires tight vertical integration between GPUs and NICs, e.g., GPU writes to NIC driver/MMIO interfaces. We present UCCL-EP, a portable EP communication system that delivers DeepEP-level performance across heterogeneous GPU and NIC hardware. UCCL-EP replaces GPU-initiated RDMA with a high-throughput GPU-CPU control channel: compact token-routing commands are transferred to multithreaded CPU proxies, which then issue GPUDirect RDMA operations on behalf of GPUs. UCCL-EP further emulates various ordering semantics required by specialized EP communication modes using RDMA immediate data, enabling correctness on NICs that lack such ordering, e.g., AWS EFA. We implement UCCL-EP on NVIDIA and AMD GPUs with EFA and Broadcom NICs. On EFA, it outperforms the best existing EP solution by up to $2.1\times$ for dispatch and combine throughput. On NVIDIA-only platform, UCCL-EP achieves comparable performance to the original DeepEP. UCCL-EP also improves token throughput on SGLang by up to 40% on the NVIDIA+EFA platform, and improves DeepSeek-V3 training throughput over the AMD Primus/Megatron-LM framework by up to 45% on a 16-node AMD+Broadcom platform.

Comment: High Performance Computing (MoE): portable expert-parallel communication using GPU–CPU control channel and GPUDirect RDMA, enabling DeepEP-level performance across GPU/NICs.

Relevance: 10 Novelty: 8

3. How Many Experts Are Enough? Towards Optimal Semantic Specialization for Mixture-of-Experts

ArXiv ID: 2512.19765

Authors: Sumin Park, Noseong Park

Abstract: Finding the optimal configuration of Sparse Mixture-ofExperts (SMoE) that maximizes semantic differentiation among experts is essential for exploiting the full potential of MoE architectures. However, existing SMoE frameworks either heavily rely on hyperparameter tuning or overlook the importance of diversifying semantic roles across experts when adapting the expert pool size. We propose Mixture-of-Experts for Adaptive Semantic Specialization (MASS), a semanticaware MoE framework for adaptive expert expansion and dynamic routing. MASS introduces two key advancements: (i) a gradient-based semantic drift detector that prompts targeted expert expansion when the existing expert pool lacks capacity to capture the full semantic diversity of the data, and (ii) an integration of adaptive routing strategy that dynamically adjusts expert usage based on token-level routing confidence mass. We first demonstrate that MASS reliably converges to the point of optimal balance between cost-performance trade-off with notably improved sematic specialization in a highly controlled synthetic setup. Further empirical results on real-world datasets across language and vision domains show that MASS consistently outperforms a range of strong MoE baselines, demonstrating its domain robustness and enhanced expert specialization.

Comment: Model Architecture (MoE): adaptive expert expansion with semantic drift detection and dynamic routing to optimize expert specialization.

Relevance: 10 Novelty: 8

4. Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling

ArXiv ID: 2512.19905

Authors: Indranil Halder, Cengiz Pehlevan

Abstract: Recent developments in large language models have shown advantages in reallocating a notable share of computational resource from training time to inference time. However, the principles behind inference time scaling are not well understood. In this paper, we introduce an analytically tractable model of inference-time scaling: Bayesian linear regression with a reward-weighted sampler, where the reward is determined from a linear model, modeling LLM-as-a-judge scenario. We study this problem in the high-dimensional regime, where the deterministic equivalents dictate a closed-form expression for the posterior predictive mean and variance. We analyze the generalization error when training data are sampled from a teacher model. We draw $k$ inference-time samples and select via softmax at a temperature applied to a quadratic reward. When the reward is not too different from the teacher, the generalization error decreases monotonically with increasing inference time samples $k$. However, the specific reward that optimizes inference-time selection generally differs from the teacher. In contrast, substantial reward misspecification induces a finite optimal $k$ beyond which more sampling can increase the generalization error. For fixed $k$, there exists an optimal sampling temperature. We experimentally verify these facts in large language model inference with an additional large language model as a judge. In the "best-of-$k$" limit with the teacher as reward, we theoretically show that the generalization error decays as $\Theta(1/k^2)$ and determine the leading coefficient via extreme value theory. These formulas delineate domains where scaling inference-time computation is provably preferable to collecting more data. Finally, we demonstrate that when task difficulty increases, the previously mentioned advantage of inference-time compute degrades.

Comment: Representation Learning/Training Dynamics: analytically tractable inference-time scaling with best-of-k theory for LLM-as-a-judge.

Relevance: 9 Novelty: 9

5. BRIDGE: Budget-aware Reasoning via Intermediate Distillation with Guided Examples

ArXiv ID: 2512.20403

Authors: Xuan-An Le, Minh-Nam Tran, Son Nguyen

Abstract: Distilling knowledge from large proprietary models (e.g., GPT-4) to tiny deployable models (less than 1B parameters) faces a critical capacity-budget trap: the 1000x capacity gap between teachers and students prevents effective direct transfer, while API costs prohibit extensive data collection. We introduce BRIDGE (Budget-Aware Reasoning via Intermediate Distillation), a two-phase framework that resolves these constraints through strategic intermediation and budget asymmetry. In Phase 1, a mid-sized Teacher Assistant (TA; e.g., about 7B) learns from the black-box teacher on a strictly limited subset of data (e.g., 3-5%), selected via a zero-API-cost pipeline that balances entropic difficulty and semantic diversity using only local TA inference. In Phase 2, we exploit this asymmetry-teacher queries are expensive, whereas TA inference is free to amplify supervision: the refined TA generates synthetic rationales for the full dataset to train the tiny student. Crucially, we apply an instruction-tuning curriculum to establish behavioral alignment in the tiny student before transferring reasoning. Our theoretical analysis shows that BRIDGE yields tighter generalization bounds than direct distillation when data is abundant. Experiments across medical, legal, and financial benchmarks demonstrate consistent improvements: BRIDGE delivers student performance gains of 28-41%, closing the capability gap with proprietary teachers by 12-16% while using 10x fewer teacher queries. Notably, BRIDGE defies the conventional cost-performance frontier, surpassing direct distillation baselines that use 100% of the budget while consuming only 5% of the resources.

Comment: Model Compression/Efficiency: budget-aware distillation via teacher assistant with theoretical generalization bounds.

Relevance: 9 Novelty: 8

6. Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures

ArXiv ID: 2512.20607

Authors: Yedi Zhang, Andrew Saxe, Peter E. Latham

Abstract: Neural networks trained with gradient descent often learn solutions of increasing complexity over time, a phenomenon known as simplicity bias. Despite being widely observed across architectures, existing theoretical treatments lack a unifying framework. We present a theoretical framework that explains a simplicity bias arising from saddle-to-saddle learning dynamics for a general class of neural networks, incorporating fully-connected, convolutional, and attention-based architectures. Here, simple means expressible with few hidden units, i.e., hidden neurons, convolutional kernels, or attention heads. Specifically, we show that linear networks learn solutions of increasing rank, ReLU networks learn solutions with an increasing number of kinks, convolutional networks learn solutions with an increasing number of convolutional kernels, and self-attention models learn solutions with an increasing number of attention heads. By analyzing fixed points, invariant manifolds, and dynamics of gradient descent learning, we show that saddle-to-saddle dynamics operates by iteratively evolving near an invariant manifold, approaching a saddle, and switching to another invariant manifold. Our analysis also illuminates the effects of data distribution and weight initialization on the duration and number of plateaus in learning, dissociating previously confounding factors. Overall, our theory offers a framework for understanding when and why gradient descent progressively learns increasingly complex solutions.

Comment: Representation Learning/Training Dynamics: unified theory of simplicity bias via saddle-to-saddle dynamics across FCN, CNN, and attention architectures.

Relevance: 9 Novelty: 8

7. Distilling to Hybrid Attention Models via KL-Guided Layer Selection

ArXiv ID: 2512.20569

Authors: Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, Yoon Kim

Abstract: Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]{goldstein2025radlads}, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective than existing approaches for layer selection, including heuristics that uniformly interleave linear attentions based on a fixed ratio, as well as more involved approaches that rely on specialized diagnostic datasets.

Comment: Model Architecture and Efficiency: KL-guided layer selection to convert selected Transformer layers to linear attention, distilling into hybrid attention models.

Relevance: 9 Novelty: 7

8. Relu and softplus neural nets as zero-sum turn-based games

ArXiv ID: 2512.20582

Authors: Stephane Gaubert, Yiannis Vlassopoulos

Abstract: We show that the output of a ReLU neural network can be interpreted as the value of a zero-sum, turn-based, stopping game, which we call the ReLU net game. The game runs in the direction opposite to that of the network, and the input of the network serves as the terminal reward of the game. In fact, evaluating the network is the same as running the Shapley-Bellman backward recursion for the value of the game. Using the expression of the value of the game as an expected total payoff with respect to the path measure induced by the transition probabilities and a pair of optimal policies, we derive a discrete Feynman-Kac-type path-integral formula for the network output. This game-theoretic representation can be used to derive bounds on the output from bounds on the input, leveraging the monotonicity of Shapley operators, and to verify robustness properties using policies as certificates. Moreover, training the neural network becomes an inverse game problem: given pairs of terminal rewards and corresponding values, one seeks transition probabilities and rewards of a game that reproduces them. Finally, we show that a similar approach applies to neural networks with Softplus activation functions, where the ReLU net game is replaced by its entropic regularization.

Comment: Representation Learning: game-theoretic and path-integral formulation of ReLU/Softplus networks enabling robustness certificates and training insights.

Relevance: 8 Novelty: 8

9. ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge

ArXiv ID: 2512.20276

Authors: Yuntao Dai, Hang Gu, Teng Wang, Qianyu Cheng, Yifei Zheng, Zhiyong Qiu, Lei Gong, Wenqi Lou, Xuehai Zhou

Abstract: Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin dered by high inference latency. While smooth robotic interaction requires control frequencies of 20 to 30 Hz, current VLA models typi cally operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms. At the core of ActionFlow is a Cross-Request Pipelin ing strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55x improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dy namic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.

Comment: High Performance Computing/Efficiency: cross-request pipelining and unified KV ring buffer to accelerate autoregressive VLA inference on edge.