Personalized Daily ArXiv Papers 2025-12-11

[gpt-5]	Prompt	Completion	Total
Token	30293	27558	57851
Cost	$0.04	$0.28	$0.31

Total arXiv papers: 436

Total scanned papers: 260

Total relevant papers: 22

Table of contents with paper titles:

FALCON: Few-step Accurate Likelihoods for Continuous Flows Authors: Danyal Rehman, Tara Akhound-Sadegh, Artem Gazizov, Yoshua Bengio, Alexander Tong
Closing the Train-Test Gap in World Models for Gradient-Based Planning Authors: Arjun Parthasarathy, Nimit Kalra, Rohun Agrawal, Yann LeCun, Oumayma Bounou, Pavel Izmailov, Micah Goldblum
StructuredDNA: A Bio-Physical Framework for Energy-Aware Transformer Routing Authors: Mustapha Hamdi
Rates and architectures for learning geometrically non-trivial operators Authors: T. Mitchell Roddenberry, Leo Tzou, Ivan Dokmani\'c, Maarten V. de Hoop, Richard G. Baraniuk
Provably Learning from Modern Language Models via Low Logit Rank Authors: Noah Golowich, Allen Liu, Abhishek Shetty
Towards Lossless Ultimate Vision Token Compression for VLMs Authors: Dehua Zheng, Mouxiao Huang, Borui Jiang, Hailin Hu, Xinghao Chen
Tensor-Compressed and Fully-Quantized Training of Neural PDE Solvers Authors: Jinming Lu, Jiayi Tian, Yequan Zhao, Hai Li, Zheng Zhang
Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression Authors: Weiyi He, Yue Xing
Self-Supervised Learning with Gaussian Processes Authors: Yunshan Duan, Sinead Williamson
Drawback of Enforcing Equivariance and its Compensation via the Lens of Expressive Power Authors: Yuzhu Chen, Tian Qin, Xinmei Tian, Fengxiang He, Dacheng Tao
HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression Authors: Gustavo Coelho Haase, Paulo Henrique Dourado da Silva
WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving Authors: Chiheng Lou, Sheng Qi, Rui Kang, Yong Zhang, Chen Sun, Pengcheng Wang, Bingyang Liu, Xuanzhe Liu, Xin Jin
Supervised learning pays attention Authors: Erin Craig, Robert Tibshirani
TinyD\'ej`aVu: Smaller Memory Footprint & Faster Inference on Sensor Data Streams with Always-On Microcontrollers Authors: Zhaolan Huang, Emmanuel Baccelli
Resolving Conflicts in Lifelong Learning via Aligning Updates in Subspaces Authors: Yueer Zhou, Yichen Wu, Ying Wei
Luxical: High-Speed Lexical-Dense Text Embeddings Authors: DatologyAI, :, Luke Merrick, Alex Fang, Aldo Carranza, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Haoli Yin, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Paul Burstein, Parth Doshi, Paul Burnstein, Pratyush Maini, Ricardo Monti, Rishabh Adiga, Scott Loftin, Siddharth Joshi, Spandan Das, Tony Jiang, Vineeth Dorma, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt
Branching Strategies Based on Subgraph GNNs: A Study on Theoretical Promise versus Practical Reality Authors: Junru Zhou, Yicheng Wang, Pan Li
Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach Authors: Salvador Carri\'on, Francisco Casacuberta
Understanding temperature tuning in energy-based models Authors: Peter W Fields, Vudtiwat Ngampruetikorn, David J Schwab, Stephanie E Palmer
Circuits, Features, and Heuristics in Molecular Transformers Authors: Kristof Varadi, Mark Marosi, Peter Antal
Spectral Embedding via Chebyshev Bases for Robust DeepONet Approximation Authors: Muhammad Abid, Omer San
Banach neural operator for Navier-Stokes equations Authors: Bo Zhang

1. FALCON: Few-step Accurate Likelihoods for Continuous Flows

ArXiv ID: 2512.09914

Authors: Danyal Rehman, Tara Akhound-Sadegh, Artem Gazizov, Yoshua Bengio, Alexander Tong

Abstract: Scalable sampling of molecular states in thermodynamic equilibrium is a long-standing challenge in statistical physics. Boltzmann Generators tackle this problem by pairing a generative model, capable of exact likelihood computation, with importance sampling to obtain consistent samples under the target distribution. Current Boltzmann Generators primarily use continuous normalizing flows (CNFs) trained with flow matching for efficient training of powerful models. However, likelihood calculation for these models is extremely costly, requiring thousands of function evaluations per sample, severely limiting their adoption. In this work, we propose Few-step Accurate Likelihoods for Continuous Flows (FALCON), a method which allows for few-step sampling with a likelihood accurate enough for importance sampling applications by introducing a hybrid training objective that encourages invertibility. We show FALCON outperforms state-of-the-art normalizing flow models for molecular Boltzmann sampling and is two orders of magnitude faster than the equivalently performing CNF model.

Comment: Author match

2. Closing the Train-Test Gap in World Models for Gradient-Based Planning

ArXiv ID: 2512.09929

Authors: Arjun Parthasarathy, Nimit Kalra, Rohun Agrawal, Yann LeCun, Oumayma Bounou, Pavel Izmailov, Micah Goldblum

Abstract: World models paired with model predictive control (MPC) can be trained offline on large-scale datasets of expert trajectories and enable generalization to a wide range of planning tasks at inference time. Compared to traditional MPC procedures, which rely on slow search algorithms or on iteratively solving optimization problems exactly, gradient-based planning offers a computationally efficient alternative. However, the performance of gradient-based planning has thus far lagged behind that of other approaches. In this paper, we propose improved methods for training world models that enable efficient gradient-based planning. We begin with the observation that although a world model is trained on a next-state prediction objective, it is used at test-time to instead estimate a sequence of actions. The goal of our work is to close this train-test gap. To that end, we propose train-time data synthesis techniques that enable significantly improved gradient-based planning with existing world models. At test time, our approach outperforms or matches the classical gradient-free cross-entropy method (CEM) across a variety of object manipulation and navigation tasks in 10% of the time budget.

Comment: Author match

3. StructuredDNA: A Bio-Physical Framework for Energy-Aware Transformer Routing

ArXiv ID: 2512.08968

Authors: Mustapha Hamdi

Abstract: The rapid scaling of large computational models has led to a critical increase in energy and compute costs. Inspired by biological systems where structure and function emerge from low-energy configurations, we introduce StructuredDNA, a sparse architecture framework for modular, energy-aware Transformer routing. StructuredDNA replaces dense Mixture-of-Experts routing with a bio-physical, energy-guided routing layer based on semantic energy minimization. Inputs are dynamically grouped into semantic codons, and routing selects a single expert by minimizing a global energy functional that combines cohesion, uncertainty, and computational cost. We validate StructuredDNA on both specialized (BioASQ) and open-domain benchmarks (WikiText-103). On BioASQ (K = 50), we achieve a 97.7% reduction in Energy Utilization Density (EUD) and a Semantic Stability Index (SSI) of 0.998. We further demonstrate a Semantic Scaling Law on WikiText-103, showing that the architecture generalizes to open domains by scaling expert granularity (K = 2048) while maintaining more than 99% energy efficiency. StructuredDNA thus establishes a robust, domain-agnostic paradigm for future sparse computational frameworks. StructuredDNA provides an explicit link between bio-physical principles and sparse expert routing in Transformer architectures, and points toward future energy-aware, modular, and scalable computational systems. We discuss limitations of this proof-of-concept study and outline directions for scaling the approach to larger models, datasets, and hardware platforms. The StructuredDNA implementation is available at https://github.com/InnoDeep-repos/StructuredDNA .

Comment: Model Architecture and Efficiency: sparse single-expert Transformer routing replacing dense MoE via an energy-minimization routing layer.

Relevance: 10 Novelty: 8

4. Rates and architectures for learning geometrically non-trivial operators

ArXiv ID: 2512.09376

Authors: T. Mitchell Roddenberry, Leo Tzou, Ivan Dokmani\'c, Maarten V. de Hoop, Richard G. Baraniuk

Abstract: Deep learning methods have proven capable of recovering operators between high-dimensional spaces, such as solution maps of PDEs and similar objects in mathematical physics, from very few training samples. This phenomenon of data-efficiency has been proven for certain classes of elliptic operators with simple geometry, i.e., operators that do not change the domain of the function or propagate singularities. However, scientific machine learning is commonly used for problems that do involve the propagation of singularities in a priori unknown ways, such as waves, advection, and fluid dynamics. In light of this, we expand the learning theory to include double fibration transforms--geometric integral operators that include generalized Radon and geodesic ray transforms. We prove that this class of operators does not suffer from the curse of dimensionality: the error decays superalgebraically, that is, faster than any fixed power of the reciprocal of the number of training samples. Furthermore, we investigate architectures that explicitly encode the geometry of these transforms, demonstrating that an architecture reminiscent of cross-attention based on levelset methods yields a parameterization that is universal, stable, and learns double fibration transforms from very few training examples. Our results contribute to a rapidly-growing line of theoretical work on learning operators for scientific machine learning.

Comment: Model Architecture + Representation Learning: theory and architectures for learning geometric integral operators; proposes cross-attention–reminiscent architecture with superalgebraic sample efficiency.

Relevance: 9 Novelty: 9

5. Provably Learning from Modern Language Models via Low Logit Rank

ArXiv ID: 2512.09892

Authors: Noah Golowich, Allen Liu, Abhishek Shetty

Abstract: While modern language models and their inner workings are incredibly complex, recent work (Golowich, Liu & Shetty; 2025) has proposed a simple and potentially tractable abstraction for them through the observation that empirically, these language models all seem to have approximately low logit rank. Roughly, this means that a matrix formed by the model's log probabilities of various tokens conditioned on certain sequences of tokens is well approximated by a low rank matrix. In this paper, our focus is on understanding how this structure can be exploited algorithmically for obtaining provable learning guarantees. Since low logit rank models can encode hard-to-learn distributions such as noisy parities, we study a query learning model with logit queries that reflects the access model for common APIs. Our main result is an efficient algorithm for learning any approximately low logit rank model from queries. We emphasize that our structural assumption closely reflects the behavior that is empirically observed in modern language models. Thus, our result gives what we believe is the first end-to-end learning guarantee for a generative model that plausibly captures modern language models.

Comment: Representation/Low-Rank Theory: exploits empirically low logit rank to give provable, efficient learning algorithms under logit queries.

Relevance: 9 Novelty: 9

6. Towards Lossless Ultimate Vision Token Compression for VLMs

ArXiv ID: 2512.09010

Authors: Dehua Zheng, Mouxiao Huang, Borui Jiang, Hailin Hu, Xinghao Chen

Abstract: Visual language models encounter challenges in computational efficiency and latency, primarily due to the substantial redundancy in the token representations of high-resolution images and videos. Current attention/similarity-based compression algorithms suffer from either position bias or class imbalance, leading to significant accuracy degradation. They also fail to generalize to shallow LLM layers, which exhibit weaker cross-modal interactions. To address this, we extend token compression to the visual encoder through an effective iterative merging scheme that is orthogonal in spatial axes to accelerate the computation across the entire VLM. Furthermoer, we integrate a spectrum pruning unit into LLM through an attention/similarity-free low-pass filter, which gradually prunes redundant visual tokens and is fully compatible to modern FlashAttention. On this basis, we propose Lossless Ultimate Vision tokens Compression (LUVC) framework. LUVC systematically compresses visual tokens until complete elimination at the final layer of LLM, so that the high-dimensional visual features are gradually fused into the multimodal queries. The experiments show that LUVC achieves a 2 speedup inference in language model with negligible accuracy degradation, and the training-free characteristic enables immediate deployment across multiple VLMs.

Comment: Model Compression and Efficiency: training-free iterative visual token merging and spectrum pruning compatible with FlashAttention for end-to-end VLM token compression.

Relevance: 9 Novelty: 8

7. Tensor-Compressed and Fully-Quantized Training of Neural PDE Solvers

ArXiv ID: 2512.09202

Authors: Jinming Lu, Jiayi Tian, Yequan Zhao, Hai Li, Zheng Zhang

Abstract: Physics-Informed Neural Networks (PINNs) have emerged as a promising paradigm for solving partial differential equations (PDEs) by embedding physical laws into neural network training objectives. However, their deployment on resource-constrained platforms is hindered by substantial computational and memory overhead, primarily stemming from higher-order automatic differentiation, intensive tensor operations, and reliance on full-precision arithmetic. To address these challenges, we present a framework that enables scalable and energy-efficient PINN training on edge devices. This framework integrates fully quantized training, Stein's estimator (SE)-based residual loss computation, and tensor-train (TT) decomposition for weight compression. It contributes three key innovations: (1) a mixed-precision training method that use a square-block MX (SMX) format to eliminate data duplication during backpropagation; (2) a difference-based quantization scheme for the Stein's estimator that mitigates underflow; and (3) a partial-reconstruction scheme (PRS) for TT-Layers that reduces quantization-error accumulation. We further design PINTA, a precision-scalable hardware accelerator, to fully exploit the performance of the framework. Experiments on the 2-D Poisson, 20-D Hamilton-Jacobi-Bellman (HJB), and 100-D Heat equations demonstrate that the proposed framework achieves accuracy comparable to or better than full-precision, uncompressed baselines while delivering 5.5x to 83.5x speedups and 159.6x to 2324.1x energy savings. This work enables real-time PDE solving on edge devices and paves the way for energy-efficient scientific computing at scale.

Comment: Compression/Efficiency + HPC: fully-quantized training, tensor-train compression, and a precision-scalable accelerator for efficient PINN/PDE solvers.

Relevance: 9 Novelty: 8

8. Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression

ArXiv ID: 2512.09275

Authors: Weiyi He, Yue Xing

Abstract: Positional encoding (PE) is a core architectural component of Transformers, yet its impact on the Transformer's generalization and robustness remains unclear. In this work, we provide the first generalization analysis for a single-layer Transformer under in-context regression that explicitly accounts for a completely trainable PE module. Our result shows that PE systematically enlarges the generalization gap. Extending to the adversarial setting, we derive the adversarial Rademacher generalization bound. We find that the gap between models with and without PE is magnified under attack, demonstrating that PE amplifies the vulnerability of models. Our bounds are empirically validated by a simulation study. Together, this work establishes a new framework for understanding the clean and adversarial generalization in ICL with PE.

Comment: Representation Learning/Transformer analysis: clean and adversarial generalization bounds (Rademacher) quantifying the impact of positional encoding in in-context regression.

Relevance: 9 Novelty: 8

9. Self-Supervised Learning with Gaussian Processes

ArXiv ID: 2512.09322

Authors: Yunshan Duan, Sinead Williamson

Abstract: Self supervised learning (SSL) is a machine learning paradigm where models learn to understand the underlying structure of data without explicit supervision from labeled samples. The acquired representations from SSL have demonstrated useful for many downstream tasks including clustering, and linear classification, etc. To ensure smoothness of the representation space, most SSL methods rely on the ability to generate pairs of observations that are similar to a given instance. However, generating these pairs may be challenging for many types of data. Moreover, these methods lack consideration of uncertainty quantification and can perform poorly in out-of-sample prediction settings. To address these limitations, we propose Gaussian process self supervised learning (GPSSL), a novel approach that utilizes Gaussian processes (GP) models on representation learning. GP priors are imposed on the representations, and we obtain a generalized Bayesian posterior minimizing a loss function that encourages informative representations. The covariance function inherent in GPs naturally pulls representations of similar units together, serving as an alternative to using explicitly defined positive samples. We show that GPSSL is closely related to both kernel PCA and VICReg, a popular neural network-based SSL method, but unlike both allows for posterior uncertainties that can be propagated to downstream tasks. Experiments on various datasets, considering classification and regression tasks, demonstrate that GPSSL outperforms traditional methods in terms of accuracy, uncertainty quantification, and error control.

Comment: Representation Learning: GP priors on representations with connections to kernel PCA and VICReg, enabling uncertainty-aware SSL.

Relevance: 9 Novelty: 8

10. Drawback of Enforcing Equivariance and its Compensation via the Lens of Expressive Power

ArXiv ID: 2512.09673

Authors: Yuzhu Chen, Tian Qin, Xinmei Tian, Fengxiang He, Dacheng Tao

Abstract: Equivariant neural networks encode symmetry as an inductive bias and have achieved strong empirical performance in wide domains. However, their expressive power remains not well understood. Focusing on 2-layer ReLU networks, this paper investigates the impact of equivariance constraints on the expressivity of equivariant and layer-wise equivariant networks. By examining the boundary hyperplanes and the channel vectors of ReLU networks, we construct an example showing that equivariance constraints could strictly limit expressive power. However, we demonstrate that this drawback can be compensated via enlarging the model size. Furthermore, we show that despite a larger model size, the resulting architecture could still correspond to a hypothesis space with lower complexity, implying superior generalizability for equivariant networks.

Comment: Representation/Architecture Theory: analyzes expressivity-generalization tradeoffs in equivariant networks and compensation via model size.

Relevance: 9 Novelty: 7

11. HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression

ArXiv ID: 2512.09886

Authors: Gustavo Coelho Haase, Paulo Henrique Dourado da Silva

Abstract: Knowledge Distillation (KD) has emerged as a promising technique for model compression but faces critical limitations: (1) sensitivity to hyperparameters requiring extensive manual tuning, (2) capacity gap when distilling from very large teachers to small students, (3) suboptimal coordination in multi-teacher scenarios, and (4) inefficient use of computational resources. We present \textbf{HPM-KD}, a framework that integrates six synergistic components: (i) Adaptive Configuration Manager via meta-learning that eliminates manual hyperparameter tuning, (ii) Progressive Distillation Chain with automatically determined intermediate models, (iii) Attention-Weighted Multi-Teacher Ensemble that learns dynamic per-sample weights, (iv) Meta-Learned Temperature Scheduler that adapts temperature throughout training, (v) Parallel Processing Pipeline with intelligent load balancing, and (vi) Shared Optimization Memory for cross-experiment reuse. Experiments on CIFAR-10, CIFAR-100, and tabular datasets demonstrate that HPM-KD: achieves 10x-15x compression while maintaining 85% accuracy retention, eliminates the need for manual tuning, and reduces training time by 30-40% via parallelization. Ablation studies confirm independent contribution of each component (0.10-0.98 pp). HPM-KD is available as part of the open-source DeepBridge library.

Comment: Matches Model Compression and Efficiency: hierarchical progressive multi-teacher knowledge distillation with adaptive hyperparameters and parallelization.

Relevance: 9 Novelty: 7

12. WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

ArXiv ID: 2512.09472

Authors: Chiheng Lou, Sheng Qi, Rui Kang, Yong Zhang, Chen Sun, Pengcheng Wang, Bingyang Liu, Xuanzhe Liu, Xin Jin

Abstract: Deploying multiple models within shared GPU clusters is promising for improving resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance, especially time-to-first-token (TTFT). We identify the root cause of such compromise as their unawareness of future workload characteristics. In contrast, recent analysis on real-world traces has shown the high periodicity and long-term predictability of LLM serving workloads. We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads. Based on universal GPU workers, we design and build WarmServe, a multi-LLM serving system that (1) mitigates cluster-wide prewarming interference by adopting an evict-aware model placement strategy, (2) prepares universal GPU workers in advance by proactive prewarming, and (3) manages GPU memory with a zero-overhead memory switching mechanism. Evaluation under real-world datasets shows that WarmServe improves TTFT by up to 50.8$\times$ compared to the state-of-the-art autoscaling-based system, while being capable of serving up to 2.5$\times$ more requests compared to the GPU-sharing system.

Comment: High Performance Computing (serving systems): predictive one-for-many GPU prewarming, evict-aware placement, and zero-overhead memory switching for multi-LLM serving.