Personalized Daily ArXiv Papers 2025-08-08

[gpt-4o]	Prompt	Completion	Total
Token	42245	4662	46907
Cost	$0.11	$0.05	$0.15

Total arXiv papers: 530

Total scanned papers: 320

Total relevant papers: 23

Table of contents with paper titles:

InfoQ: Mixed-Precision Quantization via Global Information Flow Authors: Mehmet Emre Akbulut, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Manuel Roveri
Tractable Sharpness-Aware Learning of Probabilistic Circuits Authors: Hrithik Suresh, Sahil Sidheekh, Vishnu Shreeram M. P, Sriraam Natarajan, Narayanan C. Krishnan
Fairy$\pm i$: the First 2-bit Complex LLM with All Parameters in ${\pm1, \pm i}$ Authors: Feiyu Wang, Guoan Wang, Yihao Zhang, Shengfan Wang, Weitao Li, Bokai Huang, Shimao Chen, Zihan Jiang, Rui Xu, Tong Yang
Integrated Influence: Data Attribution with Baseline Authors: Linxiao Yang, Xinyu Gu, Liang Sun
Task complexity shapes internal representations and robustness in neural networks Authors: Robert Jankowski, Filippo Radicchi, M. \'Angeles Serrano, Mari\'an Bogu\~n\'a, Santo Fortunato
Can SGD Handle Heavy-Tailed Noise? Authors: Ilyas Fatkhullin, Florian H\"ubler, Guanghui Lan
Compressed Decentralized Momentum Stochastic Gradient Methods for Nonconvex Optimization Authors: Wei Liu, Anweshit Panda, Ujwal Pandey, Christopher Brissette, Yikang Shen, George M. Slota, Naigang Wang, Jie Chen, Yangyang Xu
MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs Authors: Xiaodong Chen, Mingming Ha, Zhenzhong Lan, Jing Zhang, Jianguo Li
Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity Authors: Hikaru Umeda, Hideaki Iiduka
Making Prompts First-Class Citizens for Adaptive LLM Pipelines Authors: Ugur Cetintemel, Shu Chen, Alexander W. Lee, Deepti Raghavan
How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations Authors: Brandon Jaipersaud, David Krueger, Ekdeep Singh Lubana
On the Design of Expressive and Trainable Pulse-based Quantum Machine Learning Models Authors: Han-Xiao Tao, Xin Wang, Re-Bing Wu
LAG: Logic-Augmented Generation from a Cartesian Perspective Authors: Yilin Xiao, Chuang Zhou, Qinggang Zhang, Su Dong, Shengyuan Chen, Xiao Huang
MolSnap: Snap-Fast Molecular Generation with Latent Variational Mean Flow Authors: Md Atik Ahamed, Qiang Ye, Qiang Cheng
CF3: Compact and Fast 3D Feature Fields Authors: Hyunjoon Lee, Joonkyu Min, Jaesik Park
Cross-LoRA: A Data-Free LoRA Transfer Framework across Heterogeneous LLMs Authors: Feifan Xia, Mingyang Liao, Yuyang Fang, Defang Li, Yantong Xie, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang
Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling Authors: Yixuan Zhang, Wenxin Zhang, Hua Jiang, Quyu Kong, Feng Zhou
Taxonomy of Faults in Attention-Based Neural Networks Authors: Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma, Mohammad Masudur Rahman
MENDR: Manifold Explainable Neural Data Representations Authors: Matthew Chen, Micky Nnamdi, Justin Shao, Andrew Hornback, Hongyun Huang, Ben Tamo, Yishan Zhong, Benoit Marteau, Wenqi Shi, May Dongmei Wang
Pruning Large Language Models by Identifying and Preserving Functional Networks Authors: Yiheng Liu, Junhao Ning, Sichen Xia, Xiaohui Gao, Ning Qiang, Bao Ge, Junwei Han, Xintao Hu
Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos Authors: Haoyu Zhang, Shihao Zhang, Ian Colbert, Rayan Saab
Attention Basin: Why Contextual Position Matters in Large Language Models Authors: Zihao Yi, Delong Zeng, Zhenqing Ling, Haohao Luo, Zhe Xu, Wei Liu, Jian Luan, Wanxia Cao, Ying Shen
Gaussian mixture layers for neural networks Authors: Sinho Chewi, Philippe Rigollet, Yuling Yan

1. InfoQ: Mixed-Precision Quantization via Global Information Flow

ArXiv ID: 2508.04753

Authors: Mehmet Emre Akbulut, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Manuel Roveri

Abstract: Mixed-precision quantization (MPQ) is crucial for deploying deep neural networks on resource-constrained devices, but finding the optimal bit-width for each layer represents a complex combinatorial optimization problem. Current state-of-the-art methods rely on computationally expensive search algorithms or local sensitivity heuristic proxies like the Hessian, which fail to capture the cascading global effects of quantization error. In this work, we argue that the quantization sensitivity of a layer should not be measured by its local properties, but by its impact on the information flow throughout the entire network. We introduce InfoQ, a novel framework for MPQ that is training-free in the bit-width search phase. InfoQ assesses layer sensitivity by quantizing each layer at different bit-widths and measuring, through a single forward pass, the resulting change in mutual information in the subsequent layers. This quantifies how much each layer quantization impacts the network information flow. The resulting scores are used to formulate bit-width allocation as an integer linear programming problem, which is solved efficiently to minimize total sensitivity under a given budget (e.g., model size or BitOps). Our retraining-free search phase provides a superior search-time/accuracy trade-off (using two orders of magnitude less data compared to state-of-the-art methods such as LIMPQ), while yielding up to a 1% accuracy improvement for MobileNetV2 and ResNet18 on ImageNet at high compression rates (14X and 10.66X).

Comment: The paper presents InfoQ, a novel framework for mixed-precision quantization focusing on global information flow, relevant to model compression.

Relevance: 9 Novelty: 8

2. Tractable Sharpness-Aware Learning of Probabilistic Circuits

ArXiv ID: 2508.05537

Authors: Hrithik Suresh, Sahil Sidheekh, Vishnu Shreeram M. P, Sriraam Natarajan, Narayanan C. Krishnan

Abstract: Probabilistic Circuits (PCs) are a class of generative models that allow exact and tractable inference for a wide range of queries. While recent developments have enabled the learning of deep and expressive PCs, this increased capacity can often lead to overfitting, especially when data is limited. We analyze PC overfitting from a log-likelihood-landscape perspective and show that it is often caused by convergence to sharp optima that generalize poorly. Inspired by sharpness aware minimization in neural networks, we propose a Hessian-based regularizer for training PCs. As a key contribution, we show that the trace of the Hessian of the log-likelihood-a sharpness proxy that is typically intractable in deep neural networks-can be computed efficiently for PCs. Minimizing this Hessian trace induces a gradient-norm-based regularizer that yields simple closed-form parameter updates for EM, and integrates seamlessly with gradient based learning methods. Experiments on synthetic and real-world datasets demonstrate that our method consistently guides PCs toward flatter minima, improves generalization performance.

Comment: The paper presents a Hessian-based regularizer for probabilistic circuits, offering insights into training dynamics and representation learning.

Relevance: 9 Novelty: 8

3. Fairy$\pm i$: the First 2-bit Complex LLM with All Parameters in ${\pm1, \pm i}$

ArXiv ID: 2508.05571

Authors: Feiyu Wang, Guoan Wang, Yihao Zhang, Shengfan Wang, Weitao Li, Bokai Huang, Shimao Chen, Zihan Jiang, Rui Xu, Tong Yang

Abstract: Quantization-Aware Training (QAT) integrates quantization into the training loop, enabling LLMs to learn robust low-bit representations, and is widely recognized as one of the most promising research directions. All current QAT research focuses on minimizing quantization error on full-precision models, where the full-precision accuracy acts as an upper bound (accuracy ceiling). No existing method has even attempted to surpass this ceiling. To break this ceiling, we propose a new paradigm: raising the ceiling (full-precision model), and then still quantizing it efficiently into 2 bits. We propose Fairy$\pm i$, the first 2-bit quantization framework for complex-valued LLMs. Specifically, our method leverages the representational advantages of the complex domain to boost full-precision accuracy. We map weights to the fourth roots of unity ${\pm1, \pm i}$, forming a perfectly symmetric and information-theoretically optimal 2-bit representation. Importantly, each quantized weight has either a zero real or imaginary part, enabling multiplication-free inference using only additions and element swaps. Experimental results show that Fairy$\pm i$ outperforms the ceiling of existing 2-bit quantization approaches in terms of both PPL and downstream tasks, while maintaining strict storage and compute efficiency. This work opens a new direction for building highly accurate and practical LLMs under extremely low-bit constraints.

Comment: The paper presents a novel 2-bit quantization framework for complex-valued LLMs, which is relevant to model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

4. Integrated Influence: Data Attribution with Baseline

ArXiv ID: 2508.05089

Authors: Linxiao Yang, Xinyu Gu, Liang Sun

Abstract: As an effective approach to quantify how training samples influence test sample, data attribution is crucial for understanding data and model and further enhance the transparency of machine learning models. We find that prevailing data attribution methods based on leave-one-out (LOO) strategy suffer from the local-based explanation, as these LOO-based methods only perturb a single training sample, and overlook the collective influence in the training set. On the other hand, the lack of baseline in many data attribution methods reduces the flexibility of the explanation, e.g., failing to provide counterfactual explanations. In this paper, we propose Integrated Influence, a novel data attribution method that incorporates a baseline approach. Our method defines a baseline dataset, follows a data degeneration process to transition the current dataset to the baseline, and accumulates the influence of each sample throughout this process. We provide a solid theoretical framework for our method, and further demonstrate that popular methods, such as influence functions, can be viewed as special cases of our approach. Experimental results show that Integrated Influence generates more reliable data attributions compared to existing methods in both data attribution task and mislablled example identification task.

Comment: The paper introduces a novel data attribution method, Integrated Influence, which provides a theoretical framework and insights into data attribution, aligning with representation learning.

Relevance: 9 Novelty: 8

5. Task complexity shapes internal representations and robustness in neural networks

ArXiv ID: 2508.05463

Authors: Robert Jankowski, Filippo Radicchi, M. \'Angeles Serrano, Mari\'an Bogu\~n\'a, Santo Fortunato

Abstract: Neural networks excel across a wide range of tasks, yet remain black boxes. In particular, how their internal representations are shaped by the complexity of the input data and the problems they solve remains obscure. In this work, we introduce a suite of five data-agnostic probes-pruning, binarization, noise injection, sign flipping, and bipartite network randomization-to quantify how task difficulty influences the topology and robustness of representations in multilayer perceptrons (MLPs). MLPs are represented as signed, weighted bipartite graphs from a network science perspective. We contrast easy and hard classification tasks on the MNIST and Fashion-MNIST datasets. We show that binarizing weights in hard-task models collapses accuracy to chance, whereas easy-task models remain robust. We also find that pruning low-magnitude edges in binarized hard-task models reveals a sharp phase-transition in performance. Moreover, moderate noise injection can enhance accuracy, resembling a stochastic-resonance effect linked to optimal sign flips of small-magnitude weights. Finally, preserving only the sign structure-instead of precise weight magnitudes-through bipartite network randomizations suffices to maintain high accuracy. These phenomena define a model- and modality-agnostic measure of task complexity: the performance gap between full-precision and binarized or shuffled neural network performance. Our findings highlight the crucial role of signed bipartite topology in learned representations and suggest practical strategies for model compression and interpretability that align with task complexity.

Comment: The paper investigates how task complexity influences internal representations in neural networks, which is relevant to representation learning and model compression.

Relevance: 9 Novelty: 8

6. Can SGD Handle Heavy-Tailed Noise?

ArXiv ID: 2508.04860

Authors: Ilyas Fatkhullin, Florian H\"ubler, Guanghui Lan

Abstract: Stochastic Gradient Descent (SGD) is a cornerstone of large-scale optimization, yet its theoretical behavior under heavy-tailed noise -- common in modern machine learning and reinforcement learning -- remains poorly understood. In this work, we rigorously investigate whether vanilla SGD, devoid of any adaptive modifications, can provably succeed under such adverse stochastic conditions. Assuming only that stochastic gradients have bounded $p$-th moments for some $p \in (1, 2]$, we establish sharp convergence guarantees for (projected) SGD across convex, strongly convex, and non-convex problem classes. In particular, we show that SGD achieves minimax optimal sample complexity under minimal assumptions in the convex and strongly convex regimes: $\mathcal{O}(\varepsilon^{-\frac{p}{p-1}})$ and $\mathcal{O}(\varepsilon^{-\frac{p}{2(p-1)}})$, respectively. For non-convex objectives under H\"older smoothness, we prove convergence to a stationary point with rate $\mathcal{O}(\varepsilon^{-\frac{2p}{p-1}})$, and complement this with a matching lower bound specific to SGD with arbitrary polynomial step-size schedules. Finally, we consider non-convex Mini-batch SGD under standard smoothness and bounded central moment assumptions, and show that it also achieves a comparable $\mathcal{O}(\varepsilon^{-\frac{2p}{p-1}})$ sample complexity with a potential improvement in the smoothness constant. These results challenge the prevailing view that heavy-tailed noise renders SGD ineffective, and establish vanilla SGD as a robust and theoretically principled baseline -- even in regimes where the variance is unbounded.

Comment: The paper provides theoretical insights into the behavior of SGD under heavy-tailed noise, contributing to the understanding of training dynamics in neural networks.

Relevance: 9 Novelty: 8

7. Compressed Decentralized Momentum Stochastic Gradient Methods for Nonconvex Optimization

ArXiv ID: 2508.04950

Authors: Wei Liu, Anweshit Panda, Ujwal Pandey, Christopher Brissette, Yikang Shen, George M. Slota, Naigang Wang, Jie Chen, Yangyang Xu

Abstract: In this paper, we design two compressed decentralized algorithms for solving nonconvex stochastic optimization under two different scenarios. Both algorithms adopt a momentum technique to achieve fast convergence and a message-compression technique to save communication costs. Though momentum acceleration and compressed communication have been used in literature, it is highly nontrivial to theoretically prove the effectiveness of their composition in a decentralized algorithm that can maintain the benefits of both sides, because of the need to simultaneously control the consensus error, the compression error, and the bias from the momentum gradient. For the scenario where gradients are bounded, our proposal is a compressed decentralized adaptive method. To the best of our knowledge, this is the first decentralized adaptive stochastic gradient method with compressed communication. For the scenario of data heterogeneity without bounded gradients, our proposal is a compressed decentralized heavy-ball method, which applies a gradient tracking technique to address the challenge of data heterogeneity. Notably, both methods achieve an optimal convergence rate, and they can achieve linear speed up and adopt topology-independent algorithmic parameters within a certain regime of the user-specified error tolerance. Superior empirical performance is observed over state-of-the-art methods on training deep neural networks (DNNs) and Transformers.

Comment: The paper introduces Gaussian mixture layers for neural networks, which is a novel architectural innovation relevant to model architecture.

Relevance: 9 Novelty: 8

8. MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

ArXiv ID: 2508.05257

Authors: Xiaodong Chen, Mingming Ha, Zhenzhong Lan, Jing Zhang, Jianguo Li

Abstract: The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further re-parameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235B-A22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).

Comment: The paper presents a novel method for compressing MoE-based LLMs, which is highly relevant to model compression and MoE architecture.

Relevance: 9 Novelty: 8

9. Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity

ArXiv ID: 2508.05297

Authors: Hikaru Umeda, Hideaki Iiduka

Abstract: The unprecedented growth of deep learning models has enabled remarkable advances but introduced substantial computational bottlenecks. A key factor contributing to training efficiency is batch-size and learning-rate scheduling in stochastic gradient methods. However, naive scheduling of these hyperparameters can degrade optimization efficiency and compromise generalization. Motivated by recent theoretical insights, we investigated how the batch size and learning rate should be increased during training to balance efficiency and convergence. We analyzed this problem on the basis of stochastic first-order oracle (SFO) complexity, defined as the expected number of gradient evaluations needed to reach an $\epsilon$-approximate stationary point of the empirical loss. We theoretically derived optimal growth schedules for the batch size and learning rate that reduce SFO complexity and validated them through extensive experiments. Our results offer both theoretical insights and practical guidelines for scalable and efficient large-batch training in deep learning.

Comment: The paper provides theoretical insights into optimizing batch size and learning rate schedules for SGD, which is relevant to training dynamics in neural networks.

Relevance: 9 Novelty: 8

10. Making Prompts First-Class Citizens for Adaptive LLM Pipelines

ArXiv ID: 2508.05012

Authors: Ugur Cetintemel, Shu Chen, Alexander W. Lee, Deepti Raghavan

Abstract: Modern LLM pipelines increasingly resemble data-centric systems: they retrieve external context, compose intermediate outputs, validate results, and adapt based on runtime feedback. Yet, the central element guiding this process -- the prompt -- remains a brittle, opaque string, disconnected from the surrounding dataflow. This disconnect limits reuse, optimization, and runtime control. In this paper, we describe our vision and an initial design for SPEAR, a language and runtime that fills this prompt management gap by making prompts structured, adaptive, and first-class components of the execution model. SPEAR enables (1) runtime prompt refinement -- modifying prompts dynamically in response to execution-time signals such as confidence, latency, or missing context; and (2) structured prompt management -- organizing prompt fragments into versioned views with support for introspection and logging. SPEAR defines a prompt algebra that governs how prompts are constructed and adapted within a pipeline. It supports multiple refinement modes (manual, assisted, and automatic), giving developers a balance between control and automation. By treating prompt logic as structured data, SPEAR enables optimizations such as operator fusion, prefix caching, and view reuse. Preliminary experiments quantify the behavior of different refinement modes compared to static prompts and agentic retries, as well as the impact of prompt-level optimizations such as operator fusion.

Comment: The paper introduces SPEAR, a novel approach to managing prompts in LLM pipelines, focusing on making prompts structured and adaptive. This aligns with the interest in foundational research on LLM behavior and architecture.

Relevance: 9 Novelty: 8

11. How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations

ArXiv ID: 2508.05625

Authors: Brandon Jaipersaud, David Krueger, Ekdeep Singh Lubana

Abstract: Large Language Models (LLMs) have started to demonstrate the ability to persuade humans, yet our understanding of how this dynamic transpires is limited. Recent work has used linear probes, lightweight tools for analyzing model representations, to study various LLM skills such as the ability to model user sentiment and political perspective. Motivated by this, we apply probes to study persuasion dynamics in natural, multi-turn conversations. We leverage insights from cognitive science to train probes on distinct aspects of persuasion: persuasion success, persuadee personality, and persuasion strategy. Despite their simplicity, we show that they capture various aspects of persuasion at both the sample and dataset levels. For instance, probes can identify the point in a conversation where the persuadee was persuaded or where persuasive success generally occurs across the entire dataset. We also show that in addition to being faster than expensive prompting-based approaches, probes can do just as well and even outperform prompting in some settings, such as when uncovering persuasion strategy. This suggests probes as a plausible avenue for studying other complex behaviours such as deception and manipulation, especially in multi-turn settings and large-scale dataset analysis where prompting-based methods would be computationally inefficient.

Comment: The paper uses linear probes to study persuasion dynamics in LLMs, providing insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 7

12. On the Design of Expressive and Trainable Pulse-based Quantum Machine Learning Models

ArXiv ID: 2508.05559

Authors: Han-Xiao Tao, Xin Wang, Re-Bing Wu

Abstract: Pulse-based Quantum Machine Learning (QML) has emerged as a novel paradigm in quantum artificial intelligence due to its exceptional hardware efficiency. For practical applications, pulse-based models must be both expressive and trainable. Previous studies suggest that pulse-based models under dynamic symmetry can be effectively trained, thanks to a favorable loss landscape that has no barren plateaus. However, the resulting uncontrollability may compromise expressivity when the model is inadequately designed. This paper investigates the requirements for pulse-based QML models to be expressive while preserving trainability. We present a necessary condition pertaining to the system's initial state, the measurement observable, and the underlying dynamical symmetry Lie algebra, supported by numerical simulations. Our findings establish a framework for designing practical pulse-based QML models that balance expressivity and trainability.

Comment: The paper discusses the design of pulse-based quantum machine learning models, focusing on expressivity and trainability, which is a novel paradigm in AI for science.

Relevance: 8 Novelty: 8

13. LAG: Logic-Augmented Generation from a Cartesian Perspective

ArXiv ID: 2508.05509

Authors: Yilin Xiao, Chuang Zhou, Qinggang Zhang, Su Dong, Shengyuan Chen, Xiao Huang

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet exhibit critical limitations in knowledge-intensive tasks, often generating hallucinations when faced with questions requiring specialized expertise. While retrieval-augmented generation (RAG) mitigates this by integrating external knowledge, it struggles with complex reasoning scenarios due to its reliance on direct semantic retrieval and lack of structured logical organization. Inspired by Cartesian principles from \textit{Discours de la m\'ethode}, this paper introduces Logic-Augmented Generation (LAG), a novel paradigm that reframes knowledge augmentation through systematic question decomposition and dependency-aware reasoning. Specifically, LAG first decomposes complex questions into atomic sub-questions ordered by logical dependencies. It then resolves these sequentially, using prior answers to guide context retrieval for subsequent sub-questions, ensuring stepwise grounding in logical chain. To prevent error propagation, LAG incorporates a logical termination mechanism that halts inference upon encountering unanswerable sub-questions and reduces wasted computation on excessive reasoning. Finally, it synthesizes all sub-resolutions to generate verified responses. Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition, offering a principled alternative to existing RAG systems.

Comment: The paper proposes a novel logic-augmented generation paradigm for LLMs, focusing on systematic question decomposition and dependency-aware reasoning, relevant to large language models.

Relevance: 8 Novelty: 8

14. MolSnap: Snap-Fast Molecular Generation with Latent Variational Mean Flow

ArXiv ID: 2508.05411

Authors: Md Atik Ahamed, Qiang Ye, Qiang Cheng

Abstract: Molecular generation conditioned on textual descriptions is a fundamental task in computational chemistry and drug discovery. Existing methods often struggle to simultaneously ensure high-quality, diverse generation and fast inference. In this work, we propose a novel causality-aware framework that addresses these challenges through two key innovations. First, we introduce a Causality-Aware Transformer (CAT) that jointly encodes molecular graph tokens and text instructions while enforcing causal dependencies during generation. Second, we develop a Variational Mean Flow (VMF) framework that generalizes existing flow-based methods by modeling the latent space as a mixture of Gaussians, enhancing expressiveness beyond unimodal priors. VMF enables efficient one-step inference while maintaining strong generation quality and diversity. Extensive experiments on four standard molecular benchmarks demonstrate that our model outperforms state-of-the-art baselines, achieving higher novelty (up to 74.5\%), diversity (up to 70.3\%), and 100\% validity across all datasets. Moreover, VMF requires only one number of function evaluation (NFE) during conditional generation and up to five NFEs for unconditional generation, offering substantial computational efficiency over diffusion-based methods.

Comment: The paper proposes a novel framework for molecular generation using a Causality-Aware Transformer and Variational Mean Flow, which is relevant to foundational research in AI for Science.