Personalized Daily ArXiv Papers 2025-06-04

[gpt-4o]	Prompt	Completion	Total
Token	73570	10382	83952
Cost	$0.18	$0.1	$0.29

Total arXiv papers: 1125

Total scanned papers: 630

Total relevant papers: 44

Table of contents with paper titles:

FORT: Forward-Only Regression Training of Normalizing Flows Authors: Danyal Rehman, Oscar Davis, Jiarui Lu, Jian Tang, Michael Bronstein, Yoshua Bengio, Alexander Tong, Avishek Joey Bose
VUSA: Virtually Upscaled Systolic Array Architecture to Exploit Unstructured Sparsity in AI Acceleration Authors: Shereef Helal, Alberto Garcia-Ortiz, Lennart Bamberg
Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks Authors: Luca Arnaboldi, Bruno Loureiro, Ludovic Stephan, Florent Krzakala, Lenka Zdeborova
Probing Neural Topology of Large Language Models Authors: Yu Zheng, Yuan Yuan, Yong Li, Paolo Santi
HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference Authors: Ping Gong, Jiawei Yi, Shengnan Wang, Juncheng Zhang, Zewen Jin, Ouxiang Zhou, Ruibo Liu, Guanbin Xu, Youhui Bai, Bowen Ye, Kun Yuan, Tong Yang, Gong Zhang, Renhai Chen, Feng Wu, Cheng Li
PoLAR: Polar-Decomposed Low-Rank Adapter Representation Authors: Kai Lion, Liang Zhang, Bingcong Li, Niao He
Data Pruning by Information Maximization Authors: Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi
Non-Asymptotic Length Generalization Authors: Thomas Chen, Tengyu Ma, Zhiyuan Li
Memory-Efficient and Privacy-Preserving Collaborative Training for Mixture-of-Experts LLMs Authors: Ze Yu Zhang, Bolin Ding, Bryan Kian Hsiang Low
Computational Thresholds in Multi-Modal Learning via the Spiked Matrix-Tensor Model Authors: Hugo Tabanelli, Pierre Mergny, Lenka Zdeborova, Florent Krzakala
QKV Projections Require a Fraction of Their Memory Authors: Malik Khalf, Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster
Why Gradients Rapidly Increase Near the End of Training Authors: Aaron Defazio
Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures Authors: Mark Muchane, Sean Richardson, Kiho Park, Victor Veitch
Retrieval-Augmented Generation as Noisy In-Context Learning: A Unified Theory and Risk Bounds Authors: Yang Guo, Yutian Tao, Yifei Ming, Robert D. Nowak, Yingyu Liang
Manipulating 3D Molecules in a Fixed-Dimensional SE(3)-Equivariant Latent Space Authors: Zitao Chen, Yinjun Jia, Zitong Tian, Wei-Ying Ma, Yanyan Lan
Compiler Optimization via LLM Reasoning for Efficient Model Serving Authors: Sujun Tang, Christopher Priebe, Rohan Mahapatra, Lianhui Qin, Hadi Esmaeilzadeh
Unlocking Personalized Knowledge in Federated Large Language Model: The Power of Mixture of Experts Authors: Fan Liu, Bikang Pan, Zhongyi Wang, Xi Yao, Xiaoying Tang, Jingya Wang, Ye Shi
Earley-Driven Dynamic Pruning for Efficient Structured Decoding Authors: Xintong Sun, Chi Wei, Minghao Tian, Shiwen Ni
Quantifying task-relevant representational similarity using decision variable correlation Authors: Yu (Eric), Qian, Wilson S. Geisler, Xue-Xin Wei
Constrained Sliced Wasserstein Embedding Authors: Navid NaderiAlizadeh, Darian Salehi, Xinran Liu, Soheil Kolouri
Quotient Network -- A Network Similar to ResNet but Learning Quotients Authors: Peng Hui, Jiamuyang Zhao, Changxin Li, Qingzhen Zhu
A Tale of Two Symmetries: Exploring the Loss Landscape of Equivariant Models Authors: YuQing Xie, Tess Smidt
It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs Authors: Jun Wu, Yirong Xiong, Jiangtao Wen, Yuxing Han
Johnny: Structuring Representation Space to Enhance Machine Abstract Reasoning Ability Authors: Ruizhuo Song, Beiming Yuan
Taming LLMs by Scaling Learning Rates with Gradient Grouping Authors: Siyuan Li, Juanxi Tian, Zedong Wang, Xin Jin, Zicheng Liu, Wentao Zhang, Dan Xu
Towards Better Generalization and Interpretability in Unsupervised Concept-Based Models Authors: Francesco De Santis, Philippe Bich, Gabriele Ciravegna, Pietro Barbiero, Danilo Giordano, Tania Cerquitelli
Through a Steerable Lens: Magnifying Neural Network Interpretability via Phase-Based Extrapolation Authors: Farzaneh Mahdisoltani, Saeed Mahdisoltani, Roger B. Grosse, David J. Fleet
Random at First, Fast at Last: NTK-Guided Fourier Pre-Processing for Tabular DL Authors: Renat Sergazinov, Jing Wu, Shao-An Yin
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences Authors: Hyojin Bahng, Caroline Chan, Fredo Durand, Phillip Isola
On Universality Classes of Equivariant Networks Authors: Marco Pacini, Gabriele Santin, Bruno Lepri, Shubhendu Trivedi
WeightLoRA: Keep Only Necessary Adapters Authors: Andrey Veprikov, Vladimir Solodkin, Alexander Zyl, Andrey Savchenko, Aleksandr Beznosikov
Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability Authors: Yarden Bakish, Itamar Zimerman, Hila Chefer, Lior Wolf
Towards Unsupervised Training of Matching-based Graph Edit Distance Solver via Preference-aware GAN Authors: Wei Huang, Hanchen Wang, Dong Wen, Shaozhen Ma, Wenjie Zhang, Xuemin Lin
Protein Inverse Folding From Structure Feedback Authors: Junde Xu, Zijun Gao, Xinyi Zhou, Jie Hu, Xingyi Cheng, Le Song, Guangyong Chen, Pheng-Ann Heng, Jiezhong Qiu
Less is More: Local Intrinsic Dimensions of Contextual Language Models Authors: Benjamin Matthias Ruppik, Julius von Rohrscheidt, Carel van Niekerk, Michael Heck, Renato Vukovic, Shutong Feng, Hsien-chin Lin, Nurul Lubis, Bastian Rieck, Marcus Zibrowius, Milica Ga\v{s}i\'c
FlexiSAGA: A Flexible Systolic Array GEMM Accelerator for Sparse and Dense Processing Authors: Mika Markus M\"uller, Konstantin L\"ubeck, Alexander Louis-Ferdinand Jung, Jannik Steinmetz, Oliver Bringmann
From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models Authors: As{\i}m Ersoy, Basel Mousi, Shammur Chowdhury, Firoj Alam, Fahim Dalvi, Nadir Durrani
Generalization in VAE and Diffusion Models: A Unified Information-Theoretic Analysis Authors: Qi Chen, Jierui Zhu, Florian Shkurti
StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs Authors: Qijun Luo, Mengqi Li, Lei Zhao, Xiao Li
From Local Cues to Global Percepts: Emergent Gestalt Organization in Self-Supervised Vision Models Authors: Tianqin Li, Ziqi Wen, Leiran Song, Jun Liu, Zhi Jing, Tai Sing Lee
Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers Authors: Pengtao Chen, Xianfang Zeng, Maosen Zhao, Peng Ye, Mingzhu Shen, Wei Cheng, Gang Yu, Tao Chen
LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning-Focused Supervised Fine-Tuning Authors: Zihang Liu, Tianyu Pang, Oleg Balabanov, Chaoqun Yang, Tianjin Huang, Lu Yin, Yaoqing Yang, Shiwei Liu
Sheaves Reloaded: A Directional Awakening Authors: Stefano Fiorini, Hakan Aktas, Iulia Duta, Stefano Coniglio, Pietro Morerio, Alessio Del Bue, Pietro Li`o
Not All Tokens Are Meant to Be Forgotten Authors: Xiangyu Zhou, Yao Qiang, Saleh Zare Zade, Douglas Zytko, Prashant Khanduri, Dongxiao Zhu

1. FORT: Forward-Only Regression Training of Normalizing Flows

ArXiv ID: 2506.01158

Authors: Danyal Rehman, Oscar Davis, Jiarui Lu, Jian Tang, Michael Bronstein, Yoshua Bengio, Alexander Tong, Avishek Joey Bose

Abstract: Simulation-free training frameworks have been at the forefront of the generative modelling revolution in continuous spaces, leading to neural dynamical systems that encompass modern large-scale diffusion and flow matching models. Despite the scalability of training, the generation of high-quality samples and their corresponding likelihood under the model requires expensive numerical simulation -- inhibiting adoption in numerous scientific applications such as equilibrium sampling of molecular systems. In this paper, we revisit classical normalizing flows as one-step generative models with exact likelihoods and propose a novel, scalable training objective that does not require computing the expensive change of variable formula used in conventional maximum likelihood training. We propose Forward-Only Regression Training (FORT), a simple $\ell_2$-regression objective that maps prior samples under our flow to specifically chosen targets. We demonstrate that FORT supports a wide class of targets, such as optimal transport targets and targets from pre-trained continuous-time normalizing flows (CNF). We further demonstrate that by using CNF targets, our one-step flows allow for larger-scale training that exceeds the performance and stability of maximum likelihood training, while unlocking a broader class of architectures that were previously challenging to train. Empirically, we elucidate that our trained flows can perform equilibrium conformation sampling in Cartesian coordinates of alanine dipeptide, alanine tripeptide, and alanine tetrapeptide.

Comment: Author match

2. VUSA: Virtually Upscaled Systolic Array Architecture to Exploit Unstructured Sparsity in AI Acceleration

ArXiv ID: 2506.01166

Authors: Shereef Helal, Alberto Garcia-Ortiz, Lennart Bamberg

Abstract: Leveraging high degrees of unstructured sparsity is a promising approach to enhance the efficiency of deep neural network DNN accelerators - particularly important for emerging Edge-AI applications. We introduce VUSA, a systolic-array architecture that virtually grows based on the present sparsity to perform larger matrix multiplications with the same number of physical multiply-accumulate MAC units. The proposed architecture achieves saving by 37% and 68% in area and power efficiency, respectively, at the same peak-performance, compared to a baseline systolic array architecture in a commercial 16-nm technology. Still, the proposed architecture supports acceleration for any DNN with any sparsity - even no sparsity at all. Thus, the proposed architecture is application-independent, making it viable for general-purpose AI acceleration.

Comment: The paper introduces a novel systolic-array architecture that exploits unstructured sparsity, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

3. Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks

ArXiv ID: 2506.02651

Authors: Luca Arnaboldi, Bruno Loureiro, Ludovic Stephan, Florent Krzakala, Lenka Zdeborova

Abstract: We study the dynamics of stochastic gradient descent (SGD) for a class of sequence models termed Sequence Single-Index (SSI) models, where the target depends on a single direction in input space applied to a sequence of tokens. This setting generalizes classical single-index models to the sequential domain, encompassing simplified one-layer attention architectures. We derive a closed-form expression for the population loss in terms of a pair of sufficient statistics capturing semantic and positional alignment, and characterize the induced high-dimensional SGD dynamics for these coordinates. Our analysis reveals two distinct training phases: escape from uninformative initialization and alignment with the target subspace, and demonstrates how the sequence length and positional encoding influence convergence speed and learning trajectories. These results provide a rigorous and interpretable foundation for understanding how sequential structure in data can be beneficial for learning with attention-based models.

Comment: The paper provides theoretical insights into the dynamics of SGD in sequence models and attention networks, relevant to representation learning and model architecture.

Relevance: 9 Novelty: 8

4. Probing Neural Topology of Large Language Models

ArXiv ID: 2506.01042

Authors: Yu Zheng, Yuan Yuan, Yong Li, Paolo Santi

Abstract: Probing large language models (LLMs) has yielded valuable insights into their internal mechanisms by linking neural representations to interpretable semantics. However, how neurons functionally co-activate with each other to give rise to emergent capabilities remains largely unknown, hindering a deeper understanding and safer development of LLMs. In this work, we introduce graph probing, a method for uncovering the functional connectivity topology of LLM neurons and relating it to language generation performance. By analyzing internal neural graphs across diverse LLM families and scales, we discover a universal predictability of next-token prediction performance using only neural topology. This predictability is robust even when retaining just 1% of neuron connections or probing models after only 8 pretraining steps, highlighting the sparsity and early emergence of topological patterns. Further graph matching analysis suggests that, despite significant distinctions in architectures, parameters, and training data, different LLMs develop intricate and consistent neural topological structures that may form the foundation for their language generation abilities. Codes and data for the graph probing toolbox are released at https://github.com/DavyMorgan/llm-graph-probing.

Comment: The paper introduces a method for uncovering the functional connectivity topology of LLM neurons, which aligns with the interest in understanding LLM behavior and interpretability.

Relevance: 9 Novelty: 8

5. HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference

ArXiv ID: 2506.02572

Authors: Ping Gong, Jiawei Yi, Shengnan Wang, Juncheng Zhang, Zewen Jin, Ouxiang Zhou, Ruibo Liu, Guanbin Xu, Youhui Bai, Bowen Ye, Kun Yuan, Tong Yang, Gong Zhang, Renhai Chen, Feng Wu, Cheng Li

Abstract: Large Language Models (LLMs) have emerged as a pivotal research area, yet the attention module remains a critical bottleneck in LLM inference, even with techniques like KVCache to mitigate redundant computations. While various top-$k$ attention mechanisms have been proposed to accelerate LLM inference by exploiting the inherent sparsity of attention, they often struggled to strike a balance between efficiency and accuracy. In this paper, we introduce HATA (Hash-Aware Top-$k$ Attention), a novel approach that systematically integrates low-overhead learning-to-hash techniques into the Top-$k$ attention process. Different from the existing top-k attention methods which are devoted to seeking an absolute estimation of qk score, typically with a great cost, HATA maps queries and keys into binary hash codes, and acquires the relative qk score order with a quite low cost, which is sufficient for realizing top-k attention. Extensive experiments demonstrate that HATA achieves up to 7.2$\times$ speedup compared to vanilla full attention while maintaining model accuracy. In addition, HATA outperforms the state-of-the-art top-$k$ attention methods in both accuracy and efficiency across multiple mainstream LLM models and diverse tasks. HATA is open source at https://github.com/gpzlx1/HATA.

Comment: The paper introduces a novel attention mechanism for LLMs, which is relevant to foundational research in model architecture and efficiency.

Relevance: 9 Novelty: 8

6. PoLAR: Polar-Decomposed Low-Rank Adapter Representation

ArXiv ID: 2506.03133

Authors: Kai Lion, Liang Zhang, Bingcong Li, Niao He

Abstract: We show that low-rank adaptation of large-scale models suffers from a low stable rank that is well below the linear algebraic rank of the subspace, degrading fine-tuning performance. To mitigate the underutilization of the allocated subspace, we propose PoLAR, a parameterization inspired by the polar decomposition that factorizes the low-rank update into two direction matrices constrained to Stiefel manifolds and an unconstrained scale matrix. Our theory shows that PoLAR yields an exponentially faster convergence rate on a canonical low-rank adaptation problem. Pairing the parameterization with Riemannian optimization leads to consistent gains on three different benchmarks testing general language understanding, commonsense reasoning, and mathematical problem solving with base model sizes ranging from 350M to 27B.

Comment: The paper proposes a novel low-rank adaptation method for large-scale models, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

7. Data Pruning by Information Maximization

ArXiv ID: 2506.01701

Authors: Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi

Abstract: In this paper, we present InfoMax, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, InfoMax enhances the overall informativeness of the coreset. The information of individual samples is measured by importance scores, which capture their influence or difficulty in model learning. To quantify redundancy, we use pairwise sample similarities, based on the premise that similar samples contribute similarly to the learning process. We formalize the coreset selection problem as a discrete quadratic programming (DQP) task, with the objective of maximizing the total information content, represented as the sum of individual sample contributions minus the redundancies introduced by similar samples within the coreset. To ensure practical scalability, we introduce an efficient gradient-based solver, complemented by sparsification techniques applied to the similarity matrix and dataset partitioning strategies. This enables InfoMax to seamlessly scale to datasets with millions of samples. Extensive experiments demonstrate the superior performance of InfoMax in various data pruning tasks, including image classification, vision-language pre-training, and instruction tuning for large language models.

Comment: The paper presents a novel data pruning method, InfoMax, which is relevant to model compression through coreset selection and sparsification techniques.

Relevance: 9 Novelty: 8

8. Non-Asymptotic Length Generalization

ArXiv ID: 2506.03085

Authors: Thomas Chen, Tengyu Ma, Zhiyuan Li

Abstract: Length generalization is the ability of a learning algorithm to learn a hypothesis which generalizes to longer inputs than the inputs in the training set. In this paper, we provide provable guarantees of length generalization for various classes of functions in an idealized setting. First, we formalize the framework of non-asymptotic length generalization, which requires a computable upper bound for the minimum input length that guarantees length generalization, as a function of the complexity of ground-truth function under some given complexity measure. We refer to this minimum input length to length generalize as length complexity. We show the Minimum-Complexity Interpolator learning algorithm achieves optimal length complexity. We further show that whether a function class admits non-asymptotic length generalization is equivalent to the decidability of its language equivalence problem, which implies that there is no computable upper bound for the length complexity of Context-Free Grammars. On the positive side, we show that the length complexity of Deterministic Finite Automata is $2n - 2$ where $n$ is the number of states of the ground-truth automaton. Our main results are upper bounds of length complexity for a subset of a transformer-related function class called C-RASP (Yang & Chiang, 2024). We show that the length complexity of 1-layer C-RASP functions is $O(T^2)$ when the ground-truth function has precision $T$, and that the length complexity of 2-layer C-RASP functions is $O(T^{O(K)})$ when the ground-truth function has precision $T$ and $K$ heads.

Comment: The paper provides a theoretical framework for length generalization, which is relevant to emerging trends in foundational research.

Relevance: 9 Novelty: 8

9. Memory-Efficient and Privacy-Preserving Collaborative Training for Mixture-of-Experts LLMs

ArXiv ID: 2506.02965

Authors: Ze Yu Zhang, Bolin Ding, Bryan Kian Hsiang Low

Abstract: Mixture-of-Experts (MoE) has been gaining popularity due to its successful adaptation to large language models (LLMs). In this work, we introduce Privacy-preserving Collaborative Mixture-of-Experts (PC-MoE), which leverages the sparsity of the MoE architecture for memory-efficient decentralized collaborative LLM training, enabling multiple parties with limited GPU-memory and data resources to collectively train more capable LLMs than they could achieve individually. At the same time, this approach protects training data privacy of each participant by keeping training data, as well as parts of the forward pass signal and gradients locally within each party. By design, PC-MoE synergistically combines the strengths of distributed computation with strong confidentiality assurances. Unlike most privacy-preserving schemes, which pay for confidentiality with lower task accuracy, our framework breaks that trade-off: across seven popular LLM benchmarks, it almost matches (and sometimes exceeds) the performance and convergence rate of a fully centralized model, enjoys near 70% peak GPU RAM reduction, while being fully robust against reconstruction attacks.

Comment: The paper introduces a privacy-preserving collaborative training framework for MoE LLMs, which is relevant to model architecture and LLMs.

Relevance: 9 Novelty: 8

ArXiv ID: 2506.02664

Authors: Hugo Tabanelli, Pierre Mergny, Lenka Zdeborova, Florent Krzakala

Abstract: We study the recovery of multiple high-dimensional signals from two noisy, correlated modalities: a spiked matrix and a spiked tensor sharing a common low-rank structure. This setting generalizes classical spiked matrix and tensor models, unveiling intricate interactions between inference channels and surprising algorithmic behaviors. Notably, while the spiked tensor model is typically intractable at low signal-to-noise ratios, its correlation with the matrix enables efficient recovery via Bayesian Approximate Message Passing, inducing staircase-like phase transitions reminiscent of neural network phenomena. In contrast, empirical risk minimization for joint learning fails: the tensor component obstructs effective matrix recovery, and joint optimization significantly degrades performance, highlighting the limitations of naive multi-modal learning. We show that a simple Sequential Curriculum Learning strategy-first recovering the matrix, then leveraging it to guide tensor recovery-resolves this bottleneck and achieves optimal weak recovery thresholds. This strategy, implementable with spectral methods, emphasizes the critical role of structural correlation and learning order in multi-modal high-dimensional inference.

Comment: The paper explores a novel approach to multi-modal learning using a spiked matrix-tensor model, which provides insights into training dynamics and inference in high-dimensional settings.

Relevance: 9 Novelty: 8

11. QKV Projections Require a Fraction of Their Memory

ArXiv ID: 2506.02939

Authors: Malik Khalf, Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster

Abstract: The Multi-Head Attention mechanism is central to LLM operation, and multiple works target its compute and memory efficiency during training. While most works focus on approximating the scaled dot product, the memory consumption of the linear projections that compute the $Q$, $K$, and $V$ tensors from the input $x$ is often overlooked. To address this, we propose Point-Approximate Matrix Multiplication (PAMM), a novel tensor compression technique that reduces memory consumption of the $Q,K,V$ projections in attention layers by a factor of up to $\times 512$, effectively erasing their memory footprint, while achieving similar or better final perplexity. PAMM is fully composable with efficient attention techniques such as FlashAttention, making it a practical and complementary method for memory-efficient LLM training.

Comment: The paper proposes a novel tensor compression technique for QKV projections in attention layers, relevant to model compression.

Relevance: 9 Novelty: 8

12. Why Gradients Rapidly Increase Near the End of Training

ArXiv ID: 2506.02285

Authors: Aaron Defazio

Abstract: During long-duration Large Language Model (LLM) training runs the gradient norm increases rapidly near the end of training. In this short note, we show that this increase is due to an unintended interaction between weight decay, normalization layers, and the learning rate schedule. We propose a simple correction that fixes this behavior while also resulting in lower loss values throughout training.

Comment: The paper provides insights into the training dynamics of LLMs, which aligns with foundational research in training dynamics.

Relevance: 9 Novelty: 7

13. Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures

ArXiv ID: 2506.01197

Authors: Mark Muchane, Sean Richardson, Kiho Park, Victor Veitch

Abstract: Sparse dictionary learning (and, in particular, sparse autoencoders) attempts to learn a set of human-understandable concepts that can explain variation on an abstract space. A basic limitation of this approach is that it neither exploits nor represents the semantic relationships between the learned concepts. In this paper, we introduce a modified SAE architecture that explicitly models a semantic hierarchy of concepts. Application of this architecture to the internal representations of large language models shows both that semantic hierarchy can be learned, and that doing so improves both reconstruction and interpretability. Additionally, the architecture leads to significant improvements in computational efficiency.

Comment: The paper introduces a modified sparse autoencoder architecture that incorporates hierarchical semantics, which is relevant to representation learning and sparse methods.

Relevance: 9 Novelty: 7

14. Retrieval-Augmented Generation as Noisy In-Context Learning: A Unified Theory and Risk Bounds

ArXiv ID: 2506.03100

Authors: Yang Guo, Yutian Tao, Yifei Ming, Robert D. Nowak, Yingyu Liang

Abstract: Retrieval-augmented generation (RAG) has seen many empirical successes in recent years by aiding the LLM with external knowledge. However, its theoretical aspect has remained mostly unexplored. In this paper, we propose the first finite-sample generalization bound for RAG in in-context linear regression and derive an exact bias-variance tradeoff. Our framework views the retrieved texts as query-dependent noisy in-context examples and recovers the classical in-context learning (ICL) and standard RAG as the limit cases. Our analysis suggests that an intrinsic ceiling on generalization error exists on RAG as opposed to the ICL. Furthermore, our framework is able to model retrieval both from the training data and from external corpora by introducing uniform and non-uniform RAG noise. In line with our theory, we show the sample efficiency of ICL and RAG empirically with experiments on common QA benchmarks, such as Natural Questions and TriviaQA.

Comment: The paper provides a theoretical framework for retrieval-augmented generation, offering insights into LLM behavior, which aligns with the core topics.

Relevance: 8 Novelty: 8

15. Manipulating 3D Molecules in a Fixed-Dimensional SE(3)-Equivariant Latent Space

ArXiv ID: 2506.00771

Authors: Zitao Chen, Yinjun Jia, Zitong Tian, Wei-Ying Ma, Yanyan Lan

Abstract: Medicinal chemists often optimize drugs considering their 3D structures and designing structurally distinct molecules that retain key features, such as shapes, pharmacophores, or chemical properties. Previous deep learning approaches address this through supervised tasks like molecule inpainting or property-guided optimization. In this work, we propose a flexible zero-shot molecule manipulation method by navigating in a shared latent space of 3D molecules. We introduce a Variational AutoEncoder (VAE) for 3D molecules, named MolFLAE, which learns a fixed-dimensional, SE(3)-equivariant latent space independent of atom counts. MolFLAE encodes 3D molecules using an SE(3)-equivariant neural network into fixed number of latent nodes, distinguished by learned embeddings. The latent space is regularized, and molecular structures are reconstructed via a Bayesian Flow Network (BFN) conditioned on the encoder's latent output. MolFLAE achieves competitive performance on standard unconditional 3D molecule generation benchmarks. Moreover, the latent space of MolFLAE enables zero-shot molecule manipulation, including atom number editing, structure reconstruction, and coordinated latent interpolation for both structure and properties. We further demonstrate our approach on a drug optimization task for the human glucocorticoid receptor, generating molecules with improved hydrophilicity while preserving key interactions, under computational evaluations. These results highlight the flexibility, robustness, and real-world utility of our method, opening new avenues for molecule editing and optimization.

Comment: The paper introduces a VAE for 3D molecules with SE(3)-equivariant latent space, relevant to AI for science and representation learning.

Relevance: 8 Novelty: 8

16. Compiler Optimization via LLM Reasoning for Efficient Model Serving

ArXiv ID: 2506.01374

Authors: Sujun Tang, Christopher Priebe, Rohan Mahapatra, Lianhui Qin, Hadi Esmaeilzadeh

Abstract: While model serving has unlocked unprecedented capabilities, the high cost of serving large-scale models continues to be a significant barrier to widespread accessibility and rapid innovation. Compiler optimizations have long driven substantial performance improvements, but existing compilers struggle with neural workloads due to the exponentially large and highly interdependent space of possible transformations. Although existing stochastic search techniques can be effective, they are often sample-inefficient and fail to leverage the structural context underlying compilation decisions. We set out to investigate the research question of whether reasoning with large language models (LLMs), without any retraining, can leverage the context-aware decision space of compiler optimization to significantly improve sample efficiency. To that end, we introduce a novel compilation framework (dubbed REASONING COMPILER) that formulates optimization as a sequential, context-aware decision process, guided by a large language model and structured Monte Carlo tree search (MCTS). The LLM acts as a proposal mechanism, suggesting hardware-aware transformations that reflect the current program state and accumulated performance feedback. Monte Carlo tree search (MCTS) incorporates the LLM-generated proposals to balance exploration and exploitation, facilitating structured, context-sensitive traversal of the expansive compiler optimization space. By achieving substantial speedups with markedly fewer samples than leading neural compilers, our approach demonstrates the potential of LLM-guided reasoning to transform the landscape of compiler optimization.

Comment: The paper explores compiler optimization using LLM reasoning, which is relevant to large language models and efficiency improvements.