Personalized Daily ArXiv Papers 2025-06-03

[gpt-4o]	Prompt	Completion	Total
Token	87195	11992	99187
Cost	$0.22	$0.12	$0.34

Total arXiv papers: 1424

Total scanned papers: 816

Total relevant papers: 56

Table of contents with paper titles:

FORT: Forward-Only Regression Training of Normalizing Flows Authors: Danyal Rehman, Oscar Davis, Jiarui Lu, Jian Tang, Michael Bronstein, Yoshua Bengio, Alexander Tong, Avishek Joey Bose
Esoteric Language Models Authors: Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, Arash Vahdat
MLorc: Momentum Low-rank Compression for Large Language Model Adaptation Authors: Wei Shen, Yaxiang Zhang, Minhui Huang, Mengfan Xu, Jiawei Zhang, Cong Shen
SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model Authors: Zhao Yang, Jiwei Zhu, Bing Su
Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures Authors: Mark Muchane, Sean Richardson, Kiho Park, Victor Veitch
Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism Authors: Sameera Ramasinghe, Thalaiyasingam Ajanthan, Gil Avraham, Yan Zuo, Alexander Long
FLoE: Fisher-Based Layer Selection for Efficient Sparse Adaptation of Low-Rank Experts Authors: Xinyi Wang, Lirong Gao, Haobo Wang, Yiming Zhang, Junbo Zhao
Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer Authors: Yihe Dong, Lorenzo Noci, Mikhail Khodak, Mufan Li
Model Reprogramming Demystified: A Neural Tangent Kernel Perspective Authors: Ming-Yu Chung, Jiashuo Fan, Hancheng Ye, Qinsi Wang, Wei-Chen Shen, Chia-Mu Yu, Pin-Yu Chen, Sy-Yen Kuo
Blending Complementary Memory Systems in Hybrid Quadratic-Linear Transformers Authors: Kazuki Irie, Morris Yau, Samuel J. Gershman
It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs Authors: Jun Wu, Yirong Xiong, Jiangtao Wen, Yuxing Han
LLM Cannot Discover Causality, and Should Be Restricted to Non-Decisional Support in Causal Discovery Authors: Xingyu Wu, Kui Yu, Jibin Wu, Kay Chen Tan
Slow Feature Analysis as Variational Inference Objective Authors: Merlin Sch\"uler, Laurenz Wiskott
Unified Scaling Laws for Compressed Representations Authors: Andrei Panferov, Alexandra Volkova, Ionut-Vlad Modoranu, Vage Egiazarian, Mher Safaryan, Dan Alistarh
Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization Authors: Wojciech Masarczyk, Mateusz Ostaszewski, Tin Sum Cheng, Tomasz Trzci\'nski, Aurelien Lucchi, Razvan Pascanu
Uni-LoRA: One Vector is All You Need Authors: Kaiyang Li, Shaobo Han, Qing Su, Wei Li, Zhipeng Cai, Shihao Ji
Ultra-Quantisation: Efficient Embedding Search via 1.58-bit Encodings Authors: Richard Connor, Alan Dearle, Ben Claydon
Latent Structured Hopfield Network for Semantic Association and Retrieval Authors: Chong Li, Xiangyang Xue, Jianfeng Feng, Taiping Zeng
Probing Neural Topology of Large Language Models Authors: Yu Zheng, Yuan Yuan, Yong Li, Paolo Santi
Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning Authors: Ryotaro Kawata, Kohsei Matsutani, Yuri Kinoshita, Naoki Nishikawa, Taiji Suzuki
On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning Authors: Magdalena Proszewska, Nikolay Malkin, N. Siddharth
Tug-of-war between idiom's figurative and literal meanings in LLMs Authors: Soyoung Oh, Xinting Huang, Mathis Pink, Michael Hahn, Vera Demberg
Earley-Driven Dynamic Pruning for Efficient Structured Decoding Authors: Xintong Sun, Chi Wei, Minghao Tian, Shiwen Ni
Overfitting has a limitation: a model-independent generalization error bound based on R\'enyi entropy Authors: Atsushi Suzuki
Unlocking the Power of Rehearsal in Continual Learning: A Theoretical Perspective Authors: Junze Deng, Qinhang Wu, Peizhong Ju, Sen Lin, Yingbin Liang, Ness Shroff
zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression Authors: Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West
TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network Authors: Guangxin He, Yuan Cao, Yutong He, Tianyi Bai, Kun Yuan, Binhang Yuan
Learning DNF through Generalized Fourier Representations Authors: Mohsen Heidari, Roni Khardon
PMNO: A novel physics guided multi-step neural operator predictor for partial differential equations Authors: Jin Song, Kenji Kawaguchi, Zhenya Yan
Generalization in VAE and Diffusion Models: A Unified Information-Theoretic Analysis Authors: Qi Chen, Jierui Zhu, Florian Shkurti
Manipulating 3D Molecules in a Fixed-Dimensional SE(3)-Equivariant Latent Space Authors: Zitao Chen, Yinjun Jia, Zitong Tian, Wei-Ying Ma, Yanyan Lan
Existing Large Language Model Unlearning Evaluations Are Inconclusive Authors: Zhili Feng, Yixuan Even Xu, Alexander Robey, Robert Kirk, Xander Davies, Yarin Gal, Avi Schwarzschild, J. Zico Kolter
Quantitative Error Feedback for Quantization Noise Reduction of Filtering over Graphs Authors: Xue Xian Zheng, Weihang Liu, Xin Lou, Stefan Vlaski, Tareq Al-Naffouri
Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness Authors: Thomas Pethick, Wanyun Xie, Mete Erdogan, Kimon Antonakopoulos, Tony Silveti-Falls, Volkan Cevher
Understanding Model Reprogramming for CLIP via Decoupling Visual Prompts Authors: Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu
Self-supervised Latent Space Optimization with Nebula Variational Coding Authors: Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari
Quotient Network - A Network Similar to ResNet but Learning Quotients Authors: Peng Hui, Jiamuyang Zhao, Changxin Li, Qingzhen Zhu
Connecting Neural Models Latent Geometries with Relative Geodesic Representations Authors: Hanlin Yu, Berfin Inal, Georgios Arvanitidis, Soren Hauberg, Francesco Locatello, Marco Fumero
LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning-Focused Supervised Fine-Tuning Authors: Zihang Liu, Tianyu Pang, Oleg Balabanov, Chaoqun Yang, Tianjin Huang, Lu Yin, Yaoqing Yang, Shiwei Liu
Trade-offs in Data Memorization via Strong Data Processing Inequalities Authors: Vitaly Feldman, Guy Kornowski, Xin Lyu
Flexible Mixed Precision Quantization for Learned Image Compression Authors: Md Adnan Faisal Hossain, Zhihao Duan, Fengqing Zhu
MOFGPT: Generative Design of Metal-Organic Frameworks using Language Models Authors: Srivathsan Badrinarayanan, Rishikesh Magar, Akshay Antony, Radheesh Sharma Meda, Amir Barati Farimani
Rapid yet accurate Tile-circuit and device modeling for Analog In-Memory Computing Authors: J. Luquin, C. Mackin, S. Ambrogio, A. Chen, F. Baldi, G. Miralles, M. J. Rasch, J. B\"uchel, M. Lalwani, W. Ponghiran, P. Solomon, H. Tsai, G. W. Burr, P. Narayanan
Understanding Overadaptation in Supervised Fine-Tuning: The Role of Ensemble Methods Authors: Yifan Hao, Xingyuan Pan, Hanning Zhang, Chenlu Ye, Rui Pan, Tong Zhang
Unlocking Personalized Knowledge in Federated Large Language Model: The Power of Mixture of Experts Authors: Fan Liu, Bikang Pan, Zhongyi Wang, Xi Yao, Xiaoying Tang, Jingya Wang, Ye Shi
Mamba Drafters for Speculative Decoding Authors: Daewon Choi, Seunghyuk Oh, Saket Dingliwal, Jihoon Tack, Kyuyoung Kim, Woomin Song, Seojin Kim, Insu Han, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati
VUSA: Virtually Upscaled Systolic Array Architecture to Exploit Unstructured Sparsity in AI Acceleration Authors: Shereef Helal, Alberto Garcia-Ortiz, Lennart Bamberg
Visual Sparse Steering: Improving Zero-shot Image Classification with Sparsity Guided Steering Vectors Authors: Gerasimos Chatzoudis, Zhuowei Li, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas
Control-R: Towards controllable test-time scaling Authors: Di Zhang, Weida Wang, Junxian Li, Xunzhi Wang, Jiatong Li, Jianbo Wu, Jingdi Lei, Haonan He, Peng Ye, Shufei Zhang, Wanli Ouyang, Yuqiang Li, Dongzhan Zhou
Data Pruning by Information Maximization Authors: Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi
Boosting Bot Detection via Heterophily-Aware Representation Learning and Prototype-Guided Cluster Discovery Authors: Buyun He, Xiaorui Jiang, Qi Wu, Hao Liu, Yingguang Yang, Yong Liao
From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models Authors: As{\i}m Ersoy, Basel Mousi, Shammur Chowdhury, Firoj Alam, Fahim Dalvi, Nadir Durrani
Decoding Dense Embeddings: Sparse Autoencoders for Interpreting and Discretizing Dense Retrieval Authors: Seongwan Park, Taeklim Kim, Youngjoong Ko
Concept-Centric Token Interpretation for Vector-Quantized Generative Models Authors: Tianze Yang, Yucheng Shi, Mengnan Du, Xuansheng Wu, Qiaoyu Tan, Jin Sun, Ninghao Liu
Weight-Space Linear Recurrent Neural Networks Authors: Roussel Desmond Nzoyem, Nawid Keshtmand, Idriss Tsayem, David A. W. Barton, Tom Deakin
Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers Authors: Woomin Song, Sai Muralidhar Jayanthi, Srikanth Ronanki, Kanthashree Mysore Sathyendra, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati

1. FORT: Forward-Only Regression Training of Normalizing Flows

ArXiv ID: 2506.01158

Authors: Danyal Rehman, Oscar Davis, Jiarui Lu, Jian Tang, Michael Bronstein, Yoshua Bengio, Alexander Tong, Avishek Joey Bose

Abstract: Simulation-free training frameworks have been at the forefront of the generative modelling revolution in continuous spaces, leading to neural dynamical systems that encompass modern large-scale diffusion and flow matching models. Despite the scalability of training, the generation of high-quality samples and their corresponding likelihood under the model requires expensive numerical simulation -- inhibiting adoption in numerous scientific applications such as equilibrium sampling of molecular systems. In this paper, we revisit classical normalizing flows as one-step generative models with exact likelihoods and propose a novel, scalable training objective that does not require computing the expensive change of variable formula used in conventional maximum likelihood training. We propose Forward-Only Regression Training (FORT), a simple $\ell_2$-regression objective that maps prior samples under our flow to specifically chosen targets. We demonstrate that FORT supports a wide class of targets, such as optimal transport targets and targets from pre-trained continuous-time normalizing flows (CNF). We further demonstrate that by using CNF targets, our one-step flows allow for larger-scale training that exceeds the performance and stability of maximum likelihood training, while unlocking a broader class of architectures that were previously challenging to train. Empirically, we elucidate that our trained flows can perform equilibrium conformation sampling in Cartesian coordinates of alanine dipeptide, alanine tripeptide, and alanine tetrapeptide.

Comment: Author match

2. Esoteric Language Models

ArXiv ID: 2506.01928

Authors: Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, Arash Vahdat

Abstract: Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features--most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the first to introduce KV caching for MDMs while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to 65x faster inference than standard MDMs and 4x faster inference than prior semi-autoregressive approaches. We provide the code and model checkpoints on the project page: http://s-sahoo.github.io/Eso-LMs

Comment: The paper introduces a new family of models, Eso-LMs, which combines AR and MDM paradigms and introduces KV caching for MDMs, aligning with foundational research in LLM architecture.

Relevance: 9 Novelty: 8

3. MLorc: Momentum Low-rank Compression for Large Language Model Adaptation

ArXiv ID: 2506.01897

Authors: Wei Shen, Yaxiang Zhang, Minhui Huang, Mengfan Xu, Jiawei Zhang, Cong Shen

Abstract: With increasing size of large language models (LLMs), full-parameter fine-tuning imposes substantial memory demands. To alleviate this, we propose a novel memory-efficient training paradigm called Momentum Low-rank compression (MLorc). By directly compressing and reconstructing momentum rather than gradients, MLorc avoids imposing a fixed-rank constraint on weight update matrices and better preserves the training dynamics of full-parameter fine-tuning, in contrast to existing low-rank approaches such as LoRA and GaLore. Empirically, MLorc consistently outperforms other memory-efficient training methods, matches or even exceeds the performance of full fine-tuning with a small rank (e.g., $r=4$), and generalizes well across different optimizers -- all while not compromising time or memory efficiency. Furthermore, we provide a theoretical guarantee for its convergence under reasonable assumptions.

Comment: The paper proposes a novel memory-efficient training paradigm for LLMs using momentum low-rank compression, aligning with foundational research in model compression.

Relevance: 9 Novelty: 8

4. SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model

ArXiv ID: 2506.01833

Authors: Zhao Yang, Jiwei Zhu, Bing Su

Abstract: Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training. Furthermore, considering the multi-species and multi-profile nature of genomic profile prediction, we introduce our $\textbf{S}$pecies-$\textbf{P}$rofile $\textbf{A}$daptive $\textbf{C}$ollaborative $\textbf{E}$xperts (SPACE) that leverages Mixture of Experts (MoE) to better capture the relationships between DNA sequences across different species and genomic profiles, thereby learning more effective DNA representations. Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners. The code is available at https://github.com/ZhuJiwei111/SPACE.

Comment: The paper introduces a Mixture of Experts (MoE) model for DNA representation learning, which is relevant to both representation learning and model architecture.

Relevance: 9 Novelty: 8

5. Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures

ArXiv ID: 2506.01197

Authors: Mark Muchane, Sean Richardson, Kiho Park, Victor Veitch

Abstract: Sparse dictionary learning (and, in particular, sparse autoencoders) attempts to learn a set of human-understandable concepts that can explain variation on an abstract space. A basic limitation of this approach is that it neither exploits nor represents the semantic relationships between the learned concepts. In this paper, we introduce a modified SAE architecture that explicitly models a semantic hierarchy of concepts. Application of this architecture to the internal representations of large language models shows both that semantic hierarchy can be learned, and that doing so improves both reconstruction and interpretability. Additionally, the architecture leads to significant improvements in computational efficiency.

Comment: The paper introduces a modified sparse autoencoder architecture incorporating hierarchical semantics, relevant to representation learning and model architecture.

Relevance: 9 Novelty: 8

6. Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism

ArXiv ID: 2506.01260

Authors: Sameera Ramasinghe, Thalaiyasingam Ajanthan, Gil Avraham, Yan Zuo, Alexander Long

Abstract: Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging due to communication bottlenecks. While existing compression techniques are effective in data-parallel, they do not extend to model parallelism. Unlike data-parallel training, where weight gradients are exchanged, model-parallel requires compressing activations and activation gradients as they propagate through layers, accumulating compression errors. We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation with negligible memory/compute overhead. By leveraging a recursive structure in transformer networks, we predefine a low-dimensional subspace to confine the activations and gradients, allowing full reconstruction in subsequent layers. Our method achieves up to 100x improvement in communication efficiency and enables training billion-parameter-scale models over low-end GPUs connected via consumer-grade internet speeds as low as 80Mbps, matching the convergence of centralized datacenter systems with 100Gbps connections with model parallel.

Comment: The paper proposes a novel compression algorithm for decentralized training, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

7. FLoE: Fisher-Based Layer Selection for Efficient Sparse Adaptation of Low-Rank Experts

ArXiv ID: 2506.00495

Authors: Xinyi Wang, Lirong Gao, Haobo Wang, Yiming Zhang, Junbo Zhao

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a widely adopted strategy for adapting pre-trained Large Language Models (LLMs) to downstream tasks, significantly reducing memory and computational costs. However, most existing PEFT techniques uniformly deploy LoRA adapters across all layers, disregarding the intrinsic heterogeneity of layer contributions and task-specific rank requirements. This uniform paradigm leads to redundant parameter allocation and suboptimal adaptation efficiency. To address these limitations, we propose FLoE, a novel PEFT framework that introduces two key innovations: (i) a Fisher information-guided importance scoring mechanism to dynamically identify task-critical transformer layers for MoE-based low-rank adaptation, enabling sparse adapter deployment; and (ii) a Bayesian optimization-driven rank allocator that automatically determines optimal LoRA ranks on specific datasets without exhaustive grid search. Extensive experiments across diverse LLMs and benchmarks reveal that FLoE achieves impressive efficiency-accuracy trade-offs, making FLoE particularly advantageous in resource-constrained environments that necessitate rapid adaptation.

Comment: The paper proposes a novel PEFT framework for LLMs using MoE-based low-rank adaptation, relevant to model architecture and compression.

Relevance: 9 Novelty: 8

8. Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer

ArXiv ID: 2506.01115

Authors: Yihe Dong, Lorenzo Noci, Mikhail Khodak, Mufan Li

Abstract: The Transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of algorithmic tasks -- including mathematical reasoning, memorization, and retrieval -- using only gradient-based training on next-token prediction. While the core component of a Transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it. To this end, we compare standard Transformers to variants in which either the multi-layer perceptron (MLP) layers or the attention projectors (queries and keys) are frozen at initialization. To further isolate the contribution of attention, we introduce MixiT -- the Mixing Transformer -- a simplified, principled model in which the attention coefficients are entirely random and fixed at initialization, eliminating any input-dependent computation or learning in attention. Surprisingly, we find that MixiT matches the performance of fully trained Transformers on various algorithmic tasks, especially those involving basic arithmetic or focusing heavily on memorization. For retrieval-based tasks, we observe that having input-dependent attention coefficients is consistently beneficial, while MixiT underperforms. We attribute this failure to its inability to form specialized circuits such as induction heads -- a specific circuit known to be crucial for learning and exploiting repeating patterns in input sequences. Even more interestingly, we find that attention with frozen key and query projectors is not only able to form induction heads, but can also perform competitively on language modeling. Our results underscore the importance of architectural heterogeneity, where distinct components contribute complementary inductive biases crucial for solving different classes of tasks.

Comment: The paper analyzes the roles of attention and MLP in Transformers, providing insights into model architecture.

Relevance: 9 Novelty: 8

9. Model Reprogramming Demystified: A Neural Tangent Kernel Perspective

ArXiv ID: 2506.00620

Authors: Ming-Yu Chung, Jiashuo Fan, Hancheng Ye, Qinsi Wang, Wei-Chen Shen, Chia-Mu Yu, Pin-Yu Chen, Sy-Yen Kuo

Abstract: Model Reprogramming (MR) is a resource-efficient framework that adapts large pre-trained models to new tasks with minimal additional parameters and data, offering a promising solution to the challenges of training large models for diverse tasks. Despite its empirical success across various domains such as computer vision and time-series forecasting, the theoretical foundations of MR remain underexplored. In this paper, we present a comprehensive theoretical analysis of MR through the lens of the Neural Tangent Kernel (NTK) framework. We demonstrate that the success of MR is governed by the eigenvalue spectrum of the NTK matrix on the target dataset and establish the critical role of the source model's effectiveness in determining reprogramming outcomes. Our contributions include a novel theoretical framework for MR, insights into the relationship between source and target models, and extensive experiments validating our findings.

Comment: The paper provides a theoretical analysis of Model Reprogramming using the Neural Tangent Kernel framework, which aligns with representation learning insights.

Relevance: 9 Novelty: 8

10. Blending Complementary Memory Systems in Hybrid Quadratic-Linear Transformers

ArXiv ID: 2506.00744

Authors: Kazuki Irie, Morris Yau, Samuel J. Gershman

Abstract: We develop hybrid memory architectures for general-purpose sequence processing neural networks, that combine key-value memory using softmax attention (KV-memory) with dynamic synaptic memory through fast-weight programming (FW-memory) -- the core principles of quadratic and linear transformers, respectively. These two memory systems have complementary but individually limited properties: KV-memory offers precise retrieval but is constrained by quadratic complexity in sequence length, while FW-memory supports arbitrarily long sequences and enables more expressive computation but sacrifices precise recall. We propose and compare three methods to blend these two systems into a single memory system to leverage the strengths of both. We conduct experiments on general language modeling and retrieval tasks by training 340M- and 1.3B-parameter models from scratch, as well as on synthetic algorithmic tasks designed to precisely illustrate the benefits of certain hybrid methods over others. We also evaluate our hybrid memory systems on reinforcement learning in partially observable environments. Overall, we demonstrate how a well-designed hybrid can overcome the limitations of its individual components, offering new insights into the design principle of neural memory systems.

Comment: The paper discusses hybrid memory architectures in transformers, which is relevant to model architecture innovations.

Relevance: 9 Novelty: 8

11. It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs

ArXiv ID: 2506.00486

Authors: Jun Wu, Yirong Xiong, Jiangtao Wen, Yuxing Han

Abstract: Despite rapid advancements in the research and deployment of large language models (LLMs), the statistical distribution of model parameters, as well as their influence on initialization, training dynamics, and downstream efficiency, has received surprisingly little attention. A recent work introduced BackSlash, a training-time compression algorithm. It first demonstrated that pre-trained LLM parameters follow generalized Gaussian distributions (GGDs) better. By optimizing GG priors during training, BackSlash can reduce parameters by up to 90\% with minimal performance loss. Building on this foundational insight, we propose a unified, end-to-end framework for LLM optimization based on the GG model. Our contributions are threefold: (1) GG-based initialization scheme that aligns with the statistical structure of trained models, resulting in faster convergence and improved accuracy; (2) DeepShape, a post-training regularization method that reshapes weight distributions to match a GG profile, improving compressibility with minimized degradation in performance; and (3) RF8, a compact and hardware-efficient 8-bit floating-point format designed for GG-distributed-initialized BackSlash training, enabling low-cost inference without compromising accuracy. Experiments across diverse model architectures show that our framework consistently yields smaller and faster models that match or outperform standard training baselines. By grounding LLM development in principled statistical modeling, this work forges a new path toward efficient, scalable, and hardware-aware AI systems. The code is available on our project page: https://huggingface.co/spaces/shifeng3711/gg_prior.

Comment: The paper proposes a framework for LLM optimization using generalized Gaussian priors, which aligns with model compression and efficiency.

Relevance: 9 Novelty: 8

12. LLM Cannot Discover Causality, and Should Be Restricted to Non-Decisional Support in Causal Discovery

ArXiv ID: 2506.00844

Authors: Xingyu Wu, Kui Yu, Jibin Wu, Kay Chen Tan

Abstract: This paper critically re-evaluates LLMs' role in causal discovery and argues against their direct involvement in determining causal relationships. We demonstrate that LLMs' autoregressive, correlation-driven modeling inherently lacks the theoretical grounding for causal reasoning and introduces unreliability when used as priors in causal discovery algorithms. Through empirical studies, we expose the limitations of existing LLM-based methods and reveal that deliberate prompt engineering (e.g., injecting ground-truth knowledge) could overstate their performance, helping to explain the consistently favorable results reported in much of the current literature. Based on these findings, we strictly confined LLMs' role to a non-decisional auxiliary capacity: LLMs should not participate in determining the existence or directionality of causal relationships, but can assist the search process for causal graphs (e.g., LLM-based heuristic search). Experiments across various settings confirm that, by strictly isolating LLMs from causal decision-making, LLM-guided heuristic search can accelerate the convergence and outperform both traditional and LLM-based methods in causal structure learning. We conclude with a call for the community to shift focus from naively applying LLMs to developing specialized models and training method that respect the core principles of causal discovery.

Comment: The paper provides theoretical insights into the limitations of LLMs in causal discovery, aligning with the criteria for foundational research in LLM behavior.

Relevance: 9 Novelty: 8

13. Slow Feature Analysis as Variational Inference Objective

ArXiv ID: 2506.00580

Authors: Merlin Sch\"uler, Laurenz Wiskott

Abstract: This work presents a novel probabilistic interpretation of Slow Feature Analysis (SFA) through the lens of variational inference. Unlike prior formulations that recover linear SFA from Gaussian state-space models with linear emissions, this approach relaxes the key constraint of linearity. While it does not lead to full equivalence to non-linear SFA, it recasts the classical slowness objective in a variational framework. Specifically, it allows the slowness objective to be interpreted as a regularizer to a reconstruction loss. Furthermore, we provide arguments, why -- from the perspective of slowness optimization -- the reconstruction loss takes on the role of the constraints that ensure informativeness in SFA. We conclude with a discussion of potential new research directions.

Comment: The paper provides a novel probabilistic interpretation of Slow Feature Analysis through variational inference, which aligns with representation learning insights.

Relevance: 9 Novelty: 8

14. Unified Scaling Laws for Compressed Representations

ArXiv ID: 2506.01863

Authors: Andrei Panferov, Alexandra Volkova, Ionut-Vlad Modoranu, Vage Egiazarian, Mher Safaryan, Dan Alistarh

Abstract: Scaling laws have shaped recent advances in machine learning by enabling predictable scaling of model performance based on model size, computation, and data volume. Concurrently, the rise in computational cost for AI has motivated model compression techniques, notably quantization and sparsification, which have emerged to mitigate the steep computational demands associated with large-scale training and inference. This paper investigates the interplay between scaling laws and compression formats, exploring whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations, such as sparse, scalar-quantized, sparse-quantized or even vector-quantized formats. Our key contributions include validating a general scaling law formulation and showing that it is applicable both individually but also composably across compression types. Based on this, our main finding is demonstrating both theoretically and empirically that there exists a simple "capacity" metric -- based on the representation's ability to fit random Gaussian data -- which can robustly predict parameter efficiency across multiple compressed representations. On the practical side, we extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.

Comment: The paper explores the interplay between scaling laws and compression formats, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

15. Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization

ArXiv ID: 2506.01562

Authors: Wojciech Masarczyk, Mateusz Ostaszewski, Tin Sum Cheng, Tomasz Trzci\'nski, Aurelien Lucchi, Razvan Pascanu

Abstract: The softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the softmax function in shaping the model's representation. We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes. This bias depends on the softmax function's logits norm, which is implicitly influenced by hyperparameters or directly modified by softmax temperature. Furthermore, we demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of softmax, enabling better control over representation learning in deep neural networks.

Comment: The paper provides insights into the role of the softmax function in representation learning, which is relevant to understanding training dynamics in neural networks.

Relevance: 9 Novelty: 8

16. Uni-LoRA: One Vector is All You Need

ArXiv ID: 2506.00799

Authors: Kaiyang Li, Shaobo Han, Qing Su, Wei Li, Zhipeng Cai, Shihao Ji

Abstract: Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient fine-tuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VB-LoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strategies employed by these LoRA variants can be formulated within a unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a high-dimensional vector space $R^D$, can be reconstructed through a projection from a subspace R^d, with $d \ll D$. We demonstrate that the fundamental difference among various LoRA methods lies in the choice of the projection matrix, $P \in R^{D \times d}$.Most existing LoRA variants rely on layer-wise or structure-specific projections that limit cross-layer parameter sharing, thereby compromising parameter efficiency. In light of this, we introduce an efficient and theoretically grounded projection matrix that is isometric, enabling global parameter sharing and reducing computation overhead. Furthermore, under the unified view of Uni-LoRA, this design requires only a single trainable vector to reconstruct LoRA parameters for the entire LLM - making Uni-LoRA both a unified framework and a "one-vector-only" solution. Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter efficiency while outperforming or matching prior approaches in predictive performance.

Comment: The paper presents Uni-LoRA, a framework for parameter-efficient fine-tuning, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

17. Ultra-Quantisation: Efficient Embedding Search via 1.58-bit Encodings

ArXiv ID: 2506.00528

Authors: Richard Connor, Alan Dearle, Ben Claydon

Abstract: Many modern search domains comprise high-dimensional vectors of floating point numbers derived from neural networks, in the form of embeddings. Typical embeddings range in size from hundreds to thousands of dimensions, making the size of the embeddings, and the speed of comparison, a significant issue. Quantisation is a class of mechanism which replaces the floating point values with a smaller representation, for example a short integer. This gives an approximation of the embedding space in return for a smaller data representation and a faster comparison function. Here we take this idea almost to its extreme: we show how vectors of arbitrary-precision floating point values can be replaced by vectors whose elements are drawn from the set {-1,0,1}. This yields very significant savings in space and metric evaluation cost, while maintaining a strong correlation for similarity measurements. This is achieved by way of a class of convex polytopes which exist in the high-dimensional space. In this article we give an outline description of these objects, and show how they can be used for the basis of such radical quantisation while maintaining a surprising degree of accuracy.

Comment: The paper introduces a novel quantization method for efficient embedding search, which aligns with model compression and efficiency.

Relevance: 9 Novelty: 8

18. Latent Structured Hopfield Network for Semantic Association and Retrieval

ArXiv ID: 2506.01303

Authors: Chong Li, Xiangyang Xue, Jianfeng Feng, Taiping Zeng

Abstract: Episodic memory enables humans to recall past experiences by associating semantic elements such as objects, locations, and time into coherent event representations. While large pretrained models have shown remarkable progress in modeling semantic memory, the mechanisms for forming associative structures that support episodic memory remain underexplored. Inspired by hippocampal CA3 dynamics and its role in associative memory, we propose the Latent Structured Hopfield Network (LSHN), a biologically inspired framework that integrates continuous Hopfield attractor dynamics into an autoencoder architecture. LSHN mimics the cortical-hippocampal pathway: a semantic encoder extracts compact latent representations, a latent Hopfield network performs associative refinement through attractor convergence, and a decoder reconstructs perceptual input. Unlike traditional Hopfield networks, our model is trained end-to-end with gradient descent, achieving scalable and robust memory retrieval. Experiments on MNIST, CIFAR-10, and a simulated episodic memory task demonstrate superior performance in recalling corrupted inputs under occlusion and noise, outperforming existing associative memory models. Our work provides a computational perspective on how semantic elements can be dynamically bound into episodic memory traces through biologically grounded attractor mechanisms.

Comment: The paper introduces a biologically inspired framework integrating Hopfield networks into an autoencoder architecture, which aligns with representation learning and model architecture.

Relevance: 9 Novelty: 8

19. Probing Neural Topology of Large Language Models

ArXiv ID: 2506.01042

Authors: Yu Zheng, Yuan Yuan, Yong Li, Paolo Santi

Abstract: Probing large language models (LLMs) has yielded valuable insights into their internal mechanisms by linking neural representations to interpretable semantics. However, how neurons functionally co-activate with each other to give rise to emergent capabilities remains largely unknown, hindering a deeper understanding and safer development of LLMs. In this work, we introduce graph probing, a method for uncovering the functional connectivity topology of LLM neurons and relating it to language generation performance. By analyzing internal neural graphs across diverse LLM families and scales, we discover a universal predictability of next-token prediction performance using only neural topology. This predictability is robust even when retaining just 1% of neuron connections or probing models after only 8 pretraining steps, highlighting the sparsity and early emergence of topological patterns. Further graph matching analysis suggests that, despite significant distinctions in architectures, parameters, and training data, different LLMs develop intricate and consistent neural topological structures that may form the foundation for their language generation abilities. Codes and data for the graph probing toolbox are released at https://github.com/DavyMorgan/llm-graph-probing.

Comment: The paper introduces graph probing to uncover functional connectivity in LLMs, providing insights into their internal mechanisms, which aligns with foundational research in LLMs.

Relevance: 9 Novelty: 8

20. Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning

ArXiv ID: 2506.01656

Authors: Ryotaro Kawata, Kohsei Matsutani, Yuri Kinoshita, Naoki Nishikawa, Taiji Suzuki

Abstract: Mixture of Experts (MoE), an ensemble of specialized models equipped with a router that dynamically distributes each input to appropriate experts, has achieved successful results in the field of machine learning. However, theoretical understanding of this architecture is falling behind due to its inherent complexity. In this paper, we theoretically study the sample and runtime complexity of MoE following the stochastic gradient descent (SGD) when learning a regression task with an underlying cluster structure of single index models. On the one hand, we prove that a vanilla neural network fails in detecting such a latent organization as it can only process the problem as a whole. This is intrinsically related to the concept of information exponent which is low for each cluster, but increases when we consider the entire task. On the other hand, we show that a MoE succeeds in dividing this problem into easier subproblems by leveraging the ability of each expert to weakly recover the simpler function corresponding to an individual cluster. To the best of our knowledge, this work is among the first to explore the benefits of the MoE framework by examining its SGD dynamics in the context of nonlinear regression.

Comment: The paper provides a theoretical study on the sample and runtime complexity of MoE, aligning with foundational research in model architecture.

Relevance: 9 Novelty: 7

21. On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning

ArXiv ID: 2506.00136

Authors: Magdalena Proszewska, Nikolay Malkin, N. Siddharth

Abstract: Diffusion autoencoders (DAs) are variants of diffusion generative models that use an input-dependent latent variable to capture representations alongside the diffusion process. These representations, to varying extents, can be used for tasks such as downstream classification, controllable generation, and interpolation. However, the generative performance of DAs relies heavily on how well the latent variables can be modelled and subsequently sampled from. Better generative modelling is also the primary goal of another class of diffusion models -- those that learn their forward (noising) process. While effective at adjusting the noise process in an input-dependent manner, they must satisfy additional constraints derived from the terminal conditions of the diffusion process. Here, we draw a connection between these two classes of models and show that certain design decisions (latent variable choice, conditioning method, etc.) in the DA framework -- leading to a model we term DMZ -- allow us to obtain the best of both worlds: effective representations as evaluated on downstream tasks, including domain transfer, as well as more efficient modelling and generation with fewer denoising steps compared to standard DMs.

Comment: The paper discusses diffusion autoencoders for efficient generation and representation learning, which aligns with foundational research in representation learning.

Relevance: 9 Novelty: 7

22. Tug-of-war between idiom's figurative and literal meanings in LLMs

ArXiv ID: 2506.01723

Authors: Soyoung Oh, Xinting Huang, Mathis Pink, Michael Hahn, Vera Demberg

Abstract: Idioms present a unique challenge for language models due to their non-compositional figurative meanings, which often strongly diverge from the idiom's literal interpretation. This duality requires a model to learn representing and deciding between the two meanings to interpret an idiom in a figurative sense, or literally. In this paper, we employ tools from mechanistic interpretability to trace how a large pretrained causal transformer (LLama3.2-1B-base) deals with this ambiguity. We localize three steps of idiom processing: First, the idiom's figurative meaning is retrieved in early attention and MLP sublayers. We identify specific attention heads which boost the figurative meaning of the idiom while suppressing the idiom's literal interpretation. The model subsequently represents the figurative representation through an intermediate path. Meanwhile, a parallel bypass route forwards literal interpretation, ensuring that a both reading remain available. Overall, our findings provide a mechanistic evidence for idiom comprehension in an autoregressive transformer.

Comment: The paper provides mechanistic insights into how LLMs process idioms, which aligns with the interest in theoretical insights into LLM behavior.

Relevance: 9 Novelty: 7

23. Earley-Driven Dynamic Pruning for Efficient Structured Decoding

ArXiv ID: 2506.01151

Authors: Xintong Sun, Chi Wei, Minghao Tian, Shiwen Ni

Abstract: Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose $\textbf{ZapFormat}$, a novel $\textbf{dynamic pruning}$ strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only $\textbf{consistently maintains}$ high-precision compliant outputs but also achieves $\textbf{significant improvements}$ in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source at https://github.com/Dan-wanna-M/formatron.

Comment: The paper proposes a dynamic pruning strategy for efficient structured decoding, relevant to model compression.

Relevance: 9 Novelty: 7

24. Overfitting has a limitation: a model-independent generalization error bound based on R\'enyi entropy

ArXiv ID: 2506.00182

Authors: Atsushi Suzuki

Abstract: Will further scaling up of machine learning models continue to bring success? A significant challenge in answering this question lies in understanding generalization error, which is the impact of overfitting. Understanding generalization error behavior of increasingly large-scale machine learning models remains a significant area of investigation, as conventional analyses often link error bounds to model complexity, failing to fully explain the success of extremely large architectures. This research introduces a novel perspective by establishing a model-independent upper bound for generalization error applicable to algorithms whose outputs are determined solely by the data's histogram, such as empirical risk minimization or gradient-based methods. Crucially, this bound is shown to depend only on the R\'enyi entropy of the data-generating distribution, suggesting that a small generalization error can be maintained even with arbitrarily large models, provided the data quantity is sufficient relative to this entropy. This framework offers a direct explanation for the phenomenon where generalization performance degrades significantly upon injecting random noise into data, where the performance degrade is attributed to the consequent increase in the data distribution's R\'enyi entropy. Furthermore, we adapt the no-free-lunch theorem to be data-distribution-dependent, demonstrating that an amount of data corresponding to the R\'enyi entropy is indeed essential for successful learning, thereby highlighting the tightness of our proposed generalization bound.

Comment: The paper introduces a model-independent generalization error bound based on Rényi entropy, which is relevant to emerging trends in theoretical work.

Relevance: 8 Novelty: 8

25. Unlocking the Power of Rehearsal in Continual Learning: A Theoretical Perspective

ArXiv ID: 2506.00205

Authors: Junze Deng, Qinhang Wu, Peizhong Ju, Sen Lin, Yingbin Liang, Ness Shroff

Abstract: Rehearsal-based methods have shown superior performance in addressing catastrophic forgetting in continual learning (CL) by storing and training on a subset of past data alongside new data in current task. While such a concurrent rehearsal strategy is widely used, it remains unclear if this approach is always optimal. Inspired by human learning, where sequentially revisiting tasks helps mitigate forgetting, we explore whether sequential rehearsal can offer greater benefits for CL compared to standard concurrent rehearsal. To address this question, we conduct a theoretical analysis of rehearsal-based CL in overparameterized linear models, comparing two strategies: 1) Concurrent Rehearsal, where past and new data are trained together, and 2) Sequential Rehearsal, where new data is trained first, followed by revisiting past data sequentially. By explicitly characterizing forgetting and generalization error, we show that sequential rehearsal performs better when tasks are less similar. These insights further motivate a novel Hybrid Rehearsal method, which trains similar tasks concurrently and revisits dissimilar tasks sequentially. We characterize its forgetting and generalization performance, and our experiments with deep neural networks further confirm that the hybrid approach outperforms standard concurrent rehearsal. This work provides the first comprehensive theoretical analysis of rehearsal-based CL.

Comment: The paper provides a theoretical analysis of rehearsal-based continual learning, which is relevant to representation learning.

Relevance: 8 Novelty: 8

26. zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

ArXiv ID: 2506.01084

Authors: Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West

Abstract: Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable "hypertokens" on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60\%, with significant improvements in inference latency.

Comment: The paper introduces a framework for adaptive vocabularies in language models via token compression, which is relevant to model compression.

Relevance: 8 Novelty: 8

27. TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

ArXiv ID: 2506.01352

Authors: Guangxin He, Yuan Cao, Yutong He, Tianyi Bai, Kun Yuan, Binhang Yuan

Abstract: Decentralized training of large language models offers the opportunity to pool computational resources across geographically distributed participants but faces significant network communication bottlenecks, particularly in pipeline-parallel settings. While pipeline parallelism partitions model layers across devices to handle large-scale models, it necessitates frequent communication of intermediate activations, creating challenges when network bandwidth is limited. Existing activation compression methods, such as AQ-SGD, mitigate quantization-induced errors through error compensation but impose prohibitive memory overhead by requiring storage of previous activations. To address these issues, we introduce TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework designed specifically for pipeline parallelism. Our approach integrates fine-grained tile-wise quantization for precise control, entropy-guided token-level adaptive bit allocation for optimal bit usage, and a Hadamard-based transform with pivot element swapping to effectively suppress quantization outliers. We further provide a theoretical analysis, proving that pipeline parallel training equipped with TAH-Quant maintains a convergence rate of $\mathcal{O}(1/\sqrt{T})$, matching that of vanilla stochastic gradient descent. Extensive experiments on diverse LLM tasks demonstrate that TAH-Quant achieves aggressive activation quantization (3-4 bits) ratio, which provides up to 4.3$\times$ end-to-end speedup without compromising training convergence, matches state-of-the-art methods, incurs no extra memory overhead, and generalizes well across different training scenarios.

Comment: The paper introduces TAH-Quant, a novel activation quantization framework, relevant to model compression.

Relevance: 8 Novelty: 8

28. Learning DNF through Generalized Fourier Representations

ArXiv ID: 2506.01075

Authors: Mohsen Heidari, Roni Khardon

Abstract: The Fourier representation for the uniform distribution over the Boolean cube has found numerous applications in algorithms and complexity analysis. Notably, in learning theory, learnability of Disjunctive Normal Form (DNF) under uniform as well as product distributions has been established through such representations. This paper makes five main contributions. First, it introduces a generalized Fourier expansion that can be used with any distribution $D$ through the representation of the distribution as a Bayesian network (BN). Second, it shows that the main algorithmic tools for learning with the Fourier representation, that use membership queries to approximate functions by recovering their heavy Fourier coefficients, can be used with slight modifications with the generalized expansion. These results hold for any distribution. Third, it analyzes the $L_1$ spectral norm of conjunctions under the new expansion, showing that it is bounded for a class of distributions which can be represented by difference bounded tree BN, where a parent node in the BN representation can change the conditional expectation of a child node by at most $\alpha<0.5$. Lower bounds are presented to show that such constraints are necessary. The fourth contribution uses these results to show the learnability of DNF with membership queries under difference bounded tree BN. The final contribution is to develop an algorithm for learning difference-bounded tree BN distributions, thus extending the DNF learnability result to cases where the distribution is not known in advance.

Comment: The paper introduces a generalized Fourier representation for learning DNF, which aligns with foundational research in representation learning.

Relevance: 8 Novelty: 8

29. PMNO: A novel physics guided multi-step neural operator predictor for partial differential equations

ArXiv ID: 2506.01598

Authors: Jin Song, Kenji Kawaguchi, Zhenya Yan

Abstract: Neural operators, which aim to approximate mappings between infinite-dimensional function spaces, have been widely applied in the simulation and prediction of physical systems. However, the limited representational capacity of network architectures, combined with their heavy reliance on large-scale data, often hinder effective training and result in poor extrapolation performance. In this paper, inspired by traditional numerical methods, we propose a novel physics guided multi-step neural operator (PMNO) architecture to address these challenges in long-horizon prediction of complex physical systems. Distinct from general operator learning methods, the PMNO framework replaces the single-step input with multi-step historical data in the forward pass and introduces an implicit time-stepping scheme based on the Backward Differentiation Formula (BDF) during backpropagation. This design not only strengthens the model's extrapolation capacity but also facilitates more efficient and stable training with fewer data samples, especially for long-term predictions. Meanwhile, a causal training strategy is employed to circumvent the need for multi-stage training and to ensure efficient end-to-end optimization. The neural operator architecture possesses resolution-invariant properties, enabling the trained model to perform fast extrapolation on arbitrary spatial resolutions. We demonstrate the superior predictive performance of PMNO predictor across a diverse range of physical systems, including 2D linear system, modeling over irregular domain, complex-valued wave dynamics, and reaction-diffusion processes. Depending on the specific problem setting, various neural operator architectures, including FNO, DeepONet, and their variants, can be seamlessly integrated into the PMNO framework.

Comment: The paper presents a novel neural operator architecture for PDEs, which is relevant to foundational research in AI for science.

Relevance: 8 Novelty: 8

30. Generalization in VAE and Diffusion Models: A Unified Information-Theoretic Analysis

ArXiv ID: 2506.00849

Authors: Qi Chen, Jierui Zhu, Florian Shkurti

Abstract: Despite the empirical success of Diffusion Models (DMs) and Variational Autoencoders (VAEs), their generalization performance remains theoretically underexplored, especially lacking a full consideration of the shared encoder-generator structure. Leveraging recent information-theoretic tools, we propose a unified theoretical framework that provides guarantees for the generalization of both the encoder and generator by treating them as randomized mappings. This framework further enables (1) a refined analysis for VAEs, accounting for the generator's generalization, which was previously overlooked; (2) illustrating an explicit trade-off in generalization terms for DMs that depends on the diffusion time $T$; and (3) providing computable bounds for DMs based solely on the training data, allowing the selection of the optimal $T$ and the integration of such bounds into the optimization process to improve model performance. Empirical results on both synthetic and real datasets illustrate the validity of the proposed theory.

Comment: The paper provides a unified theoretical framework for analyzing the generalization of VAEs and DMs, which aligns with foundational research in representation learning.