This is a remedial run for missed papers from 03/20/2025 to 03/20/2025.

Results generated on 03/24/2025.

Personalized Daily Arxiv Papers 3/21/2025

[gpt-4o]	Prompt	Completion	Total
Token	46460	7584	54044
Cost	$0.12	$0.08	$0.19

Total arXiv papers: 250

Total scanned papers: 250

Total relevant papers: 40

Table of contents with paper titles:

Mixture of Lookup Experts Authors: Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi-Hong Deng, Yunhe Wang
The global convergence time of stochastic gradient descent in non-convex landscapes: Sharp estimates via large deviations Authors: Waïss Azizian, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos
Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens Authors: Shuqi Lu, Haowei Lin, Lin Yao, Zhifeng Gao, Xiaohong Ji, Weinan E, Linfeng Zhang, Guolin Ke
Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models Authors: Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, Xueqi Cheng
ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism Authors: Venmugil Elango
Tuning LLMs by RAG Principles: Towards LLM-native Memory Authors: Jiale Wei, Shuchi Wu, Ruochen Liu, Xiang Ying, Jingbo Shang, Fangbo Tao
Lyra: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences Authors: Krithik Ramesh, Sameed M. Siddiqui, Albert Gu, Michael D. Mitzenmacher, Pardis C. Sabeti
CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners Authors: Yunzhi Yao, Jizhan Fang, Jia-Chen Gu, Ningyu Zhang, Shumin Deng, Huajun Chen, Nanyun Peng
Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts Authors: Yike Yuan, Ziyu Wang, Zihao Huang, Defa Zhu, Xun Zhou, Jingyi Yu, Qiyang Min
Accelerating Transformer Inference and Training with 2:4 Activation Sparsity Authors: Daniel Haziza, Timothy Chou, Dhruv Choudhary, Luca Wehrstedt, Francisco Massa, Jiecao Yu, Geonhwa Jeong, Supriya Rao, Patrick Labatut, Jesse Cai
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models Authors: Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, Hongtao Xie
Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Simulated Annealing Authors: Vishnu Asutosh Dasu, Md Rafi ur Rashid, Vipul Gupta, Saeid Tizpaz-Niari, Gang Tan
Gene42: Long-Range Genomic Foundation Model With Dense Attention Authors: Kirill Vishniakov, Boulbaba Ben Amor, Engin Tekin, Nancy A. ElNaker, Karthik Viswanathan, Aleksandr Medvedev, Aahan Singh, Maryam Nadeem, Mohammad Amaan Sayeed, Praveenkumar Kanithi, Tiago Magalhaes, Natalia Vassilieva, Dwarikanath Mahapatra, Marco Pimentel, and Shadab Khan
QCPINN: Quantum Classical Physics-Informed Neural Networks for Solving PDEs Authors: Afrah Farea, Saiful Khan, Mustafa Serdar Celebi
HiQ-Lip: The First Quantum-Classical Hierarchical Method for Global Lipschitz Constant Estimation of ReLU Networks Authors: Haoqi He, Yan Xiao
Universal approximation property of neural stochastic differential equations Authors: Anna P. Kwossek, David J. Prömel, Josef Teichmann
Efficient Training of Neural Fractional-Order Differential Equation via Adjoint Backpropagation Authors: Qiyu Kang, Xuhao Li, Kai Zhao, Wenjun Cui, Yanan Zhao, Weihua Deng, Wee Peng Tay
Subgradient Method for System Identification with Non-Smooth Objectives Authors: Baturalp Yalcin, Javad Lavaei
The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement Authors: Ruihan Yang, Fanghua Ye, Jian Li, Siyu Yuan, Yikai Zhang, Zhaopeng Tu, Xiaolong Li, Deqing Yang
VP-NTK: Exploring the Benefits of Visual Prompting in Differentially Private Data Synthesis Authors: Chia-Yi Hsu, Jia-You Chen, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, Chun-Ying Huang
Bezier Distillation Authors: Ling Feng, SK Yang
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge Authors: Xuan Shen, Weize Ma, Jing Liu, Changdi Yang, Rui Ding, Quanyi Wang, Henghui Ding, Wei Niu, Yanzhi Wang, Pu Zhao, Jun Lin, Jiuxiang Gu
InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer Authors: Tony Zhang, Rickard Brännvall
Advances in Protein Representation Learning: Methods, Applications, and Future Directions Authors: Viet Thanh Duy Nguyen, Truong-Son Hy
Deconstructing Long Chain-of-Thought: A Structured Reasoning Optimization Framework for Long CoT Distillation Authors: Yijia Luo, Yulin Song, Xingyao Zhang, Jiaheng Liu, Weixun Wang, GengRu Chen, Wenbo Su, Bo Zheng
Rethinking Robustness in Machine Learning: A Posterior Agreement Approach Authors: João Borges S. Carvalho, Alessandro Torcinovich, Victor Jimenez Rodriguez, Antonio E. Cinà, Carlos Cotrini, Lea Schönherr, Joachim M. Buhmann
Entropy-based Exploration Conduction for Multi-step Reasoning Authors: Jinghan Zhang, Xiting Wang, Fengran Mo, Yeyang Zhou, Wanfu Gao, Kunpeng Liu
A preliminary data fusion study to assess the feasibility of Foundation Process-Property Models in Laser Powder Bed Fusion Authors: Oriol Vendrell-Gallart, Nima Negarandeh, Zahra Zanjani Foumani, Mahsa Amiri, Lorenzo Valdevit, Ramin Bostanabad
Blend the Separated: Mixture of Synergistic Experts for Data-Scarcity Drug-Target Interaction Prediction Authors: Xinlong Zhai, Chunchen Wang, Ruijia Wang, Jiazheng Kang, Shujie Li, Boyu Chen, Tengfei Ma, Zikai Zhou, Cheng Yang, Chuan Shi
Distributed Learning over Arbitrary Topology: Linear Speed-Up with Polynomial Transient Time Authors: Runze You, Shi Pu
On the Cone Effect in the Learning Dynamics Authors: Zhanpeng Zhou, Yongyi Yang, Jie Ren, Mahito Sugiyama, Junchi Yan
Manifold learning in metric spaces Authors: Liane Xu, Amit Singer
Disentangling Uncertainties by Learning Compressed Data Representation Authors: Zhiyu An, Zhibo Hou, Wan Du
Procrustes Wasserstein Metric: A Modified Benamou-Brenier Approach with Applications to Latent Gaussian Distributions Authors: Kevine Meugang Toukam
Machine learning identifies nullclines in oscillatory dynamical systems Authors: Bartosz Prokop, Jimmy Billen, Nikita Frolov, Lendert Gelens
Efficient ANN-Guided Distillation: Aligning Rate-based Features of Spiking Neural Networks through Hybrid Block-wise Replacement Authors: Shu Yang, Chengting Yu, Lei Liu, Hanzhi Ma, Aili Wang, Erping Li
Palatable Conceptions of Disembodied Being: Terra Incognita in the Space of Possible Minds Authors: Murray Shanahan
Survey on Evaluation of LLM-based Agents Authors: Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, Michal Shmueli-Scheuer
Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future Directions Authors: Hadi Amini, Md Jueal Mia, Yasaman Saadati, Ahmed Imteaj, Seyedsina Nabavirazavi, Urmish Thakker, Md Zarif Hossain, Awal Ahmed Fime, S. S. Iyengar
Line Space Clustering (LSC): Feature-Based Clustering using K-medians and Dynamic Time Warping for Versatility Authors: Joanikij Chulev, Angela Mladenovska

1. Mixture of Lookup Experts

ArXiv ID: 2503.15798

Authors: Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi-Hong Deng, Yunhe Wang

Abstract: Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert's computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLE achieves inference speeds comparable to dense models and significantly faster than MoE with experts offloading, while maintaining performance on par with MoE.

Comment: The paper introduces Mixture of Lookup Experts (MoLE), which aligns closely with foundational research in Mixture-of-Experts architectures and efficiency.

Relevance: 10 Novelty: 9

2. The global convergence time of stochastic gradient descent in non-convex landscapes: Sharp estimates via large deviations

ArXiv ID: 2503.16398

Authors: Waïss Azizian, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos

Abstract: In this paper, we examine the time it takes for stochastic gradient descent (SGD) to reach the global minimum of a general, non-convex loss function. We approach this question through the lens of randomly perturbed dynamical systems and large deviations theory, and we provide a tight characterization of the global convergence time of SGD via matching upper and lower bounds. These bounds are dominated by the most "costly" set of obstacles that the algorithm may need to overcome to reach a global minimizer from a given initialization, coupling in this way the global geometry of the underlying loss landscape with the statistics of the noise entering the process. Finally, motivated by applications to the training of deep neural networks, we also provide a series of refinements and extensions of our analysis for loss functions with shallow local minima.

Comment: The paper provides theoretical insights into the global convergence time of SGD in non-convex landscapes, which aligns with foundational research in training dynamics of neural networks.

Relevance: 9 Novelty: 8

3. Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens

ArXiv ID: 2503.16278

Authors: Shuqi Lu, Haowei Lin, Lin Yao, Zhifeng Gao, Xiaohong Ji, Weinan E, Linfeng Zhang, Guolin Ke

Abstract: Recent advancements in large language models and their multi-modal extensions have demonstrated the effectiveness of unifying generation and understanding through autoregressive next-token prediction. However, despite the critical role of 3D structural generation and understanding (3D GU) in AI for science, these tasks have largely evolved independently, with autoregressive methods remaining underexplored. To bridge this gap, we introduce Uni-3DAR, a unified framework that seamlessly integrates 3D GU tasks via autoregressive prediction. At its core, Uni-3DAR employs a novel hierarchical tokenization that compresses 3D space using an octree, leveraging the inherent sparsity of 3D structures. It then applies an additional tokenization for fine-grained structural details, capturing key attributes such as atom types and precise spatial coordinates in microscopic 3D structures. We further propose two optimizations to enhance efficiency and effectiveness. The first is a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. The second is a masked next-token prediction mechanism tailored for dynamically varying token positions, significantly boosting model performance. By combining these strategies, Uni-3DAR successfully unifies diverse 3D GU tasks within a single autoregressive framework. Extensive experiments across multiple microscopic 3D GU tasks, including molecules, proteins, polymers, and crystals, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster. The code is publicly available at https://github.com/dptech-corp/Uni-3DAR.

Comment: The paper introduces Uni-3DAR, a unified framework for 3D generation and understanding via autoregression. It aligns with foundational research in representation learning and architecture innovations.

Relevance: 9 Novelty: 8

4. Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models

ArXiv ID: 2503.15888

Authors: Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, Xueqi Cheng

Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, conflicts between parametric knowledge and retrieved context pose challenges, particularly when retrieved information is unreliable or the model's internal knowledge is outdated. In such cases, LLMs struggle to determine whether to rely more on their own parameters or the conflicted context. To address this, we propose CK-PLUG, a plug-and-play method for controlling LLMs' reliance on parametric and contextual knowledge. We introduce a novel knowledge consistency metric, Confidence Gain, which detects knowledge conflicts by measuring entropy shifts in token probability distributions after context insertion. CK-PLUG then enables fine-grained control over knowledge preference by adjusting the probability distribution of tokens with negative confidence gain through a single tuning parameter. Experiments demonstrate CK-PLUG's ability to significantly regulate knowledge reliance in counterfactual RAG scenarios while maintaining generation fluency and knowledge accuracy. For instance, on Llama3-8B, memory recall (MR) of RAG response can be adjusted within a broad range (9.9%-71.9%), compared to the baseline of 42.1%. Moreover, CK-PLUG supports adaptive control based on the model's confidence in both internal and external knowledge, achieving consistent performance improvements across various general RAG tasks. Our code is available at: $\href{https://github.com/byronBBL/CK-PLUG}{\text{this https URL}}$.

Comment: The paper introduces CK-PLUG for controlling knowledge reliance in LLMs, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 8

5. ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism

ArXiv ID: 2503.15758

Authors: Venmugil Elango

Abstract: Transformer-based models have emerged as a leading architecture for natural language processing, natural language generation, and image generation tasks. A fundamental element of the transformer architecture is self-attention, which allows the model to capture intricate dependencies within the data. However, the self-attention mechanism also incurs significant computational and memory costs, particularly for long sequences. In this paper, we introduce ATTENTION2D, a novel approach that exploits parallelism along two dimensions - query and key/value - of the self-attention operation. This method enables efficient distribution and parallelization of computations across multiple devices. Our approach facilitates asymptotically faster training and inference phases compared to previous methods, without relying on approximations or incurring additional computational or memory overheads. Furthermore, unlike existing techniques that struggle to scale with an increasing number of processing units, our approach effectively scales with additional processing units. Our experimental results confirm the effectiveness of our method in improving communication efficiency and scalability. Compared to Ring Attention, our approach demonstrated up to a 5x performance boost on a GPT-3-like model using 64 NVIDIA A100 GPUs across 16 nodes, and up to a 9.4x performance boost on 64 NVIDIA H100 GPUs across 64 nodes.

Comment: The paper introduces ATTENTION2D for distributed self-attention, which aligns with foundational research in model architecture and efficiency.

Relevance: 9 Novelty: 8

6. Tuning LLMs by RAG Principles: Towards LLM-native Memory

ArXiv ID: 2503.16071

Authors: Jiale Wei, Shuchi Wu, Ruochen Liu, Xiang Ying, Jingbo Shang, Fangbo Tao

Abstract: Memory, additional information beyond the training of large language models (LLMs), is crucial to various real-world applications, such as personal assistant. The two mainstream solutions to incorporate memory into the generation process are long-context LLMs and retrieval-augmented generation (RAG). In this paper, we first systematically compare these two types of solutions on three renovated/new datasets and show that (1) long-context solutions, although more expensive, shall be easier to capture the big picture and better answer queries which require considering the memory as a whole; and (2) when the queries concern specific information, RAG solutions shall be more competitive especially when the keywords can be explicitly matched. Therefore, we propose a novel method RAG-Tuned-LLM which fine-tunes a relative small (e.g., 7B) LLM using the data generated following the RAG principles, so it can combine the advantages of both solutions. Extensive experiments on three datasets demonstrate that RAG-Tuned-LLM can beat long-context LLMs and RAG methods across a wide range of query types.

Comment: The paper proposes a novel method combining RAG principles with LLM fine-tuning, which aligns with foundational research in LLM architecture and memory integration.

Relevance: 9 Novelty: 8

7. Lyra: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences

ArXiv ID: 2503.16351

Authors: Krithik Ramesh, Sameed M. Siddiqui, Albert Gu, Michael D. Mitzenmacher, Pardis C. Sabeti

Abstract: Deep learning architectures such as convolutional neural networks and Transformers have revolutionized biological sequence modeling, with recent advances driven by scaling up foundation and task-specific models. The computational resources and large datasets required, however, limit their applicability in biological contexts. We introduce Lyra, a subquadratic architecture for sequence modeling, grounded in the biological framework of epistasis for understanding sequence-to-function relationships. Mathematically, we demonstrate that state space models efficiently capture global epistatic interactions and combine them with projected gated convolutions for modeling local relationships. We demonstrate that Lyra is performant across over 100 wide-ranging biological tasks, achieving state-of-the-art (SOTA) performance in many key areas, including protein fitness landscape prediction, biophysical property prediction (e.g. disordered protein region functions) peptide engineering applications (e.g. antibody binding, cell-penetrating peptide prediction), RNA structure analysis, RNA function prediction, and CRISPR guide design. It achieves this with orders-of-magnitude improvements in inference speed and reduction in parameters (up to 120,000-fold in our tests) compared to recent biology foundation models. Using Lyra, we were able to train and run every task in this study on two or fewer GPUs in under two hours, democratizing access to biological sequence modeling at SOTA performance, with potential applications to many fields.

Comment: The paper introduces Lyra, a subquadratic architecture for biological sequence modeling, which is relevant to foundational research in model architecture and efficiency.

Relevance: 9 Novelty: 8

8. CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners

ArXiv ID: 2503.16356

Authors: Yunzhi Yao, Jizhan Fang, Jia-Chen Gu, Ningyu Zhang, Shumin Deng, Huajun Chen, Nanyun Peng

Abstract: Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they struggle to generalize these updates to multi-hop reasoning tasks that depend on the modified knowledge. Through an analysis of reasoning circuits -- the neural pathways LLMs use for knowledge-based inference, we observe that current layer-localized KE approaches, such as MEMIT and WISE, which edit only single or a few model layers, struggle to effectively incorporate updated information into these reasoning pathways. To address this limitation, we propose CaKE (Circuit-aware Knowledge Editing), a novel method that enables more effective integration of updated knowledge in LLMs. CaKE leverages strategically curated data, guided by our circuits-based analysis, that enforces the model to utilize the modified knowledge, stimulating the model to develop appropriate reasoning circuits for newly integrated knowledge. Experimental results show that CaKE enables more accurate and consistent use of updated knowledge across related reasoning tasks, leading to an average of 20% improvement in multi-hop reasoning accuracy on MQuAKE dataset compared to existing KE methods. We release the code and data in https://github.com/zjunlp/CaKE.

Comment: The paper proposes CaKE, a circuit-aware knowledge editing method for LLMs, which aligns with foundational research in LLM behavior and reasoning circuits.

Relevance: 9 Novelty: 8

9. Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

ArXiv ID: 2503.16057

Authors: Yike Yuan, Ziyu Wang, Zihao Huang, Defa Zhu, Xun Zhou, Jingyi Yu, Qiyang Min

Abstract: Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.

Comment: The paper introduces Race-DiT, a Mixture of Experts (MoE) model for diffusion transformers with a flexible routing strategy and regularization techniques. It aligns closely with the MoE criterion under model architecture.

Relevance: 9 Novelty: 8

10. Accelerating Transformer Inference and Training with 2:4 Activation Sparsity

ArXiv ID: 2503.16672

Authors: Daniel Haziza, Timothy Chou, Dhruv Choudhary, Luca Wehrstedt, Francisco Massa, Jiecao Yu, Geonhwa Jeong, Supriya Rao, Patrick Labatut, Jesse Cai

Abstract: In this paper, we demonstrate how to leverage 2:4 sparsity, a popular hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuracy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.

Comment: The paper demonstrates how to leverage 2:4 activation sparsity for accelerating transformer inference and training, aligning with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8

ArXiv ID: 2503.16036

Authors: Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, Hongtao Xie

Abstract: Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 2.43\% average on three multiple-choice QA benchmarks and saving 78.8\% tokens compared with the SOTA method. The code is available at https://github.com/lntzm/HICom.

Comment: The paper proposes a hybrid-level token compression strategy for MLLMs, which aligns with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8

12. Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Simulated Annealing

ArXiv ID: 2503.15815

Authors: Vishnu Asutosh Dasu, Md Rafi ur Rashid, Vipul Gupta, Saeid Tizpaz-Niari, Gang Tan

Abstract: This paper explores pruning attention heads as a post-processing bias mitigation method for large language models (LLMs). Modern AI systems such as LLMs are expanding into sensitive social contexts where fairness concerns become especially crucial. Since LLMs develop decision-making patterns by training on massive datasets of human-generated content, they naturally encode and perpetuate societal biases. While modifying training datasets and algorithms is expensive and requires significant resources; post-processing techniques-such as selectively deactivating neurons and attention heads in pre-trained LLMs-can provide feasible and effective approaches to improve fairness. However, identifying the optimal subset of parameters to prune presents a combinatorial challenge within LLMs' immense parameter space, requiring solutions that efficiently balance competing objectives across the frontiers of model fairness and utility. To address the computational challenges, we explore a search-based program repair approach via randomized simulated annealing. Given the prohibitive evaluation costs in billion-parameter LLMs, we develop surrogate deep neural networks that efficiently model the relationship between attention head states (active/inactive) and their corresponding fairness/utility metrics. This allows us to perform optimization over the surrogate models and efficiently identify optimal subsets of attention heads for selective pruning rather than directly searching through the LLM parameter space. This paper introduces Attention Pruning, a fairness-aware surrogate simulated annealing approach to prune attention heads in LLMs that disproportionately contribute to bias while minimally impacting overall model utility. Our experiments show that Attention Pruning achieves up to $40\%$ reduction in gender bias and outperforms the state-of-the-art bias mitigation strategies.

Comment: The paper explores attention pruning for bias mitigation in LLMs, which aligns with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 7

13. Gene42: Long-Range Genomic Foundation Model With Dense Attention

ArXiv ID: 2503.16565

Authors: Kirill Vishniakov, Boulbaba Ben Amor, Engin Tekin, Nancy A. ElNaker, Karthik Viswanathan, Aleksandr Medvedev, Aahan Singh, Maryam Nadeem, Mohammad Amaan Sayeed, Praveenkumar Kanithi, Tiago Magalhaes, Natalia Vassilieva, Dwarikanath Mahapatra, Marco Pimentel, and Shadab Khan

Abstract: We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at huggingface.co/inceptionai.

Comment: The paper introduces Gene42, a genomic foundation model with dense attention for long-range context. It aligns with foundational research in architecture innovations for science applications.

Relevance: 8 Novelty: 8

14. QCPINN: Quantum Classical Physics-Informed Neural Networks for Solving PDEs

ArXiv ID: 2503.16678

Authors: Afrah Farea, Saiful Khan, Mustafa Serdar Celebi

Abstract: Hybrid quantum-classical neural network methods represent an emerging approach to solving computational challenges by leveraging advantages from both paradigms. As physics-informed neural networks (PINNs) have successfully applied to solve partial differential equations (PDEs) by incorporating physical constraints into neural architectures, this work investigates whether quantum-classical physics-informed neural networks (QCPINNs) can efficiently solve PDEs with reduced parameter counts compared to classical approaches. We evaluate two quantum circuit paradigms: continuous-variable (CV) and qubit-based discrete-variable (DV) across multiple circuit ansatze (Alternate, Cascade, Cross mesh, and Layered). Benchmarking across five challenging PDEs (Helmholtz, Cavity, Wave, Klein-Gordon, and Convection-Diffusion equations) demonstrates that our hybrid approaches achieve comparable accuracy to classical PINNs while requiring up to 89% fewer trainable parameters. DV-based implementations, particularly those with angle encoding and cascade circuit configurations, exhibit better stability and convergence properties across all problem types. For the Convection-Diffusion equation, our angle-cascade QCPINN achieves parameter efficiency and a 37% reduction in relative L2 error compared to classical counterparts. Our findings highlight the potential of quantum-enhanced architectures for physics-informed learning, establishing parameter efficiency as a quantifiable quantum advantage while providing a foundation for future quantum-classical hybrid systems solving complex physical models.

Comment: The paper explores quantum-classical hybrid architectures for physics-informed neural networks, which introduces architectural innovations relevant to AI for Science.

Relevance: 8 Novelty: 8

15. HiQ-Lip: The First Quantum-Classical Hierarchical Method for Global Lipschitz Constant Estimation of ReLU Networks

ArXiv ID: 2503.16342

Authors: Haoqi He, Yan Xiao

Abstract: Estimating the global Lipschitz constant of neural networks is crucial for understanding and improving their robustness and generalization capabilities. However, precise calculations are NP-hard, and current semidefinite programming (SDP) methods face challenges such as high memory usage and slow processing speeds. In this paper, we propose \textbf{HiQ-Lip}, a hybrid quantum-classical hierarchical method that leverages Coherent Ising Machines (CIMs) to estimate the global Lipschitz constant. We tackle the estimation by converting it into a Quadratic Unconstrained Binary Optimization (QUBO) problem and implement a multilevel graph coarsening and refinement strategy to adapt to the constraints of contemporary quantum hardware. Our experimental evaluations on fully connected neural networks demonstrate that HiQ-Lip not only provides estimates comparable to state-of-the-art methods but also significantly accelerates the computation process. In specific tests involving two-layer neural networks with 256 hidden neurons, HiQ-Lip doubles the solving speed and offers more accurate upper bounds than the existing best method, LiPopt. These findings highlight the promising utility of small-scale quantum devices in advancing the estimation of neural network robustness.

Comment: The paper proposes HiQ-Lip, a hybrid quantum-classical method for estimating the global Lipschitz constant of ReLU networks. It aligns with foundational research in neural network robustness and generalization.

Relevance: 8 Novelty: 8

16. Universal approximation property of neural stochastic differential equations

ArXiv ID: 2503.16696

Authors: Anna P. Kwossek, David J. Prömel, Josef Teichmann

Abstract: We identify various classes of neural networks that are able to approximate continuous functions locally uniformly subject to fixed global linear growth constraints. For such neural networks the associated neural stochastic differential equations can approximate general stochastic differential equations, both of It\^o diffusion type, arbitrarily well. Moreover, quantitative error estimates are derived for stochastic differential equations with sufficiently regular coefficients.

Comment: The paper identifies neural networks capable of approximating continuous functions under linear growth constraints, aligning with foundational research in neural network theory.

Relevance: 8 Novelty: 8

17. Efficient Training of Neural Fractional-Order Differential Equation via Adjoint Backpropagation

ArXiv ID: 2503.16666

Authors: Qiyu Kang, Xuhao Li, Kai Zhao, Wenjun Cui, Yanan Zhao, Weihua Deng, Wee Peng Tay

Abstract: Fractional-order differential equations (FDEs) enhance traditional differential equations by extending the order of differential operators from integers to real numbers, offering greater flexibility in modeling complex dynamical systems with nonlocal characteristics. Recent progress at the intersection of FDEs and deep learning has catalyzed a new wave of innovative models, demonstrating the potential to address challenges such as graph representation learning. However, training neural FDEs has primarily relied on direct differentiation through forward-pass operations in FDE numerical solvers, leading to increased memory usage and computational complexity, particularly in large-scale applications. To address these challenges, we propose a scalable adjoint backpropagation method for training neural FDEs by solving an augmented FDE backward in time, which substantially reduces memory requirements. This approach provides a practical neural FDE toolbox and holds considerable promise for diverse applications. We demonstrate the effectiveness of our method in several tasks, achieving performance comparable to baseline models while significantly reducing computational overhead.

Comment: The paper introduces an adjoint backpropagation method for training neural fractional-order differential equations, which offers efficiency improvements and theoretical insights into training dynamics.