Personalized Daily ArXiv Papers 2025-07-09

[gpt-4o]	Prompt	Completion	Total
Token	32862	3896	36758
Cost	$0.08	$0.04	$0.12

Total arXiv papers: 535

Total scanned papers: 310

Total relevant papers: 18

Table of contents with paper titles:

FACT: the Features At Convergence Theorem for neural networks Authors: Enric Boix-Adsera, Neil Mallinar, James B. Simon, Mikhail Belkin
Estimating Interventional Distributions with Uncertain Causal Graphs through Meta-Learning Authors: Anish Dhir, Cristiana Diaconu, Valentinian Mihai Lungu, James Requeima, Richard E. Turner, Mark van der Wilk
Causal Foundation Models: Disentangling Physics from Instrument Properties Authors: Jeroen Audenaert, Daniel Muthukrishna, Paul F. Gregory, David W. Hogg, V. Ashley Villar
Generalized and Unified Equivalences between Hardness and Pseudoentropy Authors: Lunjia Hu, Salil Vadhan
SingLoRA: Low Rank Adaptation Using a Single Matrix Authors: David Bensa\"id, Noam Rotstein, Roy Velich, Daniel Bensa\"id, Ron Kimmel
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition Authors: Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
Differential Mamba Authors: Nadav Schneider, Itamar Zimerman, Eliya Nachmani
Predicting mutational effects on protein binding from folding energy Authors: Arthur Deng, Karsten Householder, Fang Wu, Sebastian Thrun, K. Christopher Garcia, Brian Trippe
Efficient Training of Large-Scale AI Models Through Federated Mixture-of-Experts: A System-Level Approach Authors: Xiaobing Chen, Boyang Zhang, Xiangwei Zhou, Mingxuan Sun, Shuai Zhang, Songyang Zhang, Geoffrey Ye Li
Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs Authors: Yizhan Huang, Zhe Yang, Meifang Chen, Jianping Zhang, Michael R. Lyu
Simple Convergence Proof of Adam From a Sign-like Descent Perspective Authors: Hanyang Peng, Shuang Qin, Yue Yu, Fangqing Jiang, Hui Wang, Zhouchen Lin
Explainable Hierarchical Deep Learning Neural Networks (Ex-HiDeNN) Authors: Reza T. Batley, Chanwook Park, Wing Kam Liu, Sourav Saha
LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks Authors: William Fleshman, Benjamin Van Durme
Mitigating Shortcut Learning with InterpoLated Learning Authors: Michalis Korakakis, Andreas Vlachos, Adrian Weller
Coding Triangle: How Does Large Language Model Understand Code? Authors: Taolin Zhang, Zihan Ma, Maosong Cao, Junnan Liu, Songyang Zhang, Kai Chen
Beyond Communication Overhead: A Multilevel Monte Carlo Approach for Mitigating Compression Bias in Distributed Learning Authors: Ze'ev Zukerman, Bassel Hamoud, Kfir Y. Levy
QS4D: Quantization-aware training for efficient hardware deployment of structured state-space sequential models Authors: Sebastian Siegel, Ming-Jay Yang, Younes Bouhadjar, Maxime Fabre, Emre Neftci, John Paul Strachan
Modern Methods in Associative Memory Authors: Dmitry Krotov, Benjamin Hoover, Parikshit Ram, Bao Pham

1. FACT: the Features At Convergence Theorem for neural networks

ArXiv ID: 2507.05644

Authors: Enric Boix-Adsera, Neil Mallinar, James B. Simon, Mikhail Belkin

Abstract: A central challenge in deep learning theory is to understand how neural networks learn and represent features. To this end, we prove the Features at Convergence Theorem (FACT), which gives a self-consistency equation that neural network weights satisfy at convergence when trained with nonzero weight decay. For each weight matrix $W$, this equation relates the "feature matrix" $W^\top W$ to the set of input vectors passed into the matrix during forward propagation and the loss gradients passed through it during backpropagation. We validate this relation empirically, showing that neural features indeed satisfy the FACT at convergence. Furthermore, by modifying the "Recursive Feature Machines" of Radhakrishnan et al. 2024 so that they obey the FACT, we arrive at a new learning algorithm, FACT-RFM. FACT-RFM achieves high performance on tabular data and captures various feature learning behaviors that occur in neural network training, including grokking in modular arithmetic and phase transitions in learning sparse parities.

Comment: The paper presents the Features At Convergence Theorem (FACT) for neural networks, providing insights into how neural networks learn and represent features, aligning with representation learning.

Relevance: 10 Novelty: 9

2. Estimating Interventional Distributions with Uncertain Causal Graphs through Meta-Learning

ArXiv ID: 2507.05526

Authors: Anish Dhir, Cristiana Diaconu, Valentinian Mihai Lungu, James Requeima, Richard E. Turner, Mark van der Wilk

Abstract: In scientific domains -- from biology to the social sciences -- many questions boil down to \textit{What effect will we observe if we intervene on a particular variable?} If the causal relationships (e.g.~a causal graph) are known, it is possible to estimate the intervention distributions. In the absence of this domain knowledge, the causal structure must be discovered from the available observational data. However, observational data are often compatible with multiple causal graphs, making methods that commit to a single structure prone to overconfidence. A principled way to manage this structural uncertainty is via Bayesian inference, which averages over a posterior distribution on possible causal structures and functional mechanisms. Unfortunately, the number of causal structures grows super-exponentially with the number of nodes in the graph, making computations intractable. We propose to circumvent these challenges by using meta-learning to create an end-to-end model: the Model-Averaged Causal Estimation Transformer Neural Process (MACE-TNP). The model is trained to predict the Bayesian model-averaged interventional posterior distribution, and its end-to-end nature bypasses the need for expensive calculations. Empirically, we demonstrate that MACE-TNP outperforms strong Bayesian baselines. Our work establishes meta-learning as a flexible and scalable paradigm for approximating complex Bayesian causal inference, that can be scaled to increasingly challenging settings in the future.

Comment: The paper introduces a meta-learning approach for Bayesian causal inference, which is a cutting-edge theoretical work in emerging trends.

Relevance: 9 Novelty: 9

3. Causal Foundation Models: Disentangling Physics from Instrument Properties

ArXiv ID: 2507.05333

Authors: Jeroen Audenaert, Daniel Muthukrishna, Paul F. Gregory, David W. Hogg, V. Ashley Villar

Abstract: Foundation models for structured time series data must contend with a fundamental challenge: observations often conflate the true underlying physical phenomena with systematic distortions introduced by measurement instruments. This entanglement limits model generalization, especially in heterogeneous or multi-instrument settings. We present a causally-motivated foundation model that explicitly disentangles physical and instrumental factors using a dual-encoder architecture trained with structured contrastive learning. Leveraging naturally occurring observational triplets (i.e., where the same target is measured under varying conditions, and distinct targets are measured under shared conditions) our model learns separate latent representations for the underlying physical signal and instrument effects. Evaluated on simulated astronomical time series designed to resemble the complexity of variable stars observed by missions like NASA's Transiting Exoplanet Survey Satellite (TESS), our method significantly outperforms traditional single-latent space foundation models on downstream prediction tasks, particularly in low-data regimes. These results demonstrate that our model supports key capabilities of foundation models, including few-shot generalization and efficient adaptation, and highlight the importance of encoding causal structure into representation learning for structured data.

Comment: The paper presents a causally-motivated foundation model using a dual-encoder architecture and structured contrastive learning, which aligns with representation learning and model architecture criteria.

Relevance: 9 Novelty: 8

4. Generalized and Unified Equivalences between Hardness and Pseudoentropy

ArXiv ID: 2507.05972

Authors: Lunjia Hu, Salil Vadhan

Abstract: Pseudoentropy characterizations provide a quantitatively precise demonstration of the close relationship between computational hardness and computational randomness. We prove a unified pseudoentropy characterization that generalizes and strengthens previous results for both uniform and non-uniform models of computation. Our characterization holds for a general family of entropy notions that encompasses the common notions of Shannon entropy and min entropy as special cases. Moreover, we show that the characterizations for different entropy notions can be simultaneously achieved by a single, universal function that simultaneously witnesses computational hardness and computational randomness. A key technical insight of our work is that the notion of weight-restricted calibration from the recent literature on algorithm fairness, along with standard computational indistinguishability (known as multiaccuracy in the fairness literature), suffices for proving pseudoentropy characterizations for general entropy notions. This demonstrates the power of weight-restricted calibration to enhance the classic Complexity-Theoretic Regularity Lemma (Trevisan, Tulsiani, and Vadhan, 2009) and Leakage Simulation Lemma (Jetchev and Pietrzak, 2014) and allows us to achieve an exponential improvement in the complexity dependency on the alphabet size compared to the pseudoentropy characterizations by Casacuberta, Dwork, and Vadhan (2024) based on the much stronger notion of multicalibration. We show that the exponential dependency on the alphabet size is inevitable for multicalibration as well as for the weaker notion of calibrated multiaccuracy.

Comment: The paper provides a unified pseudoentropy characterization, which is relevant to emerging trends in theoretical work challenging established assumptions.

Relevance: 9 Novelty: 8

5. SingLoRA: Low Rank Adaptation Using a Single Matrix

ArXiv ID: 2507.05566

Authors: David Bensa\"id, Noam Rotstein, Roy Velich, Daniel Bensa\"id, Ron Kimmel

Abstract: Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two smaller matrices that together form a low-rank matrix update. Recent research has shown that scale disparities between these two matrices often cause unstable training dynamics, leading to suboptimal performance. In this paper, we propose SingLoRA, which reformulates low-rank adaptation by learning the weights update as a decomposition of a single low-rank matrix multiplied by its transpose. This simple design inherently removes inter-matrix scale conflicts, ensuring stable optimization, and roughly halves the parameter count. We analyze SingLoRA within the infinite-width neural network framework, showing that it guarantees stable feature learning by construction. Extensive experiments on multiple tasks validate these benefits. In common sense reasoning, fine-tuning LLama 7B on MNLI with SingLoRA achieves 91.3% accuracy - surpassing LoRA (89.1%) and LoRA+ (90.2%) - while using only 60% of their parameter budget. In image generation, fine-tuning Stable Diffusion with SingLoRA significantly improves image fidelity on DreamBooth, achieving a DINO similarity score of 0.151, compared to scores of 0.148 and 0.143 for DoRA and LoRA, respectively.

Comment: The paper proposes SingLoRA, a novel approach to low-rank adaptation, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

ArXiv ID: 2507.05724

Authors: Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly

Abstract: Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model \emph{Omni-router Transformer}. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.

Comment: The paper discusses a novel MoE architecture for ASR, focusing on shared routing decisions, which provides insights into MoE architectures.

Relevance: 9 Novelty: 8

7. Differential Mamba

ArXiv ID: 2507.06204

Authors: Nadav Schneider, Itamar Zimerman, Eliya Nachmani

Abstract: Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available.

Comment: The paper introduces a novel differential mechanism for the Mamba architecture, which aligns with foundational research in model architecture.

Relevance: 9 Novelty: 8

8. Predicting mutational effects on protein binding from folding energy

ArXiv ID: 2507.05502

Authors: Arthur Deng, Karsten Householder, Fang Wu, Sebastian Thrun, K. Christopher Garcia, Brian Trippe

Abstract: Accurate estimation of mutational effects on protein-protein binding energies is an open problem with applications in structural biology and therapeutic design. Several deep learning predictors for this task have been proposed, but, presumably due to the scarcity of binding data, these methods underperform computationally expensive estimates based on empirical force fields. In response, we propose a transfer-learning approach that leverages advances in protein sequence modeling and folding stability prediction for this task. The key idea is to parameterize the binding energy as the difference between the folding energy of the protein complex and the sum of the folding energies of its binding partners. We show that using a pre-trained inverse-folding model as a proxy for folding energy provides strong zero-shot performance, and can be fine-tuned with (1) copious folding energy measurements and (2) more limited binding energy measurements. The resulting predictor, StaB-ddG, is the first deep learning predictor to match the accuracy of the state-of-the-art empirical force-field method FoldX, while offering an over 1,000x speed-up.

Comment: The paper proposes a novel transfer-learning approach for predicting protein binding effects, which aligns with foundational research in AI for Science, particularly in molecular modeling.

Relevance: 9 Novelty: 8

9. Efficient Training of Large-Scale AI Models Through Federated Mixture-of-Experts: A System-Level Approach

ArXiv ID: 2507.05685

Authors: Xiaobing Chen, Boyang Zhang, Xiangwei Zhou, Mingxuan Sun, Shuai Zhang, Songyang Zhang, Geoffrey Ye Li

Abstract: The integration of Federated Learning (FL) and Mixture-of-Experts (MoE) presents a compelling pathway for training more powerful, large-scale artificial intelligence models (LAMs) on decentralized data while preserving privacy. However, efficient federated training of these complex MoE-structured LAMs is hindered by significant system-level challenges, particularly in managing the interplay between heterogeneous client resources and the sophisticated coordination required for numerous specialized experts. This article highlights a critical, yet underexplored concept: the absence of robust quantitative strategies for dynamic client-expert alignment that holistically considers varying client capacities and the imperative for system-wise load balancing. Specifically, we propose a conceptual system design for intelligent client-expert alignment that incorporates dynamic fitness scoring, global expert load monitoring, and client capacity profiling. By tackling these systemic issues, we can unlock more scalable, efficient, and robust training mechanisms {with fewer communication rounds for convergence}, paving the way for the widespread deployment of large-scale federated MoE-structured LAMs in edge computing with ultra-high communication efficiency.

Comment: The paper discusses a system-level approach to federated training of Mixture-of-Experts (MoE) models, which aligns with the core topic of model architecture, specifically MoE.

Relevance: 9 Novelty: 7

10. Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

ArXiv ID: 2507.06056

Authors: Yizhan Huang, Zhe Yang, Meifang Chen, Jianping Zhang, Michael R. Lyu

Abstract: Large Language Models (LLMs) are known to memorize portions of their training data, sometimes reproducing content verbatim when prompted appropriately. In this work, we investigate a fundamental yet under-explored question in the domain of memorization: How to characterize memorization difficulty of training data in LLMs? Through empirical experiments on OLMo, a family of open models, we present the Entropy-Memorization Law. It suggests that data entropy is linearly correlated with memorization score. Moreover, in a case study of memorizing highly randomized strings, or "gibberish", we observe that such sequences, despite their apparent randomness, exhibit unexpectedly low empirical entropy compared to the broader training corpus. Adopting the same strategy to discover Entropy-Memorization Law, we derive a simple yet effective approach to distinguish training and testing data, enabling Dataset Inference (DI).

Comment: The paper investigates memorization in LLMs, providing theoretical insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 7

11. Simple Convergence Proof of Adam From a Sign-like Descent Perspective

ArXiv ID: 2507.05966

Authors: Hanyang Peng, Shuang Qin, Yue Yu, Fangqing Jiang, Hui Wang, Zhouchen Lin

Abstract: Adam is widely recognized as one of the most effective optimizers for training deep neural networks (DNNs). Despite its remarkable empirical success, its theoretical convergence analysis remains unsatisfactory. Existing works predominantly interpret Adam as a preconditioned stochastic gradient descent with momentum (SGDM), formulated as $\bm{x}{t+1} = \bm{x}_t - \frac{\gamma_t}{{\sqrt{\bm{v}_t}+\epsilon}} \circ \bm{m}_t$. This perspective necessitates strong assumptions and intricate techniques, resulting in lengthy and opaque convergence proofs that are difficult to verify and extend. In contrast, we propose a novel interpretation by treating Adam as a sign-like optimizer, expressed as $\bm{x}\right)$ under weak assumptions of the generalized $p$-affine variance and $(L_0, L_1, q)$-smoothness, without dependence on the model dimensionality or the numerical stability parameter $\epsilon$. Additionally, our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence and offers practical guidelines for tuning learning rates in Adam, further bridging the gap between theory and practice.} = \bm{x}_t - \gamma_t \frac{|\bm{m}_t|}{{\sqrt{\bm{v}_t}+\epsilon}} \circ {\rm Sign}(\bm{m}_t)$. This reformulation significantly simplifies the convergence analysis. For the first time, with some mild conditions, we prove that Adam achieves the optimal rate of ${\cal O}(\frac{1}{T^{\sfrac{1}{4}}})$ rather than the previous ${\cal O} \left(\frac{\ln T}{T^{\sfrac{1}{4}}

Comment: The paper provides a novel convergence proof for the Adam optimizer, offering theoretical insights into its behavior, which is relevant to training dynamics in neural networks.

Relevance: 8 Novelty: 8

12. Explainable Hierarchical Deep Learning Neural Networks (Ex-HiDeNN)

ArXiv ID: 2507.05498

Authors: Reza T. Batley, Chanwook Park, Wing Kam Liu, Sourav Saha

Abstract: Data-driven science and computation have advanced immensely to construct complex functional relationships using trainable parameters. However, efficiently discovering interpretable and accurate closed-form expressions from complex dataset remains a challenge. The article presents a novel approach called Explainable Hierarchical Deep Learning Neural Networks or Ex-HiDeNN that uses an accurate, frugal, fast, separable, and scalable neural architecture with symbolic regression to discover closed-form expressions from limited observation. The article presents the two-step Ex-HiDeNN algorithm with a separability checker embedded in it. The accuracy and efficiency of Ex-HiDeNN are tested on several benchmark problems, including discerning a dynamical system from data, and the outcomes are reported. Ex-HiDeNN generally shows outstanding approximation capability in these benchmarks, producing orders of magnitude smaller errors compared to reference data and traditional symbolic regression. Later, Ex-HiDeNN is applied to three engineering applications: a) discovering a closed-form fatigue equation, b) identification of hardness from micro-indentation test data, and c) discovering the expression for the yield surface with data. In every case, Ex-HiDeNN outperformed the reference methods used in the literature. The proposed method is built upon the foundation and published works of the authors on Hierarchical Deep Learning Neural Network (HiDeNN) and Convolutional HiDeNN. The article also provides a clear idea about the current limitations and future extensions of Ex-HiDeNN.

Comment: The paper proposes a novel approach for discovering interpretable expressions using a hierarchical deep learning architecture, relevant to representation learning and model architecture.