Personalized Daily ArXiv Papers 2025-07-09
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 32862 | 3896 | 36758 |
| Cost | $0.08 | $0.04 | $0.12 |
Total arXiv papers: 535
Total scanned papers: 310
Total relevant papers: 18
Table of contents with paper titles:
-
FACT: the Features At Convergence Theorem for neural networks Authors: Enric Boix-Adsera, Neil Mallinar, James B. Simon, Mikhail Belkin
-
Estimating Interventional Distributions with Uncertain Causal Graphs through Meta-Learning Authors: Anish Dhir, Cristiana Diaconu, Valentinian Mihai Lungu, James Requeima, Richard E. Turner, Mark van der Wilk
-
Causal Foundation Models: Disentangling Physics from Instrument Properties Authors: Jeroen Audenaert, Daniel Muthukrishna, Paul F. Gregory, David W. Hogg, V. Ashley Villar
-
Generalized and Unified Equivalences between Hardness and Pseudoentropy Authors: Lunjia Hu, Salil Vadhan
-
SingLoRA: Low Rank Adaptation Using a Single Matrix Authors: David Bensa\"id, Noam Rotstein, Roy Velich, Daniel Bensa\"id, Ron Kimmel
-
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition Authors: Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
-
Differential Mamba Authors: Nadav Schneider, Itamar Zimerman, Eliya Nachmani
-
Predicting mutational effects on protein binding from folding energy Authors: Arthur Deng, Karsten Householder, Fang Wu, Sebastian Thrun, K. Christopher Garcia, Brian Trippe
-
Efficient Training of Large-Scale AI Models Through Federated Mixture-of-Experts: A System-Level Approach Authors: Xiaobing Chen, Boyang Zhang, Xiangwei Zhou, Mingxuan Sun, Shuai Zhang, Songyang Zhang, Geoffrey Ye Li
-
Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs Authors: Yizhan Huang, Zhe Yang, Meifang Chen, Jianping Zhang, Michael R. Lyu
-
Simple Convergence Proof of Adam From a Sign-like Descent Perspective Authors: Hanyang Peng, Shuang Qin, Yue Yu, Fangqing Jiang, Hui Wang, Zhouchen Lin
-
Explainable Hierarchical Deep Learning Neural Networks (Ex-HiDeNN) Authors: Reza T. Batley, Chanwook Park, Wing Kam Liu, Sourav Saha
-
LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks Authors: William Fleshman, Benjamin Van Durme
-
Mitigating Shortcut Learning with InterpoLated Learning Authors: Michalis Korakakis, Andreas Vlachos, Adrian Weller
-
Coding Triangle: How Does Large Language Model Understand Code? Authors: Taolin Zhang, Zihan Ma, Maosong Cao, Junnan Liu, Songyang Zhang, Kai Chen
-
Beyond Communication Overhead: A Multilevel Monte Carlo Approach for Mitigating Compression Bias in Distributed Learning Authors: Ze'ev Zukerman, Bassel Hamoud, Kfir Y. Levy
-
QS4D: Quantization-aware training for efficient hardware deployment of structured state-space sequential models Authors: Sebastian Siegel, Ming-Jay Yang, Younes Bouhadjar, Maxime Fabre, Emre Neftci, John Paul Strachan
-
Modern Methods in Associative Memory Authors: Dmitry Krotov, Benjamin Hoover, Parikshit Ram, Bao Pham
1. FACT: the Features At Convergence Theorem for neural networks
ArXiv ID: 2507.05644
Authors: Enric Boix-Adsera, Neil Mallinar, James B. Simon, Mikhail Belkin
Abstract: A central challenge in deep learning theory is to understand how neural networks learn and represent features. To this end, we prove the Features at Convergence Theorem (FACT), which gives a self-consistency equation that neural network weights satisfy at convergence when trained with nonzero weight decay. For each weight matrix $W$, this equation relates the "feature matrix" $W^\top W$ to the set of input vectors passed into the matrix during forward propagation and the loss gradients passed through it during backpropagation. We validate this relation empirically, showing that neural features indeed satisfy the FACT at convergence. Furthermore, by modifying the "Recursive Feature Machines" of Radhakrishnan et al. 2024 so that they obey the FACT, we arrive at a new learning algorithm, FACT-RFM. FACT-RFM achieves high performance on tabular data and captures various feature learning behaviors that occur in neural network training, including grokking in modular arithmetic and phase transitions in learning sparse parities.
Comment: The paper presents the Features At Convergence Theorem (FACT) for neural networks, providing insights into how neural networks learn and represent features, aligning with representation learning.
Relevance: 10 Novelty: 9
2. Estimating Interventional Distributions with Uncertain Causal Graphs through Meta-Learning
ArXiv ID: 2507.05526
Authors: Anish Dhir, Cristiana Diaconu, Valentinian Mihai Lungu, James Requeima, Richard E. Turner, Mark van der Wilk
Abstract: In scientific domains -- from biology to the social sciences -- many questions boil down to \textit{What effect will we observe if we intervene on a particular variable?} If the causal relationships (e.g.~a causal graph) are known, it is possible to estimate the intervention distributions. In the absence of this domain knowledge, the causal structure must be discovered from the available observational data. However, observational data are often compatible with multiple causal graphs, making methods that commit to a single structure prone to overconfidence. A principled way to manage this structural uncertainty is via Bayesian inference, which averages over a posterior distribution on possible causal structures and functional mechanisms. Unfortunately, the number of causal structures grows super-exponentially with the number of nodes in the graph, making computations intractable. We propose to circumvent these challenges by using meta-learning to create an end-to-end model: the Model-Averaged Causal Estimation Transformer Neural Process (MACE-TNP). The model is trained to predict the Bayesian model-averaged interventional posterior distribution, and its end-to-end nature bypasses the need for expensive calculations. Empirically, we demonstrate that MACE-TNP outperforms strong Bayesian baselines. Our work establishes meta-learning as a flexible and scalable paradigm for approximating complex Bayesian causal inference, that can be scaled to increasingly challenging settings in the future.
Comment: The paper introduces a meta-learning approach for Bayesian causal inference, which is a cutting-edge theoretical work in emerging trends.
Relevance: 9 Novelty: 9
3. Causal Foundation Models: Disentangling Physics from Instrument Properties
ArXiv ID: 2507.05333
Authors: Jeroen Audenaert, Daniel Muthukrishna, Paul F. Gregory, David W. Hogg, V. Ashley Villar
Abstract: Foundation models for structured time series data must contend with a fundamental challenge: observations often conflate the true underlying physical phenomena with systematic distortions introduced by measurement instruments. This entanglement limits model generalization, especially in heterogeneous or multi-instrument settings. We present a causally-motivated foundation model that explicitly disentangles physical and instrumental factors using a dual-encoder architecture trained with structured contrastive learning. Leveraging naturally occurring observational triplets (i.e., where the same target is measured under varying conditions, and distinct targets are measured under shared conditions) our model learns separate latent representations for the underlying physical signal and instrument effects. Evaluated on simulated astronomical time series designed to resemble the complexity of variable stars observed by missions like NASA's Transiting Exoplanet Survey Satellite (TESS), our method significantly outperforms traditional single-latent space foundation models on downstream prediction tasks, particularly in low-data regimes. These results demonstrate that our model supports key capabilities of foundation models, including few-shot generalization and efficient adaptation, and highlight the importance of encoding causal structure into representation learning for structured data.
Comment: The paper presents a causally-motivated foundation model using a dual-encoder architecture and structured contrastive learning, which aligns with representation learning and model architecture criteria.
Relevance: 9 Novelty: 8
4. Generalized and Unified Equivalences between Hardness and Pseudoentropy
ArXiv ID: 2507.05972
Authors: Lunjia Hu, Salil Vadhan
Abstract: Pseudoentropy characterizations provide a quantitatively precise demonstration of the close relationship between computational hardness and computational randomness. We prove a unified pseudoentropy characterization that generalizes and strengthens previous results for both uniform and non-uniform models of computation. Our characterization holds for a general family of entropy notions that encompasses the common notions of Shannon entropy and min entropy as special cases. Moreover, we show that the characterizations for different entropy notions can be simultaneously achieved by a single, universal function that simultaneously witnesses computational hardness and computational randomness. A key technical insight of our work is that the notion of weight-restricted calibration from the recent literature on algorithm fairness, along with standard computational indistinguishability (known as multiaccuracy in the fairness literature), suffices for proving pseudoentropy characterizations for general entropy notions. This demonstrates the power of weight-restricted calibration to enhance the classic Complexity-Theoretic Regularity Lemma (Trevisan, Tulsiani, and Vadhan, 2009) and Leakage Simulation Lemma (Jetchev and Pietrzak, 2014) and allows us to achieve an exponential improvement in the complexity dependency on the alphabet size compared to the pseudoentropy characterizations by Casacuberta, Dwork, and Vadhan (2024) based on the much stronger notion of multicalibration. We show that the exponential dependency on the alphabet size is inevitable for multicalibration as well as for the weaker notion of calibrated multiaccuracy.
Comment: The paper provides a unified pseudoentropy characterization, which is relevant to emerging trends in theoretical work challenging established assumptions.
Relevance: 9 Novelty: 8
5. SingLoRA: Low Rank Adaptation Using a Single Matrix
ArXiv ID: 2507.05566
Authors: David Bensa\"id, Noam Rotstein, Roy Velich, Daniel Bensa\"id, Ron Kimmel
Abstract: Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two smaller matrices that together form a low-rank matrix update. Recent research has shown that scale disparities between these two matrices often cause unstable training dynamics, leading to suboptimal performance. In this paper, we propose SingLoRA, which reformulates low-rank adaptation by learning the weights update as a decomposition of a single low-rank matrix multiplied by its transpose. This simple design inherently removes inter-matrix scale conflicts, ensuring stable optimization, and roughly halves the parameter count. We analyze SingLoRA within the infinite-width neural network framework, showing that it guarantees stable feature learning by construction. Extensive experiments on multiple tasks validate these benefits. In common sense reasoning, fine-tuning LLama 7B on MNLI with SingLoRA achieves 91.3% accuracy - surpassing LoRA (89.1%) and LoRA+ (90.2%) - while using only 60% of their parameter budget. In image generation, fine-tuning Stable Diffusion with SingLoRA significantly improves image fidelity on DreamBooth, achieving a DINO similarity score of 0.151, compared to scores of 0.148 and 0.143 for DoRA and LoRA, respectively.
Comment: The paper proposes SingLoRA, a novel approach to low-rank adaptation, relevant to model compression and efficiency.
Relevance: 9 Novelty: 8
6. Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
ArXiv ID: 2507.05724
Authors: Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
Abstract: Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model \emph{Omni-router Transformer}. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.
Comment: The paper discusses a novel MoE architecture for ASR, focusing on shared routing decisions, which provides insights into MoE architectures.
Relevance: 9 Novelty: 8
7. Differential Mamba
ArXiv ID: 2507.06204
Authors: Nadav Schneider, Itamar Zimerman, Eliya Nachmani
Abstract: Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available.
Comment: The paper introduces a novel differential mechanism for the Mamba architecture, which aligns with foundational research in model architecture.
Relevance: 9 Novelty: 8
8. Predicting mutational effects on protein binding from folding energy
ArXiv ID: 2507.05502
Authors: Arthur Deng, Karsten Householder, Fang Wu, Sebastian Thrun, K. Christopher Garcia, Brian Trippe
Abstract: Accurate estimation of mutational effects on protein-protein binding energies is an open problem with applications in structural biology and therapeutic design. Several deep learning predictors for this task have been proposed, but, presumably due to the scarcity of binding data, these methods underperform computationally expensive estimates based on empirical force fields. In response, we propose a transfer-learning approach that leverages advances in protein sequence modeling and folding stability prediction for this task. The key idea is to parameterize the binding energy as the difference between the folding energy of the protein complex and the sum of the folding energies of its binding partners. We show that using a pre-trained inverse-folding model as a proxy for folding energy provides strong zero-shot performance, and can be fine-tuned with (1) copious folding energy measurements and (2) more limited binding energy measurements. The resulting predictor, StaB-ddG, is the first deep learning predictor to match the accuracy of the state-of-the-art empirical force-field method FoldX, while offering an over 1,000x speed-up.
Comment: The paper proposes a novel transfer-learning approach for predicting protein binding effects, which aligns with foundational research in AI for Science, particularly in molecular modeling.
Relevance: 9 Novelty: 8
9. Efficient Training of Large-Scale AI Models Through Federated Mixture-of-Experts: A System-Level Approach
ArXiv ID: 2507.05685
Authors: Xiaobing Chen, Boyang Zhang, Xiangwei Zhou, Mingxuan Sun, Shuai Zhang, Songyang Zhang, Geoffrey Ye Li
Abstract: The integration of Federated Learning (FL) and Mixture-of-Experts (MoE) presents a compelling pathway for training more powerful, large-scale artificial intelligence models (LAMs) on decentralized data while preserving privacy. However, efficient federated training of these complex MoE-structured LAMs is hindered by significant system-level challenges, particularly in managing the interplay between heterogeneous client resources and the sophisticated coordination required for numerous specialized experts. This article highlights a critical, yet underexplored concept: the absence of robust quantitative strategies for dynamic client-expert alignment that holistically considers varying client capacities and the imperative for system-wise load balancing. Specifically, we propose a conceptual system design for intelligent client-expert alignment that incorporates dynamic fitness scoring, global expert load monitoring, and client capacity profiling. By tackling these systemic issues, we can unlock more scalable, efficient, and robust training mechanisms {with fewer communication rounds for convergence}, paving the way for the widespread deployment of large-scale federated MoE-structured LAMs in edge computing with ultra-high communication efficiency.
Comment: The paper discusses a system-level approach to federated training of Mixture-of-Experts (MoE) models, which aligns with the core topic of model architecture, specifically MoE.
Relevance: 9 Novelty: 7
10. Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs
ArXiv ID: 2507.06056
Authors: Yizhan Huang, Zhe Yang, Meifang Chen, Jianping Zhang, Michael R. Lyu
Abstract: Large Language Models (LLMs) are known to memorize portions of their training data, sometimes reproducing content verbatim when prompted appropriately. In this work, we investigate a fundamental yet under-explored question in the domain of memorization: How to characterize memorization difficulty of training data in LLMs? Through empirical experiments on OLMo, a family of open models, we present the Entropy-Memorization Law. It suggests that data entropy is linearly correlated with memorization score. Moreover, in a case study of memorizing highly randomized strings, or "gibberish", we observe that such sequences, despite their apparent randomness, exhibit unexpectedly low empirical entropy compared to the broader training corpus. Adopting the same strategy to discover Entropy-Memorization Law, we derive a simple yet effective approach to distinguish training and testing data, enabling Dataset Inference (DI).
Comment: The paper investigates memorization in LLMs, providing theoretical insights into LLM behavior and interpretability.
Relevance: 9 Novelty: 7
11. Simple Convergence Proof of Adam From a Sign-like Descent Perspective
ArXiv ID: 2507.05966
Authors: Hanyang Peng, Shuang Qin, Yue Yu, Fangqing Jiang, Hui Wang, Zhouchen Lin
Abstract: Adam is widely recognized as one of the most effective optimizers for training deep neural networks (DNNs). Despite its remarkable empirical success, its theoretical convergence analysis remains unsatisfactory. Existing works predominantly interpret Adam as a preconditioned stochastic gradient descent with momentum (SGDM), formulated as $\bm{x}{t+1} = \bm{x}_t - \frac{\gamma_t}{{\sqrt{\bm{v}_t}+\epsilon}} \circ \bm{m}_t$. This perspective necessitates strong assumptions and intricate techniques, resulting in lengthy and opaque convergence proofs that are difficult to verify and extend. In contrast, we propose a novel interpretation by treating Adam as a sign-like optimizer, expressed as $\bm{x}\right)$ under weak assumptions of the generalized $p$-affine variance and $(L_0, L_1, q)$-smoothness, without dependence on the model dimensionality or the numerical stability parameter $\epsilon$. Additionally, our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence and offers practical guidelines for tuning learning rates in Adam, further bridging the gap between theory and practice.} = \bm{x}_t - \gamma_t \frac{|\bm{m}_t|}{{\sqrt{\bm{v}_t}+\epsilon}} \circ {\rm Sign}(\bm{m}_t)$. This reformulation significantly simplifies the convergence analysis. For the first time, with some mild conditions, we prove that Adam achieves the optimal rate of ${\cal O}(\frac{1}{T^{\sfrac{1}{4}}})$ rather than the previous ${\cal O} \left(\frac{\ln T}{T^{\sfrac{1}{4}}
Comment: The paper provides a novel convergence proof for the Adam optimizer, offering theoretical insights into its behavior, which is relevant to training dynamics in neural networks.
Relevance: 8 Novelty: 8
12. Explainable Hierarchical Deep Learning Neural Networks (Ex-HiDeNN)
ArXiv ID: 2507.05498
Authors: Reza T. Batley, Chanwook Park, Wing Kam Liu, Sourav Saha
Abstract: Data-driven science and computation have advanced immensely to construct complex functional relationships using trainable parameters. However, efficiently discovering interpretable and accurate closed-form expressions from complex dataset remains a challenge. The article presents a novel approach called Explainable Hierarchical Deep Learning Neural Networks or Ex-HiDeNN that uses an accurate, frugal, fast, separable, and scalable neural architecture with symbolic regression to discover closed-form expressions from limited observation. The article presents the two-step Ex-HiDeNN algorithm with a separability checker embedded in it. The accuracy and efficiency of Ex-HiDeNN are tested on several benchmark problems, including discerning a dynamical system from data, and the outcomes are reported. Ex-HiDeNN generally shows outstanding approximation capability in these benchmarks, producing orders of magnitude smaller errors compared to reference data and traditional symbolic regression. Later, Ex-HiDeNN is applied to three engineering applications: a) discovering a closed-form fatigue equation, b) identification of hardness from micro-indentation test data, and c) discovering the expression for the yield surface with data. In every case, Ex-HiDeNN outperformed the reference methods used in the literature. The proposed method is built upon the foundation and published works of the authors on Hierarchical Deep Learning Neural Network (HiDeNN) and Convolutional HiDeNN. The article also provides a clear idea about the current limitations and future extensions of Ex-HiDeNN.
Comment: The paper proposes a novel approach for discovering interpretable expressions using a hierarchical deep learning architecture, relevant to representation learning and model architecture.
Relevance: 8 Novelty: 7
13. LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks
ArXiv ID: 2507.05346
Authors: William Fleshman, Benjamin Van Durme
Abstract: The proliferation of fine-tuned language model experts for specific tasks and domains signals the need for efficient selection and combination methods. We propose LoRA-Augmented Generation (LAG) for leveraging large libraries of knowledge and task-specific LoRA adapters. LAG requires no additional training or access to data, and efficiently filters, retrieves, and applies experts on a per-token and layer basis. We evaluate LAG on various knowledge-intensive tasks, achieving superior performance over existing data-free methods. We explore scenarios where additional data is available, demonstrating LAG's compatibility with alternative solutions such as retrieval-augmented generation (RAG).
Comment: The paper proposes LoRA-Augmented Generation for efficient selection and combination of task-specific adapters, which is relevant to model architecture innovations.
Relevance: 8 Novelty: 7
14. Mitigating Shortcut Learning with InterpoLated Learning
ArXiv ID: 2507.05527
Authors: Michalis Korakakis, Andreas Vlachos, Adrian Weller
Abstract: Empirical risk minimization (ERM) incentivizes models to exploit shortcuts, i.e., spurious correlations between input attributes and labels that are prevalent in the majority of the training data but unrelated to the task at hand. This reliance hinders generalization on minority examples, where such correlations do not hold. Existing shortcut mitigation approaches are model-specific, difficult to tune, computationally expensive, and fail to improve learned representations. To address these issues, we propose InterpoLated Learning (InterpoLL) which interpolates the representations of majority examples to include features from intra-class minority examples with shortcut-mitigating patterns. This weakens shortcut influence, enabling models to acquire features predictive across both minority and majority examples. Experimental results on multiple natural language understanding tasks demonstrate that InterpoLL improves minority generalization over both ERM and state-of-the-art shortcut mitigation methods, without compromising accuracy on majority examples. Notably, these gains persist across encoder, encoder-decoder, and decoder-only architectures, demonstrating the method's broad applicability.
Comment: The paper introduces InterpoLated Learning to mitigate shortcut learning, which is relevant to representation learning and training dynamics in neural networks.
Relevance: 8 Novelty: 7
15. Coding Triangle: How Does Large Language Model Understand Code?
ArXiv ID: 2507.06138
Authors: Taolin Zhang, Zihan Ma, Maosong Cao, Junnan Liu, Songyang Zhang, Kai Chen
Abstract: Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we reveal that while LLMs can form a self-consistent system across these dimensions, their solutions often lack the diversity and robustness of human programmers. We identify a significant distribution shift between model cognition and human expertise, with model errors tending to cluster due to training data biases and limited reasoning transfer. Our study demonstrates that incorporating human-generated editorials, solutions, and diverse test cases, as well as leveraging model mixtures, can substantially enhance both the performance and robustness of LLMs. Furthermore, we reveal both the consistency and inconsistency in the cognition of LLMs that may facilitate self-reflection and self-improvement, providing a potential direction for developing more powerful coding models.
Comment: The paper evaluates LLMs' understanding of code, which is relevant to large language models, focusing on theoretical insights into LLM behavior.
Relevance: 8 Novelty: 7
16. Beyond Communication Overhead: A Multilevel Monte Carlo Approach for Mitigating Compression Bias in Distributed Learning
ArXiv ID: 2507.05508
Authors: Ze'ev Zukerman, Bassel Hamoud, Kfir Y. Levy
Abstract: Distributed learning methods have gained substantial momentum in recent years, with communication overhead often emerging as a critical bottleneck. Gradient compression techniques alleviate communication costs but involve an inherent trade-off between the empirical efficiency of biased compressors and the theoretical guarantees of unbiased compressors. In this work, we introduce a novel Multilevel Monte Carlo (MLMC) compression scheme that leverages biased compressors to construct statistically unbiased estimates. This approach effectively bridges the gap between biased and unbiased methods, combining the strengths of both. To showcase the versatility of our method, we apply it to popular compressors, like Top-$k$ and bit-wise compressors, resulting in enhanced variants. Furthermore, we derive an adaptive version of our approach to further improve its performance. We validate our method empirically on distributed deep learning tasks.
Comment: The paper introduces a novel Multilevel Monte Carlo compression scheme, which is relevant to model compression and efficiency.
Relevance: 8 Novelty: 7
17. QS4D: Quantization-aware training for efficient hardware deployment of structured state-space sequential models
ArXiv ID: 2507.06079
Authors: Sebastian Siegel, Ming-Jay Yang, Younes Bouhadjar, Maxime Fabre, Emre Neftci, John Paul Strachan
Abstract: Structured State Space models (SSM) have recently emerged as a new class of deep learning models, particularly well-suited for processing long sequences. Their constant memory footprint, in contrast to the linearly scaling memory demands of Transformers, makes them attractive candidates for deployment on resource-constrained edge-computing devices. While recent works have explored the effect of quantization-aware training (QAT) on SSMs, they typically do not address its implications for specialized edge hardware, for example, analog in-memory computing (AIMC) chips. In this work, we demonstrate that QAT can significantly reduce the complexity of SSMs by up to two orders of magnitude across various performance metrics. We analyze the relation between model size and numerical precision, and show that QAT enhances robustness to analog noise and enables structural pruning. Finally, we integrate these techniques to deploy SSMs on a memristive analog in-memory computing substrate and highlight the resulting benefits in terms of computational efficiency.
Comment: The paper discusses quantization-aware training for structured state-space models, relevant to model compression and efficiency.
Relevance: 8 Novelty: 7
18. Modern Methods in Associative Memory
ArXiv ID: 2507.06211
Authors: Dmitry Krotov, Benjamin Hoover, Parikshit Ram, Bao Pham
Abstract: Associative Memories like the famous Hopfield Networks are elegant models for describing fully recurrent neural networks whose fundamental job is to store and retrieve information. In the past few years they experienced a surge of interest due to novel theoretical results pertaining to their information storage capabilities, and their relationship with SOTA AI architectures, such as Transformers and Diffusion Models. These connections open up possibilities for interpreting the computation of traditional AI networks through the theoretical lens of Associative Memories. Additionally, novel Lagrangian formulations of these networks make it possible to design powerful distributed models that learn useful representations and inform the design of novel architectures. This tutorial provides an approachable introduction to Associative Memories, emphasizing the modern language and methods used in this area of research, with practical hands-on mathematical derivations and coding notebooks.
Comment: The paper discusses Associative Memories and their connection to modern AI architectures like Transformers, which aligns with foundational research in model architecture.
Relevance: 8 Novelty: 7
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.