Personalized Daily ArXiv Papers 2025-07-02

[gpt-4o]	Prompt	Completion	Total
Token	33705	3738	37443
Cost	$0.08	$0.04	$0.12

Total arXiv papers: 498

Total scanned papers: 315

Total relevant papers: 21

Table of contents with paper titles:

Hebbian Physics Networks: A Self-Organizing Computational Architecture Based on Local Physical Laws Authors: Gunjan Auti, Hirofumi Daiguji, Gouhei Tanaka
Testing the spin-bath view of self-attention: A Hamiltonian analysis of GPT-2 Transformer Authors: Satadeep Bhattacharjee, Seung-Cheol Lee
The language of time: a language model perspective on time-series foundation models Authors: Yi Xie, Yun Xiong, Zejian Shi, Hao Niu, Zhengfu Liu
Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap Authors: Yifan Sun, Yushan Liang, Zhen Zhang, Jiaye Teng
A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search Authors: Austin R. Ellis-Mohr, Anuj K. Nayak, Lav R. Varshney
Description of the Training Process of Neural Networks via Ergodic Theorem : Ghost nodes Authors: Eun-Ji Park, Sangwon Yun
Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact Authors: Rizwan Qureshi, Ranjan Sapkota, Abbas Shah, Amgad Muneer, Anas Zafar, Ashmal Vayani, Maged Shoman, Abdelrahman B. M. Eldaly, Kai Zhang, Ferhat Sadak, Shaina Raza, Xinqi Fan, Ravid Shwartz-Ziv, Hong Yan, Vinjia Jain, Aman Chadha, Manoj Karkee, Jia Wu, Philip Torr, Seyedali Mirjalili
LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing Authors: Wenbing Li, Zikai Song, Hang Zhou, Yunyao Zhang, Junqing Yu, Wei Yang
Disentangled Feature Importance Authors: Jin-Hong Du, Kathryn Roeder, Larry Wasserman
BoltzNCE: Learning Likelihoods for Boltzmann Generation with Stochastic Interpolants and Noise Contrastive Estimation Authors: Rishal Aggrwal, Jacky Chen, Nicholas M. Boffi, David Ryan Koes
Model Fusion via Neuron Interpolation Authors: Phoomraphee Luenam, Andreas Spanopoulos, Amit Sant, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh
SAFER: Probing Safety in Reward Models with Sparse Autoencoder Authors: Sihang Li, Wei Shi, Ziyuan Xie, Tao Liang, Guojun Ma, Xiang Wang
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections Authors: Bo Wang, Qinyuan Cheng, Runyu Peng, Rong Bao, Peiji Li, Qipeng Guo, Linyang Li, Zhiyuan Zeng, Yunhua Zhou, Xipeng Qiu
Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies Authors: Tao Xiong, Xavier Hu, Wenyan Fan, Shengyu Zhang
NN-Former: Rethinking Graph Structure in Neural Architecture Representation Authors: Ruihan Xu, Haokui Zhang, Yaowei Wang, Wei Zeng, Shiliang Zhang
Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows Authors: Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Yizhe Zhang, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Josh Susskind, Navdeep Jaitly
Towards Undistillable Models by Minimizing Conditional Mutual Information Authors: Linfeng Ye, Shayan Mohajer Hamidi, En-hui Yang
Enhancing Interpretability in Generative Modeling: Statistically Disentangled Latent Spaces Guided by Generative Factors in Scientific Datasets Authors: Arkaprabha Ganguli, Nesar Ramachandra, Julie Bessac, Emil Constantinescu
Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones Authors: Daking Rai, Samuel Miller, Kevin Moran, Ziyu Yao
DFReg: A Physics-Inspired Framework for Global Weight Distribution Regularization in Neural Networks Authors: Giovanni Ruggieri
Gradient-based Fine-Tuning through Pre-trained Model Regularization Authors: Xuanbo Liu, Liu Liu, Fuxiang Wu, Fusheng Hao, Xianglong Liu

1. Hebbian Physics Networks: A Self-Organizing Computational Architecture Based on Local Physical Laws

ArXiv ID: 2507.00641

Authors: Gunjan Auti, Hirofumi Daiguji, Gouhei Tanaka

Abstract: Traditional machine learning approaches in physics rely on global optimization, limiting interpretability and enforcing physical constraints externally. We introduce the Hebbian Physics Network (HPN), a self-organizing computational framework in which learning emerges from local Hebbian updates driven by violations of conservation laws. Grounded in non-equilibrium thermodynamics and inspired by Prigogine/'s theory of dissipative structures, HPNs eliminate the need for global loss functions by encoding physical laws directly into the system/'s local dynamics. Residuals - quantified imbalances in continuity, momentum, or energy - serve as thermodynamic signals that drive weight adaptation through generalized Hebbian plasticity. We demonstrate this approach on incompressible fluid flow and continuum diffusion, where physically consistent structures emerge from random initial conditions without supervision. HPNs reframe computation as a residual-driven thermodynamic process, offering an interpretable, scalable, and physically grounded alternative for modeling complex dynamical systems.

Comment: The paper introduces a novel computational architecture, Hebbian Physics Networks, which is grounded in non-equilibrium thermodynamics and offers a new perspective on modeling complex dynamical systems. This aligns with the emerging trends criterion.

Relevance: 9 Novelty: 9

2. Testing the spin-bath view of self-attention: A Hamiltonian analysis of GPT-2 Transformer

ArXiv ID: 2507.00683

Authors: Satadeep Bhattacharjee, Seung-Cheol Lee

Abstract: The recently proposed physics-based framework by Huo and Johnson~\cite{huo2024capturing} models the attention mechanism of Large Language Models (LLMs) as an interacting two-body spin system, offering a first-principles explanation for phenomena like repetition and bias. Building on this hypothesis, we extract the complete Query-Key weight matrices from a production-grade GPT-2 model and derive the corresponding effective Hamiltonian for every attention head. From these Hamiltonians we obtain analytic \textit{phase boundaries} logit gap criteria that predict which token should dominate the next-token distribution for a given context. A systematic evaluation on 144 heads across 20 factual-recall prompts reveals a strong negative correlation between the theoretical logit gaps and the model's empirical token rankings ($r\approx-0.70$, $p<10^{-3}$).Targeted ablations further show that suppressing the heads most aligned with the spin-bath predictions induces the anticipated shifts in output probabilities, confirming a causal link rather than a coincidental association. Taken together, our findings provide the first strong empirical evidence for the spin-bath analogy in a production-grade model. This validation not only furnishes a tractable, physics-inspired lens for interpretability but also provides the groundwork for novel generative models, bridging the gap between theoretical condensed matter physics and AI.

Comment: The paper provides a theoretical analysis of the attention mechanism in GPT-2 using a physics-based framework, offering insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 8

3. The language of time: a language model perspective on time-series foundation models

ArXiv ID: 2507.00078

Authors: Yi Xie, Yun Xiong, Zejian Shi, Hao Niu, Zhengfu Liu

Abstract: With the rise of large language models, the paradigm of training foundation models with massive parameter counts on vast datasets has been adopted in multiple domains to achieve remarkable success. Time series foundation models represent a significant extension of this paradigm, demonstrating exceptional expressive power, generalization, and cross-domain transferability. However, this gives rise to a fundamental paradox: time series data reflect distinct dynamical systems, making cross-domain transfer intuitively implausible, yet this is contradicted by the models' empirical success. To resolve this paradox, this paper investigates, from both theoretical and experimental perspectives, the representation learning mechanisms and generalization capabilities of patch-based time series foundation models. We argue that such models are not merely applying a new architecture but are fundamentally generalizing the representation paradigm of language models by extending deterministic vector-based representations to latent probabilistic distributional forms. Our theoretical analysis supports this framework by demonstrating that continuous time-series patches can be faithfully quantized into a discrete vocabulary whose key statistical properties are highly consistent with those of natural language. This generalization allows time series models to inherit the robust representation and transfer abilities of large language models, thereby explaining their superior performance in temporal tasks. Ultimately, our work provides a rigorous theoretical cornerstone for understanding, evaluating, and improving the safety and reliability of large-scale time series foundation models.

Comment: The paper investigates the representation learning mechanisms of time-series foundation models, which aligns with the core topic of representation learning.

Relevance: 9 Novelty: 8

4. Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

ArXiv ID: 2507.00075

Authors: Yifan Sun, Yushan Liang, Zhen Zhang, Jiaye Teng

Abstract: Self-improvement is among the most prominent techniques within the realm of large language models (LLM), aiming to enhance the LLM performance without relying on external data. Despite its significance, generally how LLM performances evolve during the self-improvement process remains underexplored. In this paper, we theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap. This is inspired by the conjecture that the performance enhancement of self-improvement stems from the gap between LLM's solver capability and verifier capability. Based on the theoretical framework, we further introduce how to predict the ultimate power of self-improvement using only information from the first few training epochs. We empirically validate the effectiveness of the theoretical model on various LLMs and datasets. Beyond self-improvement, we extend our analysis to investigate how external data influences these dynamics within the framework. Notably, we find that under limited external data regimes, such external data can be utilized at any stage without significantly affecting final performances, which accords with the empirical observations.

Comment: The paper models the training dynamics of LLM self-improvement, providing theoretical insights into LLM behavior, which is relevant to foundational research in LLMs.

Relevance: 9 Novelty: 8

5. A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search

ArXiv ID: 2507.00004

Authors: Austin R. Ellis-Mohr, Anuj K. Nayak, Lav R. Varshney

Abstract: Large language models (LLMs) demand considerable computational, energy, and financial resources during both training and deployment. While scaling laws for training have guided much of the field's recent progress, inference costs now represent a significant and growing component of the overall resource burden, particularly for reasoning-focused models. Existing characterizations of compute-optimality that consider model size, dataset size, and inference tokens in isolation or in fixed combinations risk overlooking more efficient operating points. We introduce directed stochastic skill search (DS3), a general framework that represents inference as stochastic traversal over a learned skill graph. From a simplified yet expressive instantiation, we derive closed-form expressions for task success and compute cost across a wide range of inference strategies -- including chain-of-thought (CoT) and tree-of-thought (ToT) -- enabling comparative analysis as a function of task difficulty and model capability. To that end, we extend a prior first-principles tripartite graph framework of LLM training to incorporate inference, and separately bridge DS3 with empirical methods that characterize LLM scaling behavior. We theoretically recover empirically observed patterns, including: linear accuracy scaling with logarithmic compute; variation in preferred inference strategies as a function of task difficulty and model capability; emergent behavior elicited by reasoning even when performance plateaus under parameter scaling; and both best-of-N (BoN) and majority voting behavior captured within a unified analytical framework. By explicitly characterizing training-inference interdependencies, our framework deepens theoretical understanding and supports principled algorithmic design and resource allocation.

Comment: The paper introduces a framework for inference compute scaling in LLMs, providing theoretical insights into LLM behavior, aligning with the Large Language Models criterion.

Relevance: 9 Novelty: 8

6. Description of the Training Process of Neural Networks via Ergodic Theorem : Ghost nodes

ArXiv ID: 2507.01003

Authors: Eun-Ji Park, Sangwon Yun

Abstract: Recent studies have proposed interpreting the training process from an ergodic perspective. Building on this foundation we present a unified framework for understanding and accelerating the training of deep neural networks via stochastic gradient descent. By analyzing the geometric landscape of the objective function we introduce a practical diagnostic, the running estimate of the largest Lyapunov exponent, which provably distinguishes genuine convergence toward stable minimizers from mere statistical stabilization near saddle points. We then propose a ghost category extension for standard classifiers that adds auxiliary ghost output nodes so the model gains extra descent directions that open a lateral corridor around narrow loss barriers and enable the optimizer to bypass poor basins during the early training phase. We show that this extension strictly reduces approximation error and that after sufficient convergence the ghost dimensions collapse and the extended model's invariant law coincides with that of the original and there exists a path in the enlarged parameter space along which the total loss does not increase while the original loss decreases by an arbitrary margin. Taken together these results provide a principled architecture level intervention that accelerates early stage trainability while preserving asymptotic behavior.

Comment: The paper introduces a novel framework for understanding and accelerating the training of neural networks via an ergodic perspective, which aligns with the representation learning and model architecture criteria.

Relevance: 9 Novelty: 8

7. Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact

ArXiv ID: 2507.00951

Authors: Rizwan Qureshi, Ranjan Sapkota, Abbas Shah, Amgad Muneer, Anas Zafar, Ashmal Vayani, Maged Shoman, Abdelrahman B. M. Eldaly, Kai Zhang, Ferhat Sadak, Shaina Raza, Xinqi Fan, Ravid Shwartz-Ziv, Hong Yan, Vinjia Jain, Aman Chadha, Manoj Karkee, Jia Wu, Philip Torr, Seyedali Mirjalili

Abstract: Can machines truly think, reason and act in domains like humans? This enduring question continues to shape the pursuit of Artificial General Intelligence (AGI). Despite the growing capabilities of models such as GPT-4.5, DeepSeek, Claude 3.5 Sonnet, Phi-4, and Grok 3, which exhibit multimodal fluency and partial reasoning, these systems remain fundamentally limited by their reliance on token-level prediction and lack of grounded agency. This paper offers a cross-disciplinary synthesis of AGI development, spanning artificial intelligence, cognitive neuroscience, psychology, generative models, and agent-based systems. We analyze the architectural and cognitive foundations of general intelligence, highlighting the role of modular reasoning, persistent memory, and multi-agent coordination. In particular, we emphasize the rise of Agentic RAG frameworks that combine retrieval, planning, and dynamic tool use to enable more adaptive behavior. We discuss generalization strategies, including information compression, test-time adaptation, and training-free methods, as critical pathways toward flexible, domain-agnostic intelligence. Vision-Language Models (VLMs) are reexamined not just as perception modules but as evolving interfaces for embodied understanding and collaborative task completion. We also argue that true intelligence arises not from scale alone but from the integration of memory and reasoning: an orchestration of modular, interactive, and self-improving components where compression enables adaptive behavior. Drawing on advances in neurosymbolic systems, reinforcement learning, and cognitive scaffolding, we explore how recent architectures begin to bridge the gap between statistical learning and goal-directed cognition. Finally, we identify key scientific, technical, and ethical challenges on the path to AGI.

Comment: The paper discusses foundational aspects of AGI, focusing on cognitive and architectural foundations, which aligns with emerging trends and theoretical insights into AI.

Relevance: 9 Novelty: 8

8. LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing

ArXiv ID: 2507.00029

Authors: Wenbing Li, Zikai Song, Hang Zhou, Yunyao Zhang, Junqing Yu, Wei Yang

Abstract: Recent efforts to combine low-rank adaptation (LoRA) with mixture-of-experts (MoE) for adapting large language models (LLMs) to multiple tasks still exhibit prevailing limitations: they either swap entire attention/feed-forward layers for switch experts or bolt on parallel expert branches, diluting parameter efficiency and task fidelity. We propose the LoRA-Mixer, a modular and lightweight MoE framework that integrates LoRA experts. Our core innovation lies in replacing the projection matrices of the attention module's input/output linear layers with dynamically routed, task-specific LoRA experts. This design ensures seamless compatibility with diverse foundation models, including transformers and state space models (SSMs), by leveraging their inherent linear projection structures. The framework supports two operational paradigms: (1) joint optimization of LoRA experts and routing mechanisms via a novel hard-soft routing strategy, or (2) direct deployment of pre-trained, frozen LoRA modules sourced from external repositories. To enable robust router training with limited data while ensuring stable routing decisions and maximizing expert reuse, we introduce an adaptive Specialization Balance Loss (SBL) that jointly optimizes expert balance and task-specific alignment. Extensive experiments on seven benchmark datasets, including MedQA, CoLA, SST-2, GSM8K, ARC-E, ARC-C, and HumanEval, demonstrate the effectiveness of LoRA-Mixer. On datasets such as GSM8K, HumanEval, and MedQA, LoRA-Mixer achieves significant improvements of 7.61%, 4.88%, and 3.08% over the base models, respectively. Compared with state-of-the-art methods, LoRA-Mixer achieves additional improvements of 1.09%, 1.45%, and 1.68%, respectively, using only 48% of the parameters, demonstrating its efficiency and strong performance.

Comment: The paper presents a modular MoE framework integrating LoRA experts, which is relevant to model architecture and compression.

Relevance: 9 Novelty: 7

9. Disentangled Feature Importance

ArXiv ID: 2507.00260

Authors: Jin-Hong Du, Kathryn Roeder, Larry Wasserman

Abstract: Feature importance quantification faces a fundamental challenge: when predictors are correlated, standard methods systematically underestimate their contributions. We prove that major existing approaches target identical population functionals under squared-error loss, revealing why they share this correlation-induced bias. To address this limitation, we introduce \emph{Disentangled Feature Importance (DFI)}, a nonparametric generalization of the classical $R^2$ decomposition via optimal transport. DFI transforms correlated features into independent latent variables using a transport map, eliminating correlation distortion. Importance is computed in this disentangled space and attributed back through the transport map's sensitivity. DFI provides a principled decomposition of importance scores that sum to the total predictive variability for latent additive models and to interaction-weighted functional ANOVA variances more generally, under arbitrary feature dependencies. We develop a comprehensive semiparametric theory for DFI. For general transport maps, we establish root-$n$ consistency and asymptotic normality of importance estimators in the latent space, which extends to the original feature space for the Bures-Wasserstein map. Notably, our estimators achieve second-order estimation error, which vanishes if both regression function and transport map estimation errors are $o_{\mathbb{P}}(n^{-1/4})$. By design, DFI avoids the computational burden of repeated submodel refitting and the challenges of conditional covariate distribution estimation, thereby achieving computational efficiency.

Comment: The paper introduces Disentangled Feature Importance, a novel method for feature importance quantification, which aligns with representation learning.

Relevance: 8 Novelty: 8

10. BoltzNCE: Learning Likelihoods for Boltzmann Generation with Stochastic Interpolants and Noise Contrastive Estimation

ArXiv ID: 2507.00846

Authors: Rishal Aggrwal, Jacky Chen, Nicholas M. Boffi, David Ryan Koes

Abstract: Efficient sampling from the Boltzmann distribution defined by an energy function is a key challenge in modeling physical systems such as molecules. Boltzmann Generators tackle this by leveraging Continuous Normalizing Flows that transform a simple prior into a distribution that can be reweighted to match the Boltzmann distribution using sample likelihoods. However, obtaining likelihoods requires computing costly Jacobians during integration, making it impractical for large molecular systems. To overcome this, we propose learning the likelihood of the generated distribution via an energy-based model trained with noise contrastive estimation and score matching. By using stochastic interpolants to anneal between the prior and generated distributions, we combine both the objective functions to efficiently learn the density function. On the alanine dipeptide system, we demonstrate that our method yields free energy profiles and energy distributions comparable to those obtained with exact likelihoods. Additionally, we show that free energy differences between metastable states can be estimated accurately with orders-of-magnitude speedup.

Comment: The paper proposes a novel method for learning likelihoods in Boltzmann generation using stochastic interpolants and noise contrastive estimation, which aligns with the AI for Science criterion.

Relevance: 8 Novelty: 8

11. Model Fusion via Neuron Interpolation

ArXiv ID: 2507.00037

Authors: Phoomraphee Luenam, Andreas Spanopoulos, Amit Sant, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh

Abstract: Model fusion aims to combine the knowledge of multiple models by creating one representative model that captures the strengths of all of its parents. However, this process is non-trivial due to differences in internal representations, which can stem from permutation invariance, random initialization, or differently distributed training data. We present a novel, neuron-centric family of model fusion algorithms designed to integrate multiple trained neural networks into a single network effectively regardless of training data distribution. Our algorithms group intermediate neurons of parent models to create target representations that the fused model approximates with its corresponding sub-network. Unlike prior approaches, our approach incorporates neuron attribution scores into the fusion process. Furthermore, our algorithms can generalize to arbitrary layer types. Experimental results on various benchmark datasets demonstrate that our algorithms consistently outperform previous fusion techniques, particularly in zero-shot and non-IID fusion scenarios. The code is available at https://github.com/AndrewSpano/neuron-interpolation-model-fusion.

Comment: The paper introduces a novel neuron-centric model fusion algorithm, which aligns with foundational research in model architecture.