Personalized Daily ArXiv Papers 2025-07-08

[gpt-4o]	Prompt	Completion	Total
Token	65206	8719	73925
Cost	$0.16	$0.09	$0.25

Total arXiv papers: 1200

Total scanned papers: 783

Total relevant papers: 48

Table of contents with paper titles:

Activation Steering for Chain-of-Thought Compression Authors: Seyedarmin Azizi, Erfan Baghaei Potraghloo, Massoud Pedram
Intervening to learn and compose disentangled representations Authors: Alex Markham, Jeri A. Chang, Isaac Hirsch, Liam Solus, Bryon Aragam
Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited Data Authors: Yunrui Qiu, Richard John, Lukas Herron, Pratyush Tiwary
any4: Learned 4-bit Numeric Representation for LLMs Authors: Mostafa Elhoushi, Jeff Johnson
Simplifying Graph Neural Kernels: from Stacking Layers to Collapsed Structure Authors: Lin Wang, Shijie Wang, Sirui Huang, Qing Li
RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs Authors: Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, Xueqi Cheng
Beyond Scaling Curves: Internal Dynamics of Neural Networks Through the NTK Lens Authors: Konstantin Nikolaou, Sven Krippendorf, Samuel Tovey, Christian Holm
DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging Authors: Neha Verma, Kenton Murray, Kevin Duh
Dyn-O: Building Structured World Models with Object-Centric Representations Authors: Zizhao Wang, Kaixin Wang, Li Zhao, Peter Stone, Jiang Bian
A Dynamical Systems Perspective on the Analysis of Neural Networks Authors: Dennis Chemnitz, Maximilian Engel, Christian Kuehn, Sara-Viola Kuntz
BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers Authors: Patrik Okanovic, Sameer Deshmukh, Grzegorz Kwasniewski, Kentaro Katayama, Takumi Honda, Maciej Besta, Torsten Hoefler
SOSAE: Self-Organizing Sparse AutoEncoder Authors: Sarthak Ketanbhai Modi, Zi Pong Lim, Yushi Cao, Yupeng Cheng, Yon Shin Teo, Shang-Wei Lin
Implicit Regularisation in Diffusion Models: An Algorithm-Dependent Generalisation Analysis Authors: Tyler Farghly, Patrick Rebeschini, George Deligiannidis, Arnaud Doucet
IMPACT: Importance-Aware Activation Space Reconstruction Authors: Md Mokarram Chowdhury, Daniel Agyei Asante, Ernie Chang, Yang Li
Exploring Core and Periphery Precepts in Biological and Artificial Intelligence: An Outcome-Based Perspective Authors: Niloofar Shadab, Tyler Cody, Alejandro Salado, Taylan G. Topcu, Mohammad Shadab, Peter Beling
Neural Inhibition Improves Dynamic Routing and Mixture of Experts Authors: Will Y. Zou, Jennifer Y. Zhang
OrbitAll: A Unified Quantum Mechanical Representation Deep Learning Framework for All Molecular Systems Authors: Beom Seok Kang, Vignesh C. Bhethanabotla, Amin Tavakoli, Maurice D. Hanisch, William A. Goddard III, Anima Anandkumar
Scientific Machine Learning of Chaotic Systems Discovers Governing Equations for Neural Populations Authors: Anthony G. Chesebro, David Hofmann, Vaibhav Dixit, Earl K. Miller, Richard H. Granger, Alan Edelman, Christopher V. Rackauckas, Lilianne R. Mujica-Parodi, Helmut H. Strey
Mpemba Effect in Large-Language Model Training Dynamics: A Minimal Analysis of the Valley-River model Authors: Sibei Liu, Zhijian Hu
DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer Authors: Yecheng Wu, Junyu Chen, Zhuoyang Zhang, Enze Xie, Jincheng Yu, Junsong Chen, Jinyi Hu, Yao Lu, Song Han, Han Cai
Cascade: Token-Sharded Private LLM Inference Authors: Rahul Thomas, Louai Zahran, Erica Choi, Akilesh Potti, Micah Goldblum, Arka Pal
How Overconfidence in Initial Choices and Underconfidence Under Criticism Modulate Change of Mind in Large Language Models Authors: Dharshan Kumaran, Stephen M Fleming, Larisa Markeeva, Joe Heyward, Andrea Banino, Mrinal Mathur, Razvan Pascanu, Simon Osindero, Benedetto de Martino, Petar Velickovic, Viorica Patraucean
HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference Authors: Weishu Deng, Yujie Yang, Peiran Du, Lingfeng Xiang, Zhen Lin, Chen Zhong, Song Jiang, Hui Lu, Jia Rao
Reason to Rote: Rethinking Memorization in Reasoning Authors: Yupei Du, Philipp Mondorf, Silvia Casola, Yuekun Yao, Robert Litschko, Barbara Plank
Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency Authors: Naoki Nishikawa, Rei Higuchi, Taiji Suzuki
SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control Authors: Xingyang He, Xiao Ling, Jie Liu
DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective Authors: Hyung Gun Chi, Zakaria Aldeneh, Tatiana Likhomanenko, Oggi Rudovic, Takuya Higuchi, Li-Wei Chen, Shinji Watanabe, Ahmed Hussen Abdelaziz
Spooky Action at a Distance: Normalization Layers Enable Side-Channel Spatial Communication Authors: Samuel Pfrommer, George Ma, Yixiao Huang, Somayeh Sojoudi
Normalized Iterative Hard Thresholding for Tensor Recovery Authors: Li Li, Yuneng Liang, Kaijie Zheng, Jian Lu
Recovering Plasticity of Neural Networks via Soft Weight Rescaling Authors: Seungwon Oh, Sangyeon Park, Isaac Han, Kyung-Joong Kim
Efficient Perplexity Bound and Ratio Matching in Discrete Diffusion Language Models Authors: Etrit Haxholli, Yeti Z. G\"urb\"uz, O\u{g}ul Can, Eli Waxman
Pseudo-likelihood produces associative memories able to generalize, even for asymmetric couplings Authors: Francesco D'Amico, Dario Bocchi, Luca Maria Del Bono, Saverio Rossi, Matteo Negri
LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers Authors: Jingze Zhu, Yongliang Wu, Wenbo Zhu, Jiawang Cao, Yanqiang Zheng, Jiawei Chen, Xu Yang, Bernt Schiele, Jonas Fischer, Xinting Hu
Beyond Token Pruning: Operation Pruning in Vision-Language Models Authors: Aoming Liu, Reuben Tan, Boqing Gong, Bryan A. Plummer
Critiques of World Models Authors: Eric Xing, Mingkai Deng, Jinyu Hou, Zhiting Hu
DANCE: Resource-Efficient Neural Architecture Search with Data-Aware and Continuous Adaptation Authors: Maolin Wang, Tianshuo Wei, Sheng Zhang, Ruocheng Guo, Wanyu Wang, Shanshan Ye, Lixin Zou, Xuetao Wei, Xiangyu Zhao
LLMs model how humans induce logically structured rules Authors: Alyssa Loo, Ellie Pavlick, Roman Feiman
Scaling Context Requires Rethinking Attention Authors: Carles Gelada, Jacob Buckman, Sean Zhang, Txus Bach
LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization Authors: Xujia Wang. Yunjia Qi, Bin Xu
Bridging KAN and MLP: MJKAN, a Hybrid Architecture with Both Efficiency and Expressiveness Authors: Hanseon Joo, Hayoung Choi, Ook Lee, Minjong Cheon
Tractable Representation Learning with Probabilistic Circuits Authors: Steven Braun, Sahil Sidheekh, Antonio Vergari, Martin Mundt, Sriraam Natarajan, Kristian Kersting
Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts Authors: Yun Wang, Longguang Wang, Chenghao Zhang, Yongjian Zhang, Zhanjie Zhang, Ao Ma, Chenyou Fan, Tin Lun Lam, Junjie Hu
SPEAR: Structured Pruning for Spiking Neural Networks via Synaptic Operation Estimation and Reinforcement Learning Authors: Hui Xie, Yuhe Liu, Shaoqi Yang, Jinyang Guo, Yufei Guo, Yuqing Ma, Jiaxin Chen, Jiaheng Liu, Xianglong Liu
Efficient Certified Reasoning for Binarized Neural Networks Authors: Jiong Yang, Yong Kiam Tan, Mate Soos, Magnus O. Myreen, Kuldeep S. Meel
Return of the Latent Space COWBOYS: Re-thinking the use of VAEs for Bayesian Optimisation of Structured Spaces Authors: Henry B. Moss, Sebastian W. Ober, Tom Diethe
Quantifying Cross-Attention Interaction in Transformers for Interpreting TCR-pMHC Binding Authors: Jiarui Li, Zixiang Yin, Haley Smith, Zhengming Ding, Samuel J. Landry, Ramgopal R. Mettu
Meta-Learning Transformers to Improve In-Context Generalization Authors: Lorenzo Braccaioli, Anna Vettoruzzo, Prabhant Singh, Joaquin Vanschoren, Mohamed-Rafik Bouguelia, Nicola Conci
MPX: Mixed Precision Training for JAX Authors: Alexander Gr\"afe, Sebastian Trimpe

1. Activation Steering for Chain-of-Thought Compression

ArXiv ID: 2507.04742

Authors: Seyedarmin Azizi, Erfan Baghaei Potraghloo, Massoud Pedram

Abstract: Large language models (LLMs) excel at complex reasoning when they include intermediate steps, known as "chains of thought" (CoTs). However, these rationales are often overly verbose, even for simple problems, leading to wasted context, increased latency, and higher energy consumption. We observe that verbose, English-heavy CoTs and concise, math-centric CoTs occupy distinct regions in the model's residual-stream activation space. By extracting and injecting a "steering vector" to transition between these modes, we can reliably shift generation toward more concise reasoning, effectively compressing CoTs without retraining. We formalize this approach as Activation-Steered Compression (ASC), an inference-time technique that shortens reasoning traces by directly modifying hidden representations. In addition, we provide a theoretical analysis of the impact of ASC on the output distribution, derived from a closed-form KL-divergence-bounded constraint to regulate steering strength. Using only 100 paired verbose and concise examples, ASC achieves up to 67.43% reduction in CoT length on MATH500 and GSM8K datasets, while maintaining accuracy across 7B, 8B, and 32B parameter models. As a training-free method, ASC introduces negligible runtime overhead and, on MATH500, delivers an average 2.73x speedup in end-to-end reasoning wall-clock time on an 8B model. This makes ASC a practical and efficient tool for streamlining the deployment of reasoning-capable LLMs in latency- or cost-sensitive settings. The code is available at: https://github.com/ArminAzizi98/ASC

Comment: The paper introduces Activation-Steered Compression (ASC), a novel inference-time technique for compressing chains of thought in LLMs by modifying hidden representations, aligning with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

2. Intervening to learn and compose disentangled representations

ArXiv ID: 2507.04754

Authors: Alex Markham, Jeri A. Chang, Isaac Hirsch, Liam Solus, Bryon Aragam

Abstract: In designing generative models, it is commonly believed that in order to learn useful latent structure, we face a fundamental tension between expressivity and structure. In this paper we challenge this view by proposing a new approach to training arbitrarily expressive generative models that simultaneously learn disentangled latent structure. This is accomplished by adding a simple decoder-only module to the head of an existing decoder block that can be arbitrarily complex. The module learns to process concept information by implicitly inverting linear representations from an encoder. Inspired by the notion of intervention in causal graphical models, our module selectively modifies its architecture during training, allowing it to learn a compact joint model over different contexts. We show how adding this module leads to disentangled representations that can be composed for out-of-distribution generation. To further validate our proposed approach, we prove a new identifiability result that extends existing work on identifying structured representations in nonlinear models.

Comment: The paper proposes a novel approach to learning disentangled representations in generative models, relevant to representation learning.

Relevance: 9 Novelty: 8

3. Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited Data

ArXiv ID: 2507.03174

Authors: Yunrui Qiu, Richard John, Lukas Herron, Pratyush Tiwary

Abstract: Accurate characterization of the equilibrium distributions of complex molecular systems and their dependence on environmental factors such as temperature is essential for understanding thermodynamic properties and transition mechanisms. Projecting these distributions onto meaningful low-dimensional representations enables interpretability and downstream analysis. Recent advances in generative AI, particularly flow models such as Normalizing Flows (NFs), have shown promise in modeling such distributions, but their scope is limited without tailored representation learning. In this work, we introduce Latent Thermodynamic Flows (LaTF), an end-to-end framework that tightly integrates representation learning and generative modeling. LaTF unifies the State Predictive Information Bottleneck (SPIB) with NFs to simultaneously learn low-dimensional latent representations, referred to as Collective Variables (CVs), classify metastable states, and generate equilibrium distributions across temperatures beyond the training data. The two components of representation learning and generative modeling are optimized jointly, ensuring that the learned latent features capture the system's slow, important degrees of freedom while the generative model accurately reproduces the system's equilibrium behavior. We demonstrate LaTF's effectiveness across diverse systems, including a model potential, the Chignolin protein, and cluster of Lennard Jones particles, with thorough evaluations and benchmarking using multiple metrics and extensive simulations. Finally, we apply LaTF to a RNA tetraloop system, where despite using simulation data from only two temperatures, LaTF reconstructs the temperature-dependent structural ensemble and melting behavior, consistent with experimental and prior extensive computational results.

Comment: The paper introduces a framework integrating representation learning and generative modeling, which aligns with foundational research in representation learning.

Relevance: 9 Novelty: 8

4. any4: Learned 4-bit Numeric Representation for LLMs

ArXiv ID: 2507.04610

Authors: Mostafa Elhoushi, Jeff Johnson

Abstract: We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4 .

Comment: The paper presents a learned 4-bit quantization method for LLMs, contributing to model compression and efficiency.

Relevance: 9 Novelty: 8

5. Simplifying Graph Neural Kernels: from Stacking Layers to Collapsed Structure

ArXiv ID: 2507.03560

Authors: Lin Wang, Shijie Wang, Sirui Huang, Qing Li

Abstract: The Graph Neural Tangent Kernel (GNTK) successfully bridges the gap between kernel methods and Graph Neural Networks (GNNs), addressing key challenges such as the difficulty of training deep networks and the limitations of traditional kernel methods. However, the existing layer-stacking strategy in GNTK introduces redundant computations, significantly increasing computational complexity and limiting scalability for practical applications. To address these issues, this paper proposes the Simplified Graph Neural Tangent Kernel (SGTK), which replaces the traditional multi-layer stacking mechanism with a continuous $K$-step aggregation operation. This novel approach streamlines the iterative kernel computation process, effectively eliminating redundant calculations while preserving the kernel's expressiveness. By reducing the dependency on layer stacking, SGTK achieves both computational simplicity and efficiency. Furthermore, we introduce the Simplified Graph Neural Kernel (SGNK), which models infinitely wide Graph Neural Networks as Gaussian Processes. This allows kernel values to be directly determined from the expected outputs of activation functions in the infinite-width regime, bypassing the need for explicit layer-by-layer computation. SGNK further reduces computational complexity while maintaining the capacity to capture intricate structural patterns in graphs. Extensive experiments on node and graph classification tasks demonstrate that the proposed SGTK and SGNK achieve performance comparable to existing approaches while improving computational efficiency. Implementation details are available at https://anonymous.4open.science/r/SGNK-1CE4/.

Comment: The paper proposes a simplified graph neural tangent kernel, which is relevant to model architecture and representation learning.

Relevance: 9 Novelty: 8

6. RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

ArXiv ID: 2507.03253

Authors: Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, Xueqi Cheng

Abstract: The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose $\textbf{RefineX}$, a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs. This high-precision distillation pipeline is used to train an efficient and reliable refine model that can systematically improve every instance in the corpus at scale. We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data across diverse downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on lighteval tasks, and achieves comparable performance using significantly fewer training tokens. Further analysis shows that RefineX reliably enhances text quality with both high efficiency and precision, outperforming prior approaches such as end-to-end generation and Prox-C. These results position RefineX as a scalable, effective, and reliable solution for optimizing pre-training data in modern LLM pipelines.

Comment: The paper introduces RefineX, a novel framework for refining pre-training data in LLMs, which aligns with foundational research in LLM pretraining and data quality improvement.

Relevance: 9 Novelty: 8

7. Beyond Scaling Curves: Internal Dynamics of Neural Networks Through the NTK Lens

ArXiv ID: 2507.05035

Authors: Konstantin Nikolaou, Sven Krippendorf, Samuel Tovey, Christian Holm

Abstract: Scaling laws offer valuable insights into the relationship between neural network performance and computational cost, yet their underlying mechanisms remain poorly understood. In this work, we empirically analyze how neural networks behave under data and model scaling through the lens of the neural tangent kernel (NTK). This analysis establishes a link between performance scaling and the internal dynamics of neural networks. Our findings of standard vision tasks show that similar performance scaling exponents can occur even though the internal model dynamics show opposite behavior. This demonstrates that performance scaling alone is insufficient for understanding the underlying mechanisms of neural networks. We also address a previously unresolved issue in neural scaling: how convergence to the infinite-width limit affects scaling behavior in finite-width models. To this end, we investigate how feature learning is lost as the model width increases and quantify the transition between kernel-driven and feature-driven scaling regimes. We identify the maximum model width that supports feature learning, which, in our setups, we find to be more than ten times smaller than typical large language model widths.

Comment: The paper provides insights into neural network dynamics through the NTK lens, contributing to representation learning and understanding training dynamics.

Relevance: 9 Novelty: 8

8. DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging

ArXiv ID: 2507.04517

Authors: Neha Verma, Kenton Murray, Kevin Duh

Abstract: Model compression offers a promising path to reducing the cost and inaccessibility of large pre-trained models, without significantly compromising their impressive performance. Large Transformer models, including large language models (LLMs), often contain computational redundancy, which can serve as a target for new model compression methods. In this work, we specifically target neuron-level redundancies in model layers by combining groups of similar neurons into fewer neurons. We frame this width reduction as a Discrete Optimal Transport problem, and propose DOTResize, a novel Transformer compression method that uses optimal transport theory to transform and compress model weights. To ensure applicability within the Transformer architecture, we motivate and incorporate entropic regularization and matrix factorization into the transportation maps produced by our method. Unlike pruning-based approaches which discard neurons based on importance measures, DOTResize re-projects the entire neuron width, allowing the retention and redistribution of useful signal across the reduced layer. Empirical results show that compared to simple or state-of-the-art neuron width-pruning techniques, DOTResize can outperform these methods across multiple LLM families and sizes, while achieving measurable reductions in real-world computational cost.

Comment: DOTResize presents a novel approach to model compression using Discrete Optimal Transport, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

9. Dyn-O: Building Structured World Models with Object-Centric Representations

ArXiv ID: 2507.03298

Authors: Zizhao Wang, Kaixin Wang, Li Zhao, Peter Stone, Jiang Bian

Abstract: World models aim to capture the dynamics of the environment, enabling agents to predict and plan for future states. In most scenarios of interest, the dynamics are highly centered on interactions among objects within the environment. This motivates the development of world models that operate on object-centric rather than monolithic representations, with the goal of more effectively capturing environment dynamics and enhancing compositional generalization. However, the development of object-centric world models has largely been explored in environments with limited visual complexity (such as basic geometries). It remains underexplored whether such models can generalize to more complex settings with diverse textures and cluttered scenes. In this paper, we fill this gap by introducing Dyn-O, an enhanced structured world model built upon object-centric representations. Compared to prior work in object-centric representations, Dyn-O improves in both learning representations and modeling dynamics. On the challenging Procgen games, we find that our method can learn object-centric world models directly from pixel observations, outperforming DreamerV3 in rollout prediction accuracy. Furthermore, by decoupling object-centric features into dynamics-agnostic and dynamics-aware components, we enable finer-grained manipulation of these features and generate more diverse imagined trajectories.

Comment: The paper introduces Dyn-O, an object-centric world model, which is relevant to model architecture and representation learning.

Relevance: 9 Novelty: 8

10. A Dynamical Systems Perspective on the Analysis of Neural Networks

ArXiv ID: 2507.05164

Authors: Dennis Chemnitz, Maximilian Engel, Christian Kuehn, Sara-Viola Kuntz

Abstract: In this chapter, we utilize dynamical systems to analyze several aspects of machine learning algorithms. As an expository contribution we demonstrate how to re-formulate a wide variety of challenges from deep neural networks, (stochastic) gradient descent, and related topics into dynamical statements. We also tackle three concrete challenges. First, we consider the process of information propagation through a neural network, i.e., we study the input-output map for different architectures. We explain the universal embedding property for augmented neural ODEs representing arbitrary functions of given regularity, the classification of multilayer perceptrons and neural ODEs in terms of suitable function classes, and the memory-dependence in neural delay equations. Second, we consider the training aspect of neural networks dynamically. We describe a dynamical systems perspective on gradient descent and study stability for overdetermined problems. We then extend this analysis to the overparameterized setting and describe the edge of stability phenomenon, also in the context of possible explanations for implicit bias. For stochastic gradient descent, we present stability results for the overparameterized setting via Lyapunov exponents of interpolation solutions. Third, we explain several results regarding mean-field limits of neural networks. We describe a result that extends existing techniques to heterogeneous neural networks involving graph limits via digraph measures. This shows how large classes of neural networks naturally fall within the framework of Kuramoto-type models on graphs and their large-graph limits. Finally, we point out that similar strategies to use dynamics to study explainable and reliable AI can also be applied to settings such as generative models or fundamental issues in gradient training methods, such as backpropagation or vanishing/exploding gradients.

Comment: The paper uses dynamical systems to analyze neural networks, providing insights into training dynamics and architecture, aligning with representation learning and model architecture.

Relevance: 9 Novelty: 8

11. BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers

ArXiv ID: 2507.03117

Authors: Patrik Okanovic, Sameer Deshmukh, Grzegorz Kwasniewski, Kentaro Katayama, Takumi Honda, Maciej Besta, Torsten Hoefler

Abstract: The energy consumption of large-scale ML models is dominated by data movement - shuffling billions of parameters across memory hierarchies and data centers. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable sparsification method applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss. Our fused, highly optimized Sparse MLP kernel delivers up to 16.7x speedup over dense MLPs across 9 architectures and 8 datasets, resulting in up to 1.6x inference speedup, 1.11x pretraining speedup and up to 3.12x inference memory usage reduction. BLaST enables the next generation of large-scale AI systems by reducing energy use, memory footprint, and latency.

Comment: The paper introduces BLaST, a method for sparsification in Transformers, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

12. SOSAE: Self-Organizing Sparse AutoEncoder

ArXiv ID: 2507.04644

Authors: Sarthak Ketanbhai Modi, Zi Pong Lim, Yushi Cao, Yupeng Cheng, Yon Shin Teo, Shang-Wei Lin

Abstract: The process of tuning the size of the hidden layers for autoencoders has the benefit of providing optimally compressed representations for the input data. However, such hyper-parameter tuning process would take a lot of computation and time effort with grid search as the default option. In this paper, we introduce the Self-Organization Regularization for Autoencoders that dynamically adapts the dimensionality of the feature space to the optimal size. Inspired by physics concepts, Self-Organizing Sparse AutoEncoder (SOSAE) induces sparsity in feature space in a structured way that permits the truncation of the non-active part of the feature vector without any loss of information. This is done by penalizing the autoencoder based on the magnitude and the positional index of the feature vector dimensions, which during training constricts the feature space in both terms. Extensive experiments on various datasets show that our SOSAE can tune the feature space dimensionality up to 130 times lesser Floating-point Operations (FLOPs) than other baselines while maintaining the same quality of tuning and performance.

Comment: The paper introduces a Self-Organizing Sparse AutoEncoder, which is relevant to representation learning and model compression through sparsity.

Relevance: 9 Novelty: 8

13. Implicit Regularisation in Diffusion Models: An Algorithm-Dependent Generalisation Analysis

ArXiv ID: 2507.03756

Authors: Tyler Farghly, Patrick Rebeschini, George Deligiannidis, Arnaud Doucet

Abstract: The success of denoising diffusion models raises important questions regarding their generalisation behaviour, particularly in high-dimensional settings. Notably, it has been shown that when training and sampling are performed perfectly, these models memorise training data -- implying that some form of regularisation is essential for generalisation. Existing theoretical analyses primarily rely on algorithm-independent techniques such as uniform convergence, heavily utilising model structure to obtain generalisation bounds. In this work, we instead leverage the algorithmic aspects that promote generalisation in diffusion models, developing a general theory of algorithm-dependent generalisation for this setting. Borrowing from the framework of algorithmic stability, we introduce the notion of score stability, which quantifies the sensitivity of score-matching algorithms to dataset perturbations. We derive generalisation bounds in terms of score stability, and apply our framework to several fundamental learning settings, identifying sources of regularisation. In particular, we consider denoising score matching with early stopping (denoising regularisation), sampler-wide coarse discretisation (sampler regularisation) and optimising with SGD (optimisation regularisation). By grounding our analysis in algorithmic properties rather than model structure, we identify multiple sources of implicit regularisation unique to diffusion models that have so far been overlooked in the literature.

Comment: The paper provides a general theory of algorithm-dependent generalization for diffusion models, focusing on implicit regularization, which is relevant to emerging trends in theoretical work.

Relevance: 9 Novelty: 8

14. IMPACT: Importance-Aware Activation Space Reconstruction

ArXiv ID: 2507.03828

Authors: Md Mokarram Chowdhury, Daniel Agyei Asante, Ernie Chang, Yang Li

Abstract: Large language models (LLMs) achieve strong performance across many domains but are difficult to deploy in resource-constrained settings due to their size. Low-rank weight matrix compression is a popular strategy for reducing model size, typically by minimizing weight reconstruction error under the assumption that weights are low-rank. However, this assumption often does not hold in LLMs. Instead, LLM activations exhibit stronger low-rank structure-prompting a shift toward minimizing activation reconstruction error. We show that this shift alone is insufficient: activation dimensions contribute unequally to model performance, and uniform reconstruction can harm performance. We propose IMPACT, a principled framework for importance-aware activation reconstruction that links model compression decisions to their impact on model behavior. IMPACT formulates an optimization problem that considers both activation structure and gradient sensitivity, and derives a closed-form solution where the optimal reconstruction bases are the eigenvectors of an importance-weighted activation covariance matrix. This enables low-rank approximations explicitly optimized to preserve accuracy. Experiments across diverse models and tasks show that IMPACT achieves up to 48.6% greater model size reduction with accuracy comparable to state-of-the-art baselines.

Comment: The paper presents a novel approach to model compression by focusing on activation space reconstruction rather than weight reconstruction, which aligns with the model compression criterion. It introduces a new framework, IMPACT, that optimizes low-rank approximations to preserve accuracy, which is a significant theoretical contribution.

Relevance: 9 Novelty: 8

15. Exploring Core and Periphery Precepts in Biological and Artificial Intelligence: An Outcome-Based Perspective

ArXiv ID: 2507.04594

Authors: Niloofar Shadab, Tyler Cody, Alejandro Salado, Taylan G. Topcu, Mohammad Shadab, Peter Beling

Abstract: Engineering methodologies predominantly revolve around established principles of decomposition and recomposition. These principles involve partitioning inputs and outputs at the component level, ensuring that the properties of individual components are preserved upon composition. However, this view does not transfer well to intelligent systems, particularly when addressing the scaling of intelligence as a system property. Our prior research contends that the engineering of general intelligence necessitates a fresh set of overarching systems principles. As a result, we introduced the "core and periphery" principles, a novel conceptual framework rooted in abstract systems theory and the Law of Requisite Variety. In this paper, we assert that these abstract concepts hold practical significance. Through empirical evidence, we illustrate their applicability to both biological and artificial intelligence systems, bridging abstract theory with real-world implementations. Then, we expand on our previous theoretical framework by mathematically defining core-dominant vs periphery-dominant systems.

Comment: The paper introduces 'core and periphery' principles for intelligent systems, which could be a novel theoretical framework in AI.

Relevance: 8 Novelty: 9

16. Neural Inhibition Improves Dynamic Routing and Mixture of Experts

ArXiv ID: 2507.03221

Authors: Will Y. Zou, Jennifer Y. Zhang

Abstract: To be effective, efficient, and diverse, deep learning models need to dynamically choose its architecture based on signals from a population of neurons. We hypothesize dynamic routing models can be improved with neural inhibition in those neural populations. This means signals commonly shared among the various modes of data statistics can be inhibited so that the routing model can choose a specialized expert path for each data sample. Only through inhibition is the routing mechanism able to effectively select neural pathways. We believe this is an under-studied and under-verified implementation methodology for Mixture-of-Experts, dynamic routing, and transformer language models. We provide experimental evidence that the neural inhibition algorithm significantly boosts the performance of general tasks and motivates more effort to be invested in this research direction.

Comment: The paper explores neural inhibition in dynamic routing and Mixture of Experts, contributing to model architecture innovations.

Relevance: 9 Novelty: 7

17. OrbitAll: A Unified Quantum Mechanical Representation Deep Learning Framework for All Molecular Systems

ArXiv ID: 2507.03853

Authors: Beom Seok Kang, Vignesh C. Bhethanabotla, Amin Tavakoli, Maurice D. Hanisch, William A. Goddard III, Anima Anandkumar

Abstract: Despite the success of deep learning methods in quantum chemistry, their representational capacity is most often confined to neutral, closed-shell molecules. However, real-world chemical systems often exhibit complex characteristics, including varying charges, spins, and environments. We introduce OrbitAll, a geometry- and physics-informed deep learning framework that can represent all molecular systems with electronic structure information. OrbitAll utilizes spin-polarized orbital features from the underlying quantum mechanical method, and combines it with graph neural networks satisfying SE(3)-equivariance. The resulting framework can represent and process any molecular system with arbitrary charges, spins, and environmental effects. OrbitAll demonstrates superior performance and generalization on predicting charged, open-shell, and solvated molecules, while also robustly extrapolating to molecules significantly larger than the training data by leveraging a physics-informed architecture. OrbitAll achieves chemical accuracy using 10 times fewer training data than competing AI models, with a speedup of approximately $10^3$ - $10^4$ compared to density functional theory.

Comment: The paper introduces a deep learning framework for quantum mechanical representation, which is relevant to AI for Science and foundational research in molecular modeling.

Relevance: 8 Novelty: 8

18. Scientific Machine Learning of Chaotic Systems Discovers Governing Equations for Neural Populations

ArXiv ID: 2507.03631

Authors: Anthony G. Chesebro, David Hofmann, Vaibhav Dixit, Earl K. Miller, Richard H. Granger, Alan Edelman, Christopher V. Rackauckas, Lilianne R. Mujica-Parodi, Helmut H. Strey

Abstract: Discovering governing equations that describe complex chaotic systems remains a fundamental challenge in physics and neuroscience. Here, we introduce the PEM-UDE method, which combines the prediction-error method with universal differential equations to extract interpretable mathematical expressions from chaotic dynamical systems, even with limited or noisy observations. This approach succeeds where traditional techniques fail by smoothing optimization landscapes and removing the chaotic properties during the fitting process without distorting optimal parameters. We demonstrate its efficacy by recovering hidden states in the Rossler system and reconstructing dynamics from noise-corrupted electrical circuit data, where the correct functional form of the dynamics is recovered even when one of the observed time series is corrupted by noise 5x the magnitude of the true signal. We demonstrate that this method is capable of recovering the correct dynamics, whereas direct symbolic regression methods, such as SINDy, fail to do so with the given amount of data and noise. Importantly, when applied to neural populations, our method derives novel governing equations that respect biological constraints such as network sparsity - a constraint necessary for cortical information processing yet not captured in next-generation neural mass models - while preserving microscale neuronal parameters. These equations predict an emergent relationship between connection density and both oscillation frequency and synchrony in neural circuits. We validate these predictions using three intracranial electrode recording datasets from the medial entorhinal cortex, prefrontal cortex, and orbitofrontal cortex. Our work provides a pathway to develop mechanistic, multi-scale brain models that generalize across diverse neural architectures, bridging the gap between single-neuron dynamics and macroscale brain activity.

Comment: The paper presents a method for discovering governing equations in chaotic systems, relevant to foundational research in AI for Science.

Relevance: 8 Novelty: 8

19. Mpemba Effect in Large-Language Model Training Dynamics: A Minimal Analysis of the Valley-River model

ArXiv ID: 2507.04206

Authors: Sibei Liu, Zhijian Hu

Abstract: Learning rate (LR) schedules in large language model (LLM) training often follow empirical templates: warm-up, constant plateau/stable phase, and decay (WSD). However, the mechanistic explanation for this strategy remains underexplored, and the choice of plateau height and decay schedule is largely heuristic. In this paper, we connect training dynamics to a thermodynamic analogy via the Mpemba effect - a phenomenon in which a hotter system cools faster than a colder one when quenched into the same bath. We analyze a class of "valley-river" loss landscapes, where sharp (valley) directions equilibrate quickly, while flatter (river) directions govern global descent. The Mpemba effect provides an explanation for the necessity of the warm-up phase and motivates a high plateau - rather than a low one - for accelerating loss decrease during decay. We show that for certain loss landscapes, there exists an optimal plateau learning rate - the "strong Mpemba point" - at which the slowest mode vanishes, resulting in faster convergence during the decay phase. We derive analytical conditions for its existence and estimate decay dynamics required to preserve the Mpemba advantage. Our minimal model and analysis offer a principled justification for plateau-based schedulers and provide guidance for tuning LR in LLMs with minimal hyperparameter sweep.

Comment: The paper connects LLM training dynamics to the Mpemba effect, providing theoretical insights into learning rate schedules, which is relevant to training dynamics in neural networks.

Relevance: 8 Novelty: 8

20. DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

ArXiv ID: 2507.04947

Authors: Yecheng Wu, Junyu Chen, Zhuoyang Zhang, Enze Xie, Jincheng Yu, Junsong Chen, Jinyi Hu, Yao Lu, Song Han, Han Cai

Abstract: We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer for AR models that achieves a 32x spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens. DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and autoregressive models.

Comment: The paper presents a novel masked autoregressive image generation framework with a deep compression hybrid tokenizer, which is relevant to model compression and efficiency.

Relevance: 8 Novelty: 8

21. Cascade: Token-Sharded Private LLM Inference

ArXiv ID: 2507.05228

Authors: Rahul Thomas, Louai Zahran, Erica Choi, Akilesh Potti, Micah Goldblum, Arka Pal

Abstract: As LLMs continue to increase in parameter size, the computational resources required to run them are available to fewer parties. Therefore, third-party inference services -- where LLMs are hosted by third parties with significant computational resources -- are becoming increasingly popular. However, third party inference raises critical concerns about user data privacy. To mitigate these risks, privacy researchers have developed provably secure schemes for third-party inference, such as Secure Multi-Party Computation (SMPC). However, SMPC protocols have significant computational and communication overhead, and do not scale to large models. In this work, we propose a new multi-party inference protocol, Cascade, that avoids these punitive costs by leveraging sharding in the sequence dimension to maintain privacy, trading off cryptographic privacy guarantees for increased performance and scalability. We demonstrate that Cascade is resistant to a generalization of a recent attack that is highly effective against other statistical privacy schemes, and that it is further resistant to learning-based attacks. As Cascade is orders of magnitude faster than existing schemes, our findings offer practical solutions for secure deployment of modern state-of-the-art LLMs.

Comment: The paper proposes a new multi-party inference protocol for LLMs, focusing on privacy and efficiency, which is relevant to large language models and model compression.

Relevance: 8 Novelty: 8

22. How Overconfidence in Initial Choices and Underconfidence Under Criticism Modulate Change of Mind in Large Language Models

ArXiv ID: 2507.03120

Authors: Dharshan Kumaran, Stephen M Fleming, Larisa Markeeva, Joe Heyward, Andrea Banino, Mrinal Mathur, Razvan Pascanu, Simon Osindero, Benedetto de Martino, Petar Velickovic, Viorica Patraucean

Abstract: Large language models (LLMs) exhibit strikingly conflicting behaviors: they can appear steadfastly overconfident in their initial answers whilst at the same time being prone to excessive doubt when challenged. To investigate this apparent paradox, we developed a novel experimental paradigm, exploiting the unique ability to obtain confidence estimates from LLMs without creating memory of their initial judgments -- something impossible in human participants. We show that LLMs -- Gemma 3, GPT4o and o1-preview -- exhibit a pronounced choice-supportive bias that reinforces and boosts their estimate of confidence in their answer, resulting in a marked resistance to change their mind. We further demonstrate that LLMs markedly overweight inconsistent compared to consistent advice, in a fashion that deviates qualitatively from normative Bayesian updating. Finally, we demonstrate that these two mechanisms -- a drive to maintain consistency with prior commitments and hypersensitivity to contradictory feedback -- parsimoniously capture LLM behavior in a different domain. Together, these findings furnish a mechanistic account of LLM confidence that explains both their stubbornness and excessive sensitivity to criticism.

Comment: The paper explores LLMs' confidence mechanisms, providing insights into LLM behavior and interpretability, relevant to foundational research in LLMs.