Personalized Daily ArXiv Papers 2025-05-13

[gpt-4o]	Prompt	Completion	Total
Token	49097	6947	56044
Cost	$0.12	$0.07	$0.19

Total arXiv papers: 790

Total scanned papers: 495

Total relevant papers: 36

Table of contents with paper titles:

The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts Authors: Enric Boix-Adsera, Philippe Rigollet
Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations Authors: Patrick Blumenberg, Thomas Graave, Tim Fingscheidt
UMoE: Unifying Attention and FFN with Shared Experts Authors: Yuanhang Yang, Chaozheng Wang, Jing Li
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration Authors: HamidReza Imani, Jiaxin Peng, Peiman Mohseni, Abdolah Amirany, Tarek El-Ghazawi
Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models Authors: Rei Higuchi, Taiji Suzuki
Analytic theory of dropout regularization Authors: Francesco Mori, Francesca Mignacco
Learning from Samples: Inverse Problems over measures via Sharpened Fenchel-Young Losses Authors: Francisco Andrade, Gabriel Peyr\'e, Clarice Poon
ICE-Pruning: An Iterative Cost-Efficient Pruning Pipeline for Deep Neural Networks Authors: Wenhao Hu, Paul Henderson, Jos\'e Cano
Beyond Attention: Toward Machines with Intrinsic Higher Mental States Authors: Ahsan Adeel
Towards the Three-Phase Dynamics of Generalization Power of a DNN Authors: Yuxuan He, Junpeng Zhang, Hongyuan Zhang, Quanshi Zhang
GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance Authors: Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens JS Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, Hyun Oh Song
Symbolic Rule Extraction from Attention-Guided Sparse Representations in Vision Transformers Authors: Parth Padalkar, Gopal Gupta
IIKL: Isometric Immersion Kernel Learning with Riemannian Manifold for Geometric Preservation Authors: Zihao Chen, Wenyong Wang, Jiachen Yang, Yu Xiang
FreqMoE: Dynamic Frequency Enhancement for Neural PDE Solvers Authors: Tianyu Chen, Haoyi Zhou, Ying Li, Hao Wang, Zhenzhe Zhang, Tianchen Zhu, Shanghang Zhang, Jianxin Li
Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures Authors: Francesco Cagnetta, Alessandro Favero, Antonio Sclocchi, Matthieu Wyart
Triangulating PL functions and the existence of efficient ReLU DNNs Authors: Danny Calegari
GraphComp: Extreme Error-bounded Compression of Scientific Data via Temporal Graph Autoencoders Authors: Guozhong Li, Muhannad Alhumaidi, Spiros Skiadopoulos, Ibrahim Hoteit, Panos Kalnis
Learning curves theory for hierarchically compositional data with power-law distributed features Authors: Francesco Cagnetta, Hyunmo Kang, Matthieu Wyart
Boltzmann Classifier: A Thermodynamic-Inspired Approach to Supervised Learning Authors: Muhamed Amin, Bernard R. Brooks
Certified Data Removal Under High-dimensional Settings Authors: Haolin Zou, Arnab Auddy, Yongchan Kwon, Kamiar Rahnama Rad, Arian Maleki
Identifying Causal Direction via Variational Bayesian Compression Authors: Quang-Duy Tran, Bao Duong, Phuoc Nguyen, Thin Nguyen
Importance Analysis for Dynamic Control of Balancing Parameter in a Simple Knowledge Distillation Setting Authors: Seongmin Kim, Kwanho Kim, Minseung Kim, Kanghyun Jo
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models Authors: Hang Wu, Jianian Zhu, Yinghui Li, Haojie Wang, Biao Hou, Jidong Zhai
InfoNCE is a Free Lunch for Semantically guided Graph Contrastive Learning Authors: Zixu Wang, Bingbing Xu, Yige Yuan, Huawei Shen, Xueqi Cheng
Solving Nonlinear PDEs with Sparse Radial Basis Function Networks Authors: Zihan Shao, Konstantin Pieper, Xiaochuan Tian
Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction Authors: Yu Mao, Holger Pirk, Chun Jason Xue
ALPCAH: Subspace Learning for Sample-wise Heteroscedastic Data Authors: Javier Salazar Cavazos, Jeffrey A. Fessler, Laura Balzano
Efficient Parallelization of Message Passing Neural Networks Authors: Junfan Xia, Bin Jiang
Mask-PINNs: Regulating Feature Distributions in Physics-Informed Neural Networks Authors: Feilong Jiang, Xiaonan Hou, Jianqiao Ye, Min Xia
PRUNE: A Patching Based Repair Framework for Certiffable Unlearning of Neural Networks Authors: Xuran Li, Jingyi Wang, Xiaohan Yuan, Peixin Zhang, Zhan Qin, Zhibo Wang, Kui Ren
Probing In-Context Learning: Impact of Task Complexity and Model Architecture on Generalization and Efficiency Authors: Binwen Liu, Peiyu Xu, Quan Yuan, Yihong Chen
Comet: Accelerating Private Inference for Large Language Model by Predicting Activation Sparsity Authors: Guang Yan, Yuhui Zhang, Zimu Guo, Lutan Zhao, Xiaojun Chen, Chen Wang, Wenhao Wang, Dan Meng, Rui Hou
CAT Merging: A Training-Free Approach for Resolving Conflicts in Model Merging Authors: Wenju Sun, Qingyong Li, Yangli-ao Geng, Boyang Li
The Influence of the Memory Capacity of Neural DDEs on the Universal Approximation Property Authors: Christian Kuehn, Sara-Viola Kuntz
Deeply Explainable Artificial Neural Network Authors: David Zucker
Feature Representation Transferring to Lightweight Models via Perception Coherence Authors: Hai-Vy Nguyen, Fabrice Gamboa, Sixin Zhang, Reda Chhaibi, Serge Gratton, Thierry Giaccone

1. The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts

ArXiv ID: 2505.06839

Authors: Enric Boix-Adsera, Philippe Rigollet

Abstract: Mixture-of-Experts (MoE) layers are increasingly central to frontier model architectures. By selectively activating parameters, they reduce computational cost while scaling total parameter count. This paper investigates the impact of the number of active experts, termed granularity, comparing architectures with many (e.g., 8 per layer in DeepSeek) to those with fewer (e.g., 1 per layer in Llama-4 models). We prove an exponential separation in network expressivity based on this design parameter, suggesting that models benefit from higher granularity. Experimental results corroborate our theoretical findings and illustrate this separation.

Comment: The paper focuses on Mixture-of-Experts (MoE) and provides theoretical insights into the impact of granularity on network expressivity, which aligns closely with the 'Model Architecture' criterion.

Relevance: 10 Novelty: 8

2. Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations

ArXiv ID: 2505.06653

Authors: Patrick Blumenberg, Thomas Graave, Tim Fingscheidt

Abstract: Large language models (LLMs) demand extensive memory capacity during both fine-tuning and inference. To enable memory-efficient fine-tuning, existing methods apply block-wise quantization techniques, such as NF4 and AF4, to the network weights. We show that these quantization techniques incur suboptimal quantization errors. Therefore, as a first novelty, we propose an optimization approach for block-wise quantization. Using this method, we design a family of quantizers named 4-bit block-wise optimal float (BOF4), which consistently reduces the quantization error compared to both baseline methods. We provide both a theoretical and a data-driven solution for the optimization process and prove their practical equivalence. Secondly, we propose a modification to the employed normalization method based on the signed absolute block maximum (BOF4-S), enabling further reduction of the quantization error and empirically achieving less degradation in language modeling performance. Thirdly, we explore additional variations of block-wise quantization methods applied to LLMs through an experimental study on the importance of accurately representing zero and large-amplitude weights on the one hand, and optimization towards various error metrics on the other hand. Lastly, we introduce a mixed-precision quantization strategy dubbed outlier-preserving quantization (OPQ) to address the distributional mismatch induced by outlier weights in block-wise quantization. By storing outlier weights in 16-bit precision (OPQ) while applying BOF4-S, we achieve top performance among 4-bit block-wise quantization techniques w.r.t. perplexity.

Comment: The paper proposes a novel 4-bit quantization method (BOF4) for LLMs, which directly aligns with the 'Model Compression' criterion and introduces significant improvements.

Relevance: 10 Novelty: 8

3. UMoE: Unifying Attention and FFN with Shared Experts

ArXiv ID: 2505.07260

Authors: Yuanhang Yang, Chaozheng Wang, Jing Li

Abstract: Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify the MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, revealing an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.

Comment: The paper introduces UMoE, a novel architecture unifying MoE designs in attention and FFN layers, which directly aligns with the 'Model Architecture' criterion, particularly focusing on Mixture-of-Experts (MoE).

Relevance: 10 Novelty: 8

4. QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration

ArXiv ID: 2505.06481

Authors: HamidReza Imani, Jiaxin Peng, Peiman Mohseni, Abdolah Amirany, Tarek El-Ghazawi

Abstract: The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single-GPU. We propose a serving system that employs \textit{similarity-based expert consolidation} to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce \textit{runtime partial reconfiguration}, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves a competitive output quality while maintaining throughput comparable to serving a single model while incurring a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85\% average reduction in turnaround time compared to NVIDIA's multi-instance GPU (MIG). Furthermore, experiments on Google's Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.

Comment: The paper focuses on memory efficiency and runtime reconfiguration for serving Mixture-of-Experts LLMs, which directly aligns with the MoE and model compression criteria.

Relevance: 9 Novelty: 8

5. Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models

ArXiv ID: 2505.07558

Authors: Rei Higuchi, Taiji Suzuki

Abstract: Aligning large language models (LLMs) with human preferences is crucial for safe deployment, yet existing methods assume specific preference models like Bradley-Terry model. This assumption leads to statistical inconsistency, where more data doesn't guarantee convergence to true human preferences. To address this critical gap, we introduce a novel alignment method Direct Density Ratio Optimization (DDRO). DDRO directly estimates the density ratio between preferred and unpreferred output distributions, circumventing the need for explicit human preference modeling. We theoretically prove that DDRO is statistically consistent, ensuring convergence to the true preferred distribution as the data size grows, regardless of the underlying preference structure. Experiments demonstrate that DDRO achieves superior performance compared to existing methods on many major benchmarks. DDRO unlocks the potential for truly data-driven alignment, paving the way for more reliable and human-aligned LLMs.

Comment: The paper introduces a statistically consistent method for aligning LLMs, which aligns with foundational research in LLM alignment and theoretical insights.

Relevance: 9 Novelty: 8

6. Analytic theory of dropout regularization

ArXiv ID: 2505.07792

Authors: Francesco Mori, Francesca Mignacco

Abstract: Dropout is a regularization technique widely used in training artificial neural networks to mitigate overfitting. It consists of dynamically deactivating subsets of the network during training to promote more robust representations. Despite its widespread adoption, dropout probabilities are often selected heuristically, and theoretical explanations of its success remain sparse. Here, we analytically study dropout in two-layer neural networks trained with online stochastic gradient descent. In the high-dimensional limit, we derive a set of ordinary differential equations that fully characterize the evolution of the network during training and capture the effects of dropout. We obtain a number of exact results describing the generalization error and the optimal dropout probability at short, intermediate, and long training times. Our analysis shows that dropout reduces detrimental correlations between hidden nodes, mitigates the impact of label noise, and that the optimal dropout probability increases with the level of noise in the data. Our results are validated by extensive numerical simulations.

Comment: The paper provides an analytic theory of dropout regularization, offering theoretical insights into training dynamics and generalization error. This aligns with the representation learning criterion, particularly in understanding how networks encode information.

Relevance: 9 Novelty: 8

7. Learning from Samples: Inverse Problems over measures via Sharpened Fenchel-Young Losses

ArXiv ID: 2505.07124

Authors: Francisco Andrade, Gabriel Peyr\'e, Clarice Poon

Abstract: Estimating parameters from samples of an optimal probability distribution is essential in applications ranging from socio-economic modeling to biological system analysis. In these settings, the probability distribution arises as the solution to an optimization problem that captures either static interactions among agents or the dynamic evolution of a system over time. Our approach relies on minimizing a new class of loss functions, called sharpened Fenchel-Young losses, which measure the sub-optimality gap of the optimization problem over the space of measures. We study the stability of this estimation method when only a finite number of sample is available. The parameters to be estimated typically correspond to a cost function in static problems and to a potential function in dynamic problems. To analyze stability, we introduce a general methodology that leverages the strong convexity of the loss function together with the sample complexity of the forward optimization problem. Our analysis emphasizes two specific settings in the context of optimal transport, where our method provides explicit stability guarantees: The first is inverse unbalanced optimal transport (iUOT) with entropic regularization, where the parameters to estimate are cost functions that govern transport computations; this method has applications such as link prediction in machine learning. The second is inverse gradient flow (iJKO), where the objective is to recover a potential function that drives the evolution of a probability distribution via the Jordan-Kinderlehrer-Otto (JKO) time-discretization scheme; this is particularly relevant for understanding cell population dynamics in single-cell genomics. Finally, we validate our approach through numerical experiments on Gaussian distributions, where closed-form solutions are available, to demonstrate the practical performance of our methods

Comment: The paper introduces sharpened Fenchel-Young losses for inverse problems, which is a novel theoretical contribution relevant to representation learning and optimization.

Relevance: 9 Novelty: 8

8. ICE-Pruning: An Iterative Cost-Efficient Pruning Pipeline for Deep Neural Networks

ArXiv ID: 2505.07411

Authors: Wenhao Hu, Paul Henderson, Jos\'e Cano

Abstract: Pruning is a widely used method for compressing Deep Neural Networks (DNNs), where less relevant parameters are removed from a DNN model to reduce its size. However, removing parameters reduces model accuracy, so pruning is typically combined with fine-tuning, and sometimes other operations such as rewinding weights, to recover accuracy. A common approach is to repeatedly prune and then fine-tune, with increasing amounts of model parameters being removed in each step. While straightforward to implement, pruning pipelines that follow this approach are computationally expensive due to the need for repeated fine-tuning. In this paper we propose ICE-Pruning, an iterative pruning pipeline for DNNs that significantly decreases the time required for pruning by reducing the overall cost of fine-tuning, while maintaining a similar accuracy to existing pruning pipelines. ICE-Pruning is based on three main components: i) an automatic mechanism to determine after which pruning steps fine-tuning should be performed; ii) a freezing strategy for faster fine-tuning in each pruning step; and iii) a custom pruning-aware learning rate scheduler to further improve the accuracy of each pruning step and reduce the overall time consumption. We also propose an efficient auto-tuning stage for the hyperparameters (e.g., freezing percentage) introduced by the three components. We evaluate ICE-Pruning on several DNN models and datasets, showing that it can accelerate pruning by up to 9.61x. Code is available at https://github.com/gicLAB/ICE-Pruning

Comment: The paper introduces ICE-Pruning, a novel pruning pipeline for model compression, which is directly relevant to foundational research in model efficiency.

Relevance: 9 Novelty: 8

9. Beyond Attention: Toward Machines with Intrinsic Higher Mental States

ArXiv ID: 2505.06257

Authors: Ahsan Adeel

Abstract: Attending to what is relevant is fundamental to both the mammalian brain and modern machine learning models such as Transformers. Yet, determining relevance remains a core challenge, traditionally offloaded to learning algorithms like backpropagation. Inspired by recent cellular neurobiological evidence linking neocortical pyramidal cells to distinct mental states, this work shows how models (e.g., Transformers) can emulate high-level perceptual processing and awake thought (imagination) states to pre-select relevant information before applying attention. Triadic neuronal-level modulation loops among questions ($Q$), clues (keys, $K$), and hypotheses (values, $V$) enable diverse, deep, parallel reasoning chains at the representation level and allow a rapid shift from initial biases to refined understanding. This leads to orders-of-magnitude faster learning with significantly reduced computational demand (e.g., fewer heads, layers, and tokens), at an approximate cost of $\mathcal{O}(N)$, where $N$ is the number of input tokens. Results span reinforcement learning (e.g., CarRacing in a high-dimensional visual setup), computer vision, and natural language question answering.

Comment: The paper explores a novel approach to attention mechanisms inspired by neurobiology, which aligns with architectural innovations and emerging trends.

Relevance: 9 Novelty: 8

10. Towards the Three-Phase Dynamics of Generalization Power of a DNN

ArXiv ID: 2505.06993

Authors: Yuxuan He, Junpeng Zhang, Hongyuan Zhang, Quanshi Zhang

Abstract: This paper proposes a new perspective for analyzing the generalization power of deep neural networks (DNNs), i.e., directly disentangling and analyzing the dynamics of generalizable and non-generalizable interaction encoded by a DNN through the training process. Specifically, this work builds upon the recent theoretical achievement in explainble AI, which proves that the detailed inference logic of DNNs can be can be strictly rewritten as a small number of AND-OR interaction patterns. Based on this, we propose an efficient method to quantify the generalization power of each interaction, and we discover a distinct three-phase dynamics of the generalization power of interactions during training. In particular, the early phase of training typically removes noisy and non-generalizable interactions and learns simple and generalizable ones. The second and the third phases tend to capture increasingly complex interactions that are harder to generalize. Experimental results verify that the learning of non-generalizable interactions is the the direct cause for the gap between the training and testing losses.

Comment: The paper provides a theoretical analysis of the generalization dynamics in DNNs, which is highly relevant to representation learning and training dynamics. The discovery of three-phase dynamics is novel and insightful.

Relevance: 9 Novelty: 8

11. GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance

ArXiv ID: 2505.07004

Authors: Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens JS Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, Hyun Oh Song

Abstract: Post-training quantization is a key technique for reducing the memory and inference latency of large language models by quantizing weights and activations without requiring retraining. However, existing methods either (1) fail to account for the varying importance of hidden features to the end loss or, when incorporating end loss, (2) neglect the critical interactions between model weights. To address these limitations, we propose GuidedQuant, a novel quantization approach that integrates gradient information from the end loss into the quantization objective while preserving cross-weight dependencies within output channels. GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, we introduce a novel non-uniform scalar quantization algorithm, which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category. We release the code at https://github.com/snu-mllab/GuidedQuant.

Comment: The paper introduces a novel quantization method for LLMs, which is highly relevant to model compression and efficiency. The integration of gradient information into the quantization objective is a significant contribution.

Relevance: 9 Novelty: 8

12. Symbolic Rule Extraction from Attention-Guided Sparse Representations in Vision Transformers

ArXiv ID: 2505.06745

Authors: Parth Padalkar, Gopal Gupta

Abstract: Recent neuro-symbolic approaches have successfully extracted symbolic rule-sets from CNN-based models to enhance interpretability. However, applying similar techniques to Vision Transformers (ViTs) remains challenging due to their lack of modular concept detectors and reliance on global self-attention mechanisms. We propose a framework for symbolic rule extraction from ViTs by introducing a sparse concept layer inspired by Sparse Autoencoders (SAEs). This linear layer operates on attention-weighted patch representations and learns a disentangled, binarized representation in which individual neurons activate for high-level visual concepts. To encourage interpretability, we apply a combination of L1 sparsity, entropy minimization, and supervised contrastive loss. These binarized concept activations are used as input to the FOLD-SE-M algorithm, which generates a rule-set in the form of logic programs. Our method achieves a 5.14% better classification accuracy than the standard ViT while enabling symbolic reasoning. Crucially, the extracted rule-set is not merely post-hoc but acts as a logic-based decision layer that operates directly on the sparse concept representations. The resulting programs are concise and semantically meaningful. This work is the first to extract executable logic programs from ViTs using sparse symbolic representations. It bridges the gap between transformer-based vision models and symbolic logic programming, providing a step forward in interpretable and verifiable neuro-symbolic AI.

Comment: The paper proposes a framework for symbolic rule extraction from Vision Transformers, which aligns with representation learning and interpretability. The use of sparse concept layers and symbolic reasoning is novel and foundational.

Relevance: 9 Novelty: 8

13. IIKL: Isometric Immersion Kernel Learning with Riemannian Manifold for Geometric Preservation

ArXiv ID: 2505.06288

Authors: Zihao Chen, Wenyong Wang, Jiachen Yang, Yu Xiang

Abstract: Geometric representation learning in preserving the intrinsic geometric and topological properties for discrete non-Euclidean data is crucial in scientific applications. Previous research generally mapped non-Euclidean discrete data into Euclidean space during representation learning, which may lead to the loss of some critical geometric information. In this paper, we propose a novel Isometric Immersion Kernel Learning (IIKL) method to build Riemannian manifold and isometrically induce Riemannian metric from discrete non-Euclidean data. We prove that Isometric immersion is equivalent to the kernel function in the tangent bundle on the manifold, which explicitly guarantees the invariance of the inner product between vectors in the arbitrary tangent space throughout the learning process, thus maintaining the geometric structure of the original data. Moreover, a novel parameterized learning model based on IIKL is introduced, and an alternating training method for this model is derived using Maximum Likelihood Estimation (MLE), ensuring efficient convergence. Experimental results proved that using the learned Riemannian manifold and its metric, our model preserved the intrinsic geometric representation of data in both 3D and high-dimensional datasets successfully, and significantly improved the accuracy of downstream tasks, such as data reconstruction and classification. It is showed that our method could reduce the inner product invariant loss by more than 90% compared to state-of-the-art (SOTA) methods, also achieved an average 40% improvement in downstream reconstruction accuracy and a 90% reduction in error for geometric metrics involving isometric and conformal.

Comment: The paper proposes a novel method for geometric representation learning using Riemannian manifolds, which aligns with the 'Representation Learning' criterion by addressing how data is encoded while preserving geometric properties.

Relevance: 9 Novelty: 8

14. FreqMoE: Dynamic Frequency Enhancement for Neural PDE Solvers

ArXiv ID: 2505.06858

Authors: Tianyu Chen, Haoyi Zhou, Ying Li, Hao Wang, Zhenzhe Zhang, Tianchen Zhu, Shanghang Zhang, Jianxin Li

Abstract: Fourier Neural Operators (FNO) have emerged as promising solutions for efficiently solving partial differential equations (PDEs) by learning infinite-dimensional function mappings through frequency domain transformations. However, the sparsity of high-frequency signals limits computational efficiency for high-dimensional inputs, and fixed-pattern truncation often causes high-frequency signal loss, reducing performance in scenarios such as high-resolution inputs or long-term predictions. To address these challenges, we propose FreqMoE, an efficient and progressive training framework that exploits the dependency of high-frequency signals on low-frequency components. The model first learns low-frequency weights and then applies a sparse upward-cycling strategy to construct a mixture of experts (MoE) in the frequency domain, effectively extending the learned weights to high-frequency regions. Experiments on both regular and irregular grid PDEs demonstrate that FreqMoE achieves up to 16.6% accuracy improvement while using merely 2.1% parameters (47.32x reduction) compared to dense FNO. Furthermore, the approach demonstrates remarkable stability in long-term predictions and generalizes seamlessly to various FNO variants and grid structures, establishing a new ``Low frequency Pretraining, High frequency Fine-tuning'' paradigm for solving PDEs.

Comment: The paper proposes FreqMoE, a novel MoE-based framework for solving PDEs, which aligns with the 'Model Architecture' criterion by innovating in the MoE space.

Relevance: 9 Novelty: 8

15. Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures

ArXiv ID: 2505.07070

Authors: Francesco Cagnetta, Alessandro Favero, Antonio Sclocchi, Matthieu Wyart

Abstract: How do neural language models acquire a language's structure when trained for next-token prediction? We address this question by deriving theoretical scaling laws for neural network performance on synthetic datasets generated by the Random Hierarchy Model (RHM) -- an ensemble of probabilistic context-free grammars designed to capture the hierarchical structure of natural language while remaining analytically tractable. Previously, we developed a theory of representation learning based on data correlations that explains how deep learning models capture the hierarchical structure of the data sequentially, one layer at a time. Here, we extend our theoretical framework to account for architectural differences. In particular, we predict and empirically validate that convolutional networks, whose structure aligns with that of the generative process through locality and weight sharing, enjoy a faster scaling of performance compared to transformer models, which rely on global self-attention mechanisms. This finding clarifies the architectural biases underlying neural scaling laws and highlights how representation learning is shaped by the interaction between model architecture and the statistical properties of data.

Comment: The paper provides theoretical insights into representation learning and scaling laws in hierarchical languages, aligning with the 'Representation Learning' criterion by analyzing how architectures encode information.

Relevance: 9 Novelty: 8

16. Triangulating PL functions and the existence of efficient ReLU DNNs

ArXiv ID: 2505.07137

Authors: Danny Calegari

Abstract: We show that every piecewise linear function $f:R^d \to R$ with compact support a polyhedron $P$ has a representation as a sum of so-called `simplex functions'. Such representations arise from degree 1 triangulations of the relative homology class (in $R^{d+1}$) bounded by $P$ and the graph of $f$, and give a short elementary proof of the existence of efficient universal ReLU neural networks that simultaneously compute all such functions $f$ of bounded complexity.

Comment: The paper provides a theoretical proof for efficient ReLU DNNs and aligns with 'Model Architecture' by addressing the representation of piecewise linear functions.

Relevance: 9 Novelty: 7

17. GraphComp: Extreme Error-bounded Compression of Scientific Data via Temporal Graph Autoencoders

ArXiv ID: 2505.06316

Authors: Guozhong Li, Muhannad Alhumaidi, Spiros Skiadopoulos, Ibrahim Hoteit, Panos Kalnis

Abstract: The generation of voluminous scientific data poses significant challenges for efficient storage, transfer, and analysis. Recently, error-bounded lossy compression methods emerged due to their ability to achieve high compression ratios while controlling data distortion. However, they often overlook the inherent spatial and temporal correlations within scientific data, thus missing opportunities for higher compression. In this paper we propose GRAPHCOMP, a novel graph-based method for error-bounded lossy compression of scientific data. We perform irregular segmentation of the original grid data and generate a graph representation that preserves the spatial and temporal correlations. Inspired by Graph Neural Networks (GNNs), we then propose a temporal graph autoencoder to learn latent representations that significantly reduce the size of the graph, effectively compressing the original data. Decompression reverses the process and utilizes the learnt graph model together with the latent representation to reconstruct an approximation of the original data. The decompressed data are guaranteed to satisfy a user-defined point-wise error bound. We compare our method against the state-of-the-art error-bounded lossy methods (i.e., HPEZ, SZ3.1, SPERR, and ZFP) on large-scale real and synthetic data. GRAPHCOMP consistently achieves the highest compression ratio across most datasets, outperforming the second-best method by margins ranging from 22% to 50%.

Comment: The paper introduces a novel graph-based compression method using temporal graph autoencoders, which aligns with model compression and representation learning.

Relevance: 8 Novelty: 8

18. Learning curves theory for hierarchically compositional data with power-law distributed features

ArXiv ID: 2505.07067

Authors: Francesco Cagnetta, Hyunmo Kang, Matthieu Wyart

Abstract: Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into power-law distributed units. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free grammars -- probabilistic models that generate data via a hierarchy of production rules. For classification, we show that having power-law distributed production rules results in a power-law learning curve with an exponent depending on the rules' distribution and a large multiplicative constant that depends on the hierarchical structure. By contrast, for next-token prediction, the distribution of production rules controls the local details of the learning curve, but not the exponent describing the large-scale behaviour.

Comment: The paper provides a theoretical analysis of learning curves for hierarchically compositional data, which aligns with representation learning and emerging trends.

Relevance: 8 Novelty: 8

19. Boltzmann Classifier: A Thermodynamic-Inspired Approach to Supervised Learning

ArXiv ID: 2505.06753

Authors: Muhamed Amin, Bernard R. Brooks

Abstract: We propose a novel classification algorithm, the Boltzmann Classifier, inspired by the thermodynamic principles underlying the Boltzmann distribution. Our method computes a probabilistic estimate for each class based on an energy function derived from feature-wise deviations between input samples and class-specific centroids. The resulting probabilities are proportional to the exponential negative energies, normalized across classes, analogous to the Boltzmann distribution used in statistical mechanics. In addition, the KT variable can be used to allow the high energy states to be more accessible, which allows the tuning of their probabilities as needed. We evaluate the model performance on several datasets from different applications. The model achieves a high accuracy, which indicates that the Boltzmann Classifier is competitive with standard models like logistic regression and k-nearest neighbors while offering a thermodynamically motivated probabilistic interpretation. our classifier does not require iterative optimization or backpropagation and is thus computationally efficient and easy to integrate into existing workflows. This work demonstrates how ideas from physics can inform new directions in machine learning, providing a foundation for interpretable, energy-based decision-making systems.

Comment: The paper introduces the Boltzmann Classifier, which is a novel energy-based approach to supervised learning, aligning with emerging trends and foundational innovations.

Relevance: 8 Novelty: 8

20. Certified Data Removal Under High-dimensional Settings

ArXiv ID: 2505.07640

Authors: Haolin Zou, Arnab Auddy, Yongchan Kwon, Kamiar Rahnama Rad, Arian Maleki

Abstract: Machine unlearning focuses on the computationally efficient removal of specific training data from trained models, ensuring that the influence of forgotten data is effectively eliminated without the need for full retraining. Despite advances in low-dimensional settings, where the number of parameters ( p ) is much smaller than the sample size ( n ), extending similar theoretical guarantees to high-dimensional regimes remains challenging. We propose an unlearning algorithm that starts from the original model parameters and performs a theory-guided sequence of Newton steps ( T \in { 1,2}). After this update, carefully scaled isotropic Laplacian noise is added to the estimate to ensure that any (potential) residual influence of forget data is completely removed. We show that when both ( n, p \to \infty ) with a fixed ratio ( n/p ), significant theoretical and computational obstacles arise due to the interplay between the complexity of the model and the finite signal-to-noise ratio. Finally, we show that, unlike in low-dimensional settings, a single Newton step is insufficient for effective unlearning in high-dimensional problems -- however, two steps are enough to achieve the desired certifiebility. We provide numerical experiments to support the certifiability and accuracy claims of this approach.

Comment: The paper proposes a high-dimensional unlearning algorithm with theoretical guarantees, which aligns with 'Emerging Trends' due to its novel approach to machine unlearning.

Relevance: 8 Novelty: 8

21. Identifying Causal Direction via Variational Bayesian Compression

ArXiv ID: 2505.07503

Authors: Quang-Duy Tran, Bao Duong, Phuoc Nguyen, Thin Nguyen

Abstract: Telling apart the cause and effect between two random variables with purely observational data is a challenging problem that finds applications in various scientific disciplines. A key principle utilized in this task is the algorithmic Markov condition, which postulates that the joint distribution, when factorized according to the causal direction, yields a more succinct codelength compared to the anti-causal direction. Previous approaches approximate these codelengths by relying on simple functions or Gaussian processes (GPs) with easily evaluable complexity, compromising between model fitness and computational complexity. To overcome these limitations, we propose leveraging the variational Bayesian learning of neural networks as an interpretation of the codelengths. Consequently, we can enhance the model fitness while promoting the succinctness of the codelengths, while avoiding the significant computational complexity of the GP-based approaches. Extensive experiments on both synthetic and real-world benchmarks in cause-effect identification demonstrate the effectiveness of our proposed method, surpassing the overall performance of related complexity-based and structural causal model regression-based approaches.

Comment: The paper proposes a method for identifying causal direction using variational Bayesian compression, which involves foundational insights into representation learning through succinctness and model fitness. This aligns well with the criteria for representation learning.

Relevance: 8 Novelty: 8

22. Importance Analysis for Dynamic Control of Balancing Parameter in a Simple Knowledge Distillation Setting

ArXiv ID: 2505.06270

Authors: Seongmin Kim, Kwanho Kim, Minseung Kim, Kanghyun Jo

Abstract: Although deep learning models owe their remarkable success to deep and complex architectures, this very complexity typically comes at the expense of real-time performance. To address this issue, a variety of model compression techniques have been proposed, among which knowledge distillation (KD) stands out for its strong empirical performance. The KD contains two concurrent processes: (i) matching the outputs of a large, pre-trained teacher network and a lightweight student network, and (ii) training the student to solve its designated downstream task. The associated loss functions are termed the distillation loss and the downsteam-task loss, respectively. Numerous prior studies report that KD is most effective when the influence of the distillation loss outweighs that of the downstream-task loss. The influence(or importance) is typically regulated by a balancing parameter. This paper provides a mathematical rationale showing that in a simple KD setting when the loss is decreasing, the balancing parameter should be dynamically adjusted

Comment: The paper discusses a mathematical rationale for dynamically adjusting the balancing parameter in knowledge distillation, which aligns with model compression and training dynamics in neural networks.