Personalized Daily ArXiv Papers 2025-04-30

[gpt-4o]	Prompt	Completion	Total
Token	35188	5033	40221
Cost	$0.09	$0.05	$0.14

Total arXiv papers: 438

Total scanned papers: 289

Total relevant papers: 25

Table of contents with paper titles:

Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Authors: Zhengfu He, Junxuan Wang, Rui Lin, Xuyang Ge, Wentao Shu, Qiong Tang, Junping Zhang, Xipeng Qiu
Nonlinear Computation with Linear Optics via Source-Position Encoding Authors: N. Richardson, C. Bosch, R. P. Adams
Partial Answer of How Transformers Learn Automata Authors: Tiantian (Crystal), Zhang
GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection Authors: DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, Jiawei Zhao
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax Authors: Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji
The Limits of AI Explainability: An Algorithmic Information Theory Approach Authors: Shrisha Rao
Learning Laplacian Positional Encodings for Heterophilous Graphs Authors: Michael Ito, Jiong Zhu, Dexiong Chen, Danai Koutra, Jenna Wiens
Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding Authors: Gabe Guo, Stefano Ermon
Jekyll-and-Hyde Tipping Point in an AI's Behavior Authors: Neil F. Johnson, Frank Yingjie Huo
Equivariant non-linear maps for neural networks on homogeneous spaces Authors: Elias Nyholm, Oscar Carlsson, Maurice Weiler, Daniel Persson
Energy-Based Coarse-Graining in Molecular Dynamics: A Flow-Based Framework Without Data Authors: Maximilian Stupp, P. S. Koutsourelakis
Coreset selection for the Sinkhorn divergence and generic smooth divergences Authors: Alex Kokot, Alex Luedtke
FourierSpecNet: Neural Collision Operator Approximation Inspired by the Fourier Spectral Method for Solving the Boltzmann Equation Authors: Jae Yong Lee, Gwang Jae Jung, Byung Chan Lim, Hyung Ju Hwang
Provably faster randomized and quantum algorithms for k-means clustering via uniform sampling Authors: Tyler Chen, Archan Ray, Akshay Seshadri, Dylan Herman, Bao Bach, Pranav Deshpande, Abhishek Som, Niraj Kumar, Marco Pistoia
Learning and Generalization with Mixture Data Authors: Harsh Vardhan, Avishek Ghosh, Arya Mazumdar
FX-DARTS: Designing Topology-unconstrained Architectures with Differentiable Architecture Search and Entropy-based Super-network Shrinking Authors: Xuan Rao, Bo Zhao, Derong Liu, Cesare Alippi
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think Authors: Hasan Abed Al Kader Hammoud, Hani Itani, Bernard Ghanem
Head-Tail-Aware KL Divergence in Knowledge Distillation for Spiking Neural Networks Authors: Tianqing Zhang, Zixin Zhu, Kairong Yu, Hongwei Wang
DYNAMAX: Dynamic computing for Transformers and Mamba based architectures Authors: Miguel Nogales, Matteo Gambella, Manuel Roveri
Can Large Language Models Learn Formal Logic? A Data-Driven Training and Evaluation Framework Authors: Yuan Xia, Akanksha Atrey, Fadoua Khmaissia, Kedar S. Namjoshi
Group Relative Knowledge Distillation: Learning from Teacher's Relational Inductive Bias Authors: Chao Li, Changhua Zhou, Jia Chen
SFi-Former: Sparse Flow Induced Attention for Graph Transformer Authors: Zhonghao Li, Ji Shi, Xinming Zhang, Miao Zhang, Bo Li
On Stochastic Rounding with Few Random Bits Authors: Andrew Fitzgibbon, Stephen Felix
Guessing Efficiently for Constrained Subspace Approximation Authors: Aditya Bhaskara, Sepideh Mahabadi, Madhusudhan Reddy Pittu, Ali Vakilian, David P. Woodruff
Explanations Go Linear: Interpretable and Individual Latent Encoding for Post-hoc Explainability Authors: Simone Piaggesi, Riccardo Guidotti, Fosca Giannotti, Dino Pedreschi

1. Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition

ArXiv ID: 2504.20938

Authors: Zhengfu He, Junxuan Wang, Rui Lin, Xuyang Ge, Wentao Shu, Qiong Tang, Junping Zhang, Xipeng Qiu

Abstract: We propose Low-Rank Sparse Attention (Lorsa), a sparse replacement model of Transformer attention layers to disentangle original Multi Head Self Attention (MHSA) into individually comprehensible components. Lorsa is designed to address the challenge of attention superposition to understand attention-mediated interaction between features in different token positions. We show that Lorsa heads find cleaner and finer-grained versions of previously discovered MHSA behaviors like induction heads, successor heads and attention sink behavior (i.e., heavily attending to the first token). Lorsa and Sparse Autoencoder (SAE) are both sparse dictionary learning methods applied to different Transformer components, and lead to consistent findings in many ways. For instance, we discover a comprehensive family of arithmetic-specific Lorsa heads, each corresponding to an atomic operation in Llama-3.1-8B. Automated interpretability analysis indicates that Lorsa achieves parity with SAE in interpretability while Lorsa exhibits superior circuit discovery properties, especially for features computed collectively by multiple MHSA heads. We also conduct extensive experiments on architectural design ablation, Lorsa scaling law and error analysis.

Comment: The paper introduces Low-Rank Sparse Attention (Lorsa), which aligns with the criteria of representation learning and model compression by exploring sparse dictionary learning and low-rank decomposition in Transformer attention layers. It also provides insights into training dynamics and interpretability.

Relevance: 10 Novelty: 8

2. Nonlinear Computation with Linear Optics via Source-Position Encoding

ArXiv ID: 2504.20401

Authors: N. Richardson, C. Bosch, R. P. Adams

Abstract: Optical computing systems provide an alternate hardware model which appears to be aligned with the demands of neural network workloads. However, the challenge of implementing energy efficient nonlinearities in optics -- a key requirement for realizing neural networks -- is a conspicuous missing link. In this work we introduce a novel method to achieve nonlinear computation in fully linear media. Our method can operate at low power and requires only the ability to drive the optical system at a data-dependent spatial position. Leveraging this positional encoding, we formulate a fully automated, topology-optimization-based hardware design framework for extremely specialized optical neural networks, drawing on modern advancements in optimization and machine learning. We evaluate our optical designs on machine learning classification tasks: demonstrating significant improvements over linear methods, and competitive performance when compared to standard artificial neural networks.

Comment: The paper proposes a novel method for nonlinear computation in linear optical systems, which aligns with 'Emerging Trends' and foundational advancements in hardware for neural networks.

Relevance: 9 Novelty: 9

3. Partial Answer of How Transformers Learn Automata

ArXiv ID: 2504.20395

Authors: Tiantian (Crystal), Zhang

Abstract: We introduce a novel framework for simulating finite automata using representation-theoretic semidirect products and Fourier modules, achieving more efficient Transformer-based implementations.

Comment: The paper introduces a novel framework for simulating finite automata using representation-theoretic methods, which aligns with representation learning and theoretical insights into Transformer behavior.

Relevance: 9 Novelty: 8

4. GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection

ArXiv ID: 2504.20437

Authors: DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, Jiawei Zhao

Abstract: Large language models (LLMs) have revolutionized natural language understanding and generation but face significant memory bottlenecks during training. GaLore, Gradient Low-Rank Projection, addresses this issue by leveraging the inherent low-rank structure of weight gradients, enabling substantial memory savings without sacrificing performance. Recent works further extend GaLore from various aspects, including low-bit quantization and higher-order tensor structures. However, there are several remaining challenges for GaLore, such as the computational overhead of SVD for subspace updates and the integration with state-of-the-art training parallelization strategies (e.g., FSDP). In this paper, we present GaLore 2, an efficient and scalable GaLore framework that addresses these challenges and incorporates recent advancements. In addition, we demonstrate the scalability of GaLore 2 by pre-training Llama 7B from scratch using up to 500 billion training tokens, highlighting its potential impact on real LLM pre-training scenarios.

Comment: The paper introduces GaLore 2, which focuses on gradient low-rank projection for efficient LLM pretraining, aligning with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

5. Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

ArXiv ID: 2504.20966

Authors: Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji

Abstract: We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M parameter models demonstrate that softpick maintains performance parity with softmax on standard benchmarks while achieving 0% sink rate. The softpick transformer produces hidden states with significantly lower kurtosis (340 vs 33,510) and creates sparse attention maps (46.97% sparsity). Models using softpick consistently outperform softmax when quantized, with particularly pronounced advantages at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code is available at https://github.com/zaydzuhri/softpick-attention.

Comment: Introduces 'softpick', a rectified softmax replacement for transformer attention mechanisms, with implications for sparsity, quantization, and interpretability, directly aligning with model architecture and compression topics.

Relevance: 9 Novelty: 8

6. The Limits of AI Explainability: An Algorithmic Information Theory Approach

ArXiv ID: 2504.20676

Authors: Shrisha Rao

Abstract: This paper establishes a theoretical foundation for understanding the fundamental limits of AI explainability through algorithmic information theory. We formalize explainability as the approximation of complex models by simpler ones, quantifying both approximation error and explanation complexity using Kolmogorov complexity. Our key theoretical contributions include: (1) a complexity gap theorem proving that any explanation significantly simpler than the original model must differ from it on some inputs; (2) precise bounds showing that explanation complexity grows exponentially with input dimension but polynomially with error tolerance for Lipschitz functions; and (3) a characterization of the gap between local and global explainability, demonstrating that local explanations can be significantly simpler while maintaining accuracy in relevant regions. We further establish a regulatory impossibility theorem proving that no governance framework can simultaneously pursue unrestricted AI capabilities, human-interpretable explanations, and negligible error. These results highlight considerations likely to be relevant to the design, evaluation, and oversight of explainable AI systems.

Comment: This paper provides a theoretical foundation for AI explainability using algorithmic information theory, aligning with the 'Emerging Trends' criterion for foundational research.

Relevance: 9 Novelty: 8

7. Learning Laplacian Positional Encodings for Heterophilous Graphs

ArXiv ID: 2504.20430

Authors: Michael Ito, Jiong Zhu, Dexiong Chen, Danai Koutra, Jenna Wiens

Abstract: In this work, we theoretically demonstrate that current graph positional encodings (PEs) are not beneficial and could potentially hurt performance in tasks involving heterophilous graphs, where nodes that are close tend to have different labels. This limitation is critical as many real-world networks exhibit heterophily, and even highly homophilous graphs can contain local regions of strong heterophily. To address this limitation, we propose Learnable Laplacian Positional Encodings (LLPE), a new PE that leverages the full spectrum of the graph Laplacian, enabling them to capture graph structure on both homophilous and heterophilous graphs. Theoretically, we prove LLPE's ability to approximate a general class of graph distances and demonstrate its generalization properties. Empirically, our evaluation on 12 benchmarks demonstrates that LLPE improves accuracy across a variety of GNNs, including graph transformers, by up to 35% and 14% on synthetic and real-world graphs, respectively. Going forward, our work represents a significant step towards developing PEs that effectively capture complex structures in heterophilous graphs.

Comment: The paper introduces Learnable Laplacian Positional Encodings, which aligns with foundational research in representation learning and graph neural networks.

Relevance: 9 Novelty: 8

8. Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding

ArXiv ID: 2504.20456

Authors: Gabe Guo, Stefano Ermon

Abstract: In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution. With discrete diffusion models, the more tokens they generate in parallel, the less their predicted distributions adhere to the originally learned data distribution, as they rely on a conditional independence assumption that only works with infinitesimally small timesteps. We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution. As implied by the name, AS-ARMs can generate tokens in any order, and in parallel. Moreover, AS-ARMs support parallelized joint probability density estimation, allowing them to correct their own parallel-generated token distributions, via our Any-Subset Speculative Decoding (ASSD) algorithm. ASSD provably enables generation of tokens from the correct joint distribution, with the number of neural network calls upper bounded by the number of tokens predicted. We empirically verify that ASSD speeds up language generation, without sacrificing quality. Furthermore, we provide a mathematically justified scheme for training AS-ARMs for generation, and show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation. Our theoretical and empirical results indicate that the once-forgotten AS-ARMs are a promising direction of language modeling.

Comment: The paper revives Any-Subset Autoregressive Models (AS-ARMs) with a principled approach to parallel sampling and decoding, which aligns with the 'Large Language Models' criterion by providing theoretical insights into language model behavior.

Relevance: 9 Novelty: 8

9. Jekyll-and-Hyde Tipping Point in an AI's Behavior

ArXiv ID: 2504.20980

Authors: Neil F. Johnson, Frank Yingjie Huo

Abstract: Trust in AI is undermined by the fact that there is no science that predicts -- or that can explain to the public -- when an LLM's output (e.g. ChatGPT) is likely to tip mid-response to become wrong, misleading, irrelevant or dangerous. With deaths and trauma already being blamed on LLMs, this uncertainty is even pushing people to treat their 'pet' LLM more politely to 'dissuade' it (or its future Artificial General Intelligence offspring) from suddenly turning on them. Here we address this acute need by deriving from first principles an exact formula for when a Jekyll-and-Hyde tipping point occurs at LLMs' most basic level. Requiring only secondary school mathematics, it shows the cause to be the AI's attention spreading so thin it suddenly snaps. This exact formula provides quantitative predictions for how the tipping-point can be delayed or prevented by changing the prompt and the AI's training. Tailored generalizations will provide policymakers and the public with a firm platform for discussing any of AI's broader uses and risks, e.g. as a personal counselor, medical advisor, decision-maker for when to use force in a conflict situation. It also meets the need for clear and transparent answers to questions like ''should I be polite to my LLM?''

Comment: The paper derives a formula for tipping points in LLM behavior, aligning with the 'Large Language Models' criterion by providing theoretical insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 8

10. Equivariant non-linear maps for neural networks on homogeneous spaces

ArXiv ID: 2504.20974

Authors: Elias Nyholm, Oscar Carlsson, Maurice Weiler, Daniel Persson

Abstract: This paper presents a novel framework for non-linear equivariant neural network layers on homogeneous spaces. The seminal work of Cohen et al. on equivariant $G$-CNNs on homogeneous spaces characterized the representation theory of such layers in the linear setting, finding that they are given by convolutions with kernels satisfying so-called steerability constraints. Motivated by the empirical success of non-linear layers, such as self-attention or input dependent kernels, we set out to generalize these insights to the non-linear setting. We derive generalized steerability constraints that any such layer needs to satisfy and prove the universality of our construction. The insights gained into the symmetry-constrained functional dependence of equivariant operators on feature maps and group elements informs the design of future equivariant neural network layers. We demonstrate how several common equivariant network architectures - $G$-CNNs, implicit steerable kernel networks, conventional and relative position embedded attention based transformers, and LieTransformers - may be derived from our framework.

Comment: This paper provides a theoretical framework for non-linear equivariant neural network layers, which aligns with architectural innovations and foundational research.

Relevance: 9 Novelty: 8

11. Energy-Based Coarse-Graining in Molecular Dynamics: A Flow-Based Framework Without Data

ArXiv ID: 2504.20940

Authors: Maximilian Stupp, P. S. Koutsourelakis

Abstract: Coarse-grained (CG) models offer an effective route to reducing the complexity of molecular simulations, yet conventional approaches depend heavily on long all-atom molecular dynamics (MD) trajectories to adequately sample configurational space. This data-driven dependence limits their accuracy and generalizability, as unvisited configurations remain excluded from the resulting CG model. We introduce a data-free generative framework for coarse-graining that directly targets the all-atom Boltzmann distribution. Our model defines a structured latent space comprising slow collective variables, which are statistically associated with multimodal marginal densities capturing metastable states, and fast variables, which represent the remaining degrees of freedom with simple, unimodal conditional distributions. A potentially learnable, bijective map from the full latent space to the all-atom configuration space enables automatic and accurate reconstruction of molecular structures. The model is trained using an energy-based objective that minimizes the reverse Kullback-Leibler divergence, relying solely on the interatomic potential rather than sampled trajectories. A tempering scheme is used to stabilize training and promote exploration of diverse configurations. Once trained, the model can generate unbiased, one-shot equilibrium all-atom samples. We validate the method on two synthetic systems-a double-well potential and a Gaussian mixture-as well as on the benchmark alanine dipeptide. The model captures all relevant modes of the Boltzmann distribution, accurately reconstructs atomic configurations, and learns physically meaningful coarse-grained representations, all without any simulation data.

Comment: The paper introduces a data-free generative framework for coarse-graining in molecular dynamics, aligning with 'AI for Science' and foundational generative modeling.

Relevance: 8 Novelty: 8

12. Coreset selection for the Sinkhorn divergence and generic smooth divergences

ArXiv ID: 2504.20194

Authors: Alex Kokot, Alex Luedtke

Abstract: We introduce CO2, an efficient algorithm to produce convexly-weighted coresets with respect to generic smooth divergences. By employing a functional Taylor expansion, we show a local equivalence between sufficiently regular losses and their second order approximations, reducing the coreset selection problem to maximum mean discrepancy minimization. We apply CO2 to the Sinkhorn divergence, providing a novel sampling procedure that requires logarithmically many data points to match the approximation guarantees of random sampling. To show this, we additionally verify several new regularity properties for entropically regularized optimal transport of independent interest. Our approach leads to a new perspective linking coreset selection and kernel quadrature to classical statistical methods such as moment and score matching. We showcase this method with a practical application of subsampling image data, and highlight key directions to explore for improved algorithmic efficiency and theoretical guarantees.

Comment: The paper introduces a coreset selection algorithm for smooth divergences, which is a foundational contribution to optimization and efficiency methods.

Relevance: 8 Novelty: 8

13. FourierSpecNet: Neural Collision Operator Approximation Inspired by the Fourier Spectral Method for Solving the Boltzmann Equation

ArXiv ID: 2504.20408

Authors: Jae Yong Lee, Gwang Jae Jung, Byung Chan Lim, Hyung Ju Hwang

Abstract: The Boltzmann equation, a fundamental model in kinetic theory, describes the evolution of particle distribution functions through a nonlinear, high-dimensional collision operator. However, its numerical solution remains computationally demanding, particularly for inelastic collisions and high-dimensional velocity domains. In this work, we propose the Fourier Neural Spectral Network (FourierSpecNet), a hybrid framework that integrates the Fourier spectral method with deep learning to approximate the collision operator in Fourier space efficiently. FourierSpecNet achieves resolution-invariant learning and supports zero-shot super-resolution, enabling accurate predictions at unseen resolutions without retraining. Beyond empirical validation, we establish a consistency result showing that the trained operator converges to the spectral solution as the discretization is refined. We evaluate our method on several benchmark cases, including Maxwellian and hard-sphere molecular models, as well as inelastic collision scenarios. The results demonstrate that FourierSpecNet offers competitive accuracy while significantly reducing computational cost compared to traditional spectral solvers. Our approach provides a robust and scalable alternative for solving the Boltzmann equation across both elastic and inelastic regimes.

Comment: The paper proposes a hybrid framework for solving the Boltzmann equation using deep learning, which is foundational research in AI for Science.

Relevance: 8 Novelty: 8

14. Provably faster randomized and quantum algorithms for k-means clustering via uniform sampling

ArXiv ID: 2504.20982

Authors: Tyler Chen, Archan Ray, Akshay Seshadri, Dylan Herman, Bao Bach, Pranav Deshpande, Abhishek Som, Niraj Kumar, Marco Pistoia

Abstract: The $k$-means algorithm (Lloyd's algorithm) is a widely used method for clustering unlabeled data. A key bottleneck of the $k$-means algorithm is that each iteration requires time linear in the number of data points, which can be expensive in big data applications. This was improved in recent works proposing quantum and quantum-inspired classical algorithms to approximate the $k$-means algorithm locally, in time depending only logarithmically on the number of data points (along with data dependent parameters) [$q$-means: A quantum algorithm for unsupervised machine learning; Kerenidis, Landman, Luongo, and Prakash, NeurIPS 2019; Do you know what $q$-means?, Doriguello, Luongo, Tang]. In this work, we describe a simple randomized mini-batch $k$-means algorithm and a quantum algorithm inspired by the classical algorithm. We prove worse-case guarantees that significantly improve upon the bounds for previous algorithms. Our improvements are due to a careful use of uniform sampling, which preserves certain symmetries of the $k$-means problem that are not preserved in previous algorithms that use data norm-based sampling.

Comment: The paper proposes randomized and quantum algorithms for k-means clustering, which is foundational research in optimization and efficiency methods.

Relevance: 8 Novelty: 8

15. Learning and Generalization with Mixture Data

ArXiv ID: 2504.20651

Authors: Harsh Vardhan, Avishek Ghosh, Arya Mazumdar

Abstract: In many, if not most, machine learning applications the training data is naturally heterogeneous (e.g. federated learning, adversarial attacks and domain adaptation in neural net training). Data heterogeneity is identified as one of the major challenges in modern day large-scale learning. A classical way to represent heterogeneous data is via a mixture model. In this paper, we study generalization performance and statistical rates when data is sampled from a mixture distribution. We first characterize the heterogeneity of the mixture in terms of the pairwise total variation distance of the sub-population distributions. Thereafter, as a central theme of this paper, we characterize the range where the mixture may be treated as a single (homogeneous) distribution for learning. In particular, we study the generalization performance under the classical PAC framework and the statistical error rates for parametric (linear regression, mixture of hyperplanes) as well as non-parametric (Lipschitz, convex and H\"older-smooth) regression problems. In order to do this, we obtain Rademacher complexity and (local) Gaussian complexity bounds with mixture data, and apply them to get the generalization and convergence rates respectively. We observe that as the (regression) function classes get more complex, the requirement on the pairwise total variation distance gets stringent, which matches our intuition. We also do a finer analysis for the case of mixed linear regression and provide a tight bound on the generalization error in terms of heterogeneity.

Comment: The paper studies generalization and statistical rates for mixture data, providing theoretical insights into heterogeneous data learning. This aligns with foundational research in representation learning and generalization theory.

Relevance: 8 Novelty: 8

16. FX-DARTS: Designing Topology-unconstrained Architectures with Differentiable Architecture Search and Entropy-based Super-network Shrinking

ArXiv ID: 2504.20079

Authors: Xuan Rao, Bo Zhao, Derong Liu, Cesare Alippi

Abstract: Strong priors are imposed on the search space of Differentiable Architecture Search (DARTS), such that cells of the same type share the same topological structure and each intermediate node retains two operators from distinct nodes. While these priors reduce optimization difficulties and improve the applicability of searched architectures, they hinder the subsequent development of automated machine learning (Auto-ML) and prevent the optimization algorithm from exploring more powerful neural networks through improved architectural flexibility. This paper aims to reduce these prior constraints by eliminating restrictions on cell topology and modifying the discretization mechanism for super-networks. Specifically, the Flexible DARTS (FX-DARTS) method, which leverages an Entropy-based Super-Network Shrinking (ESS) framework, is presented to address the challenges arising from the elimination of prior constraints. Notably, FX-DARTS enables the derivation of neural architectures without strict prior rules while maintaining the stability in the enlarged search space. Experimental results on image classification benchmarks demonstrate that FX-DARTS is capable of exploring a set of neural architectures with competitive trade-offs between performance and computational complexity within a single search procedure.

Comment: The paper introduces FX-DARTS, which focuses on differentiable architecture search and entropy-based super-network shrinking, aligning with the model architecture criterion by exploring architectural flexibility and optimization.