Previous Day 2025-03-21
Monthly Overview 2025-03
Next Day 2025-03-25

Personalized Daily Arxiv Papers 3/24/2025

[gpt-4o] Prompt Completion Total
Token 44946 6046 50992
Cost $0.11 $0.06 $0.17

Total arXiv papers: 502

Total scanned papers: 298

Total relevant papers: 28

Table of contents with paper titles:

  1. Offline Model-Based Optimization: Comprehensive Review Authors: Minsu Kim, Jiayao Gu, Ye Yuan, Taeyoung Yun, Zixuan Liu, Yoshua Bengio, Can Chen

  2. Large Language Model Compression via the Nested Activation-Aware Decomposition Authors: Jun Lu, Tianyi Xu, Bill Ding, David Li, Yu Kang

  3. Malliavin-Bismut Score-based Diffusion Models Authors: Ehsan Mirafzali, Utkarsh Gupta, Patrick Wyrod, Frank Proske, Daniele Venturi, Razvan Marinescu

  4. Exploring a Principled Framework for Deep Subspace Clustering Authors: Xianghan Meng, Zhiyuan Huang, Wei He, Xianbiao Qi, Rong Xiao, Chun-Guang Li

  5. SuperARC: A Test for General and Super Intelligence Based on First Principles of Recursion Theory and Algorithmic Probability Authors: Alberto Hern\'andez-Espinosa, Luan Ozelim, Felipe S. Abrah\~ao, Hector Zenil

  6. Glivenko-Cantelli for $f$-divergence Authors: Haoming Wang, Lek-Heng Lim

  7. Accelerating Transformer Inference and Training with 2:4 Activation Sparsity Authors: Daniel Haziza, Timothy Chou, Dhruv Choudhary, Luca Wehrstedt, Francisco Massa, Jiecao Yu, Geonhwa Jeong, Supriya Rao, Patrick Labatut, Jesse Cai

  8. Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs Authors: Anshumann, Mohd Abbas Zaidi, Akhil Kedia, Jinwoo Ahn, Taehwak Kwon, Kangwook Lee, Haejun Lee, Joohyung Lee

  9. Nonparametric Factor Analysis and Beyond Authors: Yujia Zheng, Yang Liu, Jiaxiong Yao, Yingyao Hu, Kun Zhang

  10. NdLinear Is All You Need for Representation Learning Authors: Alex Reneau, Jerry Yao-Chieh Hu, Zhongfang Zhuang, Ting-Chun Liu

  11. Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation Authors: Sophia Tang, Yinuo Zhang, Alexander Tong, Pranam Chatterjee

  12. Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models Authors: Haichao Zhang, Zhuowei Li, Dimitris Metaxas, Yun Fu

  13. Structure Is Not Enough: Leveraging Behavior for Neural Network Weight Reconstruction Authors: L\'eo Meynent, Ivan Melev, Konstantin Sch\"urholt, G\"oran Kauermann, Damian Borth

  14. Physics-Informed Deep B-Spline Networks for Dynamical Systems Authors: Zhuoyuan Wang, Raffaele Romagnoli, Jasmine Ratchford, Yorie Nakahira

  15. Ordered Topological Deep Learning: a Network Modeling Case Study Authors: Guillermo Bern\'ardez, Miquel Ferriol-Galm\'es, Carlos G\"uemes-Palau, Mathilde Papillon, Pere Barlet-Ros, Albert Cabellos-Aparicio, Nina Miolane

  16. A Learnability Analysis on Neuro-Symbolic Learning Authors: Hao-Yuan He, Ming Li

  17. Efficient ANN-Guided Distillation: Aligning Rate-based Features of Spiking Neural Networks through Hybrid Block-wise Replacement Authors: Shu Yang, Chengting Yu, Lei Liu, Hanzhi Ma, Aili Wang, Erping Li

  18. KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference Authors: Huan Yang, Renji Zhang, Deyu Zhang

  19. Neural-Guided Equation Discovery Authors: Jannis Brugger, Mattia Cerrato, David Richter, Cedric Derstroff, Daniel Maninger, Mira Mezini, Stefan Kramer

  20. PRIOT: Pruning-Based Integer-Only Transfer Learning for Embedded Systems Authors: Honoka Anada, Sefutsu Ryu, Masayuki Usui, Tatsuya Kaneko, Shinya Takamaeda-Yamazaki

  21. Token-Level Uncertainty-Aware Objective for Language Model Post-Training Authors: Tingkai Liu, Ari S. Benjamin, Anthony M. Zador

  22. An Accelerated Bregman Algorithm for ReLU-based Symmetric Matrix Decomposition Authors: Qingsong Wang

  23. Gene42: Long-Range Genomic Foundation Model With Dense Attention Authors: Kirill Vishniakov, Boulbaba Ben Amor, Engin Tekin, Nancy A. ElNaker, Karthik Viswanathan, Aleksandr Medvedev, Aahan Singh, Maryam Nadeem, Mohammad Amaan Sayeed, Praveenkumar Kanithi, Tiago Magalhaes, Natalia Vassilieva, Dwarikanath Mahapatra, Marco Pimentel, and Shadab Khan

  24. SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging Authors: Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Holger Boche

  25. Model-free front-to-end training of a large high performance laser neural network Authors: Anas Skalli, Satoshi Sunada, Mirko Goldmann, Marcin Gebski, Stephan Reitzenstein, James A. Lott, Tomasz Czyszanowski, Daniel Brunner

  26. Efficient Training of Neural Fractional-Order Differential Equation via Adjoint Backpropagation Authors: Qiyu Kang, Xuhao Li, Kai Zhao, Wenjun Cui, Yanan Zhao, Weihua Deng, Wee Peng Tay

  27. Do regularization methods for shortcut mitigation work as intended? Authors: Haoyang Hong, Ioanna Papanikolaou, Sonali Parbhoo

  28. Rethinking the Role of Spatial Mixing Authors: George Cazenavette, Joel Julin, Simon Lucey


1. Offline Model-Based Optimization: Comprehensive Review

ArXiv ID: 2503.17286

Authors: Minsu Kim, Jiayao Gu, Ye Yuan, Taeyoung Yun, Zixuan Liu, Yoshua Bengio, Can Chen

Abstract: Offline optimization is a fundamental challenge in science and engineering, where the goal is to optimize black-box functions using only offline datasets. This setting is particularly relevant when querying the objective function is prohibitively expensive or infeasible, with applications spanning protein engineering, material discovery, neural architecture search, and beyond. The main difficulty lies in accurately estimating the objective landscape beyond the available data, where extrapolations are fraught with significant epistemic uncertainty. This uncertainty can lead to objective hacking(reward hacking), exploiting model inaccuracies in unseen regions, or other spurious optimizations that yield misleadingly high performance estimates outside the training distribution. Recent advances in model-based optimization(MBO) have harnessed the generalization capabilities of deep neural networks to develop offline-specific surrogate and generative models. Trained with carefully designed strategies, these models are more robust against out-of-distribution issues, facilitating the discovery of improved designs. Despite its growing impact in accelerating scientific discovery, the field lacks a comprehensive review. To bridge this gap, we present the first thorough review of offline MBO. We begin by formalizing the problem for both single-objective and multi-objective settings and by reviewing recent benchmarks and evaluation metrics. We then categorize existing approaches into two key areas: surrogate modeling, which emphasizes accurate function approximation in out-of-distribution regions, and generative modeling, which explores high-dimensional design spaces to identify high-performing designs. Finally, we examine the key challenges and propose promising directions for advancement in this rapidly evolving field including safe control of superintelligent systems.

Comment: Author match


2. Large Language Model Compression via the Nested Activation-Aware Decomposition

ArXiv ID: 2503.17101

Authors: Jun Lu, Tianyi Xu, Bill Ding, David Li, Yu Kang

Abstract: In this paper, we tackle the critical challenge of compressing large language models (LLMs) to facilitate their practical deployment and broader adoption. We introduce a novel post-training compression paradigm that focuses on low-rank decomposition of LLM weights. Our analysis identifies two main challenges in this task: the variability in LLM activation distributions and handling unseen activations from different datasets and models. To address these challenges, we propose a nested activation-aware framework (NSVD) for LLMs, a training-free approach designed to enhance the accuracy of low-rank decompositions by managing activation outliers through transforming the weight matrix based on activation distribution and the original weight matrix. This method allows for the absorption of outliers into the transformed weight matrix, improving decomposition accuracy. Our comprehensive evaluation across eight datasets and six models from three distinct LLM families demonstrates the superiority of NSVD over current state-of-the-art methods, especially at medium to large compression ratios or in multilingual and multitask settings.

Comment: The paper focuses on a novel low-rank decomposition method for compressing large language models (LLMs), which aligns closely with the 'Model Compression' criterion. The proposed nested activation-aware framework (NSVD) introduces a new approach to handle activation variability and outliers, making it a significant contribution to compression techniques.

Relevance: 10 Novelty: 8


3. Malliavin-Bismut Score-based Diffusion Models

ArXiv ID: 2503.16917

Authors: Ehsan Mirafzali, Utkarsh Gupta, Patrick Wyrod, Frank Proske, Daniele Venturi, Razvan Marinescu

Abstract: We introduce a new framework that employs Malliavin calculus to derive explicit expressions for the score function -- i.e., the gradient of the log-density -- associated with solutions to stochastic differential equations (SDEs). Our approach integrates classical integration-by-parts techniques with modern tools, such as Bismut's formula and Malliavin calculus, to address linear and nonlinear SDEs. In doing so, we establish a rigorous connection between the Malliavin derivative, its adjoint (the Malliavin divergence or the Skorokhod integral), Bismut's formula, and diffusion generative models, thus providing a systematic method for computing $\nabla \log p_t(x)$. For the linear case, we present a detailed study proving that our formula is equivalent to the actual score function derived from the solution of the Fokker--Planck equation for linear SDEs. Additionally, we derive a closed-form expression for $\nabla \log p_t(x)$ for nonlinear SDEs with state-independent diffusion coefficients. These advancements provide fresh theoretical insights into the smoothness and structure of probability densities and practical implications for score-based generative modelling, including the design and analysis of new diffusion models. Moreover, our findings promote the adoption of the robust Malliavin calculus framework in machine learning research. These results directly apply to various pure and applied mathematics fields, such as generative modelling, the study of SDEs driven by fractional Brownian motion, and the Fokker--Planck equations associated with nonlinear SDEs.

Comment: The paper introduces a novel theoretical framework using Malliavin calculus for score-based diffusion models, which aligns with foundational research in generative modeling.

Relevance: 9 Novelty: 9


4. Exploring a Principled Framework for Deep Subspace Clustering

ArXiv ID: 2503.17288

Authors: Xianghan Meng, Zhiyuan Huang, Wei He, Xianbiao Qi, Rong Xiao, Chun-Guang Li

Abstract: Subspace clustering is a classical unsupervised learning task, built on a basic assumption that high-dimensional data can be approximated by a union of subspaces (UoS). Nevertheless, the real-world data are often deviating from the UoS assumption. To address this challenge, state-of-the-art deep subspace clustering algorithms attempt to jointly learn UoS representations and self-expressive coefficients. However, the general framework of the existing algorithms suffers from a catastrophic feature collapse and lacks a theoretical guarantee to learn desired UoS representation. In this paper, we present a Principled fRamewOrk for Deep Subspace Clustering (PRO-DSC), which is designed to learn structured representations and self-expressive coefficients in a unified manner. Specifically, in PRO-DSC, we incorporate an effective regularization on the learned representations into the self-expressive model, prove that the regularized self-expressive model is able to prevent feature space collapse, and demonstrate that the learned optimal representations under certain condition lie on a union of orthogonal subspaces. Moreover, we provide a scalable and efficient approach to implement our PRO-DSC and conduct extensive experiments to verify our theoretical findings and demonstrate the superior performance of our proposed deep subspace clustering approach. The code is available at https://github.com/mengxianghan123/PRO-DSC.

Comment: The paper presents a principled framework for deep subspace clustering, addressing feature collapse and providing theoretical guarantees. This aligns with representation learning and foundational clustering methods.

Relevance: 9 Novelty: 9


5. SuperARC: A Test for General and Super Intelligence Based on First Principles of Recursion Theory and Algorithmic Probability

ArXiv ID: 2503.16743

Authors: Alberto Hern\'andez-Espinosa, Luan Ozelim, Felipe S. Abrah\~ao, Hector Zenil

Abstract: We introduce an open-ended test grounded in algorithmic probability that can avoid benchmark contamination in the quantitative evaluation of frontier models in the context of their Artificial General Intelligence (AGI) and Superintelligence (ASI) claims. Unlike other tests, this test does not rely on statistical compression methods (such as GZIP or LZW), which are more closely related to Shannon entropy than to Kolmogorov complexity. The test challenges aspects related to features of intelligence of fundamental nature such as synthesis and model creation in the context of inverse problems (generating new knowledge from observation). We argue that metrics based on model abstraction and optimal Bayesian inference for planning can provide a robust framework for testing intelligence, including natural intelligence (human and animal), narrow AI, AGI, and ASI. Our results show no clear evidence of LLM convergence towards a defined level of intelligence, particularly AGI or ASI. We found that LLM model versions tend to be fragile and incremental, as new versions may perform worse than older ones, with progress largely driven by the size of training data. The results were compared with a hybrid neurosymbolic approach that theoretically guarantees model convergence from optimal inference based on the principles of algorithmic probability and Kolmogorov complexity. The method outperforms LLMs in a proof-of-concept on short binary sequences. Our findings confirm suspicions regarding the fundamental limitations of LLMs, exposing them as systems optimised for the perception of mastery over human language. Progress among different LLM versions from the same developers was found to be inconsistent and limited, particularly in the absence of a solid symbolic counterpart.

Comment: The paper introduces a test for AGI and ASI based on algorithmic probability, which challenges established assumptions and aligns with emerging trends in foundational AI research.

Relevance: 9 Novelty: 9


6. Glivenko-Cantelli for $f$-divergence

ArXiv ID: 2503.17355

Authors: Haoming Wang, Lek-Heng Lim

Abstract: We extend the celebrated Glivenko-Cantelli theorem, sometimes called the fundamental theorem of statistics, from its standard setting of total variation distance to all $f$-divergences. A key obstacle in this endeavor is to define $f$-divergence on a subcollection of a $\sigma$-algebra that forms a $\pi$-system but not a $\sigma$-subalgebra. This is a side contribution of our work. We will show that this notion of $f$-divergence on the $\pi$-system of rays preserves nearly all known properties of standard $f$-divergence, yields a novel integral representation of the Kolmogorov-Smirnov distance, and has a Glivenko-Cantelli theorem.

Comment: The paper extends the Glivenko-Cantelli theorem to f-divergences, which is a cutting-edge theoretical contribution and aligns with emerging trends in foundational research.

Relevance: 9 Novelty: 9


7. Accelerating Transformer Inference and Training with 2:4 Activation Sparsity

ArXiv ID: 2503.16672

Authors: Daniel Haziza, Timothy Chou, Dhruv Choudhary, Luca Wehrstedt, Francisco Massa, Jiecao Yu, Geonhwa Jeong, Supriya Rao, Patrick Labatut, Jesse Cai

Abstract: In this paper, we demonstrate how to leverage 2:4 sparsity, a popular hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuracy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.

Comment: The paper explores activation sparsity in Transformers for efficiency, which aligns with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8


8. Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs

ArXiv ID: 2503.16870

Authors: Anshumann, Mohd Abbas Zaidi, Akhil Kedia, Jinwoo Ahn, Taehwak Kwon, Kangwook Lee, Haejun Lee, Joohyung Lee

Abstract: Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation', which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.

Comment: The paper proposes a sparse logit sampling method for knowledge distillation in LLMs, which aligns with model compression and efficiency. The use of importance sampling for unbiased estimates is a novel contribution.

Relevance: 9 Novelty: 8


9. Nonparametric Factor Analysis and Beyond

ArXiv ID: 2503.16865

Authors: Yujia Zheng, Yang Liu, Jiaxiong Yao, Yingyao Hu, Kun Zhang

Abstract: Nearly all identifiability results in unsupervised representation learning inspired by, e.g., independent component analysis, factor analysis, and causal representation learning, rely on assumptions of additive independent noise or noiseless regimes. In contrast, we study the more general case where noise can take arbitrary forms, depend on latent variables, and be non-invertibly entangled within a nonlinear function. We propose a general framework for identifying latent variables in the nonparametric noisy settings. We first show that, under suitable conditions, the generative model is identifiable up to certain submanifold indeterminacies even in the presence of non-negligible noise. Furthermore, under the structural or distributional variability conditions, we prove that latent variables of the general nonlinear models are identifiable up to trivial indeterminacies. Based on the proposed theoretical framework, we have also developed corresponding estimation methods and validated them in various synthetic and real-world settings. Interestingly, our estimate of the true GDP growth from alternative measurements suggests more insightful information on the economies than official reports. We expect our framework to provide new insight into how both researchers and practitioners deal with latent variables in real-world scenarios.

Comment: The paper provides a theoretical framework for identifying latent variables in noisy settings, which aligns with foundational research in representation learning.

Relevance: 9 Novelty: 8


10. NdLinear Is All You Need for Representation Learning

ArXiv ID: 2503.17353

Authors: Alex Reneau, Jerry Yao-Chieh Hu, Zhongfang Zhuang, Ting-Chun Liu

Abstract: Many high-impact machine learning tasks involve multi-dimensional data (e.g., images, volumetric medical scans, multivariate time-series). Yet, most neural architectures flatten inputs, discarding critical cross-dimension information. We introduce NdLinear, a novel linear transformation that preserves these structures without extra overhead. By operating separately along each dimension, NdLinear captures dependencies that standard fully connected layers overlook. Extensive experiments across convolutional, recurrent, and transformer-based networks show significant improvements in representational power and parameter efficiency. Crucially, NdLinear serves as a foundational building block for large-scale foundation models by operating on any unimodal or multimodal data in its native form. This removes the need for flattening or modality-specific preprocessing. Ndlinear rethinks core architectural priorities beyond attention, enabling more expressive, context-aware models at scale. We propose NdLinear as a drop-in replacement for standard linear layers -- marking an important step toward next-generation neural architectures.

Comment: The paper introduces NdLinear, a novel linear transformation for preserving multi-dimensional data structures, which aligns with foundational research in representation learning and architectural innovations.

Relevance: 9 Novelty: 8


11. Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation

ArXiv ID: 2503.17361

Authors: Sophia Tang, Yinuo Zhang, Alexander Tong, Pranam Chatterjee

Abstract: Flow matching in the continuous simplex has emerged as a promising strategy for DNA sequence design, but struggles to scale to higher simplex dimensions required for peptide and protein generation. We introduce Gumbel-Softmax Flow and Score Matching, a generative framework on the simplex based on a novel Gumbel-Softmax interpolant with a time-dependent temperature. Using this interpolant, we introduce Gumbel-Softmax Flow Matching by deriving a parameterized velocity field that transports from smooth categorical distributions to distributions concentrated at a single vertex of the simplex. We alternatively present Gumbel-Softmax Score Matching which learns to regress the gradient of the probability density. Our framework enables high-quality, diverse generation and scales efficiently to higher-dimensional simplices. To enable training-free guidance, we propose Straight-Through Guided Flows (STGFlow), a classifier-based guidance method that leverages straight-through estimators to steer the unconditional velocity field toward optimal vertices of the simplex. STGFlow enables efficient inference-time guidance using classifiers pre-trained on clean sequences, and can be used with any discrete flow method. Together, these components form a robust framework for controllable de novo sequence generation. We demonstrate state-of-the-art performance in conditional DNA promoter design, sequence-only protein generation, and target-binding peptide design for rare disease treatment.

Comment: The paper introduces a novel generative framework for biological sequence generation, which aligns with foundational research in AI for Science.

Relevance: 8 Novelty: 8


12. Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models

ArXiv ID: 2503.16980

Authors: Haichao Zhang, Zhuowei Li, Dimitris Metaxas, Yun Fu

Abstract: Token-based video representation has emerged as a promising approach for enabling large language models to interpret video content. However, existing token reduction techniques, such as token pruning and token merging, often disrupt essential spatial-temporal positional embeddings, failing to adequately balance computational efficiency with fewer tokens. Consequently, these methods result in relatively lengthy token sequences, limiting their applicability in scenarios requiring extreme token compression, such as video large language models. In this paper, we introduce the novel task of extreme short token reduction, aiming to represent extensive video sequences with a minimal number of tokens. To address this challenge, we propose Token Dynamics, a new video representation framework that dynamically reduces token count while preserving spatial-temporal coherence. Specifically, we disentangle video representations by separating visual embeddings from grid-level motion information, structuring them into: 1. a concise token base, created by clustering tokens that describe object-level content; 2. a token dynamics map, capturing detailed spatial-temporal motion patterns across grids. Furthermore, we introduce a cross-dynamics attention mechanism that integrates motion features into the token base without increasing token length, thereby maintaining compactness and spatial-temporal integrity. The experiments demonstrate a reduction of token count to merely 0.07% of the original tokens, with only a minor performance drop of 1.13%. Additionally, we propose two novel subtasks within extreme token reduction (fixed-length and adaptive-length compression), both effectively representing long token sequences for video-language tasks. Our method offers significantly lower theoretical complexity, fewer tokens, and enhanced throughput, thus providing an efficient solution for video LLMs.

Comment: The paper introduces a novel token reduction framework for video representation in large language models, which aligns with architectural innovations and efficiency improvements. The focus on extreme token reduction is a promising direction.

Relevance: 8 Novelty: 8


13. Structure Is Not Enough: Leveraging Behavior for Neural Network Weight Reconstruction

ArXiv ID: 2503.17138

Authors: L\'eo Meynent, Ivan Melev, Konstantin Sch\"urholt, G\"oran Kauermann, Damian Borth

Abstract: The weights of neural networks (NNs) have recently gained prominence as a new data modality in machine learning, with applications ranging from accuracy and hyperparameter prediction to representation learning or weight generation. One approach to leverage NN weights involves training autoencoders (AEs), using contrastive and reconstruction losses. This allows such models to be applied to a wide variety of downstream tasks, and they demonstrate strong predictive performance and low reconstruction error. However, despite the low reconstruction error, these AEs reconstruct NN models with deteriorated performance compared to the original ones, limiting their usability with regard to model weight generation. In this paper, we identify a limitation of weight-space AEs, specifically highlighting that a structural loss, that uses the Euclidean distance between original and reconstructed weights, fails to capture some features critical for reconstructing high-performing models. We analyze the addition of a behavioral loss for training AEs in weight space, where we compare the output of the reconstructed model with that of the original one, given some common input. We show a strong synergy between structural and behavioral signals, leading to increased performance in all downstream tasks evaluated, in particular NN weights reconstruction and generation.

Comment: The paper introduces a behavioral loss for neural network weight reconstruction, which aligns with representation learning and autoencoders. The focus on combining structural and behavioral signals is a novel approach.

Relevance: 8 Novelty: 8


14. Physics-Informed Deep B-Spline Networks for Dynamical Systems

ArXiv ID: 2503.16777

Authors: Zhuoyuan Wang, Raffaele Romagnoli, Jasmine Ratchford, Yorie Nakahira

Abstract: Physics-informed machine learning provides an approach to combining data and governing physics laws for solving complex partial differential equations (PDEs). However, efficiently solving PDEs with varying parameters and changing initial conditions and boundary conditions (ICBCs) with theoretical guarantees remains an open challenge. We propose a hybrid framework that uses a neural network to learn B-spline control points to approximate solutions to PDEs with varying system and ICBC parameters. The proposed network can be trained efficiently as one can directly specify ICBCs without imposing losses, calculate physics-informed loss functions through analytical formulas, and requires only learning the weights of B-spline functions as opposed to both weights and basis as in traditional neural operator learning methods. We provide theoretical guarantees that the proposed B-spline networks serve as universal approximators for the set of solutions of PDEs with varying ICBCs under mild conditions and establish bounds on the generalization errors in physics-informed learning. We also demonstrate in experiments that the proposed B-spline network can solve problems with discontinuous ICBCs and outperforms existing methods, and is able to learn solutions of 3D dynamics with diverse initial conditions.

Comment: The paper proposes a hybrid framework using B-spline networks for solving PDEs, which is relevant to AI for science and introduces theoretical guarantees, making it foundational.

Relevance: 8 Novelty: 8


15. Ordered Topological Deep Learning: a Network Modeling Case Study

ArXiv ID: 2503.16746

Authors: Guillermo Bern\'ardez, Miquel Ferriol-Galm\'es, Carlos G\"uemes-Palau, Mathilde Papillon, Pere Barlet-Ros, Albert Cabellos-Aparicio, Nina Miolane

Abstract: Computer networks are the foundation of modern digital infrastructure, facilitating global communication and data exchange. As demand for reliable high-bandwidth connectivity grows, advanced network modeling techniques become increasingly essential to optimize performance and predict network behavior. Traditional modeling methods, such as packet-level simulators and queueing theory, have notable limitations --either being computationally expensive or relying on restrictive assumptions that reduce accuracy. In this context, the deep learning-based RouteNet family of models has recently redefined network modeling by showing an unprecedented cost-performance trade-off. In this work, we revisit RouteNet's sophisticated design and uncover its hidden connection to Topological Deep Learning (TDL), an emerging field that models higher-order interactions beyond standard graph-based methods. We demonstrate that, although originally formulated as a heterogeneous Graph Neural Network, RouteNet serves as the first instantiation of a new form of TDL. More specifically, this paper presents OrdGCCN, a novel TDL framework that introduces the notion of ordered neighbors in arbitrary discrete topological spaces, and shows that RouteNet's architecture can be naturally described as an ordered topological neural network. To the best of our knowledge, this marks the first successful real-world application of state-of-the-art TDL principles --which we confirm through extensive testbed experiments--, laying the foundation for the next generation of ordered TDL-driven applications.

Comment: The paper introduces a novel topological deep learning framework, which aligns with architectural innovations and emerging trends.

Relevance: 8 Novelty: 8


16. A Learnability Analysis on Neuro-Symbolic Learning

ArXiv ID: 2503.16797

Authors: Hao-Yuan He, Ming Li

Abstract: This paper analyzes the learnability of neuro-symbolic (NeSy) tasks within hybrid systems. We show that the learnability of NeSy tasks can be characterized by their derived constraint satisfaction problems (DCSPs). Specifically, a task is learnable if the corresponding DCSP has a unique solution; otherwise, it is unlearnable. For learnable tasks, we establish error bounds by exploiting the clustering property of the hypothesis space. Additionally, we analyze the asymptotic error for general NeSy tasks, showing that the expected error scales with the disagreement among solutions. Our results offer a principled approach to determining learnability and provide insights into the design of new algorithms.

Comment: The paper provides a learnability analysis for neuro-symbolic tasks, which aligns with foundational research in representation learning and theoretical insights.

Relevance: 8 Novelty: 7


17. Efficient ANN-Guided Distillation: Aligning Rate-based Features of Spiking Neural Networks through Hybrid Block-wise Replacement

ArXiv ID: 2503.16572

Authors: Shu Yang, Chengting Yu, Lei Liu, Hanzhi Ma, Aili Wang, Erping Li

Abstract: Spiking Neural Networks (SNNs) have garnered considerable attention as a potential alternative to Artificial Neural Networks (ANNs). Recent studies have highlighted SNNs' potential on large-scale datasets. For SNN training, two main approaches exist: direct training and ANN-to-SNN (ANN2SNN) conversion. To fully leverage existing ANN models in guiding SNN learning, either direct ANN-to-SNN conversion or ANN-SNN distillation training can be employed. In this paper, we propose an ANN-SNN distillation framework from the ANN-to-SNN perspective, designed with a block-wise replacement strategy for ANN-guided learning. By generating intermediate hybrid models that progressively align SNN feature spaces to those of ANN through rate-based features, our framework naturally incorporates rate-based backpropagation as a training method. Our approach achieves results comparable to or better than state-of-the-art SNN distillation methods, showing both training and learning efficiency.

Comment: The paper proposes a novel ANN-SNN distillation framework with a block-wise replacement strategy, which aligns with foundational research in model compression and efficiency.

Relevance: 8 Novelty: 7


18. KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference

ArXiv ID: 2503.16525

Authors: Huan Yang, Renji Zhang, Deyu Zhang

Abstract: This paper presents KVShare, a multi-user Key-Value (KV) Cache sharing technology based on semantic similarity, designed to enhance the inference efficiency of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Addressing the limitations of existing prefix caching (strict text prefix matching) and semantic caching (loss of response diversity), KVShare achieves fine-grained KV cache reuse through semantic alignment algorithms and differential editing operations. Experiments on real-world user conversation datasets demonstrate that KVShare improves KV cache hit rates by over 60%, while maintaining output quality comparable to full computation (no significant degradation in BLEU and Rouge-L metrics). This approach effectively reduces GPU resource consumption and is applicable to scenarios with repetitive queries, such as healthcare and education.

Comment: The paper introduces KVShare for efficient LLM inference, which aligns with foundational research in model compression and efficiency.

Relevance: 8 Novelty: 7


19. Neural-Guided Equation Discovery

ArXiv ID: 2503.16953

Authors: Jannis Brugger, Mattia Cerrato, David Richter, Cedric Derstroff, Daniel Maninger, Mira Mezini, Stefan Kramer

Abstract: Deep learning approaches are becoming increasingly attractive for equation discovery. We show the advantages and disadvantages of using neural-guided equation discovery by giving an overview of recent papers and the results of experiments using our modular equation discovery system MGMT ($\textbf{M}$ulti-Task $\textbf{G}$rammar-Guided $\textbf{M}$onte-Carlo $\textbf{T}$ree Search for Equation Discovery). The system uses neural-guided Monte-Carlo Tree Search (MCTS) and supports both supervised and reinforcement learning, with a search space defined by a context-free grammar. We summarize seven desirable properties of equation discovery systems, emphasizing the importance of embedding tabular data sets for such learning approaches. Using the modular structure of MGMT, we compare seven architectures (among them, RNNs, CNNs, and Transformers) for embedding tabular datasets on the auxiliary task of contrastive learning for tabular data sets on an equation discovery task. For almost all combinations of modules, supervised learning outperforms reinforcement learning. Moreover, our experiments indicate an advantage of using grammar rules as action space instead of tokens. Two adaptations of MCTS -- risk-seeking MCTS and AmEx-MCTS -- can improve equation discovery with that kind of search.

Comment: The paper explores neural-guided equation discovery, which aligns with emerging trends in foundational AI research. It introduces a modular system and compares architectures, making it relevant to representation learning.

Relevance: 8 Novelty: 7


20. PRIOT: Pruning-Based Integer-Only Transfer Learning for Embedded Systems

ArXiv ID: 2503.16860

Authors: Honoka Anada, Sefutsu Ryu, Masayuki Usui, Tatsuya Kaneko, Shinya Takamaeda-Yamazaki

Abstract: On-device transfer learning is crucial for adapting a common backbone model to the unique environment of each edge device. Tiny microcontrollers, such as the Raspberry Pi Pico, are key targets for on-device learning but often lack floating-point units, necessitating integer-only training. Dynamic computation of quantization scale factors, which is adopted in former studies, incurs high computational costs. Therefore, this study focuses on integer-only training with static scale factors, which is challenging with existing training methods. We propose a new training method named PRIOT, which optimizes the network by pruning selected edges rather than updating weights, allowing effective training with static scale factors. The pruning pattern is determined by the edge-popup algorithm, which trains a parameter named score assigned to each edge instead of the original parameters and prunes the edges with low scores before inference. Additionally, we introduce a memory-efficient variant, PRIOT-S, which only assigns scores to a small fraction of edges. We implement PRIOT and PRIOT-S on the Raspberry Pi Pico and evaluate their accuracy and computational costs using a tiny CNN model on the rotated MNIST dataset and the VGG11 model on the rotated CIFAR-10 dataset. Our results demonstrate that PRIOT improves accuracy by 8.08 to 33.75 percentage points over existing methods, while PRIOT-S reduces memory footprint with minimal accuracy loss.

Comment: The paper introduces a pruning-based integer-only training method for embedded systems, which aligns with model compression topics like pruning and quantization, making it relevant to foundational research.

Relevance: 8 Novelty: 7


21. Token-Level Uncertainty-Aware Objective for Language Model Post-Training

ArXiv ID: 2503.16511

Authors: Tingkai Liu, Ari S. Benjamin, Anthony M. Zador

Abstract: In the current work, we connect token-level uncertainty in causal language modeling to two types of training objectives: 1) masked maximum likelihood (MLE), 2) self-distillation. We show that masked MLE is effective in reducing epistemic uncertainty, and serve as an effective token-level automatic curriculum learning technique. However, masked MLE is prone to overfitting and requires self-distillation regularization to improve or maintain performance on out-of-distribution tasks. We demonstrate significant performance gain via the proposed training objective - combined masked MLE and self-distillation - across multiple architectures (Gemma, LLaMA, Phi) and datasets (Alpaca, ShareGPT, GSM8K), mitigating overfitting while maintaining adaptability during post-training. Our findings suggest that uncertainty-aware training provides an effective mechanism for enhancing language model training.

Comment: The paper proposes a token-level uncertainty-aware objective for language model post-training, which aligns with foundational research in LLM training dynamics and uncertainty modeling.

Relevance: 8 Novelty: 7


22. An Accelerated Bregman Algorithm for ReLU-based Symmetric Matrix Decomposition

ArXiv ID: 2503.16846

Authors: Qingsong Wang

Abstract: Symmetric matrix decomposition is an active research area in machine learning. This paper focuses on exploiting the low-rank structure of non-negative and sparse symmetric matrices via the rectified linear unit (ReLU) activation function. We propose the ReLU-based nonlinear symmetric matrix decomposition (ReLU-NSMD) model, introduce an accelerated alternating partial Bregman (AAPB) method for its solution, and present the algorithm's convergence results. Our algorithm leverages the Bregman proximal gradient framework to overcome the challenge of estimating the global $L$-smooth constant in the classic proximal gradient algorithm. Numerical experiments on synthetic and real datasets validate the effectiveness of our model and algorithm.

Comment: The paper focuses on low-rank structure and sparsity in symmetric matrix decomposition, which aligns with the model compression and representation learning criteria.

Relevance: 8 Novelty: 7


23. Gene42: Long-Range Genomic Foundation Model With Dense Attention

ArXiv ID: 2503.16565

Authors: Kirill Vishniakov, Boulbaba Ben Amor, Engin Tekin, Nancy A. ElNaker, Karthik Viswanathan, Aleksandr Medvedev, Aahan Singh, Maryam Nadeem, Mohammad Amaan Sayeed, Praveenkumar Kanithi, Tiago Magalhaes, Natalia Vassilieva, Dwarikanath Mahapatra, Marco Pimentel, and Shadab Khan

Abstract: We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at huggingface.co/inceptionai.

Comment: The paper introduces a genomic foundation model with dense attention, which is relevant to foundational research in large language models and architecture innovations.

Relevance: 8 Novelty: 7


24. SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

ArXiv ID: 2503.17239

Authors: Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Holger Boche

Abstract: Fine-tuning large language models (LLMs) on downstream tasks can inadvertently erode their safety alignment, even for benign fine-tuning datasets. We address this challenge by proposing SafeMERGE, a post-fine-tuning framework that preserves safety while maintaining task utility. It achieves this by selectively merging fine-tuned and safety-aligned model layers only when those deviate from safe behavior, measured by a cosine similarity criterion. We evaluate SafeMERGE against other fine-tuning- and post-fine-tuning-stage approaches for Llama-2-7B-Chat and Qwen-2-7B-Instruct models on GSM8K and PubMedQA tasks while exploring different merging strategies. We find that SafeMERGE consistently reduces harmful outputs compared to other baselines without significantly sacrificing performance, sometimes even enhancing it. The results suggest that our selective, subspace-guided, and per-layer merging method provides an effective safeguard against the inadvertent loss of safety in fine-tuned LLMs while outperforming simpler post-fine-tuning-stage defenses.

Comment: The paper introduces a selective layer-wise merging method for fine-tuned LLMs, which aligns with foundational research in large language models and safety alignment.

Relevance: 8 Novelty: 7


25. Model-free front-to-end training of a large high performance laser neural network

ArXiv ID: 2503.16943

Authors: Anas Skalli, Satoshi Sunada, Mirko Goldmann, Marcin Gebski, Stephan Reitzenstein, James A. Lott, Tomasz Czyszanowski, Daniel Brunner

Abstract: Artificial neural networks (ANNs), have become ubiquitous and revolutionized many applications ranging from computer vision to medical diagnoses. However, they offer a fundamentally connectionist and distributed approach to computing, in stark contrast to classical computers that use the von Neumann architecture. This distinction has sparked renewed interest in developing unconventional hardware to support more efficient implementations of ANNs, rather than merely emulating them on traditional systems. Photonics stands out as a particularly promising platform, providing scalability, high speed, energy efficiency, and the ability for parallel information processing. However, fully realized autonomous optical neural networks (ONNs) with in-situ learning capabilities are still rare. In this work, we demonstrate a fully autonomous and parallel ONN using a multimode vertical cavity surface emitting laser (VCSEL) using off-the-shelf components. Our ONN is highly efficient and is scalable both in network size and inference bandwidth towards the GHz range. High performance hardware-compatible optimization algorithms are necessary in order to minimize reliance on external von Neumann computers to fully exploit the potential of ONNs. As such we present and extensively study several algorithms which are broadly compatible with a wide range of systems. We then apply these algorithms to optimize our ONN, and benchmark them using the MNIST dataset. We show that our ONN can achieve high accuracy and convergence efficiency, even under limited hardware resources. Crucially, we compare these different algorithms in terms of scaling and optimization efficiency in term of convergence time which is crucial when working with limited external resources. Our work provides some guidance for the design of future ONNs as well as a simple and flexible way to train them.

Comment: The paper demonstrates a photonic ONN with in-situ learning capabilities, which aligns with emerging trends in unconventional neural network architectures.

Relevance: 7 Novelty: 8


26. Efficient Training of Neural Fractional-Order Differential Equation via Adjoint Backpropagation

ArXiv ID: 2503.16666

Authors: Qiyu Kang, Xuhao Li, Kai Zhao, Wenjun Cui, Yanan Zhao, Weihua Deng, Wee Peng Tay

Abstract: Fractional-order differential equations (FDEs) enhance traditional differential equations by extending the order of differential operators from integers to real numbers, offering greater flexibility in modeling complex dynamical systems with nonlocal characteristics. Recent progress at the intersection of FDEs and deep learning has catalyzed a new wave of innovative models, demonstrating the potential to address challenges such as graph representation learning. However, training neural FDEs has primarily relied on direct differentiation through forward-pass operations in FDE numerical solvers, leading to increased memory usage and computational complexity, particularly in large-scale applications. To address these challenges, we propose a scalable adjoint backpropagation method for training neural FDEs by solving an augmented FDE backward in time, which substantially reduces memory requirements. This approach provides a practical neural FDE toolbox and holds considerable promise for diverse applications. We demonstrate the effectiveness of our method in several tasks, achieving performance comparable to baseline models while significantly reducing computational overhead.

Comment: The paper proposes an adjoint backpropagation method for training neural fractional-order differential equations, which aligns with emerging trends in foundational research. The focus on reducing memory requirements is a significant contribution.

Relevance: 7 Novelty: 8


27. Do regularization methods for shortcut mitigation work as intended?

ArXiv ID: 2503.17015

Authors: Haoyang Hong, Ioanna Papanikolaou, Sonali Parbhoo

Abstract: Mitigating shortcuts, where models exploit spurious correlations in training data, remains a significant challenge for improving generalization. Regularization methods have been proposed to address this issue by enhancing model generalizability. However, we demonstrate that these methods can sometimes overregularize, inadvertently suppressing causal features along with spurious ones. In this work, we analyze the theoretical mechanisms by which regularization mitigates shortcuts and explore the limits of its effectiveness. Additionally, we identify the conditions under which regularization can successfully eliminate shortcuts without compromising causal features. Through experiments on synthetic and real-world datasets, our comprehensive analysis provides valuable insights into the strengths and limitations of regularization techniques for addressing shortcuts, offering guidance for developing more robust models.

Comment: The paper analyzes regularization methods for mitigating shortcuts, providing theoretical insights into their mechanisms. This aligns with representation learning and training dynamics.

Relevance: 7 Novelty: 7


28. Rethinking the Role of Spatial Mixing

ArXiv ID: 2503.16760

Authors: George Cazenavette, Joel Julin, Simon Lucey

Abstract: Until quite recently, the backbone of nearly every state-of-the-art computer vision model has been the 2D convolution. At its core, a 2D convolution simultaneously mixes information across both the spatial and channel dimensions of a representation. Many recent computer vision architectures consist of sequences of isotropic blocks that disentangle the spatial and channel-mixing components. This separation of the operations allows us to more closely juxtapose the effects of spatial and channel mixing in deep learning. In this paper, we take an initial step towards garnering a deeper understanding of the roles of these mixing operations. Through our experiments and analysis, we discover that on both classical (ResNet) and cutting-edge (ConvMixer) models, we can reach nearly the same level of classification performance by and leaving the spatial mixers at their random initializations. Furthermore, we show that models with random, fixed spatial mixing are naturally more robust to adversarial perturbations. Lastly, we show that this phenomenon extends past the classification regime, as such models can also decode pixel-shuffled images.

Comment: The paper investigates the role of spatial mixing in deep learning architectures, providing insights into the training dynamics and robustness of models. While it does not directly address representation learning or model architecture innovation, its analysis of spatial and channel mixing offers foundational insights into existing architectures.

Relevance: 7 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords: