Personalized Daily Arxiv Papers 03/12/2025

[gpt-4o]	Prompt	Completion	Total
Token	54680	7576	62256
Cost	$0.13	$0.08	$0.21

Total ArXiv papers: 605

Total scanned papers: 347

Total relevant papers: 35

Table of contents with paper titles:

Mixture of Experts Made Intrinsically Interpretable Authors: Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, Philip Torr
A Theory of Learning with Autoregressive Chain of Thought Authors: Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misiakiewicz, Nathan Srebro
How good is PAC-Bayes at explaining generalisation? Authors: Antoine Picard-Weibel, Eugenio Clerico, Roman Moscoviz, Benjamin Guedj
A Theoretical Framework for Preventing Class Collapse in Supervised Contrastive Learning Authors: Chungpa Lee, Jeongheon Oh, Kibok Lee, Jy-yong Sohn
SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs Authors: Jaewoo Song, Fangzhen Lin
MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration Authors: Jinguang Wang, Jingyu Wang, Haifeng Sun, Tingting Yang, Zirui Zhuang, Wanyi Ning, Yuexi Yin, Qi Qi, Jianxin Liao
Route Sparse Autoencoder to Interpret Large Language Models Authors: Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Gojun Ma, Xiang Wang, Xiangnan He
CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement Authors: Chenrui Ma, Rongchang Zhao, Xi Xiao, Hongyang Xie, Tianyang Wang, Xiao Wang, Hao Zhang, Yanning Shen
Accelerating MoE Model Inference with Expert Sharding Authors: Oana Balmau, Anne-Marie Kermarrec, Rafael Pires, Andr\'e Loureiro Esp\'irito Santo, Martijn de Vos, Milos Vujasinovic
ELECTRA: A Symmetry-breaking Cartesian Network for Charge Density Prediction with Floating Orbitals Authors: Jonas Elsborg, Luca Thiede, Al\'an Aspuru-Guzik, Tejs Vegge, Arghya Bhowmik
EFPC: Towards Efficient and Flexible Prompt Compression Authors: Yun-Hao Cao, Yangsong Wang, Shuzheng Hao, Zhenxing Li, Chengjun Zhan, Sichao Liu, Yi-Qi Hu
ProTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models Authors: Zicheng Ma, Chuanliu Fan, Zhicong Wang, Zhenyu Chen, Xiaohan Lin, Yanheng Li, Shihao Feng, Jun Zhang, Ziqiang Cao, Yi Qin Gao
Accurate INT8 Training Through Dynamic Block-Level Fallback Authors: Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, Jianfei Chen
The Space Between: On Folding, Symmetries and Sampling Authors: Michal Lewandowski, Bernhard Heinzl, Raphael Pisoni, Bernhard A. Moser
CIMAGE: Exploiting the Conditional Independence in Masked Graph Auto-encoders Authors: Jongwon Park, Heesoo Jung, Hogun Park
Symbolic Neural Ordinary Differential Equations Authors: Xin Li, Chengli Zhao, Xue Zhang, Xiaojun Duan
How Does Overparameterization Affect Machine Unlearning of Deep Neural Networks? Authors: Gal Alon, Yehuda Dar
Deep ARTMAP: Generalized Hierarchical Learning with Adaptive Resonance Theory Authors: Niklas M. Melton, Leonardo Enzo Brito da Silva, Sasha Petrenko, Donald. C. Wunsch II
Scaling Probabilistic Circuits via Data Partitioning Authors: Jonas Seng, Florian Peter Busch, Pooja Prasad, Devendra Singh Dhami, Martin Mundt, Kristian Kersting
A Triple-Inertial Accelerated Alternating Optimization Method for Deep Learning Training Authors: Chengcheng Yan, Jiawei Xu, Qingsong Wang, Zheng Peng
HOFAR: High-Order Augmentation of Flow Autoregressive Transformers Authors: Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan
Whoever Started the Interference Should End It: Guiding Data-Free Model Merging via Task Vectors Authors: Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, Chun Yuan
Benign Overfitting and the Geometry of the Ridge Regression Solution in Binary Classification Authors: Alexander Tsigler, Luiz F. O. Chamon, Spencer Frei, Peter L. Bartlett
Learning and Evaluating Hierarchical Feature Representations Authors: Depanshu Sani, Saket Anand
Personalized Convolutional Dictionary Learning of Physiological Time Series Authors: Axel Roques, Samuel Gruffaz, Kyurae Kim, Alain Oliviero-Durmus, Laurent Oudre
Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy Authors: Wei Junhao, Yu Zhe, Sakuma Jun
Aligning Text to Image in Diffusion Models is Easier Than You Think Authors: Jaa-Yeon Lee, Byunghee Cha, Jeongsol Kim, Jong Chul Ye
MinGRU-Based Encoder for Turbo Autoencoder Frameworks Authors: Rick Fritschek, Rafael F. Schaefer
Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks? Authors: Payel Das, Ching-Yun Ko, Sihui Dai, Georgios Kollias, Subhajit Chaudhury, Aurelie Lozano
Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning Authors: Hubert Baniecki, Przemyslaw Biecek
Two-Dimensional Deep ReLU CNN Approximation for Korobov Functions: A Constructive Approach Authors: Qin Fang, Lei Shi, Min Xu, Ding-Xuan Zhou
Median Consensus Embedding for Dimensionality Reduction Authors: Yui Tomo, Daisuke Yoneoka
Strengthening the Internal Adversarial Robustness in Lifted Neural Networks Authors: Christopher Zach
Accelerated Distributed Optimization with Compression and Error Feedback Authors: Yuan Gao, Anton Rodomanov, Jeremy Rack, Sebastian U. Stich
Learning to Match Unpaired Data with Minimum Entropy Coupling Authors: Mustapha Bounoua, Giulio Franzese, Pietro Michiardi

1. Mixture of Experts Made Intrinsically Interpretable

ArXiv ID: 2503.07639

Authors: Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, Philip Torr

Abstract: Neurons in large language models often exhibit \emph{polysemanticity}, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present \textbf{MoE-X}, a Mixture-of-Experts (MoE) language model designed to be \emph{intrinsically} interpretable. Our approach is motivated by the observation that, in language models, wider networks with sparse activations are more likely to capture interpretable factors. However, directly training such large sparse networks is computationally prohibitive. MoE architectures offer a scalable alternative by activating only a subset of experts for any given input, inherently aligning with interpretability objectives. In MoE-X, we establish this connection by rewriting the MoE layer as an equivalent sparse, large MLP. This approach enables efficient scaling of the hidden size while maintaining sparsity. To further enhance interpretability, we enforce sparse activation within each expert and redesign the routing mechanism to prioritize experts with the highest activation sparsity. These designs ensure that only the most salient features are routed and processed by the experts. We evaluate MoE-X on chess and natural language tasks, showing that it achieves performance comparable to dense models while significantly improving interpretability. MoE-X achieves a perplexity better than GPT-2, with interpretability surpassing even sparse autoencoder (SAE)-based approaches.

Comment: The paper introduces MoE-X, a Mixture-of-Experts model designed for intrinsic interpretability, which aligns closely with the MoE and interpretability criteria.

Relevance: 10 Novelty: 8

2. A Theory of Learning with Autoregressive Chain of Thought

ArXiv ID: 2503.07932

Authors: Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misiakiewicz, Nathan Srebro

Abstract: For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed, time-invariant generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes such as linear thresholds. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.

Comment: The paper formalizes learning with autoregressive chain-of-thought, which aligns with foundational research in LLMs and introduces theoretical insights.

Relevance: 9 Novelty: 9

3. How good is PAC-Bayes at explaining generalisation?

ArXiv ID: 2503.08231

Authors: Antoine Picard-Weibel, Eugenio Clerico, Roman Moscoviz, Benjamin Guedj

Abstract: We discuss necessary conditions for a PAC-Bayes bound to provide a meaningful generalisation guarantee. Our analysis reveals that the optimal generalisation guarantee depends solely on the distribution of the risk induced by the prior distribution. In particular, achieving a target generalisation level is only achievable if the prior places sufficient mass on high-performing predictors. We relate these requirements to the prevalent practice of using data-dependent priors in deep learning PAC-Bayes applications, and discuss the implications for the claim that PAC-Bayes ``explains'' generalisation.

Comment: The paper provides a theoretical analysis of PAC-Bayes bounds and their ability to explain generalization, which is highly relevant to foundational research in representation learning and generalization theory.

Relevance: 9 Novelty: 8

4. A Theoretical Framework for Preventing Class Collapse in Supervised Contrastive Learning

ArXiv ID: 2503.08203

Authors: Chungpa Lee, Jeongheon Oh, Kibok Lee, Jy-yong Sohn

Abstract: Supervised contrastive learning (SupCL) has emerged as a prominent approach in representation learning, leveraging both supervised and self-supervised losses. However, achieving an optimal balance between these losses is challenging; failing to do so can lead to class collapse, reducing discrimination among individual embeddings in the same class. In this paper, we present theoretically grounded guidelines for SupCL to prevent class collapse in learned representations. Specifically, we introduce the Simplex-to-Simplex Embedding Model (SSEM), a theoretical framework that models various embedding structures, including all embeddings that minimize the supervised contrastive loss. Through SSEM, we analyze how hyperparameters affect learned representations, offering practical guidelines for hyperparameter selection to mitigate the risk of class collapse. Our theoretical findings are supported by empirical results across synthetic and real-world datasets.

Comment: The paper provides a theoretical framework to prevent class collapse in supervised contrastive learning, which is highly relevant to foundational research in representation learning.

Relevance: 9 Novelty: 8

5. SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

ArXiv ID: 2503.07657

Authors: Jaewoo Song, Fangzhen Lin

Abstract: The quantization of large language models (LLMs) is crucial for deploying them on devices with limited computational resources. While advanced quantization algorithms offer improved performance compared to the basic linear quantization, they typically require high-end graphics processing units (GPUs), are often restricted to specific deep neural network (DNN) frameworks, and require calibration datasets. This limitation poses challenges for using such algorithms on various neural processing units (NPUs) and edge AI devices, which have diverse model formats and frameworks. In this paper, we show SplitQuantV2, an innovative algorithm designed to enhance low-bit linear quantization of LLMs, can achieve results comparable to those of advanced algorithms. SplitQuantV2 preprocesses models by splitting linear and convolution layers into functionally equivalent, quantization-friendly structures. The algorithm's platform-agnostic, concise, and efficient nature allows for implementation without the need for GPUs. Our evaluation on the Llama 3.2 1B Instruct model using the AI2's Reasoning Challenge (ARC) dataset demonstrates that SplitQuantV2 improves the accuracy of the INT4 quantization model by 11.76%p, matching the performance of the original floating-point model. Remarkably, SplitQuantV2 took only 2 minutes 6 seconds to preprocess the 1B model and perform linear INT4 quantization using only an Apple M4 CPU. SplitQuantV2 provides a practical solution for low-bit quantization on LLMs, especially when complex, computation-intensive algorithms are inaccessible due to hardware limitations or framework incompatibilities.

Comment: The paper introduces SplitQuantV2, a novel low-bit quantization method for LLMs, which aligns with the model compression criterion and demonstrates practical efficiency improvements.

Relevance: 9 Novelty: 8

6. MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration

ArXiv ID: 2503.07654

Authors: Jinguang Wang, Jingyu Wang, Haifeng Sun, Tingting Yang, Zirui Zhuang, Wanyi Ning, Yuexi Yin, Qi Qi, Jianxin Liao

Abstract: Quantization has been widely used to compress and accelerate inference of large language models (LLMs). Existing methods focus on exploring the per-token dynamic calibration to ensure both inference acceleration and model accuracy under 4-bit quantization. However, in autoregressive generation inference of long sequences, the overhead of repeated dynamic quantization and dequantization steps becomes considerably expensive. In this work, we propose MergeQuant, an accurate and efficient per-channel static quantization framework. MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method, thereby eliminating the quantization overheads before and after matrix multiplication. Furthermore, in view of the significant differences between the different channel ranges, we propose dimensional reconstruction and adaptive clipping to address the non-uniformity of quantization scale factors and redistribute the channel variations to the subsequent modules to balance the parameter distribution under QSM. Within the static quantization setting of W4A4, MergeQuant reduces the accuracy gap on zero-shot tasks compared to FP16 baseline to 1.3 points on Llama-2-70B model. On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.

Comment: The paper focuses on a novel static quantization framework for LLMs, which aligns with the model compression criterion, particularly in sparsity and quantization.

Relevance: 9 Novelty: 8

7. Route Sparse Autoencoder to Interpret Large Language Models

ArXiv ID: 2503.08200

Authors: Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Gojun Ma, Xiang Wang, Xiangnan He

Abstract: Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable and monosemantic features. However, prior works primarily focus on feature extraction from a single layer, failing to effectively capture activations that span multiple layers. In this paper, we introduce Route Sparse Autoencoder (RouteSAE), a new framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers. It dynamically assigns weights to activations from different layers, incurring minimal parameter overhead while achieving high interpretability and flexibility for targeted feature manipulation. We evaluate RouteSAE through extensive experiments on Llama-3.2-1B-Instruct. Specifically, under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score. These results underscore the potential of RouteSAE as a scalable and effective method for LLM interpretability, with applications in feature discovery and model intervention. Our codes are available at https://github.com/swei2001/RouteSAEs.

Comment: The paper proposes a sparse autoencoder framework for LLM interpretability, which aligns with representation learning and interpretability of LLMs.

Relevance: 9 Novelty: 8

8. CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement

ArXiv ID: 2503.07938

Authors: Chenrui Ma, Rongchang Zhao, Xi Xiao, Hongyang Xie, Tianyang Wang, Xiao Wang, Hao Zhang, Yanning Shen

Abstract: While deep generative models have significantly advanced representation learning, they may inherit or amplify biases and fairness issues by encoding sensitive attributes alongside predictive features. Enforcing strict independence in disentanglement is often unrealistic when target and sensitive factors are naturally correlated. To address this challenge, we propose CAD-VAE (Correlation-Aware Disentangled VAE), which introduces a correlated latent code to capture the shared information between target and sensitive attributes. Given this correlated latent, our method effectively separates overlapping factors without extra domain knowledge by directly minimizing the conditional mutual information between target and sensitive codes. A relevance-driven optimization strategy refines the correlated code by efficiently capturing essential correlated features and eliminating redundancy. Extensive experiments on benchmark datasets demonstrate that CAD-VAE produces fairer representations, realistic counterfactuals, and improved fairness-aware image editing.

Comment: The paper introduces CAD-VAE, a novel disentangled VAE framework addressing fairness in representation learning, which aligns with foundational research in representation learning.

Relevance: 9 Novelty: 8

9. Accelerating MoE Model Inference with Expert Sharding

ArXiv ID: 2503.08467

Authors: Oana Balmau, Anne-Marie Kermarrec, Rafael Pires, Andr\'e Loureiro Esp\'irito Santo, Martijn de Vos, Milos Vujasinovic

Abstract: Mixture of experts (MoE) models achieve state-of-the-art results in language modeling but suffer from inefficient hardware utilization due to imbalanced token routing and communication overhead. While prior work has focused on optimizing MoE training and decoder architectures, inference for encoder-based MoE models in a multi-GPU with expert parallelism setting remains underexplored. We introduce MoEShard, an inference system that achieves perfect load balancing through tensor sharding of MoE experts. Unlike existing approaches that rely on heuristic capacity factors or drop tokens, MoEShard evenly distributes computation across GPUs and ensures full token retention, maximizing utilization regardless of routing skewness. We achieve this through a strategic row- and column-wise decomposition of expert matrices. This reduces idle time and avoids bottlenecks caused by imbalanced expert assignments. Furthermore, MoEShard minimizes kernel launches by fusing decomposed expert computations, significantly improving throughput. We evaluate MoEShard against DeepSpeed on encoder-based architectures, demonstrating speedups of up to 6.4$\times$ in time to first token (TTFT). Our results show that tensor sharding, when properly applied to experts, is a viable and effective strategy for efficient MoE inference.

Comment: The paper addresses efficiency in Mixture-of-Experts (MoE) inference through expert sharding, which directly aligns with the model architecture and compression criteria. The tensor sharding approach is a novel contribution to MoE inference.

Relevance: 9 Novelty: 8

10. ELECTRA: A Symmetry-breaking Cartesian Network for Charge Density Prediction with Floating Orbitals

ArXiv ID: 2503.08305

Authors: Jonas Elsborg, Luca Thiede, Al\'an Aspuru-Guzik, Tejs Vegge, Arghya Bhowmik

Abstract: We present the Electronic Tensor Reconstruction Algorithm (ELECTRA) - an equivariant model for predicting electronic charge densities using "floating" orbitals. Floating orbitals are a long-standing idea in the quantum chemistry community that promises more compact and accurate representations by placing orbitals freely in space, as opposed to centering all orbitals at the position of atoms. Finding ideal placements of these orbitals requires extensive domain knowledge though, which thus far has prevented widespread adoption. We solve this in a data-driven manner by training a Cartesian tensor network to predict orbital positions along with orbital coefficients. This is made possible through a symmetry-breaking mechanism that is used to learn position displacements with lower symmetry than the input molecule while preserving the rotation equivariance of the charge density itself. Inspired by recent successes of Gaussian Splatting in representing densities in space, we are using Gaussians as our orbitals and predict their weights and covariance matrices. Our method achieves a state-of-the-art balance between computational efficiency and predictive accuracy on established benchmarks.

Comment: The paper introduces a symmetry-breaking equivariant model for predicting electronic charge densities, which is foundational in AI for science and introduces a novel generative paradigm.

Relevance: 9 Novelty: 8

11. EFPC: Towards Efficient and Flexible Prompt Compression

ArXiv ID: 2503.07956

Authors: Yun-Hao Cao, Yangsong Wang, Shuzheng Hao, Zhenxing Li, Chengjun Zhan, Sichao Liu, Yi-Qi Hu

Abstract: The emergence of large language models (LLMs) like GPT-4 has revolutionized natural language processing (NLP), enabling diverse, complex tasks. However, extensive token counts lead to high computational and financial burdens. To address this, we propose Efficient and Flexible Prompt Compression (EFPC), a novel method unifying task-aware and task-agnostic compression for a favorable accuracy-efficiency trade-off. EFPC uses GPT-4 to generate compressed prompts and integrates them with original prompts for training. During training and inference, we selectively prepend user instructions and compress prompts based on predicted probabilities. EFPC is highly data-efficient, achieving significant performance with minimal data. Compared to the state-of-the-art method LLMLingua-2, EFPC achieves a 4.8% relative improvement in F1-score with 1% additional data at a 4x compression rate, and an 11.4% gain with 10% additional data on the LongBench single-doc QA benchmark. EFPC's unified framework supports broad applicability and enhances performance across various models, tasks, and domains, offering a practical advancement in NLP.

Comment: The paper proposes a novel prompt compression method for LLMs, which aligns with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8

12. ProTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models

ArXiv ID: 2503.08179

Authors: Zicheng Ma, Chuanliu Fan, Zhicong Wang, Zhenyu Chen, Xiaohan Lin, Yanheng Li, Shihao Feng, Jun Zhang, Ziqiang Cao, Yi Qin Gao

Abstract: Large language models have made remarkable progress in the field of molecular science, particularly in understanding and generating functional small molecules. This success is largely attributed to the effectiveness of molecular tokenization strategies. In protein science, the amino acid sequence serves as the sole tokenizer for LLMs. However, many fundamental challenges in protein science are inherently structure-dependent. The absence of structure-aware tokens significantly limits the capabilities of LLMs for comprehensive biomolecular comprehension and multimodal generation. To address these challenges, we introduce a novel framework, ProTeX, which tokenizes the protein sequences, structures, and textual information into a unified discrete space. This innovative approach enables joint training of the LLM exclusively through the Next-Token Prediction paradigm, facilitating multimodal protein reasoning and generation. ProTeX enables general LLMs to perceive and process protein structures through sequential text input, leverage structural information as intermediate reasoning components, and generate or manipulate structures via sequential text output. Experiments demonstrate that our model achieves significant improvements in protein function prediction, outperforming the state-of-the-art domain expert model with a twofold increase in accuracy. Our framework enables high-quality conformational generation and customizable protein design. For the first time, we demonstrate that by adopting the standard training and inference pipelines from the LLM domain, ProTeX empowers decoder-only LLMs to effectively address diverse spectrum of protein-related tasks.

Comment: The paper introduces a framework for protein structure reasoning and editing using LLMs, which aligns with foundational AI for science and multimodal generative paradigms.

Relevance: 9 Novelty: 8

13. Accurate INT8 Training Through Dynamic Block-Level Fallback

ArXiv ID: 2503.08040

Authors: Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, Jianfei Chen

Abstract: Transformer models have achieved remarkable success across various AI applications but face significant training costs. Low-bit training, such as INT8 training, can leverage computational units with higher throughput, and has already demonstrated its effectiveness on GPT2 models with block-level quantization. However, it struggles with modern Transformer variants incorporating GLU units. This is because those variants demonstrate complex distributions of activation outliers. To address the challenge, we propose Fallback Quantization, implementing mixed-precision GEMM that dynamically falls back 8-bit to 16-bit for activation blocks containing outliers. Experiments show that our approach is robustly competent in both fine-tuning and pretraining settings. Moreover, our method achieves a 1.57x end-to-end training speedup on RTX4090 GPUs.

Comment: The paper proposes a dynamic fallback quantization method for INT8 training, which aligns with the model compression criterion by addressing efficiency and robustness in low-bit training.

Relevance: 9 Novelty: 8

14. The Space Between: On Folding, Symmetries and Sampling

ArXiv ID: 2503.08502

Authors: Michal Lewandowski, Bernhard Heinzl, Raphael Pisoni, Bernhard A. Moser

Abstract: Recent findings suggest that consecutive layers of neural networks with the ReLU activation function \emph{fold} the input space during the learning process. While many works hint at this phenomenon, an approach to quantify the folding was only recently proposed by means of a space folding measure based on Hamming distance in the ReLU activation space. We generalize this measure to a wider class of activation functions through introduction of equivalence classes of input data, analyse its mathematical and computational properties and come up with an efficient sampling strategy for its implementation. Moreover, it has been observed that space folding values increase with network depth when the generalization error is low, but decrease when the error increases. This underpins that learned symmetries in the data manifold (e.g., invariance under reflection) become visible in terms of space folds, contributing to the network's generalization capacity. Inspired by these findings, we outline a novel regularization scheme that encourages the network to seek solutions characterized by higher folding values.

Comment: The paper explores space folding in neural networks, which aligns with representation learning and provides insights into training dynamics.

Relevance: 8 Novelty: 8

15. CIMAGE: Exploiting the Conditional Independence in Masked Graph Auto-encoders

ArXiv ID: 2503.07852

Authors: Jongwon Park, Heesoo Jung, Hogun Park

Abstract: Recent Self-Supervised Learning (SSL) methods encapsulating relational information via masking in Graph Neural Networks (GNNs) have shown promising performance. However, most existing approaches rely on random masking strategies in either feature or graph space, which may fail to capture task-relevant information fully. We posit that this limitation stems from an inability to achieve minimum redundancy between masked and unmasked components while ensuring maximum relevance of both to potential downstream tasks. Conditional Independence (CI) inherently satisfies the minimum redundancy and maximum relevance criteria, but its application typically requires access to downstream labels. To address this challenge, we introduce CIMAGE, a novel approach that leverages Conditional Independence to guide an effective masking strategy within the latent space. CIMAGE utilizes CI-aware latent factor decomposition to generate two distinct contexts, leveraging high-confidence pseudo-labels derived from unsupervised graph clustering. In this framework, the pretext task involves reconstructing the masked second context solely from the information provided by the first context. Our theoretical analysis further supports the superiority of CIMAGE's novel CI-aware masking method by demonstrating that the learned embedding exhibits approximate linear separability, which enables accurate predictions for the downstream task. Comprehensive evaluations across diverse graph benchmarks illustrate the advantage of CIMAGE, with notably higher average rankings on node classification and link prediction tasks. Notably, our proposed model highlights the under-explored potential of CI in enhancing graph SSL methodologies and offers enriched insights for effective graph representation learning.

Comment: The paper introduces CIMAGE, a novel CI-aware masking strategy for graph autoencoders, contributing to foundational research in representation learning for graphs.

Relevance: 8 Novelty: 8

16. Symbolic Neural Ordinary Differential Equations

ArXiv ID: 2503.08059

Authors: Xin Li, Chengli Zhao, Xue Zhang, Xiaojun Duan

Abstract: Differential equations are widely used to describe complex dynamical systems with evolving parameters in nature and engineering. Effectively learning a family of maps from the parameter function to the system dynamics is of great significance. In this study, we propose a novel learning framework of symbolic continuous-depth neural networks, termed Symbolic Neural Ordinary Differential Equations (SNODEs), to effectively and accurately learn the underlying dynamics of complex systems. Specifically, our learning framework comprises three stages: initially, pre-training a predefined symbolic neural network via a gradient flow matching strategy; subsequently, fine-tuning this network using Neural ODEs; and finally, constructing a general neural network to capture residuals. In this process, we apply the SNODEs framework to partial differential equation systems through Fourier analysis, achieving resolution-invariant modeling. Moreover, this framework integrates the strengths of symbolism and connectionism, boasting a universal approximation theorem while significantly enhancing interpretability and extrapolation capabilities relative to state-of-the-art baseline methods. We demonstrate this through experiments on several representative complex systems. Therefore, our framework can be further applied to a wide range of scientific problems, such as system bifurcation and control, reconstruction and forecasting, as well as the discovery of new equations.

Comment: The paper proposes Symbolic Neural ODEs, integrating symbolic and neural approaches for learning dynamical systems, which aligns with foundational research in representation learning.

Relevance: 8 Novelty: 8

17. How Does Overparameterization Affect Machine Unlearning of Deep Neural Networks?

ArXiv ID: 2503.08633

Authors: Gal Alon, Yehuda Dar

Abstract: Machine unlearning is the task of updating a trained model to forget specific training data without retraining from scratch. In this paper, we investigate how unlearning of deep neural networks (DNNs) is affected by the model parameterization level, which corresponds here to the DNN width. We define validation-based tuning for several unlearning methods from the recent literature, and show how these methods perform differently depending on (i) the DNN parameterization level, (ii) the unlearning goal (unlearned data privacy or bias removal), (iii) whether the unlearning method explicitly uses the unlearned examples. Our results show that unlearning excels on overparameterized models, in terms of balancing between generalization and achieving the unlearning goal; although for bias removal this requires the unlearning method to use the unlearned examples. We further elucidate our error-based analysis by measuring how much the unlearning changes the classification decision regions in the proximity of the unlearned examples, and avoids changing them elsewhere. By this we show that the unlearning success for overparameterized models stems from the ability to delicately change the model functionality in small regions in the input space while keeping much of the model functionality unchanged.

Comment: This paper investigates the effect of overparameterization on machine unlearning, which aligns with foundational research on training dynamics in neural networks. It provides theoretical insights into how unlearning interacts with model parameterization.