Personalized Daily Arxiv Papers 03/12/2025
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 54680 | 7576 | 62256 |
| Cost | $0.13 | $0.08 | $0.21 |
Total ArXiv papers: 605
Total scanned papers: 347
Total relevant papers: 35
Table of contents with paper titles:
-
Mixture of Experts Made Intrinsically Interpretable Authors: Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, Philip Torr
-
A Theory of Learning with Autoregressive Chain of Thought Authors: Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misiakiewicz, Nathan Srebro
-
How good is PAC-Bayes at explaining generalisation? Authors: Antoine Picard-Weibel, Eugenio Clerico, Roman Moscoviz, Benjamin Guedj
-
A Theoretical Framework for Preventing Class Collapse in Supervised Contrastive Learning Authors: Chungpa Lee, Jeongheon Oh, Kibok Lee, Jy-yong Sohn
-
SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs Authors: Jaewoo Song, Fangzhen Lin
-
MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration Authors: Jinguang Wang, Jingyu Wang, Haifeng Sun, Tingting Yang, Zirui Zhuang, Wanyi Ning, Yuexi Yin, Qi Qi, Jianxin Liao
-
Route Sparse Autoencoder to Interpret Large Language Models Authors: Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Gojun Ma, Xiang Wang, Xiangnan He
-
CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement Authors: Chenrui Ma, Rongchang Zhao, Xi Xiao, Hongyang Xie, Tianyang Wang, Xiao Wang, Hao Zhang, Yanning Shen
-
Accelerating MoE Model Inference with Expert Sharding Authors: Oana Balmau, Anne-Marie Kermarrec, Rafael Pires, Andr\'e Loureiro Esp\'irito Santo, Martijn de Vos, Milos Vujasinovic
-
ELECTRA: A Symmetry-breaking Cartesian Network for Charge Density Prediction with Floating Orbitals Authors: Jonas Elsborg, Luca Thiede, Al\'an Aspuru-Guzik, Tejs Vegge, Arghya Bhowmik
-
EFPC: Towards Efficient and Flexible Prompt Compression Authors: Yun-Hao Cao, Yangsong Wang, Shuzheng Hao, Zhenxing Li, Chengjun Zhan, Sichao Liu, Yi-Qi Hu
-
ProTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models Authors: Zicheng Ma, Chuanliu Fan, Zhicong Wang, Zhenyu Chen, Xiaohan Lin, Yanheng Li, Shihao Feng, Jun Zhang, Ziqiang Cao, Yi Qin Gao
-
Accurate INT8 Training Through Dynamic Block-Level Fallback Authors: Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, Jianfei Chen
-
The Space Between: On Folding, Symmetries and Sampling Authors: Michal Lewandowski, Bernhard Heinzl, Raphael Pisoni, Bernhard A. Moser
-
CIMAGE: Exploiting the Conditional Independence in Masked Graph Auto-encoders Authors: Jongwon Park, Heesoo Jung, Hogun Park
-
Symbolic Neural Ordinary Differential Equations Authors: Xin Li, Chengli Zhao, Xue Zhang, Xiaojun Duan
-
How Does Overparameterization Affect Machine Unlearning of Deep Neural Networks? Authors: Gal Alon, Yehuda Dar
-
Deep ARTMAP: Generalized Hierarchical Learning with Adaptive Resonance Theory Authors: Niklas M. Melton, Leonardo Enzo Brito da Silva, Sasha Petrenko, Donald. C. Wunsch II
-
Scaling Probabilistic Circuits via Data Partitioning Authors: Jonas Seng, Florian Peter Busch, Pooja Prasad, Devendra Singh Dhami, Martin Mundt, Kristian Kersting
-
A Triple-Inertial Accelerated Alternating Optimization Method for Deep Learning Training Authors: Chengcheng Yan, Jiawei Xu, Qingsong Wang, Zheng Peng
-
HOFAR: High-Order Augmentation of Flow Autoregressive Transformers Authors: Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan
-
Whoever Started the Interference Should End It: Guiding Data-Free Model Merging via Task Vectors Authors: Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, Chun Yuan
-
Benign Overfitting and the Geometry of the Ridge Regression Solution in Binary Classification Authors: Alexander Tsigler, Luiz F. O. Chamon, Spencer Frei, Peter L. Bartlett
-
Learning and Evaluating Hierarchical Feature Representations Authors: Depanshu Sani, Saket Anand
-
Personalized Convolutional Dictionary Learning of Physiological Time Series Authors: Axel Roques, Samuel Gruffaz, Kyurae Kim, Alain Oliviero-Durmus, Laurent Oudre
-
Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy Authors: Wei Junhao, Yu Zhe, Sakuma Jun
-
Aligning Text to Image in Diffusion Models is Easier Than You Think Authors: Jaa-Yeon Lee, Byunghee Cha, Jeongsol Kim, Jong Chul Ye
-
MinGRU-Based Encoder for Turbo Autoencoder Frameworks Authors: Rick Fritschek, Rafael F. Schaefer
-
Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks? Authors: Payel Das, Ching-Yun Ko, Sihui Dai, Georgios Kollias, Subhajit Chaudhury, Aurelie Lozano
-
Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning Authors: Hubert Baniecki, Przemyslaw Biecek
-
Two-Dimensional Deep ReLU CNN Approximation for Korobov Functions: A Constructive Approach Authors: Qin Fang, Lei Shi, Min Xu, Ding-Xuan Zhou
-
Median Consensus Embedding for Dimensionality Reduction Authors: Yui Tomo, Daisuke Yoneoka
-
Strengthening the Internal Adversarial Robustness in Lifted Neural Networks Authors: Christopher Zach
-
Accelerated Distributed Optimization with Compression and Error Feedback Authors: Yuan Gao, Anton Rodomanov, Jeremy Rack, Sebastian U. Stich
-
Learning to Match Unpaired Data with Minimum Entropy Coupling Authors: Mustapha Bounoua, Giulio Franzese, Pietro Michiardi
1. Mixture of Experts Made Intrinsically Interpretable
ArXiv ID: 2503.07639
Authors: Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, Philip Torr
Abstract: Neurons in large language models often exhibit \emph{polysemanticity}, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present \textbf{MoE-X}, a Mixture-of-Experts (MoE) language model designed to be \emph{intrinsically} interpretable. Our approach is motivated by the observation that, in language models, wider networks with sparse activations are more likely to capture interpretable factors. However, directly training such large sparse networks is computationally prohibitive. MoE architectures offer a scalable alternative by activating only a subset of experts for any given input, inherently aligning with interpretability objectives. In MoE-X, we establish this connection by rewriting the MoE layer as an equivalent sparse, large MLP. This approach enables efficient scaling of the hidden size while maintaining sparsity. To further enhance interpretability, we enforce sparse activation within each expert and redesign the routing mechanism to prioritize experts with the highest activation sparsity. These designs ensure that only the most salient features are routed and processed by the experts. We evaluate MoE-X on chess and natural language tasks, showing that it achieves performance comparable to dense models while significantly improving interpretability. MoE-X achieves a perplexity better than GPT-2, with interpretability surpassing even sparse autoencoder (SAE)-based approaches.
Comment: The paper introduces MoE-X, a Mixture-of-Experts model designed for intrinsic interpretability, which aligns closely with the MoE and interpretability criteria.
Relevance: 10 Novelty: 8
2. A Theory of Learning with Autoregressive Chain of Thought
ArXiv ID: 2503.07932
Authors: Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misiakiewicz, Nathan Srebro
Abstract: For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed, time-invariant generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes such as linear thresholds. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.
Comment: The paper formalizes learning with autoregressive chain-of-thought, which aligns with foundational research in LLMs and introduces theoretical insights.
Relevance: 9 Novelty: 9
3. How good is PAC-Bayes at explaining generalisation?
ArXiv ID: 2503.08231
Authors: Antoine Picard-Weibel, Eugenio Clerico, Roman Moscoviz, Benjamin Guedj
Abstract: We discuss necessary conditions for a PAC-Bayes bound to provide a meaningful generalisation guarantee. Our analysis reveals that the optimal generalisation guarantee depends solely on the distribution of the risk induced by the prior distribution. In particular, achieving a target generalisation level is only achievable if the prior places sufficient mass on high-performing predictors. We relate these requirements to the prevalent practice of using data-dependent priors in deep learning PAC-Bayes applications, and discuss the implications for the claim that PAC-Bayes ``explains'' generalisation.
Comment: The paper provides a theoretical analysis of PAC-Bayes bounds and their ability to explain generalization, which is highly relevant to foundational research in representation learning and generalization theory.
Relevance: 9 Novelty: 8
4. A Theoretical Framework for Preventing Class Collapse in Supervised Contrastive Learning
ArXiv ID: 2503.08203
Authors: Chungpa Lee, Jeongheon Oh, Kibok Lee, Jy-yong Sohn
Abstract: Supervised contrastive learning (SupCL) has emerged as a prominent approach in representation learning, leveraging both supervised and self-supervised losses. However, achieving an optimal balance between these losses is challenging; failing to do so can lead to class collapse, reducing discrimination among individual embeddings in the same class. In this paper, we present theoretically grounded guidelines for SupCL to prevent class collapse in learned representations. Specifically, we introduce the Simplex-to-Simplex Embedding Model (SSEM), a theoretical framework that models various embedding structures, including all embeddings that minimize the supervised contrastive loss. Through SSEM, we analyze how hyperparameters affect learned representations, offering practical guidelines for hyperparameter selection to mitigate the risk of class collapse. Our theoretical findings are supported by empirical results across synthetic and real-world datasets.
Comment: The paper provides a theoretical framework to prevent class collapse in supervised contrastive learning, which is highly relevant to foundational research in representation learning.
Relevance: 9 Novelty: 8
5. SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs
ArXiv ID: 2503.07657
Authors: Jaewoo Song, Fangzhen Lin
Abstract: The quantization of large language models (LLMs) is crucial for deploying them on devices with limited computational resources. While advanced quantization algorithms offer improved performance compared to the basic linear quantization, they typically require high-end graphics processing units (GPUs), are often restricted to specific deep neural network (DNN) frameworks, and require calibration datasets. This limitation poses challenges for using such algorithms on various neural processing units (NPUs) and edge AI devices, which have diverse model formats and frameworks. In this paper, we show SplitQuantV2, an innovative algorithm designed to enhance low-bit linear quantization of LLMs, can achieve results comparable to those of advanced algorithms. SplitQuantV2 preprocesses models by splitting linear and convolution layers into functionally equivalent, quantization-friendly structures. The algorithm's platform-agnostic, concise, and efficient nature allows for implementation without the need for GPUs. Our evaluation on the Llama 3.2 1B Instruct model using the AI2's Reasoning Challenge (ARC) dataset demonstrates that SplitQuantV2 improves the accuracy of the INT4 quantization model by 11.76%p, matching the performance of the original floating-point model. Remarkably, SplitQuantV2 took only 2 minutes 6 seconds to preprocess the 1B model and perform linear INT4 quantization using only an Apple M4 CPU. SplitQuantV2 provides a practical solution for low-bit quantization on LLMs, especially when complex, computation-intensive algorithms are inaccessible due to hardware limitations or framework incompatibilities.
Comment: The paper introduces SplitQuantV2, a novel low-bit quantization method for LLMs, which aligns with the model compression criterion and demonstrates practical efficiency improvements.
Relevance: 9 Novelty: 8
6. MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration
ArXiv ID: 2503.07654
Authors: Jinguang Wang, Jingyu Wang, Haifeng Sun, Tingting Yang, Zirui Zhuang, Wanyi Ning, Yuexi Yin, Qi Qi, Jianxin Liao
Abstract: Quantization has been widely used to compress and accelerate inference of large language models (LLMs). Existing methods focus on exploring the per-token dynamic calibration to ensure both inference acceleration and model accuracy under 4-bit quantization. However, in autoregressive generation inference of long sequences, the overhead of repeated dynamic quantization and dequantization steps becomes considerably expensive. In this work, we propose MergeQuant, an accurate and efficient per-channel static quantization framework. MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method, thereby eliminating the quantization overheads before and after matrix multiplication. Furthermore, in view of the significant differences between the different channel ranges, we propose dimensional reconstruction and adaptive clipping to address the non-uniformity of quantization scale factors and redistribute the channel variations to the subsequent modules to balance the parameter distribution under QSM. Within the static quantization setting of W4A4, MergeQuant reduces the accuracy gap on zero-shot tasks compared to FP16 baseline to 1.3 points on Llama-2-70B model. On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.
Comment: The paper focuses on a novel static quantization framework for LLMs, which aligns with the model compression criterion, particularly in sparsity and quantization.
Relevance: 9 Novelty: 8
7. Route Sparse Autoencoder to Interpret Large Language Models
ArXiv ID: 2503.08200
Authors: Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Gojun Ma, Xiang Wang, Xiangnan He
Abstract: Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable and monosemantic features. However, prior works primarily focus on feature extraction from a single layer, failing to effectively capture activations that span multiple layers. In this paper, we introduce Route Sparse Autoencoder (RouteSAE), a new framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers. It dynamically assigns weights to activations from different layers, incurring minimal parameter overhead while achieving high interpretability and flexibility for targeted feature manipulation. We evaluate RouteSAE through extensive experiments on Llama-3.2-1B-Instruct. Specifically, under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score. These results underscore the potential of RouteSAE as a scalable and effective method for LLM interpretability, with applications in feature discovery and model intervention. Our codes are available at https://github.com/swei2001/RouteSAEs.
Comment: The paper proposes a sparse autoencoder framework for LLM interpretability, which aligns with representation learning and interpretability of LLMs.
Relevance: 9 Novelty: 8
8. CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement
ArXiv ID: 2503.07938
Authors: Chenrui Ma, Rongchang Zhao, Xi Xiao, Hongyang Xie, Tianyang Wang, Xiao Wang, Hao Zhang, Yanning Shen
Abstract: While deep generative models have significantly advanced representation learning, they may inherit or amplify biases and fairness issues by encoding sensitive attributes alongside predictive features. Enforcing strict independence in disentanglement is often unrealistic when target and sensitive factors are naturally correlated. To address this challenge, we propose CAD-VAE (Correlation-Aware Disentangled VAE), which introduces a correlated latent code to capture the shared information between target and sensitive attributes. Given this correlated latent, our method effectively separates overlapping factors without extra domain knowledge by directly minimizing the conditional mutual information between target and sensitive codes. A relevance-driven optimization strategy refines the correlated code by efficiently capturing essential correlated features and eliminating redundancy. Extensive experiments on benchmark datasets demonstrate that CAD-VAE produces fairer representations, realistic counterfactuals, and improved fairness-aware image editing.
Comment: The paper introduces CAD-VAE, a novel disentangled VAE framework addressing fairness in representation learning, which aligns with foundational research in representation learning.
Relevance: 9 Novelty: 8
9. Accelerating MoE Model Inference with Expert Sharding
ArXiv ID: 2503.08467
Authors: Oana Balmau, Anne-Marie Kermarrec, Rafael Pires, Andr\'e Loureiro Esp\'irito Santo, Martijn de Vos, Milos Vujasinovic
Abstract: Mixture of experts (MoE) models achieve state-of-the-art results in language modeling but suffer from inefficient hardware utilization due to imbalanced token routing and communication overhead. While prior work has focused on optimizing MoE training and decoder architectures, inference for encoder-based MoE models in a multi-GPU with expert parallelism setting remains underexplored. We introduce MoEShard, an inference system that achieves perfect load balancing through tensor sharding of MoE experts. Unlike existing approaches that rely on heuristic capacity factors or drop tokens, MoEShard evenly distributes computation across GPUs and ensures full token retention, maximizing utilization regardless of routing skewness. We achieve this through a strategic row- and column-wise decomposition of expert matrices. This reduces idle time and avoids bottlenecks caused by imbalanced expert assignments. Furthermore, MoEShard minimizes kernel launches by fusing decomposed expert computations, significantly improving throughput. We evaluate MoEShard against DeepSpeed on encoder-based architectures, demonstrating speedups of up to 6.4$\times$ in time to first token (TTFT). Our results show that tensor sharding, when properly applied to experts, is a viable and effective strategy for efficient MoE inference.
Comment: The paper addresses efficiency in Mixture-of-Experts (MoE) inference through expert sharding, which directly aligns with the model architecture and compression criteria. The tensor sharding approach is a novel contribution to MoE inference.
Relevance: 9 Novelty: 8
10. ELECTRA: A Symmetry-breaking Cartesian Network for Charge Density Prediction with Floating Orbitals
ArXiv ID: 2503.08305
Authors: Jonas Elsborg, Luca Thiede, Al\'an Aspuru-Guzik, Tejs Vegge, Arghya Bhowmik
Abstract: We present the Electronic Tensor Reconstruction Algorithm (ELECTRA) - an equivariant model for predicting electronic charge densities using "floating" orbitals. Floating orbitals are a long-standing idea in the quantum chemistry community that promises more compact and accurate representations by placing orbitals freely in space, as opposed to centering all orbitals at the position of atoms. Finding ideal placements of these orbitals requires extensive domain knowledge though, which thus far has prevented widespread adoption. We solve this in a data-driven manner by training a Cartesian tensor network to predict orbital positions along with orbital coefficients. This is made possible through a symmetry-breaking mechanism that is used to learn position displacements with lower symmetry than the input molecule while preserving the rotation equivariance of the charge density itself. Inspired by recent successes of Gaussian Splatting in representing densities in space, we are using Gaussians as our orbitals and predict their weights and covariance matrices. Our method achieves a state-of-the-art balance between computational efficiency and predictive accuracy on established benchmarks.
Comment: The paper introduces a symmetry-breaking equivariant model for predicting electronic charge densities, which is foundational in AI for science and introduces a novel generative paradigm.
Relevance: 9 Novelty: 8
11. EFPC: Towards Efficient and Flexible Prompt Compression
ArXiv ID: 2503.07956
Authors: Yun-Hao Cao, Yangsong Wang, Shuzheng Hao, Zhenxing Li, Chengjun Zhan, Sichao Liu, Yi-Qi Hu
Abstract: The emergence of large language models (LLMs) like GPT-4 has revolutionized natural language processing (NLP), enabling diverse, complex tasks. However, extensive token counts lead to high computational and financial burdens. To address this, we propose Efficient and Flexible Prompt Compression (EFPC), a novel method unifying task-aware and task-agnostic compression for a favorable accuracy-efficiency trade-off. EFPC uses GPT-4 to generate compressed prompts and integrates them with original prompts for training. During training and inference, we selectively prepend user instructions and compress prompts based on predicted probabilities. EFPC is highly data-efficient, achieving significant performance with minimal data. Compared to the state-of-the-art method LLMLingua-2, EFPC achieves a 4.8% relative improvement in F1-score with 1% additional data at a 4x compression rate, and an 11.4% gain with 10% additional data on the LongBench single-doc QA benchmark. EFPC's unified framework supports broad applicability and enhances performance across various models, tasks, and domains, offering a practical advancement in NLP.
Comment: The paper proposes a novel prompt compression method for LLMs, which aligns with foundational research in model compression and efficiency.
Relevance: 9 Novelty: 8
12. ProTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models
ArXiv ID: 2503.08179
Authors: Zicheng Ma, Chuanliu Fan, Zhicong Wang, Zhenyu Chen, Xiaohan Lin, Yanheng Li, Shihao Feng, Jun Zhang, Ziqiang Cao, Yi Qin Gao
Abstract: Large language models have made remarkable progress in the field of molecular science, particularly in understanding and generating functional small molecules. This success is largely attributed to the effectiveness of molecular tokenization strategies. In protein science, the amino acid sequence serves as the sole tokenizer for LLMs. However, many fundamental challenges in protein science are inherently structure-dependent. The absence of structure-aware tokens significantly limits the capabilities of LLMs for comprehensive biomolecular comprehension and multimodal generation. To address these challenges, we introduce a novel framework, ProTeX, which tokenizes the protein sequences, structures, and textual information into a unified discrete space. This innovative approach enables joint training of the LLM exclusively through the Next-Token Prediction paradigm, facilitating multimodal protein reasoning and generation. ProTeX enables general LLMs to perceive and process protein structures through sequential text input, leverage structural information as intermediate reasoning components, and generate or manipulate structures via sequential text output. Experiments demonstrate that our model achieves significant improvements in protein function prediction, outperforming the state-of-the-art domain expert model with a twofold increase in accuracy. Our framework enables high-quality conformational generation and customizable protein design. For the first time, we demonstrate that by adopting the standard training and inference pipelines from the LLM domain, ProTeX empowers decoder-only LLMs to effectively address diverse spectrum of protein-related tasks.
Comment: The paper introduces a framework for protein structure reasoning and editing using LLMs, which aligns with foundational AI for science and multimodal generative paradigms.
Relevance: 9 Novelty: 8
13. Accurate INT8 Training Through Dynamic Block-Level Fallback
ArXiv ID: 2503.08040
Authors: Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, Jianfei Chen
Abstract: Transformer models have achieved remarkable success across various AI applications but face significant training costs. Low-bit training, such as INT8 training, can leverage computational units with higher throughput, and has already demonstrated its effectiveness on GPT2 models with block-level quantization. However, it struggles with modern Transformer variants incorporating GLU units. This is because those variants demonstrate complex distributions of activation outliers. To address the challenge, we propose Fallback Quantization, implementing mixed-precision GEMM that dynamically falls back 8-bit to 16-bit for activation blocks containing outliers. Experiments show that our approach is robustly competent in both fine-tuning and pretraining settings. Moreover, our method achieves a 1.57x end-to-end training speedup on RTX4090 GPUs.
Comment: The paper proposes a dynamic fallback quantization method for INT8 training, which aligns with the model compression criterion by addressing efficiency and robustness in low-bit training.
Relevance: 9 Novelty: 8
14. The Space Between: On Folding, Symmetries and Sampling
ArXiv ID: 2503.08502
Authors: Michal Lewandowski, Bernhard Heinzl, Raphael Pisoni, Bernhard A. Moser
Abstract: Recent findings suggest that consecutive layers of neural networks with the ReLU activation function \emph{fold} the input space during the learning process. While many works hint at this phenomenon, an approach to quantify the folding was only recently proposed by means of a space folding measure based on Hamming distance in the ReLU activation space. We generalize this measure to a wider class of activation functions through introduction of equivalence classes of input data, analyse its mathematical and computational properties and come up with an efficient sampling strategy for its implementation. Moreover, it has been observed that space folding values increase with network depth when the generalization error is low, but decrease when the error increases. This underpins that learned symmetries in the data manifold (e.g., invariance under reflection) become visible in terms of space folds, contributing to the network's generalization capacity. Inspired by these findings, we outline a novel regularization scheme that encourages the network to seek solutions characterized by higher folding values.
Comment: The paper explores space folding in neural networks, which aligns with representation learning and provides insights into training dynamics.
Relevance: 8 Novelty: 8
15. CIMAGE: Exploiting the Conditional Independence in Masked Graph Auto-encoders
ArXiv ID: 2503.07852
Authors: Jongwon Park, Heesoo Jung, Hogun Park
Abstract: Recent Self-Supervised Learning (SSL) methods encapsulating relational information via masking in Graph Neural Networks (GNNs) have shown promising performance. However, most existing approaches rely on random masking strategies in either feature or graph space, which may fail to capture task-relevant information fully. We posit that this limitation stems from an inability to achieve minimum redundancy between masked and unmasked components while ensuring maximum relevance of both to potential downstream tasks. Conditional Independence (CI) inherently satisfies the minimum redundancy and maximum relevance criteria, but its application typically requires access to downstream labels. To address this challenge, we introduce CIMAGE, a novel approach that leverages Conditional Independence to guide an effective masking strategy within the latent space. CIMAGE utilizes CI-aware latent factor decomposition to generate two distinct contexts, leveraging high-confidence pseudo-labels derived from unsupervised graph clustering. In this framework, the pretext task involves reconstructing the masked second context solely from the information provided by the first context. Our theoretical analysis further supports the superiority of CIMAGE's novel CI-aware masking method by demonstrating that the learned embedding exhibits approximate linear separability, which enables accurate predictions for the downstream task. Comprehensive evaluations across diverse graph benchmarks illustrate the advantage of CIMAGE, with notably higher average rankings on node classification and link prediction tasks. Notably, our proposed model highlights the under-explored potential of CI in enhancing graph SSL methodologies and offers enriched insights for effective graph representation learning.
Comment: The paper introduces CIMAGE, a novel CI-aware masking strategy for graph autoencoders, contributing to foundational research in representation learning for graphs.
Relevance: 8 Novelty: 8
16. Symbolic Neural Ordinary Differential Equations
ArXiv ID: 2503.08059
Authors: Xin Li, Chengli Zhao, Xue Zhang, Xiaojun Duan
Abstract: Differential equations are widely used to describe complex dynamical systems with evolving parameters in nature and engineering. Effectively learning a family of maps from the parameter function to the system dynamics is of great significance. In this study, we propose a novel learning framework of symbolic continuous-depth neural networks, termed Symbolic Neural Ordinary Differential Equations (SNODEs), to effectively and accurately learn the underlying dynamics of complex systems. Specifically, our learning framework comprises three stages: initially, pre-training a predefined symbolic neural network via a gradient flow matching strategy; subsequently, fine-tuning this network using Neural ODEs; and finally, constructing a general neural network to capture residuals. In this process, we apply the SNODEs framework to partial differential equation systems through Fourier analysis, achieving resolution-invariant modeling. Moreover, this framework integrates the strengths of symbolism and connectionism, boasting a universal approximation theorem while significantly enhancing interpretability and extrapolation capabilities relative to state-of-the-art baseline methods. We demonstrate this through experiments on several representative complex systems. Therefore, our framework can be further applied to a wide range of scientific problems, such as system bifurcation and control, reconstruction and forecasting, as well as the discovery of new equations.
Comment: The paper proposes Symbolic Neural ODEs, integrating symbolic and neural approaches for learning dynamical systems, which aligns with foundational research in representation learning.
Relevance: 8 Novelty: 8
17. How Does Overparameterization Affect Machine Unlearning of Deep Neural Networks?
ArXiv ID: 2503.08633
Authors: Gal Alon, Yehuda Dar
Abstract: Machine unlearning is the task of updating a trained model to forget specific training data without retraining from scratch. In this paper, we investigate how unlearning of deep neural networks (DNNs) is affected by the model parameterization level, which corresponds here to the DNN width. We define validation-based tuning for several unlearning methods from the recent literature, and show how these methods perform differently depending on (i) the DNN parameterization level, (ii) the unlearning goal (unlearned data privacy or bias removal), (iii) whether the unlearning method explicitly uses the unlearned examples. Our results show that unlearning excels on overparameterized models, in terms of balancing between generalization and achieving the unlearning goal; although for bias removal this requires the unlearning method to use the unlearned examples. We further elucidate our error-based analysis by measuring how much the unlearning changes the classification decision regions in the proximity of the unlearned examples, and avoids changing them elsewhere. By this we show that the unlearning success for overparameterized models stems from the ability to delicately change the model functionality in small regions in the input space while keeping much of the model functionality unchanged.
Comment: This paper investigates the effect of overparameterization on machine unlearning, which aligns with foundational research on training dynamics in neural networks. It provides theoretical insights into how unlearning interacts with model parameterization.
Relevance: 8 Novelty: 7
18. Deep ARTMAP: Generalized Hierarchical Learning with Adaptive Resonance Theory
ArXiv ID: 2503.07641
Authors: Niklas M. Melton, Leonardo Enzo Brito da Silva, Sasha Petrenko, Donald. C. Wunsch II
Abstract: This paper presents Deep ARTMAP, a novel extension of the ARTMAP architecture that generalizes the self-consistent modular ART (SMART) architecture to enable hierarchical learning (supervised and unsupervised) across arbitrary transformations of data. The Deep ARTMAP framework operates as a divisive clustering mechanism, supporting an arbitrary number of modules with customizable granularity within each module. Inter-ART modules regulate the clustering at each layer, permitting unsupervised learning while enforcing a one-to-many mapping from clusters in one layer to the next. While Deep ARTMAP reduces to both ARTMAP and SMART in particular configurations, it offers significantly enhanced flexibility, accommodating a broader range of data transformations and learning modalities.
Comment: The paper proposes Deep ARTMAP, a novel hierarchical learning framework extending ARTMAP. It introduces architectural innovations relevant to model architecture research.
Relevance: 8 Novelty: 7
19. Scaling Probabilistic Circuits via Data Partitioning
ArXiv ID: 2503.08141
Authors: Jonas Seng, Florian Peter Busch, Pooja Prasad, Devendra Singh Dhami, Martin Mundt, Kristian Kersting
Abstract: Probabilistic circuits (PCs) enable us to learn joint distributions over a set of random variables and to perform various probabilistic queries in a tractable fashion. Though the tractability property allows PCs to scale beyond non-tractable models such as Bayesian Networks, scaling training and inference of PCs to larger, real-world datasets remains challenging. To remedy the situation, we show how PCs can be learned across multiple machines by recursively partitioning a distributed dataset, thereby unveiling a deep connection between PCs and federated learning (FL). This leads to federated circuits (FCs) -- a novel and flexible federated learning (FL) framework that (1) allows one to scale PCs on distributed learning environments (2) train PCs faster and (3) unifies for the first time horizontal, vertical, and hybrid FL in one framework by re-framing FL as a density estimation problem over distributed datasets. We demonstrate FC's capability to scale PCs on various large-scale datasets. Also, we show FC's versatility in handling horizontal, vertical, and hybrid FL within a unified framework on multiple classification tasks.
Comment: The paper introduces Federated Circuits (FCs) for scaling probabilistic circuits, which aligns with foundational research in model efficiency and scalability.
Relevance: 8 Novelty: 7
20. A Triple-Inertial Accelerated Alternating Optimization Method for Deep Learning Training
ArXiv ID: 2503.08489
Authors: Chengcheng Yan, Jiawei Xu, Qingsong Wang, Zheng Peng
Abstract: The stochastic gradient descent (SGD) algorithm has achieved remarkable success in training deep learning models. However, it has several limitations, including susceptibility to vanishing gradients, sensitivity to input data, and a lack of robust theoretical guarantees. In recent years, alternating minimization (AM) methods have emerged as a promising alternative for model training by employing gradient-free approaches to iteratively update model parameters. Despite their potential, these methods often exhibit slow convergence rates. To address this challenge, we propose a novel Triple-Inertial Accelerated Alternating Minimization (TIAM) framework for neural network training. The TIAM approach incorporates a triple-inertial acceleration strategy with a specialized approximation method, facilitating targeted acceleration of different terms in each sub-problem optimization. This integration improves the efficiency of convergence, achieving superior performance with fewer iterations. Additionally, we provide a convergence analysis of the TIAM algorithm, including its global convergence properties and convergence rate. Extensive experiments validate the effectiveness of the TIAM method, showing significant improvements in generalization capability and computational efficiency compared to existing approaches, particularly when applied to the rectified linear unit (ReLU) and its variants.
Comment: The paper introduces a novel optimization framework (TIAM) for neural network training, which could provide insights into training dynamics and efficiency improvements.
Relevance: 8 Novelty: 7
21. HOFAR: High-Order Augmentation of Flow Autoregressive Transformers
ArXiv ID: 2503.08032
Authors: Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan
Abstract: Flow Matching and Transformer architectures have demonstrated remarkable performance in image generation tasks, with recent work FlowAR [Ren et al., 2024] synergistically integrating both paradigms to advance synthesis fidelity. However, current FlowAR implementations remain constrained by first-order trajectory modeling during the generation process. This paper introduces a novel framework that systematically enhances flow autoregressive transformers through high-order supervision. We provide theoretical analysis and empirical evaluation showing that our High-Order FlowAR (HOFAR) demonstrates measurable improvements in generation quality compared to baseline models. The proposed approach advances the understanding of flow-based autoregressive modeling by introducing a systematic framework for analyzing trajectory dynamics through high-order expansion.
Comment: The paper introduces high-order supervision for flow autoregressive transformers, which aligns with model architecture innovations.
Relevance: 8 Novelty: 7
22. Whoever Started the Interference Should End It: Guiding Data-Free Model Merging via Task Vectors
ArXiv ID: 2503.08099
Authors: Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, Chun Yuan
Abstract: Model merging seeks to integrate task-specific expert models into a unified architecture while preserving multi-task generalization capabilities, yet parameter interference between constituent models frequently induces performance degradation. Although prior work has explored many merging strategies, resolving interference without additional data for retraining or test-time computation remains challenging. In this paper, we theoretically demonstrate that the task vectors of the linear layer constitute an approximate linear subspace for its corresponding input. Therefore, we can minimize interference under the guidance of task vectors. Based on this insight, we propose \textbf{WUDI-Merging} (\textbf{W}hoever started the interference sho\textbf{U}ld en\textbf{D} \textbf{I}t), a simple yet effective model merging method that eliminates interference without any additional data or rescaling coefficients. Comprehensive empirical evaluations across vision and language benchmarks demonstrate our method's superiority, achieving state-of-the-art performance in data-free model merging scenarios (average 10.9\% improvement versus baseline methods) while even outperforming mainstream test-time adaptation approaches by 3.3\%, and only very few computing resources are required. The code will be publicly available soon.
Comment: The paper proposes a model merging method guided by task vectors, which aligns with model architecture and efficiency innovations.
Relevance: 8 Novelty: 7
23. Benign Overfitting and the Geometry of the Ridge Regression Solution in Binary Classification
ArXiv ID: 2503.07966
Authors: Alexander Tsigler, Luiz F. O. Chamon, Spencer Frei, Peter L. Bartlett
Abstract: In this work, we investigate the behavior of ridge regression in an overparameterized binary classification task. We assume examples are drawn from (anisotropic) class-conditional cluster distributions with opposing means and we allow for the training labels to have a constant level of label-flipping noise. We characterize the classification error achieved by ridge regression under the assumption that the covariance matrix of the cluster distribution has a high effective rank in the tail. We show that ridge regression has qualitatively different behavior depending on the scale of the cluster mean vector and its interaction with the covariance matrix of the cluster distributions. In regimes where the scale is very large, the conditions that allow for benign overfitting turn out to be the same as those for the regression task. We additionally provide insights into how the introduction of label noise affects the behavior of the minimum norm interpolator (MNI). The optimal classifier in this setting is a linear transformation of the cluster mean vector and in the noiseless setting the MNI approximately learns this transformation. On the other hand, the introduction of label noise can significantly change the geometry of the solution while preserving the same qualitative behavior.
Comment: The paper investigates ridge regression in overparameterized settings and provides theoretical insights into benign overfitting, which aligns with foundational research in representation learning.
Relevance: 8 Novelty: 7
24. Learning and Evaluating Hierarchical Feature Representations
ArXiv ID: 2503.07853
Authors: Depanshu Sani, Saket Anand
Abstract: Hierarchy-aware representations ensure that the semantically closer classes are mapped closer in the feature space, thereby reducing the severity of mistakes while enabling consistent coarse-level class predictions. Towards this end, we propose a novel framework, Hierarchical Composition of Orthogonal Subspaces (Hier-COS), which learns to map deep feature embeddings into a vector space that is, by design, consistent with the structure of a given taxonomy tree. Our approach augments neural network backbones with a simple transformation module that maps learned discriminative features to subspaces defined using a fixed orthogonal frame. This construction naturally improves the severity of mistakes and promotes hierarchical consistency. Furthermore, we highlight the fundamental limitations of existing hierarchical evaluation metrics popularly used by the vision community and introduce a preference-based metric, Hierarchically Ordered Preference Score (HOPS), to overcome these limitations. We benchmark our method on multiple large and challenging datasets having deep label hierarchies (ranging from 3 - 12 levels) and compare with several baselines and SOTA. Through extensive experiments, we demonstrate that Hier-COS achieves state-of-the-art hierarchical performance across all the datasets while simultaneously beating top-1 accuracy in all but one case. We also demonstrate the performance of a Vision Transformer (ViT) backbone and show that learning a transformation module alone can map the learned features from a pre-trained ViT to Hier-COS and yield substantial performance benefits.
Comment: The paper introduces a novel framework for hierarchical feature representation learning, which aligns with the representation learning criterion. The proposed Hier-COS method and new evaluation metric (HOPS) add methodological contributions.
Relevance: 8 Novelty: 7
25. Personalized Convolutional Dictionary Learning of Physiological Time Series
ArXiv ID: 2503.07687
Authors: Axel Roques, Samuel Gruffaz, Kyurae Kim, Alain Oliviero-Durmus, Laurent Oudre
Abstract: Human physiological signals tend to exhibit both global and local structures: the former are shared across a population, while the latter reflect inter-individual variability. For instance, kinetic measurements of the gait cycle during locomotion present common characteristics, although idiosyncrasies may be observed due to biomechanical disposition or pathology. To better represent datasets with local-global structure, this work extends Convolutional Dictionary Learning (CDL), a popular method for learning interpretable representations, or dictionaries, of time-series data. In particular, we propose Personalized CDL (PerCDL), in which a local dictionary models local information as a personalized spatiotemporal transformation of a global dictionary. The transformation is learnable and can combine operations such as time warping and rotation. Formal computational and statistical guarantees for PerCDL are provided and its effectiveness on synthetic and real human locomotion data is demonstrated.
Comment: The paper extends Convolutional Dictionary Learning (CDL) with a personalized approach, which aligns with representation learning. The focus on local-global structures and formal guarantees adds methodological novelty.
Relevance: 8 Novelty: 7
26. Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy
ArXiv ID: 2503.07661
Authors: Wei Junhao, Yu Zhe, Sakuma Jun
Abstract: Model merging is a technique that combines multiple finetuned models into a single model without additional training, allowing a free-rider to cheaply inherit specialized capabilities. This study investigates methodologies to suppress unwanted model merging by free-riders. Existing methods such as model watermarking or fingerprinting can only detect merging in hindsight. In contrast, we propose a first proactive defense against model merging. Specifically, our defense method modifies the model parameters so that the model is disrupted if the model is merged with any other model, while its functionality is kept unchanged if not merged with others. Our approach consists of two modules, rearranging MLP parameters and scaling attention heads, which push the model out of the shared basin in parameter space, causing the merging performance with other models to degrade significantly. We conduct extensive experiments on image classification, image generation, and text classification to demonstrate that our defense severely disrupts merging while retaining the functionality of the post-protect model. Moreover, we analyze potential adaptive attacks and further propose a dropout-based pruning to improve our proposal's robustness.
Comment: The paper proposes a novel defense mechanism against model merging by modifying model parameters, which aligns with foundational research in model architecture and parameter manipulation.
Relevance: 8 Novelty: 7
27. Aligning Text to Image in Diffusion Models is Easier Than You Think
ArXiv ID: 2503.08250
Authors: Jaa-Yeon Lee, Byunghee Cha, Jeongsol Kim, Jong Chul Ye
Abstract: While recent advancements in generative modeling have significantly improved text-image alignment, some residual misalignment between text and image representations still remains. Although many approaches have attempted to address this issue by fine-tuning models using various reward models, etc., we revisit the challenge from the perspective of representation alignment-an approach that has gained popularity with the success of REPresentation Alignment (REPA). We first argue that conventional text-to-image (T2I) diffusion models, typically trained on paired image and text data (i.e., positive pairs) by minimizing score matching or flow matching losses, is suboptimal from the standpoint of representation alignment. Instead, a better alignment can be achieved through contrastive learning that leverages both positive and negative pairs. To achieve this efficiently even with pretrained models, we introduce a lightweight contrastive fine tuning strategy called SoftREPA that uses soft text tokens. This approach improves alignment with minimal computational overhead by adding fewer than 1M trainable parameters to the pretrained model. Our theoretical analysis demonstrates that our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency. Experimental results across text-to-image generation and text-guided image editing tasks validate the effectiveness of our approach in improving the semantic consistency of T2I generative models.
Comment: The paper proposes a lightweight contrastive fine-tuning strategy for text-to-image diffusion models, which aligns with representation learning and introduces methodological improvements.
Relevance: 8 Novelty: 7
28. MinGRU-Based Encoder for Turbo Autoencoder Frameworks
ArXiv ID: 2503.08451
Authors: Rick Fritschek, Rafael F. Schaefer
Abstract: Early neural channel coding approaches leveraged dense neural networks with one-hot encodings to design adaptive encoder-decoder pairs, improving block error rate (BLER) and automating the design process. However, these methods struggled with scalability as the size of message sets and block lengths increased. TurboAE addressed this challenge by focusing on bit-sequence inputs rather than symbol-level representations, transforming the scalability issue associated with large message sets into a sequence modeling problem. While recurrent neural networks (RNNs) were a natural fit for sequence processing, their reliance on sequential computations made them computationally expensive and inefficient for long sequences. As a result, TurboAE adopted convolutional network blocks, which were faster to train and more scalable, but lacked the sequential modeling advantages of RNNs. Recent advances in efficient RNN architectures, such as minGRU and minLSTM, and structured state space models (SSMs) like S4 and S6, overcome these limitations by significantly reducing memory and computational overhead. These models enable scalable sequence processing, making RNNs competitive for long-sequence tasks. In this work, we revisit RNNs for Turbo autoencoders by integrating the lightweight minGRU model with a Mamba block from SSMs into a parallel Turbo autoencoder framework. Our results demonstrate that this hybrid design matches the performance of convolutional network-based Turbo autoencoder approaches for short sequences while significantly improving scalability and training efficiency for long block lengths. This highlights the potential of efficient RNNs in advancing neural channel coding for long-sequence scenarios.
Comment: The paper revisits RNNs for Turbo autoencoders and integrates efficient RNN architectures, which aligns with foundational research in model architecture and sequence modeling.
Relevance: 8 Novelty: 7
29. Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks?
ArXiv ID: 2503.07903
Authors: Payel Das, Ching-Yun Ko, Sihui Dai, Georgios Kollias, Subhajit Chaudhury, Aurelie Lozano
Abstract: Large language models often expose their brittleness in reasoning tasks, especially while executing long chains of reasoning over context. We propose MemReasoner, a new and simple memory-augmented LLM architecture, in which the memory learns the relative order of facts in context, and enables hopping over them, while the decoder selectively attends to the memory. MemReasoner is trained end-to-end, with optional supporting fact supervision of varying degrees. We train MemReasoner, along with existing memory-augmented transformer models and a state-space model, on two distinct synthetic multi-hop reasoning tasks. Experiments performed under a variety of challenging scenarios, including the presence of long distractor text or target answer changes in test set, show strong generalization of MemReasoner on both single- and two-hop tasks. This generalization of MemReasoner is achieved using none-to-weak supporting fact supervision (using none and 1\% of supporting facts for one- and two-hop tasks, respectively). In contrast, baseline models overall struggle to generalize and benefit far less from using full supporting fact supervision. The results highlight the importance of explicit memory mechanisms, combined with additional weak supervision, for improving large language model's context processing ability toward reasoning tasks.
Comment: The paper introduces a memory-augmented LLM architecture for reasoning tasks, which provides insights into LLM behavior and architecture, making it relevant to foundational research.
Relevance: 8 Novelty: 7
30. Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning
ArXiv ID: 2503.08636
Authors: Hubert Baniecki, Przemyslaw Biecek
Abstract: A common belief is that intrinsically interpretable deep learning models ensure a correct, intuitive understanding of their behavior and offer greater robustness against accidental errors or intentional manipulation. However, these beliefs have not been comprehensively verified, and growing evidence casts doubt on them. In this paper, we highlight the risks related to overreliance and susceptibility to adversarial manipulation of these so-called "intrinsically (aka inherently) interpretable" models by design. We introduce two strategies for adversarial analysis with prototype manipulation and backdoor attacks against prototype-based networks, and discuss how concept bottleneck models defend against these attacks. Fooling the model's reasoning by exploiting its use of latent prototypes manifests the inherent uninterpretability of deep neural networks, leading to a false sense of security reinforced by a visual confirmation bias. The reported limitations of prototype-based networks put their trustworthiness and applicability into question, motivating further work on the robustness and alignment of (deep) interpretable models.
Comment: The paper critiques the interpretability of prototype-based networks and highlights their vulnerabilities, which aligns with emerging trends in interpretability but lacks a strong foundational contribution.
Relevance: 7 Novelty: 7
31. Two-Dimensional Deep ReLU CNN Approximation for Korobov Functions: A Constructive Approach
ArXiv ID: 2503.07976
Authors: Qin Fang, Lei Shi, Min Xu, Ding-Xuan Zhou
Abstract: This paper investigates approximation capabilities of two-dimensional (2D) deep convolutional neural networks (CNNs), with Korobov functions serving as a benchmark. We focus on 2D CNNs, comprising multi-channel convolutional layers with zero-padding and ReLU activations, followed by a fully connected layer. We propose a fully constructive approach for building 2D CNNs to approximate Korobov functions and provide rigorous analysis of the complexity of the constructed networks. Our results demonstrate that 2D CNNs achieve near-optimal approximation rates under the continuous weight selection model, significantly alleviating the curse of dimensionality. This work provides a solid theoretical foundation for 2D CNNs and illustrates their potential for broader applications in function approximation.
Comment: The paper provides a theoretical analysis of 2D CNNs for approximating Korobov functions, contributing to foundational understanding of CNN approximation capabilities.
Relevance: 7 Novelty: 7
32. Median Consensus Embedding for Dimensionality Reduction
ArXiv ID: 2503.08103
Authors: Yui Tomo, Daisuke Yoneoka
Abstract: This study proposes median consensus embedding (MCE) to address variability in low-dimensional embeddings caused by random initialization in dimensionality reduction techniques such as t-distributed stochastic neighbor embedding. MCE is defined as the geometric median of multiple embeddings. By assuming multiple embeddings as independent and identically distributed random samples and applying large deviation theory, we prove that MCE achieves consistency at an exponential rate. Furthermore, we develop a practical algorithm to implement MCE by constructing a distance function between embeddings based on the Frobenius norm of the pairwise distance matrix of data points. Application to real-world data demonstrates that MCE converges rapidly and significantly reduces instability. These results confirm that MCE effectively mitigates random initialization issues in embedding methods.
Comment: The paper introduces a novel method for reducing variability in dimensionality reduction techniques, which is relevant to representation learning but focuses on a specific embedding stability issue.
Relevance: 7 Novelty: 7
33. Strengthening the Internal Adversarial Robustness in Lifted Neural Networks
ArXiv ID: 2503.07818
Authors: Christopher Zach
Abstract: Lifted neural networks (i.e. neural architectures explicitly optimizing over respective network potentials to determine the neural activities) can be combined with a type of adversarial training to gain robustness for internal as well as input layers, in addition to improved generalization performance. In this work we first investigate how adversarial robustness in this framework can be further strengthened by solely modifying the training loss. In a second step we fix some remaining limitations and arrive at a novel training loss for lifted neural networks, that combines targeted and untargeted adversarial perturbations.
Comment: The paper explores adversarial robustness in lifted neural networks, which aligns with representation learning and training dynamics but is not a major breakthrough.
Relevance: 7 Novelty: 6
34. Accelerated Distributed Optimization with Compression and Error Feedback
ArXiv ID: 2503.08427
Authors: Yuan Gao, Anton Rodomanov, Jeremy Rack, Sebastian U. Stich
Abstract: Modern machine learning tasks often involve massive datasets and models, necessitating distributed optimization algorithms with reduced communication overhead. Communication compression, where clients transmit compressed updates to a central server, has emerged as a key technique to mitigate communication bottlenecks. However, the theoretical understanding of stochastic distributed optimization with contractive compression remains limited, particularly in conjunction with Nesterov acceleration -- a cornerstone for achieving faster convergence in optimization. In this paper, we propose a novel algorithm, ADEF (Accelerated Distributed Error Feedback), which integrates Nesterov acceleration, contractive compression, error feedback, and gradient difference compression. We prove that ADEF achieves the first accelerated convergence rate for stochastic distributed optimization with contractive compression in the general convex regime. Numerical experiments validate our theoretical findings and demonstrate the practical efficacy of ADEF in reducing communication costs while maintaining fast convergence.
Comment: The paper proposes a distributed optimization algorithm with compression and error feedback, which aligns with model compression and efficiency but is not groundbreaking.
Relevance: 7 Novelty: 6
35. Learning to Match Unpaired Data with Minimum Entropy Coupling
ArXiv ID: 2503.08501
Authors: Mustapha Bounoua, Giulio Franzese, Pietro Michiardi
Abstract: Multimodal data is a precious asset enabling a variety of downstream tasks in machine learning. However, real-world data collected across different modalities is often not paired, which is a significant challenge to learn a joint distribution. A prominent approach to address the modality coupling problem is Minimum Entropy Coupling (MEC), which seeks to minimize the joint Entropy, while satisfying constraints on the marginals. Existing approaches to the MEC problem focus on finite, discrete distributions, limiting their application for cases involving continuous data. In this work, we propose a novel method to solve the continuous MEC problem, using well-known generative diffusion models that learn to approximate and minimize the joint Entropy through a cooperative scheme, while satisfying a relaxed version of the marginal constraints. We empirically demonstrate that our method, DDMEC, is general and can be easily used to address challenging tasks, including unsupervised single-cell multi-omics data alignment and unpaired image translation, outperforming specialized methods.
Comment: The paper addresses unpaired data matching using Minimum Entropy Coupling and diffusion models, which is relevant to representation learning but not highly novel.
Relevance: 7 Novelty: 6
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.