Personalized Daily Arxiv Papers 3/28/2025

[gpt-4o]	Prompt	Completion	Total
Token	35005	4344	39349
Cost	$0.09	$0.04	$0.13

Total arXiv papers: 390

Total scanned papers: 227

Total relevant papers: 17

Table of contents with paper titles:

MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness Authors: Zihao Zheng (Eric), Xiuping Cui (Eric), Size Zheng (Eric), Maoliang Li (Eric), Jiayu Chen (Eric), Yun (Eric), Liang, Xiang Chen
Exploring the Energy Landscape of RBMs: Reciprocal Space Insights into Bosons, Hierarchical Learning and Symmetry Breaking Authors: J. Quetzalc\'oatl Toledo-Marin, Anindita Maiti, Geoffrey C. Fox, Roger G. Melko
Collab: Controlled Decoding using Mixture of Agents for LLM Alignment Authors: Souradip Chakraborty, Sujay Bhatt, Udari Madhushani Sehwag, Soumya Suvra Ghosal, Jiahao Qiu, Mengdi Wang, Dinesh Manocha, Furong Huang, Alec Koppel, Sumitra Ganesh
Neuroplasticity in Artificial Intelligence -- An Overview and Inspirations on Drop In \& Out Learning Authors: Yupei Li, Manuel Milling, Bj\"orn W. Schuller
Shared Global and Local Geometry of Language Model Embeddings Authors: Andrew Lee, Melanie Weber, Fernanda Vi\'egas, Martin Wattenberg
HOT: Hadamard-based Optimized Training Authors: Seonggon Kim, Juncheol Shin, Seung-taek Woo, Eunhyeok Park
Nonlinear Multiple Response Regression and Learning of Latent Spaces Authors: Ye Tian, Sanyou Wu, Long Feng
How do language models learn facts? Dynamics, curricula and hallucinations Authors: Nicolas Zucchet, J\"org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De
Stochastic Engrams for Efficient Continual Learning with Binarized Neural Networks Authors: Isabelle Aguilar, Luis Fernando Herbozo Contreras, Omid Kavehei
Scalable Expectation Estimation with Subtractive Mixture Models Authors: Lena Zellinger, Nicola Branchini, V\'ictor Elvira, Antonio Vergari
Consistent Multigroup Low-Rank Approximation Authors: Antonis Matakos, Martino Ciaperoni, Heikki Mannila
F-INR: Functional Tensor Decomposition for Implicit Neural Representations Authors: Sai Karthikeya Vemuri, Tim B\"uchner, Joachim Denzler
Uncertainty propagation in feed-forward neural network models Authors: Jeremy Diamzon, Daniele Venturi
Outlier dimensions favor frequent tokens in language model Authors: Iuri Macocco, Nora Graichen, Gemma Boleda, Marco Baroni
Offline Action-Free Learning of Ex-BMDPs by Comparing Diverse Datasets Authors: Alexander Levine, Peter Stone, Amy Zhang
Effective Skill Unlearning through Intervention and Abstention Authors: Yongce Li, Chung-En Sun, Tsui-Wei Weng
Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models Authors: Pin-Yu Chen, Han Shen, Payel Das, Tianyi Chen

1. MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness

ArXiv ID: 2503.21135

Authors: Zihao Zheng (Eric), Xiuping Cui (Eric), Size Zheng (Eric), Maoliang Li (Eric), Jiayu Chen (Eric), Yun (Eric), Liang, Xiang Chen

Abstract: With the advances in artificial intelligence, Mix-of-Experts (MoE) has become the main form of Large Language Models (LLMs), and its demand for model compression is increasing. Quantization is an effective method that not only compresses the models but also significantly accelerates their performance. Existing quantization methods have gradually shifted the focus from parameter scaling to the analysis of data distributions. However, their analysis is designed for dense LLMs and relies on the simple one-model-all-data mapping, which is unsuitable for MoEs. This paper proposes a new quantization framework called MoQa. MoQa decouples the data-model distribution complexity of MoEs in multiple analysis stages, quantitively revealing the dynamics during sparse data activation, data-parameter mapping, and inter-expert correlations. Based on these, MoQa identifies particular experts' and parameters' significance with optimal data-model distribution awareness and proposes a series of fine-grained mix-quantization strategies adaptive to various data activation and expert combination scenarios. Moreover, MoQa discusses the limitations of existing quantization and analyzes the impact of each stage analysis, showing novel insights for MoE quantization. Experiments show that MoQa achieves a 1.69~2.18 perplexity decrease in language modeling tasks and a 1.58%~8.91% accuracy improvement in zero-shot inference tasks. We believe MoQa will play a role in future MoE construction, optimization, and compression.

Comment: This paper introduces a novel quantization framework for MoE models, addressing compression and efficiency challenges specific to sparse data activation and expert combinations. It aligns with the model compression and MoE criteria.

Relevance: 10 Novelty: 8

2. Exploring the Energy Landscape of RBMs: Reciprocal Space Insights into Bosons, Hierarchical Learning and Symmetry Breaking

ArXiv ID: 2503.21536

Authors: J. Quetzalc\'oatl Toledo-Marin, Anindita Maiti, Geoffrey C. Fox, Roger G. Melko

Abstract: Deep generative models have become ubiquitous due to their ability to learn and sample from complex distributions. Despite the proliferation of various frameworks, the relationships among these models remain largely unexplored, a gap that hinders the development of a unified theory of AI learning. We address two central challenges: clarifying the connections between different deep generative models and deepening our understanding of their learning mechanisms. We focus on Restricted Boltzmann Machines (RBMs), known for their universal approximation capabilities for discrete distributions. By introducing a reciprocal space formulation, we reveal a connection between RBMs, diffusion processes, and coupled Bosons. We show that at initialization, the RBM operates at a saddle point, where the local curvature is determined by the singular values, whose distribution follows the Marcenko-Pastur law and exhibits rotational symmetry. During training, this rotational symmetry is broken due to hierarchical learning, where different degrees of freedom progressively capture features at multiple levels of abstraction. This leads to a symmetry breaking in the energy landscape, reminiscent of Landau theory. This symmetry breaking in the energy landscape is characterized by the singular values and the weight matrix eigenvector matrix. We derive the corresponding free energy in a mean-field approximation. We show that in the limit of infinite size RBM, the reciprocal variables are Gaussian distributed. Our findings indicate that in this regime, there will be some modes for which the diffusion process will not converge to the Boltzmann distribution. To illustrate our results, we trained replicas of RBMs with different hidden layer sizes using the MNIST dataset. Our findings bridge the gap between disparate generative frameworks and also shed light on the processes underpinning learning in generative models.

Comment: The paper explores the energy landscape of RBMs and connects them to broader theoretical frameworks like symmetry breaking and hierarchical learning, which aligns with representation learning and theoretical insights into generative models.

Relevance: 9 Novelty: 9

3. Collab: Controlled Decoding using Mixture of Agents for LLM Alignment

ArXiv ID: 2503.21720

Authors: Souradip Chakraborty, Sujay Bhatt, Udari Madhushani Sehwag, Soumya Suvra Ghosal, Jiahao Qiu, Mengdi Wang, Dinesh Manocha, Furong Huang, Alec Koppel, Sumitra Ganesh

Abstract: Alignment of Large Language models (LLMs) is crucial for safe and trustworthy deployment in applications. Reinforcement learning from human feedback (RLHF) has emerged as an effective technique to align LLMs to human preferences and broader utilities, but it requires updating billions of model parameters, which is computationally expensive. Controlled Decoding, by contrast, provides a mechanism for aligning a model at inference time without retraining. However, single-agent decoding approaches often struggle to adapt to diverse tasks due to the complexity and variability inherent in these tasks. To strengthen the test-time performance w.r.t the target task, we propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies. Treating each prior policy as an agent in the spirit of mixture of agent collaboration, we develop a decoding method that allows for inference-time alignment through a token-level selection strategy among multiple agents. For each token, the most suitable LLM is dynamically chosen from a pool of models based on a long-term utility metric. This policy-switching mechanism ensures optimal model selection at each step, enabling efficient collaboration and alignment among LLMs during decoding. Theoretical analysis of our proposed algorithm establishes optimal performance with respect to the target task represented via a target reward for the given off-the-shelf models. We conduct comprehensive empirical evaluations with open-source aligned models on diverse tasks and preferences, which demonstrates the merits of this approach over single-agent decoding baselines. Notably, Collab surpasses the current SoTA decoding strategy, achieving an improvement of up to 1.56x in average reward and 71.89% in GPT-4 based win-tie rate.

Comment: The paper introduces Model Assembly Learning (MAL) for merging heterogeneous model architectures, which aligns with foundational research in model architecture and parameter integration. The focus on merging across heterogeneous layers is novel and impactful.

Relevance: 9 Novelty: 9

4. Neuroplasticity in Artificial Intelligence -- An Overview and Inspirations on Drop In \& Out Learning

ArXiv ID: 2503.21419

Authors: Yupei Li, Manuel Milling, Bj\"orn W. Schuller

Abstract: Artificial Intelligence (AI) has achieved new levels of performance and spread in public usage with the rise of deep neural networks (DNNs). Initially inspired by human neurons and their connections, NNs have become the foundation of AI models for many advanced architectures. However, some of the most integral processes in the human brain, particularly neurogenesis and neuroplasticity in addition to the more spread neuroapoptosis have largely been ignored in DNN architecture design. Instead, contemporary AI development predominantly focuses on constructing advanced frameworks, such as large language models, which retain a static structure of neural connections during training and inference. In this light, we explore how neurogenesis, neuroapoptosis, and neuroplasticity can inspire future AI advances. Specifically, we examine analogous activities in artificial NNs, introducing the concepts of dropin'' for neurogenesis and revisitingdropout'' and structural pruning for neuroapoptosis. We additionally suggest neuroplasticity combining the two for future large NNs in ``life-long learning'' settings following the biological inspiration. We conclude by advocating for greater research efforts in this interdisciplinary domain and identifying promising directions for future exploration.

Comment: The paper explores neuroplasticity-inspired mechanisms like 'dropin' and 'dropout' for neural networks, which aligns with emerging trends and foundational research in model architecture and lifelong learning.

Relevance: 9 Novelty: 8

5. Shared Global and Local Geometry of Language Model Embeddings

ArXiv ID: 2503.21073

Authors: Andrew Lee, Melanie Weber, Fernanda Vi\'egas, Martin Wattenberg

Abstract: Researchers have recently suggested that models share common representations. In this work, we find that the token embeddings of language models exhibit common geometric structure. First, we find ``global'' similarities: token embeddings often share similar relative orientations. Next, we characterize local geometry in two ways: (1) by using Locally Linear Embeddings, and (2) by defining a simple measure for the intrinsic dimension of each token embedding. Our intrinsic dimension measure demonstrates that token embeddings lie on a lower dimensional manifold. We qualitatively show that tokens with lower intrinsic dimensions often have semantically coherent clusters, while those with higher intrinsic dimensions do not. Both characterizations allow us to find similarities in the local geometry of token embeddings. Perhaps most surprisingly, we find that alignment in token embeddings persists through the hidden states of language models, allowing us to develop an application for interpretability. Namely, we empirically demonstrate that steering vectors from one language model can be transferred to another, despite the two models having different dimensions.

Comment: The paper explores the geometric structure of token embeddings in language models, providing insights into representation learning and interpretability. It aligns with the criterion of understanding how deep networks encode information.

Relevance: 9 Novelty: 8

6. HOT: Hadamard-based Optimized Training

ArXiv ID: 2503.21261

Authors: Seonggon Kim, Juncheol Shin, Seung-taek Woo, Eunhyeok Park

Abstract: It has become increasingly important to optimize backpropagation to reduce memory usage and computational overhead. Achieving this goal is highly challenging, as multiple objectives must be considered jointly while maintaining training quality. In this paper, we focus on matrix multiplication, which accounts for the largest portion of training costs, and analyze its backpropagation in detail to identify lightweight techniques that offer the best benefits. Based on this analysis, we introduce a novel method, Hadamard-based Optimized Training (HOT). In this approach, we apply Hadamard-based optimizations, such as Hadamard quantization and Hadamard low-rank approximation, selectively and with awareness of the suitability of each optimization for different backward paths. Additionally, we introduce two enhancements: activation buffer compression and layer-wise quantizer selection. Our extensive analysis shows that HOT achieves up to 75% memory savings and a 2.6 times acceleration on real GPUs, with negligible accuracy loss compared to FP32 precision.

Comment: The paper introduces Hadamard-based optimizations for backpropagation, focusing on memory and computational efficiency, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

7. Nonlinear Multiple Response Regression and Learning of Latent Spaces

ArXiv ID: 2503.21608

Authors: Ye Tian, Sanyou Wu, Long Feng

Abstract: Identifying low-dimensional latent structures within high-dimensional data has long been a central topic in the machine learning community, driven by the need for data compression, storage, transmission, and deeper data understanding. Traditional methods, such as principal component analysis (PCA) and autoencoders (AE), operate in an unsupervised manner, ignoring label information even when it is available. In this work, we introduce a unified method capable of learning latent spaces in both unsupervised and supervised settings. We formulate the problem as a nonlinear multiple-response regression within an index model context. By applying the generalized Stein's lemma, the latent space can be estimated without knowing the nonlinear link functions. Our method can be viewed as a nonlinear generalization of PCA. Moreover, unlike AE and other neural network methods that operate as "black boxes", our approach not only offers better interpretability but also reduces computational complexity while providing strong theoretical guarantees. Comprehensive numerical experiments and real data analyses demonstrate the superior performance of our method.

Comment: This paper proposes a novel method for learning latent spaces, which aligns with representation learning. The approach offers interpretability and theoretical guarantees, making it relevant to foundational research.

Relevance: 9 Novelty: 8

8. How do language models learn facts? Dynamics, curricula and hallucinations

ArXiv ID: 2503.21676

Authors: Nicolas Zucchet, J\"org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De

Abstract: Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall. Second, the training data distribution significantly impacts learning dynamics, as imbalanced distributions lead to shorter plateaus. Finally, hallucinations emerge simultaneously with knowledge, and integrating new knowledge into the model through fine-tuning is challenging, as it quickly corrupts its existing parametric memories. Our results emphasize the importance of data distribution in knowledge acquisition and suggest novel data scheduling strategies to accelerate neural network training.

Comment: This paper investigates the learning dynamics of large language models, focusing on factual knowledge acquisition and the emergence of hallucinations. It provides theoretical insights into LLM behavior, aligning well with the foundational research on LLMs.

Relevance: 9 Novelty: 8

9. Stochastic Engrams for Efficient Continual Learning with Binarized Neural Networks

ArXiv ID: 2503.21436

Authors: Isabelle Aguilar, Luis Fernando Herbozo Contreras, Omid Kavehei

Abstract: The ability to learn continuously in artificial neural networks (ANNs) is often limited by catastrophic forgetting, a phenomenon in which new knowledge becomes dominant. By taking mechanisms of memory encoding in neuroscience (aka. engrams) as inspiration, we propose a novel approach that integrates stochastically-activated engrams as a gating mechanism for metaplastic binarized neural networks (mBNNs). This method leverages the computational efficiency of mBNNs combined with the robustness of probabilistic memory traces to mitigate forgetting and maintain the model's reliability. Previously validated metaplastic optimization techniques have been incorporated to enhance synaptic stability further. Compared to baseline binarized models and benchmark fully connected continual learning approaches, our method is the only strategy capable of reaching average accuracies over 20% in class-incremental scenarios and achieving comparable domain-incremental results to full precision state-of-the-art methods. Furthermore, we achieve a significant reduction in peak GPU and RAM usage, under 5% and 20%, respectively. Our findings demonstrate (A) an improved stability vs. plasticity trade-off, (B) a reduced memory intensiveness, and (C) an enhanced performance in binarized architectures. By uniting principles of neuroscience and efficient computing, we offer new insights into the design of scalable and robust deep learning systems.

Comment: The paper proposes a neuroscience-inspired approach to continual learning using binarized neural networks, which aligns with model compression (binarization) and sparsity. The integration of stochastic engrams adds a novel perspective.

Relevance: 8 Novelty: 8

10. Scalable Expectation Estimation with Subtractive Mixture Models

ArXiv ID: 2503.21346

Authors: Lena Zellinger, Nicola Branchini, V\'ictor Elvira, Antonio Vergari

Abstract: Many Monte Carlo (MC) and importance sampling (IS) methods use mixture models (MMs) for their simplicity and ability to capture multimodal distributions. Recently, subtractive mixture models (SMMs), i.e. MMs with negative coefficients, have shown greater expressiveness and success in generative modeling. However, their negative parameters complicate sampling, requiring costly auto-regressive techniques or accept-reject algorithms that do not scale in high dimensions. In this work, we use the difference representation of SMMs to construct an unbiased IS estimator ($\Delta\text{Ex}$) that removes the need to sample from the SMM, enabling high-dimensional expectation estimation with SMMs. In our experiments, we show that $\Delta\text{Ex}$ can achieve comparable estimation quality to auto-regressive sampling while being considerably faster in MC estimation. Moreover, we conduct initial experiments with $\Delta\text{Ex}$ using hand-crafted proposals, gaining first insights into how to construct safe proposals for $\Delta\text{Ex}$.

Comment: The paper introduces subtractive mixture models (SMMs) for scalable expectation estimation, which is a novel contribution to generative modeling and aligns with representation learning through advanced mixture models.

Relevance: 8 Novelty: 8

11. Consistent Multigroup Low-Rank Approximation

ArXiv ID: 2503.21563

Authors: Antonis Matakos, Martino Ciaperoni, Heikki Mannila

Abstract: We consider the problem of consistent low-rank approximation for multigroup data: we ask for a sequence of $k$ basis vectors such that projecting the data onto their spanned subspace treats all groups as equally as possible, by minimizing the maximum error among the groups. Additionally, we require that the sequence of basis vectors satisfies the natural consistency property: when looking for the best $k$ vectors, the first $d<k$ vectors are the best possible solution to the problem of finding $d$ basis vectors. Thus, this multigroup low-rank approximation method naturally generalizes \svd and reduces to \svd for data with a single group. We give an iterative algorithm for this task that sequentially adds to the basis the vector that gives the best rank$-1$ projection according to the min-max criterion, and then projects the data onto the orthogonal complement of that vector. For finding the best rank$-1$ projection, we use primal-dual approaches or semidefinite programming. We analyze the theoretical properties of the algorithms and demonstrate empirically that the proposed methods compare favorably to existing methods for multigroup (or fair) PCA.

Comment: The paper proposes a consistent low-rank approximation method for multigroup data, which aligns with model compression through low-rank approaches and introduces a novel iterative algorithm for fair PCA.

Relevance: 8 Novelty: 8

12. F-INR: Functional Tensor Decomposition for Implicit Neural Representations

ArXiv ID: 2503.21507

Authors: Sai Karthikeya Vemuri, Tim B\"uchner, Joachim Denzler

Abstract: Implicit Neural Representation (INR) has emerged as a powerful tool for encoding discrete signals into continuous, differentiable functions using neural networks. However, these models often have an unfortunate reliance on monolithic architectures to represent high-dimensional data, leading to prohibitive computational costs as dimensionality grows. We propose F-INR, a framework that reformulates INR learning through functional tensor decomposition, breaking down high-dimensional tasks into lightweight, axis-specific sub-networks. Each sub-network learns a low-dimensional data component (e.g., spatial or temporal). Then, we combine these components via tensor operations, reducing forward pass complexity while improving accuracy through specialized learning. F-INR is modular and, therefore, architecture-agnostic, compatible with MLPs, SIREN, WIRE, or other state-of-the-art INR architecture. It is also decomposition-agnostic, supporting CP, TT, and Tucker modes with user-defined rank for speed-accuracy control. In our experiments, F-INR trains $100\times$ faster than existing approaches on video tasks while achieving higher fidelity (+3.4 dB PSNR). Similar gains hold for image compression, physics simulations, and 3D geometry reconstruction. Through this, F-INR offers a new scalable, flexible solution for high-dimensional signal modeling.

Comment: F-INR proposes a novel framework for implicit neural representations using functional tensor decomposition, which aligns with model compression and efficiency topics. The modular and decomposition-agnostic approach is a significant contribution to scalable signal modeling.

Relevance: 8 Novelty: 8

13. Uncertainty propagation in feed-forward neural network models

ArXiv ID: 2503.21059

Authors: Jeremy Diamzon, Daniele Venturi

Abstract: We develop new uncertainty propagation methods for feed-forward neural network architectures with leaky ReLu activation functions subject to random perturbations in the input vectors. In particular, we derive analytical expressions for the probability density function (PDF) of the neural network output and its statistical moments as a function of the input uncertainty and the parameters of the network, i.e., weights and biases. A key finding is that an appropriate linearization of the leaky ReLu activation function yields accurate statistical results even for large perturbations in the input vectors. This can be attributed to the way information propagates through the network. We also propose new analytically tractable Gaussian copula surrogate models to approximate the full joint PDF of the neural network output. To validate our theorical results, we conduct Monte Carlo simulations and a thorough error analysis on a multi-layer neural network representing a nonlinear integro-differential operator between two polynomial function spaces. Our findings demonstrate excellent agreement between the theoretical predictions and Monte Carlo simulations.

Comment: The paper develops analytical methods for uncertainty propagation in neural networks, which could provide foundational insights into training dynamics and network behavior.