Personalized Daily Arxiv Papers 02/17/2025

	Prompt	Completion	Total
Token	66350	5324	71674
Cost	$0.17	$0.05	$0.22

Total scanned papers: 267

Total relevant papers: 21

Table of contents with paper titles:

Shaping Inductive Bias in Diffusion Models through Frequency-Based Noise Control Authors: Thomas Jiralerspong, Berton Earnshaw, Jason Hartford, Yoshua Bengio, Luca Scimeca
Solvable Dynamics of Self-Supervised Word Embeddings and the Emergence of Analogical Reasoning Authors: Dhruva Karkada, James B. Simon, Yasaman Bahri, Michael R. DeWeese
Representation and Interpretation in Artificial and Natural Computing Authors: Luis A. Pineda
A novel approach to data generation in generative model Authors: JaeHong Kim (Healthcare, Legal and Policy Center, Graduate school of Law, Korea University, Seoul 02841, Korea, Human-Inspired AI Research, Korea University, Seoul 02841, Korea), Jaewon Shim (Center for 0D Nanofluidics, Institute of Applied Physics, Department of Physics and Astronomy, Seoul National University, Seoul 08826, Korea)
NestQuant: Nested Lattice Quantization for Matrix Products and LLMs Authors: Semyon Savkin, Eitan Porat, Or Ordentlich, Yury Polyanskiy
On Space Folds of ReLU Neural Networks Authors: Michal Lewandowski, Hamid Eghbalzadeh, Bernhard Heinzl, Raphael Pisoni, Bernhard A. Moser
Prediction hubs are context-informed frequent tokens in LLMs Authors: Beatrix M. G. Nielsen, Iuri Macocco, Marco Baroni
Fenchel-Young Variational Learning Authors: Sophia Sklaviadis, Sweta Agrawal, Antonio Farinhas, Andre Martins, Mario Figueiredo
Deep Tree Tensor Networks for Image Recognition Authors: Chang Nie, Junfang Chen, Yajie Chen
Estimation of the Learning Coefficient Using Empirical Loss Authors: Tatsuyoshi Takio, Joe Suzuki
STAR: Spectral Truncation and Rescale for Model Merging Authors: Yu-Ang Lee, Ching-Yun Ko, Tejaswini Pedapati, I-Hsin Chung, Mi-Yen Yeh, Pin-Yu Chen
Data-Adaptive Low-Rank Sparse Subspace Clustering Authors: Ivica Kopriva
Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data Authors: Corinna Cortes, Anqi Mao, Mehryar Mohri, Yutao Zhong
Revisiting Generalization Power of a DNN in Terms of Symbolic Interactions Authors: Lei Cheng, Junpeng Zhang, Qihan Ren, Quanshi Zhang
Elastic Representation: Mitigating Spurious Correlations for Group Robustness Authors: Tao Wen, Zihan Wang, Quan Zhang, Qi Lei
Heterogeneous Resource Allocation with Multi-task Learning for Wireless Networks Authors: Nikos A. Mitsiou, Pavlos S. Bouzinis, Panagiotis G. Sarigiannidis, George K. Karagiannidis
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection Authors: Bettina Messmer, Vinko Sabol\v{c}ec, Martin Jaggi
Process Reward Models for LLM Agents: Practical Framework and Directions Authors: Sanjiban Choudhury
Conditional Latent Coding with Learnable Synthesized Reference for Deep Image Compression Authors: Siqi Wu, Yinda Chen, Dong Liu, Zhihai He
The Ann Arbor Architecture for Agent-Oriented Programming Authors: Wei Dong
Nonasymptotic CLT and Error Bounds for Two-Time-Scale Stochastic Approximation Authors: Seo Taek Kong, Sihan Zeng, Thinh T. Doan, R. Srikant

1. Shaping Inductive Bias in Diffusion Models through Frequency-Based Noise Control

ArXiv ID: 2502.10236

Authors: Thomas Jiralerspong, Berton Earnshaw, Jason Hartford, Yoshua Bengio, Luca Scimeca

Abstract: Diffusion Probabilistic Models (DPMs) are powerful generative models that have achieved unparalleled success in a number of generative tasks. In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. For topologically structured data, we devise a frequency-based noising operator to purposefully manipulate, and set, these inductive biases. We first show that appropriate manipulations of the noising forward process can lead DPMs to focus on particular aspects of the distribution to learn. We show that different datasets necessitate different inductive biases, and that appropriate frequency-based noise control induces increased generative performance compared to standard diffusion. Finally, we demonstrate the possibility of ignoring information at particular frequencies while learning. We show this in an image corruption and recovery task, where we train a DPM to recover the original target distribution after severe noise corruption.

Comment: Author match

2. Solvable Dynamics of Self-Supervised Word Embeddings and the Emergence of Analogical Reasoning

ArXiv ID: 2502.09863

Authors: Dhruva Karkada, James B. Simon, Yasaman Bahri, Michael R. DeWeese

Abstract: The remarkable success of large language models relies on their ability to implicitly learn structured latent representations from the pretraining corpus. As a simpler surrogate for representation learning in language modeling, we study a class of solvable contrastive self-supervised algorithms which we term quadratic word embedding models. These models resemble the word2vec algorithm and perform similarly on downstream tasks. Our main contributions are analytical solutions for both the training dynamics (under certain hyperparameter choices) and the final word embeddings, given in terms of only the corpus statistics. Our solutions reveal that these models learn orthogonal linear subspaces one at a time, each one incrementing the effective rank of the embeddings until model capacity is saturated. Training on WikiText, we find that the top subspaces represent interpretable concepts. Finally, we use our dynamical theory to predict how and when models acquire the ability to complete analogies.

Comment: The paper provides analytical solutions for self-supervised word embedding dynamics, offering foundational insights into representation learning and training dynamics.

Relevance: 9 Novelty: 9

3. Representation and Interpretation in Artificial and Natural Computing

ArXiv ID: 2502.10383

Authors: Luis A. Pineda

Abstract: Artificial computing machinery transforms representations through an objective process, to be interpreted subjectively by humans, so the machine and the interpreter are different entities, but in the putative natural computing both processes are performed by the same agent. The method or process that transforms a representation is called here \emph{the mode of computing}. The mode used by digital computers is the algorithmic one, but there are others, such as quantum computers and diverse forms of non-conventional computing, and there is an open-ended set of representational formats and modes that could be used in artificial and natural computing. A mode based on a notion of computing different from Turing's may perform feats beyond what the Turing Machine does but the modes would not be of the same kind and could not be compared. For a mode of computing to be more powerful than the algorithmic one, it ought to compute functions lacking an effective algorithm, and Church Thesis would not hold. Here, a thought experiment including a computational demon using a hypothetical mode for such an effect is presented. If there is natural computing, there is a mode of natural computing whose properties may be causal to the phenomenological experience. Discovering it would come with solving the hard problem of consciousness; but if it turns out that such a mode does not exist, there is no such thing as natural computing, and the mind is not a computational process.

Comment: The paper discusses representation and modes of computing, touching on theoretical aspects of computing beyond Turing Machines. It aligns with emerging trends and foundational research.

Relevance: 9 Novelty: 9

4. A novel approach to data generation in generative model

ArXiv ID: 2502.10092

Authors: JaeHong Kim (Healthcare, Legal and Policy Center, Graduate school of Law, Korea University, Seoul 02841, Korea, Human-Inspired AI Research, Korea University, Seoul 02841, Korea), Jaewon Shim (Center for 0D Nanofluidics, Institute of Applied Physics, Department of Physics and Astronomy, Seoul National University, Seoul 08826, Korea)

Abstract: Variational Autoencoders (VAEs) and other generative models are widely employed in artificial intelligence to synthesize new data. However, current approaches rely on Euclidean geometric assumptions and statistical approximations that fail to capture the structured and emergent nature of data generation. This paper introduces the Convergent Fusion Paradigm (CFP) theory, a novel geometric framework that redefines data generation by integrating dimensional expansion accompanied by qualitative transformation. By modifying the latent space geometry to interact with emergent high-dimensional structures, CFP theory addresses key challenges such as identifiability issues and unintended artifacts like hallucinations in Large Language Models (LLMs). CFP theory is based on two key conceptual hypotheses that redefine how generative models structure relationships between data and algorithms. Through the lens of CFP theory, we critically examine existing metric-learning approaches. CFP theory advances this perspective by introducing time-reversed metric embeddings and structural convergence mechanisms, leading to a novel geometric approach that better accounts for data generation as a structured epistemic process. Beyond its computational implications, CFP theory provides philosophical insights into the ontological underpinnings of data generation. By offering a systematic framework for high-dimensional learning dynamics, CFP theory contributes to establishing a theoretical foundation for understanding the data-relationship structures in AI. Finally, future research in CFP theory will be led to its implications for fully realizing qualitative transformations, introducing the potential of Hilbert space in generative modeling.

Comment: The paper introduces the Convergent Fusion Paradigm (CFP) theory, which redefines data generation in generative models and offers a novel geometric framework, aligning with foundational research in representation learning and generative modeling.

Relevance: 9 Novelty: 9

5. NestQuant: Nested Lattice Quantization for Matrix Products and LLMs

ArXiv ID: 2502.09720

Authors: Semyon Savkin, Eitan Porat, Or Ordentlich, Yury Polyanskiy

Abstract: Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent work have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Meta's SpinQuant (perplexity 7.3). Comparisons on various LLM evaluation benchmarks also show a reduction in performance degradation induced by quantization.

Comment: NestQuant introduces a novel quantization scheme for LLMs, aligning with the model compression criterion. The use of nested lattices and its theoretical grounding make it highly relevant.

Relevance: 9 Novelty: 8

6. On Space Folds of ReLU Neural Networks

ArXiv ID: 2502.09954

Authors: Michal Lewandowski, Hamid Eghbalzadeh, Bernhard Heinzl, Raphael Pisoni, Bernhard A. Moser

Abstract: Recent findings suggest that the consecutive layers of ReLU neural networks can be understood geometrically as space folding transformations of the input space, revealing patterns of self-similarity. In this paper, we present the first quantitative analysis of this space folding phenomenon in ReLU neural networks. Our approach focuses on examining how straight paths in the Euclidean input space are mapped to their counterparts in the Hamming activation space. In this process, the convexity of straight lines is generally lost, giving rise to non-convex folding behavior. To quantify this effect, we introduce a novel measure based on range metrics, similar to those used in the study of random walks, and provide the proof for the equivalence of convexity notions between the input and activation spaces. Furthermore, we provide empirical analysis on a geometrical analysis benchmark (CantorNet) as well as an image classification benchmark (MNIST). Our work advances the understanding of the activation space in ReLU neural networks by leveraging the phenomena of geometric folding, providing valuable insights on how these models process input information.

Comment: The paper provides a quantitative analysis of space folding in ReLU networks, offering foundational insights into neural network behavior and representation learning.

Relevance: 9 Novelty: 8

7. Prediction hubs are context-informed frequent tokens in LLMs

ArXiv ID: 2502.10201

Authors: Beatrix M. G. Nielsen, Iuri Macocco, Marco Baroni

Abstract: Hubness, the tendency for few points to be among the nearest neighbours of a disproportionate number of other points, commonly arises when applying standard distance measures to high-dimensional data, often negatively impacting distance-based analysis. As autoregressive large language models (LLMs) operate on high-dimensional representations, we ask whether they are also affected by hubness. We first show, theoretically, that the only representation comparison operation performed by LLMs, namely that between context and unembedding vectors to determine continuation probabilities, is not characterized by the concentration of distances phenomenon that typically causes the appeareance of nuisance hubness. We then empirically show that this comparison still leads to a high degree of hubness, but the hubs in this case do not constitute a disturbance. They are rather the result of context-modulated frequent tokens often appearing in the pool of likely candidates for next token prediction. On the other hand, when other distance computations involving LLM representations are performed, we do not have the same theoretical guarantees, and, indeed, we see nuisance hubs appear. In summary, our work highlights, on the one hand, how hubness, while omnipresent in high-dimensional spaces, is not always a negative property that needs to be mitigated, and, on the other hand, it shows that various widely-used LLMs have developed a guessing strategy that consists in constantly assigning a high probability to frequent tokens.

Comment: The paper explores hubness in LLMs and provides theoretical and empirical insights into token prediction behavior, aligning with foundational research on LLM behavior and interpretability.

Relevance: 9 Novelty: 8

8. Fenchel-Young Variational Learning

ArXiv ID: 2502.10295

Authors: Sophia Sklaviadis, Sweta Agrawal, Antonio Farinhas, Andre Martins, Mario Figueiredo

Abstract: From a variational perspective, many statistical learning criteria involve seeking a distribution that balances empirical risk and regularization. In this paper, we broaden this perspective by introducing a new general class of variational methods based on Fenchel-Young (FY) losses, treated as divergences that generalize (and encompass) the familiar Kullback-Leibler divergence at the core of classical variational learning. Our proposed formulation -- FY variational learning -- includes as key ingredients new notions of FY free energy, FY evidence, FY evidence lower bound, and FY posterior. We derive alternating minimization and gradient backpropagation algorithms to compute (or lower bound) the FY evidence, which enables learning a wider class of models than previous variational formulations. This leads to generalized FY variants of classical algorithms, such as an FY expectation-maximization (FYEM) algorithm, and latent-variable models, such as an FY variational autoencoder (FYVAE). Our new methods are shown to be empirically competitive, often outperforming their classical counterparts, and most importantly, to have qualitatively novel features. For example, FYEM has an adaptively sparse E-step, while the FYVAE can support models with sparse observations and sparse posteriors.

Comment: The paper proposes Fenchel-Young Variational Learning, a generalization of variational methods with new theoretical insights and applications to latent-variable models, aligning with foundational research in representation learning and autoencoders.

Relevance: 9 Novelty: 8

9. Deep Tree Tensor Networks for Image Recognition

ArXiv ID: 2502.09928

Authors: Chang Nie, Junfang Chen, Yajie Chen

Abstract: Originating in quantum physics, tensor networks (TNs) have been widely adopted as exponential machines and parameter decomposers for recognition tasks. Typical TN models, such as Matrix Product States (MPS), have not yet achieved successful application in natural image processing. When employed, they primarily serve to compress parameters within off-the-shelf networks, thus losing their distinctive capability to enhance exponential-order feature interactions. This paper introduces a novel architecture named \textit{\textbf{D}eep \textbf{T}ree \textbf{T}ensor \textbf{N}etwork} (DTTN), which captures $2^L$-order multiplicative interactions across features through multilinear operations, while essentially unfolding into a \emph{tree}-like TN topology with the parameter-sharing property. DTTN is stacked with multiple antisymmetric interacting modules (AIMs), and this design facilitates efficient implementation. Moreover, we theoretically reveal the equivalency among quantum-inspired TN models and polynomial and multilinear networks under certain conditions, and we believe that DTTN can inspire more interpretable studies in this field. We evaluate the proposed model against a series of benchmarks and achieve excellent performance compared to its peers and cutting-edge architectures. Our code will soon be publicly available.

Comment: The paper introduces a novel architecture, Deep Tree Tensor Networks (DTTN), which focuses on tensor networks and their application to feature interactions. This aligns with the 'Model Architecture' criterion, particularly in architectural innovations.

Relevance: 8 Novelty: 8

10. Estimation of the Learning Coefficient Using Empirical Loss

ArXiv ID: 2502.09998

Authors: Tatsuyoshi Takio, Joe Suzuki

Abstract: The learning coefficient plays a crucial role in analyzing the performance of information criteria, such as the Widely Applicable Information Criterion (WAIC) and the Widely Applicable Bayesian Information Criterion (WBIC), which Sumio Watanabe developed to assess model generalization ability. In regular statistical models, the learning coefficient is given by d/2, where d is the dimension of the parameter space. More generally, it is defined as the absolute value of the pole order of a zeta function derived from the Kullback-Leibler divergence and the prior distribution. However, except for specific cases such as reduced-rank regression, the learning coefficient cannot be derived in a closed form. Watanabe proposed a numerical method to estimate the learning coefficient, which Imai further refined to enhance its convergence properties. These methods utilize the asymptotic behavior of WBIC and have been shown to be statistically consistent as the sample size grows. In this paper, we propose a novel numerical estimation method that fundamentally differs from previous approaches and leverages a new quantity, "Empirical Loss," which was introduced by Watanabe. Through numerical experiments, we demonstrate that our proposed method exhibits both lower bias and lower variance compared to those of Watanabe and Imai. Additionally, we provide a theoretical analysis that elucidates why our method outperforms existing techniques and present empirical evidence that supports our findings.

Comment: The paper proposes a novel method for estimating the learning coefficient using empirical loss, which contributes to theoretical insights into model generalization. This aligns with foundational research in representation learning.

Relevance: 8 Novelty: 8

11. STAR: Spectral Truncation and Rescale for Model Merging

ArXiv ID: 2502.10339

Authors: Yu-Ang Lee, Ching-Yun Ko, Tejaswini Pedapati, I-Hsin Chung, Mi-Yen Yeh, Pin-Yu Chen

Abstract: Model merging is an efficient way of obtaining a multi-task model from several pretrained models without further fine-tuning, and it has gained attention in various domains, including natural language processing (NLP). Despite the efficiency, a key challenge in model merging is the seemingly inevitable decrease in task performance as the number of models increases. In this paper, we propose $\mathbf{S}$pectral $\mathbf{T}$runcation $\mathbf{A}$nd $\mathbf{R}$escale (STAR) that aims at mitigating ``merging conflicts'' by truncating small components in the respective spectral spaces, which is followed by an automatic parameter rescaling scheme to retain the nuclear norm of the original matrix. STAR requires no additional inference on original training data and is robust to hyperparamater choice. We demonstrate the effectiveness of STAR through extensive model merging cases on diverse NLP tasks. Specifically, STAR works robustly across varying model sizes, and can outperform baselines by 4.2$\%$ when merging 12 models on Flan-T5. Our code is publicly available at https://github.com/IBM/STAR.

Comment: The paper introduces STAR, a method for model merging that addresses merging conflicts through spectral truncation and rescaling. This aligns with model efficiency and compression criteria.

Relevance: 8 Novelty: 8

12. Data-Adaptive Low-Rank Sparse Subspace Clustering

ArXiv ID: 2502.10106

Authors: Ivica Kopriva

Abstract: Low-rank sparse subspace clustering (LRSSC) algorithms built on self-expressive model effectively capture both the global and local structure of the data. However, existing solutions, primarily based on proximal operators associated with Sp/Lp , p e {0, 1/2, 2/3, 1}, norms are not data-adaptive. In this work, we propose an LRSSC algorithm incorporating a data-adaptive surrogate for the S0/L0 quasi-norm. We provide a numerical solution for the corresponding proximal operator in cases where an analytical expression is unavailable. The proposed LRSSC algorithm is formulated within the proximal mapping framework, and we present theoretical proof of its global convergence toward a stationary point. We evaluate the performance of the proposed method on three well known datasets, comparing it against LRSSC algorithms constrained by Sp/Lp, p e {0, 1/2, 2/3, 1}, norms.

Comment: The paper proposes a data-adaptive low-rank sparse subspace clustering algorithm, which aligns with foundational research in representation learning and sparsity.

Relevance: 8 Novelty: 8

13. Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data

ArXiv ID: 2502.10381

Authors: Corinna Cortes, Anqi Mao, Mehryar Mohri, Yutao Zhong

Abstract: Class imbalance remains a major challenge in machine learning, especially in multi-class problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes consistent. This paper introduces a novel theoretical framework for analyzing generalization in imbalanced classification. We propose a new class-imbalanced margin loss function for both binary and multi-class settings, prove its strong $H$-consistency, and derive corresponding learning guarantees based on empirical loss and a new notion of class-sensitive Rademacher complexity. Leveraging these theoretical results, we devise novel and general learning algorithms, IMMAX (Imbalanced Margin Maximization), which incorporate confidence margins and are applicable to various hypothesis sets. While our focus is theoretical, we also present extensive empirical results demonstrating the effectiveness of our algorithms compared to existing baselines.

Comment: The paper introduces a novel theoretical framework for learning from imbalanced data, including a new margin loss function and learning guarantees, which aligns with foundational research in representation learning.

Relevance: 8 Novelty: 8

14. Revisiting Generalization Power of a DNN in Terms of Symbolic Interactions

ArXiv ID: 2502.10162

Authors: Lei Cheng, Junpeng Zhang, Qihan Ren, Quanshi Zhang

Abstract: This paper aims to analyze the generalization power of deep neural networks (DNNs) from the perspective of interactions. Unlike previous analysis of a DNN's generalization power in a highdimensional feature space, we find that the generalization power of a DNN can be explained as the generalization power of the interactions. We found that the generalizable interactions follow a decay-shaped distribution, while non-generalizable interactions follow a spindle-shaped distribution. Furthermore, our theory can effectively disentangle these two types of interactions from a DNN. We have verified that our theory can well match real interactions in a DNN in experiments.

Comment: The paper provides a novel perspective on DNN generalization by analyzing symbolic interactions, which aligns with representation learning and training dynamics.