Personalized Daily ArXiv Papers 2025-06-09

[gpt-4o]	Prompt	Completion	Total
Token	53737	6680	60417
Cost	$0.13	$0.07	$0.2

Total arXiv papers: 670

Total scanned papers: 381

Total relevant papers: 38

Table of contents with paper titles:

Grokking Beyond the Euclidean Norm of Model Parameters Authors: Pascal Jr Tikeng Notsawo, Guillaume Dumas, Guillaume Rabusseau
Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning Authors: Andrei Mircea, Supriyo Chakraborty, Nima Chitsazan, Irina Rish, Ekaterina Lobacheva
Transformative or Conservative? Conservation laws for ResNets and Transformers Authors: Sibylle Marcotte, R\'emi Gribonval, Gabriel Peyr\'e
Contextually Guided Transformers via Low-Rank Adaptation Authors: Andrey Zhmoginov, Jihwan Lee, Max Vladymyrov, Mark Sandler
The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models Authors: Alex Damian, Jason D. Lee, Joan Bruna
Neural Collapse in Cumulative Link Models for Ordinal Regression: An Analysis with Unconstrained Feature Model Authors: Chuang Ma, Tomoyuki Obuchi, Toshiyuki Tanaka
Learning Along the Arrow of Time: Hyperbolic Geometry for Backward-Compatible Representation Learning Authors: Ngoc Bui, Menglin Yang, Runjin Chen, Leonardo Neves, Mingxuan Ju, Rex Ying, Neil Shah, Tong Zhao
BAQ: Efficient Bit Allocation Quantization for Large Language Models Authors: Chao Zhang, Li Wang, Samson Lasaulce, Merouane Debbah
A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation, Training, Generalization Authors: Muhammed Ustaomeroglu, Guannan Qu
LFA applied to CNNs: Efficient Singular Value Decomposition of Convolutional Mappings by Local Fourier Analysis Authors: Antonia van Betteray, Matthias Rottmann, Karsten Kahl
CoFrNets: Interpretable Neural Architecture Inspired by Continued Fractions Authors: Isha Puri, Amit Dhurandhar, Tejaswini Pedapati, Kartikeyan Shanmugam, Dennis Wei, Kush R. Varshney
When can in-context learning generalize out of task distribution? Authors: Chase Goddard, Lindsay M. Smith, Vudtiwat Ngampruetikorn, David J. Schwab
Towards an Explainable Comparison and Alignment of Feature Embeddings Authors: Mohammad Jalali, Bahar Dibaei Nia, Farzan Farnia
Cartridges: Lightweight and general-purpose long context representations via self-study Authors: Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, Christopher Re
MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models Authors: Jie Cao, Tianwei Lin, Hongyang He, Rolan Yan, Wenqiao Zhang, Juncheng Li, Dongping Zhang, Siliang Tang, Yueting Zhuang
Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones Authors: Andrey Zhmoginov, Jihwan Lee, Mark Sandler
PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling Authors: Yuxuan Yue, Zukang Xu, Zhihang Yuan, Dawei Yang, Jianglong Wu, Liqiang Nie
Similarity Matching Networks: Hebbian Learning and Convergence Over Multiple Time Scales Authors: Veronica Centorrino, Francesco Bullo, Giovanni Russo
Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective Authors: Emmanuel Anaya Gonzalez, Sairam Vaidya, Kanghee Park, Ruyi Ji, Taylor Berg-Kirkpatrick, Loris D'Antoni
Tensor-to-Tensor Models with Fast Iterated Sum Features Authors: Joscha Diehl, Rasheed Ibraheem, Leonard Schmitz, Yue Wu
ENMA: Tokenwise Autoregression for Generative Neural PDE Operators Authors: Armand Kassa\"i Koupa\"i, Lise Le Boudec, Louis Serrano, Patrick Gallinari
Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties Authors: Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Mixture-of-Experts Meets In-Context Reinforcement Learning Authors: Wenhao Wu, Fuhong Liu, Haoru Li, Zican Hu, Daoyi Dong, Chunlin Chen, Zhi Wang
Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU Authors: Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Weifeng Liu, Qingxiao Sun
Evaluating Neuron Explanations: A Unified Framework with Sanity Checks Authors: Tuomas Oikarinen, Ge Yan, Tsui-Wei Weng
Learning to Weight Parameters for Data Attribution Authors: Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann
Flow-Attentional Graph Neural Networks Authors: Pascal Plettenberg, Dominik K\"ohler, Bernhard Sick, Josephine M. Thomas
Come Together, But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation Authors: Zhan Zhuang, Xiequn Wang, Wei Li, Yulong Zhang, Qiushi Huang, Shuhao Chen, Xuehao Wang, Yanbin Wei, Yuhe Nie, Kede Ma, Yu Zhang, Ying Wei
Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning Authors: Yuheng Lei, Sitong Mao, Shunbo Zhou, Hongyuan Zhang, Xuelong Li, Ping Luo
UniPTMs: The First Unified Multi-type PTM Site Prediction Model via Master-Slave Architecture-Based Multi-Stage Fusion Strategy and Hierarchical Contrastive Loss Authors: Yiyu Lin, Yan Wang, You Zhou, Xinye Ni, Jiahui Wu, Sen Yang
RETENTION: Resource-Efficient Tree-Based Ensemble Model Acceleration with Content-Addressable Memory Authors: Yi-Chun Liao, Chieh-Lin Tsai, Yuan-Hao Chang, Cam\'elia Slimani, Jalil Boukhobza, Tei-Wei Kuo
SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models Authors: Emil Biju, Shayan Talaei, Zhemin Huang, Mohammadreza Pourreza, Azalia Mirhoseini, Amin Saberi
Pruning Spurious Subgraphs for Graph Out-of-Distribtuion Generalization Authors: Tianjun Yao, Haoxuan Li, Yongqiang Chen, Tongliang Liu, Le Song, Eric Xing, Zhiqiang Shen
A projection-based framework for gradient-free and parallel learning Authors: Andreas Bergmeister, Manish Krishan Lal, Stefanie Jegelka, Suvrit Sra
Model-Driven Graph Contrastive Learning Authors: Ali Azizpour, Nicolas Zilberstein, Santiago Segarra
On Measuring Long-Range Interactions in Graph Neural Networks Authors: Jacob Bamberger, Benjamin Gutteridge, Scott le Roux, Michael M. Bronstein, Xiaowen Dong
Topology-aware Neural Flux Prediction Guided by Physics Authors: Haoyang Jiang, Jindong Wang, Xingquan Zhu, Yi He
Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router Authors: Chenyang Shao, Xinyang Liu, Yutang Lin, Fengli Xu, Yong Li

1. Grokking Beyond the Euclidean Norm of Model Parameters

ArXiv ID: 2506.05718

Authors: Pascal Jr Tikeng Notsawo, Guillaume Dumas, Guillaume Rabusseau

Abstract: Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. In this work, we demonstrate that grokking can be induced by regularization, either explicit or implicit. More precisely, we show that when there exists a model with a property $P$ (e.g., sparse or low-rank weights) that generalizes on the problem of interest, gradient descent with a small but non-zero regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) results in grokking. This extends previous work showing that small non-zero weight decay induces grokking. Moreover, our analysis shows that over-parameterization by adding depth makes it possible to grok or ungrok without explicitly using regularization, which is impossible in shallow cases. We further show that the $\ell_2$ norm is not a reliable proxy for generalization when the model is regularized toward a different property $P$, as the $\ell_2$ norm grows in many cases where no weight decay is used, but the model generalizes anyway. We also show that grokking can be amplified solely through data selection, with any other hyperparameter fixed.

Comment: The paper discusses grokking in neural networks, focusing on regularization and over-parameterization, which is relevant to representation learning and training dynamics.

Relevance: 9 Novelty: 8

2. Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning

ArXiv ID: 2506.05447

Authors: Andrei Mircea, Supriyo Chakraborty, Nima Chitsazan, Irina Rish, Ekaterina Lobacheva

Abstract: This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl

Comment: The paper provides insights into the training dynamics of language models, which is relevant to understanding LLM behavior.

Relevance: 9 Novelty: 8

3. Transformative or Conservative? Conservation laws for ResNets and Transformers

ArXiv ID: 2506.06194

Authors: Sibylle Marcotte, R\'emi Gribonval, Gabriel Peyr\'e

Abstract: While conservation laws in gradient flow training dynamics are well understood for (mostly shallow) ReLU and linear networks, their study remains largely unexplored for more practical architectures. This paper bridges this gap by deriving and analyzing conservation laws for modern architectures, with a focus on convolutional ResNets and Transformer networks. For this, we first show that basic building blocks such as ReLU (or linear) shallow networks, with or without convolution, have easily expressed conservation laws, and no more than the known ones. In the case of a single attention layer, we also completely describe all conservation laws, and we show that residual blocks have the same conservation laws as the same block without a skip connection. We then introduce the notion of conservation laws that depend only on a subset of parameters (corresponding e.g. to a pair of consecutive layers, to a residual block, or to an attention layer). We demonstrate that the characterization of such laws can be reduced to the analysis of the corresponding building block in isolation. Finally, we examine how these newly discovered conservation principles, initially established in the continuous gradient flow regime, persist under discrete optimization dynamics, particularly in the context of Stochastic Gradient Descent (SGD).

Comment: The paper explores conservation laws in ResNets and Transformers, providing theoretical insights into these architectures.

Relevance: 9 Novelty: 8

4. Contextually Guided Transformers via Low-Rank Adaptation

ArXiv ID: 2506.05672

Authors: Andrey Zhmoginov, Jihwan Lee, Max Vladymyrov, Mark Sandler

Abstract: Large Language Models (LLMs) based on Transformers excel at text processing, but their reliance on prompts for specialized behavior introduces computational overhead. We propose a modification to a Transformer architecture that eliminates the need for explicit prompts by learning to encode context into the model's weights. Our Contextually Guided Transformer (CGT) model maintains a contextual summary at each sequence position, allowing it to update the weights on the fly based on the preceding context. This approach enables the model to self-specialize, effectively creating a tailored model for processing information following a given prefix. We demonstrate the effectiveness of our method on synthetic in-context learning tasks and language modeling benchmarks. Furthermore, we introduce techniques for enhancing the interpretability of the learned contextual representations, drawing connections to Variational Autoencoders and promoting smoother, more consistent context encoding. This work offers a novel direction for efficient and adaptable language modeling by integrating context directly into the model's architecture.

Comment: The paper proposes a modification to Transformer architecture for context encoding, which is relevant to model architecture innovations.

Relevance: 9 Novelty: 8

5. The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models

ArXiv ID: 2506.05500

Authors: Alex Damian, Jason D. Lee, Joan Bruna

Abstract: In this work we consider generic Gaussian Multi-index models, in which the labels only depend on the (Gaussian) $d$-dimensional inputs through their projection onto a low-dimensional $r = O_d(1)$ subspace, and we study efficient agnostic estimation procedures for this hidden subspace. We introduce the \emph{generative leap} exponent $k^\star$, a natural extension of the generative exponent from [Damian et al.'24] to the multi-index setting. We first show that a sample complexity of $n=\Theta(d^{1 \vee \k/2})$ is necessary in the class of algorithms captured by the Low-Degree-Polynomial framework. We then establish that this sample complexity is also sufficient, by giving an agnostic sequential estimation procedure (that is, requiring no prior knowledge of the multi-index model) based on a spectral U-statistic over appropriate Hermite tensors. We further compute the generative leap exponent for several examples including piecewise linear functions (deep ReLU networks with bias), and general deep neural networks (with $r$-dimensional first hidden layer).

Comment: The paper discusses efficient learning of Gaussian Multi-index models, focusing on representation learning and sample complexity, which aligns with foundational research in representation learning.

Relevance: 9 Novelty: 8

6. Neural Collapse in Cumulative Link Models for Ordinal Regression: An Analysis with Unconstrained Feature Model

ArXiv ID: 2506.05801

Authors: Chuang Ma, Tomoyuki Obuchi, Toshiyuki Tanaka

Abstract: A phenomenon known as ''Neural Collapse (NC)'' in deep classification tasks, in which the penultimate-layer features and the final classifiers exhibit an extremely simple geometric structure, has recently attracted considerable attention, with the expectation that it can deepen our understanding of how deep neural networks behave. The Unconstrained Feature Model (UFM) has been proposed to explain NC theoretically, and there emerges a growing body of work that extends NC to tasks other than classification and leverages it for practical applications. In this study, we investigate whether a similar phenomenon arises in deep Ordinal Regression (OR) tasks, via combining the cumulative link model for OR and UFM. We show that a phenomenon we call Ordinal Neural Collapse (ONC) indeed emerges and is characterized by the following three properties: (ONC1) all optimal features in the same class collapse to their within-class mean when regularization is applied; (ONC2) these class means align with the classifier, meaning that they collapse onto a one-dimensional subspace; (ONC3) the optimal latent variables (corresponding to logits or preactivations in classification tasks) are aligned according to the class order, and in particular, in the zero-regularization limit, a highly local and simple geometric relationship emerges between the latent variables and the threshold values. We prove these properties analytically within the UFM framework with fixed threshold values and corroborate them empirically across a variety of datasets. We also discuss how these insights can be leveraged in OR, highlighting the use of fixed thresholds.

Comment: The paper analyzes neural collapse in ordinal regression, providing theoretical insights into deep learning behavior, which is relevant to representation learning.

Relevance: 9 Novelty: 8

7. Learning Along the Arrow of Time: Hyperbolic Geometry for Backward-Compatible Representation Learning

ArXiv ID: 2506.05826

Authors: Ngoc Bui, Menglin Yang, Runjin Chen, Leonardo Neves, Mingxuan Ju, Rex Ying, Neil Shah, Tong Zhao

Abstract: Backward compatible representation learning enables updated models to integrate seamlessly with existing ones, avoiding to reprocess stored data. Despite recent advances, existing compatibility approaches in Euclidean space neglect the uncertainty in the old embedding model and force the new model to reconstruct outdated representations regardless of their quality, thereby hindering the learning process of the new model. In this paper, we propose to switch perspectives to hyperbolic geometry, where we treat time as a natural axis for capturing a model's confidence and evolution. By lifting embeddings into hyperbolic space and constraining updated embeddings to lie within the entailment cone of the old ones, we maintain generational consistency across models while accounting for uncertainties in the representations. To further enhance compatibility, we introduce a robust contrastive alignment loss that dynamically adjusts alignment weights based on the uncertainty of the old embeddings. Experiments validate the superiority of the proposed method in achieving compatibility, paving the way for more resilient and adaptable machine learning systems.

Comment: The paper introduces a hyperbolic geometry approach for backward-compatible representation learning, which is relevant to representation learning and introduces a novel perspective.

Relevance: 9 Novelty: 8

8. BAQ: Efficient Bit Allocation Quantization for Large Language Models

ArXiv ID: 2506.05664

Authors: Chao Zhang, Li Wang, Samson Lasaulce, Merouane Debbah

Abstract: Post-training model quantization is a widely adopted technique for reducing the memory and computational costs of large language models (LLMs). However, most existing methods rely on uniform or heuristic bitwidth assignments, failing to account for the nonuniform sensitivity of weights to quantization noise. In this paper, we propose a novel framework for allocating quantization bitwidths based on sensitivity metrics derived from a Hessian proxy. We make key assumptions, which allow the layer/component-wise loss function to be expressed as an explicit function of the bitwidths. This enables a neat formulation of the bit allocation problem as a convex optimization task, whose closed-form solution adapts precision across weights to minimize the layer-wise quantization loss. Inspecting the solution provides several insights (such as the equal-loss structure), which are then exploited to design the proposed \textbf{BAQ} (Bit Allocation Quantization) algorithm. The proposed algorithm achieves a good trade-off between loss minimization and complexity and allows BAQ to be integrated into standard quantization pipelines with minimal overhead. Experimental results show that BAQ consistently outperforms GPTQ, achieving up to 56$\times$ lower perplexity at the same bitwidth on large language models ranging from 125M to 30B parameters. Leveraging our analytical results derived from solving the optimal bit allocation problem, we also provide a theoretical explanation for the observed gains. All codes of this paper are available at https://github.com/CSU-ModelCompression/BAQ.

Comment: The paper proposes an efficient bit allocation quantization method for LLMs, focusing on model compression and efficiency, which is relevant to model compression.

Relevance: 9 Novelty: 8

9. A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation, Training, Generalization

ArXiv ID: 2506.06179

Authors: Muhammed Ustaomeroglu, Guannan Qu

Abstract: Self-attention has emerged as a core component of modern neural architectures, yet its theoretical underpinnings remain elusive. In this paper, we study self-attention through the lens of interacting entities, ranging from agents in multi-agent reinforcement learning to alleles in genetic sequences, and show that a single layer linear self-attention can efficiently represent, learn, and generalize functions capturing pairwise interactions, including out-of-distribution scenarios. Our analysis reveals that self-attention acts as a mutual interaction learner under minimal assumptions on the diversity of interaction patterns observed during training, thereby encompassing a wide variety of real-world domains. In addition, we validate our theoretical insights through experiments demonstrating that self-attention learns interaction functions and generalizes across both population distributions and out-of-distribution scenarios. Building on our theories, we introduce HyperFeatureAttention, a novel neural network module designed to learn couplings of different feature-level interactions between entities. Furthermore, we propose HyperAttention, a new module that extends beyond pairwise interactions to capture multi-entity dependencies, such as three-way, four-way, or general n-way interactions.

Comment: The paper provides a theoretical study of self-attention, a core component of modern neural architectures, and introduces new modules like HyperFeatureAttention and HyperAttention, which are relevant to model architecture innovations.

Relevance: 9 Novelty: 8

10. LFA applied to CNNs: Efficient Singular Value Decomposition of Convolutional Mappings by Local Fourier Analysis

ArXiv ID: 2506.05617

Authors: Antonia van Betteray, Matthias Rottmann, Karsten Kahl

Abstract: The singular values of convolutional mappings encode interesting spectral properties, which can be used, e.g., to improve generalization and robustness of convolutional neural networks as well as to facilitate model compression. However, the computation of singular values is typically very resource-intensive. The naive approach involves unrolling the convolutional mapping along the input and channel dimensions into a large and sparse two-dimensional matrix, making the exact calculation of all singular values infeasible due to hardware limitations. In particular, this is true for matrices that represent convolutional mappings with large inputs and a high number of channels. Existing efficient methods leverage the Fast Fourier transformation (FFT) to transform convolutional mappings into the frequency domain, enabling the computation of singular values for matrices representing convolutions with larger input and channel dimensions. For a constant number of channels in a given convolution, an FFT can compute N singular values in O(N log N) complexity. In this work, we propose an approach of complexity O(N) based on local Fourier analysis, which additionally exploits the shift invariance of convolutional operators. We provide a theoretical analysis of our algorithm's runtime and validate its efficiency through numerical experiments. Our results demonstrate that our proposed method is scalable and offers a practical solution to calculate the entire set of singular values - along with the corresponding singular vectors if needed - for high-dimensional convolutional mappings.

Comment: The paper proposes a novel approach for efficient singular value decomposition of convolutional mappings, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

11. CoFrNets: Interpretable Neural Architecture Inspired by Continued Fractions

ArXiv ID: 2506.05586

Authors: Isha Puri, Amit Dhurandhar, Tejaswini Pedapati, Kartikeyan Shanmugam, Dennis Wei, Kush R. Varshney

Abstract: In recent years there has been a considerable amount of research on local post hoc explanations for neural networks. However, work on building interpretable neural architectures has been relatively sparse. In this paper, we present a novel neural architecture, CoFrNet, inspired by the form of continued fractions which are known to have many attractive properties in number theory, such as fast convergence of approximations to real numbers. We show that CoFrNets can be efficiently trained as well as interpreted leveraging their particular functional form. Moreover, we prove that such architectures are universal approximators based on a proof strategy that is different than the typical strategy used to prove universal approximation results for neural networks based on infinite width (or depth), which is likely to be of independent interest. We experiment on nonlinear synthetic functions and are able to accurately model as well as estimate feature attributions and even higher order terms in some cases, which is a testament to the representational power as well as interpretability of such architectures. To further showcase the power of CoFrNets, we experiment on seven real datasets spanning tabular, text and image modalities, and show that they are either comparable or significantly better than other interpretable models and multilayer perceptrons, sometimes approaching the accuracies of state-of-the-art models.

Comment: The paper proposes a novel neural architecture inspired by continued fractions, which is relevant to model architecture innovations.

Relevance: 9 Novelty: 8

12. When can in-context learning generalize out of task distribution?

ArXiv ID: 2506.05574

Authors: Chase Goddard, Lindsay M. Smith, Vudtiwat Ngampruetikorn, David J. Schwab

Abstract: In-context learning (ICL) is a remarkable capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples. We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize \emph{out-of-distribution}. Previous work has focused on the number of distinct tasks necessary in the pretraining dataset. Here, we use a different notion of task diversity to study the emergence of ICL in transformers trained on linear functions. We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space. We also investigate the nature of the solutions learned by the transformer on both sides of the transition, and observe similar transitions in nonlinear regression problems. We construct a phase diagram to characterize how our concept of task diversity interacts with the number of pretraining tasks. In addition, we explore how factors such as the depth of the model and the dimensionality of the regression problem influence the transition.

Comment: The paper investigates conditions for in-context learning to generalize out of task distribution, which is relevant to large language models and theoretical insights.

Relevance: 9 Novelty: 8

13. Towards an Explainable Comparison and Alignment of Feature Embeddings

ArXiv ID: 2506.06231

Authors: Mohammad Jalali, Bahar Dibaei Nia, Farzan Farnia

Abstract: While several feature embedding models have been developed in the literature, comparisons of these embeddings have largely focused on their numerical performance in classification-related downstream applications. However, an interpretable comparison of different embeddings requires identifying and analyzing mismatches between sample groups clustered within the embedding spaces. In this work, we propose the \emph{Spectral Pairwise Embedding Comparison (SPEC)} framework to compare embeddings and identify their differences in clustering a reference dataset. Our approach examines the kernel matrices derived from two embeddings and leverages the eigendecomposition of the difference kernel matrix to detect sample clusters that are captured differently by the two embeddings. We present a scalable implementation of this kernel-based approach, with computational complexity that grows linearly with the sample size. Furthermore, we introduce an optimization problem using this framework to align two embeddings, ensuring that clusters identified in one embedding are also captured in the other model. We provide numerical results demonstrating the SPEC's application to compare and align embeddings on large-scale datasets such as ImageNet and MS-COCO. The code is available at https://github.com/mjalali/embedding-comparison.

Comment: The paper proposes a framework for comparing and aligning feature embeddings, relevant to representation learning.

Relevance: 9 Novelty: 8

14. Cartridges: Lightweight and general-purpose long context representations via self-study

ArXiv ID: 2506.06266

Authors: Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, Christopher Re

Abstract: Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model's effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.

Comment: The paper introduces a method for efficient long context representations, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

15. MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models

ArXiv ID: 2506.05928

Authors: Jie Cao, Tianwei Lin, Hongyang He, Rolan Yan, Wenqiao Zhang, Juncheng Li, Dongping Zhang, Siliang Tang, Yueting Zhuang

Abstract: Recent studies integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) to further enhance the performance of parameter-efficient fine-tuning (PEFT) methods in Large Language Model (LLM) applications. Existing methods employ \emph{homogeneous} MoE-LoRA architectures composed of LoRA experts with either similar or identical structures and capacities. However, these approaches often suffer from representation collapse and expert load imbalance, which negatively impact the potential of LLMs. To address these challenges, we propose a \emph{heterogeneous} \textbf{Mixture-of-Adapters (MoA)} approach. This method dynamically integrates PEFT adapter experts with diverse structures, leveraging their complementary representational capabilities to foster expert specialization, thereby enhancing the effective transfer of pre-trained knowledge to downstream tasks. MoA supports two variants: \textbf{(i)} \textit{Soft MoA} achieves fine-grained integration by performing a weighted fusion of all expert outputs; \textbf{(ii)} \textit{Sparse MoA} activates adapter experts sparsely based on their contribution, achieving this with negligible performance degradation. Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency. Our project is available at https://github.com/DCDmllm/MoA.

Comment: The paper introduces a heterogeneous mixture of adapters for LLM fine-tuning, relevant to MoE and model architecture.

Relevance: 9 Novelty: 8

16. Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones

ArXiv ID: 2506.05641

Authors: Andrey Zhmoginov, Jihwan Lee, Mark Sandler

Abstract: Modern Foundation Models (FMs) are typically trained on corpora spanning a wide range of different data modalities, topics and downstream tasks. Utilizing these models can be very computationally expensive and is out of reach for most consumer devices. Furthermore, most of the broad FM knowledge may actually be irrelevant for a specific task at hand. Here we explore a technique for mapping parameters of a large Transformer to parameters of a smaller specialized model. By making this transformation task-specific, we aim to capture a narrower scope of the knowledge needed for performing a specific task by a smaller model. We study our method on image modeling tasks, showing that performance of generated models exceeds that of universal conditional models.

Comment: The paper explores generating smaller specialized models from large transformers, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

17. PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling

ArXiv ID: 2506.05432

Authors: Yuxuan Yue, Zukang Xu, Zhihang Yuan, Dawei Yang, Jianglong Wu, Liqiang Nie

Abstract: Large Language Models (LLMs) face significant challenges in edge deployment due to their massive parameter scale. Vector Quantization (VQ), a clustering-based quantization method, serves as a prevalent solution to this issue for its extremely low-bit (even at 2-bit) and considerable accuracy. Since a vector is a quantity in mathematics and physics that has both direction and magnitude, existing VQ works typically quantize them in a coupled manner. However, we find that direction exhibits significantly greater sensitivity to quantization compared to the magnitude. For instance, when separately clustering the directions and magnitudes of weight vectors in LLaMA-2-7B, the accuracy drop of zero-shot tasks are 46.5\% and 2.3\%, respectively. This gap even increases with the reduction of clustering centers. Further, Euclidean distance, a common metric to access vector similarities in current VQ works, places greater emphasis on reducing the magnitude error. This property is contrary to the above finding, unavoidably leading to larger quantization errors. To these ends, this paper proposes Polar Coordinate Decoupled Vector Quantization (PCDVQ), an effective and efficient VQ framework consisting of two key modules: 1) Polar Coordinate Decoupling (PCD), which transforms vectors into their polar coordinate representations and perform independent quantization of the direction and magnitude parameters.2) Distribution Aligned Codebook Construction (DACC), which optimizes the direction and magnitude codebooks in accordance with the source distribution. Experimental results show that PCDVQ outperforms baseline methods at 2-bit level by at least 1.5\% zero-shot accuracy, establishing a novel paradigm for accurate and highly compressed LLMs.

Comment: The paper proposes a novel vector quantization framework for LLMs, which is relevant to model compression techniques.

Relevance: 8 Novelty: 8

18. Similarity Matching Networks: Hebbian Learning and Convergence Over Multiple Time Scales

ArXiv ID: 2506.06134

Authors: Veronica Centorrino, Francesco Bullo, Giovanni Russo

Abstract: A recent breakthrough in biologically-plausible normative frameworks for dimensionality reduction is based upon the similarity matching cost function and the low-rank matrix approximation problem. Despite clear biological interpretation, successful application in several domains, and experimental validation, a formal complete convergence analysis remains elusive. Building on this framework, we consider and analyze a continuous-time neural network, the \emph{similarity matching network}, for principal subspace projection. Derived from a min-max-min objective, this biologically-plausible network consists of three coupled dynamics evolving at different time scales: neural dynamics, lateral synaptic dynamics, and feedforward synaptic dynamics at the fast, intermediate, and slow time scales, respectively. The feedforward and lateral synaptic dynamics consist of Hebbian and anti-Hebbian learning rules, respectively. By leveraging a multilevel optimization framework, we prove convergence of the dynamics in the offline setting. Specifically, at the first level (fast time scale), we show strong convexity of the cost function and global exponential convergence of the corresponding gradient-flow dynamics. At the second level (intermediate time scale), we prove strong concavity of the cost function and exponential convergence of the corresponding gradient-flow dynamics within the space of positive definite matrices. At the third and final level (slow time scale), we study a non-convex and non-smooth cost function, provide explicit expressions for its global minima, and prove almost sure convergence of the corresponding gradient-flow dynamics to the global minima. These results rely on two empirically motivated conjectures that are supported by thorough numerical experiments. Finally, we validate the effectiveness of our approach via a numerical example.

Comment: The paper provides a convergence analysis of a biologically-plausible neural network for dimensionality reduction, relevant to representation learning and emerging trends.

Relevance: 8 Novelty: 8

19. Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective

ArXiv ID: 2506.05754

Authors: Emmanuel Anaya Gonzalez, Sairam Vaidya, Kanghee Park, Ruyi Ji, Taylor Berg-Kirkpatrick, Loris D'Antoni

Abstract: Constrained decoding enables Language Models (LMs) to produce samples that provably satisfy hard constraints. However, existing constrained-decoding approaches often distort the underlying model distribution, a limitation that is especially problematic in applications like program fuzzing, where one wants to generate diverse and valid program inputs for testing purposes. We propose a new constrained sampling framework based on Markov Chain Monte Carlo (MCMC) that simultaneously satisfies three core desiderata: constraint satisfying (every sample satisfies the constraint), monotonically converging (the sampling process converges to the true conditional distribution), and efficient (high-quality samples emerge in few steps). Our method constructs a proposal distribution over valid outputs and applies a Metropolis-Hastings acceptance criterion based on the LM's likelihood, ensuring principled and efficient exploration of the constrained space. Empirically, our sampler outperforms existing methods on both synthetic benchmarks and real-world program fuzzing tasks.

Comment: The paper proposes a constrained sampling framework for language models, which is relevant to large language models and introduces a novel MCMC-based approach.

Relevance: 8 Novelty: 8

20. Tensor-to-Tensor Models with Fast Iterated Sum Features

ArXiv ID: 2506.06041

Authors: Joscha Diehl, Rasheed Ibraheem, Leonard Schmitz, Yue Wu

Abstract: Data in the form of images or higher-order tensors is ubiquitous in modern deep learning applications. Owing to their inherent high dimensionality, the need for subquadratic layers processing such data is even more pressing than for sequence data. We propose a novel tensor-to-tensor layer with linear cost in the input size, utilizing the mathematical gadget of ``corner trees'' from the field of permutation counting. In particular, for order-two tensors, we provide an image-to-image layer that can be plugged into image processing pipelines. On the one hand, our method can be seen as a higher-order generalization of state-space models. On the other hand, it is based on a multiparameter generalization of the signature of iterated integrals (or sums). The proposed tensor-to-tensor concept is used to build a neural network layer called the Fast Iterated Sums (FIS) layer which integrates seamlessly with other layer types. We demonstrate the usability of the FIS layer with both classification and anomaly detection tasks. By replacing some layers of a smaller ResNet architecture with FIS, a similar accuracy (with a difference of only 0.1\%) was achieved in comparison to a larger ResNet while reducing the number of trainable parameters and multi-add operations. The FIS layer was also used to build an anomaly detection model that achieved an average AUROC of 97.3\% on the texture images of the popular MVTec AD dataset. The processing and modelling codes are publicly available at https://github.com/diehlj/fast-iterated-sums.

Comment: The paper introduces a novel tensor-to-tensor layer for neural networks, which is relevant to model architecture as it proposes a new layer type with efficiency improvements.