Previous Day 2025-04-24
Monthly Overview 2025-04
Next Day 2025-04-28

Personalized Daily ArXiv Papers 2025-04-25

[gpt-4o] Prompt Completion Total
Token 26265 3534 29799
Cost $0.07 $0.04 $0.1

Total arXiv papers: 392

Total scanned papers: 225

Total relevant papers: 19

Table of contents with paper titles:

  1. I-Con: A Unifying Framework for Representation Learning Authors: Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton

  2. Representation Learning via Non-Contrastive Mutual Information Authors: Zhaohan Daniel Guo, Bernardo Avila Pires, Khimya Khetarpal, Dale Schuurmans, Bo Dai

  3. Quantum Doubly Stochastic Transformers Authors: Jannis Born, Filip Skogh, Kahn Rhrissorrakrai, Filippo Utro, Nico Wagner, Aleksandros Sobczyk

  4. Symbolic Representation for Any-to-Any Generative Tasks Authors: Jiaqi Chen, Xiaoye Zhu, Yue Wang, Tianyang Liu, Xinhui Chen, Ying Chen, Chak Tou Leong, Yifei Ke, Joseph Liu, Yiwen Yuan, Julian McAuley, Li-jia Li

  5. Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention Authors: Xiang Hu, Jiaqi Leng, Jun Zhao, Kewei Tu, Wei Wu

  6. Backslash: Rate Constrained Optimized Training of Large Language Models Authors: Jun Wu, Jiangtao Wen, Yuxing Han

  7. When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars Authors: Rei Higuchi, Ryotaro Kawata, Naoki Nishikawa, Kazusato Oko, Shoichiro Yamaguchi, Sosuke Kobayashi, Seiya Tokui, Kohei Hayashi, Daisuke Okanohara, Taiji Suzuki

  8. Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light Authors: Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi

  9. The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs Authors: Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, Edoardo M. Ponti

  10. Enhancing Variational Autoencoders with Smooth Robust Latent Encoding Authors: Hyomin Lee, Minseon Kim, Sangwon Jang, Jongheon Jeong, Sung Ju Hwang

  11. SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures Authors: Max Hartman, Lav Varshney

  12. Likelihood-Free Variational Autoencoders Authors: Chen Xu, Qiang Wang, Lijun Sun

  13. Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation Models Authors: Julius Vetter, Manuel Gloeckler, Daniel Gedon, Jakob H. Macke

  14. Physics-informed features in supervised machine learning Authors: Margherita Lampani, Sabrina Guastavino, Michele Piana, Federico Benvenuto

  15. HMI: Hierarchical Knowledge Management for Efficient Multi-Tenant Inference in Pretrained Language Models Authors: Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Qin Xie, Guiming Xie, Xuejian Gong

  16. Towards Robust LLMs: an Adversarial Robustness Measurement Framework Authors: Natan Levy, Adiel Ashrov, Guy Katz

  17. HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing Authors: Myunghyun Rhee, Joonseop Sim, Taeyoung Ahn, Seungyong Lee, Daegun Yoon, Euiseok Kim, Kyoung Park, Youngpyo Joo, Hosik Kim

  18. In-Context Learning can distort the relationship between sequence likelihoods and biological fitness Authors: Pranav Kantroo, G\"unter P. Wagner, Benjamin B. Machta

  19. NeuralGrok: Accelerate Grokking by Neural Gradient Transformation Authors: Xinyu Zhou, Simin Fan, Martin Jaggi, Jie Fu


1. I-Con: A Unifying Framework for Representation Learning

ArXiv ID: 2504.16929

Authors: Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton

Abstract: As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.

Comment: The paper introduces a unifying framework for representation learning, connecting various loss functions and methods through an information-theoretic perspective. This aligns closely with the 'Representation Learning' criterion, particularly in understanding how deep networks encode information.

Relevance: 10 Novelty: 9


2. Representation Learning via Non-Contrastive Mutual Information

ArXiv ID: 2504.16667

Authors: Zhaohan Daniel Guo, Bernardo Avila Pires, Khimya Khetarpal, Dale Schuurmans, Bo Dai

Abstract: Labeling data is often very time consuming and expensive, leaving us with a majority of unlabeled data. Self-supervised representation learning methods such as SimCLR (Chen et al., 2020) or BYOL (Grill et al., 2020) have been very successful at learning meaningful latent representations from unlabeled image data, resulting in much more general and transferable representations for downstream tasks. Broadly, self-supervised methods fall into two types: 1) Contrastive methods, such as SimCLR; and 2) Non-Contrastive methods, such as BYOL. Contrastive methods are generally trying to maximize mutual information between related data points, so they need to compare every data point to every other data point, resulting in high variance, and thus requiring large batch sizes to work well. Non-contrastive methods like BYOL have much lower variance as they do not need to make pairwise comparisons, but are much trickier to implement as they have the possibility of collapsing to a constant vector. In this paper, we aim to develop a self-supervised objective that combines the strength of both types. We start with a particular contrastive method called the Spectral Contrastive Loss (HaoChen et al., 2021; Lu et al., 2024), and we convert it into a more general non-contrastive form; this removes the pairwise comparisons resulting in lower variance, but keeps the mutual information formulation of the contrastive method preventing collapse. We call our new objective the Mutual Information Non-Contrastive (MINC) loss. We test MINC by learning image representations on ImageNet (similar to SimCLR and BYOL) and show that it consistently improves upon the Spectral Contrastive loss baseline.

Comment: The paper proposes a novel non-contrastive mutual information objective (MINC) for self-supervised representation learning, which is highly relevant to foundational research in representation learning. The approach combines strengths of contrastive and non-contrastive methods, offering a significant methodological improvement.

Relevance: 10 Novelty: 8


3. Quantum Doubly Stochastic Transformers

ArXiv ID: 2504.16275

Authors: Jannis Born, Filip Skogh, Kahn Rhrissorrakrai, Filippo Utro, Nico Wagner, Aleksandros Sobczyk

Abstract: At the core of the Transformer, the Softmax normalizes the attention matrix to be right stochastic. Previous research has shown that this often destabilizes training and that enforcing the attention matrix to be doubly stochastic (through Sinkhorn's algorithm) consistently improves performance across different tasks, domains and Transformer flavors. However, Sinkhorn's algorithm is iterative, approximative, non-parametric and thus inflexible w.r.t. the obtained doubly stochastic matrix (DSM). Recently, it has been proven that DSMs can be obtained with a parametric quantum circuit, yielding a novel quantum inductive bias for DSMs with no known classical analogue. Motivated by this, we demonstrate the feasibility of a hybrid classical-quantum doubly stochastic Transformer (QDSFormer) that replaces the Softmax in the self-attention layer with a variational quantum circuit. We study the expressive power of the circuit and find that it yields more diverse DSMs that better preserve information than classical operators. Across multiple small-scale object recognition tasks, we find that our QDSFormer consistently surpasses both a standard Vision Transformer and other doubly stochastic Transformers. Beyond the established Sinkformer, this comparison includes a novel quantum-inspired doubly stochastic Transformer (based on QR decomposition) that can be of independent interest. The QDSFormer also shows improved training stability and lower performance variation suggesting that it may mitigate the notoriously unstable training of ViTs on small-scale data.

Comment: The paper introduces a quantum-inspired doubly stochastic Transformer, replacing Softmax with a variational quantum circuit. This aligns with the 'Model Architecture' criterion, particularly in exploring novel architectural paradigms.

Relevance: 9 Novelty: 9


4. Symbolic Representation for Any-to-Any Generative Tasks

ArXiv ID: 2504.17261

Authors: Jiaqi Chen, Xiaoye Zhu, Yue Wang, Tianyang Liu, Xinhui Chen, Ying Chen, Chak Tou Leong, Yifei Ke, Joseph Liu, Yiwen Yuan, Julian McAuley, Li-jia Li

Abstract: We propose a symbolic generative task description language and a corresponding inference engine capable of representing arbitrary multimodal tasks as structured symbolic flows. Unlike conventional generative models that rely on large-scale training and implicit neural representations to learn cross-modal mappings, often at high computational cost and with limited flexibility, our framework introduces an explicit symbolic representation comprising three core primitives: functions, parameters, and topological logic. Leveraging a pre-trained language model, our inference engine maps natural language instructions directly to symbolic workflows in a training-free manner. Our framework successfully performs over 12 diverse multimodal generative tasks, demonstrating strong performance and flexibility without the need for task-specific tuning. Experiments show that our method not only matches or outperforms existing state-of-the-art unified models in content quality, but also offers greater efficiency, editability, and interruptibility. We believe that symbolic task representations provide a cost-effective and extensible foundation for advancing the capabilities of generative AI.

Comment: The paper proposes a symbolic generative task description language, which introduces a novel paradigm for generative AI. This aligns with emerging trends and foundational research in representation learning.

Relevance: 9 Novelty: 9


5. Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention

ArXiv ID: 2504.16795

Authors: Xiang Hu, Jiaqi Leng, Jun Zhao, Kewei Tu, Wei Wu

Abstract: A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose \textbf{H}ierarchical \textbf{S}parse \textbf{A}ttention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selecting the top-$k$ chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba's huge potential in long-context modeling.

Comment: The paper introduces a novel hierarchical sparse attention mechanism (HSA) for RNNs, enhancing their efficiency and long-range context modeling. This aligns with the 'Model Architecture' criterion, particularly in architectural innovations for efficiency.

Relevance: 9 Novelty: 8


6. Backslash: Rate Constrained Optimized Training of Large Language Models

ArXiv ID: 2504.16968

Authors: Jun Wu, Jiangtao Wen, Yuxing Han

Abstract: The rapid advancement of large-language models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored. In this work, we introduce Rate-Constrained Training (Backslash), a novel training-time compression approach based on rate-distortion optimization (RDO). Backslash enables a flexible trade-off between model accuracy and complexity, significantly reducing parameter redundancy while preserving performance. Experiments in various architectures and tasks demonstrate that Backslash can reduce memory usage by 60\% - 90\% without accuracy loss and provides significant compression gain compared to compression after training. Moreover, Backslash proves to be highly versatile: it enhances generalization with small Lagrange multipliers, improves model robustness to pruning (maintaining accuracy even at 80\% pruning rates), and enables network simplification for accelerated inference on edge devices.

Comment: The paper introduces a novel training-time compression method (Backslash) based on rate-distortion optimization, which aligns with the 'Model Compression' criterion. It provides significant insights into compression during training, a less explored area.

Relevance: 9 Novelty: 8


7. When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

ArXiv ID: 2504.17562

Authors: Rei Higuchi, Ryotaro Kawata, Naoki Nishikawa, Kazusato Oko, Shoichiro Yamaguchi, Sosuke Kobayashi, Seiya Tokui, Kohei Hayashi, Daisuke Okanohara, Taiji Suzuki

Abstract: The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginning of texts in the pre-training data, making it easier for the model to access latent semantics before observing the entire text. Previous studies have reported that this technique actually improves the performance of trained models in downstream tasks; however, this improvement has been observed only in specific downstream tasks, without consistent enhancement in average next-token prediction loss. To understand this phenomenon, we closely investigate how prepending metadata during pre-training affects model performance by examining its behavior using artificial data. Interestingly, we found that this approach produces both positive and negative effects on the downstream tasks. We demonstrate that the effectiveness of the approach depends on whether latent semantics can be inferred from the downstream task's prompt. Specifically, through investigations using data generated by probabilistic context-free grammars, we show that training with metadata helps improve model's performance when the given context is long enough to infer the latent semantics. In contrast, the technique negatively impacts performance when the context lacks the necessary information to make an accurate posterior inference.

Comment: The paper investigates metadata conditioning in language model pretraining, providing theoretical insights into when this technique works or fails. This aligns with foundational research in representation learning and LLM behavior.

Relevance: 9 Novelty: 8


8. Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

ArXiv ID: 2504.16922

Authors: Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi

Abstract: Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n^2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.

Comment: The paper introduces Generalized Neighborhood Attention (GNA), a sparse attention mechanism with significant efficiency improvements. This aligns with foundational research in model architecture and sparsity.

Relevance: 9 Novelty: 8


9. The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

ArXiv ID: 2504.17768

Authors: Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, Edoardo M. Ponti

Abstract: Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexplored. To address this gap, we perform a careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on a diverse collection of long-sequence tasks-including novel ones that rely on natural language while remaining controllable and easy to evaluate. Based on our experiments, we report a series of key findings: 1) an isoFLOPS analysis reveals that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones. 2) The level of sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding than prefilling, and correlates with model size in the former. 3) There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios. Even moderate sparsity levels often result in significant performance degradation on at least one task, highlighting that sparse attention is not a universal solution. 4) We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments. Through these insights, we demonstrate that sparse attention is a key tool to enhance the capabilities of Transformer LLMs for processing longer sequences, but requires careful evaluation of trade-offs for performance-sensitive applications.

Comment: This paper explores sparse attention in Transformer LLMs, directly addressing sparsity and efficiency trade-offs, which aligns with the model compression and efficiency criteria.

Relevance: 9 Novelty: 8


10. Enhancing Variational Autoencoders with Smooth Robust Latent Encoding

ArXiv ID: 2504.17219

Authors: Hyomin Lee, Minseon Kim, Sangwon Jang, Jongheon Jeong, Sung Ju Hwang

Abstract: Variational Autoencoders (VAEs) have played a key role in scaling up diffusion-based generative models, as in Stable Diffusion, yet questions regarding their robustness remain largely underexplored. Although adversarial training has been an established technique for enhancing robustness in predictive models, it has been overlooked for generative models due to concerns about potential fidelity degradation by the nature of trade-offs between performance and robustness. In this work, we challenge this presumption, introducing Smooth Robust Latent VAE (SRL-VAE), a novel adversarial training framework that boosts both generation quality and robustness. In contrast to conventional adversarial training, which focuses on robustness only, our approach smooths the latent space via adversarial perturbations, promoting more generalizable representations while regularizing with originality representation to sustain original fidelity. Applied as a post-training step on pre-trained VAEs, SRL-VAE improves image robustness and fidelity with minimal computational overhead. Experiments show that SRL-VAE improves both generation quality, in image reconstruction and text-guided image editing, and robustness, against Nightshade attacks and image editing attacks. These results establish a new paradigm, showing that adversarial training, once thought to be detrimental to generative models, can instead enhance both fidelity and robustness.

Comment: The paper enhances VAEs with adversarial training to improve robustness and fidelity, contributing to foundational insights into representation learning and generative model robustness.

Relevance: 9 Novelty: 8


11. SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures

ArXiv ID: 2504.16140

Authors: Max Hartman, Lav Varshney

Abstract: Joint Embedding Predictive Architectures (JEPA) have emerged as a powerful framework for learning general-purpose representations. However, these models often lack interpretability and suffer from inefficiencies due to dense embedding representations. We propose SparseJEPA, an extension that integrates sparse representation learning into the JEPA framework to enhance the quality of learned representations. SparseJEPA employs a penalty method that encourages latent space variables to be shared among data features with strong semantic relationships, while maintaining predictive performance. We demonstrate the effectiveness of SparseJEPA by training on the CIFAR-100 dataset and pre-training a lightweight Vision Transformer. The improved embeddings are utilized in linear-probe transfer learning for both image classification and low-level tasks, showcasing the architecture's versatility across different transfer tasks. Furthermore, we provide a theoretical proof that demonstrates that the grouping mechanism enhances representation quality. This was done by displaying that grouping reduces Multiinformation among latent-variables, including proofing the Data Processing Inequality for Multiinformation. Our results indicate that incorporating sparsity not only refines the latent space but also facilitates the learning of more meaningful and interpretable representations. In further work, hope to further extend this method by finding new ways to leverage the grouping mechanism through object-centric representation learning.

Comment: SparseJEPA introduces sparse representation learning into Joint Embedding Predictive Architectures, which directly aligns with representation learning and sparsity. The theoretical proof on Multiinformation reduction adds a novel theoretical insight.

Relevance: 9 Novelty: 8


12. Likelihood-Free Variational Autoencoders

ArXiv ID: 2504.17622

Authors: Chen Xu, Qiang Wang, Lijun Sun

Abstract: Variational Autoencoders (VAEs) typically rely on a probabilistic decoder with a predefined likelihood, most commonly an isotropic Gaussian, to model the data conditional on latent variables. While convenient for optimization, this choice often leads to likelihood misspecification, resulting in blurry reconstructions and poor data fidelity, especially for high-dimensional data such as images. In this work, we propose \textit{EnVAE}, a novel likelihood-free generative framework that has a deterministic decoder and employs the energy score -- a proper scoring rule -- to build the reconstruction loss. This enables likelihood-free inference without requiring explicit parametric density functions. To address the computational inefficiency of the energy score, we introduce a fast variant, \textit{FEnVAE}, based on the local smoothness of the decoder and the sharpness of the posterior distribution of latent variables. This yields an efficient single-sample training objective that integrates seamlessly into existing VAE pipelines with minimal overhead. Empirical results on standard benchmarks demonstrate that \textit{EnVAE} achieves superior reconstruction and generation quality compared to likelihood-based baselines. Our framework offers a general, scalable, and statistically principled alternative for flexible and nonparametric distribution learning in generative modeling.

Comment: The paper introduces a likelihood-free VAE framework (EnVAE) with a novel reconstruction loss based on the energy score. This aligns with the 'Model Architecture' criterion, particularly in advancing autoencoder-based generative modeling.

Relevance: 8 Novelty: 8


13. Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation Models

ArXiv ID: 2504.17660

Authors: Julius Vetter, Manuel Gloeckler, Daniel Gedon, Jakob H. Macke

Abstract: Simulation-based inference (SBI) offers a flexible and general approach to performing Bayesian inference: In SBI, a neural network is trained on synthetic data simulated from a model and used to rapidly infer posterior distributions for observed data. A key goal for SBI is to achieve accurate inference with as few simulations as possible, especially for expensive simulators. In this work, we address this challenge by repurposing recent probabilistic foundation models for tabular data: We show how tabular foundation models -- specifically TabPFN -- can be used as pre-trained autoregressive conditional density estimators for SBI. We propose Neural Posterior Estimation with Prior-data Fitted Networks (NPE-PF) and show that it is competitive with current SBI approaches in terms of accuracy for both benchmark tasks and two complex scientific inverse problems. Crucially, it often substantially outperforms them in terms of simulation efficiency, sometimes requiring orders of magnitude fewer simulations. NPE-PF eliminates the need for inference network selection, training, and hyperparameter tuning. We also show that it exhibits superior robustness to model misspecification and can be scaled to simulation budgets that exceed the context size limit of TabPFN. NPE-PF provides a new direction for SBI, where training-free, general-purpose inference models offer efficient, easy-to-use, and flexible solutions for a wide range of stochastic inverse problems.

Comment: The paper introduces a novel approach to simulation-based Bayesian inference using tabular foundation models, which aligns with foundational research in efficiency and representation learning.

Relevance: 8 Novelty: 8


14. Physics-informed features in supervised machine learning

ArXiv ID: 2504.17112

Authors: Margherita Lampani, Sabrina Guastavino, Michele Piana, Federico Benvenuto

Abstract: Supervised machine learning involves approximating an unknown functional relationship from a limited dataset of features and corresponding labels. The classical approach to feature-based machine learning typically relies on applying linear regression to standardized features, without considering their physical meaning. This may limit model explainability, particularly in scientific applications. This study proposes a physics-informed approach to feature-based machine learning that constructs non-linear feature maps informed by physical laws and dimensional analysis. These maps enhance model interpretability and, when physical laws are unknown, allow for the identification of relevant mechanisms through feature ranking. The method aims to improve both predictive performance in regression tasks and classification skill scores by integrating domain knowledge into the learning process, while also enabling the potential discovery of new physical equations within the context of explainable machine learning.

Comment: The paper discusses a physics-informed approach to feature-based machine learning, which aligns with 'Representation Learning' by integrating domain knowledge into feature extraction, enhancing interpretability and potential discovery of new mechanisms.

Relevance: 8 Novelty: 7


15. HMI: Hierarchical Knowledge Management for Efficient Multi-Tenant Inference in Pretrained Language Models

ArXiv ID: 2504.17449

Authors: Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Qin Xie, Guiming Xie, Xuejian Gong

Abstract: The significant computational demands of pretrained language models (PLMs), which often require dedicated hardware, present a substantial challenge in serving them efficiently, especially in multi-tenant environments. To address this, we introduce HMI, a Hierarchical knowledge management-based Multi-tenant Inference system, designed to manage tenants with distinct PLMs resource-efficiently. Our approach is three-fold: Firstly, we categorize PLM knowledge into general, domain-specific, and task-specific. Leveraging insights on knowledge acquisition across different model layers, we construct hierarchical PLMs (hPLMs) by extracting and storing knowledge at different levels, significantly reducing GPU memory usage per tenant. Secondly, we establish hierarchical knowledge management for hPLMs generated by various tenants in HMI. We manage domain-specific knowledge with acceptable storage increases by constructing and updating domain-specific knowledge trees based on frequency. We manage task-specific knowledge within limited GPU memory through parameter swapping. Finally, we propose system optimizations to enhance resource utilization and inference throughput. These include fine-grained pipelining via hierarchical knowledge prefetching to overlap CPU and I/O operations with GPU computations, and optimizing parallel implementations with batched matrix multiplications. Our experimental results demonstrate that the proposed HMI can efficiently serve up to 10,000 hPLMs (hBERTs and hGPTs) on a single GPU, with only a negligible compromise in accuracy.

Comment: The paper introduces a hierarchical knowledge management system for efficient multi-tenant inference in PLMs, which aligns with 'Model Compression' and 'Model Architecture' by addressing resource efficiency and hierarchical design.

Relevance: 8 Novelty: 7


16. Towards Robust LLMs: an Adversarial Robustness Measurement Framework

ArXiv ID: 2504.17723

Authors: Natan Levy, Adiel Ashrov, Guy Katz

Abstract: The rise of Large Language Models (LLMs) has revolutionized artificial intelligence, yet these models remain vulnerable to adversarial perturbations, undermining their reliability in high-stakes applications. While adversarial robustness in vision-based neural networks has been extensively studied, LLM robustness remains under-explored. We adapt the Robustness Measurement and Assessment (RoMA) framework to quantify LLM resilience against adversarial inputs without requiring access to model parameters. By comparing RoMA's estimates to those of formal verification methods, we demonstrate its accuracy with minimal error margins while maintaining computational efficiency. Our empirical evaluation reveals that robustness varies significantly not only between different models but also across categories within the same task and between various types of perturbations. This non-uniformity underscores the need for task-specific robustness evaluations, enabling practitioners to compare and select models based on application-specific robustness requirements. Our work provides a systematic methodology to assess LLM robustness, advancing the development of more reliable language models for real-world deployment.

Comment: The paper adapts a robustness measurement framework for LLMs, providing insights into adversarial robustness, which aligns with the 'Large Language Models' criterion by addressing theoretical aspects of LLM behavior.

Relevance: 8 Novelty: 7


17. HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

ArXiv ID: 2504.16112

Authors: Myunghyun Rhee, Joonseop Sim, Taeyoung Ahn, Seungyong Lee, Daegun Yoon, Euiseok Kim, Kyoung Park, Youngpyo Joo, Hosik Kim

Abstract: The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth Processing Unit (HPU), a memoryintensive co-processor that enhances GPU resource utilization during large-batched LLM inference. By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency. Also, the HPU, as an add-on card, scales out to accommodate surging memory demands driven by large batch sizes and extended sequence lengths. In this paper, we show the HPU prototype implemented with PCIe-based FPGA cards mounted on a GPU system. Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system, providing scalability without increasing the number of GPUs.

Comment: The paper introduces a novel co-processing unit (HPU) to improve LLM inference efficiency, which aligns with model compression and efficiency breakthroughs. The focus on memory-bound operations and GPU-HPU collaboration is relevant to foundational efficiency research.

Relevance: 8 Novelty: 7


18. In-Context Learning can distort the relationship between sequence likelihoods and biological fitness

ArXiv ID: 2504.17068

Authors: Pranav Kantroo, G\"unter P. Wagner, Benjamin B. Machta

Abstract: Language models have emerged as powerful predictors of the viability of biological sequences. During training these models learn the rules of the grammar obeyed by sequences of amino acids or nucleotides. Once trained, these models can take a sequence as input and produce a likelihood score as an output; a higher likelihood implies adherence to the learned grammar and correlates with experimental fitness measurements. Here we show that in-context learning can distort the relationship between fitness and likelihood scores of sequences. This phenomenon most prominently manifests as anomalously high likelihood scores for sequences that contain repeated motifs. We use protein language models with different architectures trained on the masked language modeling objective for our experiments, and find transformer-based models to be particularly vulnerable to this effect. This behavior is mediated by a look-up operation where the model seeks the identity of the masked position by using the other copy of the repeated motif as a reference. This retrieval behavior can override the model's learned priors. This phenomenon persists for imperfectly repeated sequences, and extends to other kinds of biologically relevant features such as reversed complement motifs in RNA sequences that fold into hairpin structures.

Comment: The paper explores how in-context learning distorts sequence likelihoods in biological fitness, providing theoretical insights into LLM behavior. This aligns with foundational research in LLM interpretability.

Relevance: 8 Novelty: 7


19. NeuralGrok: Accelerate Grokking by Neural Gradient Transformation

ArXiv ID: 2504.17243

Authors: Xinyu Zhou, Simin Fan, Martin Jaggi, Jie Fu

Abstract: Grokking is proposed and widely studied as an intricate phenomenon in which generalization is achieved after a long-lasting period of overfitting. In this work, we propose NeuralGrok, a novel gradient-based approach that learns an optimal gradient transformation to accelerate the generalization of transformers in arithmetic tasks. Specifically, NeuralGrok trains an auxiliary module (e.g., an MLP block) in conjunction with the base model. This module dynamically modulates the influence of individual gradient components based on their contribution to generalization, guided by a bilevel optimization algorithm. Our extensive experiments demonstrate that NeuralGrok significantly accelerates generalization, particularly in challenging arithmetic tasks. We also show that NeuralGrok promotes a more stable training paradigm, constantly reducing the model's complexity, while traditional regularization methods, such as weight decay, can introduce substantial instability and impede generalization. We further investigate the intrinsic model complexity leveraging a novel Absolute Gradient Entropy (AGE) metric, which explains that NeuralGrok effectively facilitates generalization by reducing the model complexity. We offer valuable insights on the grokking phenomenon of Transformer models, which encourages a deeper understanding of the fundamental principles governing generalization ability.

Comment: NeuralGrok proposes a gradient-based approach to accelerate grokking in transformers, which aligns with foundational research in training dynamics and generalization in neural networks. The use of Absolute Gradient Entropy (AGE) as a metric adds a novel perspective.

Relevance: 8 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords: