Previous Day 2025-03-13
Monthly Overview 2025-03
Next Day 2025-03-17

This is a remedial run for missed papers from 03/13/2025 to 03/13/2025.

Results generated on 03/24/2025.

Personalized Daily Arxiv Papers 3/14/2025

[gpt-4o] Prompt Completion Total
Token 41757 6534 48291
Cost $0.1 $0.07 $0.17

Total arXiv papers: 289

Total scanned papers: 289

Total relevant papers: 44

Table of contents with paper titles:

  1. Empirical Computation Authors: Eric Tang, Marcel Böhme

  2. Transformers without Normalization Authors: Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu

  3. KV-Distill: Nearly Lossless Learnable Context Compression for LLMs Authors: Vivek Chari, Guanghui Qin, Benjamin Van Durme

  4. Compute Optimal Scaling of Skills: Knowledge vs Reasoning Authors: Nicholas Roberts, Niladri Chatterji, Sharan Narang, Mike Lewis, Dieuwke Hupkes

  5. The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity Authors: Justin Sahs, Ryan Pyle, Fabio Anselmi, Ankit Patel

  6. ASIDE: Architectural Separation of Instructions and Data in Language Models Authors: Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Soroush Tabesh, Alexandra Volkova, Sebastian Lapuschkin, Wojciech Samek, Christoph H. Lampert

  7. Architecture-Aware Minimization (A$^2$M): How to Find Flat Minima in Neural Architecture Search Authors: Matteo Gambella, Fabrizio Pittorino, Manuel Roveri

  8. ZeroMerge: Parameter-Free KV Cache Compression for Memory-Efficient Long-Context LLMs Authors: Xin Liu, Pei Liu, Guoming Tang

  9. Spherical dimension Authors: Bogdan Chornomaz, Shay Moran, Tom Waknine

  10. The Relativity of Causal Knowledge Authors: Gabriele D'Acunto, Claudio Battiloro

  11. On the Identifiability of Causal Abstractions Authors: Xiusi Li, Sékou-Oumar Kaba, Siamak Ravanbakhsh

  12. Explainable Bayesian deep learning through input-skip Latent Binary Bayesian Neural Networks Authors: Eirik Høyheim, Lars Skaaret-Lund, Solve Sæbø, Aliaksandr Hubin

  13. Understanding the Logical Capabilities of Large Language Models via Out-of-Context Representation Learning Authors: Jonathan Shaki, Emanuele La Malfa, Michael Wooldridge, Sarit Kraus

  14. Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores Authors: Chenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan

  15. Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models Authors: Andy Zhou

  16. From Equations to Insights: Unraveling Symbolic Structures in PDEs with LLMs Authors: Rohan Bhatnagar, Ling Liang, Krish Patel, Haizhao Yang

  17. Langevin Monte-Carlo Provably Learns Depth Two Neural Nets at Any Size and Data Authors: Dibyakanti Kumar, Samyak Jha, Anirbit Mukherjee

  18. Multiplicative Learning Authors: Han Kim, Hyungjoon Soh, Vipul Periwal, Junghyo Jo

  19. Deep Learning based discovery of Integrable Systems Authors: Shailesh Lal, Suvajit Majumder, Evgeny Sobko

  20. Thermodynamic Bound on Energy and Negentropy Costs of Inference in Deep Neural Networks Authors: Alexei V. Tkachenko

  21. Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild Authors: Damien Teney, Liangze Jiang, Florin Gogianu, Ehsan Abbasnejad

  22. Inter-environmental world modeling for continuous and compositional dynamics Authors: Kohei Hayashi, Masanori Koyama, Julian Jorge Andrade Guerreiro

  23. Radar: Fast Long-Context Decoding for Any Transformer Authors: Yongchang Hao, Mengyao Zhai, Hossein Hajimirsadeghi, Sepidehsadat Hosseini, Frederick Tung

  24. SOLA-GCL: Subgraph-Oriented Learnable Augmentation Method for Graph Contrastive Learning Authors: Tianhao Peng, Xuhong Li, Haitao Yuan, Yuchen Li, Haoyi Xiong

  25. Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding Authors: Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum

  26. Robustness Tokens: Towards Adversarial Robustness of Transformers Authors: Brian Pulfer, Yury Belousov, Slava Voloshynovskiy

  27. Sample Compression for Continual Learning Authors: Jacob Comeau, Mathieu Bazinet, Pascal Germain, Cem Subakan

  28. OuroMamba: A Data-Free Quantization Framework for Vision Mamba Models Authors: Akshat Ramachandran, Mingyu Lee, Huan Xu, Souvik Kundu, Tushar Krishna

  29. Poly-MgNet: Polynomial Building Blocks in Multigrid-Inspired ResNets Authors: Antonia van Betteray, Matthias Rottmann, Karsten Kahl

  30. HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Authors: Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar, Atticus Geiger

  31. Numerical and statistical analysis of NeuralODE with Runge-Kutta time integration Authors: Emily C. Ehrhardt, Hanno Gottschalk, Tobias J. Riedlinger

  32. Fixed-Point RNNs: From Diagonal to Dense in a Few Iterations Authors: Sajad Movahedi, Felix Sarnthein, Nicola Muca Cirone, Antonio Orvieto

  33. Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling Authors: Shuqi Lu, Xiaohong Ji, Bohang Zhang, Lin Yao, Siyuan Liu, Zhifeng Gao, Linfeng Zhang, Guolin Ke

  34. Numerical Error Analysis of Large Language Models Authors: Stanislav Budzinskiy, Wenyi Fang, Longbin Zeng, Philipp Petersen

  35. Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective Authors: Xiaoming Zhao, Alexander G. Schwing

  36. PIMRL: Physics-Informed Multi-Scale Recurrent Learning for Spatiotemporal Prediction Authors: Han Wan, Qi Wang, Yuan Mi, Hao Sun

  37. Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More Authors: Arvid Frydenlund

  38. DTA: Dual Temporal-channel-wise Attention for Spiking Neural Networks Authors: Minje Kim, Minjun Kim, Xu Yang

  39. Structured Preconditioners in Adaptive Optimization: A Unified Analysis Authors: Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, Zhiyuan Li

  40. Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers? Authors: Subhajit Maity, Killian Hitsman, Xin Li, Aritra Dutta

  41. Sample and Map from a Single Convex Potential: Generation using Conjugate Moment Measures Authors: Nina Vesseron, Louis Béthune, Marco Cuturi

  42. Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning Authors: Yongqi Li, Xiaowei Zhang

  43. OODD: Test-time Out-of-Distribution Detection with Dynamic Dictionary Authors: Yifeng Yang, Lin Zhu, Zewen Sun, Hengyu Liu, Qinying Gu, Nanyang Ye

  44. AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation Authors: Yixiong Fang, Tianran Sun, Yuling Shi, Xiaodong Gu


1. Empirical Computation

ArXiv ID: 2503.10954

Authors: Eric Tang, Marcel Böhme

Abstract: In this vision paper, we explore the challenges and opportunities of a form of computation that employs an empirical (rather than a formal) approach, where the solution of a computational problem is returned as empirically most likely (rather than necessarily correct). We call this approach as empirical computation and observe that its capabilities and limits cannot be understood within the classic, rationalist framework of computation. While we take a very broad view of "computational problem", a classic, well-studied example is sorting: Given a set of $n$ numbers, return these numbers sorted in ascending order. * To run a classical, formal computation, we might first think about a specific algorithm (e.g., merge sort) before developing a specific program that implements it. The program will expect the input to be given in a specific format, type, or data structure (e.g., unsigned 32-bit integers). In software engineering, we have many approaches to analyze the correctness of such programs. From complexity theory, we know that there exists no correct program that can solve the average instance of the sorting problem faster than $O(n\log n)$. * To run an empirical computation, we might directly ask a large language model (LLM) to solve any computational problem (which can be stated informally in natural language) and provide the input in any format (e.g., negative numbers written as Chinese characters). There is no (problem-specific) program that could be analyzed for correctness. Also, the time it takes an LLM to return an answer is entirely independent of the computational complexity of the problem that is solved. What are the capabilities or limits of empirical computation in the general, in the problem-, or in the instance-specific? Our purpose is to establish empirical computation as a field in SE that is timely and rich with interesting problems.

Comment: The paper introduces 'empirical computation,' a novel paradigm that challenges classical computational frameworks. This aligns with the 'Emerging Trends' criterion as it proposes a cutting-edge theoretical direction.

Relevance: 9 Novelty: 9


2. Transformers without Normalization

ArXiv ID: 2503.10622

Authors: Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu

Abstract: Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Comment: The paper introduces Dynamic Tanh as a replacement for normalization layers in Transformers, aligning with 'Model Architecture' due to its challenge to conventional practices.

Relevance: 9 Novelty: 8


3. KV-Distill: Nearly Lossless Learnable Context Compression for LLMs

ArXiv ID: 2503.10337

Authors: Vivek Chari, Guanghui Qin, Benjamin Van Durme

Abstract: Sequence-to-sequence tasks often benefit from long contexts, but the quadratic complexity of self-attention in standard Transformers renders this non-trivial. During generation, temporary representations -stored in the so-called KV cache-account for a large portion of GPU memory usage and scale linearly with context length. We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations in a question-independent fashion. KV-Distill can be trained as a parameter-efficient adaptor for pretrained models, and enables the compression of arbitrary spans of a context while preserving pre-trained model capabilities. We treat a compressed-uncompressed cache as a student-teacher pairing and apply a KL-type divergence to match the generated outputs. KV-Distill outperforms other compression techniques in worst-case extractive tasks and approaches uncompressed performance in long context question answering and summarization, and it can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.

Comment: KV-Distill introduces a compression framework for LLMs, aligning with 'Model Compression' due to its focus on efficient context representation.

Relevance: 9 Novelty: 8


4. Compute Optimal Scaling of Skills: Knowledge vs Reasoning

ArXiv ID: 2503.10061

Authors: Nicholas Roberts, Niladri Chatterji, Sharan Narang, Mike Lewis, Dieuwke Hupkes

Abstract: Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as 'compute-optimally' trading-off parameter count and dataset size, alongside a more recent growing list of other crucial decisions. In this work, we ask whether compute-optimal scaling behaviour can be skill-dependent. In particular, we examine knowledge and reasoning-based skills such as knowledge-based QA and code generation, and we answer this question in the affirmative: scaling laws are skill-dependent. Next, to understand whether skill-dependent scaling is an artefact of the pretraining datamix, we conduct an extensive ablation of different datamixes and find that, also when correcting for datamix differences, knowledge and code exhibit fundamental differences in scaling behaviour. We conclude with an analysis of how our findings relate to standard compute-optimal scaling using a validation set, and find that a misspecified validation set can impact compute-optimal parameter count by nearly 50%, depending on its skill composition.

Comment: The paper studies skill-dependent scaling laws in LLMs, aligning with 'Large Language Models' due to its theoretical insights into scaling behavior.

Relevance: 9 Novelty: 8


5. The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity

ArXiv ID: 2503.10587

Authors: Justin Sahs, Ryan Pyle, Fabio Anselmi, Ankit Patel

Abstract: Despite classical statistical theory predicting severe overfitting, modern massively overparameterized neural networks still generalize well. This unexpected property is attributed to the network's so-called implicit bias, which describes its propensity to converge to solutions that generalize effectively, among the many possible that correctly label the training data. The aim of our research is to explore this bias from a new perspective, focusing on how non-linear activation functions contribute to shaping it. First, we introduce a reparameterization which removes a continuous weight rescaling symmetry. Second, in the kernel regime, we leverage this reparameterization to generalize recent findings that relate shallow Neural Networks to the Radon transform, deriving an explicit formula for the implicit bias induced by a broad class of activation functions. Specifically, by utilizing the connection between the Radon transform and the Fourier transform, we interpret the kernel regime's inductive bias as minimizing a spectral seminorm that penalizes high-frequency components, in a manner dependent on the activation function. Finally, in the adaptive regime, we demonstrate the existence of local dynamical attractors that facilitate the formation of clusters of hyperplanes where the input to a neuron's activation function is zero, yielding alignment between many neurons' response functions. We confirm these theoretical results with simulations. All together, our work provides a deeper understanding of the mechanisms underlying the generalization capabilities of overparameterized neural networks and its relation with the implicit bias, offering potential pathways for designing more efficient and robust models.

Comment: The paper explores the spectral bias of shallow neural networks shaped by activation functions, providing theoretical insights into representation learning and training dynamics.

Relevance: 9 Novelty: 8


6. ASIDE: Architectural Separation of Instructions and Data in Language Models

ArXiv ID: 2503.10566

Authors: Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Soroush Tabesh, Alexandra Volkova, Sebastian Lapuschkin, Wojciech Samek, Christoph H. Lampert

Abstract: Despite their remarkable performance, large language models lack elementary safety features, and this makes them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as a root cause for the success of prompt injection attacks. In this work, we propose an architectural change, ASIDE, that allows the model to clearly separate between instructions and data by using separate embeddings for them. Instead of training the embeddings from scratch, we propose a method to convert an existing model to ASIDE form by using two copies of the original model's embeddings layer, and applying an orthogonal rotation to one of them. We demonstrate the effectiveness of our method by showing (1) highly increased instruction-data separation scores without a loss in model capabilities and (2) competitive results on prompt injection benchmarks, even without dedicated safety training. Additionally, we study the working mechanism behind our method through an analysis of model representations.

Comment: The paper proposes an architectural change (ASIDE) for LLMs to separate instructions and data, contributing to foundational insights into LLM architecture.

Relevance: 9 Novelty: 8


ArXiv ID: 2503.10404

Authors: Matteo Gambella, Fabrizio Pittorino, Manuel Roveri

Abstract: Neural Architecture Search (NAS) has become an essential tool for designing effective and efficient neural networks. In this paper, we investigate the geometric properties of neural architecture spaces commonly used in differentiable NAS methods, specifically NAS-Bench-201 and DARTS. By defining flatness metrics such as neighborhoods and loss barriers along paths in architecture space, we reveal locality and flatness characteristics analogous to the well-known properties of neural network loss landscapes in weight space. In particular, we find that highly accurate architectures cluster together in flat regions, while suboptimal architectures remain isolated, unveiling the detailed geometrical structure of the architecture search landscape. Building on these insights, we propose Architecture-Aware Minimization (A$^2$M), a novel analytically derived algorithmic framework that explicitly biases, for the first time, the gradient of differentiable NAS methods towards flat minima in architecture space. A$^2$M consistently improves generalization over state-of-the-art DARTS-based algorithms on benchmark datasets including CIFAR-10, CIFAR-100, and ImageNet16-120, across both NAS-Bench-201 and DARTS search spaces. Notably, A$^2$M is able to increase the test accuracy, on average across different differentiable NAS methods, by +3.60\% on CIFAR-10, +4.60\% on CIFAR-100, and +3.64\% on ImageNet16-120, demonstrating its superior effectiveness in practice. A$^2$M can be easily integrated into existing differentiable NAS frameworks, offering a versatile tool for future research and applications in automated machine learning. We open-source our code at https://github.com/AI-Tech-Research-Lab/AsquaredM.

Comment: The paper proposes a framework for flat minima in neural architecture search, contributing to foundational research in model architecture and optimization.

Relevance: 9 Novelty: 8


8. ZeroMerge: Parameter-Free KV Cache Compression for Memory-Efficient Long-Context LLMs

ArXiv ID: 2503.10714

Authors: Xin Liu, Pei Liu, Guoming Tang

Abstract: The linear growth of key-value (KV) cache memory and quadratic computational complexity pose significant bottlenecks for large language models (LLMs) in long-context processing. While existing KV cache optimization methods address these challenges through token pruning or feature merging, they often suffer from irreversible information loss or require costly parameter retraining. We propose ZeroMerge, a dynamic zero-shot compression framework that achieves efficient cache management through three key innovations: (1) Fine-grained memory allocation guided by multi-dimensional token importance metrics at head-level granularity, (2) A residual merging mechanism that preserves critical context through compensated attention scoring, and (3) Parameter-free adaptation compatible with diverse LLM architectures without retraining. Comprehensive evaluations across LLaMA-2 model demonstrate that ZeroMerge maintains full-cache performance at 5\% compression ratios while doubling inference throughput at 40K token lengths. The method effectively balances memory efficiency, generation quality, and deployment flexibility, advancing practical long-context LLM applications. The code is available at https://github.com/SusCom-Lab/ZeroMerge.

Comment: The paper proposes a parameter-free KV cache compression method for LLMs, contributing to foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8


9. Spherical dimension

ArXiv ID: 2503.10240

Authors: Bogdan Chornomaz, Shay Moran, Tom Waknine

Abstract: We introduce and study the spherical dimension, a natural topological relaxation of the VC dimension that unifies several results in learning theory where topology plays a key role in the proofs. The spherical dimension is defined by extending the set of realizable datasets (used to define the VC dimension) to the continuous space of realizable distributions. In this space, a shattered set of size d (in the VC sense) is completed into a continuous object, specifically a d-dimensional sphere of realizable distributions. The spherical dimension is then defined as the dimension of the largest sphere in this space. Thus, the spherical dimension is at least the VC dimension. The spherical dimension serves as a common foundation for leveraging the Borsuk-Ulam theorem and related topological tools. We demonstrate the utility of the spherical dimension in diverse applications, including disambiguations of partial concept classes, reductions from classification to stochastic convex optimization, stability and replicability, and sample compression schemes. Perhaps surprisingly, we show that the open question posed by Alon, Hanneke, Holzman, and Moran (FOCS 2021) of whether there exist non-trivial disambiguations for halfspaces with margin is equivalent to the basic open question of whether the VC and spherical dimensions are finite together.

Comment: The paper introduces spherical dimension as a topological relaxation of VC dimension, contributing to foundational research in representation learning and theoretical insights.

Relevance: 9 Novelty: 8


10. The Relativity of Causal Knowledge

ArXiv ID: 2503.11718

Authors: Gabriele D'Acunto, Claudio Battiloro

Abstract: Recent advances in artificial intelligence reveal the limits of purely predictive systems and call for a shift toward causal and collaborative reasoning. Drawing inspiration from the revolution of Grothendieck in mathematics, we introduce the relativity of causal knowledge, which posits structural causal models (SCMs) are inherently imperfect, subjective representations embedded within networks of relationships. By leveraging category theory, we arrange SCMs into a functor category and show that their observational and interventional probability measures naturally form convex structures. This result allows us to encode non-intervened SCMs with convex spaces of probability measures. Next, using sheaf theory, we construct the network sheaf and cosheaf of causal knowledge. These structures enable the transfer of causal knowledge across the network while incorporating interventional consistency and the perspective of the subjects, ultimately leading to the formal, mathematical definition of relative causal knowledge.

Comment: The paper introduces a novel perspective on causal knowledge using category theory, which aligns with emerging trends in foundational research.

Relevance: 9 Novelty: 8


11. On the Identifiability of Causal Abstractions

ArXiv ID: 2503.10834

Authors: Xiusi Li, Sékou-Oumar Kaba, Siamak Ravanbakhsh

Abstract: Causal representation learning (CRL) enhances machine learning models' robustness and generalizability by learning structural causal models associated with data-generating processes. We focus on a family of CRL methods that uses contrastive data pairs in the observable space, generated before and after a random, unknown intervention, to identify the latent causal model. (Brehmer et al., 2022) showed that this is indeed possible, given that all latent variables can be intervened on individually. However, this is a highly restrictive assumption in many systems. In this work, we instead assume interventions on arbitrary subsets of latent variables, which is more realistic. We introduce a theoretical framework that calculates the degree to which we can identify a causal model, given a set of possible interventions, up to an abstraction that describes the system at a higher level of granularity.

Comment: The paper explores causal representation learning with a focus on identifiability, aligning with foundational research in representation learning.

Relevance: 9 Novelty: 8


12. Explainable Bayesian deep learning through input-skip Latent Binary Bayesian Neural Networks

ArXiv ID: 2503.10496

Authors: Eirik Høyheim, Lars Skaaret-Lund, Solve Sæbø, Aliaksandr Hubin

Abstract: Modeling natural phenomena with artificial neural networks (ANNs) often provides highly accurate predictions. However, ANNs often suffer from over-parameterization, complicating interpretation and raising uncertainty issues. Bayesian neural networks (BNNs) address the latter by representing weights as probability distributions, allowing for predictive uncertainty evaluation. Latent binary Bayesian neural networks (LBBNNs) further handle structural uncertainty and sparsify models by removing redundant weights. This article advances LBBNNs by enabling covariates to skip to any succeeding layer or be excluded, simplifying networks and clarifying input impacts on predictions. Ultimately, a linear model or even a constant can be found to be optimal for a specific problem at hand. Furthermore, the input-skip LBBNN approach reduces network density significantly compared to standard LBBNNs, achieving over 99% reduction for small networks and over 99.9% for larger ones, while still maintaining high predictive accuracy and uncertainty measurement. For example, on MNIST, we reached 97% accuracy and great calibration with just 935 weights, reaching state-of-the-art for compression of neural networks. Furthermore, the proposed method accurately identifies the true covariates and adjusts for system non-linearity. The main contribution is the introduction of active paths, enhancing directly designed global and local explanations within the LBBNN framework, that have theoretical guarantees and do not require post hoc external tools for explanations.

Comment: The paper introduces input-skip Latent Binary Bayesian Neural Networks, contributing to sparsity and model compression with theoretical guarantees.

Relevance: 9 Novelty: 8


13. Understanding the Logical Capabilities of Large Language Models via Out-of-Context Representation Learning

ArXiv ID: 2503.10408

Authors: Jonathan Shaki, Emanuele La Malfa, Michael Wooldridge, Sarit Kraus

Abstract: We study the capabilities of Large Language Models (LLM) on binary relations, a ubiquitous concept in math employed in most reasoning, math and logic benchmarks. This work focuses on equality, inequality, and inclusion, along with the properties they satisfy, such as ir/reflexivity, a/symmetry, transitivity, and logical complexity (e.g., number of reasoning ``hops''). We propose an alternative to in-context learning that trains only the representations of newly introduced tokens, namely out-of-context representation learning. This method mitigates linguistic biases already present in a model and, differently from in-context learning, does not rely on external information or illustrations. We argue out-of-context representation learning as a better alternative to in-context learning and fine-tuning to evaluate the capabilities of LLMs on logic tasks that are the building blocks of more complex reasoning benchmarks.

Comment: The paper introduces out-of-context representation learning for logical tasks, contributing to foundational insights into representation learning and LLM behavior.

Relevance: 9 Novelty: 8


14. Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores

ArXiv ID: 2503.10725

Authors: Chenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan

Abstract: The escalating size of Mixture-of-Experts (MoE) based Large Language Models (LLMs) presents significant computational and memory challenges, necessitating innovative solutions to enhance efficiency without compromising model accuracy. Structured sparsity emerges as a compelling strategy to address these challenges by leveraging the emerging sparse computing hardware. Prior works mainly focus on the sparsity in model parameters, neglecting the inherent sparse patterns in activations. This oversight can lead to additional computational costs associated with activations, potentially resulting in suboptimal performance. This paper presents Samoyeds, an innovative acceleration system for MoE LLMs utilizing Sparse Tensor Cores (SpTCs). Samoyeds is the first to apply sparsity simultaneously to both activations and model parameters. It introduces a bespoke sparse data format tailored for MoE computation and develops a specialized sparse-sparse matrix multiplication kernel. Furthermore, Samoyeds incorporates systematic optimizations specifically designed for the execution of dual-side structured sparse MoE LLMs on SpTCs, further enhancing system performance. Evaluations show that Samoyeds outperforms SOTA works by up to 1.99$\times$ at the kernel level and 1.58$\times$ at the model level. Moreover, it enhances memory efficiency, increasing maximum supported batch sizes by 4.41$\times$ on average. Additionally, Samoyeds surpasses existing SOTA structured sparse solutions in both model accuracy and hardware portability.

Comment: The paper presents Samoyeds, focusing on structured sparsity in MoE models, which aligns closely with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8


15. Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models

ArXiv ID: 2503.10617

Authors: Andy Zhou

Abstract: Adapting large language models to multiple tasks can cause cross-skill interference, where improvements for one skill degrade another. While methods such as LoRA impose orthogonality constraints at the weight level, they do not fully address interference in hidden-state representations. We propose Compositional Subspace Representation Fine-tuning (CS-ReFT), a novel representation-based approach that learns multiple orthonormal subspace transformations, each specializing in a distinct skill, and composes them via a lightweight router. By isolating these subspace edits in the hidden state, rather than weight matrices, CS-ReFT prevents cross-task conflicts more effectively. On the AlpacaEval benchmark, applying CS-ReFT to Llama-2-7B achieves a 93.94% win rate, surpassing GPT-3.5 Turbo (86.30%) while requiring only 0.0098% of model parameters. These findings show that specialized representation edits, composed via a simple router, significantly enhance multi-task instruction following with minimal overhead.

Comment: The paper proposes CS-ReFT, focusing on representation-based fine-tuning for LLMs, contributing to foundational insights into representation learning and LLM behavior.

Relevance: 9 Novelty: 8


16. From Equations to Insights: Unraveling Symbolic Structures in PDEs with LLMs

ArXiv ID: 2503.09986

Authors: Rohan Bhatnagar, Ling Liang, Krish Patel, Haizhao Yang

Abstract: Motivated by the remarkable success of artificial intelligence (AI) across diverse fields, the application of AI to solve scientific problems-often formulated as partial differential equations (PDEs)-has garnered increasing attention. While most existing research concentrates on theoretical properties (such as well-posedness, regularity, and continuity) of the solutions, alongside direct AI-driven methods for solving PDEs, the challenge of uncovering symbolic relationships within these equations remains largely unexplored. In this paper, we propose leveraging large language models (LLMs) to learn such symbolic relationships. Our results demonstrate that LLMs can effectively predict the operators involved in PDE solutions by utilizing the symbolic information in the PDEs. Furthermore, we show that discovering these symbolic relationships can substantially improve both the efficiency and accuracy of the finite expression method for finding analytical approximation of PDE solutions, delivering a fully interpretable solution pipeline. This work opens new avenues for understanding the symbolic structure of scientific problems and advancing their solution processes.

Comment: The paper explores symbolic structures in PDEs using LLMs, contributing to foundational research in AI for science.

Relevance: 9 Novelty: 8


17. Langevin Monte-Carlo Provably Learns Depth Two Neural Nets at Any Size and Data

ArXiv ID: 2503.10428

Authors: Dibyakanti Kumar, Samyak Jha, Anirbit Mukherjee

Abstract: In this work, we will establish that the Langevin Monte-Carlo algorithm can learn depth-2 neural nets of any size and for any data and we give non-asymptotic convergence rates for it. We achieve this via showing that under Total Variation distance and q-Renyi divergence, the iterates of Langevin Monte Carlo converge to the Gibbs distribution of Frobenius norm regularized losses for any of these nets, when using smooth activations and in both classification and regression settings. Most critically, the amount of regularization needed for our results is independent of the size of the net. This result combines several recent observations, like our previous papers showing that two-layer neural loss functions can always be regularized by a certain constant amount such that they satisfy the Villani conditions, and thus their Gibbs measures satisfy a Poincare inequality.

Comment: The paper provides theoretical insights into learning depth-2 neural networks using Langevin Monte-Carlo, which contributes to foundational research in representation learning and training dynamics.

Relevance: 8 Novelty: 8


18. Multiplicative Learning

ArXiv ID: 2503.10144

Authors: Han Kim, Hyungjoon Soh, Vipul Periwal, Junghyo Jo

Abstract: Efficient training of artificial neural networks remains a key challenge in deep learning. Backpropagation (BP), the standard learning algorithm, relies on gradient descent and typically requires numerous iterations for convergence. In this study, we introduce Expectation Reflection (ER), a novel learning approach that updates weights multiplicatively based on the ratio of observed to predicted outputs. Unlike traditional methods, ER maintains consistency without requiring ad hoc loss functions or learning rate hyperparameters. We extend ER to multilayer networks and demonstrate its effectiveness in performing image classification tasks. Notably, ER achieves optimal weight updates in a single iteration. Additionally, we reinterpret ER as a modified form of gradient descent incorporating the inverse mapping of target propagation. These findings suggest that ER provides an efficient and scalable alternative for training neural networks.

Comment: The paper introduces Expectation Reflection, a novel multiplicative learning approach, aligning with 'Representation Learning' due to its innovative training dynamics.

Relevance: 8 Novelty: 8


19. Deep Learning based discovery of Integrable Systems

ArXiv ID: 2503.10469

Authors: Shailesh Lal, Suvajit Majumder, Evgeny Sobko

Abstract: We introduce a novel machine learning based framework for discovering integrable models. Our approach first employs a synchronized ensemble of neural networks to find high-precision numerical solution to the Yang-Baxter equation within a specified class. Then, using an auxiliary system of algebraic equations, [Q_2, Q_3] = 0, and the numerical value of the Hamiltonian obtained via deep learning as a seed, we reconstruct the entire Hamiltonian family, forming an algebraic variety. We illustrate our presentation with three- and four-dimensional spin chains of difference form with local interactions. Remarkably, all discovered Hamiltonian families form rational varieties.

Comment: The paper introduces a novel framework for discovering integrable systems using neural networks, which aligns with foundational AI for science research.

Relevance: 8 Novelty: 8


20. Thermodynamic Bound on Energy and Negentropy Costs of Inference in Deep Neural Networks

ArXiv ID: 2503.09980

Authors: Alexei V. Tkachenko

Abstract: The fundamental thermodynamic bound is derived for the energy cost of inference in Deep Neural Networks (DNNs). By applying Landauer's principle, we demonstrate that the linear operations in DNNs can, in principle, be performed reversibly, whereas the non-linear activation functions impose an unavoidable energy cost. The resulting theoretical lower bound on the inference energy is determined by the average number of neurons undergoing state transition for each inference. We also restate the thermodynamic bound in terms of negentropy, a metric which is more universal than energy for assessing thermodynamic cost of information processing. Concept of negentropy is further elaborated in the context of information processing in biological and engineered system as well as human intelligence. Our analysis provides insight into the physical limits of DNN efficiency and suggests potential directions for developing energy-efficient AI architectures that leverage reversible analog computing.

Comment: The paper derives thermodynamic bounds for inference in DNNs, contributing to foundational insights into efficiency and theoretical limits.

Relevance: 8 Novelty: 8


21. Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild

ArXiv ID: 2503.10065

Authors: Damien Teney, Liangze Jiang, Florin Gogianu, Ehsan Abbasnejad

Abstract: Neural architectures tend to fit their data with relatively simple functions. This "simplicity bias" is widely regarded as key to their success. This paper explores the limits of this principle. Building on recent findings that the simplicity bias stems from ReLU activations [96], we introduce a method to meta-learn new activation functions and inductive biases better suited to specific tasks. Findings: We identify multiple tasks where the simplicity bias is inadequate and ReLUs suboptimal. In these cases, we learn new activation functions that perform better by inducing a prior of higher complexity. Interestingly, these cases correspond to domains where neural networks have historically struggled: tabular data, regression tasks, cases of shortcut learning, and algorithmic grokking tasks. In comparison, the simplicity bias induced by ReLUs proves adequate on image tasks where the best learned activations are nearly identical to ReLUs and GeLUs. Implications: Contrary to popular belief, the simplicity bias of ReLU networks is not universally useful. It is near-optimal for image classification, but other inductive biases are sometimes preferable. We showed that activation functions can control these inductive biases, but future tailored architectures might provide further benefits. Advances are still needed to characterize a model's inductive biases beyond "complexity", and their adequacy with the data.

Comment: The paper explores meta-learning activation functions to optimize inductive biases, contributing to architectural innovations and representation learning.

Relevance: 8 Novelty: 8


22. Inter-environmental world modeling for continuous and compositional dynamics

ArXiv ID: 2503.09911

Authors: Kohei Hayashi, Masanori Koyama, Julian Jorge Andrade Guerreiro

Abstract: Various world model frameworks are being developed today based on autoregressive frameworks that rely on discrete representations of actions and observations, and these frameworks are succeeding in constructing interactive generative models for the target environment of interest. Meanwhile, humans demonstrate remarkable generalization abilities to combine experiences in multiple environments to mentally simulate and learn to control agents in diverse environments. Inspired by this human capability, we introduce World modeling through Lie Action (WLA), an unsupervised framework that learns continuous latent action representations to simulate across environments. WLA learns a control interface with high controllability and predictive ability by simultaneously modeling the dynamics of multiple environments using Lie group theory and object-centric autoencoder. On synthetic benchmark and real-world datasets, we demonstrate that WLA can be trained using only video frames and, with minimal or no action labels, can quickly adapt to new environments with novel action sets.

Comment: The paper introduces WLA for inter-environmental world modeling, contributing to foundational research in representation learning and emerging trends.

Relevance: 8 Novelty: 8


23. Radar: Fast Long-Context Decoding for Any Transformer

ArXiv ID: 2503.10571

Authors: Yongchang Hao, Mengyao Zhai, Hossein Hajimirsadeghi, Sepidehsadat Hosseini, Frederick Tung

Abstract: Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers.

Comment: Radar proposes a training-free method to accelerate Transformer inference for long-context data, aligning with the 'Model Compression' criterion due to its focus on efficiency improvements.

Relevance: 8 Novelty: 7


24. SOLA-GCL: Subgraph-Oriented Learnable Augmentation Method for Graph Contrastive Learning

ArXiv ID: 2503.10100

Authors: Tianhao Peng, Xuhong Li, Haitao Yuan, Yuchen Li, Haoyi Xiong

Abstract: Graph contrastive learning has emerged as a powerful technique for learning graph representations that are robust and discriminative. However, traditional approaches often neglect the critical role of subgraph structures, particularly the intra-subgraph characteristics and inter-subgraph relationships, which are crucial for generating informative and diverse contrastive pairs. These subgraph features are crucial as they vary significantly across different graph types, such as social networks where they represent communities, and biochemical networks where they symbolize molecular interactions. To address this issue, our work proposes a novel subgraph-oriented learnable augmentation method for graph contrastive learning, termed SOLA-GCL, that centers around subgraphs, taking full advantage of the subgraph information for data augmentation. Specifically, SOLA-GCL initially partitions a graph into multiple densely connected subgraphs based on their intrinsic properties. To preserve and enhance the unique characteristics inherent to subgraphs, a graph view generator optimizes augmentation strategies for each subgraph, thereby generating tailored views for graph contrastive learning. This generator uses a combination of intra-subgraph and inter-subgraph augmentation strategies, including node dropping, feature masking, intra-edge perturbation, inter-edge perturbation, and subgraph swapping. Extensive experiments have been conducted on various graph learning applications, ranging from social networks to molecules, under semi-supervised learning, unsupervised learning, and transfer learning settings to demonstrate the superiority of our proposed approach over the state-of-the-art in GCL.

Comment: SOLA-GCL proposes a subgraph-oriented augmentation method for graph contrastive learning, contributing to representation learning with novel augmentation strategies.

Relevance: 8 Novelty: 7


25. Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

ArXiv ID: 2503.10135

Authors: Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum

Abstract: Speculative decoding (SPD) aims to accelerate the auto-regressive token generation process of a target Large Language Model (LLM). Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. The target LLM verifies the predicted sequence and accepts aligned tokens, enabling efficient multi-token generation. However, existing methods assume that all tokens within a sequence are equally important, employing identical head structures and relying on a single-generation paradigm, either serial or parallel. To this end, we theoretically demonstrate that initial tokens in the draft sequence are more important than later ones. Building on this insight, we propose Gumiho, a hybrid model combining serial and parallel heads. Specifically, given the critical importance of early tokens, we employ a sophisticated Transformer architecture for the early draft heads in a serial configuration to improve accuracy. For later tokens, we utilize multiple lightweight MLP heads operating in parallel to enhance efficiency. By allocating more advanced model structures and longer running times to the early heads, Gumiho achieves improved overall performance. The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness.

Comment: Gumiho introduces a hybrid speculative decoding architecture for LLMs, aligning with 'Model Architecture' due to its structural innovation.

Relevance: 8 Novelty: 7


26. Robustness Tokens: Towards Adversarial Robustness of Transformers

ArXiv ID: 2503.10191

Authors: Brian Pulfer, Yury Belousov, Slava Voloshynovskiy

Abstract: Recently, large pre-trained foundation models have become widely adopted by machine learning practitioners for a multitude of tasks. Given that such models are publicly available, relying on their use as backbone models for downstream tasks might result in high vulnerability to adversarial attacks crafted with the same public model. In this work, we propose Robustness Tokens, a novel approach specific to the transformer architecture that fine-tunes a few additional private tokens with low computational requirements instead of tuning model parameters as done in traditional adversarial training. We show that Robustness Tokens make Vision Transformer models significantly more robust to white-box adversarial attacks while also retaining the original downstream performances.

Comment: The paper proposes Robustness Tokens for adversarial robustness in Transformers, aligning with 'Model Architecture' due to its structural innovation.

Relevance: 8 Novelty: 7


27. Sample Compression for Continual Learning

ArXiv ID: 2503.10503

Authors: Jacob Comeau, Mathieu Bazinet, Pascal Germain, Cem Subakan

Abstract: Continual learning algorithms aim to learn from a sequence of tasks, making the training distribution non-stationary. The majority of existing continual learning approaches in the literature rely on heuristics and do not provide learning guarantees for the continual learning setup. In this paper, we present a new method called 'Continual Pick-to-Learn' (CoP2L), which is able to retain the most representative samples for each task in an efficient way. The algorithm is adapted from the Pick-to-Learn algorithm, rooted in the sample compression theory. This allows us to provide high-confidence upper bounds on the generalization loss of the learned predictors, numerically computable after every update of the learned model. We also empirically show on several standard continual learning benchmarks that our algorithm is able to outperform standard experience replay, significantly mitigating catastrophic forgetting.

Comment: The paper introduces a sample compression method for continual learning, which aligns with foundational research in model compression and efficiency.

Relevance: 8 Novelty: 7


28. OuroMamba: A Data-Free Quantization Framework for Vision Mamba Models

ArXiv ID: 2503.10959

Authors: Akshat Ramachandran, Mingyu Lee, Huan Xu, Souvik Kundu, Tushar Krishna

Abstract: We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). We identify two key challenges in enabling DFQ for VMMs, (1) VMM's recurrent state transitions restricts capturing of long-range interactions and leads to semantically weak synthetic data, (2) VMM activations exhibit dynamic outlier variations across time-steps, rendering existing static PTQ techniques ineffective. To address these challenges, OuroMamba presents a two-stage framework: (1) OuroMamba-Gen to generate semantically rich and meaningful synthetic data. It applies contrastive learning on patch level VMM features generated through neighborhood interactions in the latent state space, (2) OuroMamba-Quant to employ mixed-precision quantization with lightweight dynamic outlier detection during inference. In specific, we present a thresholding based outlier channel selection strategy for activations that gets updated every time-step. Extensive experiments across vision and generative tasks show that our data-free OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings. Additionally, we implement efficient GPU kernels to achieve practical latency speedup of up to 2.36x. Code will be released soon.

Comment: The paper presents a data-free quantization framework for vision models, contributing to foundational research in model compression and efficiency.

Relevance: 8 Novelty: 7


29. Poly-MgNet: Polynomial Building Blocks in Multigrid-Inspired ResNets

ArXiv ID: 2503.10594

Authors: Antonia van Betteray, Matthias Rottmann, Karsten Kahl

Abstract: The structural analogies of ResNets and Multigrid (MG) methods such as common building blocks like convolutions and poolings where already pointed out by He et al.\ in 2016. Multigrid methods are used in the context of scientific computing for solving large sparse linear systems arising from partial differential equations. MG methods particularly rely on two main concepts: smoothing and residual restriction / coarsening. Exploiting these analogies, He and Xu developed the MgNet framework, which integrates MG schemes into the design of ResNets. In this work, we introduce a novel neural network building block inspired by polynomial smoothers from MG theory. Our polynomial block from an MG perspective naturally extends the MgNet framework to Poly-Mgnet and at the same time reduces the number of weights in MgNet. We present a comprehensive study of our polynomial block, analyzing the choice of initial coefficients, the polynomial degree, the placement of activation functions, as well as of batch normalizations. Our results demonstrate that constructing (quadratic) polynomial building blocks based on real and imaginary polynomial roots enhances Poly-MgNet's capacity in terms of accuracy. Furthermore, our approach achieves an improved trade-off of model accuracy and number of weights compared to ResNet as well as compared to specific configurations of MgNet.

Comment: The paper introduces polynomial building blocks inspired by multigrid methods, contributing to architectural innovations in neural networks.

Relevance: 8 Novelty: 7


30. HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks

ArXiv ID: 2503.10894

Authors: Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar, Atticus Geiger

Abstract: Mechanistic interpretability has made great strides in identifying neural network features (e.g., directions in hidden activation space) that mediate concepts(e.g., the birth year of a person) and enable predictable manipulation. Distributed alignment search (DAS) leverages supervision from counterfactual data to learn concept features within hidden states, but DAS assumes we can afford to conduct a brute force search over potential feature locations. To address this, we present HyperDAS, a transformer-based hypernetwork architecture that (1) automatically locates the token-positions of the residual stream that a concept is realized in and (2) constructs features of those residual stream vectors for the concept. In experiments with Llama3-8B, HyperDAS achieves state-of-the-art performance on the RAVEL benchmark for disentangling concepts in hidden states. In addition, we review the design decisions we made to mitigate the concern that HyperDAS (like all powerful interpretabilty methods) might inject new information into the target model rather than faithfully interpreting it.

Comment: The paper proposes HyperDAS for automating mechanistic interpretability, contributing to foundational research in representation learning and interpretability.

Relevance: 8 Novelty: 7


31. Numerical and statistical analysis of NeuralODE with Runge-Kutta time integration

ArXiv ID: 2503.10729

Authors: Emily C. Ehrhardt, Hanno Gottschalk, Tobias J. Riedlinger

Abstract: NeuralODE is one example for generative machine learning based on the push forward of a simple source measure with a bijective mapping, which in the case of NeuralODE is given by the flow of a ordinary differential equation. Using Liouville's formula, the log-density of the push forward measure is easy to compute and thus NeuralODE can be trained based on the maximum Likelihood method such that the Kulback-Leibler divergence between the push forward through the flow map and the target measure generating the data becomes small. In this work, we give a detailed account on the consistency of Maximum Likelihood based empirical risk minimization for a generic class of target measures. In contrast to prior work, we do not only consider the statistical learning theory, but also give a detailed numerical analysis of the NeuralODE algorithm based on the 2nd order Runge-Kutta (RK) time integration. Using the universal approximation theory for deep ReQU networks, the stability and convergence rated for the RK scheme as well as metric entropy and concentration inequalities, we are able to prove that NeuralODE is a probably approximately correct (PAC) learning algorithm.

Comment: The paper provides a detailed analysis of NeuralODE with Runge-Kutta integration, contributing to foundational research in representation learning and generative modeling.

Relevance: 8 Novelty: 7


32. Fixed-Point RNNs: From Diagonal to Dense in a Few Iterations

ArXiv ID: 2503.10799

Authors: Sajad Movahedi, Felix Sarnthein, Nicola Muca Cirone, Antonio Orvieto

Abstract: Linear recurrent neural networks (RNNs) and state-space models (SSMs) such as Mamba have become promising alternatives to softmax-attention as sequence mixing layers in Transformer architectures. Current models, however, do not exhibit the full state-tracking expressivity of RNNs because they rely on channel-wise (i.e. diagonal) sequence mixing. In this paper, we propose to compute a dense linear RNN as the fixed-point of a parallelizable diagonal linear RNN in a single layer. We explore mechanisms to improve its memory and state-tracking abilities in practice, and achieve state-of-the-art results on the commonly used toy tasks $A_5$, $S_5$, copying, and modular arithmetics. We hope our results will open new avenues to more expressive and efficient sequence mixers.

Comment: The paper explores a novel approach to sequence mixing in RNNs, which could provide insights into representation learning and architectural innovations.

Relevance: 8 Novelty: 7


33. Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling

ArXiv ID: 2503.10489

Authors: Shuqi Lu, Xiaohong Ji, Bohang Zhang, Lin Yao, Siyuan Liu, Zhifeng Gao, Linfeng Zhang, Guolin Ke

Abstract: Molecular pretrained representations (MPR) has emerged as a powerful approach for addressing the challenge of limited supervised data in applications such as drug discovery and material design. While early MPR methods relied on 1D sequences and 2D graphs, recent advancements have incorporated 3D conformational information to capture rich atomic interactions. However, these prior models treat molecules merely as discrete atom sets, overlooking the space surrounding them. We argue from a physical perspective that only modeling these discrete points is insufficient. We first present a simple yet insightful observation: naively adding randomly sampled virtual points beyond atoms can surprisingly enhance MPR performance. In light of this, we propose a principled framework that incorporates the entire 3D space spanned by molecules. We implement the framework via a novel Transformer-based architecture, dubbed SpaceFormer, with three key components: (1) grid-based space discretization; (2) grid sampling/merging; and (3) efficient 3D positional encoding. Extensive experiments show that SpaceFormer significantly outperforms previous 3D MPR models across various downstream tasks with limited data, validating the benefit of leveraging the additional 3D space beyond atoms in MPR models.

Comment: The paper introduces a new Transformer-based architecture for molecular representation learning, which aligns with foundational research in representation learning and model architecture.

Relevance: 8 Novelty: 7


34. Numerical Error Analysis of Large Language Models

ArXiv ID: 2503.10251

Authors: Stanislav Budzinskiy, Wenyi Fang, Longbin Zeng, Philipp Petersen

Abstract: Large language models based on transformer architectures have become integral to state-of-the-art natural language processing applications. However, their training remains computationally expensive and exhibits instabilities, some of which are expected to be caused by finite-precision computations. We provide a theoretical analysis of the impact of round-off errors within the forward pass of a transformer architecture which yields fundamental bounds for these effects. In addition, we conduct a series of numerical experiments which demonstrate the practical relevance of our bounds. Our results yield concrete guidelines for choosing hyperparameters that mitigate round-off errors, leading to more robust and stable inference.

Comment: The paper provides theoretical analysis on numerical errors in LLMs, which aligns with foundational research in model efficiency and robustness.

Relevance: 8 Novelty: 7


35. Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective

ArXiv ID: 2503.10638

Authors: Xiaoming Zhao, Alexander G. Schwing

Abstract: Classifier-free guidance has become a staple for conditional generation with denoising diffusion models. However, a comprehensive understanding of classifier-free guidance is still missing. In this work, we carry out an empirical study to provide a fresh perspective on classifier-free guidance. Concretely, instead of solely focusing on classifier-free guidance, we trace back to the root, i.e., classifier guidance, pinpoint the key assumption for the derivation, and conduct a systematic study to understand the role of the classifier. We find that both classifier guidance and classifier-free guidance achieve conditional generation by pushing the denoising diffusion trajectories away from decision boundaries, i.e., areas where conditional information is usually entangled and is hard to learn. Based on this classifier-centric understanding, we propose a generic postprocessing step built upon flow-matching to shrink the gap between the learned distribution for a pre-trained denoising diffusion model and the real data distribution, majorly around the decision boundaries. Experiments on various datasets verify the effectiveness of the proposed approach.

Comment: The paper provides a classifier-centric perspective on classifier-free guidance, which aligns with foundational research in representation learning and model behavior.

Relevance: 8 Novelty: 7


36. PIMRL: Physics-Informed Multi-Scale Recurrent Learning for Spatiotemporal Prediction

ArXiv ID: 2503.10253

Authors: Han Wan, Qi Wang, Yuan Mi, Hao Sun

Abstract: Simulation of spatiotemporal systems governed by partial differential equations is widely applied in fields such as biology, chemistry, aerospace dynamics, and meteorology. Traditional numerical methods incur high computational costs due to the requirement of small time steps for accurate predictions. While machine learning has reduced these costs, long-term predictions remain challenged by error accumulation, particularly in scenarios with insufficient data or varying time scales, where stability and accuracy are compromised. Existing methods often neglect the effective utilization of multi-scale data, leading to suboptimal robustness in predictions. To address these issues, we propose a novel multi-scale learning framework, namely, the Physics-Informed Multi-Scale Recurrent Learning (PIMRL), to effectively leverage multi-scale data for spatiotemporal dynamics prediction. The PIMRL framework comprises two modules: the micro-scale module embeds physical knowledge into neural networks via pretraining, and the macro-scale module adopts a data-driven approach to learn the temporal evolution of physics in the latent space. Experimental results demonstrate that the PIMRL framework consistently achieves state-of-the-art performance across five benchmark datasets ranging from one to three dimensions, showing average improvements of over 9\% in both RMSE and MAE evaluation metrics, with maximum enhancements reaching up to 80%.

Comment: The paper introduces a physics-informed multi-scale learning framework, which aligns with foundational research in representation learning.

Relevance: 8 Novelty: 7


37. Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More

ArXiv ID: 2503.10542

Authors: Arvid Frydenlund

Abstract: This work concerns the path-star task, a minimal example of searching over a graph. The graph, $G$, is star-shaped with $D$ arms radiating from a start node, $s$. A language model (LM) is given $G$, $s$, and a target node $t$, which ends one of the arms and is tasked with generating the arm containing $t$. The minimal nature of this task means only a single choice needs to be made: which of the $D$ arms contains $t$? Decoder-only LMs fail to solve this elementary task above $1/D$ chance due to a learned shortcut that absorbs training supervision. We show how this pathology is caused by excess supervision and we present a series of solutions demonstrating that the task is solvable via decoder-only LMs. We find that the task's minimal nature causes its difficulty, as it prevents task decomposition. Our solutions provide insight into the pathology and its implications for LMs trained via next-token prediction.

Comment: The paper explores training pathologies in language models, which aligns with foundational research into LLM behavior and interpretability.

Relevance: 8 Novelty: 7


38. DTA: Dual Temporal-channel-wise Attention for Spiking Neural Networks

ArXiv ID: 2503.10052

Authors: Minje Kim, Minjun Kim, Xu Yang

Abstract: Spiking Neural Networks (SNNs) present a more energy-efficient alternative to Artificial Neural Networks (ANNs) by harnessing spatio-temporal dynamics and event-driven spikes. Effective utilization of temporal information is crucial for SNNs, leading to the exploration of attention mechanisms to enhance this capability. Conventional attention operations either apply identical operation or employ non-identical operations across target dimensions. We identify that these approaches provide distinct perspectives on temporal information. To leverage the strengths of both operations, we propose a novel Dual Temporal-channel-wise Attention (DTA) mechanism that integrates both identical/non-identical attention strategies. To the best of our knowledge, this is the first attempt to concentrate on both the correlation and dependency of temporal-channel using both identical and non-identical attention operations. Experimental results demonstrate that the DTA mechanism achieves state-of-the-art performance on both static datasets (CIFAR10, CIFAR100, ImageNet-1k) and dynamic dataset (CIFAR10-DVS), elevating spike representation and capturing complex temporal-channel relationship. We open-source our code: https://github.com/MnJnKIM/DTA-SNN.

Comment: The paper introduces Dual Temporal-channel-wise Attention for Spiking Neural Networks, contributing to architectural innovations.

Relevance: 8 Novelty: 7


39. Structured Preconditioners in Adaptive Optimization: A Unified Analysis

ArXiv ID: 2503.10537

Authors: Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, Zhiyuan Li

Abstract: We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex optimization. Our analysis not only provides matching rate to several important structured preconditioned algorithms including diagonal AdaGrad, full-matrix AdaGrad, and AdaGrad-Norm, but also gives an improved convergence rate for a one-sided variant of Shampoo over that of original Shampoo. Interestingly, more structured preconditioners (e.g., diagonal Adagrad, AdaGrad-Norm which use less space and compute) are often presented as computationally efficient approximations to full-matrix Adagrad, aiming for improved optimization performance through better approximations. Our unified analysis challenges this prevailing view and reveals, perhaps surprisingly, that more structured preconditioners, despite using less space and computation per step, can outperform their less structured counterparts. To demonstrate this, we show that one-sided Shampoo, which is relatively much cheaper than full-matrix AdaGrad could outperform it both theoretically and experimentally.

Comment: The paper provides a unified analysis of structured preconditioners in adaptive optimization, contributing to foundational insights into model efficiency and optimization.

Relevance: 8 Novelty: 7


40. Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

ArXiv ID: 2503.10632

Authors: Subhajit Maity, Killian Hitsman, Xin Li, Aritra Dutta

Abstract: Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of learnable activation functions with the potential to capture more complex relationships from data. Although KANs are useful in finding symbolic representations and continual learning of one-dimensional functions, their effectiveness in diverse machine learning (ML) tasks, such as vision, remains questionable. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep network architectures, including advanced architectures such as vision Transformers (ViTs). In this paper, we are the first to design a general learnable Kolmogorov-Arnold Attention (KArAt) for vanilla ViTs that can operate on any choice of basis. However, the computing and memory costs of training them motivated us to propose a more modular version, and we designed particular learnable attention, called Fourier-KArAt. Fourier-KArAt and its variants either outperform their ViT counterparts or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We dissect these architectures' performance and generalization capacity by analyzing their loss landscapes, weight distributions, optimizer path, attention visualization, and spectral behavior, and contrast them with vanilla ViTs. The goal of this paper is not to produce parameter- and compute-efficient attention, but to encourage the community to explore KANs in conjunction with more advanced architectures that require a careful understanding of learnable activations. Our open-source code and implementation details are available on: https://subhajitmaity.me/KArAt

Comment: The paper proposes Kolmogorov-Arnold Attention for ViTs, contributing to architectural innovations and representation learning.

Relevance: 8 Novelty: 7


41. Sample and Map from a Single Convex Potential: Generation using Conjugate Moment Measures

ArXiv ID: 2503.10576

Authors: Nina Vesseron, Louis Béthune, Marco Cuturi

Abstract: A common approach to generative modeling is to split model-fitting into two blocks: define first how to sample noise (e.g. Gaussian) and choose next what to do with it (e.g. using a single map or flows). We explore in this work an alternative route that ties sampling and mapping. We find inspiration in moment measures, a result that states that for any measure $\rho$ supported on a compact convex set of $\mathbb{R}^d$, there exists a unique convex potential $u$ such that $\rho=\nabla u\,\sharp\,e^{-u}$. While this does seem to tie effectively sampling (from log-concave distribution $e^{-u}$) and action (pushing particles through $\nabla u$), we observe on simple examples (e.g., Gaussians or 1D distributions) that this choice is ill-suited for practical tasks. We study an alternative factorization, where $\rho$ is factorized as $\nabla w^\,\sharp\,e^{-w}$, where $w^$ is the convex conjugate of $w$. We call this approach conjugate moment measures, and show far more intuitive results on these examples. Because $\nabla w^*$ is the Monge map between the log-concave distribution $e^{-w}$ and $\rho$, we rely on optimal transport solvers to propose an algorithm to recover $w$ from samples of $\rho$, and parameterize $w$ as an input-convex neural network.

Comment: The paper introduces a novel generative modeling approach using conjugate moment measures, which could be relevant to representation learning and emerging trends.

Relevance: 7 Novelty: 8


42. Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning

ArXiv ID: 2503.10005

Authors: Yongqi Li, Xiaowei Zhang

Abstract: Training deep neural networks is challenging. To accelerate training and enhance performance, we propose PadamP, a novel optimization algorithm. PadamP is derived by applying the adaptive estimation of the p-th power of the second-order moments under scale invariance, enhancing projection adaptability by modifying the projection discrimination condition. It is integrated into Adam-type algorithms, accelerating training, boosting performance, and improving generalization in deep learning. Combining projected gradient benefits with adaptive moment estimation, PadamP tackles unconstrained non-convex problems. Convergence for the non-convex case is analyzed, focusing on the decoupling of first-order moment estimation coefficients and second-order moment estimation coefficients. Unlike prior work relying on , our proof generalizes the convergence theorem, enhancing practicality. Experiments using VGG-16 and ResNet-18 on CIFAR-10 and CIFAR-100 show PadamP's effectiveness, with notable performance on CIFAR-10/100, especially for VGG-16. The results demonstrate that PadamP outperforms existing algorithms in terms of convergence speed and generalization ability, making it a valuable addition to the field of deep learning optimization.

Comment: The paper proposes a novel optimization algorithm (PadamP) for training deep networks, which contributes to foundational research in training dynamics.

Relevance: 7 Novelty: 7


43. OODD: Test-time Out-of-Distribution Detection with Dynamic Dictionary

ArXiv ID: 2503.10468

Authors: Yifeng Yang, Lin Zhu, Zewen Sun, Hengyu Liu, Qinying Gu, Nanyang Ye

Abstract: Out-of-distribution (OOD) detection remains challenging for deep learning models, particularly when test-time OOD samples differ significantly from training outliers. We propose OODD, a novel test-time OOD detection method that dynamically maintains and updates an OOD dictionary without fine-tuning. Our approach leverages a priority queue-based dictionary that accumulates representative OOD features during testing, combined with an informative inlier sampling strategy for in-distribution (ID) samples. To ensure stable performance during early testing, we propose a dual OOD stabilization mechanism that leverages strategically generated outliers derived from ID data. To our best knowledge, extensive experiments on the OpenOOD benchmark demonstrate that OODD significantly outperforms existing methods, achieving a 26.0% improvement in FPR95 on CIFAR-100 Far OOD detection compared to the state-of-the-art approach. Furthermore, we present an optimized variant of the KNN-based OOD detection framework that achieves a 3x speedup while maintaining detection performance.

Comment: The paper introduces a novel OOD detection method, which could provide insights into representation learning and model robustness.

Relevance: 7 Novelty: 7


44. AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation

ArXiv ID: 2503.10720

Authors: Yixiong Fang, Tianran Sun, Yuling Shi, Xiaodong Gu

Abstract: While RAG demonstrates remarkable capabilities in LLM applications, its effectiveness is hindered by the ever-increasing length of retrieved contexts, which introduces information redundancy and substantial computational overhead. Existing context pruning methods, such as LLMLingua, lack contextual awareness and offer limited flexibility in controlling compression rates, often resulting in either insufficient pruning or excessive information loss. In this paper, we propose AttentionRAG, an attention-guided context pruning method for RAG systems. The core idea of AttentionRAG lies in its attention focus mechanism, which reformulates RAG queries into a next-token prediction paradigm. This mechanism isolates the query's semantic focus to a single token, enabling precise and efficient attention calculation between queries and retrieved contexts. Extensive experiments on LongBench and Babilong benchmarks show that AttentionRAG achieves up to 6.3$\times$ context compression while outperforming LLMLingua methods by around 10\% in key metrics.

Comment: The paper proposes AttentionRAG, focusing on context pruning in retrieval-augmented generation, which is relevant to model compression and efficiency.

Relevance: 7 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords: