This is a remedial run for missed papers from 03/13/2025 to 03/13/2025.

Results generated on 03/24/2025.

Personalized Daily Arxiv Papers 3/14/2025

[gpt-4o]	Prompt	Completion	Total
Token	41757	6534	48291
Cost	$0.1	$0.07	$0.17

Total arXiv papers: 289

Total scanned papers: 289

Total relevant papers: 44

Table of contents with paper titles:

Empirical Computation Authors: Eric Tang, Marcel Böhme
Transformers without Normalization Authors: Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu
KV-Distill: Nearly Lossless Learnable Context Compression for LLMs Authors: Vivek Chari, Guanghui Qin, Benjamin Van Durme
Compute Optimal Scaling of Skills: Knowledge vs Reasoning Authors: Nicholas Roberts, Niladri Chatterji, Sharan Narang, Mike Lewis, Dieuwke Hupkes
The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity Authors: Justin Sahs, Ryan Pyle, Fabio Anselmi, Ankit Patel
ASIDE: Architectural Separation of Instructions and Data in Language Models Authors: Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Soroush Tabesh, Alexandra Volkova, Sebastian Lapuschkin, Wojciech Samek, Christoph H. Lampert
Architecture-Aware Minimization (A$^2$M): How to Find Flat Minima in Neural Architecture Search Authors: Matteo Gambella, Fabrizio Pittorino, Manuel Roveri
ZeroMerge: Parameter-Free KV Cache Compression for Memory-Efficient Long-Context LLMs Authors: Xin Liu, Pei Liu, Guoming Tang
Spherical dimension Authors: Bogdan Chornomaz, Shay Moran, Tom Waknine
The Relativity of Causal Knowledge Authors: Gabriele D'Acunto, Claudio Battiloro
On the Identifiability of Causal Abstractions Authors: Xiusi Li, Sékou-Oumar Kaba, Siamak Ravanbakhsh
Explainable Bayesian deep learning through input-skip Latent Binary Bayesian Neural Networks Authors: Eirik Høyheim, Lars Skaaret-Lund, Solve Sæbø, Aliaksandr Hubin
Understanding the Logical Capabilities of Large Language Models via Out-of-Context Representation Learning Authors: Jonathan Shaki, Emanuele La Malfa, Michael Wooldridge, Sarit Kraus
Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores Authors: Chenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan
Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models Authors: Andy Zhou
From Equations to Insights: Unraveling Symbolic Structures in PDEs with LLMs Authors: Rohan Bhatnagar, Ling Liang, Krish Patel, Haizhao Yang
Langevin Monte-Carlo Provably Learns Depth Two Neural Nets at Any Size and Data Authors: Dibyakanti Kumar, Samyak Jha, Anirbit Mukherjee
Multiplicative Learning Authors: Han Kim, Hyungjoon Soh, Vipul Periwal, Junghyo Jo
Deep Learning based discovery of Integrable Systems Authors: Shailesh Lal, Suvajit Majumder, Evgeny Sobko
Thermodynamic Bound on Energy and Negentropy Costs of Inference in Deep Neural Networks Authors: Alexei V. Tkachenko
Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild Authors: Damien Teney, Liangze Jiang, Florin Gogianu, Ehsan Abbasnejad
Inter-environmental world modeling for continuous and compositional dynamics Authors: Kohei Hayashi, Masanori Koyama, Julian Jorge Andrade Guerreiro
Radar: Fast Long-Context Decoding for Any Transformer Authors: Yongchang Hao, Mengyao Zhai, Hossein Hajimirsadeghi, Sepidehsadat Hosseini, Frederick Tung
SOLA-GCL: Subgraph-Oriented Learnable Augmentation Method for Graph Contrastive Learning Authors: Tianhao Peng, Xuhong Li, Haitao Yuan, Yuchen Li, Haoyi Xiong
Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding Authors: Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum
Robustness Tokens: Towards Adversarial Robustness of Transformers Authors: Brian Pulfer, Yury Belousov, Slava Voloshynovskiy
Sample Compression for Continual Learning Authors: Jacob Comeau, Mathieu Bazinet, Pascal Germain, Cem Subakan
OuroMamba: A Data-Free Quantization Framework for Vision Mamba Models Authors: Akshat Ramachandran, Mingyu Lee, Huan Xu, Souvik Kundu, Tushar Krishna
Poly-MgNet: Polynomial Building Blocks in Multigrid-Inspired ResNets Authors: Antonia van Betteray, Matthias Rottmann, Karsten Kahl
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Authors: Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar, Atticus Geiger
Numerical and statistical analysis of NeuralODE with Runge-Kutta time integration Authors: Emily C. Ehrhardt, Hanno Gottschalk, Tobias J. Riedlinger
Fixed-Point RNNs: From Diagonal to Dense in a Few Iterations Authors: Sajad Movahedi, Felix Sarnthein, Nicola Muca Cirone, Antonio Orvieto
Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling Authors: Shuqi Lu, Xiaohong Ji, Bohang Zhang, Lin Yao, Siyuan Liu, Zhifeng Gao, Linfeng Zhang, Guolin Ke
Numerical Error Analysis of Large Language Models Authors: Stanislav Budzinskiy, Wenyi Fang, Longbin Zeng, Philipp Petersen
Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective Authors: Xiaoming Zhao, Alexander G. Schwing
PIMRL: Physics-Informed Multi-Scale Recurrent Learning for Spatiotemporal Prediction Authors: Han Wan, Qi Wang, Yuan Mi, Hao Sun
Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More Authors: Arvid Frydenlund
DTA: Dual Temporal-channel-wise Attention for Spiking Neural Networks Authors: Minje Kim, Minjun Kim, Xu Yang
Structured Preconditioners in Adaptive Optimization: A Unified Analysis Authors: Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, Zhiyuan Li
Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers? Authors: Subhajit Maity, Killian Hitsman, Xin Li, Aritra Dutta
Sample and Map from a Single Convex Potential: Generation using Conjugate Moment Measures Authors: Nina Vesseron, Louis Béthune, Marco Cuturi
Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning Authors: Yongqi Li, Xiaowei Zhang
OODD: Test-time Out-of-Distribution Detection with Dynamic Dictionary Authors: Yifeng Yang, Lin Zhu, Zewen Sun, Hengyu Liu, Qinying Gu, Nanyang Ye
AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation Authors: Yixiong Fang, Tianran Sun, Yuling Shi, Xiaodong Gu

1. Empirical Computation

ArXiv ID: 2503.10954

Authors: Eric Tang, Marcel Böhme

Abstract: In this vision paper, we explore the challenges and opportunities of a form of computation that employs an empirical (rather than a formal) approach, where the solution of a computational problem is returned as empirically most likely (rather than necessarily correct). We call this approach as empirical computation and observe that its capabilities and limits cannot be understood within the classic, rationalist framework of computation. While we take a very broad view of "computational problem", a classic, well-studied example is sorting: Given a set of $n$ numbers, return these numbers sorted in ascending order. * To run a classical, formal computation, we might first think about a specific algorithm (e.g., merge sort) before developing a specific program that implements it. The program will expect the input to be given in a specific format, type, or data structure (e.g., unsigned 32-bit integers). In software engineering, we have many approaches to analyze the correctness of such programs. From complexity theory, we know that there exists no correct program that can solve the average instance of the sorting problem faster than $O(n\log n)$. * To run an empirical computation, we might directly ask a large language model (LLM) to solve any computational problem (which can be stated informally in natural language) and provide the input in any format (e.g., negative numbers written as Chinese characters). There is no (problem-specific) program that could be analyzed for correctness. Also, the time it takes an LLM to return an answer is entirely independent of the computational complexity of the problem that is solved. What are the capabilities or limits of empirical computation in the general, in the problem-, or in the instance-specific? Our purpose is to establish empirical computation as a field in SE that is timely and rich with interesting problems.

Comment: The paper introduces 'empirical computation,' a novel paradigm that challenges classical computational frameworks. This aligns with the 'Emerging Trends' criterion as it proposes a cutting-edge theoretical direction.

Relevance: 9 Novelty: 9

2. Transformers without Normalization

ArXiv ID: 2503.10622

Authors: Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu

Abstract: Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Comment: The paper introduces Dynamic Tanh as a replacement for normalization layers in Transformers, aligning with 'Model Architecture' due to its challenge to conventional practices.

Relevance: 9 Novelty: 8

3. KV-Distill: Nearly Lossless Learnable Context Compression for LLMs

ArXiv ID: 2503.10337

Authors: Vivek Chari, Guanghui Qin, Benjamin Van Durme

Abstract: Sequence-to-sequence tasks often benefit from long contexts, but the quadratic complexity of self-attention in standard Transformers renders this non-trivial. During generation, temporary representations -stored in the so-called KV cache-account for a large portion of GPU memory usage and scale linearly with context length. We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations in a question-independent fashion. KV-Distill can be trained as a parameter-efficient adaptor for pretrained models, and enables the compression of arbitrary spans of a context while preserving pre-trained model capabilities. We treat a compressed-uncompressed cache as a student-teacher pairing and apply a KL-type divergence to match the generated outputs. KV-Distill outperforms other compression techniques in worst-case extractive tasks and approaches uncompressed performance in long context question answering and summarization, and it can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.

Comment: KV-Distill introduces a compression framework for LLMs, aligning with 'Model Compression' due to its focus on efficient context representation.

Relevance: 9 Novelty: 8

4. Compute Optimal Scaling of Skills: Knowledge vs Reasoning

ArXiv ID: 2503.10061

Authors: Nicholas Roberts, Niladri Chatterji, Sharan Narang, Mike Lewis, Dieuwke Hupkes

Abstract: Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as 'compute-optimally' trading-off parameter count and dataset size, alongside a more recent growing list of other crucial decisions. In this work, we ask whether compute-optimal scaling behaviour can be skill-dependent. In particular, we examine knowledge and reasoning-based skills such as knowledge-based QA and code generation, and we answer this question in the affirmative: scaling laws are skill-dependent. Next, to understand whether skill-dependent scaling is an artefact of the pretraining datamix, we conduct an extensive ablation of different datamixes and find that, also when correcting for datamix differences, knowledge and code exhibit fundamental differences in scaling behaviour. We conclude with an analysis of how our findings relate to standard compute-optimal scaling using a validation set, and find that a misspecified validation set can impact compute-optimal parameter count by nearly 50%, depending on its skill composition.

Comment: The paper studies skill-dependent scaling laws in LLMs, aligning with 'Large Language Models' due to its theoretical insights into scaling behavior.

Relevance: 9 Novelty: 8

5. The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity

ArXiv ID: 2503.10587

Authors: Justin Sahs, Ryan Pyle, Fabio Anselmi, Ankit Patel

Abstract: Despite classical statistical theory predicting severe overfitting, modern massively overparameterized neural networks still generalize well. This unexpected property is attributed to the network's so-called implicit bias, which describes its propensity to converge to solutions that generalize effectively, among the many possible that correctly label the training data. The aim of our research is to explore this bias from a new perspective, focusing on how non-linear activation functions contribute to shaping it. First, we introduce a reparameterization which removes a continuous weight rescaling symmetry. Second, in the kernel regime, we leverage this reparameterization to generalize recent findings that relate shallow Neural Networks to the Radon transform, deriving an explicit formula for the implicit bias induced by a broad class of activation functions. Specifically, by utilizing the connection between the Radon transform and the Fourier transform, we interpret the kernel regime's inductive bias as minimizing a spectral seminorm that penalizes high-frequency components, in a manner dependent on the activation function. Finally, in the adaptive regime, we demonstrate the existence of local dynamical attractors that facilitate the formation of clusters of hyperplanes where the input to a neuron's activation function is zero, yielding alignment between many neurons' response functions. We confirm these theoretical results with simulations. All together, our work provides a deeper understanding of the mechanisms underlying the generalization capabilities of overparameterized neural networks and its relation with the implicit bias, offering potential pathways for designing more efficient and robust models.

Comment: The paper explores the spectral bias of shallow neural networks shaped by activation functions, providing theoretical insights into representation learning and training dynamics.

Relevance: 9 Novelty: 8

6. ASIDE: Architectural Separation of Instructions and Data in Language Models

ArXiv ID: 2503.10566

Authors: Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Soroush Tabesh, Alexandra Volkova, Sebastian Lapuschkin, Wojciech Samek, Christoph H. Lampert

Abstract: Despite their remarkable performance, large language models lack elementary safety features, and this makes them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as a root cause for the success of prompt injection attacks. In this work, we propose an architectural change, ASIDE, that allows the model to clearly separate between instructions and data by using separate embeddings for them. Instead of training the embeddings from scratch, we propose a method to convert an existing model to ASIDE form by using two copies of the original model's embeddings layer, and applying an orthogonal rotation to one of them. We demonstrate the effectiveness of our method by showing (1) highly increased instruction-data separation scores without a loss in model capabilities and (2) competitive results on prompt injection benchmarks, even without dedicated safety training. Additionally, we study the working mechanism behind our method through an analysis of model representations.

Comment: The paper proposes an architectural change (ASIDE) for LLMs to separate instructions and data, contributing to foundational insights into LLM architecture.

Relevance: 9 Novelty: 8

7. Architecture-Aware Minimization (A$^2$M): How to Find Flat Minima in Neural Architecture Search

ArXiv ID: 2503.10404

Authors: Matteo Gambella, Fabrizio Pittorino, Manuel Roveri

Abstract: Neural Architecture Search (NAS) has become an essential tool for designing effective and efficient neural networks. In this paper, we investigate the geometric properties of neural architecture spaces commonly used in differentiable NAS methods, specifically NAS-Bench-201 and DARTS. By defining flatness metrics such as neighborhoods and loss barriers along paths in architecture space, we reveal locality and flatness characteristics analogous to the well-known properties of neural network loss landscapes in weight space. In particular, we find that highly accurate architectures cluster together in flat regions, while suboptimal architectures remain isolated, unveiling the detailed geometrical structure of the architecture search landscape. Building on these insights, we propose Architecture-Aware Minimization (A$^2$M), a novel analytically derived algorithmic framework that explicitly biases, for the first time, the gradient of differentiable NAS methods towards flat minima in architecture space. A$^2$M consistently improves generalization over state-of-the-art DARTS-based algorithms on benchmark datasets including CIFAR-10, CIFAR-100, and ImageNet16-120, across both NAS-Bench-201 and DARTS search spaces. Notably, A$^2$M is able to increase the test accuracy, on average across different differentiable NAS methods, by +3.60\% on CIFAR-10, +4.60\% on CIFAR-100, and +3.64\% on ImageNet16-120, demonstrating its superior effectiveness in practice. A$^2$M can be easily integrated into existing differentiable NAS frameworks, offering a versatile tool for future research and applications in automated machine learning. We open-source our code at https://github.com/AI-Tech-Research-Lab/AsquaredM.

Comment: The paper proposes a framework for flat minima in neural architecture search, contributing to foundational research in model architecture and optimization.

Relevance: 9 Novelty: 8

8. ZeroMerge: Parameter-Free KV Cache Compression for Memory-Efficient Long-Context LLMs

ArXiv ID: 2503.10714

Authors: Xin Liu, Pei Liu, Guoming Tang

Abstract: The linear growth of key-value (KV) cache memory and quadratic computational complexity pose significant bottlenecks for large language models (LLMs) in long-context processing. While existing KV cache optimization methods address these challenges through token pruning or feature merging, they often suffer from irreversible information loss or require costly parameter retraining. We propose ZeroMerge, a dynamic zero-shot compression framework that achieves efficient cache management through three key innovations: (1) Fine-grained memory allocation guided by multi-dimensional token importance metrics at head-level granularity, (2) A residual merging mechanism that preserves critical context through compensated attention scoring, and (3) Parameter-free adaptation compatible with diverse LLM architectures without retraining. Comprehensive evaluations across LLaMA-2 model demonstrate that ZeroMerge maintains full-cache performance at 5\% compression ratios while doubling inference throughput at 40K token lengths. The method effectively balances memory efficiency, generation quality, and deployment flexibility, advancing practical long-context LLM applications. The code is available at https://github.com/SusCom-Lab/ZeroMerge.

Comment: The paper proposes a parameter-free KV cache compression method for LLMs, contributing to foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8

9. Spherical dimension

ArXiv ID: 2503.10240

Authors: Bogdan Chornomaz, Shay Moran, Tom Waknine

Abstract: We introduce and study the spherical dimension, a natural topological relaxation of the VC dimension that unifies several results in learning theory where topology plays a key role in the proofs. The spherical dimension is defined by extending the set of realizable datasets (used to define the VC dimension) to the continuous space of realizable distributions. In this space, a shattered set of size d (in the VC sense) is completed into a continuous object, specifically a d-dimensional sphere of realizable distributions. The spherical dimension is then defined as the dimension of the largest sphere in this space. Thus, the spherical dimension is at least the VC dimension. The spherical dimension serves as a common foundation for leveraging the Borsuk-Ulam theorem and related topological tools. We demonstrate the utility of the spherical dimension in diverse applications, including disambiguations of partial concept classes, reductions from classification to stochastic convex optimization, stability and replicability, and sample compression schemes. Perhaps surprisingly, we show that the open question posed by Alon, Hanneke, Holzman, and Moran (FOCS 2021) of whether there exist non-trivial disambiguations for halfspaces with margin is equivalent to the basic open question of whether the VC and spherical dimensions are finite together.

Comment: The paper introduces spherical dimension as a topological relaxation of VC dimension, contributing to foundational research in representation learning and theoretical insights.

Relevance: 9 Novelty: 8

10. The Relativity of Causal Knowledge

ArXiv ID: 2503.11718

Authors: Gabriele D'Acunto, Claudio Battiloro

Abstract: Recent advances in artificial intelligence reveal the limits of purely predictive systems and call for a shift toward causal and collaborative reasoning. Drawing inspiration from the revolution of Grothendieck in mathematics, we introduce the relativity of causal knowledge, which posits structural causal models (SCMs) are inherently imperfect, subjective representations embedded within networks of relationships. By leveraging category theory, we arrange SCMs into a functor category and show that their observational and interventional probability measures naturally form convex structures. This result allows us to encode non-intervened SCMs with convex spaces of probability measures. Next, using sheaf theory, we construct the network sheaf and cosheaf of causal knowledge. These structures enable the transfer of causal knowledge across the network while incorporating interventional consistency and the perspective of the subjects, ultimately leading to the formal, mathematical definition of relative causal knowledge.

Comment: The paper introduces a novel perspective on causal knowledge using category theory, which aligns with emerging trends in foundational research.

Relevance: 9 Novelty: 8

11. On the Identifiability of Causal Abstractions

ArXiv ID: 2503.10834

Authors: Xiusi Li, Sékou-Oumar Kaba, Siamak Ravanbakhsh

Abstract: Causal representation learning (CRL) enhances machine learning models' robustness and generalizability by learning structural causal models associated with data-generating processes. We focus on a family of CRL methods that uses contrastive data pairs in the observable space, generated before and after a random, unknown intervention, to identify the latent causal model. (Brehmer et al., 2022) showed that this is indeed possible, given that all latent variables can be intervened on individually. However, this is a highly restrictive assumption in many systems. In this work, we instead assume interventions on arbitrary subsets of latent variables, which is more realistic. We introduce a theoretical framework that calculates the degree to which we can identify a causal model, given a set of possible interventions, up to an abstraction that describes the system at a higher level of granularity.

Comment: The paper explores causal representation learning with a focus on identifiability, aligning with foundational research in representation learning.

Relevance: 9 Novelty: 8

12. Explainable Bayesian deep learning through input-skip Latent Binary Bayesian Neural Networks

ArXiv ID: 2503.10496

Authors: Eirik Høyheim, Lars Skaaret-Lund, Solve Sæbø, Aliaksandr Hubin

Abstract: Modeling natural phenomena with artificial neural networks (ANNs) often provides highly accurate predictions. However, ANNs often suffer from over-parameterization, complicating interpretation and raising uncertainty issues. Bayesian neural networks (BNNs) address the latter by representing weights as probability distributions, allowing for predictive uncertainty evaluation. Latent binary Bayesian neural networks (LBBNNs) further handle structural uncertainty and sparsify models by removing redundant weights. This article advances LBBNNs by enabling covariates to skip to any succeeding layer or be excluded, simplifying networks and clarifying input impacts on predictions. Ultimately, a linear model or even a constant can be found to be optimal for a specific problem at hand. Furthermore, the input-skip LBBNN approach reduces network density significantly compared to standard LBBNNs, achieving over 99% reduction for small networks and over 99.9% for larger ones, while still maintaining high predictive accuracy and uncertainty measurement. For example, on MNIST, we reached 97% accuracy and great calibration with just 935 weights, reaching state-of-the-art for compression of neural networks. Furthermore, the proposed method accurately identifies the true covariates and adjusts for system non-linearity. The main contribution is the introduction of active paths, enhancing directly designed global and local explanations within the LBBNN framework, that have theoretical guarantees and do not require post hoc external tools for explanations.

Comment: The paper introduces input-skip Latent Binary Bayesian Neural Networks, contributing to sparsity and model compression with theoretical guarantees.

Relevance: 9 Novelty: 8

13. Understanding the Logical Capabilities of Large Language Models via Out-of-Context Representation Learning

ArXiv ID: 2503.10408

Authors: Jonathan Shaki, Emanuele La Malfa, Michael Wooldridge, Sarit Kraus

Abstract: We study the capabilities of Large Language Models (LLM) on binary relations, a ubiquitous concept in math employed in most reasoning, math and logic benchmarks. This work focuses on equality, inequality, and inclusion, along with the properties they satisfy, such as ir/reflexivity, a/symmetry, transitivity, and logical complexity (e.g., number of reasoning ``hops''). We propose an alternative to in-context learning that trains only the representations of newly introduced tokens, namely out-of-context representation learning. This method mitigates linguistic biases already present in a model and, differently from in-context learning, does not rely on external information or illustrations. We argue out-of-context representation learning as a better alternative to in-context learning and fine-tuning to evaluate the capabilities of LLMs on logic tasks that are the building blocks of more complex reasoning benchmarks.

Comment: The paper introduces out-of-context representation learning for logical tasks, contributing to foundational insights into representation learning and LLM behavior.

Relevance: 9 Novelty: 8

14. Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores

ArXiv ID: 2503.10725

Authors: Chenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan

Abstract: The escalating size of Mixture-of-Experts (MoE) based Large Language Models (LLMs) presents significant computational and memory challenges, necessitating innovative solutions to enhance efficiency without compromising model accuracy. Structured sparsity emerges as a compelling strategy to address these challenges by leveraging the emerging sparse computing hardware. Prior works mainly focus on the sparsity in model parameters, neglecting the inherent sparse patterns in activations. This oversight can lead to additional computational costs associated with activations, potentially resulting in suboptimal performance. This paper presents Samoyeds, an innovative acceleration system for MoE LLMs utilizing Sparse Tensor Cores (SpTCs). Samoyeds is the first to apply sparsity simultaneously to both activations and model parameters. It introduces a bespoke sparse data format tailored for MoE computation and develops a specialized sparse-sparse matrix multiplication kernel. Furthermore, Samoyeds incorporates systematic optimizations specifically designed for the execution of dual-side structured sparse MoE LLMs on SpTCs, further enhancing system performance. Evaluations show that Samoyeds outperforms SOTA works by up to 1.99$\times$ at the kernel level and 1.58$\times$ at the model level. Moreover, it enhances memory efficiency, increasing maximum supported batch sizes by 4.41$\times$ on average. Additionally, Samoyeds surpasses existing SOTA structured sparse solutions in both model accuracy and hardware portability.

Comment: The paper presents Samoyeds, focusing on structured sparsity in MoE models, which aligns closely with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

15. Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models

ArXiv ID: 2503.10617

Authors: Andy Zhou

Abstract: Adapting large language models to multiple tasks can cause cross-skill interference, where improvements for one skill degrade another. While methods such as LoRA impose orthogonality constraints at the weight level, they do not fully address interference in hidden-state representations. We propose Compositional Subspace Representation Fine-tuning (CS-ReFT), a novel representation-based approach that learns multiple orthonormal subspace transformations, each specializing in a distinct skill, and composes them via a lightweight router. By isolating these subspace edits in the hidden state, rather than weight matrices, CS-ReFT prevents cross-task conflicts more effectively. On the AlpacaEval benchmark, applying CS-ReFT to Llama-2-7B achieves a 93.94% win rate, surpassing GPT-3.5 Turbo (86.30%) while requiring only 0.0098% of model parameters. These findings show that specialized representation edits, composed via a simple router, significantly enhance multi-task instruction following with minimal overhead.

Comment: The paper proposes CS-ReFT, focusing on representation-based fine-tuning for LLMs, contributing to foundational insights into representation learning and LLM behavior.

Relevance: 9 Novelty: 8

16. From Equations to Insights: Unraveling Symbolic Structures in PDEs with LLMs

ArXiv ID: 2503.09986

Authors: Rohan Bhatnagar, Ling Liang, Krish Patel, Haizhao Yang

Abstract: Motivated by the remarkable success of artificial intelligence (AI) across diverse fields, the application of AI to solve scientific problems-often formulated as partial differential equations (PDEs)-has garnered increasing attention. While most existing research concentrates on theoretical properties (such as well-posedness, regularity, and continuity) of the solutions, alongside direct AI-driven methods for solving PDEs, the challenge of uncovering symbolic relationships within these equations remains largely unexplored. In this paper, we propose leveraging large language models (LLMs) to learn such symbolic relationships. Our results demonstrate that LLMs can effectively predict the operators involved in PDE solutions by utilizing the symbolic information in the PDEs. Furthermore, we show that discovering these symbolic relationships can substantially improve both the efficiency and accuracy of the finite expression method for finding analytical approximation of PDE solutions, delivering a fully interpretable solution pipeline. This work opens new avenues for understanding the symbolic structure of scientific problems and advancing their solution processes.

Comment: The paper explores symbolic structures in PDEs using LLMs, contributing to foundational research in AI for science.

Relevance: 9 Novelty: 8

17. Langevin Monte-Carlo Provably Learns Depth Two Neural Nets at Any Size and Data

ArXiv ID: 2503.10428

Authors: Dibyakanti Kumar, Samyak Jha, Anirbit Mukherjee

Abstract: In this work, we will establish that the Langevin Monte-Carlo algorithm can learn depth-2 neural nets of any size and for any data and we give non-asymptotic convergence rates for it. We achieve this via showing that under Total Variation distance and q-Renyi divergence, the iterates of Langevin Monte Carlo converge to the Gibbs distribution of Frobenius norm regularized losses for any of these nets, when using smooth activations and in both classification and regression settings. Most critically, the amount of regularization needed for our results is independent of the size of the net. This result combines several recent observations, like our previous papers showing that two-layer neural loss functions can always be regularized by a certain constant amount such that they satisfy the Villani conditions, and thus their Gibbs measures satisfy a Poincare inequality.

Comment: The paper provides theoretical insights into learning depth-2 neural networks using Langevin Monte-Carlo, which contributes to foundational research in representation learning and training dynamics.

Relevance: 8 Novelty: 8

18. Multiplicative Learning

ArXiv ID: 2503.10144

Authors: Han Kim, Hyungjoon Soh, Vipul Periwal, Junghyo Jo

Abstract: Efficient training of artificial neural networks remains a key challenge in deep learning. Backpropagation (BP), the standard learning algorithm, relies on gradient descent and typically requires numerous iterations for convergence. In this study, we introduce Expectation Reflection (ER), a novel learning approach that updates weights multiplicatively based on the ratio of observed to predicted outputs. Unlike traditional methods, ER maintains consistency without requiring ad hoc loss functions or learning rate hyperparameters. We extend ER to multilayer networks and demonstrate its effectiveness in performing image classification tasks. Notably, ER achieves optimal weight updates in a single iteration. Additionally, we reinterpret ER as a modified form of gradient descent incorporating the inverse mapping of target propagation. These findings suggest that ER provides an efficient and scalable alternative for training neural networks.

Comment: The paper introduces Expectation Reflection, a novel multiplicative learning approach, aligning with 'Representation Learning' due to its innovative training dynamics.

Relevance: 8 Novelty: 8

19. Deep Learning based discovery of Integrable Systems

ArXiv ID: 2503.10469

Authors: Shailesh Lal, Suvajit Majumder, Evgeny Sobko

Abstract: We introduce a novel machine learning based framework for discovering integrable models. Our approach first employs a synchronized ensemble of neural networks to find high-precision numerical solution to the Yang-Baxter equation within a specified class. Then, using an auxiliary system of algebraic equations, [Q_2, Q_3] = 0, and the numerical value of the Hamiltonian obtained via deep learning as a seed, we reconstruct the entire Hamiltonian family, forming an algebraic variety. We illustrate our presentation with three- and four-dimensional spin chains of difference form with local interactions. Remarkably, all discovered Hamiltonian families form rational varieties.

Comment: The paper introduces a novel framework for discovering integrable systems using neural networks, which aligns with foundational AI for science research.

Relevance: 8 Novelty: 8

20. Thermodynamic Bound on Energy and Negentropy Costs of Inference in Deep Neural Networks

ArXiv ID: 2503.09980

Authors: Alexei V. Tkachenko

Abstract: The fundamental thermodynamic bound is derived for the energy cost of inference in Deep Neural Networks (DNNs). By applying Landauer's principle, we demonstrate that the linear operations in DNNs can, in principle, be performed reversibly, whereas the non-linear activation functions impose an unavoidable energy cost. The resulting theoretical lower bound on the inference energy is determined by the average number of neurons undergoing state transition for each inference. We also restate the thermodynamic bound in terms of negentropy, a metric which is more universal than energy for assessing thermodynamic cost of information processing. Concept of negentropy is further elaborated in the context of information processing in biological and engineered system as well as human intelligence. Our analysis provides insight into the physical limits of DNN efficiency and suggests potential directions for developing energy-efficient AI architectures that leverage reversible analog computing.

Comment: The paper derives thermodynamic bounds for inference in DNNs, contributing to foundational insights into efficiency and theoretical limits.

Relevance: 8 Novelty: 8

21. Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild

ArXiv ID: 2503.10065

Authors: Damien Teney, Liangze Jiang, Florin Gogianu, Ehsan Abbasnejad

Abstract: Neural architectures tend to fit their data with relatively simple functions. This "simplicity bias" is widely regarded as key to their success. This paper explores the limits of this principle. Building on recent findings that the simplicity bias stems from ReLU activations [96], we introduce a method to meta-learn new activation functions and inductive biases better suited to specific tasks. Findings: We identify multiple tasks where the simplicity bias is inadequate and ReLUs suboptimal. In these cases, we learn new activation functions that perform better by inducing a prior of higher complexity. Interestingly, these cases correspond to domains where neural networks have historically struggled: tabular data, regression tasks, cases of shortcut learning, and algorithmic grokking tasks. In comparison, the simplicity bias induced by ReLUs proves adequate on image tasks where the best learned activations are nearly identical to ReLUs and GeLUs. Implications: Contrary to popular belief, the simplicity bias of ReLU networks is not universally useful. It is near-optimal for image classification, but other inductive biases are sometimes preferable. We showed that activation functions can control these inductive biases, but future tailored architectures might provide further benefits. Advances are still needed to characterize a model's inductive biases beyond "complexity", and their adequacy with the data.

Comment: The paper explores meta-learning activation functions to optimize inductive biases, contributing to architectural innovations and representation learning.

Relevance: 8 Novelty: 8

22. Inter-environmental world modeling for continuous and compositional dynamics

ArXiv ID: 2503.09911

Authors: Kohei Hayashi, Masanori Koyama, Julian Jorge Andrade Guerreiro

Abstract: Various world model frameworks are being developed today based on autoregressive frameworks that rely on discrete representations of actions and observations, and these frameworks are succeeding in constructing interactive generative models for the target environment of interest. Meanwhile, humans demonstrate remarkable generalization abilities to combine experiences in multiple environments to mentally simulate and learn to control agents in diverse environments. Inspired by this human capability, we introduce World modeling through Lie Action (WLA), an unsupervised framework that learns continuous latent action representations to simulate across environments. WLA learns a control interface with high controllability and predictive ability by simultaneously modeling the dynamics of multiple environments using Lie group theory and object-centric autoencoder. On synthetic benchmark and real-world datasets, we demonstrate that WLA can be trained using only video frames and, with minimal or no action labels, can quickly adapt to new environments with novel action sets.

Comment: The paper introduces WLA for inter-environmental world modeling, contributing to foundational research in representation learning and emerging trends.

Relevance: 8 Novelty: 8

23. Radar: Fast Long-Context Decoding for Any Transformer

ArXiv ID: 2503.10571

Authors: Yongchang Hao, Mengyao Zhai, Hossein Hajimirsadeghi, Sepidehsadat Hosseini, Frederick Tung

Abstract: Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers.

Comment: Radar proposes a training-free method to accelerate Transformer inference for long-context data, aligning with the 'Model Compression' criterion due to its focus on efficiency improvements.