Personalized Daily ArXiv Papers 2025-05-26

[gpt-4o]	Prompt	Completion	Total
Token	61244	8088	69332
Cost	$0.15	$0.08	$0.23

Total arXiv papers: 835

Total scanned papers: 547

Total relevant papers: 53

Table of contents with paper titles:

From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning Authors: Chen Shani, Dan Jurafsky, Yann LeCun, Ravid Shwartz-Ziv
Generative Distribution Embeddings Authors: Nic Fishman, Gokul Gowri, Peng Yin, Jonathan Gootenberg, Omar Abudayyeh
SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models Authors: Ionut-Vlad Modoranu, Mher Safaryan, Erik Schultheis, Dan Alistarh
ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training Authors: Maryam Dialameh, Rezaul Karim, Hossein Rajabzadeh, Omar Mohamed Awad, Hyock Ju Kwon, Boxing Chen, Walid Ahmed, Yang Liu
Mixture of Low Rank Adaptation with Partial Parameter Sharing for Time Series Forecasting Authors: Licheng Pan, Zhichao Chen, Haoxuan Li, Guangyi Liu, Zhijian Xu, Zhaoran Liu, Hao Wang, Ying Wei
Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models Authors: Benjamin Walker, Lingyi Yang, Nicola Muca Cirone, Cristopher Salvi, Terry Lyons
PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval Authors: Zehua Pei, Ying Zhang, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu
Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models Authors: Viktoriia Chekalina, Daniil Moskovskiy, Daria Cherniuk, Maxim Kurkin, Andrey Kuznetsov, Evgeny Frolov
JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model Authors: Qihao Duan, Bingding Huang, Zhenqiao Song, Irina Lehmann, Lei Gu, Roland Eils, Benjamin Wild
The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks Authors: Vittorio Erba, Emanuele Troiani, Lenka Zdeborov\'a, Florent Krzakala
Scale-invariant Attention Authors: Ben Anson, Xi Wang, Laurence Aitchison
Attention with Trained Embeddings Provably Selects Important Tokens Authors: Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli
Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems Authors: Jiayi Geng, Howard Chen, Dilip Arumugam, Thomas L. Griffiths
Stochastic Weight Sharing for Bayesian Neural Networks Authors: Moule Lin, Shuhao Guan, Weipeng Jing, Goetz Botterweck, Andrea Patane
The emergence of sparse attention: impact of data distribution and benefits of repetition Authors: Nicolas Zucchet, Francesco d'Angelo, Andrew K. Lampinen, Stephanie C. Y. Chan
Time to Spike? Understanding the Representational Power of Spiking Neural Networks in Discrete Time Authors: Duc Anh Nguyen, Ernesto Araya, Adalbert Fono, Gitta Kutyniok
Large Language Models Implicitly Learn to See and Hear Just By Reading Authors: Prateek Verma, Mert Pilanci
The Discovery Engine: A Framework for AI-Driven Synthesis and Navigation of Scientific Knowledge Landscapes Authors: Vladimir Baulin, Austin Cook, Daniel Friedman, Janna Lumiruusu, Andrew Pashea, Shagor Rahman, Benedikt Waldeck
CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning Authors: Jinyuan Feng, Chaopeng Wei, Tenghai Qiu, Tianyi Hu, Zhiqiang Pu
An approach to identify the most semantically informative deep representations of text and images Authors: Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Mat\'eo Mahaut, Marco Baroni, Alessandro Laio
ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs Authors: Landon Butler, Abhineet Agarwal, Justin Singh Kang, Yigit Efe Erginbas, Bin Yu, Kannan Ramchandran
TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling Authors: Weizhe Lin, Xing Li, Zhiyuan Yang, Xiaojin Fu, Hui-Ling Zhen, Yaoyuan Wang, Xianzhi Yu, Wulong Liu, Xiaosong Li, Mingxuan Yuan
HiLAB: A Hybrid Inverse-Design Framework Authors: Reza Marzban, Hamed Abiri, Raphael Pestourie, Ali Adibi
An Iterative Framework for Generative Backmapping of Coarse Grained Proteins Authors: Georgios Kementzidis, Erin Wong, John Nicholson, Ruichen Xu, Yuefan Deng
Learning with Restricted Boltzmann Machines: Asymptotics of AMP and GD in High Dimensions Authors: Yizhou Xu, Florent Krzakala, Lenka Zdeborov\'a
From Compression to Expansion: A Layerwise Analysis of In-Context Learning Authors: Jiachen Jiang, Yuxin Dong, Jinxin Zhou, Zhihui Zhu
Directed Semi-Simplicial Learning with Applications to Brain Activity Decoding Authors: Manuel Lecha, Andrea Cavallo, Francesca Dominici, Ran Levi, Alessio Del Bue, Elvin Isufi, Pietro Morerio, Claudio Battiloro
Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization Authors: Francois Chaubard, Mykel Kochenderfer
Emergence of Hebbian Dynamics in Regularized Non-Local Learners Authors: David Koplow, Tomaso Poggio, Liu Ziyin
Out of the Shadows: Exploring a Latent Space for Neural Network Verification Authors: Lukas Koller, Tobias Ladner, Matthias Althoff
Strictly Constrained Generative Modeling via Split Augmented Langevin Sampling Authors: Matthieu Blanke, Yongquan Qu, Sara Shamekh, Pierre Gentine
Continuum Transformers Perform In-Context Learning by Operator Gradient Descent Authors: Abhiti Mishra, Yash Patel, Ambuj Tewari
Implicit Regularization of Infinitesimally-perturbed Gradient Descent Toward Low-dimensional Solutions Authors: Jianhao Ma, Geyu Liang, Salar Fattahi
Next Token Perception Score: Analytical Assessment of your LLM Perception Skills Authors: Yu-Ang Cheng, Leyang Hu, Hai Huang, Randall Balestriero
Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives Authors: Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, Jun Zhu
DASH: Input-Aware Dynamic Layer Skipping for Efficient LLM Inference with Markov Decision Policies Authors: Ning Yang, Fangxin Liu, Junjie Wang, Tao Yang, Kan Liu, Haibing Guan, Li Jiang
DAM-GT: Dual Positional Encoding-Based Attention Masking Graph Transformer for Node Classification Authors: Chenyang Li, Jinsong Chen, John E. Hopcroft, Kun He
New Tight Bounds for SGD without Variance Assumption: A Computer-Aided Lyapunov Analysis Authors: Daniel Cortild, Lucas Ketels, Juan Peypouquet, Guillaume Garrigos
NeUQI: Near-Optimal Uniform Quantization Parameter Initialization Authors: Li Lin, Xinyu Hu, Xiaojun Wan
C-LoRA: Contextual Low-Rank Adaptation for Uncertainty Estimation in Large Language Models Authors: Amir Hossein Rahmati, Sanket Jantre, Weifeng Zhang, Yucheng Wang, Byung-Jun Yoon, Nathan M. Urban, Xiaoning Qian
NeuroTrails: Training with Dynamic Sparse Heads as the Key to Effective Ensembling Authors: Bram Grooten, Farid Hasanov, Chenxiang Zhang, Qiao Xiao, Boqian Wu, Zahra Atashgahi, Ghada Sokar, Shiwei Liu, Lu Yin, Elena Mocanu, Mykola Pechenizkiy, Decebal Constantin Mocanu
COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection Authors: Jaewon Cheon, Pilsung Kang
Towards General Continuous Memory for Vision-Language Models Authors: Wenyi Wu, Zixuan Song, Kun Zhou, Yifei Shao, Zhiting Hu, Biwei Huang
Leveraging KANs for Expedient Training of Multichannel MLPs via Preconditioning and Geometric Refinement Authors: Jonas A. Actor, Graham Harper, Ben Southworth, Eric C. Cyr
Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning Authors: Wang Yang, Zirui Liu, Hongye Jin, Qingyu Yin, Vipin Chaudhary, Xiaotian Han
Hybrid Mamba-Transformer Decoder for Error-Correcting Codes Authors: Shy-el Cohen, Yoni Choukroun, Eliya Nachmani
\texttt{Range-Arithmetic}: Verifiable Deep Learning Inference on an Untrusted Party Authors: Ali Rahimi, Babak H. Khalaj, Mohammad Ali Maddah-Ali
TI-DeepONet: Learnable Time Integration for Stable Long-Term Extrapolation Authors: Dibyajyoti Nayak, Somdatta Goswami
Beyond Discreteness: Finite-Sample Analysis of Straight-Through Estimator for Quantization Authors: Halyun Jeong, Jack Xin, Penghang Yin
Selection Mechanisms for Sequence Modeling using Linear State Space Models Authors: Umberto Casti, Sandro Zampieri, Fabio Pasqualetti
Transformer brain encoders explain human high-level visual responses Authors: Hossein Adeli, Minni Sun, Nikolaus Kriegeskorte
Scalable Valuation of Human Feedback through Provably Robust Model Alignment Authors: Masahiro Fujisawa, Masaki Adachi, Michael A. Osborne
A Principled Bayesian Framework for Training Binary and Spiking Neural Networks Authors: James A. Walker, Moein Khajehnejad, Adeel Razi

1. From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

ArXiv ID: 2505.17117

Authors: Chen Shani, Dan Jurafsky, Yann LeCun, Ravid Shwartz-Ziv

Abstract: Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.

Comment: Author match

2. Generative Distribution Embeddings

ArXiv ID: 2505.18150

Authors: Nic Fishman, Gokul Gowri, Peng Yin, Jonathan Gootenberg, Omar Abudayyeh

Abstract: Many real-world problems require reasoning across multiple scales, demanding models which operate not on single data points, but on entire distributions. We introduce generative distribution embeddings (GDE), a framework that lifts autoencoders to the space of distributions. In GDEs, an encoder acts on sets of samples, and the decoder is replaced by a generator which aims to match the input distribution. This framework enables learning representations of distributions by coupling conditional generative models with encoder networks which satisfy a criterion we call distributional invariance. We show that GDEs learn predictive sufficient statistics embedded in the Wasserstein space, such that latent GDE distances approximately recover the $W_2$ distance, and latent interpolation approximately recovers optimal transport trajectories for Gaussian and Gaussian mixture distributions. We systematically benchmark GDEs against existing approaches on synthetic datasets, demonstrating consistently stronger performance. We then apply GDEs to six key problems in computational biology: learning representations of cell populations from lineage-tracing data (150K cells), predicting perturbation effects on single-cell transcriptomes (1M cells), predicting perturbation effects on cellular phenotypes (20M single-cell images), modeling tissue-specific DNA methylation patterns (253M sequences), designing synthetic yeast promoters (34M sequences), and spatiotemporal modeling of viral protein sequences (1M sequences).

Comment: The paper introduces a new framework for learning representations of distributions, relevant to representation learning and generative paradigms.

Relevance: 9 Novelty: 9

3. SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models

ArXiv ID: 2505.17967

Authors: Ionut-Vlad Modoranu, Mher Safaryan, Erik Schultheis, Dan Alistarh

Abstract: Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD). However, applying SVD-based procedures individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple two-step procedure to approximate SVD-based gradient projections into lower-dimensional spaces. First, we construct a complete orthogonal basis using predefined orthogonal matrices of the Discrete Cosine Transform (DCT). Second, we adaptively select basis columns based on their alignment with the gradient of each layer. Each projection matrix in our method is obtained via a single matrix multiplication followed by a lightweight sorting step to identify the most relevant basis vectors. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. During training, we store only the indices of the selected columns, avoiding the need to store full projection matrices for each layer. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, matching the performance of costly SVD-based methods while achieving faster runtime and reduced memory usage.

Comment: The paper presents a novel low-rank optimization method for LLMs, which is relevant to model compression through low-rank approaches.

Relevance: 9 Novelty: 8

4. ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training

ArXiv ID: 2505.17331

Authors: Maryam Dialameh, Rezaul Karim, Hossein Rajabzadeh, Omar Mohamed Awad, Hyock Ju Kwon, Boxing Chen, Walid Ahmed, Yang Liu

Abstract: This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77\% higher token-per-second throughput during training, up to 16\% higher Model FLOPs Utilization (MFU), and up to 14\% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7\% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning large language models, enabling faster and more resource-efficient training without compromising performance.

Comment: The paper introduces ECHO-LLaMA, focusing on efficient caching and computational efficiency in LLMs, aligning with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

ArXiv ID: 2505.17872

Authors: Licheng Pan, Zhichao Chen, Haoxuan Li, Guangyi Liu, Zhijian Xu, Zhaoran Liu, Hao Wang, Ying Wei

Abstract: Multi-task forecasting has become the standard approach for time-series forecasting (TSF). However, we show that it suffers from an Expressiveness Bottleneck, where predictions at different time steps share the same representation, leading to unavoidable errors even with optimal representations. To address this issue, we propose a two-stage framework: first, pre-train a foundation model for one-step-ahead prediction; then, adapt it using step-specific LoRA modules.This design enables the foundation model to handle any number of forecast steps while avoiding the expressiveness bottleneck. We further introduce the Mixture-of-LoRA (MoLA) model, which employs adaptively weighted LoRA experts to achieve partial parameter sharing across steps. This approach enhances both efficiency and forecasting performance by exploiting interdependencies between forecast steps. Experiments show that MoLA significantly improves model expressiveness and outperforms state-of-the-art time-series forecasting methods. Code is available at https://anonymous.4open.science/r/MoLA-BC92.

Comment: The paper introduces a Mixture-of-Low-Rank Adaptation model for time series forecasting, aligning with model architecture and representation learning criteria.

Relevance: 9 Novelty: 8

6. Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models

ArXiv ID: 2505.17761

Authors: Benjamin Walker, Lingyi Yang, Nicola Muca Cirone, Cristopher Salvi, Terry Lyons

Abstract: Structured Linear Controlled Differential Equations (SLiCEs) provide a unifying framework for sequence models with structured, input-dependent state-transition matrices that retain the maximal expressivity of dense matrices whilst being cheaper to compute. The framework encompasses existing architectures, such as input-dependent block-diagonal linear recurrent neural networks and DeltaNet's diagonal-plus-low-rank structure, as well as two novel variants based on sparsity and the Walsh--Hadamard transform. We prove that, unlike the diagonal state-transition matrices of S4 and Mamba, SLiCEs employing block-diagonal, sparse, or Walsh--Hadamard matrices match the maximal expressivity of dense matrices. Empirically, SLiCEs solve the $A_5$ state-tracking benchmark with a single layer, achieve best-in-class length generalisation on regular language tasks among parallel-in-time models, and match the state-of-the-art performance of log neural controlled differential equations on six multivariate time-series classification datasets while cutting the average time per training step by a factor of twenty.

Comment: The paper introduces Structured Linear CDEs, a novel sequence model framework, aligning with model architecture innovations.

Relevance: 9 Novelty: 8

7. PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval

ArXiv ID: 2505.17639

Authors: Zehua Pei, Ying Zhang, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu

Abstract: Mixture-of-experts (MoE) architectures enable scaling large language models (LLMs) to vast parameter counts without a proportional rise in computational costs. However, the significant memory demands of large MoE models hinder their deployment across various computational environments, from cloud servers to consumer devices. This study first demonstrates pronounced task-specific specialization in expert activation patterns within MoE layers. Building on this, we introduce PreMoe, a novel framework that enables efficient deployment of massive MoE models in memory-constrained environments. PreMoe features two main components: probabilistic expert pruning (PEP) and task-adaptive expert retrieval (TAER). PEP employs a new metric, the task-conditioned expected selection score (TCESS), derived from router logits to quantify expert importance for specific tasks, thereby identifying a minimal set of critical experts. TAER leverages these task-specific expert importance profiles for efficient inference. It pre-computes and stores compact expert patterns for diverse tasks. When a user query is received, TAER rapidly identifies the most relevant stored task pattern and reconstructs the model by loading only the small subset of experts crucial for that task. This approach dramatically reduces the memory footprint across all deployment scenarios. DeepSeek-R1 671B maintains 97.2\% accuracy on MATH500 when pruned to 8/128 configuration (50\% expert reduction), and still achieves 72.0\% with aggressive 8/32 pruning (87.5\% expert reduction). Pangu-Ultra-MoE 718B achieves 97.15\% on MATH500 and 81.3\% on AIME24 with 8/128 pruning, while even more aggressive pruning to 4/64 (390GB memory) preserves 96.95\% accuracy on MATH500. We make our code publicly available at https://github.com/JarvisPei/PreMoe.

Comment: PreMoe introduces a framework for efficient deployment of MoE models using expert pruning and retrieval, relevant to model architecture and compression.

Relevance: 9 Novelty: 8

8. Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models

ArXiv ID: 2505.17974

Authors: Viktoriia Chekalina, Daniil Moskovskiy, Daria Cherniuk, Maxim Kurkin, Andrey Kuznetsov, Evgeny Frolov

Abstract: The Fisher information is a fundamental concept for characterizing the sensitivity of parameters in neural networks. However, leveraging the full observed Fisher information is too expensive for large models, so most methods rely on simple diagonal approximations. While efficient, this approach ignores parameter correlations, often resulting in reduced performance on downstream tasks. In this work, we mitigate these limitations and propose Generalized Fisher-Weighted SVD (GFWSVD), a post-training LLM compression technique that accounts for both diagonal and off-diagonal elements of the Fisher information matrix, providing a more accurate reflection of parameter importance. To make the method tractable, we introduce a scalable adaptation of the Kronecker-factored approximation algorithm for the observed Fisher information. We demonstrate the effectiveness of our method on LLM compression, showing improvements over existing compression baselines. For example, at a 20 compression rate on the MMLU benchmark, our method outperforms FWSVD, which is based on a diagonal approximation of the Fisher information, by 5 percent, SVD-LLM by 3 percent, and ASVD by 6 percent compression rate.

Comment: The paper proposes a novel compression technique for LLMs using a generalized Fisher-weighted SVD, which is relevant to model compression.

Relevance: 9 Novelty: 8

9. JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model

ArXiv ID: 2505.17257

Authors: Qihao Duan, Bingding Huang, Zhenqiao Song, Irina Lehmann, Lei Gu, Roland Eils, Benjamin Wild

Abstract: Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types, including genetic sequences. However, adapting LLMs to genomics presents significant challenges. Capturing complex genomic interactions requires modeling long-range dependencies within DNA sequences, where interactions often span over 10,000 base pairs, even within a single gene, posing substantial computational burdens under conventional model architectures and training paradigms. Moreover, standard LLM training approaches are suboptimal for DNA: autoregressive training, while efficient, supports only unidirectional understanding. However, DNA is inherently bidirectional, e.g., bidirectional promoters regulate transcription in both directions and account for nearly 11% of human gene expression. Masked language models (MLMs) allow bidirectional understanding but are inefficient, as only masked tokens contribute to the loss per step. To address these limitations, we introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm that combines the optimization efficiency of autoregressive modeling with the bidirectional comprehension of masked modeling. JanusDNA adopts a hybrid Mamba, Attention and Mixture of Experts (MoE) architecture, combining long-range modeling of Attention with efficient sequential learning of Mamba. MoE layers further scale model capacity via sparse activation while keeping computational cost low. Notably, JanusDNA processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU. Extensive experiments and ablations show JanusDNA achieves new SOTA results on three genomic representation benchmarks, outperforming models with 250x more activated parameters. Code: https://github.com/Qihao-Duan/JanusDNA

Comment: The paper introduces JanusDNA, a hybrid DNA foundation model using MoE architecture, relevant to model architecture and foundational research in AI for science.

Relevance: 9 Novelty: 8

10. The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks

ArXiv ID: 2505.17958

Authors: Vittorio Erba, Emanuele Troiani, Lenka Zdeborov\'a, Florent Krzakala

Abstract: We study the high-dimensional asymptotics of empirical risk minimization (ERM) in over-parametrized two-layer neural networks with quadratic activations trained on synthetic data. We derive sharp asymptotics for both training and test errors by mapping the $\ell_2$-regularized learning problem to a convex matrix sensing task with nuclear norm penalization. This reveals that capacity control in such networks emerges from a low-rank structure in the learned feature maps. Our results characterize the global minima of the loss and yield precise generalization thresholds, showing how the width of the target function governs learnability. This analysis bridges and extends ideas from spin-glass methods, matrix factorization, and convex optimization and emphasizes the deep link between low-rank matrix sensing and learning in quadratic neural networks.

Comment: The paper provides a theoretical analysis of overparameterized quadratic networks, focusing on capacity control through low-rank structures, which is relevant to representation learning and model architecture.

Relevance: 9 Novelty: 8

11. Scale-invariant Attention

ArXiv ID: 2505.17083

Authors: Ben Anson, Xi Wang, Laurence Aitchison

Abstract: One persistent challenge in LLM research is the development of attention mechanisms that are able to generalise from training on shorter contexts to inference on longer contexts. We propose two conditions that we expect all effective long context attention mechanisms to have: scale-invariant total attention, and scale-invariant attention sparsity. Under a Gaussian assumption, we show that a simple position-dependent transformation of the attention logits is sufficient for these conditions to hold. Experimentally we find that the resulting scale-invariant attention scheme gives considerable benefits in terms of validation loss when zero-shot generalising from training on short contexts to validation on longer contexts, and is effective at long-context retrieval.

Comment: The paper proposes a scale-invariant attention mechanism, which is relevant to model architecture innovations, particularly in attention mechanisms.

Relevance: 9 Novelty: 8

12. Attention with Trained Embeddings Provably Selects Important Tokens

ArXiv ID: 2505.17282

Authors: Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli

Abstract: Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited. Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent. Specifically, we consider a one-layer softmax attention model with a linear head for binary classification, i.e., $\texttt{Softmax}( p^\top E_X^\top ) E_X v = \frac{ \sum_{i=1}^T \exp(p^\top E_{x_i}) E_{x_i}^\top v}{\sum_{j=1}^T \exp(p^\top E_{x_{j}}) }$, where $E_X = [ E_{x_1} , \dots, E_{x_T} ]^\top$ contains the embeddings of the input sequence, $p$ is the embedding of the $\mathrm{\langle cls \rangle}$ token and $v$ the output vector. First, we show that, already after a single step of gradient training with the logistic loss, the embeddings $E_X$ capture the importance of tokens in the dataset by aligning with the output vector $v$ proportionally to the frequency with which the corresponding tokens appear in the dataset. Then, after training $p$ via gradient flow until convergence, the softmax selects the important tokens in the sentence (i.e., those that are predictive of the label), and the resulting $\mathrm{\langle cls \rangle}$ embedding maximizes the margin for such a selection. Experiments on real-world datasets (IMDB, Yelp) exhibit a phenomenology close to that unveiled by our theory.

Comment: The paper provides theoretical insights into token embeddings and attention mechanisms, relevant to representation learning and model architecture.

Relevance: 9 Novelty: 8

13. Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems

ArXiv ID: 2505.17968

Authors: Jiayi Geng, Howard Chen, Dilip Arumugam, Thomas L. Griffiths

Abstract: Using AI to create autonomous researchers has the potential to accelerate scientific discovery. A prerequisite for this vision is understanding how well an AI model can identify the underlying structure of a black-box system from its behavior. In this paper, we explore how well a large language model (LLM) learns to identify a black-box function from passively observed versus actively collected data. We investigate the reverse-engineering capabilities of LLMs across three distinct types of black-box systems, each chosen to represent different problem domains where future autonomous AI researchers may have considerable impact: Program, Formal Language, and Math Equation. Through extensive experiments, we show that LLMs fail to extract information from observations, reaching a performance plateau that falls short of the ideal of Bayesian inference. However, we demonstrate that prompting LLMs to not only observe but also intervene -- actively querying the black-box with specific inputs to observe the resulting output -- improves performance by allowing LLMs to test edge cases and refine their beliefs. By providing the intervention data from one LLM to another, we show that this improvement is partly a result of engaging in the process of generating effective interventions, paralleling results in the literature on human learning. Further analysis reveals that engaging in intervention can help LLMs escape from two common failure modes: overcomplication, where the LLM falsely assumes prior knowledge about the black-box, and overlooking, where the LLM fails to incorporate observations. These insights provide practical guidance for helping LLMs more effectively reverse-engineer black-box systems, supporting their use in making new discoveries.

Comment: The paper provides theoretical insights into LLM behavior, specifically in reverse-engineering black-box systems, which aligns with the LLM criterion.

Relevance: 9 Novelty: 8

ArXiv ID: 2505.17856

Authors: Moule Lin, Shuhao Guan, Weipeng Jing, Goetz Botterweck, Andrea Patane

Abstract: While offering a principled framework for uncertainty quantification in deep learning, the employment of Bayesian Neural Networks (BNNs) is still constrained by their increased computational requirements and the convergence difficulties when training very deep, state-of-the-art architectures. In this work, we reinterpret weight-sharing quantization techniques from a stochastic perspective in the context of training and inference with Bayesian Neural Networks (BNNs). Specifically, we leverage 2D adaptive Gaussian distributions, Wasserstein distance estimations, and alpha blending to encode the stochastic behaviour of a BNN in a lower dimensional, soft Gaussian representation. Through extensive empirical investigation, we demonstrate that our approach significantly reduces the computational overhead inherent in Bayesian learning by several orders of magnitude, enabling the efficient Bayesian training of large-scale models, such as ResNet-101 and Vision Transformer (VIT). On various computer vision benchmarks including CIFAR10, CIFAR100, and ImageNet1k. Our approach compresses model parameters by approximately 50x and reduces model size by 75, while achieving accuracy and uncertainty estimations comparable to the state-of-the-art.

Comment: The paper presents a novel approach to compress Bayesian Neural Networks using stochastic weight-sharing quantization, which aligns with the interest in model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

15. The emergence of sparse attention: impact of data distribution and benefits of repetition

ArXiv ID: 2505.17863

Authors: Nicolas Zucchet, Francesco d'Angelo, Andrew K. Lampinen, Stephanie C. Y. Chan

Abstract: Emergence is a fascinating property of large language models and neural networks more broadly: as models scale and train for longer, they sometimes develop new abilities in sudden ways. Despite initial studies, we still lack a comprehensive understanding of how and when these abilities emerge. To address this gap, we study the emergence over training of sparse attention, a critical and frequently observed attention pattern in Transformers. By combining theoretical analysis of a toy model with empirical observations on small Transformers trained on a linear regression variant, we uncover the mechanics driving sparse attention emergence and reveal that emergence timing follows power laws based on task structure, architecture, and optimizer choice. We additionally find that repetition can greatly speed up emergence. Finally, we confirm these results on a well-studied in-context associative recall task. Our findings provide a simple, theoretically grounded framework for understanding how data distributions and model design influence the learning dynamics behind one form of emergence.

Comment: The paper studies the emergence of sparse attention in transformers, providing theoretical insights into training dynamics, which aligns with representation learning and model architecture interests.

Relevance: 9 Novelty: 8

16. Time to Spike? Understanding the Representational Power of Spiking Neural Networks in Discrete Time

ArXiv ID: 2505.18023

Authors: Duc Anh Nguyen, Ernesto Araya, Adalbert Fono, Gitta Kutyniok

Abstract: Recent years have seen significant progress in developing spiking neural networks (SNNs) as a potential solution to the energy challenges posed by conventional artificial neural networks (ANNs). However, our theoretical understanding of SNNs remains relatively limited compared to the ever-growing body of literature on ANNs. In this paper, we study a discrete-time model of SNNs based on leaky integrate-and-fire (LIF) neurons, referred to as discrete-time LIF-SNNs, a widely used framework that still lacks solid theoretical foundations. We demonstrate that discrete-time LIF-SNNs with static inputs and outputs realize piecewise constant functions defined on polyhedral regions, and more importantly, we quantify the network size required to approximate continuous functions. Moreover, we investigate the impact of latency (number of time steps) and depth (number of layers) on the complexity of the input space partitioning induced by discrete-time LIF-SNNs. Our analysis highlights the importance of latency and contrasts these networks with ANNs employing piecewise linear activation functions. Finally, we present numerical experiments to support our theoretical findings.

Comment: The paper provides theoretical insights into the representational power of spiking neural networks, which aligns with the representation learning criterion. It explores the complexity of input space partitioning and compares SNNs with ANNs, contributing to foundational research.

Relevance: 9 Novelty: 8

17. Large Language Models Implicitly Learn to See and Hear Just By Reading

ArXiv ID: 2505.17091

Authors: Prateek Verma, Mert Pilanci

Abstract: This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.

Comment: The paper suggests that LLMs can inherently develop abilities to understand images and audio, which is a novel insight into LLM behavior.

Relevance: 8 Novelty: 9

ArXiv ID: 2505.17500

Authors: Vladimir Baulin, Austin Cook, Daniel Friedman, Janna Lumiruusu, Andrew Pashea, Shagor Rahman, Benedikt Waldeck

Abstract: The prevailing model for disseminating scientific knowledge relies on individual publications dispersed across numerous journals and archives. This legacy system is ill suited to the recent exponential proliferation of publications, contributing to insurmountable information overload, issues surrounding reproducibility and retractions. We introduce the Discovery Engine, a framework to address these challenges by transforming an array of disconnected literature into a unified, computationally tractable representation of a scientific domain. Central to our approach is the LLM-driven distillation of publications into structured "knowledge artifacts," instances of a universal conceptual schema, complete with verifiable links to source evidence. These artifacts are then encoded into a high-dimensional Conceptual Tensor. This tensor serves as the primary, compressed representation of the synthesized field, where its labeled modes index scientific components (concepts, methods, parameters, relations) and its entries quantify their interdependencies. The Discovery Engine allows dynamic "unrolling" of this tensor into human-interpretable views, such as explicit knowledge graphs (the CNM graph) or semantic vector spaces, for targeted exploration. Crucially, AI agents operate directly on the graph using abstract mathematical and learned operations to navigate the knowledge landscape, identify non-obvious connections, pinpoint gaps, and assist researchers in generating novel knowledge artifacts (hypotheses, designs). By converting literature into a structured tensor and enabling agent-based interaction with this compact representation, the Discovery Engine offers a new paradigm for AI-augmented scientific inquiry and accelerated discovery.

Comment: The Discovery Engine framework for AI-driven synthesis of scientific knowledge is a novel paradigm, relevant to AI for science.

Relevance: 8 Novelty: 9

19. CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning

ArXiv ID: 2505.17553

Authors: Jinyuan Feng, Chaopeng Wei, Tenghai Qiu, Tianyi Hu, Zhiqiang Pu

Abstract: In parameter-efficient fine-tuning, mixture-of-experts (MoE), which involves specializing functionalities into different experts and sparsely activating them appropriately, has been widely adopted as a promising approach to trade-off between model capacity and computation overhead. However, current MoE variants fall short on heterogeneous datasets, ignoring the fact that experts may learn similar knowledge, resulting in the underutilization of MoE's capacity. In this paper, we propose Contrastive Representation for MoE (CoMoE), a novel method to promote modularization and specialization in MoE, where the experts are trained along with a contrastive objective by sampling from activated and inactivated experts in top-k routing. We demonstrate that such a contrastive objective recovers the mutual-information gap between inputs and the two types of experts. Experiments on several benchmarks and in multi-task settings demonstrate that CoMoE can consistently enhance MoE's capacity and promote modularization among the experts.

Comment: CoMoE focuses on enhancing Mixture-of-Experts (MoE) through contrastive representation, which is relevant to model architecture and representation learning.

Relevance: 9 Novelty: 7

20. An approach to identify the most semantically informative deep representations of text and images

ArXiv ID: 2505.17101

Authors: Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Mat\'eo Mahaut, Marco Baroni, Alessandro Laio

Abstract: Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.

Comment: The paper investigates deep representations in LLMs and vision transformers, aligning with the representation learning criterion.

Relevance: 9 Novelty: 7

21. ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs

ArXiv ID: 2505.17495

Authors: Landon Butler, Abhineet Agarwal, Justin Singh Kang, Yigit Efe Erginbas, Bin Yu, Kannan Ramchandran

Abstract: Large Language Models (LLMs) have achieved remarkable performance by capturing complex interactions between input features. To identify these interactions, most existing approaches require enumerating all possible combinations of features up to a given order, causing them to scale poorly with the number of inputs $n$. Recently, Kang et al. (2025) proposed SPEX, an information-theoretic approach that uses interaction sparsity to scale to $n \approx 10^3$ features. SPEX greatly improves upon prior methods but requires tens of thousands of model inferences, which can be prohibitive for large models. In this paper, we observe that LLM feature interactions are often hierarchical -- higher-order interactions are accompanied by their lower-order subsets -- which enables more efficient discovery. To exploit this hierarchy, we propose ProxySPEX, an interaction attribution algorithm that first fits gradient boosted trees to masked LLM outputs and then extracts the important interactions. Experiments across four challenging high-dimensional datasets show that ProxySPEX more faithfully reconstructs LLM outputs by 20% over marginal attribution approaches while using $10\times$ fewer inferences than SPEX. By accounting for interactions, ProxySPEX identifies features that influence model output over 20% more than those selected by marginal approaches. Further, we apply ProxySPEX to two interpretability tasks. Data attribution, where we identify interactions among CIFAR-10 training samples that influence test predictions, and mechanistic interpretability, where we uncover interactions between attention heads, both within and across layers, on a question-answering task. ProxySPEX identifies interactions that enable more aggressive pruning of heads than marginal approaches.

Comment: The paper introduces a novel method for efficient interpretability in LLMs, aligning with the LLM criterion.

Relevance: 9 Novelty: 7

22. TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling

ArXiv ID: 2505.17155

Authors: Weizhe Lin, Xing Li, Zhiyuan Yang, Xiaojin Fu, Hui-Ling Zhen, Yaoyuan Wang, Xianzhi Yu, Wulong Liu, Xiaosong Li, Mingxuan Yuan

Abstract: Large Reasoning Models (LRMs) demonstrate exceptional capability in tackling complex mathematical, logical, and coding tasks by leveraging extended Chain-of-Thought (CoT) reasoning. Test-time scaling methods, such as prolonging CoT with explicit token-level exploration, can push LRMs' accuracy boundaries, but they incur significant decoding overhead. A key inefficiency source is LRMs often generate redundant thinking CoTs, which demonstrate clear structured overthinking and underthinking patterns. Inspired by human cognitive reasoning processes and numerical optimization theories, we propose TrimR, a verifier-based, training-free, efficient framework for dynamic CoT compression to trim reasoning and enhance test-time scaling, explicitly tailored for production-level deployment. Our method employs a lightweight, pretrained, instruction-tuned verifier to detect and truncate redundant intermediate thoughts of LRMs without any LRM or verifier fine-tuning. We present both the core algorithm and asynchronous online system engineered for high-throughput industrial applications. Empirical evaluations on Ascend NPUs and vLLM show that our framework delivers substantial gains in inference efficiency under large-batch workloads. In particular, on the four MATH500, AIME24, AIME25, and GPQA benchmarks, the reasoning runtime of Pangu-R-38B, QwQ-32B, and DeepSeek-R1-Distill-Qwen-32B is improved by up to 70% with negligible impact on accuracy.

Comment: The paper proposes a framework for efficient test-time scaling in large reasoning models, focusing on compression and efficiency, which is relevant to model compression.

Relevance: 9 Novelty: 7

23. HiLAB: A Hybrid Inverse-Design Framework

ArXiv ID: 2505.17491

Authors: Reza Marzban, Hamed Abiri, Raphael Pestourie, Ali Adibi

Abstract: HiLAB (Hybrid inverse-design with Latent-space learning, Adjoint-based partial optimizations, and Bayesian optimization) is a new paradigm for inverse design of nanophotonic structures. Combining early-terminated topological optimization (TO) with a Vision Transformer-based variational autoencoder (VAE) and a Bayesian search, HiLAB addresses multi-functional device design by generating diverse freeform configurations at reduced simulation costs. Shortened adjoint-driven TO runs, coupled with randomized physical parameters, produce robust initial structures. These structures are compressed into a compact latent space by the VAE, enabling Bayesian optimization to co-optimize geometry and physical hyperparameters. Crucially, the trained VAE can be reused for alternative objectives or constraints by adjusting only the acquisition function. Compared to conventional TO pipelines prone to local optima, HiLAB systematically explores near-global optima with considerably fewer electromagnetic simulations. Even after accounting for training overhead, the total number of full simulations decreases by over an order of magnitude, accelerating the discovery of fabrication-friendly devices. Demonstrating its efficacy, HiLAB is used to design an achromatic beam deflector for red, green, and blue wavelengths, achieving balanced diffraction efficiencies of ~25% while mitigating chromatic aberrations-a performance surpassing existing demonstrations. Overall, HiLAB provides a flexible platform for robust, multi-parameter photonic designs and rapid adaptation to next-generation nanophotonic challenges.

Comment: The paper presents HiLAB, a new paradigm for inverse design in nanophotonics, which aligns with AI for Science through foundational research in molecular modeling.

Relevance: 8 Novelty: 8

24. An Iterative Framework for Generative Backmapping of Coarse Grained Proteins

ArXiv ID: 2505.18082

Authors: Georgios Kementzidis, Erin Wong, John Nicholson, Ruichen Xu, Yuefan Deng

Abstract: The techniques of data-driven backmapping from coarse-grained (CG) to fine-grained (FG) representation often struggle with accuracy, unstable training, and physical realism, especially when applied to complex systems such as proteins. In this work, we introduce a novel iterative framework by using conditional Variational Autoencoders and graph-based neural networks, specifically designed to tackle the challenges associated with such large-scale biomolecules. Our method enables stepwise refinement from CG beads to full atomistic details. We outline the theory of iterative generative backmapping and demonstrate via numerical experiments the advantages of multistep schemes by applying them to proteins of vastly different structures with very coarse representations. This multistep approach not only improves the accuracy of reconstructions but also makes the training process more computationally efficient for proteins with ultra-CG representations.

Comment: The paper introduces a novel iterative framework for generative backmapping of proteins, aligning with AI for Science through foundational research in molecular modeling.

Relevance: 8 Novelty: 8

25. Learning with Restricted Boltzmann Machines: Asymptotics of AMP and GD in High Dimensions

ArXiv ID: 2505.18046

Authors: Yizhou Xu, Florent Krzakala, Lenka Zdeborov\'a

Abstract: The Restricted Boltzmann Machine (RBM) is one of the simplest generative neural networks capable of learning input distributions. Despite its simplicity, the analysis of its performance in learning from the training data is only well understood in cases that essentially reduce to singular value decomposition of the data. Here, we consider the limit of a large dimension of the input space and a constant number of hidden units. In this limit, we simplify the standard RBM training objective into a form that is equivalent to the multi-index model with non-separable regularization. This opens a path to analyze training of the RBM using methods that are established for multi-index models, such as Approximate Message Passing (AMP) and its state evolution, and the analysis of Gradient Descent (GD) via the dynamical mean-field theory. We then give rigorous asymptotics of the training dynamics of RBM on data generated by the spiked covariance model as a prototype of a structure suitable for unsupervised learning. We show in particular that RBM reaches the optimal computational weak recovery threshold, aligning with the BBP transition, in the spiked covariance model.

Comment: The paper provides a theoretical analysis of Restricted Boltzmann Machines (RBM) using methods like Approximate Message Passing, relevant to representation learning and emerging trends.

Relevance: 8 Novelty: 8

26. From Compression to Expansion: A Layerwise Analysis of In-Context Learning

ArXiv ID: 2505.17322

Authors: Jiachen Jiang, Yuxin Dong, Jinxin Zhou, Zhihui Zhu

Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without weight updates by learning from demonstration sequences. While ICL shows strong empirical performance, its internal representational mechanisms are not yet well understood. In this work, we conduct a statistical geometric analysis of ICL representations to investigate how task-specific information is captured across layers. Our analysis reveals an intriguing phenomenon, which we term Layerwise Compression-Expansion: early layers progressively produce compact and discriminative representations that encode task information from the input demonstrations, while later layers expand these representations to incorporate the query and generate the prediction. This phenomenon is observed consistently across diverse tasks and a range of contemporary LLM architectures. We demonstrate that it has important implications for ICL performance -- improving with model size and the number of demonstrations -- and for robustness in the presence of noisy examples. To further understand the effect of the compact task representation, we propose a bias-variance decomposition and provide a theoretical analysis showing how attention mechanisms contribute to reducing both variance and bias, thereby enhancing performance as the number of demonstrations increases. Our findings reveal an intriguing layerwise dynamic in ICL, highlight how structured representations emerge within LLMs, and showcase that analyzing internal representations can facilitate a deeper understanding of model behavior.

Comment: The paper provides a layerwise analysis of in-context learning in LLMs, relevant to understanding LLM behavior and representation learning.

Relevance: 8 Novelty: 8

27. Directed Semi-Simplicial Learning with Applications to Brain Activity Decoding

ArXiv ID: 2505.17939

Authors: Manuel Lecha, Andrea Cavallo, Francesca Dominici, Ran Levi, Alessio Del Bue, Elvin Isufi, Pietro Morerio, Claudio Battiloro

Abstract: Graph Neural Networks (GNNs) excel at learning from pairwise interactions but often overlook multi-way and hierarchical relationships. Topological Deep Learning (TDL) addresses this limitation by leveraging combinatorial topological spaces. However, existing TDL models are restricted to undirected settings and fail to capture the higher-order directed patterns prevalent in many complex systems, e.g., brain networks, where such interactions are both abundant and functionally significant. To fill this gap, we introduce Semi-Simplicial Neural Networks (SSNs), a principled class of TDL models that operate on semi-simplicial sets -- combinatorial structures that encode directed higher-order motifs and their directional relationships. To enhance scalability, we propose Routing-SSNs, which dynamically select the most informative relations in a learnable manner. We prove that SSNs are strictly more expressive than standard graph and TDL models. We then introduce a new principled framework for brain dynamics representation learning, grounded in the ability of SSNs to provably recover topological descriptors shown to successfully characterize brain activity. Empirically, SSNs achieve state-of-the-art performance on brain dynamics classification tasks, outperforming the second-best model by up to 27%, and message passing GNNs by up to 50% in accuracy. Our results highlight the potential of principled topological models for learning from structured brain data, establishing a unique real-world case study for TDL. We also test SSNs on standard node classification and edge regression tasks, showing competitive performance. We will make the code and data publicly available.

Comment: The paper introduces Semi-Simplicial Neural Networks for brain activity decoding, which is relevant to emerging trends in model architecture.

Relevance: 8 Novelty: 8

28. Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization

ArXiv ID: 2505.17852

Authors: Francois Chaubard, Mykel Kochenderfer

Abstract: During inference, Recurrent Neural Networks (RNNs) scale constant in both FLOPs and GPU memory with increasing context length, as they compress all prior tokens into a fixed-size memory. In contrast, transformers scale linearly in FLOPs and, at best, linearly in memory during generation, since they must attend to all previous tokens explicitly. Despite this inference-time advantage, training large RNNs on long contexts remains impractical because standard optimization methods depend on Backpropagation Through Time (BPTT). BPTT requires retention of all intermediate activations during the forward pass, causing memory usage to scale linearly with both context length and model size. In this paper, we show that Zero-Order Optimization (ZOO) methods such as Random-vector Gradient Estimation (RGE) can successfully replace BPTT to train RNNs with convergence rates that match, or exceed BPTT by up to 19 fold, while using orders of magnitude less memory and cost, as the model remains in inference mode throughout training. We further demonstrate that Central-Difference RGE (CD-RGE) corresponds to optimizing a smoothed surrogate loss, inherently regularizing training and improving generalization. Our method matches or outperforms BPTT across three settings: (1) overfitting, (2) transduction, and (3) language modeling. Across all tasks, with sufficient perturbations, our models generalize as well as or better than those trained with BPTT, often in fewer steps. Despite the need for more forward passes per step, we can surpass BPTT wall-clock time per step using recent advancements such as FlashRNN and distributed inference.

Comment: The paper explores zero-order optimization for training large RNNs, which is relevant to model architecture and efficiency improvements.

Relevance: 8 Novelty: 8

29. Emergence of Hebbian Dynamics in Regularized Non-Local Learners

ArXiv ID: 2505.18069

Authors: David Koplow, Tomaso Poggio, Liu Ziyin

Abstract: Stochastic Gradient Descent (SGD) has emerged as a remarkably effective learning algorithm, underpinning nearly all state-of-the-art machine learning models, from large language models to autonomous vehicles. Despite its practical success, SGD appears fundamentally distinct from biological learning mechanisms. It is widely believed that the biological brain can not implement gradient descent because it is nonlocal, and we have found little (if any) experimental evidence for it. In contrast, the brain is widely thought to learn via local Hebbian learning principles, which have been seen as incompatible with gradient descent. In this paper, we establish a theoretical and empirical connection between the learning signals of neural networks trained using SGD with weight decay and those trained with Hebbian learning near convergence. We show that SGD with regularization can appear to learn according to a Hebbian rule, and SGD with injected noise according to an anti-Hebbian rule. We also provide empirical evidence that Hebbian learning properties can emerge in a network with weight decay from virtually any learning rule--even random ones. These results may bridge a long-standing gap between artificial and biological learning, revealing Hebbian properties as an epiphenomenon of deeper optimization principles and cautioning against interpreting their presence in neural data as evidence against more complex hetero-synaptic mechanisms.

Comment: The paper establishes a connection between SGD and Hebbian learning, which is relevant to emerging trends in learning dynamics.

Relevance: 8 Novelty: 8

30. Out of the Shadows: Exploring a Latent Space for Neural Network Verification

ArXiv ID: 2505.17854

Authors: Lukas Koller, Tobias Ladner, Matthias Althoff

Abstract: Neural networks are ubiquitous. However, they are often sensitive to small input changes. Hence, to prevent unexpected behavior in safety-critical applications, their formal verification -- a notoriously hard problem -- is necessary. Many state-of-the-art verification algorithms use reachability analysis or abstract interpretation to enclose the set of possible outputs of a neural network. Often, the verification is inconclusive due to the conservatism of the enclosure. To address this problem, we design a novel latent space for formal verification that enables the transfer of output specifications to the input space for an iterative specification-driven input refinement, i.e., we iteratively reduce the set of possible inputs to only enclose the unsafe ones. The latent space is constructed from a novel view of projection-based set representations, e.g., zonotopes, which are commonly used in reachability analysis of neural networks. A projection-based set representation is a "shadow" of a higher-dimensional set -- a latent space -- that does not change during a set propagation through a neural network. Hence, the input set and the output enclosure are "shadows" of the same latent space that we can use to transfer constraints. We present an efficient verification tool for neural networks that uses our iterative refinement to significantly reduce the number of subproblems in a branch-and-bound procedure. Using zonotopes as a set representation, unlike many other state-of-the-art approaches, our approach can be realized by only using matrix operations, which enables a significant speed-up through efficient GPU acceleration. We demonstrate that our tool achieves competitive performance, which would place it among the top-ranking tools of the last neural network verification competition (VNN-COMP'24).

Comment: The paper presents a novel latent space for neural network verification, which is relevant to model architecture and efficiency improvements.

Relevance: 8 Novelty: 8

31. Strictly Constrained Generative Modeling via Split Augmented Langevin Sampling

ArXiv ID: 2505.18017

Authors: Matthieu Blanke, Yongquan Qu, Sara Shamekh, Pierre Gentine

Abstract: Deep generative models hold great promise for representing complex physical systems, but their deployment is currently limited by the lack of guarantees on the physical plausibility of the generated outputs. Ensuring that known physical constraints are enforced is therefore critical when applying generative models to scientific and engineering problems. We address this limitation by developing a principled framework for sampling from a target distribution while rigorously satisfying physical constraints. Leveraging the variational formulation of Langevin dynamics, we propose Split Augmented Langevin (SAL), a novel primal-dual sampling algorithm that enforces constraints progressively through variable splitting, with convergence guarantees. While the method is developed theoretically for Langevin dynamics, we demonstrate its effective applicability to diffusion models. In particular, we use constrained diffusion models to generate physical fields satisfying energy and mass conservation laws. We apply our method to diffusion-based data assimilation on a complex physical system, where enforcing physical constraints substantially improves both forecast accuracy and the preservation of critical conserved quantities. We also demonstrate the potential of SAL for challenging feasibility problems in optimal control.

Comment: The paper introduces a novel sampling algorithm for generative models, aligning with the emerging trends criterion.

Relevance: 8 Novelty: 8

32. Continuum Transformers Perform In-Context Learning by Operator Gradient Descent

ArXiv ID: 2505.17838

Authors: Abhiti Mishra, Yash Patel, Ambuj Tewari

Abstract: Transformers robustly exhibit the ability to perform in-context learning, whereby their predictive accuracy on a task can increase not by parameter updates but merely with the placement of training samples in their context windows. Recent works have shown that transformers achieve this by implementing gradient descent in their forward passes. Such results, however, are restricted to standard transformer architectures, which handle finite-dimensional inputs. In the space of PDE surrogate modeling, a generalization of transformers to handle infinite-dimensional function inputs, known as "continuum transformers," has been proposed and similarly observed to exhibit in-context learning. Despite impressive empirical performance, such in-context learning has yet to be theoretically characterized. We herein demonstrate that continuum transformers perform in-context operator learning by performing gradient descent in an operator RKHS. We demonstrate this using novel proof strategies that leverage a generalized representer theorem for Hilbert spaces and gradient flows over the space of functionals of a Hilbert space. We additionally show the operator learned in context is the Bayes Optimal Predictor in the infinite depth limit of the transformer. We then provide empirical validations of this optimality result and demonstrate that the parameters under which such gradient descent is performed are recovered through the continuum transformer training.

Comment: The paper provides a theoretical characterization of in-context learning in continuum transformers, which aligns with interests in large language models and theoretical insights.

Relevance: 8 Novelty: 8

33. Implicit Regularization of Infinitesimally-perturbed Gradient Descent Toward Low-dimensional Solutions

ArXiv ID: 2505.17304

Authors: Jianhao Ma, Geyu Liang, Salar Fattahi

Abstract: Implicit regularization refers to the phenomenon where local search algorithms converge to low-dimensional solutions, even when such structures are neither explicitly specified nor encoded in the optimization problem. While widely observed, this phenomenon remains theoretically underexplored, particularly in modern over-parameterized problems. In this paper, we study the conditions that enable implicit regularization by investigating when gradient-based methods converge to second-order stationary points (SOSPs) within an implicit low-dimensional region of a smooth, possibly nonconvex function. We show that successful implicit regularization hinges on two key conditions: $(i)$ the ability to efficiently escape strict saddle points, while $(ii)$ maintaining proximity to the implicit region. Existing analyses enabling the convergence of gradient descent (GD) to SOSPs often rely on injecting large perturbations to escape strict saddle points. However, this comes at the cost of deviating from the implicit region. The central premise of this paper is that it is possible to achieve the best of both worlds: efficiently escaping strict saddle points using infinitesimal perturbations, while controlling deviation from the implicit region via a small deviation rate. We show that infinitesimally perturbed gradient descent (IPGD), which can be interpreted as GD with inherent ``round-off errors'', can provably satisfy both conditions. We apply our framework to the problem of over-parameterized matrix sensing, where we establish formal guarantees for the implicit regularization behavior of IPGD. We further demonstrate through extensive experiments that these insights extend to a broader class of learning problems.

Comment: The paper explores implicit regularization in gradient descent, which is relevant to representation learning and training dynamics in neural networks.