This is a remedial run for missed papers from 05/22/2025 to 05/22/2025.

Results generated on 05/26/2025.

Personalized Daily ArXiv Papers 2025-05-23

[gpt-4o]	Prompt	Completion	Total
Token	55307	7125	62432
Cost	$0.14	$0.07	$0.21

Total arXiv papers: 481

Total scanned papers: 481

Total relevant papers: 45

Table of contents with paper titles:

Structure-Aligned Protein Language Model Authors: Can Chen, David Heurtel-Depeiges, Robert M. Vernon, Christopher James Langmead, Yoshua Bengio, Quentin Fournier
The Computational Complexity of Counting Linear Regions in ReLU Neural Networks Authors: Moritz Stargalla, Christoph Hertrich, Daniel Reichman
Understanding Prompt Tuning and In-Context Learning via Meta-Learning Authors: Tim Genewein, Kevin Wenliang Li, Jordi Grau-Moya, Anian Ruoss, Laurent Orseau, Marcus Hutter
FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design Authors: Renjie Wei, Songqiang Xu, Qingyu Guo, Meng Li
Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning Authors: Adnan Oomerjee, Zafeirios Fountas, Zhongwei Yu, Haitham Bou-Ammar, Jun Wang
Latent Principle Discovery for Language Model Self-Improvement Authors: Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo
Is Quantum Optimization Ready? An Effort Towards Neural Network Compression using Adiabatic Quantum Computing Authors: Zhehui Wanga, Benjamin Chen Ming Choonga, Tian Huang, Daniel Gerlinghoffa, Rick Siow Mong Goh, Cheng Liu, Tao Luo
NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics Authors: Zhihang Cai, Xingjun Zhang, Zhendong Tan, Zheng Wei
Robust Invariant Representation Learning by Distribution Extrapolation Authors: Kotaro Yoshida, Konstantinos Slavakis
Stochastic Forward-Forward Learning through Representational Dimensionality Compression Authors: Zhichao Zhu, Yang Qi, Hengyuan Ma, Wenlian Lu, Jianfeng Feng
From Compression to Expansion: A Layerwise Analysis of In-Context Learning Authors: Jiachen Jiang, Yuxin Dong, Jinxin Zhou, Zhihui Zhu
TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning Authors: Florentin Beck, William Rudman, Carsten Eickhoff
Attention with Trained Embeddings Provably Selects Important Tokens Authors: Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli
Implicit Regularization of Infinitesimally-perturbed Gradient Descent Toward Low-dimensional Solutions Authors: Jianhao Ma, Geyu Liang, Salar Fattahi
Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence Authors: Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo
JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model Authors: Qihao Duan, Bingding Huang, Zhenqiao Song, Irina Lehmann, Lei Gu, Roland Eils, Benjamin Wild
Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation Authors: Seamus Somerstep, Vinod Raman, Unique Subedi, Yuekai Sun
Learning novel representations of variable sources from multi-modal $\textit{Gaia}$ data via autoencoders Authors: P. Huijse, J. De Ridder, L. Eyer, L. Rimoldini, B. Holl, N. Chornay, J. Roquette, K. Nienartowicz, G. Jevardat de Fombelle, D. J. Fritzewski, A. Kemp, V. Vanlaer, M. Vanrespaille, H. Wang, M. I. Carnerero, C. M. Raiteri, G. Marton, M. Madarász, G. Clementini, P. Gavras, C. Aerts
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs Authors: Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, Minxin Du
Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning Authors: Wang Yang, Zirui Liu, Hongye Jin, Qingyu Yin, Vipin Chaudhary, Xiaotian Han
TI-DeepONet: Learnable Time Integration for Stable Long-Term Extrapolation Authors: Dibyajyoti Nayak, Somdatta Goswami
LaSER: How Learning Can Guide the Evolution of Equations Authors: Nam H. Le, Josh Bongard
Artificial Intelligence for Direct Prediction of Molecular Dynamics Across Chemical Space Authors: Fuchun Ge, Pavlo O. Dral
Scalable Graph Generative Modeling via Substructure Sequences Authors: Zehong Wang, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, Yanfang Ye
TULiP: Test-time Uncertainty Estimation via Linearization and Weight Perturbation Authors: Yuhui Zhang, Dongshen Wu, Yuichiro Wada, Takafumi Kanamori
AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training Authors: Huishuai Zhang, Bohan Wang, Luoxin Chen
Computing Exact Shapley Values in Polynomial Time for Product-Kernel Methods Authors: Majid Mohammadi, Siu Lun Chau, Krikamol Muandet
Omni TM-AE: A Scalable and Interpretable Embedding Model Using the Full Tsetlin Machine State Space Authors: Ahmed K. Kadhim, Lei Jiao, Rishad Shafik, Ole-Christoffer Granmo
Training Long-Context LLMs Efficiently via Chunk-wise Optimization Authors: Wenhao Li, Yuxin Zhang, Gen Luo, Daohai Yu, Rongrong Ji
Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation Authors: Zhenglin Hua, Jinghan He, Zijun Yao, Tianxu Han, Haiyun Guo, Yuheng Jia, Junfeng Fang
Large-Scale Bayesian Tensor Reconstruction: An Approximate Message Passing Solution Authors: Bingyang Cheng, Zhongtao Chen, Yichen Jin, Hao Zhang, Chen Zhang, Edmud Y. Lam, Yik-Chung Wu
ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training Authors: Maryam Dialameh, Rezaul Karim, Hossein Rajabzadeh, Omar Mohamed Awad, Hyock Ju Kwon, Boxing Chen, Walid Ahmed, Yang Liu
EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning Authors: Jiawei Liu, Qisi Chen, Jianshu Zhang, Quan Liu, Defu Lian
TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling Authors: Weizhe Lin, Xing Li, Zhiyuan Yang, Xiaojin Fu, Hui-Ling Zhen, Yaoyuan Wang, Xianzhi Yu, Wulong Liu, Xiaosong Li, Mingxuan Yuan
Native Segmentation Vision Transformers Authors: Guillem Brasó, Aljoša Ošep, Laura Leal-Taixé
SELF: Self-Extend the Context Length With Logistic Growth Function Authors: Phat Thanh Dang, Saahil Thoppay, Wang Yang, Qifan Wang, Vipin Chaudhary, Xiaotian Han
HOFT: Householder Orthogonal Fine-tuning Authors: Alejandro Moreno Arcas, Albert Sanchis, Jorge Civera, Alfons Juan
AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer Authors: Jiquan Shan, Junxiao Wang, Lifeng Zhao, Liang Cai, Hongyuan Zhang, Ioannis Liritzis
NAN: A Training-Free Solution to Coefficient Estimation in Model Merging Authors: Chongjie Si, Kangtao Lv, Jingjing Jiang, Yadao Wang, Yongwei Wang, Xiaokang Yang, Wenbo Su, Bo Zheng, Wei Shen
Understanding Differential Transformer Unchains Pretrained Self-Attentions Authors: Chaerin Kong, Jiho Jang, Nojun Kwak
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving Authors: Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan
Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models Authors: Yue Li, Xin Yi, Dongsheng Shi, Gerard de Melo, Xiaoling Wang, Linlin Wang
PaTH Attention: Position Encoding via Accumulating Householder Transformations Authors: Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, Yoon Kim
Small-to-Large Generalization: Data Influences Models Consistently Across Scale Authors: Alaa Khaddaj, Logan Engstrom, Aleksander Madry
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models Authors: Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han

1. Structure-Aligned Protein Language Model

ArXiv ID: 2505.16896

Authors: Can Chen, David Heurtel-Depeiges, Robert M. Vernon, Christopher James Langmead, Yoshua Bengio, Quentin Fournier

Abstract: Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but lack the structural knowledge essential for many biological applications. To address this, we integrate structural insights from pre-trained protein graph neural networks (pGNNs) into pLMs through a latent-level contrastive learning task. This task aligns residue representations from pLMs with those from pGNNs across multiple proteins, enriching pLMs with inter-protein structural knowledge. Additionally, we incorporate a physical-level task that infuses intra-protein structural knowledge by optimizing pLMs to predict structural tokens. The proposed dual-task framework effectively incorporates both inter-protein and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module, which uses a small model trained on high-quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method to the state-of-the-art ESM2 and AMPLIFY results in notable performance gains across a wide range of tasks, including a 12.7% increase in ESM2 contact prediction. The data, code, and resulting SaESM2 and SaAMPLIFY models will be released on Hugging Face.

Comment: Author match

2. The Computational Complexity of Counting Linear Regions in ReLU Neural Networks

ArXiv ID: 2505.16716

Authors: Moritz Stargalla, Christoph Hertrich, Daniel Reichman

Abstract: An established measure of the expressive power of a given ReLU neural network is the number of linear regions into which it partitions the input space. There exist many different, non-equivalent definitions of what a linear region actually is. We systematically assess which papers use which definitions and discuss how they relate to each other. We then analyze the computational complexity of counting the number of such regions for the various definitions. Generally, this turns out to be an intractable problem. We prove NP- and #P-hardness results already for networks with one hidden layer and strong hardness of approximation results for two or more hidden layers. Finally, on the algorithmic side, we demonstrate that counting linear regions can at least be achieved in polynomial space for some common definitions.

Comment: The paper analyzes the computational complexity of counting linear regions in ReLU networks, relevant to model architecture and theoretical insights.

Relevance: 9 Novelty: 8

3. Understanding Prompt Tuning and In-Context Learning via Meta-Learning

ArXiv ID: 2505.17010

Authors: Tim Genewein, Kevin Wenliang Li, Jordi Grau-Moya, Anian Ruoss, Laurent Orseau, Marcus Hutter

Abstract: Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.

Comment: The paper provides a theoretical understanding of prompt tuning and in-context learning via meta-learning, relevant to large language models and representation learning.

Relevance: 9 Novelty: 8

4. FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design

ArXiv ID: 2505.16335

Authors: Renjie Wei, Songqiang Xu, Qingyu Guo, Meng Li

Abstract: Visual autoregressive (VAR) modeling has marked a paradigm shift in image generation from next-token prediction to next-scale prediction. VAR predicts a set of tokens at each step from coarse to fine scale, leading to better image quality and faster inference speed compared to existing diffusion models. However, the large parameter size and computation cost hinder its deployment on edge devices. To reduce the memory and computation cost, we propose FPQVAR, an efficient post-training floating-point (FP) quantization framework for VAR featuring algorithm and hardware co-design. At the algorithm level, we first identify the challenges of quantizing VAR. To address them, we propose Dual Format Quantization for the highly imbalanced input activation. We further propose Group-wise Hadamard Transformation and GHT-Aware Learnable Transformation to address the time-varying outlier channels. At the hardware level, we design the first low-bit FP quantizer and multiplier with lookup tables on FPGA and propose the first FPGA-based VAR accelerator featuring low-bit FP computation and an elaborate two-level pipeline. Extensive experiments show that compared to the state-of-the-art quantization method, our proposed FPQVAR significantly improves Fr\'echet Inception Distance (FID) from 10.83 to 3.58, Inception Score (IS) from 175.9 to 241.5 under 4-bit quantization. FPQVAR also significantly improves the performance of 6-bit quantized VAR, bringing it on par with the FP16 model. Our accelerator on AMD-Xilinx VCK190 FPGA achieves a throughput of 1.1 image/s, which is 3.1x higher than the integer-based accelerator. It also demonstrates 3.6x and 2.8x higher energy efficiency compared to the integer-based accelerator and GPU baseline, respectively.

Comment: The paper presents a novel quantization framework for visual autoregressive models, which aligns with model compression through quantization and efficiency improvements.

Relevance: 9 Novelty: 8

5. Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning

ArXiv ID: 2505.16950

Authors: Adnan Oomerjee, Zafeirios Fountas, Zhongwei Yu, Haitham Bou-Ammar, Jun Wang

Abstract: Despite their impressive capabilities, Large Language Models struggle with generalisation beyond their training distribution, often exhibiting sophisticated pattern interpolation rather than true abstract reasoning (extrapolation). In this work, we approach this limitation through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input compression and retention of predictive information in latent representations. We prove using IB theory that decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations. We then use this result to demonstrate that periodic global transformation of the internal sequence-level representations (KV cache) is a necessary computational step for improving Transformer generalisation in reasoning tasks. Based on these theoretical insights, we propose a modification to the Transformer architecture, in the form of an additional module that globally rewrites the KV cache at periodic intervals, shifting its capacity away from memorising input prefixes and toward encoding features most useful for predicting future tokens. Our model delivers substantial gains on mathematical reasoning benchmarks, outperforming both vanilla Transformers with up to 3.5x more parameters, as well as heuristic-driven pruning mechanisms for cache compression. Our approach can be seen as a principled generalisation of existing KV-cache compression methods; whereas such methods focus solely on compressing input representations, they often do so at the expense of retaining predictive information, and thus their capabilities are inherently bounded by those of an unconstrained model. This establishes a principled framework to manipulate Transformer memory using information theory, addressing fundamental reasoning limitations that scaling alone cannot overcome.

Comment: The paper proposes a modification to the Transformer architecture for improved generalization, aligning with model architecture innovations and KV cache manipulation.

Relevance: 9 Novelty: 8

6. Latent Principle Discovery for Language Model Self-Improvement

ArXiv ID: 2505.16927

Authors: Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

Abstract: When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes guiding model reasoning towards human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ an approximation of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We also show that clustering the principles yields interpretable and diverse model-generated constitutions while retaining model performance. The gains our method achieves highlight the potential of automated, principle-driven post-training recipes toward continual self-improvement.

Comment: The paper focuses on a novel method for self-improvement in language models by discovering latent principles, which aligns with foundational research in representation learning and LLM behavior.

Relevance: 9 Novelty: 8

7. Is Quantum Optimization Ready? An Effort Towards Neural Network Compression using Adiabatic Quantum Computing

ArXiv ID: 2505.16332

Authors: Zhehui Wanga, Benjamin Chen Ming Choonga, Tian Huang, Daniel Gerlinghoffa, Rick Siow Mong Goh, Cheng Liu, Tao Luo

Abstract: Quantum optimization is the most mature quantum computing technology to date, providing a promising approach towards efficiently solving complex combinatorial problems. Methods such as adiabatic quantum computing (AQC) have been employed in recent years on important optimization problems across various domains. In deep learning, deep neural networks (DNN) have reached immense sizes to support new predictive capabilities. Optimization of large-scale models is critical for sustainable deployment, but becomes increasingly challenging with ever-growing model sizes and complexity. While quantum optimization is suitable for solving complex problems, its application to DNN optimization is not straightforward, requiring thorough reformulation for compatibility with commercially available quantum devices. In this work, we explore the potential of adopting AQC for fine-grained pruning-quantization of convolutional neural networks. We rework established heuristics to formulate model compression as a quadratic unconstrained binary optimization (QUBO) problem, and assess the solution space offered by commercial quantum annealing devices. Through our exploratory efforts of reformulation, we demonstrate that AQC can achieve effective compression of practical DNN models. Experiments demonstrate that adiabatic quantum computing (AQC) not only outperforms classical algorithms like genetic algorithms and reinforcement learning in terms of time efficiency but also excels at identifying global optima.

Comment: The paper explores quantum optimization for neural network compression, specifically using adiabatic quantum computing for pruning and quantization.

Relevance: 9 Novelty: 8

8. NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics

ArXiv ID: 2505.16210

Authors: Zhihang Cai, Xingjun Zhang, Zhendong Tan, Zheng Wei

Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks. However, LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands, which significantly increases the memory resource consumption of the Key-Value (KV) cache during inference, becoming a major bottleneck in LLM deployment. To address this issue, quantization is a common and straightforward approach. Currently, quantization methods for activations are limited to 8-bit, and quantization to even lower bits can lead to substantial accuracy drops. To further save space by quantizing the KV cache to even lower bits, we analyzed the element distribution of the KV cache and designed the NQKV algorithm. Since the elements within each block of the KV cache follow a normal distribution, NQKV employs per-block quantile quantization to achieve information-theoretically optimal quantization error. Without significantly compromising model output quality, NQKV enables the OPT model to perform inference with an 2x larger batch size or a 4x longer context length, and it improves throughput by 9.3x compared to when the KV cache is not used.

Comment: The paper introduces NQKV, a KV cache quantization scheme, which directly relates to model compression and efficiency.

Relevance: 9 Novelty: 8

9. Robust Invariant Representation Learning by Distribution Extrapolation

ArXiv ID: 2505.16126

Authors: Kotaro Yoshida, Konstantinos Slavakis

Abstract: Invariant risk minimization (IRM) aims to enable out-of-distribution (OOD) generalization in deep learning by learning invariant representations. As IRM poses an inherently challenging bi-level optimization problem, most existing approaches -- including IRMv1 -- adopt penalty-based single-level approximations. However, empirical studies consistently show that these methods often fail to outperform well-tuned empirical risk minimization (ERM), highlighting the need for more robust IRM implementations. This work theoretically identifies a key limitation common to many IRM variants: their penalty terms are highly sensitive to limited environment diversity and over-parameterization, resulting in performance degradation. To address this issue, a novel extrapolation-based framework is proposed that enhances environmental diversity by augmenting the IRM penalty through synthetic distributional shifts. Extensive experiments -- ranging from synthetic setups to realistic, over-parameterized scenarios -- demonstrate that the proposed method consistently outperforms state-of-the-art IRM variants, validating its effectiveness and robustness.

Comment: The paper proposes a novel extrapolation-based framework for invariant representation learning, which aligns with the representation learning criterion.

Relevance: 9 Novelty: 8

10. Stochastic Forward-Forward Learning through Representational Dimensionality Compression

ArXiv ID: 2505.16649

Authors: Zhichao Zhu, Yang Qi, Hengyuan Ma, Wenlian Lu, Jianfeng Feng

Abstract: The Forward-Forward (FF) algorithm provides a bottom-up alternative to backpropagation (BP) for training neural networks, relying on a layer-wise "goodness" function to guide learning. Existing goodness functions, inspired by energy-based learning (EBL), are typically defined as the sum of squared post-synaptic activations, neglecting the correlations between neurons. In this work, we propose a novel goodness function termed dimensionality compression that uses the effective dimensionality (ED) of fluctuating neural responses to incorporate second-order statistical structure. Our objective minimizes ED for clamped inputs when noise is considered while maximizing it across the sample distribution, promoting structured representations without the need to prepare negative samples. We demonstrate that this formulation achieves competitive performance compared to other non-BP methods. Moreover, we show that noise plays a constructive role that can enhance generalization and improve inference when predictions are derived from the mean of squared outputs, which is equivalent to making predictions based on the energy term. Our findings contribute to the development of more biologically plausible learning algorithms and suggest a natural fit for neuromorphic computing, where stochasticity is a computational resource rather than a nuisance. The code is available at https://github.com/ZhichaoZhu/StochasticForwardForward

Comment: The paper introduces a novel goodness function for the Forward-Forward algorithm, contributing to representation learning with a focus on dimensionality compression.

Relevance: 9 Novelty: 8

11. From Compression to Expansion: A Layerwise Analysis of In-Context Learning

ArXiv ID: 2505.17322

Authors: Jiachen Jiang, Yuxin Dong, Jinxin Zhou, Zhihui Zhu

Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without weight updates by learning from demonstration sequences. While ICL shows strong empirical performance, its internal representational mechanisms are not yet well understood. In this work, we conduct a statistical geometric analysis of ICL representations to investigate how task-specific information is captured across layers. Our analysis reveals an intriguing phenomenon, which we term Layerwise Compression-Expansion: early layers progressively produce compact and discriminative representations that encode task information from the input demonstrations, while later layers expand these representations to incorporate the query and generate the prediction. This phenomenon is observed consistently across diverse tasks and a range of contemporary LLM architectures. We demonstrate that it has important implications for ICL performance -- improving with model size and the number of demonstrations -- and for robustness in the presence of noisy examples. To further understand the effect of the compact task representation, we propose a bias-variance decomposition and provide a theoretical analysis showing how attention mechanisms contribute to reducing both variance and bias, thereby enhancing performance as the number of demonstrations increases. Our findings reveal an intriguing layerwise dynamic in ICL, highlight how structured representations emerge within LLMs, and showcase that analyzing internal representations can facilitate a deeper understanding of model behavior.

Comment: The paper conducts a layerwise analysis of in-context learning in large language models, aligning with the large language models criterion.

Relevance: 9 Novelty: 8

12. TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning

ArXiv ID: 2505.16743

Authors: Florentin Beck, William Rudman, Carsten Eickhoff

Abstract: Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM

Comment: The paper introduces a novel pruning method, TRIM, which is relevant to model compression and sparsity.

Relevance: 9 Novelty: 8

13. Attention with Trained Embeddings Provably Selects Important Tokens

ArXiv ID: 2505.17282

Authors: Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli

Abstract: Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited. Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent. Specifically, we consider a one-layer softmax attention model with a linear head for binary classification, i.e., $\texttt{Softmax}( p^\top E_X^\top ) E_X v = \frac{ \sum_{i=1}^T \exp(p^\top E_{x_i}) E_{x_i}^\top v}{\sum_{j=1}^T \exp(p^\top E_{x_{j}}) }$, where $E_X = [ E_{x_1} , \dots, E_{x_T} ]^\top$ contains the embeddings of the input sequence, $p$ is the embedding of the $\mathrm{\langle cls \rangle}$ token and $v$ the output vector. First, we show that, already after a single step of gradient training with the logistic loss, the embeddings $E_X$ capture the importance of tokens in the dataset by aligning with the output vector $v$ proportionally to the frequency with which the corresponding tokens appear in the dataset. Then, after training $p$ via gradient flow until convergence, the softmax selects the important tokens in the sentence (i.e., those that are predictive of the label), and the resulting $\mathrm{\langle cls \rangle}$ embedding maximizes the margin for such a selection. Experiments on real-world datasets (IMDB, Yelp) exhibit a phenomenology close to that unveiled by our theory.

Comment: The paper provides theoretical insights into how attention with trained embeddings selects important tokens, relevant to Representation Learning.

Relevance: 9 Novelty: 8

14. Implicit Regularization of Infinitesimally-perturbed Gradient Descent Toward Low-dimensional Solutions

ArXiv ID: 2505.17304

Authors: Jianhao Ma, Geyu Liang, Salar Fattahi

Abstract: Implicit regularization refers to the phenomenon where local search algorithms converge to low-dimensional solutions, even when such structures are neither explicitly specified nor encoded in the optimization problem. While widely observed, this phenomenon remains theoretically underexplored, particularly in modern over-parameterized problems. In this paper, we study the conditions that enable implicit regularization by investigating when gradient-based methods converge to second-order stationary points (SOSPs) within an implicit low-dimensional region of a smooth, possibly nonconvex function. We show that successful implicit regularization hinges on two key conditions: $(i)$ the ability to efficiently escape strict saddle points, while $(ii)$ maintaining proximity to the implicit region. Existing analyses enabling the convergence of gradient descent (GD) to SOSPs often rely on injecting large perturbations to escape strict saddle points. However, this comes at the cost of deviating from the implicit region. The central premise of this paper is that it is possible to achieve the best of both worlds: efficiently escaping strict saddle points using infinitesimal perturbations, while controlling deviation from the implicit region via a small deviation rate. We show that infinitesimally perturbed gradient descent (IPGD), which can be interpreted as GD with inherent ``round-off errors'', can provably satisfy both conditions. We apply our framework to the problem of over-parameterized matrix sensing, where we establish formal guarantees for the implicit regularization behavior of IPGD. We further demonstrate through extensive experiments that these insights extend to a broader class of learning problems.

Comment: The paper studies implicit regularization in gradient descent, which is relevant to representation learning and training dynamics.

Relevance: 9 Novelty: 8

15. Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence

ArXiv ID: 2505.16694

Authors: Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo

Abstract: Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context. While prior work links induction heads to ICL through a sudden jump in accuracy, this can only account for ICL when the answer is included within the context. However, an important property of practical ICL in large language models is the ability to meta-learn how to solve tasks from context, rather than just copying answers from context; how such an ability is obtained during training is largely unexplored. In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model's circuit during training. Specifically, we extend the copy task from previous research into an In-Context Meta Learning setting, where models must infer a task from examples to answer queries. Interestingly, in this setting, we find that there are multiple phases in the process of acquiring such abilities, and that a unique circuit emerges in each phase, contrasting with the single-phases change in induction heads. The emergence of such circuits can be related to several phenomena known in large language models, and our analysis lead to a deeper understanding of the source of the transformer's ICL ability.

Comment: The paper analyzes the emergence of multi-phase circuits in transformers, which is relevant to understanding transformer architecture and training dynamics.

Relevance: 9 Novelty: 8

16. JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model

ArXiv ID: 2505.17257

Authors: Qihao Duan, Bingding Huang, Zhenqiao Song, Irina Lehmann, Lei Gu, Roland Eils, Benjamin Wild

Abstract: Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types, including genetic sequences. However, adapting LLMs to genomics presents significant challenges. Capturing complex genomic interactions requires modeling long-range dependencies within DNA sequences, where interactions often span over 10,000 base pairs, even within a single gene, posing substantial computational burdens under conventional model architectures and training paradigms. Moreover, standard LLM training approaches are suboptimal for DNA: autoregressive training, while efficient, supports only unidirectional understanding. However, DNA is inherently bidirectional, e.g., bidirectional promoters regulate transcription in both directions and account for nearly 11% of human gene expression. Masked language models (MLMs) allow bidirectional understanding but are inefficient, as only masked tokens contribute to the loss per step. To address these limitations, we introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm that combines the optimization efficiency of autoregressive modeling with the bidirectional comprehension of masked modeling. JanusDNA adopts a hybrid Mamba, Attention and Mixture of Experts (MoE) architecture, combining long-range modeling of Attention with efficient sequential learning of Mamba. MoE layers further scale model capacity via sparse activation while keeping computational cost low. Notably, JanusDNA processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU. Extensive experiments and ablations show JanusDNA achieves new SOTA results on three genomic representation benchmarks, outperforming models with 250x more activated parameters. Code: https://github.com/Qihao-Duan/JanusDNA

Comment: The paper presents a hybrid DNA foundation model with MoE architecture, which is relevant to model architecture innovations and efficiency.

Relevance: 9 Novelty: 8

17. Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation

ArXiv ID: 2505.17288

Authors: Seamus Somerstep, Vinod Raman, Unique Subedi, Yuekai Sun

Abstract: Using the bit string generation problem as a case study, we theoretically compare two standard methods for adapting large language models to new tasks. The first, referred to as supervised fine-tuning, involves training a new next token predictor on good generations. The second method, Best-of-N, trains a reward model to select good responses from a collection generated by an unaltered base model. If the learning setting is realizable, we find that supervised fine-tuning outperforms BoN through a better dependence on the response length in its rate of convergence. If realizability fails, then depending on the failure mode, BoN can enjoy a better rate of convergence in either n or a rate of convergence with better dependence on the response length.

Comment: The paper theoretically compares two methods for adapting large language models, focusing on the learning dynamics and convergence rates, which aligns with the Large Language Models criterion by providing theoretical insights into LLM behavior.

Relevance: 9 Novelty: 8

ArXiv ID: 2505.16320

Authors: P. Huijse, J. De Ridder, L. Eyer, L. Rimoldini, B. Holl, N. Chornay, J. Roquette, K. Nienartowicz, G. Jevardat de Fombelle, D. J. Fritzewski, A. Kemp, V. Vanlaer, M. Vanrespaille, H. Wang, M. I. Carnerero, C. M. Raiteri, G. Marton, M. Madarász, G. Clementini, P. Gavras, C. Aerts

Abstract: Gaia Data Release 3 (DR3) published for the first time epoch photometry, BP/RP (XP) low-resolution mean spectra, and supervised classification results for millions of variable sources. This extensive dataset offers a unique opportunity to study their variability by combining multiple Gaia data products. In preparation for DR4, we propose and evaluate a machine learning methodology capable of ingesting multiple Gaia data products to achieve an unsupervised classification of stellar and quasar variability. A dataset of 4 million Gaia DR3 sources is used to train three variational autoencoders (VAE), which are artificial neural networks (ANNs) designed for data compression and generation. One VAE is trained on Gaia XP low-resolution spectra, another on a novel approach based on the distribution of magnitude differences in the Gaia G band, and the third on folded Gaia G band light curves. Each Gaia source is compressed into 15 numbers, representing the coordinates in a 15-dimensional latent space generated by combining the outputs of these three models. The learned latent representation produced by the ANN effectively distinguishes between the main variability classes present in Gaia DR3, as demonstrated through both supervised and unsupervised classification analysis of the latent space. The results highlight a strong synergy between light curves and low-resolution spectral data, emphasising the benefits of combining the different Gaia data products. A two-dimensional projection of the latent variables reveals numerous overdensities, most of which strongly correlate with astrophysical properties, showing the potential of this latent space for astrophysical discovery. We show that the properties of our novel latent representation make it highly valuable for variability analysis tasks, including classification, clustering and outlier detection.

Comment: The paper uses variational autoencoders to learn novel representations from Gaia data, relevant to representation learning and autoencoders.

Relevance: 9 Novelty: 7

19. Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

ArXiv ID: 2505.16831

Authors: Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, Minxin Du

Abstract: Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rather than erase it. To diagnose this phenomenon, we introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information. Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting. In reversible cases, models suffer token-level collapse yet retain latent features; in irreversible cases, deeper representational damage occurs. We further provide a theoretical account linking shallow weight perturbations near output layers to misleading unlearning signals, and show that reversibility is modulated by task type and hyperparameters. Our findings reveal a fundamental gap in current evaluation practices and establish a new diagnostic foundation for trustworthy unlearning in LLMs. We provide a unified toolkit for analyzing LLM representation changes under unlearning and relearning: https://github.com/XiaoyuXU1/Representational_Analysis_Tools.git.

Comment: The paper investigates the reversibility of machine unlearning in LLMs, providing theoretical insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 7

20. Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

ArXiv ID: 2505.17315

Authors: Wang Yang, Zirui Liu, Hongye Jin, Qingyu Yin, Vipin Chaudhary, Xiaotian Han

Abstract: Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model's long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models.

Comment: The paper explores the role of long-context ability in reasoning, which is relevant to Large Language Models and their theoretical insights.

Relevance: 9 Novelty: 7

21. TI-DeepONet: Learnable Time Integration for Stable Long-Term Extrapolation

ArXiv ID: 2505.17341

Authors: Dibyajyoti Nayak, Somdatta Goswami

Abstract: Accurate temporal extrapolation presents a fundamental challenge for neural operators in modeling dynamical systems, where reliable predictions must extend significantly beyond the training time horizon. Conventional Deep Operator Network (DeepONet) approaches employ two inherently limited training paradigms - fixed-horizon rollouts that predict complete spatiotemporal solutions while disregarding temporal causality, and autoregressive formulations that accumulate errors through sequential predictions. We introduce TI-DeepONet, a framework that integrates neural operators with adaptive numerical time-stepping techniques to preserve the Markovian structure of dynamical systems while mitigating error propagation in extended temporal forecasting. Our approach reformulates the learning objective from direct state prediction to the approximation of instantaneous time-derivative fields, which are then integrated using established numerical schemes. This architecture supports continuous-time prediction and enables deployment of higher-precision integrators during inference than those used during training, balancing computational efficiency with predictive accuracy. We further develop TI(L)-DeepONet, which incorporates learnable coefficients for intermediate slopes in the integration process, adapting to solution-specific variations and enhancing fidelity. Evaluation across three canonical PDEs shows that TI(L)-DeepONet marginally outperforms TI-DeepONet, with both reducing relative L2 extrapolation errors: approximately 81% over autoregressive and 70% over fixed-horizon methods. Notably, both maintain prediction stability for temporal domains extending to about twice the training interval. This research establishes a physics-aware operator learning paradigm that bridges neural approximation with numerical analysis while preserving the causal structure of dynamical systems.

Comment: TI-DeepONet introduces a novel framework for neural operators with adaptive time-stepping, relevant to model architecture and emerging trends.

Relevance: 8 Novelty: 8

22. LaSER: How Learning Can Guide the Evolution of Equations

ArXiv ID: 2505.17309

Authors: Nam H. Le, Josh Bongard

Abstract: Evolution and learning are two distinct yet complementary forms of adaptation. While evolutionary processes operate across generations via the selection of genotypes, learning occurs within the lifetime of an individual, shaping behavior through phenotypic adjustment. The Baldwin effect describes how lifetime learning can improve evolutionary search without altering inherited structures. While this has proven effective in areas like neuroevolution, where gradient-based learning is often used to fine-tune weights or behaviors produced by evolution, it remains underexplored in systems that evolve non-differentiable symbolic structures like Genetic Programming (GP). GP evolves explicit syntax trees that represent equations, offering strong interpretability but limited generalization due to the burden of discovering both useful representations and precise mappings. Here, we show for the first time that integrating a simple form of supervised learning, applied at the semantic or behavioral level during evaluation, can effectively guide the evolution of equations in GP. To achieve this, we propose a new GP pipeline, LaSER (Latent Semantic Evolutionary Regression), where each GP individual generates a semantic representation that is passed to a supervised learner. The quality of the learned mapping is used to assign fitness, without modifying the underlying syntax tree or evolutionary process. Across standard symbolic regression benchmarks, in terms of generalization ability, LaSER significantly outperforms traditional GP and, in several cases, matches or exceeds popular machine learning regressors, while preserving the symbolic interpretability. By separating evolution from learning, LaSER offers a practical route to integrating GP with modern ML workflows, and opens new avenues for research at the intersection of evolutionary computation and representation learning.

Comment: LaSER integrates learning with genetic programming for symbolic regression, relevant to representation learning and emerging trends.

Relevance: 8 Novelty: 8

23. Artificial Intelligence for Direct Prediction of Molecular Dynamics Across Chemical Space

ArXiv ID: 2505.16301

Authors: Fuchun Ge, Pavlo O. Dral

Abstract: Molecular dynamics (MD) is a powerful tool for exploring the behavior of atomistic systems, but its reliance on sequential numerical integration limits simulation efficiency. We present MDtrajNet-1, a foundational AI model that directly generates MD trajectories across chemical space, bypassing force calculations and integration. This approach accelerates simulations by up to two orders of magnitude compared to traditional MD, even those enhanced by machine-learning interatomic potentials. MDtrajNet-1 combines equivariant neural networks with a Transformer-based architecture to achieve strong accuracy and transferability in predicting long-time trajectories for both known and unseen systems. Remarkably, the errors of the trajectories generated by MDtrajNet-1 for various molecular systems are close to those of the conventional ab initio MD. The model's flexible design supports diverse application scenarios, including different statistical ensembles, boundary conditions, and interaction types. By overcoming the intrinsic speed barrier of conventional MD, MDtrajNet-1 opens new frontiers in efficient and scalable atomistic simulations.

Comment: The paper introduces a foundational AI model for molecular dynamics, which is relevant to AI for Science with a focus on architecture-level innovations.

Relevance: 8 Novelty: 8

24. Scalable Graph Generative Modeling via Substructure Sequences

ArXiv ID: 2505.16130

Authors: Zehong Wang, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, Yanfang Ye

Abstract: Graph neural networks (GNNs) has been predominantly driven by message-passing, where node representations are iteratively updated via local neighborhood aggregation. Despite their success, message-passing suffers from fundamental limitations -- including constrained expressiveness, over-smoothing, over-squashing, and limited capacity to model long-range dependencies. These issues hinder scalability: increasing data size or model size often fails to yield improved performance, limiting the viability of GNNs as backbones for graph foundation models. In this work, we explore pathways beyond message-passing and introduce Generative Graph Pattern Machine (G$^2$PM), a generative Transformer pre-training framework for graphs. G$^2$PM represents graph instances (nodes, edges, or entire graphs) as sequences of substructures, and employs generative pre-training over the sequences to learn generalizable, transferable representations. Empirically, G$^2$PM demonstrates strong scalability: on the ogbn-arxiv benchmark, it continues to improve with model sizes up to 60M parameters, outperforming prior generative approaches that plateau at significantly smaller scales (e.g., 3M). In addition, we systematically analyze the model design space, highlighting key architectural choices that contribute to its scalability and generalization. Across diverse tasks -- including node classification, graph classification, and transfer learning -- G$^2$PM consistently outperforms strong baselines, establishing a compelling foundation for scalable graph learning. The code and dataset are available at https://github.com/Zehong-Wang/G2PM.

Comment: The paper presents a generative Transformer pre-training framework for graphs, which is relevant to model architecture innovations beyond message-passing in GNNs.

Relevance: 8 Novelty: 8

25. TULiP: Test-time Uncertainty Estimation via Linearization and Weight Perturbation

ArXiv ID: 2505.16923

Authors: Yuhui Zhang, Dongshen Wu, Yuichiro Wada, Takafumi Kanamori

Abstract: A reliable uncertainty estimation method is the foundation of many modern out-of-distribution (OOD) detectors, which are critical for safe deployments of deep learning models in the open world. In this work, we propose TULiP, a theoretically-driven post-hoc uncertainty estimator for OOD detection. Our approach considers a hypothetical perturbation applied to the network before convergence. Based on linearized training dynamics, we bound the effect of such perturbation, resulting in an uncertainty score computable by perturbing model parameters. Ultimately, our approach computes uncertainty from a set of sampled predictions. We visualize our bound on synthetic regression and classification datasets. Furthermore, we demonstrate the effectiveness of TULiP using large-scale OOD detection benchmarks for image classification. Our method exhibits state-of-the-art performance, particularly for near-distribution samples.

Comment: TULiP proposes a new method for uncertainty estimation using linearization and weight perturbation, which relates to representation learning and training dynamics.