This is a remedial run for missed papers from 05/22/2025 to 05/22/2025.
Results generated on 05/26/2025.
Personalized Daily ArXiv Papers 2025-05-23
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 55307 | 7125 | 62432 |
| Cost | $0.14 | $0.07 | $0.21 |
Total arXiv papers: 481
Total scanned papers: 481
Total relevant papers: 45
Table of contents with paper titles:
-
Structure-Aligned Protein Language Model Authors: Can Chen, David Heurtel-Depeiges, Robert M. Vernon, Christopher James Langmead, Yoshua Bengio, Quentin Fournier
-
The Computational Complexity of Counting Linear Regions in ReLU Neural Networks Authors: Moritz Stargalla, Christoph Hertrich, Daniel Reichman
-
Understanding Prompt Tuning and In-Context Learning via Meta-Learning Authors: Tim Genewein, Kevin Wenliang Li, Jordi Grau-Moya, Anian Ruoss, Laurent Orseau, Marcus Hutter
-
FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design Authors: Renjie Wei, Songqiang Xu, Qingyu Guo, Meng Li
-
Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning Authors: Adnan Oomerjee, Zafeirios Fountas, Zhongwei Yu, Haitham Bou-Ammar, Jun Wang
-
Latent Principle Discovery for Language Model Self-Improvement Authors: Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo
-
Is Quantum Optimization Ready? An Effort Towards Neural Network Compression using Adiabatic Quantum Computing Authors: Zhehui Wanga, Benjamin Chen Ming Choonga, Tian Huang, Daniel Gerlinghoffa, Rick Siow Mong Goh, Cheng Liu, Tao Luo
-
NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics Authors: Zhihang Cai, Xingjun Zhang, Zhendong Tan, Zheng Wei
-
Robust Invariant Representation Learning by Distribution Extrapolation Authors: Kotaro Yoshida, Konstantinos Slavakis
-
Stochastic Forward-Forward Learning through Representational Dimensionality Compression Authors: Zhichao Zhu, Yang Qi, Hengyuan Ma, Wenlian Lu, Jianfeng Feng
-
From Compression to Expansion: A Layerwise Analysis of In-Context Learning Authors: Jiachen Jiang, Yuxin Dong, Jinxin Zhou, Zhihui Zhu
-
TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning Authors: Florentin Beck, William Rudman, Carsten Eickhoff
-
Attention with Trained Embeddings Provably Selects Important Tokens Authors: Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli
-
Implicit Regularization of Infinitesimally-perturbed Gradient Descent Toward Low-dimensional Solutions Authors: Jianhao Ma, Geyu Liang, Salar Fattahi
-
Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence Authors: Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo
-
JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model Authors: Qihao Duan, Bingding Huang, Zhenqiao Song, Irina Lehmann, Lei Gu, Roland Eils, Benjamin Wild
-
Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation Authors: Seamus Somerstep, Vinod Raman, Unique Subedi, Yuekai Sun
-
Learning novel representations of variable sources from multi-modal $\textit{Gaia}$ data via autoencoders Authors: P. Huijse, J. De Ridder, L. Eyer, L. Rimoldini, B. Holl, N. Chornay, J. Roquette, K. Nienartowicz, G. Jevardat de Fombelle, D. J. Fritzewski, A. Kemp, V. Vanlaer, M. Vanrespaille, H. Wang, M. I. Carnerero, C. M. Raiteri, G. Marton, M. Madarász, G. Clementini, P. Gavras, C. Aerts
-
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs Authors: Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, Minxin Du
-
Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning Authors: Wang Yang, Zirui Liu, Hongye Jin, Qingyu Yin, Vipin Chaudhary, Xiaotian Han
-
TI-DeepONet: Learnable Time Integration for Stable Long-Term Extrapolation Authors: Dibyajyoti Nayak, Somdatta Goswami
-
LaSER: How Learning Can Guide the Evolution of Equations Authors: Nam H. Le, Josh Bongard
-
Artificial Intelligence for Direct Prediction of Molecular Dynamics Across Chemical Space Authors: Fuchun Ge, Pavlo O. Dral
-
Scalable Graph Generative Modeling via Substructure Sequences Authors: Zehong Wang, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, Yanfang Ye
-
TULiP: Test-time Uncertainty Estimation via Linearization and Weight Perturbation Authors: Yuhui Zhang, Dongshen Wu, Yuichiro Wada, Takafumi Kanamori
-
AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training Authors: Huishuai Zhang, Bohan Wang, Luoxin Chen
-
Computing Exact Shapley Values in Polynomial Time for Product-Kernel Methods Authors: Majid Mohammadi, Siu Lun Chau, Krikamol Muandet
-
Omni TM-AE: A Scalable and Interpretable Embedding Model Using the Full Tsetlin Machine State Space Authors: Ahmed K. Kadhim, Lei Jiao, Rishad Shafik, Ole-Christoffer Granmo
-
Training Long-Context LLMs Efficiently via Chunk-wise Optimization Authors: Wenhao Li, Yuxin Zhang, Gen Luo, Daohai Yu, Rongrong Ji
-
Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation Authors: Zhenglin Hua, Jinghan He, Zijun Yao, Tianxu Han, Haiyun Guo, Yuheng Jia, Junfeng Fang
-
Large-Scale Bayesian Tensor Reconstruction: An Approximate Message Passing Solution Authors: Bingyang Cheng, Zhongtao Chen, Yichen Jin, Hao Zhang, Chen Zhang, Edmud Y. Lam, Yik-Chung Wu
-
ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training Authors: Maryam Dialameh, Rezaul Karim, Hossein Rajabzadeh, Omar Mohamed Awad, Hyock Ju Kwon, Boxing Chen, Walid Ahmed, Yang Liu
-
EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning Authors: Jiawei Liu, Qisi Chen, Jianshu Zhang, Quan Liu, Defu Lian
-
TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling Authors: Weizhe Lin, Xing Li, Zhiyuan Yang, Xiaojin Fu, Hui-Ling Zhen, Yaoyuan Wang, Xianzhi Yu, Wulong Liu, Xiaosong Li, Mingxuan Yuan
-
Native Segmentation Vision Transformers Authors: Guillem Brasó, Aljoša Ošep, Laura Leal-Taixé
-
SELF: Self-Extend the Context Length With Logistic Growth Function Authors: Phat Thanh Dang, Saahil Thoppay, Wang Yang, Qifan Wang, Vipin Chaudhary, Xiaotian Han
-
HOFT: Householder Orthogonal Fine-tuning Authors: Alejandro Moreno Arcas, Albert Sanchis, Jorge Civera, Alfons Juan
-
AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer Authors: Jiquan Shan, Junxiao Wang, Lifeng Zhao, Liang Cai, Hongyuan Zhang, Ioannis Liritzis
-
NAN: A Training-Free Solution to Coefficient Estimation in Model Merging Authors: Chongjie Si, Kangtao Lv, Jingjing Jiang, Yadao Wang, Yongwei Wang, Xiaokang Yang, Wenbo Su, Bo Zheng, Wei Shen
-
Understanding Differential Transformer Unchains Pretrained Self-Attentions Authors: Chaerin Kong, Jiho Jang, Nojun Kwak
-
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving Authors: Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan
-
Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models Authors: Yue Li, Xin Yi, Dongsheng Shi, Gerard de Melo, Xiaoling Wang, Linlin Wang
-
PaTH Attention: Position Encoding via Accumulating Householder Transformations Authors: Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, Yoon Kim
-
Small-to-Large Generalization: Data Influences Models Consistently Across Scale Authors: Alaa Khaddaj, Logan Engstrom, Aleksander Madry
-
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models Authors: Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han
1. Structure-Aligned Protein Language Model
ArXiv ID: 2505.16896
Authors: Can Chen, David Heurtel-Depeiges, Robert M. Vernon, Christopher James Langmead, Yoshua Bengio, Quentin Fournier
Abstract: Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but lack the structural knowledge essential for many biological applications. To address this, we integrate structural insights from pre-trained protein graph neural networks (pGNNs) into pLMs through a latent-level contrastive learning task. This task aligns residue representations from pLMs with those from pGNNs across multiple proteins, enriching pLMs with inter-protein structural knowledge. Additionally, we incorporate a physical-level task that infuses intra-protein structural knowledge by optimizing pLMs to predict structural tokens. The proposed dual-task framework effectively incorporates both inter-protein and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module, which uses a small model trained on high-quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method to the state-of-the-art ESM2 and AMPLIFY results in notable performance gains across a wide range of tasks, including a 12.7% increase in ESM2 contact prediction. The data, code, and resulting SaESM2 and SaAMPLIFY models will be released on Hugging Face.
Comment: Author match
2. The Computational Complexity of Counting Linear Regions in ReLU Neural Networks
ArXiv ID: 2505.16716
Authors: Moritz Stargalla, Christoph Hertrich, Daniel Reichman
Abstract: An established measure of the expressive power of a given ReLU neural network is the number of linear regions into which it partitions the input space. There exist many different, non-equivalent definitions of what a linear region actually is. We systematically assess which papers use which definitions and discuss how they relate to each other. We then analyze the computational complexity of counting the number of such regions for the various definitions. Generally, this turns out to be an intractable problem. We prove NP- and #P-hardness results already for networks with one hidden layer and strong hardness of approximation results for two or more hidden layers. Finally, on the algorithmic side, we demonstrate that counting linear regions can at least be achieved in polynomial space for some common definitions.
Comment: The paper analyzes the computational complexity of counting linear regions in ReLU networks, relevant to model architecture and theoretical insights.
Relevance: 9 Novelty: 8
3. Understanding Prompt Tuning and In-Context Learning via Meta-Learning
ArXiv ID: 2505.17010
Authors: Tim Genewein, Kevin Wenliang Li, Jordi Grau-Moya, Anian Ruoss, Laurent Orseau, Marcus Hutter
Abstract: Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.
Comment: The paper provides a theoretical understanding of prompt tuning and in-context learning via meta-learning, relevant to large language models and representation learning.
Relevance: 9 Novelty: 8
4. FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design
ArXiv ID: 2505.16335
Authors: Renjie Wei, Songqiang Xu, Qingyu Guo, Meng Li
Abstract: Visual autoregressive (VAR) modeling has marked a paradigm shift in image generation from next-token prediction to next-scale prediction. VAR predicts a set of tokens at each step from coarse to fine scale, leading to better image quality and faster inference speed compared to existing diffusion models. However, the large parameter size and computation cost hinder its deployment on edge devices. To reduce the memory and computation cost, we propose FPQVAR, an efficient post-training floating-point (FP) quantization framework for VAR featuring algorithm and hardware co-design. At the algorithm level, we first identify the challenges of quantizing VAR. To address them, we propose Dual Format Quantization for the highly imbalanced input activation. We further propose Group-wise Hadamard Transformation and GHT-Aware Learnable Transformation to address the time-varying outlier channels. At the hardware level, we design the first low-bit FP quantizer and multiplier with lookup tables on FPGA and propose the first FPGA-based VAR accelerator featuring low-bit FP computation and an elaborate two-level pipeline. Extensive experiments show that compared to the state-of-the-art quantization method, our proposed FPQVAR significantly improves Fr\'echet Inception Distance (FID) from 10.83 to 3.58, Inception Score (IS) from 175.9 to 241.5 under 4-bit quantization. FPQVAR also significantly improves the performance of 6-bit quantized VAR, bringing it on par with the FP16 model. Our accelerator on AMD-Xilinx VCK190 FPGA achieves a throughput of 1.1 image/s, which is 3.1x higher than the integer-based accelerator. It also demonstrates 3.6x and 2.8x higher energy efficiency compared to the integer-based accelerator and GPU baseline, respectively.
Comment: The paper presents a novel quantization framework for visual autoregressive models, which aligns with model compression through quantization and efficiency improvements.
Relevance: 9 Novelty: 8
5. Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning
ArXiv ID: 2505.16950
Authors: Adnan Oomerjee, Zafeirios Fountas, Zhongwei Yu, Haitham Bou-Ammar, Jun Wang
Abstract: Despite their impressive capabilities, Large Language Models struggle with generalisation beyond their training distribution, often exhibiting sophisticated pattern interpolation rather than true abstract reasoning (extrapolation). In this work, we approach this limitation through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input compression and retention of predictive information in latent representations. We prove using IB theory that decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations. We then use this result to demonstrate that periodic global transformation of the internal sequence-level representations (KV cache) is a necessary computational step for improving Transformer generalisation in reasoning tasks. Based on these theoretical insights, we propose a modification to the Transformer architecture, in the form of an additional module that globally rewrites the KV cache at periodic intervals, shifting its capacity away from memorising input prefixes and toward encoding features most useful for predicting future tokens. Our model delivers substantial gains on mathematical reasoning benchmarks, outperforming both vanilla Transformers with up to 3.5x more parameters, as well as heuristic-driven pruning mechanisms for cache compression. Our approach can be seen as a principled generalisation of existing KV-cache compression methods; whereas such methods focus solely on compressing input representations, they often do so at the expense of retaining predictive information, and thus their capabilities are inherently bounded by those of an unconstrained model. This establishes a principled framework to manipulate Transformer memory using information theory, addressing fundamental reasoning limitations that scaling alone cannot overcome.
Comment: The paper proposes a modification to the Transformer architecture for improved generalization, aligning with model architecture innovations and KV cache manipulation.
Relevance: 9 Novelty: 8
6. Latent Principle Discovery for Language Model Self-Improvement
ArXiv ID: 2505.16927
Authors: Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo
Abstract: When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes guiding model reasoning towards human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ an approximation of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We also show that clustering the principles yields interpretable and diverse model-generated constitutions while retaining model performance. The gains our method achieves highlight the potential of automated, principle-driven post-training recipes toward continual self-improvement.
Comment: The paper focuses on a novel method for self-improvement in language models by discovering latent principles, which aligns with foundational research in representation learning and LLM behavior.
Relevance: 9 Novelty: 8
7. Is Quantum Optimization Ready? An Effort Towards Neural Network Compression using Adiabatic Quantum Computing
ArXiv ID: 2505.16332
Authors: Zhehui Wanga, Benjamin Chen Ming Choonga, Tian Huang, Daniel Gerlinghoffa, Rick Siow Mong Goh, Cheng Liu, Tao Luo
Abstract: Quantum optimization is the most mature quantum computing technology to date, providing a promising approach towards efficiently solving complex combinatorial problems. Methods such as adiabatic quantum computing (AQC) have been employed in recent years on important optimization problems across various domains. In deep learning, deep neural networks (DNN) have reached immense sizes to support new predictive capabilities. Optimization of large-scale models is critical for sustainable deployment, but becomes increasingly challenging with ever-growing model sizes and complexity. While quantum optimization is suitable for solving complex problems, its application to DNN optimization is not straightforward, requiring thorough reformulation for compatibility with commercially available quantum devices. In this work, we explore the potential of adopting AQC for fine-grained pruning-quantization of convolutional neural networks. We rework established heuristics to formulate model compression as a quadratic unconstrained binary optimization (QUBO) problem, and assess the solution space offered by commercial quantum annealing devices. Through our exploratory efforts of reformulation, we demonstrate that AQC can achieve effective compression of practical DNN models. Experiments demonstrate that adiabatic quantum computing (AQC) not only outperforms classical algorithms like genetic algorithms and reinforcement learning in terms of time efficiency but also excels at identifying global optima.
Comment: The paper explores quantum optimization for neural network compression, specifically using adiabatic quantum computing for pruning and quantization.
Relevance: 9 Novelty: 8
8. NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics
ArXiv ID: 2505.16210
Authors: Zhihang Cai, Xingjun Zhang, Zhendong Tan, Zheng Wei
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks. However, LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands, which significantly increases the memory resource consumption of the Key-Value (KV) cache during inference, becoming a major bottleneck in LLM deployment. To address this issue, quantization is a common and straightforward approach. Currently, quantization methods for activations are limited to 8-bit, and quantization to even lower bits can lead to substantial accuracy drops. To further save space by quantizing the KV cache to even lower bits, we analyzed the element distribution of the KV cache and designed the NQKV algorithm. Since the elements within each block of the KV cache follow a normal distribution, NQKV employs per-block quantile quantization to achieve information-theoretically optimal quantization error. Without significantly compromising model output quality, NQKV enables the OPT model to perform inference with an 2x larger batch size or a 4x longer context length, and it improves throughput by 9.3x compared to when the KV cache is not used.
Comment: The paper introduces NQKV, a KV cache quantization scheme, which directly relates to model compression and efficiency.
Relevance: 9 Novelty: 8
9. Robust Invariant Representation Learning by Distribution Extrapolation
ArXiv ID: 2505.16126
Authors: Kotaro Yoshida, Konstantinos Slavakis
Abstract: Invariant risk minimization (IRM) aims to enable out-of-distribution (OOD) generalization in deep learning by learning invariant representations. As IRM poses an inherently challenging bi-level optimization problem, most existing approaches -- including IRMv1 -- adopt penalty-based single-level approximations. However, empirical studies consistently show that these methods often fail to outperform well-tuned empirical risk minimization (ERM), highlighting the need for more robust IRM implementations. This work theoretically identifies a key limitation common to many IRM variants: their penalty terms are highly sensitive to limited environment diversity and over-parameterization, resulting in performance degradation. To address this issue, a novel extrapolation-based framework is proposed that enhances environmental diversity by augmenting the IRM penalty through synthetic distributional shifts. Extensive experiments -- ranging from synthetic setups to realistic, over-parameterized scenarios -- demonstrate that the proposed method consistently outperforms state-of-the-art IRM variants, validating its effectiveness and robustness.
Comment: The paper proposes a novel extrapolation-based framework for invariant representation learning, which aligns with the representation learning criterion.
Relevance: 9 Novelty: 8
10. Stochastic Forward-Forward Learning through Representational Dimensionality Compression
ArXiv ID: 2505.16649
Authors: Zhichao Zhu, Yang Qi, Hengyuan Ma, Wenlian Lu, Jianfeng Feng
Abstract: The Forward-Forward (FF) algorithm provides a bottom-up alternative to backpropagation (BP) for training neural networks, relying on a layer-wise "goodness" function to guide learning. Existing goodness functions, inspired by energy-based learning (EBL), are typically defined as the sum of squared post-synaptic activations, neglecting the correlations between neurons. In this work, we propose a novel goodness function termed dimensionality compression that uses the effective dimensionality (ED) of fluctuating neural responses to incorporate second-order statistical structure. Our objective minimizes ED for clamped inputs when noise is considered while maximizing it across the sample distribution, promoting structured representations without the need to prepare negative samples. We demonstrate that this formulation achieves competitive performance compared to other non-BP methods. Moreover, we show that noise plays a constructive role that can enhance generalization and improve inference when predictions are derived from the mean of squared outputs, which is equivalent to making predictions based on the energy term. Our findings contribute to the development of more biologically plausible learning algorithms and suggest a natural fit for neuromorphic computing, where stochasticity is a computational resource rather than a nuisance. The code is available at https://github.com/ZhichaoZhu/StochasticForwardForward
Comment: The paper introduces a novel goodness function for the Forward-Forward algorithm, contributing to representation learning with a focus on dimensionality compression.
Relevance: 9 Novelty: 8
11. From Compression to Expansion: A Layerwise Analysis of In-Context Learning
ArXiv ID: 2505.17322
Authors: Jiachen Jiang, Yuxin Dong, Jinxin Zhou, Zhihui Zhu
Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without weight updates by learning from demonstration sequences. While ICL shows strong empirical performance, its internal representational mechanisms are not yet well understood. In this work, we conduct a statistical geometric analysis of ICL representations to investigate how task-specific information is captured across layers. Our analysis reveals an intriguing phenomenon, which we term Layerwise Compression-Expansion: early layers progressively produce compact and discriminative representations that encode task information from the input demonstrations, while later layers expand these representations to incorporate the query and generate the prediction. This phenomenon is observed consistently across diverse tasks and a range of contemporary LLM architectures. We demonstrate that it has important implications for ICL performance -- improving with model size and the number of demonstrations -- and for robustness in the presence of noisy examples. To further understand the effect of the compact task representation, we propose a bias-variance decomposition and provide a theoretical analysis showing how attention mechanisms contribute to reducing both variance and bias, thereby enhancing performance as the number of demonstrations increases. Our findings reveal an intriguing layerwise dynamic in ICL, highlight how structured representations emerge within LLMs, and showcase that analyzing internal representations can facilitate a deeper understanding of model behavior.
Comment: The paper conducts a layerwise analysis of in-context learning in large language models, aligning with the large language models criterion.
Relevance: 9 Novelty: 8
12. TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning
ArXiv ID: 2505.16743
Authors: Florentin Beck, William Rudman, Carsten Eickhoff
Abstract: Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM
Comment: The paper introduces a novel pruning method, TRIM, which is relevant to model compression and sparsity.
Relevance: 9 Novelty: 8
13. Attention with Trained Embeddings Provably Selects Important Tokens
ArXiv ID: 2505.17282
Authors: Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli
Abstract: Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited. Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent. Specifically, we consider a one-layer softmax attention model with a linear head for binary classification, i.e., $\texttt{Softmax}( p^\top E_X^\top ) E_X v = \frac{ \sum_{i=1}^T \exp(p^\top E_{x_i}) E_{x_i}^\top v}{\sum_{j=1}^T \exp(p^\top E_{x_{j}}) }$, where $E_X = [ E_{x_1} , \dots, E_{x_T} ]^\top$ contains the embeddings of the input sequence, $p$ is the embedding of the $\mathrm{\langle cls \rangle}$ token and $v$ the output vector. First, we show that, already after a single step of gradient training with the logistic loss, the embeddings $E_X$ capture the importance of tokens in the dataset by aligning with the output vector $v$ proportionally to the frequency with which the corresponding tokens appear in the dataset. Then, after training $p$ via gradient flow until convergence, the softmax selects the important tokens in the sentence (i.e., those that are predictive of the label), and the resulting $\mathrm{\langle cls \rangle}$ embedding maximizes the margin for such a selection. Experiments on real-world datasets (IMDB, Yelp) exhibit a phenomenology close to that unveiled by our theory.
Comment: The paper provides theoretical insights into how attention with trained embeddings selects important tokens, relevant to Representation Learning.
Relevance: 9 Novelty: 8
14. Implicit Regularization of Infinitesimally-perturbed Gradient Descent Toward Low-dimensional Solutions
ArXiv ID: 2505.17304
Authors: Jianhao Ma, Geyu Liang, Salar Fattahi
Abstract: Implicit regularization refers to the phenomenon where local search algorithms converge to low-dimensional solutions, even when such structures are neither explicitly specified nor encoded in the optimization problem. While widely observed, this phenomenon remains theoretically underexplored, particularly in modern over-parameterized problems. In this paper, we study the conditions that enable implicit regularization by investigating when gradient-based methods converge to second-order stationary points (SOSPs) within an implicit low-dimensional region of a smooth, possibly nonconvex function. We show that successful implicit regularization hinges on two key conditions: $(i)$ the ability to efficiently escape strict saddle points, while $(ii)$ maintaining proximity to the implicit region. Existing analyses enabling the convergence of gradient descent (GD) to SOSPs often rely on injecting large perturbations to escape strict saddle points. However, this comes at the cost of deviating from the implicit region. The central premise of this paper is that it is possible to achieve the best of both worlds: efficiently escaping strict saddle points using infinitesimal perturbations, while controlling deviation from the implicit region via a small deviation rate. We show that infinitesimally perturbed gradient descent (IPGD), which can be interpreted as GD with inherent ``round-off errors'', can provably satisfy both conditions. We apply our framework to the problem of over-parameterized matrix sensing, where we establish formal guarantees for the implicit regularization behavior of IPGD. We further demonstrate through extensive experiments that these insights extend to a broader class of learning problems.
Comment: The paper studies implicit regularization in gradient descent, which is relevant to representation learning and training dynamics.
Relevance: 9 Novelty: 8
15. Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence
ArXiv ID: 2505.16694
Authors: Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo
Abstract: Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context. While prior work links induction heads to ICL through a sudden jump in accuracy, this can only account for ICL when the answer is included within the context. However, an important property of practical ICL in large language models is the ability to meta-learn how to solve tasks from context, rather than just copying answers from context; how such an ability is obtained during training is largely unexplored. In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model's circuit during training. Specifically, we extend the copy task from previous research into an In-Context Meta Learning setting, where models must infer a task from examples to answer queries. Interestingly, in this setting, we find that there are multiple phases in the process of acquiring such abilities, and that a unique circuit emerges in each phase, contrasting with the single-phases change in induction heads. The emergence of such circuits can be related to several phenomena known in large language models, and our analysis lead to a deeper understanding of the source of the transformer's ICL ability.
Comment: The paper analyzes the emergence of multi-phase circuits in transformers, which is relevant to understanding transformer architecture and training dynamics.
Relevance: 9 Novelty: 8
16. JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model
ArXiv ID: 2505.17257
Authors: Qihao Duan, Bingding Huang, Zhenqiao Song, Irina Lehmann, Lei Gu, Roland Eils, Benjamin Wild
Abstract: Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types, including genetic sequences. However, adapting LLMs to genomics presents significant challenges. Capturing complex genomic interactions requires modeling long-range dependencies within DNA sequences, where interactions often span over 10,000 base pairs, even within a single gene, posing substantial computational burdens under conventional model architectures and training paradigms. Moreover, standard LLM training approaches are suboptimal for DNA: autoregressive training, while efficient, supports only unidirectional understanding. However, DNA is inherently bidirectional, e.g., bidirectional promoters regulate transcription in both directions and account for nearly 11% of human gene expression. Masked language models (MLMs) allow bidirectional understanding but are inefficient, as only masked tokens contribute to the loss per step. To address these limitations, we introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm that combines the optimization efficiency of autoregressive modeling with the bidirectional comprehension of masked modeling. JanusDNA adopts a hybrid Mamba, Attention and Mixture of Experts (MoE) architecture, combining long-range modeling of Attention with efficient sequential learning of Mamba. MoE layers further scale model capacity via sparse activation while keeping computational cost low. Notably, JanusDNA processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU. Extensive experiments and ablations show JanusDNA achieves new SOTA results on three genomic representation benchmarks, outperforming models with 250x more activated parameters. Code: https://github.com/Qihao-Duan/JanusDNA
Comment: The paper presents a hybrid DNA foundation model with MoE architecture, which is relevant to model architecture innovations and efficiency.
Relevance: 9 Novelty: 8
17. Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation
ArXiv ID: 2505.17288
Authors: Seamus Somerstep, Vinod Raman, Unique Subedi, Yuekai Sun
Abstract: Using the bit string generation problem as a case study, we theoretically compare two standard methods for adapting large language models to new tasks. The first, referred to as supervised fine-tuning, involves training a new next token predictor on good generations. The second method, Best-of-N, trains a reward model to select good responses from a collection generated by an unaltered base model. If the learning setting is realizable, we find that supervised fine-tuning outperforms BoN through a better dependence on the response length in its rate of convergence. If realizability fails, then depending on the failure mode, BoN can enjoy a better rate of convergence in either n or a rate of convergence with better dependence on the response length.
Comment: The paper theoretically compares two methods for adapting large language models, focusing on the learning dynamics and convergence rates, which aligns with the Large Language Models criterion by providing theoretical insights into LLM behavior.
Relevance: 9 Novelty: 8
18. Learning novel representations of variable sources from multi-modal $\textit{Gaia}$ data via autoencoders
ArXiv ID: 2505.16320
Authors: P. Huijse, J. De Ridder, L. Eyer, L. Rimoldini, B. Holl, N. Chornay, J. Roquette, K. Nienartowicz, G. Jevardat de Fombelle, D. J. Fritzewski, A. Kemp, V. Vanlaer, M. Vanrespaille, H. Wang, M. I. Carnerero, C. M. Raiteri, G. Marton, M. Madarász, G. Clementini, P. Gavras, C. Aerts
Abstract: Gaia Data Release 3 (DR3) published for the first time epoch photometry, BP/RP (XP) low-resolution mean spectra, and supervised classification results for millions of variable sources. This extensive dataset offers a unique opportunity to study their variability by combining multiple Gaia data products. In preparation for DR4, we propose and evaluate a machine learning methodology capable of ingesting multiple Gaia data products to achieve an unsupervised classification of stellar and quasar variability. A dataset of 4 million Gaia DR3 sources is used to train three variational autoencoders (VAE), which are artificial neural networks (ANNs) designed for data compression and generation. One VAE is trained on Gaia XP low-resolution spectra, another on a novel approach based on the distribution of magnitude differences in the Gaia G band, and the third on folded Gaia G band light curves. Each Gaia source is compressed into 15 numbers, representing the coordinates in a 15-dimensional latent space generated by combining the outputs of these three models. The learned latent representation produced by the ANN effectively distinguishes between the main variability classes present in Gaia DR3, as demonstrated through both supervised and unsupervised classification analysis of the latent space. The results highlight a strong synergy between light curves and low-resolution spectral data, emphasising the benefits of combining the different Gaia data products. A two-dimensional projection of the latent variables reveals numerous overdensities, most of which strongly correlate with astrophysical properties, showing the potential of this latent space for astrophysical discovery. We show that the properties of our novel latent representation make it highly valuable for variability analysis tasks, including classification, clustering and outlier detection.
Comment: The paper uses variational autoencoders to learn novel representations from Gaia data, relevant to representation learning and autoencoders.
Relevance: 9 Novelty: 7
19. Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
ArXiv ID: 2505.16831
Authors: Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, Minxin Du
Abstract: Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rather than erase it. To diagnose this phenomenon, we introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information. Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting. In reversible cases, models suffer token-level collapse yet retain latent features; in irreversible cases, deeper representational damage occurs. We further provide a theoretical account linking shallow weight perturbations near output layers to misleading unlearning signals, and show that reversibility is modulated by task type and hyperparameters. Our findings reveal a fundamental gap in current evaluation practices and establish a new diagnostic foundation for trustworthy unlearning in LLMs. We provide a unified toolkit for analyzing LLM representation changes under unlearning and relearning: https://github.com/XiaoyuXU1/Representational_Analysis_Tools.git.
Comment: The paper investigates the reversibility of machine unlearning in LLMs, providing theoretical insights into LLM behavior and interpretability.
Relevance: 9 Novelty: 7
20. Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning
ArXiv ID: 2505.17315
Authors: Wang Yang, Zirui Liu, Hongye Jin, Qingyu Yin, Vipin Chaudhary, Xiaotian Han
Abstract: Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model's long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models.
Comment: The paper explores the role of long-context ability in reasoning, which is relevant to Large Language Models and their theoretical insights.
Relevance: 9 Novelty: 7
21. TI-DeepONet: Learnable Time Integration for Stable Long-Term Extrapolation
ArXiv ID: 2505.17341
Authors: Dibyajyoti Nayak, Somdatta Goswami
Abstract: Accurate temporal extrapolation presents a fundamental challenge for neural operators in modeling dynamical systems, where reliable predictions must extend significantly beyond the training time horizon. Conventional Deep Operator Network (DeepONet) approaches employ two inherently limited training paradigms - fixed-horizon rollouts that predict complete spatiotemporal solutions while disregarding temporal causality, and autoregressive formulations that accumulate errors through sequential predictions. We introduce TI-DeepONet, a framework that integrates neural operators with adaptive numerical time-stepping techniques to preserve the Markovian structure of dynamical systems while mitigating error propagation in extended temporal forecasting. Our approach reformulates the learning objective from direct state prediction to the approximation of instantaneous time-derivative fields, which are then integrated using established numerical schemes. This architecture supports continuous-time prediction and enables deployment of higher-precision integrators during inference than those used during training, balancing computational efficiency with predictive accuracy. We further develop TI(L)-DeepONet, which incorporates learnable coefficients for intermediate slopes in the integration process, adapting to solution-specific variations and enhancing fidelity. Evaluation across three canonical PDEs shows that TI(L)-DeepONet marginally outperforms TI-DeepONet, with both reducing relative L2 extrapolation errors: approximately 81% over autoregressive and 70% over fixed-horizon methods. Notably, both maintain prediction stability for temporal domains extending to about twice the training interval. This research establishes a physics-aware operator learning paradigm that bridges neural approximation with numerical analysis while preserving the causal structure of dynamical systems.
Comment: TI-DeepONet introduces a novel framework for neural operators with adaptive time-stepping, relevant to model architecture and emerging trends.
Relevance: 8 Novelty: 8
22. LaSER: How Learning Can Guide the Evolution of Equations
ArXiv ID: 2505.17309
Authors: Nam H. Le, Josh Bongard
Abstract: Evolution and learning are two distinct yet complementary forms of adaptation. While evolutionary processes operate across generations via the selection of genotypes, learning occurs within the lifetime of an individual, shaping behavior through phenotypic adjustment. The Baldwin effect describes how lifetime learning can improve evolutionary search without altering inherited structures. While this has proven effective in areas like neuroevolution, where gradient-based learning is often used to fine-tune weights or behaviors produced by evolution, it remains underexplored in systems that evolve non-differentiable symbolic structures like Genetic Programming (GP). GP evolves explicit syntax trees that represent equations, offering strong interpretability but limited generalization due to the burden of discovering both useful representations and precise mappings. Here, we show for the first time that integrating a simple form of supervised learning, applied at the semantic or behavioral level during evaluation, can effectively guide the evolution of equations in GP. To achieve this, we propose a new GP pipeline, LaSER (Latent Semantic Evolutionary Regression), where each GP individual generates a semantic representation that is passed to a supervised learner. The quality of the learned mapping is used to assign fitness, without modifying the underlying syntax tree or evolutionary process. Across standard symbolic regression benchmarks, in terms of generalization ability, LaSER significantly outperforms traditional GP and, in several cases, matches or exceeds popular machine learning regressors, while preserving the symbolic interpretability. By separating evolution from learning, LaSER offers a practical route to integrating GP with modern ML workflows, and opens new avenues for research at the intersection of evolutionary computation and representation learning.
Comment: LaSER integrates learning with genetic programming for symbolic regression, relevant to representation learning and emerging trends.
Relevance: 8 Novelty: 8
23. Artificial Intelligence for Direct Prediction of Molecular Dynamics Across Chemical Space
ArXiv ID: 2505.16301
Authors: Fuchun Ge, Pavlo O. Dral
Abstract: Molecular dynamics (MD) is a powerful tool for exploring the behavior of atomistic systems, but its reliance on sequential numerical integration limits simulation efficiency. We present MDtrajNet-1, a foundational AI model that directly generates MD trajectories across chemical space, bypassing force calculations and integration. This approach accelerates simulations by up to two orders of magnitude compared to traditional MD, even those enhanced by machine-learning interatomic potentials. MDtrajNet-1 combines equivariant neural networks with a Transformer-based architecture to achieve strong accuracy and transferability in predicting long-time trajectories for both known and unseen systems. Remarkably, the errors of the trajectories generated by MDtrajNet-1 for various molecular systems are close to those of the conventional ab initio MD. The model's flexible design supports diverse application scenarios, including different statistical ensembles, boundary conditions, and interaction types. By overcoming the intrinsic speed barrier of conventional MD, MDtrajNet-1 opens new frontiers in efficient and scalable atomistic simulations.
Comment: The paper introduces a foundational AI model for molecular dynamics, which is relevant to AI for Science with a focus on architecture-level innovations.
Relevance: 8 Novelty: 8
24. Scalable Graph Generative Modeling via Substructure Sequences
ArXiv ID: 2505.16130
Authors: Zehong Wang, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, Yanfang Ye
Abstract: Graph neural networks (GNNs) has been predominantly driven by message-passing, where node representations are iteratively updated via local neighborhood aggregation. Despite their success, message-passing suffers from fundamental limitations -- including constrained expressiveness, over-smoothing, over-squashing, and limited capacity to model long-range dependencies. These issues hinder scalability: increasing data size or model size often fails to yield improved performance, limiting the viability of GNNs as backbones for graph foundation models. In this work, we explore pathways beyond message-passing and introduce Generative Graph Pattern Machine (G$^2$PM), a generative Transformer pre-training framework for graphs. G$^2$PM represents graph instances (nodes, edges, or entire graphs) as sequences of substructures, and employs generative pre-training over the sequences to learn generalizable, transferable representations. Empirically, G$^2$PM demonstrates strong scalability: on the ogbn-arxiv benchmark, it continues to improve with model sizes up to 60M parameters, outperforming prior generative approaches that plateau at significantly smaller scales (e.g., 3M). In addition, we systematically analyze the model design space, highlighting key architectural choices that contribute to its scalability and generalization. Across diverse tasks -- including node classification, graph classification, and transfer learning -- G$^2$PM consistently outperforms strong baselines, establishing a compelling foundation for scalable graph learning. The code and dataset are available at https://github.com/Zehong-Wang/G2PM.
Comment: The paper presents a generative Transformer pre-training framework for graphs, which is relevant to model architecture innovations beyond message-passing in GNNs.
Relevance: 8 Novelty: 8
25. TULiP: Test-time Uncertainty Estimation via Linearization and Weight Perturbation
ArXiv ID: 2505.16923
Authors: Yuhui Zhang, Dongshen Wu, Yuichiro Wada, Takafumi Kanamori
Abstract: A reliable uncertainty estimation method is the foundation of many modern out-of-distribution (OOD) detectors, which are critical for safe deployments of deep learning models in the open world. In this work, we propose TULiP, a theoretically-driven post-hoc uncertainty estimator for OOD detection. Our approach considers a hypothetical perturbation applied to the network before convergence. Based on linearized training dynamics, we bound the effect of such perturbation, resulting in an uncertainty score computable by perturbing model parameters. Ultimately, our approach computes uncertainty from a set of sampled predictions. We visualize our bound on synthetic regression and classification datasets. Furthermore, we demonstrate the effectiveness of TULiP using large-scale OOD detection benchmarks for image classification. Our method exhibits state-of-the-art performance, particularly for near-distribution samples.
Comment: TULiP proposes a new method for uncertainty estimation using linearization and weight perturbation, which relates to representation learning and training dynamics.
Relevance: 8 Novelty: 7
26. AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training
ArXiv ID: 2505.16363
Authors: Huishuai Zhang, Bohan Wang, Luoxin Chen
Abstract: We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed $(L_0, L_1)$ smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes that can be further approximated by momentum magnitudes. We establish rigorous theoretical convergence guarantees and provide practical guidelines for hyperparameter selection. Empirically, AdamS demonstrates strong performance in various tasks, including pre-training runs on GPT-2 and Llama2 (up to 13B parameters) and reinforcement learning in post-training regimes. With its efficiency, simplicity, and theoretical grounding, AdamS stands as a compelling alternative to existing optimizers.
Comment: AdamS introduces a new optimizer for LLM pretraining and post-training, relevant to large language models and optimization techniques.
Relevance: 8 Novelty: 7
27. Computing Exact Shapley Values in Polynomial Time for Product-Kernel Methods
ArXiv ID: 2505.16516
Authors: Majid Mohammadi, Siu Lun Chau, Krikamol Muandet
Abstract: Kernel methods are widely used in machine learning due to their flexibility and expressive power. However, their black-box nature poses significant challenges to interpretability, limiting their adoption in high-stakes applications. Shapley value-based feature attribution techniques, such as SHAP and kernel-specific variants like RKHS-SHAP, offer a promising path toward explainability. Yet, computing exact Shapley values remains computationally intractable in general, motivating the development of various approximation schemes. In this work, we introduce PKeX-Shapley, a novel algorithm that utilizes the multiplicative structure of product kernels to enable the exact computation of Shapley values in polynomial time. We show that product-kernel models admit a functional decomposition that allows for a recursive formulation of Shapley values. This decomposition not only yields computational efficiency but also enhances interpretability in kernel-based learning. We also demonstrate how our framework can be generalized to explain kernel-based statistical discrepancies such as the Maximum Mean Discrepancy (MMD) and the Hilbert-Schmidt Independence Criterion (HSIC), thus offering new tools for interpretable statistical inference.
Comment: The paper presents a method for computing exact Shapley values in polynomial time for product-kernel methods, relevant to representation learning and interpretability.
Relevance: 8 Novelty: 7
28. Omni TM-AE: A Scalable and Interpretable Embedding Model Using the Full Tsetlin Machine State Space
ArXiv ID: 2505.16386
Authors: Ahmed K. Kadhim, Lei Jiao, Rishad Shafik, Ole-Christoffer Granmo
Abstract: The increasing complexity of large-scale language models has amplified concerns regarding their interpretability and reusability. While traditional embedding models like Word2Vec and GloVe offer scalability, they lack transparency and often behave as black boxes. Conversely, interpretable models such as the Tsetlin Machine (TM) have shown promise in constructing explainable learning systems, though they previously faced limitations in scalability and reusability. In this paper, we introduce Omni Tsetlin Machine AutoEncoder (Omni TM-AE), a novel embedding model that fully exploits the information contained in the TM's state matrix, including literals previously excluded from clause formation. This method enables the construction of reusable, interpretable embeddings through a single training phase. Extensive experiments across semantic similarity, sentiment classification, and document clustering tasks show that Omni TM-AE performs competitively with and often surpasses mainstream embedding models. These results demonstrate that it is possible to balance performance, scalability, and interpretability in modern Natural Language Processing (NLP) systems without resorting to opaque architectures.
Comment: The paper introduces a novel embedding model using Tsetlin Machine, which aligns with model architecture innovations and interpretability.
Relevance: 8 Novelty: 7
29. Training Long-Context LLMs Efficiently via Chunk-wise Optimization
ArXiv ID: 2505.16710
Authors: Wenhao Li, Yuxin Zhang, Gen Luo, Daohai Yu, Rongrong Ji
Abstract: While long-context large language models (LLMs) exhibit remarkable document processing capabilities, their prohibitively high training costs often hinder customized applications. To mitigate this issue, we propose \textit{Sequential Chunk-wise Optimization} (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks. Each chunk independently constructs its computational graph and performs localized backpropagation, ensuring that only one chunk's forward activations are stored in memory. Building on SeCO, we further introduce \textit{Sparse Chunk-wise Optimization} (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks and incorporates a carefully designed compensation factor to ensure unbiased gradient estimation. SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer. Implemented as lightweight training wrappers, both SeCO and SpaCO offer substantial practical benefits. For example, when fine-tuning an 8B model with LoRA on a single RTX 3090 GPU, SeCO expands maximum sequence length from 1K to 16K tokens, while SpaCO demonstrates accelerated training speed -- achieving up to 3x faster than SeCO under the same experimental setup. These innovations provide new insights into optimizing long-context models, making them more accessible for practical applications. We have open-sourced the code at \href{https://github.com/wenhaoli-xmu/seco}{here}.
Comment: The paper introduces a memory-efficient training paradigm for long-context LLMs, which is relevant to model compression and efficiency.
Relevance: 8 Novelty: 7
30. Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation
ArXiv ID: 2505.16146
Authors: Zhenglin Hua, Jinghan He, Zijun Yao, Tianxu Han, Haiyun Guo, Yuheng Jia, Junfeng Fang
Abstract: Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks such as visual question answering (VQA) and image captioning. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with either hallucinations or actuality, realizing more precise and direct hallucination-related representations. Our analysis demonstrates that interventions along the faithful direction we identified can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose Steering LVLMs via SAE Latent Directions (SSL), a training-free method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead.
Comment: The paper leverages sparse autoencoders to mitigate hallucinations in LVLMs, which is relevant to representation learning and model architecture.
Relevance: 8 Novelty: 7
31. Large-Scale Bayesian Tensor Reconstruction: An Approximate Message Passing Solution
ArXiv ID: 2505.16305
Authors: Bingyang Cheng, Zhongtao Chen, Yichen Jin, Hao Zhang, Chen Zhang, Edmud Y. Lam, Yik-Chung Wu
Abstract: Tensor CANDECOMP/PARAFAC decomposition (CPD) is a fundamental model for tensor reconstruction. Although the Bayesian framework allows for principled uncertainty quantification and automatic hyperparameter learning, existing methods do not scale well for large tensors because of high-dimensional matrix inversions. To this end, we introduce CP-GAMP, a scalable Bayesian CPD algorithm. This algorithm leverages generalized approximate message passing (GAMP) to avoid matrix inversions and incorporates an expectation-maximization routine to jointly infer the tensor rank and noise power. Through multiple experiments, for synthetic 100x100x100 rank 20 tensors with only 20% elements observed, the proposed algorithm reduces runtime by 82.7% compared to the state-of-the-art variational Bayesian CPD method, while maintaining comparable reconstruction accuracy.
Comment: The paper presents a scalable Bayesian CPD algorithm, which is relevant to model compression and efficiency through low-rank approaches.
Relevance: 8 Novelty: 7
32. ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training
ArXiv ID: 2505.17331
Authors: Maryam Dialameh, Rezaul Karim, Hossein Rajabzadeh, Omar Mohamed Awad, Hyock Ju Kwon, Boxing Chen, Walid Ahmed, Yang Liu
Abstract: This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77\% higher token-per-second throughput during training, up to 16\% higher Model FLOPs Utilization (MFU), and up to 14\% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7\% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning large language models, enabling faster and more resource-efficient training without compromising performance.
Comment: The paper introduces ECHO-LLaMA, an efficient LLaMA architecture with shared KV caching, which is relevant to model compression and efficiency.
Relevance: 8 Novelty: 7
33. EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning
ArXiv ID: 2505.16312
Authors: Jiawei Liu, Qisi Chen, Jianshu Zhang, Quan Liu, Defu Lian
Abstract: Large Language Models (LLMs) excel at complex reasoning through search algorithms, yet current strategies often suffer from massive token consumption due to redundant exploration of semantically equivalent steps. Existing semantic similarity methods struggle to accurately identify such equivalence in domain-specific contexts like mathematical reasoning. To address this, we propose EquivPruner, a simple yet effective approach that identifies and prunes semantically equivalent actions during LLM reasoning search. We also introduce MathEquiv, the first dataset we created for mathematical statement equivalence, which enables the training of a lightweight equivalence detector. Extensive experiments across various models and tasks demonstrate that EquivPruner significantly reduces token consumption, improving searching efficiency and often bolstering reasoning accuracy. For instance, when applied to Qwen2.5-Math-7B-Instruct on GSM8K, EquivPruner reduced token consumption by 48.1\% while also improving accuracy. Our code is available at https://github.com/Lolo1222/EquivPruner.
Comment: The paper introduces EquivPruner, a method for pruning semantically equivalent actions in LLM reasoning, which relates to model compression and efficiency.
Relevance: 8 Novelty: 7
34. TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling
ArXiv ID: 2505.17155
Authors: Weizhe Lin, Xing Li, Zhiyuan Yang, Xiaojin Fu, Hui-Ling Zhen, Yaoyuan Wang, Xianzhi Yu, Wulong Liu, Xiaosong Li, Mingxuan Yuan
Abstract: Large Reasoning Models (LRMs) demonstrate exceptional capability in tackling complex mathematical, logical, and coding tasks by leveraging extended Chain-of-Thought (CoT) reasoning. Test-time scaling methods, such as prolonging CoT with explicit token-level exploration, can push LRMs' accuracy boundaries, but they incur significant decoding overhead. A key inefficiency source is LRMs often generate redundant thinking CoTs, which demonstrate clear structured overthinking and underthinking patterns. Inspired by human cognitive reasoning processes and numerical optimization theories, we propose TrimR, a verifier-based, training-free, efficient framework for dynamic CoT compression to trim reasoning and enhance test-time scaling, explicitly tailored for production-level deployment. Our method employs a lightweight, pretrained, instruction-tuned verifier to detect and truncate redundant intermediate thoughts of LRMs without any LRM or verifier fine-tuning. We present both the core algorithm and asynchronous online system engineered for high-throughput industrial applications. Empirical evaluations on Ascend NPUs and vLLM show that our framework delivers substantial gains in inference efficiency under large-batch workloads. In particular, on the four MATH500, AIME24, AIME25, and GPQA benchmarks, the reasoning runtime of Pangu-R-38B, QwQ-32B, and DeepSeek-R1-Distill-Qwen-32B is improved by up to 70% with negligible impact on accuracy.
Comment: The paper introduces TrimR, a framework for dynamic CoT compression in reasoning models, which relates to model compression and efficiency.
Relevance: 8 Novelty: 7
35. Native Segmentation Vision Transformers
ArXiv ID: 2505.16993
Authors: Guillem Brasó, Aljoša Ošep, Laura Leal-Taixé
Abstract: Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer, that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises natively in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer. We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of native, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks. Our project page is https://research.nvidia.com/labs/dvl/projects/native-segmentation.
Comment: The paper proposes a new vision transformer architecture with native segmentation capabilities, aligning with the model architecture criterion.
Relevance: 8 Novelty: 7
36. SELF: Self-Extend the Context Length With Logistic Growth Function
ArXiv ID: 2505.17296
Authors: Phat Thanh Dang, Saahil Thoppay, Wang Yang, Qifan Wang, Vipin Chaudhary, Xiaotian Han
Abstract: Large language models suffer issues when operated on long contexts that are larger than their training context length due to the standard position encoding for tokens in the attention layer. Tokens a long distance apart will rarely have an effect on each other and long prompts yield unexpected results. To solve this problem, we propose SELF (Self-Extend the Context Length With Logistic Growth Function): a solution of grouping consecutive tokens at varying group sizes using a logistic capacity equation combined with a constant group size at smaller relative distances. Our model had an increase in performance of up to 12% compared to the LongLM extension method in LEval (specifically on the Qwen model). On summarization related tasks in LongBench, our model performed up to 6.4% better than LongLM (specifically on the Llama-2-7b model). On reading comprehension tasks from LEval, our model performed up to 5.4% better than the LongLM. Our code is available at https://github.com/alexeipc/SELF-LLM.
Comment: The paper proposes a method to extend context length in large language models, which aligns with the large language models criterion.
Relevance: 8 Novelty: 7
37. HOFT: Householder Orthogonal Fine-tuning
ArXiv ID: 2505.16531
Authors: Alejandro Moreno Arcas, Albert Sanchis, Jorge Civera, Alfons Juan
Abstract: Adaptation of foundation models using low-rank methods is a widespread approach. Another way to adapt these models is to employ orthogonal fine-tuning methods, which are less time and memory efficient despite their good generalization properties. In this work, we propose Householder Orthogonal Fine-tuning (HOFT), a novel orthogonal fine-tuning method that aims to alleviate time and space complexity. Moreover, some theoretical properties of the orthogonal fine-tuning paradigm are explored. From this exploration, Scaled Householder Orthogonal Fine-tuning (SHOFT) is proposed. Both HOFT and SHOFT are evaluated in downstream tasks, namely commonsense reasoning, machine translation, subject-driven generation and mathematical reasoning. Compared with state-of-the-art adaptation methods, HOFT and SHOFT show comparable or better results.
Comment: The paper proposes a novel orthogonal fine-tuning method for foundation models, aligning with the model compression criterion.
Relevance: 8 Novelty: 7
38. AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer
ArXiv ID: 2505.16463
Authors: Jiquan Shan, Junxiao Wang, Lifeng Zhao, Liang Cai, Hongyuan Zhang, Ioannis Liritzis
Abstract: Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given $n$ patches, they will have quadratic complexity such as $\mathcal{O}(n^2)$ and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of an input image, some tokens may not be helpful for the downstream tasks. To handle this problem, we introduce an anchor-based efficient vision transformer (AnchorFormer), which employs the anchor tokens to learn the pivotal information and accelerate the inference. Firstly, by estimating the bipartite attention between the anchors and tokens, the complexity will be reduced from $\mathcal{O}(n^2)$ to $\mathcal{O}(mn)$, where $m$ is an anchor number and $m < n$. Notably, by representing the anchors with the neurons in a neural layer, we can differentiable learn these distributions and approximate global self-attention through the Markov process. Moreover, we extend the proposed model to three downstream tasks including classification, detection, and segmentation. Extensive experiments show the effectiveness of our AnchorFormer, e.g., achieving up to a 9.0% higher accuracy or 46.7% FLOPs reduction on ImageNet classification, 81.3% higher mAP on COCO detection under comparable FLOPs, as compared to the current baselines.
Comment: The paper proposes an efficient vision transformer architecture, AnchorFormer, which is relevant to model architecture innovations.
Relevance: 8 Novelty: 7
39. NAN: A Training-Free Solution to Coefficient Estimation in Model Merging
ArXiv ID: 2505.16148
Authors: Chongjie Si, Kangtao Lv, Jingjing Jiang, Yadao Wang, Yongwei Wang, Xiaokang Yang, Wenbo Su, Bo Zheng, Wei Shen
Abstract: Model merging offers a training-free alternative to multi-task learning by combining independently fine-tuned models into a unified one without access to raw data. However, existing approaches often rely on heuristics to determine the merging coefficients, limiting their scalability and generality. In this work, we revisit model merging through the lens of least-squares optimization and show that the optimal merging weights should scale with the amount of task-specific information encoded in each model. Based on this insight, we propose NAN, a simple yet effective method that estimates model merging coefficients via the inverse of parameter norm. NAN is training-free, plug-and-play, and applicable to a wide range of merging strategies. Extensive experiments on show that NAN consistently improves performance of baseline methods.
Comment: The paper discusses a training-free method for model merging, which is relevant to model efficiency and compression.
Relevance: 8 Novelty: 7
40. Understanding Differential Transformer Unchains Pretrained Self-Attentions
ArXiv ID: 2505.16333
Authors: Chaerin Kong, Jiho Jang, Nojun Kwak
Abstract: Differential Transformer has recently gained significant attention for its impressive empirical performance, often attributed to its ability to perform noise canceled attention. However, precisely how differential attention achieves its empirical benefits remains poorly understood. Moreover, Differential Transformer architecture demands large-scale training from scratch, hindering utilization of open pretrained weights. In this work, we conduct an in-depth investigation of Differential Transformer, uncovering three key factors behind its success: (1) enhanced expressivity via negative attention, (2) reduced redundancy among attention heads, and (3) improved learning dynamics. Based on these findings, we propose DEX, a novel method to efficiently integrate the advantages of differential attention into pretrained language models. By reusing the softmax attention scores and adding a lightweight differential operation on the output value matrix, DEX effectively incorporates the key advantages of differential attention while remaining lightweight in both training and inference. Evaluations confirm that DEX substantially improves the pretrained LLMs across diverse benchmarks, achieving significant performance gains with minimal adaptation data (< 0.01\%).
Comment: The paper investigates Differential Transformer and proposes DEX, which is relevant to model architecture innovations.
Relevance: 8 Novelty: 7
41. DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
ArXiv ID: 2505.16278
Authors: Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan
Abstract: End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$\pi_0$. Specifically, we add Vision MoE to Drive-$\pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$\pi_0$.
Comment: The paper focuses on a novel MoE-based framework for autonomous driving, which aligns with the Model Architecture criterion.
Relevance: 8 Novelty: 7
42. Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models
ArXiv ID: 2505.16104
Authors: Yue Li, Xin Yi, Dongsheng Shi, Gerard de Melo, Xiaoling Wang, Linlin Wang
Abstract: With the increasing size of Large Vision-Language Models (LVLMs), network pruning techniques aimed at compressing models for deployment in resource-constrained environments have garnered significant attention. However, we observe that pruning often leads to a degradation in safety performance. To address this issue, we present a novel and lightweight approach, termed Hierarchical Safety Realignment (HSR). HSR operates by first quantifying the contribution of each attention head to safety, identifying the most critical ones, and then selectively restoring neurons directly within these attention heads that play a pivotal role in maintaining safety. This process hierarchically realigns the safety of pruned LVLMs, progressing from the attention head level to the neuron level. We validate HSR across various models and pruning strategies, consistently achieving notable improvements in safety performance. To our knowledge, this is the first work explicitly focused on restoring safety in LVLMs post-pruning.
Comment: The paper presents a novel approach to restore safety in pruned LVLMs, aligning with Model Compression.
Relevance: 8 Novelty: 7
43. PaTH Attention: Position Encoding via Accumulating Householder Transformations
ArXiv ID: 2505.16381
Authors: Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, Yoon Kim
Abstract: The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm that minimizes I/O cost. Across both targeted synthetic benchmarks and moderate-scale real-world language modeling experiments, we find that PaTH demonstrates superior performance compared to RoPE and other recent baselines.
Comment: The paper introduces a novel position encoding scheme for transformers, which is relevant to model architecture innovations.
Relevance: 8 Novelty: 7
44. Small-to-Large Generalization: Data Influences Models Consistently Across Scale
ArXiv ID: 2505.16260
Authors: Alaa Khaddaj, Logan Engstrom, Aleksander Madry
Abstract: Choice of training data distribution greatly influences model behavior. Yet, in large-scale settings, precisely characterizing how changes in training data affects predictions is often difficult due to model training costs. Current practice is to instead extrapolate from scaled down, inexpensive-to-train proxy models. However, changes in data do not influence smaller and larger models identically. Therefore, understanding how choice of data affects large-scale models raises the question: how does training data distribution influence model behavior across compute scale? We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data. Equipped with these findings, we characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.
Comment: The paper explores the influence of training data distribution on model behavior across scales, which is relevant to representation learning and training dynamics.
Relevance: 8 Novelty: 7
45. Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
ArXiv ID: 2505.16416
Authors: Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han
Abstract: Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to large vision-language models (LVLMs), its variants introduce unintended cross-modal positional biases. Specifically, they enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments. This issue arises because image tokens representing the same content but located at different spatial positions are assigned distinct positional biases, leading to inconsistent cross-modal associations. To address this, we propose Per-Token Distance (PTD) - a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory orthogonal to the linear path of text token indices, forming a cone-like structure. This configuration ensures that each text token maintains an equal distance to all image tokens, reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered layer strategy that applies different RoPE variants across layers. This design leverages the complementary strengths of each RoPE variant, thereby enhancing the model's overall performance. Our experimental results demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for LVLMs. The code is available at https://github.com/lose4578/CircleRoPE.
Comment: The paper introduces Circle-RoPE, a novel positional encoding scheme for large vision-language models, which aligns with the Model Architecture criterion by proposing a new encoding structure. It also addresses cross-modal biases, which is a significant insight into model behavior.
Relevance: 8 Novelty: 7
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.