Personalized Daily Arxiv Papers 02/03/2025
| Prompt | Completion | Total | |
|---|---|---|---|
| Token | 85599 | 7805 | 93404 |
| Cost | $2.14 | $0.78 | $2.92 |
Total scanned papers: 298
Total relevant papers: 27
Table of contents with paper titles:
-
Strassen Attention: Unlocking Compositional Abilities in Transformers Based on a New Lower Bound Method Authors: Alexander Kozachinskiy, Felipe Urrutia, Hector Jimenez, Tomasz Steifer, Germ\'an Pizarro, Mat\'ias Fuentes, Francisco Meza, Cristian Buc, Crist\'obal Rojas
-
TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs Authors: Yan Sun, Tiansheng Huang, Liang Ding, Li Shen, Dacheng Tao
-
Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models Authors: Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, Dan Alistarh
-
Brain-inspired sparse training enables Transformers and LLMs to perform as fully connected Authors: Yingtao Zhang, Jialin Zhao, Wenjing Wu, Ziheng Liao, Umberto Michieli, Carlo Vittorio Cannistraci
-
A theoretical framework for overfitting in energy-based modeling Authors: Giovanni Catania, Aur\'elien Decelle, Cyril Furtlehner, Beatriz Seoane
-
Norm-Bounded Low-Rank Adaptation Authors: Ruigang Wang, Krishnamurthy Dvijotham, Ian R. Manchester
-
Low-Rank Adapting Models for Sparse Autoencoders Authors: Matthew Chen, Joshua Engels, Max Tegmark
-
An Invitation to Neuroalgebraic Geometry Authors: Giovanni Luca Marchetti, Vahid Shahverdi, Stefano Mereta, Matthew Trager, Kathl\'en Kohn
-
A Metric for the Balance of Information in Graph Learning Authors: Alex O. Davies, Nirav S. Ajmeri, Telmo de Menezes e Silva Filho
-
Self-Supervised Learning Using Nonlinear Dependence Authors: M. Hadi Sepanj, Benyamin Ghojogh, Paul Fieguth
-
Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models Authors: Jialin Zhao, Yingtao Zhang, Carlo Vittorio Cannistraci
-
Memory-Efficient Fine-Tuning of Transformers via Token Selection Authors: Antoine Simoulin, Namyong Park, Xiaoyi Liu, Grey Yang
-
\underline{E2}Former: A Linear-time \underline{E}fficient and \underline{E}quivariant Trans\underline{former} for Scalable Molecular Modeling Authors: Yunyang Li, Lin Huang, Zhihao Ding, Chu Wang, Xinran Wei, Han Yang, Zun Wang, Chang Liu, Yu Shi, Peiran Jin, Jia Zhang, Mark Gerstein, Tao Qin
-
Symmetric Pruning of Large Language Models Authors: Kai Yi, Peter Richt\'arik
-
A Comunication Framework for Compositional Generation Authors: Rafael Elberg, Mircea Petrache, Denis Parra
-
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution Authors: Shichang Zhang, Tessa Han, Usha Bhalla, Hima Lakkaraju
-
Neural SDEs as a Unified Approach to Continuous-Domain Sequence Modeling Authors: Macheng Shen, Chen Cheng
-
Error Slice Discovery via Manifold Compactness Authors: Han Yu, Jiashuo Liu, Hao Zou, Renzhe Xu, Yue He, Xingxuan Zhang, Peng Cui
-
BCAT: A Block Causal Transformer for PDE Foundation Models for Fluid Dynamics Authors: Yuxuan Liu, Jingmin Sun, Hayden Schaeffer
-
Temperature-Annealed Boltzmann Generators Authors: Henrik Schopmans, Pascal Friederich
-
Understanding Oversmoothing in GNNs as Consensus in Opinion Dynamics Authors: Keqin Wang, Yulong Yang, Ishan Saha, Christine Allen-Blanchette
-
Can We Predict the Effect of Prompts? Authors: Jae Yong Lee, Sungmin Kang, Shin Yoo
-
Unraveling Zeroth-Order Optimization through the Lens of Low-Dimensional Structured Perturbations Authors: Sihwan Park, Jihun Yun, SungYub Kim, Souvik Kundu, Eunho Yang
-
What is causal about causal models and representations? Authors: Frederik Hytting J{\o}rgensen, Luigi Gresele, Sebastian Weichwald
-
BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning Authors: Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, Liwei Wang, Mingyi Hong, Zhaoran Wang
-
Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment Authors: Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Sch\"onfeld, Ali Thabet, Jonas Kohler
-
Contrast-Aware Calibration for Fine-Tuned CLIP: Leveraging Image-Text Alignment Authors: Song-Lin Lv, Yu-Yang Chen, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo
1. Strassen Attention: Unlocking Compositional Abilities in Transformers Based on a New Lower Bound Method
ArXiv ID: 2501.19215
Authors: Alexander Kozachinskiy, Felipe Urrutia, Hector Jimenez, Tomasz Steifer, Germ\'an Pizarro, Mat\'ias Fuentes, Francisco Meza, Cristian Buc, Crist\'obal Rojas
Abstract: We propose a novel method to evaluate the theoretical limits of Transformers, allowing us to prove the first lower bounds against one-layer softmax Transformers with infinite precision. We establish those bounds for three tasks that require advanced reasoning. The first task, Match3 (Sanford et al., 2023), requires looking at all triples of positions. The second and third tasks address compositionality-based reasoning: one is composition of functions (Peng et al., 2024) and the other is composition of binary relations. We formally prove the inability of one-layer softmax Transformers to solve any of these tasks. In an attempt to overcome these limitations, we introduce Strassen attention and prove that with this mechanism a one-layer Transformer can in principle solve all these tasks. We also show that it enjoys sub-cubic running-time complexity, making it more scalable than similar previously proposed mechanisms, such as higher-order attention (Sanford et al., 2023). To complement our theoretical findings, we experimentally studied Strassen attention and compared it against standard (Vaswani et al, 2017), higher-order attention (Sanford et al., 2023) and triangular attention (Bergen et al. 2021). Our results help to disentangle all these attention mechanisms, highlighting their strengths and limitations. In particular, Strassen attention outperforms standard attention significantly on all the tasks. Altogether, understanding the theoretical limitations can guide research towards scalable attention mechanisms that improve the reasoning abilities of Transformers.
Comment: This paper addresses an innovative attention mechanism (Strassen Attention) for improving transformer scalability and reasoning abilities, directly aligning with model architecture advancements. It provides theoretical and experimental insights.
Relevance: 10 Novelty: 9
2. TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs
ArXiv ID: 2501.19057
Authors: Yan Sun, Tiansheng Huang, Liang Ding, Li Shen, Dacheng Tao
Abstract: Zeroth-order optimization (ZO) has demonstrated remarkable promise in efficient fine-tuning tasks for Large Language Models (LLMs). In particular, recent advances incorporate the low-rankness of gradients, introducing low-rank ZO estimators to further reduce GPU memory consumption. However, most existing works focus solely on the low-rankness of each individual gradient, overlooking a broader property shared by all gradients throughout the training, i.e., all gradients approximately reside within a similar subspace. In this paper, we consider two factors together and propose a novel low-rank ZO estimator, TeZO, which captures the low-rankness across both the model and temporal dimension. Specifically, we represent ZO perturbations along the temporal dimension as a 3D tensor and employ Canonical Polyadic Decomposition (CPD) to extract each low-rank 2D matrix, significantly reducing the training cost. TeZO can also be easily extended to the Adam variant while consuming less memory than MeZO-SGD, and requiring about only 35% memory of MeZO-Adam. Both comprehensive theoretical analysis and extensive experimental research have validated its efficiency, achieving SOTA-comparable results with lower overhead of time and memory.
Comment: The paper explores a novel method for memory-efficient fine-tuning of LLMs by leveraging low-rankness across temporal dimensions and employing Canonical Polyadic Decomposition (CPD). This is closely tied to the 'Model Compression' criteria.
Relevance: 10 Novelty: 9
3. Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models
ArXiv ID: 2501.19392
Authors: Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, Dan Alistarh
Abstract: Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key & Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) high-compression mechanisms for internal network states. We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to "optimally" compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under $1\%$ relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.
Comment: This paper addresses KV cache compression in LLMs, explicitly aligning with the model compression criterion. The introduction of AQUA-KV for adaptive quantization is relevant and demonstrates novel efficiency improvements.
Relevance: 10 Novelty: 8
4. Brain-inspired sparse training enables Transformers and LLMs to perform as fully connected
ArXiv ID: 2501.19107
Authors: Yingtao Zhang, Jialin Zhao, Wenjing Wu, Ziheng Liao, Umberto Michieli, Carlo Vittorio Cannistraci
Abstract: This study aims to enlarge our current knowledge on application of brain-inspired network science principles for training artificial neural networks (ANNs) with sparse connectivity. Dynamic sparse training (DST) can reduce the computational demands in ANNs, but faces difficulties to keep peak performance at high sparsity levels. The Cannistraci-Hebb training (CHT) is a brain-inspired method for growing connectivity in DST. CHT leverages a gradient-free, topology-driven link regrowth, which has shown ultra-sparse (1% connectivity or lower) advantage across various tasks compared to fully connected networks. Yet, CHT suffers two main drawbacks: (i) its time complexity is O(Nd^3) - N node network size, d node degree - hence it can apply only to ultra-sparse networks. (ii) it selects top link prediction scores, which is inappropriate for the early training epochs, when the network presents unreliable connections. We propose a GPU-friendly approximation of the CH link predictor, which reduces the computational complexity to O(N^3), enabling a fast implementation of CHT in large-scale models. We introduce the Cannistraci-Hebb training soft rule (CHTs), which adopts a strategy for sampling connections in both link removal and regrowth, balancing the exploration and exploitation of network topology. To improve performance, we integrate CHTs with a sigmoid gradual density decay (CHTss). Empirical results show that, using 1% of connections, CHTs outperforms fully connected networks in MLP on visual classification tasks, compressing some networks to < 30% nodes. Using 5% of the connections, CHTss outperforms fully connected networks in two Transformer-based machine translation tasks. Using 30% of the connections, CHTss achieves superior performance compared to other dynamic sparse training methods in language modeling, and it surpasses the fully connected counterpart in zero-shot evaluations.
Comment: The use of brain-inspired sparse training and dynamic sparse connectivity in transformer-based models is directly relevant to sparsity and model compression, fitting well with foundational contributions in efficiency and architecture.
Relevance: 9 Novelty: 9
5. A theoretical framework for overfitting in energy-based modeling
ArXiv ID: 2501.19158
Authors: Giovanni Catania, Aur\'elien Decelle, Cyril Furtlehner, Beatriz Seoane
Abstract: We investigate the impact of limited data on training pairwise energy-based models for inverse problems aimed at identifying interaction networks. Utilizing the Gaussian model as testbed, we dissect training trajectories across the eigenbasis of the coupling matrix, exploiting the independent evolution of eigenmodes and revealing that the learning timescales are tied to the spectral decomposition of the empirical covariance matrix. We see that optimal points for early stopping arise from the interplay between these timescales and the initial conditions of training. Moreover, we show that finite data corrections can be accurately modeled through asymptotic random matrix theory calculations and provide the counterpart of generalized cross-validation in the energy based model context. Our analytical framework extends to binary-variable maximum-entropy pairwise models with minimal variations. These findings offer strategies to control overfitting in discrete-variable models through empirical shrinkage corrections, improving the management of overfitting in energy-based generative models.
Comment: Develops a theoretical framework for overfitting in energy-based generative models, exploring spectral learning dynamics. Matches foundational research in representation learning.
Relevance: 9 Novelty: 9
6. Norm-Bounded Low-Rank Adaptation
ArXiv ID: 2501.19050
Authors: Ruigang Wang, Krishnamurthy Dvijotham, Ian R. Manchester
Abstract: In this work, we propose norm-bounded low-rank adaptation (NB-LoRA) for parameter-efficient fine tuning. We introduce two parameterizations that allow explicit bounds on each singular value of the weight adaptation matrix, which can therefore satisfy any prescribed unitarily invariant norm bound, including the Schatten norms (e.g., nuclear, Frobenius, spectral norm). The proposed parameterizations are unconstrained and complete, i.e. they cover all matrices satisfying the prescribed rank and norm constraints. Experiments on vision fine-tuning benchmarks show that the proposed approach can achieve good adaptation performance while avoiding model catastrophic forgetting and also substantially improve robustness to a wide range of hyper-parameters, including adaptation rank, learning rate and number of training epochs. We also explore applications in privacy-preserving model merging and low-rank matrix completion.
Comment: Proposes NB-LoRA for parameter-efficient fine-tuning with norm bounds on adaptation matrices. This paper aligns closely with model compression, particularly low-rank adaptation techniques, making it highly relevant.
Relevance: 10 Novelty: 8
7. Low-Rank Adapting Models for Sparse Autoencoders
ArXiv ID: 2501.19406
Authors: Matthew Chen, Joshua Engels, Max Tegmark
Abstract: Sparse autoencoders (SAEs) decompose language model representations into a sparse set of linear latent vectors. Recent works have improved SAEs using language model gradients, but these techniques require many expensive backward passes during training and still cause a significant increase in cross entropy loss when SAE reconstructions are inserted into the model. In this work, we improve on these limitations by taking a fundamentally different approach: we use low-rank adaptation (LoRA) to finetune the language model itself around a previously trained SAE. We analyze our method across SAE sparsity, SAE width, language model size, LoRA rank, and model layer on the Gemma Scope family of SAEs. In these settings, our method reduces the cross entropy loss gap by 30% to 55% when SAEs are inserted during the forward pass. We also find that compared to end-to-end (e2e) SAEs, our approach achieves the same downstream cross entropy loss 3$\times$ to 20$\times$ faster on Gemma-2-2B and 2$\times$ to 10$\times$ faster on Llama-3.2-1B. We further show that our technique improves downstream metrics and can adapt multiple SAEs at once. Our results demonstrate that improving model interpretability is not limited to post-hoc SAE training; Pareto improvements can also be achieved by directly optimizing the model itself.
Comment: This paper attempts to improve sparse autoencoders by combining low-rank adaptation with interpretability-driven design, making it directly relevant to topics like sparsity, low-rank techniques, and sparse autoencoders.
Relevance: 10 Novelty: 8
8. An Invitation to Neuroalgebraic Geometry
ArXiv ID: 2501.18915
Authors: Giovanni Luca Marchetti, Vahid Shahverdi, Stefano Mereta, Matthew Trager, Kathl\'en Kohn
Abstract: In this expository work, we promote the study of function spaces parameterized by machine learning models through the lens of algebraic geometry. To this end, we focus on algebraic models, such as neural networks with polynomial activations, whose associated function spaces are semi-algebraic varieties. We outline a dictionary between algebro-geometric invariants of these varieties, such as dimension, degree, and singularities, and fundamental aspects of machine learning, such as sample complexity, expressivity, training dynamics, and implicit bias. Along the way, we review the literature and discuss ideas beyond the algebraic domain. This work lays the foundations of a research direction bridging algebraic geometry and deep learning, that we refer to as neuroalgebraic geometry.
Comment: The paper introduces a new theoretical framework connecting algebraic geometry and machine learning, specifically targeting neural networks. This aligns with 'Representation Learning' as it provides unique insights on the expressivity and training dynamics of neural networks.
Relevance: 9 Novelty: 9
9. A Metric for the Balance of Information in Graph Learning
ArXiv ID: 2501.19137
Authors: Alex O. Davies, Nirav S. Ajmeri, Telmo de Menezes e Silva Filho
Abstract: Graph learning on molecules makes use of information from both the molecular structure and the features attached to that structure. Much work has been conducted on biasing either towards structure or features, with the aim that bias bolsters performance. Identifying which information source a dataset favours, and therefore how to approach learning that dataset, is an open issue. Here we propose Noise-Noise Ratio Difference (NNRD), a quantitative metric for whether there is more useful information in structure or features. By employing iterative noising on features and structure independently, leaving the other intact, NNRD measures the degradation of information in each. We employ NNRD over a range of molecular tasks, and show that it corresponds well to a loss of information, with intuitive results that are more expressive than simple performance aggregates. Our future work will focus on expanding data domains, tasks and types, as well as refining our choice of baseline model.
Comment: Introduces a metric (NNRD) to balance structural and feature information in graph learning. Contains fundamental insights into managing representation biases in molecular graph data.
Relevance: 9 Novelty: 8
10. Self-Supervised Learning Using Nonlinear Dependence
ArXiv ID: 2501.18875
Authors: M. Hadi Sepanj, Benyamin Ghojogh, Paul Fieguth
Abstract: Self-supervised learning has gained significant attention in contemporary applications, particularly due to the scarcity of labeled data. While existing SSL methodologies primarily address feature variance and linear correlations, they often neglect the intricate relations between samples and the nonlinear dependencies inherent in complex data. In this paper, we introduce Correlation-Dependence Self-Supervised Learning (CDSSL), a novel framework that unifies and extends existing SSL paradigms by integrating both linear correlations and nonlinear dependencies, encapsulating sample-wise and feature-wise interactions. Our approach incorporates the Hilbert-Schmidt Independence Criterion (HSIC) to robustly capture nonlinear dependencies within a Reproducing Kernel Hilbert Space, enriching representation learning. Experimental evaluations on diverse benchmarks demonstrate the efficacy of CDSSL in improving representation quality.
Comment: Introduces CDSSL, a self-supervised learning framework incorporating nonlinear dependencies through HSIC. This aligns well with representation learning as it focuses on enhancing representation quality through novel methodologies for capturing complex data relationships.
Relevance: 9 Novelty: 8
11. Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models
ArXiv ID: 2501.19090
Authors: Jialin Zhao, Yingtao Zhang, Carlo Vittorio Cannistraci
Abstract: The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Low-rank pruning has gained attention for its tensor coherence and GPU compatibility across all densities. However, low-rank pruning has struggled to match the performance of semi-structured pruning, often doubling perplexity (PPL) at similar densities. In this paper, we propose Pivoting Factorization (PIFA), a novel lossless meta low-rank representation that unsupervisedly learns a compact form of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving an additional 24.2\% memory savings and 24.6\% faster inference over low-rank layers at r/d = 0.5, thereby significantly enhancing performance at the same density. To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free low-rank reconstruction method that minimizes error accumulation (M). MPIFA, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods and, for the first time, achieves performance comparable to semi-structured pruning, while surpassing it in GPU efficiency and compatibility.
Comment: The paper introduces Pivoting Factorization, which directly applies to low-rank compression in large language models, making it highly relevant to model compression and efficiency techniques.
Relevance: 9 Novelty: 8
12. Memory-Efficient Fine-Tuning of Transformers via Token Selection
ArXiv ID: 2501.18824
Authors: Antoine Simoulin, Namyong Park, Xiaoyi Liu, Grey Yang
Abstract: Fine-tuning provides an effective means to specialize pre-trained models for various downstream tasks. However, fine-tuning often incurs high memory overhead, especially for large transformer-based models, such as LLMs. While existing methods may reduce certain parts of the memory required for fine-tuning, they still require caching all intermediate activations computed in the forward pass to update weights during the backward pass. In this work, we develop TokenTune, a method to reduce memory usage, specifically the memory to store intermediate activations, in the fine-tuning of transformer-based models. During the backward pass, TokenTune approximates the gradient computation by backpropagating through just a subset of input tokens. Thus, with TokenTune, only a subset of intermediate activations are cached during the forward pass. Also, TokenTune can be easily combined with existing methods like LoRA, further reducing the memory cost. We evaluate our approach on pre-trained transformer models with up to billions of parameters, considering the performance on multiple downstream tasks such as text classification and question answering in a few-shot learning setup. Overall, TokenTune achieves performance on par with full fine-tuning or representative memory-efficient fine-tuning methods, while greatly reducing the memory footprint, especially when combined with other methods with complementary memory reduction mechanisms. We hope that our approach will facilitate the fine-tuning of large transformers, in specializing them for specific domains or co-training them with other neural components from a larger system. Our code is available at https://github.com/facebookresearch/tokentune.
Comment: TokenTune proposes efficient fine-tuning for transformers through token selection. It focuses on memory-efficient training, fitting well into 'Model Compression' and highlights methodological contributions.
Relevance: 9 Novelty: 8
13. \underline{E2}Former: A Linear-time \underline{E}fficient and \underline{E}quivariant Trans\underline{former} for Scalable Molecular Modeling
ArXiv ID: 2501.19216
Authors: Yunyang Li, Lin Huang, Zhihao Ding, Chu Wang, Xinran Wei, Han Yang, Zun Wang, Chang Liu, Yu Shi, Peiran Jin, Jia Zhang, Mark Gerstein, Tao Qin
Abstract: Equivariant Graph Neural Networks (EGNNs) have demonstrated significant success in modeling microscale systems, including those in chemistry, biology and materials science. However, EGNNs face substantial computational challenges due to the high cost of constructing edge features via spherical tensor products, making them impractical for large-scale systems. To address this limitation, we introduce E2Former, an equivariant and efficient transformer architecture that incorporates the Wigner $6j$ convolution (Wigner $6j$ Conv). By shifting the computational burden from edges to nodes, the Wigner $6j$ Conv reduces the complexity from $O(|\mathcal{E}|)$ to $ O(| \mathcal{V}|)$ while preserving both the model's expressive power and rotational equivariance. We show that this approach achieves a 7x-30x speedup compared to conventional $\mathrm{SO}(3)$ convolutions. Furthermore, our empirical results demonstrate that the derived E2Former mitigates the computational challenges of existing approaches without compromising the ability to capture detailed geometric information. This development could suggest a promising direction for scalable and efficient molecular modeling.
Comment: The E2Former introduces a novel efficient and equivariant transformer architecture with significant computational speedups, aligning well with the 'Model Architecture' criterion and particularly Transformer-based innovations.
Relevance: 9 Novelty: 8
14. Symmetric Pruning of Large Language Models
ArXiv ID: 2501.18980
Authors: Kai Yi, Peter Richt\'arik
Abstract: Popular post-training pruning methods such as Wanda and RIA are known for their simple, yet effective, designs that have shown exceptional empirical performance. Wanda optimizes performance through calibrated activations during pruning, while RIA emphasizes the relative, rather than absolute, importance of weight elements. Despite their practical success, a thorough theoretical foundation explaining these outcomes has been lacking. This paper introduces new theoretical insights that redefine the standard minimization objective for pruning, offering a deeper understanding of the factors contributing to their success. Our study extends beyond these insights by proposing complementary strategies that consider both input activations and weight significance. We validate these approaches through rigorous experiments, demonstrating substantial enhancements over existing methods. Furthermore, we introduce a novel training-free fine-tuning approach $R^2$-DSnoT that incorporates relative weight importance and a regularized decision boundary within a dynamic pruning-and-growing framework, significantly outperforming strong baselines and establishing a new state of the art.
Comment: This paper delves into symmetric pruning methods for large language models, integrating theoretical insights and proposing new strategies, aligning well with 'Model Compression' relevance.
Relevance: 9 Novelty: 8
15. A Comunication Framework for Compositional Generation
ArXiv ID: 2501.19182
Authors: Rafael Elberg, Mircea Petrache, Denis Parra
Abstract: Compositionality and compositional generalization--the ability to understand novel combinations of known concepts--are central characteristics of human language and are hypothesized to be essential for human cognition. In machine learning, the emergence of this property has been studied in a communication game setting, where independent agents (a sender and a receiver) converge to a shared encoding policy from a set of states to a space of discrete messages, where the receiver can correctly reconstruct the states observed by the sender using only the sender's messages. The use of communication games in generation tasks is still largely unexplored, with recent methods for compositional generation focusing mainly on the use of supervised guidance (either through class labels or text). In this work, we take the first steps to fill this gap, and we present a self-supervised generative communication game-based framework for creating compositional encodings in learned representations from pre-trained encoder-decoder models. In an Iterated Learning (IL) protocol involving a sender and a receiver, we apply alternating pressures for compression and diversity of encoded discrete messages, so that the protocol converges to an efficient but unambiguous encoding. Approximate message entropy regularization is used to favor compositional encodings. Our framework is based on rigorous justifications and proofs of defining and balancing the concepts of Eficiency, Unambiguity and Non-Holisticity in encoding. We test our method on the compositional image dataset Shapes3D, demonstrating robust performance in both reconstruction and compositionality metrics, surpassing other tested discrete message frameworks.
Comment: Explores compositional generation through iterative pressure-based protocol, introducing useful insights into encoder-decoder paradigms and representation learning.
Relevance: 9 Novelty: 8
16. Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution
ArXiv ID: 2501.18887
Authors: Shichang Zhang, Tessa Han, Usha Bhalla, Hima Lakkaraju
Abstract: The increasing complexity of AI systems has made understanding their behavior a critical challenge. Numerous methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of approaches and terminology. This position paper argues that feature, data, and component attribution methods share fundamental similarities, and bridging them can benefit interpretability research. We conduct a detailed analysis of successful methods across three domains and present a unified view to demonstrate that these seemingly distinct methods employ similar approaches, such as perturbations, gradients, and linear approximations, differing primarily in their perspectives rather than core techniques. Our unified perspective enhances understanding of existing attribution methods, identifies shared concepts and challenges, makes this field more accessible to newcomers, and highlights new directions not only for attribution and interpretability but also for broader AI research, including model editing, steering, and regulation.
Comment: Unifies attribution methodologies and contributes to enhancing interpretability in AI systems; aligns with emerging trends in foundational understanding.
Relevance: 8 Novelty: 9
17. Neural SDEs as a Unified Approach to Continuous-Domain Sequence Modeling
ArXiv ID: 2501.18871
Authors: Macheng Shen, Chen Cheng
Abstract: Inspired by the ubiquitous use of differential equations to model continuous dynamics across diverse scientific and engineering domains, we propose a novel and intuitive approach to continuous sequence modeling. Our method interprets time-series data as \textit{discrete samples from an underlying continuous dynamical system}, and models its time evolution using Neural Stochastic Differential Equation (Neural SDE), where both the flow (drift) and diffusion terms are parameterized by neural networks. We derive a principled maximum likelihood objective and a \textit{simulation-free} scheme for efficient training of our Neural SDE model. We demonstrate the versatility of our approach through experiments on sequence modeling tasks across both embodied and generative AI. Notably, to the best of our knowledge, this is the first work to show that SDE-based continuous-time modeling also excels in such complex scenarios, and we hope that our work opens up new avenues for research of SDE models in high-dimensional and temporally intricate domains.
Comment: Proposes Neural SDEs for sequence modeling in a continuous-time perspective, offering foundational insights into time-series representation learning. Novel approach to intrinsic modeling dynamics.
Relevance: 8 Novelty: 8
18. Error Slice Discovery via Manifold Compactness
ArXiv ID: 2501.19032
Authors: Han Yu, Jiashuo Liu, Hao Zou, Renzhe Xu, Yue He, Xingxuan Zhang, Peng Cui
Abstract: Despite the great performance of deep learning models in many areas, they still make mistakes and underperform on certain subsets of data, i.e. error slices. Given a trained model, it is important to identify its semantically coherent error slices that are easy to interpret, which is referred to as the error slice discovery problem. However, there is no proper metric of slice coherence without relying on extra information like predefined slice labels. Current evaluation of slice coherence requires access to predefined slices formulated by metadata like attributes or subclasses. Its validity heavily relies on the quality and abundance of metadata, where some possible patterns could be ignored. Besides, current algorithms cannot directly incorporate the constraint of coherence into their optimization objective due to the absence of an explicit coherence metric, which could potentially hinder their effectiveness. In this paper, we propose manifold compactness, a coherence metric without reliance on extra information by incorporating the data geometry property into its design, and experiments on typical datasets empirically validate the rationality of the metric. Then we develop Manifold Compactness based error Slice Discovery (MCSD), a novel algorithm that directly treats risk and coherence as the optimization objective, and is flexible to be applied to models of various tasks. Extensive experiments on the benchmark and case studies on other typical datasets demonstrate the superiority of MCSD.
Comment: The paper introduces a novel metric for evaluating coherence in error slice discovery using manifold compactness and optimizes for risk and coherence simultaneously, which aligns with criteria around representation learning and training dynamics.
Relevance: 8 Novelty: 8
19. BCAT: A Block Causal Transformer for PDE Foundation Models for Fluid Dynamics
ArXiv ID: 2501.18972
Authors: Yuxuan Liu, Jingmin Sun, Hayden Schaeffer
Abstract: We introduce BCAT, a PDE foundation model designed for autoregressive prediction of solutions to two dimensional fluid dynamics problems. Our approach uses a block causal transformer architecture to model next frame predictions, leveraging previous frames as contextual priors rather than relying solely on sub-frames or pixel-based inputs commonly used in image generation methods. This block causal framework more effectively captures the spatial dependencies inherent in nonlinear spatiotemporal dynamics and physical phenomena. In an ablation study, next frame prediction demonstrated a 2.9x accuracy improvement over next token prediction. BCAT is trained on a diverse range of fluid dynamics datasets, including incompressible and compressible Navier-Stokes equations across various geometries and parameter regimes, as well as the shallow-water equations. The model's performance was evaluated on 6 distinct downstream prediction tasks and tested on about 8K trajectories to measure robustness on a variety of fluid dynamics simulations. BCAT achieved an average relative error of 1.92% across all evaluation tasks, outperforming prior approaches on standard benchmarks.
Comment: The paper introduces a PDE foundation model with a novel block causal Transformer architecture, offering significant advancements in dynamic spatiotemporal modeling, which aligns with model architecture innovation.
Relevance: 8 Novelty: 8
20. Temperature-Annealed Boltzmann Generators
ArXiv ID: 2501.19077
Authors: Henrik Schopmans, Pascal Friederich
Abstract: Efficient sampling of unnormalized probability densities such as the Boltzmann distribution of molecular systems is a longstanding challenge. Next to conventional approaches like molecular dynamics or Markov chain Monte Carlo, variational approaches, such as training normalizing flows with the reverse Kullback-Leibler divergence, have been introduced. However, such methods are prone to mode collapse and often do not learn to sample the full configurational space. Here, we present temperature-annealed Boltzmann generators (TA-BG) to address this challenge. First, we demonstrate that training a normalizing flow with the reverse Kullback-Leibler divergence at high temperatures is possible without mode collapse. Furthermore, we introduce a reweighting-based training objective to anneal the distribution to lower target temperatures. We apply this methodology to three molecular systems of increasing complexity and, compared to the baseline, achieve better results in almost all metrics while requiring up to three times fewer target energy evaluations. For the largest system, our approach is the only method that accurately resolves the metastable states of the system.
Comment: Proposes temperature-annealed Boltzmann generators for efficient sampling in molecular systems, demonstrating innovations in energy efficiency, which aligns with AI for science criteria focused on foundational molecular modeling.
Relevance: 8 Novelty: 8
21. Understanding Oversmoothing in GNNs as Consensus in Opinion Dynamics
ArXiv ID: 2501.19089
Authors: Keqin Wang, Yulong Yang, Ishan Saha, Christine Allen-Blanchette
Abstract: In contrast to classes of neural networks where the learned representations become increasingly expressive with network depth, the learned representations in graph neural networks (GNNs), tend to become increasingly similar. This phenomena, known as oversmoothing, is characterized by learned representations that cannot be reliably differentiated leading to reduced predictive performance. In this paper, we propose an analogy between oversmoothing in GNNs and consensus or agreement in opinion dynamics. Through this analogy, we show that the message passing structure of recent continuous-depth GNNs is equivalent to a special case of opinion dynamics (i.e., linear consensus models) which has been theoretically proven to converge to consensus (i.e., oversmoothing) for all inputs. Using the understanding developed through this analogy, we design a new continuous-depth GNN model based on nonlinear opinion dynamics and prove that our model, which we call behavior-inspired message passing neural network (BIMP) circumvents oversmoothing for general inputs. Through extensive experiments, we show that BIMP is robust to oversmoothing and adversarial attack, and consistently outperforms competitive baselines on numerous benchmarks.
Comment: Analyzes oversmoothing in GNNs and proposes a novel architecture inspired by opinion dynamics to address it. While this aligns with representation learning, it is focused on GNNs, a domain slightly peripheral to the primary interests.
Relevance: 7 Novelty: 8
22. Can We Predict the Effect of Prompts?
ArXiv ID: 2501.18883
Authors: Jae Yong Lee, Sungmin Kang, Shin Yoo
Abstract: Large Language Models (LLMs) are machine learning models that have seen widespread adoption due to their capability of handling previously difficult tasks. LLMs, due to their training, are sensitive to how exactly a question is presented, also known as prompting. However, prompting well is challenging, as it has been difficult to uncover principles behind prompting -- generally, trial-and-error is the most common way of improving prompts, despite its significant computational cost. In this context, we argue it would be useful to perform `predictive prompt analysis', in which an automated technique would perform a quick analysis of a prompt and predict how the LLM would react to it, relative to a goal provided by the user. As a demonstration of the concept, we present Syntactic Prevalence Analyzer (SPA), a predictive prompt analysis approach based on sparse autoencoders (SAEs). SPA accurately predicted how often an LLM would generate target syntactic structures during code synthesis, with up to 0.994 Pearson correlation between the predicted and actual prevalence of the target structure. At the same time, SPA requires only 0.4\% of the time it takes to run the LLM on a benchmark. As LLMs are increasingly used during and integrated into modern software development, our proposed predictive prompt analysis concept has the potential to significantly ease the use of LLMs for both practitioners and researchers.
Comment: The paper proposes predictive prompt analysis for LLMs via sparse autoencoders, which aligns with representation learning enhancements and interpretability-related advances.
Relevance: 8 Novelty: 7
23. Unraveling Zeroth-Order Optimization through the Lens of Low-Dimensional Structured Perturbations
ArXiv ID: 2501.19099
Authors: Sihwan Park, Jihun Yun, SungYub Kim, Souvik Kundu, Eunho Yang
Abstract: Zeroth-order (ZO) optimization has emerged as a promising alternative to gradient-based backpropagation methods, particularly for black-box optimization and large language model (LLM) fine-tuning. However, ZO methods suffer from slow convergence due to high-variance stochastic gradient estimators. While structured perturbations, such as sparsity and low-rank constraints, have been explored to mitigate these issues, their effectiveness remains highly under-explored. In this work, we develop a unified theoretical framework that analyzes both the convergence and generalization properties of ZO optimization under structured perturbations. We show that high dimensionality is the primary bottleneck and introduce the notions of \textit{stable rank} and \textit{effective overlap} to explain how structured perturbations reduce gradient noise and accelerate convergence. Using the uniform stability under our framework, we then provide the first theoretical justification for why these perturbations enhance generalization. Additionally, through empirical analysis, we identify that \textbf{block coordinate descent} (BCD) to be an effective structured perturbation method. Extensive experiments show that, compared to existing alternatives, memory-efficient ZO (MeZO) with BCD (\textit{MeZO-BCD}) can provide improved converge with a faster wall-clock time/iteration by up to $\times\textbf{2.09}$ while yielding similar or better accuracy.
Comment: This work provides a new theoretical framework for Zeroth-Order Optimization, including structured perturbations and connections to generalization, aligning with 'Representation Learning' and efficiency methods.
Relevance: 7 Novelty: 8
24. What is causal about causal models and representations?
ArXiv ID: 2501.19335
Authors: Frederik Hytting J{\o}rgensen, Luigi Gresele, Sebastian Weichwald
Abstract: Causal Bayesian networks are 'causal' models since they make predictions about interventional distributions. To connect such causal model predictions to real-world outcomes, we must determine which actions in the world correspond to which interventions in the model. For example, to interpret an action as an intervention on a treatment variable, the action will presumably have to a) change the distribution of treatment in a way that corresponds to the intervention, and b) not change other aspects, such as how the outcome depends on the treatment; while the marginal distributions of some variables may change as an effect. We introduce a formal framework to make such requirements for different interpretations of actions as interventions precise. We prove that the seemingly natural interpretation of actions as interventions is circular: Under this interpretation, every causal Bayesian network that correctly models the observational distribution is trivially also interventionally valid, and no action yields empirical data that could possibly falsify such a model. We prove an impossibility result: No interpretation exists that is non-circular and simultaneously satisfies a set of natural desiderata. Instead, we examine non-circular interpretations that may violate some desiderata and show how this may in turn enable the falsification of causal models. By rigorously examining how a causal Bayesian network could be a 'causal' model of the world instead of merely a mathematical object, our formal framework contributes to the conceptual foundations of causal representation learning, causal discovery, and causal abstraction, while also highlighting some limitations of existing approaches.
Comment: The paper tackles the foundations of causal models and their real-world interpretability, contributing to the theoretical understanding of causal representation learning. However, it does not explicitly target foundational LLM or representation learning innovations.
Relevance: 7 Novelty: 7
25. BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning
ArXiv ID: 2501.18858
Authors: Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, Liwei Wang, Mingyi Hong, Zhaoran Wang
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet generating reliable reasoning processes remains a significant challenge. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model incorporating latent thinking processes and evaluation signals. Within this framework, we introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps. First, it generates high-quality rationales by approximating the optimal thinking process through reinforcement learning, using a novel reward shaping mechanism. Second, it enhances the base LLM by maximizing the joint probability of rationale generation with respect to the model's parameters. Theoretically, we demonstrate BRiTE's convergence at a rate of $1/T$ with $T$ representing the number of iterations. Empirical evaluations on math and coding benchmarks demonstrate that our approach consistently improves performance across different base models without requiring human-annotated thinking processes. In addition, BRiTE demonstrates superior performance compared to existing algorithms that bootstrap thinking processes use alternative methods such as rejection sampling, and can even match or exceed the results achieved through supervised fine-tuning with human-annotated data.
Comment: Introduces a probabilistic framework and reinforcement learning-inspired approach to enhance reasoning in LLMs but primarily focuses on process improvements and does not delve into foundational insights about architectures or representation learning.
Relevance: 7 Novelty: 7
26. Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment
ArXiv ID: 2501.19309
Authors: Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Sch\"onfeld, Ali Thabet, Jonas Kohler
Abstract: The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive generation, leveraging a fast draft model to propose candidate tokens, which are then verified in parallel based on their likelihood under the target model. While this approach guarantees to reproduce the target output, it incurs a substantial penalty: many high-quality draft tokens are rejected, even when they represent objectively valid continuations. Indeed, we show that even powerful draft models such as GPT-4o, as well as human text cannot achieve high acceptance rates under the standard verification scheme. This severely limits the speedup potential of current speculative decoding methods, as an early rejection becomes overwhelmingly likely when solely relying on alignment of draft and target. We thus ask the following question: Can we adapt verification to recognize correct, but non-aligned replies? To this end, we draw inspiration from the LLM-as-a-judge framework, which demonstrated that LLMs are able to rate answers in a versatile way. We carefully design a dataset to elicit the same capability in the target model by training a compact module on top of the embeddings to produce ``judgements" of the current continuation. We showcase our strategy on the Llama-3.1 family, where our 8b/405B-Judge achieves a speedup of 9x over Llama-405B, while maintaining its quality on a large range of benchmarks. These benefits remain present even in optimized inference frameworks, where our method reaches up to 141 tokens/s for 8B/70B-Judge and 129 tokens/s for 8B/405B on 2 and 8 H100s respectively.
Comment: Proposes enhancements to speculative decoding via LLM-based judging, underscoring efficiency in autoregressive generation. Ties to LLM behavior but lacks foundational contributions to architecture or theory.
Relevance: 7 Novelty: 7
27. Contrast-Aware Calibration for Fine-Tuned CLIP: Leveraging Image-Text Alignment
ArXiv ID: 2501.19060
Authors: Song-Lin Lv, Yu-Yang Chen, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo
Abstract: Vision-language models (VLMs), such as CLIP, have demonstrated exceptional generalization capabilities and can quickly adapt to downstream tasks through prompt fine-tuning. Unfortunately, in classification tasks involving non-training classes, known as open-vocabulary setting, fine-tuned VLMs often overfit to train classes, resulting in a misalignment between confidence scores and actual accuracy on unseen classes, which significantly undermines their reliability in real-world deployments. Existing confidence calibration methods typically require training parameters or analyzing features from the training dataset, restricting their ability to generalize unseen classes without corresponding train data. Moreover, VLM-specific calibration methods rely solely on text features from train classes as calibration indicators, which inherently limits their ability to calibrate train classes. To address these challenges, we propose an effective multimodal calibration method Contrast-Aware Calibration (CAC). Building on the original CLIP's zero-shot adaptability and the conclusion from empirical analysis that poor intra-class and inter-class discriminative ability on unseen classes is the root cause, we calculate calibration weights based on the contrastive difference between the original and fine-tuned CLIP. This method not only adapts to calibrating unseen classes but also overcomes the limitations of previous VLM calibration methods that could not calibrate train classes. In experiments involving 11 datasets with 5 fine-tuning methods, CAC consistently achieved the best calibration effect on both train and unseen classes without sacrificing accuracy and inference speed.
Comment: Proposes a contrast-aware calibration method for vision-language models like CLIP, focusing on fine-tuning dynamics and addressing misalignment issues, which partially matches the representation learning criterion.
Relevance: 7 Novelty: 7
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Relevant Topics
Use the following relevance criteria to focus on foundational research, avoiding purely application-driven work:
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in training or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), minor tweaks (e.g., instruction tuning, CoT, data mixing), or purely empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords for Relevant Domains: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
Hints on Irrelevant Domains: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
Hints on Application Tasks: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, etc.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.