Previous Day 2025-04-25
Monthly Overview 2025-04
Next Day 2025-04-29

Personalized Daily ArXiv Papers 2025-04-28

[gpt-4o] Prompt Completion Total
Token 26244 3513 29757
Cost $0.07 $0.04 $0.1

Total arXiv papers: 357

Total scanned papers: 190

Total relevant papers: 15

Table of contents with paper titles:

  1. Score-Based Deterministic Density Sampling Authors: Vasily Ilin, Bamdad Hosseini, Jingwei Hu

  2. Non-identifiability distinguishes Neural Networks among Parametric Models Authors: Sourav Chatterjee, Timothy Sudijono

  3. BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs Authors: Hongyu Wang, Shuming Ma, Furu Wei

  4. Scaling Laws For Scalable Oversight Authors: Joshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark

  5. NoEsis: Differentially Private Knowledge Transfer in Modular LLM Adaptation Authors: Rob Romijnders, Stefanos Laskaridis, Ali Shahin Shamsabadi, Hamed Haddadi

  6. Generalization Guarantees for Multi-View Representation Learning and Application to Regularization via Gaussian Product Mixture Prior Authors: Milad Sefidgaran, Abdellatif Zaidi, Piotr Krasnowski

  7. Gradient Descent as a Shrinkage Operator for Spectral Bias Authors: Simon Lucey

  8. Random-Set Large Language Models Authors: Muhammad Mubashar, Shireen Kudukkil Manchingal, Fabio Cuzzolin

  9. Studying Small Language Models with Susceptibilities Authors: Garrett Baker, George Wang, Jesse Hoogland, Daniel Murfet

  10. Neural operators struggle to learn complex PDEs in pedestrian mobility: Hughes model case study Authors: Prajwal Chauhan, Salah Eddine Choutri, Mohamed Ghattassi, Nader Masmoudi, Saif Eddin Jabari

  11. Efficient Learning on Large Graphs using a Densifying Regularity Lemma Authors: Jonathan Kouchly, Ben Finkelshtein, Michael Bronstein, Ron Levie

  12. Subfunction Structure Matters: A New Perspective on Local Optima Networks Authors: S. L. Thomson, M. W. Przewozniczek

  13. Outlier-aware Tensor Robust Principal Component Analysis with Self-guided Data Augmentation Authors: Yangyang Xu, Kexin Li, Li Yang, You-Wei Wen

  14. A Model Zoo on Phase Transitions in Neural Networks Authors: Konstantin Sch\"urholt, L\'eo Meynent, Yefan Zhou, Haiquan Lu, Yaoqing Yang, Damian Borth

  15. Representation Learning for Distributional Perturbation Extrapolation Authors: Julius von K\"ugelgen, Jakob Ketterer, Xinwei Shen, Nicolai Meinshausen, Jonas Peters


1. Score-Based Deterministic Density Sampling

ArXiv ID: 2504.18130

Authors: Vasily Ilin, Bamdad Hosseini, Jingwei Hu

Abstract: We propose and analyze a deterministic sampling framework using Score-Based Transport Modeling (SBTM) for sampling an unnormalized target density $\pi$. While diffusion generative modeling relies on pre-training the score function $\nabla \log f_t$ using samples from $\pi$, SBTM addresses the more general and challenging setting where only $\nabla \log\pi$ is known. SBTM approximates the Wasserstein gradient flow on KL$(f_t|\pi)$ by learning the time-varying score $\nabla \log f_t$ on the fly using score matching. The learned score gives immediate access to relative Fisher information, providing a built-in convergence diagnostic. The deterministic trajectories are smooth, interpretable, and free of Brownian-motion noise, while having the same distribution as ULA. We prove that SBTM dissipates relative entropy at the same rate as the exact gradient flow, provided sufficient training. We further extend our framework to annealed dynamics, to handle non log-concave targets. Numerical experiments validate our theoretical findings: SBTM converges at the optimal rate, has smooth trajectories, and is easily integrated with annealed dynamics. We compare to the baselines of ULA and annealed ULA.

Comment: The paper proposes a deterministic sampling framework using Score-Based Transport Modeling, which aligns with 'Emerging Trends' and 'Representation Learning' due to its novel approach to sampling and convergence diagnostics.

Relevance: 9 Novelty: 8


2. Non-identifiability distinguishes Neural Networks among Parametric Models

ArXiv ID: 2504.18017

Authors: Sourav Chatterjee, Timothy Sudijono

Abstract: One of the enduring problems surrounding neural networks is to identify the factors that differentiate them from traditional statistical models. We prove a pair of results which distinguish feedforward neural networks among parametric models at the population level, for regression tasks. Firstly, we prove that for any pair of random variables $(X,Y)$, neural networks always learn a nontrivial relationship between $X$ and $Y$, if one exists. Secondly, we prove that for reasonable smooth parametric models, under local and global identifiability conditions, there exists a nontrivial $(X,Y)$ pair for which the parametric model learns the constant predictor $\mathbb{E}[Y]$. Together, our results suggest that a lack of identifiability distinguishes neural networks among the class of smooth parametric models.

Comment: This paper provides theoretical insights into the non-identifiability of neural networks, distinguishing them from traditional parametric models. It aligns closely with foundational research in representation learning.

Relevance: 9 Novelty: 8


3. BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

ArXiv ID: 2504.18415

Authors: Hongyu Wang, Shuming Ma, Furu Wei

Abstract: Efficient deployment of 1-bit Large Language Models (LLMs) is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.

Comment: BitNet v2 introduces a novel method for 4-bit activation quantization in 1-bit LLMs, addressing efficiency and memory challenges. This is highly relevant to model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8


4. Scaling Laws For Scalable Oversight

ArXiv ID: 2504.18530

Authors: Joshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark

Abstract: Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales. To address this gap, we propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen. Specifically, our framework models oversight as a game between capability-mismatched players; the players have oversight-specific and deception-specific Elo scores that are a piecewise-linear function of their general intelligence, with two plateaus corresponding to task incompetence and task saturation. We validate our framework with a modified version of the game Nim and then apply it to four oversight games: "Mafia", "Debate", "Backdoor Code" and "Wargames". For each game, we find scaling laws that approximate how domain performance depends on general AI system capability (using Chatbot Arena Elo as a proxy for general capability). We then build on our findings in a theoretical study of Nested Scalable Oversight (NSO), a process in which trusted models oversee untrusted stronger models, which then become the trusted models in the next step. We identify conditions under which NSO succeeds and derive numerically (and in some cases analytically) the optimal number of oversight levels to maximize the probability of oversight success. In our numerical examples, the NSO success rate is below 52% when overseeing systems that are 400 Elo points stronger than the baseline overseer, and it declines further for overseeing even stronger systems.

Comment: The paper proposes a framework for scalable oversight and introduces theoretical scaling laws, which aligns with emerging trends in foundational AI research.

Relevance: 9 Novelty: 8


5. NoEsis: Differentially Private Knowledge Transfer in Modular LLM Adaptation

ArXiv ID: 2504.18147

Authors: Rob Romijnders, Stefanos Laskaridis, Ali Shahin Shamsabadi, Hamed Haddadi

Abstract: Large Language Models (LLM) are typically trained on vast amounts of data from various sources. Even when designed modularly (e.g., Mixture-of-Experts), LLMs can leak privacy on their sources. Conversely, training such models in isolation arguably prohibits generalization. To this end, we propose a framework, NoEsis, which builds upon the desired properties of modularity, privacy, and knowledge transfer. NoEsis integrates differential privacy with a hybrid two-staged parameter-efficient fine-tuning that combines domain-specific low-rank adapters, acting as experts, with common prompt tokens, acting as a knowledge-sharing backbone. Results from our evaluation on CodeXGLUE showcase that NoEsis can achieve provable privacy guarantees with tangible knowledge transfer across domains, and empirically show protection against Membership Inference Attacks. Finally, on code completion tasks, NoEsis bridges at least 77% of the accuracy gap between the non-shared and the non-private baseline.

Comment: The paper introduces a modular framework for LLM adaptation with differential privacy, which aligns with the core topic of model architecture, particularly modularity and privacy-preserving methods. The use of low-rank adapters adds relevance to model compression.

Relevance: 9 Novelty: 8


6. Generalization Guarantees for Multi-View Representation Learning and Application to Regularization via Gaussian Product Mixture Prior

ArXiv ID: 2504.18455

Authors: Milad Sefidgaran, Abdellatif Zaidi, Piotr Krasnowski

Abstract: We study the problem of distributed multi-view representation learning. In this problem, $K$ agents observe each one distinct, possibly statistically correlated, view and independently extracts from it a suitable representation in a manner that a decoder that gets all $K$ representations estimates correctly the hidden label. In the absence of any explicit coordination between the agents, a central question is: what should each agent extract from its view that is necessary and sufficient for a correct estimation at the decoder? In this paper, we investigate this question from a generalization error perspective. First, we establish several generalization bounds in terms of the relative entropy between the distribution of the representations extracted from training and "test" datasets and a data-dependent symmetric prior, i.e., the Minimum Description Length (MDL) of the latent variables for all views and training and test datasets. Then, we use the obtained bounds to devise a regularizer; and investigate in depth the question of the selection of a suitable prior. In particular, we show and conduct experiments that illustrate that our data-dependent Gaussian mixture priors with judiciously chosen weights lead to good performance. For single-view settings (i.e., $K=1$), our experimental results are shown to outperform existing prior art Variational Information Bottleneck (VIB) and Category-Dependent VIB (CDVIB) approaches. Interestingly, we show that a weighted attention mechanism emerges naturally in this setting. Finally, for the multi-view setting, we show that the selection of the joint prior as a Gaussians product mixture induces a Gaussian mixture marginal prior for each marginal view and implicitly encourages the agents to extract and output redundant features, a finding which is somewhat counter-intuitive.

Comment: The paper explores multi-view representation learning with a focus on generalization bounds and introduces a novel regularizer based on Gaussian mixture priors. This aligns closely with the 'Representation Learning' criterion, particularly in training dynamics and feature learning.

Relevance: 9 Novelty: 8


7. Gradient Descent as a Shrinkage Operator for Spectral Bias

ArXiv ID: 2504.18207

Authors: Simon Lucey

Abstract: We generalize the connection between activation function and spline regression/smoothing and characterize how this choice may influence spectral bias within a 1D shallow network. We then demonstrate how gradient descent (GD) can be reinterpreted as a shrinkage operator that masks the singular values of a neural network's Jacobian. Viewed this way, GD implicitly selects the number of frequency components to retain, thereby controlling the spectral bias. An explicit relationship is proposed between the choice of GD hyperparameters (learning rate & number of iterations) and bandwidth (the number of active components). GD regularization is shown to be effective only with monotonic activation functions. Finally, we highlight the utility of non-monotonic activation functions (sinc, Gaussian) as iteration-efficient surrogates for spectral bias.

Comment: This paper provides a theoretical analysis of gradient descent as a shrinkage operator for spectral bias, which aligns with 'Representation Learning' and training dynamics in neural networks. It also introduces novel insights into activation functions and spectral bias.

Relevance: 9 Novelty: 8


8. Random-Set Large Language Models

ArXiv ID: 2504.18085

Authors: Muhammad Mubashar, Shireen Kudukkil Manchingal, Fabio Cuzzolin

Abstract: Large Language Models (LLMs) are known to produce very high-quality tests and responses to our queries. But how much can we trust this generated text? In this paper, we study the problem of uncertainty quantification in LLMs. We propose a novel Random-Set Large Language Model (RSLLM) approach which predicts finite random sets (belief functions) over the token space, rather than probability vectors as in classical LLMs. In order to allow so efficiently, we also present a methodology based on hierarchical clustering to extract and use a budget of "focal" subsets of tokens upon which the belief prediction is defined, rather than using all possible collections of tokens, making the method scalable yet effective. RS-LLMs encode the epistemic uncertainty induced in their generation process by the size and diversity of its training set via the size of the credal sets associated with the predicted belief functions. The proposed approach is evaluated on CoQA and OBQA datasets using Llama2-7b, Mistral-7b and Phi-2 models and is shown to outperform the standard model in both datasets in terms of correctness of answer while also showing potential in estimating the second level uncertainty in its predictions and providing the capability to detect when its hallucinating.

Comment: The paper proposes a novel approach to uncertainty quantification in LLMs using random sets, which aligns with the 'Large Language Models' criterion due to its focus on foundational improvements in LLM behavior and interpretability.

Relevance: 9 Novelty: 8


9. Studying Small Language Models with Susceptibilities

ArXiv ID: 2504.18274

Authors: Garrett Baker, George Wang, Jesse Hoogland, Daniel Murfet

Abstract: We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small, controlled perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. Building a set of perturbations (probes) yields a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer. Susceptibilities link local learning coefficients from singular learning theory with linear-response theory, and quantify how local loss landscape geometry deforms under shifts in the data distribution.

Comment: The paper develops a linear response framework for interpretability in small language models, which aligns with representation learning by analyzing how components of a network respond to data distribution shifts. The focus on susceptibility and attribution scores is novel.

Relevance: 8 Novelty: 9


10. Neural operators struggle to learn complex PDEs in pedestrian mobility: Hughes model case study

ArXiv ID: 2504.18267

Authors: Prajwal Chauhan, Salah Eddine Choutri, Mohamed Ghattassi, Nader Masmoudi, Saif Eddin Jabari

Abstract: This paper investigates the limitations of neural operators in learning solutions for a Hughes model, a first-order hyperbolic conservation law system for crowd dynamics. The model couples a Fokker-Planck equation representing pedestrian density with a Hamilton-Jacobi-type (eikonal) equation. This Hughes model belongs to the class of nonlinear hyperbolic systems that often exhibit complex solution structures, including shocks and discontinuities. In this study, we assess the performance of three state-of-the-art neural operators (Fourier Neural Operator, Wavelet Neural Operator, and Multiwavelet Neural Operator) in various challenging scenarios. Specifically, we consider (1) discontinuous and Gaussian initial conditions and (2) diverse boundary conditions, while also examining the impact of different numerical schemes. Our results show that these neural operators perform well in easy scenarios with fewer discontinuities in the initial condition, yet they struggle in complex scenarios with multiple initial discontinuities and dynamic boundary conditions, even when trained specifically on such complex samples. The predicted solutions often appear smoother, resulting in a reduction in total variation and a loss of important physical features. This smoothing behavior is similar to issues discussed by Daganzo (1995), where models that introduce artificial diffusion were shown to miss essential features such as shock waves in hyperbolic systems. These results suggest that current neural operator architectures may introduce unintended regularization effects that limit their ability to capture transport dynamics governed by discontinuities. They also raise concerns about generalizing these methods to traffic applications where shock preservation is essential.

Comment: The paper critiques neural operators' ability to handle complex PDEs, which aligns with 'Emerging Trends' as it highlights limitations in current architectures and raises foundational questions about their generalization.

Relevance: 8 Novelty: 8


11. Efficient Learning on Large Graphs using a Densifying Regularity Lemma

ArXiv ID: 2504.18273

Authors: Jonathan Kouchly, Ben Finkelshtein, Michael Bronstein, Ron Levie

Abstract: Learning on large graphs presents significant challenges, with traditional Message Passing Neural Networks suffering from computational and memory costs scaling linearly with the number of edges. We introduce the Intersecting Block Graph (IBG), a low-rank factorization of large directed graphs based on combinations of intersecting bipartite components, each consisting of a pair of communities, for source and target nodes. By giving less weight to non-edges, we show how to efficiently approximate any graph, sparse or dense, by a dense IBG. Specifically, we prove a constructive version of the weak regularity lemma, showing that for any chosen accuracy, every graph, regardless of its size or sparsity, can be approximated by a dense IBG whose rank depends only on the accuracy. This dependence of the rank solely on the accuracy, and not on the sparsity level, is in contrast to previous forms of the weak regularity lemma. We present a graph neural network architecture operating on the IBG representation of the graph and demonstrating competitive performance on node classification, spatio-temporal graph analysis, and knowledge graph completion, while having memory and computational complexity linear in the number of nodes rather than edges.

Comment: The paper introduces a novel low-rank factorization method for large graphs, which aligns with the 'Model Compression' criterion due to its focus on efficiency and sparsity. Additionally, it provides theoretical insights via a constructive version of the weak regularity lemma, which is foundational.

Relevance: 8 Novelty: 8


12. Subfunction Structure Matters: A New Perspective on Local Optima Networks

ArXiv ID: 2504.17799

Authors: S. L. Thomson, M. W. Przewozniczek

Abstract: Local optima networks (LONs) capture fitness landscape information. They are typically constructed in a black-box manner; information about the problem structure is not utilised. This also applies to the analysis of LONs: knowledge about the problem, such as interaction between variables, is not considered. We challenge this status-quo with an alternative approach: we consider how LON analysis can be improved by incorporating subfunction-based information - this can either be known a-priori or learned during search. To this end, LONs are constructed for several benchmark pseudo-boolean problems using three approaches: firstly, the standard algorithm; a second algorithm which uses deterministic grey-box crossover; and a third algorithm which selects perturbations based on learned information about variable interactions. Metrics related to subfunction changes in a LON are proposed and compared with metrics from previous literature which capture other aspects of a LON. Incorporating problem structure in LON construction and analysing it can bring enriched insight into optimisation dynamics. Such information may be crucial to understanding the difficulty of solving a given problem with state-of-the-art linkage learning optimisers. In light of the results, we suggest incorporation of problem structure as an alternative paradigm in landscape analysis for problems with known or suspected subfunction structure.

Comment: The paper explores a novel perspective on local optima networks by incorporating subfunction-based information, which aligns with the 'Emerging Trends' criterion as it challenges established assumptions in landscape analysis.

Relevance: 8 Novelty: 7


13. Outlier-aware Tensor Robust Principal Component Analysis with Self-guided Data Augmentation

ArXiv ID: 2504.18323

Authors: Yangyang Xu, Kexin Li, Li Yang, You-Wei Wen

Abstract: Tensor Robust Principal Component Analysis (TRPCA) is a fundamental technique for decomposing multi-dimensional data into a low-rank tensor and an outlier tensor, yet existing methods relying on sparse outlier assumptions often fail under structured corruptions. In this paper, we propose a self-guided data augmentation approach that employs adaptive weighting to suppress outlier influence, reformulating the original TRPCA problem into a standard Tensor Principal Component Analysis (TPCA) problem. The proposed model involves an optimization-driven weighting scheme that dynamically identifies and downweights outlier contributions during tensor augmentation. We develop an efficient proximal block coordinate descent algorithm with closed-form updates to solve the resulting optimization problem, ensuring computational efficiency. Theoretical convergence is guaranteed through a framework combining block coordinate descent with majorization-minimization principles. Numerical experiments on synthetic and real-world datasets, including face recovery, background subtraction, and hyperspectral denoising, demonstrate that our method effectively handles various corruption patterns. The results show the improvements in both accuracy and computational efficiency compared to state-of-the-art methods.

Comment: The paper proposes a novel optimization-driven approach for Tensor Robust Principal Component Analysis, which aligns with 'Representation Learning' due to its focus on low-rank tensor decomposition and handling structured corruptions.

Relevance: 8 Novelty: 7


14. A Model Zoo on Phase Transitions in Neural Networks

ArXiv ID: 2504.18072

Authors: Konstantin Sch\"urholt, L\'eo Meynent, Yefan Zhou, Haiquan Lu, Yaoqing Yang, Damian Borth

Abstract: Using the weights of trained Neural Network (NN) models as data modality has recently gained traction as a research field - dubbed Weight Space Learning (WSL). Multiple recent works propose WSL methods to analyze models, evaluate methods, or synthesize weights. Weight space learning methods require populations of trained models as datasets for development and evaluation. However, existing collections of models - called model zoos' - are unstructured or follow a rudimentary definition of diversity. In parallel, work rooted in statistical physics has identified phases and phase transitions in NN models. Models are homogeneous within the same phase but qualitatively differ from one phase to another. We combine the idea ofmodel zoos' with phase information to create a controlled notion of diversity in populations. We introduce 12 large-scale zoos that systematically cover known phases and vary over model architecture, size, and datasets. These datasets cover different modalities, such as computer vision, natural language processing, and scientific ML. For every model, we compute loss landscape metrics and validate full coverage of the phases. With this dataset, we provide the community with a resource with a wide range of potential applications for WSL and beyond. Evidence suggests the loss landscape phase plays a role in applications such as model training, analysis, or sparsification. We demonstrate this in an exploratory study of the downstream methods like transfer learning or model weights averaging.

Comment: The paper introduces a structured 'model zoo' for weight space learning and explores phase transitions in neural networks, which could provide foundational insights into representation learning and training dynamics.

Relevance: 8 Novelty: 7


15. Representation Learning for Distributional Perturbation Extrapolation

ArXiv ID: 2504.18522

Authors: Julius von K\"ugelgen, Jakob Ketterer, Xinwei Shen, Nicolai Meinshausen, Jonas Peters

Abstract: We consider the problem of modelling the effects of unseen perturbations such as gene knockdowns or drug combinations on low-level measurements such as RNA sequencing data. Specifically, given data collected under some perturbations, we aim to predict the distribution of measurements for new perturbations. To address this challenging extrapolation task, we posit that perturbations act additively in a suitable, unknown embedding space. More precisely, we formulate the generative process underlying the observed data as a latent variable model, in which perturbations amount to mean shifts in latent space and can be combined additively. Unlike previous work, we prove that, given sufficiently diverse training perturbations, the representation and perturbation effects are identifiable up to affine transformation, and use this to characterize the class of unseen perturbations for which we obtain extrapolation guarantees. To estimate the model from data, we propose a new method, the perturbation distribution autoencoder (PDAE), which is trained by maximising the distributional similarity between true and predicted perturbation distributions. The trained model can then be used to predict previously unseen perturbation distributions. Empirical evidence suggests that PDAE compares favourably to existing methods and baselines at predicting the effects of unseen perturbations.

Comment: This paper introduces a new method, PDAE, for representation learning in the context of distributional perturbation extrapolation. It provides theoretical guarantees and focuses on latent variable modeling, which aligns with 'Representation Learning' and foundational generative paradigms.

Relevance: 8 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords: