Personalized Daily Arxiv Papers 03/07/2025
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 41894 | 5862 | 47756 |
| Cost | $0.1 | $0.06 | $0.16 |
Total ArXiv papers: 523
Total scanned papers: 281
Total relevant papers: 30
Table of contents with paper titles:
-
A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers Authors: William Merrill, Ashish Sabharwal
-
L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling Authors: Zhuo Chen, Oriol Mayn\'e i Comas, Zhuotao Jin, Di Luo, Marin Solja\v{c}i\'c
-
Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining Authors: Houyi Li, Wenzheng Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Yangshijie Xu, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang
-
Generalizability of Neural Networks Minimizing Empirical Risk Based on Expressive Ability Authors: Lijia Yu, Yibo Miao, Yifan Zhu, Xiao-Shan Gao, Lijun Zhang
-
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions Authors: Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes, Lara Marinov, Michael Chen, Shreya Singhal, Carolin Lawrence, Aditi Raghunathan, Kiril Gashteovski, Graham Neubig
-
SOLAR: Scalable Optimization of Large-scale Architecture for Reasoning Authors: Chen Li, Yinyi Luo, Anudeep Bolimera, Marios Savvides
-
Learning Causal Response Representations through Direct Effect Analysis Authors: Homer Durand, Gherardo Varando, Gustau Camps-Valls
-
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization Authors: Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma
-
Causally Reliable Concept Bottleneck Models Authors: Giovanni De Felice, Arianna Casanova Flores, Francesco De Santis, Silvia Santini, Johannes Schneider, Pietro Barbiero, Alberto Termine
-
Transferable Foundation Models for Geometric Tasks on Point Cloud Representations: Geometric Neural Operators Authors: Blaine Quackenbush, Paul J. Atzberger
-
Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling Authors: Yan Li, Pengfei Zheng, Shuang Chen, Zewei Xu, Yunfei Du, Zhengang Wang
-
Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size Authors: Alireza Behtash, Marijan Fofonjka, Ethan Baird, Tyler Mauer, Hossein Moghimifam, David Stout, Joel Dennison
-
Activation Space Interventions Can Be Transferred Between Large Language Models Authors: Narmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Michael Lan, Abir Harrasse, Amirali Abdullah
-
How can representation dimension dominate structurally pruned LLMs? Authors: Mingxue Xu, Lisa Alazraki, Danilo P. Mandic
-
Enough Coin Flips Can Make LLMs Act Bayesian Authors: Ritwik Gupta, Rodolfo Corona, Jiaxin Ge, Eric Wang, Dan Klein, Trevor Darrell, David M. Chan
-
Provable Robust Overfitting Mitigation in Wasserstein Distributionally Robust Optimization Authors: Shuang Liu, Yihan Wang, Yifan Zhu, Yibo Miao, Xiao-Shan Gao
-
Generative Learning of Densities on Manifolds Authors: Dimitris G. Giovanis, Ellis Crabtree, Roger G. Ghanem, Ioannis G. kevrekidis
-
An optimal Petrov-Galerkin framework for operator networks Authors: Philip Charles, Deep Ray, Yue Yu, Joost Prins, Hugo Melchers, Michael R. A. Abdelmalik, Jeffrey Cochran, Assad A. Oberai, Thomas J. R. Hughes, Mats G. Larson
-
Sample-Optimal Agnostic Boosting with Unlabeled Data Authors: Udaya Ghai, Karan Singh
-
Generalized Interpolating Discrete Diffusion Authors: Dimitri von R\"utte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Sch\"olkopf, Thomas Hofmann
-
All-atom Diffusion Transformers: Unified generative modelling of molecules and materials Authors: Chaitanya K. Joshi, Xiang Fu, Yi-Lun Liao, Vahe Gharakhanyan, Benjamin Kurt Miller, Anuroop Sriram, Zachary W. Ulissi
-
Quantitative Flow Approximation Properties of Narrow Neural ODEs Authors: Karthik Elamvazhuthi
-
Boosting Offline Optimizers with Surrogate Sensitivity Authors: Manh Cuong Dao, Phi Le Nguyen, Thao Nguyen Truong, Trong Nghia Hoang
-
LEWIS (LayEr WIse Sparsity) -- A Training Free Guided Model Merging Approach Authors: Hetarth Chopra, Vidhi Rambhia, Vikram Adve
-
An Information-theoretic Multi-task Representation Learning Framework for Natural Language Understanding Authors: Dou Hu, Lingwei Wei, Wei Zhou, Songlin Hu
-
Simple Self Organizing Map with Visual Transformer Authors: Alan Luo, Kaiwen Yuan
-
Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic Flows Authors: Xiangxin Zhou, Yi Xiao, Haowei Lin, Xinheng He, Jiaqi Guan, Yang Wang, Qiang Liu, Feng Zhou, Liang Wang, Jianzhu Ma
-
IDInit: A Universal and Stable Initialization Method for Neural Network Training Authors: Yu Pan, Chaozheng Wang, Zekai Wu, Qifan Wang, Min Zhang, Zenglin Xu
-
Bi-Lipschitz Ansatz for Anti-Symmetric Functions Authors: Nadav Dym, Jianfeng Lu, Matan Mizrachi
-
Continual Optimization with Symmetry Teleportation for Multi-Task Learning Authors: Zhipeng Zhou, Ziqiao Meng, Pengcheng Wu, Peilin Zhao, Chunyan Miao
1. A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers
ArXiv ID: 2503.03961
Authors: William Merrill, Ashish Sabharwal
Abstract: Recent theoretical results show transformers cannot express sequential reasoning problems over long input lengths, intuitively because their computational depth is bounded. However, prior work treats the depth as a constant, leaving it unclear to what degree bounded depth may suffice for solving problems over short inputs, or how increasing the transformer's depth affects its expressive power. We address these questions by analyzing the expressive power of transformers whose depth can grow minimally with context length $n$. We show even highly uniform transformers with depth $\Theta(\log n)$ can express two important problems: recognizing regular languages, which captures state tracking abilities, and graph connectivity, which underlies multi-step reasoning. Notably, both of these problems cannot be expressed by fixed-depth transformers under standard complexity conjectures, demonstrating the expressivity benefit of growing depth. Moreover, our theory quantitatively predicts how depth must grow with input length to express these problems, showing that depth scaling is more efficient than scaling width or chain-of-thought steps. Empirically, we find our theoretical depth requirements for regular language recognition match the practical depth requirements of transformers remarkably well. Thus, our results clarify precisely how depth affects transformers' reasoning capabilities, providing potential practical insights for designing models that are better at sequential reasoning.
Comment: The paper provides theoretical insights into the expressive power of log-depth transformers, directly addressing foundational questions about model architecture and depth scaling.
Relevance: 10 Novelty: 9
2. L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling
ArXiv ID: 2503.04725
Authors: Zhuo Chen, Oriol Mayn\'e i Comas, Zhuotao Jin, Di Luo, Marin Solja\v{c}i\'c
Abstract: We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies. This scaling law, which we show is distinct from and scales independently of the conventional two-point mutual information, is the key to understanding long-context language modeling. Using this scaling law, we formulate the Long-context Language Modeling (L$^2$M) condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information. Our results are validated through experiments on both transformers and state space models. This work establishes a theoretical foundation that guides the development of large language models toward longer context lengths.
Comment: The paper establishes a mutual information scaling law for long-context language modeling, which provides theoretical insights into LLM behavior and aligns with the LLM criterion.
Relevance: 9 Novelty: 9
3. Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining
ArXiv ID: 2503.04715
Authors: Houyi Li, Wenzheng Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Yangshijie Xu, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang
Abstract: The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well-established, yet their effective deployment necessitates careful hyperparameter optimization. Through extensive empirical studies involving grid searches across diverse configurations, we discover universal scaling laws governing these hyperparameters: optimal learning rate follows a power-law relationship with both model parameters and data sizes, while optimal batch size scales primarily with data sizes. Our analysis reveals a convex optimization landscape for hyperparameters under fixed models and data size conditions. This convexity implies an optimal hyperparameter plateau. We contribute a universal, plug-and-play optimal hyperparameter tool for the community. Its estimated values on the test set are merely 0.07\% away from the globally optimal LLM performance found via an exhaustive search. These laws demonstrate remarkable robustness across variations in model sparsity, training data distribution, and model shape. To our best known, this is the first work that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers, as well as establishes optimal hyperparameter scaling laws across diverse data distributions. This exhaustive optimization process demands substantial computational resources, utilizing nearly one million NVIDIA H800 GPU hours to train 3,700 LLMs of varying sizes and hyperparameters from scratch and consuming approximately 100 trillion tokens in total. To facilitate reproducibility and further research, we will progressively release all loss measurements and model checkpoints through our designated repository https://step-law.github.io/
Comment: The paper establishes scaling laws for hyperparameters in LLM pretraining, providing theoretical insights into model optimization and aligning with foundational research in LLM behavior.
Relevance: 9 Novelty: 9
4. Generalizability of Neural Networks Minimizing Empirical Risk Based on Expressive Ability
ArXiv ID: 2503.04111
Authors: Lijia Yu, Yibo Miao, Yifan Zhu, Xiao-Shan Gao, Lijun Zhang
Abstract: The primary objective of learning methods is generalization. Classic uniform generalization bounds, which rely on VC-dimension or Rademacher complexity, fail to explain the significant attribute that over-parameterized models in deep learning exhibit nice generalizability. On the other hand, algorithm-dependent generalization bounds, like stability bounds, often rely on strict assumptions. To establish generalizability under less stringent assumptions, this paper investigates the generalizability of neural networks that minimize or approximately minimize empirical risk. We establish a lower bound for population accuracy based on the expressiveness of these networks, which indicates that with an adequate large number of training samples and network sizes, these networks, including over-parameterized ones, can generalize effectively. Additionally, we provide a necessary condition for generalization, demonstrating that, for certain data distributions, the quantity of training data required to ensure generalization exceeds the network size needed to represent the corresponding data distribution. Finally, we provide theoretical insights into several phenomena in deep learning, including robust generalization, importance of over-parameterization, and effect of loss function on generalization.
Comment: The paper provides theoretical insights into generalizability based on expressiveness, directly addressing foundational questions in representation learning and over-parameterization.
Relevance: 9 Novelty: 8
5. Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
ArXiv ID: 2503.03862
Authors: Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes, Lara Marinov, Michael Chen, Shreya Singhal, Carolin Lawrence, Aditi Raghunathan, Kiril Gashteovski, Graham Neubig
Abstract: Improvements in language model capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25\% code, as well as the better performance of some architectural decisions such as choosing rotary over learned embeddings. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.
Comment: The paper meta-analyzes design decisions in language models, providing insights into architectural choices and their downstream impact, which aligns with foundational research in model architecture.
Relevance: 9 Novelty: 8
6. SOLAR: Scalable Optimization of Large-scale Architecture for Reasoning
ArXiv ID: 2503.04530
Authors: Chen Li, Yinyi Luo, Anudeep Bolimera, Marios Savvides
Abstract: Large Language Models (LLMs) excel in reasoning but remain constrained by their Chain-of-Thought (CoT) approach, which struggles with complex tasks requiring more nuanced topological reasoning. We introduce SOLAR, Scalable Optimization of Large-scale Architecture for Reasoning, a framework that dynamically optimizes various reasoning topologies to enhance accuracy and efficiency. Our Topological Annotation Generation (TAG) system automates topological dataset creation and segmentation, improving post-training and evaluation. Additionally, we propose Topological-Scaling, a reward-driven framework that aligns training and inference scaling, equipping LLMs with adaptive, task-aware reasoning. SOLAR achieves substantial gains on MATH and GSM8K: +5% accuracy with Topological Tuning, +9% with Topological Reward, and +10.02% with Hybrid Scaling. It also reduces response length by over 5% for complex problems, lowering inference latency. To foster the reward system, we train a multi-task Topological Reward Model (M-TRM), which autonomously selects the best reasoning topology and answer in a single pass, eliminating the need for training and inference on multiple single-task TRMs (S-TRMs), thus reducing both training cost and inference latency. In addition, in terms of performance, M-TRM surpasses all S-TRMs, improving accuracy by +10% and rank correlation by +9%. To the best of our knowledge, SOLAR sets a new benchmark for scalable, high-precision LLM reasoning while introducing an automated annotation process and a dynamic reasoning topology competition mechanism.
Comment: The paper introduces SOLAR, a framework for reasoning in LLMs with novel topological approaches, aligning with foundational research in model architecture and reasoning.
Relevance: 9 Novelty: 8
7. Learning Causal Response Representations through Direct Effect Analysis
ArXiv ID: 2503.04358
Authors: Homer Durand, Gherardo Varando, Gustau Camps-Valls
Abstract: We propose a novel approach for learning causal response representations. Our method aims to extract directions in which a multidimensional outcome is most directly caused by a treatment variable. By bridging conditional independence testing with causal representation learning, we formulate an optimisation problem that maximises the evidence against conditional independence between the treatment and outcome, given a conditioning set. This formulation employs flexible regression models tailored to specific applications, creating a versatile framework. The problem is addressed through a generalised eigenvalue decomposition. We show that, under mild assumptions, the distribution of the largest eigenvalue can be bounded by a known $F$-distribution, enabling testable conditional independence. We also provide theoretical guarantees for the optimality of the learned representation in terms of signal-to-noise ratio and Fisher information maximisation. Finally, we demonstrate the empirical effectiveness of our approach in simulation and real-world experiments. Our results underscore the utility of this framework in uncovering direct causal effects within complex, multivariate settings.
Comment: The paper focuses on causal representation learning, which aligns with the representation learning criterion. It introduces a novel optimization framework and provides theoretical guarantees, making it relevant to foundational research.
Relevance: 9 Novelty: 8
8. HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
ArXiv ID: 2503.04598
Authors: Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma
Abstract: Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, challenges remain in training deep transformer networks, especially regarding the location of layer normalization. While Pre-Norm structures facilitate easier training due to their more prominent identity path, they often yield suboptimal performance compared to Post-Norm. In this paper, we propose $\textbf{HybridNorm}$, a straightforward yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm approaches. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. This design not only stabilizes training but also enhances performance, particularly in the context of LLMs. Comprehensive experiments in both dense and sparse architectures show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches, achieving state-of-the-art results across various benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. %Code will be made publicly available. Code is available at https://github.com/BryceZhuo/HybridNorm.
Comment: The paper proposes HybridNorm, a novel normalization strategy for transformers, which directly aligns with the model architecture criterion. It provides insights into training stability and performance improvements.
Relevance: 9 Novelty: 8
9. Causally Reliable Concept Bottleneck Models
ArXiv ID: 2503.04363
Authors: Giovanni De Felice, Arianna Casanova Flores, Francesco De Santis, Silvia Santini, Johannes Schneider, Pietro Barbiero, Alberto Termine
Abstract: Concept-based models are an emerging paradigm in deep learning that constrains the inference process to operate through human-interpretable concepts, facilitating explainability and human interaction. However, these architectures, on par with popular opaque neural models, fail to account for the true causal mechanisms underlying the target phenomena represented in the data. This hampers their ability to support causal reasoning tasks, limits out-of-distribution generalization, and hinders the implementation of fairness constraints. To overcome these issues, we propose \emph{Causally reliable Concept Bottleneck Models} (C$^2$BMs), a class of concept-based architectures that enforce reasoning through a bottleneck of concepts structured according to a model of the real-world causal mechanisms. We also introduce a pipeline to automatically learn this structure from observational data and \emph{unstructured} background knowledge (e.g., scientific literature). Experimental evidence suggest that C$^2$BM are more interpretable, causally reliable, and improve responsiveness to interventions w.r.t. standard opaque and concept-based models, while maintaining their accuracy.
Comment: The paper introduces a concept bottleneck model with causal reasoning capabilities, aligning with representation learning and emerging trends in explainable AI. It also provides a pipeline for learning causal structures.
Relevance: 9 Novelty: 8
10. Transferable Foundation Models for Geometric Tasks on Point Cloud Representations: Geometric Neural Operators
ArXiv ID: 2503.04649
Authors: Blaine Quackenbush, Paul J. Atzberger
Abstract: We introduce methods for obtaining pretrained Geometric Neural Operators (GNPs) that can serve as basal foundation models for use in obtaining geometric features. These can be used within data processing pipelines for machine learning tasks and numerical methods. We show how our GNPs can be trained to learn robust latent representations for the differential geometry of point-clouds to provide estimates of metric, curvature, and other shape-related features. We demonstrate how our pre-trained GNPs can be used (i) to estimate the geometric properties of surfaces of arbitrary shape and topologies with robustness in the presence of noise, (ii) to approximate solutions of geometric partial differential equations (PDEs) on manifolds, and (iii) to solve equations for shape deformations such as curvature driven flows. We also release a package of the codes and weights for using our pre-trained GNPs for processing point cloud representations. This allows for incorporating our pre-trained GNPs as components for reuse within existing and new data processing pipelines. The GNPs also can be used as part of numerical solvers involving geometry or as part of methods for performing inference and other geometric tasks.
Comment: The paper introduces Geometric Neural Operators (GNPs) for point cloud representations, which aligns with foundational research in representation learning and architecture-level innovations.
Relevance: 9 Novelty: 8
11. Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling
ArXiv ID: 2503.04398
Authors: Yan Li, Pengfei Zheng, Shuang Chen, Zewei Xu, Yunfei Du, Zhengang Wang
Abstract: MoE (Mixture of Experts) prevails as a neural architecture that can scale modern transformer-based LLMs (Large Language Models) to unprecedented scales. Nevertheless, large MoEs' great demands of computing power, memory capacity and memory bandwidth make scalable serving a fundamental challenge and efficient parallel inference has become a requisite to attain adequate throughput under latency constraints. DeepSpeed-MoE, one state-of-the-art MoE inference framework, adopts a 3D-parallel paradigm including EP (Expert Parallelism), TP (Tensor Parallel) and DP (Data Parallelism). However, our analysis shows DeepSpeed-MoE's inference efficiency is largely bottlenecked by EP, which is implemented with costly all-to-all collectives to route token activation. Our work aims to boost DeepSpeed-MoE by strategically reducing EP's communication overhead with a technique named Speculative MoE. Speculative MoE has two speculative parallelization schemes, speculative token shuffling and speculative expert grouping, which predict outstanding tokens' expert routing paths and pre-schedule tokens and experts across devices to losslessly trim EP's communication volume. Besides DeepSpeed-MoE, we also build Speculative MoE into a prevailing MoE inference engine SGLang. Experiments show Speculative MoE can significantly boost state-of-the-art MoE inference frameworks on fast homogeneous and slow heterogeneous interconnects.
Comment: The paper focuses on improving MoE inference efficiency with speculative parallelization, which directly aligns with foundational research in MoE architectures and efficiency.
Relevance: 9 Novelty: 8
12. Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size
ArXiv ID: 2503.04704
Authors: Alireza Behtash, Marijan Fofonjka, Ethan Baird, Tyler Mauer, Hossein Moghimifam, David Stout, Joel Dennison
Abstract: We present a novel approach to selective model quantization that transcends the limitations of architecture-specific and size-dependent compression methods for Large Language Models (LLMs) using Entropy-Weighted Quantization (EWQ). By analyzing the entropy distribution across transformer blocks, EWQ determines which blocks can be safely quantized without causing significant performance degradation, independent of model architecture or size. Our method outperforms uniform quantization approaches, maintaining Massive Multitask Language Understanding (MMLU) accuracy scores within 0.5% of unquantized models while reducing memory usage by up to 18%. We demonstrate the effectiveness of EWQ across multiple architectures-from 1.6B to 70B parameters-showcasing consistent improvements in the quality-compression trade-off regardless of model scale or architectural design. A surprising finding of EWQ is its ability to reduce perplexity compared to unquantized models, suggesting the presence of beneficial regularization through selective precision reduction. This improvement holds across different model families, indicating a fundamental relationship between layer-level entropy and optimal precision requirements. Additionally, we introduce FastEWQ, a rapid method for entropy distribution analysis that eliminates the need for loading model weights. This technique leverages universal characteristics of entropy distribution that persist across various architectures and scales, enabling near-instantaneous quantization decisions while maintaining 80% classification accuracy with full entropy analysis. Our results demonstrate that effective quantization strategies can be developed independently of specific architectural choices or model sizes, opening new possibilities for efficient LLM deployment.
Comment: The paper proposes a novel entropy-weighted quantization method for LLMs, which aligns with the model compression criterion. The findings on entropy and precision requirements are insightful and relevant.
Relevance: 9 Novelty: 8
13. Activation Space Interventions Can Be Transferred Between Large Language Models
ArXiv ID: 2503.04429
Authors: Narmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Michael Lan, Abir Harrasse, Amirali Abdullah
Abstract: The study of representation universality in AI models reveals growing convergence across domains, modalities, and architectures. However, the practical applications of representation universality remain largely unexplored. We bridge this gap by demonstrating that safety interventions can be transferred between models through learned mappings of their shared activation spaces. We demonstrate this approach on two well-established AI safety tasks: backdoor removal and refusal of harmful prompts, showing successful transfer of steering vectors that alter the models' outputs in a predictable way. Additionally, we propose a new task, \textit{corrupted capabilities}, where models are fine-tuned to embed knowledge tied to a backdoor. This tests their ability to separate useful skills from backdoors, reflecting real-world challenges. Extensive experiments across Llama, Qwen and Gemma model families show that our method enables using smaller models to efficiently align larger ones. Furthermore, we demonstrate that autoencoder mappings between base and fine-tuned models can serve as reliable ``lightweight safety switches", allowing dynamic toggling between model behaviors.
Comment: The paper explores activation space interventions and their transferability between LLMs, which aligns with representation learning and foundational insights into LLM behavior.
Relevance: 9 Novelty: 8
14. How can representation dimension dominate structurally pruned LLMs?
ArXiv ID: 2503.04377
Authors: Mingxue Xu, Lisa Alazraki, Danilo P. Mandic
Abstract: Pruning assumes a subnetwork exists in the original deep neural network, which can achieve comparative model performance with less computation than the original. However, it is unclear how the model performance varies with the different subnetwork extractions. In this paper, we choose the representation dimension (or embedding dimension, model dimension, the dimension of the residual stream in the relevant literature) as the entry point to this issue. We investigate the linear transformations in the LLM transformer blocks and consider a specific structured pruning approach, SliceGPT, to extract the subnetworks of different representation dimensions. We mechanistically analyse the activation flow during the model forward passes, and find the representation dimension dominates the linear transformations, model predictions, and, finally, the model performance. Explicit analytical relations are given to calculate the pruned model performance (perplexity and accuracy) without actual evaluation, and are empirically validated with Llama-3-8B-Instruct and Phi-3-mini-4k-Instruct.
Comment: The paper investigates the role of representation dimension in pruned LLMs, providing foundational insights into structured pruning and its impact on model performance.
Relevance: 9 Novelty: 8
15. Enough Coin Flips Can Make LLMs Act Bayesian
ArXiv ID: 2503.04722
Authors: Ritwik Gupta, Rodolfo Corona, Jiaxin Ge, Eric Wang, Dan Klein, Trevor Darrell, David M. Chan
Abstract: Large language models (LLMs) exhibit the ability to generalize given few-shot examples in their input prompt, an emergent capability known as in-context learning (ICL). We investigate whether LLMs utilize ICL to perform structured reasoning in ways that are consistent with a Bayesian framework or rely on pattern matching. Using a controlled setting of biased coin flips, we find that: (1) LLMs often possess biased priors, causing initial divergence in zero-shot settings, (2) in-context evidence outweighs explicit bias instructions, (3) LLMs broadly follow Bayesian posterior updates, with deviations primarily due to miscalibrated priors rather than flawed updates, and (4) attention magnitude has negligible effect on Bayesian inference. With sufficient demonstrations of biased coin flips via ICL, LLMs update their priors in a Bayesian manner.
Comment: The paper investigates whether LLMs perform Bayesian reasoning during in-context learning, providing theoretical insights into LLM behavior and interpretability. This aligns closely with the foundational research on LLMs and their emergent capabilities.
Relevance: 9 Novelty: 8
16. Provable Robust Overfitting Mitigation in Wasserstein Distributionally Robust Optimization
ArXiv ID: 2503.04315
Authors: Shuang Liu, Yihan Wang, Yifan Zhu, Yibo Miao, Xiao-Shan Gao
Abstract: Wasserstein distributionally robust optimization (WDRO) optimizes against worst-case distributional shifts within a specified uncertainty set, leading to enhanced generalization on unseen adversarial examples, compared to standard adversarial training which focuses on pointwise adversarial perturbations. However, WDRO still suffers fundamentally from the robust overfitting problem, as it does not consider statistical error. We address this gap by proposing a novel robust optimization framework under a new uncertainty set for adversarial noise via Wasserstein distance and statistical error via Kullback-Leibler divergence, called the Statistically Robust WDRO. We establish a robust generalization bound for the new optimization framework, implying that out-of-distribution adversarial performance is at least as good as the statistically robust training loss with high probability. Furthermore, we derive conditions under which Stackelberg and Nash equilibria exist between the learner and the adversary, giving an optimal robust model in certain sense. Finally, through extensive experiments, we demonstrate that our method significantly mitigates robust overfitting and enhances robustness within the framework of WDRO.
Comment: The paper proposes a novel robust optimization framework under Wasserstein DRO, which aligns with foundational research in optimization and robustness.
Relevance: 8 Novelty: 8
17. Generative Learning of Densities on Manifolds
ArXiv ID: 2503.03963
Authors: Dimitris G. Giovanis, Ellis Crabtree, Roger G. Ghanem, Ioannis G. kevrekidis
Abstract: A generative modeling framework is proposed that combines diffusion models and manifold learning to efficiently sample data densities on manifolds. The approach utilizes Diffusion Maps to uncover possible low-dimensional underlying (latent) spaces in the high-dimensional data (ambient) space. Two approaches for sampling from the latent data density are described. The first is a score-based diffusion model, which is trained to map a standard normal distribution to the latent data distribution using a neural network. The second one involves solving an It\^o stochastic differential equation in the latent space. Additional realizations of the data are generated by lifting the samples back to the ambient space using Double Diffusion Maps, a recently introduced technique typically employed in studying dynamical system reduction; here the focus lies in sampling densities rather than system dynamics. The proposed approaches enable sampling high dimensional data densities restricted to low-dimensional, a priori unknown manifolds. The efficacy of the proposed framework is demonstrated through a benchmark problem and a material with multiscale structure.
Comment: The paper combines diffusion models and manifold learning for generative modeling, which aligns with foundational research in representation learning and generative paradigms.
Relevance: 8 Novelty: 8
18. An optimal Petrov-Galerkin framework for operator networks
ArXiv ID: 2503.04024
Authors: Philip Charles, Deep Ray, Yue Yu, Joost Prins, Hugo Melchers, Michael R. A. Abdelmalik, Jeffrey Cochran, Assad A. Oberai, Thomas J. R. Hughes, Mats G. Larson
Abstract: The optimal Petrov-Galerkin formulation to solve partial differential equations (PDEs) recovers the best approximation in a specified finite-dimensional (trial) space with respect to a suitable norm. However, the recovery of this optimal solution is contingent on being able to construct the optimal weighting functions associated with the trial basis. While explicit constructions are available for simple one- and two-dimensional problems, such constructions for a general multidimensional problem remain elusive. In the present work, we revisit the optimal Petrov-Galerkin formulation through the lens of deep learning. We propose an operator network framework called Petrov-Galerkin Variationally Mimetic Operator Network (PG-VarMiON), which emulates the optimal Petrov-Galerkin weak form of the underlying PDE. The PG-VarMiON is trained in a supervised manner using a labeled dataset comprising the PDE data and the corresponding PDE solution, with the training loss depending on the choice of the optimal norm. The special architecture of the PG-VarMiON allows it to implicitly learn the optimal weighting functions, thus endowing the proposed operator network with the ability to generalize well beyond the training set. We derive approximation error estimates for PG-VarMiON, highlighting the contributions of various error sources, particularly the error in learning the true weighting functions. Several numerical results are presented for the advection-diffusion equation to demonstrate the efficacy of the proposed method. By embedding the Petrov-Galerkin structure into the network architecture, PG-VarMiON exhibits greater robustness and improved generalization compared to other popular deep operator frameworks, particularly when the training data is limited.
Comment: The paper proposes a novel operator network framework (PG-VarMiON) for solving PDEs, embedding Petrov-Galerkin structure into the architecture. This aligns with foundational research in model architecture.
Relevance: 8 Novelty: 8
19. Sample-Optimal Agnostic Boosting with Unlabeled Data
ArXiv ID: 2503.04706
Authors: Udaya Ghai, Karan Singh
Abstract: Boosting provides a practical and provably effective framework for constructing accurate learning algorithms from inaccurate rules of thumb. It extends the promise of sample-efficient learning to settings where direct Empirical Risk Minimization (ERM) may not be implementable efficiently. In the realizable setting, boosting is known to offer this computational reprieve without compromising on sample efficiency. However, in the agnostic case, existing boosting algorithms fall short of achieving the optimal sample complexity. This paper highlights an unexpected and previously unexplored avenue of improvement: unlabeled samples. We design a computationally efficient agnostic boosting algorithm that matches the sample complexity of ERM, given polynomially many additional unlabeled samples. In fact, we show that the total number of samples needed, unlabeled and labeled inclusive, is never more than that for the best known agnostic boosting algorithm -- so this result is never worse -- while only a vanishing fraction of these need to be labeled for the algorithm to succeed. This is particularly fortuitous for learning-theoretic applications of agnostic boosting, which often take place in the distribution-specific setting, where unlabeled samples can be availed for free. We detail other applications of this result in reinforcement learning.
Comment: The paper proposes a novel agnostic boosting algorithm leveraging unlabeled data, which aligns with foundational research in representation learning and efficiency improvements.
Relevance: 8 Novelty: 8
20. Generalized Interpolating Discrete Diffusion
ArXiv ID: 2503.04482
Authors: Dimitri von R\"utte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Sch\"olkopf, Thomas Hofmann
Abstract: While state-of-the-art language models achieve impressive results through next-token prediction, they have inherent limitations such as the inability to revise already generated tokens. This has prompted exploration of alternative approaches such as discrete diffusion. However, masked diffusion, which has emerged as a popular choice due to its simplicity and effectiveness, reintroduces this inability to revise words. To overcome this, we generalize masked diffusion and derive the theoretical backbone of a family of general interpolating discrete diffusion (GIDD) processes offering greater flexibility in the design of the noising processes. Leveraging a novel diffusion ELBO, we achieve compute-matched state-of-the-art performance in diffusion language modeling. Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality and unlocking the ability for the model to correct its own mistakes, an area where autoregressive models notoriously have struggled. Our code and models are open-source: https://github.com/dvruette/gidd/
Comment: The paper introduces a generalized interpolating discrete diffusion (GIDD) framework, which provides theoretical insights into diffusion processes and aligns with foundational research in representation learning.
Relevance: 8 Novelty: 8
21. All-atom Diffusion Transformers: Unified generative modelling of molecules and materials
ArXiv ID: 2503.03965
Authors: Chaitanya K. Joshi, Xiang Fu, Yi-Lun Liao, Vahe Gharakhanyan, Benjamin Kurt Miller, Anuroop Sriram, Zachary W. Ulissi
Abstract: Diffusion models are the standard toolkit for generative modelling of 3D atomic systems. However, for different types of atomic systems - such as molecules and materials - the generative processes are usually highly specific to the target system despite the underlying physics being the same. We introduce the All-atom Diffusion Transformer (ADiT), a unified latent diffusion framework for jointly generating both periodic materials and non-periodic molecular systems using the same model: (1) An autoencoder maps a unified, all-atom representations of molecules and materials to a shared latent embedding space; and (2) A diffusion model is trained to generate new latent embeddings that the autoencoder can decode to sample new molecules or materials. Experiments on QM9 and MP20 datasets demonstrate that jointly trained ADiT generates realistic and valid molecules as well as materials, exceeding state-of-the-art results from molecule and crystal-specific models. ADiT uses standard Transformers for both the autoencoder and diffusion model, resulting in significant speedups during training and inference compared to equivariant diffusion models. Scaling ADiT up to half a billion parameters predictably improves performance, representing a step towards broadly generalizable foundation models for generative chemistry. Open source code: https://github.com/facebookresearch/all-atom-diffusion-transformer
Comment: The paper proposes a unified generative model for molecules and materials using Transformers, which aligns with foundational research in generative modeling and representation learning.
Relevance: 8 Novelty: 8
22. Quantitative Flow Approximation Properties of Narrow Neural ODEs
ArXiv ID: 2503.04068
Authors: Karthik Elamvazhuthi
Abstract: In this note, we revisit the problem of flow approximation properties of neural ordinary differential equations (NODEs). The approximation properties have been considered as a flow controllability problem in recent literature. The neural ODE is considered {\it narrow} when the parameters have dimension equal to the input of the neural network, and hence have limited width. We derive the relation of narrow NODEs in approximating flows of shallow but wide NODEs. Due to existing results on approximation properties of shallow neural networks, this facilitates understanding which kind of flows of dynamical systems can be approximated using narrow neural ODEs. While approximation properties of narrow NODEs have been established in literature, the proofs often involve extensive constructions or require invoking deep controllability theorems from control theory. In this paper, we provide a simpler proof technique that involves only ideas from ODEs and Gr{\"o}nwall's lemma. Moreover, we provide an estimate on the number of switches needed for the time dependent weights of the narrow NODE to mimic the behavior of a NODE with a single layer wide neural network as the velocity field.
Comment: The paper explores the approximation properties of narrow Neural ODEs, which aligns with foundational research in representation learning and training dynamics of neural networks.
Relevance: 8 Novelty: 7
23. Boosting Offline Optimizers with Surrogate Sensitivity
ArXiv ID: 2503.04181
Authors: Manh Cuong Dao, Phi Le Nguyen, Thao Nguyen Truong, Trong Nghia Hoang
Abstract: Offline optimization is an important task in numerous material engineering domains where online experimentation to collect data is too expensive and needs to be replaced by an in silico maximization of a surrogate of the black-box function. Although such a surrogate can be learned from offline data, its prediction might not be reliable outside the offline data regime, which happens when the surrogate has narrow prediction margin and is (therefore) sensitive to small perturbations of its parameterization. This raises the following questions: (1) how to regulate the sensitivity of a surrogate model; and (2) whether conditioning an offline optimizer with such less sensitive surrogate will lead to better optimization performance. To address these questions, we develop an optimizable sensitivity measurement for the surrogate model, which then inspires a sensitivity-informed regularizer that is applicable to a wide range of offline optimizers. This development is both orthogonal and synergistic to prior research on offline optimization, which is demonstrated in our extensive experiment benchmark.
Comment: The paper develops a sensitivity-informed regularizer for offline optimization, which aligns with foundational research in optimization and robustness.
Relevance: 8 Novelty: 7
24. LEWIS (LayEr WIse Sparsity) -- A Training Free Guided Model Merging Approach
ArXiv ID: 2503.03874
Authors: Hetarth Chopra, Vidhi Rambhia, Vikram Adve
Abstract: As specialized large language models (LLMs) become increasingly prevalent, model merging methods are being used to combine them to create a single multi-task model without requiring any additional data or training. However, these approaches fall short when the objective of merging is to increase the downstream model's performance on a particular task-specific benchmark. In this work, we propose LEWIS (Layer Wise Sparsity), a guided model-merging framework that uses activation-based layer importance to dynamically adjust layer-wise task-vector sparsity required for the merge process. LEWIS uses a calibration dataset to prioritize critical layers during the task-vector pruning process required for model merging. This approach guides existing merging methods by preserving essential layer-wise task-specific knowledge while ensuring the merged model performs the best at benchmarks resembling the calibration dataset. Our experiments demonstrate the effectiveness of LEWIS with performance improvements of code instruction-following and math-solving models created through model merging up to 4 percent and 11.3 percent, respectively, outperforming unguided data-less model merging approaches that use uniform-sparsity.
Comment: The paper introduces a sparsity-based model merging approach, which aligns with the model compression criterion. The method is novel in its use of layer-wise sparsity for task-specific performance improvements.
Relevance: 8 Novelty: 7
25. An Information-theoretic Multi-task Representation Learning Framework for Natural Language Understanding
ArXiv ID: 2503.04667
Authors: Dou Hu, Lingwei Wei, Wei Zhou, Songlin Hu
Abstract: This paper proposes a new principled multi-task representation learning framework (InfoMTL) to extract noise-invariant sufficient representations for all tasks. It ensures sufficiency of shared representations for all tasks and mitigates the negative effect of redundant features, which can enhance language understanding of pre-trained language models (PLMs) under the multi-task paradigm. Firstly, a shared information maximization principle is proposed to learn more sufficient shared representations for all target tasks. It can avoid the insufficiency issue arising from representation compression in the multi-task paradigm. Secondly, a task-specific information minimization principle is designed to mitigate the negative effect of potential redundant features in the input for each task. It can compress task-irrelevant redundant information and preserve necessary information relevant to the target for multi-task prediction. Experiments on six classification benchmarks show that our method outperforms 12 comparative multi-task methods under the same multi-task settings, especially in data-constrained and noisy scenarios. Extensive experiments demonstrate that the learned representations are more sufficient, data-efficient, and robust.
Comment: The paper introduces a multi-task representation learning framework with theoretical principles for sufficiency and redundancy minimization, aligning with the representation learning criterion.
Relevance: 8 Novelty: 7
26. Simple Self Organizing Map with Visual Transformer
ArXiv ID: 2503.04121
Authors: Alan Luo, Kaiwen Yuan
Abstract: Vision Transformers (ViTs) have demonstrated exceptional performance in various vision tasks. However, they tend to underperform on smaller datasets due to their inherent lack of inductive biases. Current approaches address this limitation implicitly-often by pairing ViTs with pretext tasks or by distilling knowledge from convolutional neural networks (CNNs) to strengthen the prior. In contrast, Self-Organizing Maps (SOMs), a widely adopted self-supervised framework, are inherently structured to preserve topology and spatial organization, making them a promising candidate to directly address the limitations of ViTs in limited or small training datasets. Despite this potential, equipping SOMs with modern deep learning architectures remains largely unexplored. In this study, we conduct a novel exploration on how Vision Transformers (ViTs) and Self-Organizing Maps (SOMs) can empower each other, aiming to bridge this critical research gap. Our findings demonstrate that these architectures can synergistically enhance each other, leading to significantly improved performance in both unsupervised and supervised tasks. Code will be publicly available.
Comment: The paper explores combining Vision Transformers (ViTs) with Self-Organizing Maps (SOMs), which aligns with foundational research in representation learning and architectural innovations.
Relevance: 8 Novelty: 7
27. Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic Flows
ArXiv ID: 2503.03989
Authors: Xiangxin Zhou, Yi Xiao, Haowei Lin, Xinheng He, Jiaqi Guan, Yang Wang, Qiang Liu, Feng Zhou, Liang Wang, Jianzhu Ma
Abstract: The dynamic nature of proteins, influenced by ligand interactions, is essential for comprehending protein function and progressing drug discovery. Traditional structure-based drug design (SBDD) approaches typically target binding sites with rigid structures, limiting their practical application in drug development. While molecular dynamics simulation can theoretically capture all the biologically relevant conformations, the transition rate is dictated by the intrinsic energy barrier between them, making the sampling process computationally expensive. To overcome the aforementioned challenges, we propose to use generative modeling for SBDD considering conformational changes of protein pockets. We curate a dataset of apo and multiple holo states of protein-ligand complexes, simulated by molecular dynamics, and propose a full-atom flow model (and a stochastic version), named DynamicFlow, that learns to transform apo pockets and noisy ligands into holo pockets and corresponding 3D ligand molecules. Our method uncovers promising ligand molecules and corresponding holo conformations of pockets. Additionally, the resultant holo-like states provide superior inputs for traditional SBDD approaches, playing a significant role in practical drug discovery.
Comment: The paper proposes a generative modeling approach for protein dynamics, which aligns with foundational research in AI for Science, particularly in molecular modeling.
Relevance: 8 Novelty: 7
28. IDInit: A Universal and Stable Initialization Method for Neural Network Training
ArXiv ID: 2503.04626
Authors: Yu Pan, Chaozheng Wang, Zekai Wu, Qifan Wang, Min Zhang, Zenglin Xu
Abstract: Deep neural networks have achieved remarkable accomplishments in practice. The success of these networks hinges on effective initialization methods, which are vital for ensuring stable and rapid convergence during training. Recently, initialization methods that maintain identity transition within layers have shown good efficiency in network training. These techniques (e.g., Fixup) set specific weights to zero to achieve identity control. However, settings of remaining weight (e.g., Fixup uses random values to initialize non-zero weights) will affect the inductive bias that is achieved only by a zero weight, which may be harmful to training. Addressing this concern, we introduce fully identical initialization (IDInit), a novel method that preserves identity in both the main and sub-stem layers of residual networks. IDInit employs a padded identity-like matrix to overcome rank constraints in non-square weight matrices. Furthermore, we show the convergence problem of an identity matrix can be solved by stochastic gradient descent. Additionally, we enhance the universality of IDInit by processing higher-order weights and addressing dead neuron problems. IDInit is a straightforward yet effective initialization method, with improved convergence, stability, and performance across various settings, including large-scale datasets and deep models.
Comment: The paper proposes a novel initialization method for neural networks, which aligns with foundational research in training dynamics and stability.
Relevance: 8 Novelty: 7
29. Bi-Lipschitz Ansatz for Anti-Symmetric Functions
ArXiv ID: 2503.04263
Authors: Nadav Dym, Jianfeng Lu, Matan Mizrachi
Abstract: Motivated by applications for simulating quantum many body functions, we propose a new universal ansatz for approximating anti-symmetric functions. The main advantage of this ansatz over previous alternatives is that it is bi-Lipschitz with respect to a naturally defined metric. As a result, we are able to obtain quantitative approximation results for approximation of Lipschitz continuous antisymmetric functions. Moreover, we provide preliminary experimental evidence to the improved performance of this ansatz for learning antisymmetric functions.
Comment: The paper introduces a new ansatz for approximating anti-symmetric functions, which is relevant to representation learning due to its focus on function approximation and theoretical insights.
Relevance: 7 Novelty: 7
30. Continual Optimization with Symmetry Teleportation for Multi-Task Learning
ArXiv ID: 2503.04046
Authors: Zhipeng Zhou, Ziqiao Meng, Pengcheng Wu, Peilin Zhao, Chunyan Miao
Abstract: Multi-task learning (MTL) is a widely explored paradigm that enables the simultaneous learning of multiple tasks using a single model. Despite numerous solutions, the key issues of optimization conflict and task imbalance remain under-addressed, limiting performance. Unlike existing optimization-based approaches that typically reweight task losses or gradients to mitigate conflicts or promote progress, we propose a novel approach based on Continual Optimization with Symmetry Teleportation (COST). During MTL optimization, when an optimization conflict arises, we seek an alternative loss-equivalent point on the loss landscape to reduce conflict. Specifically, we utilize a low-rank adapter (LoRA) to facilitate this practical teleportation by designing convergent, loss-invariant objectives. Additionally, we introduce a historical trajectory reuse strategy to continually leverage the benefits of advanced optimizers. Extensive experiments on multiple mainstream datasets demonstrate the effectiveness of our approach. COST is a plug-and-play solution that enhances a wide range of existing MTL methods. When integrated with state-of-the-art methods, COST achieves superior performance.
Comment: The paper introduces a novel optimization method for multi-task learning using low-rank adapters, which aligns with foundational research in model architecture and training dynamics.
Relevance: 7 Novelty: 7
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.