Previous Day 2025-08-29
Monthly Overview 2025-09
Next Day 2025-09-03

Personalized Daily ArXiv Papers 2025-09-01

[gpt-4o] Prompt Completion Total
Token 17129 1713 18842
Cost $0.04 $0.02 $0.06

Total arXiv papers: 337

Total scanned papers: 208

Total relevant papers: 11

Table of contents with paper titles:

  1. Adaptive Heavy-Tailed Stochastic Gradient Descent Authors: Bodu Gong, Gustavo Enrique Batista, Pierre Lafaye de Micheaux

  2. Distribution-Aware Feature Selection for SAEs Authors: Narmeen Oozeer, Nirmalendu Prakash, Michael Lan, Alice Rigg, Amirali Abdullah

  3. Normalized Maximum Likelihood Code-Length on Riemannian Manifold Data Spaces Authors: Kota Fukuzawa, Atsushi Suzuki, Kenji Yamanishi

  4. NSPDI-SNN: An efficient lightweight SNN based on nonlinear synaptic pruning and dendritic integration Authors: Wuque Cai, Hongze Sun, Jiayi He, Qianqian Liao, Yunliang Zang, Duo Chen, Dezhong Yao, Daqing Guo

  5. RelP: Faithful and Efficient Circuit Discovery via Relevance Patching Authors: Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, Neel Nanda

  6. Domain Generalization in-the-Wild: Disentangling Classification from Domain-Aware Representations Authors: Ha Min Son, Zhe Zhao, Shahbaz Rezaei, Xin Liu

  7. MoE-Health: A Mixture of Experts Framework for Robust Multimodal Healthcare Prediction Authors: Xiaoyang Wang, Christopher C. Yang

  8. Weighted Support Points from Random Measures: An Interpretable Alternative for Generative Modeling Authors: Peiqi Zhao, Carlos E. Rodr\'iguez, Rams\'es H. Mena, Stephen G. Walker

  9. A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers Authors: Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou

  10. Rethinking Layer-wise Model Merging through Chain of Merges Authors: Pietro Buzzega, Riccardo Salami, Angelo Porrello, Simone Calderara

  11. Deep Residual Echo State Networks: exploring residual orthogonal connections in untrained Recurrent Neural Networks Authors: Matteo Pinna, Andrea Ceni, Claudio Gallicchio


1. Adaptive Heavy-Tailed Stochastic Gradient Descent

ArXiv ID: 2508.21353

Authors: Bodu Gong, Gustavo Enrique Batista, Pierre Lafaye de Micheaux

Abstract: In the era of large-scale neural network models, optimization algorithms often struggle with generalization due to an overreliance on training loss. One key insight widely accepted in the machine learning community is the idea that wide basins (regions around a local minimum where the loss increases gradually) promote better generalization by offering greater stability to small changes in input data or model parameters. In contrast, sharp minima are typically more sensitive and less stable. Motivated by two key empirical observations - the inherent heavy-tailed distribution of gradient noise in stochastic gradient descent and the Edge of Stability phenomenon during neural network training, in which curvature grows before settling at a plateau, we introduce Adaptive Heavy Tailed Stochastic Gradient Descent (AHTSGD). The algorithm injects heavier-tailed noise into the optimizer during the early stages of training to enhance exploration and gradually transitions to lighter-tailed noise as sharpness stabilizes. By dynamically adapting to the sharpness of the loss landscape throughout training, AHTSGD promotes accelerated convergence to wide basins. AHTSGD is the first algorithm to adjust the nature of injected noise into an optimizer based on the Edge of Stability phenomenon. AHTSGD consistently outperforms SGD and other noise-based methods on benchmarks like MNIST and CIFAR-10, with marked gains on noisy datasets such as SVHN. It ultimately accelerates early training from poor initializations and improves generalization across clean and noisy settings, remaining robust to learning rate choices.

Comment: The paper introduces Adaptive Heavy Tailed Stochastic Gradient Descent (AHTSGD), which is relevant to representation learning as it provides insights into training dynamics and optimization in neural networks.

Relevance: 9 Novelty: 8


2. Distribution-Aware Feature Selection for SAEs

ArXiv ID: 2508.21324

Authors: Narmeen Oozeer, Nirmalendu Prakash, Michael Lan, Alice Rigg, Amirali Abdullah

Abstract: Sparse autoencoders (SAEs) decompose neural activations into interpretable features. A widely adopted variant, the TopK SAE, reconstructs each token from its K most active latents. However, this approach is inefficient, as some tokens carry more information than others. BatchTopK addresses this limitation by selecting top activations across a batch of tokens. This improves average reconstruction but risks an "activation lottery," where rare high-magnitude features crowd out more informative but lower-magnitude ones. To address this issue, we introduce Sampled-SAE: we score the columns (representing features) of the batch activation matrix (via $L_2$ norm or entropy), forming a candidate pool of size $Kl$, and then apply Top-$K$ to select tokens across the batch from the restricted pool of features. Varying $l$ traces a spectrum between batch-level and token-specific selection. At $l=1$, tokens draw only from $K$ globally influential features, while larger $l$ expands the pool toward standard BatchTopK and more token-specific features across the batch. Small $l$ thus enforces global consistency; large $l$ favors fine-grained reconstruction. On Pythia-160M, no single value optimizes $l$ across all metrics: the best choice depends on the trade-off between shared structure, reconstruction fidelity, and downstream performance. Sampled-SAE thus reframes BatchTopK as a tunable, distribution-aware family.

Comment: The paper discusses Distribution-Aware Feature Selection for Sparse Autoencoders (SAEs), which is relevant to representation learning and model compression as it involves sparse methods and feature selection.

Relevance: 9 Novelty: 8


3. Normalized Maximum Likelihood Code-Length on Riemannian Manifold Data Spaces

ArXiv ID: 2508.21466

Authors: Kota Fukuzawa, Atsushi Suzuki, Kenji Yamanishi

Abstract: In recent years, with the large-scale expansion of graph data, there has been an increased focus on Riemannian manifold data spaces other than Euclidean space. In particular, the development of hyperbolic spaces has been remarkable, and they have high expressive power for graph data with hierarchical structures. Normalized Maximum Likelihood (NML) is employed in regret minimization and model selection. However, existing formulations of NML have been developed primarily in Euclidean spaces and are inherently dependent on the choice of coordinate systems, making it non-trivial to extend NML to Riemannian manifolds. In this study, we define a new NML that reflects the geometric structure of Riemannian manifolds, called the Riemannian manifold NML (Rm-NML). This Rm-NML is invariant under coordinate transformations and coincides with the conventional NML under the natural parameterization in Euclidean space. We extend existing computational techniques for NML to the setting of Riemannian manifolds. Furthermore, we derive a method to simplify the computation of Rm-NML on Riemannian symmetric spaces, which encompass data spaces of growing interest such as hyperbolic spaces. To illustrate the practical application of our proposed method, we explicitly computed the Rm-NML for normal distributions on hyperbolic spaces.

Comment: The paper extends the concept of Normalized Maximum Likelihood to Riemannian manifolds, which is a significant theoretical contribution to model selection and representation learning in non-Euclidean spaces.

Relevance: 9 Novelty: 8


4. NSPDI-SNN: An efficient lightweight SNN based on nonlinear synaptic pruning and dendritic integration

ArXiv ID: 2508.21566

Authors: Wuque Cai, Hongze Sun, Jiayi He, Qianqian Liao, Yunliang Zang, Duo Chen, Dezhong Yao, Daqing Guo

Abstract: Spiking neural networks (SNNs) are artificial neural networks based on simulated biological neurons and have attracted much attention in recent artificial intelligence technology studies. The dendrites in biological neurons have efficient information processing ability and computational power; however, the neurons of SNNs rarely match the complex structure of the dendrites. Inspired by the nonlinear structure and highly sparse properties of neuronal dendrites, in this study, we propose an efficient, lightweight SNN method with nonlinear pruning and dendritic integration (NSPDI-SNN). In this method, we introduce nonlinear dendritic integration (NDI) to improve the representation of the spatiotemporal information of neurons. We implement heterogeneous state transition ratios of dendritic spines and construct a new and flexible nonlinear synaptic pruning (NSP) method to achieve the high sparsity of SNN. We conducted systematic experiments on three benchmark datasets (DVS128 Gesture, CIFAR10-DVS, and CIFAR10) and extended the evaluation to two complex tasks (speech recognition and reinforcement learning-based maze navigation task). Across all tasks, NSPDI-SNN consistently achieved high sparsity with minimal performance degradation. In particular, our method achieved the best experimental results on all three event stream datasets. Further analysis showed that NSPDI significantly improved the efficiency of synaptic information transfer as sparsity increased. In conclusion, our results indicate that the complex structure and nonlinear computation of neuronal dendrites provide a promising approach for developing efficient SNN methods.

Comment: The paper introduces a novel method for spiking neural networks (SNNs) focusing on nonlinear synaptic pruning and dendritic integration, which aligns with model compression and representation learning through sparsity and pruning.

Relevance: 9 Novelty: 8


5. RelP: Faithful and Efficient Circuit Discovery via Relevance Patching

ArXiv ID: 2508.21258

Authors: Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, Neel Nanda

Abstract: Activation patching is a standard method in mechanistic interpretability for localizing the components of a model responsible for specific behaviors, but it is computationally expensive to apply at scale. Attribution patching offers a faster, gradient-based approximation, yet suffers from noise and reduced reliability in deep, highly non-linear networks. In this work, we introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients derived from Layer-wise Relevance Propagation (LRP). LRP propagates the network's output backward through the layers, redistributing relevance to lower-level components according to local propagation rules that ensure properties such as relevance conservation or improved signal-to-noise ratio. Like attribution patching, RelP requires only two forward passes and one backward pass, maintaining computational efficiency while improving faithfulness. We validate RelP across a range of models and tasks, showing that it more accurately approximates activation patching than standard attribution patching, particularly when analyzing residual stream and MLP outputs in the Indirect Object Identification (IOI) task. For instance, for MLP outputs in GPT-2 Large, attribution patching achieves a Pearson correlation of 0.006, whereas RelP reaches 0.956, highlighting the improvement offered by RelP. Additionally, we compare the faithfulness of sparse feature circuits identified by RelP and Integrated Gradients (IG), showing that RelP achieves comparable faithfulness without the extra computational cost associated with IG.

Comment: The paper introduces Relevance Patching (RelP), a method improving mechanistic interpretability in neural networks, which is relevant to representation learning and theoretical insights into model behavior.

Relevance: 9 Novelty: 8


6. Domain Generalization in-the-Wild: Disentangling Classification from Domain-Aware Representations

ArXiv ID: 2508.21769

Authors: Ha Min Son, Zhe Zhao, Shahbaz Rezaei, Xin Liu

Abstract: Evaluating domain generalization (DG) for foundational models like CLIP is challenging, as web-scale pretraining data potentially covers many existing benchmarks. Consequently, current DG evaluation may neither be sufficiently challenging nor adequately test genuinely unseen data scenarios. To better assess the performance of CLIP on DG in-the-wild, a scenario where CLIP encounters challenging unseen data, we consider two approaches: (1) evaluating on 33 diverse datasets with quantified out-of-distribution (OOD) scores after fine-tuning CLIP on ImageNet, and (2) using unlearning to make CLIP `forget' some domains as an approximation. We observe that CLIP's performance deteriorates significantly on more OOD datasets. To address this, we present CLIP-DCA (Disentangling Classification from enhanced domain Aware representations). Our approach is motivated by the observation that while standard domain invariance losses aim to make representations domain-invariant, this can be harmful to foundation models by forcing the discarding of domain-aware representations beneficial for generalization. We instead hypothesize that enhancing domain awareness is a prerequisite for effective domain-invariant classification in foundation models. CLIP-DCA identifies and enhances domain awareness within CLIP's encoders using a separate domain head and synthetically generated diverse domain data. Simultaneously, it encourages domain-invariant classification through disentanglement from the domain features. CLIP-DCA shows significant improvements within this challenging evaluation compared to existing methods, particularly on datasets that are more OOD.

Comment: The paper focuses on enhancing domain awareness in foundational models like CLIP, which aligns with representation learning and foundational model insights. It introduces a novel approach to disentangle classification from domain-aware representations, which is a significant theoretical contribution.

Relevance: 9 Novelty: 8


7. MoE-Health: A Mixture of Experts Framework for Robust Multimodal Healthcare Prediction

ArXiv ID: 2508.21793

Authors: Xiaoyang Wang, Christopher C. Yang

Abstract: Healthcare systems generate diverse multimodal data, including Electronic Health Records (EHR), clinical notes, and medical images. Effectively leveraging this data for clinical prediction is challenging, particularly as real-world samples often present with varied or incomplete modalities. Existing approaches typically require complete modality data or rely on manual selection strategies, limiting their applicability in real-world clinical settings where data availability varies across patients and institutions. To address these limitations, we propose MoE-Health, a novel Mixture of Experts framework designed for robust multimodal fusion in healthcare prediction. MoE-Health architecture is specifically developed to handle samples with differing modalities and improve performance on critical clinical tasks. By leveraging specialized expert networks and a dynamic gating mechanism, our approach dynamically selects and combines relevant experts based on available data modalities, enabling flexible adaptation to varying data availability scenarios. We evaluate MoE-Health on the MIMIC-IV dataset across three critical clinical prediction tasks: in-hospital mortality prediction, long length of stay, and hospital readmission prediction. Experimental results demonstrate that MoE-Health achieves superior performance compared to existing multimodal fusion methods while maintaining robustness across different modality availability patterns. The framework effectively integrates multimodal information, offering improved predictive performance and robustness in handling heterogeneous and incomplete healthcare data, making it particularly suitable for deployment in diverse healthcare environments with heterogeneous data availability.

Comment: The paper presents MoE-Health, a Mixture of Experts framework for multimodal healthcare prediction, which is relevant to model architecture as it involves MoE and dynamic networks.

Relevance: 9 Novelty: 7


8. Weighted Support Points from Random Measures: An Interpretable Alternative for Generative Modeling

ArXiv ID: 2508.21255

Authors: Peiqi Zhao, Carlos E. Rodr\'iguez, Rams\'es H. Mena, Stephen G. Walker

Abstract: Support points summarize a large dataset through a smaller set of representative points that can be used for data operations, such as Monte Carlo integration, without requiring access to the full dataset. In this sense, support points offer a compact yet informative representation of the original data. We build on this idea to introduce a generative modeling framework based on random weighted support points, where the randomness arises from a weighting scheme inspired by the Dirichlet process and the Bayesian bootstrap. The proposed method generates diverse and interpretable sample sets from a fixed dataset, without relying on probabilistic modeling assumptions or neural network architectures. We present the theoretical formulation of the method and develop an efficient optimization algorithm based on the Convex--Concave Procedure (CCP). Empirical results on the MNIST and CelebA-HQ datasets show that our approach produces high-quality and diverse outputs at a fraction of the computational cost of black-box alternatives such as Generative Adversarial Networks (GANs) or Denoising Diffusion Probabilistic Models (DDPMs). These results suggest that random weighted support points offer a principled, scalable, and interpretable alternative for generative modeling. A key feature is their ability to produce genuinely interpolative samples that preserve underlying data structure.

Comment: The paper introduces a new generative modeling framework using random weighted support points, which offers a compact representation of data. This aligns with representation learning and introduces a novel method that is interpretable and efficient, providing a theoretical contribution to generative modeling.

Relevance: 8 Novelty: 8


9. A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

ArXiv ID: 2508.21148

Authors: Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou

Abstract: Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.

Comment: The survey on Scientific Large Language Models (Sci-LLMs) provides a comprehensive overview of foundational models in scientific research, focusing on data-centric challenges and evaluation protocols, which aligns with the interest in foundational research in LLMs.

Relevance: 8 Novelty: 7


10. Rethinking Layer-wise Model Merging through Chain of Merges

ArXiv ID: 2508.21421

Authors: Pietro Buzzega, Riccardo Salami, Angelo Porrello, Simone Calderara

Abstract: Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized modules in-creases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques often rely on interference heuristics,importance weighting, or activation matching while treating each layer independently, thereby failing to account for the inter-layer dependencies inherent in deep networks. This simplification leads to distributional mismatches, especially inactivation-based methods, when changes in early layers are not properly reflected in downstream ones. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address it, we propose Chain of Merges (CoM), a layer-wise merging procedure that updates activation statistics in an auto-regressive fashion, explicitly accounting for cross-layer interactions. CoM produces a coherent merged model through a series of conditionally optimal updates, effectively mitigating degradation caused by covariate shift. Experiments on standard bench-marks demonstrate that CoM achieves state-of-the-art performance.

Comment: The paper proposes a novel method for model merging that accounts for inter-layer dependencies, which is relevant to model architecture and efficiency. It introduces a new approach to mitigate internal covariate shift, showing substantial insights.

Relevance: 8 Novelty: 7


11. Deep Residual Echo State Networks: exploring residual orthogonal connections in untrained Recurrent Neural Networks

ArXiv ID: 2508.21172

Authors: Matteo Pinna, Andrea Ceni, Claudio Gallicchio

Abstract: Echo State Networks (ESNs) are a particular type of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) framework, popular for their fast and efficient learning. However, traditional ESNs often struggle with long-term information processing. In this paper, we introduce a novel class of deep untrained RNNs based on temporal residual connections, called Deep Residual Echo State Networks (DeepResESNs). We show that leveraging a hierarchy of untrained residual recurrent layers significantly boosts memory capacity and long-term temporal modeling. For the temporal residual connections, we consider different orthogonal configurations, including randomly generated and fixed-structure configurations, and we study their effect on network dynamics. A thorough mathematical analysis outlines necessary and sufficient conditions to ensure stable dynamics within DeepResESN. Our experiments on a variety of time series tasks showcase the advantages of the proposed approach over traditional shallow and deep RC.

Comment: The paper presents Deep Residual Echo State Networks, exploring residual orthogonal connections in untrained RNNs, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.