Personalized Daily Arxiv Papers 03/19/2025
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 39134 | 5372 | 44506 |
| Cost | $0.09 | $0.05 | $0.15 |
Total arXiv papers: 559
Total scanned papers: 325
Total relevant papers: 29
Table of contents with paper titles:
-
Improved Scalable Lipschitz Bounds for Deep Neural Networks Authors: Usman Syed, Bin Hu
-
Higher-Order Graphon Neural Networks: Approximation and Cut Distance Authors: Daniel Herbst, Stefanie Jegelka
-
ROCK: A variational formulation for occupation kernel methods in Reproducing Kernel Hilbert Spaces Authors: Victor Rielly, Kamel Lahouel, Chau Nguyen, Bruno Jedynak
-
RWKV-7 "Goose" with Expressive Dynamic State Evolution Authors: Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Haowen Hou, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, Christian Zhou-Zheng
-
Cognitive Activation and Chaotic Dynamics in Large Language Models: A Quasi-Lyapunov Analysis of Reasoning Mechanisms Authors: Xiaojian Li, Yongkang Leng, Ruiqing Ding, Hangjie Mo, Shanlin Yang
-
Frac-Connections: Fractional Extension of Hyper-Connections Authors: Defa Zhu, Hongzhi Huang, Jundong Zhou, Zihao Huang, Yutao Zeng, Banggu Wu, Qiyang Min, Xun Zhou
-
DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers Authors: Minglei Shi, Ziyang Yuan, Haotian Yang, Xintao Wang, Mingwu Zheng, Xin Tao, Wenliang Zhao, Wenzhao Zheng, Jie Zhou, Jiwen Lu, Pengfei Wan, Di Zhang, Kun Gai
-
Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels Authors: Maximilian Beck, Korbinian P\"oppel, Phillip Lippe, Sepp Hochreiter
-
Learning on LLM Output Signatures for gray-box LLM Behavior Analysis Authors: Guy Bar-Shalom, Fabrizio Frasca, Derek Lim, Yoav Gelberg, Yftah Ziser, Ran El-Yaniv, Gal Chechik, Haggai Maron
-
Landscape Complexity for the Empirical Risk of Generalized Linear Models: Discrimination between Structured Data Authors: Theodoros G. Tsironis, Aris L. Moustakas
-
Fundamental Limits of Matrix Sensing: Exact Asymptotics, Universality, and Applications Authors: Yizhou Xu, Antoine Maillard, Lenka Zdeborov\'a, Florent Krzakala
-
Revealing higher-order neural representations with generative artificial intelligence Authors: Hojjat Azimi Asrari, Megan A. K. Peters
-
Fuzzy Rule-based Differentiable Representation Learning Authors: Wei Zhang, Zhaohong Deng, Guanjin Wang, Kup-Sze Choi
-
Ensemble Knowledge Distillation for Machine Learning Interatomic Potentials Authors: Sakib Matin, Emily Shinkle, Yulia Pimonova, Galen T. Craven, Ying Wai Li, Kipton Barros, Nicholas Lubbers
-
Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model Authors: Kai Tong, Kang Pan, Xiao Zhang, Erli Meng, Run He, Yawen Cui, Nuoyan Guo, Huiping Zhuang
-
From Demonstrations to Rewards: Alignment Without Explicit Human Preferences Authors: Siliang Zeng, Yao Liu, Huzefa Rangwala, George Karypis, Mingyi Hong, Rasool Fakoor
-
PENCIL: Long Thoughts with Short Memory Authors: Chenxiao Yang, Nathan Srebro, David McAllester, Zhiyuan Li
-
Learning local neighborhoods of non-Gaussian graphical models: A measure transport approach Authors: Sarah Liaw, Rebecca Morrison, Youssef Marzouk, Ricardo Baptista
-
FeNeC: Enhancing Continual Learning via Feature Clustering with Neighbor- or Logit-Based Classification Authors: Kamil Ksi\k{a}.zek, Hubert Jastrz\k{e}bski, Bartosz Trojan, Krzysztof Pniaczek, Micha{\l} Karp, Jacek Tabor
-
Layer-wise Adaptive Gradient Norm Penalizing Method for Efficient and Accurate Deep Learning Authors: Sunwoo Lee
-
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts Authors: Evangelos Georganas, Dhiraj Kalamkar, Alexander Kozlov, Alexander Heinecke
-
Unifying Text Semantics and Graph Structures for Temporal Text-attributed Graphs with Large Language Models Authors: Siwei Zhang, Yun Xiong, Yateng Tang, Xi Chen, Zian Jia, Zehao Gu, Jiarong Xu, Jiawei Zhang
-
Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels Authors: Yujia Tong, Yuze Wang, Jingling Yuan, Chuang Hu
-
Unified Analysis of Decentralized Gradient Descent: a Contraction Mapping Framework Authors: Erik G. Larsson, Nicolo Michelusi
-
Positivity sets of hinge functions Authors: Josef Schicho, Ayush Kumar Tewari, Audie Warren
-
End-to-End Optimal Detector Design with Mutual Information Surrogates Authors: Kinga Anna Wozniak, Stephen Mulligan, Jan Kieseler, Markus Klute, Francois Fleuret, Tobias Golling
-
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models Authors: Teng Wang, Zhangyi Jiang, Zhenqi He, Wenhan Yang, Yanan Zheng, Zeyu Li, Zifan He, Shenyang Tong, Hailei Gong
-
Quantification of Uncertainties in Probabilistic Deep Neural Network by Implementing Boosting of Variational Inference Authors: Pavia Bera, Sanjukta Bhanja
-
On the clustering behavior of sliding windows Authors: Boris Alexeev, Wenyan Luo, Dustin G. Mixon, Yan X Zhang
1. Improved Scalable Lipschitz Bounds for Deep Neural Networks
ArXiv ID: 2503.14297
Authors: Usman Syed, Bin Hu
Abstract: Computing tight Lipschitz bounds for deep neural networks is crucial for analyzing their robustness and stability, but existing approaches either produce relatively conservative estimates or rely on semidefinite programming (SDP) formulations (namely the LipSDP condition) that face scalability issues. Building upon ECLipsE-Fast, the state-of-the-art Lipschitz bound method that avoids SDP formulations, we derive a new family of improved scalable Lipschitz bounds that can be combined to outperform ECLipsE-Fast. Specifically, we leverage more general parameterizations of feasible points of LipSDP to derive various closed-form Lipschitz bounds, avoiding the use of SDP solvers. In addition, we show that our technique encompasses ECLipsE-Fast as a special case and leads to a much larger class of scalable Lipschitz bounds for deep neural networks. Our empirical study shows that our bounds improve ECLipsE-Fast, further advancing the scalability and precision of Lipschitz estimation for large neural networks.
Comment: The paper introduces improved scalable Lipschitz bounds for deep neural networks, which directly contributes to understanding training dynamics and robustness, aligning with representation learning.
Relevance: 9 Novelty: 8
2. Higher-Order Graphon Neural Networks: Approximation and Cut Distance
ArXiv ID: 2503.14338
Authors: Daniel Herbst, Stefanie Jegelka
Abstract: Graph limit models, like graphons for limits of dense graphs, have recently been used to study size transferability of graph neural networks (GNNs). While most literature focuses on message passing GNNs (MPNNs), in this work we attend to the more powerful higher-order GNNs. First, we extend the $k$-WL test for graphons (B\"oker, 2023) to the graphon-signal space and introduce signal-weighted homomorphism densities as a key tool. As an exemplary focus, we generalize Invariant Graph Networks (IGNs) to graphons, proposing Invariant Graphon Networks (IWNs) defined via a subset of the IGN basis corresponding to bounded linear operators. Even with this restricted basis, we show that IWNs of order $k$ are at least as powerful as the $k$-WL test, and we establish universal approximation results for graphon-signals in $L^p$ distances. This significantly extends the prior work of Cai & Wang (2022), showing that IWNs--a subset of their IGN-small--retain effectively the same expressivity as the full IGN basis in the limit. In contrast to their approach, our blueprint of IWNs also aligns better with the geometry of graphon space, for example facilitating comparability to MPNNs. We highlight that, while typical higher-order GNNs are discontinuous w.r.t. cut distance--which causes their lack of convergence and is inherently tied to the definition of $k$-WL--their transferability remains comparable to MPNNs.
Comment: The paper extends higher-order GNNs to graphon models, providing theoretical insights into their approximation and transferability, aligning with representation learning and emerging trends.
Relevance: 9 Novelty: 8
3. ROCK: A variational formulation for occupation kernel methods in Reproducing Kernel Hilbert Spaces
ArXiv ID: 2503.13791
Authors: Victor Rielly, Kamel Lahouel, Chau Nguyen, Bruno Jedynak
Abstract: We present a Representer Theorem result for a large class of weak formulation problems. We provide examples of applications of our formulation both in traditional machine learning and numerical methods as well as in new and emerging techniques. Finally we apply our formulation to generalize the multivariate occupation kernel (MOCK) method for learning dynamical systems from data proposing the more general Riesz Occupation Kernel (ROCK) method. Our generalized methods are both more computationally efficient and performant on most of the benchmarks we test against.
Comment: The paper presents a variational formulation for kernel methods, which is a foundational contribution to representation learning and computational efficiency.
Relevance: 9 Novelty: 8
4. RWKV-7 "Goose" with Expressive Dynamic State Evolution
ArXiv ID: 2503.14456
Authors: Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Haowen Hou, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, Christian Zhou-Zheng
Abstract: We present RWKV-7 "Goose", a new sequence modeling architecture, along with pre-trained language models that establish a new state-of-the-art in downstream performance at the 3 billion parameter scale on multilingual tasks, and match current SoTA English language performance despite being trained on dramatically fewer tokens than other top 3B models. Nevertheless, RWKV-7 models require only constant memory usage and constant inference time per token. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to $\mathsf{TC}^0$. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at https://huggingface.co/RWKV, and our training and inference code at https://github.com/RWKV/RWKV-LM all under the Apache 2.0 License.
Comment: RWKV-7 introduces a novel sequence modeling architecture with theoretical insights into its capabilities, making it relevant to model architecture innovations.
Relevance: 9 Novelty: 8
5. Cognitive Activation and Chaotic Dynamics in Large Language Models: A Quasi-Lyapunov Analysis of Reasoning Mechanisms
ArXiv ID: 2503.13530
Authors: Xiaojian Li, Yongkang Leng, Ruiqing Ding, Hangjie Mo, Shanlin Yang
Abstract: The human-like reasoning capabilities exhibited by Large Language Models (LLMs) challenge the traditional neural network theory's understanding of the flexibility of fixed-parameter systems. This paper proposes the "Cognitive Activation" theory, revealing the essence of LLMs' reasoning mechanisms from the perspective of dynamic systems: the model's reasoning ability stems from a chaotic process of dynamic information extraction in the parameter space. By introducing the Quasi-Lyapunov Exponent (QLE), we quantitatively analyze the chaotic characteristics of the model at different layers. Experiments show that the model's information accumulation follows a nonlinear exponential law, and the Multilayer Perceptron (MLP) accounts for a higher proportion in the final output than the attention mechanism. Further experiments indicate that minor initial value perturbations will have a substantial impact on the model's reasoning ability, confirming the theoretical analysis that large language models are chaotic systems. This research provides a chaos theory framework for the interpretability of LLMs' reasoning and reveals potential pathways for balancing creativity and reliability in model design.
Comment: The paper introduces a chaos theory framework for understanding LLM reasoning, which provides theoretical insights into LLM behavior and interpretability.
Relevance: 9 Novelty: 8
6. Frac-Connections: Fractional Extension of Hyper-Connections
ArXiv ID: 2503.14125
Authors: Defa Zhu, Hongzhi Huang, Jundong Zhou, Zihao Huang, Yutao Zeng, Banggu Wu, Qiyang Min, Xun Zhou
Abstract: Residual connections are central to modern deep learning architectures, enabling the training of very deep networks by mitigating gradient vanishing. Hyper-Connections recently generalized residual connections by introducing multiple connection strengths at different depths, thereby addressing the seesaw effect between gradient vanishing and representation collapse. However, Hyper-Connections increase memory access costs by expanding the width of hidden states. In this paper, we propose Frac-Connections, a novel approach that divides hidden states into multiple parts rather than expanding their width. Frac-Connections retain partial benefits of Hyper-Connections while reducing memory consumption. To validate their effectiveness, we conduct large-scale experiments on language tasks, with the largest being a 7B MoE model trained on up to 3T tokens, demonstrating that Frac-Connections significantly outperform residual connections.
Comment: Frac-Connections propose a novel architectural improvement to residual connections, which aligns with foundational research in model architecture.
Relevance: 9 Novelty: 8
7. DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers
ArXiv ID: 2503.14487
Authors: Minglei Shi, Ziyang Yuan, Haotian Yang, Xintao Wang, Mingwu Zheng, Xin Tao, Wenliang Zhao, Wenzhao Zheng, Jie Zhou, Jiwen Lu, Pengfei Wan, Di Zhang, Kun Gai
Abstract: Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: https://shiml20.github.io/DiffMoE/
Comment: DiffMoE introduces a novel MoE-based approach for diffusion models, which aligns with foundational research in model architecture and efficiency.
Relevance: 9 Novelty: 8
8. Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels
ArXiv ID: 2503.14376
Authors: Maximilian Beck, Korbinian P\"oppel, Phillip Lippe, Sepp Hochreiter
Abstract: Linear RNNs with gating recently demonstrated competitive performance compared to Transformers in language modeling. Although their linear compute scaling in sequence length offers theoretical runtime advantages over Transformers, realizing these benefits in practice requires optimized custom kernels, as Transformers rely on the highly efficient Flash Attention kernels. Leveraging the chunkwise-parallel formulation of linear RNNs, Flash Linear Attention (FLA) shows that linear RNN kernels are faster than Flash Attention, by parallelizing over chunks of the input sequence. However, since the chunk size of FLA is limited, many intermediate states must be materialized in GPU memory. This leads to low arithmetic intensity and causes high memory consumption and IO cost, especially for long-context pre-training. In this work, we present Tiled Flash Linear Attention (TFLA), a novel kernel algorithm for linear RNNs, that enables arbitrary large chunk sizes by introducing an additional level of sequence parallelization within each chunk. First, we apply TFLA to the xLSTM with matrix memory, the mLSTM. Second, we propose an mLSTM variant with sigmoid input gate and reduced computation for even faster kernel runtimes at equal language modeling performance. In our speed benchmarks, we show that our new mLSTM kernels based on TFLA outperform highly optimized Flash Attention, Linear Attention and Mamba kernels, setting a new state of the art for efficient long-context sequence modeling primitives.
Comment: Tiled Flash Linear Attention introduces a novel kernel algorithm for efficient sequence modeling, which aligns with foundational research in model architecture and efficiency.
Relevance: 9 Novelty: 8
9. Learning on LLM Output Signatures for gray-box LLM Behavior Analysis
ArXiv ID: 2503.14043
Authors: Guy Bar-Shalom, Fabrizio Frasca, Derek Lim, Yoav Gelberg, Yftah Ziser, Ran El-Yaniv, Gal Chechik, Haggai Maron
Abstract: Large Language Models (LLMs) have achieved widespread adoption, yet our understanding of their behavior remains limited, particularly in detecting data contamination and hallucinations. While recently proposed probing techniques provide insights through activation analysis, they require "white-box" access to model internals, often unavailable. Current "gray-box" approaches typically analyze only the probability of the actual tokens in the sequence with simple task-specific heuristics. Importantly, these methods overlook the rich information contained in the full token distribution at each processing step. To address these limitations, we propose that gray-box analysis should leverage the complete observable output of LLMs, consisting of both the previously used token probabilities as well as the complete token distribution sequences - a unified data type we term LOS (LLM Output Signature). To this end, we develop a transformer-based approach to process LOS that theoretically guarantees approximation of existing techniques while enabling more nuanced analysis. Our approach achieves superior performance on hallucination and data contamination detection in gray-box settings, significantly outperforming existing baselines. Furthermore, it demonstrates strong transfer capabilities across datasets and LLMs, suggesting that LOS captures fundamental patterns in LLM behavior. Our code is available at: https://github.com/BarSGuy/LLM-Output-Signatures-Network.
Comment: The paper proposes a gray-box analysis method for LLM behavior, which aligns with foundational research in understanding LLM behavior and interpretability.
Relevance: 9 Novelty: 8
10. Landscape Complexity for the Empirical Risk of Generalized Linear Models: Discrimination between Structured Data
ArXiv ID: 2503.14403
Authors: Theodoros G. Tsironis, Aris L. Moustakas
Abstract: We use the Kac-Rice formula and results from random matrix theory to obtain the average number of critical points of a family of high-dimensional empirical loss functions, where the data are correlated $d$-dimensional Gaussian vectors, whose number has a fixed ratio with their dimension. The correlations are introduced to model the existence of structure in the data, as is common in current Machine-Learning systems. Under a technical hypothesis, our results are exact in the large-$d$ limit, and characterize the annealed landscape complexity, namely the logarithm of the expected number of critical points at a given value of the loss. We first address in detail the landscape of the loss function of a single perceptron and then generalize it to the case where two competing data sets with different covariance matrices are present, with the perceptron seeking to discriminate between them. The latter model can be applied to understand the interplay between adversity and non-trivial data structure. For completeness, we also treat the case of a loss function used in training Generalized Linear Models in the presence of correlated input data.
Comment: The paper uses random matrix theory to analyze the landscape complexity of empirical risk functions, which is foundational and relevant to understanding training dynamics in neural networks.
Relevance: 9 Novelty: 8
11. Fundamental Limits of Matrix Sensing: Exact Asymptotics, Universality, and Applications
ArXiv ID: 2503.14121
Authors: Yizhou Xu, Antoine Maillard, Lenka Zdeborov\'a, Florent Krzakala
Abstract: In the matrix sensing problem, one wishes to reconstruct a matrix from (possibly noisy) observations of its linear projections along given directions. We consider this model in the high-dimensional limit: while previous works on this model primarily focused on the recovery of low-rank matrices, we consider in this work more general classes of structured signal matrices with potentially large rank, e.g. a product of two matrices of sizes proportional to the dimension. We provide rigorous asymptotic equations characterizing the Bayes-optimal learning performance from a number of samples which is proportional to the number of entries in the matrix. Our proof is composed of three key ingredients: $(i)$ we prove universality properties to handle structured sensing matrices, related to the ''Gaussian equivalence'' phenomenon in statistical learning, $(ii)$ we provide a sharp characterization of Bayes-optimal learning in generalized linear models with Gaussian data and structured matrix priors, generalizing previously studied settings, and $(iii)$ we leverage previous works on the problem of matrix denoising. The generality of our results allow for a variety of applications: notably, we mathematically establish predictions obtained via non-rigorous methods from statistical physics in [ETB+24] regarding Bilinear Sequence Regression, a benchmark model for learning from sequences of tokens, and in [MTM+24] on Bayes-optimal learning in neural networks with quadratic activation function, and width proportional to the dimension.
Comment: The paper provides theoretical insights into matrix sensing and Bayes-optimal learning, which aligns with foundational research in representation learning and efficiency.
Relevance: 8 Novelty: 9
12. Revealing higher-order neural representations with generative artificial intelligence
ArXiv ID: 2503.14333
Authors: Hojjat Azimi Asrari, Megan A. K. Peters
Abstract: Studies often aim to reveal how neural representations encode aspects of an observer's environment, such as its contents or structure. These are first-order" representations (FORs), because they'reabout" the external world. A less-common target is higher-order" representations (HORs), which areabout" FORs -- their contents, stability, or uncertainty. HORs of uncertainty appear critically involved in adaptive behaviors including learning under uncertainty, influencing learning rates and internal model updating based on environmental feedback. However, HORs about uncertainty are unlikely to be direct read-outs" of FOR characteristics, instead reflecting estimation processes which may be lossy, bias-prone, or distortive and which may also incorporate estimates of distributions of uncertainty the observer is likely to experience. While some research has targeted neural representations ofinstantaneously" estimated uncertainty, how the brain represents \textit{distributions} of expected uncertainty remains largely unexplored. Here, we propose a novel reinforcement learning (RL) based generative artificial intelligence (genAI) approach to explore neural representations of uncertainty distributions. We use existing functional magnetic resonance imaging data, where humans learned to `de-noise' their brain states to achieve target neural patterns, to train denoising diffusion genAI models with RL algorithms to learn noise distributions similar to how humans might learn to do the same. We then explore these models' learned noise-distribution HORs compared to control models trained with traditional backpropagation. Results reveal model-dependent differences in noise distribution representations -- with the RL-based model offering much higher explanatory power for human behavior -- offering an exciting path towards using genAI to explore neural noise-distribution HORs.
Comment: The paper uses generative AI to explore higher-order neural representations, which aligns with emerging trends in representation learning and theoretical insights.
Relevance: 8 Novelty: 8
13. Fuzzy Rule-based Differentiable Representation Learning
ArXiv ID: 2503.13548
Authors: Wei Zhang, Zhaohong Deng, Guanjin Wang, Kup-Sze Choi
Abstract: Representation learning has emerged as a crucial focus in machine and deep learning, involving the extraction of meaningful and useful features and patterns from the input data, thereby enhancing the performance of various downstream tasks such as classification, clustering, and prediction. Current mainstream representation learning methods primarily rely on non-linear data mining techniques such as kernel methods and deep neural networks to extract abstract knowledge from complex datasets. However, most of these methods are black-box, lacking transparency and interpretability in the learning process, which constrains their practical utility. To this end, this paper introduces a novel representation learning method grounded in an interpretable fuzzy rule-based model. Specifically, it is built upon the Takagi-Sugeno-Kang fuzzy system (TSK-FS) to initially map input data to a high-dimensional fuzzy feature space through the antecedent part of the TSK-FS. Subsequently, a novel differentiable optimization method is proposed for the consequence part learning which can preserve the model's interpretability and transparency while further exploring the nonlinear relationships within the data. This optimization method retains the essence of traditional optimization, with certain parts of the process parameterized corresponding differentiable modules constructed, and a deep optimization process implemented. Consequently, this method not only enhances the model's performance but also ensures its interpretability. Moreover, a second-order geometry preservation method is introduced to further improve the robustness of the proposed method. Extensive experiments conducted on various benchmark datasets validate the superiority of the proposed method, highlighting its potential for advancing representation learning methodologies.
Comment: The paper introduces a fuzzy rule-based representation learning method, which is a novel and interpretable approach to feature extraction.
Relevance: 8 Novelty: 8
14. Ensemble Knowledge Distillation for Machine Learning Interatomic Potentials
ArXiv ID: 2503.14293
Authors: Sakib Matin, Emily Shinkle, Yulia Pimonova, Galen T. Craven, Ying Wai Li, Kipton Barros, Nicholas Lubbers
Abstract: Machine learning interatomic potentials (MLIPs) are a promising tool to accelerate atomistic simulations and molecular property prediction. The quality of MLIPs strongly depends on the quantity of available training data as well as the quantum chemistry (QC) level of theory used to generate that data. Datasets generated with high-fidelity QC methods, such as coupled cluster, are typically restricted to small molecules and may be missing energy gradients. With this limited quantity of data, it is often difficult to train good MLIP models. We present an ensemble knowledge distillation (EKD) method to improve MLIP accuracy when trained to energy-only datasets. In our EKD approach, first, multiple teacher models are trained to QC energies and then used to generate atomic forces for all configurations in the dataset. Next, a student MLIP is trained to both QC energies and to ensemble-averaged forces generated by the teacher models. We apply this workflow on the ANI-1ccx dataset which consists of organic molecules with configuration energies computed at the coupled cluster level of theory. The resulting student MLIPs achieve new state-of-the-art accuracy on the out-of-sample COMP6 benchmark and improved stability for molecular dynamics simulations. The EKD approach for MLIP is broadly applicable for chemical, biomolecular and materials science simulations.
Comment: The paper introduces an ensemble knowledge distillation method for improving machine learning interatomic potentials, which is relevant to foundational efficiency and representation learning methods.
Relevance: 8 Novelty: 8
15. Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model
ArXiv ID: 2503.13575
Authors: Kai Tong, Kang Pan, Xiao Zhang, Erli Meng, Run He, Yawen Cui, Nuoyan Guo, Huiping Zhuang
Abstract: Large Language Models (LLMs) possess encompassing capabilities that can process diverse language-related tasks. However, finetuning on LLMs will diminish this general skills and continual finetuning will further cause severe degradation on accumulated knowledge. Recently, Continual Learning (CL) in Large Language Models (LLMs) arises which aims to continually adapt the LLMs to new tasks while maintaining previously learned knowledge and inheriting general skills. Existing techniques either leverage previous data to replay, leading to extra computational costs, or utilize a single parameter-efficient module to learn the downstream task, constraining new knowledge absorption with interference between different tasks. Toward these issues, this paper proposes Analytic Subspace Routing(ASR) to address these challenges. For each task, we isolate the learning within a subspace of deep layers' features via low-rank adaptation, eliminating knowledge interference between different tasks. Additionally, we propose an analytic routing mechanism to properly utilize knowledge learned in different subspaces. Our approach employs Recursive Least Squares to train a multi-task router model, allowing the router to dynamically adapt to incoming data without requiring access to historical data. Also, the router effectively assigns the current task to an appropriate subspace and has a non-forgetting property of previously learned tasks with a solid theoretical guarantee. Experimental results demonstrate that our method achieves near-perfect retention of prior knowledge while seamlessly integrating new information, effectively overcoming the core limitations of existing methods. Our code will be released after acceptance.
Comment: The paper proposes a novel subspace routing mechanism for continual learning in LLMs, which aligns with representation learning and model architecture innovations.
Relevance: 8 Novelty: 8
16. From Demonstrations to Rewards: Alignment Without Explicit Human Preferences
ArXiv ID: 2503.13538
Authors: Siliang Zeng, Yao Liu, Huzefa Rangwala, George Karypis, Mingyi Hong, Rasool Fakoor
Abstract: One of the challenges of aligning large models with human preferences lies in both the data requirements and the technical complexities of current approaches. Predominant methods, such as RLHF, involve multiple steps, each demanding distinct types of data, including demonstration data and preference data. In RLHF, human preferences are typically modeled through a reward model, which serves as a proxy to guide policy learning during the reinforcement learning stage, ultimately producing a policy aligned with human preferences. However, in this paper, we propose a fresh perspective on learning alignment based on inverse reinforcement learning principles, where the optimal policy is still derived from reward maximization. However, instead of relying on preference data, we directly learn the reward model from demonstration data. This new formulation offers the flexibility to be applied even when only demonstration data is available, a capability that current RLHF methods lack, and it also shows that demonstration data offers more utility than what conventional wisdom suggests. Our extensive evaluation, based on public reward benchmark, HuggingFace Open LLM Leaderboard and MT-Bench, demonstrates that our approach compares favorably to state-of-the-art methods that rely solely on demonstration data.
Comment: The paper introduces a novel approach to alignment using inverse reinforcement learning principles, which is relevant to foundational LLM research and offers a fresh perspective on reward modeling.
Relevance: 8 Novelty: 8
17. PENCIL: Long Thoughts with Short Memory
ArXiv ID: 2503.14337
Authors: Chenxiao Yang, Nathan Srebro, David McAllester, Zhiyuan Li
Abstract: While recent works (e.g. o1, DeepSeek R1) have demonstrated great promise of using long Chain-of-Thought (CoT) to improve reasoning capabilities of language models, scaling it up during test-time is challenging due to inefficient memory usage -- intermediate computations accumulate indefinitely in context even no longer needed for future thoughts. We propose PENCIL, which incorporates a reduction mechanism into the autoregressive generation process, allowing the model to recursively clean up intermediate thoughts based on patterns learned from training. With this reduction mechanism, PENCIL significantly reduces the maximal context length required during generation, and thus can generate longer thoughts with limited memory, solving larger-scale problems given more thinking time. For example, we demonstrate PENCIL achieves 97\% accuracy on the challenging Einstein's puzzle -- a task even large models like GPT-4 struggle with -- using only a small 25M-parameter transformer with 2048 context length. Theoretically, we prove PENCIL can perform universal space-efficient computation by simulating Turing machines with optimal time and space complexity, and thus can solve arbitrary computational tasks that would otherwise be intractable given context window constraints.
Comment: PENCIL introduces a novel reduction mechanism for autoregressive generation, which aligns with foundational research in model architecture and efficiency improvements.
Relevance: 8 Novelty: 8
18. Learning local neighborhoods of non-Gaussian graphical models: A measure transport approach
ArXiv ID: 2503.13899
Authors: Sarah Liaw, Rebecca Morrison, Youssef Marzouk, Ricardo Baptista
Abstract: Identifying the Markov properties or conditional independencies of a collection of random variables is a fundamental task in statistics for modeling and inference. Existing approaches often learn the structure of a probabilistic graphical model, which encodes these dependencies, by assuming that the variables follow a distribution with a simple parametric form. Moreover, the computational cost of many algorithms scales poorly for high-dimensional distributions, as they need to estimate all the edges in the graph simultaneously. In this work, we propose a scalable algorithm to infer the conditional independence relationships of each variable by exploiting the local Markov property. The proposed method, named Localized Sparsity Identification for Non-Gaussian Distributions (L-SING), estimates the graph by using flexible classes of transport maps to represent the conditional distribution for each variable. We show that L-SING includes existing approaches, such as neighborhood selection with Lasso, as a special case. We demonstrate the effectiveness of our algorithm in both Gaussian and non-Gaussian settings by comparing it to existing methods. Lastly, we show the scalability of the proposed approach by applying it to high-dimensional non-Gaussian examples, including a biological dataset with more than 150 variables.
Comment: The paper introduces a novel method (L-SING) for identifying conditional independence relationships in non-Gaussian graphical models, leveraging sparsity and transport maps. This aligns with the representation learning criterion, particularly in sparse methods and training dynamics.
Relevance: 8 Novelty: 7
19. FeNeC: Enhancing Continual Learning via Feature Clustering with Neighbor- or Logit-Based Classification
ArXiv ID: 2503.14301
Authors: Kamil Ksi\k{a}.zek, Hubert Jastrz\k{e}bski, Bartosz Trojan, Krzysztof Pniaczek, Micha{\l} Karp, Jacek Tabor
Abstract: The ability of deep learning models to learn continuously is essential for adapting to new data categories and evolving data distributions. In recent years, approaches leveraging frozen feature extractors after an initial learning phase have been extensively studied. Many of these methods estimate per-class covariance matrices and prototypes based on backbone-derived feature representations. Within this paradigm, we introduce FeNeC (Feature Neighborhood Classifier) and FeNeC-Log, its variant based on the log-likelihood function. Our approach generalizes the existing concept by incorporating data clustering to capture greater intra-class variability. Utilizing the Mahalanobis distance, our models classify samples either through a nearest neighbor approach or trainable logit values assigned to consecutive classes. Our proposition may be reduced to the existing approaches in a special case while extending them with the ability of more flexible adaptation to data. We demonstrate that two FeNeC variants achieve competitive performance in scenarios where task identities are unknown and establish state-of-the-art results on several benchmarks.
Comment: The paper introduces a novel clustering-based approach for continual learning, which aligns with representation learning and foundational methods for adapting to evolving data.
Relevance: 8 Novelty: 7
20. Layer-wise Adaptive Gradient Norm Penalizing Method for Efficient and Accurate Deep Learning
ArXiv ID: 2503.14205
Authors: Sunwoo Lee
Abstract: Sharpness-aware minimization (SAM) is known to improve the generalization performance of neural networks. However, it is not widely used in real-world applications yet due to its expensive model perturbation cost. A few variants of SAM have been proposed to tackle such an issue, but they commonly do not alleviate the cost noticeably. In this paper, we propose a lightweight layer-wise gradient norm penalizing method that tackles the expensive computational cost of SAM while maintaining its superior generalization performance. Our study empirically proves that the gradient norm of the whole model can be effectively suppressed by penalizing the gradient norm of only a few critical layers. We also theoretically show that such a partial model perturbation does not harm the convergence rate of SAM, allowing them to be safely adapted in real-world applications. To demonstrate the efficacy of the proposed method, we perform extensive experiments comparing the proposed method to mini-batch SGD and the conventional SAM using representative computer vision and language modeling benchmarks.
Comment: The paper proposes a layer-wise gradient norm penalizing method to improve computational efficiency, which aligns with model compression and training dynamics.
Relevance: 8 Novelty: 7
21. ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
ArXiv ID: 2503.13565
Authors: Evangelos Georganas, Dhiraj Kalamkar, Alexander Kozlov, Alexander Heinecke
Abstract: Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing any accuracy over the 16-bit model inference. In a typical SD setup, the idea is to use a full-precision, small, fast model as "draft" to generate the next few tokens and use the "target" large model to verify the draft-generated tokens. The efficacy of this method heavily relies on the acceptance ratio of the draft-generated tokens and the relative token throughput of the draft versus the target model. Nevertheless, an efficient SD pipeline requires pre-training and aligning the draft model to the target model, making it impractical for LLM inference in a plug-and-play fashion. In this work, we propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4. In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline. Then we pursue an opportunity for further acceleration: the MXFP4 draft token generation itself can be accelerated via speculative decoding by using yet another smaller draft. We call our method ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts since it recursively applies speculation for accelerating the draft-token generation. Combining Multi-Level Speculative Decoding with MXFP4 Quantized Drafts we outperform state-of-the-art speculative decoding, yielding speedups up to 2.72x over the BF16 baseline.
Comment: The paper introduces multi-level speculative decoding with quantized drafts, which aligns with model compression and efficiency improvements.
Relevance: 8 Novelty: 7
22. Unifying Text Semantics and Graph Structures for Temporal Text-attributed Graphs with Large Language Models
ArXiv ID: 2503.14411
Authors: Siwei Zhang, Yun Xiong, Yateng Tang, Xi Chen, Zian Jia, Zehao Gu, Jiarong Xu, Jiawei Zhang
Abstract: Temporal graph neural networks (TGNNs) have shown remarkable performance in temporal graph modeling. However, real-world temporal graphs often possess rich textual information, giving rise to temporal text-attributed graphs (TTAGs). Such combination of dynamic text semantics and evolving graph structures introduces heightened complexity. Existing TGNNs embed texts statically and rely heavily on encoding mechanisms that biasedly prioritize structural information, overlooking the temporal evolution of text semantics and the essential interplay between semantics and structures for synergistic reinforcement. To tackle these issues, we present \textbf{{Cross}}, a novel framework that seamlessly extends existing TGNNs for TTAG modeling. The key idea is to employ the advanced large language models (LLMs) to extract the dynamic semantics in text space and then generate expressive representations unifying both semantics and structures. Specifically, we propose a Temporal Semantics Extractor in the {Cross} framework, which empowers the LLM to offer the temporal semantic understanding of node's evolving contexts of textual neighborhoods, facilitating semantic dynamics. Subsequently, we introduce the Semantic-structural Co-encoder, which collaborates with the above Extractor for synthesizing illuminating representations by jointly considering both semantic and structural information while encouraging their mutual reinforcement. Extensive experimental results on four public datasets and one practical industrial dataset demonstrate {Cross}'s significant effectiveness and robustness.
Comment: The paper presents a framework for unifying text semantics and graph structures using LLMs, which aligns with representation learning and explores the interplay between semantics and structures.
Relevance: 8 Novelty: 7
23. Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels
ArXiv ID: 2503.13917
Authors: Yujia Tong, Yuze Wang, Jingling Yuan, Chuang Hu
Abstract: Model quantization enables efficient deployment of deep neural networks on edge devices through low-bit parameter representation, yet raises critical challenges for implementing machine unlearning (MU) under data privacy regulations. Existing MU methods designed for full-precision models fail to address two fundamental limitations in quantized networks: 1) Noise amplification from label mismatch during data processing, and 2) Gradient imbalance between forgotten and retained data during training. These issues are exacerbated by quantized models' constrained parameter space and discrete optimization. We propose Q-MUL, the first dedicated unlearning framework for quantized models. Our method introduces two key innovations: 1) Similar Labels assignment replaces random labels with semantically consistent alternatives to minimize noise injection, and 2) Adaptive Gradient Reweighting dynamically aligns parameter update contributions from forgotten and retained data. Through systematic analysis of quantized model vulnerabilities, we establish theoretical foundations for these mechanisms. Extensive evaluations on benchmark datasets demonstrate Q-MUL's superiority over existing approaches.
Comment: Q-MUL addresses machine unlearning for quantized models, which is relevant to model compression and efficiency improvements.
Relevance: 8 Novelty: 7
24. Unified Analysis of Decentralized Gradient Descent: a Contraction Mapping Framework
ArXiv ID: 2503.14353
Authors: Erik G. Larsson, Nicolo Michelusi
Abstract: The decentralized gradient descent (DGD) algorithm, and its sibling, diffusion, are workhorses in decentralized machine learning, distributed inference and estimation, and multi-agent coordination. We propose a novel, principled framework for the analysis of DGD and diffusion for strongly convex, smooth objectives, and arbitrary undirected topologies, using contraction mappings coupled with a result called the mean Hessian theorem (MHT). The use of these tools yields tight convergence bounds, both in the noise-free and noisy regimes. While these bounds are qualitatively similar to results found in the literature, our approach using contractions together with the MHT decouples the algorithm dynamics (how quickly the algorithm converges to its fixed point) from its asymptotic convergence properties (how far the fixed point is from the global optimum). This yields a simple, intuitive analysis that is accessible to a broader audience. Extensions are provided to multiple local gradient updates, time-varying step sizes, noisy gradients (stochastic DGD and diffusion), communication noise, and random topologies.
Comment: The paper provides a novel contraction mapping framework for decentralized gradient descent, which is relevant to emerging trends in foundational optimization research.
Relevance: 8 Novelty: 7
25. Positivity sets of hinge functions
ArXiv ID: 2503.13512
Authors: Josef Schicho, Ayush Kumar Tewari, Audie Warren
Abstract: In this paper we investigate which subsets of the real plane are realisable as the set of points on which a one-layer ReLU neural network takes a positive value. In the case of cones we give a full characterisation of such sets. Furthermore, we give a necessary condition for any subset of $\mathbb R^d$. We give various examples of such one-layer neural networks.
Comment: The paper investigates the geometry of ReLU activation functions, which provides insights into neural network behavior and representation learning.
Relevance: 8 Novelty: 6
26. End-to-End Optimal Detector Design with Mutual Information Surrogates
ArXiv ID: 2503.14342
Authors: Kinga Anna Wozniak, Stephen Mulligan, Jan Kieseler, Markus Klute, Francois Fleuret, Tobias Golling
Abstract: We introduce a novel approach for end-to-end black-box optimization of high energy physics (HEP) detectors using local deep learning (DL) surrogates. These surrogates approximate a scalar objective function that encapsulates the complex interplay of particle-matter interactions and physics analysis goals. In addition to a standard reconstruction-based metric commonly used in the field, we investigate the information-theoretic metric of mutual information. Unlike traditional methods, mutual information is inherently task-agnostic, offering a broader optimization paradigm that is less constrained by predefined targets. We demonstrate the effectiveness of our method in a realistic physics analysis scenario: optimizing the thicknesses of calorimeter detector layers based on simulated particle interactions. The surrogate model learns to approximate objective gradients, enabling efficient optimization with respect to energy resolution. Our findings reveal three key insights: (1) end-to-end black-box optimization using local surrogates is a practical and compelling approach for detector design, providing direct optimization of detector parameters in alignment with physics analysis goals; (2) mutual information-based optimization yields design choices that closely match those from state-of-the-art physics-informed methods, indicating that these approaches operate near optimality and reinforcing their reliability in HEP detector design; and (3) information-theoretic methods provide a powerful, generalizable framework for optimizing scientific instruments. By reframing the optimization process through an information-theoretic lens rather than domain-specific heuristics, mutual information enables the exploration of new avenues for discovery beyond conventional approaches.
Comment: The paper focuses on mutual information as a task-agnostic optimization metric, which aligns with representation learning and foundational insights into optimization frameworks.
Relevance: 7 Novelty: 7
27. Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
ArXiv ID: 2503.13551
Authors: Teng Wang, Zhangyi Jiang, Zhenqi He, Wenhan Yang, Yanan Zheng, Zeyu Li, Zifan He, Shenyang Tong, Hailei Gong
Abstract: Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate steps. In this paper, we propose a novel reward model approach, Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps from fine-grained and coarse-grained level. HRM performs better in assessing reasoning coherence and self-reflection, particularly when the previous reasoning step is incorrect. Furthermore, to address the inefficiency of autonomous generating PRM training data via Monte Carlo Tree Search (MCTS), we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC) based on node merging (combining two consecutive reasoning steps into one step) in the tree structure. This approach diversifies MCTS results for HRM with negligible computational overhead, enhancing label robustness by introducing noise. Empirical results on the PRM800K dataset demonstrate that HRM, in conjunction with HNC, achieves superior stability and reliability in evaluation compared to PRM. Furthermore, cross-domain evaluations on MATH500 and GSM8K confirm HRM's superior generalization and robustness across diverse reasoning tasks. The code for all experiments will be released at https: //github.com/tengwang0318/hierarchial_reward_model.
Comment: The paper proposes a hierarchical reward model for reasoning in LLMs, which aligns with foundational insights into improving reasoning capabilities in LLMs.
Relevance: 7 Novelty: 7
28. Quantification of Uncertainties in Probabilistic Deep Neural Network by Implementing Boosting of Variational Inference
ArXiv ID: 2503.13909
Authors: Pavia Bera, Sanjukta Bhanja
Abstract: Modern neural network architectures have achieved remarkable accuracies but remain highly dependent on their training data, often lacking interpretability in their learned mappings. While effective on large datasets, they tend to overfit on smaller ones. Probabilistic neural networks, such as those utilizing variational inference, address this limitation by incorporating uncertainty estimation through weight distributions rather than point estimates. However, standard variational inference often relies on a single-density approximation, which can lead to poor posterior estimates and hinder model performance. We propose Boosted Bayesian Neural Networks (BBNN), a novel approach that enhances neural network weight distribution approximations using Boosting Variational Inference (BVI). By iteratively constructing a mixture of densities, BVI expands the approximating family, enabling a more expressive posterior that leads to improved generalization and uncertainty estimation. While this approach increases computational complexity, it significantly enhances accuracy an essential tradeoff, particularly in high-stakes applications such as medical diagnostics, where false negatives can have severe consequences. Our experimental results demonstrate that BBNN achieves ~5% higher accuracy compared to conventional neural networks while providing superior uncertainty quantification. This improvement highlights the effectiveness of leveraging a mixture-based variational family to better approximate the posterior distribution, ultimately advancing probabilistic deep learning.
Comment: The paper introduces Boosted Bayesian Neural Networks for better uncertainty quantification, which aligns with representation learning and training dynamics in neural networks.
Relevance: 7 Novelty: 7
29. On the clustering behavior of sliding windows
ArXiv ID: 2503.14393
Authors: Boris Alexeev, Wenyan Luo, Dustin G. Mixon, Yan X Zhang
Abstract: Things can go spectacularly wrong when clustering timeseries data that has been preprocessed with a sliding window. We highlight three surprising failures that emerge depending on how the window size compares with the timeseries length. In addition to computational examples, we present theoretical explanations for each of these failure modes.
Comment: The paper provides theoretical insights into clustering behavior with sliding windows, which could be relevant for emerging trends in foundational research.
Relevance: 7 Novelty: 7
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.