Previous Day 2025-04-21
Monthly Overview 2025-04
Next Day 2025-04-23

Personalized Daily ArXiv Papers 2025-04-22

[gpt-4o] Prompt Completion Total
Token 51438 7285 58723
Cost $0.13 $0.07 $0.2

Total arXiv papers: 791

Total scanned papers: 462

Total relevant papers: 29

Table of contents with paper titles:

  1. MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core Authors: Dennis Liu, Zijie Yan, Xin Yao, Tong Liu, Vijay Korthikanti, Evan Wu, Shiqing Fan, Gao Deng, Hongxiao Bai, Ashwath Aithal, Michael Andersch, Mohammad Shoeybi, Jiajie Yao, Chandler Zhou, David Wu, Xipeng Li, June Yang

  2. Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction Authors: Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, Aditi Raghunathan

  3. Compute-Optimal LLMs Provably Generalize Better With Scale Authors: Marc Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Christopher De Sa, J. Zico Kolter, Andrew Gordon Wilson

  4. Quantitative Clustering in Mean-Field Transformer Models Authors: Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet

  5. Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs Authors: Lucas Maisonnave, Cyril Moineau, Olivier Bichler, Fabrice Rastello

  6. Density Measures for Language Generation Authors: Jon Kleinberg, Fan Wei

  7. AI for the Open-World: the Learning Principles Authors: Jianyu Zhang

  8. Multiscale Tensor Summation Factorization as a New Neural Network Layer (MTS Layer) for Multidimensional Data Processing Authors: Mehmet Yama\c{c}, Muhammad Numan Yousaf, Serkan Kiranyaz, Moncef Gabbouj

  9. Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator Authors: Akshat Ramachandran, Souvik Kundu, Arnab Raha, Shamik Kundu, Deepak K. Mathaikutty, Tushar Krishna

  10. CacheFormer: High Attention-Based Segment Caching Authors: Sushant Singh, Ausif Mahmood

  11. On Learning Parallel Pancakes with Mostly Uniform Weights Authors: Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, Jasper C. H. Lee, Thanasis Pittas

  12. Data Selection for ERMs Authors: Steve Hanneke, Shay Moran, Alexander Shlimovich, Amir Yehudayoff

  13. Semi-parametric Memory Consolidation: Towards Brain-like Deep Continual Learning Authors: Geng Liu, Fei Zhu, Rong Feng, Zhiqiang Yi, Shiqi Wang, Gaofeng Meng, Zhaoxiang Zhang

  14. NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models Authors: Lawrence Liu, Inesh Chakrabarti, Yixiao Li, Mengdi Wang, Tuo Zhao, Lin F. Yang

  15. SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training Authors: Zhouyang Li, Yuliang Liu, Wei Zhang, Tailing Yuan, Bin Chen, Chengru Song, Di Zhang

  16. VeLU: Variance-enhanced Learning Unit for Deep Neural Networks Authors: Ashkan Shakarami, Yousef Yeganeh, Azade Farshad, Lorenzo Nicol`e, Stefano Ghidoni, Nassir Navab

  17. FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference Authors: Coleman Hooper, Charbel Sakr, Ben Keller, Rangharajan Venkatesan, Kurt Keutzer, Sophia Shao, Brucek Khailany

  18. Pets: General Pattern Assisted Architecture For Time Series Analysis Authors: Xiangkai Ma, Xiaobin Hong, Wenzhong Li, Sanglu Lu

  19. LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models Authors: Kang He, Kaushik Roy

  20. A Call for New Recipes to Enhance Spatial Reasoning in MLLMs Authors: Huanyu Zhang, Chengzu Li, Wenshan Wu, Shaoguang Mao, Yan xia, Ivan Vuli\'c, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei

  21. The Geometry of Self-Verification in a Task-Specific Reasoning Model Authors: Andrew Lee, Lihao Sun, Chris Wendler, Fernanda Vi\'egas, Martin Wattenberg

  22. Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation Authors: Muhammad Haseeb Aslam, Clara Martinez, Marco Pedersoli, Alessandro Koerich, Ali Etemad, Eric Granger

  23. LoRe: Personalizing LLMs via Low-Rank Reward Modeling Authors: Avinandan Bose, Zhihan Xiong, Yuejie Chi, Simon Shaolei Du, Lin Xiao, Maryam Fazel

  24. Generative System Dynamics in Recurrent Neural Networks Authors: Michele Casoni, Tommaso Guidi, Alessandro Betti, Stefano Melacci, Marco Gori

  25. Linking forward-pass dynamics in Transformers and real-time human processing Authors: Jennifer Hu, Michael A. Lepori, Michael Franke

  26. A discrete physics-informed training for projection-based reduced order models with neural networks Authors: N. Sibuet, S. Ares de Parga, J. R. Bravo, R. Rossi

  27. Leakage and Interpretability in Concept-Based Models Authors: Enrico Parisini, Tapabrata Chakraborti, Chris Harbron, Ben D. MacArthur, Christopher R. S. Banerji

  28. On Revealing the Hidden Problem Structure in Real-World and Theoretical Problems Using Walsh Coefficient Influence Authors: M. W. Przewozniczek, F. Chicano, R. Tin\'os, J. Nalepa, B. Ruszczak, A. M. Wijata

  29. A Basic Evaluation of Neural Networks Trained with the Error Diffusion Learning Algorithm Authors: Kazuhisa Fujita


1. MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

ArXiv ID: 2504.14960

Authors: Dennis Liu, Zijie Yan, Xin Yao, Tong Liu, Vijay Korthikanti, Evan Wu, Shiqing Fan, Gao Deng, Hongxiao Bai, Ashwath Aithal, Michael Andersch, Mohammad Shoeybi, Jiajie Yao, Chandler Zhou, David Wu, Xipeng Li, June Yang

Abstract: Mixture of Experts (MoE) models enhance neural network scalability by dynamically selecting relevant experts per input token, enabling larger model sizes while maintaining manageable computation costs. However, efficient training of large-scale MoE models across thousands of GPUs presents significant challenges due to limitations in existing parallelism strategies. We introduce an end-to-end training framework for large-scale MoE models that utilizes five-dimensional hybrid parallelism: Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallelism, and Pipeline Parallelism. Central to our approach is MoE Parallel Folding, a novel strategy that decouples the parallelization of attention and MoE layers in Transformer models, allowing each layer type to adopt optimal parallel configurations. Additionally, we develop a flexible token-level dispatcher that supports both token-dropping and token-dropless MoE training across all five dimensions of parallelism. This dispatcher accommodates dynamic tensor shapes and coordinates different parallelism schemes for Attention and MoE layers, facilitating complex parallelism implementations. Our experiments demonstrate significant improvements in training efficiency and scalability. We achieve up to 49.3% Model Flops Utilization (MFU) for the Mixtral 8x22B model and 39.0% MFU for the Qwen2-57B-A14B model on H100 GPUs, outperforming existing methods. The framework scales efficiently up to 1,024 GPUs and maintains high performance with sequence lengths up to 128K tokens, validating its effectiveness for large-scale MoE model training. The code is available in Megatron-Core.

Comment: The paper introduces MoE Parallel Folding, a novel parallelism strategy for efficient training of large-scale MoE models. This directly aligns with the interest in Mixture-of-Experts and architectural innovations.

Relevance: 10 Novelty: 8


2. Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction

ArXiv ID: 2504.15266

Authors: Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, Aditi Raghunathan

Abstract: We design a suite of minimal algorithmic tasks that are a loose abstraction of open-ended real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended stochastic planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic and memorizes excessively; comparatively, multi-token approaches, namely teacherless training and diffusion models, excel in producing diverse and original output. Secondly, in our tasks, we find that to elicit randomness from the Transformer without hurting coherence, it is better to inject noise right at the input layer (via a method we dub hash-conditioning) rather than defer to temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and softmax-based sampling. We make part of the code available under https://github.com/chenwu98/algorithmic-creativity

Comment: The paper introduces minimal algorithmic tasks to test the creative limits of language models and argues for moving beyond next-token prediction. It aligns with 'Emerging Trends' by challenging established paradigms in LLM training.

Relevance: 9 Novelty: 9


3. Compute-Optimal LLMs Provably Generalize Better With Scale

ArXiv ID: 2504.15208

Authors: Marc Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Christopher De Sa, J. Zico Kolter, Andrew Gordon Wilson

Abstract: Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. This generalization bound can be decomposed into three interpretable components: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As compute-optimal language models are scaled up, the number of parameters per data point remains constant; however, both the loss variance and the quantization error decrease, implying that larger models should have smaller generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows more slowly than their capacity on the compute-optimal frontier. From these findings we produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.

Comment: This paper provides theoretical insights into why larger language models generalize better, aligning with the foundational research on LLM behavior and scaling laws.

Relevance: 9 Novelty: 8


4. Quantitative Clustering in Mean-Field Transformer Models

ArXiv ID: 2504.14697

Authors: Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet

Abstract: The evolution of tokens through a deep transformer models can be modeled as an interacting particle system that has been shown to exhibit an asymptotic clustering behavior akin to the synchronization phenomenon in Kuramoto models. In this work, we investigate the long-time clustering of mean-field transformer models. More precisely, we establish exponential rates of contraction to a Dirac point mass for any suitably regular initialization under some assumptions on the parameters of transformer models, any suitably regular mean-field initialization synchronizes exponentially fast with some quantitative rates.

Comment: The paper investigates clustering behavior in mean-field transformer models, providing theoretical insights into transformer dynamics, which aligns with the 'Model Architecture' criterion.

Relevance: 9 Novelty: 8


5. Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs

ArXiv ID: 2504.13989

Authors: Lucas Maisonnave, Cyril Moineau, Olivier Bichler, Fabrice Rastello

Abstract: Large language models (LLMs) have become pivotal in artificial intelligence, demonstrating strong capabilities in reasoning, understanding, and generating data. However, their deployment on edge devices is hindered by their substantial size, often reaching several billion parameters. Quantization is a widely used method to reduce memory usage and inference time, however LLMs present unique challenges due to the prevalence of outliers in their activations. In this work, we leverage the theoretical advantages of Hadamard matrices over random rotation matrices to push the boundaries of quantization in LLMs. We demonstrate that Hadamard matrices are more effective in reducing outliers, which are a significant obstacle in achieving low-bit quantization. Our method based on a gradual binary search enables 3-bit quantization for weights, activations, and key-value (KV) caches, resulting in a 40\% increase in accuracy on common benchmarks compared to SoTA methods. We extend the use of rotation matrices to support non-power-of-2 embedding dimensions, similar to the Qwen architecture, by employing the Paley algorithm. We theoretically demonstrates the superiority of Hadamard matrices in reducing outliers.We achieved 3-bit quantization for weights, activations, and KV cache, significantly enhancing model performance. Our experimental results on multiple models family like Mistral, LLaMA, and Qwen demonstrate the effectiveness of our approach, outperforming existing methods and enabling practical 3-bit quantization.

Comment: The paper proposes a novel quantization method for LLMs, addressing challenges in activation quantization and KV cache, which aligns with the 'Model Compression' criterion.

Relevance: 9 Novelty: 8


6. Density Measures for Language Generation

ArXiv ID: 2504.14370

Authors: Jon Kleinberg, Fan Wei

Abstract: The recent successes of large language models (LLMs) have led to a surge of theoretical research into language generation. A recent line of work proposes an abstract view, called language generation in the limit, where generation is seen as a game between an adversary and an algorithm: the adversary generates strings from an unknown language $K$, chosen from a countable collection of candidate languages, and after seeing a finite set of these strings, the algorithm must generate new strings from $K$ that it has not seen before. This formalism highlights a key tension: the trade-off between validity (the algorithm should only produce strings from the language) and breadth (it should be able to produce many strings from the language). This trade-off is central in applied language generation as well, where it appears as a balance between hallucination (generating invalid utterances) and mode collapse (generating only a restricted set of outputs). Despite its importance, this trade-off has been challenging to study quantitatively. We develop ways to quantify this trade-off by formalizing breadth using measures of density. Existing algorithms for language generation in the limit produce output sets that can have zero density in the true language, and this important failure of breadth might seem unavoidable. We show, however, that such a failure is not necessary: we provide an algorithm for language generation in the limit whose outputs have strictly positive density in $K$. We also study the internal representations built by these algorithms, specifically the sequence of hypothesized candidate languages they consider, and show that achieving the strongest form of breadth may require oscillating indefinitely between high- and low-density representations. Our analysis introduces a novel topology on language families, with notions of convergence and limit points playing a key role.

Comment: The paper introduces a theoretical framework for language generation, focusing on the trade-off between validity and breadth. It aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 8


7. AI for the Open-World: the Learning Principles

ArXiv ID: 2504.14751

Authors: Jianyu Zhang

Abstract: During the past decades, numerous successes of AI has been made on "specific capabilities", named closed-world, such as artificial environments or specific real-world tasks. This well-defined narrow capability brings two nice benefits, a clear criterion of success and the opportunity to collect a lot of examples. The criteria not only reveal whether a machine has achieved a goal, but reveal how the machine falls short of the goal. As a result, human designers can fix the problems one after the other until the machine is deemed good enough for the task. Furthermore, the large set of collected examples reduces the difficulty of this problem-fixing process (by the central limit theorem). Do the success in closed-world translate into broad open-world, where a machine is required to perform any task that a human could possibly undertake with fewer examples and less priori knowledge from human designers? No. Because competence in a specific task provides little insight in handling other tasks, the valuable criteria for specific tasks become helpless when handling broader unseen tasks. Furthermore, due to the shortage of examples in unseen tasks, central limit theorem does not stand on our side. At the end, human designers lose the oscilloscope to "hack" an AI system for the open-world. Achieving AI for the open-world requires unique learning principles and innovated techniques, which are different from the ones in building AI for the closed-world. This thesis explores necessary learning principles required to construct AI for the open-world, including rich features (analogy a large tool box), disentangled representation (an organized tool box), and inference-time learning (a tool-savvy hand). Driven by the learning principles, this thesis further proposes techniques to use the learning principles, conducts enormous large-scale experiments to verify the learning principles.

Comment: The paper discusses learning principles for open-world AI, aligning with 'Emerging Trends' by addressing foundational challenges in AI learning paradigms.

Relevance: 9 Novelty: 8


8. Multiscale Tensor Summation Factorization as a New Neural Network Layer (MTS Layer) for Multidimensional Data Processing

ArXiv ID: 2504.13975

Authors: Mehmet Yama\c{c}, Muhammad Numan Yousaf, Serkan Kiranyaz, Moncef Gabbouj

Abstract: Multilayer perceptrons (MLP), or fully connected artificial neural networks, are known for performing vector-matrix multiplications using learnable weight matrices; however, their practical application in many machine learning tasks, especially in computer vision, can be limited due to the high dimensionality of input-output pairs at each layer. To improve efficiency, convolutional operators have been utilized to facilitate weight sharing and local connections, yet they are constrained by limited receptive fields. In this paper, we introduce Multiscale Tensor Summation (MTS) Factorization, a novel neural network operator that implements tensor summation at multiple scales, where each tensor to be summed is obtained through Tucker-decomposition-like mode products. Unlike other tensor decomposition methods in the literature, MTS is not introduced as a network compression tool; instead, as a new backbone neural layer. MTS not only reduces the number of parameters required while enhancing the efficiency of weight optimization compared to traditional dense layers (i.e., unfactorized weight matrices in MLP layers), but it also demonstrates clear advantages over convolutional layers. The proof-of-concept experimental comparison of the proposed MTS networks with MLPs and Convolutional Neural Networks (CNNs) demonstrates their effectiveness across various tasks, such as classification, compression, and signal restoration. Additionally, when integrated with modern non-linear units such as the multi-head gate (MHG), also introduced in this study, the corresponding neural network, MTSNet, demonstrates a more favorable complexity-performance tradeoff compared to state-of-the-art transformers in various computer vision applications. The software implementation of the MTS layer and the corresponding MTS-based networks, MTSNets, is shared at https://github.com/mehmetyamac/MTSNet.

Comment: The introduction of Multiscale Tensor Summation (MTS) as a new neural network layer aligns with architectural innovations. The method offers a novel approach to improve efficiency and parameter optimization.

Relevance: 9 Novelty: 8


9. Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator

ArXiv ID: 2504.14365

Authors: Akshat Ramachandran, Souvik Kundu, Arnab Raha, Shamik Kundu, Deepak K. Mathaikutty, Tushar Krishna

Abstract: Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance. In contrast, supporting multiple N:M patterns to provide sparse representational freedom introduces costly overhead in hardware. To address these challenges for LLMs, we first present a flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method. FLOW enables the identification of optimal layer-wise N and M values (from a given range) by simultaneously accounting for the presence and distribution of outliers, allowing a higher degree of representational freedom. To deploy sparse models with such N:M flexibility, we then introduce a flexible, low-overhead digital compute-in-memory architecture (FlexCiM). FlexCiM supports diverse sparsity patterns by partitioning a digital CiM (DCiM) macro into smaller sub-macros, which are adaptively aggregated and disaggregated through distribution and merging mechanisms for different N and M values. Extensive experiments on both transformer-based and recurrence-based state space foundation models (SSMs) demonstrate that FLOW outperforms existing alternatives with an accuracy improvement of up to 36%, while FlexCiM achieves up to 1.75x lower inference latency and 1.5x lower energy consumption compared to existing sparse accelerators. Code is available at: https://github.com/FLOW-open-project/FLOW

Comment: The paper introduces a flexible N:M sparsity method and a compute-in-memory accelerator, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8


10. CacheFormer: High Attention-Based Segment Caching

ArXiv ID: 2504.13981

Authors: Sushant Singh, Ausif Mahmood

Abstract: Efficiently handling long contexts in transformer-based language models with low perplexity is an active area of research. Numerous recent approaches like Linformer, Longformer, Performer, and Structured state space models (SSMs)., have not fully resolved this problem. All these models strive to reduce the quadratic time complexity of the attention mechanism while minimizing the loss in quality due to the effective compression of the long context. Inspired by the cache and virtual memory principle in computers, where in case of a cache miss, not only the needed data is retrieved from the memory, but the adjacent data is also obtained, we apply this concept to handling long contexts by dividing it into small segments. In our design, we retrieve the nearby segments in an uncompressed form when high segment-level attention occurs at the compressed level. Our en-hancements for handling long context include aggregating four attention mechanisms consisting of short sliding window attention, long compressed segmented attention, dynamically retrieving top k high attention uncompressed segments, and overlapping segments in long segment attention to avoid segment fragmentation. These enhancements result in an architecture that outperforms ex-isting SOTA architectures with an average perplexity improvement of 8.5% over similar model sizes.

Comment: The paper introduces CacheFormer, which focuses on improving efficiency in handling long contexts in transformers. This aligns with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8


11. On Learning Parallel Pancakes with Mostly Uniform Weights

ArXiv ID: 2504.15251

Authors: Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, Jasper C. H. Lee, Thanasis Pittas

Abstract: We study the complexity of learning $k$-mixtures of Gaussians ($k$-GMMs) on $\mathbb{R}^d$. This task is known to have complexity $d^{\Omega(k)}$ in full generality. To circumvent this exponential lower bound on the number of components, research has focused on learning families of GMMs satisfying additional structural properties. A natural assumption posits that the component weights are not exponentially small and that the components have the same unknown covariance. Recent work gave a $d^{O(\log(1/w_{\min}))}$-time algorithm for this class of GMMs, where $w_{\min}$ is the minimum weight. Our first main result is a Statistical Query (SQ) lower bound showing that this quasi-polynomial upper bound is essentially best possible, even for the special case of uniform weights. Specifically, we show that it is SQ-hard to distinguish between such a mixture and the standard Gaussian. We further explore how the distribution of weights affects the complexity of this task. Our second main result is a quasi-polynomial upper bound for the aforementioned testing task when most of the weights are uniform while a small fraction of the weights are potentially arbitrary.

Comment: The paper explores the complexity of learning Gaussian Mixture Models (GMMs) with structural assumptions, which is foundational research in representation learning. It provides theoretical insights into the statistical query complexity and quasi-polynomial bounds.

Relevance: 9 Novelty: 8


12. Data Selection for ERMs

ArXiv ID: 2504.14572

Authors: Steve Hanneke, Shay Moran, Alexander Shlimovich, Amir Yehudayoff

Abstract: Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning rule $\mathcal{A}$ and a data selection budget $n$, how well can $\mathcal{A}$ perform when trained on at most $n$ data points selected from a population of $N$ points? We investigate when it is possible to select $n \ll N$ points and achieve performance comparable to training on the entire population. We address this question across a variety of empirical risk minimizers. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression. Additionally, we establish two general results: a taxonomy of error rates in binary classification and in stochastic convex optimization. Finally, we propose several open questions and directions for future research.

Comment: The paper explores data selection for empirical risk minimizers, providing theoretical insights into optimizing training data. It aligns with 'Representation Learning' and offers foundational contributions to learning theory.

Relevance: 9 Novelty: 8


13. Semi-parametric Memory Consolidation: Towards Brain-like Deep Continual Learning

ArXiv ID: 2504.14727

Authors: Geng Liu, Fei Zhu, Rong Feng, Zhiqiang Yi, Shiqi Wang, Gaofeng Meng, Zhaoxiang Zhang

Abstract: Humans and most animals inherently possess a distinctive capacity to continually acquire novel experiences and accumulate worldly knowledge over time. This ability, termed continual learning, is also critical for deep neural networks (DNNs) to adapt to the dynamically evolving world in open environments. However, DNNs notoriously suffer from catastrophic forgetting of previously learned knowledge when trained on sequential tasks. In this work, inspired by the interactive human memory and learning system, we propose a novel biomimetic continual learning framework that integrates semi-parametric memory and the wake-sleep consolidation mechanism. For the first time, our method enables deep neural networks to retain high performance on novel tasks while maintaining prior knowledge in real-world challenging continual learning scenarios, e.g., class-incremental learning on ImageNet. This study demonstrates that emulating biological intelligence provides a promising path to enable deep neural networks with continual learning capabilities.

Comment: The paper proposes a biomimetic continual learning framework inspired by human memory systems, which aligns with representation learning and training dynamics in neural networks. The semi-parametric memory consolidation mechanism is a novel contribution.

Relevance: 9 Novelty: 8


14. NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

ArXiv ID: 2504.14569

Authors: Lawrence Liu, Inesh Chakrabarti, Yixiao Li, Mengdi Wang, Tuo Zhao, Lin F. Yang

Abstract: Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environments. To address this challenge, we propose NoWag: (Normalized Weight and Activation Guided Compression), a unified framework for zero-shot shape preserving compression algorithms. We compressed Llama-2 7B/13B/70B and Llama-3 8/70BB models, using two popular forms of shape-preserving compression, vector quantization NoWag-VQ (NoWag for Vector Quantization), and unstructured/semi-structured pruning NoWag-P (NoWag for Pruning). We found that NoWag-VQ significantly outperforms state-of-the-art zero shot VQ, and that NoWag-P performs competitively against state-of-the-art methods. These results suggest commonalities between these compression paradigms that could inspire future work. Our code is available at https://github.com/LawrenceRLiu/NoWag

Comment: The paper introduces a unified framework for shape-preserving compression of LLMs, addressing sparsity and quantization, which aligns with the model compression criterion.

Relevance: 9 Novelty: 7


15. SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training

ArXiv ID: 2504.14519

Authors: Zhouyang Li, Yuliang Liu, Wei Zhang, Tailing Yuan, Bin Chen, Chengru Song, Di Zhang

Abstract: Pipeline Parallelism (PP) serves as a crucial technique for training Large Language Models (LLMs), owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context scenarios, existing pipeline parallelism methods fail to address the substantial activation memory pressure, primarily due to the peak memory consumption resulting from the accumulation of activations across multiple microbatches. Moreover, these approaches inevitably introduce considerable pipeline bubbles, further hindering efficiency. To tackle these challenges, we propose SlimPipe, a novel approach to fine-grained pipeline parallelism that employs uniform sequence slicing coupled with one-forward-one-backward (1F1B) schedule. It reduces the accumulated activations from several microbatches to just one, which is split into several slices. Although the slices are evenly partitioned, the computation cost is not equal across slices due to causal attention. We develop a sophisticated workload redistribution technique to address this load imbalance. SlimPipe achieves (1) near-zero memory overhead and (2) minimal pipeline bubbles simultaneously. The effectiveness of SlimPipe has been proven by thorough testing with diverse model architectures, context window sizes, and SlimPipe-specific configurations. For example, on the Llama 70B model, compared to state-of-the-art methods, SlimPipe significantly boosts the Model FLOPs Utilization (MFU) to up to $1.57\times$ for a context length of 512K. More notably, for a context length of 2048K, it maintains over 45% utilization on 256 NVIDIA Hopper 80GB GPUs, while other approaches either suffer significant performance drops or fail entirely due to memory constraints.

Comment: SlimPipe introduces a novel pipeline parallelism method for LLM training, addressing memory efficiency and scalability, which is relevant to foundational advancements in LLM training.

Relevance: 8 Novelty: 8


16. VeLU: Variance-enhanced Learning Unit for Deep Neural Networks

ArXiv ID: 2504.15051

Authors: Ashkan Shakarami, Yousef Yeganeh, Azade Farshad, Lorenzo Nicol`e, Stefano Ghidoni, Nassir Navab

Abstract: Activation functions are fundamental in deep neural networks and directly impact gradient flow, optimization stability, and generalization. Although ReLU remains standard because of its simplicity, it suffers from vanishing gradients and lacks adaptability. Alternatives like Swish and GELU introduce smooth transitions, but fail to dynamically adjust to input statistics. We propose VeLU, a Variance-enhanced Learning Unit as an activation function that dynamically scales based on input variance by integrating ArcTan-Sin transformations and Wasserstein-2 regularization, effectively mitigating covariate shifts and stabilizing optimization. Extensive experiments on ViT_B16, VGG19, ResNet50, DenseNet121, MobileNetV2, and EfficientNetB3 confirm VeLU's superiority over ReLU, ReLU6, Swish, and GELU on six vision benchmarks. The codes of VeLU are publicly available on GitHub.

Comment: The paper introduces VeLU, a novel activation function that dynamically scales based on input variance, which aligns with foundational research in model architecture and optimization.

Relevance: 8 Novelty: 8


17. FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference

ArXiv ID: 2504.14152

Authors: Coleman Hooper, Charbel Sakr, Ben Keller, Rangharajan Venkatesan, Kurt Keutzer, Sophia Shao, Brucek Khailany

Abstract: Quantization is a powerful tool to improve large language model (LLM) inference efficiency by utilizing more energy-efficient low-precision datapaths and reducing memory footprint. However, accurately quantizing LLM weights and activations to low precision is challenging without degrading model accuracy. We propose fine-grained mixed precision (FGMP) quantization, a post-training mixed-precision quantization hardware-software co-design methodology that maintains accuracy while quantizing the majority of weights and activations to reduced precision. Our work makes the following contributions: 1) We develop a policy that uses the perturbation in each value, weighted by the Fisher information, to select which weight and activation blocks to keep in higher precision. This approach preserves accuracy by identifying which weight and activation blocks need to be retained in higher precision to minimize the perturbation in the model loss. 2) We also propose a sensitivity-weighted clipping approach for fine-grained quantization which helps retain accuracy for blocks that are quantized to low precision. 3) We then propose hardware augmentations to leverage the efficiency benefits of FGMP quantization. Our hardware implementation encompasses i) datapath support for FGMP at block granularity, and ii) a mixed-precision activation quantization unit to assign activation blocks to high or low precision on the fly with minimal runtime and energy overhead. Our design, prototyped using NVFP4 (an FP4 format with microscaling) as the low-precision datatype and FP8 as the high-precision datatype, facilitates efficient FGMP quantization, attaining <1% perplexity degradation on Wikitext-103 for the Llama-2-7B model relative to an all-FP8 baseline design while consuming 14% less energy during inference and requiring 30% less weight memory.

Comment: The paper proposes a fine-grained mixed-precision quantization method for LLM inference, which is relevant to model compression and efficiency improvements.

Relevance: 8 Novelty: 7


18. Pets: General Pattern Assisted Architecture For Time Series Analysis

ArXiv ID: 2504.14209

Authors: Xiangkai Ma, Xiaobin Hong, Wenzhong Li, Sanglu Lu

Abstract: Time series analysis has found widespread applications in areas such as weather forecasting, anomaly detection, and healthcare. However, real-world sequential data often exhibit a superimposed state of various fluctuation patterns, including hourly, daily, and monthly frequencies. Traditional decomposition techniques struggle to effectively disentangle these multiple fluctuation patterns from the seasonal components, making time series analysis challenging. Surpassing the existing multi-period decoupling paradigms, this paper introduces a novel perspective based on energy distribution within the temporal-spectrum space. By adaptively quantifying observed sequences into continuous frequency band intervals, the proposed approach reconstructs fluctuation patterns across diverse periods without relying on domain-specific prior knowledge. Building upon this innovative strategy, we propose Pets, an enhanced architecture that is adaptable to arbitrary model structures. Pets integrates a Fluctuation Pattern Assisted (FPA) module and a Context-Guided Mixture of Predictors (MoP). The FPA module facilitates information fusion among diverse fluctuation patterns by capturing their dependencies and progressively modeling these patterns as latent representations at each layer. Meanwhile, the MoP module leverages these compound pattern representations to guide and regulate the reconstruction of distinct fluctuations hierarchically. Pets achieves state-of-the-art performance across various tasks, including forecasting, imputation, anomaly detection, and classification, while demonstrating strong generalization and robustness.

Comment: The paper proposes a novel architecture for time series analysis using a mixture of predictors, which aligns with the 'Model Architecture' criterion, particularly for its use of MoE.

Relevance: 8 Novelty: 7


19. LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models

ArXiv ID: 2504.14089

Authors: Kang He, Kaushik Roy

Abstract: Large language models (LLMs) have achieved remarkable multi-step reasoning capabilities across various domains. However, LLMs still face distinct challenges in complex logical reasoning, as (1) proof-finding requires systematic exploration and the maintenance of logical coherence and (2) searching the right combination of premises at each reasoning step is inherently challenging in tasks with large premise space. To address this, we propose LogicTree, an inference-time modular framework employing algorithm-guided search to automate structured proof exploration and ensure logical coherence. Advancing beyond tree-of-thought (ToT), we incorporate caching mechanism into LogicTree to enable effective utilization of historical knowledge, preventing reasoning stagnation and minimizing redundancy. Furthermore, we address the combinatorial complexity of premise search by decomposing it into a linear process. The refined premise selection restricts subsequent inference to at most one derivation per step, enhancing reasoning granularity and enforcing strict step-by-step reasoning. Additionally, we introduce two LLM-free heuristics for premise prioritization, enabling strategic proof search. Experimental results on five datasets demonstrate that LogicTree optimally scales inference-time computation to achieve higher proof accuracy, surpassing chain-of-thought (CoT) and ToT with average gains of 23.6% and 12.5%, respectively, on GPT-4o. Moreover, within LogicTree, GPT-4o outperforms o3-mini by 7.6% on average.

Comment: The paper proposes LogicTree, a framework for structured proof exploration in LLMs, which aligns with the 'Large Language Models' criterion by addressing reasoning and interpretability challenges.

Relevance: 8 Novelty: 7


20. A Call for New Recipes to Enhance Spatial Reasoning in MLLMs

ArXiv ID: 2504.15037

Authors: Huanyu Zhang, Chengzu Li, Wenshan Wu, Shaoguang Mao, Yan xia, Ivan Vuli\'c, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks. However, recent studies have exposed critical limitations in their spatial reasoning capabilities. This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world, thereby limiting their broader applications. We argue that spatial reasoning capabilities will not naturally emerge from merely scaling existing architectures and training methodologies. Instead, this challenge demands dedicated attention to fundamental modifications in the current MLLM development approach. In this position paper, we first establish a comprehensive framework for spatial reasoning within the context of MLLMs. We then elaborate on its pivotal role in real-world applications. Through systematic analysis, we examine how individual components of the current methodology-from training data to reasoning mechanisms-influence spatial reasoning capabilities. This examination reveals critical limitations while simultaneously identifying promising avenues for advancement. Our work aims to direct the AI research community's attention toward these crucial yet underexplored aspects. By highlighting these challenges and opportunities, we seek to catalyze progress toward achieving human-like spatial reasoning capabilities in MLLMs.

Comment: This position paper highlights the limitations of spatial reasoning in multimodal LLMs and calls for fundamental modifications, aligning with the 'Emerging Trends' criterion for foundational challenges.

Relevance: 8 Novelty: 7


21. The Geometry of Self-Verification in a Task-Specific Reasoning Model

ArXiv ID: 2504.14379

Authors: Andrew Lee, Lihao Sun, Chris Wendler, Fernanda Vi\'egas, Martin Wattenberg

Abstract: How do reasoning models verify their own answers? We study this question by training a model using DeepSeek R1's recipe on the CountDown task. We leverage the fact that preference tuning leads to mode collapse, resulting in a model that always produces highly structured and easily parse-able chain-of-thought sequences. With this setup, we do a top-down and bottom-up analysis to reverse-engineer how the model verifies its outputs. Our top-down analysis reveals Gated Linear Unit (GLU) weights encoding verification-related tokens, such as success'' orincorrect'', which activate according to the correctness of the model's reasoning steps. Our bottom-up analysis reveals that ``previous-token heads'' are mainly responsible for model verification. Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU vectors to localize as few as three attention heads that can disable model verification, pointing to a necessary component of a potentially larger verification circuit.

Comment: The paper provides insights into self-verification mechanisms in reasoning models, aligning with foundational research in understanding LLM behavior and interpretability.

Relevance: 8 Novelty: 7


22. Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation

ArXiv ID: 2504.14307

Authors: Muhammad Haseeb Aslam, Clara Martinez, Marco Pedersoli, Alessandro Koerich, Ali Etemad, Eric Granger

Abstract: Advances in self-distillation have shown that when knowledge is distilled from a teacher to a student using the same deep learning (DL) architecture, the student performance can surpass the teacher particularly when the network is overparameterized and the teacher is trained with early stopping. Alternatively, ensemble learning also improves performance, although training, storing, and deploying multiple models becomes impractical as the number of models grows. Even distilling an ensemble to a single student model or weight averaging methods first requires training of multiple teacher models and does not fully leverage the inherent stochasticity for generating and distilling diversity in DL models. These constraints are particularly prohibitive in resource-constrained or latency-sensitive applications such as wearable devices. This paper proposes to train only one model and generate multiple diverse teacher representations using distillation-time dropout. However, generating these representations stochastically leads to noisy representations that are misaligned with the learned task. To overcome this problem, a novel stochastic self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only, using student-guided knowledge distillation (SGKD). The student representation at each distillation step is used as authority to guide the distillation process. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods without increasing the model size at both training and testing time, and incurs negligible computational complexity compared to state-of-the-art ensemble learning and weight averaging methods.

Comment: The paper introduces a novel stochastic self-distillation strategy, which aligns with representation learning by improving knowledge distillation techniques.

Relevance: 8 Novelty: 7


23. LoRe: Personalizing LLMs via Low-Rank Reward Modeling

ArXiv ID: 2504.14439

Authors: Avinandan Bose, Zhihan Xiong, Yuejie Chi, Simon Shaolei Du, Lin Xiao, Maryam Fazel

Abstract: Personalizing large language models (LLMs) to accommodate diverse user preferences is essential for enhancing alignment and user satisfaction. Traditional reinforcement learning from human feedback (RLHF) approaches often rely on monolithic value representations, limiting their ability to adapt to individual preferences. We introduce a novel framework that leverages low-rank preference modeling to efficiently learn and generalize user-specific reward functions. By representing reward functions in a low-dimensional subspace and modeling individual preferences as weighted combinations of shared basis functions, our approach avoids rigid user categorization while enabling scalability and few-shot adaptation. We validate our method on multiple preference datasets, demonstrating superior generalization to unseen users and improved accuracy in preference prediction tasks.

Comment: The paper proposes a low-rank preference modeling framework for personalizing LLMs, which aligns with foundational research in representation learning and efficiency.

Relevance: 8 Novelty: 7


24. Generative System Dynamics in Recurrent Neural Networks

ArXiv ID: 2504.13951

Authors: Michele Casoni, Tommaso Guidi, Alessandro Betti, Stefano Melacci, Marco Gori

Abstract: In this study, we investigate the continuous time dynamics of Recurrent Neural Networks (RNNs), focusing on systems with nonlinear activation functions. The objective of this work is to identify conditions under which RNNs exhibit perpetual oscillatory behavior, without converging to static fixed points. We establish that skew-symmetric weight matrices are fundamental to enable stable limit cycles in both linear and nonlinear configurations. We further demonstrate that hyperbolic tangent-like activation functions (odd, bounded, and continuous) preserve these oscillatory dynamics by ensuring motion invariants in state space. Numerical simulations showcase how nonlinear activation functions not only maintain limit cycles, but also enhance the numerical stability of the system integration process, mitigating those instabilities that are commonly associated with the forward Euler method. The experimental results of this analysis highlight practical considerations for designing neural architectures capable of capturing complex temporal dependencies, i.e., strategies for enhancing memorization skills in recurrent models.

Comment: The paper investigates the dynamics of RNNs with a focus on oscillatory behavior and stability, which aligns with foundational research in representation learning and training dynamics.

Relevance: 8 Novelty: 7


25. Linking forward-pass dynamics in Transformers and real-time human processing

ArXiv ID: 2504.14107

Authors: Jennifer Hu, Michael A. Lepori, Michael Franke

Abstract: Modern AI models are increasingly being used as theoretical tools to study human cognition. One dominant approach is to evaluate whether human-derived measures (such as offline judgments or real-time processing) are predicted by a model's output: that is, the end-product of forward pass(es) through the network. At the same time, recent advances in mechanistic interpretability have begun to reveal the internal processes that give rise to model outputs, raising the question of whether models and humans might arrive at outputs using similar "processing strategies". Here, we investigate the link between real-time processing in humans and "layer-time" dynamics in Transformer models. Across five studies spanning domains and modalities, we test whether the dynamics of computation in a single forward pass of pre-trained Transformers predict signatures of processing in humans, above and beyond properties of the model's output probability distribution. We consistently find that layer-time dynamics provide additional predictive power on top of output measures. Our results suggest that Transformer processing and human processing may be facilitated or impeded by similar properties of an input stimulus, and this similarity has emerged through general-purpose objectives such as next-token prediction or image recognition. Our work suggests a new way of using AI models to study human cognition: not just as a black box mapping stimuli to responses, but potentially also as explicit processing models.

Comment: The paper links forward-pass dynamics in transformers to human processing, which provides insights into transformer behavior and aligns with foundational research in model architecture.

Relevance: 8 Novelty: 7


26. A discrete physics-informed training for projection-based reduced order models with neural networks

ArXiv ID: 2504.13875

Authors: N. Sibuet, S. Ares de Parga, J. R. Bravo, R. Rossi

Abstract: This paper presents a physics-informed training framework for projection-based Reduced Order Models (ROMs). We extend the PROM-ANN architecture by complementing snapshot-based training with a FEM-based, discrete physics-informed residual loss, bridging the gap between traditional projection-based ROMs and physics-informed neural networks (PINNs). Unlike conventional PINNs that rely on analytical PDEs, our approach leverages FEM residuals to guide the learning of the ROM approximation manifold. Key contributions include: (1) a parameter-agnostic, discrete residual loss applicable to non-linear problems, (2) an architectural modification to PROM-ANN improving accuracy for fast-decaying singular values, and (3) an empirical study on the proposed physics informed training process for ROMs. The method is demonstrated on a non-linear hyperelasticity problem, simulating a rubber cantilever under multi-axial loads. The main accomplishment in regards to the proposed residual-based loss is its applicability on non-linear problems by interfacing with FEM software while maintaining reasonable training times. The modified PROM-ANN outperforms POD by orders of magnitude in snapshot reconstruction accuracy, while the original formulation is not able to learn a proper mapping for this use-case. Finally, the application of physics informed training in ANN-PROM modestly narrows the gap between data reconstruction and ROM accuracy, however it highlights the untapped potential of the proposed residual-driven optimization for future ROM development. This work underscores the critical role of FEM residuals in ROM construction and calls for further exploration on architectures beyond PROM-ANN.

Comment: The paper presents a physics-informed training framework for reduced order models, which aligns with 'AI for Science' as it bridges neural networks and FEM-based residuals. It introduces a novel residual-driven optimization approach.

Relevance: 8 Novelty: 7


27. Leakage and Interpretability in Concept-Based Models

ArXiv ID: 2504.14094

Authors: Enrico Parisini, Tapabrata Chakraborti, Chris Harbron, Ben D. MacArthur, Christopher R. S. Banerji

Abstract: Concept Bottleneck Models aim to improve interpretability by predicting high-level intermediate concepts, representing a promising approach for deployment in high-risk scenarios. However, they are known to suffer from information leakage, whereby models exploit unintended information encoded within the learned concepts. We introduce an information-theoretic framework to rigorously characterise and quantify leakage, and define two complementary measures: the concepts-task leakage (CTL) and interconcept leakage (ICL) scores. We show that these measures are strongly predictive of model behaviour under interventions and outperform existing alternatives in robustness and reliability. Using this framework, we identify the primary causes of leakage and provide strong evidence that Concept Embedding Models exhibit substantial leakage regardless of the hyperparameters choice. Finally, we propose practical guidelines for designing concept-based models to reduce leakage and ensure interpretability.

Comment: The paper introduces an information-theoretic framework to address leakage in concept-based models, which aligns with representation learning and interpretability, making it relevant to foundational research.

Relevance: 8 Novelty: 7


28. On Revealing the Hidden Problem Structure in Real-World and Theoretical Problems Using Walsh Coefficient Influence

ArXiv ID: 2504.13949

Authors: M. W. Przewozniczek, F. Chicano, R. Tin\'os, J. Nalepa, B. Ruszczak, A. M. Wijata

Abstract: Gray-box optimization employs Walsh decomposition to obtain non-linear variable dependencies and utilize them to propose masks of variables that have a joint non-linear influence on fitness value. These masks significantly improve the effectiveness of variation operators. In some problems, all variables are non-linearly dependent, making the aforementioned masks useless. We analyze the features of the real-world instances of such problems and show that many of their dependencies may have noise-like origins. Such noise-caused dependencies are irrelevant to the optimization process and can be ignored. To identify them, we propose extending the use of Walsh decomposition by measuring variable dependency strength that allows the construction of the weighted dynamic Variable Interaction Graph (wdVIG). wdVIGs adjust the dependency strength to mixed individuals. They allow the filtering of irrelevant dependencies and re-enable using dependency-based masks by variation operators. We verify the wdVIG potential on a large benchmark suite. For problems with noise, the wdVIG masks can improve the optimizer's effectiveness. If all dependencies are relevant for the optimization, i.e., the problem is not noised, the influence of wdVIG masks is similar to that of state-of-the-art structures of this kind.

Comment: The paper extends Walsh decomposition for optimization problems, which aligns with foundational research in sparsity and efficiency. The proposed weighted dynamic Variable Interaction Graph (wdVIG) is a novel contribution.

Relevance: 8 Novelty: 7


29. A Basic Evaluation of Neural Networks Trained with the Error Diffusion Learning Algorithm

ArXiv ID: 2504.14814

Authors: Kazuhisa Fujita

Abstract: Artificial neural networks are powerful tools capable of addressing various tasks. Although the backpropagation algorithm has become a standard training method for these neural networks, its lack of biological plausibility has inspired the development of alternative learning approaches. One such alternative is Kaneko's Error Diffusion Learning Algorithm (EDLA), a biologically motivated approach wherein a single global error signal diffuses throughout a network composed of paired excitatory-inhibitory sublayers, thereby eliminating the necessity for layer-wise backpropagation. This study presents a contemporary formulation of the EDLA framework and evaluates its effectiveness through parity check, regression, and image classification tasks. Our experimental results indicate that EDLA networks can consistently achieve high accuracy across these benchmarks, with performance efficiency and convergence speed notably influenced by the choice of learning rate, neuron count, and network depth. Further investigation of the internal representations formed by EDLA networks reveals their capacity for meaningful feature extraction, similar to traditional neural networks. These results suggest that EDLA is a biologically motivated alternative for training feedforward networks and will motivate future work on extending this method to biologically inspired neural networks.

Comment: The paper evaluates the biologically inspired Error Diffusion Learning Algorithm (EDLA), which contributes to foundational research in training dynamics and alternative learning methods. The biologically motivated approach is novel.

Relevance: 8 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords: