Personalized Daily Arxiv Papers 01/17/2025

Total cost: $0.980125

Total relevant papers: 25

Paper selection prompt and criteria at the bottom

Table of contents with paper titles:

FASP: Fast and Accurate Structured Pruning of Large Language Models Authors: Hanyu Hu, Pengxiang Zhao, Ping Li, Yi Zheng, Zhefeng Wang, Xiaoming Yuan
LLM-Based Routing in Mixture of Experts: A Novel Framework for Trading Authors: Kuan-Ming Liu (National Chengchi University, College of Commerce), Ming-Chih Lo (National Yang Ming Chiao Tung University, College of Computer Science)
Towards Understanding Extrapolation: a Causal Lens Authors: Lingjing Kong, Guangyi Chen, Petar Stojanov, Haoxuan Li, Eric P. Xing, Kun Zhang
Mono-Forward: Backpropagation-Free Algorithm for Efficient Neural Network Training Harnessing Local Errors Authors: James Gong, Bruce Li, Waleed Abdulla
Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach Authors: Alireza Ghaffari, Sharareh Younesian, Boxing Chen, Vahid Partovi Nia, Masoud Asgharian
Enhancing Graph Representation Learning with Localized Topological Features Authors: Zuoyu Yan, Qi Zhao, Ze Ye, Tengfei Ma, Liangcai Gao, Zhi Tang, Yusu Wang, Chao Chen
Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging Authors: Anke Tang, Enneng Yang, Li Shen, Yong Luo, Han Hu, Bo Du, Dacheng Tao
On Learning Informative Trajectory Embeddings for Imitation, Classification and Regression Authors: Zichang Ge, Changyu Chen, Arunesh Sinha, Pradeep Varakantham
Fokker-Planck to Callan-Symanzik: evolution of weight matrices under training Authors: Wei Bu, Uri Kol, Ziming Liu
Rational Tuning of LLM Cascades via Probabilistic Modeling Authors: Michael J. Zellinger, Matt Thomson
MatrixNet: Learning over symmetry groups using learned group representations Authors: Lucas Laird, Circe Hsu, Asilata Bapat, Robin Walters
Pruning for Sparse Diffusion Models based on Gradient Flow Authors: Ben Wan, Tianyi Zheng, Zhaoyu Chen, Yuxiao Wang, Jia Wang
Large Language Model is Secretly a Protein Sequence Optimizer Authors: Yinkai Wang, Jiaxing He, Yuanqi Du, Xiaohui Chen, Jianan Canal Li, Li-Ping Liu, Xiaolin Xu, Soha Hassoun
MOGNET: A Mux-residual quantized Network leveraging Online-Generated weights Authors: Van Thien Nguyen, William Guicquero, Gilles Sicard
A Near-optimal Algorithm for Learning Margin Halfspaces with Massart Noise Authors: Ilias Diakonikolas, Nikos Zarifis
Testing Noise Assumptions of Learning Algorithms Authors: Surbhi Goel, Adam R. Klivans, Konstantinos Stavropoulos, Arsen Vasilyan
Task Vectors in In-Context Learning: Emergence, Formation, and Benefit Authors: Liu Yang, Ziqian Lin, Kangwook Lee, Dimitris Papailiopoulos, Robert Nowak
Class Incremental Fault Diagnosis under Limited Fault Data via Supervised Contrastive Knowledge Distillation Authors: Hanrong Zhang, Yifei Yao, Zixuan Wang, Jiayuan Su, Mengxuan Li, Peng Peng, Hongwei Wang
Free-Knots Kolmogorov-Arnold Network: On the Analysis of Spline Knots and Advancing Stability Authors: Liangwewi Nathan Zheng, Wei Emma Zhang, Lin Yue, Miao Xu, Olaf Maennel, Weitong Chen
Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks Authors: Pierfrancesco Beneventano, Blake Woodworth
Weight for Robustness: A Comprehensive Approach towards Optimal Fault-Tolerant Asynchronous ML Authors: Tehila Dahan, Kfir Y. Levy
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models Authors: Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, Yong Li
PAL: Prompting Analytic Learning with Missing Modality for Multi-Modal Class-Incremental Learning Authors: Xianghu Yue, Yiming Chen, Xueyi Zhang, Xiaoxue Gao, Mengling Feng, Mingrui Lao, Huiping Zhuang, Haizhou Li
Reducing the Sensitivity of Neural Physics Simulators to Mesh Topology via Pretraining Authors: Nathan Vaska, Justin Goodwin, Robin Walters, Rajmonda S. Caceres
Generating particle physics Lagrangians with transformers Authors: Yong Sheng Koay, Rikard Enberg, Stefano Moretti, Eliel Camargo-Molina

0. FASP: Fast and Accurate Structured Pruning of Large Language Models

ArXiv ID: 2501.09412 Authors: Hanyu Hu, Pengxiang Zhao, Ping Li, Yi Zheng, Zhefeng Wang, Xiaoming Yuan

Abstract: arXiv:2501.09412v1 Announce Type: new Abstract: The rapid increase in the size of large language models (LLMs) has significantly escalated their computational and memory demands, posing challenges for efficient deployment, especially on resource-constrained devices. Structured pruning has emerged as an effective model compression method that can reduce these demands while preserving performance. In this paper, we introduce FASP (Fast and Accurate Structured Pruning), a novel structured pruning framework for LLMs that emphasizes both speed and accuracy. FASP employs a distinctive pruning structure that interlinks sequential layers, allowing for the removal of columns in one layer while simultaneously eliminating corresponding rows in the preceding layer without incurring additional performance loss. The pruning metric, inspired by Wanda, is computationally efficient and effectively selects components to prune. Additionally, we propose a restoration mechanism that enhances model fidelity by adjusting the remaining weights post-pruning. We evaluate FASP on the OPT and LLaMA model families, demonstrating superior performance in terms of perplexity and accuracy on downstream tasks compared to state-of-the-art methods. Our approach achieves significant speed-ups, pruning models such as OPT-125M in 17 seconds and LLaMA-30B in 15 minutes on a single NVIDIA RTX 4090 GPU, making it a highly practical solution for optimizing LLMs.

Comment: The paper presents a novel structured pruning framework for LLMs, relevant to model compression and efficiency. Relevance: 10 Novelty: 9

1. LLM-Based Routing in Mixture of Experts: A Novel Framework for Trading

ArXiv ID: 2501.09636 Authors: Kuan-Ming Liu (National Chengchi University, College of Commerce), Ming-Chih Lo (National Yang Ming Chiao Tung University, College of Computer Science)

Abstract: arXiv:2501.09636v1 Announce Type: new Abstract: Recent advances in deep learning and large language models (LLMs) have facilitated the deployment of the mixture-of-experts (MoE) mechanism in the stock investment domain. While these models have demonstrated promising trading performance, they are often unimodal, neglecting the wealth of information available in other modalities, such as textual data. Moreover, the traditional neural network-based router selection mechanism fails to consider contextual and real-world nuances, resulting in suboptimal expert selection. To address these limitations, we propose LLMoE, a novel framework that employs LLMs as the router within the MoE architecture. Specifically, we replace the conventional neural network-based router with LLMs, leveraging their extensive world knowledge and reasoning capabilities to select experts based on historical price data and stock news. This approach provides a more effective and interpretable selection mechanism. Our experiments on multimodal real-world stock datasets demonstrate that LLMoE outperforms state-of-the-art MoE models and other deep neural network approaches. Additionally, the flexible architecture of LLMoE allows for easy adaptation to various downstream tasks.

Comment: The paper introduces a novel framework using LLMs as routers in MoE, aligning with interests in MoE and LLM architecture innovations. Relevance: 10 Novelty: 9

2. Towards Understanding Extrapolation: a Causal Lens

ArXiv ID: 2501.09163 Authors: Lingjing Kong, Guangyi Chen, Petar Stojanov, Haoxuan Li, Eric P. Xing, Kun Zhang

Abstract: arXiv:2501.09163v1 Announce Type: new Abstract: Canonical work handling distribution shifts typically necessitates an entire target distribution that lands inside the training distribution. However, practical scenarios often involve only a handful of target samples, potentially lying outside the training support, which requires the capability of extrapolation. In this work, we aim to provide a theoretical understanding of when extrapolation is possible and offer principled methods to achieve it without requiring an on-support target distribution. To this end, we formulate the extrapolation problem with a latent-variable model that embodies the minimal change principle in causal mechanisms. Under this formulation, we cast the extrapolation problem into a latent-variable identification problem. We provide realistic conditions on shift properties and the estimation objectives that lead to identification even when only one off-support target sample is available, tackling the most challenging scenarios. Our theory reveals the intricate interplay between the underlying manifold's smoothness and the shift properties. We showcase how our theoretical results inform the design of practical adaptation algorithms. Through experiments on both synthetic and real-world data, we validate our theoretical findings and their practical implications.

Comment: The paper provides a theoretical understanding of extrapolation using a latent-variable model, which aligns with the interest in theoretical insights and emerging trends in AI research. Relevance: 8 Novelty: 9

3. Mono-Forward: Backpropagation-Free Algorithm for Efficient Neural Network Training Harnessing Local Errors

ArXiv ID: 2501.09238 Authors: James Gong, Bruce Li, Waleed Abdulla

Abstract: arXiv:2501.09238v1 Announce Type: new Abstract: Backpropagation is the standard method for achieving state-of-the-art accuracy in neural network training, but it often imposes high memory costs and lacks biological plausibility. In this paper, we introduce the Mono-Forward algorithm, a purely local layerwise learning method inspired by Hinton's Forward-Forward framework. Unlike backpropagation, Mono-Forward optimizes each layer solely with locally available information, eliminating the reliance on global error signals. We evaluated Mono-Forward on multi-layer perceptrons and convolutional neural networks across multiple benchmarks, including MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100. The test results show that Mono-Forward consistently matches or surpasses the accuracy of backpropagation across all tasks, with significantly reduced and more even memory usage, better parallelizability, and a comparable convergence rate.

Comment: The paper introduces a novel training algorithm, Mono-Forward, which is a backpropagation-free method. This aligns with the interest in foundational methods and theoretical insights into neural network training. Relevance: 9 Novelty: 8

4. Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach

ArXiv ID: 2501.09107 Authors: Alireza Ghaffari, Sharareh Younesian, Boxing Chen, Vahid Partovi Nia, Masoud Asgharian

Abstract: arXiv:2501.09107v1 Announce Type: new Abstract: As Large Language Models (LLMs) become increasingly computationally complex, developing efficient deployment strategies, such as quantization, becomes crucial. State-of-the-art Post-training Quantization (PTQ) techniques often rely on calibration processes to maintain the accuracy of these models. However, while these calibration techniques can enhance performance in certain domains, they may not be as effective in others. This paper aims to draw attention to robust statistical approaches that can mitigate such issues. We propose a weight-adaptive PTQ method that can be considered a precursor to calibration-based PTQ methods, guiding the quantization process to preserve the distribution of weights by minimizing the Kullback-Leibler divergence between the quantized weights and the originally trained weights. This minimization ensures that the quantized model retains the Shannon information content of the original model to a great extent, guaranteeing robust and efficient deployment across many tasks. As such, our proposed approach can perform on par with most common calibration-based PTQ methods, establishing a new pre-calibration step for further adjusting the quantized weights with calibration. We show that our pre-calibration results achieve the same accuracy as some existing calibration-based PTQ methods on various LLMs.

Comment: The paper introduces a novel statistical pre-calibration approach for post-training quantization, relevant to model compression. Relevance: 9 Novelty: 8

5. Enhancing Graph Representation Learning with Localized Topological Features

ArXiv ID: 2501.09178 Authors: Zuoyu Yan, Qi Zhao, Ze Ye, Tengfei Ma, Liangcai Gao, Zhi Tang, Yusu Wang, Chao Chen

Abstract: arXiv:2501.09178v1 Announce Type: new Abstract: Representation learning on graphs is a fundamental problem that can be crucial in various tasks. Graph neural networks, the dominant approach for graph representation learning, are limited in their representation power. Therefore, it can be beneficial to explicitly extract and incorporate high-order topological and geometric information into these models. In this paper, we propose a principled approach to extract the rich connectivity information of graphs based on the theory of persistent homology. Our method utilizes the topological features to enhance the representation learning of graph neural networks and achieve state-of-the-art performance on various node classification and link prediction benchmarks. We also explore the option of end-to-end learning of the topological features, i.e., treating topological computation as a differentiable operator during learning. Our theoretical analysis and empirical study provide insights and potential guidelines for employing topological features in graph learning tasks.

Comment: The paper enhances graph representation learning with topological features, aligning with representation learning interests. Relevance: 9 Novelty: 8

6. Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

ArXiv ID: 2501.09522 Authors: Anke Tang, Enneng Yang, Li Shen, Yong Luo, Han Hu, Bo Du, Dacheng Tao

Abstract: arXiv:2501.09522v1 Announce Type: new Abstract: Deep model merging represents an emerging research direction that combines multiple fine-tuned models to harness their specialized capabilities across different tasks and domains. Current model merging techniques focus on merging all available models simultaneously, with weight interpolation-based methods being the predominant approaches. However, these conventional approaches are not well-suited for scenarios where models become available sequentially, and they often suffer from high memory requirements and potential interference between tasks. In this study, we propose a training-free projection-based continual merging method that processes models sequentially through orthogonal projections of weight matrices and adaptive scaling mechanisms. Our method operates by projecting new parameter updates onto subspaces orthogonal to existing merged parameter updates while using an adaptive scaling mechanism to maintain stable parameter distances, enabling efficient sequential integration of task-specific knowledge. Our approach maintains constant memory complexity to the number of models, minimizes interference between tasks through orthogonal projections, and retains the performance of previously merged models through adaptive task vector scaling. Extensive experiments on CLIP-ViT models demonstrate that our method achieves a 5-8% average accuracy improvement while maintaining robust performance in different task orderings.

Comment: The paper presents a novel method for sequential model merging, which is relevant to model architecture and compression through orthogonal projections and adaptive scaling. Relevance: 9 Novelty: 8

7. On Learning Informative Trajectory Embeddings for Imitation, Classification and Regression

ArXiv ID: 2501.09327 Authors: Zichang Ge, Changyu Chen, Arunesh Sinha, Pradeep Varakantham

Abstract: arXiv:2501.09327v1 Announce Type: new Abstract: In real-world sequential decision making tasks like autonomous driving, robotics, and healthcare, learning from observed state-action trajectories is critical for tasks like imitation, classification, and clustering. For example, self-driving cars must replicate human driving behaviors, while robots and healthcare systems benefit from modeling decision sequences, whether or not they come from expert data. Existing trajectory encoding methods often focus on specific tasks or rely on reward signals, limiting their ability to generalize across domains and tasks. Inspired by the success of embedding models like CLIP and BERT in static domains, we propose a novel method for embedding state-action trajectories into a latent space that captures the skills and competencies in the dynamic underlying decision-making processes. This method operates without the need for reward labels, enabling better generalization across diverse domains and tasks. Our contributions are threefold: (1) We introduce a trajectory embedding approach that captures multiple abilities from state-action data. (2) The learned embeddings exhibit strong representational power across downstream tasks, including imitation, classification, clustering, and regression. (3) The embeddings demonstrate unique properties, such as controlling agent behaviors in IQ-Learn and an additive structure in the latent space. Experimental results confirm that our method outperforms traditional approaches, offering more flexible and powerful trajectory representations for various applications. Our code is available at https://github.com/Erasmo1015/vte.

Comment: The paper introduces a novel method for embedding state-action trajectories, which aligns with representation learning by capturing skills and competencies without reward labels. Relevance: 9 Novelty: 8

8. Fokker-Planck to Callan-Symanzik: evolution of weight matrices under training

ArXiv ID: 2501.09659 Authors: Wei Bu, Uri Kol, Ziming Liu

Abstract: arXiv:2501.09659v1 Announce Type: new Abstract: The dynamical evolution of a neural network during training has been an incredibly fascinating subject of study. First principal derivation of generic evolution of variables in statistical physics systems has proved useful when used to describe training dynamics conceptually, which in practice means numerically solving equations such as Fokker-Planck equation. Simulating entire networks inevitably runs into the curse of dimensionality. In this paper, we utilize Fokker-Planck to simulate the probability density evolution of individual weight matrices in the bottleneck layers of a simple 2-bottleneck-layered auto-encoder and compare the theoretical evolutions against the empirical ones by examining the output data distributions. We also derive physically relevant partial differential equations such as Callan-Symanzik and Kardar-Parisi-Zhang equations from the dynamical equation we have.

Comment: The paper explores theoretical insights into neural network training dynamics, relevant to foundational research in model architecture. Relevance: 9 Novelty: 8

9. Rational Tuning of LLM Cascades via Probabilistic Modeling

ArXiv ID: 2501.09345 Authors: Michael J. Zellinger, Matt Thomson

Abstract: arXiv:2501.09345v1 Announce Type: new Abstract: Understanding the reliability of large language models (LLMs) has recently garnered significant attention. Given LLMs' propensity to hallucinate, as well as their high sensitivity to prompt design, it is already challenging to predict the performance of an individual LLM. However, the problem becomes more complex for compound LLM systems such as cascades, where in addition to each model's standalone performance, we must understand how the error rates of different models interact. In this paper, we present a probabilistic model for the joint performance distribution of a sequence of LLMs, which enables a framework for rationally tuning the confidence thresholds of a LLM cascade using continuous optimization. Compared to selecting confidence thresholds using grid search, our parametric Markov-copula model significantly improves runtime scaling with respect to the length of the cascade and the desired resolution of the cost-error curve, turning them from intractable into low-order polynomial. In addition, the optimal thresholds computed using our continuous optimization-based algorithm increasingly outperform those found via grid search as cascade length grows, improving the area under the cost-error curve by 1.9% on average for cascades consisting of at least three models. Overall, our Markov-copula model provides a rational basis for tuning LLM cascade performance and points to the potential of probabilistic methods in analyzing LLM systems.

Comment: The paper presents a probabilistic model for tuning LLM cascades, which aligns with the interest in theoretical insights into LLM behavior. Relevance: 9 Novelty: 8

10. MatrixNet: Learning over symmetry groups using learned group representations

ArXiv ID: 2501.09571 Authors: Lucas Laird, Circe Hsu, Asilata Bapat, Robin Walters

Abstract: arXiv:2501.09571v1 Announce Type: new Abstract: Group theory has been used in machine learning to provide a theoretically grounded approach for incorporating known symmetry transformations in tasks from robotics to protein modeling. In these applications, equivariant neural networks use known symmetry groups with predefined representations to learn over geometric input data. We propose MatrixNet, a neural network architecture that learns matrix representations of group element inputs instead of using predefined representations. MatrixNet achieves higher sample efficiency and generalization over several standard baselines in prediction tasks over the several finite groups and the Artin braid group. We also show that MatrixNet respects group relations allowing generalization to group elements of greater word length than in the training set.

Comment: MatrixNet introduces a novel architecture for learning group representations, relevant to model architecture and representation learning. Relevance: 8 Novelty: 8

11. Pruning for Sparse Diffusion Models based on Gradient Flow

ArXiv ID: 2501.09464 Authors: Ben Wan, Tianyi Zheng, Zhaoyu Chen, Yuxiao Wang, Jia Wang

Abstract: arXiv:2501.09464v1 Announce Type: new Abstract: Diffusion Models (DMs) have impressive capabilities among generation models, but are limited to slower inference speeds and higher computational costs. Previous works utilize one-shot structure pruning to derive lightweight DMs from pre-trained ones, but this approach often leads to a significant drop in generation quality and may result in the removal of crucial weights. Thus we propose a iterative pruning method based on gradient flow, including the gradient flow pruning process and the gradient flow pruning criterion. We employ a progressive soft pruning strategy to maintain the continuity of the mask matrix and guide it along the gradient flow of the energy function based on the pruning criterion in sparse space, thereby avoiding the sudden information loss typically caused by one-shot pruning. Gradient-flow based criterion prune parameters whose removal increases the gradient norm of loss function and can enable fast convergence for a pruned model in iterative pruning stage. Our extensive experiments on widely used datasets demonstrate that our method achieves superior performance in efficiency and consistency with pre-trained models.

Comment: The paper focuses on pruning for sparse diffusion models, aligning with model compression through sparsity and pruning techniques. Relevance: 9 Novelty: 7

12. Large Language Model is Secretly a Protein Sequence Optimizer

ArXiv ID: 2501.09274 Authors: Yinkai Wang, Jiaxing He, Yuanqi Du, Xiaohui Chen, Jianan Canal Li, Li-Ping Liu, Xiaolin Xu, Soha Hassoun

Abstract: arXiv:2501.09274v1 Announce Type: new Abstract: We consider the protein sequence engineering problem, which aims to find protein sequences with high fitness levels, starting from a given wild-type sequence. Directed evolution has been a dominating paradigm in this field which has an iterative process to generate variants and select via experimental feedback. We demonstrate large language models (LLMs), despite being trained on massive texts, are secretly protein sequence optimizers. With a directed evolutionary method, LLM can perform protein engineering through Pareto and experiment-budget constrained optimization, demonstrating success on both synthetic and experimental fitness landscapes.

Comment: The paper explores the use of large language models for protein sequence optimization, aligning with AI for Science and LLMs with a novel application in protein engineering. Relevance: 8 Novelty: 8

13. MOGNET: A Mux-residual quantized Network leveraging Online-Generated weights

ArXiv ID: 2501.09531 Authors: Van Thien Nguyen, William Guicquero, Gilles Sicard

Abstract: arXiv:2501.09531v1 Announce Type: new Abstract: This paper presents a compact model architecture called MOGNET, compatible with a resource-limited hardware. MOGNET uses a streamlined Convolutional factorization block based on a combination of 2 point-wise (1x1) convolutions with a group-wise convolution in-between. To further limit the overall model size and reduce the on-chip required memory, the second point-wise convolution's parameters are on-line generated by a Cellular Automaton structure. In addition, MOGNET enables the use of low-precision weights and activations, by taking advantage of a Multiplexer mechanism with a proper Bitshift rescaling for integrating residual paths without increasing the hardware-related complexity. To efficiently train this model we also introduce a novel weight ternarization method favoring the balance between quantized levels. Experimental results show that given tiny memory budget (sub-2Mb), MOGNET can achieve higher accuracy with a clear gap up to 1% at a similar or even lower model size compared to recent state-of-the-art methods.

Comment: The paper introduces a compact model architecture with quantization and low-precision techniques, relevant to model compression. Relevance: 8 Novelty: 7

14. A Near-optimal Algorithm for Learning Margin Halfspaces with Massart Noise

ArXiv ID: 2501.09691 Authors: Ilias Diakonikolas, Nikos Zarifis

Abstract: arXiv:2501.09691v1 Announce Type: new Abstract: We study the problem of PAC learning $\gamma$-margin halfspaces in the presence of Massart noise. Without computational considerations, the sample complexity of this learning problem is known to be $\widetilde{\Theta}(1/(\gamma^2 \epsilon))$. Prior computationally efficient algorithms for the problem incur sample complexity $\tilde{O}(1/(\gamma^4 \epsilon^3))$ and achieve 0-1 error of $\eta+\epsilon$, where $\eta<1/2$ is the upper bound on the noise rate. Recent work gave evidence of an information-computation tradeoff, suggesting that a quadratic dependence on $1/\epsilon$ is required for computationally efficient algorithms. Our main result is a computationally efficient learner with sample complexity $\widetilde{\Theta}(1/(\gamma^2 \epsilon^2))$, nearly matching this lower bound. In addition, our algorithm is simple and practical, relying on online SGD on a carefully selected sequence of convex losses.

Comment: The paper focuses on theoretical insights into learning algorithms, which aligns with the interest in foundational methods. Relevance: 7 Novelty: 8

15. Testing Noise Assumptions of Learning Algorithms

ArXiv ID: 2501.09189 Authors: Surbhi Goel, Adam R. Klivans, Konstantinos Stavropoulos, Arsen Vasilyan

Abstract: arXiv:2501.09189v1 Announce Type: new Abstract: We pose a fundamental question in computational learning theory: can we efficiently test whether a training set satisfies the assumptions of a given noise model? This question has remained unaddressed despite decades of research on learning in the presence of noise. In this work, we show that this task is tractable and present the first efficient algorithm to test various noise assumptions on the training data. To model this question, we extend the recently proposed testable learning framework of Rubinfeld and Vasilyan (2023) and require a learner to run an associated test that satisfies the following two conditions: (1) whenever the test accepts, the learner outputs a classifier along with a certificate of optimality, and (2) the test must pass for any dataset drawn according to a specified modeling assumption on both the marginal distribution and the noise model. We then consider the problem of learning halfspaces over Gaussian marginals with Massart noise (where each label can be flipped with probability less than $1/2$ depending on the input features), and give a fully-polynomial time testable learning algorithm. We also show a separation between the classical setting of learning in the presence of structured noise and testable learning. In fact, for the simple case of random classification noise (where each label is flipped with fixed probability $\eta = 1/2$), we show that testable learning requires super-polynomial time while classical learning is trivial.

Comment: The paper presents a theoretical approach to testing noise assumptions in learning algorithms, which aligns with the core topic of theoretical insights into learning models. Relevance: 7 Novelty: 8

16. Task Vectors in In-Context Learning: Emergence, Formation, and Benefit

ArXiv ID: 2501.09240 Authors: Liu Yang, Ziqian Lin, Kangwook Lee, Dimitris Papailiopoulos, Robert Nowak

Abstract: arXiv:2501.09240v1 Announce Type: new Abstract: In-context learning is a remarkable capability of transformers, referring to their ability to adapt to specific tasks based on a short history or context. Previous research has found that task-specific information is locally encoded within models, though their emergence and functionality remain unclear due to opaque pre-training processes. In this work, we investigate the formation of task vectors in a controlled setting, using models trained from scratch on synthetic datasets. Our findings confirm that task vectors naturally emerge under certain conditions, but the tasks may be relatively weakly and/or non-locally encoded within the model. To promote strong task vectors encoded at a prescribed location within the model, we propose an auxiliary training mechanism based on a task vector prompting loss (TVP-loss). This method eliminates the need to search for task-correlated encodings within the trained model and demonstrably improves robustness and generalization.

Comment: The paper investigates task vectors in transformers, contributing to understanding model architecture and representation learning through task vector prompting loss. Relevance: 8 Novelty: 7

17. Class Incremental Fault Diagnosis under Limited Fault Data via Supervised Contrastive Knowledge Distillation

ArXiv ID: 2501.09525 Authors: Hanrong Zhang, Yifei Yao, Zixuan Wang, Jiayuan Su, Mengxuan Li, Peng Peng, Hongwei Wang

Abstract: arXiv:2501.09525v1 Announce Type: new Abstract: Class-incremental fault diagnosis requires a model to adapt to new fault classes while retaining previous knowledge. However, limited research exists for imbalanced and long-tailed data. Extracting discriminative features from few-shot fault data is challenging, and adding new fault classes often demands costly model retraining. Moreover, incremental training of existing methods risks catastrophic forgetting, and severe class imbalance can bias the model's decisions toward normal classes. To tackle these issues, we introduce a Supervised Contrastive knowledge distiLlation for class Incremental Fault Diagnosis (SCLIFD) framework proposing supervised contrastive knowledge distillation for improved representation learning capability and less forgetting, a novel prioritized exemplar selection method for sample replay to alleviate catastrophic forgetting, and the Random Forest Classifier to address the class imbalance. Extensive experimentation on simulated and real-world industrial datasets across various imbalance ratios demonstrates the superiority of SCLIFD over existing approaches. Our code can be found at https://github.com/Zhang-Henry/SCLIFD_TII.

Comment: The paper focuses on representation learning through supervised contrastive knowledge distillation, relevant to feature learning. Relevance: 8 Novelty: 7

18. Free-Knots Kolmogorov-Arnold Network: On the Analysis of Spline Knots and Advancing Stability

ArXiv ID: 2501.09283 Authors: Liangwewi Nathan Zheng, Wei Emma Zhang, Lin Yue, Miao Xu, Olaf Maennel, Weitong Chen

Abstract: arXiv:2501.09283v1 Announce Type: new Abstract: Kolmogorov-Arnold Neural Networks (KANs) have gained significant attention in the machine learning community. However, their implementation often suffers from poor training stability and heavy trainable parameter. Furthermore, there is limited understanding of the behavior of the learned activation functions derived from B-splines. In this work, we analyze the behavior of KANs through the lens of spline knots and derive the lower and upper bound for the number of knots in B-spline-based KANs. To address existing limitations, we propose a novel Free Knots KAN that enhances the performance of the original KAN while reducing the number of trainable parameters to match the trainable parameter scale of standard Multi-Layer Perceptrons (MLPs). Additionally, we introduce new a training strategy to ensure $C^2$ continuity of the learnable spline, resulting in smoother activation compared to the original KAN and improve the training stability by range expansion. The proposed method is comprehensively evaluated on 8 datasets spanning various domains, including image, text, time series, multimodal, and function approximation tasks. The promising results demonstrates the feasibility of KAN-based network and the effectiveness of proposed method.

Comment: The paper proposes a novel Free Knots KAN, which is relevant to model architecture and offers theoretical insights into spline-based networks, enhancing training stability and parameter efficiency. Relevance: 8 Novelty: 7

19. Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks

ArXiv ID: 2501.09137 Authors: Pierfrancesco Beneventano, Blake Woodworth

Abstract: arXiv:2501.09137v1 Announce Type: new Abstract: We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize -- about $2/\textrm{sharpness}$. It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability'', which induces additional regularization by delaying convergence and may have implications for training more complex models.

Comment: The paper offers theoretical insights into gradient descent dynamics, which is relevant to foundational research in model training and optimization. Relevance: 7 Novelty: 7

20. Weight for Robustness: A Comprehensive Approach towards Optimal Fault-Tolerant Asynchronous ML

ArXiv ID: 2501.09621 Authors: Tehila Dahan, Kfir Y. Levy

Abstract: arXiv:2501.09621v1 Announce Type: new Abstract: We address the challenges of Byzantine-robust training in asynchronous distributed machine learning systems, aiming to enhance efficiency amid massive parallelization and heterogeneous computing resources. Asynchronous systems, marked by independently operating workers and intermittent updates, uniquely struggle with maintaining integrity against Byzantine failures, which encompass malicious or erroneous actions that disrupt learning. The inherent delays in such settings not only introduce additional bias to the system but also obscure the disruptions caused by Byzantine faults. To tackle these issues, we adapt the Byzantine framework to asynchronous dynamics by introducing a novel weighted robust aggregation framework. This allows for the extension of robust aggregators and a recent meta-aggregator to their weighted versions, mitigating the effects of delayed updates. By further incorporating a recent variance-reduction technique, we achieve an optimal convergence rate for the first time in an asynchronous Byzantine environment. Our methodology is rigorously validated through empirical and theoretical analysis, demonstrating its effectiveness in enhancing fault tolerance and optimizing performance in asynchronous ML systems.

Comment: The paper presents a novel weighted robust aggregation framework for asynchronous ML, relevant to model architecture and theoretical insights. Relevance: 7 Novelty: 7

21. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

ArXiv ID: 2501.09686 Authors: Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, Yong Li

Abstract: arXiv:2501.09686v1 Announce Type: new Abstract: Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model. The introduction of OpenAI's o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.

Comment: The paper surveys reinforced reasoning with LLMs, focusing on theoretical insights into reasoning processes, which aligns with the interest in LLM behavior and architecture breakthroughs. Relevance: 8 Novelty: 6

ArXiv ID: 2501.09352 Authors: Xianghu Yue, Yiming Chen, Xueyi Zhang, Xiaoxue Gao, Mengling Feng, Mingrui Lao, Huiping Zhuang, Haizhou Li

Abstract: arXiv:2501.09352v1 Announce Type: new Abstract: Multi-modal class-incremental learning (MMCIL) seeks to leverage multi-modal data, such as audio-visual and image-text pairs, thereby enabling models to learn continuously across a sequence of tasks while mitigating forgetting. While existing studies primarily focus on the integration and utilization of multi-modal information for MMCIL, a critical challenge remains: the issue of missing modalities during incremental learning phases. This oversight can exacerbate severe forgetting and significantly impair model performance. To bridge this gap, we propose PAL, a novel exemplar-free framework tailored to MMCIL under missing-modality scenarios. Concretely, we devise modality-specific prompts to compensate for missing information, facilitating the model to maintain a holistic representation of the data. On this foundation, we reformulate the MMCIL problem into a Recursive Least-Squares task, delivering an analytical linear solution. Building upon these, PAL not only alleviates the inherent under-fitting limitation in analytic learning but also preserves the holistic representation of missing-modality data, achieving superior performance with less forgetting across various multi-modal incremental scenarios. Extensive experiments demonstrate that PAL significantly outperforms competitive methods across various datasets, including UPMC-Food101 and N24News, showcasing its robustness towards modality absence and its anti-forgetting ability to maintain high incremental accuracy.

Comment: The paper introduces a novel framework for multi-modal class-incremental learning with missing modalities, which involves representation learning through modality-specific prompts. Relevance: 7 Novelty: 7

23. Reducing the Sensitivity of Neural Physics Simulators to Mesh Topology via Pretraining

ArXiv ID: 2501.09597 Authors: Nathan Vaska, Justin Goodwin, Robin Walters, Rajmonda S. Caceres

Abstract: arXiv:2501.09597v1 Announce Type: new Abstract: Meshes are used to represent complex objects in high fidelity physics simulators across a variety of domains, such as radar sensing and aerodynamics. There is growing interest in using neural networks to accelerate physics simulations, and also a growing body of work on applying neural networks directly to irregular mesh data. Since multiple mesh topologies can represent the same object, mesh augmentation is typically required to handle topological variation when training neural networks. Due to the sensitivity of physics simulators to small changes in mesh shape, it is challenging to use these augmentations when training neural network-based physics simulators. In this work, we show that variations in mesh topology can significantly reduce the performance of neural network simulators. We evaluate whether pretraining can be used to address this issue, and find that employing an established autoencoder pretraining technique with graph embedding models reduces the sensitivity of neural network simulators to variations in mesh topology. Finally, we highlight future research directions that may further reduce neural simulator sensitivity to mesh topology.

Comment: The paper discusses using autoencoder pretraining to reduce sensitivity in neural network simulators, which aligns with representation learning and model architecture topics. Relevance: 7 Novelty: 6

24. Generating particle physics Lagrangians with transformers

ArXiv ID: 2501.09729 Authors: Yong Sheng Koay, Rikard Enberg, Stefano Moretti, Eliel Camargo-Molina

Abstract: arXiv:2501.09729v1 Announce Type: new Abstract: In physics, Lagrangians provide a systematic way to describe laws governing physical systems. In the context of particle physics, they encode the interactions and behavior of the fundamental building blocks of our universe. By treating Lagrangians as complex, rule-based constructs similar to linguistic expressions, we trained a transformer model -- proven to be effective in natural language tasks -- to predict the Lagrangian corresponding to a given list of particles. We report on the transformer's performance in constructing Lagrangians respecting the Standard Model $\mathrm{SU}(3)\times \mathrm{SU}(2)\times \mathrm{U}(1)$ gauge symmetries. The resulting model is shown to achieve high accuracies (over 90\%) with Lagrangians up to six matter fields, with the capacity to generalize beyond the training distribution, albeit within architectural constraints. We show through an analysis of input embeddings that the model has internalized concepts such as group representations and conjugation operations as it learned to generate Lagrangians. We make the model and training datasets available to the community. An interactive demonstration can be found at: \url{https://huggingface.co/spaces/JoseEliel/generate-lagrangians}.

Comment: The paper uses transformers to generate particle physics Lagrangians, which is relevant to AI for science by applying foundational architecture to a new domain. Relevance: 7 Novelty: 6

Paper selection prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Relevant Topics

Representation Learning - Relevant: Feature learning, sparse/contrastive learning, dictionary learning, or theoretical insights into how deep networks encode information. - Irrelevant: Application-only work using standard representation learning without innovative insights.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, and other foundational structures. - Irrelevant: Simply applying existing architectures to new tasks without structural/theoretical innovation.
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank, KV cache, or theoretical/algorithmic innovations for efficiency, etc. - Irrelevant: Simply applying existing compression to new tasks.
Large Language Models (LLMs) - Relevant: Strong theoretical insights on LLM behavior, architecture/training breakthroughs (e.g., MoE). - Irrelevant: Domain-specific usage or small tweaks (e.g., instruction tuning), lack of theoretical advancement (e.g., benchmarks/datasets, inference tricks like RAG).
AI for Science - Relevant: Foundational research in molecule/protein modeling (e.g., new training paradigms, advanced generative methods, or theoretical perspectives), or major architecture-level innovation. - Irrelevant: Conventional, domain-limited applications lacking insights on the foundational side.
Emerging Trends - Relevant: Cutting-edge theoretical work challenging assumptions, or broad new paradigms/concepts in AI research. - Irrelevant: Trend-following or incremental extensions on existing methods.

Papers

[PAPER LIST HERE]

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other. E.g., a paper with high relevance can be of low novelty, or vice versa.

Relevance Scoring

Relevance 9-10 (Completely Relevant)
Focus: Fully aligned with core topics, score the highest if also contains keywords in it.
Keywords: “Mixture of Experts (MoE),” “Representation Learning,” “Compression,” “Sparse,” “Pruning,” “Quantization,” “Low-rank,” “Theoretical,” “Scalability,” “Foundation Models,” etc.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords (like "MoE").
Relevance 7-8 (Relevant)
Focus: Clearly tied to our main topics, may not fully hit the interest in foundational methods.
Examples: Pure research on representation/architecture with no other domain focus; significant overlap with MoE.
Relevance 5-6 (Optional)
Focus: Link to our topics—covers relevant ideas but also includes another area of interest.
Examples: Work referencing MoE in a broader context or centered on another domain like federated learning, online learning, etc.
Relevance 3-4 (Irrelevant)
Focus: Largely outside our interests, with little/no association to our topics.
Examples: application-level tasks like using MoE as a method for medical image segmentation, etc.
Relevance 1-2 (Ignore)
Focus: Purely unrelated to our topics.
Examples: Entirely different fields like reinforcement learning, 3D vision learning, etc.

Novelty Scoring

Note: Foundation vs. Application - Foundational/theoretical papers (new theorems, architectures, or strong methodological insights) are of high novelty. - Subdomain papers and application-focused papers (e.g., "methods for xxx") are lower in novelty.

Novelty 9-10 (Breakthrough)
Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for MoE routing; a novel theoretical result transforming representation learning.
Novelty 7-8 (Improvements)
Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Moderate)
Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods that don’t change the broader landscape (e.g., a standard LLM fine-tuned on a new dataset).
Novelty 1-2 (Low)
Definition: Minimal originality, applying standard approaches without real innovation.
Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies with no methodological advancement.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

ARXIVID: should be the ArXiv ID.
COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
RELEVANCE: should be a score from 1-10.
NOVELTY: should be a score from 1-10.