Personalized Daily Arxiv Papers 01/28/2025
| Prompt | Completion | Total | |
|---|---|---|---|
| Token | 55530 | 4704 | 60234 |
| Cost | $1.39 | $0.47 | $1.86 |
Total scanned papers: 485
Total relevant papers: 9
Table of contents with paper titles:
-
Task Arithmetic in Trust Region: A Training-Free Model Merging Approach to Navigate Knowledge Conflicts Authors: Wenju Sun, Qingyong Li, Wen Wang, Yangli-ao Geng, Boyang Li
-
Efficient and Interpretable Neural Networks Using Complex Lehmer Transform Authors: Masoud Ataei, Xiaogang Wang
-
An Attempt to Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models Authors: Jaturong Kongmanee
-
A Unified Analysis of Stochastic Gradient Descent with Arbitrary Data Permutations and Beyond Authors: Yipeng Li, Xinchen Lyu, Zhenyu Liu
-
A New Approach for Knowledge Generation Using Active Inference Authors: Jamshid Ghasimi, Nazanin Movarraei
-
Equation discovery framework EPDE: Towards a better equation discovery Authors: Mikhail Maslyaev, Alexander Hvatov
-
Risk-Aware Distributional Intervention Policies for Language Models Authors: Bao Nguyen, Binh Nguyen, Duy Nguyen, Viet Anh Nguyen
-
FreqMoE: Enhancing Time Series Forecasting through Frequency Decomposition Mixture of Experts Authors: Ziqi Liu
-
Self-reflecting Large Language Models: A Hegelian Dialectical Approach Authors: Sara Abdali, Can Goksen, Saeed Amizadeh andKazuhito Koishida
1. Task Arithmetic in Trust Region: A Training-Free Model Merging Approach to Navigate Knowledge Conflicts
ArXiv ID: 2501.15065
Authors: Wenju Sun, Qingyong Li, Wen Wang, Yangli-ao Geng, Boyang Li
Abstract: Multi-task model merging offers an efficient solution for integrating knowledge from multiple fine-tuned models, mitigating the significant computational and storage demands associated with multi-task training. As a key technique in this field, Task Arithmetic (TA) defines task vectors by subtracting the pre-trained model ($\theta_{\text{pre}}$) from the fine-tuned task models in parameter space, then adjusting the weight between these task vectors and $\theta_{\text{pre}}$ to balance task-generalized and task-specific knowledge. Despite the promising performance of TA, conflicts can arise among the task vectors, particularly when different tasks require distinct model adaptations. In this paper, we formally define this issue as knowledge conflicts, characterized by the performance degradation of one task after merging with a model fine-tuned for another task. Through in-depth analysis, we show that these conflicts stem primarily from the components of task vectors that align with the gradient of task-specific losses at $\theta_{\text{pre}}$. To address this, we propose Task Arithmetic in Trust Region (TATR), which defines the trust region as dimensions in the model parameter space that cause only small changes (corresponding to the task vector components with gradient orthogonal direction) in the task-specific losses. Restricting parameter merging within this trust region, TATR can effectively alleviate knowledge conflicts. Moreover, TATR serves as both an independent approach and a plug-and-play module compatible with a wide range of TA-based methods. Extensive empirical evaluations on eight distinct datasets robustly demonstrate that TATR improves the multi-task performance of several TA-based model merging methods by an observable margin.
Comment: Addresses model merging and introduces Task Arithmetic in Trust Region (TATR), which is highly relevant to model efficiency and potentially representation learning. The analysis of knowledge conflicts and trust regions contributes novel theoretical insights into multi-task model merging.
Relevance: 9 Novelty: 8
2. Efficient and Interpretable Neural Networks Using Complex Lehmer Transform
ArXiv ID: 2501.15223
Authors: Masoud Ataei, Xiaogang Wang
Abstract: We propose an efficient and interpretable neural network with a novel activation function called the weighted Lehmer transform. This new activation function enables adaptive feature selection and extends to the complex domain, capturing phase-sensitive and hierarchical relationships within data. Notably, it provides greater interpretability and transparency compared to existing machine learning models, facilitating a deeper understanding of its functionality and decision-making processes. We analyze the mathematical properties of both real-valued and complex-valued Lehmer activation units and demonstrate their applications in modeling nonlinear interactions. Empirical evaluations demonstrate that our proposed neural network achieves competitive accuracy on benchmark datasets with significantly improved computational efficiency. A single layer of real-valued or complex-valued Lehmer activation units is shown to deliver state-of-the-art performance, balancing efficiency with interpretability.
Comment: The paper introduces a novel activation function based on the Lehmer transform, focusing on efficiency and interpretability of neural networks. This aligns well with Representation Learning and architectural innovation topics, offering theoretical insights.
Relevance: 9 Novelty: 8
3. An Attempt to Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models
ArXiv ID: 2501.15054
Authors: Jaturong Kongmanee
Abstract: This research aims to unravel how large language models (LLMs) iteratively refine token predictions (or, in a general sense, vector predictions). We utilized a logit lens technique to analyze the model's token predictions derived from intermediate representations. Specifically, we focused on how LLMs access and use information from input contexts, and how positioning of relevant information affects the model's token prediction refinement process. Our findings for multi-document question answering task, by varying input context lengths (the number of documents), using GPT-2, revealed that the number of layers between the first layer that the model predicted next tokens correctly and the later layers that the model finalized its correct predictions, as a function of the position of relevant information (i.e., placing the relevant one at the beginning, middle, or end of the input context), has a nearly inverted U shape. We found that the gap between these two layers, on average, diminishes when relevant information is positioned at the beginning or end of the input context, suggesting that the model requires more refinements when processing longer contexts with relevant information situated in the middle, and highlighting which layers are essential for determining the correct output. Our analysis provides insights about how token predictions are distributed across different conditions, and establishes important connections to existing hypotheses and previous findings in AI safety research and development.
Comment: This paper analyzes how LLMs refine token predictions and identify essential layers, contributing theoretical insights into behavior and interpretability of LLMs, directly aligning with the LLM criterion.
Relevance: 10 Novelty: 7
4. A Unified Analysis of Stochastic Gradient Descent with Arbitrary Data Permutations and Beyond
ArXiv ID: 2501.16117
Authors: Yipeng Li, Xinchen Lyu, Zhenyu Liu
Abstract: We aim to provide a unified convergence analysis for permutation-based Stochastic Gradient Descent (SGD), where data examples are permuted before each epoch. By examining the relations among permutations, we categorize existing permutation-based SGD algorithms into four categories: Arbitrary Permutations, Independent Permutations (including Random Reshuffling), One Permutation (including Incremental Gradient, Shuffle One and Nice Permutation) and Dependent Permutations (including GraBs Lu et al., 2022; Cooper et al., 2023). Existing unified analyses failed to encompass the Dependent Permutations category due to the inter-epoch dependencies in its permutations. In this work, we propose a general assumption that captures the inter-epoch permutation dependencies. Using the general assumption, we develop a unified framework for permutation-based SGD with arbitrary permutations of examples, incorporating all the aforementioned representative algorithms. Furthermore, we adapt our framework on example ordering in SGD for client ordering in Federated Learning (FL). Specifically, we develop a unified framework for regularized-participation FL with arbitrary permutations of clients.
Comment: The unified analysis of permutation-based SGD introduces a theoretical framework relevant to training dynamics in neural networks, which aligns with representation learning interests.
Relevance: 9 Novelty: 8
5. A New Approach for Knowledge Generation Using Active Inference
ArXiv ID: 2501.15105
Authors: Jamshid Ghasimi, Nazanin Movarraei
Abstract: There are various models proposed on how knowledge is generated in the human brain including the semantic networks model. Although this model has been widely studied and even computational models are presented, but, due to various limits and inefficiencies in the generation of different types of knowledge, its application is limited to semantic knowledge because of has been formed according to semantic memory and declarative knowledge and has many limits in explaining various procedural and conditional knowledge. Given the importance of providing an appropriate model for knowledge generation, especially in the areas of improving human cognitive functions or building intelligent machines, improving existing models in knowledge generation or providing more comprehensive models is of great importance. In the current study, based on the free energy principle of the brain, is the researchers proposed a model for generating three types of declarative, procedural, and conditional knowledge. While explaining different types of knowledge, this model is capable to compute and generate concepts from stimuli based on probabilistic mathematics and the action-perception process (active inference). The proposed model is unsupervised learning that can update itself using a combination of different stimuli as a generative model can generate new concepts of unsupervised received stimuli. In this model, the active inference process is used in the generation of procedural and conditional knowledge and the perception process is used to generate declarative knowledge.
Comment: The paper proposes a knowledge generation model based on the free energy principle, exploring active inference and unsupervised learning for generating various types of knowledge. It has potential foundational contributions to representation learning and aligns with cutting-edge theoretical work on cognitive modeling.
Relevance: 8 Novelty: 8
6. Equation discovery framework EPDE: Towards a better equation discovery
ArXiv ID: 2501.14768
Authors: Mikhail Maslyaev, Alexander Hvatov
Abstract: Equation discovery methods hold promise for extracting knowledge from physics-related data. However, existing approaches often require substantial prior information that significantly reduces the amount of knowledge extracted. In this paper, we enhance the EPDE algorithm -- an evolutionary optimization-based discovery framework. In contrast to methods like SINDy, which rely on pre-defined libraries of terms and linearities, our approach generates terms using fundamental building blocks such as elementary functions and individual differentials. Within evolutionary optimization, we may improve the computation of the fitness function as is done in gradient methods and enhance the optimization algorithm itself. By incorporating multi-objective optimization, we effectively explore the search space, yielding more robust equation extraction, even when dealing with complex experimental data. We validate our algorithm's noise resilience and overall performance by comparing its results with those from the state-of-the-art equation discovery framework SINDy.
Comment: This paper proposes improvements to equation discovery using evolutionary optimization, a fundamental topic in representation learning and model interpretability. It introduces noise-resilient methods, which aligns closely with emerging trends in theoretical work.
Relevance: 8 Novelty: 8
7. Risk-Aware Distributional Intervention Policies for Language Models
ArXiv ID: 2501.15758
Authors: Bao Nguyen, Binh Nguyen, Duy Nguyen, Viet Anh Nguyen
Abstract: Language models are prone to occasionally undesirable generations, such as harmful or toxic content, despite their impressive capability to produce texts that appear accurate and coherent. This paper presents a new two-stage approach to detect and mitigate undesirable content generations by rectifying activations. First, we train an ensemble of layerwise classifiers to detect undesirable content using activations by minimizing a smooth surrogate of the risk-aware score. Then, for contents that are detected as undesirable, we propose layerwise distributional intervention policies that perturb the attention heads minimally while guaranteeing probabilistically the effectiveness of the intervention. Benchmarks on several language models and datasets show that our method outperforms baselines in reducing the generation of undesirable output.
Comment: Proposes a novel approach to mitigate undesirable content generation in LLMs through activation-level interventions. This aligns with relevance to theoretical insights into LLM behavior and interpretability.
Relevance: 8 Novelty: 7
8. FreqMoE: Enhancing Time Series Forecasting through Frequency Decomposition Mixture of Experts
ArXiv ID: 2501.15125
Authors: Ziqi Liu
Abstract: Long-term time series forecasting is essential in areas like finance and weather prediction. Besides traditional methods that operate in the time domain, many recent models transform time series data into the frequency domain to better capture complex patterns. However, these methods often use filtering techniques to remove certain frequency signals as noise, which may unintentionally discard important information and reduce prediction accuracy. To address this, we propose the Frequency Decomposition Mixture of Experts (FreqMoE) model, which dynamically decomposes time series data into frequency bands, each processed by a specialized expert. A gating mechanism adjusts the importance of each output of expert based on frequency characteristics, and the aggregated results are fed into a prediction module that iteratively refines the forecast using residual connections. Our experiments demonstrate that FreqMoE outperforms state-of-the-art models, achieving the best performance on 51 out of 70 metrics across all tested datasets, while significantly reducing the number of required parameters to under 50k, providing notable efficiency advantages.
Comment: The paper discusses a Mixture of Experts (MoE) architecture, which is inherently relevant to the topic of model architecture. However, it's focused on time-series forecasting, which is an application-driven task. While the model offers efficiency advantages, it does not appear to introduce significant foundational insights into MoE frameworks.
Relevance: 7 Novelty: 6
9. Self-reflecting Large Language Models: A Hegelian Dialectical Approach
ArXiv ID: 2501.14917
Authors: Sara Abdali, Can Goksen, Saeed Amizadeh andKazuhito Koishida
Abstract: Investigating NLP through a philosophical lens has recently caught researcher's eyes as it connects computational methods with classical schools of philosophy. This paper introduces a philosophical approach inspired by the Hegelian Dialectic for LLMs' self-reflection, utilizing a self-dialectical approach to emulate internal critiques and then synthesize new ideas by resolving the contradicting points. Moreover, this paper investigates the effect of LLMs' temperature for generation by establishing a dynamic annealing approach, which promotes the creativity in the early stages and gradually refines it by focusing on the nuances, as well as a fixed temperature strategy for generation. Our proposed approach is examined to determine its ability to generate novel ideas from an initial proposition. Additionally, a Multi Agent Majority Voting (MAMV) strategy is leveraged to assess the validity and novelty of the generated ideas, which proves beneficial in the absence of domain experts. Our experiments show promise in generating new ideas and provide a stepping-stone for future research.
Comment: The paper discusses the use of a novel philosophical approach (Hegelian Dialectic) for developing self-reflective capabilities in LLMs, which introduces dynamic temperature annealing and an innovative evaluation method (MAMV). However, this leans toward a conceptual and experimental perspective rather than foundational breakthroughs in LLM architecture or theoretical insights.
Relevance: 7 Novelty: 6
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Relevant Topics
Use the following relevance criteria to focus on foundational research, avoiding purely application-driven work:
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in training or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), minor tweaks (e.g., instruction tuning, CoT, data mixing), or purely empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords for Relevant Domains: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
Hints on Irrelevant Domains: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
Hints on Application Tasks: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, etc.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.