Previous Day 2025-04-30
Monthly Overview 2025-05
Next Day 2025-05-02

Personalized Daily ArXiv Papers 2025-05-01

[gpt-4o] Prompt Completion Total
Token 23745 3079 26824
Cost $0.06 $0.03 $0.09

Total arXiv papers: 389

Total scanned papers: 229

Total relevant papers: 11

Table of contents with paper titles:

  1. TT-LoRA MoE: Unifying Parameter-Efficient Fine-Tuning and Sparse Mixture-of-Experts Authors: Pradip Kunwar, Minh N. Vu, Maanak Gupta, Mahmoud Abdelsalam, Manish Bhattarai

  2. Recursive KL Divergence Optimization: A Dynamic Framework for Representation Learning Authors: Anthony D Martin

  3. PICO: Secure Transformers via Robust Prompt Isolation and Cybersecurity Oversight Authors: Ben Goertzel, Paulos Yibelo

  4. AdaR1: From Long-CoT to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization Authors: Haotian Luo, Haiying He, Yibo Wang, Jinluan Yang, Rui Liu, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, Li Shen

  5. Token-Level Prompt Mixture with Parameter-Free Routing for Federated Domain Generalization Authors: Shuai Gong, Chaoran Cui, Xiaolin Dong, Xiushan Nie, Lei Zhu, Xiaojun Chang

  6. Memorization and Knowledge Injection in Gated LLMs Authors: Xu Pan, Ely Hahami, Zechen Zhang, Haim Sompolinsky

  7. Efficient LLMs with AMP: Attention Heads and MLP Pruning Authors: Leandro Giusti Mugnaini, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Victor Zacarias, Edson Bollis, Lucas Pellicer, Anna Helena Reali Costa, Artur Jordao

  8. Param$\Delta$ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost Authors: Sheng Cao, Mingrui Wu, Karthik Prasad, Yuandong Tian, Zechun Liu

  9. Deep Learning Optimization Using Self-Adaptive Weighted Auxiliary Variables Authors: Yaru Liu, Yiqi Gu, Michael K. Ng

  10. Low-rank computation of the posterior mean in Multi-Output Gaussian Processes Authors: Sebastian Esche, Martin Stoll

  11. NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models Authors: Yi Zhou, Wenpeng Xing, Dezhang Kong, Changting Lin, Meng Han


1. TT-LoRA MoE: Unifying Parameter-Efficient Fine-Tuning and Sparse Mixture-of-Experts

ArXiv ID: 2504.21190

Authors: Pradip Kunwar, Minh N. Vu, Maanak Gupta, Mahmoud Abdelsalam, Manish Bhattarai

Abstract: We propose Tensor-Trained Low-Rank Adaptation Mixture of Experts (TT-LoRA MoE), a novel computational framework integrating Parameter-Efficient Fine-Tuning (PEFT) with sparse MoE routing to address scalability challenges in large model deployments. Unlike traditional MoE approaches, which face substantial computational overhead as expert counts grow, TT-LoRA MoE decomposes training into two distinct, optimized stages. First, we independently train lightweight, tensorized low-rank adapters (TT-LoRA experts), each specialized for specific tasks. Subsequently, these expert adapters remain frozen, eliminating inter-task interference and catastrophic forgetting in multi-task setting. A sparse MoE router, trained separately, dynamically leverages base model representations to select exactly one specialized adapter per input at inference time, automating expert selection without explicit task specification. Comprehensive experiments confirm our architecture retains the memory efficiency of low-rank adapters, seamlessly scales to large expert pools, and achieves robust task-level optimization. This structured decoupling significantly enhances computational efficiency and flexibility: uses only 2% of LoRA, 0.3% of Adapters and 0.03% of AdapterFusion parameters and outperforms AdapterFusion by 4 value in multi-tasking, enabling practical and scalable multi-task inference deployments.

Comment: The paper introduces TT-LoRA MoE, which integrates sparse Mixture-of-Experts (MoE) with low-rank adaptation, aligning closely with the 'Model Architecture' and 'Model Compression' criteria. It provides a novel approach to scalability and efficiency in multi-task settings.

Relevance: 10 Novelty: 8


2. Recursive KL Divergence Optimization: A Dynamic Framework for Representation Learning

ArXiv ID: 2504.21707

Authors: Anthony D Martin

Abstract: We propose a generalization of modern representation learning objectives by reframing them as recursive divergence alignment processes over localized conditional distributions While recent frameworks like Information Contrastive Learning I-Con unify multiple learning paradigms through KL divergence between fixed neighborhood conditionals we argue this view underplays a crucial recursive structure inherent in the learning process. We introduce Recursive KL Divergence Optimization RKDO a dynamic formalism where representation learning is framed as the evolution of KL divergences across data neighborhoods. This formulation captures contrastive clustering and dimensionality reduction methods as static slices while offering a new path to model stability and local adaptation. Our experiments demonstrate that RKDO offers dual efficiency advantages approximately 30 percent lower loss values compared to static approaches across three different datasets and 60 to 80 percent reduction in computational resources needed to achieve comparable results. This suggests that RKDOs recursive updating mechanism provides a fundamentally more efficient optimization landscape for representation learning with significant implications for resource constrained applications.

Comment: The paper proposes a recursive KL divergence optimization framework for representation learning, which directly aligns with foundational research in representation learning and training dynamics.

Relevance: 9 Novelty: 8


3. PICO: Secure Transformers via Robust Prompt Isolation and Cybersecurity Oversight

ArXiv ID: 2504.21029

Authors: Ben Goertzel, Paulos Yibelo

Abstract: We propose a robust transformer architecture designed to prevent prompt injection attacks and ensure secure, reliable response generation. Our PICO (Prompt Isolation and Cybersecurity Oversight) framework structurally separates trusted system instructions from untrusted user inputs through dual channels that are processed independently and merged only by a controlled, gated fusion mechanism. In addition, we integrate a specialized Security Expert Agent within a Mixture-of-Experts (MoE) framework and incorporate a Cybersecurity Knowledge Graph (CKG) to supply domain-specific reasoning. Our training design further ensures that the system prompt branch remains immutable while the rest of the network learns to handle adversarial inputs safely. This PICO framework is presented via a general mathematical formulation, then elaborated in terms of the specifics of transformer architecture, and fleshed out via hypothetical case studies including Policy Puppetry attacks. While the most effective implementation may involve training transformers in a PICO-based way from scratch, we also present a cost-effective fine-tuning approach.

Comment: The paper introduces a secure transformer architecture using a Mixture-of-Experts framework, which is highly relevant to foundational research in model architecture and MoE.

Relevance: 9 Novelty: 8


4. AdaR1: From Long-CoT to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization

ArXiv ID: 2504.21659

Authors: Haotian Luo, Haiying He, Yibo Wang, Jinluan Yang, Rui Liu, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, Li Shen

Abstract: Recently, long-thought reasoning models achieve strong performance on complex reasoning tasks, but often incur substantial inference overhead, making efficiency a critical concern. Our empirical analysis reveals that the benefit of using Long-CoT varies across problems: while some problems require elaborate reasoning, others show no improvement, or even degraded accuracy. This motivates adaptive reasoning strategies that tailor reasoning depth to the input. However, prior work primarily reduces redundancy within long reasoning paths, limiting exploration of more efficient strategies beyond the Long-CoT paradigm. To address this, we propose a novel two-stage framework for adaptive and efficient reasoning. First, we construct a hybrid reasoning model by merging long and short CoT models to enable diverse reasoning styles. Second, we apply bi-level preference training to guide the model to select suitable reasoning styles (group-level), and prefer concise and correct reasoning within each style group (instance-level). Experiments demonstrate that our method significantly reduces inference costs compared to other baseline approaches, while maintaining performance. Notably, on five mathematical datasets, the average length of reasoning is reduced by more than 50%, highlighting the potential of adaptive strategies to optimize reasoning efficiency in large language models. Our code is coming soon at https://github.com/StarDewXXX/AdaR1

Comment: The paper proposes a hybrid reasoning optimization framework for LLMs, which aligns with foundational research in LLM efficiency and adaptive reasoning strategies. It introduces a novel bi-level optimization approach.

Relevance: 9 Novelty: 8


5. Token-Level Prompt Mixture with Parameter-Free Routing for Federated Domain Generalization

ArXiv ID: 2504.21063

Authors: Shuai Gong, Chaoran Cui, Xiaolin Dong, Xiushan Nie, Lei Zhu, Xiaojun Chang

Abstract: Federated domain generalization (FedDG) aims to learn a globally generalizable model from decentralized clients with heterogeneous data while preserving privacy. Recent studies have introduced prompt learning to adapt vision-language models (VLMs) in FedDG by learning a single global prompt. However, such a one-prompt-fits-all learning paradigm typically leads to performance degradation on personalized samples. Although the mixture of experts (MoE) offers a promising solution for specialization, existing MoE-based methods suffer from coarse image-level expert assignment and high communication costs from parameterized routers. To address these limitations, we propose TRIP, a Token-level prompt mixture with parameter-free routing framework for FedDG, which treats multiple prompts as distinct experts. Unlike existing image-level routing designs, TRIP assigns different tokens within an image to specific experts. To ensure communication efficiency, TRIP incorporates a parameter-free routing mechanism based on token clustering and optimal transport. The instance-specific prompt is then synthesized by aggregating experts, weighted by the number of tokens assigned to each. Additionally, TRIP develops an unbiased learning strategy for prompt experts, leveraging the VLM's zero-shot generalization capability. Extensive experiments across four benchmarks demonstrate that TRIP achieves optimal generalization results, with communication of only 1K parameters per round. Our code is available at https://github.com/GongShuai8210/TRIP.

Comment: The paper introduces a token-level prompt mixture framework with parameter-free routing, which aligns with foundational research in model architecture (Mixture of Experts) and efficiency. The token-level routing is a novel contribution.

Relevance: 9 Novelty: 8


6. Memorization and Knowledge Injection in Gated LLMs

ArXiv ID: 2504.21239

Authors: Xu Pan, Ely Hahami, Zechen Zhang, Haim Sompolinsky

Abstract: Large Language Models (LLMs) currently struggle to sequentially add new memories and integrate new knowledge. These limitations contrast with the human ability to continuously learn from new experiences and acquire knowledge throughout life. Most existing approaches add memories either through large context windows or external memory buffers (e.g., Retrieval-Augmented Generation), and studies on knowledge injection rarely test scenarios resembling everyday life events. In this work, we introduce a continual learning framework, Memory Embedded in Gated LLMs (MEGa), which injects event memories directly into the weights of LLMs. Each memory is stored in a dedicated set of gated low-rank weights. During inference, a gating mechanism activates relevant memory weights by matching query embeddings to stored memory embeddings. This enables the model to both recall entire memories and answer related questions. On two datasets - fictional characters and Wikipedia events - MEGa outperforms baseline approaches in mitigating catastrophic forgetting. Our model draws inspiration from the complementary memory system of the human brain.

Comment: The paper introduces a novel framework, MEGa, for continual learning in LLMs using gated low-rank weights, which aligns with the 'Model Compression' criterion due to its focus on low-rank approaches and efficiency. It also touches on foundational aspects of LLM behavior.

Relevance: 9 Novelty: 8


7. Efficient LLMs with AMP: Attention Heads and MLP Pruning

ArXiv ID: 2504.21174

Authors: Leandro Giusti Mugnaini, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Victor Zacarias, Edson Bollis, Lucas Pellicer, Anna Helena Reali Costa, Artur Jordao

Abstract: Deep learning drives a new wave in computing systems and triggers the automation of increasingly complex problems. In particular, Large Language Models (LLMs) have significantly advanced cognitive tasks, often matching or even surpassing human-level performance. However, their extensive parameters result in high computational costs and slow inference, posing challenges for deployment in resource-limited settings. Among the strategies to overcome the aforementioned challenges, pruning emerges as a successful mechanism since it reduces model size while maintaining predictive ability. In this paper, we introduce AMP: Attention Heads and MLP Pruning, a novel structured pruning method that efficiently compresses LLMs by removing less critical structures within Multi-Head Attention (MHA) and Multilayer Perceptron (MLP). By projecting the input data onto weights, AMP assesses structural importance and overcomes the limitations of existing techniques, which often fall short in flexibility or efficiency. In particular, AMP surpasses the current state-of-the-art on commonsense reasoning tasks by up to 1.49 percentage points, achieving a 30% pruning ratio with minimal impact on zero-shot task performance. Moreover, AMP also improves inference speeds, making it well-suited for deployment in resource-constrained environments. We confirm the flexibility of AMP on different families of LLMs, including LLaMA and Phi.

Comment: The paper proposes AMP, a structured pruning method targeting Attention Heads and MLPs in LLMs, which aligns with the 'Model Compression' criterion. It offers a novel pruning approach with significant efficiency improvements.

Relevance: 9 Novelty: 8


8. Param$\Delta$ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost

ArXiv ID: 2504.21023

Authors: Sheng Cao, Mingrui Wu, Karthik Prasad, Yuandong Tian, Zechun Liu

Abstract: The post-training phase of large language models is essential for enhancing capabilities such as instruction-following, reasoning, and alignment with human preferences. However, it demands extensive high-quality data and poses risks like overfitting, alongside significant computational costs due to repeated post-training and evaluation after each base model update. This paper introduces $Param\Delta$, a novel method that streamlines post-training by transferring knowledge from an existing post-trained model to a newly updated base model with ZERO additional training. By computing the difference between post-trained model weights ($\Theta_\text{post}$) and base model weights ($\Theta_\text{base}$), and adding this to the updated base model ($\Theta'\text{base}$), we define $Param\Delta$ Model as: $\Theta$. This approach surprisingly equips the new base model with post-trained capabilities, achieving performance comparable to direct post-training. We did analysis on LLama3, Llama3.1, Qwen, and DeepSeek-distilled models. Results indicate $Param\Delta$ Model effectively replicates traditional post-training. For example, the $Param\Delta$ Model obtained from 70B Llama3-inst, Llama3-base, Llama3.1-base models attains approximately 95\% of Llama3.1-inst model's performance on average. $Param\Delta$ brings a new perspective on how to fully leverage models in the open-weight community, where checkpoints for base and instruct models are readily available and frequently updated, by providing a cost-free framework to accelerate the iterative cycle of model development.}\Delta} = \Theta_\text{post} - \Theta_\text{base} + \Theta'_\text{base

Comment: The paper proposes a novel method, ParamΔ, for post-training large language models at zero cost, which aligns with the 'Large Language Models' criterion by introducing a new perspective on leveraging model weights without additional training.

Relevance: 8 Novelty: 8


9. Deep Learning Optimization Using Self-Adaptive Weighted Auxiliary Variables

ArXiv ID: 2504.21501

Authors: Yaru Liu, Yiqi Gu, Michael K. Ng

Abstract: In this paper, we develop a new optimization framework for the least squares learning problem via fully connected neural networks or physics-informed neural networks. The gradient descent sometimes behaves inefficiently in deep learning because of the high non-convexity of loss functions and the vanishing gradient issue. Our idea is to introduce auxiliary variables to separate the layers of the deep neural networks and reformulate the loss functions for ease of optimization. We design the self-adaptive weights to preserve the consistency between the reformulated loss and the original mean squared loss, which guarantees that optimizing the new loss helps optimize the original problem. Numerical experiments are presented to verify the consistency and show the effectiveness and robustness of our models over gradient descent.

Comment: The paper introduces a novel optimization framework using self-adaptive weighted auxiliary variables, which aligns with foundational research in training dynamics and optimization for neural networks.

Relevance: 8 Novelty: 7


10. Low-rank computation of the posterior mean in Multi-Output Gaussian Processes

ArXiv ID: 2504.21527

Authors: Sebastian Esche, Martin Stoll

Abstract: Gaussian processes (GP) are a versatile tool in machine learning and computational science. We here consider the case of multi-output Gaussian processes (MOGP) and present low-rank approaches for efficiently computing the posterior mean of a MOGP. Starting from low-rank spatio-temporal data we consider a structured covariance function, assuming separability across space and time. This separability, in turn, gives a decomposition of the covariance matrix into a Kronecker product of individual covariance matrices. Incorporating the typical noise term to the model then requires the solution of a large-scale Stein equation for computing the posterior mean. For this, we propose efficient low-rank methods based on a combination of a LRPCG method with the Sylvester equation solver KPIK adjusted for solving Stein equations. We test the developed method on real world street network graphs by using graph filters as covariance matrices. Moreover, we propose a degree-weighted average covariance matrix, which can be employed under specific assumptions to achieve more efficient convergence.

Comment: The paper presents low-rank methods for efficient computation in multi-output Gaussian processes, aligning with the 'Model Compression' criterion due to its focus on low-rank approaches and efficiency.

Relevance: 8 Novelty: 7


11. NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models

ArXiv ID: 2504.21053

Authors: Yi Zhou, Wenpeng Xing, Dezhang Kong, Changting Lin, Meng Han

Abstract: Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content. In this work, we propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints. Our method consists of three key steps: Neuron Activation Analysis, where we examine activation patterns in response to harmful and harmless prompts to detect neurons that are critical for distinguishing between harmful and harmless inputs; Similarity-Based Neuron Identification, which systematically locates the neurons responsible for safe alignment; and Neuron Relearning for Safety Removal, where we fine-tune these selected neurons to restore the model's ability to generate previously restricted responses. Experimental results demonstrate that our method effectively removes safety constraints with minimal fine-tuning, highlighting a critical vulnerability in current alignment techniques. Our findings underscore the need for robust defenses against adversarial fine-tuning attacks on LLMs.

Comment: The paper identifies vulnerabilities in LLM safety alignment and proposes a method to induce disalignment, which aligns with the 'Large Language Models' criterion by providing insights into LLM behavior and interpretability.

Relevance: 8 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords: