Personalized Daily ArXiv Papers 2025-08-14
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 34630 | 4045 | 38675 |
| Cost | $0.09 | $0.04 | $0.13 |
Total arXiv papers: 538
Total scanned papers: 320
Total relevant papers: 17
Table of contents with paper titles:
-
$\mu$-Parametrization for Mixture of Experts Authors: Jan Ma{\l}a\'snicki, Kamil Ciebiera, Mateusz Boru\'n, Maciej Pi\'oro, Jan Ludziejewski, Maciej Stefaniak, Micha{\l} Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, Jakub Krajewski
-
Provable In-Context Vector Arithmetic via Retrieving Task Concepts Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Qingfu Zhang, Hau-San Wong, Taiji Suzuki
-
EGGS-PTP: An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models Authors: Omar Bazarbachi, Zijun Sun, Yanning Shen
-
HKT: A Biologically Inspired Framework for Modular Hereditary Knowledge Transfer in Neural Networks Authors: Yanick Chistian Tchenko, Felix Mohr, Hicham Hadj Abdelkader, Hedi Tabia
-
Synaptic Pruning: A Biological Inspiration for Deep Learning Regularization Authors: Gideon Vos, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi
-
DQT: Dynamic Quantization Training via Dequantization-Free Nested Integer Arithmetic Authors: Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Francesca Palermo, Diana Trojaniello, Manuel Roveri
-
CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge Authors: Muqing Li, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang
-
Global Convergence Analysis of Vanilla Gradient Descent for Asymmetric Matrix Completion Authors: Xu Zhang, Shuo Chen, Jinsheng Li, Xiangying Pang, Maoguo Gong
-
HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap Authors: Wenxiang Lin, Xinglin Pan, Lin Zhang, Shaohuai Shi, Xuan Wang, Xiaowen Chu
-
NEURAL: Attention-Guided Pruning for Unified Multimodal Resource-Constrained Clinical Evaluation Authors: Devvrat Joshi, Islem Rekik
-
Pattern-based Knowledge Component Extraction from Student Code Using Representation Learning Authors: Muntasir Hoq, Griffin Pitts, Andrew Lan, Peter Brusilovsky, Bita Akram
-
Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks Authors: Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng
-
Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models Authors: Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin
-
Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning Authors: Xiaojun Wu, Xiaoguang Jiang, Huiyang Li, Jucai Zhai, Dengfeng Liu, Qiaobo Hao, Huang Liu, Zhiguo Yang, Ji Xie, Ninglun Gu, Jin Yang, Kailai Zhang, Yelun Bao, Jun Wang
-
Structured Kernel Regression VAE: A Computationally Efficient Surrogate for GP-VAEs in ICA Authors: Yuan-Hao Wei, Fu-Hao Deng, Lin-Yong Cui, Yan-Jie Sun
-
Improving Diversity in Language Models: When Temperature Fails, Change the Loss Authors: Alexandre Verine, Florian Le Bronnec, Kunhao Zheng, Alexandre Allauzen, Yann Chevaleyre, Benjamin Negrevergne
-
Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation Authors: Sameer Ambekar, Daniel M. Lang, Julia A. Schnabel
1. $\mu$-Parametrization for Mixture of Experts
ArXiv ID: 2508.09752
Authors: Jan Ma{\l}a\'snicki, Kamil Ciebiera, Mateusz Boru\'n, Maciej Pi\'oro, Jan Ludziejewski, Maciej Stefaniak, Micha{\l} Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, Jakub Krajewski
Abstract: Recent years have seen a growing interest and adoption of LLMs, with $\mu$Transfer becoming a key technique for tuning hyperparameters in large-scale training. Meanwhile, Mixture-of-Experts (MoE) has emerged as a leading architecture in extremely large models. However, the intersection of these two advancements has remained unexplored. In this work, we derive a $\mu$-Parameterization ($\mu$P) for MoE, providing theoretical guarantees for feature learning across model widths in both the router and experts. We empirically validate our parameterization and further investigate how scaling the number of experts and granularity affects the optimal learning rate.
Comment: The paper provides a theoretical framework for MoE parameterization, aligning with model architecture insights and foundational research.
Relevance: 10 Novelty: 8
2. Provable In-Context Vector Arithmetic via Retrieving Task Concepts
ArXiv ID: 2508.09820
Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Qingfu Zhang, Hau-San Wong, Taiji Suzuki
Abstract: In-context learning (ICL) has garnered significant attention for its ability to grasp functions/tasks from demonstrations. Recent studies suggest the presence of a latent task/function vector in LLMs during ICL. Merullo et al. (2024) showed that LLMs leverage this vector alongside the residual stream for Word2Vec-like vector arithmetic, solving factual-recall ICL tasks. Additionally, recent work empirically highlighted the key role of Question-Answer data in enhancing factual-recall capabilities. Despite these insights, a theoretical explanation remains elusive. To move one step forward, we propose a theoretical framework building on empirically grounded hierarchical concept modeling. We develop an optimization theory, showing how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via vector arithmetic. We prove 0-1 loss convergence and show the strong generalization, including robustness to concept recombination and distribution shifts. These results elucidate the advantages of transformers over static embedding predecessors. Empirical simulations corroborate our theoretical insights.
Comment: The paper provides a theoretical framework for in-context learning in LLMs, focusing on vector arithmetic and task concept retrieval, which is relevant to foundational research in LLMs.
Relevance: 9 Novelty: 8
3. EGGS-PTP: An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models
ArXiv ID: 2508.09471
Authors: Omar Bazarbachi, Zijun Sun, Yanning Shen
Abstract: As Large Language Models (LLMs) become more widely adopted and scale up in size, the computational and memory challenges involved in deploying these massive foundation models have grown increasingly severe. This underscores the urgent need to develop more efficient model variants. Faced with this challenge, the present work introduces EGGS-PTP: an Expander-Graph Guided Structured Post-training Pruning method. The proposed approach leverages graph theory to guide the design of N:M structured pruning, effectively reducing model size and computational demands. By incorporating concepts from expander graphs, EGGS-PTP ensures information flow within the pruned network, preserving essential model functionality. Extensive numerical experiments demonstrate that EGGS-PTP not only achieves significant acceleration and memory savings due to structured sparsity but also outperforms existing structured pruning techniques in terms of accuracy across various LLMs.
Comment: The paper introduces a novel structured pruning method for LLMs using expander graphs, aligning with model compression and efficiency breakthroughs.
Relevance: 9 Novelty: 8
4. HKT: A Biologically Inspired Framework for Modular Hereditary Knowledge Transfer in Neural Networks
ArXiv ID: 2508.09743
Authors: Yanick Chistian Tchenko, Felix Mohr, Hicham Hadj Abdelkader, Hedi Tabia
Abstract: A prevailing trend in neural network research suggests that model performance improves with increasing depth and capacity - often at the cost of integrability and efficiency. In this paper, we propose a strategy to optimize small, deployable models by enhancing their capabilities through structured knowledge inheritance. We introduce Hereditary Knowledge Transfer (HKT), a biologically inspired framework for modular and selective transfer of task-relevant features from a larger, pretrained parent network to a smaller child model. Unlike standard knowledge distillation, which enforces uniform imitation of teacher outputs, HKT draws inspiration from biological inheritance mechanisms - such as memory RNA transfer in planarians - to guide a multi-stage process of feature transfer. Neural network blocks are treated as functional carriers, and knowledge is transmitted through three biologically motivated components: Extraction, Transfer, and Mixture (ETM). A novel Genetic Attention (GA) mechanism governs the integration of inherited and native representations, ensuring both alignment and selectivity. We evaluate HKT across diverse vision tasks, including optical flow (Sintel, KITTI), image classification (CIFAR-10), and semantic segmentation (LiTS), demonstrating that it significantly improves child model performance while preserving its compactness. The results show that HKT consistently outperforms conventional distillation approaches, offering a general-purpose, interpretable, and scalable solution for deploying high-performance neural networks in resource-constrained environments.
Comment: The paper introduces a biologically inspired framework for knowledge transfer in neural networks, which aligns with representation learning and model architecture.
Relevance: 9 Novelty: 8
5. Synaptic Pruning: A Biological Inspiration for Deep Learning Regularization
ArXiv ID: 2508.09330
Authors: Gideon Vos, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi
Abstract: Synaptic pruning in biological brains removes weak connections to improve efficiency. In contrast, dropout regularization in artificial neural networks randomly deactivates neurons without considering activity-dependent pruning. We propose a magnitude-based synaptic pruning method that better reflects biology by progressively removing low-importance connections during training. Integrated directly into the training loop as a dropout replacement, our approach computes weight importance from absolute magnitudes across layers and applies a cubic schedule to gradually increase global sparsity. At fixed intervals, pruning masks permanently remove low-importance weights while maintaining gradient flow for active ones, eliminating the need for separate pruning and fine-tuning phases. Experiments on multiple time series forecasting models including RNN, LSTM, and Patch Time Series Transformer across four datasets show consistent gains. Our method ranked best overall, with statistically significant improvements confirmed by Friedman tests (p < 0.01). In financial forecasting, it reduced Mean Absolute Error by up to 20% over models with no or standard dropout, and up to 52% in select transformer models. This dynamic pruning mechanism advances regularization by coupling weight elimination with progressive sparsification, offering easy integration into diverse architectures. Its strong performance, especially in financial time series forecasting, highlights its potential as a practical alternative to conventional dropout techniques.
Comment: The paper proposes a synaptic pruning method inspired by biological processes, which is relevant to model compression through sparsity and pruning.
Relevance: 9 Novelty: 8
6. DQT: Dynamic Quantization Training via Dequantization-Free Nested Integer Arithmetic
ArXiv ID: 2508.09176
Authors: Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Francesca Palermo, Diana Trojaniello, Manuel Roveri
Abstract: The deployment of deep neural networks on resource-constrained devices relies on quantization. While static, uniform quantization applies a fixed bit-width to all inputs, it fails to adapt to their varying complexity. Dynamic, instance-based mixed-precision quantization promises a superior accuracy-efficiency trade-off by allocating higher precision only when needed. However, a critical bottleneck remains: existing methods require a costly dequantize-to-float and requantize-to-integer cycle to change precision, breaking the integer-only hardware paradigm and compromising performance gains. This paper introduces Dynamic Quantization Training (DQT), a novel framework that removes this bottleneck. At the core of DQT is a nested integer representation where lower-precision values are bit-wise embedded within higher-precision ones. This design, coupled with custom integer-only arithmetic, allows for on-the-fly bit-width switching through a near-zero-cost bit-shift operation. This makes DQT the first quantization framework to enable both dequantization-free static mixed-precision of the backbone network, and truly efficient dynamic, instance-based quantization through a lightweight controller that decides at runtime how to quantize each layer. We demonstrate DQT state-of-the-art performance on ResNet18 on CIFAR-10 and ResNet50 on ImageNet. On ImageNet, our 4-bit dynamic ResNet50 achieves 77.00% top-1 accuracy, an improvement over leading static (LSQ, 76.70%) and dynamic (DQNET, 76.94%) methods at a comparable BitOPs budget. Crucially, DQT achieves this with a bit-width transition cost of only 28.3M simple bit-shift operations, a drastic improvement over the 56.6M costly Multiply-Accumulate (MAC) floating-point operations required by previous dynamic approaches - unlocking a new frontier in efficient, adaptive AI.
Comment: The paper introduces a novel framework for dynamic quantization training, which is relevant to model compression and efficiency.
Relevance: 9 Novelty: 8
7. CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge
ArXiv ID: 2508.09208
Authors: Muqing Li, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang
Abstract: The proliferation of large language models (LLMs) has driven the adoption of Mixture-of-Experts (MoE) architectures as a promising solution to scale model capacity while controlling computational costs. However, deploying MoE models in resource-constrained mobile edge computing environments presents significant challenges due to their large memory footprint and dynamic expert activation patterns. To address these challenges, we propose a novel dynamic resource-aware collaborative optimization framework that jointly optimizes expert aggregation granularity and offloading strategies based on real-time device resource states, network conditions, and input characteristics in mobile edge environments, denoted as CoMoE. In CoMoE, we first systematically analyze existing expert aggregation techniques, including expert parameter merging,knowledge distillation,and parameter sharing decomposition, identifying their limitations in dynamic mobile environments.We then investigate expert offloading strategies encompassing expert prediction and prefetching, expert caching and scheduling, and multi-tier storage architectures, revealing the interdependencies between routing decisions and offloading performance.The CoMoE incorporates adaptive scheduling mechanisms that respond to user mobility and varying network conditions, enabling efficient MoE deployment across heterogeneous edge devices. Extensive experiments on real mobile edge testbeds demonstrate that CoMoE achieves approximately 70% reduction in memory usage compared to baseline methods, 10.5% lower inference latency than existing expert offloading techniques, while maintaining model performance stability. For large-scale MoE models (e.g,7.4B-parameter Switch-Base-128), the CoMoE reduces memory requirements from 15.6GB to 4.7GB, enabling deployment on resource-constrained mobile edge devices that previously could only support much smaller models.
Comment: The paper proposes a collaborative optimization framework for MoE-based LLMs at the edge, which is relevant to model architecture and efficiency.
Relevance: 9 Novelty: 8
8. Global Convergence Analysis of Vanilla Gradient Descent for Asymmetric Matrix Completion
ArXiv ID: 2508.09685
Authors: Xu Zhang, Shuo Chen, Jinsheng Li, Xiangying Pang, Maoguo Gong
Abstract: This paper investigates the asymmetric low-rank matrix completion problem, which can be formulated as an unconstrained non-convex optimization problem with a nonlinear least-squares objective function, and is solved via gradient descent methods. Previous gradient descent approaches typically incorporate regularization terms into the objective function to guarantee convergence. However, numerical experiments and theoretical analysis of the gradient flow both demonstrate that the elimination of regularization terms in gradient descent algorithms does not adversely affect convergence performance. By introducing the leave-one-out technique, we inductively prove that the vanilla gradient descent with spectral initialization achieves a linear convergence rate with high probability. Besides, we demonstrate that the balancing regularization term exhibits a small norm during iterations, which reveals the implicit regularization property of gradient descent. Empirical results show that our algorithm has a lower computational cost while maintaining comparable completion performance compared to other gradient descent algorithms.
Comment: The paper provides a theoretical analysis of gradient descent for asymmetric low-rank matrix completion, which is relevant to model compression and efficiency.
Relevance: 9 Novelty: 8
9. HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap
ArXiv ID: 2508.09591
Authors: Wenxiang Lin, Xinglin Pan, Lin Zhang, Shaohuai Shi, Xuan Wang, Xiaowen Chu
Abstract: The sparsely activated mixture-of-experts (MoE) transformer has become a common architecture for large language models (LLMs) due to its sparsity, which requires fewer computational demands while easily scaling the model size. In MoE models, each MoE layer requires to dynamically choose tokens to activate particular experts for computation while the activated experts may not be located in the same device or GPU as the token. However, this leads to substantial communication and load imbalances across all GPUs, which obstructs the scalability of distributed systems within a GPU cluster. To this end, we introduce HierMoE to accelerate the training of MoE models by two topology-aware techniques: 1) token deduplication to reduce the communication traffic, and 2) expert swap to balance the workloads among all GPUs. To enable the above two proposed approaches to be more general, we build theoretical models aimed at achieving the best token duplication and expert swap strategy under different model configurations and hardware environments. We implement our prototype HierMoE system atop Megatron-LM and conduct experiments on a 32-GPU cluster with DeepSeek-V3 and Qwen3-30B-A3B models. Experimental results show that our HierMoE achieves $1.55\times$ to $3.32\times$ faster communication and delivers $1.18\times$ to $1.27\times$ faster end-to-end training compared to state-of-the-art MoE training systems, Tutel-2DH, SmartMoE, and Megatron-LM.
Comment: The paper introduces HierMoE, which accelerates MoE training with hierarchical token deduplication and expert swap, aligning with the core topic of Model Architecture and MoE.
Relevance: 9 Novelty: 7
10. NEURAL: Attention-Guided Pruning for Unified Multimodal Resource-Constrained Clinical Evaluation
ArXiv ID: 2508.09715
Authors: Devvrat Joshi, Islem Rekik
Abstract: The rapid growth of multimodal medical imaging data presents significant storage and transmission challenges, particularly in resource-constrained clinical settings. We propose NEURAL, a novel framework that addresses this by using semantics-guided data compression. Our approach repurposes cross-attention scores between the image and its radiological report from a fine-tuned generative vision-language model to structurally prune chest X-rays, preserving only diagnostically critical regions. This process transforms the image into a highly compressed, graph representation. This unified graph-based representation fuses the pruned visual graph with a knowledge graph derived from the clinical report, creating a universal data structure that simplifies downstream modeling. Validated on the MIMIC-CXR and CheXpert Plus dataset for pneumonia detection, NEURAL achieves a 93.4-97.7\% reduction in image data size while maintaining a high diagnostic performance of 0.88-0.95 AUC, outperforming other baseline models that use uncompressed data. By creating a persistent, task-agnostic data asset, NEURAL resolves the trade-off between data size and clinical utility, enabling efficient workflows and teleradiology without sacrificing performance. Our NEURAL code is available at https://github.com/basiralab/NEURAL.
Comment: The paper introduces a novel framework for data compression using attention-guided pruning, which aligns with model compression and efficiency breakthroughs.
Relevance: 9 Novelty: 7
11. Pattern-based Knowledge Component Extraction from Student Code Using Representation Learning
ArXiv ID: 2508.09281
Authors: Muntasir Hoq, Griffin Pitts, Andrew Lan, Peter Brusilovsky, Bita Akram
Abstract: Effective personalized learning in computer science education depends on accurately modeling what students know and what they need to learn. While Knowledge Components (KCs) provide a foundation for such modeling, automated KC extraction from student code is inherently challenging due to insufficient explainability of discovered KCs and the open-endedness of programming problems with significant structural variability across student solutions and complex interactions among programming concepts. In this work, we propose a novel, explainable framework for automated KC discovery through pattern-based KCs: recurring structural patterns within student code that capture the specific programming patterns and language constructs that students must master. Toward this, we train a Variational Autoencoder to generate important representative patterns from student code guided by an explainable, attention-based code representation model that identifies important correct and incorrect pattern implementations from student code. These patterns are then clustered to form pattern-based KCs. We evaluate our KCs using two well-established methods informed by Cognitive Science: learning curve analysis and Deep Knowledge Tracing (DKT). Experimental results demonstrate meaningful learning trajectories and significant improvements in DKT predictive performance over traditional KT methods. This work advances knowledge modeling in CS education by providing an automated, scalable, and explainable framework for identifying granular code patterns and algorithmic constructs, essential for student learning.
Comment: The paper proposes a novel framework for automated knowledge component extraction using representation learning, which aligns with the core topic of representation learning.
Relevance: 9 Novelty: 7
12. Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks
ArXiv ID: 2508.09190
Authors: Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng
Abstract: Fine-tuning as service injects domain-specific knowledge into large language models (LLMs), while challenging the original alignment mechanisms and introducing safety risks. A series of defense strategies have been proposed for the alignment, fine-tuning, and post-fine-tuning phases, where most post-fine-tuning defenses rely on coarse-grained safety layer mapping. These methods lack a comprehensive consideration of both safety layers and fine-grained neurons, limiting their ability to efficiently balance safety and utility. To address this, we propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks. FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons while minimizing interference with downstream task neurons. We then project the safety neuron parameters onto safety directions, improving model safety while aligning more closely with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, by introducing a task-specific, multi-dimensional heterogeneous safety neuron cluster optimization mechanism, we achieve continual defense and generalization capability against unforeseen emerging safety concerns.
Comment: The paper proposes Fine-Grained Safety Neurons for reducing fine-tuning risks in LLMs, which aligns with foundational research in LLM safety and interpretability.
Relevance: 8 Novelty: 7
13. Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models
ArXiv ID: 2508.09874
Authors: Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin
Abstract: Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model's parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.
Comment: The paper introduces a novel memory architecture for LLMs, focusing on domain adaptation without changing original model parameters, aligning with foundational research in LLM architecture.
Relevance: 8 Novelty: 7
14. Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning
ArXiv ID: 2508.09883
Authors: Xiaojun Wu, Xiaoguang Jiang, Huiyang Li, Jucai Zhai, Dengfeng Liu, Qiaobo Hao, Huang Liu, Zhiguo Yang, Ji Xie, Ninglun Gu, Jin Yang, Kailai Zhang, Yelun Bao, Jun Wang
Abstract: Large language models (LLMs) demonstrate remarkable reasoning capabilities in tasks such as algorithmic coding and mathematical problem-solving. Recent methods have improved reasoning through expanded corpus and multistage training combining reinforcement learning and supervised fine-tuning. Although some methods suggest that small but targeted dataset can incentivize reasoning via only distillation, a reasoning scaling laws is still taking shape, increasing computational costs. To address this, we propose a data-efficient distillation framework (DED) that optimizes the Pareto frontier of reasoning distillation. Inspired by the on-policy learning and diverse roll-out strategies of reinforcement learning, the key idea of our approach is threefold: (1) We identify that benchmark scores alone do not determine an effective teacher model. Through comprehensive comparisons of leading reasoning LLMs, we develop a method to select an optimal teacher model. (2) While scaling distillation can enhance reasoning, it often degrades out-of-domain performance. A carefully curated, smaller corpus achieves a balanced trade-off between in-domain and out-of-domain capabilities. (3) Diverse reasoning trajectories encourage the student model to develop robust reasoning skills. We validate our method through evaluations on mathematical reasoning (AIME 2024/2025, MATH-500) and code generation (LiveCodeBench), achieving state-of-the-art results with only 0.8k carefully curated examples, bypassing the need for extensive scaling. Our systematic analysis demonstrates that DED outperforms existing methods by considering factors beyond superficial hardness, token length, or teacher model capability. This work offers a practical and efficient pathway to advanced reasoning while preserving general capabilities.
Comment: The paper proposes a data-efficient distillation framework for reasoning in LLMs, aligning with foundational research in LLM behavior and efficiency.
Relevance: 8 Novelty: 7
15. Structured Kernel Regression VAE: A Computationally Efficient Surrogate for GP-VAEs in ICA
ArXiv ID: 2508.09721
Authors: Yuan-Hao Wei, Fu-Hao Deng, Lin-Yong Cui, Yan-Jie Sun
Abstract: The interpretability of generative models is considered a key factor in demonstrating their effectiveness and controllability. The generated data are believed to be determined by latent variables that are not directly observable. Therefore, disentangling, decoupling, decomposing, causal inference, or performing Independent Component Analysis (ICA) in the latent variable space helps uncover the independent factors that influence the attributes or features affecting the generated outputs, thereby enhancing the interpretability of generative models. As a generative model, Variational Autoencoders (VAEs) combine with variational Bayesian inference algorithms. Using VAEs, the inverse process of ICA can be equivalently framed as a variational inference process. In some studies, Gaussian processes (GPs) have been introduced as priors for each dimension of latent variables in VAEs, structuring and separating each dimension from temporal or spatial perspectives, and encouraging different dimensions to control various attributes of the generated data. However, GPs impose a significant computational burden, resulting in substantial resource consumption when handling large datasets. Essentially, GPs model different temporal or spatial structures through various kernel functions. Structuring the priors of latent variables via kernel functions-so that different kernel functions model the correlations among sequence points within different latent dimensions-is at the core of achieving disentanglement in VAEs. The proposed Structured Kernel Regression VAE (SKR-VAE) leverages this core idea in a more efficient way, avoiding the costly kernel matrix inversion required in GPs. This research demonstrates that, while maintaining ICA performance, SKR-VAE achieves greater computational efficiency and significantly reduced computational burden compared to GP-VAE.
Comment: The paper introduces SKR-VAE, a computationally efficient surrogate for GP-VAEs in ICA, which is relevant to model architecture and efficiency improvements.
Relevance: 8 Novelty: 7
16. Improving Diversity in Language Models: When Temperature Fails, Change the Loss
ArXiv ID: 2508.09654
Authors: Alexandre Verine, Florian Le Bronnec, Kunhao Zheng, Alexandre Allauzen, Yann Chevaleyre, Benjamin Negrevergne
Abstract: Increasing diversity in language models is a challenging yet essential objective. A common approach is to raise the decoding temperature. In this work, we investigate this approach through a simplistic yet common case to provide insights into why decreasing temperature can improve quality (Precision), while increasing it often fails to boost coverage (Recall). Our analysis reveals that for a model to be effectively tunable through temperature adjustments, it must be trained toward coverage. To address this, we propose rethinking loss functions in language models by leveraging the Precision-Recall framework. Our results demonstrate that this approach achieves a substantially better trade-off between Precision and Recall than merely combining negative log-likelihood training with temperature scaling. These findings offer a pathway toward more versatile and robust language modeling techniques.
Comment: The paper proposes rethinking loss functions in language models to improve diversity, which relates to foundational research in LLMs.
Relevance: 8 Novelty: 7
17. Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation
ArXiv ID: 2508.09223
Authors: Sameer Ambekar, Daniel M. Lang, Julia A. Schnabel
Abstract: Test-time adaptation allows pretrained models to adjust to incoming data streams, addressing distribution shifts between source and target domains. However, standard methods rely on single-dimensional linear classification layers, which often fail to handle diverse and complex shifts. We propose Hierarchical Adaptive Networks with Task Vectors (Hi-Vec), which leverages multiple layers of increasing size for dynamic test-time adaptation. By decomposing the encoder's representation space into such hierarchically organized layers, Hi-Vec, in a plug-and-play manner, allows existing methods to adapt to shifts of varying complexity. Our contributions are threefold: First, we propose dynamic layer selection for automatic identification of the optimal layer for adaptation to each test batch. Second, we propose a mechanism that merges weights from the dynamic layer to other layers, ensuring all layers receive target information. Third, we propose linear layer agreement that acts as a gating function, preventing erroneous fine-tuning by adaptation on noisy batches. We rigorously evaluate the performance of Hi-Vec in challenging scenarios and on multiple target datasets, proving its strong capability to advance state-of-the-art methods. Our results show that Hi-Vec improves robustness, addresses uncertainty, and handles limited batch sizes and increased outlier rates.
Comment: The paper proposes a novel hierarchical adaptive network architecture for test-time adaptation, which aligns with the model architecture criterion.
Relevance: 8 Novelty: 7
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.