Previous Day 2025-08-26
Monthly Overview 2025-08
Next Day 2025-08-28

Personalized Daily ArXiv Papers 2025-08-27

[gpt-4o] Prompt Completion Total
Token 33367 4084 37451
Cost $0.08 $0.04 $0.12

Total arXiv papers: 491

Total scanned papers: 306

Total relevant papers: 26

Table of contents with paper titles:

  1. Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks Authors: Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota

  2. Energy-Based Flow Matching for Generating 3D Molecular Structure Authors: Wenyin Zhou, Christopher Iliffe Sprague, Vsevolod Viliuga, Matteo Tadiello, Arne Elofsson, Hossein Azizpour

  3. Dynamic Collaboration of Multi-Language Models based on Minimal Complete Semantic Units Authors: Chao Hao, Zezheng Wang, Yanhua Huang, Ruiwen Xu, Wenzhe Niu, Xin Liu, Zitong Yu

  4. SALMAN: Stability Analysis of Language Models Through the Maps Between Graph-based Manifolds Authors: Wuxinlin Cheng, Yupeng Cao, Jinwen Wu, Koduvayur Subbalakshmi, Tian Han, Zhuo Feng

  5. Enabling MoE on the Edge via Importance-Driven Expert Scheduling Authors: Guoying Zhu, Meng Li, Haipeng Dai, Xuechen Liu, Weijun Wang, Keran Li, Jun xiao, Ligeng Chen, Wei Wang

  6. UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning Authors: Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao

  7. APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration Authors: Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang

  8. What do language models model? Transformers, automata, and the format of thought Authors: Colin Klein

  9. Echoes of the past: A unified perspective on fading memory and echo states Authors: Juan-Pablo Ortega, Florian Rossmannek

  10. Understanding Tool-Integrated Reasoning Authors: Heng Lin, Zhongwen Xu

  11. FFT-MoE: Efficient Federated Fine-Tuning for Foundation Models via Large-scale Sparse MoE under Heterogeneous Edge Authors: Gang Hu, Yinglei Teng, Pengfei Wu, Nan Wang

  12. Vectorized Attention with Learnable Encoding for Quantum Transformer Authors: Ziqing Guo, Ziwen Pan, Alex Khan, Jan Balewski

  13. Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models Authors: Chenxi Zhou, Pengfei Cao, Jiang Li, Jun Zhao, Kang Liu

  14. Principled Detection of Hallucinations in Large Language Models via Multiple Testing Authors: Jiawei Li, Akshayaa Magesh, Venugopal V. Veeravalli

  15. Information Templates: A New Paradigm for Intelligent Active Feature Acquisition Authors: Hung-Tien Huang, Dzung Dinh, Junier B. Oliva

  16. Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models Authors: Hung Ming Liu

  17. Biologically Disentangled Multi-Omic Modeling Reveals Mechanistic Insights into Pan-Cancer Immunotherapy Resistance Authors: Ifrah Tariq, Ernest Fraenkel

  18. Distance-informed Neural Processes Authors: Aishwarya Venkataramanan, Joachim Denzler

  19. Sparse minimum Redundancy Maximum Relevance for feature selection Authors: Peter Naylor, Benjamin Poignard, H\'ector Climente-Gonz\'alez, Makoto Yamada

  20. Beyond Tokens: Enhancing RTL Quality Estimation via Structural Graph Learning Authors: Yi Liu, Hongji Zhang, Yiwen Wang, Dimitris Tsaras, Lei Chen, Mingxuan Yuan, Qiang Xu

  21. Get Global Guarantees: On the Probabilistic Nature of Perturbation Robustness Authors: Wenchuan Mu, Kwan Hui Lim

  22. Generalization Bound for a General Class of Neural Ordinary Differential Equations Authors: Madhusudan Verma, Manoj Kumar

  23. On the Generalisation of Koopman Representations for Chaotic System Control Authors: Kyriakos Hjikakou (University of Groningen, Department of Artificial Intelligence, Groningen, Netherlands), Juan Diego Cardenas Cartagena (University of Groningen, Department of Artificial Intelligence, Groningen, Netherlands), Matthia Sabatelli (University of Groningen, Department of Artificial Intelligence, Groningen, Netherlands)

  24. Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction Authors: Congchi Yin, Tianyi Wu, Yankai Shu, Alex Gu, Yunhan Wang, Jun Shao, Xun Jiang, Piji Li

  25. Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics Authors: Jungwoo Kim, Minsang Kim, Jaeheon Lee, Chanwoo Moon, Heejin Kim, Taeho Hwang, Woosuk Chung, Yeseong Kim, Sungjin Lee

  26. Can Structured Templates Facilitate LLMs in Tackling Harder Tasks? : An Exploration of Scaling Laws by Difficulty Authors: Zhichao Yang, Zhaoxin Fan, Gen Li, Yuanze Hu, Xinyu Wang, Ye Qiu, Xin Wang, Yifan Sun, Wenjun Wu


1. Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

ArXiv ID: 2508.18672

Authors: Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota

Abstract: Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top-$k$ routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top-$k$ alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as sparsity. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly sparse models. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

Comment: The paper investigates the optimal sparsity of Mixture-of-Experts models for reasoning tasks, which aligns with foundational research in model architecture and sparsity.

Relevance: 10 Novelty: 8


2. Energy-Based Flow Matching for Generating 3D Molecular Structure

ArXiv ID: 2508.18949

Authors: Wenyin Zhou, Christopher Iliffe Sprague, Vsevolod Viliuga, Matteo Tadiello, Arne Elofsson, Hossein Azizpour

Abstract: Molecular structure generation is a fundamental problem that involves determining the 3D positions of molecules' constituents. It has crucial biological applications, such as molecular docking, protein folding, and molecular design. Recent advances in generative modeling, such as diffusion models and flow matching, have made great progress on these tasks by modeling molecular conformations as a distribution. In this work, we focus on flow matching and adopt an energy-based perspective to improve training and inference of structure generation models. Our view results in a mapping function, represented by a deep network, that is directly learned to \textit{iteratively} map random configurations, i.e. samples from the source distribution, to target structures, i.e. points in the data manifold. This yields a conceptually simple and empirically effective flow matching setup that is theoretically justified and has interesting connections to fundamental properties such as idempotency and stability, as well as the empirically useful techniques such as structure refinement in AlphaFold. Experiments on protein docking as well as protein backbone generation consistently demonstrate the method's effectiveness, where it outperforms recent baselines of task-associated flow matching and diffusion models, using a similar computational budget.

Comment: The paper focuses on foundational research in molecular modeling using an energy-based perspective, which aligns with AI for Science. It introduces a novel flow matching setup with theoretical justifications.

Relevance: 9 Novelty: 8


3. Dynamic Collaboration of Multi-Language Models based on Minimal Complete Semantic Units

ArXiv ID: 2508.18763

Authors: Chao Hao, Zezheng Wang, Yanhua Huang, Ruiwen Xu, Wenzhe Niu, Xin Liu, Zitong Yu

Abstract: This paper investigates the enhancement of reasoning capabilities in language models through token-level multi-model collaboration. Our approach selects the optimal tokens from the next token distributions provided by multiple models to perform autoregressive reasoning. Contrary to the assumption that more models yield better results, we introduce a distribution distance-based dynamic selection strategy (DDS) to optimize the multi-model collaboration process. To address the critical challenge of vocabulary misalignment in multi-model collaboration, we propose the concept of minimal complete semantic units (MCSU), which is simple yet enables multiple language models to achieve natural alignment within the linguistic space. Experimental results across various benchmarks demonstrate the superiority of our method. The code will be available at https://github.com/Fanye12/DDS.

Comment: The paper introduces a novel method for multi-model collaboration in language models, focusing on token-level reasoning and vocabulary alignment, which aligns with foundational research in LLMs.

Relevance: 9 Novelty: 8


4. SALMAN: Stability Analysis of Language Models Through the Maps Between Graph-based Manifolds

ArXiv ID: 2508.18306

Authors: Wuxinlin Cheng, Yupeng Cao, Jinwen Wu, Koduvayur Subbalakshmi, Tian Han, Zhuo Feng

Abstract: Recent strides in pretrained transformer-based language models have propelled state-of-the-art performance in numerous NLP tasks. Yet, as these models grow in size and deployment, their robustness under input perturbations becomes an increasingly urgent question. Existing robustness methods often diverge between small-parameter and large-scale models (LLMs), and they typically rely on labor-intensive, sample-specific adversarial designs. In this paper, we propose a unified, local (sample-level) robustness framework (SALMAN) that evaluates model stability without modifying internal parameters or resorting to complex perturbation heuristics. Central to our approach is a novel Distance Mapping Distortion (DMD) measure, which ranks each sample's susceptibility by comparing input-to-output distance mappings in a near-linear complexity manner. By demonstrating significant gains in attack efficiency and robust training, we position our framework as a practical, model-agnostic tool for advancing the reliability of transformer-based NLP systems.

Comment: The paper presents a novel robustness framework for transformer-based language models, focusing on model stability and interpretability, which aligns with foundational research in LLMs.

Relevance: 9 Novelty: 8


5. Enabling MoE on the Edge via Importance-Driven Expert Scheduling

ArXiv ID: 2508.18983

Authors: Guoying Zhu, Meng Li, Haipeng Dai, Xuechen Liu, Weijun Wang, Keran Li, Jun xiao, Ligeng Chen, Wei Wang

Abstract: The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.

Comment: The paper focuses on deploying MoE on edge devices with a novel importance-driven expert scheduling, relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8


6. UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

ArXiv ID: 2508.18756

Authors: Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao

Abstract: While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.

Comment: The paper presents UltraMemV2, a memory-layer architecture that competes with MoE models, relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8


7. APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

ArXiv ID: 2508.19087

Authors: Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang

Abstract: Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the extreme efficiency associated with ultra-low-bit quantized LLMs at arbitrary precision presents challenges on GPUs. This is primarily due to the limited support for GPU Tensor Cores, inefficient memory management, and inflexible kernel optimizations. To tackle these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM. Firstly, we introduce a novel data format, bipolar-INT, which allows for efficient and lossless conversion with signed INT, while also being more conducive to parallel computation. We also develop a matrix multiplication (MatMul) method allowing for arbitrary precision by dismantling and reassembling matrices at the bit level. This method provides flexible precision and optimizes the utilization of GPU Tensor Cores. In addition, we propose a memory management system focused on data recovery, which strategically employs fast shared memory to substantially increase kernel execution speed and reduce memory access latency. Finally, we develop a kernel mapping method that dynamically selects the optimal configurable hyperparameters of kernels for varying matrix sizes, enabling optimal performance across different LLM architectures and precision settings. In LLM inference, APT-LLM achieves up to a 3.99$\times$ speedup compared to FP16 baselines and a 2.16$\times$ speedup over NVIDIA CUTLASS INT4 acceleration on RTX 3090. On RTX 4090 and H800, APT-LLM achieves up to 2.44$\times$ speedup over FP16 and 1.65$\times$ speedup over CUTLASS integer baselines.

Comment: The paper focuses on quantization and efficiency improvements for LLMs, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8


8. What do language models model? Transformers, automata, and the format of thought

ArXiv ID: 2508.18598

Authors: Colin Klein

Abstract: What do large language models actually model? Do they tell us something about human capacities, or are they models of the corpus we've trained them on? I give a non-deflationary defence of the latter position. Cognitive science tells us that linguistic capabilities in humans rely supralinear formats for computation. The transformer architecture, by contrast, supports at best a linear formats for processing. This argument will rely primarily on certain invariants of the computational architecture of transformers. I then suggest a positive story about what transformers are doing, focusing on Liu et al. (2022)'s intriguing speculations about shortcut automata. I conclude with why I don't think this is a terribly deflationary story. Language is not (just) a means for expressing inner state but also a kind of 'discourse machine' that lets us make new language given appropriate context. We have learned to use this technology in one way; LLMs have also learned to use it too, but via very different means.

Comment: The paper discusses the theoretical understanding of transformers and LLMs, which aligns with insights into LLM behavior and architecture.

Relevance: 9 Novelty: 8


9. Echoes of the past: A unified perspective on fading memory and echo states

ArXiv ID: 2508.19145

Authors: Juan-Pablo Ortega, Florian Rossmannek

Abstract: Recurrent neural networks (RNNs) have become increasingly popular in information processing tasks involving time series and temporal data. A fundamental property of RNNs is their ability to create reliable input/output responses, often linked to how the network handles its memory of the information it processed. Various notions have been proposed to conceptualize the behavior of memory in RNNs, including steady states, echo states, state forgetting, input forgetting, and fading memory. Although these notions are often used interchangeably, their precise relationships remain unclear. This work aims to unify these notions in a common language, derive new implications and equivalences between them, and provide alternative proofs to some existing results. By clarifying the relationships between these concepts, this research contributes to a deeper understanding of RNNs and their temporal information processing capabilities.

Comment: The paper unifies various notions of memory in RNNs, contributing to a deeper understanding of their temporal information processing capabilities, which is relevant to representation learning.

Relevance: 9 Novelty: 8


10. Understanding Tool-Integrated Reasoning

ArXiv ID: 2508.19201

Authors: Heng Lin, Zhongwen Xu

Abstract: We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR's success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.

Comment: The paper provides a formal proof for the effectiveness of Tool-Integrated Reasoning in LLMs, offering theoretical insights into model capabilities, which aligns with foundational research in LLMs.

Relevance: 9 Novelty: 8


11. FFT-MoE: Efficient Federated Fine-Tuning for Foundation Models via Large-scale Sparse MoE under Heterogeneous Edge

ArXiv ID: 2508.18663

Authors: Gang Hu, Yinglei Teng, Pengfei Wu, Nan Wang

Abstract: As FMs drive progress toward Artificial General Intelligence (AGI), fine-tuning them under privacy and resource constraints has become increasingly critical particularly when highquality training data resides on distributed edge devices. Federated Learning (FL) offers a compelling solution through Federated Fine-Tuning (FFT), which enables collaborative model adaptation without sharing raw data. Recent approaches incorporate Parameter-Efficient Fine-Tuning (PEFT) techniques such as Low Rank Adaptation (LoRA) to reduce computational overhead. However, LoRA-based FFT faces two major limitations in heterogeneous FL environments: structural incompatibility across clients with varying LoRA configurations and limited adaptability to non-IID data distributions, which hinders convergence and generalization. To address these challenges, we propose FFT MoE, a novel FFT framework that replaces LoRA with sparse Mixture of Experts (MoE) adapters. Each client trains a lightweight gating network to selectively activate a personalized subset of experts, enabling fine-grained adaptation to local resource budgets while preserving aggregation compatibility. To further combat the expert load imbalance caused by device and data heterogeneity, we introduce a heterogeneity-aware auxiliary loss that dynamically regularizes the routing distribution to ensure expert diversity and balanced utilization. Extensive experiments spanning both IID and non-IID conditions demonstrate that FFT MoE consistently outperforms state of the art FFT baselines in generalization performance and training efficiency.

Comment: The paper introduces FFT MoE, a novel framework using sparse Mixture of Experts (MoE) for federated fine-tuning, which aligns with the core topic of model architecture and representation learning.

Relevance: 9 Novelty: 8


12. Vectorized Attention with Learnable Encoding for Quantum Transformer

ArXiv ID: 2508.18464

Authors: Ziqing Guo, Ziwen Pan, Alex Khan, Jan Balewski

Abstract: Vectorized quantum block encoding provides a way to embed classical data into Hilbert space, offering a pathway for quantum models, such as Quantum Transformers (QT), that replace classical self-attention with quantum circuit simulations to operate more efficiently. Current QTs rely on deep parameterized quantum circuits (PQCs), rendering them vulnerable to QPU noise, and thus hindering their practical performance. In this paper, we propose the Vectorized Quantum Transformer (VQT), a model that supports ideal masked attention matrix computation through quantum approximation simulation and efficient training via vectorized nonlinear quantum encoder, yielding shot-efficient and gradient-free quantum circuit simulation (QCS) and reduced classical sampling overhead. In addition, we demonstrate an accuracy comparison for IBM and IonQ in quantum circuit simulation and competitive results in benchmarking natural language processing tasks on IBM state-of-the-art and high-fidelity Kingston QPU. Our noise intermediate-scale quantum friendly VQT approach unlocks a novel architecture for end-to-end machine learning in quantum computing.

Comment: The paper introduces a Vectorized Quantum Transformer, which is a novel architecture combining quantum computing with transformer models.

Relevance: 8 Novelty: 9


13. Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models

ArXiv ID: 2508.18609

Authors: Chenxi Zhou, Pengfei Cao, Jiang Li, Jun Zhao, Kang Liu

Abstract: Large language models (LLMs) present significant deployment challenges due to their scale, with post-training quantization (PTQ) emerging as a practical compression solution. However, a comprehensive understanding of how PTQ precisely impacts diverse LLM knowledge capabilities remains elusive, and existing scaling laws for quantized models often overlook crucial PTQ-specific parameters and task-specific sensitivities. This paper addresses these gaps by conducting an extensive empirical investigation to establish task-stratified scaling laws. We disentangle LLM knowledge into memorization and utilization capabilities and develop a unified quantitative framework that incorporates model size, effective bit-width, calibration set size, and group size. Our central finding reveals that knowledge memorization exhibits markedly greater sensitivity to variations in effective bit-width, calibration set size, and model size compared to the more robust knowledge utilization. These findings offer a fine-grained understanding of PTQ's impact and provide guidance for developing knowledge-aware quantization strategies that can better preserve targeted cognitive functions.

Comment: The paper provides insights into post-training quantization effects on LLMs, which is relevant to model compression and understanding LLM behavior.

Relevance: 9 Novelty: 7


14. Principled Detection of Hallucinations in Large Language Models via Multiple Testing

ArXiv ID: 2508.18473

Authors: Jiawei Li, Akshayaa Magesh, Venugopal V. Veeravalli

Abstract: While Large Language Models (LLMs) have emerged as powerful foundational models to solve a variety of tasks, they have also been shown to be prone to hallucinations, i.e., generating responses that sound confident but are actually incorrect or even nonsensical. In this work, we formulate the problem of detecting hallucinations as a hypothesis testing problem and draw parallels to the problem of out-of-distribution detection in machine learning models. We propose a multiple-testing-inspired method to solve the hallucination detection problem, and provide extensive experimental results to validate the robustness of our approach against state-of-the-art methods.

Comment: The paper addresses hallucination detection in LLMs using a hypothesis testing approach, which provides theoretical insights into LLM behavior, aligning with the criteria for foundational research in LLMs.

Relevance: 9 Novelty: 7


15. Information Templates: A New Paradigm for Intelligent Active Feature Acquisition

ArXiv ID: 2508.18380

Authors: Hung-Tien Huang, Dzung Dinh, Junier B. Oliva

Abstract: Active feature acquisition (AFA) is an instance-adaptive paradigm in which, at test time, a policy sequentially chooses which features to acquire (at a cost) before predicting. Existing approaches either train reinforcement learning (RL) policies, which deal with a difficult MDP, or greedy policies that cannot account for the joint informativeness of features or require knowledge about the underlying data distribution. To overcome this, we propose Template-based AFA (TAFA), a non-greedy framework that learns a small library of feature templates--a set of features that are jointly informative--and uses this library of templates to guide the next feature acquisitions. Through identifying feature templates, the proposed framework not only significantly reduces the action space considered by the policy but also alleviates the need to estimate the underlying data distribution. Extensive experiments on synthetic and real-world datasets show that TAFA outperforms the existing state-of-the-art baselines while achieving lower overall acquisition cost and computation.

Comment: The paper proposes a new paradigm for active feature acquisition using information templates, relevant to representation learning.

Relevance: 8 Novelty: 8


16. Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models

ArXiv ID: 2508.18988

Authors: Hung Ming Liu

Abstract: We present a framework where neural models develop an AI Mother Tongue, a native symbolic language that simultaneously supports intuitive reasoning, compositional symbol chains, and inherent interpretability. Unlike post-hoc explanation methods, our approach embeds reasoning directly into the model's representations: symbols capture meaningful semantic patterns, chains trace decision paths, and gated induction mechanisms guide selective focus, yielding transparent yet flexible reasoning. We introduce complementary training objectives to enhance symbol purity and decision sparsity, and employ a sequential specialization strategy to first build broad symbolic competence and then refine intuitive judgments. Experiments on AI tasks demonstrate competitive accuracy alongside verifiable reasoning traces, showing that AI Mother Tongue can serve as a unified mechanism for interpretability, intuition, and symbolic reasoning in neural models.

Comment: The paper presents a framework for developing an AI Mother Tongue, focusing on symbolic reasoning and interpretability, which aligns with representation learning and emerging trends.

Relevance: 8 Novelty: 8


17. Biologically Disentangled Multi-Omic Modeling Reveals Mechanistic Insights into Pan-Cancer Immunotherapy Resistance

ArXiv ID: 2508.18638

Authors: Ifrah Tariq, Ernest Fraenkel

Abstract: Immune checkpoint inhibitors (ICIs) have transformed cancer treatment, yet patient responses remain highly variable, and the biological mechanisms underlying resistance are poorly understood. While machine learning models hold promise for predicting responses to ICIs, most existing methods lack interpretability and do not effectively leverage the biological structure inherent to multi-omics data. Here, we introduce the Biologically Disentangled Variational Autoencoder (BDVAE), a deep generative model that integrates transcriptomic and genomic data through modality- and pathway-specific encoders. Unlike existing rigid, pathway-informed models, BDVAE employs a modular encoder architecture combined with variational inference to learn biologically meaningful latent features associated with immune, genomic, and metabolic processes. Applied to a pan-cancer cohort of 366 patients across four cancer types treated with ICIs, BDVAE accurately predicts treatment response (AUC-ROC = 0.94 on unseen test data) and uncovers critical resistance mechanisms, including immune suppression, metabolic shifts, and neuronal signaling. Importantly, BDVAE reveals that resistance spans a continuous biological spectrum rather than strictly binary states, reflecting gradations of tumor dysfunction. Several latent features correlate with survival outcomes and known clinical subtypes, demonstrating BDVAE's capability to generate interpretable, clinically relevant insights. These findings underscore the value of biologically structured machine learning in elucidating complex resistance patterns and guiding precision immunotherapy strategies.

Comment: The paper introduces a Biologically Disentangled Variational Autoencoder, which is a novel architecture for integrating multi-omic data, aligning with Model Architecture.

Relevance: 8 Novelty: 7


18. Distance-informed Neural Processes

ArXiv ID: 2508.18903

Authors: Aishwarya Venkataramanan, Joachim Denzler

Abstract: We propose the Distance-informed Neural Process (DNP), a novel variant of Neural Processes that improves uncertainty estimation by combining global and distance-aware local latent structures. Standard Neural Processes (NPs) often rely on a global latent variable and struggle with uncertainty calibration and capturing local data dependencies. DNP addresses these limitations by introducing a global latent variable to model task-level variations and a local latent variable to capture input similarity within a distance-preserving latent space. This is achieved through bi-Lipschitz regularization, which bounds distortions in input relationships and encourages the preservation of relative distances in the latent space. This modeling approach allows DNP to produce better-calibrated uncertainty estimates and more effectively distinguish in- from out-of-distribution data. Empirical results demonstrate that DNP achieves strong predictive performance and improved uncertainty calibration across regression and classification tasks.

Comment: The paper introduces Distance-informed Neural Processes, a novel variant of Neural Processes, which aligns with Model Architecture by improving uncertainty estimation.

Relevance: 8 Novelty: 7


19. Sparse minimum Redundancy Maximum Relevance for feature selection

ArXiv ID: 2508.18901

Authors: Peter Naylor, Benjamin Poignard, H\'ector Climente-Gonz\'alez, Makoto Yamada

Abstract: We propose a feature screening method that integrates both feature-feature and feature-target relationships. Inactive features are identified via a penalized minimum Redundancy Maximum Relevance (mRMR) procedure, which is the continuous version of the classic mRMR penalized by a non-convex regularizer, and where the parameters estimated as zero coefficients represent the set of inactive features. We establish the conditions under which zero coefficients are correctly identified to guarantee accurate recovery of inactive features. We introduce a multi-stage procedure based on the knockoff filter enabling the penalized mRMR to discard inactive features while controlling the false discovery rate (FDR). Our method performs comparably to HSIC-LASSO but is more conservative in the number of selected features. It only requires setting an FDR threshold, rather than specifying the number of features to retain. The effectiveness of the method is illustrated through simulations and real-world datasets. The code to reproduce this work is available on the following GitHub: https://github.com/PeterJackNaylor/SmRMR.

Comment: The paper proposes a sparse feature selection method using a penalized mRMR procedure, which aligns with Model Compression through sparsity and feature selection.

Relevance: 8 Novelty: 7


20. Beyond Tokens: Enhancing RTL Quality Estimation via Structural Graph Learning

ArXiv ID: 2508.18730

Authors: Yi Liu, Hongji Zhang, Yiwen Wang, Dimitris Tsaras, Lei Chen, Mingxuan Yuan, Qiang Xu

Abstract: Estimating the quality of register transfer level (RTL) designs is crucial in the electronic design automation (EDA) workflow, as it enables instant feedback on key metrics like area and delay without the need for time-consuming logic synthesis. While recent approaches have leveraged large language models (LLMs) to derive embeddings from RTL code and achieved promising results, they overlook the structural semantics essential for accurate quality estimation. In contrast, the control data flow graph (CDFG) view exposes the design's structural characteristics more explicitly, offering richer cues for representation learning. In this work, we introduce a novel structure-aware graph self-supervised learning framework, StructRTL, for improved RTL design quality estimation. By learning structure-informed representations from CDFGs, our method significantly outperforms prior art on various quality estimation tasks. To further boost performance, we incorporate a knowledge distillation strategy that transfers low-level insights from post-mapping netlists into the CDFG predictor. Experiments show that our approach establishes new state-of-the-art results, demonstrating the effectiveness of combining structural learning with cross-stage supervision.

Comment: The paper introduces a novel graph-based learning framework for RTL quality estimation, focusing on structural representation learning, which aligns with foundational research in representation learning.

Relevance: 8 Novelty: 7


21. Get Global Guarantees: On the Probabilistic Nature of Perturbation Robustness

ArXiv ID: 2508.19183

Authors: Wenchuan Mu, Kwan Hui Lim

Abstract: In safety-critical deep learning applications, robustness measures the ability of neural models that handle imperceptible perturbations in input data, which may lead to potential safety hazards. Existing pre-deployment robustness assessment methods typically suffer from significant trade-offs between computational cost and measurement precision, limiting their practical utility. To address these limitations, this paper conducts a comprehensive comparative analysis of existing robustness definitions and associated assessment methodologies. We propose tower robustness to evaluate robustness, which is a novel, practical metric based on hypothesis testing to quantitatively evaluate probabilistic robustness, enabling more rigorous and efficient pre-deployment assessments. Our extensive comparative evaluation illustrates the advantages and applicability of our proposed approach, thereby advancing the systematic understanding and enhancement of model robustness in safety-critical deep learning applications.

Comment: The paper proposes a new metric for evaluating robustness in neural models, which aligns with emerging trends in theoretical work.

Relevance: 8 Novelty: 7


22. Generalization Bound for a General Class of Neural Ordinary Differential Equations

ArXiv ID: 2508.18920

Authors: Madhusudan Verma, Manoj Kumar

Abstract: Neural ordinary differential equations (neural ODEs) are a popular type of deep learning model that operate with continuous-depth architectures. To assess how well such models perform on unseen data, it is crucial to understand their generalization error bounds. Previous research primarily focused on the linear case for the dynamics function in neural ODEs - Marion, P. (2023), or provided bounds for Neural Controlled ODEs that depend on the sampling interval Bleistein et al. (2023). In this work, we analyze a broader class of neural ODEs where the dynamics function is a general nonlinear function, either time dependent or time independent, and is Lipschitz continuous with respect to the state variables. We showed that under this Lipschitz condition, the solutions to neural ODEs have solutions with bounded variations. Based on this observation, we establish generalization bounds for both time-dependent and time-independent cases and investigate how overparameterization and domain constraints influence these bounds. To our knowledge, this is the first derivation of generalization bounds for neural ODEs with general nonlinear dynamics.

Comment: The paper provides generalization bounds for neural ODEs, which is relevant to understanding model architecture and emerging trends in theoretical work.

Relevance: 8 Novelty: 7


23. On the Generalisation of Koopman Representations for Chaotic System Control

ArXiv ID: 2508.18954

Authors: Kyriakos Hjikakou (University of Groningen, Department of Artificial Intelligence, Groningen, Netherlands), Juan Diego Cardenas Cartagena (University of Groningen, Department of Artificial Intelligence, Groningen, Netherlands), Matthia Sabatelli (University of Groningen, Department of Artificial Intelligence, Groningen, Netherlands)

Abstract: This paper investigates the generalisability of Koopman-based representations for chaotic dynamical systems, focusing on their transferability across prediction and control tasks. Using the Lorenz system as a testbed, we propose a three-stage methodology: learning Koopman embeddings through autoencoding, pre-training a transformer on next-state prediction, and fine-tuning for safety-critical control. Our results show that Koopman embeddings outperform both standard and physics-informed PCA baselines, achieving accurate and data-efficient performance. Notably, fixing the pre-trained transformer weights during fine-tuning leads to no performance degradation, indicating that the learned representations capture reusable dynamical structure rather than task-specific patterns. These findings support the use of Koopman embeddings as a foundation for multi-task learning in physics-informed machine learning. A project page is available at https://kikisprdx.github.io/.

Comment: The paper explores Koopman-based representations and their generalization for chaotic systems, involving autoencoding and transformers, which aligns with representation learning and model architecture analysis.

Relevance: 8 Novelty: 7


24. Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction

ArXiv ID: 2508.19035

Authors: Congchi Yin, Tianyi Wu, Yankai Shu, Alex Gu, Yunhan Wang, Jun Shao, Xun Jiang, Piji Li

Abstract: Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment. This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning, neglecting the integrated reasoning process that is indispensable for humans discovery of real world. We introduce a novel evaluation paradigm, \textit{black-box interaction}, to tackle this challenge. A black-box is defined by a hidden function that maps a specific set of inputs to outputs. LLMs are required to unravel the hidden function behind the black-box by interacting with it in given exploration turns, and reasoning over observed input-output pairs. Leveraging this idea, we build the \textsc{Oracle} benchmark which comprises 6 types of black-box task and 96 black-boxes. 19 modern LLMs are benchmarked. o3 ranks first in 5 of the 6 tasks, achieving over 70\% accuracy on most easy black-boxes. But it still struggles with some hard black-box tasks, where its average performance drops below 40\%. Further analysis indicates a universal difficulty among LLMs: They lack the high-level planning capability to develop efficient and adaptive exploration strategies for hypothesis refinement.

Comment: The paper introduces a novel evaluation paradigm for LLM reasoning, which could provide theoretical insights into LLM behavior and interpretability.

Relevance: 8 Novelty: 7


25. Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics

ArXiv ID: 2508.18736

Authors: Jungwoo Kim, Minsang Kim, Jaeheon Lee, Chanwoo Moon, Heejin Kim, Taeho Hwang, Woosuk Chung, Yeseong Kim, Sungjin Lee

Abstract: Serving Large Language Models (LLMs) at scale requires meeting strict Service Level Objectives (SLOs) under severe computational and memory constraints. Nevertheless, traditional caching strategies fall short: exact-matching and prefix caches neglect query semantics, while state-of-the-art semantic caches remain confined to traditional intuitions, offering little conceptual departure. Building on this, we present SISO, a semantic caching system that redefines efficiency for LLM serving. SISO introduces centroid-based caching to maximize coverage with minimal memory, locality-aware replacement to preserve high-value entries, and dynamic thresholding to balance accuracy and latency under varying workloads. Across diverse datasets, SISO delivers up to 1.71$\times$ higher hit ratios and consistently stronger SLO attainment compared to state-of-the-art systems.

Comment: The paper proposes a new semantic caching system for LLM serving, which is relevant to model compression and efficiency improvements.

Relevance: 8 Novelty: 7


26. Can Structured Templates Facilitate LLMs in Tackling Harder Tasks? : An Exploration of Scaling Laws by Difficulty

ArXiv ID: 2508.19069

Authors: Zhichao Yang, Zhaoxin Fan, Gen Li, Yuanze Hu, Xinyu Wang, Ye Qiu, Xin Wang, Yifan Sun, Wenjun Wu

Abstract: Structured, procedural reasoning is essential for Large Language Models (LLMs), especially in mathematics. While post-training methods have improved LLM performance, they still fall short in capturing deep procedural logic on complex tasks. To tackle the issue, in this paper, we first investigate this limitation and uncover a novel finding: a Scaling Law by Difficulty, which reveals that model performance follows a U-shaped curve with respect to training data complexity -- excessive low-difficulty data impedes abstraction, while high-difficulty data significantly enhances reasoning ability. Motivated by this, we propose the Structured Solution Template (SST) framework, which uses solution templates and a curriculum of varied difficulty to explicitly teach procedural reasoning. Specifically, SST comprises (1) fine-tuning with structured solution-template chains and dynamically weighted loss to prioritize procedural logic, (2) prompt-time injection of solution templates as cognitive scaffolds to guide inference, and (3) integrated curriculum fine-tuning that explicitly teaches the model to self-plan - execute - self-correct. Experiments on GSM8K, AIME24, and new Dynamic En benchmark show that SST significantly improves both accuracy and efficiency, especially on harder problems.

Comment: The paper proposes a framework for improving LLMs' procedural reasoning, which provides insights into training dynamics and aligns with foundational research in representation learning.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.