Personalized Daily ArXiv Papers 2025-08-27

[gpt-4o]	Prompt	Completion	Total
Token	33367	4084	37451
Cost	$0.08	$0.04	$0.12

Total arXiv papers: 491

Total scanned papers: 306

Total relevant papers: 26

Table of contents with paper titles:

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks Authors: Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota
Energy-Based Flow Matching for Generating 3D Molecular Structure Authors: Wenyin Zhou, Christopher Iliffe Sprague, Vsevolod Viliuga, Matteo Tadiello, Arne Elofsson, Hossein Azizpour
Dynamic Collaboration of Multi-Language Models based on Minimal Complete Semantic Units Authors: Chao Hao, Zezheng Wang, Yanhua Huang, Ruiwen Xu, Wenzhe Niu, Xin Liu, Zitong Yu
SALMAN: Stability Analysis of Language Models Through the Maps Between Graph-based Manifolds Authors: Wuxinlin Cheng, Yupeng Cao, Jinwen Wu, Koduvayur Subbalakshmi, Tian Han, Zhuo Feng
Enabling MoE on the Edge via Importance-Driven Expert Scheduling Authors: Guoying Zhu, Meng Li, Haipeng Dai, Xuechen Liu, Weijun Wang, Keran Li, Jun xiao, Ligeng Chen, Wei Wang
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning Authors: Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao
APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration Authors: Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang
What do language models model? Transformers, automata, and the format of thought Authors: Colin Klein
Echoes of the past: A unified perspective on fading memory and echo states Authors: Juan-Pablo Ortega, Florian Rossmannek
Understanding Tool-Integrated Reasoning Authors: Heng Lin, Zhongwen Xu
FFT-MoE: Efficient Federated Fine-Tuning for Foundation Models via Large-scale Sparse MoE under Heterogeneous Edge Authors: Gang Hu, Yinglei Teng, Pengfei Wu, Nan Wang
Vectorized Attention with Learnable Encoding for Quantum Transformer Authors: Ziqing Guo, Ziwen Pan, Alex Khan, Jan Balewski
Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models Authors: Chenxi Zhou, Pengfei Cao, Jiang Li, Jun Zhao, Kang Liu
Principled Detection of Hallucinations in Large Language Models via Multiple Testing Authors: Jiawei Li, Akshayaa Magesh, Venugopal V. Veeravalli
Information Templates: A New Paradigm for Intelligent Active Feature Acquisition Authors: Hung-Tien Huang, Dzung Dinh, Junier B. Oliva
Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models Authors: Hung Ming Liu
Biologically Disentangled Multi-Omic Modeling Reveals Mechanistic Insights into Pan-Cancer Immunotherapy Resistance Authors: Ifrah Tariq, Ernest Fraenkel
Distance-informed Neural Processes Authors: Aishwarya Venkataramanan, Joachim Denzler
Sparse minimum Redundancy Maximum Relevance for feature selection Authors: Peter Naylor, Benjamin Poignard, H\'ector Climente-Gonz\'alez, Makoto Yamada
Beyond Tokens: Enhancing RTL Quality Estimation via Structural Graph Learning Authors: Yi Liu, Hongji Zhang, Yiwen Wang, Dimitris Tsaras, Lei Chen, Mingxuan Yuan, Qiang Xu
Get Global Guarantees: On the Probabilistic Nature of Perturbation Robustness Authors: Wenchuan Mu, Kwan Hui Lim
Generalization Bound for a General Class of Neural Ordinary Differential Equations Authors: Madhusudan Verma, Manoj Kumar
On the Generalisation of Koopman Representations for Chaotic System Control Authors: Kyriakos Hjikakou (University of Groningen, Department of Artificial Intelligence, Groningen, Netherlands), Juan Diego Cardenas Cartagena (University of Groningen, Department of Artificial Intelligence, Groningen, Netherlands), Matthia Sabatelli (University of Groningen, Department of Artificial Intelligence, Groningen, Netherlands)
Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction Authors: Congchi Yin, Tianyi Wu, Yankai Shu, Alex Gu, Yunhan Wang, Jun Shao, Xun Jiang, Piji Li
Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics Authors: Jungwoo Kim, Minsang Kim, Jaeheon Lee, Chanwoo Moon, Heejin Kim, Taeho Hwang, Woosuk Chung, Yeseong Kim, Sungjin Lee
Can Structured Templates Facilitate LLMs in Tackling Harder Tasks? : An Exploration of Scaling Laws by Difficulty Authors: Zhichao Yang, Zhaoxin Fan, Gen Li, Yuanze Hu, Xinyu Wang, Ye Qiu, Xin Wang, Yifan Sun, Wenjun Wu

1. Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

ArXiv ID: 2508.18672

Authors: Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota

Abstract: Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top-$k$ routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top-$k$ alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as sparsity. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly sparse models. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

Comment: The paper investigates the optimal sparsity of Mixture-of-Experts models for reasoning tasks, which aligns with foundational research in model architecture and sparsity.

Relevance: 10 Novelty: 8

2. Energy-Based Flow Matching for Generating 3D Molecular Structure

ArXiv ID: 2508.18949

Authors: Wenyin Zhou, Christopher Iliffe Sprague, Vsevolod Viliuga, Matteo Tadiello, Arne Elofsson, Hossein Azizpour

Abstract: Molecular structure generation is a fundamental problem that involves determining the 3D positions of molecules' constituents. It has crucial biological applications, such as molecular docking, protein folding, and molecular design. Recent advances in generative modeling, such as diffusion models and flow matching, have made great progress on these tasks by modeling molecular conformations as a distribution. In this work, we focus on flow matching and adopt an energy-based perspective to improve training and inference of structure generation models. Our view results in a mapping function, represented by a deep network, that is directly learned to \textit{iteratively} map random configurations, i.e. samples from the source distribution, to target structures, i.e. points in the data manifold. This yields a conceptually simple and empirically effective flow matching setup that is theoretically justified and has interesting connections to fundamental properties such as idempotency and stability, as well as the empirically useful techniques such as structure refinement in AlphaFold. Experiments on protein docking as well as protein backbone generation consistently demonstrate the method's effectiveness, where it outperforms recent baselines of task-associated flow matching and diffusion models, using a similar computational budget.

Comment: The paper focuses on foundational research in molecular modeling using an energy-based perspective, which aligns with AI for Science. It introduces a novel flow matching setup with theoretical justifications.

Relevance: 9 Novelty: 8

3. Dynamic Collaboration of Multi-Language Models based on Minimal Complete Semantic Units

ArXiv ID: 2508.18763

Authors: Chao Hao, Zezheng Wang, Yanhua Huang, Ruiwen Xu, Wenzhe Niu, Xin Liu, Zitong Yu

Abstract: This paper investigates the enhancement of reasoning capabilities in language models through token-level multi-model collaboration. Our approach selects the optimal tokens from the next token distributions provided by multiple models to perform autoregressive reasoning. Contrary to the assumption that more models yield better results, we introduce a distribution distance-based dynamic selection strategy (DDS) to optimize the multi-model collaboration process. To address the critical challenge of vocabulary misalignment in multi-model collaboration, we propose the concept of minimal complete semantic units (MCSU), which is simple yet enables multiple language models to achieve natural alignment within the linguistic space. Experimental results across various benchmarks demonstrate the superiority of our method. The code will be available at https://github.com/Fanye12/DDS.

Comment: The paper introduces a novel method for multi-model collaboration in language models, focusing on token-level reasoning and vocabulary alignment, which aligns with foundational research in LLMs.

Relevance: 9 Novelty: 8

4. SALMAN: Stability Analysis of Language Models Through the Maps Between Graph-based Manifolds

ArXiv ID: 2508.18306

Authors: Wuxinlin Cheng, Yupeng Cao, Jinwen Wu, Koduvayur Subbalakshmi, Tian Han, Zhuo Feng

Abstract: Recent strides in pretrained transformer-based language models have propelled state-of-the-art performance in numerous NLP tasks. Yet, as these models grow in size and deployment, their robustness under input perturbations becomes an increasingly urgent question. Existing robustness methods often diverge between small-parameter and large-scale models (LLMs), and they typically rely on labor-intensive, sample-specific adversarial designs. In this paper, we propose a unified, local (sample-level) robustness framework (SALMAN) that evaluates model stability without modifying internal parameters or resorting to complex perturbation heuristics. Central to our approach is a novel Distance Mapping Distortion (DMD) measure, which ranks each sample's susceptibility by comparing input-to-output distance mappings in a near-linear complexity manner. By demonstrating significant gains in attack efficiency and robust training, we position our framework as a practical, model-agnostic tool for advancing the reliability of transformer-based NLP systems.

Comment: The paper presents a novel robustness framework for transformer-based language models, focusing on model stability and interpretability, which aligns with foundational research in LLMs.

Relevance: 9 Novelty: 8

5. Enabling MoE on the Edge via Importance-Driven Expert Scheduling

ArXiv ID: 2508.18983

Authors: Guoying Zhu, Meng Li, Haipeng Dai, Xuechen Liu, Weijun Wang, Keran Li, Jun xiao, Ligeng Chen, Wei Wang

Abstract: The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.

Comment: The paper focuses on deploying MoE on edge devices with a novel importance-driven expert scheduling, relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8

6. UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

ArXiv ID: 2508.18756

Authors: Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao

Abstract: While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.

Comment: The paper presents UltraMemV2, a memory-layer architecture that competes with MoE models, relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8

7. APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

ArXiv ID: 2508.19087

Authors: Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang

Abstract: Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the extreme efficiency associated with ultra-low-bit quantized LLMs at arbitrary precision presents challenges on GPUs. This is primarily due to the limited support for GPU Tensor Cores, inefficient memory management, and inflexible kernel optimizations. To tackle these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM. Firstly, we introduce a novel data format, bipolar-INT, which allows for efficient and lossless conversion with signed INT, while also being more conducive to parallel computation. We also develop a matrix multiplication (MatMul) method allowing for arbitrary precision by dismantling and reassembling matrices at the bit level. This method provides flexible precision and optimizes the utilization of GPU Tensor Cores. In addition, we propose a memory management system focused on data recovery, which strategically employs fast shared memory to substantially increase kernel execution speed and reduce memory access latency. Finally, we develop a kernel mapping method that dynamically selects the optimal configurable hyperparameters of kernels for varying matrix sizes, enabling optimal performance across different LLM architectures and precision settings. In LLM inference, APT-LLM achieves up to a 3.99$\times$ speedup compared to FP16 baselines and a 2.16$\times$ speedup over NVIDIA CUTLASS INT4 acceleration on RTX 3090. On RTX 4090 and H800, APT-LLM achieves up to 2.44$\times$ speedup over FP16 and 1.65$\times$ speedup over CUTLASS integer baselines.

Comment: The paper focuses on quantization and efficiency improvements for LLMs, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

8. What do language models model? Transformers, automata, and the format of thought

ArXiv ID: 2508.18598

Authors: Colin Klein

Abstract: What do large language models actually model? Do they tell us something about human capacities, or are they models of the corpus we've trained them on? I give a non-deflationary defence of the latter position. Cognitive science tells us that linguistic capabilities in humans rely supralinear formats for computation. The transformer architecture, by contrast, supports at best a linear formats for processing. This argument will rely primarily on certain invariants of the computational architecture of transformers. I then suggest a positive story about what transformers are doing, focusing on Liu et al. (2022)'s intriguing speculations about shortcut automata. I conclude with why I don't think this is a terribly deflationary story. Language is not (just) a means for expressing inner state but also a kind of 'discourse machine' that lets us make new language given appropriate context. We have learned to use this technology in one way; LLMs have also learned to use it too, but via very different means.

Comment: The paper discusses the theoretical understanding of transformers and LLMs, which aligns with insights into LLM behavior and architecture.

Relevance: 9 Novelty: 8

9. Echoes of the past: A unified perspective on fading memory and echo states

ArXiv ID: 2508.19145

Authors: Juan-Pablo Ortega, Florian Rossmannek

Abstract: Recurrent neural networks (RNNs) have become increasingly popular in information processing tasks involving time series and temporal data. A fundamental property of RNNs is their ability to create reliable input/output responses, often linked to how the network handles its memory of the information it processed. Various notions have been proposed to conceptualize the behavior of memory in RNNs, including steady states, echo states, state forgetting, input forgetting, and fading memory. Although these notions are often used interchangeably, their precise relationships remain unclear. This work aims to unify these notions in a common language, derive new implications and equivalences between them, and provide alternative proofs to some existing results. By clarifying the relationships between these concepts, this research contributes to a deeper understanding of RNNs and their temporal information processing capabilities.

Comment: The paper unifies various notions of memory in RNNs, contributing to a deeper understanding of their temporal information processing capabilities, which is relevant to representation learning.

Relevance: 9 Novelty: 8

10. Understanding Tool-Integrated Reasoning

ArXiv ID: 2508.19201

Authors: Heng Lin, Zhongwen Xu

Abstract: We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR's success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.

Comment: The paper provides a formal proof for the effectiveness of Tool-Integrated Reasoning in LLMs, offering theoretical insights into model capabilities, which aligns with foundational research in LLMs.

Relevance: 9 Novelty: 8

11. FFT-MoE: Efficient Federated Fine-Tuning for Foundation Models via Large-scale Sparse MoE under Heterogeneous Edge

ArXiv ID: 2508.18663

Authors: Gang Hu, Yinglei Teng, Pengfei Wu, Nan Wang

Abstract: As FMs drive progress toward Artificial General Intelligence (AGI), fine-tuning them under privacy and resource constraints has become increasingly critical particularly when highquality training data resides on distributed edge devices. Federated Learning (FL) offers a compelling solution through Federated Fine-Tuning (FFT), which enables collaborative model adaptation without sharing raw data. Recent approaches incorporate Parameter-Efficient Fine-Tuning (PEFT) techniques such as Low Rank Adaptation (LoRA) to reduce computational overhead. However, LoRA-based FFT faces two major limitations in heterogeneous FL environments: structural incompatibility across clients with varying LoRA configurations and limited adaptability to non-IID data distributions, which hinders convergence and generalization. To address these challenges, we propose FFT MoE, a novel FFT framework that replaces LoRA with sparse Mixture of Experts (MoE) adapters. Each client trains a lightweight gating network to selectively activate a personalized subset of experts, enabling fine-grained adaptation to local resource budgets while preserving aggregation compatibility. To further combat the expert load imbalance caused by device and data heterogeneity, we introduce a heterogeneity-aware auxiliary loss that dynamically regularizes the routing distribution to ensure expert diversity and balanced utilization. Extensive experiments spanning both IID and non-IID conditions demonstrate that FFT MoE consistently outperforms state of the art FFT baselines in generalization performance and training efficiency.

Comment: The paper introduces FFT MoE, a novel framework using sparse Mixture of Experts (MoE) for federated fine-tuning, which aligns with the core topic of model architecture and representation learning.

Relevance: 9 Novelty: 8

12. Vectorized Attention with Learnable Encoding for Quantum Transformer

ArXiv ID: 2508.18464

Authors: Ziqing Guo, Ziwen Pan, Alex Khan, Jan Balewski

Abstract: Vectorized quantum block encoding provides a way to embed classical data into Hilbert space, offering a pathway for quantum models, such as Quantum Transformers (QT), that replace classical self-attention with quantum circuit simulations to operate more efficiently. Current QTs rely on deep parameterized quantum circuits (PQCs), rendering them vulnerable to QPU noise, and thus hindering their practical performance. In this paper, we propose the Vectorized Quantum Transformer (VQT), a model that supports ideal masked attention matrix computation through quantum approximation simulation and efficient training via vectorized nonlinear quantum encoder, yielding shot-efficient and gradient-free quantum circuit simulation (QCS) and reduced classical sampling overhead. In addition, we demonstrate an accuracy comparison for IBM and IonQ in quantum circuit simulation and competitive results in benchmarking natural language processing tasks on IBM state-of-the-art and high-fidelity Kingston QPU. Our noise intermediate-scale quantum friendly VQT approach unlocks a novel architecture for end-to-end machine learning in quantum computing.

Comment: The paper introduces a Vectorized Quantum Transformer, which is a novel architecture combining quantum computing with transformer models.

Relevance: 8 Novelty: 9

13. Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models

ArXiv ID: 2508.18609

Authors: Chenxi Zhou, Pengfei Cao, Jiang Li, Jun Zhao, Kang Liu

Abstract: Large language models (LLMs) present significant deployment challenges due to their scale, with post-training quantization (PTQ) emerging as a practical compression solution. However, a comprehensive understanding of how PTQ precisely impacts diverse LLM knowledge capabilities remains elusive, and existing scaling laws for quantized models often overlook crucial PTQ-specific parameters and task-specific sensitivities. This paper addresses these gaps by conducting an extensive empirical investigation to establish task-stratified scaling laws. We disentangle LLM knowledge into memorization and utilization capabilities and develop a unified quantitative framework that incorporates model size, effective bit-width, calibration set size, and group size. Our central finding reveals that knowledge memorization exhibits markedly greater sensitivity to variations in effective bit-width, calibration set size, and model size compared to the more robust knowledge utilization. These findings offer a fine-grained understanding of PTQ's impact and provide guidance for developing knowledge-aware quantization strategies that can better preserve targeted cognitive functions.

Comment: The paper provides insights into post-training quantization effects on LLMs, which is relevant to model compression and understanding LLM behavior.

Relevance: 9 Novelty: 7

14. Principled Detection of Hallucinations in Large Language Models via Multiple Testing

ArXiv ID: 2508.18473

Authors: Jiawei Li, Akshayaa Magesh, Venugopal V. Veeravalli

Abstract: While Large Language Models (LLMs) have emerged as powerful foundational models to solve a variety of tasks, they have also been shown to be prone to hallucinations, i.e., generating responses that sound confident but are actually incorrect or even nonsensical. In this work, we formulate the problem of detecting hallucinations as a hypothesis testing problem and draw parallels to the problem of out-of-distribution detection in machine learning models. We propose a multiple-testing-inspired method to solve the hallucination detection problem, and provide extensive experimental results to validate the robustness of our approach against state-of-the-art methods.

Comment: The paper addresses hallucination detection in LLMs using a hypothesis testing approach, which provides theoretical insights into LLM behavior, aligning with the criteria for foundational research in LLMs.

Relevance: 9 Novelty: 7

15. Information Templates: A New Paradigm for Intelligent Active Feature Acquisition

ArXiv ID: 2508.18380

Authors: Hung-Tien Huang, Dzung Dinh, Junier B. Oliva

Abstract: Active feature acquisition (AFA) is an instance-adaptive paradigm in which, at test time, a policy sequentially chooses which features to acquire (at a cost) before predicting. Existing approaches either train reinforcement learning (RL) policies, which deal with a difficult MDP, or greedy policies that cannot account for the joint informativeness of features or require knowledge about the underlying data distribution. To overcome this, we propose Template-based AFA (TAFA), a non-greedy framework that learns a small library of feature templates--a set of features that are jointly informative--and uses this library of templates to guide the next feature acquisitions. Through identifying feature templates, the proposed framework not only significantly reduces the action space considered by the policy but also alleviates the need to estimate the underlying data distribution. Extensive experiments on synthetic and real-world datasets show that TAFA outperforms the existing state-of-the-art baselines while achieving lower overall acquisition cost and computation.

Comment: The paper proposes a new paradigm for active feature acquisition using information templates, relevant to representation learning.

Relevance: 8 Novelty: 8

16. Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models

ArXiv ID: 2508.18988

Authors: Hung Ming Liu

Abstract: We present a framework where neural models develop an AI Mother Tongue, a native symbolic language that simultaneously supports intuitive reasoning, compositional symbol chains, and inherent interpretability. Unlike post-hoc explanation methods, our approach embeds reasoning directly into the model's representations: symbols capture meaningful semantic patterns, chains trace decision paths, and gated induction mechanisms guide selective focus, yielding transparent yet flexible reasoning. We introduce complementary training objectives to enhance symbol purity and decision sparsity, and employ a sequential specialization strategy to first build broad symbolic competence and then refine intuitive judgments. Experiments on AI tasks demonstrate competitive accuracy alongside verifiable reasoning traces, showing that AI Mother Tongue can serve as a unified mechanism for interpretability, intuition, and symbolic reasoning in neural models.

Comment: The paper presents a framework for developing an AI Mother Tongue, focusing on symbolic reasoning and interpretability, which aligns with representation learning and emerging trends.

Relevance: 8 Novelty: 8

17. Biologically Disentangled Multi-Omic Modeling Reveals Mechanistic Insights into Pan-Cancer Immunotherapy Resistance

ArXiv ID: 2508.18638

Authors: Ifrah Tariq, Ernest Fraenkel

Abstract: Immune checkpoint inhibitors (ICIs) have transformed cancer treatment, yet patient responses remain highly variable, and the biological mechanisms underlying resistance are poorly understood. While machine learning models hold promise for predicting responses to ICIs, most existing methods lack interpretability and do not effectively leverage the biological structure inherent to multi-omics data. Here, we introduce the Biologically Disentangled Variational Autoencoder (BDVAE), a deep generative model that integrates transcriptomic and genomic data through modality- and pathway-specific encoders. Unlike existing rigid, pathway-informed models, BDVAE employs a modular encoder architecture combined with variational inference to learn biologically meaningful latent features associated with immune, genomic, and metabolic processes. Applied to a pan-cancer cohort of 366 patients across four cancer types treated with ICIs, BDVAE accurately predicts treatment response (AUC-ROC = 0.94 on unseen test data) and uncovers critical resistance mechanisms, including immune suppression, metabolic shifts, and neuronal signaling. Importantly, BDVAE reveals that resistance spans a continuous biological spectrum rather than strictly binary states, reflecting gradations of tumor dysfunction. Several latent features correlate with survival outcomes and known clinical subtypes, demonstrating BDVAE's capability to generate interpretable, clinically relevant insights. These findings underscore the value of biologically structured machine learning in elucidating complex resistance patterns and guiding precision immunotherapy strategies.

Comment: The paper introduces a Biologically Disentangled Variational Autoencoder, which is a novel architecture for integrating multi-omic data, aligning with Model Architecture.