Personalized Daily ArXiv Papers 2025-09-09

[gpt-5]	Prompt	Completion	Total
Token	60727	58670	119397
Cost	$0.08	$0.59	$0.66

Total arXiv papers: 788

Total scanned papers: 461

Total relevant papers: 31

Table of contents with paper titles:

Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs Authors: Yuanteng Chen, Peisong Wang, Yuantian Shao, Jian Cheng
Learning words in groups: fusion algebras, tensor ranks and grokking Authors: Maor Shutman, Oren Louidor, Ran Tessler
Universality of physical neural networks with multivariate nonlinearity Authors: Benjamin Savinson, David J. Norris, Siddhartha Mishra, Samuel Lanthaler
LoaQ: Layer-wise Output Approximation Quantization Authors: Li Lin, Xiaojun Wan
Evaluating the Efficiency of Latent Spaces via the Coupling-Matrix Authors: Mehmet Can Yavuz, Berrin Yanikoglu
From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers Authors: Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok
Robust and Adaptive Spectral Method for Representation Multi-Task Learning with Contamination Authors: Yian Huang, Yang Feng, Zhiliang Ying
Barycentric Neural Networks and Length-Weighted Persistent Entropy Loss: A Green Geometric and Topological Framework for Function Approximation Authors: Victor Toscano-Duran, Rocio Gonzalez-Diaz, Miguel A. Guti\'errez-Naranjo
TreeGPT: A Novel Hybrid Architecture for Abstract Syntax Tree Processing with Global Parent-Child Aggregation Authors: Zixi Li
HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models Authors: Xin Tong, Zhi Lin, Jingya Wang, Bo Jin
FAVAE-Effective Frequency Aware Latent Tokenizer Authors: Tejaswini Medi, Hsien-Yi Wang, Arianna Rampini, Margret Keuper
Learning from one graph: transductive learning guarantees via the geometry of small random worlds Authors: Nils Detering, Luca Galimberti, Anastasis Kratsios, Giulia Livieri, A. Martina Neuman
MOSAIC: Minimax-Optimal Sparsity-Adaptive Inference for Change Points in Dynamic Networks Authors: Yingying Fan, Jingyuan Liu, Jinchi Lv, Ao Sun
Dato: A Task-Based Programming Model for Dataflow Accelerators Authors: Shihan Fang, Hongzheng Chen, Niansong Zhang, Jiajie Li, Han Meng, Adrian Liu, Zhiru Zhang
ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders Authors: Xiangyu Liu, Haodi Lei, Yi Liu, Yang Liu, Wei Hu
An Improved Template for Approximate Computing Authors: M. Rezaalipour, F. Costa, M. Biasion, R. Otoni, G. A. Constantinides, L. Pozzi
SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning Authors: Hanzhen Wang, Jiaming Xu, Jiayi Pan, Yongkang Zhou, Guohao Dai
PAC-Bayesian Generalization Bounds for Graph Convolutional Networks on Inductive Node Classification Authors: Huayi Tang, Yong Liu
Long-Range Graph Wavelet Networks Authors: Filippo Guerranti, Fabrizio Forte, Simon Geisler, Stephan G\"unnemann
FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving Authors: Kyungmin Bin, Seungbeom Choi, Jimyoung Son, Jieun Choi, Daseul Bae, Daehyeon Baek, Kihyo Moon, Minsung Jang, Hyojung Lee
time2time: Causal Intervention in Hidden States to Simulate Rare Events in Time Series Foundation Models Authors: Debdeep Sanyal, Aaryan Nagpal, Dhruv Kumar, Murari Mandal, Saurabh Deshpande
Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning Authors: Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng
Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL Authors: Haoyang He, Zihua Rong, Kun Ji, Chenyang Li, Qing Huang, Chong Xia, Lan Yang, Honggang Zhang
From Long to Short: LLMs Excel at Trimming Own Reasoning Chains Authors: Wei Han, Geng Zhan, Sicheng Yu, Chenyu Wang, Bryan Hooi
Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks Authors: Su Hyeong Lee, Risi Kondor, Richard Ngo
H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers Authors: Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Shijian Lu, Nicu Sebe
Text-Trained LLMs Can Zero-Shot Extrapolate PDE Dynamics Authors: Jiajun Bao, Nicolas Boull\'e, Toni J. B. Liu, Rapha\"el Sarfati, Christopher J. Earls
Learning spatially structured open quantum dynamics with regional-attention transformers Authors: Dounan Du, Eden Figueroa
Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation Authors: Qiyuan Chen, Hongsen Huang, Qian Shao, Jiahe Chen, Jintai Chen, Hongxia Xu, Renjie Hua, Ren Chuan, Jian Wu
Nonnegative matrix factorization and the principle of the common cause Authors: E. Khalafyan, A. E. Allahverdyan, A. Hovhannisyan
On optimal solutions of classical and sliced Wasserstein GANs with non-Gaussian data Authors: Yu-Jui Huang, Hsin-Hua Shen, Yu-Chih Huang, Wan-Yi Lin, Shih-Chun Lin

1. Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs

ArXiv ID: 2509.06346

Authors: Yuanteng Chen, Peisong Wang, Yuantian Shao, Jian Cheng

Abstract: Sparse Mixture-of-Experts (MoE) has become a key architecture for scaling large language models (LLMs) efficiently. Recent fine-grained MoE designs introduce hundreds of experts per layer, with multiple experts activated per token, enabling stronger specialization. However, during pre-training, routers are optimized mainly for stability and robustness: they converge prematurely and enforce balanced usage, limiting the full potential of model performance and efficiency. In this work, we uncover two overlooked issues: (i) a few highly influential experts are underutilized due to premature and balanced routing decisions; and (ii) enforcing a fixed number of active experts per token introduces substantial redundancy. Instead of retraining models or redesigning MoE architectures, we introduce Ban&Pick, a post-training, plug-and-play strategy for smarter MoE routing. Pick discovers and reinforces key experts-a small group with outsized impact on performance-leading to notable accuracy gains across domains. Ban complements this by dynamically pruning redundant experts based on layer and token sensitivity, delivering faster inference with minimal accuracy loss. Experiments on fine-grained MoE-LLMs (DeepSeek, Qwen3) across math, code, and general reasoning benchmarks demonstrate that Ban&Pick delivers free performance gains and inference acceleration without retraining or architectural changes. For instance, on Qwen3-30B-A3B, it improves accuracy from 80.67 to 84.66 on AIME2024 and from 65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under the vLLM.

Comment: Model Architecture and Efficiency (MoE): post-training routing strategy that reinforces key experts and dynamically prunes redundant experts to improve accuracy and speed without retraining.

Relevance: 10 Novelty: 8

2. Learning words in groups: fusion algebras, tensor ranks and grokking

ArXiv ID: 2509.06931

Authors: Maor Shutman, Oren Louidor, Ran Tessler

Abstract: In this work, we demonstrate that a simple two-layer neural network with standard activation functions can learn an arbitrary word operation in any finite group, provided sufficient width is available and exhibits grokking while doing so. To explain the mechanism by which this is achieved, we reframe the problem as that of learning a particular $3$-tensor, which we show is typically of low rank. A key insight is that low-rank implementations of this tensor can be obtained by decomposing it along triplets of basic self-conjugate representations of the group and leveraging the fusion structure to rule out many components. Focusing on a phenomenologically similar but more tractable surrogate model, we show that the network is able to find such low-rank implementations (or approximations thereof), thereby using limited width to approximate the word-tensor in a generalizable way. In the case of the simple multiplication word, we further elucidate the form of these low-rank implementations, showing that the network effectively implements efficient matrix multiplication in the sense of Strassen. Our work also sheds light on the mechanism by which a network reaches such a solution under gradient descent.

Comment: Representation Learning/Training Dynamics: explains learning of group word operations via low-rank tensor decompositions and links to grokking.

Relevance: 10 Novelty: 8

3. Universality of physical neural networks with multivariate nonlinearity

ArXiv ID: 2509.05420

Authors: Benjamin Savinson, David J. Norris, Siddhartha Mishra, Samuel Lanthaler

Abstract: The enormous energy demand of artificial intelligence is driving the development of alternative hardware for deep learning. Physical neural networks try to exploit physical systems to perform machine learning more efficiently. In particular, optical systems can calculate with light using negligible energy. While their computational capabilities were long limited by the linearity of optical materials, nonlinear computations have recently been demonstrated through modified input encoding. Despite this breakthrough, our inability to determine if physical neural networks can learn arbitrary relationships between data -- a key requirement for deep learning known as universality -- hinders further progress. Here we present a fundamental theorem that establishes a universality condition for physical neural networks. It provides a powerful mathematical criterion that imposes device constraints, detailing how inputs should be encoded in the tunable parameters of the physical system. Based on this result, we propose a scalable architecture using free-space optics that is provably universal and achieves high accuracy on image classification tasks. Further, by combining the theorem with temporal multiplexing, we present a route to potentially huge effective system sizes in highly practical but poorly scalable on-chip photonic devices. Our theorem and scaling methods apply beyond optical systems and inform the design of a wide class of universal, energy-efficient physical neural networks, justifying further efforts in their development.

Comment: Model Architecture and Representation Learning: proves universality conditions for physical neural networks and proposes a scalable optical architecture.

Relevance: 9 Novelty: 9

4. LoaQ: Layer-wise Output Approximation Quantization

ArXiv ID: 2509.06297

Authors: Li Lin, Xiaojun Wan

Abstract: A natural and intuitive idea in model quantization is to approximate each component's quantized output to match its original. Layer-wise post-training quantization (PTQ), though based on this idea, adopts a strictly local view and can achieve, at best, only activation-aware approximations of weights. As a result, it often leads to insufficient approximations and practical deviations from this guiding intuition. Recent work has achieved a more accurate approximation of linear-layer outputs within the framework of layer-wise PTQ, but such refinements remain inadequate for achieving alignment with the full model output. Based on a deeper understanding of the structural characteristics of mainstream LLMs, we propose $LoaQ$, an output-approximation method for layer-wise PTQ that explicitly targets output-level consistency. It better aligns with this intuition and can feature a simple closed-form solution, making it orthogonal to existing techniques and readily integrable into existing quantization pipelines. Experiments on the LLaMA and Qwen model families demonstrate that LoaQ performs effectively in both weight-only and weight-activation joint quantization. By integrating seamlessly with existing quantization strategies, it further enhances overall quantization quality and shows strong potential to advance the frontier of post-training quantization.

Comment: Model Compression and Efficiency: a layer-wise PTQ method targeting output-level consistency with a simple closed-form solution, improving quantization quality.

Relevance: 10 Novelty: 7

5. Evaluating the Efficiency of Latent Spaces via the Coupling-Matrix

ArXiv ID: 2509.06314

Authors: Mehmet Can Yavuz, Berrin Yanikoglu

Abstract: A central challenge in representation learning is constructing latent embeddings that are both expressive and efficient. In practice, deep networks often produce redundant latent spaces where multiple coordinates encode overlapping information, reducing effective capacity and hindering generalization. Standard metrics such as accuracy or reconstruction loss provide only indirect evidence of such redundancy and cannot isolate it as a failure mode. We introduce a redundancy index, denoted rho(C), that directly quantifies inter-dimensional dependencies by analyzing coupling matrices derived from latent representations and comparing their off-diagonal statistics against a normal distribution via energy distance. The result is a compact, interpretable, and statistically grounded measure of representational quality. We validate rho(C) across discriminative and generative settings on MNIST variants, Fashion-MNIST, CIFAR-10, and CIFAR-100, spanning multiple architectures and hyperparameter optimization strategies. Empirically, low rho(C) reliably predicts high classification accuracy or low reconstruction error, while elevated redundancy is associated with performance collapse. Estimator reliability grows with latent dimension, yielding natural lower bounds for reliable analysis. We further show that Tree-structured Parzen Estimators (TPE) preferentially explore low-rho regions, suggesting that rho(C) can guide neural architecture search and serve as a redundancy-aware regularization target. By exposing redundancy as a universal bottleneck across models and tasks, rho(C) offers both a theoretical lens and a practical tool for evaluating and improving the efficiency of learned representations.

Comment: Representation Learning: proposes a redundancy index rho(C) using coupling-matrix off-diagonal statistics (energy distance) to quantify inter-dimensional dependencies in latent spaces.

Relevance: 9 Novelty: 8

6. From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

ArXiv ID: 2509.06938

Authors: Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok

Abstract: As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model's hallucination risk.

Comment: Representation Learning: uses sparse autoencoders to analyze transformer internal concept activations and link them to hallucination behavior under input uncertainty.

Relevance: 9 Novelty: 8

7. Robust and Adaptive Spectral Method for Representation Multi-Task Learning with Contamination

ArXiv ID: 2509.06575

Authors: Yian Huang, Yang Feng, Zhiliang Ying

Abstract: Representation-based multi-task learning (MTL) improves efficiency by learning a shared structure across tasks, but its practical application is often hindered by contamination, outliers, or adversarial tasks. Most existing methods and theories assume a clean or near-clean setting, failing when contamination is significant. This paper tackles representation MTL with an unknown and potentially large contamination proportion, while also allowing for heterogeneity among inlier tasks. We introduce a Robust and Adaptive Spectral method (RAS) that can distill the shared inlier representation effectively and efficiently, while requiring no prior knowledge of the contamination level or the true representation dimension. Theoretically, we provide non-asymptotic error bounds for both the learned representation and the per-task parameters. These bounds adapt to inlier task similarity and outlier structure, and guarantee that RAS performs at least as well as single-task learning, thus preventing negative transfer. We also extend our framework to transfer learning with corresponding theoretical guarantees for the target task. Extensive experiments confirm our theory, showcasing the robustness and adaptivity of RAS, and its superior performance in regimes with up to 80\% task contamination.

Comment: Representation Learning: robust, adaptive spectral method to extract shared representations in contaminated multi-task settings with non-asymptotic guarantees.

Relevance: 9 Novelty: 8

8. Barycentric Neural Networks and Length-Weighted Persistent Entropy Loss: A Green Geometric and Topological Framework for Function Approximation

ArXiv ID: 2509.06694

Authors: Victor Toscano-Duran, Rocio Gonzalez-Diaz, Miguel A. Guti\'errez-Naranjo

Abstract: While it is well-established that artificial neural networks are \emph{universal approximators} for continuous functions on compact domains, many modern approaches rely on deep or overparameterized architectures that incur high computational costs. In this paper, a new type of \emph{small shallow} neural network, called the \emph{Barycentric Neural Network} ($\BNN$), is proposed, which leverages a fixed set of \emph{base points} and their \emph{barycentric coordinates} to define both its structure and its parameters. We demonstrate that our $\BNN$ enables the exact representation of \emph{continuous piecewise linear functions} ($\CPLF$s), ensuring strict continuity across segments. Since any continuous function over a compact domain can be approximated arbitrarily well by $\CPLF$s, the $\BNN$ naturally emerges as a flexible and interpretable tool for \emph{function approximation}. Beyond the use of this representation, the main contribution of the paper is the introduction of a new variant of \emph{persistent entropy}, a topological feature that is stable and scale invariant, called the \emph{length-weighted persistent entropy} ($\LWPE$), which is weighted by the lifetime of topological features. Our framework, which combines the $\BNN$ with a loss function based on our $\LWPE$, aims to provide flexible and geometrically interpretable approximations of nonlinear continuous functions in resource-constrained settings, such as those with limited base points for $\BNN$ design and few training epochs. Instead of optimizing internal weights, our approach directly \emph{optimizes the base points that define the $\BNN$}. Experimental results show that our approach achieves \emph{superior and faster approximation performance} compared to classical loss functions such as MSE, RMSE, MAE, and log-cosh.

Comment: Model Architecture: introduces a small, shallow Barycentric Neural Network that exactly represents CPLFs and optimizes base points; also proposes a topological loss (length-weighted persistent entropy).

Relevance: 9 Novelty: 8

9. TreeGPT: A Novel Hybrid Architecture for Abstract Syntax Tree Processing with Global Parent-Child Aggregation

ArXiv ID: 2509.05550

Authors: Zixi Li

Abstract: We introduce TreeGPT, a novel neural architecture that combines transformer-based attention mechanisms with global parent-child aggregation for processing Abstract Syntax Trees (ASTs) in neural program synthesis tasks. Unlike traditional approaches that rely solely on sequential processing or graph neural networks, TreeGPT employs a hybrid design that leverages both self-attention for capturing local dependencies and a specialized Tree Feed-Forward Network (TreeFFN) for modeling hierarchical tree structures through iterative message passing. The core innovation lies in our Global Parent-Child Aggregation mechanism, formalized as: $$h_i^{(t+1)} = \sigma \Big( h_i^{(0)} + W_{pc} \sum_{(p,c) \in E_i} f(h_p^{(t)}, h_c^{(t)}) + b \Big)$$ where $h_i^{(t)}$ represents the hidden state of node $i$ at iteration $t$, $E_i$ denotes all parent-child edges involving node $i$, and $f(h_p, h_c)$ is an edge aggregation function. This formulation enables each node to progressively aggregate information from the entire tree structure through $T$ iterations. Our architecture integrates optional enhancements including gated aggregation with learnable edge weights, residual connections for gradient stability, and bidirectional propagation for capturing both bottom-up and top-down dependencies. We evaluate TreeGPT on the ARC Prize 2025 dataset, a challenging visual reasoning benchmark requiring abstract pattern recognition and rule inference. Experimental results demonstrate that TreeGPT achieves 96\% accuracy, significantly outperforming transformer baselines (1.3\%), large-scale models like Grok-4 (15.9\%), and specialized program synthesis methods like SOAR (52\%) while using only 1.5M parameters. Our comprehensive ablation study reveals that edge projection is the most critical component, with the combination of edge projection and gating achieving optimal performance.

Comment: Model Architecture: introduces a hybrid Transformer + TreeFFN with global parent–child aggregation for AST processing (conditional/dynamic structure-aware network).

Relevance: 9 Novelty: 7

10. HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models

ArXiv ID: 2509.06596

Authors: Xin Tong, Zhi Lin, Jingya Wang, Bo Jin

Abstract: Large Language Models (LLMs) often produce hallucinations in retrieval-augmented or long-context generation, even when relevant evidence is present. This stems from two issues: head importance is treated as input-agnostic, and raw attention weights poorly reflect each token's true contribution. We present HAVE (Head-Adaptive Gating and ValuE Calibration), a parameter-free decoding framework that directly addresses both challenges. HAVE introduces head-adaptive gating, which performs instance-level soft reweighing of attention heads, and value calibration, which augments attention with the magnitude of value vectors to approximate write-back contribution. Together, these modules construct token-level evidence aligned with model updates and fuse it with the LM distribution through a lightweight uncertainty-scaled policy. HAVE requires no finetuning and operates in a single forward pass, making it efficient and broadly applicable. Experiments across multiple QA benchmarks and LLM families demonstrate that HAVE consistently reduces hallucinations and outperforms strong baselines, including DAGCD, with modest overhead. The framework is transparent, reproducible, and readily integrates with off-the-shelf LLMs, advancing trustworthy generation in real-world settings.

Comment: Model Architecture/Efficiency: parameter-free, instance-level head gating and value calibration at decoding—dynamic attention head weighting to mitigate hallucinations.

Relevance: 9 Novelty: 7

11. FAVAE-Effective Frequency Aware Latent Tokenizer

ArXiv ID: 2509.05441

Authors: Tejaswini Medi, Hsien-Yi Wang, Arianna Rampini, Margret Keuper

Abstract: Latent generative models have shown remarkable progress in high-fidelity image synthesis, typically using a two-stage training process that involves compressing images into latent embeddings via learned tokenizers in the first stage. The quality of generation strongly depends on how expressive and well-optimized these latent embeddings are. While various methods have been proposed to learn effective latent representations, the reconstructed images often lack realism, particularly in textured regions with sharp transitions, due to loss of fine details governed by high frequencies. We conduct a detailed frequency decomposition of existing state-of-the-art (SOTA) latent tokenizers and show that conventional objectives inherently prioritize low-frequency reconstruction, often at the expense of high-frequency fidelity. Our analysis reveals these latent tokenizers exhibit a bias toward low-frequency information, when jointly optimized, leading to over-smoothed outputs and visual artifacts that diminish perceptual quality. To address this, we propose a wavelet-based, frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples the optimization of low- and high-frequency components. This decoupling enables improved reconstruction of fine textures while preserving global structure. Our approach bridges the fidelity gap in current latent tokenizers and emphasizes the importance of frequency-aware optimization for realistic image representation, with broader implications for applications in content creation, neural rendering, and medical imaging.

Comment: Model Architecture and Representation Learning: frequency-aware VAE tokenizer with wavelet-based decoupling of low/high frequencies to improve latent representations and high-frequency reconstruction.

Relevance: 9 Novelty: 7

12. Learning from one graph: transductive learning guarantees via the geometry of small random worlds

ArXiv ID: 2509.06894

Authors: Nils Detering, Luca Galimberti, Anastasis Kratsios, Giulia Livieri, A. Martina Neuman

Abstract: Since their introduction by Kipf and Welling in $2017$, a primary use of graph convolutional networks is transductive node classification, where missing labels are inferred within a single observed graph and its feature matrix. Despite the widespread use of the network model, the statistical foundations of transductive learning remain limited, as standard inference frameworks typically rely on multiple independent samples rather than a single graph. In this work, we address these gaps by developing new concentration-of-measure tools that leverage the geometric regularities of large graphs via low-dimensional metric embeddings. The emergent regularities are captured using a random graph model; however, the methods remain applicable to deterministic graphs once observed. We establish two principal learning results. The first concerns arbitrary deterministic $k$-vertex graphs, and the second addresses random graphs that share key geometric properties with an Erd\H{o}s-R\'{e}nyi graph $\mathbf{G}=\mathbf{G}(k,p)$ in the regime $p \in \mathcal{O}((\log (k)/k)^{1/2})$. The first result serves as the basis for and illuminates the second. We then extend these results to the graph convolutional network setting, where additional challenges arise. Lastly, our learning guarantees remain informative even with a few labelled nodes $N$ and achieve the optimal nonparametric rate $\mathcal{O}(N^{-1/2})$ as $N$ grows.

Comment: Theoretical Foundations: transductive learning guarantees via geometric concentration and extensions to GCNs, providing rates under single-graph settings.

Relevance: 8 Novelty: 8

13. MOSAIC: Minimax-Optimal Sparsity-Adaptive Inference for Change Points in Dynamic Networks

ArXiv ID: 2509.06303

Authors: Yingying Fan, Jingyuan Liu, Jinchi Lv, Ao Sun

Abstract: We propose a new inference framework, named MOSAIC, for change-point detection in dynamic networks with the simultaneous low-rank and sparse-change structure. We establish the minimax rate of detection boundary, which relies on the sparsity of changes. We then develop an eigen-decomposition-based test with screened signals that approaches the minimax rate in theory, with only a minor logarithmic loss. For practical implementation of MOSAIC, we adjust the theoretical test by a novel residual-based technique, resulting in a pivotal statistic that converges to a standard normal distribution via the martingale central limit theorem under the null hypothesis and achieves full power under the alternative hypothesis. We also analyze the minimax rate of testing boundary for dynamic networks without the low-rank structure, which almost aligns with the results in high-dimensional mean-vector change-point inference. We showcase the effectiveness of MOSAIC and verify our theoretical results with several simulation examples and a real data application.

Comment: Sparsity/Low-rank Theory: minimax detection/testing boundaries and near-optimal eigen-based tests for change points in dynamic networks with sparse changes and low-rank structure.

Relevance: 8 Novelty: 8

14. Dato: A Task-Based Programming Model for Dataflow Accelerators

ArXiv ID: 2509.06794

Authors: Shihan Fang, Hongzheng Chen, Niansong Zhang, Jiajie Li, Han Meng, Adrian Liu, Zhiru Zhang

Abstract: Recent deep learning workloads increasingly push computational demand beyond what current memory systems can sustain, with many kernels stalling on data movement rather than computation. While modern dataflow accelerators incorporate on-chip streaming to mitigate off-chip bandwidth limitations, existing programming models struggle to harness these capabilities effectively. Low-level interfaces provide fine-grained control but impose significant development overhead, whereas high-level tile-based languages abstract away communication details, restricting optimization and forcing compilers to reconstruct the intended dataflow. We present Dato, a Python-embedded, task-based programming model for dataflow accelerators that elevates data communication and sharding to first-class type constructs. Developers write programs as a graph of tasks connected via explicit stream types, with sharded inputs specified using layout types. These tasks are first mapped virtually onto the accelerator's spatial fabric, and the compiler then generates a physical mapping that respects hardware constraints. Experimental results on both AMD Ryzen AI NPU and Alveo FPGA devices demonstrate that Dato achieves high performance while significantly reducing the burden of writing optimized code. On the NPU, Dato attains up to 84% hardware utilization for GEMM and delivers a 2.81x speedup on attention kernels compared to a state-of-the-art commercial framework. On the FPGA, Dato surpasses leading frameworks in performance when generating custom systolic arrays, achieving 98% of the theoretical peak performance.

Comment: High Performance Computing: a task-based programming model with first-class stream/sharding types and spatial mapping compiler for dataflow accelerators enabling high utilization.