Previous Day 2025-09-08
Monthly Overview 2025-09
Next Day 2025-09-10

Personalized Daily ArXiv Papers 2025-09-09

[gpt-5] Prompt Completion Total
Token 60727 58670 119397
Cost $0.08 $0.59 $0.66

Total arXiv papers: 788

Total scanned papers: 461

Total relevant papers: 31

Table of contents with paper titles:

  1. Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs Authors: Yuanteng Chen, Peisong Wang, Yuantian Shao, Jian Cheng

  2. Learning words in groups: fusion algebras, tensor ranks and grokking Authors: Maor Shutman, Oren Louidor, Ran Tessler

  3. Universality of physical neural networks with multivariate nonlinearity Authors: Benjamin Savinson, David J. Norris, Siddhartha Mishra, Samuel Lanthaler

  4. LoaQ: Layer-wise Output Approximation Quantization Authors: Li Lin, Xiaojun Wan

  5. Evaluating the Efficiency of Latent Spaces via the Coupling-Matrix Authors: Mehmet Can Yavuz, Berrin Yanikoglu

  6. From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers Authors: Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok

  7. Robust and Adaptive Spectral Method for Representation Multi-Task Learning with Contamination Authors: Yian Huang, Yang Feng, Zhiliang Ying

  8. Barycentric Neural Networks and Length-Weighted Persistent Entropy Loss: A Green Geometric and Topological Framework for Function Approximation Authors: Victor Toscano-Duran, Rocio Gonzalez-Diaz, Miguel A. Guti\'errez-Naranjo

  9. TreeGPT: A Novel Hybrid Architecture for Abstract Syntax Tree Processing with Global Parent-Child Aggregation Authors: Zixi Li

  10. HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models Authors: Xin Tong, Zhi Lin, Jingya Wang, Bo Jin

  11. FAVAE-Effective Frequency Aware Latent Tokenizer Authors: Tejaswini Medi, Hsien-Yi Wang, Arianna Rampini, Margret Keuper

  12. Learning from one graph: transductive learning guarantees via the geometry of small random worlds Authors: Nils Detering, Luca Galimberti, Anastasis Kratsios, Giulia Livieri, A. Martina Neuman

  13. MOSAIC: Minimax-Optimal Sparsity-Adaptive Inference for Change Points in Dynamic Networks Authors: Yingying Fan, Jingyuan Liu, Jinchi Lv, Ao Sun

  14. Dato: A Task-Based Programming Model for Dataflow Accelerators Authors: Shihan Fang, Hongzheng Chen, Niansong Zhang, Jiajie Li, Han Meng, Adrian Liu, Zhiru Zhang

  15. ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders Authors: Xiangyu Liu, Haodi Lei, Yi Liu, Yang Liu, Wei Hu

  16. An Improved Template for Approximate Computing Authors: M. Rezaalipour, F. Costa, M. Biasion, R. Otoni, G. A. Constantinides, L. Pozzi

  17. SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning Authors: Hanzhen Wang, Jiaming Xu, Jiayi Pan, Yongkang Zhou, Guohao Dai

  18. PAC-Bayesian Generalization Bounds for Graph Convolutional Networks on Inductive Node Classification Authors: Huayi Tang, Yong Liu

  19. Long-Range Graph Wavelet Networks Authors: Filippo Guerranti, Fabrizio Forte, Simon Geisler, Stephan G\"unnemann

  20. FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving Authors: Kyungmin Bin, Seungbeom Choi, Jimyoung Son, Jieun Choi, Daseul Bae, Daehyeon Baek, Kihyo Moon, Minsung Jang, Hyojung Lee

  21. time2time: Causal Intervention in Hidden States to Simulate Rare Events in Time Series Foundation Models Authors: Debdeep Sanyal, Aaryan Nagpal, Dhruv Kumar, Murari Mandal, Saurabh Deshpande

  22. Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning Authors: Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

  23. Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL Authors: Haoyang He, Zihua Rong, Kun Ji, Chenyang Li, Qing Huang, Chong Xia, Lan Yang, Honggang Zhang

  24. From Long to Short: LLMs Excel at Trimming Own Reasoning Chains Authors: Wei Han, Geng Zhan, Sicheng Yu, Chenyu Wang, Bryan Hooi

  25. Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks Authors: Su Hyeong Lee, Risi Kondor, Richard Ngo

  26. H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers Authors: Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Shijian Lu, Nicu Sebe

  27. Text-Trained LLMs Can Zero-Shot Extrapolate PDE Dynamics Authors: Jiajun Bao, Nicolas Boull\'e, Toni J. B. Liu, Rapha\"el Sarfati, Christopher J. Earls

  28. Learning spatially structured open quantum dynamics with regional-attention transformers Authors: Dounan Du, Eden Figueroa

  29. Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation Authors: Qiyuan Chen, Hongsen Huang, Qian Shao, Jiahe Chen, Jintai Chen, Hongxia Xu, Renjie Hua, Ren Chuan, Jian Wu

  30. Nonnegative matrix factorization and the principle of the common cause Authors: E. Khalafyan, A. E. Allahverdyan, A. Hovhannisyan

  31. On optimal solutions of classical and sliced Wasserstein GANs with non-Gaussian data Authors: Yu-Jui Huang, Hsin-Hua Shen, Yu-Chih Huang, Wan-Yi Lin, Shih-Chun Lin


1. Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs

ArXiv ID: 2509.06346

Authors: Yuanteng Chen, Peisong Wang, Yuantian Shao, Jian Cheng

Abstract: Sparse Mixture-of-Experts (MoE) has become a key architecture for scaling large language models (LLMs) efficiently. Recent fine-grained MoE designs introduce hundreds of experts per layer, with multiple experts activated per token, enabling stronger specialization. However, during pre-training, routers are optimized mainly for stability and robustness: they converge prematurely and enforce balanced usage, limiting the full potential of model performance and efficiency. In this work, we uncover two overlooked issues: (i) a few highly influential experts are underutilized due to premature and balanced routing decisions; and (ii) enforcing a fixed number of active experts per token introduces substantial redundancy. Instead of retraining models or redesigning MoE architectures, we introduce Ban&Pick, a post-training, plug-and-play strategy for smarter MoE routing. Pick discovers and reinforces key experts-a small group with outsized impact on performance-leading to notable accuracy gains across domains. Ban complements this by dynamically pruning redundant experts based on layer and token sensitivity, delivering faster inference with minimal accuracy loss. Experiments on fine-grained MoE-LLMs (DeepSeek, Qwen3) across math, code, and general reasoning benchmarks demonstrate that Ban&Pick delivers free performance gains and inference acceleration without retraining or architectural changes. For instance, on Qwen3-30B-A3B, it improves accuracy from 80.67 to 84.66 on AIME2024 and from 65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under the vLLM.

Comment: Model Architecture and Efficiency (MoE): post-training routing strategy that reinforces key experts and dynamically prunes redundant experts to improve accuracy and speed without retraining.

Relevance: 10 Novelty: 8


2. Learning words in groups: fusion algebras, tensor ranks and grokking

ArXiv ID: 2509.06931

Authors: Maor Shutman, Oren Louidor, Ran Tessler

Abstract: In this work, we demonstrate that a simple two-layer neural network with standard activation functions can learn an arbitrary word operation in any finite group, provided sufficient width is available and exhibits grokking while doing so. To explain the mechanism by which this is achieved, we reframe the problem as that of learning a particular $3$-tensor, which we show is typically of low rank. A key insight is that low-rank implementations of this tensor can be obtained by decomposing it along triplets of basic self-conjugate representations of the group and leveraging the fusion structure to rule out many components. Focusing on a phenomenologically similar but more tractable surrogate model, we show that the network is able to find such low-rank implementations (or approximations thereof), thereby using limited width to approximate the word-tensor in a generalizable way. In the case of the simple multiplication word, we further elucidate the form of these low-rank implementations, showing that the network effectively implements efficient matrix multiplication in the sense of Strassen. Our work also sheds light on the mechanism by which a network reaches such a solution under gradient descent.

Comment: Representation Learning/Training Dynamics: explains learning of group word operations via low-rank tensor decompositions and links to grokking.

Relevance: 10 Novelty: 8


3. Universality of physical neural networks with multivariate nonlinearity

ArXiv ID: 2509.05420

Authors: Benjamin Savinson, David J. Norris, Siddhartha Mishra, Samuel Lanthaler

Abstract: The enormous energy demand of artificial intelligence is driving the development of alternative hardware for deep learning. Physical neural networks try to exploit physical systems to perform machine learning more efficiently. In particular, optical systems can calculate with light using negligible energy. While their computational capabilities were long limited by the linearity of optical materials, nonlinear computations have recently been demonstrated through modified input encoding. Despite this breakthrough, our inability to determine if physical neural networks can learn arbitrary relationships between data -- a key requirement for deep learning known as universality -- hinders further progress. Here we present a fundamental theorem that establishes a universality condition for physical neural networks. It provides a powerful mathematical criterion that imposes device constraints, detailing how inputs should be encoded in the tunable parameters of the physical system. Based on this result, we propose a scalable architecture using free-space optics that is provably universal and achieves high accuracy on image classification tasks. Further, by combining the theorem with temporal multiplexing, we present a route to potentially huge effective system sizes in highly practical but poorly scalable on-chip photonic devices. Our theorem and scaling methods apply beyond optical systems and inform the design of a wide class of universal, energy-efficient physical neural networks, justifying further efforts in their development.

Comment: Model Architecture and Representation Learning: proves universality conditions for physical neural networks and proposes a scalable optical architecture.

Relevance: 9 Novelty: 9


4. LoaQ: Layer-wise Output Approximation Quantization

ArXiv ID: 2509.06297

Authors: Li Lin, Xiaojun Wan

Abstract: A natural and intuitive idea in model quantization is to approximate each component's quantized output to match its original. Layer-wise post-training quantization (PTQ), though based on this idea, adopts a strictly local view and can achieve, at best, only activation-aware approximations of weights. As a result, it often leads to insufficient approximations and practical deviations from this guiding intuition. Recent work has achieved a more accurate approximation of linear-layer outputs within the framework of layer-wise PTQ, but such refinements remain inadequate for achieving alignment with the full model output. Based on a deeper understanding of the structural characteristics of mainstream LLMs, we propose $LoaQ$, an output-approximation method for layer-wise PTQ that explicitly targets output-level consistency. It better aligns with this intuition and can feature a simple closed-form solution, making it orthogonal to existing techniques and readily integrable into existing quantization pipelines. Experiments on the LLaMA and Qwen model families demonstrate that LoaQ performs effectively in both weight-only and weight-activation joint quantization. By integrating seamlessly with existing quantization strategies, it further enhances overall quantization quality and shows strong potential to advance the frontier of post-training quantization.

Comment: Model Compression and Efficiency: a layer-wise PTQ method targeting output-level consistency with a simple closed-form solution, improving quantization quality.

Relevance: 10 Novelty: 7


5. Evaluating the Efficiency of Latent Spaces via the Coupling-Matrix

ArXiv ID: 2509.06314

Authors: Mehmet Can Yavuz, Berrin Yanikoglu

Abstract: A central challenge in representation learning is constructing latent embeddings that are both expressive and efficient. In practice, deep networks often produce redundant latent spaces where multiple coordinates encode overlapping information, reducing effective capacity and hindering generalization. Standard metrics such as accuracy or reconstruction loss provide only indirect evidence of such redundancy and cannot isolate it as a failure mode. We introduce a redundancy index, denoted rho(C), that directly quantifies inter-dimensional dependencies by analyzing coupling matrices derived from latent representations and comparing their off-diagonal statistics against a normal distribution via energy distance. The result is a compact, interpretable, and statistically grounded measure of representational quality. We validate rho(C) across discriminative and generative settings on MNIST variants, Fashion-MNIST, CIFAR-10, and CIFAR-100, spanning multiple architectures and hyperparameter optimization strategies. Empirically, low rho(C) reliably predicts high classification accuracy or low reconstruction error, while elevated redundancy is associated with performance collapse. Estimator reliability grows with latent dimension, yielding natural lower bounds for reliable analysis. We further show that Tree-structured Parzen Estimators (TPE) preferentially explore low-rho regions, suggesting that rho(C) can guide neural architecture search and serve as a redundancy-aware regularization target. By exposing redundancy as a universal bottleneck across models and tasks, rho(C) offers both a theoretical lens and a practical tool for evaluating and improving the efficiency of learned representations.

Comment: Representation Learning: proposes a redundancy index rho(C) using coupling-matrix off-diagonal statistics (energy distance) to quantify inter-dimensional dependencies in latent spaces.

Relevance: 9 Novelty: 8


6. From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

ArXiv ID: 2509.06938

Authors: Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok

Abstract: As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model's hallucination risk.

Comment: Representation Learning: uses sparse autoencoders to analyze transformer internal concept activations and link them to hallucination behavior under input uncertainty.

Relevance: 9 Novelty: 8


7. Robust and Adaptive Spectral Method for Representation Multi-Task Learning with Contamination

ArXiv ID: 2509.06575

Authors: Yian Huang, Yang Feng, Zhiliang Ying

Abstract: Representation-based multi-task learning (MTL) improves efficiency by learning a shared structure across tasks, but its practical application is often hindered by contamination, outliers, or adversarial tasks. Most existing methods and theories assume a clean or near-clean setting, failing when contamination is significant. This paper tackles representation MTL with an unknown and potentially large contamination proportion, while also allowing for heterogeneity among inlier tasks. We introduce a Robust and Adaptive Spectral method (RAS) that can distill the shared inlier representation effectively and efficiently, while requiring no prior knowledge of the contamination level or the true representation dimension. Theoretically, we provide non-asymptotic error bounds for both the learned representation and the per-task parameters. These bounds adapt to inlier task similarity and outlier structure, and guarantee that RAS performs at least as well as single-task learning, thus preventing negative transfer. We also extend our framework to transfer learning with corresponding theoretical guarantees for the target task. Extensive experiments confirm our theory, showcasing the robustness and adaptivity of RAS, and its superior performance in regimes with up to 80\% task contamination.

Comment: Representation Learning: robust, adaptive spectral method to extract shared representations in contaminated multi-task settings with non-asymptotic guarantees.

Relevance: 9 Novelty: 8


8. Barycentric Neural Networks and Length-Weighted Persistent Entropy Loss: A Green Geometric and Topological Framework for Function Approximation

ArXiv ID: 2509.06694

Authors: Victor Toscano-Duran, Rocio Gonzalez-Diaz, Miguel A. Guti\'errez-Naranjo

Abstract: While it is well-established that artificial neural networks are \emph{universal approximators} for continuous functions on compact domains, many modern approaches rely on deep or overparameterized architectures that incur high computational costs. In this paper, a new type of \emph{small shallow} neural network, called the \emph{Barycentric Neural Network} ($\BNN$), is proposed, which leverages a fixed set of \emph{base points} and their \emph{barycentric coordinates} to define both its structure and its parameters. We demonstrate that our $\BNN$ enables the exact representation of \emph{continuous piecewise linear functions} ($\CPLF$s), ensuring strict continuity across segments. Since any continuous function over a compact domain can be approximated arbitrarily well by $\CPLF$s, the $\BNN$ naturally emerges as a flexible and interpretable tool for \emph{function approximation}. Beyond the use of this representation, the main contribution of the paper is the introduction of a new variant of \emph{persistent entropy}, a topological feature that is stable and scale invariant, called the \emph{length-weighted persistent entropy} ($\LWPE$), which is weighted by the lifetime of topological features. Our framework, which combines the $\BNN$ with a loss function based on our $\LWPE$, aims to provide flexible and geometrically interpretable approximations of nonlinear continuous functions in resource-constrained settings, such as those with limited base points for $\BNN$ design and few training epochs. Instead of optimizing internal weights, our approach directly \emph{optimizes the base points that define the $\BNN$}. Experimental results show that our approach achieves \emph{superior and faster approximation performance} compared to classical loss functions such as MSE, RMSE, MAE, and log-cosh.

Comment: Model Architecture: introduces a small, shallow Barycentric Neural Network that exactly represents CPLFs and optimizes base points; also proposes a topological loss (length-weighted persistent entropy).

Relevance: 9 Novelty: 8


9. TreeGPT: A Novel Hybrid Architecture for Abstract Syntax Tree Processing with Global Parent-Child Aggregation

ArXiv ID: 2509.05550

Authors: Zixi Li

Abstract: We introduce TreeGPT, a novel neural architecture that combines transformer-based attention mechanisms with global parent-child aggregation for processing Abstract Syntax Trees (ASTs) in neural program synthesis tasks. Unlike traditional approaches that rely solely on sequential processing or graph neural networks, TreeGPT employs a hybrid design that leverages both self-attention for capturing local dependencies and a specialized Tree Feed-Forward Network (TreeFFN) for modeling hierarchical tree structures through iterative message passing. The core innovation lies in our Global Parent-Child Aggregation mechanism, formalized as: $$h_i^{(t+1)} = \sigma \Big( h_i^{(0)} + W_{pc} \sum_{(p,c) \in E_i} f(h_p^{(t)}, h_c^{(t)}) + b \Big)$$ where $h_i^{(t)}$ represents the hidden state of node $i$ at iteration $t$, $E_i$ denotes all parent-child edges involving node $i$, and $f(h_p, h_c)$ is an edge aggregation function. This formulation enables each node to progressively aggregate information from the entire tree structure through $T$ iterations. Our architecture integrates optional enhancements including gated aggregation with learnable edge weights, residual connections for gradient stability, and bidirectional propagation for capturing both bottom-up and top-down dependencies. We evaluate TreeGPT on the ARC Prize 2025 dataset, a challenging visual reasoning benchmark requiring abstract pattern recognition and rule inference. Experimental results demonstrate that TreeGPT achieves 96\% accuracy, significantly outperforming transformer baselines (1.3\%), large-scale models like Grok-4 (15.9\%), and specialized program synthesis methods like SOAR (52\%) while using only 1.5M parameters. Our comprehensive ablation study reveals that edge projection is the most critical component, with the combination of edge projection and gating achieving optimal performance.

Comment: Model Architecture: introduces a hybrid Transformer + TreeFFN with global parent–child aggregation for AST processing (conditional/dynamic structure-aware network).

Relevance: 9 Novelty: 7


10. HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models

ArXiv ID: 2509.06596

Authors: Xin Tong, Zhi Lin, Jingya Wang, Bo Jin

Abstract: Large Language Models (LLMs) often produce hallucinations in retrieval-augmented or long-context generation, even when relevant evidence is present. This stems from two issues: head importance is treated as input-agnostic, and raw attention weights poorly reflect each token's true contribution. We present HAVE (Head-Adaptive Gating and ValuE Calibration), a parameter-free decoding framework that directly addresses both challenges. HAVE introduces head-adaptive gating, which performs instance-level soft reweighing of attention heads, and value calibration, which augments attention with the magnitude of value vectors to approximate write-back contribution. Together, these modules construct token-level evidence aligned with model updates and fuse it with the LM distribution through a lightweight uncertainty-scaled policy. HAVE requires no finetuning and operates in a single forward pass, making it efficient and broadly applicable. Experiments across multiple QA benchmarks and LLM families demonstrate that HAVE consistently reduces hallucinations and outperforms strong baselines, including DAGCD, with modest overhead. The framework is transparent, reproducible, and readily integrates with off-the-shelf LLMs, advancing trustworthy generation in real-world settings.

Comment: Model Architecture/Efficiency: parameter-free, instance-level head gating and value calibration at decoding—dynamic attention head weighting to mitigate hallucinations.

Relevance: 9 Novelty: 7


11. FAVAE-Effective Frequency Aware Latent Tokenizer

ArXiv ID: 2509.05441

Authors: Tejaswini Medi, Hsien-Yi Wang, Arianna Rampini, Margret Keuper

Abstract: Latent generative models have shown remarkable progress in high-fidelity image synthesis, typically using a two-stage training process that involves compressing images into latent embeddings via learned tokenizers in the first stage. The quality of generation strongly depends on how expressive and well-optimized these latent embeddings are. While various methods have been proposed to learn effective latent representations, the reconstructed images often lack realism, particularly in textured regions with sharp transitions, due to loss of fine details governed by high frequencies. We conduct a detailed frequency decomposition of existing state-of-the-art (SOTA) latent tokenizers and show that conventional objectives inherently prioritize low-frequency reconstruction, often at the expense of high-frequency fidelity. Our analysis reveals these latent tokenizers exhibit a bias toward low-frequency information, when jointly optimized, leading to over-smoothed outputs and visual artifacts that diminish perceptual quality. To address this, we propose a wavelet-based, frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples the optimization of low- and high-frequency components. This decoupling enables improved reconstruction of fine textures while preserving global structure. Our approach bridges the fidelity gap in current latent tokenizers and emphasizes the importance of frequency-aware optimization for realistic image representation, with broader implications for applications in content creation, neural rendering, and medical imaging.

Comment: Model Architecture and Representation Learning: frequency-aware VAE tokenizer with wavelet-based decoupling of low/high frequencies to improve latent representations and high-frequency reconstruction.

Relevance: 9 Novelty: 7


12. Learning from one graph: transductive learning guarantees via the geometry of small random worlds

ArXiv ID: 2509.06894

Authors: Nils Detering, Luca Galimberti, Anastasis Kratsios, Giulia Livieri, A. Martina Neuman

Abstract: Since their introduction by Kipf and Welling in $2017$, a primary use of graph convolutional networks is transductive node classification, where missing labels are inferred within a single observed graph and its feature matrix. Despite the widespread use of the network model, the statistical foundations of transductive learning remain limited, as standard inference frameworks typically rely on multiple independent samples rather than a single graph. In this work, we address these gaps by developing new concentration-of-measure tools that leverage the geometric regularities of large graphs via low-dimensional metric embeddings. The emergent regularities are captured using a random graph model; however, the methods remain applicable to deterministic graphs once observed. We establish two principal learning results. The first concerns arbitrary deterministic $k$-vertex graphs, and the second addresses random graphs that share key geometric properties with an Erd\H{o}s-R\'{e}nyi graph $\mathbf{G}=\mathbf{G}(k,p)$ in the regime $p \in \mathcal{O}((\log (k)/k)^{1/2})$. The first result serves as the basis for and illuminates the second. We then extend these results to the graph convolutional network setting, where additional challenges arise. Lastly, our learning guarantees remain informative even with a few labelled nodes $N$ and achieve the optimal nonparametric rate $\mathcal{O}(N^{-1/2})$ as $N$ grows.

Comment: Theoretical Foundations: transductive learning guarantees via geometric concentration and extensions to GCNs, providing rates under single-graph settings.

Relevance: 8 Novelty: 8


13. MOSAIC: Minimax-Optimal Sparsity-Adaptive Inference for Change Points in Dynamic Networks

ArXiv ID: 2509.06303

Authors: Yingying Fan, Jingyuan Liu, Jinchi Lv, Ao Sun

Abstract: We propose a new inference framework, named MOSAIC, for change-point detection in dynamic networks with the simultaneous low-rank and sparse-change structure. We establish the minimax rate of detection boundary, which relies on the sparsity of changes. We then develop an eigen-decomposition-based test with screened signals that approaches the minimax rate in theory, with only a minor logarithmic loss. For practical implementation of MOSAIC, we adjust the theoretical test by a novel residual-based technique, resulting in a pivotal statistic that converges to a standard normal distribution via the martingale central limit theorem under the null hypothesis and achieves full power under the alternative hypothesis. We also analyze the minimax rate of testing boundary for dynamic networks without the low-rank structure, which almost aligns with the results in high-dimensional mean-vector change-point inference. We showcase the effectiveness of MOSAIC and verify our theoretical results with several simulation examples and a real data application.

Comment: Sparsity/Low-rank Theory: minimax detection/testing boundaries and near-optimal eigen-based tests for change points in dynamic networks with sparse changes and low-rank structure.

Relevance: 8 Novelty: 8


14. Dato: A Task-Based Programming Model for Dataflow Accelerators

ArXiv ID: 2509.06794

Authors: Shihan Fang, Hongzheng Chen, Niansong Zhang, Jiajie Li, Han Meng, Adrian Liu, Zhiru Zhang

Abstract: Recent deep learning workloads increasingly push computational demand beyond what current memory systems can sustain, with many kernels stalling on data movement rather than computation. While modern dataflow accelerators incorporate on-chip streaming to mitigate off-chip bandwidth limitations, existing programming models struggle to harness these capabilities effectively. Low-level interfaces provide fine-grained control but impose significant development overhead, whereas high-level tile-based languages abstract away communication details, restricting optimization and forcing compilers to reconstruct the intended dataflow. We present Dato, a Python-embedded, task-based programming model for dataflow accelerators that elevates data communication and sharding to first-class type constructs. Developers write programs as a graph of tasks connected via explicit stream types, with sharded inputs specified using layout types. These tasks are first mapped virtually onto the accelerator's spatial fabric, and the compiler then generates a physical mapping that respects hardware constraints. Experimental results on both AMD Ryzen AI NPU and Alveo FPGA devices demonstrate that Dato achieves high performance while significantly reducing the burden of writing optimized code. On the NPU, Dato attains up to 84% hardware utilization for GEMM and delivers a 2.81x speedup on attention kernels compared to a state-of-the-art commercial framework. On the FPGA, Dato surpasses leading frameworks in performance when generating custom systolic arrays, achieving 98% of the theoretical peak performance.

Comment: High Performance Computing: a task-based programming model with first-class stream/sharding types and spatial mapping compiler for dataflow accelerators enabling high utilization.

Relevance: 8 Novelty: 7


15. ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders

ArXiv ID: 2509.05309

Authors: Xiangyu Liu, Haodi Lei, Yi Liu, Yang Liu, Wei Hu

Abstract: Sparse Autoencoder (SAE) has emerged as a powerful tool for mechanistic interpretability of large language models. Recent works apply SAE to protein language models (PLMs), aiming to extract and analyze biologically meaningful features from their latent spaces. However, SAE suffers from semantic entanglement, where individual neurons often mix multiple nonlinear concepts, making it difficult to reliably interpret or manipulate model behaviors. In this paper, we propose a semantically-guided SAE, called ProtSAE. Unlike existing SAE which requires annotation datasets to filter and interpret activations, we guide semantic disentanglement during training using both annotation datasets and domain knowledge to mitigate the effects of entangled attributes. We design interpretability experiments showing that ProtSAE learns more biologically relevant and interpretable hidden features compared to previous methods. Performance analyses further demonstrate that ProtSAE maintains high reconstruction fidelity while achieving better results in interpretable probing. We also show the potential of ProtSAE in steering PLMs for downstream generation tasks.

Comment: Representation Learning/Interpretability with Sparsity: semantically-guided sparse autoencoders to disentangle features in PLMs while maintaining reconstruction fidelity.

Relevance: 8 Novelty: 7


16. An Improved Template for Approximate Computing

ArXiv ID: 2509.06162

Authors: M. Rezaalipour, F. Costa, M. Biasion, R. Otoni, G. A. Constantinides, L. Pozzi

Abstract: Deploying neural networks on edge devices entails a careful balance between the energy required for inference and the accuracy of the resulting classification. One technique for navigating this tradeoff is approximate computing: the process of reducing energy consumption by slightly reducing the accuracy of arithmetic operators. In this context, we propose a methodology to reduce the area of the small arithmetic operators used in neural networks - i.e., adders and multipliers - via a small loss in accuracy, and show that we improve area savings for the same accuracy loss w.r.t. the state of the art. To achieve our goal, we improve on a boolean rewriting technique recently proposed, called XPAT, where the use of a parametrisable template to rewrite circuits has proved to be highly beneficial. In particular, XPAT was able to produce smaller circuits than comparable approaches while utilising a naive sum of products template structure. In this work, we show that template parameters can act as proxies for chosen metrics and we propose a novel template based on parametrisable product sharing that acts as a close proxy to synthesised area. We demonstrate experimentally that our methodology converges better to low-area solutions and that it can find better approximations than both the original XPAT and two other state-of-the-art approaches.

Comment: Compression/Efficiency: new parametrizable boolean-rewriting template for approximate adders/multipliers improving area savings for given accuracy loss.

Relevance: 8 Novelty: 7


17. SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

ArXiv ID: 2509.05614

Authors: Hanzhen Wang, Jiaming Xu, Jiayi Pan, Yongkang Zhou, Guohao Dai

Abstract: Pruning accelerates compute-bound models by reducing computation. Recently applied to Vision-Language-Action (VLA) models, existing methods prune tokens using only local info from current action, ignoring global context from prior actions, causing >20% success rate drop and limited speedup. We observe high similarity across consecutive actions and propose leveraging both local (current) and global (past) info for smarter token selection. We introduce SpecPrune-VLA, a training-free method with two-level pruning and heuristic control: (1) Static pruning at action level: uses global history and local context to reduce visual tokens per action; (2) Dynamic pruning at layer level: prunes tokens per layer based on layer-specific importance; (3) Lightweight action-aware controller: classifies actions as coarse/fine-grained (by speed), adjusting pruning aggressiveness since fine-grained actions are pruning-sensitive. Experiments on LIBERO show SpecPrune-VLA achieves 1.46 times speedup on NVIDIA A800 and 1.57 times on NVIDIA GeForce RTX 3090 vs. OpenVLA-OFT, with negligible success rate loss.

Comment: Model Compression and Efficiency: training-free two-level token pruning with an action-aware controller to accelerate VLA models.

Relevance: 8 Novelty: 7


18. PAC-Bayesian Generalization Bounds for Graph Convolutional Networks on Inductive Node Classification

ArXiv ID: 2509.06600

Authors: Huayi Tang, Yong Liu

Abstract: Graph neural networks (GNNs) have achieved remarkable success in processing graph-structured data across various applications. A critical aspect of real-world graphs is their dynamic nature, where new nodes are continually added and existing connections may change over time. Previous theoretical studies, largely based on the transductive learning framework, fail to adequately model such temporal evolution and structural dynamics. In this paper, we presents a PAC-Bayesian theoretical analysis of graph convolutional networks (GCNs) for inductive node classification, treating nodes as dependent and non-identically distributed data points. We derive novel generalization bounds for one-layer GCNs that explicitly incorporate the effects of data dependency and non-stationarity, and establish sufficient conditions under which the generalization gap converges to zero as the number of nodes increases. Furthermore, we extend our analysis to two-layer GCNs, and reveal that it requires stronger assumptions on graph topology to guarantee convergence. This work establishes a theoretical foundation for understanding and improving GNN generalization in dynamic graph environments.

Comment: Representation Learning/Training Dynamics: PAC-Bayesian generalization bounds for GCNs with dependent, non-stationary nodes; foundational theory on generalization.

Relevance: 8 Novelty: 7


19. Long-Range Graph Wavelet Networks

ArXiv ID: 2509.06743

Authors: Filippo Guerranti, Fabrizio Forte, Simon Geisler, Stephan G\"unnemann

Abstract: Modeling long-range interactions, the propagation of information across distant parts of a graph, is a central challenge in graph machine learning. Graph wavelets, inspired by multi-resolution signal processing, provide a principled way to capture both local and global structures. However, existing wavelet-based graph neural networks rely on finite-order polynomial approximations, which limit their receptive fields and hinder long-range propagation. We propose Long-Range Graph Wavelet Networks (LR-GWN), which decompose wavelet filters into complementary local and global components. Local aggregation is handled with efficient low-order polynomials, while long-range interactions are captured through a flexible spectral domain parameterization. This hybrid design unifies short- and long-distance information flow within a principled wavelet framework. Experiments show that LR-GWN achieves state-of-the-art performance among wavelet-based methods on long-range benchmarks, while remaining competitive on short-range datasets.

Comment: Model Architecture: introduces a wavelet-based GNN with spectral-domain parameterization to capture long-range interactions within a unified local/global design.

Relevance: 8 Novelty: 7


20. FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving

ArXiv ID: 2509.06261

Authors: Kyungmin Bin, Seungbeom Choi, Jimyoung Son, Jieun Choi, Daseul Bae, Daehyeon Baek, Kihyo Moon, Minsung Jang, Hyojung Lee

Abstract: Recent advances in Post-Training Quantization (PTQ) techniques have significantly increased demand for serving quantized large language models (LLMs), enabling higher throughput and substantially reduced memory usage with minimal accuracy loss. Quantized models address memory constraints in LLMs and enhance GPU resource utilization through efficient GPU sharing. However, quantized models have smaller KV block sizes than non-quantized models, causing limited memory efficiency due to memory fragmentation. Also, distinct resource usage patterns between quantized and non-quantized models require efficient scheduling to maximize throughput. To address these challenges, we propose FineServe, an inference serving framework for mixed-precision LLMs. FineServe's key contributions include: (1) KV Slab, a precision-aware adaptive memory management technique dynamically allocating KV cache based on model quantization characteristics, significantly reducing GPU memory fragmentation, and (2) a two-level scheduling framework comprising a global scheduler that places models to GPUs based on request rates, latency SLOs, and memory constraints and efficiency, and a local scheduler that adaptively adjusts batch sizes according to real-time request fluctuations. Experimental results demonstrate that FineServe achieves up to 2.2x higher SLO attainment and 1.8x higher token generation throughput compared to the state-of-the-art GPU sharing systems.

Comment: High Performance Computing/Efficiency: precision-aware KV cache management (KV Slab) and two-level scheduling for mixed-precision LLM serving to reduce fragmentation and boost throughput.

Relevance: 8 Novelty: 7


21. time2time: Causal Intervention in Hidden States to Simulate Rare Events in Time Series Foundation Models

ArXiv ID: 2509.05801

Authors: Debdeep Sanyal, Aaryan Nagpal, Dhruv Kumar, Murari Mandal, Saurabh Deshpande

Abstract: While transformer-based foundation models excel at forecasting routine patterns, two questions remain: do they internalize semantic concepts such as market regimes, or merely fit curves? And can their internal representations be leveraged to simulate rare, high-stakes events such as market crashes? To investigate this, we introduce activation transplantation, a causal intervention that manipulates hidden states by imposing the statistical moments of one event (e.g., a historical crash) onto another (e.g., a calm period) during the forward pass. This procedure deterministically steers forecasts: injecting crash semantics induces downturn predictions, while injecting calm semantics suppresses crashes and restores stability. Beyond binary control, we find that models encode a graded notion of event severity, with the latent vector norm directly correlating with the magnitude of systemic shocks. Validated across two architecturally distinct TSFMs, Toto (decoder only) and Chronos (encoder-decoder), our results demonstrate that steerable, semantically grounded representations are a robust property of large time series transformers. Our findings provide evidence for a latent concept space that governs model predictions, shifting interpretability from post-hoc attribution to direct causal intervention, and enabling semantic "what-if" analysis for strategic stress-testing.

Comment: Representation Learning: causal interventions on Transformer hidden states reveal and control semantic latent concepts in TS foundation models.

Relevance: 8 Novelty: 7


22. Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

ArXiv ID: 2509.06461

Authors: Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

Comment: Representation Learning/Attention Mechanisms: training-free contrastive attention to isolate task-relevant visual signals and reduce attention entropy effects.

Relevance: 8 Novelty: 7


23. Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL

ArXiv ID: 2509.06024

Authors: Haoyang He, Zihua Rong, Kun Ji, Chenyang Li, Qing Huang, Chong Xia, Lan Yang, Honggang Zhang

Abstract: Reinforcement learning (RL) has recently become the dominant paradigm for strengthening the reasoning abilities of large language models (LLMs). Yet the rule-based reward functions commonly used on mathematical or programming benchmarks assess only answer format and correctness, providing no signal as to whether the induced Chain-of-Thought (CoT) actually improves the answer. Furthermore, such task-specific training offers limited control over logical depth and therefore may fail to reveal a model's genuine reasoning capacity. We propose Dynamic Reasoning Efficiency Reward (DRER) -- a plug-and-play RL reward framework that reshapes both reward and advantage signals. (i) A Reasoning Quality Reward assigns fine-grained credit to those reasoning chains that demonstrably raise the likelihood of the correct answer, directly incentivising the trajectories with beneficial CoT tokens. (ii) A Dynamic Length Advantage decays the advantage of responses whose length deviates from a validation-derived threshold, stabilising training. To facilitate rigorous assessment, we also release Logictree, a dynamically constructed deductive reasoning dataset that functions both as RL training data and as a comprehensive benchmark. Experiments confirm the effectiveness of DRER: our 7B model attains GPT-o3-mini level performance on Logictree with 400 trianing steps, while the average confidence of CoT-augmented answers rises by 30%. The model further exhibits generalisation across diverse logical-reasoning datasets, and the mathematical benchmark AIME24. These results illuminate how RL shapes CoT behaviour and chart a practical path toward enhancing formal-reasoning skills in large language models. All code and data are available in repository https://github.com/Henryhe09/DRER.

Comment: Training dynamics/representation learning: RL-based reward shaping (Reasoning Quality Reward, Dynamic Length Advantage) to improve CoT reasoning in LLMs.

Relevance: 8 Novelty: 7


24. From Long to Short: LLMs Excel at Trimming Own Reasoning Chains

ArXiv ID: 2509.06174

Authors: Wei Han, Geng Zhan, Sicheng Yu, Chenyu Wang, Bryan Hooi

Abstract: O1/R1 style large reasoning models (LRMs) signal a substantial leap forward over conventional instruction-following LLMs. By applying test-time scaling to generate extended reasoning paths, they establish many SOTAs across a wide range of complex reasoning tasks. However, recent studies show that LRMs are prone to suffer from overthinking -- the tendency to overcomplicate simple problems, leading to excessive strategy switching and long, convoluted reasoning traces that hinder their interpretability. To mitigate this issue, we conduct a systematic investigation into the reasoning efficiency of a broad set of LRMs and uncover a common dilemma: the difficulty in balancing multiple generation objectives such as correctness and brevity. Based on this discovery, we propose a test-time scaling method, EDIT (Efficient Dynamic Inference Trimming), which efficiently guides LRMs to identify the shortest correct reasoning paths at test time. EDIT employs constraint-guided generation while jointly tracking length and answer distributions under varying constraints, allowing it to select responses that strike an optimal balance between conciseness and correctness. Extensive experiments across diverse models and datasets show that EDIT substantially enhance the reasoning efficiency, producing compact yet informative outputs that improve readability and user experience.

Comment: Model Compression/Efficiency: proposes a test-time scaling method (EDIT) that trims reasoning chains to jointly optimize brevity and correctness, improving inference efficiency.

Relevance: 8 Novelty: 7


25. Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks

ArXiv ID: 2509.06701

Authors: Su Hyeong Lee, Risi Kondor, Richard Ngo

Abstract: We develop a theory of intelligent agency grounded in probabilistic modeling for neural models. Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare. We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. Our framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. Finally, we formalize an agentic alignment phenomenon in LLMs using our theory: eliciting a benevolent persona ("Luigi'") induces an antagonistic counterpart ("Waluigi"), while a manifest-then-suppress Waluigi strategy yields strictly larger first-order misalignment reduction than pure Luigi reinforcement alone. These results clarify how developing a principled mathematical framework for how subagents can coalesce into coherent higher-level entities provides novel implications for alignment in agentic AI systems.

Comment: Representation/Theory: probabilistic framework for compositional latent subagents in neural networks via weighted log pooling, yielding insights into internal alignment dynamics.

Relevance: 7 Novelty: 8


26. H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers

ArXiv ID: 2509.06956

Authors: Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Shijian Lu, Nicu Sebe

Abstract: Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play pruning-and-recovering framework, called Hierarchical Hourglass Tokenizer (H${2}$OT), for efficient transformer-based 3D human pose estimation from videos. H$$OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method. Code and models are available at https://github.com/NationalGAILab/HoT.}$OT begins with progressively pruning pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token pruning and recovery strategies. In addition, our H$_{2

Comment: Compression/Efficiency: dynamic token pruning and recovery modules within transformers to reduce compute while restoring full-length outputs for efficient inference.

Relevance: 7 Novelty: 7


27. Text-Trained LLMs Can Zero-Shot Extrapolate PDE Dynamics

ArXiv ID: 2509.06322

Authors: Jiajun Bao, Nicolas Boull\'e, Toni J. B. Liu, Rapha\"el Sarfati, Christopher J. Earls

Abstract: Large language models (LLMs) have demonstrated emergent in-context learning (ICL) capabilities across a range of tasks, including zero-shot time-series forecasting. We show that text-trained foundation models can accurately extrapolate spatiotemporal dynamics from discretized partial differential equation (PDE) solutions without fine-tuning or natural language prompting. Predictive accuracy improves with longer temporal contexts but degrades at finer spatial discretizations. In multi-step rollouts, where the model recursively predicts future spatial states over multiple time steps, errors grow algebraically with the time horizon, reminiscent of global error accumulation in classical finite-difference solvers. We interpret these trends as in-context neural scaling laws, where prediction quality varies predictably with both context length and output length. To better understand how LLMs are able to internally process PDE solutions so as to accurately roll them out, we analyze token-level output distributions and uncover a consistent ICL progression: beginning with syntactic pattern imitation, transitioning through an exploratory high-entropy phase, and culminating in confident, numerically grounded predictions.

Comment: Representation Learning: analysis of in-context learning scaling laws and token-level dynamics for zero-shot PDE rollout.

Relevance: 7 Novelty: 7


28. Learning spatially structured open quantum dynamics with regional-attention transformers

ArXiv ID: 2509.06871

Authors: Dounan Du, Eden Figueroa

Abstract: Simulating the dynamics of open quantum systems with spatial structure and external control is an important challenge in quantum information science. Classical numerical solvers for such systems require integrating coupled master and field equations, which is computationally demanding for simulation and optimization tasks and often precluding real-time use in network-scale simulations or feedback control. We introduce a regional attention-based neural architecture that learns the spatiotemporal dynamics of structured open quantum systems. The model incorporates translational invariance of physical laws as an inductive bias to achieve scalable complexity, and supports conditioning on time-dependent global control parameters. We demonstrate learning on two representative systems: a driven dissipative single qubit and an electromagnetically induced transparency (EIT) quantum memory. The model achieves high predictive fidelity under both in-distribution and out-of-distribution control protocols, and provides substantial acceleration up to three orders of magnitude over numerical solvers. These results demonstrate that the architecture establishes a general surrogate modeling framework for spatially structured open quantum dynamics, with immediate relevance to large-scale quantum network simulation, quantum repeater and protocol design, real-time experimental optimization, and scalable device modeling across diverse light-matter platforms.

Comment: Model Architecture: introduces a regional-attention Transformer incorporating translational invariance as an inductive bias and conditioning on global controls.

Relevance: 7 Novelty: 7


29. Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation

ArXiv ID: 2509.05605

Authors: Qiyuan Chen, Hongsen Huang, Qian Shao, Jiahe Chen, Jintai Chen, Hongxia Xu, Renjie Hua, Ren Chuan, Jian Wu

Abstract: Large Language Models (LLMs) require high quality preference datasets to align with human preferences. However, conventional methods for constructing such datasets face significant challenges: reliance on pre-collected instructions often leads to distribution mismatches with target models, while the need for sampling multiple stochastic responses introduces substantial computational overhead. In this work, we explore a paradigm shift by leveraging inherent regulation of LLMs' representation space for efficient and tailored preference dataset construction, named Icon$^{2}$. Specifically, it first extracts layer-wise direction vectors to encode sophisticated human preferences and then uses these vectors to filter self-synthesized instructions based on their inherent consistency. During decoding, bidirectional inherent control is applied to steer token representations, enabling the precise generation of response pairs with clear alignment distinctions. Experimental results demonstrate significant improvements in both alignment and efficiency. Llama3-8B and Qwen2-7B achieve an average win rate improvement of 13.89% on AlpacaEval 2.0 and 13.45% on Arena-Hard, while reducing computational costs by up to 48.1%.

Comment: Representation Learning: leverages layer-wise direction vectors to regulate token representations and synthesize preference data for alignment.

Relevance: 7 Novelty: 7


30. Nonnegative matrix factorization and the principle of the common cause

ArXiv ID: 2509.03652

Authors: E. Khalafyan, A. E. Allahverdyan, A. Hovhannisyan

Abstract: Nonnegative matrix factorization (NMF) is a known unsupervised data-reduction method. The principle of the common cause (PCC) is a basic methodological approach in probabilistic causality, which seeks an independent mixture model for the joint probability of two dependent random variables. It turns out that these two concepts are closely related. This relationship is explored reciprocally for several datasets of gray-scale images, which are conveniently mapped into probability models. On one hand, PCC provides a predictability tool that leads to a robust estimation of the effective rank of NMF. Unlike other estimates (e.g., those based on the Bayesian Information Criteria), our estimate of the rank is stable against weak noise. We show that NMF implemented around this rank produces features (basis images) that are also stable against noise and against seeds of local optimization, thereby effectively resolving the NMF nonidentifiability problem. On the other hand, NMF provides an interesting possibility of implementing PCC in an approximate way, where larger and positively correlated joint probabilities tend to be explained better via the independent mixture model. We work out a clustering method, where data points with the same common cause are grouped into the same cluster. We also show how NMF can be employed for data denoising.

Comment: Representation Learning: connects NMF with the principle of the common cause, including robust effective rank estimation and noise-stable features.

Relevance: 7 Novelty: 7


31. On optimal solutions of classical and sliced Wasserstein GANs with non-Gaussian data

ArXiv ID: 2509.06505

Authors: Yu-Jui Huang, Hsin-Hua Shen, Yu-Chih Huang, Wan-Yi Lin, Shih-Chun Lin

Abstract: The generative adversarial network (GAN) aims to approximate an unknown distribution via a parameterized neural network (NN). While GANs have been widely applied in reinforcement and semisupervised learning as well as computer vision tasks, selecting their parameters often needs an exhaustive search and only a few selection methods can be proved to be theoretically optimal. One of the most promising GAN variants is the Wasserstein GAN (WGAN). Prior work on optimal parameters for WGAN is limited to the linear-quadratic-Gaussian (LQG) setting, where the NN is linear and the data is Gaussian. In this paper, we focus on the characterization of optimal WGAN parameters beyond the LQG setting. We derive closed-form optimal parameters for one-dimensional WGANs when the NN has non-linear activation functions and the data is non-Gaussian. To extend this to high-dimensional WGANs, we adopt the sliced Wasserstein framework and replace the constraint on marginal distributions of the randomly projected data by a constraint on the joint distribution of the original (unprojected) data. We show that the linear generator can be asymptotically optimal for sliced WGAN with non-Gaussian data. Empirical studies show that our closed-form WGAN parameters have good convergence behavior with data under both Gaussian and Laplace distributions. Also, compared to the r principal component analysis (r-PCA) solution, our proposed solution for sliced WGAN can achieve the same performance while requiring less computational resources.

Comment: Representation Learning: theoretical characterization of optimal WGAN parameters beyond LQG and asymptotic optimality in sliced WGANs, offering insights into learned generative mappings.

Relevance: 7 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  2. Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  3. High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

  4. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.