Personalized Daily ArXiv Papers 2025-10-27

[gpt-5]	Prompt	Completion	Total
Token	41714	43116	84830
Cost	$0.05	$0.43	$0.48

Total arXiv papers: 690

Total scanned papers: 296

Total relevant papers: 29

Table of contents with paper titles:

Surrogate-based quantification of policy uncertainty in generative flow networks Authors: Ram\'on Nartallo-Kaluarachchi, Robert Manson-Sawko, Shashanka Ubaru, Dongsung Huh, Ma{\l}gorzata J Zimo\'n, Lior Horesh, Yoshua Bengio
Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection Authors: Atoosa Chegini, Hamid Kazemi, Garrett Souza, Maria Safi, Yang Song, Samy Bengio, Sinead Williamson, Mehrdad Farajtabar
ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models Authors: Federico Danieli, Pau Rodriguez, Miguel Sarabia, Xavier Suau, Luca Zappella
A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization Authors: Xuan Tang, Jichu Li, Difan Zou
Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression Authors: Xi Zhang, Xiaolin Wu, Jiamang Wang, Weisi Lin
Sparser Block-Sparse Attention via Token Permutation Authors: Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu
From Information to Generative Exponent: Learning Rate Induces Phase Transitions in SGD Authors: Konstantinos Christopher Tsiolis, Alireza Mousavi-Hosseini, Murat A. Erdogdu
Neural Collapse under Gradient Flow on Shallow ReLU Networks for Orthogonally Separable Data Authors: Hancheng Min, Zhihui Zhu, Ren\'e Vidal
Triangle Multiplication Is All You Need For Biomolecular Structure Representations Authors: Jeffrey Ouyang-Zhang, Pranav Murugan, Daniel J. Diaz, Gianluca Scarpellini, Richard Strong Bowen, Nate Gruver, Adam Klivans, Philipp Kr\"ahenb\"uhl, Aleksandra Faust, Maruan Al-Shedivat
Disentangled Representation Learning via Modular Compositional Bias Authors: Whie Jung, Dong Hoon Lee, Seunghoon Hong
Equivariance by Contrast: Identifiable Equivariant Embeddings from Unlabeled Finite Group Actions Authors: Tobias Schmidt, Steffen Schneider, Matthias Bethge
$\alpha$-LoRA: Effective Fine-Tuning via Base Model Rescaling Authors: Aymane El Firdoussi, El Mahdi Chayti, Mohamed El Amine Seddik, Martin Jaggi
Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation Authors: Enshu Liu, Qian Chen, Xuefei Ning, Shengen Yan, Guohao Dai, Zinan Lin, Yu Wang
Adaptive Graph Mixture of Residual Experts: Unsupervised Learning on Diverse Graphs with Heterogeneous Specialization Authors: Yunlong Chu, Minglai Shao, Zengyi Wo, Bing Hao, Yuhang Liu, Ruijie Wang, Jianxin Li
HollowFlow: Efficient Sample Likelihood Evaluation using Hollow Message Passing Authors: Johann Flemming Gloy, Simon Olsson
Generalised Flow Maps for Few-Step Generative Modelling on Riemannian Manifolds Authors: Oscar Davis, Michael S. Albergo, Nicholas M. Boffi, Michael M. Bronstein, Avishek Joey Bose
Finite-Time Analysis of Stochastic Nonconvex Nonsmooth Optimization on the Riemannian Manifolds Authors: Emre Sahinoglu, Youbang Sun, Shahin Shahrampour
Correlation Dimension of Auto-Regressive Large Language Models Authors: Xin Du, Kumiko Tanaka-Ishii
Few-Shot Knowledge Distillation of LLMs With Counterfactual Explanations Authors: Faisal Hamman, Pasan Dissanayake, Yanjun Fu, Sanghamitra Dutta
Relieving the Over-Aggregating Effect in Graph Transformers Authors: Junshu Sun, Wanxing Chang, Chenxue Yang, Qingming Huang, Shuhui Wang
VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set Authors: Shufan Shen, Junshu Sun, Qingming Huang, Shuhui Wang
Model Merging with Functional Dual Anchors Authors: Kexuan Shi, Yandong Wen, Weiyang Liu
xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads Authors: Jiabo Shi, Dimitrios Pezaros, Yehia Elkhatib
Head Pursuit: Probing Attention Specialization in Multimodal Transformers Authors: Lorenzo Basile, Valentino Maiorca, Diego Doimo, Francesco Locatello, Alberto Cazzaniga
Memory Constrained Dynamic Subnetwork Update for Transfer Learning Authors: A\"el Qu\'elennec, Pavlo Mozharovskyi, Van-Tam Nguyen, Enzo Tartaglione
PINN Balls: Scaling Second-Order Methods for PINNs with Domain Decomposition and Adaptive Sampling Authors: Andrea Bonfanti, Ismael Medina, Roman List, Bj\"orn Staeves, Roberto Santana, Marco Ellero
Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime Authors: Noah Oberweis, Semih Cayci
On Uncertainty Calibration for Equivariant Functions Authors: Edward Berman, Jacob Ginesin, Marco Pacini, Robin Walters
Neural Mutual Information Estimation with Vector Copulas Authors: Yanzhi Chen, Zijing Ou, Adrian Weller, Michael U. Gutmann

1. Surrogate-based quantification of policy uncertainty in generative flow networks

ArXiv ID: 2510.21523

Authors: Ram\'on Nartallo-Kaluarachchi, Robert Manson-Sawko, Shashanka Ubaru, Dongsung Huh, Ma{\l}gorzata J Zimo\'n, Lior Horesh, Yoshua Bengio

Abstract: Generative flow networks are able to sample, via sequential construction, high-reward, complex objects according to a reward function. However, such reward functions are often estimated approximately from noisy data, leading to epistemic uncertainty in the learnt policy. We present an approach to quantify this uncertainty by constructing a surrogate model composed of a polynomial chaos expansion, fit on a small ensemble of trained flow networks. This model learns the relationship between reward functions, parametrised in a low-dimensional space, and the probability distributions over actions at each step along a trajectory of the flow network. The surrogate model can then be used for inexpensive Monte Carlo sampling to estimate the uncertainty in the policy given uncertain rewards. We illustrate the performance of our approach on a discrete and continuous grid-world, symbolic regression, and a Bayesian structure learning task.

Comment: Author match

2. Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection

ArXiv ID: 2510.21049

Authors: Atoosa Chegini, Hamid Kazemi, Garrett Souza, Maria Safi, Yang Song, Samy Bengio, Sinead Williamson, Mehrdad Farajtabar

Abstract: Reasoning has become a central paradigm for large language models (LLMs), consistently boosting accuracy across diverse benchmarks. Yet its suitability for precision-sensitive tasks remains unclear. We present the first systematic study of reasoning for classification tasks under strict low false positive rate (FPR) regimes. Our analysis covers two tasks--safety detection and hallucination detection--evaluated in both fine-tuned and zero-shot settings, using standard LLMs and Large Reasoning Models (LRMs). Our results reveal a clear trade-off: Think On (reasoning-augmented) generation improves overall accuracy, but underperforms at the low-FPR thresholds essential for practical use. In contrast, Think Off (no reasoning during inference) dominates in these precision-sensitive regimes, with Think On surpassing only when higher FPRs are acceptable. In addition, we find token-based scoring substantially outperforms self-verbalized confidence for precision-sensitive deployments. Finally, a simple ensemble of the two modes recovers the strengths of each. Taken together, our findings position reasoning as a double-edged tool: beneficial for average accuracy, but often ill-suited for applications requiring strict precision.

Comment: Author match

3. ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

ArXiv ID: 2510.21450

Authors: Federico Danieli, Pau Rodriguez, Miguel Sarabia, Xavier Suau, Luca Zappella

Abstract: Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to 665x over naive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.

Comment: High Performance Computing: algorithm to parallelize nonlinear RNN training via Newton iterations and parallel reductions, enabling large-scale sequence model training.

Relevance: 10 Novelty: 9

4. A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization

ArXiv ID: 2510.21314

Authors: Xuan Tang, Jichu Li, Difan Zou

Abstract: The rapid scaling of large language models (LLMs) has made low-precision training essential for reducing memory, improving efficiency, and enabling larger models and datasets. Existing convergence theories for adaptive optimizers, however, assume all components are exact and neglect hardware-aware quantization, leaving open the question of why low-precision training remains effective. We introduce the first theoretical framework for analyzing the convergence of adaptive optimizers, including Adam and Muon, under floating-point quantization of gradients, weights, and optimizer states (e.g., moment estimates). Within this framework, we derive convergence rates on smooth non-convex objectives under standard stochastic gradient assumptions, explicitly characterizing how quantization errors from different components affect convergence. We show that both algorithms retain rates close to their full-precision counterparts provided mantissa length scales only logarithmically with the number of iterations. Our analysis further reveals that Adam is highly sensitive to weights and second-moment quantization due to its reliance on $\beta_2 \to 1$, while Muon requires weaker error control and is thus potentially more robust. These results narrow the gap between empirical success and theoretical understanding of low-precision training methods. Numerical experiments on synthetic and real-world data corroborate our theory.

Comment: Compression/Efficiency: first convergence analysis of adaptive optimizers under floating‑point quantization (gradients/weights/states) for low‑precision training.

Relevance: 10 Novelty: 9

5. Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression

ArXiv ID: 2510.20984

Authors: Xi Zhang, Xiaolin Wu, Jiamang Wang, Weisi Lin

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities but typically require extensive computational resources and memory for inference. Post-training quantization (PTQ) can effectively reduce these demands by storing weights in lower bit-width formats. However, standard uniform quantization often leads to notable performance degradation, particularly in low-bit scenarios. In this work, we introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook, defined by a learnable generation matrix. To address the non-differentiability of the quantization process, we adopt Babai rounding to approximate nearest-lattice-point search during training, which enables stable optimization of the generation matrices. Once trained, decoding reduces to a simple matrix-vector multiplication, yielding an efficient and practical quantization pipeline. Experiments on multiple benchmarks show that our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines, highlighting its effectiveness in deploying large models under stringent resource constraints. Our source code is available on GitHub repository: https://github.com/xzhang9308/GLVQ.

Comment: Model Compression and Efficiency: introduces grouped lattice vector quantization with learnable generation matrices and Babai rounding for low-bit LLMs.

Relevance: 10 Novelty: 8

6. Sparser Block-Sparse Attention via Token Permutation

ArXiv ID: 2510.21270

Authors: Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu

Abstract: Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\textbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75\times$ in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn

Comment: Matches Compression/Efficiency: block-sparse attention enhanced via token permutation and custom kernels, improving long-context LLM prefilling speed/accuracy.

Relevance: 10 Novelty: 8

7. From Information to Generative Exponent: Learning Rate Induces Phase Transitions in SGD

ArXiv ID: 2510.21020

Authors: Konstantinos Christopher Tsiolis, Alireza Mousavi-Hosseini, Murat A. Erdogdu

Abstract: To understand feature learning dynamics in neural networks, recent theoretical works have focused on gradient-based learning of Gaussian single-index models, where the label is a nonlinear function of a latent one-dimensional projection of the input. While the sample complexity of online SGD is determined by the information exponent of the link function, recent works improved this by performing multiple gradient steps on the same sample with different learning rates -- yielding a non-correlational update rule -- and instead are limited by the (potentially much smaller) generative exponent. However, this picture is only valid when these learning rates are sufficiently large. In this paper, we characterize the relationship between learning rate(s) and sample complexity for a broad class of gradient-based algorithms that encapsulates both correlational and non-correlational updates. We demonstrate that, in certain cases, there is a phase transition from an "information exponent regime" with small learning rate to a "generative exponent regime" with large learning rate. Our framework covers prior analyses of one-pass SGD and SGD with batch reuse, while also introducing a new layer-wise training algorithm that leverages a two-timescales approach (via different learning rates for each layer) to go beyond correlational queries without reusing samples or modifying the loss from squared error. Our theoretical study demonstrates that the choice of learning rate is as important as the design of the algorithm in achieving statistical and computational efficiency.

Comment: Representation Learning: theoretical analysis of SGD dynamics showing learning-rate-induced phase transitions; introduces a two-timescale layer-wise training algorithm.

Relevance: 9 Novelty: 8

8. Neural Collapse under Gradient Flow on Shallow ReLU Networks for Orthogonally Separable Data

ArXiv ID: 2510.21078

Authors: Hancheng Min, Zhihui Zhu, Ren\'e Vidal

Abstract: Among many mysteries behind the success of deep networks lies the exceptional discriminative power of their learned representations as manifested by the intriguing Neural Collapse (NC) phenomenon, where simple feature structures emerge at the last layer of a trained neural network. Prior works on the theoretical understandings of NC have focused on analyzing the optimization landscape of matrix-factorization-like problems by considering the last-layer features as unconstrained free optimization variables and showing that their global minima exhibit NC. In this paper, we show that gradient flow on a two-layer ReLU network for classifying orthogonally separable data provably exhibits NC, thereby advancing prior results in two ways: First, we relax the assumption of unconstrained features, showing the effect of data structure and nonlinear activations on NC characterizations. Second, we reveal the role of the implicit bias of the training dynamics in facilitating the emergence of NC.

Comment: Representation Learning: proves emergence of Neural Collapse under gradient flow in two-layer ReLU networks with orthogonally separable data.

Relevance: 9 Novelty: 8

9. Triangle Multiplication Is All You Need For Biomolecular Structure Representations

ArXiv ID: 2510.18870

Authors: Jeffrey Ouyang-Zhang, Pranav Murugan, Daniel J. Diaz, Gianluca Scarpellini, Richard Strong Bowen, Nate Gruver, Adam Klivans, Philipp Kr\"ahenb\"uhl, Aleksandra Faust, Maruan Al-Shedivat

Abstract: AlphaFold has transformed protein structure prediction, but emerging applications such as virtual ligand screening, proteome-wide folding, and de novo binder design demand predictions at a massive scale, where runtime and memory costs become prohibitive. A major bottleneck lies in the Pairformer backbone of AlphaFold3-style models, which relies on computationally expensive triangular primitives-especially triangle attention-for pairwise reasoning. We introduce Pairmixer, a streamlined alternative that eliminates triangle attention while preserving higher-order geometric reasoning capabilities that are critical for structure prediction. Pairmixer substantially improves computational efficiency, matching state-of-the-art structure predictors across folding and docking benchmarks, delivering up to 4x faster inference on long sequences while reducing training cost by 34%. Its efficiency alleviates the computational burden of downstream applications such as modeling large protein complexes, high-throughput ligand and binder screening, and hallucination-based design. Within BoltzDesign, for example, Pairmixer delivers over 2x faster sampling and scales to sequences ~30% longer than the memory limits of Pairformer.

Comment: Matches Model Architecture and Efficiency: replaces triangle attention with a streamlined module (Pairmixer) preserving higher-order reasoning while reducing compute/memory.

Relevance: 9 Novelty: 8

10. Disentangled Representation Learning via Modular Compositional Bias

ArXiv ID: 2510.21402

Authors: Whie Jung, Dong Hoon Lee, Seunghoon Hong

Abstract: Recent disentangled representation learning (DRL) methods heavily rely on factor specific strategies-either learning objectives for attributes or model architectures for objects-to embed inductive biases. Such divergent approaches result in significant overhead when novel factors of variation do not align with prior assumptions, such as statistical independence or spatial exclusivity, or when multiple factors coexist, as practitioners must redesign architectures or objectives. To address this, we propose a compositional bias, a modular inductive bias decoupled from both objectives and architectures. Our key insight is that different factors obey distinct recombination rules in the data distribution: global attributes are mutually exclusive, e.g., a face has one nose, while objects share a common support (any subset of objects can co-exist). We therefore randomly remix latents according to factor-specific rules, i.e., a mixing strategy, and force the encoder to discover whichever factor structure the mixing strategy reflects through two complementary objectives: (i) a prior loss that ensures every remix decodes into a realistic image, and (ii) the compositional consistency loss introduced by Wiedemer et al. (arXiv:2310.05327), which aligns each composite image with its corresponding composite latent. Under this general framework, simply adjusting the mixing strategy enables disentanglement of attributes, objects, and even both, without modifying the objectives or architectures. Extensive experiments demonstrate that our method shows competitive performance in both attribute and object disentanglement, and uniquely achieves joint disentanglement of global style and objects. Code is available at https://github.com/whieya/Compositional-DRL.

Comment: Matches Representation Learning: modular compositional bias enabling disentanglement of attributes/objects without architecture/objective redesign.

Relevance: 9 Novelty: 8

11. Equivariance by Contrast: Identifiable Equivariant Embeddings from Unlabeled Finite Group Actions

ArXiv ID: 2510.21706

Authors: Tobias Schmidt, Steffen Schneider, Matthias Bethge

Abstract: We propose Equivariance by Contrast (EbC) to learn equivariant embeddings from observation pairs $(\mathbf{y}, g \cdot \mathbf{y})$, where $g$ is drawn from a finite group acting on the data. Our method jointly learns a latent space and a group representation in which group actions correspond to invertible linear maps -- without relying on group-specific inductive biases. We validate our approach on the infinite dSprites dataset with structured transformations defined by the finite group $G:= (R_m \times \mathbb{Z}_n \times \mathbb{Z}_n)$, combining discrete rotations and periodic translations. The resulting embeddings exhibit high-fidelity equivariance, with group operations faithfully reproduced in latent space. On synthetic data, we further validate the approach on the non-abelian orthogonal group $O(n)$ and the general linear group $GL(n)$. We also provide a theoretical proof for identifiability. While broad evaluation across diverse group types on real-world data remains future work, our results constitute the first successful demonstration of general-purpose encoder-only equivariant learning from group action observations alone, including non-trivial non-abelian groups and a product group motivated by modeling affine equivariances in computer vision.

Comment: Representation Learning: identifiable equivariant embeddings from finite group actions without inductive biases; theory plus general‑purpose method.

Relevance: 9 Novelty: 8

12. $\alpha$-LoRA: Effective Fine-Tuning via Base Model Rescaling

ArXiv ID: 2510.21345

Authors: Aymane El Firdoussi, El Mahdi Chayti, Mohamed El Amine Seddik, Martin Jaggi

Abstract: Fine-tuning has proven to be highly effective in adapting pre-trained models to perform better on new desired tasks with minimal data samples. Among the most widely used approaches are reparameterization methods, which update a target module by augmenting its frozen weight matrix with an additional trainable weight matrix. The most prominent example is Low Rank Adaption (LoRA), which gained significant attention in recent years. In this paper, we introduce a new class of reparameterization methods for transfer learning, designed to enhance the generalization ability of fine-tuned models. We establish the effectiveness of our approach in a high-dimensional binary classification setting using tools from Random Matrix Theory, and further validate our theoretical findings through more realistic experiments, such as fine-tuning LLMs.

Comment: Low‑rank/Compression: new reparameterization (α‑LoRA) via base model rescaling with theory (RMT) to improve fine‑tuning generalization.

Relevance: 9 Novelty: 8

13. Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation

ArXiv ID: 2510.21003

Authors: Enshu Liu, Qian Chen, Xuefei Ning, Shengen Yan, Guohao Dai, Zinan Lin, Yu Wang

Abstract: Image Auto-regressive (AR) models have emerged as a powerful paradigm of visual generative models. Despite their promising performance, they suffer from slow generation speed due to the large number of sampling steps required. Although Distilled Decoding 1 (DD1) was recently proposed to enable few-step sampling for image AR models, it still incurs significant performance degradation in the one-step setting, and relies on a pre-defined mapping that limits its flexibility. In this work, we propose a new method, Distilled Decoding 2 (DD2), to further advances the feasibility of one-step sampling for image AR models. Unlike DD1, DD2 does not without rely on a pre-defined mapping. We view the original AR model as a teacher model which provides the ground truth conditional score in the latent embedding space at each token position. Based on this, we propose a novel \emph{conditional score distillation loss} to train a one-step generator. Specifically, we train a separate network to predict the conditional score of the generated distribution and apply score distillation at every token position conditioned on previous tokens. Experimental results show that DD2 enables one-step sampling for image AR models with an minimal FID increase from 3.40 to 5.43 on ImageNet-256. Compared to the strongest baseline DD1, DD2 reduces the gap between the one-step sampling and original AR model by 67%, with up to 12.3$\times$ training speed-up simultaneously. DD2 takes a significant step toward the goal of one-step AR generation, opening up new possibilities for fast and high-quality AR modeling. Code is available at https://github.com/imagination-research/Distilled-Decoding-2.

Comment: Compression/Efficiency: introduces conditional score distillation for one-step sampling of image autoregressive models, substantially accelerating generation.

Relevance: 9 Novelty: 8

14. Adaptive Graph Mixture of Residual Experts: Unsupervised Learning on Diverse Graphs with Heterogeneous Specialization

ArXiv ID: 2510.21207

Authors: Yunlong Chu, Minglai Shao, Zengyi Wo, Bing Hao, Yuhang Liu, Ruijie Wang, Jianxin Li

Abstract: Graph Neural Networks (GNNs) face a fundamental adaptability challenge: their fixed message-passing architectures struggle with the immense diversity of real-world graphs, where optimal computational strategies vary by local structure and task. While Mixture-of-Experts (MoE) offers a promising pathway to adaptability, existing graph MoE methods remain constrained by their reliance on supervised signals and instability when training heterogeneous experts. We introduce ADaMoRE (Adaptive Mixture of Residual Experts), a principled framework that enables robust, fully unsupervised training of heterogeneous MoE on graphs. ADaMoRE employs a backbone-residual expert architecture where foundational encoders provide stability while specialized residual experts capture diverse computational patterns. A structurally-aware gating network performs fine-grained node routing. The entire architecture is trained end-to-end using a unified unsupervised objective, which integrates a primary reconstruction task with an information-theoretic diversity regularizer to explicitly enforce functional specialization among the experts. Theoretical analysis confirms our design improves data efficiency and training stability. Extensive evaluation across 16 benchmarks validates ADaMoRE's state-of-the-art performance in unsupervised node classification and few-shot learning, alongside superior generalization, training efficiency, and faster convergence on diverse graphs and tasks.

Comment: Model Architecture (MoE): proposes an unsupervised Adaptive Graph Mixture of Residual Experts with structurally-aware gating and diversity regularization.

Relevance: 9 Novelty: 8

15. HollowFlow: Efficient Sample Likelihood Evaluation using Hollow Message Passing

ArXiv ID: 2510.21542

Authors: Johann Flemming Gloy, Simon Olsson

Abstract: Flow and diffusion-based models have emerged as powerful tools for scientific applications, particularly for sampling non-normalized probability distributions, as exemplified by Boltzmann Generators (BGs). A critical challenge in deploying these models is their reliance on sample likelihood computations, which scale prohibitively with system size $n$, often rendering them infeasible for large-scale problems. To address this, we introduce $\textit{HollowFlow}$, a flow-based generative model leveraging a novel non-backtracking graph neural network (NoBGNN). By enforcing a block-diagonal Jacobian structure, HollowFlow likelihoods are evaluated with a constant number of backward passes in $n$, yielding speed-ups of up to $\mathcal{O}(n^2)$: a significant step towards scaling BGs to larger systems. Crucially, our framework generalizes: $\textbf{any equivariant GNN or attention-based architecture}$ can be adapted into a NoBGNN. We validate HollowFlow by training BGs on two different systems of increasing size. For both systems, the sampling and likelihood evaluation time decreases dramatically, following our theoretical scaling laws. For the larger system we obtain a $10^2\times$ speed-up, clearly illustrating the potential of HollowFlow-based approaches for high-dimensional scientific problems previously hindered by computational bottlenecks.

Comment: Efficiency/HPC: enforces block-diagonal Jacobians via non-backtracking GNNs to achieve scalable likelihood evaluation in flow models; adaptable to equivariant GNNs/attention.

Relevance: 8 Novelty: 8

16. Generalised Flow Maps for Few-Step Generative Modelling on Riemannian Manifolds

ArXiv ID: 2510.21608

Authors: Oscar Davis, Michael S. Albergo, Nicholas M. Boffi, Michael M. Bronstein, Avishek Joey Bose

Abstract: Geometric data and purpose-built generative models on them have become ubiquitous in high-impact deep learning application domains, ranging from protein backbone generation and computational chemistry to geospatial data. Current geometric generative models remain computationally expensive at inference -- requiring many steps of complex numerical simulation -- as they are derived from dynamical measure transport frameworks such as diffusion and flow-matching on Riemannian manifolds. In this paper, we propose Generalised Flow Maps (GFM), a new class of few-step generative models that generalises the Flow Map framework in Euclidean spaces to arbitrary Riemannian manifolds. We instantiate GFMs with three self-distillation-based training methods: Generalised Lagrangian Flow Maps, Generalised Eulerian Flow Maps, and Generalised Progressive Flow Maps. We theoretically show that GFMs, under specific design decisions, unify and elevate existing Euclidean few-step generative models, such as consistency models, shortcut models, and meanflows, to the Riemannian setting. We benchmark GFMs against other geometric generative models on a suite of geometric datasets, including geospatial data, RNA torsion angles, and hyperbolic manifolds, and achieve state-of-the-art sample quality for single- and few-step evaluations, and superior or competitive log-likelihoods using the implicit probability flow.

Comment: Matches Model Architecture and Efficiency: few-step generative modeling generalized to Riemannian manifolds (self-distillation-based GFMs), reducing inference steps.

Relevance: 8 Novelty: 8

17. Finite-Time Analysis of Stochastic Nonconvex Nonsmooth Optimization on the Riemannian Manifolds

ArXiv ID: 2510.21468

Authors: Emre Sahinoglu, Youbang Sun, Shahin Shahrampour

Abstract: This work addresses the finite-time analysis of nonsmooth nonconvex stochastic optimization under Riemannian manifold constraints. We adapt the notion of Goldstein stationarity to the Riemannian setting as a performance metric for nonsmooth optimization on manifolds. We then propose a Riemannian Online to NonConvex (RO2NC) algorithm, for which we establish the sample complexity of $O(\epsilon^{-3}\delta^{-1})$ in finding $(\delta,\epsilon)$-stationary points. This result is the first-ever finite-time guarantee for fully nonsmooth, nonconvex optimization on manifolds and matches the optimal complexity in the Euclidean setting. When gradient information is unavailable, we develop a zeroth order version of RO2NC algorithm (ZO-RO2NC), for which we establish the same sample complexity. The numerical results support the theory and demonstrate the practical effectiveness of the algorithms.

Comment: Optimization/Training Theory: finite-time guarantees for nonsmooth nonconvex stochastic optimization on Riemannian manifolds, including a zeroth-order variant.

Relevance: 8 Novelty: 8

18. Correlation Dimension of Auto-Regressive Large Language Models

ArXiv ID: 2510.21258

Authors: Xin Du, Kumiko Tanaka-Ishii

Abstract: Large language models (LLMs) have achieved remarkable progress in natural language generation, yet they continue to display puzzling behaviors -- such as repetition and incoherence -- even when exhibiting low perplexity. This highlights a key limitation of conventional evaluation metrics, which emphasize local prediction accuracy while overlooking long-range structural complexity. We introduce correlation dimension, a fractal-geometric measure of self-similarity, to quantify the epistemological complexity of text as perceived by a language model. This measure captures the hierarchical recurrence structure of language, bridging local and global properties in a unified framework. Through extensive experiments, we show that correlation dimension (1) reveals three distinct phases during pretraining, (2) reflects context-dependent complexity, (3) indicates a model's tendency toward hallucination, and (4) reliably detects multiple forms of degeneration in generated text. The method is computationally efficient, robust to model quantization (down to 4-bit precision), broadly applicable across autoregressive architectures (e.g., Transformer and Mamba), and provides fresh insight into the generative dynamics of LLMs.

Comment: Representation Learning: introduces correlation-dimension metric to quantify long-range structural complexity and generative dynamics in autoregressive LLMs.