Personalized Daily ArXiv Papers 2025-10-03

[gpt-5]	Prompt	Completion	Total
Token	63152	63906	127058
Cost	$0.08	$0.64	$0.72

Total arXiv papers: 736

Total scanned papers: 466

Total relevant papers: 41

Table of contents with paper titles:

Catalyst GFlowNet for electrocatalyst design: A hydrogen evolution reaction case study Authors: Lena Podina, Christina Humer, Alexandre Duval, Victor Schmidt, Ali Ramlaoui, Shahana Chatterjee, Yoshua Bengio, Alex Hernandez-Garcia, David Rolnick, F\'elix Therrien
Support Basis: Fast Attention Beyond Bounded Entries Authors: Maryam Aliakbarpour, Vladimir Braverman, Junze Yin, Haochen Zhang
The Unseen Frontier: Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM Authors: Kwanhee Lee, Hyeondo Jang, Dongyeop Lee, Dan Alistarh, Namhoon Lee
Self-Supervised Representation Learning as Mutual Information Maximization Authors: Akhlaqur Rahman Sabby, Yi Sui, Tongzi Wu, Jesse C. Cresswell, Ga Wu
Budgeted Broadcast: An Activity-Dependent Pruning Rule for Neural Network Efficiency Authors: Yaron Meirovitch, Fuming Yang, Jeff Lichtman, Nir Shavit
RSAVQ: Riemannian Sensitivity-Aware Vector Quantization for Large Language Models Authors: Zukang Xu, Xing Hu, Qiang Wu, Dawei Yang
Randomized Gradient Subspaces for Efficient Large Language Model Training Authors: Sahar Rajabi, Nayeema Nonta, Samanvay Vajpayee, Sirisha Rambhatla
CAT: Curvature-Adaptive Transformers for Geometry-Aware Learning Authors: Ryan Y. Lin, Siddhartha Ojha, Nicholas Bai
ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models Authors: Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, Tushar Krishna
Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression Authors: Yifei Zuo, Yutong Yin, Zhichen Zeng, Ang Li, Banghua Zhu, Zhaoran Wang
Drop-Muon: Update Less, Converge Faster Authors: Kaja Gruntkowska, Yassine Maziane, Zheng Qu, Peter Richt\'arik
Posterior Collapse as a Phase Transition in Variational Autoencoders Authors: Zhen Li, Fan Zhang, Zheng Zhang, Yu Chen
Low Rank Gradients and Where to Find Them Authors: Rishi Sonthalia, Michael Murray, Guido Mont\'ufar
Rethinking the shape convention of an MLP Authors: Meng-Hsi Chen, Yu-Ang Lee, Feng-Ting Liao, Da-shan Shiu
Precise Dynamics of Diagonal Linear Networks: A Unifying Analysis by Dynamical Mean-Field Theory Authors: Sota Nishiyama, Masaaki Imaizumi
StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold Authors: Zhizhong Li, Sina Sajadmanesh, Jingtao Li, Lingjuan Lyu
Flock: A Knowledge Graph Foundation Model via Learning on Random Walks Authors: Jinwoo Kim, Xingyue Huang, Krzysztof Olejniczak, Kyungbin Min, Michael Bronstein, Seunghoon Hong, .Ismail .Ilkan Ceylan
Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models Authors: Runqian Wang, Yilun Du
HiSpec: Hierarchical Speculative Decoding for LLMs Authors: Avinash Kumar, Sujay Sanghavi, Poulami Das
KaVa: Latent Reasoning via Compressed KV-Cache Distillation Authors: Anna Kuzina, Maciej Pioro, Paul N. Whatmough, Babak Ehteshami Bejnordi
Mathematical Modeling and Convergence Analysis of Deep Neural Networks with Dense Layer Connectivities in Deep Learning Authors: Jinshu Huang, Haibin Su, Xue-Cheng Tai, Chunlin Wu
How Do Language Models Compose Functions? Authors: Apoorv Khandelwal, Ellie Pavlick
Transformers Discover Molecular Structure Without Graph Priors Authors: Tobias Kreiman, Yutong Bai, Fadi Atieh, Elizabeth Weaver, Eric Qu, Aditi S. Krishnapriyan
Constrained Adaptive Rejection Sampling Authors: Pawe{\l} Parys, Sairam Vaidya, Taylor Berg-Kirkpatrick, Loris D'Antoni
Risk Phase Transitions in Spiked Regression: Alignment Driven Benign and Catastrophic Overfitting Authors: Jiping Li, Rishi Sonthalia
Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed Authors: Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Liu, Sanmi Koyejo
Uniform-in-time convergence bounds for Persistent Contrastive Divergence Algorithms Authors: Paul Felix Valsecchi Oliva, O. Deniz Akyildiz, Andrew Duncan
To Augment or Not to Augment? Diagnosing Distributional Symmetry Breaking Authors: Hannah Lawrence, Elyssa Hofgard, Vasco Portilheiro, Yuxuan Chen, Tess Smidt, Robin Walters
Robust Tangent Space Estimation via Laplacian Eigenvector Gradient Orthogonalization Authors: Dhruv Kohli, Sawyer J. Robertson, Gal Mishne, Alexander Cloninger
DeMuon: A Decentralized Muon for Matrix Optimization over Graphs Authors: Chuan He, Shuyi Ren, Jingwei Mao, Erik G. Larsson
Learning Time-Series Representations by Hierarchical Uniformity-Tolerance Latent Balancing Authors: Amin Jalali, Milad Soltany, Michael Greenspan, Ali Etemad
Flatness-Aware Stochastic Gradient Langevin Dynamics Authors: Stefano Bruno, Youngsik Hwang, Jaehyeon An, Sotirios Sabanis, Dong-Young Lim
Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead Authors: Feiyang Kang, Michael Kuchnik, Karthik Padthe, Marin Vlastelica, Ruoxi Jia, Carole-Jean Wu, Newsha Ardalani
Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value Authors: Wangxuan Fan, Ching Wang, Siqi Li, Nan Liu
Quantum-inspired Benchmark for Estimating Intrinsic Dimension Authors: Aritra Das, Joseph T. Iosue, Victor V. Albert
PENEX: AdaBoost-Inspired Neural Network Regularization Authors: Klaus-Rudolf Kladny, Bernhard Sch\"olkopf, Michael Muehlebach
xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity Authors: Maximilian Beck, Kajetan Schweighofer, Sebastian B\"ock, Sebastian Lehner, Sepp Hochreiter
Representational Alignment Across Model Layers and Brain Regions with Hierarchical Optimal Transport Authors: Shaan Shah, Meenakshi Khosla
Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories Authors: Nilay Naharas, Dang Nguyen, Nesihan Bulut, Mohammadhossein Bateni, Vahab Mirrokni, Baharan Mirzasoleiman
Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks Authors: Dongjun Kim, Gyuho Shim, Yongchan Chun, Minhyuk Kim, Chanjun Park, Heuiseok Lim
Learning Model Representations Using Publicly Available Model Hubs Authors: Damian Falk, Konstantin Sch\"urholt, Konstantinos Tzevelekakis, L\'eo Meynent, Damian Borth

1. Catalyst GFlowNet for electrocatalyst design: A hydrogen evolution reaction case study

ArXiv ID: 2510.02142

Authors: Lena Podina, Christina Humer, Alexandre Duval, Victor Schmidt, Ali Ramlaoui, Shahana Chatterjee, Yoshua Bengio, Alex Hernandez-Garcia, David Rolnick, F\'elix Therrien

Abstract: Efficient and inexpensive energy storage is essential for accelerating the adoption of renewable energy and ensuring a stable supply, despite fluctuations in sources such as wind and solar. Electrocatalysts play a key role in hydrogen energy storage (HES), allowing the energy to be stored as hydrogen. However, the development of affordable and high-performance catalysts for this process remains a significant challenge. We introduce Catalyst GFlowNet, a generative model that leverages machine learning-based predictors of formation and adsorption energy to design crystal surfaces that act as efficient catalysts. We demonstrate the performance of the model through a proof-of-concept application to the hydrogen evolution reaction, a key reaction in HES, for which we successfully identified platinum as the most efficient known catalyst. In future work, we aim to extend this approach to the oxygen evolution reaction, where current optimal catalysts are expensive metal oxides, and open the search space to discover new materials. This generative modeling framework offers a promising pathway for accelerating the search for novel and efficient catalysts.

Comment: Author match

2. Support Basis: Fast Attention Beyond Bounded Entries

ArXiv ID: 2510.01643

Authors: Maryam Aliakbarpour, Vladimir Braverman, Junze Yin, Haochen Zhang

Abstract: The quadratic complexity of softmax attention remains a central bottleneck in scaling large language models (LLMs). [Alman and Song, NeurIPS 2023] proposed a sub-quadratic attention approximation algorithm, but it works only under the restrictive bounded-entry assumption. Since this assumption rarely holds in practice, its applicability to modern LLMs is limited. In this paper, we introduce support-basis decomposition, a new framework for efficient attention approximation beyond bounded entries. We empirically demonstrate that the entries of the query and key matrices exhibit sub-Gaussian behavior. Our approach uses this property to split large and small entries, enabling exact computation on sparse components and polynomial approximation on dense components. We establish rigorous theoretical guarantees, proving a sub-quadratic runtime, and extend the method to a multi-threshold setting that eliminates all distributional assumptions. Furthermore, we provide the first theoretical justification for the empirical success of polynomial attention [Kacham, Mirrokni, and Zhong, ICML 2024], showing that softmax attention can be closely approximated by a combination of multiple polynomial attentions with sketching.

Comment: Efficient attention approximation with sub-quadratic runtime beyond bounded entries; rigorous guarantees and justification of polynomial attention.

Relevance: 10 Novelty: 9

3. The Unseen Frontier: Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM

ArXiv ID: 2510.01650

Authors: Kwanhee Lee, Hyeondo Jang, Dongyeop Lee, Dan Alistarh, Namhoon Lee

Abstract: Neural network pruning is a promising technique to mitigate the excessive computational and memory requirements of large language models (LLMs). Despite its promise, however, progress in this area has diminished, as conventional methods are seemingly unable to surpass moderate sparsity levels (50-60%) without severely degrading model accuracy. This work breaks through the current impasse, presenting a principled and effective method called $\texttt{Elsa}$, which achieves extreme sparsity levels of up to 90% while retaining high model fidelity. This is done by identifying several limitations in current practice, all of which can be traced back to their reliance on a surrogate objective formulation. $\texttt{Elsa}$ tackles this issue directly and effectively via standard and well-established constrained optimization techniques based on ADMM. Our extensive experiments across a wide range of models and scales show that $\texttt{Elsa}$ achieves substantial improvements over existing methods; e.g., it achieves 7.8$\times$ less perplexity than the best existing method on LLaMA-2-7B at 90% sparsity. Furthermore, we present $\texttt{Elsa}_{\text{-L}}$, a quantized variant that scales to extremely large models (27B), and establish its theoretical convergence guarantees. These results highlight meaningful progress in advancing the frontier of LLM sparsity, while promising that significant opportunities for further advancement may remain in directions that have so far attracted limited exploration.

Comment: Model Compression and Efficiency — extreme sparsity/pruning for LLMs via surrogate-free ADMM; includes quantized variant and convergence guarantees.

Relevance: 10 Novelty: 9

4. Self-Supervised Representation Learning as Mutual Information Maximization

ArXiv ID: 2510.01345

Authors: Akhlaqur Rahman Sabby, Yi Sui, Tongzi Wu, Jesse C. Cresswell, Ga Wu

Abstract: Self-supervised representation learning (SSRL) has demonstrated remarkable empirical success, yet its underlying principles remain insufficiently understood. While recent works attempt to unify SSRL methods by examining their information-theoretic objectives or summarizing their heuristics for preventing representation collapse, architectural elements like the predictor network, stop-gradient operation, and statistical regularizer are often viewed as empirically motivated additions. In this paper, we adopt a first-principles approach and investigate whether the learning objective of an SSRL algorithm dictates its possible optimization strategies and model design choices. In particular, by starting from a variational mutual information (MI) lower bound, we derive two training paradigms, namely Self-Distillation MI (SDMI) and Joint MI (JMI), each imposing distinct structural constraints and covering a set of existing SSRL algorithms. SDMI inherently requires alternating optimization, making stop-gradient operations theoretically essential. In contrast, JMI admits joint optimization through symmetric architectures without such components. Under the proposed formulation, predictor networks in SDMI and statistical regularizers in JMI emerge as tractable surrogates for the MI objective. We show that many existing SSRL methods are specific instances or approximations of these two paradigms. This paper provides a theoretical explanation behind the choices of different architectural components of existing SSRL methods, beyond heuristic conveniences.

Comment: Theoretical unification of self-supervised representation learning via MI; explains stop-gradient and predictor networks from first principles.

Relevance: 10 Novelty: 8

5. Budgeted Broadcast: An Activity-Dependent Pruning Rule for Neural Network Efficiency

ArXiv ID: 2510.01263

Authors: Yaron Meirovitch, Fuming Yang, Jeff Lichtman, Nir Shavit

Abstract: Most pruning methods remove parameters ranked by impact on loss (e.g., magnitude or gradient). We propose Budgeted Broadcast (BB), which gives each unit a local traffic budget (the product of its long-term on-rate $a_i$ and fan-out $k_i$). A constrained-entropy analysis shows that maximizing coding entropy under a global traffic budget yields a selectivity-audience balance, $\log\frac{1-a_i}{a_i}=\beta k_i$. BB enforces this balance with simple local actuators that prune either fan-in (to lower activity) or fan-out (to reduce broadcast). In practice, BB increases coding entropy and decorrelation and improves accuracy at matched sparsity across Transformers for ASR, ResNets for face identification, and 3D U-Nets for synapse prediction, sometimes exceeding dense baselines. On electron microscopy images, it attains state-of-the-art F1 and PR-AUC under our evaluation protocol. BB is easy to integrate and suggests a path toward learning more diverse and efficient representations.

Comment: Matches Model Compression and Efficiency: introduces an activity-dependent pruning rule with constrained-entropy analysis to balance fan-in/fan-out (sparsity/pruning) for efficiency.

Relevance: 10 Novelty: 8

6. RSAVQ: Riemannian Sensitivity-Aware Vector Quantization for Large Language Models

ArXiv ID: 2510.01240

Authors: Zukang Xu, Xing Hu, Qiang Wu, Dawei Yang

Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, their exponentially increasing parameters pose significant challenges for deployment on resource-constrained devices. Vector Quantization (VQ) shows great promise for low-bit quantization (e.g., 2 to 4 bits), but existing work faces two key challenges: unconstrained direction error and suboptimal bit allocation. In this paper, we propose RSAVQ, a novel VQ framework to enhance extremely low-bit quantization for LLMs. RSAVQ introduces two geometry-driven innovations that effectively mitigate above limitations: (1) Error Direction Sensitivity Guidance (EDSG), which leverages the Fisher Information Matrix (FIM)-induced Riemannian metric to project quantization errors onto low-sensitivity directions in the parameter space. Specifically, this projection is performed along the negative natural gradient direction, which effectively suppresses error expansion. (2) Weight Channel Sensitivity Guidance (WCSG) , which constructs a channel-wise sensitivity metric via FIM curvature analysis to dynamically guide bit resource allocation. The approach facilitates a globally optimal quantization solution within prescribed bit constraints. Experiments demonstrate that RSAVQ outperforms existing methods for LLMs. For example, in 2-bit quantization of LLaMA-3 8B, RSAVQ leads baselines like VPTQ and QuIP# by 0.4 in perplexity (PPL) and 1.5 in zero-shot accuracy. This work offers a practical solution for constrained environments and a theoretical bridge between information geometry and the quantization of neural networks, advancing efficient deep learning.

Comment: Matches Model Compression and Efficiency: low-bit vector quantization for LLMs using Fisher-information (Riemannian) sensitivity guidance and channel-wise bit allocation.

Relevance: 10 Novelty: 8

7. Randomized Gradient Subspaces for Efficient Large Language Model Training

ArXiv ID: 2510.01878

Authors: Sahar Rajabi, Nayeema Nonta, Samanvay Vajpayee, Sirisha Rambhatla

Abstract: Training large language models (LLMs) is often bottlenecked by extreme memory demands, with optimizer states dominating the footprint. Recent works mitigates this cost by projecting gradients into low-dimensional subspaces using sophisticated update strategies. In this paper, we analyze the dynamics of gradient space and its underlying subspaces. We find that while a small subspace captures most gradient energy, a significant portion still resides in the residual bulk; moreover, the influence of the core subspace diminishes over time and in deeper layers. We also observe that the gradient space exhibits near-flat curvature, calling for algorithms that explicitly account for this geometry. Motivated by these insights, we introduce a suite of randomized algorithms, GrassWalk and GrassJump, which exploit subspace and achieve state-of-the-art memory savings while improving performance on LLaMA-1B and LLaMA-7B pretraining.

Comment: High Performance Computing/Efficiency: randomized gradient subspace methods (GrassWalk/GrassJump) reduce optimizer memory for LLM pretraining by leveraging near-flat curvature.

Relevance: 10 Novelty: 8

8. CAT: Curvature-Adaptive Transformers for Geometry-Aware Learning

ArXiv ID: 2510.01634

Authors: Ryan Y. Lin, Siddhartha Ojha, Nicholas Bai

Abstract: Transformers achieve strong performance across diverse domains but implicitly assume Euclidean geometry in their attention mechanisms, limiting their effectiveness on data with non-Euclidean structure. While recent extensions to hyperbolic and spherical spaces show promise for hierarchical and cyclical patterns, respectively, they require committing to a single geometry a priori, reducing flexibility when data exhibits mixed geometric properties. We introduce the Curvature-Adaptive Transformer (CAT), a novel architecture that dynamically learns per-token routing across three geometric attention branches through a lightweight, differentiable gating mechanism. Unlike fixed-geometry approaches, CAT enables adaptive geometric specialization, routing tokens to the appropriate curvature based on their local relational structure. The routing network provides interpretable curvature preferences while each branch employs geometry-specific operations optimized for its respective manifold. On knowledge graph completion benchmarks (FB15k-237, WN18RR), CAT achieves approximately 10% improvements in MRR and Hits@10 over fixed-geometry baselines with minimal overhead (5% parameter increase, comparable inference time). These results demonstrate that learned geometric adaptation outperforms any single fixed geometry for complex relational reasoning, establishing CAT as a scalable and interpretable foundation for mixture-of-geometry architectures across language, vision, and multimodal domains.

Comment: Model Architecture: conditional routing across geometry-specific attention branches (mixture-of-geometry/MoE-like) enabling curvature-adaptive Transformers.

Relevance: 10 Novelty: 8

9. ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

ArXiv ID: 2510.01290

Authors: Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, Tushar Krishna

Abstract: The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key-value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization-eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens' memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over state-of-the-art baselines.

Comment: Compression/Efficiency/HPC: thought-adaptive KV-cache compression with hybrid quantization–eviction and a PagedAttention-extended kernel for memory reuse.

Relevance: 10 Novelty: 8

10. Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression

ArXiv ID: 2510.01450

Authors: Yifei Zuo, Yutong Yin, Zhichen Zeng, Ang Li, Banghua Zhu, Zhaoran Wang

Abstract: Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at greater computational cost-has been relatively underexplored. In this work, we bridge this gap by proposing Local Linear Attention (LLA), a novel attention mechanism derived from nonparametric statistics through the lens of test-time regression. First, we show that LLA offers theoretical advantages over Linear and Softmax Attention for associative memory via a bias-variance trade-off analysis. Next, we address its computational challenges and propose two memory-efficient primitives to tackle the $\Theta(n^2 d)$ and $\Theta(n d^2)$ complexity. We then introduce FlashLLA, a hardware-efficient, blockwise algorithm that enables scalable and parallel computation on modern accelerators. In addition, we implement and profile a customized inference kernel that significantly reduces memory overheads. Finally, we empirically validate the advantages and limitations of LLA on test-time regression, in-context regression, associative recall and state tracking tasks. Experiment results demonstrate that LLA effectively adapts to non-stationarity, outperforming strong baselines in test-time training and in-context learning, and exhibiting promising evidence for its scalability and applicability in large-scale models. Code is available at https://github.com/Yifei-Zuo/Flash-LLA.

Comment: Model Architecture: proposes a new attention mechanism (Local Linear Attention) as an alternative to Softmax/linear attention in Transformers; High-Performance Computing/Efficiency: introduces memory-efficient primitives and a hardware-efficient blockwise algorithm (FlashLLA) with custom kernels to reduce O(n^2 d) and O(n d^2) costs.

Relevance: 10 Novelty: 8

11. Drop-Muon: Update Less, Converge Faster

ArXiv ID: 2510.02239

Authors: Kaja Gruntkowska, Yassine Maziane, Zheng Qu, Peter Richt\'arik

Abstract: Conventional wisdom in deep learning optimization dictates updating all layers at every step-a principle followed by all recent state-of-the-art optimizers such as Muon. In this work, we challenge this assumption, showing that full-network updates can be fundamentally suboptimal, both in theory and in practice. We introduce a non-Euclidean Randomized Progressive Training method-Drop-Muon-a simple yet powerful framework that updates only a subset of layers per step according to a randomized schedule, combining the efficiency of progressive training with layer-specific non-Euclidean updates for top-tier performance. We provide rigorous convergence guarantees under both layer-wise smoothness and layer-wise $(L^0, L^1)$-smoothness, covering deterministic and stochastic gradient settings, marking the first such results for progressive training in the stochastic and non-smooth regime. Our cost analysis further reveals that full-network updates are not optimal unless a very specific relationship between layer smoothness constants holds. Through controlled CNN experiments, we empirically demonstrate that Drop-Muon consistently outperforms full-network Muon, achieving the same accuracy up to $1.4\times$ faster in wall-clock time. Together, our results suggest a shift in how large-scale models can be efficiently trained, challenging the status quo and offering a highly efficient, theoretically grounded alternative to full-network updates.

Comment: Training efficiency criterion: randomized progressive layer updates with non-Euclidean optimization and convergence theory, reducing update cost.

Relevance: 9 Novelty: 9

12. Posterior Collapse as a Phase Transition in Variational Autoencoders

ArXiv ID: 2510.01621

Authors: Zhen Li, Fan Zhang, Zheng Zhang, Yu Chen

Abstract: We investigate the phenomenon of posterior collapse in variational autoencoders (VAEs) from the perspective of statistical physics, and reveal that it constitutes a phase transition governed jointly by data structure and model hyper-parameters. By analyzing the stability of the trivial solution associated with posterior collapse, we identify a critical hyper-parameter threshold. This critical boundary, separating meaningful latent inference from collapse, is characterized by a discontinuity in the KL divergence between the approximate posterior and the prior distribution. We validate this critical behavior on both synthetic and real-world datasets, confirming the existence of a phase transition. Our results demonstrate that posterior collapse is not merely an optimization failure, but rather an emerging phase transition arising from the interplay between data structure and variational constraints. This perspective offers new insights into the trainability and representational capacity of deep generative models.

Comment: Representation Learning: theoretical analysis of VAEs’ training dynamics, framing posterior collapse as a phase transition with a critical boundary.

Relevance: 9 Novelty: 8

13. Low Rank Gradients and Where to Find Them

ArXiv ID: 2510.01303

Authors: Rishi Sonthalia, Michael Murray, Guido Mont\'ufar

Abstract: This paper investigates low-rank structure in the gradients of the training loss for two-layer neural networks while relaxing the usual isotropy assumptions on the training data and parameters. We consider a spiked data model in which the bulk can be anisotropic and ill-conditioned, we do not require independent data and weight matrices and we also analyze both the mean-field and neural-tangent-kernel scalings. We show that the gradient with respect to the input weights is approximately low rank and is dominated by two rank-one terms: one aligned with the bulk data-residue , and another aligned with the rank one spike in the input data. We characterize how properties of the training data, the scaling regime and the activation function govern the balance between these two components. Additionally, we also demonstrate that standard regularizers, such as weight decay, input noise and Jacobian penalties, also selectively modulate these components. Experiments on synthetic and real data corroborate our theoretical predictions.

Comment: Compression/Efficiency: identifies approximate low-rank structure in gradients; Representation Learning/Training Dynamics: links data/activation/regularizers to gradient rank components.

Relevance: 9 Novelty: 8

14. Rethinking the shape convention of an MLP

ArXiv ID: 2510.01796

Authors: Meng-Hsi Chen, Yu-Ang Lee, Feng-Ting Liao, Da-shan Shiu

Abstract: Multi-layer perceptrons (MLPs) conventionally follow a narrow-wide-narrow design where skip connections operate at the input/output dimensions while processing occurs in expanded hidden spaces. We challenge this convention by proposing wide-narrow-wide (Hourglass) MLP blocks where skip connections operate at expanded dimensions while residual computation flows through narrow bottlenecks. This inversion leverages higher-dimensional spaces for incremental refinement while maintaining computational efficiency through parameter-matched designs. Implementing Hourglass MLPs requires an initial projection to lift input signals to expanded dimensions. We propose that this projection can remain fixed at random initialization throughout training, enabling efficient training and inference implementations. We evaluate both architectures on generative tasks over popular image datasets, characterizing performance-parameter Pareto frontiers through systematic architectural search. Results show that Hourglass architectures consistently achieve superior Pareto frontiers compared to conventional designs. As parameter budgets increase, optimal Hourglass configurations favor deeper networks with wider skip connections and narrower bottlenecks-a scaling pattern distinct from conventional MLPs. Our findings suggest reconsidering skip connection placement in modern architectures, with potential applications extending to Transformers and other residual networks.

Comment: Model Architecture: rethinks MLP shape/skip placement with hourglass blocks and fixed random expansion; provides scaling insights applicable to residual networks/Transformers.

Relevance: 9 Novelty: 8

15. Precise Dynamics of Diagonal Linear Networks: A Unifying Analysis by Dynamical Mean-Field Theory

ArXiv ID: 2510.01930

Authors: Sota Nishiyama, Masaaki Imaizumi

Abstract: Diagonal linear networks (DLNs) are a tractable model that captures several nontrivial behaviors in neural network training, such as initialization-dependent solutions and incremental learning. These phenomena are typically studied in isolation, leaving the overall dynamics insufficiently understood. In this work, we present a unified analysis of various phenomena in the gradient flow dynamics of DLNs. Using Dynamical Mean-Field Theory (DMFT), we derive a low-dimensional effective process that captures the asymptotic gradient flow dynamics in high dimensions. Analyzing this effective process yields new insights into DLN dynamics, including loss convergence rates and their trade-off with generalization, and systematically reproduces many of the previously observed phenomena. These findings deepen our understanding of DLNs and demonstrate the effectiveness of the DMFT approach in analyzing high-dimensional learning dynamics of neural networks.

Comment: Matches Representation Learning: theoretical analysis of gradient-flow dynamics in diagonal linear networks via Dynamical Mean-Field Theory.

Relevance: 9 Novelty: 8

16. StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold

ArXiv ID: 2510.01938

Authors: Zhizhong Li, Sina Sajadmanesh, Jingtao Li, Lingjuan Lyu

Abstract: Low-rank adaptation (LoRA) has been widely adopted as a parameter-efficient technique for fine-tuning large-scale pre-trained models. However, it still lags behind full fine-tuning in performance, partly due to its insufficient exploitation of the geometric structure underlying low-rank manifolds. In this paper, we propose a geometry-aware extension of LoRA that uses a three-factor decomposition $U!SV^\top$. Analogous to the structure of singular value decomposition (SVD), it separates the adapter's input and output subspaces, $V$ and $U$, from the scaling factor $S$. Our method constrains $U$ and $V$ to lie on the Stiefel manifold, ensuring their orthonormality throughout the training. To optimize on the Stiefel manifold, we employ a flexible and modular geometric optimization design that converts any Euclidean optimizer to a Riemannian one. It enables efficient subspace learning while remaining compatible with existing fine-tuning pipelines. Empirical results across a wide range of downstream tasks, including commonsense reasoning, math and code generation, image classification, and image generation, demonstrate the superior performance of our approach against the recent state-of-the-art variants of LoRA. Code is available at https://github.com/SonyResearch/stella.

Comment: Model Compression and Efficiency: advances LoRA via U S V^T factorization with Stiefel manifold constraints and Riemannian optimization for low-rank adapters.

Relevance: 9 Novelty: 8

17. Flock: A Knowledge Graph Foundation Model via Learning on Random Walks

ArXiv ID: 2510.01510

Authors: Jinwoo Kim, Xingyue Huang, Krzysztof Olejniczak, Kyungbin Min, Michael Bronstein, Seunghoon Hong, .Ismail .Ilkan Ceylan

Abstract: We study the problem of zero-shot link prediction on knowledge graphs (KGs), which requires models to generalize over novel entities and novel relations. Knowledge graph foundation models (KGFMs) address this task by enforcing equivariance over both nodes and relations, learning from structural properties of nodes and relations, which are then transferable to novel graphs with similar structural properties. However, the conventional notion of deterministic equivariance imposes inherent limits on the expressive power of KGFMs, preventing them from distinguishing structurally similar but semantically distinct relations. To overcome this limitation, we introduce probabilistic node-relation equivariance, which preserves equivariance in distribution while incorporating a principled randomization to break symmetries during inference. Building on this principle, we present Flock, a KGFM that iteratively samples random walks, encodes them into sequences via a recording protocol, embeds them with a sequence model, and aggregates representations of nodes and relations via learned pooling. Crucially, Flock respects probabilistic node-relation equivariance and is a universal approximator for isomorphism-invariant link-level functions over KGs. Empirically, Flock perfectly solves our new diagnostic dataset Petals where current KGFMs fail, and achieves state-of-the-art performances on entity- and relation prediction tasks on 54 KGs from diverse domains.

Comment: Model Architecture: introduces probabilistic node–relation equivariance and random-walk sequence modeling with universality guarantees for KG link functions.

Relevance: 9 Novelty: 8

18. Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

ArXiv ID: 2510.02300

Authors: Runqian Wang, Yilun Du

Abstract: We introduce Equilibrium Matching (EqM), a generative modeling framework built from an equilibrium dynamics perspective. EqM discards the non-equilibrium, time-conditional dynamics in traditional diffusion and flow-based generative models and instead learns the equilibrium gradient of an implicit energy landscape. Through this approach, we can adopt an optimization-based sampling process at inference time, where samples are obtained by gradient descent on the learned landscape with adjustable step sizes, adaptive optimizers, and adaptive compute. EqM surpasses the generation performance of diffusion/flow models empirically, achieving an FID of 1.90 on ImageNet 256$\times$256. EqM is also theoretically justified to learn and sample from the data manifold. Beyond generation, EqM is a flexible framework that naturally handles tasks including partially noised image denoising, OOD detection, and image composition. By replacing time-conditional velocities with a unified equilibrium landscape, EqM offers a tighter bridge between flow and energy-based models and a simple route to optimization-driven inference.

Comment: Model Architecture/Representation Learning: implicit energy-based model learning an equilibrium gradient with optimization-driven sampling and adaptive compute—foundational alternative to diffusion/flow.

Relevance: 9 Novelty: 8

19. HiSpec: Hierarchical Speculative Decoding for LLMs

ArXiv ID: 2510.01336

Authors: Avinash Kumar, Sujay Sanghavi, Poulami Das

Abstract: Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.

Comment: Compression/Efficiency/HPC: hierarchical speculative decoding using early-exit intermediate verification with KV-cache/hidden-state reuse for high-throughput inference.

Relevance: 9 Novelty: 8

20. KaVa: Latent Reasoning via Compressed KV-Cache Distillation

ArXiv ID: 2510.02312

Authors: Anna Kuzina, Maciej Pioro, Paul N. Whatmough, Babak Ehteshami Bejnordi

Abstract: Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural-language reasoning traces. In this work, we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed KV-cache of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories. We show that the abstract, unstructured knowledge within compressed KV-cache, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student. Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation-only to natural-language traces, and scales to larger backbones while preserving efficiency. These results establish compressed KV-cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT-trained teachers with the efficiency and deployability of latent inference.

Comment: Model Compression and Efficiency: compressed KV-cache distillation to supervise latent reasoning, leveraging cache-aware signals for efficient inference and memory savings.

Relevance: 9 Novelty: 8

21. Mathematical Modeling and Convergence Analysis of Deep Neural Networks with Dense Layer Connectivities in Deep Learning

ArXiv ID: 2510.02049

Authors: Jinshu Huang, Haibin Su, Xue-Cheng Tai, Chunlin Wu

Abstract: In deep learning, dense layer connectivity has become a key design principle in deep neural networks (DNNs), enabling efficient information flow and strong performance across a range of applications. In this work, we model densely connected DNNs mathematically and analyze their learning problems in the deep-layer limit. For a broad applicability, we present our analysis in a framework setting of DNNs with densely connected layers and general non-local feature transformations (with local feature transformations as special cases) within layers, which is called dense non-local (DNL) framework and includes standard DenseNets and variants as special examples. In this formulation, the densely connected networks are modeled as nonlinear integral equations, in contrast to the ordinary differential equation viewpoint commonly adopted in prior works. We study the associated training problems from an optimal control perspective and prove convergence results from the network learning problem to its continuous-time counterpart. In particular, we show the convergence of optimal values and the subsequence convergence of minimizers, using a piecewise linear extension and $\Gamma$-convergence analysis. Our results provide a mathematical foundation for understanding densely connected DNNs and further suggest that such architectures can offer stability of training deep models.

Comment: Model Architecture: theoretical modeling of densely connected networks (DenseNet-style) via nonlinear integral equations with convergence (Γ-convergence) results for training.

Relevance: 9 Novelty: 8

22. How Do Language Models Compose Functions?

ArXiv ID: 2510.01685

Authors: Apoorv Khandelwal, Ellie Pavlick

Abstract: While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the "compositionality gap": i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. Then, using logit lens on their residual stream activations, we identify two processing mechanisms, one which solves tasks $\textit{compositionally}$, computing $f(x)$ along the way to computing $g(f(x))$, and one which solves them $\textit{directly}$, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that which mechanism is employed appears to be related to the embedding space geometry, with the idiomatic mechanism being dominant in cases where there exists a linear mapping from $x$ to $g(f(x))$ in the embedding spaces. We fully release our data and code at: https://github.com/apoorvkh/composing-functions .

Comment: Representation Learning: mechanistic analysis of compositionality in LLMs via logit-lens, identifying processing pathways and linking them to embedding space geometry.

Relevance: 9 Novelty: 7

23. Transformers Discover Molecular Structure Without Graph Priors

ArXiv ID: 2510.02259

Authors: Tobias Kreiman, Yutong Bai, Fadi Atieh, Elizabeth Weaver, Eric Qu, Aditi S. Krishnapriyan

Abstract: Graph Neural Networks (GNNs) are the dominant architecture for molecular machine learning, particularly for molecular property prediction and machine learning interatomic potentials (MLIPs). GNNs perform message passing on predefined graphs often induced by a fixed radius cutoff or k-nearest neighbor scheme. While this design aligns with the locality present in many molecular tasks, a hard-coded graph can limit expressivity due to the fixed receptive field and slows down inference with sparse graph operations. In this work, we investigate whether pure, unmodified Transformers trained directly on Cartesian coordinates$\unicode{x2013}$without predefined graphs or physical priors$\unicode{x2013}$can approximate molecular energies and forces. As a starting point for our analysis, we demonstrate how to train a Transformer to competitive energy and force mean absolute errors under a matched training compute budget, relative to a state-of-the-art equivariant GNN on the OMol25 dataset. We discover that the Transformer learns physically consistent patterns$\unicode{x2013}$such as attention weights that decay inversely with interatomic distance$\unicode{x2013}$and flexibly adapts them across different molecular environments due to the absence of hard-coded biases. The use of a standard Transformer also unlocks predictable improvements with respect to scaling training resources, consistent with empirical scaling laws observed in other domains. Our results demonstrate that many favorable properties of GNNs can emerge adaptively in Transformers, challenging the necessity of hard-coded graph inductive biases and pointing toward standardized, scalable architectures for molecular modeling.

Comment: Model Architecture / Representation Learning: shows pure Transformers (no graph priors) learn distance-aware structure for molecular modeling, with scaling and attention analysis.

Relevance: 9 Novelty: 7

24. Constrained Adaptive Rejection Sampling

ArXiv ID: 2510.01902

Authors: Pawe{\l} Parys, Sairam Vaidya, Taylor Berg-Kirkpatrick, Loris D'Antoni

Abstract: Language Models (LMs) are increasingly used in applications where generated outputs must satisfy strict semantic or syntactic constraints. Existing approaches to constrained generation fall along a spectrum: greedy constrained decoding methods enforce validity during decoding but distort the LM's distribution, while rejection sampling (RS) preserves fidelity but wastes computation by discarding invalid outputs. Both extremes are problematic in domains such as program fuzzing, where both validity and diversity of samples are essential. We present Constrained Adaptive Rejection Sampling (CARS), an approach that strictly improves the sample-efficiency of RS without distributional distortion. CARS begins with unconstrained LM sampling and adaptively rules out constraint-violating continuations by recording them in a trie and subtracting their probability mass from future draws. This adaptive pruning ensures that prefixes proven invalid are never revisited, acceptance rates improve monotonically, and the resulting samples exactly follow the constrained distribution. In experiments on a variety of domains -- e.g., program fuzzing and molecular generation -- CARS consistently achieves higher efficiency -- measured in the number of LM forward passes per valid sample -- while also producing stronger sample diversity than both GCD and methods that approximate the LM's distribution.

Comment: Compression/Efficiency: algorithmic innovation for constrained decoding via adaptive rejection sampling that preserves the exact distribution while improving sample efficiency.

Relevance: 8 Novelty: 8

25. Risk Phase Transitions in Spiked Regression: Alignment Driven Benign and Catastrophic Overfitting

ArXiv ID: 2510.01414

Authors: Jiping Li, Rishi Sonthalia

Abstract: This paper analyzes the generalization error of minimum-norm interpolating solutions in linear regression using spiked covariance data models. The paper characterizes how varying spike strengths and target-spike alignments can affect risk, especially in overparameterized settings. The study presents an exact expression for the generalization error, leading to a comprehensive classification of benign, tempered, and catastrophic overfitting regimes based on spike strength, the aspect ratio $c=d/n$ (particularly as $c \to \infty$), and target alignment. Notably, in well-specified aligned problems, increasing spike strength can surprisingly induce catastrophic overfitting before achieving benign overfitting. The paper also reveals that target-spike alignment is not always advantageous, identifying specific, sometimes counterintuitive, conditions for its benefit or detriment. Alignment with the spike being detrimental is empirically demonstrated to persist in nonlinear models.

Comment: Generalization theory in overparameterized spiked regression, classifying benign vs catastrophic overfitting—training dynamics/representation theory.

Relevance: 8 Novelty: 8

26. Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

ArXiv ID: 2510.01494

Authors: Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Liu, Sanmi Koyejo

Abstract: The field of adversarial robustness has long established that adversarial examples can successfully transfer between image classifiers and that text jailbreaks can successfully transfer between language models (LMs). However, a pair of recent studies reported being unable to successfully transfer image jailbreaks between vision-language models (VLMs). To explain this striking difference, we propose a fundamental distinction regarding the transferability of attacks against machine learning models: attacks in the input data-space can transfer, whereas attacks in model representation space do not, at least not without geometric alignment of representations. We then provide theoretical and empirical evidence of this hypothesis in four different settings. First, we mathematically prove this distinction in a simple setting where two networks compute the same input-output map but via different representations. Second, we construct representation-space attacks against image classifiers that are as successful as well-known data-space attacks, but fail to transfer. Third, we construct representation-space attacks against LMs that successfully jailbreak the attacked models but again fail to transfer. Fourth, we construct data-space attacks against VLMs that successfully transfer to new VLMs, and we show that representation space attacks \emph{can} transfer when VLMs' latent geometries are sufficiently aligned in post-projector space. Our work reveals that adversarial transfer is not an inherent property of all attacks but contingent on their operational domain - the shared data-space versus models' unique representation spaces - a critical insight for building more robust models.

Comment: Representation learning insight: analyzes how latent geometry vs shared data-space affects adversarial transfer with theory and experiments.

Relevance: 8 Novelty: 8

27. Uniform-in-time convergence bounds for Persistent Contrastive Divergence Algorithms

ArXiv ID: 2510.01944

Authors: Paul Felix Valsecchi Oliva, O. Deniz Akyildiz, Andrew Duncan

Abstract: We propose a continuous-time formulation of persistent contrastive divergence (PCD) for maximum likelihood estimation (MLE) of unnormalised densities. Our approach expresses PCD as a coupled, multiscale system of stochastic differential equations (SDEs), which perform optimisation of the parameter and sampling of the associated parametrised density, simultaneously. From this novel formulation, we are able to derive explicit bounds for the error between the PCD iterates and the MLE solution for the model parameter. This is made possible by deriving uniform-in-time (UiT) bounds for the difference in moments between the multiscale system and the averaged regime. An efficient implementation of the continuous-time scheme is introduced, leveraging a class of explicit, stable intregators, stochastic orthogonal Runge-Kutta Chebyshev (S-ROCK), for which we provide explicit error estimates in the long-time regime. This leads to a novel method for training energy-based models (EBMs) with explicit error guarantees.

Comment: Representation Learning/Training Dynamics: theoretical uniform-in-time convergence bounds for PCD in EBMs with an efficient continuous-time SDE formulation and stable S-ROCK integrators.

Relevance: 8 Novelty: 8

28. To Augment or Not to Augment? Diagnosing Distributional Symmetry Breaking

ArXiv ID: 2510.01349

Authors: Hannah Lawrence, Elyssa Hofgard, Vasco Portilheiro, Yuxuan Chen, Tess Smidt, Robin Walters

Abstract: Symmetry-aware methods for machine learning, such as data augmentation and equivariant architectures, encourage correct model behavior on all transformations (e.g. rotations or permutations) of the original dataset. These methods can improve generalization and sample efficiency, under the assumption that the transformed datapoints are highly probable, or "important", under the test distribution. In this work, we develop a method for critically evaluating this assumption. In particular, we propose a metric to quantify the amount of anisotropy, or symmetry-breaking, in a dataset, via a two-sample neural classifier test that distinguishes between the original dataset and its randomly augmented equivalent. We validate our metric on synthetic datasets, and then use it to uncover surprisingly high degrees of alignment in several benchmark point cloud datasets. We show theoretically that distributional symmetry-breaking can actually prevent invariant methods from performing optimally even when the underlying labels are truly invariant, as we show for invariant ridge regression in the infinite feature limit. Empirically, we find that the implication for symmetry-aware methods is dataset-dependent: equivariant methods still impart benefits on some anisotropic datasets, but not others. Overall, these findings suggest that understanding equivariance -- both when it works, and why -- may require rethinking symmetry biases in the data.

Comment: Representation Learning: proposes a metric for distributional symmetry-breaking and theory showing when equivariant methods can underperform.

Relevance: 8 Novelty: 8

29. Robust Tangent Space Estimation via Laplacian Eigenvector Gradient Orthogonalization

ArXiv ID: 2510.02308

Authors: Dhruv Kohli, Sawyer J. Robertson, Gal Mishne, Alexander Cloninger

Abstract: Estimating the tangent spaces of a data manifold is a fundamental problem in data analysis. The standard approach, Local Principal Component Analysis (LPCA), struggles in high-noise settings due to a critical trade-off in choosing the neighborhood size. Selecting an optimal size requires prior knowledge of the geometric and noise characteristics of the data that are often unavailable. In this paper, we propose a spectral method, Laplacian Eigenvector Gradient Orthogonalization (LEGO), that utilizes the global structure of the data to guide local tangent space estimation. Instead of relying solely on local neighborhoods, LEGO estimates the tangent space at each data point by orthogonalizing the gradients of low-frequency eigenvectors of the graph Laplacian. We provide two theoretical justifications of our method. First, a differential geometric analysis on a tubular neighborhood of a manifold shows that gradients of the low-frequency Laplacian eigenfunctions of the tube align closely with the manifold's tangent bundle, while an eigenfunction with high gradient in directions orthogonal to the manifold lie deeper in the spectrum. Second, a random matrix theoretic analysis also demonstrates that low-frequency eigenvectors are robust to sub-Gaussian noise. Through comprehensive experiments, we demonstrate that LEGO yields tangent space estimates that are significantly more robust to noise than those from LPCA, resulting in marked improvements in downstream tasks such as manifold learning, boundary detection, and local intrinsic dimension estimation.

Comment: Manifold/representation learning via Laplacian eigenvector gradient orthogonalization with theoretical robustness to noise.