Personalized Daily ArXiv Papers 2025-10-30

[gpt-5]	Prompt	Completion	Total
Token	37095	36561	73656
Cost	$0.05	$0.37	$0.41

Total arXiv papers: 519

Total scanned papers: 308

Total relevant papers: 25

Table of contents with paper titles:

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats Authors: Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation Authors: Inclusion AI, :, Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianing Li, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jianping Jiang, Jun Peng, Kaixiang Ji, Kaimeng Ren, Libin Wang, Lixiang Ru, Longhua Tan, Lan Wang, Mochen Bai, Ning Gao, Qingpei Guo, Qinglong Zhang, Qiang Xu, Rui Liu, Ruijie Xiong, Ruobing Zheng, Sirui Gao, Tianqi Li, Tinghao Liu, Weilong Chai, Xinyu Xiao, Xiaomei Wang, Xiaolong Wang, Xiao Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Yi Yuan, Yuting Gao, Yuting Xiao, Yunxiao Sun, Yipeng Chen, Yifan Mao, Yifei Wu, Yongjie Lyu, Ziping Ma, Zhiqiang Fang, Zhihao Qiu, Ziyuan Huang, Zizheng Yang, Zhengyu He
CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices Authors: Xuchen Feng, Siyu Liao
How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs Authors: Samet Demir, Zafer Dogan
Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information Authors: Yuan Cheng, Yu Huang, Zhe Xiong, Yingbin Liang, Vincent Y. F. Tan
Sequences of Logits Reveal the Low Rank Structure of Language Models Authors: Noah Golowich, Allen Liu, Abhishek Shetty
The Neural Differential Manifold: An Architecture with Explicit Geometric Structure Authors: Di Zhang
Spectral Perturbation Bounds for Low-Rank Approximation with Applications to Privacy Authors: Phuc Tran, Nisheeth K. Vishnoi, Van H. Vu
IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning Authors: Xiandong Zou, Pan Zhou
Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning Authors: Arani Roy, Marco P. Apolinario, Shristi Das Biswas, Kaushik Roy
Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers Authors: Rabin Adhikari
From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning Authors: Junsoo Oh, Jerry Song, Chulhee Yun
A Deep Learning Framework for Multi-Operator Learning: Architectures and Approximation Theory Authors: Adrien Weihs, Jingmin Sun, Zecheng Zhang, Hayden Schaeffer
Training Across Reservoirs: Using Numerical Differentiation To Couple Trainable Networks With Black-Box Reservoirs Authors: Andrew Clark, Jack Moursounidis, Osmaan Rasouli, William Gan, Cooper Doyle, Anna Leontjeva
Nonlinear Dynamics In Optimization Landscape of Shallow Neural Networks with Tunable Leaky ReLU Authors: Jingzhou Liu
Mechanistic Interpretability of RNNs emulating Hidden Markov Models Authors: Elia Torre, Michele Viscione, Lucas Pompe, Benjamin F Grewe, Valerio Mante
BSFA: Leveraging the Subspace Dichotomy to Accelerate Neural Network Training Authors: Wenjie Zhou, Bohan Wang, Wei Chen, Xueqi Cheng
Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought Authors: Jiachen Zhao, Yiyou Sun, Weiyan Shi, Dawn Song
Lipschitz-aware Linearity Grafting for Certified Robustness Authors: Yongjin Han, Suhyun Kim
Are Language Models Efficient Reasoners? A Perspective from Logic Programming Authors: Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan, Abulhair Saparov, Ryan Cotterell, Bernhard Sch\"olkopf
Resource-Efficient and Robust Inference of Deep and Bayesian Neural Networks on Embedded and Analog Computing Platforms Authors: Bernhard Klein
Confidence is Not Competence Authors: Debdeep Sanyal, Manya Pandey, Dhruv Kumar, Saurabh Deshpande, Murari Mandal
Continual Low-Rank Adapters for LLM-based Generative Recommender Systems Authors: Hyunsik Yoo, Ting-Wei Li, SeongKu Kang, Zhining Liu, Charlie Xu, Qilin Qi, Hanghang Tong
What Really Matters in Matrix-Whitening Optimizers? Authors: Kevin Frans, Pieter Abbeel, Sergey Levine
TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting Authors: Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, Frank Hutter

1. INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

ArXiv ID: 2510.25602

Authors: Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo

Abstract: Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.

Comment: Compression/Efficiency: comprehensive study of low-bit quantization formats (INT vs FP) at fine-grained levels with new training method for MXINT8.

Relevance: 10 Novelty: 8

2. Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

ArXiv ID: 2510.24821

Authors: Inclusion AI, :, Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianing Li, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jianping Jiang, Jun Peng, Kaixiang Ji, Kaimeng Ren, Libin Wang, Lixiang Ru, Longhua Tan, Lan Wang, Mochen Bai, Ning Gao, Qingpei Guo, Qinglong Zhang, Qiang Xu, Rui Liu, Ruijie Xiong, Ruobing Zheng, Sirui Gao, Tianqi Li, Tinghao Liu, Weilong Chai, Xinyu Xiao, Xiaomei Wang, Xiaolong Wang, Xiao Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Yi Yuan, Yuting Gao, Yuting Xiao, Yunxiao Sun, Yipeng Chen, Yifan Mao, Yifei Wu, Yongjie Lyu, Ziping Ma, Zhiqiang Fang, Zhihao Qiu, Ziyuan Huang, Zizheng Yang, Zhengyu He

Abstract: We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.

Comment: Model Architecture: sparse Mixture-of-Experts (MoE) unified multimodal model with only 6.1B active parameters per token.

Relevance: 10 Novelty: 7

3. CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices

ArXiv ID: 2510.25323

Authors: Xuchen Feng, Siyu Liao

Abstract: Normalizing flows are deep generative models that enable efficient likelihood estimation and sampling through invertible transformations. A key challenge is to design linear layers that enhance expressiveness while maintaining efficient computation of the Jacobian determinant and inverse. We introduce a novel invertible linear layer based on the product of circulant and diagonal matrices. This decomposition reduces parameter complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(mn)$ using $m$ diagonal matrices and $m-1$ circulant matrices while still approximating general linear transformations. By leveraging the Fast Fourier Transform, our approach reduces the time complexity of matrix inversion from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn\log n)$ and that of computing the log-determinant from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn)$, where $n$ is the input dimension. We build upon this layer to develop Circulant-Diagonal Flow (CDFlow), which achieves strong density estimation on natural image datasets and effectively models data with inherent periodic structure. Furthermore, CDFlow significantly accelerates key operations in normalizing flows, providing practical benefits for scalable generative modeling.

Comment: Model Architecture and Efficiency: introduces invertible linear layers via circulant–diagonal decomposition with FFT, reducing parameters and log-det/inversion cost for normalizing flows.

Relevance: 9 Novelty: 8

4. How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs

ArXiv ID: 2510.25753

Authors: Samet Demir, Zafer Dogan

Abstract: Pretrained Transformers demonstrate remarkable in-context learning (ICL) capabilities, enabling them to adapt to new tasks from demonstrations without parameter updates. However, theoretical studies often rely on simplified architectures (e.g., omitting MLPs), data models (e.g., linear regression with isotropic inputs), and single-source training, limiting their relevance to realistic settings. In this work, we study ICL in pretrained Transformers with nonlinear MLP heads on nonlinear tasks drawn from multiple data sources with heterogeneous input, task, and noise distributions. We analyze a model where the MLP comprises two layers, with the first layer trained via a single gradient step and the second layer fully optimized. Under high-dimensional asymptotics, we prove that such models are equivalent in ICL error to structured polynomial predictors, leveraging results from the theory of Gaussian universality and orthogonal polynomials. This equivalence reveals that nonlinear MLPs meaningfully enhance ICL performance, particularly on nonlinear tasks, compared to linear baselines. It also enables a precise analysis of data mixing effects: we identify key properties of high-quality data sources (low noise, structured covariances) and show that feature learning emerges only when the task covariance exhibits sufficient structure. These results are validated empirically across various activation functions, model sizes, and data distributions. Finally, we experiment with a real-world scenario involving multilingual sentiment analysis where each language is treated as a different source. Our experimental results for this case exemplify how our findings extend to real-world cases. Overall, our work advances the theoretical foundations of ICL in Transformers and provides actionable insight into the role of architecture and data in ICL.

Comment: Strongly matches architecture/representation learning criteria with theoretical analysis of ICL in Transformers including nonlinear MLP heads and multi-source data mixing.

Relevance: 9 Novelty: 8

5. Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information

ArXiv ID: 2510.25542

Authors: Yuan Cheng, Yu Huang, Zhe Xiong, Yingbin Liang, Vincent Y. F. Tan

Abstract: Uncovering hidden graph structures underlying real-world data is a critical challenge with broad applications across scientific domains. Recently, transformer-based models leveraging the attention mechanism have demonstrated strong empirical success in capturing complex dependencies within graphs. However, the theoretical understanding of their training dynamics has been limited to tree-like graphs, where each node depends on a single parent. Extending provable guarantees to more general directed acyclic graphs (DAGs) -- which involve multiple parents per node -- remains challenging, primarily due to the difficulty in designing training objectives that enable different attention heads to separately learn multiple different parent relationships. In this work, we address this problem by introducing a novel information-theoretic metric: the kernel-guided mutual information (KG-MI), based on the $f$-divergence. Our objective combines KG-MI with a multi-head attention framework, where each head is associated with a distinct marginal transition kernel to model diverse parent-child dependencies effectively. We prove that, given sequences generated by a $K$-parent DAG, training a single-layer, multi-head transformer via gradient ascent converges to the global optimum in polynomial time. Furthermore, we characterize the attention score patterns at convergence. In addition, when particularizing the $f$-divergence to the KL divergence, the learned attention scores accurately reflect the ground-truth adjacency matrix, thereby provably recovering the underlying graph structure. Experimental results validate our theoretical findings.

Comment: Strongly matches architecture/theory criteria by proving multi-head Transformers learn DAG structure via a kernel-guided mutual information objective.

Relevance: 9 Novelty: 8

6. Sequences of Logits Reveal the Low Rank Structure of Language Models

ArXiv ID: 2510.24966

Authors: Noah Golowich, Allen Liu, Abhishek Shetty

Abstract: A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model's logits for varying sets of prompts and responses have low approximate rank. We then show that this low-rank structure can be leveraged for generation -- in particular, we can generate a response to a target prompt using a linear combination of the model's outputs on unrelated, or even nonsensical prompts. On the theoretical front, we observe that studying the approximate rank of language models in the sense discussed above yields a simple universal abstraction whose theoretical predictions parallel our experiments. We then analyze the representation power of the abstraction and give provable learning guarantees.

Comment: Representation Learning + Compression/Efficiency: demonstrates and exploits low-rank structure in LM logits with a model-agnostic abstraction and theory.

Relevance: 9 Novelty: 8

7. The Neural Differential Manifold: An Architecture with Explicit Geometric Structure

ArXiv ID: 2510.25113

Authors: Di Zhang

Abstract: This paper introduces the Neural Differential Manifold (NDM), a novel neural network architecture that explicitly incorporates geometric structure into its fundamental design. Departing from conventional Euclidean parameter spaces, the NDM re-conceptualizes a neural network as a differentiable manifold where each layer functions as a local coordinate chart, and the network parameters directly parameterize a Riemannian metric tensor at every point. The architecture is organized into three synergistic layers: a Coordinate Layer implementing smooth chart transitions via invertible transformations inspired by normalizing flows, a Geometric Layer that dynamically generates the manifold's metric through auxiliary sub-networks, and an Evolution Layer that optimizes both task performance and geometric simplicity through a dual-objective loss function. This geometric regularization penalizes excessive curvature and volume distortion, providing intrinsic regularization that enhances generalization and robustness. The framework enables natural gradient descent optimization aligned with the learned manifold geometry and offers unprecedented interpretability by endowing internal representations with clear geometric meaning. We analyze the theoretical advantages of this approach, including its potential for more efficient optimization, enhanced continual learning, and applications in scientific discovery and controllable generative modeling. While significant computational challenges remain, the Neural Differential Manifold represents a fundamental shift towards geometrically structured, interpretable, and efficient deep learning systems.

Comment: Model Architecture: proposes a neural architecture as a differentiable manifold with learned Riemannian metric and geometry-regularized optimization (natural-gradient aligned).

Relevance: 9 Novelty: 8

8. Spectral Perturbation Bounds for Low-Rank Approximation with Applications to Privacy

ArXiv ID: 2510.25670

Authors: Phuc Tran, Nisheeth K. Vishnoi, Van H. Vu

Abstract: A central challenge in machine learning is to understand how noise or measurement errors affect low-rank approximations, particularly in the spectral norm. This question is especially important in differentially private low-rank approximation, where one aims to preserve the top-$p$ structure of a data-derived matrix while ensuring privacy. Prior work often analyzes Frobenius norm error or changes in reconstruction quality, but these metrics can over- or under-estimate true subspace distortion. The spectral norm, by contrast, captures worst-case directional error and provides the strongest utility guarantees. We establish new high-probability spectral-norm perturbation bounds for symmetric matrices that refine the classical Eckart--Young--Mirsky theorem and explicitly capture interactions between a matrix $A \in \mathbb{R}^{n \times n}$ and an arbitrary symmetric perturbation $E$. Under mild eigengap and norm conditions, our bounds yield sharp estimates for $|(A + E)_p - A_p|$, where $A_p$ is the best rank-$p$ approximation of $A$, with improvements of up to a factor of $\sqrt{n}$. As an application, we derive improved utility guarantees for differentially private PCA, resolving an open problem in the literature. Our analysis relies on a novel contour bootstrapping method from complex analysis and extends it to a broad class of spectral functionals, including polynomials and matrix exponentials. Empirical results on real-world datasets confirm that our bounds closely track the actual spectral error under diverse perturbation regimes.

Comment: Compression/Efficiency: new spectral-norm perturbation bounds for low-rank approximation, improving theoretical guarantees (e.g., DP-PCA utility).

Relevance: 8 Novelty: 9

9. IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning

ArXiv ID: 2510.25262

Authors: Xiandong Zou, Pan Zhou

Abstract: Normalization is fundamental to deep learning, but existing approaches such as BatchNorm, LayerNorm, and RMSNorm are variance-centric by enforcing zero mean and unit variance, stabilizing training without controlling how representations capture task-relevant information. We propose IB-Inspired Normalization (IBNorm), a simple yet powerful family of methods grounded in the Information Bottleneck principle. IBNorm introduces bounded compression operations that encourage embeddings to preserve predictive information while suppressing nuisance variability, yielding more informative representations while retaining the stability and compatibility of standard normalization. Theoretically, we prove that IBNorm achieves a higher IB value and tighter generalization bounds than variance-centric methods. Empirically, IBNorm consistently outperforms BatchNorm, LayerNorm, and RMSNorm across large-scale language models (LLaMA, GPT-2) and vision models (ResNet, ViT), with mutual information analysis confirming superior information bottleneck behavior. Code will be released publicly.

Comment: Model Architecture (Normalization) and Representation Learning: IB-inspired normalization controlling task-relevant information with theory on IB value and generalization.

Relevance: 9 Novelty: 7

10. Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning

ArXiv ID: 2510.25594

Authors: Arani Roy, Marco P. Apolinario, Shristi Das Biswas, Kaushik Roy

Abstract: Training deep neural networks (DNNs) with backpropagation (BP) achieves state-of-the-art accuracy but requires global error propagation and full parameterization, leading to substantial memory and computational overhead. Direct Feedback Alignment (DFA) enables local, parallelizable updates with lower memory requirements but is limited by unstructured feedback and poor scalability in deeper architectures, specially convolutional neural networks. To address these limitations, we propose a structured local learning framework that operates directly on low-rank manifolds defined by the Singular Value Decomposition (SVD) of weight matrices. Each layer is trained in its decomposed form, with updates applied to the SVD components using a composite loss that integrates cross-entropy, subspace alignment, and orthogonality regularization. Feedback matrices are constructed to match the SVD structure, ensuring consistent alignment between forward and feedback pathways. Our method reduces the number of trainable parameters relative to the original DFA model, without relying on pruning or post hoc compression. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that our method achieves accuracy comparable to that of BP. Ablation studies confirm the importance of each loss term in the low-rank setting. These results establish local learning on low-rank manifolds as a principled and scalable alternative to full-rank gradient-based training.

Comment: Compression/Efficiency and Architecture: structured local learning on low-rank manifolds (SVD) with aligned feedback, reducing parameters and avoiding BP while maintaining accuracy.

Relevance: 9 Novelty: 7

11. Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers

ArXiv ID: 2510.25013

Authors: Rabin Adhikari

Abstract: Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task -- a benchmark for studying coreference -- like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.

Comment: Strongly matches representation learning criterion via mechanistic interpretability of attention-only transformers and emergence of minimal circuits for IOI.

Relevance: 9 Novelty: 7

12. From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning

ArXiv ID: 2510.24812

Authors: Junsoo Oh, Jerry Song, Chulhee Yun

Abstract: Weak-to-strong generalization refers to the phenomenon where a stronger model trained under supervision from a weaker one can outperform its teacher. While prior studies aim to explain this effect, most theoretical insights are limited to abstract frameworks or linear/random feature models. In this paper, we provide a formal analysis of weak-to-strong generalization from a linear CNN (weak) to a two-layer ReLU CNN (strong). We consider structured data composed of label-dependent signals of varying difficulty and label-independent noise, and analyze gradient descent dynamics when the strong model is trained on data labeled by the pretrained weak model. Our analysis identifies two regimes -- data-scarce and data-abundant -- based on the signal-to-noise characteristics of the dataset, and reveals distinct mechanisms of weak-to-strong generalization. In the data-scarce regime, generalization occurs via benign overfitting or fails via harmful overfitting, depending on the amount of data, and we characterize the transition boundary. In the data-abundant regime, generalization emerges in the early phase through label correction, but we observe that overtraining can subsequently degrade performance.

Comment: Matches representation learning criterion with a theoretical analysis of feature learning and training dynamics (weak-to-strong generalization) in CNNs.

Relevance: 8 Novelty: 8

13. A Deep Learning Framework for Multi-Operator Learning: Architectures and Approximation Theory

ArXiv ID: 2510.25379

Authors: Adrien Weihs, Jingmin Sun, Zecheng Zhang, Hayden Schaeffer

Abstract: While many problems in machine learning focus on learning mappings between finite-dimensional spaces, scientific applications require approximating mappings between function spaces, i.e., operators. We study the problem of learning collections of operators and provide both theoretical and empirical advances. We distinguish between two regimes: (i) multiple operator learning, where a single network represents a continuum of operators parameterized by a parametric function, and (ii) learning several distinct single operators, where each operator is learned independently. For the multiple operator case, we introduce two new architectures, $\mathrm{MNO}$ and $\mathrm{MONet}$, and establish universal approximation results in three settings: continuous, integrable, or Lipschitz operators. For the latter, we further derive explicit scaling laws that quantify how the network size must grow to achieve a target approximation accuracy. For learning several single operators, we develop a framework for balancing architectural complexity across subnetworks and show how approximation order determines computational efficiency. Empirical experiments on parametric PDE benchmarks confirm the strong expressive power and efficiency of the proposed architectures. Overall, this work establishes a unified theoretical and practical foundation for scalable neural operator learning across multiple operators.

Comment: Matches model architecture and efficiency theory criteria with new multi-operator neural operator architectures (MNO/MONet) and explicit approximation/scaling laws.

Relevance: 8 Novelty: 8

14. Training Across Reservoirs: Using Numerical Differentiation To Couple Trainable Networks With Black-Box Reservoirs

ArXiv ID: 2510.25074

Authors: Andrew Clark, Jack Moursounidis, Osmaan Rasouli, William Gan, Cooper Doyle, Anna Leontjeva

Abstract: We introduce Bounded Numerical Differentiation (BOND), a perturbative method for estimating partial derivatives across network structures with inaccessible computational graphs. BOND demonstrates improved accuracy and scalability from existing perturbative methods, enabling new explorations of trainable architectures that integrate black-box functions. We observe that these black-box functions, realized in our experiments as fixed, untrained networks, can enhance model performance without increasing the number of trainable parameters. This improvement is achieved without extensive optimization of the architecture or properties of the black-box function itself. Our findings highlight the potential of leveraging fixed, non-trainable modules to expand model capacity, suggesting a path toward combining analogue and digital devices as a mechanism for scaling networks.

Comment: Matches architecture/systems criteria by enabling training with black-box modules via Bounded Numerical Differentiation, supporting hybrid analogue–digital compositions.

Relevance: 8 Novelty: 8

15. Nonlinear Dynamics In Optimization Landscape of Shallow Neural Networks with Tunable Leaky ReLU

ArXiv ID: 2510.25060

Authors: Jingzhou Liu

Abstract: In this work, we study the nonlinear dynamics of a shallow neural network trained with mean-squared loss and leaky ReLU activation. Under Gaussian inputs and equal layer width k, (1) we establish, based on the equivariant gradient degree, a theoretical framework, applicable to any number of neurons k>= 4, to detect bifurcation of critical points with associated symmetries from global minimum as leaky parameter $\alpha$ varies. Typically, our analysis reveals that a multi-mode degeneracy consistently occurs at the critical number 0, independent of k. (2) As a by-product, we further show that such bifurcations are width-independent, arise only for nonnegative $\alpha$ and that the global minimum undergoes no further symmetry-breaking instability throughout the engineering regime $\alpha$ in range (0,1). An explicit example with k=5 is presented to illustrate the framework and exhibit the resulting bifurcation together with their symmetries.

Comment: Representation Learning/Training Dynamics: theoretical bifurcation analysis of shallow networks with tunable leaky ReLU revealing symmetry-breaking and landscape structure.