This is a remedial run for missed papers from 03/13/2026 to 03/13/2026.

Results generated on 03/21/2026.

Personalized Daily ArXiv Papers 2026-03-14

[gpt-5.4]	Prompt	Completion	Total
Token	113479	5647	119126
Cost	$0.28	$0.08	$0.37

Table of contents with paper titles:

NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL Authors: Amos Goldman, Nimrod Boker, Maayan Sheraizin, Nimrod Admoni, Artem Polyakov, Subhadeep Bhattacharya, Fan Yu, Kai Sun, Georgios Theodorakis, Hsin-Chun Yin, Peter-Jan Gootzen, Aamir Shafi, Assaf Ravid, Salvatore Di Girolamo, Manjunath Gorentla Venkata, Gil Bloch
A theory of learning data statistics in diffusion models, from easy to hard Authors: Lorenzo Bardone, Claudia Merger, Sebastian Goldt
Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems Authors: Ann Dooms
From Gradients to Riccati Geometry: Kalman World Models for Single-Pass Learning Authors: Andrew Kiruluta
As Language Models Scale, Low-order Linear Depth Dynamics Emerge Authors: Buddhika Nettasinghe, Geethu Joseph
Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding Authors: Guanyu Chen, Ruichen Wang, Tianren Zhang, Feng Chen
When Drafts Evolve: Speculative Decoding Meets Online Learning Authors: Yu-Yang Qian, Hao-Cong Wu, Yichao Fu, Hao Zhang, Peng Zhao
Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy Authors: Piyush Sao
State-space models through the lens of ensemble control Authors: Ye Feng, Jianfeng Lu
Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics Authors: Jose Marie Antonio Miñoza, Paulo Mario P. Medina, Sebastian C. Ibañez
MXNorm: Reusing MXFP block scales for efficient tensor normalisation Authors: Callum McLean, Luke Y. Prince, Alexandre Payot, Paul Balança, Carlo Luschi
LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing Authors: Jiawei Hao, Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Dan Zeng
ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training Authors: Jie Ji, Gen Li, Kaiyuan Deng, Fatemeh Afghah, Xiaolong Ma
Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces Authors: Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David R. Mortensen, David Harwath
Upper Bounds for Local Learning Coefficients of Three-Layer Neural Networks Authors: Yuki Kurumadani
Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity Authors: Donglin Yu
Probabilistic Gaussian Homotopy: A Probability-Space Continuation Framework for Nonconvex Optimization Authors: Eshed Gal, Samy Wu Fung, Eldad Haber
Asymptotic and Finite-Time Guarantees for Langevin-Based Temperature Annealing in InfoNCE Authors: Faris Chaudhry
Equivalence of approximation by networks of single- and multi-spike neurons Authors: Dominik Dold, Philipp Christian Petersen
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis Authors: Chen Feng, Zhuo Zhi, Zhao Huang, Jiawei Ge, Ling Xiao, Nicu Sebe, Georgios Tzimiropoulos, Ioannis Patras
SRAM-Based Compute-in-Memory Accelerator for Linear-decay Spiking Neural Networks Authors: Hongyang Shang, Shuai Dong, Yahan Yang, Junyi Yang, Peng Zhou, Arindam Basu
Scalable Machines with Intrinsic Higher Mental-State Dynamics Authors: Ahsan Adeel, M. Bilal
Modality-free Graph In-context Alignment Authors: Wei Zhuo, Siqiang Luo
PLUME: Building a Network-Native Foundation Model for Wireless Traces via Protocol-Aware Tokenization Authors: Swadhin Pradhan, Shazal Irshad, Jerome Henry
Maximizing Incremental Information Entropy for Contrastive Learning Authors: Jiansong Zhang, Zhuoqin Yang, Xu Wu, Xiaoling Luo, Peizhong Liu, Linlin Shen
Resolving Interference (RI): Disentangling Models for Improved Model Merging Authors: Pratik Ramesh, George Stoica, Arun Iyer, Leshem Choshen, Judy Hoffman
Deep Invertible Autoencoders for Dimensionality Reduction of Dynamical Systems Authors: Nicolò Botteghi, Silke Glas, Christoph Brune
Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs Authors: Bumjun Kim, Dongjae Jeon, Moongyu Jeon, Albert No
Representation Learning for Spatiotemporal Physical Systems Authors: Helen Qu, Rudy Morel, Michael McCabe, Alberto Bietti, François Lanusse, Shirley Ho, Yann LeCun
Orla: A Library for Serving LLM-Based Multi-Agent Systems Authors: Rana Shahout, Hayder Tirmazi, Minlan Yu, Michael Mitzenmacher
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning Authors: Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva, Hyeji Kim

1. NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL

ArXiv ID: 2603.13606

Authors: Amos Goldman, Nimrod Boker, Maayan Sheraizin, Nimrod Admoni, Artem Polyakov, Subhadeep Bhattacharya, Fan Yu, Kai Sun, Georgios Theodorakis, Hsin-Chun Yin, Peter-Jan Gootzen, Aamir Shafi, Assaf Ravid, Salvatore Di Girolamo, Manjunath Gorentla Venkata, Gil Bloch

Abstract: Mixture-of-Experts (MoE) architectures have become essential for scaling large language models, driving the development of specialized device-initiated communication libraries such as DeepEP, Hybrid-EP, and others. These libraries demonstrate the performance benefits of GPU-initiated RDMA for MoE dispatch and combine operations. This paper presents NCCL EP (Expert Parallelism), a ground-up MoE communication library built entirely on NCCL's Device API. NCCL EP provides unified ncclEpDispatch and ncclEpCombine primitives with both C and Python interfaces, supporting Low-Latency (LL) mode for inference decoding and High-Throughput (HT) mode for training and inference prefill. LL targets small batch sizes (1-128 tokens) using direct all-to-all RDMA+NVLink mesh connectivity with double-buffered communication for overlapping dispatch and combine phases. HT targets large batches (4096+ tokens) using hierarchical communication that aggregates tokens within NVLink domains before inter-node RDMA transmission. Both modes leverage Device API for both intra- and inter-node communications, taking advantage of its topology awareness and optimized GPU-initiated implementation. We evaluate NCCL EP on an H100-based cluster across multi-node configurations, demonstrating competitive LL kernel performance and presenting end-to-end results with vLLM integration. By building MoE communication natively within NCCL, NCCL EP provides a supported path for expert parallelism on current and emerging NVIDIA platforms.

Comment: High-performance computing for MoE: unified NCCL expert-parallel dispatch/combine API with topology-aware low-latency and high-throughput modes.

Relevance: 10 Novelty: 8

2. A theory of learning data statistics in diffusion models, from easy to hard

ArXiv ID: 2603.12901

Authors: Lorenzo Bardone, Claudia Merger, Sebastian Goldt

Abstract: While diffusion models have emerged as a powerful class of generative models, their learning dynamics remain poorly understood. We address this issue first by empirically showing that standard diffusion models trained on natural images exhibit a distributional simplicity bias, learning simple, pair-wise input statistics before specializing to higher-order correlations. We reproduce this behaviour in simple denoisers trained on a minimal data model, the mixed cumulant model, where we precisely control both pair-wise and higher-order correlations of the inputs. We identify a scalar invariant of the model that governs the sample complexity of learning pair-wise and higher-order correlations that we call the diffusion information exponent, in analogy to related invariants in different learning paradigms. Using this invariant, we prove that the denoiser learns simple, pair-wise statistics of the inputs at linear sample complexity, while more complex higher-order statistics, such as the fourth cumulant, require at least cubic sample complexity. We also prove that the sample complexity of learning the fourth cumulant is linear if pair-wise and higher-order statistics share a correlated latent structure. Our work describes a key mechanism for how diffusion models can learn distributions of increasing complexity.

Comment: Theory for representation learning in diffusion models: proves easy-to-hard learning of low- vs high-order data statistics via a diffusion information exponent.

Relevance: 9 Novelty: 9

3. Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems

ArXiv ID: 2603.13069

Authors: Ann Dooms

Abstract: What is a diffusion model actually doing when it turns noise into a photograph? We show that the deterministic DDIM reverse chain operates as a Partitioned Iterated Function System (PIFS) and that this framework serves as a unified design language for denoising diffusion model schedules, architectures, and training objectives. From the PIFS structure we derive three computable geometric quantities: a per-step contraction threshold $L^_t$, a diagonal expansion function $f_t(λ)$ and a global expansion threshold $λ^{*}$. These quantities require no model evaluation and fully characterize the denoising dynamics. They structurally explain the two-regime behavior of diffusion models: global context assembly at high noise via diffuse cross-patch attention and fine-detail synthesis at low noise via patch-by-patch suppression release in strict variance order. Self-attention emerges as the natural primitive for PIFS contraction. The Kaplan-Yorke dimension of the PIFS attractor is determined analytically through a discrete Moran equation on the Lyapunov spectrum. Through the study of the fractal geometry of the PIFS, we derive three optimal design criteria and show that four prominent empirical design choices (the cosine schedule offset, resolution-dependent logSNR shift, Min-SNR loss weighting, and Align Your Steps sampling) each arise as approximate solutions to our explicit geometric optimization problems tuning theory into practice.

Comment: Theoretical reinterpretation of diffusion models as partitioned iterated function systems, yielding computable geometric design criteria for schedules and objectives.

Relevance: 9 Novelty: 9

4. From Gradients to Riccati Geometry: Kalman World Models for Single-Pass Learning

ArXiv ID: 2603.13423

Authors: Andrew Kiruluta

Abstract: Backpropagation dominates modern machine learning, yet it is not the only principled method for optimizing dynamical systems. We propose Kalman World Models (KWM), a class of learned state-space models trained via recursive Bayesian filtering rather than reverse-mode automatic differentiation. Instead of gradient descent updates, we replace parameter learning with Kalman-style gain adaptation. Training becomes online filtering; error signals become innovations. We further extend this framework to transformer-based large language models (LLMs), where internal activations are treated as latent dynamical states corrected via innovation terms. This yields a gradient-free training and adaptation paradigm grounded in control theory. We derive stability conditions, analyze computational complexity, and provide empirical results on sequence modeling tasks demonstrating competitive performance with improved robustness and continual adaptation properties.

Comment: Proposes a gradient-free training paradigm for state-space models and transformers using Kalman-style recursive filtering, with stability and complexity analysis.

Relevance: 9 Novelty: 9

5. As Language Models Scale, Low-order Linear Depth Dynamics Emerge

ArXiv ID: 2603.12541

Authors: Buddhika Nettasinghe, Geethu Joseph

Abstract: Large language models are often viewed as high-dimensional nonlinear systems and treated as black boxes. Here, we show that transformer depth dynamics admit accurate low-order linear surrogates within context. Across tasks including toxicity, irony, hate speech and sentiment, a 32-dimensional linear surrogate reproduces the layerwise sensitivity profile of GPT-2-large with near-perfect agreement, capturing how the final output shifts under additive injections at each layer. We then uncover a surprising scaling principle: for a fixed-order linear surrogate, agreement with the full model improves monotonically with model size across the GPT-2 family. This linear surrogate also enables principled multi-layer interventions that require less energy than standard heuristic schedules when applied to the full model. Together, our results reveal that as language models scale, low-order linear depth dynamics emerge within contexts, offering a systems-theoretic foundation for analyzing and controlling them.

Comment: Core architecture analysis: identifies low-order linear surrogate dynamics emerging across transformer depth as models scale.

Relevance: 9 Novelty: 8

6. Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding

ArXiv ID: 2603.13459

Authors: Guanyu Chen, Ruichen Wang, Tianren Zhang, Feng Chen

Abstract: In-context learning (ICL) is a valuable capability exhibited by Transformers pretrained on diverse sequence tasks. However, previous studies have observed that ICL often conflicts with the model's inherent in-weight learning (IWL) ability. By examining the representation space learned by a toy model in synthetic experiments, we identify the shared encoding space for context and samples in Transformers as a potential source of this conflict. To address this, we modify the model architecture to separately encode the context and samples into two distinct spaces: a task representation space and a sample representation space. We model these two spaces under a simple yet principled framework, assuming a linear representational structure and treating them as a pair of dual spaces. Both theoretical analysis and empirical results demonstrate the effectiveness of our proposed architecture, CoQE, in the single-value answer setting. It not only enhances ICL performance through improved representation learning, but also successfully reconciles ICL and IWL capabilities across synthetic few-shot classification and a newly designed pseudo-arithmetic task. Code: https://github.com/McGuinnessChen/dual-representation-space-encoding

Comment: Model architecture: separates context and sample encoding into dual representation spaces to reconcile in-context and in-weight learning.

Relevance: 9 Novelty: 8

7. When Drafts Evolve: Speculative Decoding Meets Online Learning

ArXiv ID: 2603.12617

Authors: Yu-Yang Qian, Hao-Cong Wu, Yichao Fu, Hao Zhang, Peng Zhao

Abstract: Speculative decoding has emerged as a widely adopted paradigm for accelerating large language model inference, where a lightweight draft model rapidly generates candidate tokens that are then verified in parallel by a larger target model. However, due to limited model capacity, drafts often struggle to approximate the target distribution, resulting in shorter acceptance lengths and diminished speedup. A key yet under-explored observation is that speculative decoding inherently provides verification feedback that quantifies the deviation between the draft and target models at no additional cost. This process naturally forms an iterative "draft commits-feedback provides-draft adapts" evolving loop, which precisely matches the online learning paradigm. Motivated by this connection, we propose OnlineSpec, a unified framework that systematically leverages interactive feedback to continuously evolve draft models. Grounded in dynamic regret minimization, we establish a formal link between online learning performance and speculative system's acceleration rate, and develop novel algorithms via modern online learning techniques, including optimistic online learning that adaptively reuses historical gradients as predictive update hints, and online ensemble learning that dynamically maintains multiple draft models. Our algorithms are equipped with theoretical justifications and improved acceleration rates, achieving up to 24% speedup over seven benchmarks and three foundation models.

Comment: Inference efficiency: speculative decoding cast as online learning, with regret-based algorithms that adapt draft models from verification feedback.

Relevance: 9 Novelty: 8

8. Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy

ArXiv ID: 2603.13552

Authors: Piyush Sao

Abstract: Optimization analyses for cross-entropy training rely on local Taylor models of the loss to predict whether a proposed step will decrease the objective. These surrogates are reliable only inside the Taylor convergence radius of the true loss along the update direction. That radius is set not by real-line curvature alone but by the nearest complex singularity. For cross-entropy, the softmax partition function $F=\sum_j \exp(z_j)$ has complex zeros -- ``ghosts of softmax'' -- that induce logarithmic singularities in the loss and cap this radius. To make this geometry usable, we derive closed-form expressions under logit linearization along the proposed update direction. In the binary case, the exact radius is $ρ^*=\sqrt{δ^2+ π^2}/Δ_a$. In the multiclass case, we obtain the lower bound $ρ_a=π/Δ_a$, where $Δ_a=\max_k a_k-\min_k a_k$ is the spread of directional logit derivatives $a_k=\nabla z_k\cdot v$. This bound costs one Jacobian-vector product and reveals what makes a step fragile: samples that are both near a decision flip and highly sensitive to the proposed direction tighten the radius. The normalized step size $r=τ/ρ_a$ separates safe from dangerous updates. Across six tested architectures and multiple step directions, no model fails for $r<1$, yet collapse appears once $r\ge 1$. Temperature scaling confirms the mechanism: normalizing by $ρ_a$ shrinks the onset-threshold spread from standard deviation $0.992$ to $0.164$. A controller that enforces $τ\leρ_a$ survives learning-rate spikes up to $10{,} 000\times$ in our tests, where gradient clipping still collapses. Together, these results identify a geometric constraint on cross-entropy optimization that operates through Taylor convergence rather than Hessian curvature.

Comment: Optimization theory for transformers trained with cross-entropy: derives complex-singularity step-size bounds from softmax geometry with a cheap JVP-based safety criterion.

Relevance: 9 Novelty: 8

9. State-space models through the lens of ensemble control

ArXiv ID: 2603.13587

Authors: Ye Feng, Jianfeng Lu

Abstract: State-space models (SSMs) are effective architectures for sequential modeling, but a rigorous theoretical understanding of their training dynamics is still lacking. In this work, we formulate the training of SSMs as an ensemble optimal control problem, where a shared control law governs a population of input-dependent dynamical systems. We derive Pontryagin's maximum principle (PMP) for this ensemble control formulation, providing necessary conditions for optimality. Motivated by these conditions, we introduce an algorithm based on the method of successive approximations. We prove convergence of this iterative scheme along a subsequence and establish sufficient conditions for global optimality. The resulting framework provides a control-theoretic perspective on SSM training.

Comment: Provides a control-theoretic foundation for state-space models by casting training as an ensemble optimal control problem and deriving PMP-based optimality conditions.

Relevance: 9 Novelty: 8

10. Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

ArXiv ID: 2603.13085

Authors: Jose Marie Antonio Miñoza, Paulo Mario P. Medina, Sebastian C. Ibañez

Abstract: Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit, even at large widths. A spectral amplification result establishes this formally: the attention transformation cubes the Gram matrix's condition number, requiring width $m = Ω(κ^6)$ for convergence, a threshold that exceeds any practical width for natural image datasets. This non-convergence is characterized through influence malleability, the capacity to dynamically alter reliance on training examples. Attention exhibits 6--9$\times$ higher malleability than ReLU networks, with dual implications: its data-dependent kernel can reduce approximation error by aligning with task structure, but this same sensitivity increases susceptibility to adversarial manipulation of training data. These findings suggest that attention's power and vulnerability share a common origin in its departure from the kernel regime.

Comment: NTK-based theory for linearized attention showing non-convergence and introducing influence malleability as a core property.

Relevance: 9 Novelty: 8

11. MXNorm: Reusing MXFP block scales for efficient tensor normalisation

ArXiv ID: 2603.13180

Authors: Callum McLean, Luke Y. Prince, Alexandre Payot, Paul Balança, Carlo Luschi

Abstract: Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accelerators that use increasingly low-precision number formats. However, improvements in matrix multiplication performance have far outstripped improvements in performance on reductions and elementwise computations, which are still being performed in higher precision. In this work, we propose MXNorm, a drop-in replacement for RMSNorm that estimates the RMS using only the block scales calculated as part of the MXFP8 cast and enables a 32x decrease in the size of reduction needed for normalization. We validate our approximation method on pre-training of Llama 3 models of 125M, 1B and 8B parameters, finding minimal loss of training accuracy compared to a baseline using RMSNorm with MXFP8 matmuls. We also show practical kernel speedups using only torch.compile of up to 2.4x for MXNorm over RMSNorm, corresponding to a 1.3% speedup in Llama 3 8B transformer layers in MXFP8 and a 2.6% speedup in NVFP4.

Comment: Model efficiency: normalization redesign that reuses MXFP block scales to cut reduction cost and speed low-precision transformer training.

Relevance: 9 Novelty: 7

12. LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing

ArXiv ID: 2603.12645

Authors: Jiawei Hao, Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Dan Zeng

Abstract: Mixture-of-Experts (MoE) based Large Language Models (LLMs) have demonstrated impressive performance and computational efficiency. However, their deployment is often constrained by substantial memory demands, primarily due to the need to load numerous expert modules. While existing expert compression techniques like pruning or merging attempt to mitigate this, they often suffer from irreversible knowledge loss or high training overhead. In this paper, we propose a novel expert compression paradigm termed expert replacing, which replaces redundant experts with parameter-efficient modules and recovers their capabilities with low training costs. We find that even a straightforward baseline of this paradigm yields promising performance. Building on this foundation, we introduce LightMoE, a framework that enhances the paradigm by introducing adaptive expert selection, hierarchical expert construction, and an annealed recovery strategy. Experimental results show that LightMoE matches the performance of LoRA fine-tuning at a 30% compression ratio. Even under a more aggressive 50% compression rate, it outperforms existing methods and achieves average performance improvements of 5.6% across five diverse tasks. These findings demonstrate that LightMoE strikes a superior balance among memory efficiency, training efficiency, and model performance.

Comment: Model compression for MoE: replaces redundant experts with parameter-efficient modules to reduce memory without full expert merging/pruning.

Relevance: 9 Novelty: 7

13. ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training

ArXiv ID: 2603.13115

Authors: Jie Ji, Gen Li, Kaiyuan Deng, Fatemeh Afghah, Xiaolong Ma

Abstract: Deep learning models, despite their impressive achievements, suffer from high computational costs and memory requirements, limiting their usability in resource-constrained environments. Sparse neural networks significantly alleviate these constraints by dramatically reducing parameter count and computational overhead. However, existing sparse training methods often experience chaotic and noisy gradient signals, severely hindering convergence and generalization performance, particularly at high sparsity levels. To tackle this critical challenge, we propose Zero-Order Sharpness-Aware Minimization (ZO-SAM), a novel optimization framework that strategically integrates zero-order optimization within the SAM approach. Unlike traditional SAM, ZO-SAM requires only a single backpropagation step during perturbation, selectively utilizing zero-order gradient estimations. This innovative approach reduces the backpropagation computational cost by half compared to conventional SAM, significantly lowering gradient variance and effectively eliminating associated computational overhead. By harnessing SAM's capacity for identifying flat minima, ZO-SAM stabilizes the training process and accelerates convergence. These efficiency gains are particularly important in sparse training scenarios, where computational cost is the primary bottleneck that limits the practicality of SAM. Moreover, models trained with ZO-SAM exhibit improved robustness under distribution shift, further broadening its practicality in real-world deployments.

Comment: Optimization method for efficient sparse training: zero-order SAM cuts backprop cost while stabilizing high-sparsity learning.

Relevance: 9 Novelty: 7

14. Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces

ArXiv ID: 2603.12642

Authors: Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David R. Mortensen, David Harwath

Abstract: Transformer-based self-supervised speech models (S3Ms) are often described as contextualized, yet what this entails remains unclear. Here, we focus on how a single frame-level S3M representation can encode phones and their surrounding context. Prior work has shown that S3Ms represent phones compositionally; for example, phonological vectors such as voicing, bilabiality, and nasality vectors are superposed in the S3M representation of [m]. We extend this view by proposing that phonological information from a sequence of neighboring phones is also compositionally encoded in a single frame, such that vectors corresponding to previous, current, and next phones are superposed within a single frame-level representation. We show that this structure has several properties, including orthogonality between relative positions, and emergence of implicit phonetic boundaries. Together, our findings advance our understanding of context-dependent S3M representations.

Comment: Representation learning analysis: shows self-supervised speech models encode neighboring phonetic context in position-dependent orthogonal subspaces.