This is a remedial run for missed papers from 03/17/2026 to 03/17/2026.

Results generated on 03/21/2026.

Personalized Daily ArXiv Papers 2026-03-18

[gpt-5.4]	Prompt	Completion	Total
Token	125187	5514	130701
Cost	$0.31	$0.08	$0.4

Table of contents with paper titles:

Functorial Neural Architectures from Higher Inductive Types Authors: Karen Sargsyan
Self-Regularized Learning Methods Authors: Max Schölpple, Liu Fanghui, Ingo Steinwart
Transformers are Bayesian Networks Authors: Gregory Coppola
MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models Authors: Chen-Hao Chao, Wei-Fang Sun, Junwei Quan, Chun-Yi Lee, Rahul G. Krishnan
SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds Authors: Viktor Stein, Wuchen Li, Gabriele Steidl
High-dimensional estimation with missing data: Statistical and computational limits Authors: Kabir Aladin Verchand, Ankit Pensia, Saminul Haque, Rohith Kuditipudi
BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization Authors: Ji-Fu Li, Manyi Zhang, Xiaobo Xia, Han Bao, Haoli Bai, Zhenhua Dong, Xianzhi Yu
An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU Authors: Ruijia Yang, Zeyi Wen
High-Dimensional Gaussian Mean Estimation under Realizable Contamination Authors: Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas
Knowledge Localization in Mixture-of-Experts LLMs Using Cross-Lingual Inconsistency Authors: Lucas Bandarkar, Alan Ansell, Trevor Cohn
Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets Authors: Kristi Topollai, Anna Choromanska
GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators Authors: Mattia Rigotti, Nicholas Thumiger, Thomas Frick
Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks Authors: Xavier Gonzalez
Capability-Guided Compression: Toward Interpretability-Aware Budget Allocation for Large Language Models Authors: Rishaank Gupta
NANOZK: Layerwise Zero-Knowledge Proofs for Verifiable Large Language Model Inference Authors: Zhaohui Geoffrey Wang
Decoding the Critique Mechanism in Large Reasoning Models Authors: Hoang Phan, Quang H. Nguyen, Hung T. Q. Le, Xiusi Chen, Heng Ji, Khoa D. Doan
Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization Authors: Wenhao Zhao, Qiran Zou, Rushi Shah, Yudi Wu, Zhouhan Lin, Dianbo Liu
Self-Aware Markov Models for Discrete Reasoning Authors: Gregor Kornhardt, Jannis Chemseddine, Christian Wald, Gabriele Steidl
Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective Authors: Noppanat Wadlom, Junyi Shen, Yao Lu
Demystifing Video Reasoning Authors: Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang
Transformers Can Learn Rules They've Never Seen: Proof of Computation Beyond Interpolation Authors: Andy Gray
SOMP: Scalable Gradient Inversion for Large Language Models via Subspace-Guided Orthogonal Matching Pursuit Authors: Yibo Li, Qiongxiu Li
NeuronSpark: A Spiking Neural Network Language Model with Selective State Space Dynamics Authors: Zhengzheng Tang
Trained Persistent Memory for Frozen Encoder--Decoder LLMs: Six Architectural Methods Authors: Hong Jeong
Optimal uncertainty bounds for multivariate kernel regression under bounded noise: A Gaussian process-based dual function Authors: Amon Lahr, Anna Scampicchio, Johannes Köhler, Melanie N. Zeilinger
Online Semi-infinite Linear Programming: Efficient Algorithms via Function Approximation Authors: Yiming Zong, Jiashuo Jiang
SF-Mamba: Rethinking State Space Model for Vision Authors: Masakazu Yoshimura, Teruaki Hayashi, Yuki Hoshino, Wei-Yao Wang, Takeshi Ohashi
V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising Authors: Han Lin, Xichen Pan, Zun Wang, Yue Zhang, Chu Wang, Jaemin Cho, Mohit Bansal
Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning Authors: Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, Jun Suzuki
Grid-World Representations in Transformers Reflect Predictive Geometry Authors: Sasha Brenner, Thomas R. Knösche, Nico Scherf
Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing Authors: Parsa Mirtaheri, Mikhail Belkin
PRISM: Demystifying Retention and Interaction in Mid-Training Authors: Bharat Runwal, Ashish Agrawal, Anurag Roy, Rameswar Panda
Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits Authors: Jia Qing Yap
Dependence Fidelity and Downstream Inference Stability in Generative Models Authors: Nazia Riasat
Parallel In-context Learning for Large Vision Language Models Authors: Shin'ya Yamaguchi, Daiki Chijiwa, Tamao Sakao, Taku Hasegawa
Efficient Reasoning on the Edge Authors: Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli, Evgeny Mironov, Leyla Mirvakhabova, Tribhuvanesh Orekondy, Spyridon Stasis, Andrey Kuzmin, Anna Kuzina, Markus Nagel, Ankita Nayak, Corrado Rainone, Ork de Rooij, Paul N Whatmough, Arash Behboodi, Babak Ehteshami Bejnordi
Implementation of tangent linear and adjoint models for neural networks based on a compiler library tool Authors: Sa Xiao, Hao Jing, Honglu Sun, Haoyu Li

1. Functorial Neural Architectures from Higher Inductive Types

ArXiv ID: 2603.16123

Authors: Karen Sargsyan

Abstract: Neural networks systematically fail at compositional generalization -- producing correct outputs for novel combinations of known parts. We show that this failure is architectural: compositional generalization is equivalent to functoriality of the decoder, and this perspective yields both guarantees and impossibility results. We compile Higher Inductive Type (HIT) specifications into neural architectures via a monoidal functor from the path groupoid of a target space to a category of parametric maps: path constructors become generator networks, composition becomes structural concatenation, and 2-cells witnessing group relations become learned natural transformations. We prove that decoders assembled by structural concatenation of independently generated segments are strict monoidal functors (compositional by construction), while softmax self-attention is not functorial for any non-trivial compositional task. Both results are formalized in Cubical Agda. Experiments on three spaces validate the full hierarchy: on the torus ($\mathbb{Z}^2$), functorial decoders outperform non-functorial ones by 2-2.7x; on $S^1 \vee S^1$ ($F_2$), the type-A/B gap widens to 5.5-10x; on the Klein bottle ($\mathbb{Z} \rtimes \mathbb{Z}$), a learned 2-cell closes a 46% error gap on words exercising the group relation.

Comment: Introduces a new architecture class with formal compositional-generalization guarantees via functoriality, and proves self-attention is non-functorial for nontrivial compositional tasks.

Relevance: 10 Novelty: 10

2. Self-Regularized Learning Methods

ArXiv ID: 2603.17160

Authors: Max Schölpple, Liu Fanghui, Ingo Steinwart

Abstract: We introduce a general framework for analyzing learning algorithms based on the notion of self-regularization, which captures implicit complexity control without requiring explicit regularization. This is motivated by previous observations that many algorithms, such as gradient-descent based learning, exhibit implicit regularization. In a nutshell, for a self-regularized algorithm the complexity of the predictor is inherently controlled by that of the simplest comparator achieving the same empirical risk. This framework is sufficiently rich to cover both classical regularized empirical risk minimization and gradient descent. Building on self-regularization, we provide a thorough statistical analysis of such algorithms including minmax-optimal rates, where it suffices to show that the algorithm is self-regularized -- all further requirements stem from the learning problem itself. Finally, we discuss the problem of data-dependent hyperparameter selection, providing a general result which yields minmax-optimal rates up to a double logarithmic factor and covers data-driven early stopping for RKHS-based gradient descent.

Comment: Provides a general theoretical framework for implicit regularization via self-regularization, covering gradient descent and yielding optimal statistical rates.

Relevance: 10 Novelty: 9

3. Transformers are Bayesian Networks

ArXiv ID: 2603.17063

Authors: Gregory Coppola

Abstract: Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network. We establish this in five ways. First, we prove that every sigmoid transformer with any weights implements weighted loopy belief propagation on its implicit factor graph. One layer is one round of BP. This holds for any weights -- trained, random, or constructed. Formally verified against standard mathematical axioms. Second, we give a constructive proof that a transformer can implement exact belief propagation on any declared knowledge base. On knowledge bases without circular dependencies this yields provably correct probability estimates at every node. Formally verified against standard mathematical axioms. Third, we prove uniqueness: a sigmoid transformer that produces exact posteriors necessarily has BP weights. There is no other path through the sigmoid architecture to exact posteriors. Formally verified against standard mathematical axioms. Fourth, we delineate the AND/OR boolean structure of the transformer layer: attention is AND, the FFN is OR, and their strict alternation is Pearl's gather/update algorithm exactly. Fifth, we confirm all formal results experimentally, corroborating the Bayesian network characterization in practice. We also establish the practical viability of loopy belief propagation despite the current lack of a theoretical convergence guarantee. We further prove that verifiable inference requires a finite concept space. Any finite verification procedure can distinguish at most finitely many concepts. Without grounding, correctness is not defined. Hallucination is not a bug that scaling can fix. It is the structural consequence of operating without concepts. Formally verified against standard mathematical axioms.

Comment: Theoretical characterization of transformer layers as loopy belief propagation in Bayesian networks, with uniqueness results.

Relevance: 9 Novelty: 9

4. MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models

ArXiv ID: 2603.16077

Authors: Chen-Hao Chao, Wei-Fang Sun, Junwei Quan, Chun-Yi Lee, Rahul G. Krishnan

Abstract: Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. Second, we find that the function form of the subtokenizer significantly degrades likelihood estimation when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. To address these limitations, we study the tightness of the variational bound in MDM-Prime and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our scaling analysis reveals that MDM-Prime-v2 is 21.8$\times$ more compute-efficient than autoregressive models (ARM). In compute-optimal comparisons, MDM-Prime-v2 achieves 7.77 perplexity on OpenWebText, outperforming ARM (12.99), MDM (18.94), and MDM-Prime (13.41). When extending the model size to 1.1B parameters, our model further demonstrates superior zero-shot accuracy on various commonsense reasoning tasks.

Comment: Compute-optimal diffusion language modeling via binary subtoken encoding, index shuffling, and scaling-law analysis.

Relevance: 9 Novelty: 8

5. SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds

ArXiv ID: 2603.16535

Authors: Viktor Stein, Wuchen Li, Gabriele Steidl

Abstract: Transformers owe much of their empirical success in natural language processing to the self-attention blocks. Recent perspectives interpret attention blocks as interacting particle systems, whose mean-field limits correspond to gradient flows of interaction energy functionals on probability density spaces equipped with Wasserstein-$2$-type metrics. We extend this viewpoint by introducing accelerated attention blocks derived from inertial Nesterov-type dynamics on density spaces. In our proposed architecture, tokens carry both spatial (feature) and velocity variables. The time discretization and the approximation of accelerated density dynamics yield Hamiltonian momentum attention blocks, which constitute the proposed accelerated attention architectures. In particular, for linear self-attention, we show that the attention blocks approximate a Stein variational gradient flow, using a bilinear kernel, of a potential energy. In this setting, we prove that elliptically contoured probability distributions are preserved by the accelerated attention blocks. We present implementable particle-based algorithms and demonstrate that the proposed accelerated attention blocks converge faster than the classical attention blocks while preserving the number of oracle calls.

Comment: New attention architecture derived from inertial dynamics on density manifolds, yielding accelerated momentum attention blocks.

Relevance: 9 Novelty: 8

6. High-dimensional estimation with missing data: Statistical and computational limits

ArXiv ID: 2603.16712

Authors: Kabir Aladin Verchand, Ankit Pensia, Saminul Haque, Rohith Kuditipudi

Abstract: We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $ε$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $ρ$, for any constant contamination $ε\in (0, 1)$, (roughly) $n \gtrsim d e^{1/ρ^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/ρ^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.

Comment: Statistical-computational limits for high-dimensional estimation with missing data, including information-computation gaps.

Relevance: 9 Novelty: 8

7. BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization

ArXiv ID: 2603.16590

Authors: Ji-Fu Li, Manyi Zhang, Xiaobo Xia, Han Bao, Haoli Bai, Zhenhua Dong, Xianzhi Yu

Abstract: Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.

Comment: Quantization method tailored to MXFP4 with block-wise affine transforms and Kronecker-efficient parameterization.

Relevance: 9 Novelty: 8

8. An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU

ArXiv ID: 2603.16428

Authors: Ruijia Yang, Zeyi Wen

Abstract: Fine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democratize LLM fine-tuning, we present SlideFormer, a novel system designed for single-GPU environments. Our innovations are: (1) A lightweight asynchronous engine that treats the GPU as a sliding window and overlaps GPU computation with CPU updates and multi-tier I/O. (2) A highly efficient heterogeneous memory management scheme significantly reduces peak memory usage. (3) Optimized Triton kernels to solve key bottlenecks and integrated advanced I/O. This collaborative design enables fine-tuning of the latest 123B+ models on a single RTX 4090, supporting up to 8x larger batch sizes and 6x larger models. In evaluations, SlideFormer achieves 1.40x to 6.27x higher throughput while roughly halving CPU/GPU memory usage compared to baselines, sustaining >95% peak performance on both NVIDIA and AMD GPUs.

Comment: Single-GPU fine-tuning system with heterogeneous memory management, asynchronous CPU/GPU overlap, and kernel co-design.

Relevance: 9 Novelty: 8

9. High-Dimensional Gaussian Mean Estimation under Realizable Contamination

ArXiv ID: 2603.16798

Authors: Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas

Abstract: We study mean estimation for a Gaussian distribution with identity covariance in $\mathbb{R}^d$ under a missing data scheme termed realizable $ε$-contamination model. In this model an adversary can choose a function $r(x)$ between 0 and $ε$ and each sample $x$ goes missing with probability $r(x)$. Recent work Ma et al., 2024 proposed this model as an intermediate-strength setting between Missing Completely At Random (MCAR) -- where missingness is independent of the data -- and Missing Not At Random (MNAR) -- where missingness may depend arbitrarily on the sample values and can lead to non-identifiability issues. That work established information-theoretic upper and lower bounds for mean estimation in the realizable contamination model. Their proposed estimators incur runtime exponential in the dimension, leaving open the possibility of computationally efficient algorithms in high dimensions. In this work, we establish an information-computation gap in the Statistical Query model (and, as a corollary, for Low-Degree Polynomials and PTF tests), showing that algorithms must either use substantially more samples than information-theoretically necessary or incur exponential runtime. We complement our SQ lower bound with an algorithm whose sample-time tradeoff nearly matches our lower bound. Together, these results qualitatively characterize the complexity of Gaussian mean estimation under $ε$-realizable contamination.

Comment: SQ lower bounds and matching tradeoffs for Gaussian mean estimation under realizable contamination.

Relevance: 9 Novelty: 8

10. Knowledge Localization in Mixture-of-Experts LLMs Using Cross-Lingual Inconsistency

ArXiv ID: 2603.17102

Authors: Lucas Bandarkar, Alan Ansell, Trevor Cohn

Abstract: Modern LLMs continue to exhibit significant variance in behavior across languages, such as being able to recall factual information in some languages but not others. While typically studied as a problem to be mitigated, in this work, we propose leveraging this cross-lingual inconsistency as a tool for interpretability in mixture-of-experts (MoE) LLMs. Our knowledge localization framework contrasts routing for sets of languages where the model correctly recalls information from languages where it fails. This allows us to isolate model components that play a functional role in answering about a piece of knowledge. Our method proceeds in two stages: (1) querying the model with difficult factual questions across a diverse set of languages to generate "success" and "failure" activation buckets and then (2) applying a statistical contrastive analysis to the MoE router logits to identify experts important for knowledge. To validate the necessity of this small number of experts for answering a knowledge question, we deactivate them and re-ask the question. We find that despite only deactivating about 20 out of 6000 experts, the model no longer answers correctly in over 40% of cases. Generally, this method provides a realistic and scalable knowledge localization approach to address increasingly complex LLMs.

Comment: MoE interpretability method that localizes factual knowledge by contrasting cross-lingual router behavior and causally validating expert necessity.

Relevance: 9 Novelty: 8

11. Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets

ArXiv ID: 2603.16731

Authors: Kristi Topollai, Anna Choromanska

Abstract: Quantizing optimizer states is becoming an important ingredient of memory-efficient large-scale pre-training, but the resulting optimizer dynamics remain only partially understood. We study low-precision exponential moving average (EMA) optimizer states and show how quantization can cause many nominal updates to round back to the same stored value, making the state effectively stale and slowing adaptation beyond what the nominal decay would suggest. We then develop a simple predictive model of stalling that estimates one-step stalling probabilities and characterizes how stalling builds up over time after the initialization. This perspective provides a mechanistic explanation for why optimizer-state resets help in low precision: once a quantized EMA becomes effectively stale, resetting it can temporarily restore responsiveness. Motivated by this picture, we derive a simple theory-guided method for choosing useful reset periods, showing that in low precision the key question is not only whether resets help, but when they should be applied. Experiments in controlled simulations and LLM pre-training show that suitable reset schedules recover the performance lost to low-precision state storage while substantially reducing optimizer-state memory.

Comment: Analyzes low-precision optimizer-state dynamics in LLM pretraining, explaining EMA staleness and deriving theory-guided reset schedules for memory-efficient training.

Relevance: 9 Novelty: 8

12. GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators

ArXiv ID: 2603.16849

Authors: Mattia Rigotti, Nicholas Thumiger, Thomas Frick

Abstract: Adapting transformer positional encoding to meshes and graph-structured data presents significant computational challenges: exact spectral methods require cubic-complexity eigendecomposition and can inadvertently break gauge invariance through numerical solver artifacts, while efficient approximate methods sacrifice gauge symmetry by design. Both failure modes cause catastrophic generalization in inductive learning, where models trained with one set of numerical choices fail when encountering different spectral decompositions of similar graphs or discretizations of the same mesh. We propose GIST (Gauge-Invariant Spectral Transformers), a new graph transformer architecture that resolves this challenge by achieving end-to-end $\mathcal{O}(N)$ complexity through random projections while algorithmically preserving gauge invariance via inner-product-based attention on the projected embeddings. We prove GIST achieves discretization-invariant learning with bounded mismatch error, enabling parameter transfer across arbitrary mesh resolutions for neural operator applications. Empirically, GIST matches state-of-the-art on standard graph benchmarks (e.g., achieving 99.50% micro-F1 on PPI) while uniquely scaling to mesh-based Neural Operator benchmarks with up to 750K nodes, achieving state-of-the-art aerodynamic prediction on the challenging DrivAerNet and DrivAerNet++ datasets.

Comment: Introduces a graph transformer with O(N) spectral positional encoding that preserves gauge invariance and includes theory for discretization-invariant neural operators.

Relevance: 9 Novelty: 8

13. Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

ArXiv ID: 2603.16850

Authors: Xavier Gonzalez

Abstract: Massively parallel hardware (GPUs) and long sequence data have made parallel algorithms essential for machine learning at scale. Yet dynamical systems, like recurrent neural networks and Markov chain Monte Carlo, were thought to suffer from sequential bottlenecks. Recent work showed that dynamical systems can in fact be parallelized across the sequence length by reframing their evaluation as a system of nonlinear equations, which can be solved with Newton's method using a parallel associative scan. However, these parallel Newton methods struggled with limitations, primarily inefficiency, instability, and lack of convergence guarantees. This thesis addresses these limitations with methodological and theoretical contributions, drawing particularly from optimization. Methodologically, we develop scalable and stable parallel Newton methods, based on quasi-Newton and trust-region approaches. The quasi-Newton methods are faster and more memory efficient, while the trust-region approaches are significantly more stable. Theoretically, we unify many fixed-point methods into our parallel Newton framework, including Picard and Jacobi iterations. We establish a linear convergence rate for these techniques that depends on the method's approximation accuracy and stability. Moreover, we give a precise condition, rooted in dynamical stability, that characterizes when parallelization provably accelerates a dynamical system and when it cannot. Specifically, the sign of the Largest Lyapunov Exponent of a dynamical system determines whether or not parallel Newton methods converge quickly. In sum, this thesis unlocks scalable and stable methods for parallelizing sequential computation, and provides a firm theoretical basis for when such techniques will and will not work. This thesis also serves as a guide to parallel Newton methods for researchers who want to write the next chapter in this ongoing story.

Comment: Develops parallel Newton and quasi-Newton methods to remove sequential bottlenecks in dynamical systems, with convergence theory tied to Lyapunov stability.

Relevance: 9 Novelty: 8

14. Capability-Guided Compression: Toward Interpretability-Aware Budget Allocation for Large Language Models

ArXiv ID: 2603.16440

Authors: Rishaank Gupta

Abstract: Large language model compression has made substantial progress through pruning, quantization, and low-rank decomposition, yet a fundamental limitation persists across all existing methods: compression budgets are allocated without any representation of what individual model components functionally encode. We term this the capability-blind compression problem and argue it is a root cause of two well-documented failures -- the insensitivity of perplexity-based evaluation to reasoning capability loss, and the abrupt phase transitions in model performance recently characterized by Ma et al. (2026). We propose Capability-Guided Compression (CGC), a framework that addresses this by using Sparse Autoencoder (SAE)-derived capability density maps to allocate differential compression budgets across transformer components. Capability density is a formally defined scalar measure combining the feature breadth, activation entropy, and cross-input consistency of a component's SAE feature activation distribution. We prove theoretically that components with higher capability density exhibit lower structural redundancy and reach their individual phase transition points at lower compression ratios, providing the first pre-compression mechanism for component-level phase transition prediction. Experiments on GPT-2 Medium confirm that capability density is statistically independent of Wanda importance scores (Spearman rho = -0.054, n = 384 heads), establishing it as a genuinely novel compression signal orthogonal to all existing importance metrics. We report a negative result on PPL-based compression comparison and provide a principled diagnosis identifying GPT-2 Medium as an insufficient test bed for the full CGC hypothesis. The theoretical framework, density formalism, and orthogonality finding constitute a foundation for capability-aware compression research.

Comment: Compression framework allocates pruning budgets using SAE-derived capability density, linking interpretability to component-level compression sensitivity.

Relevance: 9 Novelty: 8

15. NANOZK: Layerwise Zero-Knowledge Proofs for Verifiable Large Language Model Inference

ArXiv ID: 2603.18046

Authors: Zhaohui Geoffrey Wang

Abstract: When users query proprietary LLM APIs, they receive outputs with no cryptographic assurance that the claimed model was actually used. Service providers could substitute cheaper models, apply aggressive quantization, or return cached responses - all undetectable by users paying premium prices for frontier capabilities. We present METHOD, a zero-knowledge proof system that makes LLM inference verifiable: users can cryptographically confirm that outputs correspond to the computation of a specific model. Our approach exploits the fact that transformer inference naturally decomposes into independent layer computations, enabling a layerwise proof framework where each layer generates a constant-size proof regardless of model width. This decomposition sidesteps the scalability barrier facing monolithic approaches and enables parallel proving. We develop lookup table approximations for non-arithmetic operations (softmax, GELU, LayerNorm) that introduce zero measurable accuracy loss, and introduce Fisher information-guided verification for scenarios where proving all layers is impractical. On transformer models up to d=128, METHOD generates constant-size layer proofs of 5.5KB (2.1KB attention + 3.5KB MLP) with 24 ms verification time. Compared to EZKL, METHOD achieves 70x smaller proofs and 5.7x faster proving time at d=128, while maintaining formal soundness guarantees (epsilon < 1e-37). Lookup approximations preserve model perplexity exactly, enabling verification without quality compromise.

Comment: Systems/methodology contribution for verifiable transformer inference via layerwise zero-knowledge proofs with constant-size per-layer proofs.

Relevance: 8 Novelty: 9

16. Decoding the Critique Mechanism in Large Reasoning Models

ArXiv ID: 2603.16331

Authors: Hoang Phan, Quang H. Nguyen, Hung T. Q. Le, Xiusi Chen, Heng Ji, Khoa D. Doan

Abstract: Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong "critique" ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating through the chain-of-thought (CoT), resulting in an incorrect intermediate conclusion, the model still reaches the correct final answer. This recovery implies that the model must possess an internal mechanism to detect errors and trigger self-correction, which we refer to as the hidden critique ability. Building on feature space analysis, we identify a highly interpretable critique vector representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model's error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs' critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at https://github.com/mail-research/lrm-critique-vectors.

Comment: Representation-learning analysis of hidden critique behavior in reasoning models via an interpretable latent critique vector.