Personalized Monthly Topic Summary 2026/03
| Metric | Value |
|---|---|
| Total Papers | 546 |
| Model Architecture | 152 |
| Model Compression and Efficiency | 159 |
| High Performance Computing | 78 |
| Representation Learning | 134 |
| Other Foundational Research | 23 |
Model Architecture (152)
-
Functorial Neural Architectures from Higher Inductive Types - Score: 20 (R=10, N=10) - Date: 2026-03-18 - Comment: Introduces a new architecture class with formal compositional-generalization guarantees via functoriality, and proves self-attention is non-functorial for nontrivial compositional tasks.
-
The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks - Score: 20.0 (R=0, N=0) - Date: 2026-03-06 - Comment: Author match
-
Any-Subgroup Equivariant Networks via Symmetry Breaking - Score: 19 (R=10, N=9) - Date: 2026-03-23 - Comment: Architecture theory for equivariant networks: a single model attains any subgroup equivariance through symmetry-breaking inputs with universality guarantees.
-
ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit - Score: 19 (R=10, N=9) - Date: 2026-03-19 - Comment: Theoretical characterization of large-scale ResNet training dynamics with rigorous convergence rates in the joint infinite depth-width-dimension limit.
-
Learning to Recall with Transformers Beyond Orthogonal Embeddings - Score: 19 (R=10, N=9) - Date: 2026-03-17 - Comment: Transformer theory under finite data and non-orthogonal embeddings, yielding explicit storage-capacity scalings.
-
Mamba-3: Improved Sequence Modeling using State Space Principles - Score: 19 (R=10, N=9) - Date: 2026-03-17 - Comment: State-space sequence architecture with complex recurrence and MIMO design improving the performance-efficiency frontier.
-
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling - Score: 19 (R=10, N=9) - Date: 2026-03-16 - Comment: Introduces matrix-valued nonlinear recurrent layers as a scalable core architecture with stronger expressivity than standard transformer blocks.
-
Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks - Score: 19 (R=10, N=9) - Date: 2026-03-13 - Comment: Provides a proof that attention sinks are functionally necessary in softmax Transformers for trigger-conditional computation.
-
Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation - Score: 19 (R=10, N=9) - Date: 2026-03-06 - Comment: Model Architecture (MoE): universal expert pool with virtual width (depth–width transformation), staggered rotational sharing, and depth-aware load balancing/routing.
-
Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget - Score: 19 (R=10, N=9) - Date: 2026-03-05 - Comment: Conditional routing that replaces Transformer MLPs with linear surrogates when possible—dynamic networks/efficiency and architectural analysis.
-
Recursive Models for Long-Horizon Reasoning - Score: 19 (R=10, N=9) - Date: 2026-03-03 - Comment: Model Architecture — formalizes recursive models enabling long-horizon reasoning with provable reductions in active context requirements beyond single-sequence methods.
-
Transformers are Stateless Differentiable Neural Computers - Score: 18 (R=10, N=8) - Date: 2026-03-23 - Comment: Model architecture/theory: formally derives causal Transformers as stateless differentiable neural computers with external memory semantics.
-
Path-Constrained Mixture-of-Experts - Score: 18 (R=10, N=8) - Date: 2026-03-19 - Comment: MoE architecture innovation: constraining cross-layer expert path space by sharing routers across layers.
-
Learning When to Attend: Conditional Memory Access for Long-Context LLMs - Score: 18 (R=10, N=8) - Date: 2026-03-19 - Comment: Conditional attention architecture for long-context LLMs that learns token-wise global memory access.
-
Mixture-of-Depths Attention - Score: 18 (R=10, N=8) - Date: 2026-03-17 - Comment: Introduces a new transformer attention primitive that mixes current-layer and cross-layer KV access, with an accompanying hardware-efficient algorithm nearly matching FlashAttention-2 efficiency.
-
PDE-SSM: A Spectral State Space Approach to Spatial Mixing in Diffusion Transformers - Score: 18 (R=10, N=8) - Date: 2026-03-16 - Comment: Replaces transformer attention with a learnable Fourier-solved PDE state-space block, a core architectural innovation for efficient spatial mixing.
-
Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing - Score: 18 (R=10, N=8) - Date: 2026-03-13 - Comment: Model architecture innovation: threshold-based MoE routing gives causal dynamic computation allocation with load balancing without auxiliary losses.
-
Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design - Score: 18 (R=10, N=8) - Date: 2026-03-12 - Comment: Model Architecture/Efficiency: MoE scaling law optimizing expert vs. attention FLOPs; explicit formula for optimal compute allocation under sparsity.
-
Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias - Score: 18 (R=10, N=8) - Date: 2026-03-12 - Comment: Exact theory of transformer position bias at initialization — matches Model Architecture: analysis/innovations on transformers and training dynamics.
-
MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios - Score: 18 (R=10, N=8) - Date: 2026-03-12 - Comment: HPC/efficiency for MoE: speculative decoding as lookahead for memory management with dynamic partitioning and async prefetch/eviction.
-
ConFu: Contemplate the Future for Better Speculative Sampling - Score: 18 (R=10, N=8) - Date: 2026-03-12 - Comment: Speculative decoding with contemplate tokens and MoE gating to boost acceptance — matches Model Compression and Efficiency and Mixture-of-Experts.
-
On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer - Score: 18 (R=10, N=8) - Date: 2026-03-11 - Comment: Optimizer/Scaling Theory: introduces operator-norm-based geometry with mean-normalized, layerwise composable norms enabling width-independent smoothness and learning-rate transfer; proposes row/column-normalized optimizers (MOGA).
-
Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers - Score: 18 (R=10, N=8) - Date: 2026-03-11 - Comment: Model Architecture (MoE): Bayesian variational routing confined to expert selection for calibrated, uncertainty-aware MoE Transformers with <1% extra FLOPs.
-
Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models - Score: 18 (R=10, N=8) - Date: 2026-03-11 - Comment: Model Architecture — theoretical expressivity/efficiency benefits of hybrid Transformer+SSM models over non-hybrids.
-
Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions - Score: 18 (R=10, N=8) - Date: 2026-03-09 - Comment: Training dynamics theory: gradient flow on value–softmax drives low-entropy outputs, explaining attention phenomena.
-
Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View - Score: 18 (R=10, N=8) - Date: 2026-03-09 - Comment: Theory of model architecture/expressivity: Lie-algebraic analysis of depth in parallelizable sequence models (Transformers/SSMs).
-
PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters - Score: 18 (R=10, N=8) - Date: 2026-03-05 - Comment: Model Architecture/Efficiency: training-free, adapter-free 2D-to-3D lifting operator (PlaneCycle) enabling 3D fusion while reusing 2D backbones
-
Data-Aware Random Feature Kernel for Transformers - Score: 18 (R=10, N=8) - Date: 2026-03-05 - Comment: Matches Compression/Efficiency and Model Architecture: data-aware random-feature attention (learned covariance) enabling importance-sampled linear attention (DARKFormer).
-
The Expressive Limits of Diagonal SSMs for State-Tracking - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: Strongly matches Model Architecture (theoretical analysis): expressivity limits of diagonal SSMs for state-tracking with precise group-theoretic characterization.
-
Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: Training Dynamics of Self-Attention — structure-aware preconditioned gradient descent with spectral initialization yields geometric-rate global convergence.
-
TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: High Performance Computing + MoE: heterogeneous GPU–CPU–DIMM-NDP offloading with bottleneck-aware expert scheduling for high-throughput MoE inference.
-
Expert Divergence Learning for MoE-based Language Models - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: Model Architecture (MoE): encourages expert specialization via label-driven Jensen–Shannon divergence on routing distributions.
-
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization - Score: 18 (R=10, N=8) - Date: 2026-03-02 - Comment: Model Compression and Efficiency + MoE: token-aware adaptive error compensation using routed low-rank mixture-of-experts for PTQ of VLMs.
-
Transformers are Bayesian Networks - Score: 18 (R=9, N=9) - Date: 2026-03-18 - Comment: Theoretical characterization of transformer layers as loopy belief propagation in Bayesian networks, with uniqueness results.
-
A Family of LLMs Liberated from Static Vocabularies - Score: 18 (R=9, N=9) - Date: 2026-03-17 - Comment: Core transformer architecture redesign replacing static token vocabularies with hierarchical byte-level encoding/decoding.
-
Local Urysohn Width: A Topological Complexity Measure for Classification - Score: 18 (R=9, N=9) - Date: 2026-03-17 - Comment: Develops a new theoretical complexity measure for classification based on local Urysohn width, with hierarchy and sample-complexity results.
-
From Gradients to Riccati Geometry: Kalman World Models for Single-Pass Learning - Score: 18 (R=9, N=9) - Date: 2026-03-14 - Comment: Proposes a gradient-free training paradigm for state-space models and transformers using Kalman-style recursive filtering, with stability and complexity analysis.
-
Attention Meets Reachability: Structural Equivalence and Efficiency in Grammar-Constrained LLM Decoding - Score: 18 (R=9, N=9) - Date: 2026-03-09 - Comment: High Performance Computing and Architecture: formal analysis and lower bounds for grammar-constrained decoding; connects to Transformers/MoE with latency envelopes.
-
Exclusive Self Attention - Score: 17 (R=10, N=7) - Date: 2026-03-11 - Comment: Model Architecture: Exclusive Self Attention modifies Transformer attention to exclude self-position information, improving long-sequence modeling.
-
On the Ability of Transformers to Verify Plans - Score: 17 (R=9, N=8) - Date: 2026-03-23 - Comment: Transformer theory: introduces C*-RASP and proves length-generalization guarantees for plan verification with growing vocabulary size.
-
Neural Dynamics Self-Attention for Spiking Transformers - Score: 17 (R=9, N=8) - Date: 2026-03-23 - Comment: Introduces a new spiking self-attention mechanism that adds locality bias and removes explicit attention-matrix storage to cut inference memory.
-
Speculating Experts Accelerates Inference for Mixture-of-Experts - Score: 17 (R=9, N=8) - Date: 2026-03-23 - Comment: MoE inference systems method that speculates future experts to overlap CPU-GPU transfers with compute under expert offloading.
-
Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training - Score: 17 (R=9, N=8) - Date: 2026-03-19 - Comment: Training efficiency method: a lower-overhead whitening optimizer for faster transformer training with convergence analysis.
-
CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention - Score: 17 (R=9, N=8) - Date: 2026-03-19 - Comment: KV-cache-efficient attention architecture conversion: covariance-aware factorization and nonuniform rank allocation for converting GQA to MLA.
-
Attention Sinks Induce Gradient Sinks - Score: 17 (R=9, N=8) - Date: 2026-03-19 - Comment: Mechanistic Transformer analysis linking attention sinks to gradient sinks and massive activations through backpropagation dynamics.
-
Knowledge Localization in Mixture-of-Experts LLMs Using Cross-Lingual Inconsistency - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: MoE interpretability method that localizes factual knowledge by contrasting cross-lingual router behavior and causally validating expert necessity.
-
GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Introduces a graph transformer with O(N) spectral positional encoding that preserves gauge invariance and includes theory for discretization-invariant neural operators.
-
SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: New attention architecture derived from inertial dynamics on density manifolds, yielding accelerated momentum attention blocks.
-
MoLoRA: Composable Specialization via Per-Token Adapter Routing - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Model architecture: per-token adapter routing with Mixture-of-LoRA enables composable specialization within a single sequence.
-
Directional Routing in Transformers - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Proposes a lightweight transformer routing mechanism where attention heads use learned suppression directions controlled by a shared router, yielding a core architectural change analyzed mechanistically.
-
Gauge-Equivariant Intrinsic Neural Operators for Geometry-Consistent Learning of Elliptic PDE Maps - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Presents gauge-equivariant intrinsic neural operators, a core operator-learning architecture with strong geometry-consistency guarantees.
-
Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Training dynamics analysis: spectral-edge SVD reveals low-rank signal-noise structure and phase transitions in transformer optimization trajectories.
-
Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy - Score: 17 (R=9, N=8) - Date: 2026-03-14 - Comment: Optimization theory for transformers trained with cross-entropy: derives complex-singularity step-size bounds from softmax geometry with a cheap JVP-based safety criterion.
-
Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics - Score: 17 (R=9, N=8) - Date: 2026-03-14 - Comment: NTK-based theory for linearized attention showing non-convergence and introducing influence malleability as a core property.
-
As Language Models Scale, Low-order Linear Depth Dynamics Emerge - Score: 17 (R=9, N=8) - Date: 2026-03-14 - Comment: Core architecture analysis: identifies low-order linear surrogate dynamics emerging across transformer depth as models scale.
-
Marginals Before Conditionals - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Training Dynamics/Representation: Minimal conditional learning task revealing plateau/transition and selector-routing head dynamics.
-
RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: High-Performance Computing: General operator fusion for cascaded reductions (e.g., safe softmax+GEMM in attention) with formal analysis and auto kernel generation.
-
From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Model Architecture/Representation Learning: a hierarchical masked autoencoder with a cascaded decoder and progressive masking curriculum for multi-granular representation learning.
-
Quantifying the Necessity of Chain of Thought through Opaque Serial Depth - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Representation Learning/Architecture Theory: formalizes opaque serial depth to bound non-externalized reasoning in neural nets; includes analysis showing Mixture-of-Experts likely has lower opaque depth than dense models.
-
A Variational Latent Equilibrium for Learning in Cortex - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Training Dynamics/Architecture: proposes a variational latent equilibrium framework approximating BPTT with fully local dynamics, unifying energy-based spatiotemporal credit assignment.
-
Generalized Reduction to the Isotropy for Flexible Equivariant Neural Fields - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Model Architecture — general orbit-equivalence reduction enabling flexible equivariant neural fields under arbitrary group actions.
-
Permutation-Equivariant 2D State Space Models: Theory and Canonical Architecture for Multivariate Time Series - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Model Architecture and Theory: derives canonical permutation-equivariant 2D state-space form and proposes VI 2D SSM/Mamba, eliminating variable-axis ordering and reducing dependency depth.
-
RAC: Rectified Flow Auto Coder - Score: 17 (R=9, N=8) - Date: 2026-03-09 - Comment: Architecture: Rectified Flow-based autoencoder enabling multi-step, bidirectional inference and reduced parameters.
-
Functionality-Oriented LLM Merging on the Fisher--Rao Manifold - Score: 17 (R=9, N=8) - Date: 2026-03-06 - Comment: Model Architecture/Systems — functionality-oriented LLM merging via Fisher–Rao Karcher mean with a practical fixed-point algorithm; prevents collapse and scales to N>2 experts.
-
The Inductive Bias of Convolutional Neural Networks: Locality and Weight Sharing Reshape Implicit Regularization - Score: 17 (R=9, N=8) - Date: 2026-03-06 - Comment: Model Architecture/Training Dynamics: shows how CNN locality and weight sharing reshape implicit regularization at EoS, explaining superior generalization.
-
CHLU: The Causal Hamiltonian Learning Unit as a Symplectic Primitive for Deep Learning - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Model Architecture—new symplectic Causal Hamiltonian Learning Unit conserving phase-space volume to stabilize long-horizon memory.
-
Spectral Condition for $\mu$P under Width-Depth Scaling - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: High Performance Computing/Training Dynamics: unified spectral μP condition for stable width–depth scaling and hyperparameter transfer across optimizers.
-
Memory Caching: RNNs with Growing Memory - Score: 17 (R=9, N=8) - Date: 2026-03-02 - Comment: Matches Model Architecture and Efficiency criteria: introduces Memory Caching to grow RNN effective memory with sequence length, interpolating between RNN and Transformer memory-compute trade-offs.
-
Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights - Score: 17 (R=8, N=9) - Date: 2026-03-13 - Comment: Representation/post-training insight: shows large pretrained models contain dense nearby task experts, enabling parallel random perturbation selection and ensembling.
-
Chemical Reaction Networks Learn Better than Spiking Neural Networks - Score: 17 (R=8, N=9) - Date: 2026-03-13 - Comment: Theoretical architecture result proving stronger expressivity of chemical reaction networks than spiking neural networks, with regret and VC-dimension analysis.
-
AIMER: Calibration-Free Task-Agnostic MoE Pruning - Score: 16 (R=9, N=7) - Date: 2026-03-20 - Comment: Calibration-free pruning criterion for MoE experts, directly addressing model compression and serving efficiency.
-
LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing - Score: 16 (R=9, N=7) - Date: 2026-03-14 - Comment: Model compression for MoE: replaces redundant experts with parameter-efficient modules to reduce memory without full expert merging/pruning.
-
The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers - Score: 16 (R=9, N=7) - Date: 2026-03-12 - Comment: Representation/Architecture Analysis: Identifies binary routing in Transformer FFNs, explaining conditional computation behavior.
-
SCORE: Replacing Layer Stacking with Contractive Recurrent Depth - Score: 16 (R=9, N=7) - Date: 2026-03-12 - Comment: Model Architecture: Replaces layer stacking with contractive recurrent depth (ODE-inspired shared block) across MLP/GNN/Transformer.
-
Routing without Forgetting - Score: 16 (R=9, N=7) - Date: 2026-03-11 - Comment: Model Architecture: embeds energy-based associative retrieval (Modern Hopfield) within transformers for input-conditioned dynamic routing in online continual learning without gradient specialization.
-
Warm Starting State-Space Models with Automata Learning - Score: 16 (R=9, N=7) - Date: 2026-03-09 - Comment: Model architecture/theory: proves exact realization of Moore machines as state-space models and uses symbolic automata to warm-start SSMs.
-
Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder - Score: 16 (R=9, N=7) - Date: 2026-03-09 - Comment: Architecture/efficiency: single dense Transformer encoder unifying modalities, replacing MoE/routing with shared parameters.
-
Trade-offs in Ensembling, Merging and Routing Among Parameter-Efficient Experts - Score: 16 (R=9, N=7) - Date: 2026-03-05 - Comment: Systematic study of ensembling/merging/routing among parameter-efficient experts—experts/routing (MoE-style) for multi-task efficiency.
-
TiledAttention: a CUDA Tile SDPA Kernel for PyTorch - Score: 16 (R=9, N=7) - Date: 2026-03-03 - Comment: High Performance Computing: editable CUDA tile SDPA kernel enabling schedule-level research with online softmax and tiled KV streaming for attention efficiency.
-
CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning - Score: 16 (R=9, N=7) - Date: 2026-03-02 - Comment: Model Architecture: Mixture-of-Experts with stage-aligned experts and routing for hybrid-capabilities reasoning.
-
Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training - Score: 16 (R=9, N=7) - Date: 2026-03-02 - Comment: Representation Learning/Training Dynamics: analyzes optimizer-induced low-dimensional drift and transverse dynamics in transformer parameter trajectories.
-
NeuroGame Transformer: Gibbs-Inspired Attention Driven by Game Theory and Statistical Physics - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Transformer architecture innovation: Gibbs/Ising attention with game-theoretic token valuation and convergence analysis.
-
AS2 -- Attention-Based Soft Answer Sets: An End-to-End Differentiable Neuro-Soft-Symbolic Reasoning Architecture - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Model architecture: proposes a fully differentiable neuro-symbolic reasoning architecture using a soft fixed-point approximation to ASP consequence operators.
-
An SO(3)-equivariant reciprocal-space neural potential for long-range interactions - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Core architecture innovation: SO(3)-equivariant reciprocal-space message passing to model long-range interactions consistently.
-
Variational Phasor Circuits for Phase-Native Brain-Computer Interface Classification - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Introduces a new phase-native classifier architecture on the S^1 manifold using trainable phase shifts, unitary mixing, and interference instead of dense real-valued layers.
-
LoST: Level of Semantics Tokenization for 3D Shapes - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Core architecture/tokenization design for generative 3D models by ordering tokens by semantic salience rather than geometric level-of-detail.
-
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Core architecture innovation for flow-matching control: replacing fixed-time integration with time-unconditional optimization for adaptive compute and OOD detection.
-
Gaussian Process Limit Reveals Structural Benefits of Graph Transformers - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Model architecture theory: derives GP limits for graph transformers and proves structural anti-oversmoothing benefits over graph convolutions.
-
Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Core architecture work on scalable continuous SE(3)-equivariant kernels using coordinate-based convolution design.
-
The Phasor Transformer: Resolving Attention Bottlenecks on the Unit Circle - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Model architecture/efficiency: replaces quadratic attention with unit-circle phase blocks plus DFT-based global token mixing in O(N log N).
-
Transformers Can Learn Rules They've Never Seen: Proof of Computation Beyond Interpolation - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Gives a theoretical and empirical analysis of transformers' ability to compute unseen rules beyond interpolation, including circuit-level evidence.
-
Demystifing Video Reasoning - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Mechanistic analysis of diffusion-transformer reasoning that identifies denoising-step dynamics and layer specialization as the core substrate.
-
Self-Aware Markov Models for Discrete Reasoning - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Proposes a discrete reasoning architecture with self-correcting remasking and adaptive stopping, extending masked diffusion-style models with dynamic computation.
-
NeuronSpark: A Spiking Neural Network Language Model with Selective State Space Dynamics - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Introduces a pure spiking language-model architecture with selective state-space dynamics and custom training/stabilization methods.
-
Deriving Hyperparameter Scaling Laws via Modern Optimization Theory - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Optimization-theoretic derivation of hyperparameter scaling laws for learning rate, momentum, and batch size.
-
PhasorFlow: A Python Library for Unit Circle Based Computing - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Core architecture proposal: unit-circle/phasor computation framework with variational phasor circuits and a DFT-based transformer alternative to attention.
-
Rethinking Machine Unlearning: Models Designed to Forget via Key Deletion - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Model architecture: memory-augmented transformer designed for unlearning by deleting instance-specific keys instead of updating weights.
-
Ablate and Rescue: A Causal Analysis of Residual Stream Hyper-Connections - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Mechanistic analysis of multi-stream transformer residual architectures using causal stream ablation-and-rescue interventions.
-
Universe Routing: Why Self-Evolving Agents Need Epistemic Control - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Conditional/modular architecture idea: explicit hard routing across epistemically incompatible solvers, with MoE-style comparison and continual expansion results.
-
Punctuated Equilibria in Artificial Intelligence: The Institutional Scaling Law and the Speciation of Sovereign AI - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Proposes a theoretical framework and scaling law for when smaller orchestrated models can outperform larger ones, directly addressing foundational model-scaling assumptions.
-
D-MEM: Dopamine-Gated Agentic Memory via Reward Prediction Error Routing - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Agent memory architecture with reward-prediction-error routing that cuts long-term memory write complexity from O(N^2) to selective O(1)/O(N) paths.
-
Towards One-for-All Anomaly Detection for Tabular Data - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Generalist tabular anomaly detection architecture using transferable neighbor-distance representations and MoE fusion across unseen datasets.
-
From Specification to Architecture: A Theory Compiler for Knowledge-Guided Machine Learning - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Foundational architecture-generation agenda: compiling typed domain theories into provably theory-consistent model architectures.
-
Sampling Boltzmann distributions via normalizing flow approximation of transport maps - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Puts normalizing-flow Boltzmann sampling on firm mathematical footing with existence and approximation results for low-regularity targets.
-
Equivalence of approximation by networks of single- and multi-spike neurons - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Architecture theory for spiking networks: proves approximation-equivalence between single-spike and multi-spike neuron networks up to linear overhead.
-
Scalable Machines with Intrinsic Higher Mental-State Dynamics - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Presents a core architectural modification to transformer attention via triadic modulation loops that pre-select relevant information with claimed linear-time scaling.
-
HCP-DCNet: A Hierarchical Causal Primitive Dynamic Composition Network for Self-Improving Causal Understanding - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Dynamic composition architecture with typed causal primitives and routing into differentiable execution graphs directly targets core model architecture design.
-
Separable neural architectures as a primitive for unified predictive and generative intelligence - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Introduces separable neural architectures as a core architectural primitive that factorizes high-dimensional mappings via controlled interaction order and tensor rank.
-
Geometry-Aware Probabilistic Circuits via Voronoi Tessellations - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Probabilistic modeling architecture: geometry-aware probabilistic circuits with Voronoi-structured sum nodes and tractability conditions.
-
Flowers: A Warp Drive for Neural PDE Solvers - Score: 16 (R=8, N=8) - Date: 2026-03-06 - Comment: Model Architecture: warp-based operator network (no attention/Fourier/convolution) enabling linear-cost global interactions for PDE solution operators.
-
Scalable Prompt Routing via Fine-Grained Latent Task Discovery - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Uses fine-grained latent task discovery plus a mixture-of-experts router, making the main contribution a core conditional architecture.
-
DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Hardware-aware Transformer efficiency method: distribution-aware piecewise activations for faster on-device inference and training.
-
Towards Solving Polynomial-Objective Integer Programming with Hypergraph Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Hypergraph neural network architecture for polynomial-objective integer programs, explicitly modeling high-degree term-variable-constraint interactions.
-
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Mixture-of-Experts post-training recipe combining Cascade RL with multi-domain on-policy distillation for a compact high-capacity model.
-
DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Targets efficient MoE inference with dynamic expert orchestration and mixed-precision quantization on edge hardware.
-
Transformers Learn Robust In-Context Regression under Distributional Uncertainty - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Analyzes Transformer in-context regression under broad distributional uncertainty, probing a core capability of the architecture.
-
TARo: Token-level Adaptive Routing for LLM Test-time Alignment - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Token-level adaptive routing is a conditional/dynamic network mechanism for inference-time control of LLM reasoning.
-
From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs - Score: 15 (R=8, N=7) - Date: 2026-03-19 - Comment: Mechanistic representation analysis of MLLMs, pinpointing how segmentation information degrades in the adapter and is recovered through attention dynamics in later layers.
-
Dependence Fidelity and Downstream Inference Stability in Generative Models - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Foundational theory for generative models: shows marginal matching can fail to preserve dependence structure and gives covariance-level guarantees for downstream inference stability.
-
Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Uses sparse autoencoders to decode steering vectors in a 35B MoE, probing and causally intervening on internal behavioral representations.
-
Parallel In-context Learning for Large Vision Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Efficiency method for Transformer-based multimodal in-context learning: parallel chunking plus Product-of-Experts aggregation reduces quadratic context-cost at inference.
-
Selective Memory for Artificial Intelligence: Write-Time Gating with Hierarchical Archiving - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Write-time gating with hierarchical archival is a memory-architecture contribution for selective external knowledge storage and retrieval efficiency.
-
Tackling Over-smoothing on Hypergraphs: A Ricci Flow-guided Neural Diffusion Approach - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Theoretical and methodological hypergraph architecture work: Ricci-flow-guided neural diffusion to control message passing and mitigate over-smoothing.
-
CATFormer: When Continual Learning Meets Spiking Transformers With Dynamic Thresholds - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Introduces a new Transformer-based continual-learning architecture with dynamic neuron thresholds and gated head selection.
-
AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Proposes replacing transformer backbones with deep state space models in a vision-language-action architecture for efficient multimodal sequence modeling.
-
Masked BRep Autoencoder via Hierarchical Graph Transformer - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Core architecture and representation learning: masked graph autoencoder with hierarchical graph Transformer for self-supervised CAD representation learning.
-
AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Low-rank adapter method with zero initialization and a rank-capacity theory for frozen Vision Transformers.
-
On the Degrees of Freedom of Gridded Control Points in Learning-Based Medical Image Registration - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Sparse control-point deformation with cross-attention targets core architecture/memory efficiency for 3D registration.
-
WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Model architecture: system-aware Mixture-of-Experts with structural embeddings for scalable world models across heterogeneous robots.
-
Representation Alignment for Just Image Transformers is not Easier than You Think - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Core architecture/training insight: analyzes why representation alignment fails in pixel-space diffusion transformers and introduces a corrected alignment method.
-
Human-like Object Grouping in Self-supervised Vision Transformers - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Representation analysis in vision transformers: quantifies object-centric patch similarity and links Gram structure to human-like grouping.
-
Exploring the Dimensions of a Variational Neuron - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Introduces a neuron-level variational computational unit with explicit prior/posterior and analyzes latent dimensionality as a core architectural primitive.
-
PA-Net: Precipitation-Adaptive Mixture-of-Experts for Long-Tail Rainfall Nowcasting - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Precipitation-adaptive MoE dynamically allocates experts by token intensity, a clear conditional-network architectural idea.
-
Routing Channel-Patch Dependencies in Time Series Forecasting with Graph Spectral Decomposition - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Core architecture method for adaptive routing of channel dependencies using graph spectral decomposition and frequency-specific experts.
-
Deep Invertible Autoencoders for Dimensionality Reduction of Dynamical Systems - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Core autoencoder architecture contribution: invertible autoencoders for dimensionality reduction that mitigate projection-error plateaus as latent dimension grows.
-
Event-Driven Video Generation - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Core architecture innovation for video transformers: event-gated sampling adds explicit interaction structure to DiT generation.
-
NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: MoE-style PEFT architecture with context-aware neuromodulation gating and orthogonality regularization for better expert separation.
-
Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Mechanistic interpretability of transformers by localizing demographic bias to individual attention heads in CLIP's vision encoder.
-
Context-dependent manifold learning: A neuromodulated constrained autoencoder approach - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Autoencoder architecture for context-dependent manifold learning using neuromodulated geometric constraints.
-
Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation - Score: 15 (R=8, N=7) - Date: 2026-03-12 - Comment: Riemannian geometry-preserving VAE for SPD matrices — matches Model Architecture (Autoencoders) and Representation Learning on manifolds.
-
ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning - Score: 15 (R=8, N=7) - Date: 2026-03-12 - Comment: Model Architecture/Efficiency: Mixture-of-LoRAs with reinforcement-based router enabling dynamic conditional routing in finetuning.
-
Bridging Domains through Subspace-Aware Model Merging - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Model Architecture: subspace-aware model merging (SCORE) resolving singular subspace conflicts via shared orthogonal basis and pruning.
-
Recursive Inference Machines for Neural Reasoning - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Model Architecture — introduces Recursive Inference Machines that embed recursive inference mechanisms; generalizes TRMs with a reweighting component for neural reasoning.
-
Symbol-Equivariant Recurrent Reasoning Models - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Architecture: enforces permutation equivariance in recurrent reasoning models via symbol-equivariant layers for symmetry-aware reasoning.
-
Phase-Type Variational Autoencoders for Heavy-Tailed Data - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Architecture: introduces a Phase-Type (CTMC absorption-time) decoder in VAEs for heavy-tailed generative modeling.
-
Invariant-Stratified Propagation for Expressive Graph Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Architecture: Invariant-Stratified Propagation (ISP) with a WL variant and neural implementation for higher-expressive GNNs.
-
Spectral Attention Steering for Prompt Highlighting - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Architecture/Efficiency: training-free attention steering via spectral key editing compatible with FlashAttention; query-adaptive expert routing.
-
Polynomial Mixing for Efficient Self-supervised Speech Encoders - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Architecture/Efficiency: Polynomial Mixer as a linear-time token-mixing replacement for self-attention in encoders.
-
MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Matches Model Architecture criterion: multi-resolution Vision Transformer with shared world-coordinate embeddings and extended RoPE for scale-consistent attention.
-
Intrinsic Lorentz Neural Network - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Model Architecture: fully intrinsic hyperbolic (Lorentz) neural network with novel point-to-hyperplane layer and intrinsic normalization/operators.
-
ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Matches Model Architecture/Efficiency criterion: conditional/dynamic routing between Fast and Slow agents with free-energy-based fusion for test-time compute scaling in LLM reasoning.
-
Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Model Architecture (MoE): integrates DINT attention with a Sparse Mixture-of-Experts for modality-shared and routed experts in a multimodal foundation model.
Model Compression and Efficiency (159)
-
AI+HW 2035: Shaping the Next Decade - Score: 20.0 (R=0, N=0) - Date: 2026-03-06 - Comment: Author match
-
A Dual Certificate Approach to Sparsity in Infinite-Width Shallow Neural Networks - Score: 19 (R=10, N=9) - Date: 2026-03-19 - Comment: Foundational sparsity theory for infinite-width ReLU networks using dual certificates in TV-regularized training.
-
The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training - Score: 19 (R=10, N=9) - Date: 2026-03-12 - Comment: Analyzes anisotropy and mean-bias as rank-one driver of FP4 instability and proposes mean subtraction — matches Model Compression and Efficiency: quantization stability.
-
SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity - Score: 19 (R=10, N=9) - Date: 2026-03-06 - Comment: Model Compression and Efficiency + Systems: enables (2N−2):2N structured sparsity (e.g., 6:8) on 2:4 Sparse Tensor Cores via sliding-window decomposition and activation lifting, achieving near-theoretical speedups with preserved accuracy.
-
WaterSIC: information-theoretically (near) optimal linear layer quantization - Score: 19 (R=10, N=9) - Date: 2026-03-06 - Comment: Model Compression and Efficiency — Quantization: proposes WaterSIC, an information-theoretically near-optimal linear layer quantizer with waterfilling-style rate allocation and provable 0.255-bit rate gap.
-
Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization - Score: 19 (R=10, N=9) - Date: 2026-03-03 - Comment: Model Compression and Efficiency: curvature-aware MDL framework for layer-adaptive capacity allocation/pruning (e.g., expert slots, LoRA ranks) with closed-form solutions and regret bounds.
-
On De-Individuated Neurons: Continuous Symmetries Enable Dynamic Topologies - Score: 19 (R=10, N=9) - Date: 2026-03-02 - Comment: Matches Model Architecture and Compression/Efficiency criteria: introduces isotropic activation primitives enabling dynamic topology (neurogenesis/degeneration) and exact connectivity pruning with sparsity.
-
Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys - Score: 18 (R=10, N=8) - Date: 2026-03-16 - Comment: Model compression and efficiency: unifies KV-cache compression and sparse attention retrieval via self-indexing 1-bit quantized keys with custom CUDA integration.
-
Leech Lattice Vector Quantization for Efficient LLM Compression - Score: 18 (R=10, N=8) - Date: 2026-03-12 - Comment: Model compression and efficiency: high-dimensional Leech lattice vector quantization with codebook-free indexing and parallel dequantization.
-
LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation - Score: 18 (R=10, N=8) - Date: 2026-03-12 - Comment: KV cache eviction with learned importance prediction without draft generation — matches Model Compression and Efficiency: cache/memory optimization for LLM inference.
-
Uncovering a Winning Lottery Ticket with Continuously Relaxed Bernoulli Gates - Score: 18 (R=10, N=8) - Date: 2026-03-11 - Comment: Model Compression and Efficiency — differentiable L0 sparsity via relaxed Bernoulli gates to discover Strong Lottery Tickets without training weights.
-
Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction - Score: 18 (R=10, N=8) - Date: 2026-03-11 - Comment: Compression/Efficiency: proposes Overflow-Aware Scaling and Macro Block Scaling to improve 4-bit MXFP4 quantization fidelity for LLMs without hardware changes.
-
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE - Score: 18 (R=10, N=8) - Date: 2026-03-09 - Comment: Model Compression and Efficiency (MoE): non-uniform layer-wise expert pruning using a stable ESAP proxy and evolutionary search to optimize memory/throughput under a budget.
-
Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices - Score: 18 (R=10, N=8) - Date: 2026-03-06 - Comment: HPC/Memory Optimization + Compression: persistent 4-bit KV-cache with direct restoration eliminates re-prefill, enabling multi-agent edge inference; up to 136x TTFB reduction and 4x memory density.
-
Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection - Score: 18 (R=10, N=8) - Date: 2026-03-06 - Comment: Model Compression and Efficiency — KV cache/memory optimization via low-dimensional queries/keys and SVD compression; theoretical log(N) selection dimension; 75% key cache savings.
-
One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache - Score: 18 (R=10, N=8) - Date: 2026-03-06 - Comment: Model Compression and Efficiency: token-wise adaptive low-rank KV-cache compression with dynamic per-token rate allocation (post-training), orthogonal to pruning.
-
Dissecting Quantization Error: A Concentration-Alignment Perspective - Score: 18 (R=10, N=8) - Date: 2026-03-05 - Comment: Provides a principled SQNR-based theory of quantization error (concentration+alignment) and introduces CAT transforms—model compression/quantization.
-
Spectral Surgery: Training-Free Refinement of LoRA via Gradient-Guided Singular Value Reweighting - Score: 18 (R=10, N=8) - Date: 2026-03-05 - Comment: Model Compression and Efficiency: low-rank LoRA refinement via SVD-based singular value reweighting; training-free parameter editing.
-
ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer - Score: 18 (R=10, N=8) - Date: 2026-03-05 - Comment: Matches Model Architecture and Efficiency: tokenizer-free hierarchical byte-level LM with compression-driven segmentation and Top-K selection for a static compute graph.
-
Multi-Head Low-Rank Attention - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: Model Compression/Efficiency and HPC: low-rank attention with partitionable latent heads enabling TP-friendly decoding and reduced KV cache I/O.
-
3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: Model Compression and Efficiency: introduces a 3-block ADMM for sparse+low-rank LLM decomposition and transformer-level matching refinement with convergence guarantees.
-
Attn-QAT: 4-Bit Attention With Quantization-Aware Training - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: Model Compression and Efficiency—4-bit quantization-aware training for attention (FP4) with stable backward recomputation and fused kernels.
-
Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation - Score: 18 (R=10, N=8) - Date: 2026-03-02 - Comment: Model Compression and Efficiency: low-rank approximation of optimizer states to cut memory while maintaining performance in LLM training.
-
GRAIL: Post-hoc Compensation by Linear Reconstruction for Compressed Networks - Score: 18 (R=10, N=8) - Date: 2026-03-02 - Comment: Model Compression and Efficiency: zero-finetuning post-hoc blockwise compensation via Gram-matrix linear reconstruction to restore compressed network behavior.
-
Optimal Scalar Quantization for Matrix Multiplication: Closed-Form Density and Phase Transition - Score: 18 (R=9, N=9) - Date: 2026-03-23 - Comment: Compression theory for matrix multiplication: derives optimal scalar quantization densities and phase transitions with closed-form analysis.
-
ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning - Score: 17 (R=10, N=7) - Date: 2026-03-09 - Comment: Model Compression and Efficiency: improves one-shot LLM pruning (SparseGPT) via loss-driven two-level reordering of columns/blocks to reduce pruning error.
-
TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly - Score: 17 (R=9, N=8) - Date: 2026-03-23 - Comment: Model compression/efficiency: proposes on-the-fly activation-aware test-time quantization that adapts per prompt without retraining.
-
Prune-then-Quantize or Quantize-then-Prune? Understanding the Impact of Compression Order in Joint Model Compression - Score: 17 (R=9, N=8) - Date: 2026-03-20 - Comment: Model compression and efficiency: provides theory and experiments on compression order in joint pruning–quantization, including the Progressive Intensity Hypothesis.
-
ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression - Score: 17 (R=9, N=8) - Date: 2026-03-19 - Comment: High-performance systems: hardware-aware lossless compression with fused decompression-GEMM for faster, memory-efficient LLM inference on GPUs.
-
High-Dimensional Gaussian Mean Estimation under Realizable Contamination - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: SQ lower bounds and matching tradeoffs for Gaussian mean estimation under realizable contamination.
-
Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Analyzes low-precision optimizer-state dynamics in LLM pretraining, explaining EMA staleness and deriving theory-guided reset schedules for memory-efficient training.
-
High-dimensional estimation with missing data: Statistical and computational limits - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Statistical-computational limits for high-dimensional estimation with missing data, including information-computation gaps.
-
BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Quantization method tailored to MXFP4 with block-wise affine transforms and Kronecker-efficient parameterization.
-
Capability-Guided Compression: Toward Interpretability-Aware Budget Allocation for Large Language Models - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Compression framework allocates pruning budgets using SAE-derived capability density, linking interpretability to component-level compression sensitivity.
-
MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Compute-optimal diffusion language modeling via binary subtoken encoding, index shuffling, and scaling-law analysis.
-
Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Introduces adaptive latent-space reasoning with dynamic halting, a core architectural efficiency idea for implicit reasoning in LLMs.
-
Spiking Layer-Adaptive Magnitude-based Pruning - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Introduces a theory-guided pruning framework for temporal SNNs with time-aware layer importance and distortion-constrained sparsity allocation.
-
Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Provides theory for dataset distillation showing efficient encoding of low-dimensional task structure under gradient-based training of neural networks.
-
FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Inference efficiency: training-free retrieval-style replacement for the LM output head that reduces classification-head compute.
-
ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Model efficiency: training-free LVLM token pruning that corrects attention shift and merges redundant tokens while remaining KV-cache compatible.
-
Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Inference-time KV-cache memory management architecture with selective forgetting/compression and theoretical interference reduction.
-
Enhancing LLM Training via Spectral Clipping - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Spectral clipping is a general optimizer-side efficiency/stability method for LLM training with theory and scalable Newton-Schulz implementation.
-
GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Memory-efficient architecture: writes long context into compact prefix memory via test-time gradient descent instead of large KV caches.
-
Effective Sparsity: A Unified Framework via Normalized Entropy and the Effective Number of Nonzeros - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Defines effective sparsity via normalized-entropy regularizers with RIP-based recovery guarantees, offering a new theoretical sparsity framework.
-
When Drafts Evolve: Speculative Decoding Meets Online Learning - Score: 17 (R=9, N=8) - Date: 2026-03-14 - Comment: Inference efficiency: speculative decoding cast as online learning, with regret-based algorithms that adapt draft models from verification feedback.
-
GPrune-LLM: Generalization-Aware Structured Pruning for Large Language Models - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Compression methodology: structured LLM pruning guided by cross-distribution neuron sensitivity to improve post-pruning generalization.
-
Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Model compression and dynamic networks: unified utility metric for structural pruning and routing based on alternating gradient flow.
-
HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Compression methodology: end-to-end multi-granular stochastic auto-pruning for ViTs across heads, FFNs, and intra-block dimensions.
-
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Attention efficiency: cross-layer reuse of sparse attention top-k indices cuts indexer cost with training-free and training-aware configurations.
-
Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Inference efficiency for transformers: training-free decoding acceleration using stable within-sentence attention support and sparse memory refresh.
-
LongFlow: Efficient KV Cache Compression for Reasoning M - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Inference efficiency: KV-cache compression for long-output reasoning models with negligible-overhead importance estimation and fused custom kernel.
-
A New Tensor Network: Tubal Tensor Train and Its Applications - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Model Compression/Low-Rank: introduces the Tubal Tensor Train (TTT) tensor network with TTT-SVD/ATCU algorithms and error bounds.
-
ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Efficiency: training-free early-skipping for diffusion LLMs using intermediate tensor variation/confidence to skip token compute, yielding substantial inference speedups.
-
Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Model Architecture + HPC: introduces a TTC layer performing finite-horizon LQR planning within neural networks and a fused CUDA solver for hardware-efficient inference-time control.
-
Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence - Score: 17 (R=9, N=8) - Date: 2026-03-09 - Comment: Efficiency/HPC: memory-efficient optimization via mask traversal with improved nonconvex convergence (O(eps^-3)).
-
Preserving Continuous Symmetry in Discrete Spaces: Geometric-Aware Quantization for SO(3)-Equivariant GNNs - Score: 17 (R=9, N=8) - Date: 2026-03-06 - Comment: Compression/Efficiency—geometric-aware low-bit quantization for SO(3)-equivariant GNNs that preserves symmetry via magnitude-direction decoupling and symmetry-aware training.
-
$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space - Score: 17 (R=9, N=8) - Date: 2026-03-06 - Comment: Inference-time optimization—introduces differentiable test-time gradient descent over token logits to refine LLM decoding; theoretical link to KL-regularized RL.
-
Stacked from One: Multi-Scale Self-Injection for Context Window Extension - Score: 17 (R=9, N=8) - Date: 2026-03-06 - Comment: Model Architecture and Efficiency — two stacked short-context LLMs with multi-grained compression and self-injection for long-context extension, reducing memory and accelerating inference.
-
NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training - Score: 17 (R=9, N=8) - Date: 2026-03-05 - Comment: Compression/Efficiency + Training Dynamics: optimizer with nuclear-norm-constrained updates to induce low-rank weight structure for better LLM compressibility
-
Never Saddle for Reparameterized Steepest Descent as Mirror Flow - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Training Dynamics and Optimization Geometry — introduces steepest mirror flows explaining implicit bias, sparsity, and saddle escape (insights into Adam/AdamW vs. SGD).
-
FreeAct: Freeing Activations for LLM Quantization - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Model Compression and Efficiency: dynamic activation-side transformations (beyond one-to-one orthogonal mappings) for improved LLM quantization.
-
Scalable Multi-Task Low-Rank Model Adaptation - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Strongly matches Model Compression/Efficiency (low-rank): scalable multi-task LoRA with spectral-aware regularization, block-level adaptation, and fine-grained routing.
-
A Decomposition Framework for Certifiably Optimal Orthogonal Sparse PCA - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Strongly matches Sparsity/Representation Learning: certifiably optimal orthogonal Sparse PCA with BnB acceleration and block decomposition.
-
Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Model Compression and Efficiency — replaces MHA with Multi-Head Latent Attention in Whisper decoder to shrink KV cache by up to 87.5% with minimal fine-tuning.
-
Weight Updates as Activation Shifts: A Principled Framework for Steering - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Model Architecture/Efficiency: establishes equivalence between activation steering and weight updates and introduces a parameter-efficient joint adaptation method.
-
Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Strongly matches Model Compression/Efficiency: training-free KV cache compression for VLM-based GUI agents with saliency/trajectory-aware scoring.
-
Efficient Discovery of Approximate Causal Abstractions via Neural Mechanism Sparsification - Score: 17 (R=9, N=8) - Date: 2026-03-02 - Comment: Model Compression and Efficiency: structured pruning viewed as search over causal abstractions with closed-form interventional risk criteria (sparsity/pruning).
-
Computation-Utility-Privacy Tradeoffs in Bayesian Estimation - Score: 17 (R=8, N=9) - Date: 2026-03-19 - Comment: Foundational theory for differentially private Bayesian estimation, giving efficient near-Bayes-optimal algorithms and computational-statistical lower bounds.
-
Massive Redundancy in Gradient Transport Enables Sparse Online Learning - Score: 17 (R=8, N=9) - Date: 2026-03-17 - Comment: Shows strong redundancy in online gradient transport and proposes sparse propagation schemes that retain most adaptation ability, a foundational efficiency result for recurrent and transformer training dynamics.
-
Pruning-induced phases in fully-connected neural networks: the eumentia, the dementia, and the amentia - Score: 17 (R=8, N=9) - Date: 2026-03-13 - Comment: Theory of compression dynamics: identifies pruning-induced phase transitions in fully connected networks with statistical-mechanics analysis.
-
Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models - Score: 16 (R=9, N=7) - Date: 2026-03-17 - Comment: Model compression: asymmetric text-visual pruning for LVLMs based on modality-specific sensitivity analysis and adaptive token calibration.
-
SVD Contextual Sparsity Predictors for Fast LLM Inference - Score: 16 (R=9, N=7) - Date: 2026-03-16 - Comment: Uses training-free SVD-based contextual sparsity predictors for conditional FFN execution, directly targeting fast LLM inference.
-
MXNorm: Reusing MXFP block scales for efficient tensor normalisation - Score: 16 (R=9, N=7) - Date: 2026-03-14 - Comment: Model efficiency: normalization redesign that reuses MXFP block scales to cut reduction cost and speed low-precision transformer training.
-
ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training - Score: 16 (R=9, N=7) - Date: 2026-03-14 - Comment: Optimization method for efficient sparse training: zero-order SAM cuts backprop cost while stabilizing high-sparsity learning.
-
Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE - Score: 16 (R=9, N=7) - Date: 2026-03-13 - Comment: Transformer efficiency: analyzes partial RoPE as a core positional-encoding design that preserves convergence while greatly reducing cache memory.
-
GAST: Gradient-aligned Sparse Tuning of Large Language Models with Data-layer Selection - Score: 16 (R=9, N=7) - Date: 2026-03-11 - Comment: Model Compression/Efficiency: gradient-aligned sparse tuning with joint layer selection and data selection in a unified optimization for PEFT.
-
Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention - Score: 16 (R=9, N=7) - Date: 2026-03-11 - Comment: High-performance inference efficiency — KV cache compression with Compressed PagedAttention and scheduling for high-concurrency LLM inference.
-
ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs - Score: 16 (R=9, N=7) - Date: 2026-03-11 - Comment: Model Compression and Efficiency: adaptive KV-cache management with dynamic precision allocation, quantization, and eviction based on per-layer attention statistics for long-context inference.
-
Stem: Rethinking Causal Information Flow in Sparse Attention - Score: 16 (R=9, N=7) - Date: 2026-03-09 - Comment: Model Compression and Efficiency: proposes position-dependent sparse attention (Token Position-Decay) with an output-aware token metric to reduce prefill compute in causal Transformers.
-
FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling - Score: 16 (R=9, N=7) - Date: 2026-03-09 - Comment: Model Compression and Efficiency: proposes dynamic sparse attention (instantaneous pattern discovery + thresholding) to accelerate long-context prefilling.
-
Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models - Score: 16 (R=9, N=7) - Date: 2026-03-09 - Comment: Model Compression and Efficiency — adaptive visual token pruning based on singular value spectrum (low-rank/spectral energy) for compute-efficient VLM inference.
-
POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation - Score: 16 (R=9, N=7) - Date: 2026-03-06 - Comment: High Performance Computing/Efficiency: scalable orthogonal-equivalence reparameterization (POET-X) that reduces memory and compute for LLM training while preserving stability.
-
InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context - Score: 16 (R=9, N=7) - Date: 2026-03-06 - Comment: Model Efficiency: information-flow-guided selective KV recomputation and RoPE-consistent chunk reordering for long-context inference.
-
EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs - Score: 16 (R=9, N=7) - Date: 2026-03-05 - Comment: Model Compression/Efficiency: early-stage visual token pruning inside the encoder (layer-wise, similarity/diversity/attention-guided) for MLLMs
-
SageBwd: A Trainable Low-bit Attention - Score: 16 (R=9, N=7) - Date: 2026-03-03 - Comment: Model Compression and Efficiency: quantization of attention (INT8) during training with stability analysis (QK-norm, K-/Q-smoothing) and identification of backward-pass gradient as primary error source.
-
Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification - Score: 16 (R=9, N=7) - Date: 2026-03-03 - Comment: Model Compression and Efficiency/HPC: applies low-bit quantization specifically to speculative verification to overcome memory bandwidth limits, improving end-to-end throughput.
-
LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding - Score: 16 (R=9, N=7) - Date: 2026-03-02 - Comment: Model Compression and Efficiency: new training objective directly optimizing acceptance rate in speculative decoding for faster inference.
-
Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Model compression/efficiency: introduces a new distillation objective for discrete diffusion models using discrete MMD, tackling a known methodological gap in fast sampling.
-
Minimax Generalized Cross-Entropy - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Proposes a new convex minimax formulation of generalized cross-entropy with theoretical error bounds and efficient bilevel optimization via implicit differentiation.
-
Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Model compression/efficiency for LLM RL via layerwise representation perturbations that stabilize off-policy updates by controlling heavy-tailed importance ratios.
-
Computational and Statistical Hardness of Calibration Distance - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Theoretical hardness and approximation results for calibration distance, a foundational learning-theoretic problem.
-
RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Mixed-precision quantization via RL for per-layer bit allocation with zero-shot transfer across LLM families.
-
How do LLMs Compute Verbal Confidence - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Mechanistic representation analysis of how LLMs compute and cache verbal confidence beyond token log-probabilities.
-
Flow Matching Policy with Entropy Regularization - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Core algorithmic innovation for generative policies: flow-matching policy optimization with a tractable entropy regularizer and much cheaper training than diffusion policies.
-
rSDNet: Unified Robust Neural Learning against Label Noise and Adversarial Attacks - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Theoretical robust-learning framework replacing cross-entropy with minimum-divergence estimation, with consistency and robustness guarantees.
-
Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Foundational analysis of vector quantization collapse mechanisms, identifying token/embedding collapse causes and proposing diversity-preserving fixes.
-
SOMP: Scalable Gradient Inversion for Large Language Models via Subspace-Guided Orthogonal Matching Pursuit - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Scalable gradient inversion for transformers via sparse recovery using head-wise geometric structure and subspace-guided OMP.
-
Online Semi-infinite Linear Programming: Efficient Algorithms via Function Approximation - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Online semi-infinite LP with function approximation giving regret bounds independent of the number of constraints.
-
Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Proposes an information-density-driven masking and noise scheduling paradigm for training diffusion LLMs.
-
More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Theoretical analysis of beam search overestimation bias with explicit critical-width scaling laws for LLM inference.
-
SCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge Editing - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Uses sparse transcoders to identify knowledge circuits and perform sparse neuron-level interventions for lifelong knowledge editing, targeting representation-level structure rather than dense black-box updates.
-
Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Theoretical identifiability for robust prediction under latent shift, replacing completeness with a weaker cross-domain rank condition.
-
Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Improves VLM efficiency with a spatial-on-demand architecture that retrieves high-resolution crops only when needed, reducing unnecessary visual compute.
-
On the (Generative) Linear Sketching Problem - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Studies linear sketch recovery through generative priors and proposes a training-without-ground-truth framework for efficient sketch inversion.
-
ZOTTA: Test-Time Adaptation with Gradient-Free Zeroth-Order Optimization - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Efficiency-focused test-time adaptation with zeroth-order optimization, enabling forward-only adaptation for high-dimensional and quantized models.
-
Interleaved Resampling and Refitting: Data and Compute-Efficient Evaluation of Black-Box Predictors - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Develops a black-box, data- and compute-efficient procedure for excess-risk evaluation with high-probability guarantees via interleaved resampling/refitting.
-
TMPDiff: Temporal Mixed-Precision for Diffusion Models - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Model compression and efficiency: introduces timestep-wise mixed-precision quantization for diffusion inference with a principled search algorithm over temporal precision allocation.
-
PLUME: Building a Network-Native Foundation Model for Wireless Traces via Protocol-Aware Tokenization - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Protocol-aware tokenization for network traces defines a modality-native foundation model design that greatly improves efficiency over generic tokenization.
-
Probabilistic Gaussian Homotopy: A Probability-Space Continuation Framework for Nonconvex Optimization - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Foundational nonconvex optimization method: probability-space homotopy with Boltzmann-weighted gradient aggregation and a derived annealed minimizer dynamics.
-
Upper Bounds for Local Learning Coefficients of Three-Layer Neural Networks - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Theory: derives upper bounds for local learning coefficients at singular points in three-layer neural networks.
-
One-Step Flow Policy: Self-Distillation for Fast Visuomotor Policies - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Model efficiency via one-step self-distillation for diffusion/flow visuomotor policies, reducing iterative sampling cost by 100x.
-
A Quantitative Characterization of Forgetting in Post-Training - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Theoretical analysis of forgetting in post-training, deriving objective-dependent conditions for mass forgetting and component drift.
-
Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Provides theory for prior-data fitted networks, proving inconsistency and proposing a calibrated posterior correction with Bernstein-von Mises guarantees.
-
Truth as a Compression Artifact in Language Model Training - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Representation-learning insight: argues truth preference emerges from compression structure, supported by controlled transformer training studies.
-
On-Policy Self-Distillation for Reasoning Compression - Score: 16 (R=8, N=8) - Date: 2026-03-06 - Comment: Model Compression and Efficiency — on-policy self-distillation to compress chain-of-thought reasoning tokens while maintaining/improving accuracy.
-
Accelerating Single-Pass SGD for Generalized Linear Prediction - Score: 16 (R=8, N=8) - Date: 2026-03-03 - Comment: Matches Algorithmic Efficiency/HPC: first momentum-accelerated single-pass SGD for GLMs with sharp excess risk bounds in streaming.
-
GPU-friendly and Linearly Convergent First-order Methods for Certifying Optimal $k$-sparse GLMs - Score: 16 (R=8, N=8) - Date: 2026-03-03 - Comment: Model Compression/Efficiency and HPC: GPU-friendly, linearly convergent proximal framework for certifying optimal k-sparse GLMs with specialized perspective-prox operators and duality-gap restarts.
-
Growing Networks with Autonomous Pruning - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Model compression/efficiency: studies dynamically growing networks with autonomous pruning during training to reach sparse architectures.
-
Warm-Start Flow Matching for Guaranteed Fast Text/Image Generation - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Generative-model efficiency: warm-start flow matching cuts sampling steps with a formal guaranteed speed-up mechanism.
-
Spectral Tempering for Embedding Compression in Dense Passage Retrieval - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Presents a learning-free eigenspectrum-based method for adaptive embedding compression, directly addressing model efficiency via spectral analysis.
-
Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Core tokenizer methodology: replaces frequency-based BPE merging with a statistically grounded significance-gain criterion and evaluates effects on Transformer LM efficiency.
-
UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Inference efficiency method: adaptive KV-cache/context allocation driven by token-level uncertainty for long-context decoding.
-
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs - Score: 15 (R=8, N=7) - Date: 2026-03-19 - Comment: Unified token pruning across both ViT and LLM with learned spatio-temporal scoring for video VLM efficiency.
-
Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity - Score: 15 (R=8, N=7) - Date: 2026-03-19 - Comment: Compression method: calibration-free mixed-precision quantization driven by dual numerical and structural layer sensitivity.
-
KANtize: Exploring Low-bit Quantization of Kolmogorov-Arnold Networks for Efficient Inference - Score: 15 (R=8, N=7) - Date: 2026-03-19 - Comment: Low-bit quantization of KANs using quantized spline tables for major inference-efficiency gains.
-
Implementation of tangent linear and adjoint models for neural networks based on a compiler library tool - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Compiler/runtime tool for integrating neural networks with numerical models, including tangent linear and adjoint support for efficient heterogeneous execution.
-
Efficient Reasoning on the Edge - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Systems-level efficiency for on-device reasoning via dynamic adapter switching, KV-cache sharing, and budget-forced reasoning compression.
-
SF-Mamba: Rethinking State Space Model for Vision - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: State-space vision architecture redesign with patch swapping and batch folding for higher GPU-parallel efficiency.
-
MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Model compression and efficiency: latency-guided hardware-in-the-loop architecture search for on-device LLM design under deployment constraints.
-
Effective Distillation to Hybrid xLSTM Architectures - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Model compression and efficiency: distillation pipeline from transformer teachers into sub-quadratic hybrid xLSTM students for efficient inference.
-
Controlled Langevin Dynamics for Sampling of Feedforward Neural Networks Trained with Minibatches - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Introduces controlled minibatch pseudo-Langevin dynamics for scalable Boltzmann sampling of neural-network parameters, addressing a core training/sampling methodology issue.
-
PrototypeNAS: Rapid Design of Deep Neural Networks for Microcontroller Units - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Efficiency-focused methodology: zero-shot NAS jointly searching architecture, pruning, and quantization for constrained deployment.
-
SimCert: Probabilistic Certification for Behavioral Similarity in Deep Neural Network Compression - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Probabilistic certification framework for preserving behavior under pruning and quantization in compressed networks.
-
DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Training-free multimodal token compression using dynamic audio-driven semantic chunking for efficient long-context omnimodal inference.
-
SPARQ: Spiking Early-Exit Neural Networks for Energy-Efficient Edge AI - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Combines spiking computation, quantization-aware training, and adaptive early exits into a unified efficient inference architecture.
-
High-Fidelity Compression of Seismic Velocity Models via SIREN Auto-Decoders - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: SIREN auto-decoder for high-fidelity neural compression is a direct model compression/autoencoder-style representation contribution.
-
Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Proposes an inference-time early-exit mechanism for reasoning models based on monitoring high-entropy path deviation as a signal of overthinking.
-
True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Quantization method: true 4-bit training on commodity CPUs with soft weight clipping and dynamic scaling reaching near full-precision parity.
-
IGU-LoRA: Adaptive Rank Allocation via Integrated Gradients and Uncertainty-Aware Scoring - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Adaptive LoRA rank allocation using integrated gradients with a theoretical quadrature error bound targets compression/efficiency at the method level.
-
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Inference-efficiency method for reasoning models: learns optimal early-exit points to cut Chain-of-Thought compute.
-
Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Explicit kernel-basis construction for equivariant CNNs that avoids Clebsch-Gordan coefficients and generalizes across symmetry groups.
-
Efficient Reasoning with Balanced Thinking - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Efficiency for transformers/LRMs: training-free hidden-state steering to adapt reasoning compute between overthinking and underthinking.
-
BiGain: Unified Token Compression for Joint Generation and Classification - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Training-free token compression for diffusion backbones using frequency-aware merging/downsampling, directly addressing efficient model computation.
-
Quantization Robustness of Monotone Operator Equilibrium Networks - Score: 15 (R=8, N=7) - Date: 2026-03-12 - Comment: Model Compression/Efficiency: Provable quantization robustness for monotone operator equilibrium networks; links precision, perturbation, and convergence.
-
On Catastrophic Forgetting in Low-Rank Decomposition-Based Parameter-Efficient Fine-Tuning - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Model Compression/Efficiency: analyzes catastrophic forgetting in low-rank PEFT (e.g., LoRA, tensor decompositions) via update subspace geometry; guidance for efficient continual adaptation.
-
Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Model Efficiency: parameter- and data-efficient adaptation of draft models for speculative decoding using a decoupled shared/private architecture and targeted data regeneration/selection.
-
Evolving Prompt Adaptation for Vision-Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Compression/Efficiency: parameter-efficient adaptation with low-rank updates decoupled into direction/magnitude to preserve pretraining knowledge; adds feature geometric regularization.
-
DendroNN: Dendrocentric Neural Networks for Energy-Efficient Classification of Event-Based Data - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Model Architecture and Efficiency: dendrite-inspired DendroNN with event-driven routing, dynamic/static sparsity and intrinsic quantization; includes asynchronous hardware design for low-power spatiotemporal processing.
-
HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Model Compression and Efficiency: hierarchical, preference-conditioned structured pruning with VLM-aware sensitivity signals and plan-level GRPO.
-
Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Compression/sparsity: pruning to extract bias-invariant subnetworks from vanilla models without retraining.
-
MCEL: Margin-Based Cross-Entropy Loss for Error-Tolerant Quantized Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Compression/Efficiency—introduces a margin-based cross-entropy loss to improve robustness of quantized NNs to bit-flip errors without error-aware training.
-
Rethinking Representativeness and Diversity in Dynamic Data Selection - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Representation Learning and Efficiency — dynamic data selection using sparse autoencoder factors for representativeness and process-level diversity, yielding >2× training speedups.
-
Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Representation Learning — training dynamics in deep linear networks: depth-induced coupling promotes low-rank implicit bias and mitigates plasticity loss.
-
Nonconvex Latent Optimally Partitioned Block-Sparse Recovery via Log-Sum and Minimax Concave Penalties - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Sparsity/Compression: nonconvex block-sparse recovery with unknown partitions using log-sum and MCP penalties with ADMM optimization.
-
MARS: Harmonizing Multimodal Convergence via Adaptive Rank Search - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Compression/Efficiency: adaptive LoRA rank search via dual scaling laws to align modality-specific convergence and maximize MLLM fine-tuning performance.
-
Polynomial Surrogate Training for Differentiable Ternary Logic Gate Networks - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Architecture/Efficiency: polynomial surrogate training for differentiable ternary logic-gate networks with bounded hardening gap and large parameter reduction.
-
Stateful Token Reduction for Long-Video Hybrid VLMs - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Compression/Efficiency: query-conditioned token reduction for hybrid attention–Mamba VLMs with progressive scheduling and unified scoring.
-
Task-Centric Acceleration of Small-Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Model Compression and Efficiency: task-adaptive sequence compression via tokenizer expansion (TASC-ft) and training-free n-gram speculative decoding (TASC-spec) to accelerate SLM inference.
-
KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Matches High Performance Computing/Efficiency criterion: KV-cache-centric memory management (construction, recomputation, balanced loading) to reduce LLM inference latency.
High Performance Computing (78)
-
ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context - Score: 20.0 (R=0, N=0) - Date: 2026-03-03 - Comment: Author match
-
The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference - Score: 19 (R=10, N=9) - Date: 2026-03-23 - Comment: Transformer systems insight showing KV cache is exactly reconstructible from residual streams, yielding a new bounded-memory inference scheme.
-
Deep learning and the rate of approximation by flows - Score: 19 (R=10, N=9) - Date: 2026-03-17 - Comment: Gives a theoretical characterization of deep residual network approximation via geodesic distance on a sub-Finsler manifold of diffeomorphisms.
-
Why Are Linear RNNs More Parallelizable? - Score: 19 (R=10, N=9) - Date: 2026-03-05 - Comment: Strong match to Model Architecture and High-Performance Computing theory by characterizing LRNNs’ parallelizability via complexity classes and expressivity trade-offs.
-
NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL - Score: 18 (R=10, N=8) - Date: 2026-03-14 - Comment: High-performance computing for MoE: unified NCCL expert-parallel dispatch/combine API with topology-aware low-latency and high-throughput modes.
-
MoEless: Efficient MoE LLM Serving via Serverless Computing - Score: 18 (R=10, N=8) - Date: 2026-03-09 - Comment: High Performance Computing / MoE Systems: serverless MoE serving with expert load prediction and elastic scaling/placement to reduce latency/cost.
-
The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $\lambda$-Calculus - Score: 18 (R=9, N=9) - Date: 2026-03-23 - Comment: Model architecture/systems: replaces free-form recursive control with a typed λ-calculus runtime for long-context reasoning, with formal guarantees on termination and cost.
-
Rigorous Asymptotics for First-Order Algorithms Through the Dynamical Cavity Method - Score: 18 (R=9, N=9) - Date: 2026-03-16 - Comment: Provides a rigorous formalization of the dynamical cavity method for first-order algorithms, yielding asymptotic theory for optimization dynamics.
-
Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems - Score: 18 (R=9, N=9) - Date: 2026-03-14 - Comment: Theoretical reinterpretation of diffusion models as partitioned iterated function systems, yielding computable geometric design criteria for schedules and objectives.
-
SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits - Score: 17 (R=9, N=8) - Date: 2026-03-20 - Comment: Systems-level benchmark for GPU kernel optimization with analytically derived speed-of-light hardware bounds, directly matching HPC methodology.
-
Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Develops parallel Newton and quasi-Newton methods to remove sequential bottlenecks in dynamical systems, with convergence theory tied to Lyapunov stability.
-
An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Single-GPU fine-tuning system with heterogeneous memory management, asynchronous CPU/GPU overlap, and kernel co-design.
-
Determinism in the Undetermined: Deterministic Output in Charge-Conserving Continuous-Time Neuromorphic Systems with Temporal Stochasticity - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Provides a theoretical foundation for charge-conserving continuous-time SNNs, proving spike-timing-invariant computation and exact correspondence to quantized ANNs.
-
FlashSampling: Fast and Memory-Efficient Exact Sampling - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Presents an exact systems-level decoding primitive that fuses categorical sampling into the LM-head matmul to eliminate logits materialization and reduce memory traffic.
-
Grokking as a Variance-Limited Phase Transition: Spectral Gating and the Epsilon-Stability Threshold - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Training-dynamics theory of grokking as a variance-limited phase transition governed by optimizer-induced spectral gating.
-
High-Probability Bounds for SGD under the Polyak-Lojasiewicz Condition with Markovian Noise - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Establishes the first uniform-in-time high-probability SGD bounds under PL with Markovian noise, a foundational optimization theory result.
-
State-space models through the lens of ensemble control - Score: 17 (R=9, N=8) - Date: 2026-03-14 - Comment: Provides a control-theoretic foundation for state-space models by casting training as an ensemble optimal control problem and deriving PMP-based optimality conditions.
-
AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Systems co-design for dynamic sparse models: token-level pre-gating and fused kernels to make dynamic LoRA/MoE-style adapter inference efficient.
-
Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Optimization/Systems — new optimizer combining spectral constraints with Shampoo-style preconditioning for faster, stable training.
-
The Missing Memory Hierarchy: Demand Paging for LLM Context Windows - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Systems/Memory Optimization: introduces demand paging and multi-level memory hierarchy for LLM context windows, directly addressing context efficiency.
-
A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA - Score: 17 (R=9, N=8) - Date: 2026-03-09 - Comment: HPC/systems: FPGA accelerator and memory optimization for linear attention decode by keeping recurrent state on-chip.
-
SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training - Score: 17 (R=9, N=8) - Date: 2026-03-05 - Comment: Matches High Performance Computing/Distributed Training: integrity verification for pipeline parallel training with convergence guarantees in untrusted settings.
-
Hyperagents - Score: 17 (R=8, N=9) - Date: 2026-03-23 - Comment: Proposes a self-referential architecture where the meta-level modification mechanism is itself editable, a foundational systems design for open-ended self-improvement.
-
Identifying Latent Actions and Dynamics from Offline Data via Demonstrator Diversity - Score: 17 (R=8, N=9) - Date: 2026-03-19 - Comment: Foundational identifiability theory for recovering latent actions and dynamics from offline trajectories using demonstrator diversity.
-
Cohomological Obstructions to Global Counterfactuals: A Sheaf-Theoretic Foundation for Generative Causal Models - Score: 17 (R=8, N=9) - Date: 2026-03-19 - Comment: Foundational theory for generative causal models using sheaf/cohomology and an O(1)-memory reverse-mode differentiation bridge via Sinkhorn-IFT-VJP.
-
NANOZK: Layerwise Zero-Knowledge Proofs for Verifiable Large Language Model Inference - Score: 17 (R=8, N=9) - Date: 2026-03-18 - Comment: Systems/methodology contribution for verifiable transformer inference via layerwise zero-knowledge proofs with constant-size per-layer proofs.
-
Sinkhorn-Drifting Generative Models - Score: 17 (R=8, N=9) - Date: 2026-03-13 - Comment: Generative modeling theory: links drifting dynamics to Sinkhorn-divergence gradient flows and resolves equilibrium identifiability.
-
Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Reintroduces explicit Markov states into LLM RL post-training with theoretical sample-complexity guarantees, directly targeting foundational training dynamics.
-
Two-Time-Scale Learning Dynamics: A Population View of Neural Network Training - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Theoretical framework for two-time-scale population dynamics of neural network training, linking population methods to replicator-mutator and bilevel optimization.
-
Heavy-Tailed and Long-Range Dependent Noise in Stochastic Approximation: A Finite-Time Analysis - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Finite-time theory for stochastic approximation under heavy-tailed and long-range dependent noise, extending core optimization analysis beyond classical assumptions.
-
Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Systems-level method for verifiable large-model inference using lightweight sampling-based proofs with execution-trace commitments.
-
Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Systems-level training architecture proposing depth-independent memory scaling near 2x inference footprint with exact gradient accumulation.
-
VideoAtlas: Navigating Long-Form Video in Logarithmic Compute - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Systems-level hierarchical video representation enabling logarithmic-compute navigation and cache reuse for long-context multimodal models.
-
RHYME-XT: A Neural Operator for Spatiotemporal Control Systems - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Neural-operator architecture for spatiotemporal control systems combining learned Galerkin projection with direct flow-map learning.
-
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Truncated backpropagation for recurrent video diffusion decoding with constant-memory training and theory.
-
Optimal uncertainty bounds for multivariate kernel regression under bounded noise: A Gaussian process-based dual function - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Distribution-free dual-form uncertainty bounds for multi-output kernel regression, with a GP-compatible structure that is directly usable in downstream optimization.
-
Trained Persistent Memory for Frozen Encoder--Decoder LLMs: Six Architectural Methods - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Proposes six architectural methods for differentiable persistent latent memory in frozen encoder-decoder LLMs.
-
Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Workflow-aware LLM serving system that introduces cross-call caching and cache-aware scheduling for agentic workloads.
-
Parallelised Differentiable Straightest Geodesics for 3D Meshes - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Provides differentiable and parallel straightest-geodesic operators for meshes, enabling new geometry-aware learning primitives.
-
Why AI systems don't learn and what to do about it: Lessons on autonomous learning from cognitive science - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Proposes a foundational cognitive architecture for autonomous learning with observation, action, and meta-control systems.
-
Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Systems-level synchronization method for multi-agent LLMs by adapting MESI-style cache coherence to artifact sharing.
-
LLM as Graph Kernel: Rethinking Message Passing on Text-Rich Graphs - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Recasts the LLM itself as the graph message-passing operator on text-rich graphs, changing the core aggregation mechanism.
-
Fold-CP: A Context Parallelism Framework for Biomolecular Modeling - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: High-performance computing contribution: context parallelism with custom primitives for scaling biomolecular model attention and triangular updates across GPUs.
-
Chain-of-Trajectories: Unlocking the Intrinsic Generative Optimality of Diffusion Models via Graph-Theoretic Planning - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Reframes diffusion sampling as graph-theoretic planning with a low-dimensional state proxy to allocate compute adaptively during generation.
-
SuperLocalMemory V3: Information-Geometric Foundations for Zero-LLM Enterprise Agent Memory - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Systems/theory for agent memory: derives retrieval and lifecycle mechanisms from information geometry and sheaf cohomology rather than heuristics.
-
Disentangling Dynamical Systems: Causal Representation Learning Meets Local Sparse Attention - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Combines causal representation learning with sparse attention and proves identifiability conditions for disentangled system representations.
-
Convergence of Two Time-Scale Stochastic Approximation: A Martingale Approach - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Derives new almost-sure convergence and rate results for two time-scale stochastic approximation under broader noise conditions.
-
OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Introduces unified KV-cache management across tasks and time for VLA transformers, a systems-level inference innovation for multi-task parallelism.
-
Structure-Dependent Regret and Constraint Violation Bounds for Online Convex Optimization with Time-Varying Constraints - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Derives structure-dependent regret and constraint-violation bounds for online convex optimization with time-varying constraints, adapting updates to regularity in constraint drift.
-
AEX: Non-Intrusive Multi-Hop Attestation and Provenance for LLM APIs - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Systems-level protocol for signed attestation and provenance at the LLM API boundary, addressing verification of request-output relations.
-
The Institutional Scaling Law: Non-Monotonic Fitness, Capability-Trust Divergence, and Symbiogenetic Scaling in Generative AI - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Theoretical scaling-law work on non-monotonic model/system scaling and orchestration of domain-specific models.
-
SRAM-Based Compute-in-Memory Accelerator for Linear-decay Spiking Neural Networks - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Algorithm-hardware co-design for compute-in-memory SNNs that removes the state-update bottleneck via in-memory parallel decay.
-
Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Inference systems contribution: proves modality-boundary partitioning minimizes transfer under KV caching and enables cost-efficient cross-tier heterogeneous serving.
-
KernelFoundry: Hardware-aware evolutionary GPU kernel optimization - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Systems-level GPU optimization: evolutionary MAP-Elites framework for hardware-aware kernel search and prompt co-evolution.
-
Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Distributed systems contribution: disaggregated serving architecture for any-to-any multimodal models with flexible computation-graph execution.
-
AutoScout: Structured Optimization for Automating ML System Configuration - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Systems-level optimizer for ML configuration spaces with hierarchical mixed discrete/continuous decisions and multi-fidelity profiling.
-
Large Spikes in Stochastic Gradient Descent: A Large-Deviations View - Score: 16 (R=8, N=8) - Date: 2026-03-12 - Comment: Training Dynamics: Large-deviations theory for SGD catapult spikes with explicit kernel/learning-rate criterion.
-
Riemannian Optimization in Modular Systems - Score: 16 (R=8, N=8) - Date: 2026-03-05 - Comment: Proposes layerwise Riemannian metrics and composable modules with contraction guarantees—principled optimization/training dynamics for neural architectures.
-
D-Mem: A Dual-Process Memory System for LLM Agents - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Systems-level memory architecture for LLM agents with dynamic quality gating between retrieval and full-deliberation modes.
-
SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Production-oriented training framework for speculative decoding with hybrid parallelism and optimized kernels, matching large-model systems work.
-
Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training - Score: 15 (R=8, N=7) - Date: 2026-03-19 - Comment: Systems-level method for distributed large-batch training that jointly optimizes batch size for time, cost, and generalization.
-
Facts as First Class Objects: Knowledge Objects for Persistent LLM Memory - Score: 15 (R=8, N=7) - Date: 2026-03-19 - Comment: Systems-level memory architecture replacing in-context storage with hash-addressed knowledge objects for persistent O(1) retrieval.
-
100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Systems-level efficiency method using lightweight proxy models to approximate expensive LLM-backed SQL operators at large scale.
-
Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Develops Byzantine-robust distributed optimization with compressed communication using double momentum and variance reduction, directly targeting scalable training methodology.
-
MONET: Modeling and Optimization of neural NEtwork Training from Edge to Data Centers - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Training-system modeling for heterogeneous accelerators, including activation checkpointing and layer-fusion co-design.
-
Align Forward, Adapt Backward: Closing the Discretization Gap in Logic Gate Networks - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Analyzes training-inference discretization gaps in hard vs. soft component selection and proposes a new gradient estimator for aligned conditional computation.
-
Exploiting temporal parallelism for LSTM Autoencoder acceleration on FPGA - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Presents a systems-level FPGA dataflow design that exploits temporal parallelism across timesteps and layers for efficient LSTM autoencoder inference.
-
Orla: A Library for Serving LLM-Based Multi-Agent Systems - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Introduces systems-level mechanisms for multi-agent LLM serving, especially workflow orchestration and KV-cache management across workflow boundaries.
-
Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Training-free parallel decoding for diffusion LLMs using self-attention-induced dependency graphs and independent-set selection.
-
Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Distributed optimization theory: Byzantine-robust training under generalized (L0,L1)-smoothness with convergence guarantees.
-
TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Systems analysis methodology: decomposes LLM inference host-side overhead into actionable components and characterizes host-device boundedness.
-
SpectralGuard: Detecting Memory Collapse Attacks in State Space Models - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Systems/theory for state-space models: spectral-radius analysis of memory collapse with a real-time architectural monitor.
-
Multi-DNN Inference of Sparse Models on Edge SoCs - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Systems/Efficiency: model stitching recombines subgraphs from sparse models for multi-DNN inference on edge SoCs without retraining, improving throughput and memory use.
-
FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: High Performance Computing/Systems: introduces flexible resource isolation (Flex-Mem/Flex-NPU), LLM-aware memory management, and a secure inference pipeline for on-device LLM serving.
-
Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Matches Representation Learning/Optimization theory via adversarially-aligned Jacobian regularization that controls sensitivity along adversarial directions, improving minimax stability with less expressivity loss.
-
Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Matches Model Efficiency and memory optimization by introducing indexed external memory with RL-optimized read/write under context budgets, plus theoretical bounds on in-context computation.
-
stratum: A System Infrastructure for Massive Agent-Centric ML Workloads - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: High Performance Computing: unified system infrastructure compiling and executing large batches of agent-generated ML pipelines efficiently.
-
Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: High Performance Computing: adaptive prefetching to reduce communication in distributed GNN training using an LLM-based controller.
Representation Learning (134)
-
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels - Score: 20.0 (R=0, N=0) - Date: 2026-03-23 - Comment: Author match
-
Statistical and structural identifiability in representation learning - Score: 19 (R=10, N=9) - Date: 2026-03-13 - Comment: Representation learning theory: formalizes statistical vs structural identifiability and proves near-identifiability beyond last-layer representations.
-
Directional Neural Collapse Explains Few-Shot Transfer in Self-Supervised Learning - Score: 18 (R=10, N=8) - Date: 2026-03-05 - Comment: Matches Representation Learning/Theory: directional neural collapse (decision-axis variance) explains few-shot transfer with sharp bounds and multitask geometry.
-
InfoNCE Induces Gaussian Distribution - Score: 18 (R=10, N=8) - Date: 2026-03-02 - Comment: Representation Learning: theoretical analysis showing InfoNCE induces Gaussian structure in learned features.
-
Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients - Score: 18 (R=9, N=9) - Date: 2026-03-16 - Comment: Representation learning: unsupervised sparse dictionary decomposition of per-document training gradients to discover interpretable behavior atoms and steering directions.
-
A theory of learning data statistics in diffusion models, from easy to hard - Score: 18 (R=9, N=9) - Date: 2026-03-14 - Comment: Theory for representation learning in diffusion models: proves easy-to-hard learning of low- vs high-order data statistics via a diffusion information exponent.
-
Solving adversarial examples requires solving exponential misalignment - Score: 18 (R=9, N=9) - Date: 2026-03-05 - Comment: Representation Learning/Theory: introduces perceptual manifold dimensionality as a geometric account of adversarial vulnerability and robustness.
-
Quantitative Convergence of Wasserstein Gradient Flows of Kernel Mean Discrepancies - Score: 18 (R=9, N=9) - Date: 2026-03-03 - Comment: Representation Learning/Training Dynamics: quantitative convergence of Wasserstein gradient flows (MMD/Sobolev) linking to infinite-width shallow nets.
-
Var-JEPA: A Variational Formulation of the Joint-Embedding Predictive Architecture -- Bridging Predictive and Generative Self-Supervised Learning - Score: 17 (R=9, N=8) - Date: 2026-03-23 - Comment: Representation learning: gives a variational reformulation of JEPA as an explicit latent-variable model, removing heuristic anti-collapse regularization.
-
Only relative ranks matter in weight-clustered large language models - Score: 17 (R=9, N=8) - Date: 2026-03-19 - Comment: Model compression/representation learning: shows clustered LLM weights preserve performance primarily through relative rank structure rather than exact magnitudes.
-
A Noise Sensitivity Exponent Controls Large Statistical-to-Computational Gaps in Single- and Multi-Index Models - Score: 17 (R=9, N=8) - Date: 2026-03-19 - Comment: Theory of statistical-to-computational gaps in high-dimensional learning via a unifying noise sensitivity exponent.
-
Self-Distillation of Hidden Layers for Self-Supervised Representation Learning - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Self-supervised representation learning through hidden-layer self-distillation instead of only final-layer targets.
-
IConE: Batch Independent Collapse Prevention for Self-Supervised Representation Learning - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Batch-independent collapse prevention for self-supervised representation learning via dataset-level auxiliary embeddings.
-
Power-Law Spectrum of the Random Feature Model - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Derives power-law spectral preservation results for random feature models, directly addressing representation structure in core architectures.
-
Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Representation learning analysis: identifies which next-token gradient components cause transformers to develop seemingly redundant abstract features.
-
The Phenomenology of Hallucinations - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Representation-level theory of hallucination: uncertainty is internally encoded but weakly coupled to logits, explaining failure to abstain.
-
On Interpolation Formulas Describing Neural Network Generalization - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Theory of training dynamics: extends Domingos-style kernel interpolation to stochastic gradient training with optimizer-specific path kernels.
-
Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding - Score: 17 (R=9, N=8) - Date: 2026-03-14 - Comment: Model architecture: separates context and sample encoding into dual representation spaces to reconcile in-context and in-weight learning.
-
Topological DeepONets and a generalization of the Chen-Chen operator approximation theorem - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Theory of neural operators: extends DeepONet universal approximation from Banach-function settings to general locally convex spaces.
-
Disentangled Representation Learning through Unsupervised Symmetry Group Discovery - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Representation learning theory: unsupervised symmetry group discovery with identifiability guarantees for symmetry-based disentanglement.
-
Historical Consensus: Preventing Posterior Collapse via Iterative Selection of Gaussian Mixture Priors - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Representation Learning: proposes iterative selection of Gaussian mixture priors for VAEs to provably avoid posterior collapse across architectures.
-
Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Representation learning/interpretability: sparse autoencoders + causal DAG structure learning to reveal concept interactions in LLMs.
-
Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Matches Representation Learning: mechanistic interpretability using sparse autoencoders to reveal causal feature hierarchies inside a transformer TSFM.
-
Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Theoretical foundations of representation/training dynamics behind prompt comprehension, ICL, and CoT in LLMs.
-
From Data Statistics to Feature Geometry: How Correlations Shape Superposition - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Representation Learning: analyzes superposition under correlated features, introducing BOWS to reveal constructive interference and feature geometry beyond the sparse/independent case.
-
Memorization capacity of deep ReLU neural networks characterized by width and depth - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Theory/Representation: characterizes memorization capacity via a tight width–depth tradeoff (W^2 L^2 ~ N log(1/δ)) for ReLU networks, advancing foundational understanding.
-
An Empirical Study and Theoretical Explanation on Task-Level Model-Merging Collapse - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Representation Learning: theoretical limits on model merging via rate–distortion, linking representational incompatibility to task-level collapse; fundamental analysis of mergeability.
-
Causal Interpretation of Neural Network Computations with Contribution Decomposition - Score: 17 (R=9, N=8) - Date: 2026-03-09 - Comment: Representation Learning — uses sparse autoencoders to causally decompose hidden-neuron contributions, enabling mechanistic interpretability and controllable interventions.
-
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs - Score: 17 (R=9, N=8) - Date: 2026-03-05 - Comment: Finds a robust sparsity–difficulty relation in LLM hidden states and exploits it for curriculum ICL—representation learning/training dynamics.
-
Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD? - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Optimization/Training Dynamics: compute-optimal scaling laws for signSGD under power-law random features, revealing noise-reshaping/drift-normalization effects.
-
Diagnosing Generalization Failures from Representational Geometry Markers - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Representation Learning: uses representational geometry markers (manifold dimensionality/utility) to predict OOD generalization and guide model selection.
-
Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Matches Representation Learning/Training dynamics theory: analyzes data quality and synergistic effects across pretraining, SFT, and RL with transformer models.
-
Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Representation Learning/Training Dynamics—Singular Learning Theory explains grokking as phase transition via local learning coefficient.
-
NNiT: Width-Agnostic Neural Network Generation with Structurally Aligned Weight Spaces - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Model Architecture/Representation Learning: width-agnostic generation of neural weights via tokenized patches and GHN-based structural alignment to resolve permutation symmetries.
-
Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models - Score: 17 (R=9, N=8) - Date: 2026-03-02 - Comment: Matches Representation Learning criterion: derives necessary geometric constraints (linear, orthogonal per-concept factors) for compositional generalization with empirical support.
-
Provable Subspace Identification of Nonlinear Multi-view CCA - Score: 17 (R=9, N=8) - Date: 2026-03-02 - Comment: Representation Learning Theory: provable identifiability and finite-sample guarantees for nonlinear multi-view CCA subspace recovery.
-
Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination - Score: 17 (R=8, N=9) - Date: 2026-03-23 - Comment: Representation learning/theory: proposes a unified geometric uncertainty principle linking adversarial fragility and LLM hallucination through input-gradient coupling.
-
Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs - Score: 17 (R=8, N=9) - Date: 2026-03-19 - Comment: Foundational theory for RL/MDPs: operator-theoretic derivation of policy-gradient results for general state/action spaces with unbounded costs.
-
Language Generation with Replay: A Learning-Theoretic View of Model Collapse - Score: 17 (R=8, N=9) - Date: 2026-03-13 - Comment: Learning theory for representation/data dynamics: formal characterization of model collapse under replayed self-generated text.
-
On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD - Score: 16 (R=9, N=7) - Date: 2026-03-12 - Comment: Matches Training Dynamics/Representation: theoretical analysis of label‑noise SGD in two-layer linear networks revealing phase behavior and links to SAM.
-
SoftJAX & SoftTorch: Empowering Automatic Differentiation Libraries with Informative Gradients - Score: 16 (R=9, N=7) - Date: 2026-03-11 - Comment: Differentiable programming foundations — consolidated soft relaxations (e.g., sorting, indexing, fuzzy logic) to provide informative gradients in AD frameworks.
-
Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers - Score: 16 (R=9, N=7) - Date: 2026-03-06 - Comment: Representation Learning/Training Dynamics: theoretical mechanism for analogical reasoning in transformers via aligned representations and curriculum-dependent emergence.
-
Stable and Steerable Sparse Autoencoders with Weight Regularization - Score: 16 (R=9, N=7) - Date: 2026-03-05 - Comment: Matches Representation Learning and Sparsity: stability/steerability of sparse autoencoders via L2/L1 weight regularization, tied init, and unit-norm decoders.
-
The Lattice Representation Hypothesis of Large Language Models - Score: 16 (R=9, N=7) - Date: 2026-03-03 - Comment: Representation Learning: posits a concept lattice geometry in LLM embeddings enabling meet/join via linear attribute directions and thresholds.
-
Spectral Alignment in Forward-Backward Representations via Temporal Abstraction - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Representation learning theory: analyzes spectral mismatch in forward-backward successor representations and shows temporal abstraction acts as a low-pass filter with value-error bounds.
-
Pitfalls in Evaluating Interpretability Agents - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Provides foundational analysis of how to evaluate autonomous interpretability agents, introducing an intrinsic criterion based on functional interchangeability of model components.
-
IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Representation analysis of CLIP projectors that identifies an aligned isotropic subspace and yields a training-free spectral decomposition method.
-
RiboSphere: Learning Unified and Efficient Representations of RNA Structures - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Model architecture and representation learning through a discrete geometric autoencoding framework combining vector quantization, SE(3)-invariant transformers, and flow matching.
-
Secure Linear Alignment of Large Language Models - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Studies representational convergence via linear alignment between independently trained LLMs, directly probing shared representations.
-
Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization method - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Representation learning/training dynamics theory: first unconditional error analysis for Adam via uniform a priori bounds in strongly convex stochastic optimization.
-
Seasoning Generative Models for a Generalization Aftertaste - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Representation learning theory: proves discriminator-guided refinement can improve generative-model generalization, with bounds governed by discriminator-class complexity.
-
Learning Decision-Sufficient Representations for Linear Optimization - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Representation learning: develops decision-sufficient compressed representations with hardness results, polynomial algorithms, and PAC bounds tied to intrinsic decision-relevant dimension.
-
From Topic to Transition Structure: Unsupervised Concept Discovery at Corpus Scale via Predictive Associative Memory - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Representation learning: unsupervised corpus-scale concept discovery via a contrastive associative-memory objective that isolates transition structure rather than topical semantics.
-
Discovering Decoupled Functional Modules in Large Language Models - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Representation-learning interpretability method that discovers decoupled cross-layer functional modules in LLMs with an unsupervised objective.
-
Variational Kernel Design for Internal Noise: Gaussian Chaos Noise, Representation Compatibility, and Reliable Deep Learning - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Introduces a new internal-noise framework via variational kernel design, deriving Gaussian Chaos Noise with theoretical guarantees on representation distortion.
-
Learning Permutation Distributions via Reflected Diffusion on Ranks - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Core generative modeling contribution: a new diffusion framework on permutations using soft-rank forward processes and generalized PL denoisers.
-
Decoding the Critique Mechanism in Large Reasoning Models - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Representation-learning analysis of hidden critique behavior in reasoning models via an interpretable latent critique vector.
-
W2T: LoRA Weights Already Know What They Can Do - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Weight-space representation learning for LoRA adapters using a canonical factorization that removes decomposition ambiguity.
-
Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Representation/learning dynamics analysis: information-theoretic framework explaining reasoning via uncertainty externalization and information allocation.
-
In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Improves KAN symbolic extraction with in-context operator selection and sparse gated operator layers, directly targeting core architecture interpretability/representation.
-
Interpretable Classification of Time Series Using Euler Characteristic Surfaces - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Introduces Euler Characteristic Surfaces as a stable, computationally efficient topological representation for time series, with a proved stability theorem.
-
$K-$means with learned metrics - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Theoretical foundation for representation/metric learning: continuity and stability of k-means under learned metrics via measured Gromov-Hausdorff topology.
-
Windowed Fourier Propagator: A Frequency-Local Neural Operator for Wave Equations in Inhomogeneous Media - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Presents a frequency-local neural operator that preserves superposition, a methodological advance in representation for wave dynamics.
-
Not All Latent Spaces Are Flat: Hyperbolic Concept Control - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Representation-space innovation: hyperbolic concept steering for generative models using parallel transport instead of Euclidean latent control.
-
Modality-free Graph In-context Alignment - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Method for graph foundation models: parameter-update-free in-context alignment across heterogeneous domains via gradient fingerprints.
-
Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Representation learning analysis: shows self-supervised speech models encode neighboring phonetic context in position-dependent orthogonal subspaces.
-
Asymptotic and Finite-Time Guarantees for Langevin-Based Temperature Annealing in InfoNCE - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Theoretical representation-learning result linking InfoNCE temperature schedules to Langevin simulated annealing with asymptotic and finite-time guarantees.
-
Diffusion Models Generalize but Not in the Way You Might Think - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Foundational analysis of memorization and generalization dynamics in diffusion models across noise levels and denoising trajectories.
-
Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Training objective methodology for language models: sequence-level feature matching through energy-based fine-tuning with theoretical grounding.
-
On-Average Stability of Multipass Preconditioned SGD and Effective Dimension - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Foundational optimization theory: multipass PSGD stability analysis with effective-dimension-dependent excess risk bounds and matching lower bounds.
-
Exhaustive Circuit Mapping of a Single-Cell Foundation Model Reveals Massive Redundancy, Heavy-Tailed Hub Architecture, and Layer-Dependent Differentiation Control - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Representation learning and mechanistic interpretability study using exhaustive circuit tracing and higher-order ablations to characterize internal feature organization.
-
Slack More, Predict Better: Proximal Relaxation for Probabilistic Latent Variable Model-based Soft Sensors - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Advances probabilistic latent variable modeling with a new proximal variational inference objective and convergence analysis to reduce amortization error.
-
Harnessing Data Asymmetry: Manifold Learning in the Finsler World - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Foundational representation learning: extends manifold learning from symmetric Riemannian to asymmetric Finsler geometry with generalized t-SNE/UMAP.
-
Factorized Neural Implicit DMD for Parametric Dynamics - Score: 16 (R=8, N=8) - Date: 2026-03-12 - Comment: Representation Learning/Architecture: factorized neural implicit DMD that parameterizes Koopman spectral decomposition for stable long-horizon rollouts and spectral analysis.
-
Training Language Models via Neural Cellular Automata - Score: 16 (R=8, N=8) - Date: 2026-03-12 - Comment: Training dynamics/representation learning: synthetic pre-pretraining with neural cellular automata enabling transfer and efficiency.
-
A Gaussian Comparison Theorem for Training Dynamics in Machine Learning - Score: 16 (R=8, N=8) - Date: 2026-03-11 - Comment: Representation Learning/Training Dynamics: theoretical comparison (via Gordon’s theorem) linking training dynamics to a surrogate system; validates DMF and refines non-asymptotic behavior.
-
COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics - Score: 16 (R=8, N=8) - Date: 2026-03-09 - Comment: Model Architecture / Representation Learning — training-free activation steering approximating one-step learning dynamics for in-context control of LLM internal representations.
-
Why Is RLHF Alignment Shallow? A Gradient Analysis - Score: 16 (R=8, N=8) - Date: 2026-03-06 - Comment: Representation Learning/Training Dynamics—gradient analysis of RLHF showing shallow alignment and proposing recovery-penalty objective to distribute gradients across positions.
-
Semi-Supervised Generative Learning via Latent Space Distribution Matching - Score: 16 (R=8, N=8) - Date: 2026-03-05 - Comment: Latent Space Distribution Matching with Wasserstein bounds; connects to LDMs—representation learning/generative modeling theory.
-
Surprisal-R\'enyi Free Energy - Score: 16 (R=8, N=8) - Date: 2026-03-05 - Comment: Matches Representation Learning/Training Objectives: introduces Surprisal-Rényi Free Energy interpolating KLs with variance/tail sensitivity and MDL interpretation.
-
Random Features for Operator-Valued Kernels: Bridging Kernel Methods and Neural Operators - Score: 16 (R=8, N=8) - Date: 2026-03-03 - Comment: Representation Learning/Theory — generalization analysis of random features for operator-valued kernels, linking to NTK and neural operators with optimal/minimax rates.
-
What Is the Geometry of the Alignment Tax? - Score: 16 (R=8, N=8) - Date: 2026-03-03 - Comment: Representation Learning theory: geometric characterization of safety–capability tradeoffs in representation subspaces with scaling predictions.
-
Universality of Shallow and Deep Neural Networks on Non-Euclidean Spaces - Score: 16 (R=8, N=8) - Date: 2026-03-02 - Comment: Model Architecture: theoretical universality for deep narrow networks on general topological spaces; Representation Learning: foundational approximation results beyond Euclidean inputs.
-
CLaRE-ty Amid Chaos: Quantifying Representational Entanglement to Predict Ripple Effects in LLM Editing - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Representation learning: quantifies fact entanglement in LLM hidden representations using forward activations to predict edit ripple effects efficiently.
-
Hierarchical Latent Structure Learning through Online Inference - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Online hierarchical latent-variable inference via nested CRP plus sequential Monte Carlo for representation learning in sequential data.
-
Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Uses activation probing to detect motivated reasoning from internal representations, directly probing how LLMs encode decision dynamics.
-
PRISM: Demystifying Retention and Interaction in Mid-Training - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Foundational empirical analysis of mid-training, characterizing weight-space and representation changes and their interaction with later RL.
-
V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Systematic architectural study of representation-aligned co-denoising, isolating key design ingredients for dual-stream diffusion.
-
Grid-World Representations in Transformers Reflect Predictive Geometry - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Representation learning study showing transformer hidden states align with analytically derived predictive geometry in a controlled setting.
-
Embedding-Aware Feature Discovery: Bridging Latent Representations and Interpretable Features in Event Sequences - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Representation learning framework that explicitly decomposes embedding utility into alignment and complementarity for interpretable feature discovery from event sequences.
-
Mechanistic Origin of Moral Indifference in Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Representation learning analysis using sparse autoencoders to isolate and reshape mono-semantic moral features in LLM latent space.
-
TabKD: Tabular Knowledge Distillation through Interaction Diversity of Learned Feature Bins - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Model compression: data-free tabular knowledge distillation built around interaction-diverse synthetic query generation from learned feature bins.
-
Mechanistic Foundations of Goal-Directed Control - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Mechanistic interpretability: analyzes emergence of goal-directed control circuits, gating thresholds, and phase transitions with closed-form predictions.
-
ES-Merging: Biological MLLM Merging via Embedding Space Signals - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Uses embedding-space response signals to estimate layer- and element-wise model merging coefficients, making merging representation-aware rather than parameter-heuristic.
-
Is the reconstruction loss culprit? An attempt to outperform JEPA - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Predictive representation learning: gated predictive autoencoders isolate predictable components to challenge JEPA-style objectives.
-
Soft Mean Expected Calibration Error (SMECE): A Calibration Metric for Probabilistic Labels - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Foundational calibration metric replacing hard-label bin frequencies with mean probabilistic labels, extending ECE correctly.
-
U-Face: An Efficient and Generalizable Framework for Unsupervised Facial Attribute Editing via Subspace Learning - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Representation learning via latent subspace learning for disentangled editing, with an autoencoder view and convergence-backed alternating optimization.
-
Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Representation learning via a bottleneck-token reconstruction objective explicitly targeting what-is-where compositional scene state encoding.
-
Resolving Interference (RI): Disentangling Models for Improved Model Merging - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Core methodology for model merging: reduces cross-task interference by functionally orthogonalizing constituent models using unlabeled auxiliary data.
-
Representation Learning for Spatiotemporal Physical Systems - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Directly studies representation learning by comparing self-supervised objectives for physically meaningful latent representations, highlighting latent-space methods like JEPA.
-
Maximizing Incremental Information Entropy for Contrastive Learning - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Representation learning: contrastive objective that explicitly maximizes incremental entropy with an information-bottleneck formulation.
-
Probing Length Generalization in Mamba via Image Reconstruction - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Core architecture analysis: probes Mamba length generalization failure modes and introduces a length-adaptive variant.
-
Revisiting Model Stitching In the Foundation Model Era - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Representation learning via model stitching: a systematic study of cross-model feature compatibility in heterogeneous vision foundation models.
-
A Geometrically-Grounded Drive for MDL-Based Optimization in Deep Learning - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Representation learning/compression: integrates MDL directly into training dynamics with a theoretical geometric optimization framework.
-
Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Activation engineering method that improves steering vectors via cross-layer representation evolution, directly targeting core representation/control methodology in LLMs.
-
OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Uses sparse autoencoders to disentangle superposed features and applies orthogonal projection for concept erasure, directly targeting representation structure.
-
A Stable Neural Statistical Dependence Estimator for Autoencoder Feature Analysis - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Representation analysis: stable neural statistical dependence estimator for quantifying input-latent-reconstruction dependence in autoencoders.
-
A Universal Nearest-Neighbor Estimator for Intrinsic Dimensionality - Score: 15 (R=8, N=7) - Date: 2026-03-12 - Comment: Representation learning theory: universal nearest-neighbor intrinsic dimensionality estimator with distribution-free consistency.
-
Digging Deeper: Learning Multi-Level Concept Hierarchies - Score: 15 (R=8, N=7) - Date: 2026-03-12 - Comment: Proposes MLCS and Deep-HiCEMs for hierarchical concepts and interventions — matches Representation Learning (concept/dictionary learning) and architecture innovation.
-
Revisiting Sharpness-Aware Minimization: A More Faithful and Effective Implementation - Score: 15 (R=8, N=7) - Date: 2026-03-12 - Comment: Matches Training Dynamics/Optimization: theoretical reinterpretation of SAM and a new XSAM update that improves generalization with minimal overhead.
-
What is Missing? Explaining Neurons Activated by Absent Concepts - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Representation Learning — identifies and explains neurons encoding absences via extensions to attribution/feature visualization.
-
Curveball Steering: The Right Direction To Steer Isn't Always Linear - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Representation Learning: geometry-aware nonlinear activation steering via polynomial kernel PCA, challenging the linear representation hypothesis.
-
Transductive Generalization via Optimal Transport and Its Application to Graph Node Classification - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Representation Learning: introduces OT-based, representation-dependent transductive generalization bounds and analyzes how GNN aggregation transforms representation distributions with depth.
-
An accurate flatness measure to estimate the generalization performance of CNN models - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Representation Learning/Training Dynamics: derives an exact, architecture-aware Hessian-trace-based flatness measure for CNNs (with GAP), robustly linked to generalization.
-
Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Representation Learning and Efficiency: layer- and token-wise analysis of dLLMs vs AR LMs; introduces inference-time layer skipping achieving FLOPs reductions without KV-cache tricks.
-
Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Representation Learning — probes frozen foundation-model features for continuous geometry, with layer-wise signal localization and objective/architecture comparisons.
-
Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Representation Learning — sparse auto-encoder yields interpretable visual words and enables sparse inverted-index retrieval (sparse coding aligning with efficiency/interpretability).
-
Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Representation Learning / Mechanistic Interpretability: disentangled safety subspaces (recognition vs execution) with causal steering in LLMs.
-
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Representation Learning: activation/attention probing analyzes belief dynamics; Efficiency: probe-guided early-exit enables adaptive computation with large token savings.
-
Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Model Compression/Efficiency + Representation Learning: CompACT discrete tokenizer compresses each observation to ~8 tokens for world models, enabling orders-of-magnitude faster planning with preserved task information.
-
How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression? - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Representation Learning — analyzes implicit bias/training dynamics of gradient descent in shallow ReLU models, quantifying deviation from minimum-l2 solution.
-
Understanding the Dynamics of Demonstration Conflict in In-Context Learning - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Representation Learning/Training Dynamics—empirical analysis of in-context learning under conflicting demonstrations; identifies and validates phase-specific attention heads causing failures.
-
Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Representation Learning/Interpretability—Delta-Crosscoder with sparsity and delta-based loss to isolate causal latent directions differing after fine-tuning.
-
Efficient Refusal Ablation in LLM through Optimal Transport - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Matches Representation Learning by transforming activation distributions with optimal transport and revealing layer-localized safety representations.
-
Towards Improved Sentence Representations using Token Graphs - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Matches Model Architecture/Representation Learning: structure-aware pooling via token-similarity graphs and a lightweight GNN over frozen LLM outputs.
-
StructLens: A Structural Lens for Language Models via Maximum Spanning Trees - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Structure-aware inter-layer analysis via MSTs over residual streams; aids layer pruning—representation learning and model compression.
-
Controlling Chat Style in Language Models via Single-Direction Editing - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Representation Engineering: linear-direction editing in activation space for precise, training-free style control and composition
-
Old Habits Die Hard: How Conversational History Geometrically Traps LLMs - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Analyzes internal LLM representations via geometric consistency over conversational history—representation learning/training dynamics.
-
On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Representation Learning/Training Dynamics—provable slow convergence of robustness margin in non-linear ReLU networks.
-
Discrete World Models via Regularization - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Matches Representation Learning with sparsity: unsupervised Boolean world models via entropy/independence/locality regularizers and robust discrete optimization.
-
Rate-Distortion Signatures of Generalization and Information Trade-offs - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Representation Learning—uses rate–distortion theory to analyze accuracy–robustness/generalization trade-offs across models.
-
Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Representation Learning — introduces trajectory-based analysis of layer-wise representation displacement to distinguish valid vs. spurious reasoning (tested on dense and MoE LLMs).
-
Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Representation Learning: critical analysis of identifiability metrics with taxonomy and stress-testing suite.
-
A Mixed Diet Makes DINO An Omnivorous Vision Encoder - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Matches Representation Learning criterion: cross-modal alignment with a distillation objective to learn a modality-agnostic embedding space anchored to a frozen DINOv2 teacher.
Other Foundational Research (23)
-
AI Must Embrace Specialization via Superhuman Adaptable Intelligence - Score: 20.0 (R=0, N=0) - Date: 2026-03-02 - Comment: Author match
-
Self-Regularized Learning Methods - Score: 19 (R=10, N=9) - Date: 2026-03-18 - Comment: Provides a general theoretical framework for implicit regularization via self-regularization, covering gradient descent and yielding optimal statistical rates.
-
Lost in Aggregation: On a Fundamental Expressivity Limit of Message-Passing Graph Neural Networks - Score: 19 (R=10, N=9) - Date: 2026-03-17 - Comment: Proves a fundamental expressivity limit of message-passing GNNs under generic aggregation, separating them sharply from graph isomorphism procedures.
-
Neural Networks as Local-to-Global Computations - Score: 18 (R=9, N=9) - Date: 2026-03-17 - Comment: Reinterprets feedforward ReLU networks as local-to-global sheaf computations with harmonic extension and bidirectional heat-equation dynamics.
-
Non-Euclidean Gradient Descent Operates at the Edge of Stability - Score: 17 (R=9, N=8) - Date: 2026-03-06 - Comment: Training Dynamics: generalizes Edge-of-Stability theory to non-Euclidean norms with a geometry-aware sharpness measure across optimizers.
-
Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Matches Training Dynamics theory: GRPO policy gradient as a U-statistic with MSE bounds, oracle equivalence, and a universal group-size scaling law.
-
Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum - Score: 17 (R=8, N=9) - Date: 2026-03-19 - Comment: Provable algorithmic gains from autocurriculum for reasoning-model SFT and RL fine-tuning.
-
The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions - Score: 17 (R=8, N=9) - Date: 2026-03-19 - Comment: Theoretical study of geometric limits of causal interventions in continuous generative models, introducing manifold tearing and a causal uncertainty principle.
-
Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Test-time reinforcement learning for unified multimodal models, with metacognitive monitoring signals enabling parameter updates and self-improvement at inference time.
-
Transition Flow Matching - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Foundational generative modeling: directly learning transition flow as a global quantity enables single-step or arbitrary-time generation with theoretical unification.
-
Unbiased and Biased Variance-Reduced Forward-Reflected-Backward Splitting Methods for Stochastic Composite Inclusions - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Theoretical optimization: variance-reduced forward-reflected-backward splitting with new biased and unbiased estimators plus convergence and oracle complexity guarantees.
-
Preconditioned One-Step Generative Modeling for Bayesian Inverse Problems in Function Spaces - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Introduces a neural-operator-based one-step generative sampler for Bayesian inverse problems with function-space stability analysis.
-
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Foundational analysis of why ideal noise-correction fails, linking optimization dynamics, convergence states, and information-theoretic limits.
-
Exponential-Family Membership Inference: From LiRA and RMIA to BaVarIA - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Unifies major membership inference attacks under an exponential-family likelihood-ratio framework and introduces Bayesian variance estimation for low-shadow-model regimes.
-
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction - Score: 16 (R=8, N=8) - Date: 2026-03-12 - Comment: Training Dynamics/Optimization for large models: HTMuon encourages heavy-tailed spectra with theory (Schatten‑q steepest descent) and improved LLM pretraining.
-
Inducing Sustained Creativity and Diversity in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Novel decoding method for sustained diversity and creativity in LLM generation, targeting inference-time behavior rather than application tuning.
-
Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Controlled methodological study of 51 post-training algorithms uncovering scale-dependent ranking inversions and isolating algorithmic effects.
-
Optimal Splitting of Language Models from Mixtures to Specialized Domains - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Scaling-law method for optimal compute allocation between pretraining and specialization when splitting language models into domain-specific models.
-
Foundations of Schrödinger Bridges for Generative Modeling - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Builds mathematical foundations for Schrödinger bridges as a unifying framework behind diffusion, score, and flow-based generative models.
-
Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Studies learning-rate scheduling as a foundational training-dynamics question, linking no-decay pretraining to flatter minima and better downstream adaptability.
-
Towards Understanding Adam Convergence on Highly Degenerate Polynomials - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Training dynamics — theoretical analysis of Adam’s auto-convergence and stability regimes on degenerate polynomials.
-
DC-Merge: Improving Model Merging with Directional Consistency - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Model merging/parameter-space geometry: enforces directional consistency via singular-space smoothing and orthogonal subspace alignment.
-
Ensembling Language Models with Sequential Monte Carlo - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: High-Performance Computing/Algorithms — Sequential Monte Carlo decoding to sample from f-ensemble LM distributions in a shared byte space, enabling principled ensembling across vocabularies.