← Previous Summary | Monthly Overview
2026-02 | 2026-03

Personalized Monthly Topic Summary 2026/03

MetricValue
Total Papers546
Model Architecture152
Model Compression and Efficiency159
High Performance Computing78
Representation Learning134
Other Foundational Research23

Model Architecture (152)

  1. Functorial Neural Architectures from Higher Inductive Types - Score: 20 (R=10, N=10) - Date: 2026-03-18 - Comment: Introduces a new architecture class with formal compositional-generalization guarantees via functoriality, and proves self-attention is non-functorial for nontrivial compositional tasks.

  2. The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks - Score: 20.0 (R=0, N=0) - Date: 2026-03-06 - Comment: Author match

  3. Any-Subgroup Equivariant Networks via Symmetry Breaking - Score: 19 (R=10, N=9) - Date: 2026-03-23 - Comment: Architecture theory for equivariant networks: a single model attains any subgroup equivariance through symmetry-breaking inputs with universality guarantees.

  4. ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit - Score: 19 (R=10, N=9) - Date: 2026-03-19 - Comment: Theoretical characterization of large-scale ResNet training dynamics with rigorous convergence rates in the joint infinite depth-width-dimension limit.

  5. Learning to Recall with Transformers Beyond Orthogonal Embeddings - Score: 19 (R=10, N=9) - Date: 2026-03-17 - Comment: Transformer theory under finite data and non-orthogonal embeddings, yielding explicit storage-capacity scalings.

  6. Mamba-3: Improved Sequence Modeling using State Space Principles - Score: 19 (R=10, N=9) - Date: 2026-03-17 - Comment: State-space sequence architecture with complex recurrence and MIMO design improving the performance-efficiency frontier.

  7. M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling - Score: 19 (R=10, N=9) - Date: 2026-03-16 - Comment: Introduces matrix-valued nonlinear recurrent layers as a scalable core architecture with stronger expressivity than standard transformer blocks.

  8. Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks - Score: 19 (R=10, N=9) - Date: 2026-03-13 - Comment: Provides a proof that attention sinks are functionally necessary in softmax Transformers for trigger-conditional computation.

  9. Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation - Score: 19 (R=10, N=9) - Date: 2026-03-06 - Comment: Model Architecture (MoE): universal expert pool with virtual width (depth–width transformation), staggered rotational sharing, and depth-aware load balancing/routing.

  10. Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget - Score: 19 (R=10, N=9) - Date: 2026-03-05 - Comment: Conditional routing that replaces Transformer MLPs with linear surrogates when possible—dynamic networks/efficiency and architectural analysis.

  11. Recursive Models for Long-Horizon Reasoning - Score: 19 (R=10, N=9) - Date: 2026-03-03 - Comment: Model Architecture — formalizes recursive models enabling long-horizon reasoning with provable reductions in active context requirements beyond single-sequence methods.

  12. Transformers are Stateless Differentiable Neural Computers - Score: 18 (R=10, N=8) - Date: 2026-03-23 - Comment: Model architecture/theory: formally derives causal Transformers as stateless differentiable neural computers with external memory semantics.

  13. Path-Constrained Mixture-of-Experts - Score: 18 (R=10, N=8) - Date: 2026-03-19 - Comment: MoE architecture innovation: constraining cross-layer expert path space by sharing routers across layers.

  14. Learning When to Attend: Conditional Memory Access for Long-Context LLMs - Score: 18 (R=10, N=8) - Date: 2026-03-19 - Comment: Conditional attention architecture for long-context LLMs that learns token-wise global memory access.

  15. Mixture-of-Depths Attention - Score: 18 (R=10, N=8) - Date: 2026-03-17 - Comment: Introduces a new transformer attention primitive that mixes current-layer and cross-layer KV access, with an accompanying hardware-efficient algorithm nearly matching FlashAttention-2 efficiency.

  16. PDE-SSM: A Spectral State Space Approach to Spatial Mixing in Diffusion Transformers - Score: 18 (R=10, N=8) - Date: 2026-03-16 - Comment: Replaces transformer attention with a learnable Fourier-solved PDE state-space block, a core architectural innovation for efficient spatial mixing.

  17. Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing - Score: 18 (R=10, N=8) - Date: 2026-03-13 - Comment: Model architecture innovation: threshold-based MoE routing gives causal dynamic computation allocation with load balancing without auxiliary losses.

  18. Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design - Score: 18 (R=10, N=8) - Date: 2026-03-12 - Comment: Model Architecture/Efficiency: MoE scaling law optimizing expert vs. attention FLOPs; explicit formula for optimal compute allocation under sparsity.

  19. Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias - Score: 18 (R=10, N=8) - Date: 2026-03-12 - Comment: Exact theory of transformer position bias at initialization — matches Model Architecture: analysis/innovations on transformers and training dynamics.

  20. MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios - Score: 18 (R=10, N=8) - Date: 2026-03-12 - Comment: HPC/efficiency for MoE: speculative decoding as lookahead for memory management with dynamic partitioning and async prefetch/eviction.

  21. ConFu: Contemplate the Future for Better Speculative Sampling - Score: 18 (R=10, N=8) - Date: 2026-03-12 - Comment: Speculative decoding with contemplate tokens and MoE gating to boost acceptance — matches Model Compression and Efficiency and Mixture-of-Experts.

  22. On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer - Score: 18 (R=10, N=8) - Date: 2026-03-11 - Comment: Optimizer/Scaling Theory: introduces operator-norm-based geometry with mean-normalized, layerwise composable norms enabling width-independent smoothness and learning-rate transfer; proposes row/column-normalized optimizers (MOGA).

  23. Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers - Score: 18 (R=10, N=8) - Date: 2026-03-11 - Comment: Model Architecture (MoE): Bayesian variational routing confined to expert selection for calibrated, uncertainty-aware MoE Transformers with <1% extra FLOPs.

  24. Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models - Score: 18 (R=10, N=8) - Date: 2026-03-11 - Comment: Model Architecture — theoretical expressivity/efficiency benefits of hybrid Transformer+SSM models over non-hybrids.

  25. Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions - Score: 18 (R=10, N=8) - Date: 2026-03-09 - Comment: Training dynamics theory: gradient flow on value–softmax drives low-entropy outputs, explaining attention phenomena.

  26. Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View - Score: 18 (R=10, N=8) - Date: 2026-03-09 - Comment: Theory of model architecture/expressivity: Lie-algebraic analysis of depth in parallelizable sequence models (Transformers/SSMs).

  27. PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters - Score: 18 (R=10, N=8) - Date: 2026-03-05 - Comment: Model Architecture/Efficiency: training-free, adapter-free 2D-to-3D lifting operator (PlaneCycle) enabling 3D fusion while reusing 2D backbones

  28. Data-Aware Random Feature Kernel for Transformers - Score: 18 (R=10, N=8) - Date: 2026-03-05 - Comment: Matches Compression/Efficiency and Model Architecture: data-aware random-feature attention (learned covariance) enabling importance-sampled linear attention (DARKFormer).

  29. The Expressive Limits of Diagonal SSMs for State-Tracking - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: Strongly matches Model Architecture (theoretical analysis): expressivity limits of diagonal SSMs for state-tracking with precise group-theoretic characterization.

  30. Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: Training Dynamics of Self-Attention — structure-aware preconditioned gradient descent with spectral initialization yields geometric-rate global convergence.

  31. TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: High Performance Computing + MoE: heterogeneous GPU–CPU–DIMM-NDP offloading with bottleneck-aware expert scheduling for high-throughput MoE inference.

  32. Expert Divergence Learning for MoE-based Language Models - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: Model Architecture (MoE): encourages expert specialization via label-driven Jensen–Shannon divergence on routing distributions.

  33. Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization - Score: 18 (R=10, N=8) - Date: 2026-03-02 - Comment: Model Compression and Efficiency + MoE: token-aware adaptive error compensation using routed low-rank mixture-of-experts for PTQ of VLMs.

  34. Transformers are Bayesian Networks - Score: 18 (R=9, N=9) - Date: 2026-03-18 - Comment: Theoretical characterization of transformer layers as loopy belief propagation in Bayesian networks, with uniqueness results.

  35. A Family of LLMs Liberated from Static Vocabularies - Score: 18 (R=9, N=9) - Date: 2026-03-17 - Comment: Core transformer architecture redesign replacing static token vocabularies with hierarchical byte-level encoding/decoding.

  36. Local Urysohn Width: A Topological Complexity Measure for Classification - Score: 18 (R=9, N=9) - Date: 2026-03-17 - Comment: Develops a new theoretical complexity measure for classification based on local Urysohn width, with hierarchy and sample-complexity results.

  37. From Gradients to Riccati Geometry: Kalman World Models for Single-Pass Learning - Score: 18 (R=9, N=9) - Date: 2026-03-14 - Comment: Proposes a gradient-free training paradigm for state-space models and transformers using Kalman-style recursive filtering, with stability and complexity analysis.

  38. Attention Meets Reachability: Structural Equivalence and Efficiency in Grammar-Constrained LLM Decoding - Score: 18 (R=9, N=9) - Date: 2026-03-09 - Comment: High Performance Computing and Architecture: formal analysis and lower bounds for grammar-constrained decoding; connects to Transformers/MoE with latency envelopes.

  39. Exclusive Self Attention - Score: 17 (R=10, N=7) - Date: 2026-03-11 - Comment: Model Architecture: Exclusive Self Attention modifies Transformer attention to exclude self-position information, improving long-sequence modeling.

  40. On the Ability of Transformers to Verify Plans - Score: 17 (R=9, N=8) - Date: 2026-03-23 - Comment: Transformer theory: introduces C*-RASP and proves length-generalization guarantees for plan verification with growing vocabulary size.

  41. Neural Dynamics Self-Attention for Spiking Transformers - Score: 17 (R=9, N=8) - Date: 2026-03-23 - Comment: Introduces a new spiking self-attention mechanism that adds locality bias and removes explicit attention-matrix storage to cut inference memory.

  42. Speculating Experts Accelerates Inference for Mixture-of-Experts - Score: 17 (R=9, N=8) - Date: 2026-03-23 - Comment: MoE inference systems method that speculates future experts to overlap CPU-GPU transfers with compute under expert offloading.

  43. Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training - Score: 17 (R=9, N=8) - Date: 2026-03-19 - Comment: Training efficiency method: a lower-overhead whitening optimizer for faster transformer training with convergence analysis.

  44. CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention - Score: 17 (R=9, N=8) - Date: 2026-03-19 - Comment: KV-cache-efficient attention architecture conversion: covariance-aware factorization and nonuniform rank allocation for converting GQA to MLA.

  45. Attention Sinks Induce Gradient Sinks - Score: 17 (R=9, N=8) - Date: 2026-03-19 - Comment: Mechanistic Transformer analysis linking attention sinks to gradient sinks and massive activations through backpropagation dynamics.

  46. Knowledge Localization in Mixture-of-Experts LLMs Using Cross-Lingual Inconsistency - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: MoE interpretability method that localizes factual knowledge by contrasting cross-lingual router behavior and causally validating expert necessity.

  47. GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Introduces a graph transformer with O(N) spectral positional encoding that preserves gauge invariance and includes theory for discretization-invariant neural operators.

  48. SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: New attention architecture derived from inertial dynamics on density manifolds, yielding accelerated momentum attention blocks.

  49. MoLoRA: Composable Specialization via Per-Token Adapter Routing - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Model architecture: per-token adapter routing with Mixture-of-LoRA enables composable specialization within a single sequence.

  50. Directional Routing in Transformers - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Proposes a lightweight transformer routing mechanism where attention heads use learned suppression directions controlled by a shared router, yielding a core architectural change analyzed mechanistically.

  51. Gauge-Equivariant Intrinsic Neural Operators for Geometry-Consistent Learning of Elliptic PDE Maps - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Presents gauge-equivariant intrinsic neural operators, a core operator-learning architecture with strong geometry-consistency guarantees.

  52. Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Training dynamics analysis: spectral-edge SVD reveals low-rank signal-noise structure and phase transitions in transformer optimization trajectories.

  53. Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy - Score: 17 (R=9, N=8) - Date: 2026-03-14 - Comment: Optimization theory for transformers trained with cross-entropy: derives complex-singularity step-size bounds from softmax geometry with a cheap JVP-based safety criterion.

  54. Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics - Score: 17 (R=9, N=8) - Date: 2026-03-14 - Comment: NTK-based theory for linearized attention showing non-convergence and introducing influence malleability as a core property.

  55. As Language Models Scale, Low-order Linear Depth Dynamics Emerge - Score: 17 (R=9, N=8) - Date: 2026-03-14 - Comment: Core architecture analysis: identifies low-order linear surrogate dynamics emerging across transformer depth as models scale.

  56. Marginals Before Conditionals - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Training Dynamics/Representation: Minimal conditional learning task revealing plateau/transition and selector-routing head dynamics.

  57. RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: High-Performance Computing: General operator fusion for cascaded reductions (e.g., safe softmax+GEMM in attention) with formal analysis and auto kernel generation.

  58. From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Model Architecture/Representation Learning: a hierarchical masked autoencoder with a cascaded decoder and progressive masking curriculum for multi-granular representation learning.

  59. Quantifying the Necessity of Chain of Thought through Opaque Serial Depth - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Representation Learning/Architecture Theory: formalizes opaque serial depth to bound non-externalized reasoning in neural nets; includes analysis showing Mixture-of-Experts likely has lower opaque depth than dense models.

  60. A Variational Latent Equilibrium for Learning in Cortex - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Training Dynamics/Architecture: proposes a variational latent equilibrium framework approximating BPTT with fully local dynamics, unifying energy-based spatiotemporal credit assignment.

  61. Generalized Reduction to the Isotropy for Flexible Equivariant Neural Fields - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Model Architecture — general orbit-equivalence reduction enabling flexible equivariant neural fields under arbitrary group actions.

  62. Permutation-Equivariant 2D State Space Models: Theory and Canonical Architecture for Multivariate Time Series - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Model Architecture and Theory: derives canonical permutation-equivariant 2D state-space form and proposes VI 2D SSM/Mamba, eliminating variable-axis ordering and reducing dependency depth.

  63. RAC: Rectified Flow Auto Coder - Score: 17 (R=9, N=8) - Date: 2026-03-09 - Comment: Architecture: Rectified Flow-based autoencoder enabling multi-step, bidirectional inference and reduced parameters.

  64. Functionality-Oriented LLM Merging on the Fisher--Rao Manifold - Score: 17 (R=9, N=8) - Date: 2026-03-06 - Comment: Model Architecture/Systems — functionality-oriented LLM merging via Fisher–Rao Karcher mean with a practical fixed-point algorithm; prevents collapse and scales to N>2 experts.

  65. The Inductive Bias of Convolutional Neural Networks: Locality and Weight Sharing Reshape Implicit Regularization - Score: 17 (R=9, N=8) - Date: 2026-03-06 - Comment: Model Architecture/Training Dynamics: shows how CNN locality and weight sharing reshape implicit regularization at EoS, explaining superior generalization.

  66. CHLU: The Causal Hamiltonian Learning Unit as a Symplectic Primitive for Deep Learning - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Model Architecture—new symplectic Causal Hamiltonian Learning Unit conserving phase-space volume to stabilize long-horizon memory.

  67. Spectral Condition for $\mu$P under Width-Depth Scaling - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: High Performance Computing/Training Dynamics: unified spectral μP condition for stable width–depth scaling and hyperparameter transfer across optimizers.

  68. Memory Caching: RNNs with Growing Memory - Score: 17 (R=9, N=8) - Date: 2026-03-02 - Comment: Matches Model Architecture and Efficiency criteria: introduces Memory Caching to grow RNN effective memory with sequence length, interpolating between RNN and Transformer memory-compute trade-offs.

  69. Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights - Score: 17 (R=8, N=9) - Date: 2026-03-13 - Comment: Representation/post-training insight: shows large pretrained models contain dense nearby task experts, enabling parallel random perturbation selection and ensembling.

  70. Chemical Reaction Networks Learn Better than Spiking Neural Networks - Score: 17 (R=8, N=9) - Date: 2026-03-13 - Comment: Theoretical architecture result proving stronger expressivity of chemical reaction networks than spiking neural networks, with regret and VC-dimension analysis.

  71. AIMER: Calibration-Free Task-Agnostic MoE Pruning - Score: 16 (R=9, N=7) - Date: 2026-03-20 - Comment: Calibration-free pruning criterion for MoE experts, directly addressing model compression and serving efficiency.

  72. LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing - Score: 16 (R=9, N=7) - Date: 2026-03-14 - Comment: Model compression for MoE: replaces redundant experts with parameter-efficient modules to reduce memory without full expert merging/pruning.

  73. The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers - Score: 16 (R=9, N=7) - Date: 2026-03-12 - Comment: Representation/Architecture Analysis: Identifies binary routing in Transformer FFNs, explaining conditional computation behavior.

  74. SCORE: Replacing Layer Stacking with Contractive Recurrent Depth - Score: 16 (R=9, N=7) - Date: 2026-03-12 - Comment: Model Architecture: Replaces layer stacking with contractive recurrent depth (ODE-inspired shared block) across MLP/GNN/Transformer.

  75. Routing without Forgetting - Score: 16 (R=9, N=7) - Date: 2026-03-11 - Comment: Model Architecture: embeds energy-based associative retrieval (Modern Hopfield) within transformers for input-conditioned dynamic routing in online continual learning without gradient specialization.

  76. Warm Starting State-Space Models with Automata Learning - Score: 16 (R=9, N=7) - Date: 2026-03-09 - Comment: Model architecture/theory: proves exact realization of Moore machines as state-space models and uses symbolic automata to warm-start SSMs.

  77. Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder - Score: 16 (R=9, N=7) - Date: 2026-03-09 - Comment: Architecture/efficiency: single dense Transformer encoder unifying modalities, replacing MoE/routing with shared parameters.

  78. Trade-offs in Ensembling, Merging and Routing Among Parameter-Efficient Experts - Score: 16 (R=9, N=7) - Date: 2026-03-05 - Comment: Systematic study of ensembling/merging/routing among parameter-efficient experts—experts/routing (MoE-style) for multi-task efficiency.

  79. TiledAttention: a CUDA Tile SDPA Kernel for PyTorch - Score: 16 (R=9, N=7) - Date: 2026-03-03 - Comment: High Performance Computing: editable CUDA tile SDPA kernel enabling schedule-level research with online softmax and tiled KV streaming for attention efficiency.

  80. CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning - Score: 16 (R=9, N=7) - Date: 2026-03-02 - Comment: Model Architecture: Mixture-of-Experts with stage-aligned experts and routing for hybrid-capabilities reasoning.

  81. Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training - Score: 16 (R=9, N=7) - Date: 2026-03-02 - Comment: Representation Learning/Training Dynamics: analyzes optimizer-induced low-dimensional drift and transverse dynamics in transformer parameter trajectories.

  82. NeuroGame Transformer: Gibbs-Inspired Attention Driven by Game Theory and Statistical Physics - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Transformer architecture innovation: Gibbs/Ising attention with game-theoretic token valuation and convergence analysis.

  83. AS2 -- Attention-Based Soft Answer Sets: An End-to-End Differentiable Neuro-Soft-Symbolic Reasoning Architecture - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Model architecture: proposes a fully differentiable neuro-symbolic reasoning architecture using a soft fixed-point approximation to ASP consequence operators.

  84. An SO(3)-equivariant reciprocal-space neural potential for long-range interactions - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Core architecture innovation: SO(3)-equivariant reciprocal-space message passing to model long-range interactions consistently.

  85. Variational Phasor Circuits for Phase-Native Brain-Computer Interface Classification - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Introduces a new phase-native classifier architecture on the S^1 manifold using trainable phase shifts, unitary mixing, and interference instead of dense real-valued layers.

  86. LoST: Level of Semantics Tokenization for 3D Shapes - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Core architecture/tokenization design for generative 3D models by ordering tokens by semantic salience rather than geometric level-of-detail.

  87. Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Core architecture innovation for flow-matching control: replacing fixed-time integration with time-unconditional optimization for adaptive compute and OOD detection.

  88. Gaussian Process Limit Reveals Structural Benefits of Graph Transformers - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Model architecture theory: derives GP limits for graph transformers and proves structural anti-oversmoothing benefits over graph convolutions.

  89. Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Core architecture work on scalable continuous SE(3)-equivariant kernels using coordinate-based convolution design.

  90. The Phasor Transformer: Resolving Attention Bottlenecks on the Unit Circle - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Model architecture/efficiency: replaces quadratic attention with unit-circle phase blocks plus DFT-based global token mixing in O(N log N).

  91. Transformers Can Learn Rules They've Never Seen: Proof of Computation Beyond Interpolation - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Gives a theoretical and empirical analysis of transformers' ability to compute unseen rules beyond interpolation, including circuit-level evidence.

  92. Demystifing Video Reasoning - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Mechanistic analysis of diffusion-transformer reasoning that identifies denoising-step dynamics and layer specialization as the core substrate.

  93. Self-Aware Markov Models for Discrete Reasoning - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Proposes a discrete reasoning architecture with self-correcting remasking and adaptive stopping, extending masked diffusion-style models with dynamic computation.

  94. NeuronSpark: A Spiking Neural Network Language Model with Selective State Space Dynamics - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Introduces a pure spiking language-model architecture with selective state-space dynamics and custom training/stabilization methods.

  95. Deriving Hyperparameter Scaling Laws via Modern Optimization Theory - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Optimization-theoretic derivation of hyperparameter scaling laws for learning rate, momentum, and batch size.

  96. PhasorFlow: A Python Library for Unit Circle Based Computing - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Core architecture proposal: unit-circle/phasor computation framework with variational phasor circuits and a DFT-based transformer alternative to attention.

  97. Rethinking Machine Unlearning: Models Designed to Forget via Key Deletion - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Model architecture: memory-augmented transformer designed for unlearning by deleting instance-specific keys instead of updating weights.

  98. Ablate and Rescue: A Causal Analysis of Residual Stream Hyper-Connections - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Mechanistic analysis of multi-stream transformer residual architectures using causal stream ablation-and-rescue interventions.

  99. Universe Routing: Why Self-Evolving Agents Need Epistemic Control - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Conditional/modular architecture idea: explicit hard routing across epistemically incompatible solvers, with MoE-style comparison and continual expansion results.

  100. Punctuated Equilibria in Artificial Intelligence: The Institutional Scaling Law and the Speciation of Sovereign AI - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Proposes a theoretical framework and scaling law for when smaller orchestrated models can outperform larger ones, directly addressing foundational model-scaling assumptions.

  101. D-MEM: Dopamine-Gated Agentic Memory via Reward Prediction Error Routing - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Agent memory architecture with reward-prediction-error routing that cuts long-term memory write complexity from O(N^2) to selective O(1)/O(N) paths.

  102. Towards One-for-All Anomaly Detection for Tabular Data - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Generalist tabular anomaly detection architecture using transferable neighbor-distance representations and MoE fusion across unseen datasets.

  103. From Specification to Architecture: A Theory Compiler for Knowledge-Guided Machine Learning - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Foundational architecture-generation agenda: compiling typed domain theories into provably theory-consistent model architectures.

  104. Sampling Boltzmann distributions via normalizing flow approximation of transport maps - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Puts normalizing-flow Boltzmann sampling on firm mathematical footing with existence and approximation results for low-regularity targets.

  105. Equivalence of approximation by networks of single- and multi-spike neurons - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Architecture theory for spiking networks: proves approximation-equivalence between single-spike and multi-spike neuron networks up to linear overhead.

  106. Scalable Machines with Intrinsic Higher Mental-State Dynamics - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Presents a core architectural modification to transformer attention via triadic modulation loops that pre-select relevant information with claimed linear-time scaling.

  107. HCP-DCNet: A Hierarchical Causal Primitive Dynamic Composition Network for Self-Improving Causal Understanding - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Dynamic composition architecture with typed causal primitives and routing into differentiable execution graphs directly targets core model architecture design.

  108. Separable neural architectures as a primitive for unified predictive and generative intelligence - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Introduces separable neural architectures as a core architectural primitive that factorizes high-dimensional mappings via controlled interaction order and tensor rank.

  109. Geometry-Aware Probabilistic Circuits via Voronoi Tessellations - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Probabilistic modeling architecture: geometry-aware probabilistic circuits with Voronoi-structured sum nodes and tractability conditions.

  110. Flowers: A Warp Drive for Neural PDE Solvers - Score: 16 (R=8, N=8) - Date: 2026-03-06 - Comment: Model Architecture: warp-based operator network (no attention/Fourier/convolution) enabling linear-cost global interactions for PDE solution operators.

  111. Scalable Prompt Routing via Fine-Grained Latent Task Discovery - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Uses fine-grained latent task discovery plus a mixture-of-experts router, making the main contribution a core conditional architecture.

  112. DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Hardware-aware Transformer efficiency method: distribution-aware piecewise activations for faster on-device inference and training.

  113. Towards Solving Polynomial-Objective Integer Programming with Hypergraph Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Hypergraph neural network architecture for polynomial-objective integer programs, explicitly modeling high-degree term-variable-constraint interactions.

  114. Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Mixture-of-Experts post-training recipe combining Cascade RL with multi-domain on-policy distillation for a compact high-capacity model.

  115. DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Targets efficient MoE inference with dynamic expert orchestration and mixed-precision quantization on edge hardware.

  116. Transformers Learn Robust In-Context Regression under Distributional Uncertainty - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Analyzes Transformer in-context regression under broad distributional uncertainty, probing a core capability of the architecture.

  117. TARo: Token-level Adaptive Routing for LLM Test-time Alignment - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Token-level adaptive routing is a conditional/dynamic network mechanism for inference-time control of LLM reasoning.

  118. From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs - Score: 15 (R=8, N=7) - Date: 2026-03-19 - Comment: Mechanistic representation analysis of MLLMs, pinpointing how segmentation information degrades in the adapter and is recovered through attention dynamics in later layers.

  119. Dependence Fidelity and Downstream Inference Stability in Generative Models - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Foundational theory for generative models: shows marginal matching can fail to preserve dependence structure and gives covariance-level guarantees for downstream inference stability.

  120. Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Uses sparse autoencoders to decode steering vectors in a 35B MoE, probing and causally intervening on internal behavioral representations.

  121. Parallel In-context Learning for Large Vision Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Efficiency method for Transformer-based multimodal in-context learning: parallel chunking plus Product-of-Experts aggregation reduces quadratic context-cost at inference.

  122. Selective Memory for Artificial Intelligence: Write-Time Gating with Hierarchical Archiving - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Write-time gating with hierarchical archival is a memory-architecture contribution for selective external knowledge storage and retrieval efficiency.

  123. Tackling Over-smoothing on Hypergraphs: A Ricci Flow-guided Neural Diffusion Approach - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Theoretical and methodological hypergraph architecture work: Ricci-flow-guided neural diffusion to control message passing and mitigate over-smoothing.

  124. CATFormer: When Continual Learning Meets Spiking Transformers With Dynamic Thresholds - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Introduces a new Transformer-based continual-learning architecture with dynamic neuron thresholds and gated head selection.

  125. AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Proposes replacing transformer backbones with deep state space models in a vision-language-action architecture for efficient multimodal sequence modeling.

  126. Masked BRep Autoencoder via Hierarchical Graph Transformer - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Core architecture and representation learning: masked graph autoencoder with hierarchical graph Transformer for self-supervised CAD representation learning.

  127. AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Low-rank adapter method with zero initialization and a rank-capacity theory for frozen Vision Transformers.

  128. On the Degrees of Freedom of Gridded Control Points in Learning-Based Medical Image Registration - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Sparse control-point deformation with cross-attention targets core architecture/memory efficiency for 3D registration.

  129. WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Model architecture: system-aware Mixture-of-Experts with structural embeddings for scalable world models across heterogeneous robots.

  130. Representation Alignment for Just Image Transformers is not Easier than You Think - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Core architecture/training insight: analyzes why representation alignment fails in pixel-space diffusion transformers and introduces a corrected alignment method.

  131. Human-like Object Grouping in Self-supervised Vision Transformers - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Representation analysis in vision transformers: quantifies object-centric patch similarity and links Gram structure to human-like grouping.

  132. Exploring the Dimensions of a Variational Neuron - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Introduces a neuron-level variational computational unit with explicit prior/posterior and analyzes latent dimensionality as a core architectural primitive.

  133. PA-Net: Precipitation-Adaptive Mixture-of-Experts for Long-Tail Rainfall Nowcasting - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Precipitation-adaptive MoE dynamically allocates experts by token intensity, a clear conditional-network architectural idea.

  134. Routing Channel-Patch Dependencies in Time Series Forecasting with Graph Spectral Decomposition - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Core architecture method for adaptive routing of channel dependencies using graph spectral decomposition and frequency-specific experts.

  135. Deep Invertible Autoencoders for Dimensionality Reduction of Dynamical Systems - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Core autoencoder architecture contribution: invertible autoencoders for dimensionality reduction that mitigate projection-error plateaus as latent dimension grows.

  136. Event-Driven Video Generation - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Core architecture innovation for video transformers: event-gated sampling adds explicit interaction structure to DiT generation.

  137. NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: MoE-style PEFT architecture with context-aware neuromodulation gating and orthogonality regularization for better expert separation.

  138. Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Mechanistic interpretability of transformers by localizing demographic bias to individual attention heads in CLIP's vision encoder.

  139. Context-dependent manifold learning: A neuromodulated constrained autoencoder approach - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Autoencoder architecture for context-dependent manifold learning using neuromodulated geometric constraints.

  140. Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation - Score: 15 (R=8, N=7) - Date: 2026-03-12 - Comment: Riemannian geometry-preserving VAE for SPD matrices — matches Model Architecture (Autoencoders) and Representation Learning on manifolds.

  141. ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning - Score: 15 (R=8, N=7) - Date: 2026-03-12 - Comment: Model Architecture/Efficiency: Mixture-of-LoRAs with reinforcement-based router enabling dynamic conditional routing in finetuning.

  142. Bridging Domains through Subspace-Aware Model Merging - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Model Architecture: subspace-aware model merging (SCORE) resolving singular subspace conflicts via shared orthogonal basis and pruning.

  143. Recursive Inference Machines for Neural Reasoning - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Model Architecture — introduces Recursive Inference Machines that embed recursive inference mechanisms; generalizes TRMs with a reweighting component for neural reasoning.

  144. Symbol-Equivariant Recurrent Reasoning Models - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Architecture: enforces permutation equivariance in recurrent reasoning models via symbol-equivariant layers for symmetry-aware reasoning.

  145. Phase-Type Variational Autoencoders for Heavy-Tailed Data - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Architecture: introduces a Phase-Type (CTMC absorption-time) decoder in VAEs for heavy-tailed generative modeling.

  146. Invariant-Stratified Propagation for Expressive Graph Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Architecture: Invariant-Stratified Propagation (ISP) with a WL variant and neural implementation for higher-expressive GNNs.

  147. Spectral Attention Steering for Prompt Highlighting - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Architecture/Efficiency: training-free attention steering via spectral key editing compatible with FlashAttention; query-adaptive expert routing.

  148. Polynomial Mixing for Efficient Self-supervised Speech Encoders - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Architecture/Efficiency: Polynomial Mixer as a linear-time token-mixing replacement for self-attention in encoders.

  149. MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Matches Model Architecture criterion: multi-resolution Vision Transformer with shared world-coordinate embeddings and extended RoPE for scale-consistent attention.

  150. Intrinsic Lorentz Neural Network - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Model Architecture: fully intrinsic hyperbolic (Lorentz) neural network with novel point-to-hyperplane layer and intrinsic normalization/operators.

  151. ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Matches Model Architecture/Efficiency criterion: conditional/dynamic routing between Fast and Slow agents with free-energy-based fusion for test-time compute scaling in LLM reasoning.

  152. Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Model Architecture (MoE): integrates DINT attention with a Sparse Mixture-of-Experts for modality-shared and routed experts in a multimodal foundation model.

Model Compression and Efficiency (159)

  1. AI+HW 2035: Shaping the Next Decade - Score: 20.0 (R=0, N=0) - Date: 2026-03-06 - Comment: Author match

  2. A Dual Certificate Approach to Sparsity in Infinite-Width Shallow Neural Networks - Score: 19 (R=10, N=9) - Date: 2026-03-19 - Comment: Foundational sparsity theory for infinite-width ReLU networks using dual certificates in TV-regularized training.

  3. The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training - Score: 19 (R=10, N=9) - Date: 2026-03-12 - Comment: Analyzes anisotropy and mean-bias as rank-one driver of FP4 instability and proposes mean subtraction — matches Model Compression and Efficiency: quantization stability.

  4. SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity - Score: 19 (R=10, N=9) - Date: 2026-03-06 - Comment: Model Compression and Efficiency + Systems: enables (2N−2):2N structured sparsity (e.g., 6:8) on 2:4 Sparse Tensor Cores via sliding-window decomposition and activation lifting, achieving near-theoretical speedups with preserved accuracy.

  5. WaterSIC: information-theoretically (near) optimal linear layer quantization - Score: 19 (R=10, N=9) - Date: 2026-03-06 - Comment: Model Compression and Efficiency — Quantization: proposes WaterSIC, an information-theoretically near-optimal linear layer quantizer with waterfilling-style rate allocation and provable 0.255-bit rate gap.

  6. Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization - Score: 19 (R=10, N=9) - Date: 2026-03-03 - Comment: Model Compression and Efficiency: curvature-aware MDL framework for layer-adaptive capacity allocation/pruning (e.g., expert slots, LoRA ranks) with closed-form solutions and regret bounds.

  7. On De-Individuated Neurons: Continuous Symmetries Enable Dynamic Topologies - Score: 19 (R=10, N=9) - Date: 2026-03-02 - Comment: Matches Model Architecture and Compression/Efficiency criteria: introduces isotropic activation primitives enabling dynamic topology (neurogenesis/degeneration) and exact connectivity pruning with sparsity.

  8. Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys - Score: 18 (R=10, N=8) - Date: 2026-03-16 - Comment: Model compression and efficiency: unifies KV-cache compression and sparse attention retrieval via self-indexing 1-bit quantized keys with custom CUDA integration.

  9. Leech Lattice Vector Quantization for Efficient LLM Compression - Score: 18 (R=10, N=8) - Date: 2026-03-12 - Comment: Model compression and efficiency: high-dimensional Leech lattice vector quantization with codebook-free indexing and parallel dequantization.

  10. LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation - Score: 18 (R=10, N=8) - Date: 2026-03-12 - Comment: KV cache eviction with learned importance prediction without draft generation — matches Model Compression and Efficiency: cache/memory optimization for LLM inference.

  11. Uncovering a Winning Lottery Ticket with Continuously Relaxed Bernoulli Gates - Score: 18 (R=10, N=8) - Date: 2026-03-11 - Comment: Model Compression and Efficiency — differentiable L0 sparsity via relaxed Bernoulli gates to discover Strong Lottery Tickets without training weights.

  12. Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction - Score: 18 (R=10, N=8) - Date: 2026-03-11 - Comment: Compression/Efficiency: proposes Overflow-Aware Scaling and Macro Block Scaling to improve 4-bit MXFP4 quantization fidelity for LLMs without hardware changes.

  13. EvoESAP: Non-Uniform Expert Pruning for Sparse MoE - Score: 18 (R=10, N=8) - Date: 2026-03-09 - Comment: Model Compression and Efficiency (MoE): non-uniform layer-wise expert pruning using a stable ESAP proxy and evolutionary search to optimize memory/throughput under a budget.

  14. Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices - Score: 18 (R=10, N=8) - Date: 2026-03-06 - Comment: HPC/Memory Optimization + Compression: persistent 4-bit KV-cache with direct restoration eliminates re-prefill, enabling multi-agent edge inference; up to 136x TTFB reduction and 4x memory density.

  15. Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection - Score: 18 (R=10, N=8) - Date: 2026-03-06 - Comment: Model Compression and Efficiency — KV cache/memory optimization via low-dimensional queries/keys and SVD compression; theoretical log(N) selection dimension; 75% key cache savings.

  16. One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache - Score: 18 (R=10, N=8) - Date: 2026-03-06 - Comment: Model Compression and Efficiency: token-wise adaptive low-rank KV-cache compression with dynamic per-token rate allocation (post-training), orthogonal to pruning.

  17. Dissecting Quantization Error: A Concentration-Alignment Perspective - Score: 18 (R=10, N=8) - Date: 2026-03-05 - Comment: Provides a principled SQNR-based theory of quantization error (concentration+alignment) and introduces CAT transforms—model compression/quantization.

  18. Spectral Surgery: Training-Free Refinement of LoRA via Gradient-Guided Singular Value Reweighting - Score: 18 (R=10, N=8) - Date: 2026-03-05 - Comment: Model Compression and Efficiency: low-rank LoRA refinement via SVD-based singular value reweighting; training-free parameter editing.

  19. ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer - Score: 18 (R=10, N=8) - Date: 2026-03-05 - Comment: Matches Model Architecture and Efficiency: tokenizer-free hierarchical byte-level LM with compression-driven segmentation and Top-K selection for a static compute graph.

  20. Multi-Head Low-Rank Attention - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: Model Compression/Efficiency and HPC: low-rank attention with partitionable latent heads enabling TP-friendly decoding and reduced KV cache I/O.

  21. 3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: Model Compression and Efficiency: introduces a 3-block ADMM for sparse+low-rank LLM decomposition and transformer-level matching refinement with convergence guarantees.

  22. Attn-QAT: 4-Bit Attention With Quantization-Aware Training - Score: 18 (R=10, N=8) - Date: 2026-03-03 - Comment: Model Compression and Efficiency—4-bit quantization-aware training for attention (FP4) with stable backward recomputation and fused kernels.

  23. Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation - Score: 18 (R=10, N=8) - Date: 2026-03-02 - Comment: Model Compression and Efficiency: low-rank approximation of optimizer states to cut memory while maintaining performance in LLM training.

  24. GRAIL: Post-hoc Compensation by Linear Reconstruction for Compressed Networks - Score: 18 (R=10, N=8) - Date: 2026-03-02 - Comment: Model Compression and Efficiency: zero-finetuning post-hoc blockwise compensation via Gram-matrix linear reconstruction to restore compressed network behavior.

  25. Optimal Scalar Quantization for Matrix Multiplication: Closed-Form Density and Phase Transition - Score: 18 (R=9, N=9) - Date: 2026-03-23 - Comment: Compression theory for matrix multiplication: derives optimal scalar quantization densities and phase transitions with closed-form analysis.

  26. ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning - Score: 17 (R=10, N=7) - Date: 2026-03-09 - Comment: Model Compression and Efficiency: improves one-shot LLM pruning (SparseGPT) via loss-driven two-level reordering of columns/blocks to reduce pruning error.

  27. TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly - Score: 17 (R=9, N=8) - Date: 2026-03-23 - Comment: Model compression/efficiency: proposes on-the-fly activation-aware test-time quantization that adapts per prompt without retraining.

  28. Prune-then-Quantize or Quantize-then-Prune? Understanding the Impact of Compression Order in Joint Model Compression - Score: 17 (R=9, N=8) - Date: 2026-03-20 - Comment: Model compression and efficiency: provides theory and experiments on compression order in joint pruning–quantization, including the Progressive Intensity Hypothesis.

  29. ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression - Score: 17 (R=9, N=8) - Date: 2026-03-19 - Comment: High-performance systems: hardware-aware lossless compression with fused decompression-GEMM for faster, memory-efficient LLM inference on GPUs.

  30. High-Dimensional Gaussian Mean Estimation under Realizable Contamination - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: SQ lower bounds and matching tradeoffs for Gaussian mean estimation under realizable contamination.

  31. Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Analyzes low-precision optimizer-state dynamics in LLM pretraining, explaining EMA staleness and deriving theory-guided reset schedules for memory-efficient training.

  32. High-dimensional estimation with missing data: Statistical and computational limits - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Statistical-computational limits for high-dimensional estimation with missing data, including information-computation gaps.

  33. BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Quantization method tailored to MXFP4 with block-wise affine transforms and Kronecker-efficient parameterization.

  34. Capability-Guided Compression: Toward Interpretability-Aware Budget Allocation for Large Language Models - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Compression framework allocates pruning budgets using SAE-derived capability density, linking interpretability to component-level compression sensitivity.

  35. MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Compute-optimal diffusion language modeling via binary subtoken encoding, index shuffling, and scaling-law analysis.

  36. Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Introduces adaptive latent-space reasoning with dynamic halting, a core architectural efficiency idea for implicit reasoning in LLMs.

  37. Spiking Layer-Adaptive Magnitude-based Pruning - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Introduces a theory-guided pruning framework for temporal SNNs with time-aware layer importance and distortion-constrained sparsity allocation.

  38. Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Provides theory for dataset distillation showing efficient encoding of low-dimensional task structure under gradient-based training of neural networks.

  39. FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Inference efficiency: training-free retrieval-style replacement for the LM output head that reduces classification-head compute.

  40. ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Model efficiency: training-free LVLM token pruning that corrects attention shift and merges redundant tokens while remaining KV-cache compatible.

  41. Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Inference-time KV-cache memory management architecture with selective forgetting/compression and theoretical interference reduction.

  42. Enhancing LLM Training via Spectral Clipping - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Spectral clipping is a general optimizer-side efficiency/stability method for LLM training with theory and scalable Newton-Schulz implementation.

  43. GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Memory-efficient architecture: writes long context into compact prefix memory via test-time gradient descent instead of large KV caches.

  44. Effective Sparsity: A Unified Framework via Normalized Entropy and the Effective Number of Nonzeros - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Defines effective sparsity via normalized-entropy regularizers with RIP-based recovery guarantees, offering a new theoretical sparsity framework.

  45. When Drafts Evolve: Speculative Decoding Meets Online Learning - Score: 17 (R=9, N=8) - Date: 2026-03-14 - Comment: Inference efficiency: speculative decoding cast as online learning, with regret-based algorithms that adapt draft models from verification feedback.

  46. GPrune-LLM: Generalization-Aware Structured Pruning for Large Language Models - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Compression methodology: structured LLM pruning guided by cross-distribution neuron sensitivity to improve post-pruning generalization.

  47. Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Model compression and dynamic networks: unified utility metric for structural pruning and routing based on alternating gradient flow.

  48. HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Compression methodology: end-to-end multi-granular stochastic auto-pruning for ViTs across heads, FFNs, and intra-block dimensions.

  49. IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Attention efficiency: cross-layer reuse of sparse attention top-k indices cuts indexer cost with training-free and training-aware configurations.

  50. Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Inference efficiency for transformers: training-free decoding acceleration using stable within-sentence attention support and sparse memory refresh.

  51. LongFlow: Efficient KV Cache Compression for Reasoning M - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Inference efficiency: KV-cache compression for long-output reasoning models with negligible-overhead importance estimation and fused custom kernel.

  52. A New Tensor Network: Tubal Tensor Train and Its Applications - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Model Compression/Low-Rank: introduces the Tubal Tensor Train (TTT) tensor network with TTT-SVD/ATCU algorithms and error bounds.

  53. ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Efficiency: training-free early-skipping for diffusion LLMs using intermediate tensor variation/confidence to skip token compute, yielding substantial inference speedups.

  54. Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Model Architecture + HPC: introduces a TTC layer performing finite-horizon LQR planning within neural networks and a fused CUDA solver for hardware-efficient inference-time control.

  55. Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence - Score: 17 (R=9, N=8) - Date: 2026-03-09 - Comment: Efficiency/HPC: memory-efficient optimization via mask traversal with improved nonconvex convergence (O(eps^-3)).

  56. Preserving Continuous Symmetry in Discrete Spaces: Geometric-Aware Quantization for SO(3)-Equivariant GNNs - Score: 17 (R=9, N=8) - Date: 2026-03-06 - Comment: Compression/Efficiency—geometric-aware low-bit quantization for SO(3)-equivariant GNNs that preserves symmetry via magnitude-direction decoupling and symmetry-aware training.

  57. $\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space - Score: 17 (R=9, N=8) - Date: 2026-03-06 - Comment: Inference-time optimization—introduces differentiable test-time gradient descent over token logits to refine LLM decoding; theoretical link to KL-regularized RL.

  58. Stacked from One: Multi-Scale Self-Injection for Context Window Extension - Score: 17 (R=9, N=8) - Date: 2026-03-06 - Comment: Model Architecture and Efficiency — two stacked short-context LLMs with multi-grained compression and self-injection for long-context extension, reducing memory and accelerating inference.

  59. NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training - Score: 17 (R=9, N=8) - Date: 2026-03-05 - Comment: Compression/Efficiency + Training Dynamics: optimizer with nuclear-norm-constrained updates to induce low-rank weight structure for better LLM compressibility

  60. Never Saddle for Reparameterized Steepest Descent as Mirror Flow - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Training Dynamics and Optimization Geometry — introduces steepest mirror flows explaining implicit bias, sparsity, and saddle escape (insights into Adam/AdamW vs. SGD).

  61. FreeAct: Freeing Activations for LLM Quantization - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Model Compression and Efficiency: dynamic activation-side transformations (beyond one-to-one orthogonal mappings) for improved LLM quantization.

  62. Scalable Multi-Task Low-Rank Model Adaptation - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Strongly matches Model Compression/Efficiency (low-rank): scalable multi-task LoRA with spectral-aware regularization, block-level adaptation, and fine-grained routing.

  63. A Decomposition Framework for Certifiably Optimal Orthogonal Sparse PCA - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Strongly matches Sparsity/Representation Learning: certifiably optimal orthogonal Sparse PCA with BnB acceleration and block decomposition.

  64. Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Model Compression and Efficiency — replaces MHA with Multi-Head Latent Attention in Whisper decoder to shrink KV cache by up to 87.5% with minimal fine-tuning.

  65. Weight Updates as Activation Shifts: A Principled Framework for Steering - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Model Architecture/Efficiency: establishes equivalence between activation steering and weight updates and introduces a parameter-efficient joint adaptation method.

  66. Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Strongly matches Model Compression/Efficiency: training-free KV cache compression for VLM-based GUI agents with saliency/trajectory-aware scoring.

  67. Efficient Discovery of Approximate Causal Abstractions via Neural Mechanism Sparsification - Score: 17 (R=9, N=8) - Date: 2026-03-02 - Comment: Model Compression and Efficiency: structured pruning viewed as search over causal abstractions with closed-form interventional risk criteria (sparsity/pruning).

  68. Computation-Utility-Privacy Tradeoffs in Bayesian Estimation - Score: 17 (R=8, N=9) - Date: 2026-03-19 - Comment: Foundational theory for differentially private Bayesian estimation, giving efficient near-Bayes-optimal algorithms and computational-statistical lower bounds.

  69. Massive Redundancy in Gradient Transport Enables Sparse Online Learning - Score: 17 (R=8, N=9) - Date: 2026-03-17 - Comment: Shows strong redundancy in online gradient transport and proposes sparse propagation schemes that retain most adaptation ability, a foundational efficiency result for recurrent and transformer training dynamics.

  70. Pruning-induced phases in fully-connected neural networks: the eumentia, the dementia, and the amentia - Score: 17 (R=8, N=9) - Date: 2026-03-13 - Comment: Theory of compression dynamics: identifies pruning-induced phase transitions in fully connected networks with statistical-mechanics analysis.

  71. Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models - Score: 16 (R=9, N=7) - Date: 2026-03-17 - Comment: Model compression: asymmetric text-visual pruning for LVLMs based on modality-specific sensitivity analysis and adaptive token calibration.

  72. SVD Contextual Sparsity Predictors for Fast LLM Inference - Score: 16 (R=9, N=7) - Date: 2026-03-16 - Comment: Uses training-free SVD-based contextual sparsity predictors for conditional FFN execution, directly targeting fast LLM inference.

  73. MXNorm: Reusing MXFP block scales for efficient tensor normalisation - Score: 16 (R=9, N=7) - Date: 2026-03-14 - Comment: Model efficiency: normalization redesign that reuses MXFP block scales to cut reduction cost and speed low-precision transformer training.

  74. ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training - Score: 16 (R=9, N=7) - Date: 2026-03-14 - Comment: Optimization method for efficient sparse training: zero-order SAM cuts backprop cost while stabilizing high-sparsity learning.

  75. Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE - Score: 16 (R=9, N=7) - Date: 2026-03-13 - Comment: Transformer efficiency: analyzes partial RoPE as a core positional-encoding design that preserves convergence while greatly reducing cache memory.

  76. GAST: Gradient-aligned Sparse Tuning of Large Language Models with Data-layer Selection - Score: 16 (R=9, N=7) - Date: 2026-03-11 - Comment: Model Compression/Efficiency: gradient-aligned sparse tuning with joint layer selection and data selection in a unified optimization for PEFT.

  77. Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention - Score: 16 (R=9, N=7) - Date: 2026-03-11 - Comment: High-performance inference efficiency — KV cache compression with Compressed PagedAttention and scheduling for high-concurrency LLM inference.

  78. ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs - Score: 16 (R=9, N=7) - Date: 2026-03-11 - Comment: Model Compression and Efficiency: adaptive KV-cache management with dynamic precision allocation, quantization, and eviction based on per-layer attention statistics for long-context inference.

  79. Stem: Rethinking Causal Information Flow in Sparse Attention - Score: 16 (R=9, N=7) - Date: 2026-03-09 - Comment: Model Compression and Efficiency: proposes position-dependent sparse attention (Token Position-Decay) with an output-aware token metric to reduce prefill compute in causal Transformers.

  80. FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling - Score: 16 (R=9, N=7) - Date: 2026-03-09 - Comment: Model Compression and Efficiency: proposes dynamic sparse attention (instantaneous pattern discovery + thresholding) to accelerate long-context prefilling.

  81. Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models - Score: 16 (R=9, N=7) - Date: 2026-03-09 - Comment: Model Compression and Efficiency — adaptive visual token pruning based on singular value spectrum (low-rank/spectral energy) for compute-efficient VLM inference.

  82. POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation - Score: 16 (R=9, N=7) - Date: 2026-03-06 - Comment: High Performance Computing/Efficiency: scalable orthogonal-equivalence reparameterization (POET-X) that reduces memory and compute for LLM training while preserving stability.

  83. InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context - Score: 16 (R=9, N=7) - Date: 2026-03-06 - Comment: Model Efficiency: information-flow-guided selective KV recomputation and RoPE-consistent chunk reordering for long-context inference.

  84. EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs - Score: 16 (R=9, N=7) - Date: 2026-03-05 - Comment: Model Compression/Efficiency: early-stage visual token pruning inside the encoder (layer-wise, similarity/diversity/attention-guided) for MLLMs

  85. SageBwd: A Trainable Low-bit Attention - Score: 16 (R=9, N=7) - Date: 2026-03-03 - Comment: Model Compression and Efficiency: quantization of attention (INT8) during training with stability analysis (QK-norm, K-/Q-smoothing) and identification of backward-pass gradient as primary error source.

  86. Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification - Score: 16 (R=9, N=7) - Date: 2026-03-03 - Comment: Model Compression and Efficiency/HPC: applies low-bit quantization specifically to speculative verification to overcome memory bandwidth limits, improving end-to-end throughput.

  87. LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding - Score: 16 (R=9, N=7) - Date: 2026-03-02 - Comment: Model Compression and Efficiency: new training objective directly optimizing acceptance rate in speculative decoding for faster inference.

  88. Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Model compression/efficiency: introduces a new distillation objective for discrete diffusion models using discrete MMD, tackling a known methodological gap in fast sampling.

  89. Minimax Generalized Cross-Entropy - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Proposes a new convex minimax formulation of generalized cross-entropy with theoretical error bounds and efficient bilevel optimization via implicit differentiation.

  90. Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Model compression/efficiency for LLM RL via layerwise representation perturbations that stabilize off-policy updates by controlling heavy-tailed importance ratios.

  91. Computational and Statistical Hardness of Calibration Distance - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Theoretical hardness and approximation results for calibration distance, a foundational learning-theoretic problem.

  92. RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Mixed-precision quantization via RL for per-layer bit allocation with zero-shot transfer across LLM families.

  93. How do LLMs Compute Verbal Confidence - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Mechanistic representation analysis of how LLMs compute and cache verbal confidence beyond token log-probabilities.

  94. Flow Matching Policy with Entropy Regularization - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Core algorithmic innovation for generative policies: flow-matching policy optimization with a tractable entropy regularizer and much cheaper training than diffusion policies.

  95. rSDNet: Unified Robust Neural Learning against Label Noise and Adversarial Attacks - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Theoretical robust-learning framework replacing cross-entropy with minimum-divergence estimation, with consistency and robustness guarantees.

  96. Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Foundational analysis of vector quantization collapse mechanisms, identifying token/embedding collapse causes and proposing diversity-preserving fixes.

  97. SOMP: Scalable Gradient Inversion for Large Language Models via Subspace-Guided Orthogonal Matching Pursuit - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Scalable gradient inversion for transformers via sparse recovery using head-wise geometric structure and subspace-guided OMP.

  98. Online Semi-infinite Linear Programming: Efficient Algorithms via Function Approximation - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Online semi-infinite LP with function approximation giving regret bounds independent of the number of constraints.

  99. Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Proposes an information-density-driven masking and noise scheduling paradigm for training diffusion LLMs.

  100. More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Theoretical analysis of beam search overestimation bias with explicit critical-width scaling laws for LLM inference.

  101. SCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge Editing - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Uses sparse transcoders to identify knowledge circuits and perform sparse neuron-level interventions for lifelong knowledge editing, targeting representation-level structure rather than dense black-box updates.

  102. Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Theoretical identifiability for robust prediction under latent shift, replacing completeness with a weaker cross-domain rank condition.

  103. Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Improves VLM efficiency with a spatial-on-demand architecture that retrieves high-resolution crops only when needed, reducing unnecessary visual compute.

  104. On the (Generative) Linear Sketching Problem - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Studies linear sketch recovery through generative priors and proposes a training-without-ground-truth framework for efficient sketch inversion.

  105. ZOTTA: Test-Time Adaptation with Gradient-Free Zeroth-Order Optimization - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Efficiency-focused test-time adaptation with zeroth-order optimization, enabling forward-only adaptation for high-dimensional and quantized models.

  106. Interleaved Resampling and Refitting: Data and Compute-Efficient Evaluation of Black-Box Predictors - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Develops a black-box, data- and compute-efficient procedure for excess-risk evaluation with high-probability guarantees via interleaved resampling/refitting.

  107. TMPDiff: Temporal Mixed-Precision for Diffusion Models - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Model compression and efficiency: introduces timestep-wise mixed-precision quantization for diffusion inference with a principled search algorithm over temporal precision allocation.

  108. PLUME: Building a Network-Native Foundation Model for Wireless Traces via Protocol-Aware Tokenization - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Protocol-aware tokenization for network traces defines a modality-native foundation model design that greatly improves efficiency over generic tokenization.

  109. Probabilistic Gaussian Homotopy: A Probability-Space Continuation Framework for Nonconvex Optimization - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Foundational nonconvex optimization method: probability-space homotopy with Boltzmann-weighted gradient aggregation and a derived annealed minimizer dynamics.

  110. Upper Bounds for Local Learning Coefficients of Three-Layer Neural Networks - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Theory: derives upper bounds for local learning coefficients at singular points in three-layer neural networks.

  111. One-Step Flow Policy: Self-Distillation for Fast Visuomotor Policies - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Model efficiency via one-step self-distillation for diffusion/flow visuomotor policies, reducing iterative sampling cost by 100x.

  112. A Quantitative Characterization of Forgetting in Post-Training - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Theoretical analysis of forgetting in post-training, deriving objective-dependent conditions for mass forgetting and component drift.

  113. Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Provides theory for prior-data fitted networks, proving inconsistency and proposing a calibrated posterior correction with Bernstein-von Mises guarantees.

  114. Truth as a Compression Artifact in Language Model Training - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Representation-learning insight: argues truth preference emerges from compression structure, supported by controlled transformer training studies.

  115. On-Policy Self-Distillation for Reasoning Compression - Score: 16 (R=8, N=8) - Date: 2026-03-06 - Comment: Model Compression and Efficiency — on-policy self-distillation to compress chain-of-thought reasoning tokens while maintaining/improving accuracy.

  116. Accelerating Single-Pass SGD for Generalized Linear Prediction - Score: 16 (R=8, N=8) - Date: 2026-03-03 - Comment: Matches Algorithmic Efficiency/HPC: first momentum-accelerated single-pass SGD for GLMs with sharp excess risk bounds in streaming.

  117. GPU-friendly and Linearly Convergent First-order Methods for Certifying Optimal $k$-sparse GLMs - Score: 16 (R=8, N=8) - Date: 2026-03-03 - Comment: Model Compression/Efficiency and HPC: GPU-friendly, linearly convergent proximal framework for certifying optimal k-sparse GLMs with specialized perspective-prox operators and duality-gap restarts.

  118. Growing Networks with Autonomous Pruning - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Model compression/efficiency: studies dynamically growing networks with autonomous pruning during training to reach sparse architectures.

  119. Warm-Start Flow Matching for Guaranteed Fast Text/Image Generation - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Generative-model efficiency: warm-start flow matching cuts sampling steps with a formal guaranteed speed-up mechanism.

  120. Spectral Tempering for Embedding Compression in Dense Passage Retrieval - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Presents a learning-free eigenspectrum-based method for adaptive embedding compression, directly addressing model efficiency via spectral analysis.

  121. Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Core tokenizer methodology: replaces frequency-based BPE merging with a statistically grounded significance-gain criterion and evaluates effects on Transformer LM efficiency.

  122. UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Inference efficiency method: adaptive KV-cache/context allocation driven by token-level uncertainty for long-context decoding.

  123. Unified Spatio-Temporal Token Scoring for Efficient Video VLMs - Score: 15 (R=8, N=7) - Date: 2026-03-19 - Comment: Unified token pruning across both ViT and LLM with learned spatio-temporal scoring for video VLM efficiency.

  124. Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity - Score: 15 (R=8, N=7) - Date: 2026-03-19 - Comment: Compression method: calibration-free mixed-precision quantization driven by dual numerical and structural layer sensitivity.

  125. KANtize: Exploring Low-bit Quantization of Kolmogorov-Arnold Networks for Efficient Inference - Score: 15 (R=8, N=7) - Date: 2026-03-19 - Comment: Low-bit quantization of KANs using quantized spline tables for major inference-efficiency gains.

  126. Implementation of tangent linear and adjoint models for neural networks based on a compiler library tool - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Compiler/runtime tool for integrating neural networks with numerical models, including tangent linear and adjoint support for efficient heterogeneous execution.

  127. Efficient Reasoning on the Edge - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Systems-level efficiency for on-device reasoning via dynamic adapter switching, KV-cache sharing, and budget-forced reasoning compression.

  128. SF-Mamba: Rethinking State Space Model for Vision - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: State-space vision architecture redesign with patch swapping and batch folding for higher GPU-parallel efficiency.

  129. MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Model compression and efficiency: latency-guided hardware-in-the-loop architecture search for on-device LLM design under deployment constraints.

  130. Effective Distillation to Hybrid xLSTM Architectures - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Model compression and efficiency: distillation pipeline from transformer teachers into sub-quadratic hybrid xLSTM students for efficient inference.

  131. Controlled Langevin Dynamics for Sampling of Feedforward Neural Networks Trained with Minibatches - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Introduces controlled minibatch pseudo-Langevin dynamics for scalable Boltzmann sampling of neural-network parameters, addressing a core training/sampling methodology issue.

  132. PrototypeNAS: Rapid Design of Deep Neural Networks for Microcontroller Units - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Efficiency-focused methodology: zero-shot NAS jointly searching architecture, pruning, and quantization for constrained deployment.

  133. SimCert: Probabilistic Certification for Behavioral Similarity in Deep Neural Network Compression - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Probabilistic certification framework for preserving behavior under pruning and quantization in compressed networks.

  134. DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Training-free multimodal token compression using dynamic audio-driven semantic chunking for efficient long-context omnimodal inference.

  135. SPARQ: Spiking Early-Exit Neural Networks for Energy-Efficient Edge AI - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Combines spiking computation, quantization-aware training, and adaptive early exits into a unified efficient inference architecture.

  136. High-Fidelity Compression of Seismic Velocity Models via SIREN Auto-Decoders - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: SIREN auto-decoder for high-fidelity neural compression is a direct model compression/autoencoder-style representation contribution.

  137. Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Proposes an inference-time early-exit mechanism for reasoning models based on monitoring high-entropy path deviation as a signal of overthinking.

  138. True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Quantization method: true 4-bit training on commodity CPUs with soft weight clipping and dynamic scaling reaching near full-precision parity.

  139. IGU-LoRA: Adaptive Rank Allocation via Integrated Gradients and Uncertainty-Aware Scoring - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Adaptive LoRA rank allocation using integrated gradients with a theoretical quadrature error bound targets compression/efficiency at the method level.

  140. TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Inference-efficiency method for reasoning models: learns optimal early-exit points to cut Chain-of-Thought compute.

  141. Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Explicit kernel-basis construction for equivariant CNNs that avoids Clebsch-Gordan coefficients and generalizes across symmetry groups.

  142. Efficient Reasoning with Balanced Thinking - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Efficiency for transformers/LRMs: training-free hidden-state steering to adapt reasoning compute between overthinking and underthinking.

  143. BiGain: Unified Token Compression for Joint Generation and Classification - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Training-free token compression for diffusion backbones using frequency-aware merging/downsampling, directly addressing efficient model computation.

  144. Quantization Robustness of Monotone Operator Equilibrium Networks - Score: 15 (R=8, N=7) - Date: 2026-03-12 - Comment: Model Compression/Efficiency: Provable quantization robustness for monotone operator equilibrium networks; links precision, perturbation, and convergence.

  145. On Catastrophic Forgetting in Low-Rank Decomposition-Based Parameter-Efficient Fine-Tuning - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Model Compression/Efficiency: analyzes catastrophic forgetting in low-rank PEFT (e.g., LoRA, tensor decompositions) via update subspace geometry; guidance for efficient continual adaptation.

  146. Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Model Efficiency: parameter- and data-efficient adaptation of draft models for speculative decoding using a decoupled shared/private architecture and targeted data regeneration/selection.

  147. Evolving Prompt Adaptation for Vision-Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Compression/Efficiency: parameter-efficient adaptation with low-rank updates decoupled into direction/magnitude to preserve pretraining knowledge; adds feature geometric regularization.

  148. DendroNN: Dendrocentric Neural Networks for Energy-Efficient Classification of Event-Based Data - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Model Architecture and Efficiency: dendrite-inspired DendroNN with event-driven routing, dynamic/static sparsity and intrinsic quantization; includes asynchronous hardware design for low-power spatiotemporal processing.

  149. HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Model Compression and Efficiency: hierarchical, preference-conditioned structured pruning with VLM-aware sensitivity signals and plan-level GRPO.

  150. Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Compression/sparsity: pruning to extract bias-invariant subnetworks from vanilla models without retraining.

  151. MCEL: Margin-Based Cross-Entropy Loss for Error-Tolerant Quantized Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Compression/Efficiency—introduces a margin-based cross-entropy loss to improve robustness of quantized NNs to bit-flip errors without error-aware training.

  152. Rethinking Representativeness and Diversity in Dynamic Data Selection - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Representation Learning and Efficiency — dynamic data selection using sparse autoencoder factors for representativeness and process-level diversity, yielding >2× training speedups.

  153. Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Representation Learning — training dynamics in deep linear networks: depth-induced coupling promotes low-rank implicit bias and mitigates plasticity loss.

  154. Nonconvex Latent Optimally Partitioned Block-Sparse Recovery via Log-Sum and Minimax Concave Penalties - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Sparsity/Compression: nonconvex block-sparse recovery with unknown partitions using log-sum and MCP penalties with ADMM optimization.

  155. MARS: Harmonizing Multimodal Convergence via Adaptive Rank Search - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Compression/Efficiency: adaptive LoRA rank search via dual scaling laws to align modality-specific convergence and maximize MLLM fine-tuning performance.

  156. Polynomial Surrogate Training for Differentiable Ternary Logic Gate Networks - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Architecture/Efficiency: polynomial surrogate training for differentiable ternary logic-gate networks with bounded hardening gap and large parameter reduction.

  157. Stateful Token Reduction for Long-Video Hybrid VLMs - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Model Compression/Efficiency: query-conditioned token reduction for hybrid attention–Mamba VLMs with progressive scheduling and unified scoring.

  158. Task-Centric Acceleration of Small-Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Model Compression and Efficiency: task-adaptive sequence compression via tokenizer expansion (TASC-ft) and training-free n-gram speculative decoding (TASC-spec) to accelerate SLM inference.

  159. KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Matches High Performance Computing/Efficiency criterion: KV-cache-centric memory management (construction, recomputation, balanced loading) to reduce LLM inference latency.

High Performance Computing (78)

  1. ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context - Score: 20.0 (R=0, N=0) - Date: 2026-03-03 - Comment: Author match

  2. The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference - Score: 19 (R=10, N=9) - Date: 2026-03-23 - Comment: Transformer systems insight showing KV cache is exactly reconstructible from residual streams, yielding a new bounded-memory inference scheme.

  3. Deep learning and the rate of approximation by flows - Score: 19 (R=10, N=9) - Date: 2026-03-17 - Comment: Gives a theoretical characterization of deep residual network approximation via geodesic distance on a sub-Finsler manifold of diffeomorphisms.

  4. Why Are Linear RNNs More Parallelizable? - Score: 19 (R=10, N=9) - Date: 2026-03-05 - Comment: Strong match to Model Architecture and High-Performance Computing theory by characterizing LRNNs’ parallelizability via complexity classes and expressivity trade-offs.

  5. NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL - Score: 18 (R=10, N=8) - Date: 2026-03-14 - Comment: High-performance computing for MoE: unified NCCL expert-parallel dispatch/combine API with topology-aware low-latency and high-throughput modes.

  6. MoEless: Efficient MoE LLM Serving via Serverless Computing - Score: 18 (R=10, N=8) - Date: 2026-03-09 - Comment: High Performance Computing / MoE Systems: serverless MoE serving with expert load prediction and elastic scaling/placement to reduce latency/cost.

  7. The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $\lambda$-Calculus - Score: 18 (R=9, N=9) - Date: 2026-03-23 - Comment: Model architecture/systems: replaces free-form recursive control with a typed λ-calculus runtime for long-context reasoning, with formal guarantees on termination and cost.

  8. Rigorous Asymptotics for First-Order Algorithms Through the Dynamical Cavity Method - Score: 18 (R=9, N=9) - Date: 2026-03-16 - Comment: Provides a rigorous formalization of the dynamical cavity method for first-order algorithms, yielding asymptotic theory for optimization dynamics.

  9. Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems - Score: 18 (R=9, N=9) - Date: 2026-03-14 - Comment: Theoretical reinterpretation of diffusion models as partitioned iterated function systems, yielding computable geometric design criteria for schedules and objectives.

  10. SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits - Score: 17 (R=9, N=8) - Date: 2026-03-20 - Comment: Systems-level benchmark for GPU kernel optimization with analytically derived speed-of-light hardware bounds, directly matching HPC methodology.

  11. Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Develops parallel Newton and quasi-Newton methods to remove sequential bottlenecks in dynamical systems, with convergence theory tied to Lyapunov stability.

  12. An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU - Score: 17 (R=9, N=8) - Date: 2026-03-18 - Comment: Single-GPU fine-tuning system with heterogeneous memory management, asynchronous CPU/GPU overlap, and kernel co-design.

  13. Determinism in the Undetermined: Deterministic Output in Charge-Conserving Continuous-Time Neuromorphic Systems with Temporal Stochasticity - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Provides a theoretical foundation for charge-conserving continuous-time SNNs, proving spike-timing-invariant computation and exact correspondence to quantized ANNs.

  14. FlashSampling: Fast and Memory-Efficient Exact Sampling - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Presents an exact systems-level decoding primitive that fuses categorical sampling into the LM-head matmul to eliminate logits materialization and reduce memory traffic.

  15. Grokking as a Variance-Limited Phase Transition: Spectral Gating and the Epsilon-Stability Threshold - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Training-dynamics theory of grokking as a variance-limited phase transition governed by optimizer-induced spectral gating.

  16. High-Probability Bounds for SGD under the Polyak-Lojasiewicz Condition with Markovian Noise - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Establishes the first uniform-in-time high-probability SGD bounds under PL with Markovian noise, a foundational optimization theory result.

  17. State-space models through the lens of ensemble control - Score: 17 (R=9, N=8) - Date: 2026-03-14 - Comment: Provides a control-theoretic foundation for state-space models by casting training as an ensemble optimal control problem and deriving PMP-based optimality conditions.

  18. AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Systems co-design for dynamic sparse models: token-level pre-gating and fused kernels to make dynamic LoRA/MoE-style adapter inference efficient.

  19. Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Optimization/Systems — new optimizer combining spectral constraints with Shampoo-style preconditioning for faster, stable training.

  20. The Missing Memory Hierarchy: Demand Paging for LLM Context Windows - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Systems/Memory Optimization: introduces demand paging and multi-level memory hierarchy for LLM context windows, directly addressing context efficiency.

  21. A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA - Score: 17 (R=9, N=8) - Date: 2026-03-09 - Comment: HPC/systems: FPGA accelerator and memory optimization for linear attention decode by keeping recurrent state on-chip.

  22. SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training - Score: 17 (R=9, N=8) - Date: 2026-03-05 - Comment: Matches High Performance Computing/Distributed Training: integrity verification for pipeline parallel training with convergence guarantees in untrusted settings.

  23. Hyperagents - Score: 17 (R=8, N=9) - Date: 2026-03-23 - Comment: Proposes a self-referential architecture where the meta-level modification mechanism is itself editable, a foundational systems design for open-ended self-improvement.

  24. Identifying Latent Actions and Dynamics from Offline Data via Demonstrator Diversity - Score: 17 (R=8, N=9) - Date: 2026-03-19 - Comment: Foundational identifiability theory for recovering latent actions and dynamics from offline trajectories using demonstrator diversity.

  25. Cohomological Obstructions to Global Counterfactuals: A Sheaf-Theoretic Foundation for Generative Causal Models - Score: 17 (R=8, N=9) - Date: 2026-03-19 - Comment: Foundational theory for generative causal models using sheaf/cohomology and an O(1)-memory reverse-mode differentiation bridge via Sinkhorn-IFT-VJP.

  26. NANOZK: Layerwise Zero-Knowledge Proofs for Verifiable Large Language Model Inference - Score: 17 (R=8, N=9) - Date: 2026-03-18 - Comment: Systems/methodology contribution for verifiable transformer inference via layerwise zero-knowledge proofs with constant-size per-layer proofs.

  27. Sinkhorn-Drifting Generative Models - Score: 17 (R=8, N=9) - Date: 2026-03-13 - Comment: Generative modeling theory: links drifting dynamics to Sinkhorn-divergence gradient flows and resolves equilibrium identifiability.

  28. Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Reintroduces explicit Markov states into LLM RL post-training with theoretical sample-complexity guarantees, directly targeting foundational training dynamics.

  29. Two-Time-Scale Learning Dynamics: A Population View of Neural Network Training - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Theoretical framework for two-time-scale population dynamics of neural network training, linking population methods to replicator-mutator and bilevel optimization.

  30. Heavy-Tailed and Long-Range Dependent Noise in Stochastic Approximation: A Finite-Time Analysis - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Finite-time theory for stochastic approximation under heavy-tailed and long-range dependent noise, extending core optimization analysis beyond classical assumptions.

  31. Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Systems-level method for verifiable large-model inference using lightweight sampling-based proofs with execution-trace commitments.

  32. Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Systems-level training architecture proposing depth-independent memory scaling near 2x inference footprint with exact gradient accumulation.

  33. VideoAtlas: Navigating Long-Form Video in Logarithmic Compute - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Systems-level hierarchical video representation enabling logarithmic-compute navigation and cache reuse for long-context multimodal models.

  34. RHYME-XT: A Neural Operator for Spatiotemporal Control Systems - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Neural-operator architecture for spatiotemporal control systems combining learned Galerkin projection with direct flow-map learning.

  35. ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Truncated backpropagation for recurrent video diffusion decoding with constant-memory training and theory.

  36. Optimal uncertainty bounds for multivariate kernel regression under bounded noise: A Gaussian process-based dual function - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Distribution-free dual-form uncertainty bounds for multi-output kernel regression, with a GP-compatible structure that is directly usable in downstream optimization.

  37. Trained Persistent Memory for Frozen Encoder--Decoder LLMs: Six Architectural Methods - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Proposes six architectural methods for differentiable persistent latent memory in frozen encoder-decoder LLMs.

  38. Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Workflow-aware LLM serving system that introduces cross-call caching and cache-aware scheduling for agentic workloads.

  39. Parallelised Differentiable Straightest Geodesics for 3D Meshes - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Provides differentiable and parallel straightest-geodesic operators for meshes, enabling new geometry-aware learning primitives.

  40. Why AI systems don't learn and what to do about it: Lessons on autonomous learning from cognitive science - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Proposes a foundational cognitive architecture for autonomous learning with observation, action, and meta-control systems.

  41. Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Systems-level synchronization method for multi-agent LLMs by adapting MESI-style cache coherence to artifact sharing.

  42. LLM as Graph Kernel: Rethinking Message Passing on Text-Rich Graphs - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Recasts the LLM itself as the graph message-passing operator on text-rich graphs, changing the core aggregation mechanism.

  43. Fold-CP: A Context Parallelism Framework for Biomolecular Modeling - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: High-performance computing contribution: context parallelism with custom primitives for scaling biomolecular model attention and triangular updates across GPUs.

  44. Chain-of-Trajectories: Unlocking the Intrinsic Generative Optimality of Diffusion Models via Graph-Theoretic Planning - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Reframes diffusion sampling as graph-theoretic planning with a low-dimensional state proxy to allocate compute adaptively during generation.

  45. SuperLocalMemory V3: Information-Geometric Foundations for Zero-LLM Enterprise Agent Memory - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Systems/theory for agent memory: derives retrieval and lifecycle mechanisms from information geometry and sheaf cohomology rather than heuristics.

  46. Disentangling Dynamical Systems: Causal Representation Learning Meets Local Sparse Attention - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Combines causal representation learning with sparse attention and proves identifiability conditions for disentangled system representations.

  47. Convergence of Two Time-Scale Stochastic Approximation: A Martingale Approach - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Derives new almost-sure convergence and rate results for two time-scale stochastic approximation under broader noise conditions.

  48. OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Introduces unified KV-cache management across tasks and time for VLA transformers, a systems-level inference innovation for multi-task parallelism.

  49. Structure-Dependent Regret and Constraint Violation Bounds for Online Convex Optimization with Time-Varying Constraints - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Derives structure-dependent regret and constraint-violation bounds for online convex optimization with time-varying constraints, adapting updates to regularity in constraint drift.

  50. AEX: Non-Intrusive Multi-Hop Attestation and Provenance for LLM APIs - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Systems-level protocol for signed attestation and provenance at the LLM API boundary, addressing verification of request-output relations.

  51. The Institutional Scaling Law: Non-Monotonic Fitness, Capability-Trust Divergence, and Symbiogenetic Scaling in Generative AI - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Theoretical scaling-law work on non-monotonic model/system scaling and orchestration of domain-specific models.

  52. SRAM-Based Compute-in-Memory Accelerator for Linear-decay Spiking Neural Networks - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Algorithm-hardware co-design for compute-in-memory SNNs that removes the state-update bottleneck via in-memory parallel decay.

  53. Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Inference systems contribution: proves modality-boundary partitioning minimizes transfer under KV caching and enables cost-efficient cross-tier heterogeneous serving.

  54. KernelFoundry: Hardware-aware evolutionary GPU kernel optimization - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Systems-level GPU optimization: evolutionary MAP-Elites framework for hardware-aware kernel search and prompt co-evolution.

  55. Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Distributed systems contribution: disaggregated serving architecture for any-to-any multimodal models with flexible computation-graph execution.

  56. AutoScout: Structured Optimization for Automating ML System Configuration - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Systems-level optimizer for ML configuration spaces with hierarchical mixed discrete/continuous decisions and multi-fidelity profiling.

  57. Large Spikes in Stochastic Gradient Descent: A Large-Deviations View - Score: 16 (R=8, N=8) - Date: 2026-03-12 - Comment: Training Dynamics: Large-deviations theory for SGD catapult spikes with explicit kernel/learning-rate criterion.

  58. Riemannian Optimization in Modular Systems - Score: 16 (R=8, N=8) - Date: 2026-03-05 - Comment: Proposes layerwise Riemannian metrics and composable modules with contraction guarantees—principled optimization/training dynamics for neural architectures.

  59. D-Mem: A Dual-Process Memory System for LLM Agents - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Systems-level memory architecture for LLM agents with dynamic quality gating between retrieval and full-deliberation modes.

  60. SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Production-oriented training framework for speculative decoding with hybrid parallelism and optimized kernels, matching large-model systems work.

  61. Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training - Score: 15 (R=8, N=7) - Date: 2026-03-19 - Comment: Systems-level method for distributed large-batch training that jointly optimizes batch size for time, cost, and generalization.

  62. Facts as First Class Objects: Knowledge Objects for Persistent LLM Memory - Score: 15 (R=8, N=7) - Date: 2026-03-19 - Comment: Systems-level memory architecture replacing in-context storage with hash-addressed knowledge objects for persistent O(1) retrieval.

  63. 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Systems-level efficiency method using lightweight proxy models to approximate expensive LLM-backed SQL operators at large scale.

  64. Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Develops Byzantine-robust distributed optimization with compressed communication using double momentum and variance reduction, directly targeting scalable training methodology.

  65. MONET: Modeling and Optimization of neural NEtwork Training from Edge to Data Centers - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Training-system modeling for heterogeneous accelerators, including activation checkpointing and layer-fusion co-design.

  66. Align Forward, Adapt Backward: Closing the Discretization Gap in Logic Gate Networks - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Analyzes training-inference discretization gaps in hard vs. soft component selection and proposes a new gradient estimator for aligned conditional computation.

  67. Exploiting temporal parallelism for LSTM Autoencoder acceleration on FPGA - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Presents a systems-level FPGA dataflow design that exploits temporal parallelism across timesteps and layers for efficient LSTM autoencoder inference.

  68. Orla: A Library for Serving LLM-Based Multi-Agent Systems - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Introduces systems-level mechanisms for multi-agent LLM serving, especially workflow orchestration and KV-cache management across workflow boundaries.

  69. Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Training-free parallel decoding for diffusion LLMs using self-attention-induced dependency graphs and independent-set selection.

  70. Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Distributed optimization theory: Byzantine-robust training under generalized (L0,L1)-smoothness with convergence guarantees.

  71. TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Systems analysis methodology: decomposes LLM inference host-side overhead into actionable components and characterizes host-device boundedness.

  72. SpectralGuard: Detecting Memory Collapse Attacks in State Space Models - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Systems/theory for state-space models: spectral-radius analysis of memory collapse with a real-time architectural monitor.

  73. Multi-DNN Inference of Sparse Models on Edge SoCs - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Systems/Efficiency: model stitching recombines subgraphs from sparse models for multi-DNN inference on edge SoCs without retraining, improving throughput and memory use.

  74. FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: High Performance Computing/Systems: introduces flexible resource isolation (Flex-Mem/Flex-NPU), LLM-aware memory management, and a secure inference pipeline for on-device LLM serving.

  75. Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Matches Representation Learning/Optimization theory via adversarially-aligned Jacobian regularization that controls sensitivity along adversarial directions, improving minimax stability with less expressivity loss.

  76. Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Matches Model Efficiency and memory optimization by introducing indexed external memory with RL-optimized read/write under context budgets, plus theoretical bounds on in-context computation.

  77. stratum: A System Infrastructure for Massive Agent-Centric ML Workloads - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: High Performance Computing: unified system infrastructure compiling and executing large batches of agent-generated ML pipelines efficiently.

  78. Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: High Performance Computing: adaptive prefetching to reduce communication in distributed GNN training using an LLM-based controller.

Representation Learning (134)

  1. LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels - Score: 20.0 (R=0, N=0) - Date: 2026-03-23 - Comment: Author match

  2. Statistical and structural identifiability in representation learning - Score: 19 (R=10, N=9) - Date: 2026-03-13 - Comment: Representation learning theory: formalizes statistical vs structural identifiability and proves near-identifiability beyond last-layer representations.

  3. Directional Neural Collapse Explains Few-Shot Transfer in Self-Supervised Learning - Score: 18 (R=10, N=8) - Date: 2026-03-05 - Comment: Matches Representation Learning/Theory: directional neural collapse (decision-axis variance) explains few-shot transfer with sharp bounds and multitask geometry.

  4. InfoNCE Induces Gaussian Distribution - Score: 18 (R=10, N=8) - Date: 2026-03-02 - Comment: Representation Learning: theoretical analysis showing InfoNCE induces Gaussian structure in learned features.

  5. Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients - Score: 18 (R=9, N=9) - Date: 2026-03-16 - Comment: Representation learning: unsupervised sparse dictionary decomposition of per-document training gradients to discover interpretable behavior atoms and steering directions.

  6. A theory of learning data statistics in diffusion models, from easy to hard - Score: 18 (R=9, N=9) - Date: 2026-03-14 - Comment: Theory for representation learning in diffusion models: proves easy-to-hard learning of low- vs high-order data statistics via a diffusion information exponent.

  7. Solving adversarial examples requires solving exponential misalignment - Score: 18 (R=9, N=9) - Date: 2026-03-05 - Comment: Representation Learning/Theory: introduces perceptual manifold dimensionality as a geometric account of adversarial vulnerability and robustness.

  8. Quantitative Convergence of Wasserstein Gradient Flows of Kernel Mean Discrepancies - Score: 18 (R=9, N=9) - Date: 2026-03-03 - Comment: Representation Learning/Training Dynamics: quantitative convergence of Wasserstein gradient flows (MMD/Sobolev) linking to infinite-width shallow nets.

  9. Var-JEPA: A Variational Formulation of the Joint-Embedding Predictive Architecture -- Bridging Predictive and Generative Self-Supervised Learning - Score: 17 (R=9, N=8) - Date: 2026-03-23 - Comment: Representation learning: gives a variational reformulation of JEPA as an explicit latent-variable model, removing heuristic anti-collapse regularization.

  10. Only relative ranks matter in weight-clustered large language models - Score: 17 (R=9, N=8) - Date: 2026-03-19 - Comment: Model compression/representation learning: shows clustered LLM weights preserve performance primarily through relative rank structure rather than exact magnitudes.

  11. A Noise Sensitivity Exponent Controls Large Statistical-to-Computational Gaps in Single- and Multi-Index Models - Score: 17 (R=9, N=8) - Date: 2026-03-19 - Comment: Theory of statistical-to-computational gaps in high-dimensional learning via a unifying noise sensitivity exponent.

  12. Self-Distillation of Hidden Layers for Self-Supervised Representation Learning - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Self-supervised representation learning through hidden-layer self-distillation instead of only final-layer targets.

  13. IConE: Batch Independent Collapse Prevention for Self-Supervised Representation Learning - Score: 17 (R=9, N=8) - Date: 2026-03-17 - Comment: Batch-independent collapse prevention for self-supervised representation learning via dataset-level auxiliary embeddings.

  14. Power-Law Spectrum of the Random Feature Model - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Derives power-law spectral preservation results for random feature models, directly addressing representation structure in core architectures.

  15. Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Representation learning analysis: identifies which next-token gradient components cause transformers to develop seemingly redundant abstract features.

  16. The Phenomenology of Hallucinations - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Representation-level theory of hallucination: uncertainty is internally encoded but weakly coupled to logits, explaining failure to abstain.

  17. On Interpolation Formulas Describing Neural Network Generalization - Score: 17 (R=9, N=8) - Date: 2026-03-16 - Comment: Theory of training dynamics: extends Domingos-style kernel interpolation to stochastic gradient training with optimizer-specific path kernels.

  18. Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding - Score: 17 (R=9, N=8) - Date: 2026-03-14 - Comment: Model architecture: separates context and sample encoding into dual representation spaces to reconcile in-context and in-weight learning.

  19. Topological DeepONets and a generalization of the Chen-Chen operator approximation theorem - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Theory of neural operators: extends DeepONet universal approximation from Banach-function settings to general locally convex spaces.

  20. Disentangled Representation Learning through Unsupervised Symmetry Group Discovery - Score: 17 (R=9, N=8) - Date: 2026-03-13 - Comment: Representation learning theory: unsupervised symmetry group discovery with identifiability guarantees for symmetry-based disentanglement.

  21. Historical Consensus: Preventing Posterior Collapse via Iterative Selection of Gaussian Mixture Priors - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Representation Learning: proposes iterative selection of Gaussian mixture priors for VAEs to provably avoid posterior collapse across architectures.

  22. Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Representation learning/interpretability: sparse autoencoders + causal DAG structure learning to reveal concept interactions in LLMs.

  23. Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Matches Representation Learning: mechanistic interpretability using sparse autoencoders to reveal causal feature hierarchies inside a transformer TSFM.

  24. Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought - Score: 17 (R=9, N=8) - Date: 2026-03-12 - Comment: Theoretical foundations of representation/training dynamics behind prompt comprehension, ICL, and CoT in LLMs.

  25. From Data Statistics to Feature Geometry: How Correlations Shape Superposition - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Representation Learning: analyzes superposition under correlated features, introducing BOWS to reveal constructive interference and feature geometry beyond the sparse/independent case.

  26. Memorization capacity of deep ReLU neural networks characterized by width and depth - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Theory/Representation: characterizes memorization capacity via a tight width–depth tradeoff (W^2 L^2 ~ N log(1/δ)) for ReLU networks, advancing foundational understanding.

  27. An Empirical Study and Theoretical Explanation on Task-Level Model-Merging Collapse - Score: 17 (R=9, N=8) - Date: 2026-03-11 - Comment: Representation Learning: theoretical limits on model merging via rate–distortion, linking representational incompatibility to task-level collapse; fundamental analysis of mergeability.

  28. Causal Interpretation of Neural Network Computations with Contribution Decomposition - Score: 17 (R=9, N=8) - Date: 2026-03-09 - Comment: Representation Learning — uses sparse autoencoders to causally decompose hidden-neuron contributions, enabling mechanistic interpretability and controllable interventions.

  29. Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs - Score: 17 (R=9, N=8) - Date: 2026-03-05 - Comment: Finds a robust sparsity–difficulty relation in LLM hidden states and exploits it for curriculum ICL—representation learning/training dynamics.

  30. Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD? - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Optimization/Training Dynamics: compute-optimal scaling laws for signSGD under power-law random features, revealing noise-reshaping/drift-normalization effects.

  31. Diagnosing Generalization Failures from Representational Geometry Markers - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Representation Learning: uses representational geometry markers (manifold dimensionality/utility) to predict OOD generalization and guide model selection.

  32. Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Matches Representation Learning/Training dynamics theory: analyzes data quality and synergistic effects across pretraining, SFT, and RL with transformer models.

  33. Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Representation Learning/Training Dynamics—Singular Learning Theory explains grokking as phase transition via local learning coefficient.

  34. NNiT: Width-Agnostic Neural Network Generation with Structurally Aligned Weight Spaces - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Model Architecture/Representation Learning: width-agnostic generation of neural weights via tokenized patches and GHN-based structural alignment to resolve permutation symmetries.

  35. Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models - Score: 17 (R=9, N=8) - Date: 2026-03-02 - Comment: Matches Representation Learning criterion: derives necessary geometric constraints (linear, orthogonal per-concept factors) for compositional generalization with empirical support.

  36. Provable Subspace Identification of Nonlinear Multi-view CCA - Score: 17 (R=9, N=8) - Date: 2026-03-02 - Comment: Representation Learning Theory: provable identifiability and finite-sample guarantees for nonlinear multi-view CCA subspace recovery.

  37. Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination - Score: 17 (R=8, N=9) - Date: 2026-03-23 - Comment: Representation learning/theory: proposes a unified geometric uncertainty principle linking adversarial fragility and LLM hallucination through input-gradient coupling.

  38. Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs - Score: 17 (R=8, N=9) - Date: 2026-03-19 - Comment: Foundational theory for RL/MDPs: operator-theoretic derivation of policy-gradient results for general state/action spaces with unbounded costs.

  39. Language Generation with Replay: A Learning-Theoretic View of Model Collapse - Score: 17 (R=8, N=9) - Date: 2026-03-13 - Comment: Learning theory for representation/data dynamics: formal characterization of model collapse under replayed self-generated text.

  40. On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD - Score: 16 (R=9, N=7) - Date: 2026-03-12 - Comment: Matches Training Dynamics/Representation: theoretical analysis of label‑noise SGD in two-layer linear networks revealing phase behavior and links to SAM.

  41. SoftJAX & SoftTorch: Empowering Automatic Differentiation Libraries with Informative Gradients - Score: 16 (R=9, N=7) - Date: 2026-03-11 - Comment: Differentiable programming foundations — consolidated soft relaxations (e.g., sorting, indexing, fuzzy logic) to provide informative gradients in AD frameworks.

  42. Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers - Score: 16 (R=9, N=7) - Date: 2026-03-06 - Comment: Representation Learning/Training Dynamics: theoretical mechanism for analogical reasoning in transformers via aligned representations and curriculum-dependent emergence.

  43. Stable and Steerable Sparse Autoencoders with Weight Regularization - Score: 16 (R=9, N=7) - Date: 2026-03-05 - Comment: Matches Representation Learning and Sparsity: stability/steerability of sparse autoencoders via L2/L1 weight regularization, tied init, and unit-norm decoders.

  44. The Lattice Representation Hypothesis of Large Language Models - Score: 16 (R=9, N=7) - Date: 2026-03-03 - Comment: Representation Learning: posits a concept lattice geometry in LLM embeddings enabling meet/join via linear attribute directions and thresholds.

  45. Spectral Alignment in Forward-Backward Representations via Temporal Abstraction - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Representation learning theory: analyzes spectral mismatch in forward-backward successor representations and shows temporal abstraction acts as a low-pass filter with value-error bounds.

  46. Pitfalls in Evaluating Interpretability Agents - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Provides foundational analysis of how to evaluate autonomous interpretability agents, introducing an intrinsic criterion based on functional interchangeability of model components.

  47. IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Representation analysis of CLIP projectors that identifies an aligned isotropic subspace and yields a training-free spectral decomposition method.

  48. RiboSphere: Learning Unified and Efficient Representations of RNA Structures - Score: 16 (R=8, N=8) - Date: 2026-03-23 - Comment: Model architecture and representation learning through a discrete geometric autoencoding framework combining vector quantization, SE(3)-invariant transformers, and flow matching.

  49. Secure Linear Alignment of Large Language Models - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Studies representational convergence via linear alignment between independently trained LLMs, directly probing shared representations.

  50. Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization method - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Representation learning/training dynamics theory: first unconditional error analysis for Adam via uniform a priori bounds in strongly convex stochastic optimization.

  51. Seasoning Generative Models for a Generalization Aftertaste - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Representation learning theory: proves discriminator-guided refinement can improve generative-model generalization, with bounds governed by discriminator-class complexity.

  52. Learning Decision-Sufficient Representations for Linear Optimization - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Representation learning: develops decision-sufficient compressed representations with hardness results, polynomial algorithms, and PAC bounds tied to intrinsic decision-relevant dimension.

  53. From Topic to Transition Structure: Unsupervised Concept Discovery at Corpus Scale via Predictive Associative Memory - Score: 16 (R=8, N=8) - Date: 2026-03-20 - Comment: Representation learning: unsupervised corpus-scale concept discovery via a contrastive associative-memory objective that isolates transition structure rather than topical semantics.

  54. Discovering Decoupled Functional Modules in Large Language Models - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Representation-learning interpretability method that discovers decoupled cross-layer functional modules in LLMs with an unsupervised objective.

  55. Variational Kernel Design for Internal Noise: Gaussian Chaos Noise, Representation Compatibility, and Reliable Deep Learning - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Introduces a new internal-noise framework via variational kernel design, deriving Gaussian Chaos Noise with theoretical guarantees on representation distortion.

  56. Learning Permutation Distributions via Reflected Diffusion on Ranks - Score: 16 (R=8, N=8) - Date: 2026-03-19 - Comment: Core generative modeling contribution: a new diffusion framework on permutations using soft-rank forward processes and generalized PL denoisers.

  57. Decoding the Critique Mechanism in Large Reasoning Models - Score: 16 (R=8, N=8) - Date: 2026-03-18 - Comment: Representation-learning analysis of hidden critique behavior in reasoning models via an interpretable latent critique vector.

  58. W2T: LoRA Weights Already Know What They Can Do - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Weight-space representation learning for LoRA adapters using a canonical factorization that removes decomposition ambiguity.

  59. Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Representation/learning dynamics analysis: information-theoretic framework explaining reasoning via uncertainty externalization and information allocation.

  60. In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Improves KAN symbolic extraction with in-context operator selection and sparse gated operator layers, directly targeting core architecture interpretability/representation.

  61. Interpretable Classification of Time Series Using Euler Characteristic Surfaces - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Introduces Euler Characteristic Surfaces as a stable, computationally efficient topological representation for time series, with a proved stability theorem.

  62. $K-$means with learned metrics - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Theoretical foundation for representation/metric learning: continuity and stability of k-means under learned metrics via measured Gromov-Hausdorff topology.

  63. Windowed Fourier Propagator: A Frequency-Local Neural Operator for Wave Equations in Inhomogeneous Media - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Presents a frequency-local neural operator that preserves superposition, a methodological advance in representation for wave dynamics.

  64. Not All Latent Spaces Are Flat: Hyperbolic Concept Control - Score: 16 (R=8, N=8) - Date: 2026-03-16 - Comment: Representation-space innovation: hyperbolic concept steering for generative models using parallel transport instead of Euclidean latent control.

  65. Modality-free Graph In-context Alignment - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Method for graph foundation models: parameter-update-free in-context alignment across heterogeneous domains via gradient fingerprints.

  66. Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Representation learning analysis: shows self-supervised speech models encode neighboring phonetic context in position-dependent orthogonal subspaces.

  67. Asymptotic and Finite-Time Guarantees for Langevin-Based Temperature Annealing in InfoNCE - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Theoretical representation-learning result linking InfoNCE temperature schedules to Langevin simulated annealing with asymptotic and finite-time guarantees.

  68. Diffusion Models Generalize but Not in the Way You Might Think - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Foundational analysis of memorization and generalization dynamics in diffusion models across noise levels and denoising trajectories.

  69. Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Training objective methodology for language models: sequence-level feature matching through energy-based fine-tuning with theoretical grounding.

  70. On-Average Stability of Multipass Preconditioned SGD and Effective Dimension - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Foundational optimization theory: multipass PSGD stability analysis with effective-dimension-dependent excess risk bounds and matching lower bounds.

  71. Exhaustive Circuit Mapping of a Single-Cell Foundation Model Reveals Massive Redundancy, Heavy-Tailed Hub Architecture, and Layer-Dependent Differentiation Control - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Representation learning and mechanistic interpretability study using exhaustive circuit tracing and higher-order ablations to characterize internal feature organization.

  72. Slack More, Predict Better: Proximal Relaxation for Probabilistic Latent Variable Model-based Soft Sensors - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Advances probabilistic latent variable modeling with a new proximal variational inference objective and convergence analysis to reduce amortization error.

  73. Harnessing Data Asymmetry: Manifold Learning in the Finsler World - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Foundational representation learning: extends manifold learning from symmetric Riemannian to asymmetric Finsler geometry with generalized t-SNE/UMAP.

  74. Factorized Neural Implicit DMD for Parametric Dynamics - Score: 16 (R=8, N=8) - Date: 2026-03-12 - Comment: Representation Learning/Architecture: factorized neural implicit DMD that parameterizes Koopman spectral decomposition for stable long-horizon rollouts and spectral analysis.

  75. Training Language Models via Neural Cellular Automata - Score: 16 (R=8, N=8) - Date: 2026-03-12 - Comment: Training dynamics/representation learning: synthetic pre-pretraining with neural cellular automata enabling transfer and efficiency.

  76. A Gaussian Comparison Theorem for Training Dynamics in Machine Learning - Score: 16 (R=8, N=8) - Date: 2026-03-11 - Comment: Representation Learning/Training Dynamics: theoretical comparison (via Gordon’s theorem) linking training dynamics to a surrogate system; validates DMF and refines non-asymptotic behavior.

  77. COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics - Score: 16 (R=8, N=8) - Date: 2026-03-09 - Comment: Model Architecture / Representation Learning — training-free activation steering approximating one-step learning dynamics for in-context control of LLM internal representations.

  78. Why Is RLHF Alignment Shallow? A Gradient Analysis - Score: 16 (R=8, N=8) - Date: 2026-03-06 - Comment: Representation Learning/Training Dynamics—gradient analysis of RLHF showing shallow alignment and proposing recovery-penalty objective to distribute gradients across positions.

  79. Semi-Supervised Generative Learning via Latent Space Distribution Matching - Score: 16 (R=8, N=8) - Date: 2026-03-05 - Comment: Latent Space Distribution Matching with Wasserstein bounds; connects to LDMs—representation learning/generative modeling theory.

  80. Surprisal-R\'enyi Free Energy - Score: 16 (R=8, N=8) - Date: 2026-03-05 - Comment: Matches Representation Learning/Training Objectives: introduces Surprisal-Rényi Free Energy interpolating KLs with variance/tail sensitivity and MDL interpretation.

  81. Random Features for Operator-Valued Kernels: Bridging Kernel Methods and Neural Operators - Score: 16 (R=8, N=8) - Date: 2026-03-03 - Comment: Representation Learning/Theory — generalization analysis of random features for operator-valued kernels, linking to NTK and neural operators with optimal/minimax rates.

  82. What Is the Geometry of the Alignment Tax? - Score: 16 (R=8, N=8) - Date: 2026-03-03 - Comment: Representation Learning theory: geometric characterization of safety–capability tradeoffs in representation subspaces with scaling predictions.

  83. Universality of Shallow and Deep Neural Networks on Non-Euclidean Spaces - Score: 16 (R=8, N=8) - Date: 2026-03-02 - Comment: Model Architecture: theoretical universality for deep narrow networks on general topological spaces; Representation Learning: foundational approximation results beyond Euclidean inputs.

  84. CLaRE-ty Amid Chaos: Quantifying Representational Entanglement to Predict Ripple Effects in LLM Editing - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Representation learning: quantifies fact entanglement in LLM hidden representations using forward activations to predict edit ripple effects efficiently.

  85. Hierarchical Latent Structure Learning through Online Inference - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Online hierarchical latent-variable inference via nested CRP plus sequential Monte Carlo for representation learning in sequential data.

  86. Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Uses activation probing to detect motivated reasoning from internal representations, directly probing how LLMs encode decision dynamics.

  87. PRISM: Demystifying Retention and Interaction in Mid-Training - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Foundational empirical analysis of mid-training, characterizing weight-space and representation changes and their interaction with later RL.

  88. V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Systematic architectural study of representation-aligned co-denoising, isolating key design ingredients for dual-stream diffusion.

  89. Grid-World Representations in Transformers Reflect Predictive Geometry - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Representation learning study showing transformer hidden states align with analytically derived predictive geometry in a controlled setting.

  90. Embedding-Aware Feature Discovery: Bridging Latent Representations and Interpretable Features in Event Sequences - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Representation learning framework that explicitly decomposes embedding utility into alignment and complementarity for interpretable feature discovery from event sequences.

  91. Mechanistic Origin of Moral Indifference in Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Representation learning analysis using sparse autoencoders to isolate and reshape mono-semantic moral features in LLM latent space.

  92. TabKD: Tabular Knowledge Distillation through Interaction Diversity of Learned Feature Bins - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Model compression: data-free tabular knowledge distillation built around interaction-diverse synthetic query generation from learned feature bins.

  93. Mechanistic Foundations of Goal-Directed Control - Score: 15 (R=8, N=7) - Date: 2026-03-17 - Comment: Mechanistic interpretability: analyzes emergence of goal-directed control circuits, gating thresholds, and phase transitions with closed-form predictions.

  94. ES-Merging: Biological MLLM Merging via Embedding Space Signals - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Uses embedding-space response signals to estimate layer- and element-wise model merging coefficients, making merging representation-aware rather than parameter-heuristic.

  95. Is the reconstruction loss culprit? An attempt to outperform JEPA - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Predictive representation learning: gated predictive autoencoders isolate predictable components to challenge JEPA-style objectives.

  96. Soft Mean Expected Calibration Error (SMECE): A Calibration Metric for Probabilistic Labels - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Foundational calibration metric replacing hard-label bin frequencies with mean probabilistic labels, extending ECE correctly.

  97. U-Face: An Efficient and Generalizable Framework for Unsupervised Facial Attribute Editing via Subspace Learning - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Representation learning via latent subspace learning for disentangled editing, with an autoencoder view and convergence-backed alternating optimization.

  98. Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition - Score: 15 (R=8, N=7) - Date: 2026-03-16 - Comment: Representation learning via a bottleneck-token reconstruction objective explicitly targeting what-is-where compositional scene state encoding.

  99. Resolving Interference (RI): Disentangling Models for Improved Model Merging - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Core methodology for model merging: reduces cross-task interference by functionally orthogonalizing constituent models using unlabeled auxiliary data.

  100. Representation Learning for Spatiotemporal Physical Systems - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Directly studies representation learning by comparing self-supervised objectives for physically meaningful latent representations, highlighting latent-space methods like JEPA.

  101. Maximizing Incremental Information Entropy for Contrastive Learning - Score: 15 (R=8, N=7) - Date: 2026-03-14 - Comment: Representation learning: contrastive objective that explicitly maximizes incremental entropy with an information-bottleneck formulation.

  102. Probing Length Generalization in Mamba via Image Reconstruction - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Core architecture analysis: probes Mamba length generalization failure modes and introduces a length-adaptive variant.

  103. Revisiting Model Stitching In the Foundation Model Era - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Representation learning via model stitching: a systematic study of cross-model feature compatibility in heterogeneous vision foundation models.

  104. A Geometrically-Grounded Drive for MDL-Based Optimization in Deep Learning - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Representation learning/compression: integrates MDL directly into training dynamics with a theoretical geometric optimization framework.

  105. Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Activation engineering method that improves steering vectors via cross-layer representation evolution, directly targeting core representation/control methodology in LLMs.

  106. OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Uses sparse autoencoders to disentangle superposed features and applies orthogonal projection for concept erasure, directly targeting representation structure.

  107. A Stable Neural Statistical Dependence Estimator for Autoencoder Feature Analysis - Score: 15 (R=8, N=7) - Date: 2026-03-13 - Comment: Representation analysis: stable neural statistical dependence estimator for quantifying input-latent-reconstruction dependence in autoencoders.

  108. A Universal Nearest-Neighbor Estimator for Intrinsic Dimensionality - Score: 15 (R=8, N=7) - Date: 2026-03-12 - Comment: Representation learning theory: universal nearest-neighbor intrinsic dimensionality estimator with distribution-free consistency.

  109. Digging Deeper: Learning Multi-Level Concept Hierarchies - Score: 15 (R=8, N=7) - Date: 2026-03-12 - Comment: Proposes MLCS and Deep-HiCEMs for hierarchical concepts and interventions — matches Representation Learning (concept/dictionary learning) and architecture innovation.

  110. Revisiting Sharpness-Aware Minimization: A More Faithful and Effective Implementation - Score: 15 (R=8, N=7) - Date: 2026-03-12 - Comment: Matches Training Dynamics/Optimization: theoretical reinterpretation of SAM and a new XSAM update that improves generalization with minimal overhead.

  111. What is Missing? Explaining Neurons Activated by Absent Concepts - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Representation Learning — identifies and explains neurons encoding absences via extensions to attribution/feature visualization.

  112. Curveball Steering: The Right Direction To Steer Isn't Always Linear - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Representation Learning: geometry-aware nonlinear activation steering via polynomial kernel PCA, challenging the linear representation hypothesis.

  113. Transductive Generalization via Optimal Transport and Its Application to Graph Node Classification - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Representation Learning: introduces OT-based, representation-dependent transductive generalization bounds and analyzes how GNN aggregation transforms representation distributions with depth.

  114. An accurate flatness measure to estimate the generalization performance of CNN models - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Representation Learning/Training Dynamics: derives an exact, architecture-aware Hessian-trace-based flatness measure for CNNs (with GAP), robustly linked to generalization.

  115. Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Representation Learning and Efficiency: layer- and token-wise analysis of dLLMs vs AR LMs; introduces inference-time layer skipping achieving FLOPs reductions without KV-cache tricks.

  116. Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Representation Learning — probes frozen foundation-model features for continuous geometry, with layer-wise signal localization and objective/architecture comparisons.

  117. Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Representation Learning — sparse auto-encoder yields interpretable visual words and enables sparse inverted-index retrieval (sparse coding aligning with efficiency/interpretability).

  118. Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Representation Learning / Mechanistic Interpretability: disentangled safety subspaces (recognition vs execution) with causal steering in LLMs.

  119. Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Representation Learning: activation/attention probing analyzes belief dynamics; Efficiency: probe-guided early-exit enables adaptive computation with large token savings.

  120. Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Model Compression/Efficiency + Representation Learning: CompACT discrete tokenizer compresses each observation to ~8 tokens for world models, enabling orders-of-magnitude faster planning with preserved task information.

  121. How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression? - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Representation Learning — analyzes implicit bias/training dynamics of gradient descent in shallow ReLU models, quantifying deviation from minimum-l2 solution.

  122. Understanding the Dynamics of Demonstration Conflict in In-Context Learning - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Representation Learning/Training Dynamics—empirical analysis of in-context learning under conflicting demonstrations; identifies and validates phase-specific attention heads causing failures.

  123. Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: Representation Learning/Interpretability—Delta-Crosscoder with sparsity and delta-based loss to isolate causal latent directions differing after fine-tuning.

  124. Efficient Refusal Ablation in LLM through Optimal Transport - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Matches Representation Learning by transforming activation distributions with optimal transport and revealing layer-localized safety representations.

  125. Towards Improved Sentence Representations using Token Graphs - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Matches Model Architecture/Representation Learning: structure-aware pooling via token-similarity graphs and a lightweight GNN over frozen LLM outputs.

  126. StructLens: A Structural Lens for Language Models via Maximum Spanning Trees - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Structure-aware inter-layer analysis via MSTs over residual streams; aids layer pruning—representation learning and model compression.

  127. Controlling Chat Style in Language Models via Single-Direction Editing - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Representation Engineering: linear-direction editing in activation space for precise, training-free style control and composition

  128. Old Habits Die Hard: How Conversational History Geometrically Traps LLMs - Score: 15 (R=8, N=7) - Date: 2026-03-05 - Comment: Analyzes internal LLM representations via geometric consistency over conversational history—representation learning/training dynamics.

  129. On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Representation Learning/Training Dynamics—provable slow convergence of robustness margin in non-linear ReLU networks.

  130. Discrete World Models via Regularization - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Matches Representation Learning with sparsity: unsupervised Boolean world models via entropy/independence/locality regularizers and robust discrete optimization.

  131. Rate-Distortion Signatures of Generalization and Information Trade-offs - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Representation Learning—uses rate–distortion theory to analyze accuracy–robustness/generalization trade-offs across models.

  132. Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning - Score: 15 (R=8, N=7) - Date: 2026-03-03 - Comment: Representation Learning — introduces trajectory-based analysis of layer-wise representation displacement to distinguish valid vs. spurious reasoning (tested on dense and MoE LLMs).

  133. Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Representation Learning: critical analysis of identifiability metrics with taxonomy and stress-testing suite.

  134. A Mixed Diet Makes DINO An Omnivorous Vision Encoder - Score: 15 (R=8, N=7) - Date: 2026-03-02 - Comment: Matches Representation Learning criterion: cross-modal alignment with a distillation objective to learn a modality-agnostic embedding space anchored to a frozen DINOv2 teacher.

Other Foundational Research (23)

  1. AI Must Embrace Specialization via Superhuman Adaptable Intelligence - Score: 20.0 (R=0, N=0) - Date: 2026-03-02 - Comment: Author match

  2. Self-Regularized Learning Methods - Score: 19 (R=10, N=9) - Date: 2026-03-18 - Comment: Provides a general theoretical framework for implicit regularization via self-regularization, covering gradient descent and yielding optimal statistical rates.

  3. Lost in Aggregation: On a Fundamental Expressivity Limit of Message-Passing Graph Neural Networks - Score: 19 (R=10, N=9) - Date: 2026-03-17 - Comment: Proves a fundamental expressivity limit of message-passing GNNs under generic aggregation, separating them sharply from graph isomorphism procedures.

  4. Neural Networks as Local-to-Global Computations - Score: 18 (R=9, N=9) - Date: 2026-03-17 - Comment: Reinterprets feedforward ReLU networks as local-to-global sheaf computations with harmonic extension and bidirectional heat-equation dynamics.

  5. Non-Euclidean Gradient Descent Operates at the Edge of Stability - Score: 17 (R=9, N=8) - Date: 2026-03-06 - Comment: Training Dynamics: generalizes Edge-of-Stability theory to non-Euclidean norms with a geometry-aware sharpness measure across optimizers.

  6. Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic - Score: 17 (R=9, N=8) - Date: 2026-03-03 - Comment: Matches Training Dynamics theory: GRPO policy gradient as a U-statistic with MSE bounds, oracle equivalence, and a universal group-size scaling law.

  7. Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum - Score: 17 (R=8, N=9) - Date: 2026-03-19 - Comment: Provable algorithmic gains from autocurriculum for reasoning-model SFT and RL fine-tuning.

  8. The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions - Score: 17 (R=8, N=9) - Date: 2026-03-19 - Comment: Theoretical study of geometric limits of causal interventions in continuous generative models, introducing manifold tearing and a causal uncertainty principle.

  9. Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Test-time reinforcement learning for unified multimodal models, with metacognitive monitoring signals enabling parameter updates and self-improvement at inference time.

  10. Transition Flow Matching - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Foundational generative modeling: directly learning transition flow as a global quantity enables single-step or arbitrary-time generation with theoretical unification.

  11. Unbiased and Biased Variance-Reduced Forward-Reflected-Backward Splitting Methods for Stochastic Composite Inclusions - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Theoretical optimization: variance-reduced forward-reflected-backward splitting with new biased and unbiased estimators plus convergence and oracle complexity guarantees.

  12. Preconditioned One-Step Generative Modeling for Bayesian Inverse Problems in Function Spaces - Score: 16 (R=8, N=8) - Date: 2026-03-17 - Comment: Introduces a neural-operator-based one-step generative sampler for Bayesian inverse problems with function-space stability analysis.

  13. Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis - Score: 16 (R=8, N=8) - Date: 2026-03-14 - Comment: Foundational analysis of why ideal noise-correction fails, linking optimization dynamics, convergence states, and information-theoretic limits.

  14. Exponential-Family Membership Inference: From LiRA and RMIA to BaVarIA - Score: 16 (R=8, N=8) - Date: 2026-03-13 - Comment: Unifies major membership inference attacks under an exponential-family likelihood-ratio framework and introduces Bayesian variance estimation for low-shadow-model regimes.

  15. HTMuon: Improving Muon via Heavy-Tailed Spectral Correction - Score: 16 (R=8, N=8) - Date: 2026-03-12 - Comment: Training Dynamics/Optimization for large models: HTMuon encourages heavy-tailed spectra with theory (Schatten‑q steepest descent) and improved LLM pretraining.

  16. Inducing Sustained Creativity and Diversity in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Novel decoding method for sustained diversity and creativity in LLM generation, targeting inference-time behavior rather than application tuning.

  17. Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions - Score: 15 (R=8, N=7) - Date: 2026-03-23 - Comment: Controlled methodological study of 51 post-training algorithms uncovering scale-dependent ranking inversions and isolating algorithmic effects.

  18. Optimal Splitting of Language Models from Mixtures to Specialized Domains - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Scaling-law method for optimal compute allocation between pretraining and specialization when splitting language models into domain-specific models.

  19. Foundations of Schrödinger Bridges for Generative Modeling - Score: 15 (R=8, N=7) - Date: 2026-03-20 - Comment: Builds mathematical foundations for Schrödinger bridges as a unifying framework behind diffusion, score, and flow-based generative models.

  20. Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning - Score: 15 (R=8, N=7) - Date: 2026-03-18 - Comment: Studies learning-rate scheduling as a foundational training-dynamics question, linking no-decay pretraining to flatter minima and better downstream adaptability.

  21. Towards Understanding Adam Convergence on Highly Degenerate Polynomials - Score: 15 (R=8, N=7) - Date: 2026-03-11 - Comment: Training dynamics — theoretical analysis of Adam’s auto-convergence and stability regimes on degenerate polynomials.

  22. DC-Merge: Improving Model Merging with Directional Consistency - Score: 15 (R=8, N=7) - Date: 2026-03-09 - Comment: Model merging/parameter-space geometry: enforces directional consistency via singular-space smoothing and orthogonal subspace alignment.

  23. Ensembling Language Models with Sequential Monte Carlo - Score: 15 (R=8, N=7) - Date: 2026-03-06 - Comment: High-Performance Computing/Algorithms — Sequential Monte Carlo decoding to sample from f-ensemble LM distributions in a shared byte space, enabling principled ensembling across vocabularies.