← Previous Summary | Monthly Overview | Next Summary →
2026-01 | 2026-02 | 2026-03

Personalized Monthly Topic Summary 2026/02

Metric	Value
Total Papers	635
Model Architecture	186
Model Compression and Efficiency	221
High Performance Computing	42
Representation Learning	177
Other Foundational Research	9

Model Architecture (186)

Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds - Score: 19 (R=10, N=9) - Date: 2026-02-23 - Comment: Model Architecture (MoE): Grassmannian routing with Matrix Bingham distributions enabling concentration-controlled sparsity and provable resistance to expert collapse.
Approximation Theory for Lipschitz Continuous Transformers - Score: 19 (R=10, N=9) - Date: 2026-02-18 - Comment: Model Architecture/Theory: constructs Lipschitz-continuous Transformer blocks via gradient-flow Euler steps and proves universal approximation under Lipschitz constraints.
Stabilizing Native Low-Rank LLM Pretraining - Score: 19 (R=10, N=9) - Date: 2026-02-16 - Comment: Low‑rank Architecture/Training: native low‑rank transformer pretraining stabilized by spectral renormalization with orthogonalization (Spectron) and compute‑optimal scaling laws.
RAM-Net: Expressive Linear Attention with Selectively Addressable Memory - Score: 19 (R=10, N=9) - Date: 2026-02-13 - Comment: Matches Model Architecture and Efficiency: RAM-Net introduces selectively addressable sparse memory enabling expressive linear attention with random access.
Rotary Positional Embeddings as Phase Modulation: Theoretical Bounds on the RoPE Base for Long-Context Transformers - Score: 19 (R=10, N=9) - Date: 2026-02-12 - Comment: Model Architecture theory: rigorous bounds on RoPE base for long-context Transformers, linking aliasing, depth, and precision constraints.
Versor: A Geometric Sequence Architecture - Score: 19 (R=10, N=9) - Date: 2026-02-12 - Comment: Strong Model Architecture match: introduces a new CGA-based sequence architecture (Versor) with O(L) complexity and interpretable attention via geometric operations.
Free Energy Mixer - Score: 19 (R=10, N=9) - Date: 2026-02-11 - Comment: Model Architecture: introduces Free Energy Mixer, a value-aware per-channel read mechanism plug-and-play with attention/SSMs for selection vs averaging.
OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale - Score: 19 (R=10, N=9) - Date: 2026-02-07 - Comment: Model architecture (MoE) + systems co-design: vector-level atomic experts with Cartesian Product Router and expert-centric scheduling to scale fine-grained MoE efficiently—strong core-topic match.
SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel - Score: 19 (R=10, N=9) - Date: 2026-02-07 - Comment: Linear attention architecture: geometry-aware spherical Yat-kernel with positive random feature approximation achieving near-softmax performance at O(L) time—foundational attention efficiency.
ZeroS: Zero-Sum Linear Attention for Efficient Transformers - Score: 19 (R=10, N=9) - Date: 2026-02-06 - Comment: Introduces Zero-Sum Linear Attention achieving O(N) complexity with contrastive capabilities via zero-sum residuals; core Transformer efficiency/attention innovation.
Unifying approach to uniform expressivity of graph neural networks - Score: 18 (R=10, N=8) - Date: 2026-02-23 - Comment: Model Architecture/Expressivity: introduces Template GNNs with matching logic and equivalence to analyze and unify GNN expressivity.
Turbo Connection: Reasoning as Information Flow from Higher to Lower Layers - Score: 18 (R=10, N=8) - Date: 2026-02-23 - Comment: Model Architecture: TurboConn adds dense backward cross-token residuals to increase effective computational depth in Transformers.
A Theoretical Framework for Modular Learning of Robust Generative Models - Score: 18 (R=10, N=8) - Date: 2026-02-20 - Comment: Matches: Model Architecture (Mixture-of-Experts) — theoretical framework and minimax robust gating with generalization bounds and modularity.
MoE-Spec: Expert Budgeting for Efficient Speculative Decoding - Score: 18 (R=10, N=8) - Date: 2026-02-19 - Comment: Mixture-of-Experts + Efficiency: verification-time expert budgeting for MoE speculative decoding to cap expert capacity and improve throughput without retraining.
Avey-B - Score: 18 (R=10, N=8) - Date: 2026-02-18 - Comment: Model Architecture: proposes an attention-free encoder-only alternative with decoupled static/dynamic parameterizations, stability-oriented normalization, and neural compression for efficient long-context encoding.
ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns - Score: 18 (R=10, N=8) - Date: 2026-02-18 - Comment: MoE Architecture: training-free dense-to-MoE conversion using GLU activation patterns to form shared and routed experts without breaking activation regularities.
Divine Benevolence is an $x^2$: GLUs scale asymptotically faster than MLPs - Score: 18 (R=10, N=8) - Date: 2026-02-17 - Comment: Model Architecture: theoretical scaling-law advantage of GLUs (quadratic approximation order) over MLPs; introduces Gated Quadratic Unit with steeper L(P) slope.
Geometry-Preserving Aggregation for Mixture-of-Experts Embedding Models - Score: 18 (R=10, N=8) - Date: 2026-02-17 - Comment: Model Architecture (MoE): introduces geometry-preserving spherical barycentric aggregation for MoE embeddings to respect hyperspherical manifold structure.
SLA2: Sparse-Linear Attention with Learnable Routing and QAT - Score: 18 (R=10, N=8) - Date: 2026-02-16 - Comment: Matches Model Architecture and Efficiency: improved sparse-linear attention with learnable routing and quantization-aware training for major speedups while preserving quality.
HyperMLP: An Integrated Perspective for Sequence Modeling - Score: 18 (R=10, N=8) - Date: 2026-02-16 - Comment: Matches Model Architecture: reinterprets attention as a dynamic MLP and proposes HyperMLP/HyperGLU with theory and empirical gains over softmax attention.
Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers - Score: 18 (R=10, N=8) - Date: 2026-02-16 - Comment: Model Architecture (MoE): identifies a pre-routing bottleneck from multi-head attention causing route collisions and proposes head-wise routing (MH-MoE) to mitigate catastrophic forgetting.
LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training - Score: 18 (R=10, N=8) - Date: 2026-02-13 - Comment: Matches High Performance Computing and MoE Architecture: introduces Fully Sharded Expert Parallelism and adaptive expert re-layout for load-balanced MoE training.
Krause Synchronization Transformers - Score: 18 (R=10, N=8) - Date: 2026-02-13 - Comment: C1+C2: Model architecture and efficiency—localized, selectively sparse attention (Krause Attention) with linear time complexity.
Retrieval-Aware Distillation for Transformer-SSM Hybrids - Score: 18 (R=10, N=8) - Date: 2026-02-13 - Comment: Model Architecture/Efficiency: retrieval-aware distillation to build Transformer–SSM hybrids by preserving only retrieval-critical heads; 5–6x memory savings.
MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models - Score: 18 (R=10, N=8) - Date: 2026-02-13 - Comment: Mixture-of-Experts Efficiency: fine-tuning to reduce experts-per-sequence and cache preferred experts, cutting CPU–GPU transfers and boosting throughput up to 14.7x.
MoEEdit: Efficient and Routing-Stable Knowledge Editing for Mixture-of-Experts LLMs - Score: 18 (R=10, N=8) - Date: 2026-02-12 - Comment: Model Architecture: targets MoE with routing-stable knowledge editing; Efficiency: block-structured updates solved via BCD for compute/memory efficiency.
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters - Score: 18 (R=10, N=8) - Date: 2026-02-12 - Comment: Strong Model Architecture match (sparse MoE with 11B active params) and Efficiency (interleaved sliding-window/full attention, MTP-3) plus scalable training systems.
Gated Removal of Normalization in Transformers Enables Stable Training and Efficient Inference - Score: 18 (R=10, N=8) - Date: 2026-02-12 - Comment: Model Architecture: introduces TaperNorm to remove per-token normalization; Efficiency: enables folding scalings into linear projections for faster inference with theoretical justification.
Effective MoE-based LLM Compression by Exploiting Heterogeneous Inter-Group Experts Routing Frequency and Information Density - Score: 18 (R=10, N=8) - Date: 2026-02-11 - Comment: Model Architecture + Compression: MoE-aware SVD compression with routing-frequency and information-density–guided rank allocation plus sparse residual reconstruction.
Generalizing GNNs with Tokenized Mixture of Experts - Score: 18 (R=10, N=8) - Date: 2026-02-11 - Comment: Matches Model Architecture (Mixture-of-Experts): tokenized MoE encoder with vector-quantized interface and Lipschitz-regularized head to improve GNN generalization/robustness.
Noise Stability of Transformer Models - Score: 18 (R=10, N=8) - Date: 2026-02-11 - Comment: Matches Representation Learning and Model Architecture: introduces noise stability as a simplicity/robustness metric for Transformers, with theory and a regularizer that accelerates grokking/training.
SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models - Score: 18 (R=10, N=8) - Date: 2026-02-11 - Comment: Matches Model Architecture (MoE) and Efficiency: similarity-based dynamic expert re-routing to reduce active experts in batch decoding; includes custom CUDA and vLLM integration.
DirMoE: Dirichlet-routed Mixture of Experts - Score: 18 (R=10, N=8) - Date: 2026-02-10 - Comment: Model Architecture: Mixture-of-Experts with differentiable Dirichlet/Bernoulli routing (DirMoE) decoupling selection and contribution.
Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models - Score: 18 (R=10, N=8) - Date: 2026-02-10 - Comment: MoE Architecture/Theory: stable batch MM algorithm with convergence guarantees for softmax-gated multinomial-logistic MoE; finite-sample rates and near-optimal expert selection.
Spectral Gating Networks - Score: 18 (R=10, N=8) - Date: 2026-02-10 - Comment: Model Architecture: Spectral Gating Networks add a compact learnable Fourier pathway with gates to FFN/MLP layers under fixed budgets.
MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models - Score: 18 (R=10, N=8) - Date: 2026-02-09 - Comment: Model Architecture/Efficiency: Mixture of Slimmable Experts (MoSE) introduces slimmable experts within MoE for conditional widths, enabling continuous accuracy–compute trade-offs from a single pretrained model.
Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism - Score: 18 (R=10, N=8) - Date: 2026-02-05 - Comment: Direct hit on Model Architecture (Mixture-of-Experts) and High Performance Computing: proposes a new MoE variant with deterministic O(1) communication Head Parallelism for distributed training.
Disentangling Causal Importance from Emergent Structure in Multi-Expert Orchestration - Score: 18 (R=10, N=8) - Date: 2026-02-05 - Comment: Model Architecture: analysis of multi-expert (MoE) orchestration and routing with causal attribution disentanglement.
From Dead Neurons to Deep Approximators: Deep Bernstein Networks as a Provable Alternative to Residual Layers - Score: 18 (R=10, N=8) - Date: 2026-02-05 - Comment: Model architecture: Bernstein activation-based deep networks as residual-free alternatives with provable trainability and exponential approximation rates.
Learning Tangent Bundles and Characteristic Classes with Autoencoder Atlases - Score: 18 (R=9, N=9) - Date: 2026-02-28 - Comment: Matches Model Architecture and Representation Learning: multi-chart autoencoders as atlases, tangent bundle recovery, and topological invariants (characteristic classes).
Why ReLU? A Bit-Model Dichotomy for Deep Network Training - Score: 18 (R=9, N=9) - Date: 2026-02-24 - Comment: Theoretical foundations/architecture: bit-model complexity dichotomy showing ReLU yields tractable ERM vs. polynomial activations (#P-hard).
Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction - Score: 18 (R=9, N=9) - Date: 2026-02-13 - Comment: Efficiency/Conditional Networks: consolidation-based routing that provably reduces attention compute over training with adaptive memory consolidation.
Free(): Learning to Forget in Malloc-Only Reasoning Models - Score: 18 (R=9, N=9) - Date: 2026-02-10 - Comment: Model Architecture/Efficiency: plug-and-play Free-Module (LoRA) enabling self-forgetting to prune useless context during reasoning.
Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization - Score: 17 (R=10, N=7) - Date: 2026-02-17 - Comment: Model Architecture (MoE): intra-layer specialization and cross-layer coupling losses to improve expert specialization and routing efficiency without architectural changes.
SD-MoE: Spectral Decomposition for Effective Expert Specialization - Score: 17 (R=10, N=7) - Date: 2026-02-16 - Comment: Model Architecture (MoE): spectral decomposition of parameters/gradients to decouple dominant subspaces and improve expert specialization with minimal overhead.
The Laplacian Mechanism Improves Transformers by Reshaping Token Geometry - Score: 17 (R=10, N=7) - Date: 2026-02-11 - Comment: Model Architecture + Representation Geometry: modifies attention into a Laplacian mechanism to directly control token variance and induce neural collapse–like geometry.
XShare: Collaborative in-Batch Expert Sharing for Faster MoE Inference - Score: 17 (R=10, N=7) - Date: 2026-02-11 - Comment: Matches Model Architecture (MoE) and Efficiency: in-batch expert sharing via greedy optimization to reduce expert activation and improve throughput without retraining.
SpecMD: A Comprehensive Study On Speculative Expert Prefetching - Score: 17 (R=10, N=7) - Date: 2026-02-06 - Comment: Model Architecture and Efficiency (MoE): standardized benchmarking for MoE expert caching and a novel eviction policy tailored to expert access patterns, improving TTFT and hit rates.
Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability - Score: 17 (R=9, N=8) - Date: 2026-02-28 - Comment: Matches Representation Learning/Training Dynamics: single-pass Koopman spectral profiling predicts transformer divergence and shaping stabilizes training across architectures (incl. MoE/SSMs).
Accelerating LLM Pre-Training through Flat-Direction Dynamics Enhancement - Score: 17 (R=9, N=8) - Date: 2026-02-27 - Comment: High Performance Computing/Efficiency — optimizer-level acceleration by emphasizing flat-direction dynamics via a Riemannian ODE framework; applicable to Dense and MoE pretraining.
Support Tokens, Stability Margins, and a New Foundation for Robust LLMs - Score: 17 (R=9, N=8) - Date: 2026-02-27 - Comment: Model Architecture/Training Dynamics: probabilistic reformulation of self-attention with log-barrier MAP objective for robust LLMs (support tokens, stability margins).
Adaptation to Intrinsic Dependence in Diffusion Language Models - Score: 17 (R=9, N=8) - Date: 2026-02-24 - Comment: Model Architecture/Inference Efficiency: distribution-agnostic randomized unmasking schedules for diffusion language models with KL convergence scaling to total correlation.
Path-conditioned training: a principled way to rescale ReLU neural networks - Score: 17 (R=9, N=8) - Date: 2026-02-24 - Comment: Model Architecture/Optimization Theory: path-conditioned rescaling of ReLU networks via path-lifting and kernel alignment; principled conditioning improving training.
Incremental Learning of Sparse Attention Patterns in Transformers - Score: 17 (R=9, N=8) - Date: 2026-02-24 - Comment: Training Dynamics/Representation Learning: analyzes staged emergence of sparse attention patterns in transformers with differential equation modeling and convergence results.
Toward Manifest Relationality in Transformers via Symmetry Reduction - Score: 17 (R=9, N=8) - Date: 2026-02-24 - Comment: Model Architecture: symmetry-reduced Transformer operating on invariant relational quantities to remove redundant degrees of freedom and analyze optimization.
One-step Language Modeling via Continuous Denoising - Score: 17 (R=9, N=8) - Date: 2026-02-20 - Comment: Model Architecture and Efficiency: flow-based continuous denoising for language modeling with few-/one-step generation via distilled flow map; challenges discrete diffusion assumptions.
From Growing to Looping: A Unified View of Iterative Computation in LLMs - Score: 17 (R=9, N=8) - Date: 2026-02-19 - Comment: Model Architecture: unifies and analyzes looping and depth growth to induce iterative computation in LLMs; shows composability and inference-time looping benefits.
Synthesis and Verification of Transformer Programs - Score: 17 (R=9, N=8) - Date: 2026-02-19 - Comment: Matches Model Architecture/Analysis: formal verification and synthesis of Transformer programs (C-RASP) via SMT-backed model checking and learning.
Surgical Activation Steering via Generative Causal Mediation - Score: 17 (R=9, N=8) - Date: 2026-02-19 - Comment: Representation Learning/Architecture Control: uses generative causal mediation to localize and steer sparse attention heads for long-form behaviors via targeted activation interventions.
Beyond ReLU: Bifurcation, Oversmoothing, and Topological Priors - Score: 17 (R=9, N=8) - Date: 2026-02-18 - Comment: Model Architecture/Theory: introduces a class of non-monotone activations to induce bifurcations that mitigate GNN oversmoothing, with initialization derived from theory.
PolyNODE: Variable-dimension Neural ODEs on M-polyfolds - Score: 17 (R=9, N=8) - Date: 2026-02-18 - Comment: Model Architecture: extends Neural ODEs to variable-dimension flows on M-polyfolds (PolyNODE), enabling dimensional bottlenecks and new autoencoder designs.
Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: High Performance Computing/Training dynamics: optimal batch size scheduling via functional scaling laws; validated on Dense and MoE LLM pretraining.
Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: Compression/Efficiency: modifies Transformer attention mask for block-wise causal masking to ease and accelerate soft prompt (context) compression at high ratios.
Rational Neural Networks have Expressivity Advantages - Score: 17 (R=9, N=8) - Date: 2026-02-16 - Comment: Model Architecture: introduces trainable low-degree rational activation functions with provable expressivity/parameter-efficiency advantages, extending to transformer-style nonlinearities.
Prototype Transformer: Towards Language Model Architectures Interpretable by Design - Score: 17 (R=9, N=8) - Date: 2026-02-13 - Comment: Model Architecture: prototype-based autoregressive LM replacing self-attention with two-way prototype communication; linear sequence scaling and interpretability.
MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling - Score: 17 (R=9, N=8) - Date: 2026-02-13 - Comment: Model Architecture + Efficiency: hybrid sparse (InfLLM-V2) and linear (Lightning) attention with hybrid positional encoding enabling 256K–1M context at up to 3.5x speed.
SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion - Score: 17 (R=9, N=8) - Date: 2026-02-13 - Comment: Model Architecture — introduces SpiralFormer, a looped Transformer with multi-resolution recursion enabling hierarchical dependencies and improved parameter/compute efficiency.
Towards Compressive and Scalable Recurrent Memory - Score: 17 (R=9, N=8) - Date: 2026-02-13 - Comment: Model Architecture/Efficiency: HiPPO-grounded Elastic Memory with polynomial sampling for long-context recurrent memory; large memory and speed advantages.
LUCID: Attention with Preconditioned Representations - Score: 17 (R=9, N=8) - Date: 2026-02-12 - Comment: Model Architecture: modifies attention via preconditioned probabilities to improve long-context retrieval without extra complexity.
Learning to Remember, Learn, and Forget in Attention-Based Models - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: Model Architecture and Training Dynamics: Bayesian metaplasticity for attention (Palimpsa), unifying and extending gated linear attention and linking to Mamba2.
ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: Model Architecture/Training Dynamics: learns and reassigns residual connections to improve depth utilization with negligible overhead.
Discovering Interpretable Algorithms by Decompiling Transformers to RASP - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: Matches Interpretability/Model Analysis: reparameterizes Transformers as RASP and extracts minimal causal sub-programs, directly connecting trained models to interpretable algorithms.
Latent Reasoning with Supervised Thinking States - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: Model Architecture: latent reasoning with supervised thinking tokens injected during input processing, reducing CoT cost while preserving reasoning ability.
Gaussian Match-and-Copy: A Minimalist Benchmark for Studying Transformer Induction - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: Matches Model Architecture/Training Dynamics: minimalist Transformer benchmark to isolate induction and analysis showing max-margin implicit bias for hard match selection.
StretchTime: Adaptive Time Series Forecasting via Symplectic Attention - Score: 17 (R=9, N=8) - Date: 2026-02-10 - Comment: Model Architecture: proposes Symplectic Positional Embeddings generalizing RoPE to Sp(2,R) with adaptive time-warped attention.
Constructive conditional normalizing flows - Score: 17 (R=9, N=8) - Date: 2026-02-10 - Comment: Matches Model Architecture/Theory: constructive conditional normalizing flows via perceptron-driven continuity equations with explicit implementable decomposition.
Radial M\"untz-Sz\'asz Networks: Neural Architectures with Learnable Power Bases for Multidimensional Singularities - Score: 17 (R=9, N=8) - Date: 2026-02-10 - Comment: Model Architecture: introduces Radial Müntz–Szász Networks with learnable radial power bases and a log-primitive, with closed-form gradients enabling physics-informed training and extreme parameter efficiency.
SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm - Score: 17 (R=9, N=8) - Date: 2026-02-10 - Comment: Model Architecture: two-stream SiameseNorm reconciles Pre-/Post-Norm, preserving stability and expressivity.
Improving Variable-Length Generation in Diffusion Language Models via Length Regularization - Score: 17 (R=9, N=8) - Date: 2026-02-10 - Comment: Model Architecture/Efficiency: length-regularized inference for diffusion LMs enabling reliable variable-length generation without retraining.
Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks - Score: 17 (R=9, N=8) - Date: 2026-02-10 - Comment: Training Dynamics/Architecture Theory: derives a universal −3/2 depth-scaling law for learning rates via effective depth across CNNs/ResNets/Transformers under μP.
Inference-Time Rethinking with Latent Thought Vectors for Math Reasoning - Score: 17 (R=9, N=8) - Date: 2026-02-09 - Comment: Model Architecture/Inference-time Computation: decouples reasoning into latent thought vectors and a decoder, enabling gradient-based refinement over a learned latent manifold.
Transport and Merge: Cross-Architecture Merging for Large Language Models - Score: 17 (R=9, N=8) - Date: 2026-02-07 - Comment: Model architecture/training: cross-architecture weight-space merging via optimal transport-aligned activations to infer neuron correspondences—general method for knowledge transfer.
Pseudo-Invertible Neural Networks - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Matches Model Architecture/Representation: defines SPNNs with tractable non-linear pseudo-inverse enabling Non-Linear Back-Projection for zero-shot inversion.
Orthogonal Self-Attention - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Matches Model Architecture: Orthogonal Self-Attention with orthogonal parametrization and linear-time scaling enabling stable skipless Transformers.
Transformers Are Born Biased: Structural Inductive Biases at Random Initialization and Their Practical Consequences - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Representation Learning: mechanistic analysis of structural inductive biases and attention-driven geometry in Transformers at initialization, with training dynamics insights (SeedPrint, attention-sink link).
Learning Compact Boolean Networks - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Develops compact Boolean network architectures (learned connections, compact convolutions, adaptive discretization) targeting computational efficiency.
On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Matches Training Dynamics: establishes superlinear relation between SGD noise covariance and curvature via activity–weight duality with theoretical bounds.
Does SGD Seek Flatness or Sharpness? An Exactly Solvable Model - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Matches Training Dynamics: exactly solvable model clarifies when SGD prefers flat vs. sharp minima based on label-noise anisotropy.
A logical re-conception of neural networks: Hamiltonian bitwise part-whole architecture - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Introduces a new architecture (graph-Hamiltonian operator) with radically low-precision arithmetic and linear scaling; matches Model Architecture and Efficiency criteria.
Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Theoretical analysis showing multi-layer linearized cross-attention is provably Bayes-optimal for multimodal ICL; architecture-level insight.
Billion-Scale Graph Foundation Models - Score: 17 (R=9, N=8) - Date: 2026-02-05 - Comment: HPC and architecture: scalable GraphBFF Transformer for billion-scale graphs with data batching, pretraining/fine-tuning recipes, and scaling laws.
YuriiFormer: A Suite of Nesterov-Accelerated Transformers - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: Model Architecture: optimization-theoretic view yields a Nesterov-accelerated Transformer that preserves attention/MLP oracles and improves training/performance.
Names Don't Matter: Symbol-Invariant Transformer for Open-Vocabulary Learning - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: Model Architecture: symbol-invariant Transformer with provable invariance to renaming of interchangeable tokens via aggregated attention across streams.
Sparse Attention as Compact Kernel Regression - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: Model Architecture: provides a kernel-theoretic framework for sparse attention (entmax/compact kernels), offering principled attention design alternatives.
Exact closed-form Gaussian moments of residual layers - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: Model Architecture/Training Dynamics: exact closed-form Gaussian mean/covariance propagation through residual layers for common activations.
Symmetry Breaking in Transformers for Efficient and Interpretable Training - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: Model Architecture: symmetry-breaking biases in attention to remove rotational redundancy, improving optimizer performance and interpretability.
Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention - Score: 16 (R=9, N=7) - Date: 2026-02-28 - Comment: Model Architecture: proposes Affine-Scaled Attention that relaxes softmax unit-sum via input-dependent scaling/bias, improving stability and training dynamics.
Ruyi2 Technical Report - Score: 16 (R=9, N=7) - Date: 2026-02-28 - Comment: Model Architecture + HPC: variable-depth early-exit ‘Familial Model’ with 3D parallel training and parameter sharing for efficient train/deploy.
Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models - Score: 16 (R=9, N=7) - Date: 2026-02-27 - Comment: Representation Learning/Architecture analysis: theoretical effects of fine-tuning on in-context learning in linear attention models; value-only updates preserve ICL.
Learning Physical Operators using Neural Operators - Score: 16 (R=9, N=7) - Date: 2026-02-27 - Comment: Model Architecture: physics-informed neural operators trained via operator splitting with a modular mixture-of-experts and neural ODE formulation for generalization across regimes.
Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series - Score: 16 (R=9, N=7) - Date: 2026-02-24 - Comment: Model Architecture and Efficiency: replaces attention with a centralized aggregation (CoTAR) achieving linear complexity and improved channel dependency modeling.
2Mamba2Furious: Linear in Complexity, Competitive in Accuracy - Score: 16 (R=9, N=7) - Date: 2026-02-20 - Comment: Model Architecture and Efficiency: modified Mamba-2 (linear attention/SSM) to approach softmax accuracy while maintaining linear complexity.
Arcee Trinity Large Technical Report - Score: 16 (R=9, N=7) - Date: 2026-02-20 - Comment: Model Architecture (MoE): introduces a sparse MoE LLM with sigmoid routing and a new MoE load-balancing method (SMEBU); High-Performance Training details at scale.
Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers - Score: 16 (R=9, N=7) - Date: 2026-02-17 - Comment: Model Architecture analysis: uncovers residual-path causal shift in Transformers and proposes residual attenuation/gating mitigation.
Parameter-Efficient Fine-Tuning of LLMs with Mixture of Space Experts - Score: 16 (R=9, N=7) - Date: 2026-02-17 - Comment: Model Architecture / PEFT: Mixture of Space experts with lightweight routing extends LoRA to heterogeneous geometries for curvature-aware adaptation.
Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines? - Score: 16 (R=9, N=7) - Date: 2026-02-17 - Comment: Matches 'Model Architecture/Representation Learning (Autoencoders/Sparsity)': rigorous sanity checks showing current SAEs often match random baselines.
Position Encoding with Random Float Sampling Enhances Length Generalization of Transformers - Score: 16 (R=9, N=7) - Date: 2026-02-17 - Comment: Model Architecture (Transformers): Random Float Sampling for position encoding improves length generalization; applicable to sinusoidal, RoPE, and ALiBi.
You Can Learn Tokenization End-to-End with Reinforcement Learning - Score: 16 (R=9, N=7) - Date: 2026-02-17 - Comment: Model Architecture/Training Pipeline: learns tokenization end-to-end via score-function (REINFORCE) with variance reduction, replacing hardcoded tokenizers.
AllMem: A Memory-centric Recipe for Efficient Long-context Modeling - Score: 16 (R=9, N=7) - Date: 2026-02-17 - Comment: High Performance Computing / Model Architecture: hybrid sliding-window attention with non-linear test-time memory and memory-efficient fine-tuning for long-context scaling with reduced compute/memory.
HLA: Hadamard Linear Attention - Score: 16 (R=9, N=7) - Date: 2026-02-13 - Comment: Model Architecture/Efficiency: proposes a new linear attention (Hadamard Linear Attention) with an efficient scheme approximating softmax.
Kalman Linear Attention: Parallel Bayesian Filtering For Efficient Language Modelling and State Tracking - Score: 16 (R=9, N=7) - Date: 2026-02-12 - Comment: Model Architecture: proposes Kalman Linear Attention with probabilistic state-tracking; HPC/Efficiency: parallelizable via associative scan while retaining linear complexity.
Gradient Residual Connections - Score: 16 (R=9, N=7) - Date: 2026-02-11 - Comment: Model Architecture: introduces gradient-based residual connections to better approximate high-frequency functions while retaining standard skips.
Understanding Dynamic Compute Allocation in Recurrent Transformers - Score: 16 (R=9, N=7) - Date: 2026-02-11 - Comment: Model Architecture/Adaptive Computation: unified recurrent Transformer enabling per-token variable depth; analyzes compute allocation vs complexity.
Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs - Score: 16 (R=9, N=7) - Date: 2026-02-11 - Comment: Matches Model Architecture (MoE): analyzes router safety with Router Safety importance score (RoSais) and F-SOUR to expose unsafe routing configurations in MoE LLMs.
Diffusion-Inspired Reconfiguration of Transformers for Uncertainty Calibration - Score: 16 (R=9, N=7) - Date: 2026-02-10 - Comment: Matches Model Architecture: reconfigures transformer blocks as probabilistic mappings compiled onto a diffusion-like path for principled uncertainty propagation.
FlexMoRE: A Flexible Mixture of Rank-heterogeneous Experts for Efficient Federatedly-trained Large Language Models - Score: 16 (R=9, N=7) - Date: 2026-02-10 - Comment: Model Architecture/Efficiency: federated MoE with rank-heterogeneous experts (low-rank adapters) reducing parameters.
Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization - Score: 16 (R=9, N=7) - Date: 2026-02-10 - Comment: High Performance Computing: restructures Transformer computation to minimize cross-GPU synchronization for faster multi-GPU inference.
From Kepler to Newton: Inductive Biases Guide Learned World Models in Transformers - Score: 16 (R=9, N=7) - Date: 2026-02-09 - Comment: Inductive Bias/Architecture: shows how minimal biases (spatial smoothness, stability via noisy contexts, temporal locality via restricted attention) guide transformers from curve-fitting to learning Newtonian world models.
Fine-Grained Model Merging via Modular Expert Recombination - Score: 16 (R=9, N=7) - Date: 2026-02-09 - Comment: Model Architecture: fine-grained, component-wise model merging with a reusable modular expert library and input-aware routing (conditional/dynamic networks).
Revisiting the Shape Convention of Transformer Language Models - Score: 16 (R=9, N=7) - Date: 2026-02-09 - Comment: Model Architecture/Efficiency: replaces Transformer FFN with deeper hourglass FFNs and rebalances attention vs FFN under fixed budgets, challenging the narrow–wide–narrow MLP convention.
A Multiplicative Neural Network Architecture: Locality and Regularity of Appriximation - Score: 16 (R=9, N=7) - Date: 2026-02-09 - Comment: Model Architecture: proposes a multiplicative neural network with universal approximation and locality/regularity analysis.
RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models - Score: 16 (R=9, N=7) - Date: 2026-02-06 - Comment: Model Architecture (MoE): routing-aware expert-level safety alignment with targeted expert repair and routing consistency; addresses expert/routing dynamics in MoE.
Rational ANOVA Networks - Score: 16 (R=9, N=7) - Date: 2026-02-05 - Comment: Model Architecture — Rational-ANOVA Network with explicit low-order interactions and stable Padé-style rational units for learnable nonlinearities and extrapolation.
Hierarchical Shift Mixing -- Beyond Dense Attention in Transformers - Score: 16 (R=9, N=7) - Date: 2026-02-02 - Comment: Model Architecture/Efficiency: Hierarchical Shift Mixing distributes token interactions across layers for linear-time mixing; hybrid with softmax attention reduces cost.
Stabilizing Transformer Training Through Consensus - Score: 16 (R=9, N=7) - Date: 2026-02-02 - Comment: Model Architecture: consensus mechanism as an attention replacement/hybrid to stabilize transformer training across learning rates.
SpanNorm: Reconciling Training Stability and Performance in Deep Transformers - Score: 16 (R=9, N=7) - Date: 2026-02-02 - Comment: Model Architecture/training stability: SpanNorm reconciles PreNorm/PostNorm with a spanning residual and PostNorm-style output; theory for bounded variance; applicable to dense and MoE.
Efficient Encoder-Free Fourier-based 3D Large Multimodal Model - Score: 16 (R=8, N=8) - Date: 2026-02-27 - Comment: Model Architecture and Efficiency: encoder-free 3D LMM with FFT-based tokenizer approximating self-attention and Fourier-augmented LoRA adapters.
RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format - Score: 16 (R=8, N=8) - Date: 2026-02-27 - Comment: Model Architecture/Efficiency: gradient-free task-vector merging via null-space projection and instruction-attention scaling preserves reasoning format.
Behavior Learning (BL): Learning Hierarchical Optimization Structures from Data - Score: 16 (R=8, N=8) - Date: 2026-02-24 - Comment: Model Architecture: proposes interpretable, identifiable networks as hierarchical compositions of utility-maximization blocks with theory.
Training-Free Cross-Architecture Merging for Graph Neural Networks - Score: 16 (R=8, N=8) - Date: 2026-02-24 - Comment: Model Architecture and Efficiency: training-free cross-architecture GNN merging via a shared operator family (UMPM) and message alignment, avoiding retraining.
Breaking the Correlation Plateau: On the Optimization and Capacity Limits of Attention-Based Regressors - Score: 16 (R=8, N=8) - Date: 2026-02-23 - Comment: Representation Learning/Training Dynamics: theoretical analysis of attention-based regressors (PCC plateau) and architecture fix (Extrapolative Correlation Attention) addressing softmax/convex-hull limits.
Beyond Learning: A Training-Free Alternative to Model Adaptation - Score: 16 (R=8, N=8) - Date: 2026-02-19 - Comment: Matches Model Architecture and Efficiency: training-free module transplantation via activation-selected internal modules for immediate capability transfer.
Size Transferability of Graph Transformers with Convolutional Positional Encodings - Score: 16 (R=8, N=8) - Date: 2026-02-18 - Comment: Model Architecture/Theory: links Graph Transformers with GNN positional encodings to manifold neural networks, establishing size transferability guarantees.
UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model - Score: 16 (R=8, N=8) - Date: 2026-02-17 - Comment: Model Architecture/Representation: unified discrete visual tokenizer with massive binary codebook and SigLu activation; conv–attention hybrid and staged training.
Which Algorithms Can Graph Neural Networks Learn? - Score: 16 (R=8, N=8) - Date: 2026-02-16 - Comment: Representation/Architecture theory: provides conditions for MPNNs to learn algorithms and generalize to arbitrary sizes; includes impossibility results and more expressive MPNN-like variants.
From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design - Score: 16 (R=8, N=8) - Date: 2026-02-12 - Comment: High Performance Computing: accelerator/system co-design (3D-Flow) enabling fine-grained FlashAttention with register-level vertical dataflow to reduce SRAM energy and increase throughput.
Flow of Spans: Generalizing Language Models to Dynamic Span-Vocabulary via GFlowNets - Score: 16 (R=8, N=8) - Date: 2026-02-12 - Comment: Model Architecture: proposes a GFlowNets-based span-generation framework with a DAG state space and dynamic vocabulary for language modeling.
Escaping Spectral Bias without Backpropagation: Fast Implicit Neural Representations with Extreme Learning Machines - Score: 16 (R=8, N=8) - Date: 2026-02-10 - Comment: Model Architecture/Efficiency: backprop-free INRs via Extreme Learning Machines with domain decomposition and partition-of-unity; spectral Barron norm analysis guides adaptive refinement.
LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning - Score: 16 (R=8, N=8) - Date: 2026-02-10 - Comment: Model Architecture/Reasoning: latent reasoning interface decouples chemical computation from text, using continuous latent dynamics instead of CoT.
Algebraic Robustness Verification of Neural Networks - Score: 16 (R=8, N=8) - Date: 2026-02-09 - Comment: Theory for Robustness Verification: formulates verification via ED degree/discriminants and provides an exact certification algorithm via homotopy; architecture-dependent complexity measure.
Breaking Symmetry Bottlenecks in GNN Readouts - Score: 16 (R=8, N=8) - Date: 2026-02-06 - Comment: Model Architecture/Expressivity: proves averaging readouts in GNNs erase symmetry-aware components and introduces projector-based invariant readouts to break this bottleneck.
Limitations of SGD for Multi-Index Models Beyond Statistical Queries - Score: 16 (R=8, N=8) - Date: 2026-02-06 - Comment: Training Dynamics: non-SQ framework proving limitations of vanilla SGD for multi-index models including neural nets.
TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training - Score: 16 (R=8, N=8) - Date: 2026-02-02 - Comment: High Performance Computing — tensorized gradient orthonormalization generalizing Muon with improved convergence for LLM pre-training.
Matterhorn: Efficient Analog Sparse Spiking Transformer Architecture with Masked Time-To-First-Spike Encoding - Score: 16 (R=8, N=8) - Date: 2026-02-02 - Comment: Model Architecture and Efficiency: sparse spiking Transformer with masked time-to-first-spike encoding and compute-in-memory synapses to cut data-movement energy.
pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: MoE/PEFT architecture: mixture-of-experts prompt tuning with learnable dispatcher to combine diverse domain experts.
Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: Model architecture: identifies activation subspace bottlenecks in SSMs and introduces a steered/test-time intervention and Stable-Mamba variant.
WaveSSM: Multiscale State-Space Models for Non-stationary Signal Attention - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: Model architecture: Wavelet-based SSMs (WaveSSM) providing localized temporal bases for long-range sequence modeling.
A Computationally Efficient Multidimensional Vision Transformer - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Model Architecture and Efficiency — introduces a tensor cosine product (Cproduct) ViT with multilinear structure and 1/C parameter reduction enabling efficient attention.
VecFormer: Towards Efficient and Generalizable Graph Transformer with Graph Token Attention - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Model Architecture/Efficiency: introduces vector-quantized graph tokens and token-level attention to reduce Graph Transformer complexity and improve OOD generalization.
Laplacian Multi-scale Flow Matching for Generative Modeling - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Matches Model Architecture/Efficiency: Laplacian multi-scale flow matching with parallel mixture-of-transformers and causal attention for faster, high-quality generation.
Insertion Based Sequence Generation with Learnable Order Dynamics - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Matches Model Architecture: introduces learnable order dynamics for insertion-based masked diffusion via discrete flow matching.
Transformers for dynamical systems learn transfer operators in-context - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Representation Learning/Architecture: elucidates in-context learning in transformers as transfer-operator forecasting with discovery of double-descent tradeoffs.
Advection-Diffusion on Graphs: A Bakry-Emery Laplacian for Spectral Graph Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-02-23 - Comment: Model Architecture (GNNs): Bakry-Emery Laplacian with learnable potential yielding adaptive advection–diffusion in spectral GNNs (mu-ChebNet).
PHAST: Port-Hamiltonian Architecture for Structured Temporal Dynamics Forecasting - Score: 15 (R=8, N=7) - Date: 2026-02-23 - Comment: Model Architecture: introduces a port-Hamiltonian neural architecture with low-rank PSD/SPD parameterizations and stable integrators for long-horizon dynamics.
Provable Adversarial Robustness in In-Context Learning - Score: 15 (R=8, N=7) - Date: 2026-02-23 - Comment: Theory of In-Context Learning: provable adversarial robustness bounds for linear self-attention Transformers under Wasserstein shifts (capacity/sample complexity).
Be Wary of Your Time Series Preprocessing - Score: 15 (R=8, N=7) - Date: 2026-02-20 - Comment: Representation Learning — introduces a formal expressivity framework analyzing how normalization (Standard/Min-Max) impacts Transformer representations for time series; Model Architecture — theoretical bounds on preprocessing effects in Transformer-based models.
CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill - Score: 15 (R=8, N=7) - Date: 2026-02-19 - Comment: Model Efficiency: cross-layer attention aggregation to stabilize token-importance ranking and accelerate LLM prefill, reducing TTFT substantially.
Complex-Valued Unitary Representations as Classification Heads for Improved Uncertainty Quantification in Deep Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-02-18 - Comment: Model Architecture: introduces a complex-valued unitary classification head (Cayley-parameterized) that improves calibration over standard softmax.
Spectral Convolution on Orbifolds for Geometric Deep Learning - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Model Architecture: extends spectral convolution to orbifolds as a foundational GDL building block.
Use What You Know: Causal Foundation Models with Partial Graphs - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Model Architecture/Attention: conditions causal foundation models on partial graphs via learnable attention biases to leverage domain knowledge.
Inner Loop Inference for Pretrained Transformers: Unlocking Latent Capabilities Without Training - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Model Architecture/Inference-time computation: inner-loop reapplication of transformer blocks to extend refinement without training.
Synaptic Activation and Dual Liquid Dynamics for Interpretable Bio-Inspired Models - Score: 15 (R=8, N=7) - Date: 2026-02-16 - Comment: Conditional/Dynamic Networks: introduces liquid-capacitance dynamics with chemical synapses and synaptic activation for interpretable recurrent policies.
Enforcing Reciprocity in Operator Learning for Seismic Wave Propagation - Score: 15 (R=8, N=7) - Date: 2026-02-13 - Comment: C1: Model architecture—transformer-based neural operator with reciprocity hard-coded via cross-attention and commutative operations.
Neural Additive Experts: Context-Gated Experts for Controllable Model Additivity - Score: 15 (R=8, N=7) - Date: 2026-02-12 - Comment: Matches Model Architecture: Neural Additive Experts (context-gated experts) balancing interpretability and accuracy via controllable additivity (MoE-like gating).
C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning - Score: 15 (R=8, N=7) - Date: 2026-02-12 - Comment: Model Architecture: proposes C^2RoPE positional encoding and Chebyshev causal masking to improve spatio-temporal reasoning in multimodal Transformers.
Looping Back to Move Forward: Recursive Transformers for Efficient and Flexible Large Multimodal Models - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Model Architecture/Efficiency: recursive Transformer with connector and monotonic recursion loss for compute-adaptive refinement.
Regime Change Hypothesis: Foundations for Decoupled Dynamics in Neural Network Training - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Matches Representation Learning/Training Dynamics: theoretical and empirical study of activation-pattern stability and two-timescale behavior across architectures including Transformers.
The Median is Easier than it Looks: Approximation with a Constant-Depth, Linear-Width ReLU Network - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Matches Model Architecture theory: expressivity/depth–width tradeoffs for ReLU networks to approximate the median with constant depth and linear width.
CauScale: Neural Causal Discovery at Scale - Score: 15 (R=8, N=7) - Date: 2026-02-10 - Comment: High Performance Computing/Architecture Scaling: neural causal discovery with reduction unit and tied attention to cut memory/time, scaling to 1k-node graphs
Time-Delayed Transformers for Data-Driven Modeling of Low-Dimensional Dynamics - Score: 15 (R=8, N=7) - Date: 2026-02-10 - Comment: Model Architecture: proposes a minimal time-delayed transformer linking to TD-DMD with linear complexity; Representation Learning for nonlinear/chaotic dynamics.
HypRAG: Hyperbolic Dense Retrieval for Retrieval Augmented Generation - Score: 15 (R=8, N=7) - Date: 2026-02-10 - Comment: Model Architecture: introduces a fully hyperbolic transformer and a geometry-aware pooling operator (Outward Einstein Midpoint) that preserves hierarchy; Representation Learning via hyperbolic norms encoding specificity.
The Quantum Sieve Tracer: A Hybrid Framework for Layer-Wise Activation Tracing in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Mechanistic Interpretability: hybrid quantum-classical activation tracing to disentangle sparse semantic signals from polysemantic noise in LLM attention circuits.
Weisfeiler and Lehman Go Categorical - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Model Architecture theory: categorical WL framework deriving hypergraph neural architectures with provable expressivity gains.
Diffeomorphism-Equivariant Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Model Architecture/Equivariance: induces diffeomorphism equivariance via energy-based canonicalisation, extending symmetry handling to infinite-dimensional groups.
Not All Layers Need Tuning: Selective Layer Restoration Recovers Diversity - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Architecture/Training Dynamics: training-free Selective Layer Restoration (restore chosen layers to pretrain) to recover diversity without quality loss; layerwise functional roles.
HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: MoE Efficiency: training-free online expand–reduce control for multi-path decoding in mixture-of-experts LLMs, reallocating compute under fixed budgets.
SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Model Architecture/Efficiency: scalable in-context hypernetwork generating LoRA adapters in a single pass for fast adaptation without fine-tuning.
Accelerating Vision Transformers on Brain Processing Unit - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Model Architecture and Efficiency: restructures ViT (linear/LN→conv ops) to exploit CNN-optimized BPU hardware with INT8 acceleration, enabling weight transfer without retraining.
Determining Energy Efficiency Sweet Spots in Production LLM Inference - Score: 15 (R=8, N=7) - Date: 2026-02-07 - Comment: HPC/Efficiency: analytical model tying Transformer compute/memory complexity to energy for inference, identifying sequence-length “sweet spots.”
Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers - Score: 15 (R=8, N=7) - Date: 2026-02-06 - Comment: Optimization/Training Dynamics: theory-driven adaptive warm-up scheduling for norm-constrained optimizers with convergence guarantees and practical scheduler for LLM pretraining.
Structural Disentanglement in Bilinear MLPs via Architectural Inductive Bias - Score: 15 (R=8, N=7) - Date: 2026-02-06 - Comment: Matches Model Architecture/Representation: bilinear MLPs induce non-mixing representations enabling structural disentanglement and editability.
Addressing Corpus Knowledge Poisoning Attacks on RAG Using Sparse Attention - Score: 15 (R=8, N=7) - Date: 2026-02-06 - Comment: Matches Architecture/Efficiency (Sparsity): block-sparse document attention mask for RAG to prevent harmful cross-document interactions.
MirrorLA: Reflecting Feature Map for Vision Linear Attention - Score: 15 (R=8, N=7) - Date: 2026-02-05 - Comment: Model Architecture: geometric linear attention via learnable Householder reflections to preserve information under non-negativity constraints.
SOMBRERO: Measuring and Steering Boundary Placement in End-to-End Hierarchical Sequence Models - Score: 15 (R=8, N=7) - Date: 2026-02-02 - Comment: Model Architecture: router-agnostic boundary quality metric and boundary-steering loss for end-to-end hierarchical sequence models to align compute with surprisal.
NAG: A Unified Native Architecture for Encoder-free Text-Graph Modeling in Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-02 - Comment: Model Architecture: encoder-free text-graph modeling by repurposing self-attention and positional IDs to natively encode topology in LMs.
Shattered Compositionality: Counterintuitive Learning Dynamics of Transformers for Arithmetic - Score: 15 (R=8, N=7) - Date: 2026-02-02 - Comment: Representation Learning/Training Dynamics: analysis of how transformers acquire arithmetic skills, revealing non-human compositionality dynamics.

Model Compression and Efficiency (221)

Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA - Score: 20.0 (R=0, N=0) - Date: 2026-02-27 - Comment: Author match
stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation - Score: 20.0 (R=0, N=0) - Date: 2026-02-11 - Comment: Author match
Fast KV Compaction via Attention Matching - Score: 19 (R=10, N=9) - Date: 2026-02-19 - Comment: Matches Compression/Efficiency: fast KV-cache compaction via attention matching with per-head preservation and closed-form subproblems enabling strong quality-time tradeoffs.
WildCat: Near-Linear Attention in Theory and Practice - Score: 19 (R=10, N=9) - Date: 2026-02-11 - Comment: Model Compression and Efficiency: near-linear attention via coreset selection (randomly pivoted Cholesky) with strong approximation guarantees and practical GPU implementation.
RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs - Score: 19 (R=10, N=9) - Date: 2026-02-07 - Comment: Model compression/quantization: residual-aware binarization training that enforces a hierarchical error-correcting structure across binary paths, enabling accurate 2-bit, matmul-free LLM inference.
BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models - Score: 19 (R=10, N=9) - Date: 2026-02-05 - Comment: Model Compression and Efficiency: 2-bit LLM quantization via variable bit-plane grids with second-order refinement and theory.
GeoIB: Geometry-Aware Information Bottleneck via Statistical-Manifold Compression - Score: 19 (R=10, N=9) - Date: 2026-02-05 - Comment: Representation Learning/Compression — Geometry-aware Information Bottleneck replacing MI estimation with Fisher–Rao and Jacobian-based controls plus natural-gradient updates.
Float8@2bits: Entropy Coding Enables Data-Free Model Compression - Score: 19 (R=10, N=9) - Date: 2026-02-02 - Comment: Model compression and efficiency: extreme-rate post-training compression via entropy coding decoupled from precision (data-free), achieving SOTA at ≤4 bits.
Breaking the Blocks: Continuous Low-Rank Decomposed Scaling for Unified LLM Quantization and Adaptation - Score: 19 (R=10, N=9) - Date: 2026-02-02 - Comment: Compression/Efficiency — unified low-rank decomposed element-wise scaling enabling quantization, joint QAT, and high-rank multiplicative PEFT with no extra inference cost.
FlashOptim: Optimizers for Memory Efficient Training - Score: 18 (R=10, N=8) - Date: 2026-02-28 - Comment: Matches Compression/Efficiency and HPC: optimizer-state quantization via companding and bounded master weight splitting cuts per-parameter training memory to ~7 bytes while preserving quality.
S2O: Early Stopping for Sparse Attention via Online Permutation - Score: 18 (R=10, N=8) - Date: 2026-02-28 - Comment: Matches Compression/Efficiency: online permutation plus early-stopping for sparse attention substantially raises effective sparsity and delivers large end-to-end speedups for long-context transformers.
veScale-FSDP: Flexible and High-Performance FSDP at Scale - Score: 18 (R=10, N=8) - Date: 2026-02-28 - Comment: High Performance Computing: redesigned FSDP with RaggedShard and structure-aware planning; supports block-wise quantized training and non-element-wise optimizers at scale.
InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models - Score: 18 (R=10, N=8) - Date: 2026-02-27 - Comment: Model compression and efficiency: hardware-aware inner-dimension groupwise KV-cache quantization with hybrid schemes and normalization to speed LLM decoding.
PRAC: Principal-Random Subspace for LLM Activation Compression and Memory-Efficient Training - Score: 18 (R=10, N=8) - Date: 2026-02-27 - Comment: Model Compression and Efficiency: activation compression via principal (SVD) + random orthogonal subspace with unbiased low-variance gradient estimation, reducing activation memory in LLM training.
pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training - Score: 18 (R=10, N=8) - Date: 2026-02-27 - Comment: Model Compression and Efficiency: decoupled linear QAT with a dominant 1-bit branch plus compact high-precision branch (and sparse experts) to overcome democratization and enable sub-2-bit LLMs.
A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs - Score: 18 (R=10, N=8) - Date: 2026-02-24 - Comment: Model Compression and Efficiency — MoE inference-time load balancing via expert replication and quantization; training-free, systems-level improvement for Sparse MoE LLMs.
Cut Less, Fold More: Model Compression through the Lens of Projection Geometry - Score: 18 (R=10, N=8) - Date: 2026-02-23 - Comment: Model Compression and Efficiency: geometry-aware, calibration-free compression; formalizes pruning vs low-rank folding as orthogonal projections with theoretical and large-scale empirical support.
ScaleBITS: Scalable Bitwidth Search for Hardware-Aligned Mixed-Precision LLMs - Score: 18 (R=10, N=8) - Date: 2026-02-23 - Comment: Compression/Efficiency: hardware-aligned mixed-precision quantization with block-wise partitioning and global bitwidth allocation under memory budget.
Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression - Score: 18 (R=10, N=8) - Date: 2026-02-20 - Comment: Model Compression and Efficiency: analyzes sign persistence as a bottleneck for sub-bit compression; introduces gap-based init and outward-drift regularizer to reduce effective sign flips.
Beyond SGD, Without SVD: Proximal Subspace Iteration LoRA with Diagonal Fractional K-FAC - Score: 18 (R=10, N=8) - Date: 2026-02-19 - Comment: Compression/Efficiency: advances LoRA optimization via proximal subspace iteration (LoRSum) and memory-efficient preconditioning (diagonal K-FAC/Shampoo) without full SVD.
COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression - Score: 18 (R=10, N=8) - Date: 2026-02-18 - Comment: Model Compression and Efficiency: training-free sparse factorization for Transformer compression using orthogonal dictionaries with closed-form Procrustes updates and one-shot dynamic layer-wise budget allocation.
WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity - Score: 18 (R=10, N=8) - Date: 2026-02-17 - Comment: Model Compression and Efficiency: training-free, weight-aware mixed-granularity activation sparsity with improved sparse kernels for LLM inference.
S2D: Selective Spectral Decay for Quantization-Friendly Conditioning of Neural Activations - Score: 18 (R=10, N=8) - Date: 2026-02-17 - Comment: Model Compression and Efficiency: reduces activation outliers via selective spectral decay tied to dominant singular values, yielding quantization-friendly activations (PTQ/QAT gains).
LCSB: Layer-Cyclic Selective Backpropagation for Memory-Efficient On-Device LLM Fine-Tuning - Score: 18 (R=10, N=8) - Date: 2026-02-16 - Comment: Memory Optimization/Efficiency: selective per-step layer backpropagation for LoRA fine-tuning on-device, with BCD interpretation and improved stability.
Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration - Score: 18 (R=10, N=8) - Date: 2026-02-13 - Comment: Model Compression and Efficiency + MoE: heterogeneous MoE expert pruning, windowed attention replacement, and FP8 KV-cache quantization via post-training NAS for inference acceleration.
KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models - Score: 18 (R=10, N=8) - Date: 2026-02-13 - Comment: Matches Compression/Efficiency and MoE: KLT-guided SVD plus bias-corrected vector quantization for ultra-low-bit MoE LLMs.
ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression - Score: 18 (R=10, N=8) - Date: 2026-02-12 - Comment: Model Compression/Efficiency: training-free sparse factorization with calibration-guided knapsack allocation under global budget; strong compression innovation.
Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity - Score: 18 (R=10, N=8) - Date: 2026-02-11 - Comment: Model Compression and Efficiency: structured sparsity via stochastic gates to prune rows/columns, cutting 20–40% params and inference time with theory.
FlattenGPT: Depth Compression for Transformer with Layer Flattening - Score: 18 (R=10, N=8) - Date: 2026-02-11 - Comment: Model Compression and Efficiency: depth compression via layer flattening to merge adjacent Transformer blocks, preserving structure and improving inference efficiency.
Prism: Spectral-Aware Block-Sparse Attention - Score: 18 (R=10, N=8) - Date: 2026-02-11 - Comment: Matches Efficiency (Attention): spectral-aware block-sparse attention with theory on RoPE-induced pooling artifacts and a training-free block selection method yielding large speedups.
ManifoldKV: Training-Free KV Cache Compression via Euclidean Outlier Detection - Score: 18 (R=10, N=8) - Date: 2026-02-11 - Comment: Matches Model Compression and Efficiency: training-free KV-cache compression via Euclidean distance scoring and windowed variants for long-context robustness.
Near-Oracle KV Selection via Pre-hoc Sparsity for Long-Context Inference - Score: 18 (R=10, N=8) - Date: 2026-02-11 - Comment: Model Compression and Efficiency: pre-hoc KV-cache sparsity with MI-based accuracy guarantees for long-context inference.
DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity - Score: 18 (R=10, N=8) - Date: 2026-02-11 - Comment: Model Compression and Efficiency + HPC: residual-based KV cache compression leveraging long-range similarity and a specialized sparse inference engine for speedups.
ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs - Score: 18 (R=10, N=8) - Date: 2026-02-10 - Comment: Efficiency/Systems: drift-robust GPU-native KV-cache top-k retrieval with collision candidates and quantized reranking; supports UVA offloading and million-token contexts.
POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models - Score: 18 (R=10, N=8) - Date: 2026-02-09 - Comment: Compression/Efficiency: online structural pruning with context-conditioned dynamic masks for LLMs/MoEs/VLMs; plug-and-play efficient inference.
NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models - Score: 18 (R=10, N=8) - Date: 2026-02-09 - Comment: Compression/Efficiency: PTQ to sub-1-bit via low-rank binary factorization with ADMM initialization and reconstruction; state-of-the-art ultra-low-bit LLM quantization.
Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations - Score: 18 (R=10, N=8) - Date: 2026-02-09 - Comment: Compression/Efficiency and Representation Learning: proves emergent low-rank/invariant subspace training dynamics in MLPs and motivates effective low-rank parameterizations.
Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning - Score: 18 (R=10, N=8) - Date: 2026-02-09 - Comment: Low-Rank/Training Dynamics: learning-rate scaling laws across LoRA ranks (μA) with transfer to full finetuning; hyperparameter transfer theory for low-rank adaptation.
To 2:4 Sparsity and Beyond: Neuron-level Activation Function to Accelerate LLM Pre-Training - Score: 18 (R=10, N=8) - Date: 2026-02-09 - Comment: Compression/Efficiency: combines 2:4 structured weight sparsity with v:n:m activation sparsity and sparse-to-dense training to accelerate LLM pretraining with maintained quality.
Compressing LLMs with MoP: Mixture of Pruners - Score: 18 (R=10, N=8) - Date: 2026-02-09 - Comment: Compression/Efficiency: structured pruning via a mixture-of-pruners combining depth and width pruning, yielding latency reductions and improved accuracy under fixed compression.
CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers - Score: 18 (R=10, N=8) - Date: 2026-02-06 - Comment: Closed-form one-shot structured pruning for ViTs with representation-preserving compensation using unlabeled calibration; directly targets structured pruning/efficiency.
Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs - Score: 18 (R=10, N=8) - Date: 2026-02-06 - Comment: Compression/Efficiency: hierarchical top-p sparse attention optimizing selection cost and attention compute for long-context LLMs.
CoSA: Compressed Sensing-Based Adaptation of Large Language Models - Score: 18 (R=10, N=8) - Date: 2026-02-06 - Comment: PEFT/Compression: compressed sensing-based adaptation replaces low-rank updates with random projections plus compact core, improving expressivity under parameter efficiency constraints.
Online Vector Quantized Attention - Score: 18 (R=10, N=8) - Date: 2026-02-05 - Comment: Model Architecture + Efficiency: online vector-quantized attention with linear compute/constant memory and sparse memory updates for long-context tasks.
MixQuant: Pushing the Limits of Block Rotations in Post-Training Quantization - Score: 18 (R=10, N=8) - Date: 2026-02-02 - Comment: Model Compression and Efficiency: PTQ with block rotations analyzed non-asymptotically; introduces permutation-based mass diffusion for outlier suppression.
ARO: A New Lens On Matrix Optimization For Large Models - Score: 18 (R=9, N=9) - Date: 2026-02-11 - Comment: High Performance Computing/Efficiency: new matrix optimizer (gradient rotation) for faster LLM training beyond orthogonalization/whitening.
Semantic Rate Distortion and Posterior Design: Compute Constraints, Multimodality, and Strategic Inference - Score: 18 (R=9, N=9) - Date: 2026-02-06 - Comment: Compression/Efficiency and Representation Learning: semantic rate–compute tradeoffs with strategic posterior design, showing compute as implicit rate and multimodal benefits.
1-Bit Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization - Score: 17 (R=10, N=7) - Date: 2026-02-18 - Comment: Model Compression and Efficiency: low-bit QAT with k-means weight quantization; demonstrates efficient 1-bit weight regimes under fixed memory budgets.
Astro: Activation-guided Structured Regularization for Outlier-Robust LLM Post-Training Quantization - Score: 17 (R=10, N=7) - Date: 2026-02-11 - Comment: Model Compression and Efficiency: post-training quantization for LLMs with activation-guided structured regularization to suppress outliers without inference latency.
Regularized Calibration with Successive Rounding for Post-Training Quantization - Score: 17 (R=10, N=7) - Date: 2026-02-07 - Comment: Compression/Efficiency: PTQ with regularized asymmetric calibration and a successive rounding + bounded-search procedure for LLM quantization.
NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion - Score: 17 (R=9, N=8) - Date: 2026-02-28 - Comment: Matches Model Compression/Efficiency via low-rank adaptation and Model Architecture: a non-linear adapter (gating + structural dropout) breaking LoRA’s linear rank ceiling.
SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning - Score: 17 (R=9, N=8) - Date: 2026-02-28 - Comment: Matches Model Compression and Efficiency: model-driven KV cache management (auxiliary parallel task) for long-horizon reasoning; memory/context optimization.
Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators - Score: 17 (R=9, N=8) - Date: 2026-02-27 - Comment: High Performance Computing/Efficiency — vectorized constrained decoding via CSR sparse ops (STATIC) for accelerator-friendly trie operations, enabling production-scale constrained generative retrieval.
Celo2: Towards Learned Optimization Free Lunch - Score: 17 (R=9, N=8) - Date: 2026-02-24 - Comment: Matches Efficiency/Training Dynamics: simple normalized learned optimizer meta-trained with tiny compute, scaling out-of-distribution to billion-parameter pretraining.
IDLM: Inverse-distilled Diffusion Language Models - Score: 17 (R=9, N=8) - Date: 2026-02-24 - Comment: Matches Model Compression/Efficiency: inverse distillation reduces DLM sampling steps 4–64× with theoretical uniqueness and gradient-stable relaxations.
PCA-VAE: Differentiable Subspace Quantization without Codebook Collapse - Score: 17 (R=9, N=8) - Date: 2026-02-24 - Comment: Model Architecture + Compression/Efficiency: replaces vector quantization with a differentiable PCA bottleneck (Oja’s rule), yielding stable, bit-efficient autoencoders.
Neural-HSS: Hierarchical Semi-Separable Neural PDE Solver - Score: 17 (R=9, N=8) - Date: 2026-02-23 - Comment: Compression/Efficiency + Model Architecture: leverages HSS low-rank structure for parameter/data efficiency, with theoretical guarantees and links to FNO/convolutions.
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference - Score: 17 (R=9, N=8) - Date: 2026-02-23 - Comment: Efficiency: train-dense, infer-sparse attention via recurrence-augmented attention; reduces FLOPs and KV cache with minimal accuracy loss.
Calibrated Adaptation: Bayesian Stiefel Manifold Priors for Reliable Parameter-Efficient Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2026-02-23 - Comment: Parameter-Efficient Fine-Tuning: Bayesian adapters with Matrix Langevin priors on the Stiefel manifold for calibrated low-rank adaptation and uncertainty, with intrinsic manifold inference.
GeneZip: Region-Aware Compression for Long Context DNA Modeling - Score: 17 (R=9, N=8) - Date: 2026-02-23 - Comment: Compression/Efficiency: region-aware DNA compression with dynamic routing enabling long-context training with major compute savings.
Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization - Score: 17 (R=9, N=8) - Date: 2026-02-20 - Comment: Model Compression and Efficiency: ZO fine-tuning via subspace projection and spectral gradient orthogonalization (low-rank Muon) improving query/memory efficiency.
Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2026-02-20 - Comment: Model Efficiency: algorithmic improvements to speculative decoding (delayed tree expansion, dynamic selector) for faster LLM sampling.
Continuous-Time Piecewise-Linear Recurrent Neural Networks - Score: 17 (R=9, N=8) - Date: 2026-02-18 - Comment: Model Architecture: introduces continuous-time piecewise-linear RNNs with a training/simulation algorithm that exploits PL structure, improving tractability and efficiency over Neural ODEs.
Uniform error bounds for quantized dynamical models - Score: 17 (R=9, N=8) - Date: 2026-02-18 - Comment: Compression/quantization theory: uniform error bounds for quantized dynamical models, with complexity scaling in bits (hardware–statistical link).
On Surprising Effectiveness of Masking Updates in Adaptive Optimizers - Score: 17 (R=9, N=8) - Date: 2026-02-18 - Comment: High Performance Computing/Optimization: masked adaptive updates (Magma) provide a simple, efficient optimizer improving LLM pretraining with curvature-regularization effects.
Algorithmic Simplification of Neural Networks with Mosaic-of-Motifs - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: Compression/Efficiency: constrained parameterization (Mosaic-of-Motifs) leveraging reusable motifs to reduce algorithmic (Kolmogorov) complexity of neural weights.
Unbiased Approximate Vector-Jacobian Products for Efficient Backpropagation - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: Efficiency: unbiased randomized approximate vector–Jacobian products for backprop to reduce compute/memory with variance-optimal estimators.
MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: Model Compression and Efficiency: training-free sparse denoising guided by the first All-[MASK] attention to prune KV cache accesses, delivering large long-context speedups.
The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: Matches 'Compression/Efficiency (Quantization)': theoretical decomposition showing precision reduction can increase net energy in multi-hop reasoning due to dequantization and sequential amortization effects.
SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: Matches 'Compression/Efficiency (Sparse Attention)': trainable hybrid Top-k+Top-p masking with distillation fine-tuning achieving 95% sparsity and large speedups.
FUTON: Fourier Tensor Network for Implicit Neural Representations - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: Model Architecture: Fourier Tensor Network with low-rank tensor parameterization for INRs; exploits low-rank structure for efficiency and generalization.
General learned delegation by clones - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: Conditional/Dynamic Networks & Efficiency: learned delegation by spawning coordinated clones to allocate compute across branches under a global reward.
Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2026-02-16 - Comment: Matches Model Compression and Efficiency/HPC: structured backprop exploiting LoRA low-rank to cut memory with exact gradients for on-device fine-tuning.
Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models - Score: 17 (R=9, N=8) - Date: 2026-02-16 - Comment: Compression/Efficiency: attention‑driven self‑compression that progressively reduces vision tokens within the LLM, FlashAttention‑compatible, cutting FLOPs and KV‑cache.
QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching - Score: 17 (R=9, N=8) - Date: 2026-02-16 - Comment: Compression/Efficiency + Low‑rank + Quantization: one‑shot post‑training elastic multi‑bit switching with cascaded low‑rank adapters (MB‑CLoRA) and multi‑bit token merging; supports mixed precision and KV‑cache efficiency.
Improved state mixing in higher-order and block diagonal linear recurrent networks - Score: 17 (R=9, N=8) - Date: 2026-02-13 - Comment: Model Architecture — introduces higher-order and block-diagonal linear recurrent units (H-LRU, BD-LRU) with structured state mixing and parallel-scan implementation to boost expressivity at LRNN efficiency.
GHOST: Unmasking Phantom States in Mamba2 via Grouped Hidden-state Output-aware Selection & Truncation - Score: 17 (R=9, N=8) - Date: 2026-02-13 - Comment: Model Compression and Efficiency: structured pruning of Mamba2 state dimension via forward-only controllability/observability (balanced truncation-inspired).
HiFloat4 Format for Language Model Inference - Score: 17 (R=9, N=8) - Date: 2026-02-13 - Comment: Model Compression and Efficiency — introduces a hierarchical 4-bit block floating-point format (HiF4) enabling mostly fixed-point matmuls and improved inference efficiency on LLMs.
Spectra: Rethinking Optimizers for LLMs Under Spectral Anisotropy - Score: 17 (R=9, N=8) - Date: 2026-02-13 - Comment: High Performance Computing/Efficiency: spike-aware optimizer that shapes low-rank spectral components, accelerates LLM training, and cuts optimizer state memory.
SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining - Score: 17 (R=9, N=8) - Date: 2026-02-12 - Comment: Model Compression and Efficiency: hardware-aware FP8 quantization, KV-cache handling, and pipeline/dataflow co-optimization for long-context MLA decoding.
$\mu$pscaling small models: Principled warm starts and hyperparameter transfer - Score: 17 (R=9, N=8) - Date: 2026-02-12 - Comment: High Performance Computing/Scaling: principled μP-based upscaling and hyperparameter transfer (μTransfer) enabling efficient training of widened models.
Rank-Accuracy Trade-off for LoRA: A Gradient-Flow Analysis - Score: 17 (R=9, N=8) - Date: 2026-02-12 - Comment: Model Compression and Efficiency: theoretical gradient-flow analysis of low-rank (LoRA) updates quantifying rank–accuracy trade-offs.
QUOKA: Query-Oriented KV Selection For Efficient LLM Prefill - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: Compression/Efficiency: training-free sparse attention via query-oriented KV selection to accelerate prefill while preserving accuracy.
Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: Model Compression and Efficiency: optimizes Transformer KV cache eviction via head-level budget allocation using convex-hull relaxation and a marginal-utility greedy solver.
Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: Model Architecture and Efficiency: compressive chunk-wise memory with learned compressor and dynamic gating for selective recall, enabling long-context reasoning with reduced memory and faster inference.
Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: High Performance Computing/Efficiency: shows SGD matches/outperforms AdamW in RL for LLMs with extreme update sparsity, reducing memory; training dynamics insight.
DynamiQ: Accelerating Gradient Synchronization using Compressed Multi-hop All-reduce - Score: 17 (R=9, N=8) - Date: 2026-02-10 - Comment: HPC + Compression: quantization co-designed for multi-hop all-reduce with a fused decompress-accumulate-recompress kernel to accelerate distributed training.
OJBKQ: Objective-Joint Babai-Klein Quantization - Score: 17 (R=9, N=8) - Date: 2026-02-10 - Comment: Model Compression and Efficiency: post-training quantization via joint BILS solved with extended Babai/Klein algorithms, improving 3–4 bit LLM PTQ
Dense Neural Networks are not Universal Approximators - Score: 17 (R=9, N=8) - Date: 2026-02-10 - Comment: Matches Architecture Theory/Sparsity: proves limits of dense networks under natural constraints, motivating sparse connectivity for universality.
BitLogic: Training Framework for Gradient-Based FPGA-Native Neural Networks - Score: 17 (R=9, N=8) - Date: 2026-02-10 - Comment: Model Architecture/Hardware Efficiency: differentiable LUT-based FPGA-native neural networks with gradient training and RTL export, replacing MACs with LUT nodes for sparse, binary computation.
Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization - Score: 17 (R=9, N=8) - Date: 2026-02-09 - Comment: Low-Rank/Training Dynamics: theoretical analysis of SpecGF/Muon in LoRA-style matrix factorization showing uniform spectral growth and global convergence properties.
EUGens: Efficient, Unified, and General Dense Layers - Score: 17 (R=9, N=8) - Date: 2026-02-09 - Comment: Model Architecture and Efficiency: introduces EUGens, a new dense layer class using random features to approximate FFLs, reducing inference from quadratic to linear time and enabling backprop-free layer-wise transfer.
Shared LoRA Subspaces for almost Strict Continual Learning - Score: 17 (R=9, N=8) - Date: 2026-02-07 - Comment: Model compression/PEFT: learns a single shared low-rank subspace updated across tasks for near-strict continual learning, delivering large parameter/memory savings—strong PEFT innovation.
Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance - Score: 17 (R=9, N=8) - Date: 2026-02-07 - Comment: Efficiency: variational training for speculative decoding drafts to maximize acceptance probability, improving inference speed without retraining the target model.
CSRv2: Unlocking Ultra-Sparse Embeddings - Score: 17 (R=9, N=8) - Date: 2026-02-07 - Comment: Model Compression and Efficiency: ultra-sparse embeddings with progressive k-annealing and supervised contrastive training; large compute/memory savings.
Nonlinearity as Rank: Generative Low-Rank Adapter with Radial Basis Functions - Score: 17 (R=9, N=8) - Date: 2026-02-07 - Comment: PEFT/low-rank: replaces explicit LoRA bases with nonlinear RBF-generated bases from latents, boosting effective rank under tight parameter budgets—architectural efficiency in adapters.
When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging - Score: 17 (R=9, N=8) - Date: 2026-02-07 - Comment: Model Efficiency/Representation: training-free spectral calibration for model merging by rescaling inflated singular values to correct shared subspace over-accumulation.
Inverse Depth Scaling From Most Layers Being Similar - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Training Dynamics/Scaling Laws: analyzes depth contributions in residual LLMs, showing inverse loss–depth scaling and layer similarity.
Orthogonal Model Merging - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Matches Model Architecture/Efficiency: orthogonal manifold-based model merging (OrthoMerge) preserving weight geometry; extends to non-OFT via orthogonal-residual decoupling.
Price of universality in vector quantization is at most 0.11 bit - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Theoretical result for universal vector quantization codebooks for weight-only quantization; directly targets model compression/quantization.
Path-Guided Flow Matching for Dataset Distillation - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Matches Compression/Efficiency: dataset distillation via flow matching in VAE latent space with ODE-consistent path guidance for fast deterministic synthesis.
TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Compression/Efficiency: backpropagation-free, attention-aware PTQ with inter-layer error compensation and fast joint channel quantization for LLMs.
LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Compression/Efficiency: hybrid-head sparse decoding with HardKuma selects/reuses tokens to accelerate KV attention without quality loss.
Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: PEFT/Training Dynamics: unified view identifying layerwise residual signal, activation energy, and coupling; introduces Layer Card to guide layer placement for LoRA under compute constraints.
The Key to State Reduction in Linear Attention: A Rank-based Perspective - Score: 17 (R=9, N=8) - Date: 2026-02-05 - Comment: Compression/Efficiency: rank-based analysis of linear attention with structured pruning of Q/K states (hardware-aware, CUDA-compatible) for memory/speed gains.
SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration - Score: 17 (R=9, N=8) - Date: 2026-02-05 - Comment: Compression/Efficiency: training-free sparse attention patterns and efficient sparse kernels for accelerating VAR generation.
Proxy Compression for Language Modeling - Score: 17 (R=9, N=8) - Date: 2026-02-05 - Comment: Model Compression and Efficiency: proxy compression scheme aligning compressed inputs with raw bytes for compute-efficient LM training.
TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: High Performance Computing/Efficiency — ternary speculative decoding with lightweight proxy verification reduces target invocations and speeds LLM inference.
Learnable Permutation for Structured Sparsity on Transformer Models - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: Model compression and efficiency: structured sparsity/pruning in Transformers via end-to-end learnable weight permutation with a differentiable bipartite matching solver.
Residual Context Diffusion Language Models - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: Model architecture/efficiency: diffusion LLM decoding with residual context recycling to reduce wasted computation and improve accuracy with fewer denoising steps.
Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: Model Compression and Efficiency: quantization-aware training + confidence-gated distillation for INT4 VLMs under an Information Bottleneck view.
Layerwise Progressive Freezing Enables STE-Free Training of Deep Binary Neural Networks - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: Compression/Efficiency — STE-free training of binary neural networks via progressive freezing (StoMPP), advancing BNN training at scale.
EUGens: Efficient, Unified, and General Dense Layers - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: Model Architecture and Efficiency: new dense layer class approximating FFLs via random features, reducing inference from quadratic to linear; backprop-free layer-wise knowledge transfer.
AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning - Score: 16 (R=9, N=7) - Date: 2026-02-27 - Comment: Model Compression and Efficiency: joint optimization of mixed-precision quantization and per-layer LoRA ranks via evolutionary + Bayesian search for memory-constrained fine-tuning.
Scaling Laws for Precision in High-Dimensional Linear Regression - Score: 16 (R=9, N=7) - Date: 2026-02-24 - Comment: Model Compression and Efficiency: provides theoretical scaling laws for low-precision (quantized) training, linking precision to effective model/data size.
DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference - Score: 16 (R=9, N=7) - Date: 2026-02-24 - Comment: Compression/Efficiency: dual-stage token reduction (vision-side compression + text-guided pruning) for VLM training/inference.
Bayesian Lottery Ticket Hypothesis - Score: 16 (R=9, N=7) - Date: 2026-02-24 - Comment: Matches Sparsity/Pruning: extends the Lottery Ticket Hypothesis to Bayesian NNs and analyzes effective pruning criteria for BNNs.
Dual Length Codes for Lossless Compression of BFloat16 - Score: 16 (R=9, N=7) - Date: 2026-02-23 - Comment: High Performance Computing/Compression: lossless coding scheme for BFloat16 tensors to reduce communication bandwidth with fast decoding and simple hardware.
LORA-CRAFT: Cross-layer Rank Adaptation via Frozen Tucker Decomposition of Pre-trained Attention Weights - Score: 16 (R=9, N=7) - Date: 2026-02-20 - Comment: Matches: Model Compression and Efficiency — PEFT via cross-layer Tucker decomposition of pre-trained attention weights (low-rank/tensor).
D2-LoRA: A Synergistic Approach to Differential and Directional Low-Rank Adaptation - Score: 16 (R=9, N=7) - Date: 2026-02-17 - Comment: Matches 'Compression/Efficiency (Low-rank)': D2-LoRA introduces differential+directional low-rank adaptation with mergeability and improved accuracy.
TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design - Score: 16 (R=9, N=7) - Date: 2026-02-16 - Comment: Compression/Efficiency & HPC: microscaling low-precision compute, LUT-based nonlinear ops, and memory-aware scheduling via SW–HW co-design for LLM inference.
Learning to Evict from Key-Value Cache - Score: 16 (R=9, N=7) - Date: 2026-02-12 - Comment: Model Compression and Efficiency: learns adaptive KV cache eviction policies to reduce memory/computation during LLM inference without modifying model architecture.
Beyond Student: An Asymmetric Network for Neural Network Inheritance - Score: 16 (R=9, N=7) - Date: 2026-02-11 - Comment: Model Compression and Efficiency: asymmetric low-rank decomposition (SVD-initialized) to inherit teacher knowledge and reconstruct lightweight, expressive networks.
Sparse Layer Sharpness-Aware Minimization for Efficient Fine-Tuning - Score: 16 (R=9, N=7) - Date: 2026-02-11 - Comment: Model Compression and Efficiency with Sparsity: sparse layer selection for SAM via multi-armed bandits to cut backprop compute during fine-tuning.
CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation - Score: 16 (R=9, N=7) - Date: 2026-02-11 - Comment: Model Compression and Efficiency: risk-adaptive, head-aware KV cache compression using offline bandits and entropy/perplexity–gated thresholds for long-context LLMs.
FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging - Score: 16 (R=9, N=7) - Date: 2026-02-11 - Comment: Matches Model Compression and Efficiency: training-free spatiotemporal token merging for VLLMs to reduce visual tokens while preserving accuracy.
Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models - Score: 16 (R=9, N=7) - Date: 2026-02-11 - Comment: Model Compression and Efficiency: layer-wise pruning via cooperative-game Shapley at scale using surrogate performance predictors and stratified sampling.
ODELoRA: Training Low-Rank Adaptation by Solving Ordinary Differential Equations - Score: 16 (R=9, N=7) - Date: 2026-02-10 - Comment: Model Compression and Efficiency: LoRA training via ODE-based dynamics emulating full fine-tuning with convergence guarantees.
On the Importance of a Multi-Scale Calibration for Quantization - Score: 16 (R=9, N=7) - Date: 2026-02-10 - Comment: Model Compression and Efficiency: PTQ for LLMs with multi-scale sequence-length-aware Hessian calibration (Matryoshka Calibration).
Efficient Post-Training Pruning of Large Language Models with Statistical Correction - Score: 16 (R=9, N=7) - Date: 2026-02-10 - Comment: Model Compression and Efficiency: post-training pruning for LLMs with channel-wise statistical calibration and analytic energy compensation (no retraining).
SpecAttn: Co-Designing Sparse Attention with Self-Speculative Decoding - Score: 16 (R=9, N=7) - Date: 2026-02-10 - Comment: Model Compression/Efficiency: verification-guided sparse attention co-designed with self-speculative decoding to reduce KV-cache usage and accelerate long-context inference.
PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference - Score: 16 (R=9, N=7) - Date: 2026-02-09 - Comment: High-Performance Computing: kernel-level attention packing and KV-cache reorganization for heterogeneous batched LLM inference (compute- and I/O-aware execution).
SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration - Score: 16 (R=9, N=7) - Date: 2026-02-07 - Comment: Inference efficiency: training-free speculative decoding by constructing a compatible draft via FIT-based layer pruning—plug-and-play acceleration without extra training.
FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion - Score: 16 (R=9, N=7) - Date: 2026-02-07 - Comment: Model Efficiency: proposes attention output caching that reduces KV-cache access in block diffusion, complementary to sparse attention for long-context generation.
Multi-Token Prediction via Self-Distillation - Score: 16 (R=9, N=7) - Date: 2026-02-06 - Comment: Matches Compression/Efficiency: converts AR LMs to multi-token prediction via online self-distillation for standalone faster decoding.
Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog - Score: 16 (R=9, N=7) - Date: 2026-02-06 - Comment: Model Compression and Efficiency: iterative prune–tune loop enabling substantial LLM pruning while preserving reasoning performance.
Topology-Aware Revival for Efficient Sparse Training - Score: 16 (R=9, N=7) - Date: 2026-02-06 - Comment: Matches Compression/Efficiency (Sparsity): topology-aware one-shot revival improving static sparse training without dynamic rewiring.
Greedy-Gnorm: A Gradient Matrix Norm-Based Alternative to Attention Entropy for Head Pruning - Score: 16 (R=9, N=7) - Date: 2026-02-05 - Comment: Model Compression — dynamic attention-head pruning via gradient-matrix norm scoring with iterative re-evaluation, improving over entropy-based pruning.
FlexLoRA: Entropy-Guided Flexible Low-Rank Adaptation - Score: 16 (R=9, N=7) - Date: 2026-02-02 - Comment: Model Compression/Efficiency: flexible low-rank adaptation with entropy-guided rank importance, global rank pruning/expansion, and zero-impact initialization for PEFT.
DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning - Score: 16 (R=9, N=7) - Date: 2026-02-02 - Comment: Model Compression and Efficiency: adaptive, training-free inference-time pruning of FFN neurons using attention-guided dynamic masks (sparsity/pruning).
Is Hierarchical Quantization Essential for Optimal Reconstruction? - Score: 16 (R=9, N=7) - Date: 2026-02-02 - Comment: Model Compression and Efficiency: shows capacity-matched single-level VQ-VAE can match hierarchical reconstruction when mitigating codebook collapse (quantization/autoencoders).
CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing - Score: 16 (R=8, N=8) - Date: 2026-02-18 - Comment: High Performance Computing/Optimization: second-order constrained LLM editing using K-FAC and matrix-free low-curvature projections to preserve capabilities.
Steady-State Behavior of Constant-Stepsize Stochastic Approximation: Gaussian Approximation and Tail Bounds - Score: 16 (R=8, N=8) - Date: 2026-02-17 - Comment: Matches 'Training Dynamics': non-asymptotic Gaussian approximation and tail bounds for constant-stepsize SA/SGD steady states (i.i.d. and Markovian noise).
Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise - Score: 16 (R=8, N=8) - Date: 2026-02-17 - Comment: Training Efficiency / Optimization Theory: worst-case analysis of stochastically preconditioned SGD under heavy-tailed noise showing normalization superiority over clipping.
MonarchRT: Efficient Attention for Real-Time Video Generation - Score: 16 (R=8, N=8) - Date: 2026-02-13 - Comment: Model Efficiency: structured attention via Monarch matrices with custom kernels, achieving large speedups and high sparsity while preserving video diffusion quality.
Efficient Analysis of the Distilled Neural Tangent Kernel - Score: 16 (R=8, N=8) - Date: 2026-02-13 - Comment: Efficiency for NTK computation: NTK-tuned dataset distillation (DNTK) drastically reduces Jacobian evaluations while preserving kernel structure/performance.
dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning - Score: 16 (R=8, N=8) - Date: 2026-02-12 - Comment: Model Architecture/Efficiency: tokenizer-free autoregressive model with differentiable dynamic chunking (conditional compression) enabling speedups and hierarchical representations.
Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning - Score: 16 (R=8, N=8) - Date: 2026-02-12 - Comment: Matches Efficiency/Decoding algorithms with a GPU-friendly Sequential Monte Carlo scheme for sequence-level power sampling at near-standard latency.
Model soups need only one ingredient - Score: 16 (R=8, N=8) - Date: 2026-02-11 - Comment: Model Compression/Efficiency: single-checkpoint weight-space ensembling via SVD and entropy-ranked reweighting.
Riemannian MeanFlow - Score: 16 (R=8, N=8) - Date: 2026-02-10 - Comment: Matches Model Architecture/Efficiency: learns manifold flow maps (few/one-step generations) for generative modeling on Riemannian manifolds, reducing compute.
Physical Analog Kolmogorov-Arnold Networks based on Reconfigurable Nonlinear-Processing Units - Score: 16 (R=8, N=8) - Date: 2026-02-10 - Comment: Matches Model Architecture/Efficiency/Hardware: analog KAN with reconfigurable nonlinear-processing units enabling energy/latency-efficient inference.
Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization - Score: 16 (R=8, N=8) - Date: 2026-02-09 - Comment: Optimization/Efficiency: decouples variance adaptation and scale-invariant terms (DeVA), bridging Adam-like methods with matrix spectral optimizers for faster large-scale training.
Shiva-DiT: Residual-Based Differentiable Top-$k$ Selection for Efficient Diffusion Transformers - Score: 16 (R=8, N=8) - Date: 2026-02-07 - Comment: Efficiency/pruning: residual-based differentiable top-k token selection with static-budget compliance and adaptive routing for DiTs—algorithmic efficiency under strict hardware constraints.
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching - Score: 16 (R=8, N=8) - Date: 2026-02-07 - Comment: Model Efficiency: learnable feature caching compatible with step distillation for video DiTs, plus a conservative Restricted MeanFlow for stable high compression.
Temporal Pair Consistency for Variance-Reduced Flow Matching - Score: 16 (R=8, N=8) - Date: 2026-02-07 - Comment: Generative Modeling/Training Dynamics: estimator-level variance reduction for flow matching by coupling timestep predictions (Temporal Pair Consistency).
Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps - Score: 16 (R=8, N=8) - Date: 2026-02-06 - Comment: Proposes a new stochastic flow map architecture enabling efficient reward alignment (search/SMC/guidance) at inference; architecture designed for adaptability/efficiency.
Logarithmic-time Schedules for Scaling Language Models with Momentum - Score: 16 (R=8, N=8) - Date: 2026-02-06 - Comment: Matches HPC/Efficiency: time-varying optimizer schedules (log-time β1/β2/weight decay) and an AdamW-like variant to improve large-scale LM training efficiency.
Learning to Reason in 13 Parameters - Score: 16 (R=8, N=8) - Date: 2026-02-05 - Comment: Compression/Efficiency: extreme low-rank adapters (TinyLoRA) scaling to near-parameterless updates, demonstrating minimal-parameter reasoning improvements.
Bitwise Systolic Array Architecture for Runtime-Reconfigurable Multi-precision Quantized Multiplication on Hardware Accelerators - Score: 15 (R=8, N=7) - Date: 2026-02-28 - Comment: Compression/Efficiency + HPC Hardware: runtime-reconfigurable multi-precision quantized multiplication via bitwise systolic array supporting mixed-precision QNNs.
LUMOS: Democratizing SciML Workflows with L0-Regularized Learning for Unified Feature and Parameter Adaptation - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: Model Compression and Efficiency: L0-regularized unified feature selection and pruning with semi-stochastic gating for sparsity and speedup.
GRAU: Generic Reconfigurable Activation Unit Design for Neural Network Hardware Accelerators - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: Matches Efficiency/Hardware: reconfigurable piecewise-linear activation with power-of-two slopes enabling mixed-precision and >90% LUT reduction.
Orthogonal Weight Modification Enhances Learning Scalability and Convergence Efficiency without Gradient Backpropagation - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: Model Compression/Efficiency — low-rank perturbation-based updates with orthogonality to reduce gradient estimate variance; High Performance Computing — O(1) parallel-time weight updates enabling deep non-BP training.
Dirichlet Scale Mixture Priors for Bayesian Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Matches Sparsity/Compression and Representation: Dirichlet scale mixture priors impose structured shrinkage in BNNs, enabling sparsity and pruning with robustness benefits.
Relational Feature Caching for Accelerating Diffusion Transformers - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Matches Model Compression/Efficiency: relational feature caching and error-aware cache scheduling accelerate Diffusion Transformers by reducing redundant compute.
Information-Guided Noise Allocation for Efficient Diffusion Training - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Model Efficiency: information-guided, data-adaptive noise scheduling for diffusion training that reallocates compute to informative noise regions.
Influence-Preserving Proxies for Gradient-Based Data Selection in LLM Fine-tuning - Score: 15 (R=8, N=7) - Date: 2026-02-23 - Comment: Compression/Efficiency: low-rank compression plus gradient/logit alignment to build influence-preserving proxies for scalable LLM data selection.
Revisiting Weight Regularization for Low-Rank Continual Learning - Score: 15 (R=8, N=7) - Date: 2026-02-20 - Comment: Model Compression and Efficiency: low-rank adapters with weight regularization (EWC) for parameter-efficient continual learning.
Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature - Score: 15 (R=8, N=7) - Date: 2026-02-20 - Comment: Matches: Model Compression/Efficiency — dataless task-vector disentanglement using K-FAC to control representation drift in modular adaptation.
Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum - Score: 15 (R=8, N=7) - Date: 2026-02-20 - Comment: High Performance Computing/Optimization: introduces NAMO/NAMO-D integrating orthogonalized momentum with Adam-style noise adaptation, with convergence guarantees for large-model training.
Training Large Reasoning Models Efficiently via Progressive Thought Encoding - Score: 15 (R=8, N=7) - Date: 2026-02-20 - Comment: Matches: Model Compression and Efficiency — parameter-efficient fine-tuning enabling fixed-size caches and reduced memory for RL training/inference.
Efficient Remote Prefix Fetching with GPU-native Media ASICs - Score: 15 (R=8, N=7) - Date: 2026-02-20 - Comment: High Performance Computing/Efficiency: systems-level KV cache reuse and compression for faster LLM inference.
Neighborhood Stability as a Measure of Nearest Neighbor Searchability - Score: 15 (R=8, N=7) - Date: 2026-02-19 - Comment: HPC/Efficiency for ANN: introduces neighborhood stability measures (clustering-NSM, point-NSM) to predict searchability and ANNS accuracy from nearest-neighbor structure.
Subtractive Modulative Network with Learnable Periodic Activations - Score: 15 (R=8, N=7) - Date: 2026-02-19 - Comment: Model Architecture: introduces a parameter-efficient INR (SMN) with learnable periodic activations and modulatory filters inspired by subtractive synthesis.
HAWX: A Hardware-Aware FrameWork for Fast and Scalable ApproXimation of DNNs - Score: 15 (R=8, N=7) - Date: 2026-02-19 - Comment: Matches Compression/Efficiency (hardware-aware): multi-level sensitivity scoring and predictive models to integrate approximate computing blocks for scalable DNN approximation.
Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective - Score: 15 (R=8, N=7) - Date: 2026-02-19 - Comment: Matches Compression/Efficiency: query-conditioned selector-based soft context compression for RAG that reduces computation while preserving performance.
Spanning the Visual Analogy Space with a Weight Basis of LoRAs - Score: 15 (R=8, N=7) - Date: 2026-02-18 - Comment: Low-rank architecture innovation: learnable basis of LoRA modules with dynamic composition for conditional specialization (aligns with low-rank/architecture efficiency).
The Equalizer: Introducing Shape-Gain Decomposition in Neural Audio Codecs - Score: 15 (R=8, N=7) - Date: 2026-02-18 - Comment: Compression/efficiency: explicit shape–gain decomposition in neural audio codecs to reduce bitrate and complexity.
FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations - Score: 15 (R=8, N=7) - Date: 2026-02-18 - Comment: HPC/Efficiency: GPU memory hierarchy optimizations and streaming schedules (2.5D textures) to run large/multi-DNN workloads on mobile with strong speed/memory gains.
Fast and Effective On-policy Distillation from Reasoning Prefixes - Score: 15 (R=8, N=7) - Date: 2026-02-18 - Comment: Efficiency: modifies on-policy distillation to supervise only prefixes, reducing training FLOPs 2x–47x while matching full OPD performance.
Scaling Beyond Masked Diffusion Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Matches 'Model Architecture and Scaling': scaling study of discrete diffusion LMs with FLOPs-efficient training objective and speed–quality Pareto analysis.
LRD-MPC: Efficient MPC Inference through Low-rank Decomposition - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Model Compression and Efficiency: low-rank decomposition for linear layers under MPC with truncation skipping and pipelined concatenation to reduce communication/compute.
OneLatent: Single-Token Compression for Visual Latent Reasoning - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Model Compression and Efficiency: compresses chain-of-thought into a single latent token using visualized CoT and hidden-state supervision.
HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Model Compression and Efficiency: 1-bit post-training quantization for VLA models using Hessian-guided salience and sparse orthogonal transforms.
Zero-Order Optimization for LLM Fine-Tuning via Learnable Direction Sampling - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: High Performance Computing / Training Efficiency: zero-order fine-tuning with learnable direction sampling policy that reduces variance and dimensionality dependence, with theory.
Preventing Rank Collapse in Federated Low-Rank Adaptation with Client Heterogeneity - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Model Compression and Efficiency: addresses LoRA rank heterogeneity with rank-partitioned aggregation to prevent rank collapse in federated adaptation.
AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Matches 'Efficiency/Cache': adaptive cache correction for Diffusion Transformers enabling activation reuse without retraining.
IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Model Compression and Efficiency: visual token pruning for MLLMs via MMR balancing importance/diversity; attention-map free and FlashAttention compatible.
CoPE-VideoLM: Codec Primitives For Efficient Video Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-16 - Comment: Efficiency: leverages video codec motion vectors/residuals with lightweight transformer encoders to cut tokens and compute for VideoLMs.
Quantization-Robust LLM Unlearning via Low-Rank Adaptation - Score: 15 (R=8, N=7) - Date: 2026-02-16 - Comment: Compression/Efficiency + Low‑rank: uses LoRA to concentrate unlearning updates that survive 4‑bit PTQ, improving quantization‑robust unlearning.
FlashSchNet: Fast and Accurate Coarse-Grained Neural Network Molecular Dynamics - Score: 15 (R=8, N=7) - Date: 2026-02-16 - Comment: Compression/Efficiency + HPC: IO-aware fused kernels (flash radial basis/message passing/aggregation) and channel‑wise 16‑bit quantization to cut HBM traffic and atomics for GNNs.
Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation - Score: 15 (R=8, N=7) - Date: 2026-02-16 - Comment: Low-rank/Compression: progressive LoRA rank scheduling and layer-unfreezing curriculum for preference optimization in text-to-image models.
The Appeal and Reality of Recycling LoRAs with Adaptive Merging - Score: 15 (R=8, N=7) - Date: 2026-02-16 - Comment: Matches Model Compression and Efficiency: empirical and methodological study of adaptive merging of LoRAs (low-rank adapters) including a new merging approach.
Manifold-Aware Temporal Domain Generalization for Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-13 - Comment: Model Compression and Efficiency — Manifold-aware Temporal LoRA constrains temporal updates to a shared low-dimensional manifold within a low-rank adaptation subspace; also offers insights into temporal representation dynamics.
Beyond Parameter Arithmetic: Sparse Complementary Fusion for Distribution-Aware Model Merging - Score: 15 (R=8, N=7) - Date: 2026-02-13 - Comment: Model Compression and Efficiency — proposes sparse, distribution-aware weight-space merging via reverse KL to control interference and fuse capabilities without retraining.
ThinkRouter: Efficient Reasoning via Routing Thinking between Latent and Discrete Spaces - Score: 15 (R=8, N=7) - Date: 2026-02-13 - Comment: Conditional/Dynamic Networks: confidence-aware routing between discrete CoT and latent reasoning to improve efficiency and accuracy.
Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt - Score: 15 (R=8, N=7) - Date: 2026-02-13 - Comment: C2: Compression/Efficiency—stochastic quantization and embedding projection for private, communication-efficient split inference.
RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis - Score: 15 (R=8, N=7) - Date: 2026-02-13 - Comment: High Performance Computing — Roofline-based benchmarking and a new Relative Inference Potential metric to analyze and compare on-device LLM efficiency under hardware constraints.
Flow caching for autoregressive video generation - Score: 15 (R=8, N=7) - Date: 2026-02-12 - Comment: Model Compression and Efficiency: introduces chunk-wise flow caching and KV-cache compression for autoregressive video generation to reduce runtime/memory.
From Diet to Free Lunch: Estimating Auxiliary Signal Properties using Dynamic Pruning Masks in Speech Enhancement Networks - Score: 15 (R=8, N=7) - Date: 2026-02-12 - Comment: Matches Compression/Efficiency (dynamic channel pruning) and Representation Learning by reusing pruning masks to estimate auxiliary signal properties.
When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning - Score: 15 (R=8, N=7) - Date: 2026-02-12 - Comment: Model Architecture/Efficiency: conditional/dynamic recurrent memory with learnable update/exit gates for long-context reasoning, enabling early exit and reduced compute.
1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization - Score: 15 (R=8, N=7) - Date: 2026-02-12 - Comment: Matches Model Compression/Efficiency via a novel low-rank complex adapter with only ~1% parameters and theory-guided optimization addressing low-rank convergence issues.
Compute Only Once: UG-Separation for Efficient Large Recommendation Models - Score: 15 (R=8, N=7) - Date: 2026-02-12 - Comment: Matches Model Compression and Efficiency: introduces a caching-like reusable computation mechanism via user-group separation in dense interaction models, plus weight-only quantization (W8A16) to cut memory bandwidth.
Hardware Co-Design Scaling Laws via Roofline Modelling for On-Device LLMs - Score: 15 (R=8, N=7) - Date: 2026-02-12 - Comment: Matches High-Performance Computing/Efficiency via hardware–software co-design scaling laws coupling training loss scaling with roofline latency modeling for on-device LLMs.
Reverse-Engineering Model Editing on Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-12 - Comment: Model Editing/Compression: exploits low-rank structure of edit updates and proposes subspace camouflage; ties to representation and low-rank updates.
LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Matches Model Compression and Efficiency: data-free, layer-wise adaptive rescaling veneer for model merging that respects layer heterogeneity in ViTs.
Clarifying Shampoo: Adapting Spectral Descent to Stochasticity and the Parameter Trajectory - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Optimization/Training Efficiency: clarifies Shampoo vs. Muon by decomposing updates, linking to spectral descent, and demonstrating higher token efficiency.
Sparsity-Aware Evolution for Model Merging - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Compression/Efficiency: sparsity-aware evolutionary model merging via iterative pruning–merging and sparsity-weighted selection.
FIRE: Frobenius-Isometry Reinitialization for Balancing the Stability-Plasticity Tradeoff - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Optimization/Training: Frobenius-Isometry Reinitialization balancing stability-plasticity via constrained objective.
Collaborative and Efficient Fine-tuning: Leveraging Task Similarity - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Matches Model Compression and Efficiency: collaborative low-rank adaptation (LoRA) with shared/personalized adapters; includes theory for heterogeneous linear regression.
M-Loss: Quantifying Model Merging Compatibility with Limited Unlabeled Data - Score: 15 (R=8, N=7) - Date: 2026-02-10 - Comment: Matches Model Compression/Efficiency: metric (M-Loss) to assess and guide model merging with limited unlabeled data; links to pruning/parameter significance.
Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices - Score: 15 (R=8, N=7) - Date: 2026-02-10 - Comment: High Performance Computing/Efficiency: compiler-assisted speculative sampling with heterogeneous partitioning for edge LLM inference.
LQA: A Lightweight Quantized-Adaptive Framework for Vision-Language Models on the Edge - Score: 15 (R=8, N=7) - Date: 2026-02-10 - Comment: Model Compression and Efficiency: modality-aware quantization plus gradient-free test-time adaptation for VLMs on edge devices
Steer2Adapt: Dynamically Composing Steering Vectors Elicits Efficient Adaptation of LLMs - Score: 15 (R=8, N=7) - Date: 2026-02-10 - Comment: Matches Conditional/Dynamic Networks: composes steering vectors from a reusable low-dim prior for efficient inference-time LLM adaptation.
Robustness Beyond Known Groups with Low-rank Adaptation - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Low-rank Adaptation: restricts adaptation to a low-dimensional error subspace via low-rank logit adjustments to improve worst-group performance without backbone changes.
SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Model Architecture/Efficiency: modular separation of perception and reasoning in VLMs enabling test-time scaling and asymmetric compute allocation.
Internalizing LLM Reasoning via Discovery and Replay of Latent Actions - Score: 15 (R=8, N=7) - Date: 2026-02-07 - Comment: Representation Learning/Efficiency: dynamic inference-time activation steering (latent trajectory control) with a sparse control basis to internalize reasoning and cut token compute.
How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs - Score: 15 (R=8, N=7) - Date: 2026-02-06 - Comment: Compression/Efficiency: studies sparsity-inducing activations and EoC initialization variance to enable high hidden-layer sparsity and improve training stability in DNNs/CNNs.
Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers - Score: 15 (R=8, N=7) - Date: 2026-02-06 - Comment: Matches Compression/Efficiency: Bayesian stopping policy for multi-sample consistency to reduce LLM inference calls while preserving accuracy.
Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better - Score: 15 (R=8, N=7) - Date: 2026-02-06 - Comment: Matches HPC/Efficiency: Late-to-Early Training transfers late-layer knowledge to early steps/layers to accelerate LLM pretraining.
Mano: Restriking Manifold Optimization for LLM Training - Score: 15 (R=8, N=7) - Date: 2026-02-02 - Comment: Optimization for large-scale training: manifold-aware optimizer (momentum on oblique manifold) improving LLM training efficiency compared to AdamW/Muon.
Decomposing and Composing: Towards Efficient Vision-Language Continual Learning via Rank-1 Expert Pool in a Single LoRA - Score: 15 (R=8, N=7) - Date: 2026-02-02 - Comment: Model compression/efficiency: LoRA reparameterized as a sparse Rank-1 expert pool with orthogonalization to mitigate forgetting in continual learning.

High Performance Computing (42)

When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents - Score: 20.0 (R=0, N=0) - Date: 2026-02-11 - Comment: Author match
SeedFlood: A Step Toward Scalable Decentralized Training of LLMs - Score: 19 (R=10, N=9) - Date: 2026-02-23 - Comment: HPC/Distributed Training: seed-reconstructible zeroth-order updates enable near-zero-size messages and model-size-independent communication.
From $O(mn)$ to $O(r^2)$: Two-Sided Low-Rank Communication for Adam in Distributed Training with Memory Efficiency - Score: 19 (R=10, N=9) - Date: 2026-02-11 - Comment: High Performance Computing: two-sided low-rank communication for Adam-family optimizers reducing per-step payload from O(mn) to O(r^2) with randomized refresh; strong distributed training efficiency gains.
Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation - Score: 19 (R=10, N=9) - Date: 2026-02-02 - Comment: Quantization for efficient large-scale training: fully NVFP4 training with an unbiased micro-scaled quantizer (MS-EDEN) improving gradient estimation; systems-level kernels on Blackwell GPUs.
SOCKET: SOft Collison Kernel EsTimator for Sparse Attention - Score: 18 (R=10, N=8) - Date: 2026-02-09 - Comment: Compression/Efficiency: sparse attention via soft LSH scoring kernel for top-k token selection; systems-level acceleration with custom CUDA/Triton yielding up to 1.5× throughput over FlashAttention.
LoRDO: Distributed Low-Rank Optimization with Infrequent Communication - Score: 18 (R=10, N=8) - Date: 2026-02-06 - Comment: High-Performance/Distributed Training: unifies low-rank optimization with infrequent communication, restoring subspace exploration and cutting communication in DDP for foundation models.
Unlocked Backpropagation using Wave Scattering - Score: 18 (R=9, N=9) - Date: 2026-02-12 - Comment: High Performance Computing/Optimization: reformulates backprop as a hyperbolic initial value (wave scattering) to unlock forward–backward dependency, yielding fully unlocked training algorithms.
ECHO: Encoding Communities via High-order Operators - Score: 17 (R=9, N=8) - Date: 2026-02-28 - Comment: Matches Model Architecture and High-Performance Computing/Efficiency: introduces a conditional Topology-Aware Router (dynamic routing) and memory-sharded full-batch contrastive training with chunked O(N·K) similarity to bypass O(N^2) memory, a systems-level algorithmic optimization for scalable GNNs.
Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? - Score: 17 (R=9, N=8) - Date: 2026-02-27 - Comment: Model Architecture/Training Dynamics: data-centric supervision (NAP) to enable truly parallel non-autoregressive decoding in DLMs.
RPU -- A Reasoning Processing Unit - Score: 17 (R=9, N=8) - Date: 2026-02-24 - Comment: Matches High Performance Computing: chiplet-based, bandwidth-first architecture with decoupled pipelines to overcome memory-wall bottlenecks in LLM inference.
JPmHC Dynamical Isometry via Orthogonal Hyper-Connections - Score: 17 (R=9, N=8) - Date: 2026-02-23 - Comment: Model Architecture/Stability: orthogonality-constrained hyper-connections preserving Jacobian spectrum; manifold-constrained mixers with memory-efficient implicit differentiation.
UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems - Score: 17 (R=9, N=8) - Date: 2026-02-23 - Comment: Architecture/Efficiency/HPC: linear-scaling equivariant transformer (E2Former-V2) with sparsification and long–short range modeling for higher throughput.
FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving - Score: 17 (R=9, N=8) - Date: 2026-02-19 - Comment: Matches High Performance Computing: operator-level preemption and event-driven scheduling for LLM serving to mitigate HoL blocking and optimize TTFT-goodput.
KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: HPC/Systems: memory-augmented in-context RL for cross-task CUDA kernel optimization with a persistent knowledge base for improved GPU performance.
PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving - Score: 17 (R=9, N=8) - Date: 2026-02-13 - Comment: HPC/Systems: shared prefill module and KV-cache reuse across multiple LLMs in disaggregated serving, with routing; 4.5x lower p95 latency and 3.9x higher throughput.
PRISM: Parallel Residual Iterative Sequence Model - Score: 17 (R=9, N=8) - Date: 2026-02-12 - Comment: Model Architecture: introduces a parallel residual iterative sequence model resolving expressivity–efficiency tension; HPC: parallelizable multi-step refinement without serial dependencies.
V-ABFT: Variance-Based Adaptive Threshold for Fault-Tolerant Matrix Multiplication in Mixed-Precision Deep Learning - Score: 17 (R=9, N=8) - Date: 2026-02-10 - Comment: High Performance Computing: variance-based adaptive thresholds for ABFT in mixed-precision GEMM, enabling finer SDC detection in DL training/inference
Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay - Score: 17 (R=9, N=8) - Date: 2026-02-09 - Comment: Training dynamics: optimal learning-rate schedules under functional scaling laws applicable to LLM pretraining; theory-driven.
High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory - Score: 17 (R=9, N=8) - Date: 2026-02-09 - Comment: Training Dynamics Theory: DMFT-based high-dimensional limit for stochastic gradient flow covering GLMs and two-layer nets; unifies prior SGD dynamics frameworks.
AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: High Performance Computing: asynchronous data and pipeline parallelism with sparse averaging and convergence guarantees to reduce communication.
Asynchronous Heavy-Tailed Optimization - Score: 16 (R=9, N=7) - Date: 2026-02-23 - Comment: High-Performance/Distributed Training: asynchronous optimization under heavy-tailed gradient noise with delay-aware scheduling and compensation, with convergence guarantees.
Predicting LLM Output Length via Entropy-Guided Representations - Score: 16 (R=9, N=7) - Date: 2026-02-13 - Comment: Matches Efficiency/HPC: reuses model hidden states for entropy-guided and progressive length prediction to improve batched inference throughput.
tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models - Score: 16 (R=9, N=7) - Date: 2026-02-10 - Comment: High-Performance Computing + Efficiency: systems-level co-training of multiple LoRA adapters via an elastic shared super-model with fused low-rank kernels and adaptive scheduling.
Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers - Score: 16 (R=9, N=7) - Date: 2026-02-09 - Comment: High Performance Computing: distributed training innovation for matrix-based optimizers with asynchronous scheduling and load-balanced partitioning.
Training-Free Generative Modeling via Kernelized Stochastic Interpolants - Score: 16 (R=8, N=8) - Date: 2026-02-24 - Comment: Model Architecture/Efficiency — training-free generative modeling via kernelized stochastic interpolants, replacing neural training with linear systems and specialized integrators.
K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model - Score: 16 (R=8, N=8) - Date: 2026-02-24 - Comment: High Performance Computing: co-evolving world model guides LLM-based search for GPU kernel optimization, yielding large speedups (incl. MoE kernels).
The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety - Score: 16 (R=8, N=8) - Date: 2026-02-18 - Comment: Training Dynamics: geometric analysis reveals curvature-driven alignment collapse under fine-tuning, with an instability condition and quartic scaling law.
KoopGen: Koopman Generator Networks for Representing and Predicting Dynamical Systems with Continuous Spectra - Score: 16 (R=8, N=8) - Date: 2026-02-17 - Comment: Model Architecture: neural Koopman generator with operator-theoretic constraints (skew-/self-adjoint decomposition) for representing continuous-spectrum dynamics.
MergePipe: A Budget-Aware Parameter Management System for Scalable LLM Merging - Score: 16 (R=8, N=8) - Date: 2026-02-17 - Comment: High Performance Computing / Systems: catalog-driven, budget-aware parameter management and streaming execution for scalable LLM merging with drastic I/O reductions.
Training deep physical neural networks with local physical information bottleneck - Score: 16 (R=8, N=8) - Date: 2026-02-11 - Comment: HPC/Training Methods: local physical information bottleneck enabling scalable training of physical neural networks on analog substrates.
LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure - Score: 15 (R=8, N=7) - Date: 2026-02-28 - Comment: High Performance Computing: unified runtime-driven simulator modeling heterogeneous/disaggregated LLM serving (batching, routing, offloading, memory, power).
GetBatch: Distributed Multi-Object Retrieval for ML Data Loading - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: HPC/Systems: batch multi-object retrieval API for ML data loading, reducing latency and tail effects during training.
Learning Long-Range Dependencies with Temporal Predictive Coding - Score: 15 (R=8, N=7) - Date: 2026-02-23 - Comment: Training/Efficiency: combines Temporal Predictive Coding with approximate RTRL for local, parallelizable spatio-temporal credit assignment as an alternative to BPTT.
Distributed physics-informed neural networks via domain decomposition for fast flow reconstruction - Score: 15 (R=8, N=7) - Date: 2026-02-19 - Comment: Matches High Performance Computing: distributed PINNs via domain decomposition with CUDA graphs/JIT and global pressure consistency enforcement.
Silent Inconsistency in Data-Parallel Full Fine-Tuning: Diagnosing Worker-Level Optimization Misalignment - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: HPC/Distributed Training: introduces lightweight metrics to diagnose worker-level optimization misalignment in synchronous data-parallel fine-tuning.
HyFunc: Accelerating LLM-based Function Calls for Agentic AI through Hybrid-Model Cascade and Dynamic Templating - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Model Efficiency/Systems: hybrid-model cascade and dynamic templating inside vLLM to reduce function-calling latency and redundant large-model generation.
When RL Meets Adaptive Speculative Training: A Unified Training-Serving System - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: High Performance Computing/Efficiency: unified training-serving system for speculative decoding with online adaptation and asynchronous RL updates.
TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference - Score: 15 (R=8, N=7) - Date: 2026-02-06 - Comment: Matches High Performance Computing: serving-engine-native speculative decoding with online draft adaptation and heterogeneous GPU mapping for inference efficiency.
Finding Structure in Continual Learning - Score: 15 (R=8, N=7) - Date: 2026-02-05 - Comment: Optimization/Training Dynamics — reframes continual learning via Douglas–Rachford splitting to decouple plasticity vs stability without replay/regularization.
FOCUS: DLLMs Know How to Tame Their Compute Bound - Score: 15 (R=8, N=7) - Date: 2026-02-02 - Comment: High Performance Computing — DLLM inference system that dynamically focuses compute on decodable tokens to boost throughput.
HetCCL: Accelerating LLM Training with Heterogeneous GPUs - Score: 15 (R=8, N=7) - Date: 2026-02-02 - Comment: High Performance Computing: cross-vendor collective communication enabling distributed training over heterogeneous GPUs without driver changes.
Towards Resiliency in Large Language Model Serving with KevlarFlow - Score: 15 (R=8, N=7) - Date: 2026-02-02 - Comment: High Performance Computing: systems-level innovations for resilient LLM serving (decoupled model-parallel init, dynamic rerouting, background KV-cache replication).

Representation Learning (177)

Radial-VCReg: More Informative Representation Learning Through Radial Gaussianization - Score: 20.0 (R=0, N=0) - Date: 2026-02-17 - Comment: Author match
Causal-JEPA: Learning World Models through Object-Level Latent Interventions - Score: 20.0 (R=0, N=0) - Date: 2026-02-13 - Comment: Author match
Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures - Score: 20.0 (R=0, N=0) - Date: 2026-02-11 - Comment: Author match
Synthesizable Molecular Generation via Soft-constrained GFlowNets with Rich Chemical Priors - Score: 20.0 (R=0, N=0) - Date: 2026-02-05 - Comment: Author match
A unified theory of feature learning in RNNs and DNNs - Score: 19 (R=10, N=9) - Date: 2026-02-18 - Comment: Representation learning/training dynamics: unified mean-field theory linking RNNs and DNNs via representational kernels in the μP regime.
CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs - Score: 18 (R=10, N=8) - Date: 2026-02-07 - Comment: Model Architecture/Efficiency: modifies RoPE with soft low-frequency clipping (CoPE) to improve long-context length generalization and mitigate OOD artifacts.
Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs - Score: 18 (R=9, N=9) - Date: 2026-02-28 - Comment: Matches Representation Learning/Theory: frames modality collapse as mismatched decoding with GMI bounds; shows encoder retains attributes but decoder scoring rule limits accessible information; objective-level fix validated.
On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking - Score: 18 (R=9, N=9) - Date: 2026-02-20 - Comment: Representation Learning/Training Dynamics: mechanistic theory of modular addition, Fourier features, lottery-ticket selection, and grokking stages.
Why Deep Jacobian Spectra Separate: Depth-Induced Scaling and Singular-Vector Alignment - Score: 18 (R=9, N=9) - Date: 2026-02-16 - Comment: Matches Representation Learning: theoretical analysis of deep Jacobian spectra (scaling, separation, and singular-vector alignment) explaining implicit bias and low-rank behavior.
Deep Learning of Compositional Targets with Hierarchical Spectral Methods - Score: 18 (R=9, N=9) - Date: 2026-02-12 - Comment: Strong Representation Learning/training-dynamics match: theoretical sample-complexity separation showing depth advantage via hierarchical spectral estimators.
A Random Matrix Theory of Masked Self-Supervised Regression - Score: 18 (R=9, N=9) - Date: 2026-02-02 - Comment: Representation Learning: high-dimensional random matrix theory for masked self-supervised regression with BBP-type phase transition and explicit generalization error.
Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks - Score: 17 (R=9, N=8) - Date: 2026-02-27 - Comment: Representation learning/theory: large-deviation rate functions for wide Bayesian NNs capturing feature learning beyond fixed-kernel NNGP.
A 1/R Law for Kurtosis Contrast in Balanced Mixtures - Score: 17 (R=9, N=8) - Date: 2026-02-27 - Comment: Matches Representation Learning: theoretical analysis of kurtosis-based ICA with a 1/R redundancy law and purification restoring contrast.
Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations - Score: 17 (R=9, N=8) - Date: 2026-02-24 - Comment: Representation/approximation theory: shows smooth activations enable depth-constant, minimax-optimal rates (smoothness adaptivity).
Manifold-Aligned Generative Transport - Score: 17 (R=9, N=8) - Date: 2026-02-24 - Comment: Model Architecture/Representation Learning: proposes a one-shot manifold-aligned generative transport with theoretical Wasserstein bounds.
Regularity of Second-Order Elliptic PDEs in Spectral Barron Spaces - Score: 17 (R=9, N=8) - Date: 2026-02-24 - Comment: Theory/Representation: proves Barron-space regularity gains for elliptic PDEs and dimension-independent two-layer cosine-network approximation.
Attention Deficits in Language Models: Causal Explanations for Procedural Hallucinations - Score: 17 (R=9, N=8) - Date: 2026-02-24 - Comment: Matches Representation Learning/Training Dynamics: causal analysis of LLM readout failures (gating vs. binding), with probes and mutual-information diagnostics.
Topological Exploration of High-Dimensional Empirical Risk Landscapes: general approach, and applications to phase retrieval - Score: 17 (R=9, N=8) - Date: 2026-02-23 - Comment: Training dynamics/representation theory: Kac–Rice analysis of high-dimensional loss landscapes and Hessian spectra (foundational landscape insights).
Bayesian Optimality of In-Context Learning with Selective State Spaces - Score: 17 (R=9, N=8) - Date: 2026-02-23 - Comment: Model Architecture and Representation Learning: theoretical framing of ICL as Bayes-optimal inference with selective SSMs, separating from ERM/implicit GD and demonstrating statistical efficiency.
Canonicalizing Multimodal Contrastive Representation Learning - Score: 17 (R=9, N=8) - Date: 2026-02-20 - Comment: Matches: Representation Learning — discovers orthogonal canonical maps aligning independent multimodal contrastive encoders; theory and practice.
Learning with Boolean threshold functions - Score: 17 (R=9, N=8) - Date: 2026-02-20 - Comment: Model Architecture/Representation: trains Boolean-threshold networks via projection-based constraint satisfaction, yielding sparse ±1-weight logical circuits.
VP-VAE: Rethinking Vector Quantization via Adaptive Vector Perturbation - Score: 17 (R=9, N=8) - Date: 2026-02-20 - Comment: Model Architecture/Representation Learning: replaces codebook in VQ-VAE with adaptive latent perturbations (VP-VAE) and introduces FSP for fixed quantizers, stabilizing training and improving token usage.
Early-Warning Signals of Grokking via Loss-Landscape Geometry - Score: 17 (R=9, N=8) - Date: 2026-02-20 - Comment: Representation Learning/Training dynamics: proposes commutator-defect curvature as a causal early-warning signal for grokking in transformers.
Optimizer choice matters for the emergence of Neural Collapse - Score: 17 (R=9, N=8) - Date: 2026-02-19 - Comment: Representation Learning/Training Dynamics: provides theoretical and empirical analysis of optimizer-dependent Neural Collapse and the role of weight-decay coupling and momentum.
How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning - Score: 17 (R=9, N=8) - Date: 2026-02-18 - Comment: Representation learning analysis: layer-wise PID quantifies vision, language, and synergy flows in multimodal Transformers; training dynamics insights.
Logit Distance Bounds Representational Similarity - Score: 17 (R=9, N=8) - Date: 2026-02-18 - Comment: Representation learning theory: establishes a logit-distance that bounds linear representational dissimilarity; implications for distillation beyond KL.
Symmetry in language statistics shapes the geometry of model representations - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: Representation Learning Theory: links translation symmetries in language statistics to emergent geometric structures in embeddings across models.
BitDance: Scaling Autoregressive Generative Models with Binary Tokens - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: Model Architecture and Efficiency: binary-token latent representation with diffusion head and parallel next-patch decoding for fast, scalable AR generation.
A Theoretical Framework for LLM Fine-tuning Using Early Stopping for Non-random Initialization - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: Representation Learning/Training Dynamics Theory: extends NTK to non-random (pretrained) inits and analyzes early-stopping convergence for LLM fine-tuning.
Text Has Curvature - Score: 17 (R=9, N=8) - Date: 2026-02-17 - Comment: Representation Learning/Geometry: introduces a text-native curvature signal (Texture) and uses it for compression/routing without geometric training.
From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders - Score: 17 (R=9, N=8) - Date: 2026-02-13 - Comment: Representation Learning: introduces hierarchical sparse autoencoders to discover multi-scale, monosemantic feature hierarchies in LLMs.
Sparse Semantic Dimension as a Generalization Certificate for LLMs - Score: 17 (R=9, N=8) - Date: 2026-02-13 - Comment: Representation Learning — proposes Sparse Semantic Dimension using sparse autoencoder feature vocabularies to certify LLM generalization and reveals scaling laws of learned features.
Theoretical Analysis of Contrastive Learning under Imbalanced Data: From Training Dynamics to a Pruning Solution - Score: 17 (R=9, N=8) - Date: 2026-02-12 - Comment: Representation Learning theory under imbalance with analysis of training dynamics plus a pruning-based remedy linking sparsity to improved features.
Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: Matches Representation Learning/Interpretability and Architecture: proves invariant subspace necessity for transformer linear interfaces, explaining success of linear probes and sparse autoencoders.
Mutual Information Collapse Explains Disentanglement Failure in $\beta$-VAEs - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: Matches Representation Learning: analyzes MI collapse in β-VAEs and proposes λβ-VAE to decouple KL pressure from latent informativeness, stabilizing disentanglement.
Linearization Explains Fine-Tuning in Large Language Models - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: Representation Learning/Training dynamics: linearization/NTK lens explains PEFT, with spectral insights and layer selection bounds.
The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: Representation Learning: identifies a low-dimensional correctness manifold in LMs with linear separability and causal activation steering; output uncertainty fails to capture it.
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models - Score: 17 (R=9, N=8) - Date: 2026-02-11 - Comment: Representation Learning and Efficient Scaling: formalizes modality gap geometry and introduces ReAlign (training-free) and ReVision (scalable pretraining paradigm) to align modalities using unpaired data.
Dichotomy of Feature Learning and Unlearning: Fast-Slow Analysis on Neural Networks with Stochastic Gradient Descent - Score: 17 (R=9, N=8) - Date: 2026-02-10 - Comment: Representation Learning/Training Dynamics: fast–slow analysis of SGD in infinite-width 2-layer nets explaining feature unlearning with scaling laws
LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery - Score: 17 (R=9, N=8) - Date: 2026-02-10 - Comment: Representation Learning/Architecture: unified vision-language sparse autoencoder with shared dictionary via learned OT alignment for interpretable concepts.
Learning a Generative Meta-Model of LLM Activations - Score: 17 (R=9, N=8) - Date: 2026-02-09 - Comment: Representation Learning/Interpretability: trains diffusion meta-models on LLM activations to learn a prior over internal states, improving intervention fidelity and sparsity of concepts.
Disentanglement by means of action-induced representations - Score: 17 (R=9, N=8) - Date: 2026-02-09 - Comment: Representation Learning: introduces action-induced representations with provable disentanglement and a variational AIR architecture (VAIR).
Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings - Score: 17 (R=9, N=8) - Date: 2026-02-09 - Comment: Representation Learning: aligned sparse autoencoder with an iso-energy inductive bias to analyze and disentangle VLM embedding geometry (bimodal vs. unimodal atoms).
Optimal scaling laws in learning hierarchical multi-index models - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Representation Learning/Training Dynamics: sharp scaling laws and phase-transition analysis for two-layer nets with a spectral estimator achieving optimal rates.
Fluid Representations in Reasoning Models - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Matches Representation Learning: mechanistic analysis of internal representations and in-context refinement in reasoning LMs (QwQ-32B).
Decomposing Query-Key Feature Interactions Using Contrastive Covariances - Score: 17 (R=9, N=8) - Date: 2026-02-05 - Comment: Representation learning/interpretability: low-rank decomposition of the query–key space via contrastive covariances to attribute attention mechanisms.
Rethinking Weight Tying: Pseudo-Inverse Tying for Stable LM Training and Updates - Score: 17 (R=9, N=8) - Date: 2026-02-05 - Comment: Model Architecture: pseudo-inverse-consistent tying of embedding/unembedding for stable LM training and interventions.
Provable Target Sample Complexity Improvements as Pre-Trained Models Scale - Score: 17 (R=9, N=8) - Date: 2026-02-05 - Comment: Representation Learning/Theory: PEFT-inspired caulking framework proving reduced downstream sample complexity as pre-trained models scale.
Perplexity Cannot Always Tell Right from Wrong - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: Rigorous analysis of perplexity as a model selection metric for Transformer LMs—matches Representation Learning (training dynamics/metric analysis).
Optimization, Generalization and Differential Privacy Bounds for Gradient Descent on Kolmogorov-Arnold Networks - Score: 17 (R=9, N=8) - Date: 2026-02-02 - Comment: Provides optimization, generalization, and DP theory for training Kolmogorov–Arnold Networks—matches Model Architecture analysis and Representation Learning (training dynamics).
Statistical-Computational Trade-offs in Learning Multi-Index Models via Harmonic Analysis - Score: 17 (R=8, N=9) - Date: 2026-02-11 - Comment: Representation Learning Theory: harmonic-analytic characterization and SQ/LDP lower bounds for learning multi-index models, with spectral algorithms achieving near-optimal trade-offs.
Conditional Diffusion Guidance under Hard Constraint: A Stochastic Analysis Approach - Score: 17 (R=8, N=9) - Date: 2026-02-07 - Comment: Generative Modeling Theory: principled conditional diffusion guidance via Doob’s h-transform with martingale-based estimators and non-asymptotic guarantees.
Gradient Flow Through Diagram Expansions: Learning Regimes and Explicit Solutions - Score: 17 (R=8, N=9) - Date: 2026-02-05 - Comment: Matches Representation Learning: theoretical analysis of training dynamics via gradient flow with explicit regimes/solutions, offering fundamental insights into learning behavior.
The Anxiety of Influence: Bloom Filters in Transformer Attention Heads - Score: 16 (R=9, N=7) - Date: 2026-02-20 - Comment: Representation Learning/Mechanistic interpretability: shows specific transformer heads act as membership testers (Bloom filter-like) and analyzes capacity/behavior.
Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking - Score: 16 (R=9, N=7) - Date: 2026-02-20 - Comment: Matches: Representation Learning — geometric analysis of training dynamics (low-dimensional subspace, curvature) explaining grokking in Transformers.
Anatomy of Capability Emergence: Scale-Invariant Representation Collapse and Top-Down Reorganization in Neural Networks - Score: 16 (R=9, N=7) - Date: 2026-02-19 - Comment: Representation Learning/Training Dynamics: geometric analysis of capability emergence, scale-invariant representation collapse, and top-down layer reorganization across model scales.
The Information Geometry of Softmax: Probing and Steering - Score: 16 (R=9, N=7) - Date: 2026-02-18 - Comment: Representation learning and geometry: information geometry of softmax representations with a principled steering method (dual steering).
Sufficient Conditions for Stability of Minimum-Norm Interpolating Deep ReLU Networks - Score: 16 (R=9, N=7) - Date: 2026-02-17 - Comment: Representation Learning / Training Dynamics: stability analysis of minimum-norm interpolating deep ReLU networks with a low-rank layer condition.
Singular Vectors of Attention Heads Align with Features - Score: 16 (R=9, N=7) - Date: 2026-02-17 - Comment: Representation Learning/Mechanistic Interpretability: theoretical and empirical evidence that singular vectors of attention align with features; proposes testable predictions.
A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models - Score: 16 (R=9, N=7) - Date: 2026-02-16 - Comment: Model Architecture: theoretical analysis of selective SSMs (Mamba), showing input-dependent gating performs feature selection and establishing generalization/convergence bounds.
Deep networks learn to parse uniform-depth context-free languages from local statistics - Score: 16 (R=9, N=7) - Date: 2026-02-09 - Comment: Representation Learning: theoretical and empirical insights into how deep nets learn hierarchical structure from local statistics in PCFGs.
Mechanisms of AI Protein Folding in ESMFold - Score: 16 (R=9, N=7) - Date: 2026-02-06 - Comment: Matches Representation Learning: causal mechanistic dissection of ESMFold, identifying staged development of biochemical and spatial features.
Depth-Wise Emergence of Prediction-Centric Geometry in Large Language Models - Score: 16 (R=9, N=7) - Date: 2026-02-06 - Comment: Representation Learning: geometric and mechanistic analysis of depth-wise transitions in LLMs from context processing to prediction formation.
LORE: Jointly Learning the Intrinsic Dimensionality and Relative Similarity Structure From Ordinal Data - Score: 16 (R=9, N=7) - Date: 2026-02-05 - Comment: Representation learning with low-rank structure: jointly learns intrinsic dimensionality and ordinal embedding via Schatten-p regularization and IRLS optimization with guarantees.
Language Model Circuits Are Sparse in the Neuron Basis - Score: 16 (R=9, N=7) - Date: 2026-02-02 - Comment: Representation Learning: empirical analysis of sparsity in the neuron basis and an end-to-end circuit tracing pipeline without SAEs.
Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features - Score: 16 (R=9, N=7) - Date: 2026-02-02 - Comment: Representation learning and sparsity: weight-based interpretability of Sparse Autoencoder features, revealing functional roles beyond activation patterns.
A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning - Score: 16 (R=8, N=8) - Date: 2026-02-24 - Comment: Representation learning/training dynamics: analytical theory linking pretraining initialization to feature reuse/refinement in fine-tuning.
On the Equivalence of Random Network Distillation, Deep Ensembles, and Bayesian Inference - Score: 16 (R=8, N=8) - Date: 2026-02-24 - Comment: Representation Learning/Uncertainty — establishes equivalence between RND, deep ensembles, and Bayesian inference in the NTK limit, providing a principled theoretical link.
I Dropped a Neural Net - Score: 16 (R=8, N=8) - Date: 2026-02-24 - Comment: Matches Representation/Training Dynamics: reconstructs exact layer order of a shuffled ResNet via dynamic-isometry-driven signals, offering structural insights.
A Markovian View of Iterative-Feedback Loops in Image Generative Models: Neural Resonance and Model Collapse - Score: 16 (R=8, N=8) - Date: 2026-02-24 - Comment: Training Dynamics/Theory: Markov-chain view of iterative feedback in generative models, explaining collapse via neural resonance with diagnostic taxonomy.
Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning - Score: 16 (R=8, N=8) - Date: 2026-02-20 - Comment: Representation Learning/Training Dynamics: rigorous theory for self-distillation in ridge regression with closed-form optimal mixing and precise asymptotics; one-shot tuning method.
Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees - Score: 16 (R=8, N=8) - Date: 2026-02-20 - Comment: Representation Learning: mechanistic interpretability with algorithms and provable robustness/minimality guarantees for circuit discovery.
Panini: Continual Learning in Token Space via Structured Memory - Score: 16 (R=8, N=8) - Date: 2026-02-18 - Comment: Training Dynamics/Representation Learning: theoretical account of pretraining via universal priors and posterior contraction, explaining adaptation and length generalization.
Universal priors: solving empirical Bayes via Bayesian inference and pretraining - Score: 16 (R=8, N=8) - Date: 2026-02-18 - Comment: Matches Representation Learning: provides a theoretical account (universal priors, posterior contraction) for adaptation and length generalization in pretrained transformers, offering foundational insights into training/generalization dynamics.
Drift-Diffusion Matching: Embedding dynamics in latent manifolds of asymmetric neural networks - Score: 16 (R=8, N=8) - Date: 2026-02-17 - Comment: Matches 'Model Architecture' and 'Representation Learning': asymmetric continuous-time RNNs trained to embed arbitrary SDE dynamics with analyses of encoding and time-irreversibility.
On the Stability of Nonlinear Dynamics in GD and SGD: Beyond Quadratic Potentials - Score: 16 (R=8, N=8) - Date: 2026-02-17 - Comment: Matches 'Training Dynamics': nonlinear stability criteria for GD/SGD beyond linearization, including stochastic effects and oscillations.
Finding Highly Interpretable Prompt-Specific Circuits in Language Models - Score: 16 (R=8, N=8) - Date: 2026-02-17 - Comment: Representation Learning/Mechanistic Interpretability: ACC++ extracts prompt-specific causal communication circuits in attention without SAEs or activation patching.
Transporting Task Vectors across Different Architectures without Training - Score: 16 (R=8, N=8) - Date: 2026-02-16 - Comment: Matches Representation Learning: training-free transport of task vectors across heterogeneous architectures via functional alignment of intermediate representations.
The Implicit Bias of Logit Regularization - Score: 16 (R=8, N=8) - Date: 2026-02-13 - Comment: C4: Representation learning—theoretical analysis of logit regularization’s implicit bias and training dynamics.
Protein Circuit Tracing via Cross-layer Transcoders - Score: 16 (R=8, N=8) - Date: 2026-02-13 - Comment: Matches Representation Learning and Compression: cross-layer sparse transcoders recover and compress computational circuits across layers in pLMs.
Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models - Score: 16 (R=8, N=8) - Date: 2026-02-12 - Comment: Representation Learning/Training Dynamics: RL credit assignment over latent thought trajectories for LoopLMs, aligning objective with internal computation.
Barycentric alignment for instance-level comparison of neural representations - Score: 16 (R=8, N=8) - Date: 2026-02-11 - Comment: Matches Representation Learning: barycentric alignment that quotients symmetries to build a universal embedding enabling instance-level representational comparison across models/brains.
A Graphop Analysis of Graph Neural Networks on Sparse Graphs: Generalization and Universal Approximation - Score: 16 (R=8, N=8) - Date: 2026-02-10 - Comment: Model Architecture Theory: unified compact metric space and graphop analysis yielding stronger generalization and universal approximation results for MPNNs on sparse graphs
A Thermodynamic Theory of Learning Part II: Critical Period Closure and Continual Learning Failure - Score: 16 (R=8, N=8) - Date: 2026-02-10 - Comment: Training Dynamics Theory: thermodynamic perspective introduces critical period closure explaining continual learning failures under finite dissipation.
Featured Reproducing Kernel Banach Spaces for Learning and Neural Networks - Score: 16 (R=8, N=8) - Date: 2026-02-10 - Comment: Representation Learning/Theory: establishes featured reproducing kernel Banach spaces, extending representer theorems and linking fixed-architecture neural networks to kernel methods beyond Hilbert spaces.
DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders - Score: 16 (R=8, N=8) - Date: 2026-02-06 - Comment: Representation Learning: uses sparse autoencoders for mechanistic interpretability in diffusion language models; analyzes layer-wise effects and interventions.
Subliminal Effects in Your Data: A General Mechanism via Log-Linearity - Score: 16 (R=8, N=8) - Date: 2026-02-06 - Comment: Representation Learning/Theory: introduces logit-linear-selection mechanism explaining hidden dataset signals across models.
Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model - Score: 16 (R=8, N=8) - Date: 2026-02-05 - Comment: Training Dynamics/Theory: optimal learning-rate (and momentum) schedules and compute-optimal scaling laws in a solvable model.
Continual Learning through Control Minimization - Score: 16 (R=8, N=8) - Date: 2026-02-05 - Comment: Continual learning/training dynamics: control-theoretic formulation yielding continual-natural gradient without curvature storage; foundational algorithmic innovation.
Supervised Learning as Lossy Compression: Characterizing Generalization and Sample Complexity via Finite Blocklength Analysis - Score: 16 (R=8, N=8) - Date: 2026-02-05 - Comment: Representation Learning/Theory: reframes supervised learning as finite-blocklength lossy compression, yielding explicit generalization/sample complexity bounds.
SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport - Score: 15 (R=8, N=7) - Date: 2026-02-28 - Comment: Matches Representation Learning: semi-supervised alignment of frozen unimodal encoders using OT-based divergence to transfer relational structure with minimal paired data.
Certified Circuits: Stability Guarantees for Mechanistic Circuits - Score: 15 (R=8, N=7) - Date: 2026-02-28 - Comment: Matches Representation Learning/Mechanistic Interpretability: certifies stability of discovered circuits via randomized subsampling, yielding provably robust subnet explanations.
Reinforcement-aware Knowledge Distillation for LLM Reasoning - Score: 15 (R=8, N=7) - Date: 2026-02-28 - Comment: Representation Learning/Training Dynamics: RL-aware distillation with a trust-region ratio objective (PPO/GRPO-style) replacing KL for on-policy teacher guidance.
UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs - Score: 15 (R=8, N=7) - Date: 2026-02-28 - Comment: Training Dynamics/Representation Learning: MI-based reward within GRPO to optimize pass@k by promoting diverse, skill-specific trajectories.
Model Agreement via Anchoring - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: Training dynamics/generalization: anchoring technique to bound independent model disagreement across common algorithms.
Differentiable Zero-One Loss via Hypersimplex Projections - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: Optimization/Representation: differentiable zero-one loss via hypersimplex projection (Soft-Binary-Argmax) enabling gradient-based training with geometric consistency.
Takeuchi's Information Criteria as Generalization Measures for DNNs Close to NTK Regime - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: Representation Learning/Training Dynamics — evaluates Takeuchi’s Information Criterion as a generalization measure for DNNs near the NTK regime with theoretical and large-scale empirical support.
Latent Matters: Learning Deep State-Space Models - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: Representation Learning/Architecture: constrained optimization framework for DSSMs and EKVAE combining amortized VI with Kalman filtering/smoothing to better learn dynamics.
Generalization Bounds of Stochastic Gradient Descent in Homogeneous Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: Training dynamics/generalization theory: stability-based bounds for SGD in homogeneous neural networks allowing slower stepsize decay.
IBCircuit: Towards Holistic Circuit Discovery with Information Bottleneck - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: Representation Learning/Mechanistic Interpretability: Information Bottleneck-based end-to-end optimization to discover faithful, minimal circuits without handcrafted corruptions.
Causality $\neq$ Invariance: Function and Concept Vectors in LLMs - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: Representation learning/mechanistic interpretability: discovers format-invariant Concept Vectors distinct from Function Vectors and demonstrates causal steering.
Disentangling Shared and Target-Enriched Topics via Background-Contrastive Non-negative Matrix Factorization - Score: 15 (R=8, N=7) - Date: 2026-02-27 - Comment: Representation Learning: background-contrastive NMF with shared bases to disentangle target-specific topics; scalable multiplicative updates on GPU.
Understanding the Curse of Unrolling - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Representation Learning (training dynamics): non-asymptotic analysis of algorithm unrolling explains divergence and proposes truncation to stabilize and reduce memory.
Grokking Finite-Dimensional Algebra - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Representation Learning and Training Dynamics — studies grokking across algebraic structures, linking generalization to structure tensor rank/sparsity and implicit low-rank bias.
Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Representation Learning/Interpretability: self-supervised disentanglement of goal vs. framing factors in LLM activations with theoretical guarantees and efficient detection.
Partial Soft-Matching Distance for Neural Representational Comparison with Partial Unit Correspondence - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Representation Learning: introduces a partial optimal transport-based soft-matching distance for neural representational comparison with theory and efficient ranking.
Spectral bias in physics-informed and operator learning: Analysis and mitigation guidelines - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Representation/training dynamics: analyzes spectral bias in PINNs/neural operators and proposes optimization and loss strategies to mitigate it.
Spilled Energy in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-24 - Comment: Representation Learning/Architecture Analysis: reinterprets LLM softmax as EBM and proposes training-free energy metrics for hallucination detection from logits.
A Geometric Probe of the Accuracy-Robustness Trade-off: Sharp Boundaries in Symmetry-Breaking Dimensional Expansion - Score: 15 (R=8, N=7) - Date: 2026-02-23 - Comment: Representation Learning/Training Dynamics: geometric explanation of accuracy–robustness trade-off via symmetry-breaking dimensional expansion and mask projection.
ADAPT: Hybrid Prompt Optimization for LLM Feature Visualization - Score: 15 (R=8, N=7) - Date: 2026-02-23 - Comment: Representation Learning: feature visualization for LLM directions via hybrid prompt optimization tailored to discrete text.
Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs - Score: 15 (R=8, N=7) - Date: 2026-02-23 - Comment: Representation Learning/Mechanistic Interpretability: identifies a universal activation subspace driving clarification-seeking and turn amplification across prompts and models.
ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment - Score: 15 (R=8, N=7) - Date: 2026-02-20 - Comment: Representation Engineering: unified ODE-based activation steering framework for LLM alignment with multi-step adaptive control.
Discovering Universal Activation Directions for PII Leakage in Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-20 - Comment: Representation Learning: identifies universal latent activation directions in LLM residual streams that modulate PII leakage via mechanistic interpretability.
Are Object-Centric Representations Better At Compositional Generalization? - Score: 15 (R=8, N=7) - Date: 2026-02-19 - Comment: Representation Learning: rigorous evaluation of object-centric versus dense vision encoders for compositional generalization under controlled data/compute regimes.
FEKAN: Feature-Enriched Kolmogorov-Arnold Networks - Score: 15 (R=8, N=7) - Date: 2026-02-19 - Comment: Matches Model Architecture: proposes a KAN variant (FEKAN) that improves efficiency and representation capacity with theoretical guarantees, without increasing parameters.
The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-02-19 - Comment: Representation Learning/Training Dynamics: analyzes implicit bias of momentum-based optimizers (Muon, Adam, Signum) in homogeneous neural networks, linking to margin maximization norms.
Geometric Neural Operators via Lie Group-Constrained Latent Dynamics - Score: 15 (R=8, N=7) - Date: 2026-02-19 - Comment: Matches Model Architecture/Representation Learning: Lie group-constrained latent dynamics with low-rank parameterization to enforce geometric inductive biases in neural operators.
On the Power of Source Screening for Learning Shared Feature Extractors - Score: 15 (R=8, N=7) - Date: 2026-02-19 - Comment: Representation Learning: theory and algorithms for source screening to learn a shared low-dimensional subspace, achieving minimax-optimal subspace estimation from selected sources.
LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization - Score: 15 (R=8, N=7) - Date: 2026-02-19 - Comment: Model Architecture/Representation Learning: introduces a learnable discrete tokenizer (LGQ) with differentiable soft assignments and utilization regularizers to address codebook collapse and scalability.
Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-19 - Comment: Matches Representation Learning: geometric analysis of latent subspaces for personality steering and effects of orthonormalization.
Neural-POD: A Plug-and-Play Neural Operator Framework for Infinite-Dimensional Functional Nonlinear Proper Orthogonal Decomposition - Score: 15 (R=8, N=7) - Date: 2026-02-18 - Comment: Representation Learning: proposes a neural-operator framework to learn nonlinear, orthogonal basis functions (resolution-invariant) as a POD alternative.
Seeing to Generalize: How Visual Data Corrects Binding Shortcuts - Score: 15 (R=8, N=7) - Date: 2026-02-18 - Comment: Representation learning/training dynamics: shows cross-modal training corrects positional binding shortcuts and improves OOD generalization.
Refine Now, Query Fast: A Decoupled Refinement Paradigm for Implicit Neural Fields - Score: 15 (R=8, N=7) - Date: 2026-02-18 - Comment: Model efficiency/architecture: decoupled representation refinement to encode rich features into compact embeddings for fast INR inference.
Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Representation Learning + Efficiency: concept-level training data attribution leveraging probes/sparse autoencoder features with scalable approximations.
The Well-Tempered Classifier: Some Elementary Properties of Temperature Scaling - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Representation Learning/Calibration theory: properties and characterizations of temperature scaling, incl. entropy monotonicity and information projection.
Diagnosing Knowledge Conflict in Multimodal Long-Chain Reasoning - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Representation Learning/Training Dynamics: probes internal representations to diagnose and localize knowledge conflict signals in MLLMs.
Revisiting the Platonic Representation Hypothesis: An Aristotelian View - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Representation Learning: permutation-calibrated representational similarity metrics with statistical guarantees; proposes Aristotelian hypothesis on local neighborhoods.
The geometry of invariant learning: an information-theoretic analysis of data augmentation and generalization - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Representation Learning: information-theoretic generalization bounds for data augmentation with a geometric group-diameter control.
Metabolic cost of information processing in Poisson variational autoencoders - Score: 15 (R=8, N=7) - Date: 2026-02-17 - Comment: Representation Learning / Autoencoders: Poisson VAE links KL to firing rates yielding metabolic cost and emergent sparse coding, contrasting Gaussian VAEs.
SWING: Unlocking Implicit Graph Representations for Graph Random Features - Score: 15 (R=8, N=7) - Date: 2026-02-16 - Comment: Representation/Efficiency: introduces SWING for Graph Random Features on implicit graphs using linearized kernels via random features and importance sampling without graph materialization; accelerator-friendly.
The Observer Effect in World Models: Invasive Adaptation Corrupts Latent Physics - Score: 15 (R=8, N=7) - Date: 2026-02-13 - Comment: Matches Representation Learning: proposes non-invasive linear-probe evaluation (PhyIP) to assess latent physical structure without adaptation.
TAVAE: A VAE with Adaptable Priors Explains Contextual Modulation in the Visual Cortex - Score: 15 (R=8, N=7) - Date: 2026-02-13 - Comment: Model Architecture and Representation Learning: extends VAE with adaptable task-specific priors (Task-Amortized VAE) enabling flexible contextual inference.
In-Context Function Learning in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-13 - Comment: Matches Representation Learning/Training Dynamics: GP-based analysis of in-context learning and inductive biases with methods to steer them.
Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification - Score: 15 (R=8, N=7) - Date: 2026-02-13 - Comment: Matches Representation Learning: hierarchical sparse coding with concept embeddings leveraging VLM latent space for interpretable classification.
Weight Decay Improves Language Model Plasticity - Score: 15 (R=8, N=7) - Date: 2026-02-12 - Comment: Representation Learning/Training Dynamics: shows how pretraining weight decay shapes linear separability, attention regularization, and downstream plasticity.
Generalized Robust Adaptive-Bandwidth Multi-View Manifold Learning in High Dimensions with Noise - Score: 15 (R=8, N=7) - Date: 2026-02-12 - Comment: Representation Learning: diffusion geometry/manifold learning with adaptive bandwidth selection and theoretical guarantees under heterogeneous high-dimensional noise.
Step-resolved data attribution for looped transformers - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Representation Learning/Interpretability in Transformers: step-decomposed training-data influence for looped transformers with scalable TensorSketch implementation.
When Less is More: The LLM Scaling Paradox in Context Compression - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Matches Representation Learning: analysis of scaling in compressor–decoder context compression revealing knowledge overwriting and semantic drift tied to embedding rank and entropy.
Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Representation Learning/Interpretability: geometric principle for circuit discovery and activation steering in transformers (read-write duality).
Self-Supervised Learning as Discrete Communication - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Representation Learning: frames SSL as discrete communication over a fixed-capacity binary channel with coding-rate regularization and periodic head reinitialization.
Towards Uniformity and Alignment for Multimodal Representation Learning - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Representation Learning: decouples alignment and uniformity across modalities with theory via global Holder divergence.
Is Memorization Helpful or Harmful? Prior Information Sets the Threshold - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Matches Representation Learning/Training Theory: Bayesian analysis in overparameterized linear models linking memorization vs overfitting to prior Fisher information thresholds.
Effective Reasoning Chains Reduce Intrinsic Dimensionality - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Representation Learning: quantifies how reasoning strategies reduce intrinsic dimensionality and link to generalization.
Spectral Disentanglement and Enhancement: A Dual-domain Contrastive Framework for Representation Learning - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Representation Learning: SVD-based spectral disentanglement with dual-domain (feature+spectrum) contrastive objectives and curriculum spectral enhancement.
Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Model Architecture and Representation Learning: introduces discrete concept prediction with vector quantization atop NTP to form a harder pretraining objective.
Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Representation Learning / Training dynamics: causal do-interventions over latent chain-of-thought steps to analyze necessity, influence propagation, and mode commitment.
Emergent Misalignment is Easy, Narrow Misalignment is Hard - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Representation Learning: isolates linear representations underlying emergent misalignment and generalization in LLMs.
Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Matches Representation Learning: uncovers and causally validates a conceptual subspace mediating in-context inference in LLMs; analyzes layer-wise construction/use.
On the Infinite Width and Depth Limits of Predictive Coding Networks - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Representation Learning/Training Dynamics: infinite-width/depth analysis of predictive coding networks showing BP equivalence.
Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Representation Learning/Analysis: training-free few-shot segmentation with DINOv3 features revealing a "Semantic Selection Gap" across layers and strong last-layer baseline.
Your Language Model Secretly Contains Personality Subnetworks - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Sparsity/Subnetworks + Representation Learning: identifies and isolates persona-specific subnetworks via masking and contrastive pruning without training.
The Geometry of Representational Failures in Vision Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-11 - Comment: Representation Learning: mechanistic analysis of concept-vector geometry in VLMs with steering interventions linking representational overlap to failure modes.
Towards Understanding Multimodal Fine-Tuning: Spatial Features - Score: 15 (R=8, N=7) - Date: 2026-02-10 - Comment: Representation Learning: mechanistic analysis of multimodal fine-tuning via stage-wise model diffing, revealing spatially grounded features and causal attention heads.
Mutual information and task-relevant latent dimensionality - Score: 15 (R=8, N=7) - Date: 2026-02-10 - Comment: Representation Learning: MI-based hybrid critic to infer task-relevant latent dimensionality, preserving latent geometry.
Interpreting Physics in Video World Models - Score: 15 (R=8, N=7) - Date: 2026-02-10 - Comment: Representation Learning/Interpretability: layerwise analysis of video transformers reveals a Physics Emergence Zone and distributed geometric encodings of motion variables.
Endogenous Resistance to Activation Steering in Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Representation Learning/Mechanistic Interpretability: identifies internal circuits via sparse autoencoder latents and analyzes resistance to activation steering in LLMs.
Vision Transformer Finetuning Benefits from Non-Smooth Components - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Representation Learning/training dynamics: analyzes ViT plasticity (non-smoothness) to guide finetuning component selection.
Explaining Grokking in Transformers through the Lens of Inductive Bias - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Representation Learning/Training Dynamics: analyzes grokking in transformers via inductive biases (e.g., LayerNorm placement) and relates generalization to feature compressibility.
Same Answer, Different Representations: Hidden instability in VLMs - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Representation Learning: probes internal representation drift, spectral sensitivity, and spatial smoothness in VLMs beyond output invariance.
Multi-Way Representation Alignment - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Representation Learning: multi-model latent space alignment beyond pairwise methods via GPA/GCPA, preserving geometry and improving any-to-any retrieval.
Optimal rates for density and mode estimation with expand-and-sparsify representations - Score: 15 (R=8, N=7) - Date: 2026-02-09 - Comment: Representation Learning/Theory: analyzes expand-and-sparsify sparse representations, proving minimax-optimal rates for density and mode estimation.
Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering - Score: 15 (R=8, N=7) - Date: 2026-02-07 - Comment: Representation Learning/Efficiency: calibration-aware inference-time steering via residual activation probes; improves accuracy without retraining.
NEX: Neuron Explore-Exploit Scoring for Label-Free Chain-of-Thought Selection and Model Ranking - Score: 15 (R=8, N=7) - Date: 2026-02-07 - Comment: Representation Learning/Efficiency: white-box neuron-phase analysis to rank CoT samples and models without labels, reducing selection compute.
Refine and Purify: Orthogonal Basis Optimization with Null-Space Denoising for Conditional Representation Learning - Score: 15 (R=8, N=7) - Date: 2026-02-07 - Comment: Representation Learning: orthogonal basis optimization and null-space projection for conditional subspaces (sparse/low-interference representations).
Joint Embedding Variational Bayes - Score: 15 (R=8, N=7) - Date: 2026-02-06 - Comment: Matches Representation Learning: variational joint embedding maximizing a symmetric conditional ELBO with heavy-tailed likelihood on embeddings.
Smoothness Errors in Dynamics Models and How to Avoid Them - Score: 15 (R=8, N=7) - Date: 2026-02-06 - Comment: Model Architecture and Representation Learning: relaxed unitary graph/mesh convolutions to balance smoothing, addressing over-smoothing in dynamics models.
Disentangled Representation Learning via Flow Matching - Score: 15 (R=8, N=7) - Date: 2026-02-06 - Comment: Flow matching framework for disentangled representation learning with explicit orthogonality regularizer; matches Representation Learning criterion.
Improving Set Function Approximation with Quasi-Arithmetic Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-02-06 - Comment: Model Architecture: introduces a learnable set aggregation (Neuralized Kolmogorov Mean) and quasi-arithmetic neural networks with theory on universal approximation and structured latent representations for set functions.
Phaedra: Learning High-Fidelity Discrete Tokenization for the Physical Science - Score: 15 (R=8, N=7) - Date: 2026-02-06 - Comment: Matches Representation Learning/Efficiency: tokenizer design (Phaedra) tailored for physical data preserving spectral/physical fidelity for scalable discrete representations.
Towards Understanding and Avoiding Limitations of Convolutions on Graphs - Score: 15 (R=8, N=7) - Date: 2026-02-05 - Comment: Representation Learning — theoretical analysis of rank collapse/over-smoothing in MPNNs; Model Architecture — proposes MRS/MIMO-GC/LMGC to mitigate these limitations.
SEIS: Subspace-based Equivariance and Invariance Scores for Neural Representations - Score: 15 (R=8, N=7) - Date: 2026-02-05 - Comment: Representation Learning — SEIS subspace metrics disentangling equivariance vs invariance layer-wise without labels or explicit transformation knowledge.
A Hitchhiker's Guide to Poisson Gradient Estimation - Score: 15 (R=8, N=7) - Date: 2026-02-05 - Comment: Training dynamics and representation learning: improved Poisson gradient estimation (modified EAT) evaluated in VAEs with Poisson latents; practical guidance and theory.
Understanding Generalization from Embedding Dimension and Distributional Convergence - Score: 15 (R=8, N=7) - Date: 2026-02-02 - Comment: Representation Learning: generalization bounds via intrinsic embedding dimension and Lipschitz sensitivity, offering representation-centric diagnostics.
Is Softmax Loss All You Need? A Principled Analysis of Softmax-family Loss - Score: 15 (R=8, N=7) - Date: 2026-02-02 - Comment: Representation Learning/Training Dynamics — principled analysis of Softmax-family losses: consistency, gradients, and efficiency–effectiveness trade-offs for large-class learning.
Local Intrinsic Dimension of Representations Predicts Alignment and Generalization in AI Models and Human Brain - Score: 15 (R=8, N=7) - Date: 2026-02-02 - Comment: Representation Learning: links local intrinsic dimension to generalization and model–model/brain alignment, offering a geometric descriptor.
Stabilizing Consistency Training: A Flow Map Analysis and Self-Distillation - Score: 15 (R=8, N=7) - Date: 2026-02-02 - Comment: Flow-map-based theory for consistency model training and a stabilized self-distillation objective—matches Representation Learning (training dynamics/stability) and model training methodology.
Context Structure Reshapes the Representational Geometry of Language Models - Score: 15 (R=8, N=7) - Date: 2026-02-02 - Comment: Representation learning: analyses how context structure reshapes representational geometry (trajectory straightening) in LLMs across tasks.

Other Foundational Research (9)

Deriving Neural Scaling Laws from the statistics of natural language - Score: 18 (R=9, N=9) - Date: 2026-02-11 - Comment: Foundational Theory: derives neural scaling law exponents from measurable language statistics (correlation decay and conditional entropy) with parameter-free predictions.
Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization - Score: 17 (R=9, N=8) - Date: 2026-02-06 - Comment: Optimization/Training Dynamics: tight long-term tail decay analysis for (clipped) SGD in non-convex settings via large deviations, offering stronger run-level guarantees.
Online Realizable Regression and Applications for ReLU Networks - Score: 16 (R=8, N=8) - Date: 2026-02-24 - Comment: Theory/Training Dynamics: bounds for realizable online regression under approximate metric losses with applications to bounded-norm ReLU networks.
Implicit Bias and Convergence of Matrix Stochastic Mirror Descent - Score: 16 (R=8, N=8) - Date: 2026-02-24 - Comment: Training Dynamics/Implicit Bias: proves convergence and implicit bias for matrix-valued stochastic mirror descent, extending classic results to multi-output settings.
Hierarchical Zero-Order Optimization for Deep Neural Networks - Score: 16 (R=8, N=8) - Date: 2026-02-12 - Comment: High Performance Computing/Optimization: hierarchical zeroth-order training reducing query complexity from O(ML^2) to O(ML log L) with stability analysis; applicable to deep networks.
A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula - Score: 16 (R=8, N=8) - Date: 2026-02-11 - Comment: Training Dynamics/Theory: finite-sample analysis of iterative self-improvement and easy-to-hard curricula with explicit guarantees and feedback-loop characterization.
Functional Central Limit Theorem for Stochastic Gradient Descent - Score: 15 (R=8, N=7) - Date: 2026-02-18 - Comment: Training Dynamics/Theory: functional CLT for SGD trajectories characterizing temporal fluctuations around minimizers.
Towards Robust Scaling Laws for Optimizers - Score: 15 (R=8, N=7) - Date: 2026-02-10 - Comment: HPC/Training Dynamics: optimizer-aware scaling laws with shared exponents and theoretical grounding on convex quadratics.
Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate - Score: 15 (R=8, N=7) - Date: 2026-02-10 - Comment: Matches Training Dynamics: convexity/Lipschitz-based scaling laws linking loss and learning rate across model sizes/horizons.