← Previous Summary | Monthly Overview | Next Summary →
2025-10 | 2025-11 | 2025-12

Personalized Monthly Topic Summary 2025/11

Metric	Value
Total Papers	467
Model Architecture	142
Model Compression and Efficiency	163
High Performance Computing	44
Representation Learning	111
Other Foundational Research	7

Model Architecture (142)

MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts - Score: 19 (R=10, N=9) - Date: 2025-11-27 - Comment: Architecture/Efficiency: training-free metamorphosis of dense MLPs into static MoE; introduces structured sparsity (Fractal Fade) and variance-preserving pruning.
Equivalence of Context and Parameter Updates in Modern Transformer Blocks - Score: 19 (R=10, N=9) - Date: 2025-11-25 - Comment: Representation Learning/Architecture Analysis: proves context effects equal rank-1 MLP weight patches (plus RMSNorm) across modern transformer blocks incl. MoE; constructive multi-layer algorithm.
Compiling to linear neurons - Score: 19 (R=10, N=9) - Date: 2025-11-19 - Comment: Model architecture: a programming language that compiles discrete algorithms into linear neurons, enabling direct programming within differentiable networks.
Multistability of Self-Attention Dynamics in Transformers - Score: 19 (R=10, N=9) - Date: 2025-11-17 - Comment: Representation/Training Dynamics: theoretical analysis of self-attention as multiagent Oja flow with equilibrium taxonomy.
The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms - Score: 19 (R=10, N=9) - Date: 2025-11-07 - Comment: Model Architecture: establishes a strong lottery ticket existence result for multi-head attention in transformers, advancing sparsity/lottery-ticket theory for this core component.
Higher-order Linear Attention - Score: 19 (R=10, N=9) - Date: 2025-11-04 - Comment: Model Architecture/Efficiency: introduces higher-order linear-time attention with constant-size state; HPC: exact chunk-parallel scans for streaming recurrence.
Mixture-of-Transformers Learn Faster: A Theoretical Study on Classification Problems - Score: 19 (R=10, N=9) - Date: 2025-11-03 - Comment: Model Architecture: theoretical analysis of Mixture-of-Transformers (transformer-level experts with gating), proving specialization and faster convergence for MoE-style models.
On the Role of Hidden States of Modern Hopfield Network in Transformer - Score: 18 (R=10, N=8) - Date: 2025-11-27 - Comment: Model Architecture: introduces Modern Hopfield Attention by adding MHN-derived hidden states to Transformers to mitigate rank collapse and token uniformity.
Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models - Score: 18 (R=10, N=8) - Date: 2025-11-26 - Comment: Strongly matches MoE and Pruning: hierarchical, generalizable expert selection (‘cluster-then-select’) for pruning sparse MoE models across domains.
Exploiting the Experts: Unauthorized Compression in MoE-LLMs - Score: 18 (R=10, N=8) - Date: 2025-11-26 - Comment: Model Architecture + Compression/Efficiency: MoE expert attribution and pruning under task use; analyzes prunability and proposes defenses (entangled expert training) against unauthorized MoE compression.
AnyExperts: On-Demand Expert Allocation for Multimodal Language Models with Mixture of Expert - Score: 18 (R=10, N=8) - Date: 2025-11-26 - Comment: Matches Model Architecture: MoE with budget-aware on-demand expert allocation per token (dynamic routing).
Selective Rotary Position Embedding - Score: 18 (R=10, N=8) - Date: 2025-11-24 - Comment: Model Architecture: Selective (input-dependent) Rotary Position Embeddings generalizing RoPE across softmax/linear transformers and SSMs with analysis of implicit rotations/forgetting.
MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts - Score: 18 (R=10, N=8) - Date: 2025-11-19 - Comment: Matches MoE + Compression/Efficiency: speculative expert prefetching and quantized offloading to hide PCIe I/O, accelerating MoE inference on limited hardware.
Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-11-18 - Comment: High-Performance MoE Systems: pre-attention expert prediction/prefetching with lightweight linear routers and ranking-aware loss enabling first-layer prefetch and high-accuracy routing.
Optimizing Mixture of Block Attention - Score: 18 (R=10, N=8) - Date: 2025-11-17 - Comment: Model Architecture + HPC: theoretical analysis and CUDA kernel (FlashMoBA) for efficient Mixture of Block Attention with long-context speedups.
FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models - Score: 18 (R=10, N=8) - Date: 2025-11-17 - Comment: HPC + Model Architecture: MoE architectural modifications enabling overlap of computation and blocking communication for distributed efficiency.
$\pi$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling - Score: 18 (R=10, N=8) - Date: 2025-11-17 - Comment: Model Architecture/Efficiency: periodic sparse Transformer attention (π-Attention) for long-context with linear-time footprint.
BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference - Score: 18 (R=10, N=8) - Date: 2025-11-14 - Comment: Compression/Efficiency + Systems for MoE: exploits expert redundancy to accelerate memory-constrained MoE inference and mitigate PCIe offloading stalls when prefetch fails.
Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off - Score: 18 (R=10, N=8) - Date: 2025-11-14 - Comment: Introduces principled structural sparsity in multi-head attention, reducing complexity by a factor of H—core Model Architecture and Efficiency.
Selective Sinkhorn Routing for Improved Sparse Mixture of Experts - Score: 18 (R=10, N=8) - Date: 2025-11-13 - Comment: Model architecture (MoE): lightweight optimal-transport-based selective Sinkhorn routing that removes auxiliary load-balancing losses.
A Circular Argument : Does RoPE need to be Equivariant for Vision? - Score: 18 (R=10, N=8) - Date: 2025-11-12 - Comment: Model Architecture: formal analysis of RoPE equivariance and a new positional encoding (Spherical RoPE) for M-dimensional data, challenging necessity of strict equivariance.
Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs - Score: 18 (R=10, N=8) - Date: 2025-11-11 - Comment: Model Architecture (MoE): aligns routing weights with task manifolds via manifold regularization, improving generalization with lightweight router fine-tuning.
Route Experts by Sequence, not by Token - Score: 18 (R=10, N=8) - Date: 2025-11-11 - Comment: Model Architecture: MoE routing innovation (sequence-level TopK) enabling dynamic expert allocation under fixed budget, improving efficiency at high sparsity.
How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy - Score: 18 (R=10, N=8) - Date: 2025-11-11 - Comment: Model Architecture/Efficiency: Random Batch Attention, a linear-time self-attention with theoretical expressivity and parallelization benefits.
PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference - Score: 18 (R=10, N=8) - Date: 2025-11-11 - Comment: Matches Model Compression and Architecture: training-free MoE compression via sparse expert merging and bit-packed inference.
GMoPE:A Prompt-Expert Mixture Framework for Graph Foundation Models - Score: 18 (R=10, N=8) - Date: 2025-11-07 - Comment: Model Architecture: Mixture-of-Experts framework for graph foundation models with structure-aware routing and prompt-expert vectors; prompt-only fine-tuning improves efficiency.
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error - Score: 18 (R=10, N=8) - Date: 2025-11-05 - Comment: High Performance Computing + Low-Precision Training: FP8-centric, quantization-consistent dataflow for MoE with fused operators, eliminating cast overhead and avoiding double quantization error.
Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining - Score: 18 (R=10, N=8) - Date: 2025-11-05 - Comment: MoE Efficiency: batch-aware opportunistic expert activation to reduce activated experts and decode latency without retraining.
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants - Score: 18 (R=10, N=8) - Date: 2025-11-05 - Comment: High Performance Computing: compiler-native PyTorch extensions that automatically generate fused, FlashAttention-style kernels for diverse attention variants with tiling and fusion.
LongCat-Flash-Omni Technical Report - Score: 18 (R=10, N=8) - Date: 2025-11-05 - Comment: Model Architecture (MoE) + HPC: Shortcut-connected MoE with zero-computation experts and modality-decoupled parallelism for efficient large-scale multimodal training.
Quantitative Bounds for Length Generalization in Transformers - Score: 18 (R=10, N=8) - Date: 2025-11-03 - Comment: Model Architecture/Theory — quantitative bounds for length generalization in Transformers, analyzing precision and depth cases.
Deep Learning as a Convex Paradigm of Computation: Minimizing Circuit Size with ResNets - Score: 18 (R=9, N=9) - Date: 2025-11-27 - Comment: Foundational Theory/Architecture: relates a ResNet-induced norm to circuit complexity in a convex HTMC regime, explaining Occam-like computation via ResNets.
Softmax Transformers are Turing-Complete - Score: 18 (R=9, N=9) - Date: 2025-11-26 - Comment: Matches Model Architecture analysis: proves Turing-completeness for length-generalizable softmax CoT transformers (theoretical foundation of Transformers).
Operator Learning at Machine Precision - Score: 18 (R=9, N=9) - Date: 2025-11-26 - Comment: Strongly matches Model Architecture (Operator Learning) and MoE-like aggregation: CHONKNORIS regresses Cholesky factors of Newton–Kantorovich updates to achieve machine-precision; FONKNORIS aggregates multiple experts.
Categorical Equivariant Deep Learning: Category-Equivariant Neural Networks and Universal Approximation Theorems - Score: 18 (R=9, N=9) - Date: 2025-11-25 - Comment: Model Architecture/Theory: category-equivariant neural networks with general equivariant universal approximation theorems.
Internalizing Tools as Morphisms in Graded Transformers - Score: 18 (R=9, N=9) - Date: 2025-11-25 - Comment: Model Architecture: graded transformers with typed morphisms and utility-driven differentiable routing yielding sparse, interpretable internal computation.
Compiling to recurrent neurons - Score: 18 (R=9, N=9) - Date: 2025-11-20 - Comment: Model Architecture: introduces a typed language that compiles iteration into linear recurrent neurons, enabling first-class control flow within differentiable networks with formal correctness.
Branching Flows: Discrete, Continuous, and Manifold Flow Matching with Splits and Deletions - Score: 18 (R=9, N=9) - Date: 2025-11-13 - Comment: Model Architecture: introduces Branching Flows—flow matching with stochastic splits/deletions to handle variable-length outputs across discrete/continuous/manifold spaces.
Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization - Score: 18 (R=9, N=9) - Date: 2025-11-11 - Comment: Matches Representation Learning/Architecture theory: provable chain-of-thought length generalization in transformers via attention concentration.
Next-Latent Prediction Transformers Learn Compact World Models - Score: 18 (R=9, N=9) - Date: 2025-11-11 - Comment: Representation Learning/Architecture: Next-Latent Prediction objective induces compact belief-state latents and transition dynamics in Transformers.
FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning - Score: 17 (R=10, N=7) - Date: 2025-11-25 - Comment: Strong match to Model Architecture (MoE) and Efficiency: dynamic expert activation and routing-aware token pruning for MoE-based MLLMs.
MoETTA: Test-Time Adaptation Under Mixed Distribution Shifts with MoE-LayerNorm - Score: 17 (R=10, N=7) - Date: 2025-11-20 - Comment: Matches Criterion 1 (Model Architecture): Mixture-of-Experts-based TTA with expert-level adaptation to heterogeneous mixed distribution shifts; introduces MoE-LayerNorm expertization for conditional updates.
Bayesian Mixture of Experts For Large Language Models - Score: 17 (R=10, N=7) - Date: 2025-11-13 - Comment: Matches Mixture-of-Experts criterion directly: Bayesian uncertainty via structured Laplace on expert layers in MoE LLMs.
Subjective Depth and Timescale Transformers: Learning Where and When to Compute - Score: 17 (R=9, N=8) - Date: 2025-11-27 - Comment: Conditional/Dynamic Networks: Bayesian-surprise-driven routing for where/when to compute in decoder-only Transformers, reducing self-attention and KV-cache costs.
HVAdam: A Full-Dimension Adaptive Optimizer - Score: 17 (R=9, N=8) - Date: 2025-11-26 - Comment: Matches Optimizers/Training Dynamics: introduces a tunable-adaptivity optimizer with convergence guarantees bridging SGD and Adam.
CAMformer: Associative Memory is All You Need - Score: 17 (R=9, N=8) - Date: 2025-11-26 - Comment: High Performance Computing/Architecture: analog BA-CAM associative memory for constant-time attention similarity with hierarchical top-k filtering.
Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models - Score: 17 (R=9, N=8) - Date: 2025-11-26 - Comment: Model Architecture and Efficiency: explores latency-optimal depth–width ratios and operator choices; evolutionary search for hybrid SLMs optimized for real-device latency.
AdaPerceiver: Transformers with Adaptive Width, Depth, and Tokens - Score: 17 (R=9, N=8) - Date: 2025-11-25 - Comment: Model Architecture/Efficiency: unified adaptive Transformer controlling width, depth, and tokens with joint training for Pareto-efficient inference.
Transformers with RL or SFT Provably Learn Sparse Boolean Functions, But Differently - Score: 17 (R=9, N=8) - Date: 2025-11-25 - Comment: Representation Learning/Training Dynamics: theoretical analysis showing transformers learn k-sparse Boolean functions via RL vs SFT with CoT-style supervision.
Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence - Score: 17 (R=9, N=8) - Date: 2025-11-20 - Comment: Matches Criterion 1 (Model Architecture) and Criterion 4 (Representation Learning): proposes dynamic nested hierarchies that adapt optimization levels/structure with convergence and expressivity analysis for lifelong learning.
QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention - Score: 17 (R=9, N=8) - Date: 2025-11-18 - Comment: Algorithm-Architecture Co-Design: cache-local deformable attention accelerator with schedule-aware prefetching and fused kernels achieving large throughput/energy gains and mixed-precision quantization.
Stratified Knowledge-Density Super-Network for Scalable Vision Transformers - Score: 17 (R=9, N=8) - Date: 2025-11-18 - Comment: Model Compression and Architecture: ViT super-network via weighted PCA attention contraction and importance-aware dropout for stratified knowledge and flexible subnets.
The Anatomy of a Triton Attention Kernel - Score: 17 (R=9, N=8) - Date: 2025-11-18 - Comment: High Performance Computing - Triton-based paged attention kernel with auto-tuning and cross-vendor portability achieving SOTA inference efficiency.
Decoupling Positional and Symbolic Attention Behavior in Transformers - Score: 17 (R=9, N=8) - Date: 2025-11-18 - Comment: Strongly matches Model Architecture analysis and Representation Learning: theoretical and empirical dissection of Transformer attention with RoPE, defining metrics for positional vs symbolic head behavior and causal control via frequency access.
MMA-Sim: Bit-Accurate Reference Model of Tensor Cores and Matrix Cores - Score: 17 (R=9, N=8) - Date: 2025-11-17 - Comment: High-Performance Computing: bit-accurate reference model of Tensor/Matrix Cores revealing undocumented arithmetic behaviors affecting DNN stability.
Leveraging Parameter Space Symmetries for Reasoning Skill Transfer in LLMs - Score: 17 (R=9, N=8) - Date: 2025-11-17 - Comment: Model Architecture: parameter-space alignment exploiting Transformer symmetries (permutation/rotation/scaling) to enable reliable model merging/skill transfer.
Rethinking Visual Information Processing in Multimodal LLMs - Score: 17 (R=9, N=8) - Date: 2025-11-14 - Comment: Model Architecture: LLaViT introduces modality-specific QKV, bidirectional attention over visual tokens, and multi-scale representations to better integrate vision into LLMs.
Fractional neural attention for efficient multiscale sequence processing - Score: 17 (R=9, N=8) - Date: 2025-11-14 - Comment: Model Architecture: replaces standard self-attention with Fractional Neural Attention based on fractional Laplacian diffusion for multiscale dependencies; theory links to larger spectral gaps and shorter path lengths (efficiency).
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models - Score: 17 (R=9, N=8) - Date: 2025-11-13 - Comment: Model Architecture and Efficiency: selective latent iterations only at hard tokens via a learned decider, LoRA-based refinement, and duo-causal attention over iteration depth.
Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence - Score: 17 (R=9, N=8) - Date: 2025-11-11 - Comment: Model Architecture/Efficiency: retrofits recurrence into pretrained LMs with a recurrence curriculum to decouple test-time compute from parameters/training compute.
Can Training Dynamics of Scale-Invariant Neural Networks Be Explained by the Thermodynamics of an Ideal Gas? - Score: 17 (R=9, N=8) - Date: 2025-11-11 - Comment: Training dynamics theory: maps SGD with weight decay in scale-invariant nets to thermodynamic variables, informing hyperparameter design.
TuckA: Hierarchical Compact Tensor Experts for Efficient Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2025-11-11 - Comment: Matches Model Architecture and Compression/Efficiency: Tucker low-rank PEFT with hierarchical tensor experts and efficient routing (MoE-like).
Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin - Score: 17 (R=9, N=8) - Date: 2025-11-11 - Comment: Model Architecture and Efficiency: replaces self-attention with stacked target-to-history cross-attention for linear complexity; batching and length extrapolation for 10k sequences.
Efficient Linear Attention for Multivariate Time Series Modeling via Entropy Equality - Score: 17 (R=9, N=8) - Date: 2025-11-07 - Comment: Model Architecture and Efficiency: introduces a theoretically grounded linear attention via entropy-equality, achieving linear complexity and balanced attention weights.
The Curved Spacetime of Transformer Architectures - Score: 17 (R=9, N=8) - Date: 2025-11-06 - Comment: Representation Learning / Architecture analysis: geometric framework analyzing attention as curvature and parallel transport in Transformers.
ExplicitLM: Decoupling Knowledge from Parameters via Explicit Memory Banks - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: Model Architecture: external explicit memory bank with differentiable two-stage retrieval and conditional routing (product-key filtering + Gumbel-Softmax) for interpretable, updatable knowledge.
Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: Model Architecture and Efficiency: introduces Visual-Contrast Attention that replaces MHSA, reducing complexity from quadratic to O(N n C) with architectural modifications enabling sparse, contrastive interactions.
Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle - Score: 17 (R=9, N=8) - Date: 2025-11-04 - Comment: Model Architecture/Theory: reframes attention via an energy-based principle and proposes new attention variants inspired by optimization methods.
Fermions and Supersymmetry in Neural Network Field Theories - Score: 17 (R=8, N=9) - Date: 2025-11-24 - Comment: Foundational Architecture Theory: Grassmann-valued neural networks realizing fermionic field theories, infinite-width limits, and supersymmetry via super-affine transformations.
Controlling changes to attention logits - Score: 16 (R=9, N=7) - Date: 2025-11-27 - Comment: Model Architecture: transformer attention stabilization by controlling changes to attention logits via parameter-dependent learning rates (QK dynamics).
OrdMoE: Preference Alignment via Hierarchical Expert Group Ranking in Multimodal Mixture-of-Experts LLMs - Score: 16 (R=9, N=7) - Date: 2025-11-26 - Comment: Matches Model Architecture: leverages MoE router scores to build self-supervised preference hierarchies for alignment in multimodal MoE LLMs.
Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers - Score: 16 (R=9, N=7) - Date: 2025-11-26 - Comment: Model Architecture/Efficiency: deterministic continuous blending to stably replace self-attention with efficient alternatives in pretrained Transformers.
Progressive Localisation in Localist LLMs - Score: 16 (R=9, N=7) - Date: 2025-11-26 - Comment: Model Architecture: progressively localizing attention across layers to improve interpretability while retaining performance—an architectural scheduling insight for LLMs.
Shape-Adapting Gated Experts: Dynamic Expert Routing for Colonoscopic Lesion Segmentation - Score: 16 (R=9, N=7) - Date: 2025-11-25 - Comment: Model Architecture: dynamic expert routing with hierarchical gating (MoE-like) across CNN/Transformer paths for adaptive computation.
Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required? - Score: 16 (R=9, N=7) - Date: 2025-11-24 - Comment: Model Architecture and Efficiency: sparse Mixture-of-Experts treating channels as experts to reduce cross-channel attention cost in multi-channel ViTs.
Gradient flow for deep equilibrium single-index models - Score: 16 (R=9, N=7) - Date: 2025-11-24 - Comment: Training Dynamics/Model Architecture: theoretical analysis of gradient flow and convergence for deep equilibrium (DEQ) and single-index models, including conservation law and linear convergence.
Self-Adaptive Graph Mixture of Models - Score: 16 (R=9, N=7) - Date: 2025-11-18 - Comment: Model Architecture: graph Mixture-of-Models with topology-aware gating and expert pruning; mixture approach across heterogeneous GNNs.
Uncertainty Makes It Stable: Curiosity-Driven Quantized Mixture-of-Experts - Score: 16 (R=9, N=7) - Date: 2025-11-18 - Comment: Model Compression and Efficiency: quantization (4-bit) and uncertainty-driven routing in a quantized Mixture-of-Experts to stabilize latency and save energy.
Do traveling waves make good positional encodings? - Score: 16 (R=9, N=7) - Date: 2025-11-18 - Comment: Model Architecture: new positional encoding (RollPE) framed as traveling waves with equivalence/links to RoPE.
A Unified Geometric Field Theory Framework for Transformers: From Manifold Embeddings to Kernel Modulation - Score: 16 (R=9, N=7) - Date: 2025-11-12 - Comment: Model Architecture Theory: unified geometric/field-theoretic framework linking positional encodings, kernel operators, and attention in Transformers.
A General Method for Proving Networks Universal Approximation Property - Score: 16 (R=9, N=7) - Date: 2025-11-12 - Comment: Model Architecture/Theory: a general modular framework (UAM) to prove universal approximation across diverse architectures.
Minimum Width of Deep Narrow Networks for Universal Approximation - Score: 16 (R=9, N=7) - Date: 2025-11-11 - Comment: Model Architecture: theoretical bounds on minimum width for universal approximation in deep narrow networks across activations.
Learning to Focus: Focal Attention for Selective and Scalable Transformers - Score: 16 (R=9, N=7) - Date: 2025-11-11 - Comment: Model Architecture/Efficiency: Focal Attention sharpens softmax via temperature control (fixed or learnable), improving scaling and long-context performance.
Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder - Score: 16 (R=9, N=7) - Date: 2025-11-11 - Comment: Model Architecture and Sparsity: multi-expert sparse autoencoder with multiple expert activation and feature scaling to reduce redundancy and improve specialization.
How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need? - Score: 16 (R=9, N=7) - Date: 2025-11-10 - Comment: Compression/Efficiency (aggressive token merging for 3D Transformers reducing tokens by 90–95%) and architectural efficiency insights.
AILA--First Experiments with Localist Language Models - Score: 16 (R=9, N=7) - Date: 2025-11-06 - Comment: Model Architecture and Representation Learning — introduces a controllable locality dial in transformers to interpolate between localist and distributed representations without retraining.
Apriel-H1: Towards Efficient Enterprise Reasoning Models - Score: 16 (R=9, N=7) - Date: 2025-11-05 - Comment: Model Architecture and Efficiency: hybrid SSM–Transformer via distillation replacing attention layers with Mamba to reduce KV-cache needs and boost throughput.
Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects - Score: 16 (R=9, N=7) - Date: 2025-11-05 - Comment: High Performance Computing: NUMA-aware attention scheduling (Swizzled Head-first Mapping) exploiting intra-chiplet locality on disaggregated GPUs.
From Uniform to Adaptive: General Skip-Block Mechanisms for Efficient PDE Neural Operators - Score: 16 (R=9, N=7) - Date: 2025-11-05 - Comment: Conditional/Dynamic Networks: skip-block routing that ranks tokens and prunes computation in later layers of Transformer-based neural operators to cut FLOPs.
Soft Task-Aware Routing of Experts for Equivariant Representation Learning - Score: 16 (R=9, N=7) - Date: 2025-11-04 - Comment: MoE/Representation Learning: soft task-aware routing of expert projection heads to disentangle invariant vs equivariant representations.
Elastic Architecture Search for Efficient Language Models - Score: 16 (R=9, N=7) - Date: 2025-11-04 - Comment: Model Architecture/Efficiency: NAS for compact transformer LMs with dynamic modules (heads/dimensions) and per-block distillation.
Terminal Velocity Matching - Score: 16 (R=8, N=8) - Date: 2025-11-26 - Comment: Matches Model Architecture/Training: introduces Terminal Velocity Matching, a generalization of flow matching enabling one/few-step generative models with efficient kernels.
The Alexander-Hirschowitz theorem for neurovarieties - Score: 16 (R=8, N=8) - Date: 2025-11-26 - Comment: Model Architecture Theory: algebraic-geometry analysis of polynomial neural networks (neurovarieties) establishing identifiability and expected dimension.
Merging without Forgetting: Continual Fusion of Task-Specific Models via Optimal Transport - Score: 16 (R=8, N=8) - Date: 2025-11-26 - Comment: Matches Model Architecture: optimal transport-based masked fusion for continual model merging that preserves task-specific structure.
Gate-level boolean evolutionary geometric attention neural networks - Score: 16 (R=8, N=8) - Date: 2025-11-25 - Comment: Model Architecture: Boolean-domain Transformer with XNOR-based Boolean attention and Boolean RoPE; emphasizes hardware-efficient discrete computation.
Deep Improvement Supervision - Score: 16 (R=8, N=8) - Date: 2025-11-24 - Comment: Model Architecture/Training Efficiency: proposes a new supervision scheme for tiny recursive models that cuts forward passes 18x; insights into latent reasoning akin to classifier-free guidance.
Walrus: A Cross-Domain Foundation Model for Continuum Dynamics - Score: 16 (R=8, N=8) - Date: 2025-11-21 - Comment: High Performance Computing + Model Architecture: transformer-based foundation model for continuum dynamics with harmonic-analysis stabilization, load-balanced distributed 2D/3D training, and compute-adaptive tokenization.
Symmetry-Aware Graph Metanetwork Autoencoders: Model Merging through Parameter Canonicalization - Score: 16 (R=8, N=8) - Date: 2025-11-18 - Comment: Model Architecture/Representation: parameter-space canonicalization exploiting permutation and scaling symmetries via ScaleGMN autoencoders for robust model merging.
Adaptive Initial Residual Connections for GNNs with Theoretical Guarantees - Score: 16 (R=8, N=8) - Date: 2025-11-11 - Comment: Model Architecture: adaptive initial residual connections in GNNs with theoretical guarantees preventing oversmoothing (Dirichlet energy bounded away from zero).
Physics-Informed Design of Input Convex Neural Networks for Consistency Optimal Transport Flow Matching - Score: 16 (R=8, N=8) - Date: 2025-11-11 - Comment: Model Architecture and Efficiency: physics-informed PICNN for OT flow matching with HJ residual; supports one-step and ODE sampling from the same potential.
Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning - Score: 15 (R=8, N=7) - Date: 2025-11-27 - Comment: Model Architecture: integrates graph topology into selected Transformer layers (hop-free), avoiding over-smoothing and exceeding GNN/Transformer expressivity.
Short-Range Oversquashing - Score: 15 (R=8, N=7) - Date: 2025-11-26 - Comment: Matches Model Architecture analysis: disentangles short-range oversquashing (bottlenecks) vs long-range vanishing gradients in MPNNs and shows transformer advantages.
Rethinking Message Passing Neural Networks with Diffusion Distance-guided Stress Majorization - Score: 15 (R=8, N=7) - Date: 2025-11-26 - Comment: Matches Model Architecture: new MPNN objective via diffusion distance–guided stress majorization with orthogonal regularization to address over-smoothing/correlation.
Physics-informed Neural Operator Learning for Nonlinear Grad-Shafranov Equation - Score: 15 (R=8, N=7) - Date: 2025-11-26 - Comment: Model Architecture/Efficiency: Physics-Informed Neural Operator with a Transformer–KAN variant and semi-supervised physics constraints achieving millisecond inference and robust generalization.
Resolving Node Identifiability in Graph Neural Processes via Laplacian Spectral Encodings - Score: 15 (R=8, N=7) - Date: 2025-11-25 - Comment: Representation/Architecture: invariant Laplacian spectral encodings with theory surpassing WL limits for node identifiability.
Learning Solution Operators for Partial Differential Equations via Monte Carlo-Type Approximation - Score: 15 (R=8, N=7) - Date: 2025-11-25 - Comment: Matches Model Architecture: Monte Carlo-type neural operator that directly approximates kernel integrals, enabling resolution generalization.
SAOT: An Enhanced Locality-Aware Spectral Transformer for Solving PDEs - Score: 15 (R=8, N=7) - Date: 2025-11-25 - Comment: Model Architecture/Efficiency: Wavelet Attention (linear complexity) fused with Fourier Attention in a spectral Transformer (SAOT).
Reduced-Basis Deep Operator Learning for Parametric PDEs with Independently Varying Boundary and Source Data - Score: 15 (R=8, N=7) - Date: 2025-11-25 - Comment: Matches Model Architecture and Efficiency: reduced-basis DeepONet with label-free residual training and certified RB trunk for parametric PDE operators.
DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams - Score: 15 (R=8, N=7) - Date: 2025-11-25 - Comment: Architecture/Efficiency: continual transformer for streaming with linear per-layer compute and redundancy-free inference.
AutoSAGE: Input-Aware CUDA Scheduling for Sparse GNN Aggregation (SpMM/SDDMM) and CSR Attention - Score: 15 (R=8, N=7) - Date: 2025-11-25 - Comment: High Performance Computing: input-aware CUDA scheduler for sparse SpMM/SDDMM and CSR attention with on-device micro-probes and caching for kernel selection.
Spanning Tree Autoregressive Visual Generation - Score: 15 (R=8, N=7) - Date: 2025-11-24 - Comment: Model Architecture: proposes spanning-tree autoregressive ordering for visual generation to retain bidirectional context and flexible conditioning without changing core AR architecture.
Topologic Attention Networks: Attending to Direct and Indirect Neighbors through Gaussian Belief Propagation - Score: 15 (R=8, N=7) - Date: 2025-11-24 - Comment: Model Architecture: introduces topologic attention derived from Gaussian belief propagation to attend over indirect neighbors, extending GNN receptive fields with improved scalability.
ManifoldFormer: Geometric Deep Learning for Neural Dynamics on Riemannian Manifolds - Score: 15 (R=8, N=7) - Date: 2025-11-24 - Comment: Model Architecture and Representation Learning: introduces a Riemannian VAE and geodesic-aware attention Transformer operating on manifolds, plus neural-ODE dynamics for manifold-constrained evolution.
ODE-ViT: Plug & Play Attention Layer from the Generalization of the ViT as an Ordinary Differential Equation - Score: 15 (R=8, N=7) - Date: 2025-11-21 - Comment: Matches Model Architecture criterion: reformulates ViT as an ODE system with stability guarantees and a plug-and-play attention layer, plus teacher-student guidance of continuous trajectories.
Gauge-Equivariant Graph Networks via Self-Interference Cancellation - Score: 15 (R=8, N=7) - Date: 2025-11-21 - Comment: Model Architecture: gauge-equivariant GNN with projection-based self-interference cancellation for heterophilous graphs.
GLOBE: Accurate and Generalizable PDE Surrogates using Domain-Inspired Architectures and Equivariances - Score: 15 (R=8, N=7) - Date: 2025-11-21 - Comment: Model Architecture: domain-inspired, equivariant neural surrogate for PDEs with global receptive field and compact parameterization (117k params).
Splat Regression Models - Score: 15 (R=8, N=7) - Date: 2025-11-19 - Comment: Model Architecture: introduces Splat Regression Models (mixtures of anisotropic splats) with WFR gradient flows; unifies Gaussian Splatting.
Complex-Weighted Convolutional Networks: Provable Expressiveness via Complex Diffusion - Score: 15 (R=8, N=7) - Date: 2025-11-19 - Comment: Model Architecture/Theory: complex-weighted diffusion on graphs with provable expressiveness; new GNN framework (CWCN).
Tab-PET: Graph-Based Positional Encodings for Tabular Transformers - Score: 15 (R=8, N=7) - Date: 2025-11-18 - Comment: Model Architecture: graph-derived positional encodings for tabular Transformers with theory (effective-rank reduction).
Are Graph Transformers Necessary? Efficient Long-Range Message Passing with Fractal Nodes in MPNNs - Score: 15 (R=8, N=7) - Date: 2025-11-18 - Comment: Model Architecture: augments MPNNs with fractal nodes to enable efficient long-range message passing and mitigate over-squashing, challenging the need for graph Transformers.
X-VMamba: Explainable Vision Mamba - Score: 15 (R=8, N=7) - Date: 2025-11-18 - Comment: Model Architecture/Interpretability: controllability-based Jacobian/Gramian analysis for Vision Mamba (SSMs), giving attention-like insights with linear complexity.
BlinDNO: A Distributional Neural Operator for Dynamical System Reconstruction from Time-Label-Free data - Score: 15 (R=8, N=7) - Date: 2025-11-18 - Comment: Model Architecture: permutation-invariant distribution-to-function neural operator with attention-based mixer for inverse dynamics without time labels.
Sumudu Neural Operator for ODEs and PDEs - Score: 15 (R=8, N=7) - Date: 2025-11-18 - Comment: Model Architecture: proposes Sumudu Neural Operator, a new operator class for ODE/PDE solving and zero-shot super-resolution.
PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs - Score: 15 (R=8, N=7) - Date: 2025-11-18 - Comment: Model Architecture/Representation Learning: analyzes multimodal RoPE’s induced time kernel and introduces a training-free, multi-phase aggregation to stabilize attention under temporal shifts (theoretical Lipschitz stability).
Test-Time Steering for Lossless Text Compression via Weighted Product of Experts - Score: 15 (R=8, N=7) - Date: 2025-11-17 - Comment: Compression/Efficiency: weighted product-of-experts to steer neural compressors at test time with performance guarantees.
Expandable and Differentiable Dual Memories with Orthogonal Regularization for Exemplar-free Continual Learning - Score: 15 (R=8, N=7) - Date: 2025-11-14 - Comment: Differentiable dual-memory architecture with orthogonal regularization and adaptive pruning/expansion—Model Architecture for continual learning.
Walsh-Hadamard Neural Operators for Solving PDEs with Discontinuous Coefficients - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Model Architecture: introduces a Walsh-Hadamard Neural Operator, a new spectral operator architecture complementary to FNOs for PDEs with discontinuities.
Recursive Dynamics in Fast-Weights Homeostatic Reentry Networks: Toward Reflective Intelligence - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Model Architecture: fast-weights with homeostatic reentry enabling internal recurrence and controlled reflective dynamics.
Beyond Fixed Depth: Adaptive Graph Neural Networks for Node Classification Under Varying Homophily - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Model Architecture: adaptive-depth GNN selecting node-specific aggregation depths to handle varying homophily.
Vocabulary In-Context Learning in Transformers: Benefits of Positional Encoding - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Model Architecture/Representation Learning: shows positional encoding enables universal approximation for vocabulary in-context learning in Transformers with theoretical conditions.
Transolver is a Linear Transformer: Revisiting Physics-Attention through the Lens of Linear Attention - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Model Architecture and Efficiency: reformulates Physics-Attention as linear attention and proposes a linear-attention neural operator with reduced complexity.
Mixtures of SubExperts for Large Language Continual Learning - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Matches Model Architecture: sparse Mixture-of-SubExperts for PEFT-based continual learning in LLMs.
From Kernels to Attention: A Transformer Framework for Density and Score Estimation - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Model Architecture and Representation Learning: transformer with permutation/affine equivariance linking attention to KDE for density/score estimation.
Discrete Bayesian Sample Inference for Graph Generation - Score: 15 (R=8, N=7) - Date: 2025-11-06 - Comment: Model Architecture/Generative Modeling — Bayesian Sample Inference for discrete graphs with SDE formulation linking to diffusion/flow, enabling one-shot graph generation.
A Non-Adversarial Approach to Idempotent Generative Modelling - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Model Architecture/Training: non-adversarial idempotent generative modeling (IMLE + reconstruction) improving manifold projection and sample quality.
OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Matches Model Architecture: proposes conditioned continuous neural fields with multimodal crosstalk blocks and iterative cross-modal refinement for robust multimodal spatiotemporal learning.
DoFlow: Causal Generative Flows for Interventional and Counterfactual Time-Series Prediction - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Matches Model Architecture/Representation: flow-based generative model structured by a causal DAG enabling interventional and counterfactual forecasting.
Natural Building Blocks for Structured World Models: Theory, Evidence, and Scaling - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Model Architecture/Theory: proposes structured world models from HMMs and switching LDS (and controlled variants) as modular building blocks; discusses scalable structure learning.
EchoLSTM: A Self-Reflective Recurrent Network for Stabilizing Long-Range Memory - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Model Architecture: introduces output-conditioned gating in an LSTM (with attention) to stabilize long-range memory via self-reflective feedback.
Optimal Attention Temperature Enhances In-Context Learning under Distribution Shift - Score: 15 (R=8, N=7) - Date: 2025-11-04 - Comment: Training dynamics/architecture: theoretical and empirical analysis of optimal attention temperature to improve ICL robustness under distribution shift.
One model to solve them all: 2BSDE families via neural operators - Score: 15 (R=8, N=7) - Date: 2025-11-04 - Comment: Model Architecture: generative neural operator variant (via Kolmogorov–Arnold networks) with approximation guarantees for families of 2BSDEs.
Hydra: Dual Exponentiated Memory for Multivariate Time Series Analysis - Score: 15 (R=8, N=7) - Date: 2025-11-04 - Comment: Model Architecture and training efficiency: dual-headed 2D recurrent memory with a 2D chunk-wise training algorithm for multivariate time series.

Model Compression and Efficiency (163)

Virtual Width Networks - Score: 19 (R=10, N=9) - Date: 2025-11-18 - Comment: Model Architecture and Efficiency: Virtual Width Networks decouple representational width from backbone compute, with scaling relation for loss reduction.
MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference - Score: 19 (R=10, N=9) - Date: 2025-11-11 - Comment: Compression/Efficiency and HPC: Shared KV Attention transforming memory-bound KV cache ops to compute-bound GEMMs with MoE-inspired sparse attention and disaggregated infrastructure.
Simplex-FEM Networks (SiFEN): Learning A Triangulated Function Approximator - Score: 19 (R=10, N=9) - Date: 2025-11-10 - Comment: Model Architecture (FEM-based piecewise-polynomial network on learned simplicial mesh) with explicit sparsity/locality for efficiency.
TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training - Score: 19 (R=10, N=9) - Date: 2025-11-07 - Comment: Model Compression and Efficiency: distributed training-time sparsification via independent subnetwork training and aggregation enabling zero-cost, structured pruning; also an HPC-oriented distributed framework.
Continuous Autoregressive Language Models - Score: 19 (R=10, N=9) - Date: 2025-11-04 - Comment: Model Architecture and Efficiency: replaces next-token with next-vector prediction via high-fidelity autoencoding to increase semantic bandwidth and reduce generation steps.
TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control - Score: 19 (R=10, N=9) - Date: 2025-11-04 - Comment: Compression/Efficiency: end-to-end 4-bit fully-quantized training using NVFP4 with new double-block quantization, oscillation suppression, and outlier control.
IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference - Score: 18 (R=10, N=8) - Date: 2025-11-27 - Comment: Quantization/Efficiency: fully integer attention (IndexSoftmax, LUT-based) eliminating dequantize/softmax bottleneck; plug-and-play without retraining.
Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning - Score: 18 (R=10, N=8) - Date: 2025-11-26 - Comment: Strongly matches Compression/Efficiency: information-theoretic adaptive structural pruning for VLMs (eRank, KS distance) plus training-free low-rank FFN compression.
SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression - Score: 18 (R=10, N=8) - Date: 2025-11-26 - Comment: Model Compression and Efficiency: decompression-free KV-cache compression via orthogonal rotation and pruning with runtime-tunable compression level.
Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost - Score: 18 (R=10, N=8) - Date: 2025-11-26 - Comment: Strongly matches Compression/Efficiency and Systems: 2-bit KV cache quantization with dynamic channel-wise precision boosts and page-centric kernels/layout for high-throughput inference.
GateRA: Token-Aware Modulation for Parameter-Efficient Fine-Tuning - Score: 18 (R=10, N=8) - Date: 2025-11-26 - Comment: Model Architecture + Efficiency: token-aware gating of PEFT branches (LoRA/DoRA/HiRA) with entropy regularization, yielding dynamic, conditional updates at token level.
Adaptive Layer-Wise Transformations for Post-Training Quantization of Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-11-25 - Comment: Strong match to Model Compression and Efficiency: adaptive layer-wise transformations for post-training quantization of LLMs (addresses outliers heterogeneously).
$A^3$: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving - Score: 18 (R=10, N=8) - Date: 2025-11-25 - Comment: Compression/Efficiency: attention-aware KV cache fusion for LLM serving to cut latency/memory.
Change-of-Basis Pruning via Rotational Invariance - Score: 18 (R=10, N=8) - Date: 2025-11-21 - Comment: Strongly matches Model Compression/Efficiency criterion: change-of-basis structured pruning enabled by rotationally invariant activations (TSRAs) to concentrate importance and prune effectively.
Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference - Score: 18 (R=10, N=8) - Date: 2025-11-21 - Comment: Model Compression and Efficiency: adaptive, runtime expert quantization for MoE inference under strict HBM budgets.
Quant-Trim in Practice: Improved Cross-Platform Low-Bit Deployment on Edge NPUs - Score: 18 (R=10, N=8) - Date: 2025-11-20 - Comment: Model Compression and Efficiency: hardware-agnostic low-bit quantization via progressive fake quantization and reverse pruning for robust deployment.
Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-11-20 - Comment: Model Compression and Efficiency: progressive KV/cache and context compression via summary tokens enabling near-linear long-context Transformer inference.
SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization - Score: 18 (R=10, N=8) - Date: 2025-11-18 - Comment: Model Compression and Efficiency - ultra-low-bit LLM quantization using spectral (Fourier) decomposition and adaptive truncation for weights/activations.
EcoSpa: Efficient Transformer Training with Coupled Sparsity - Score: 18 (R=10, N=8) - Date: 2025-11-18 - Comment: Model Compression and Efficiency: structured sparsity with coupled pruning of multiplicatively interacting Transformer weight pairs for efficient training/inference.
EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training - Score: 18 (R=10, N=8) - Date: 2025-11-14 - Comment: Entropy-driven dynamic gradient compression for distributed LLM training—Compression/Efficiency and HPC systems innovation.
BayesQ: Uncertainty-Guided Bayesian Quantization - Score: 18 (R=10, N=8) - Date: 2025-11-13 - Comment: Matches Model Compression and Efficiency: Bayesian post-training quantization optimizing posterior-expected loss with mixed-precision allocation.
Extreme Model Compression with Structured Sparsity at Low Precision - Score: 18 (R=10, N=8) - Date: 2025-11-12 - Comment: Direct hit on 'Model Compression and Efficiency': combines structured sparsity with low-bit quantization using a training-time angular-alignment regularizer for extreme compression.
Rethinking Parameter Sharing as Graph Coloring for Structured Compression - Score: 18 (R=10, N=8) - Date: 2025-11-11 - Comment: Model Compression and Efficiency: cross-layer parameter sharing cast as graph coloring with Hessian-based geometric criterion (structured compression).
Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving - Score: 18 (R=10, N=8) - Date: 2025-11-11 - Comment: Compression/Efficiency: adaptive layer- and time-aware KV cache pruning with relevance-aware retention for long-form LLM reasoning.
MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling - Score: 18 (R=10, N=8) - Date: 2025-11-11 - Comment: Model Compression and Efficiency: FP8 training with microscaling and automatic scaling for throughput and numerical stability.
Attention and Compression is all you need for Controllably Efficient Language Models - Score: 18 (R=10, N=8) - Date: 2025-11-10 - Comment: Matches Model Architecture and Efficiency: Compress & Attend Transformer uses dense attention over compressed context for controllable compute-memory tradeoffs and test-time adaptivity via multi-chunk training.
Block Rotation is All You Need for MXFP4 Quantization - Score: 18 (R=10, N=8) - Date: 2025-11-07 - Comment: Model Compression: PTQ under MXFP4 (FP4) with a block-rotation strategy resolving incompatibility with power-of-two block scaling.
DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization - Score: 18 (R=10, N=8) - Date: 2025-11-07 - Comment: Model Compression and Efficiency: distribution-aware rotational calibration (DartQuant) with efficient QR-orth optimization for LLM quantization at large scale.
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators - Score: 18 (R=10, N=8) - Date: 2025-11-06 - Comment: Model Compression/Efficiency and High Performance Computing — deployable sparse KV attention/KV-cache compression compatible with static-graph, continuous-batching accelerators at 128k contexts.
Memory-Efficient Training with In-Place FFT Implementation - Score: 18 (R=10, N=8) - Date: 2025-11-04 - Comment: High Performance Computing: first real-domain fully in-place FFT that preserves memory layout, eliminating intermediate buffers to reduce training memory usage.
On the Origin of Algorithmic Progress in AI - Score: 17 (R=9, N=8) - Date: 2025-11-27 - Comment: Model Compression and Efficiency: analyzes scale-dependent algorithmic efficiency via compute-optimal scaling laws (LSTM→Transformer) explaining large training FLOP gains.
SUPN: Shallow Universal Polynomial Networks - Score: 17 (R=9, N=8) - Date: 2025-11-27 - Comment: Model Architecture/Efficiency: shallow universal polynomial networks replace deep stacks with a single polynomial layer, with approximation guarantees and fewer parameters.
G-Net: A Provably Easy Construction of High-Accuracy Random Binary Neural Networks - Score: 17 (R=9, N=8) - Date: 2025-11-27 - Comment: Compression/Efficiency & Quantization: randomized binary neural networks (EHD G-Nets) with theoretical accuracy guarantees, bridging NNs and hyperdimensional computing.
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression - Score: 17 (R=9, N=8) - Date: 2025-11-27 - Comment: Model Architecture/Efficiency: a fading-memory layer solving online ridge regression at test time via gated adaptive regularization and Chebyshev iteration; long-context gains.
Length-MAX Tokenizer for Language Models - Score: 17 (R=9, N=8) - Date: 2025-11-27 - Comment: Model Compression and Efficiency: tokenizer optimizing average token length to cut token count and KV-cache, reducing training steps and inference latency.
CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding - Score: 17 (R=9, N=8) - Date: 2025-11-26 - Comment: Matches Compression/Efficiency: calibration-free post-training quantization via learned (structured/dual) transformations and adaptive rounding without calibration data.
ModHiFi: Identifying High Fidelity predictive components for Model Modification - Score: 17 (R=9, N=8) - Date: 2025-11-26 - Comment: Strongly matches Model Compression and Efficiency: data-/gradient-free component importance (Subset Fidelity) enabling structured pruning and unlearning without training data.
VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking - Score: 17 (R=9, N=8) - Date: 2025-11-26 - Comment: Model Compression and Efficiency: storage-aware activation sparsification via neuron chunking that couples neuron importance with flash I/O latency.
Layer-Wise High-Impact Parameter Ratio Optimization in Post-Training Quantization for Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-11-25 - Comment: Compression/Efficiency: layer-wise optimization of high-impact parameter ratios for extreme PTQ of LLMs with inter-layer dependencies.
PocketLLM: Ultimate Compression of Large Language Models via Meta Networks - Score: 17 (R=9, N=8) - Date: 2025-11-25 - Comment: Matches Model Compression and Efficiency: compresses LLM weights via latent codebook + meta-network decoder enabling extreme model compression.
Evolution Strategies at the Hyperscale - Score: 17 (R=9, N=8) - Date: 2025-11-21 - Comment: High Performance Computing + Efficiency: low-rank evolution strategies enabling scalable, backprop-free training for large networks.
D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models - Score: 17 (R=9, N=8) - Date: 2025-11-20 - Comment: Model Compression and Efficiency: first data-free quantization framework tailored for CLIP with semantic/diverse synthetic data.
AdaTok: Adaptive Token Compression with Object-Aware Representations for Efficient Multimodal LLMs - Score: 17 (R=9, N=8) - Date: 2025-11-20 - Comment: Strongly matches compression/efficiency criterion via adaptive object-aware token compression for MLLMs.
Compute-in-Memory Implementation of State Space Models for Event Sequence Processing - Score: 17 (R=9, N=8) - Date: 2025-11-20 - Comment: High Performance Computing/Efficiency: algorithm–hardware co-design mapping state space models onto memristor-based CIM with reparameterization for real-valued coefficients.
CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design - Score: 17 (R=9, N=8) - Date: 2025-11-19 - Comment: High Performance Computing/Efficiency: algorithm–system co-design for KVCache offloading (GPU-centric sync, zero-copy, on-GPU caching).
Improved Convergence in Parameter-Agnostic Error Feedback through Momentum - Score: 17 (R=9, N=8) - Date: 2025-11-19 - Comment: High Performance Computing and Efficiency: parameter-agnostic error feedback with momentum for compressed distributed training; provides convergence bounds without problem-dependent tuning.
10Cache: Heterogeneous Resource-Aware Tensor Caching and Migration for LLM Training - Score: 17 (R=9, N=8) - Date: 2025-11-19 - Comment: Matches High Performance Computing: heterogeneous GPU/CPU/NVMe tensor caching, prefetching, and buffer reuse to accelerate LLM training.
Weight-sparse transformers have interpretable circuits - Score: 17 (R=9, N=8) - Date: 2025-11-18 - Comment: Model Compression and Efficiency + Representation Learning: train/prune weight-sparse Transformers to yield interpretable circuits and analyze capability–interpretability tradeoffs.
OTARo: Once Tuning for All Precisions toward Robust On-Device LLMs - Score: 17 (R=9, N=8) - Date: 2025-11-18 - Comment: Model Compression and Efficiency - quantization framework (SEFP) enabling multi-precision switching post once-tuning with robustness across bit-widths.
Connectivity-Guided Sparsification of 2-FWL GNNs: Preserving Full Expressivity with Improved Efficiency - Score: 17 (R=9, N=8) - Date: 2025-11-18 - Comment: Model Compression/Efficiency: topology-guided sparsification of higher-order GNNs preserving full 2-FWL expressivity with theory.
Fast and Expressive Multi-Token Prediction with Probabilistic Circuits - Score: 17 (R=9, N=8) - Date: 2025-11-17 - Comment: Compression/Efficiency: probabilistic-circuit-based multi-token prediction exploring expressiveness–latency trade-offs and partial layer sharing for faster LLM decoding.
SVD-NO: Learning PDE Solution Operators with SVD Integral Kernels - Score: 17 (R=9, N=8) - Date: 2025-11-14 - Comment: Architecture + Low-rank Efficiency: parameterizes neural operator kernels via SVD with learned singular functions and values, enforcing orthonormality for an expressive yet efficient low-rank operator.
Explore and Establish Synergistic Effects Between Weight Pruning and Coreset Selection in Neural Network Training - Score: 17 (R=9, N=8) - Date: 2025-11-14 - Comment: Model Compression and Efficiency: jointly explores weight pruning and coreset selection with a new SWaST mechanism and state preservation to stabilize simultaneous reduction.
DynaKV: Enabling Accurate and Efficient Long-Sequence LLM Decoding on Smartphones - Score: 17 (R=9, N=8) - Date: 2025-11-13 - Comment: Model Compression and Efficiency/HPC: adaptive KV-cache clustering, continuity-centric flash management, and cache virtualization for accurate, low-latency long-sequence decoding on smartphones.
NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization - Score: 17 (R=9, N=8) - Date: 2025-11-12 - Comment: HPC/Efficiency: reformulates CLIP contrastive normalizer via convex/variational analysis and learns a neural normalizer, enabling efficient large-scale training with small batches.
The Few Govern the Many:Unveiling Few-Layer Dominance for Time Series Models - Score: 17 (R=9, N=8) - Date: 2025-11-11 - Comment: Compression/Efficiency + Representation: discovers few-layer dominance in TS models and proposes retaining dominant layers, yielding large parameter reduction and speedups.
LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging - Score: 17 (R=9, N=8) - Date: 2025-11-11 - Comment: Compression/Efficiency: training-free instance-level dynamic selection and merging of multiple LoRA adapters at inference time.
P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats - Score: 17 (R=9, N=8) - Date: 2025-11-11 - Comment: Matches HPC and Compression/Efficiency: NPU–PIM co-design with mixed-precision quantization and operator fusion for LLM inference.
MobileLLM-Pro Technical Report - Score: 17 (R=9, N=8) - Date: 2025-11-11 - Comment: Compression/Efficiency: on-device LLM with implicit positional distillation for long context, specialist model merging without parameter growth, and 4-bit QAT.
Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration - Score: 17 (R=9, N=8) - Date: 2025-11-11 - Comment: High Performance Computing/Efficiency: precision-scalable microscaling datapaths with optimized reduction tree and NPU integration for mixed-precision MACs.
Linear Gradient Prediction with Control Variates - Score: 17 (R=9, N=8) - Date: 2025-11-10 - Comment: Training efficiency: control-variate-based linear gradient prediction (NTK-inspired) enabling unbiased updates without full backpropagation.
Deep Progressive Training: scaling up depth capacity of zero/one-layer models - Score: 17 (R=9, N=8) - Date: 2025-11-10 - Comment: High Performance Computing/Efficiency: progressive depth expansion (zero/one-layer) with theoretical guidance for compute-efficient training of deep models.
FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow - Score: 17 (R=9, N=8) - Date: 2025-11-10 - Comment: HPC/Systems for sparse DL: a compiler enabling cross-expression fusion, dataflow ordering, and sparsity blocking on reconfigurable dataflow architectures.
Distribution-Aware Tensor Decomposition for Compression of Convolutional Neural Networks - Score: 17 (R=9, N=8) - Date: 2025-11-07 - Comment: Matches Model Compression and Efficiency: low-rank tensor decompositions for CNN compression with data-informed norms and new ALS algorithms minimizing function-space error; reduces or removes fine-tuning.
Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing - Score: 17 (R=9, N=8) - Date: 2025-11-07 - Comment: Model Compression and Efficiency: autoregressive-aware split computing with mixed-precision quantization and adaptive intermediate compression under memory/latency constraints for LLMs.
Towards Scalable Backpropagation-Free Gradient Estimation - Score: 17 (R=9, N=8) - Date: 2025-11-06 - Comment: HPC/Efficiency: backpropagation-free gradient estimation via forward-mode with reduced bias/variance, aiming to scale training without backward passes or activation storage.
Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: Model Compression and Efficiency: selectively sparsified low-rank storage of fine-tuning updates leveraging interleaved singular vector importance (sparsity + low-rank).
A new class of Markov random fields enabling lightweight sampling - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: Matches Efficiency and Model Architecture: introduces a new class of MRFs via mapping from GMRFs enabling lightweight, much faster sampling than Gibbs.
Bayesian Natural Gradient Fine-Tuning of CLIP Models via Kalman Filtering - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: Matches Efficiency/Optimization: Bayesian natural gradient fine-tuning via Kalman filtering approximates NGD for large CLIP models, improving training efficiency and robustness.
MISA: Memory-Efficient LLMs Optimization with Module-wise Importance Sampling - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: High-Efficiency Training: memory-efficient LLM optimization via module-wise importance sampling with variance reduction, convergence guarantees, and favorable memory analysis.
A Saddle Point Remedy: Power of Variable Elimination in Non-convex Optimization - Score: 17 (R=9, N=8) - Date: 2025-11-04 - Comment: Optimization Theory: explains why variable elimination (VarPro) reshapes non-convex landscapes (saddle-to-maxima) and guides robust, efficient training algorithm design.
Energy-Efficient Deep Learning Without Backpropagation: A Rigorous Evaluation of Forward-Only Algorithms - Score: 17 (R=9, N=8) - Date: 2025-11-04 - Comment: Training/Efficiency: forward-only learning as a backprop-free alternative with hardware-validated energy and speed gains.
FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management - Score: 17 (R=9, N=8) - Date: 2025-11-04 - Comment: Model Compression and Efficiency: KV-cache management leveraging head-wise temporal stability to offload/re-rank pages for memory/latency gains in LLM serving.
Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse - Score: 17 (R=9, N=8) - Date: 2025-11-04 - Comment: High Performance Computing: training efficiency via shared-prefix reuse (tree packing + gradient restoration) for agentic LLMs.
Reject Only Critical Tokens: Pivot-Aware Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2025-11-04 - Comment: Efficiency: pivot-aware speculative decoding that rejects only utility-critical tokens via a lightweight classifier, yielding higher acceptance and speedups.
Calibrating and Rotating: A Unified Framework for Weight Conditioning in PEFT - Score: 17 (R=9, N=8) - Date: 2025-11-04 - Comment: Matches Compression/Efficiency: PEFT via learnable weight conditioning (diagonal calibration and orthogonal rotations), clarifying DoRA and improving LoRA.
CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs - Score: 17 (R=9, N=8) - Date: 2025-11-04 - Comment: Model Compression and Efficiency: on-the-fly speculative decoding with layer sparsity and activation quantization plus dynamic cascade routing for faster inference.
SpecAttn: Speculating Sparse Attention - Score: 17 (R=9, N=8) - Date: 2025-11-03 - Comment: Compression/Efficiency: training-free sparse attention via speculative decoding, with KV-cache pruning and alignment—algorithmic inference efficiency improvement.
FPS: Feedforward-based Parameter Selection For Efficient Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2025-11-03 - Comment: Compression/Efficiency: gradient-free, single-forward-pass parameter selection (magnitude × activation) for memory-efficient PEFT.
Tokenisation over Bounded Alphabets is Hard - Score: 17 (R=8, N=9) - Date: 2025-11-20 - Comment: Matches algorithmic/theoretical efficiency criterion via hardness and approximability results for tokenizer design in foundation models.
LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs - Score: 16 (R=9, N=7) - Date: 2025-11-27 - Comment: Model Compression/Efficiency: Progressive Visual Compression (refined patch embedding + windowed token compression) for native-resolution ViT encoding, reducing TTFT.
Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models - Score: 16 (R=9, N=7) - Date: 2025-11-27 - Comment: Efficiency via pruning: entropy-guided, adaptive block-level pruning for diffusion/flow generative models; zero-shot adaptive schedule.
EfficientXpert: Efficient Domain Adaptation for Large Language Models via Propagation-Aware Pruning - Score: 16 (R=9, N=7) - Date: 2025-11-26 - Comment: Compression/Efficiency: propagation-aware pruning (Foresight Mask) integrated with LoRA via a one-step Partial Brain Surgeon update to produce sparse, domain-adapted experts.
FastForward Pruning: Efficient LLM Pruning via Single-Step Reinforcement Learning - Score: 16 (R=9, N=7) - Date: 2025-11-26 - Comment: Model Compression and Efficiency: single-step RL for discovering non-uniform layer-wise sparsity allocations for LLM pruning.
Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach - Score: 16 (R=9, N=7) - Date: 2025-11-24 - Comment: Model Compression and Efficiency: proposes frequency-domain, outlier-KV-aware KV cache compression for multimodal LLMs with dynamic per-layer budget; compatible with FlashAttention kernels.
PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants - Score: 16 (R=9, N=7) - Date: 2025-11-21 - Comment: High Performance Computing: fused CUDA operators and kernel-level optimizations for KAN variants to boost GPU utilization.
Weight Variance Amplifier Improves Accuracy in High-Sparsity One-Shot Pruning - Score: 16 (R=9, N=7) - Date: 2025-11-20 - Comment: Matches Criterion 2 (Compression/Efficiency): proposes a variance-amplifying regularizer to improve robustness under high-sparsity one-shot pruning, directly addressing pruning and sparsity.
Likelihood-guided Regularization in Attention Based Models - Score: 16 (R=9, N=7) - Date: 2025-11-18 - Comment: Compression/Efficiency: likelihood-guided variational Ising regularization for ViTs enabling structured sparsity and dynamic pruning.
CAO: Curvature-Adaptive Optimization via Periodic Low-Rank Hessian Sketching - Score: 16 (R=9, N=7) - Date: 2025-11-18 - Comment: Compression/Efficiency: curvature-adaptive optimizer using periodic low-rank Hessian sketching to precondition gradients with theoretical guarantees.
BitSnap: Checkpoint Sparsification and Quantization in LLM Training - Score: 16 (R=9, N=7) - Date: 2025-11-18 - Comment: Model Compression and Efficiency - checkpoint sparsification and quantization tailored to LLM training for storage/memory/fault-tolerance efficiency.
Coordinate Descent for Network Linearization - Score: 16 (R=9, N=7) - Date: 2025-11-18 - Comment: Model Compression and Efficiency - discrete optimization (coordinate descent) to reduce ReLU count for sparsity/latency in private inference.
Beyond One-Way Pruning: Bidirectional Pruning-Regrowth for Extreme Accuracy-Sparsity Tradeoff - Score: 16 (R=9, N=7) - Date: 2025-11-18 - Comment: Model Compression and Efficiency: bidirectional pruning–regrowth strategy to recover accuracy at extreme sparsity, improving accuracy–sparsity tradeoff.
On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization - Score: 16 (R=9, N=7) - Date: 2025-11-17 - Comment: Compression/Efficiency/HPC: backprop-free zeroth-order on-device fine-tuning enabling larger models under strict memory constraints.
Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning - Score: 16 (R=9, N=7) - Date: 2025-11-13 - Comment: Matches Model Compression and Efficiency: adaptive visual token and KV cache pruning for VideoLLMs (sparsity/pruning).
Alignment-Aware Quantization for LLM Safety - Score: 16 (R=9, N=7) - Date: 2025-11-13 - Comment: Model compression and efficiency: post-training quantization with alignment-preserving contrastive loss to retain safety alignment under low-bit PTQ.
Alignment-Constrained Dynamic Pruning for LLMs: Identifying and Preserving Alignment-Critical Circuits - Score: 16 (R=9, N=7) - Date: 2025-11-13 - Comment: Model compression/efficiency: dynamic structured pruning with alignment-aware circuit preservation for safe LLM inference.
A Generalized Spectral Framework to Expain Neural Scaling and Compression Dynamics - Score: 16 (R=9, N=7) - Date: 2025-11-12 - Comment: Develops a unified spectral framework connecting learning dynamics and compression—fits 'Representation Learning' and 'Compression/Efficiency' theory.
SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs - Score: 16 (R=9, N=7) - Date: 2025-11-12 - Comment: Representation Learning + Sparsity: introduces a benchmark for interaction sparsity across SAEs and proposes Staircase SAEs to enforce sparse cross-layer connectivity.
MI-to-Mid Distilled Compression (M2M-DC): An Hybrid-Information-Guided-Block Pruning with Progressive Inner Slicing Approach to Model Compression - Score: 16 (R=9, N=7) - Date: 2025-11-11 - Comment: Model Compression and Efficiency: structured block pruning guided by mutual information plus progressive channel slicing and KD.
QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations - Score: 16 (R=9, N=7) - Date: 2025-11-11 - Comment: Compression/Efficiency/HPC: quantization-enabled circuit sharing for nonlinear ops in Transformers on FPGAs, reducing latency and resources.
Rank-1 LoRAs Encode Interpretable Reasoning Signals - Score: 16 (R=9, N=7) - Date: 2025-11-11 - Comment: Model Compression and Efficiency: exploits low-rank (rank-1) LoRA adapters; Representation Learning: analyzes interpretable features via sparse autoencoders.
CAMP-HiVe: Cyclic Pair Merging based Efficient DNN Pruning with Hessian-Vector Approximation for Resource-Constrained Systems - Score: 16 (R=9, N=7) - Date: 2025-11-11 - Comment: Model Compression and Efficiency: pruning via Hessian-vector approximation and cyclic pair merging for resource-constrained deployment.
APP: Accelerated Path Patching with Task-Specific Pruning - Score: 16 (R=9, N=7) - Date: 2025-11-10 - Comment: Matches Model Compression/Efficiency: contrastive attention-head pruning (sparsity/pruning) to reduce search space and compute for circuit discovery; architecture-level head selection informed by causal mediation.
Efficient Neural Networks with Discrete Cosine Transform Activations - Score: 16 (R=9, N=7) - Date: 2025-11-07 - Comment: Model Compression and Efficiency: DCT-parameterized adaptive activations enable structured, coefficient-level pruning and compact, interpretable networks.
Efficiently Training A Flat Neural Network Before It has been Quantizated - Score: 16 (R=9, N=7) - Date: 2025-11-05 - Comment: Compression/Efficiency: pre-conditioning for PTQ via noise-injection to reach flat minima, modeling activation/weight quantization errors.
Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding - Score: 16 (R=9, N=7) - Date: 2025-11-04 - Comment: Model Compression and Efficiency: theoretical and empirical analysis of quantized training with stochastic rounding, highlighting batch size interactions and variance sources.
FLoRA: Fused forward-backward adapters for parameter efficient fine-tuning and reducing inference-time latencies of LLMs - Score: 16 (R=9, N=7) - Date: 2025-11-04 - Comment: Matches Compression/Efficiency: fused forward–backward adapters for PEFT that also reduce inference latency.
FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection - Score: 16 (R=8, N=8) - Date: 2025-11-26 - Comment: Matches Compression/Efficiency: DNN-free coreset selection via frequency-domain distribution matching (Characteristic Function Distance) with topology-aware constraints.
Flow Map Distillation Without Data - Score: 16 (R=8, N=8) - Date: 2025-11-25 - Comment: Matches Model Compression/Efficiency: data-free distillation of flow maps to accelerate sampling (algorithmic efficiency for generative models).
Efficient Penalty-Based Bilevel Methods: Improved Analysis, Novel Updates, and Flatness Condition - Score: 16 (R=8, N=8) - Date: 2025-11-24 - Comment: Matches Optimization/Efficiency for training: improved penalty-based bilevel methods with larger steps, single-loop updates, and a flatness condition.
AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training - Score: 16 (R=8, N=8) - Date: 2025-11-19 - Comment: Matches Training Dynamics/Efficiency: replaces L2 weight decay in AdamW with decoupled Huber decay, improving convergence and inducing sparsity useful for pruning.
On the Gradient Complexity of Private Optimization with Private Oracles - Score: 16 (R=8, N=8) - Date: 2025-11-19 - Comment: Efficiency Theory: lower bounds on DP optimization gradient complexity and limits of gradient quantization.
ImagebindDC: Compressing Multi-modal Data with Imagebind-based Condensation - Score: 16 (R=8, N=8) - Date: 2025-11-12 - Comment: Matches Compression/Efficiency and Representation Learning via multimodal data condensation in ImageBind space with characteristic-function loss for exact moment matching.
C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning - Score: 16 (R=8, N=8) - Date: 2025-11-11 - Comment: High Performance Computing/Efficiency: cascaded LLM inference with probabilistic cost constraints and conformal guarantees; self-supervised optimization.
DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning - Score: 16 (R=8, N=8) - Date: 2025-11-11 - Comment: Matches Efficiency/HPC: optimizer with dynamic Kronecker approximation of Fisher for scalable gradient preconditioning.
Accelerated Frank-Wolfe Algorithms: Complementarity Conditions and Sparsity - Score: 16 (R=8, N=8) - Date: 2025-11-05 - Comment: Optimization for Sparsity/Low-rank: accelerated Frank–Wolfe with complementarity conditions yielding sparsity- and rank-aware complexity guarantees.
Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold - Score: 16 (R=8, N=8) - Date: 2025-11-05 - Comment: Training Dynamics/Optimization: shows Adam minimizes a distinct sharpness measure via adaptive updates (SDE analysis), contrasting SGD and extending to other adaptive methods.
Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving - Score: 16 (R=8, N=8) - Date: 2025-11-05 - Comment: High Performance Computing and Efficiency: multi-agent Graph-CoT decomposition plus LLM serving optimizations (graph-specific KV-cache management, priority eviction, pipelining) for lower tokens/latency and higher throughput.
Real-time Continual Learning on Intel Loihi 2 - Score: 16 (R=8, N=8) - Date: 2025-11-05 - Comment: Matches Architecture+HPC/Efficiency: SNN with local learning, normalization, neurogenesis on neuromorphic hardware (Loihi 2) enabling real-time, energy-efficient continual learning.
Frequency-Aware Token Reduction for Efficient Vision Transformer - Score: 15 (R=8, N=7) - Date: 2025-11-27 - Comment: Model Compression and Efficiency: frequency-aware token reduction for ViTs that mitigates rank collapse/over-smoothing while lowering compute.
A Fully Probabilistic Tensor Network for Regularized Volterra System Identification - Score: 15 (R=8, N=7) - Date: 2025-11-26 - Comment: Compression/Efficiency: CP tensor network representation of Volterra kernels with sparsity-inducing Bayesian priors for automatic rank selection and reduced complexity.
On-Demand Multi-Task Sparsity for Efficient Large-Model Deployment on Edge Devices - Score: 15 (R=8, N=7) - Date: 2025-11-26 - Comment: Matches Compression/Efficiency: on-demand multi-task sparsity with block-level reuse to minimize task-switching I/O on edge devices.
Comprehensive Design Space Exploration for Tensorized Neural Network Hardware Accelerators - Score: 15 (R=8, N=7) - Date: 2025-11-26 - Comment: Closely matches 'Model Compression and Efficiency' via tensor decomposition (low-rank/tensorized models) and latency-driven optimization; also matches 'High Performance Computing' with a unified co-exploration of contraction paths, hardware architecture, and dataflow mapping for accelerators.
Adaptive Mesh-Quantization for Neural PDE Solvers - Score: 15 (R=8, N=7) - Date: 2025-11-25 - Comment: Model Compression and Efficiency: adaptive bit-width quantization across mesh nodes/edges/clusters for neural PDE solvers via auxiliary difficulty predictor.
Accelerating Time Series Foundation Models with Speculative Decoding - Score: 15 (R=8, N=7) - Date: 2025-11-25 - Comment: Efficiency: adapts speculative decoding to continuous autoregressive time-series foundation models to cut sequential passes.
Learning Straight Flows: Variational Flow Matching for Efficient Generation - Score: 15 (R=8, N=7) - Date: 2025-11-25 - Comment: Model Architecture/Efficiency: variational flow matching with latent code to enforce straight trajectories for near one-step generation.
DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format - Score: 15 (R=8, N=7) - Date: 2025-11-24 - Comment: Matches High Performance Computing/Efficiency: digital in-memory stochastic computing with compressed Bent-Pyramid format for efficient matrix operations.
Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation - Score: 15 (R=8, N=7) - Date: 2025-11-24 - Comment: Matches Model Efficiency/HPC: proposes energy scaling laws to predict GPU energy for diffusion inference across models and hardware.
NTK-Guided Implicit Neural Teaching - Score: 15 (R=8, N=7) - Date: 2025-11-20 - Comment: Model Compression and Efficiency: NTK-guided coordinate selection accelerates INR training via algorithmic efficiency gains.
Parameter Importance-Driven Continual Learning for Foundation Models - Score: 15 (R=8, N=7) - Date: 2025-11-20 - Comment: Compression/Efficiency: updates only 0.1% most important parameters via Fisher/second-order estimators to mitigate forgetting in continual post-training.
EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control - Score: 15 (R=8, N=7) - Date: 2025-11-20 - Comment: Training Dynamics: entropy stabilization in RLHF-style LLM training via proportional–integral control on loss coefficients.
Credal Ensemble Distillation for Uncertainty Quantification - Score: 15 (R=8, N=7) - Date: 2025-11-20 - Comment: Matches model compression/efficiency criterion by distilling deep ensembles into a single credal model with calibrated uncertainty.
Beat the long tail: Distribution-Aware Speculative Decoding for RL Training - Score: 15 (R=8, N=7) - Date: 2025-11-19 - Comment: Matches Model Efficiency/HPC: distribution-aware speculative decoding using rollout-history-based drafter to speed RL rollouts without altering outputs.
SLMQuant:Benchmarking Small Language Model Quantization for Practical Deployment - Score: 15 (R=8, N=7) - Date: 2025-11-18 - Comment: Model Compression and Efficiency: first benchmark focused on quantization of small language models; identifies sensitivity differences and design principles.
BSO: Binary Spiking Online Optimization Algorithm - Score: 15 (R=8, N=7) - Date: 2025-11-18 - Comment: Compression/Efficiency: memory-efficient online training for Binary Spiking Neural Networks via flip-signal updates and temporal-aware thresholds with regret bounds.
Cmprsr: Abstractive Token-Level Question-Agnostic Prompt Compressor - Score: 15 (R=8, N=7) - Date: 2025-11-18 - Comment: Compression/Efficiency: LLM-as-a-compressor for prompt compression with GRPO/SFT to balance target compression rate and downstream fidelity.
Retrofit: Continual Learning with Bounded Forgetting for Security Applications - Score: 15 (R=8, N=7) - Date: 2025-11-18 - Comment: Compression/Efficiency: parameter-level merging with low-rank and sparse updates enabling data-free continual learning with bounded forgetting.
Private Zeroth-Order Optimization with Public Data - Score: 15 (R=8, N=7) - Date: 2025-11-17 - Comment: Compression/Efficiency: private zeroth-order optimization leveraging public data to improve DP training utility and speed.
Black-Box On-Policy Distillation of Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-11-14 - Comment: Model Compression and Efficiency: introduces on-policy black-box distillation via a GAN-style discriminator, a new distillation paradigm without access to teacher logits/params.
Steering Pretrained Drafters during Speculative Decoding - Score: 15 (R=8, N=7) - Date: 2025-11-14 - Comment: Model Efficiency/HPC: improves speculative decoding via a lightweight dynamic alignment (steering vector) to increase token acceptance with negligible overhead.
Efficient Hyperdimensional Computing with Modular Composite Representations - Score: 15 (R=8, N=7) - Date: 2025-11-14 - Comment: Representation Learning + Efficiency/HW: introduces modular composite high-dimensional representations with analysis of capacity/accuracy and a dedicated accelerator for efficient implementation.
Factorization-in-Loop: Proximal Fill-in Minimization for Sparse Matrix Reordering - Score: 15 (R=8, N=7) - Date: 2025-11-13 - Comment: Matches High Performance Computing: learning-based sparse matrix reordering with proximal factorization-in-loop to directly minimize fill-in.
SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder - Score: 15 (R=8, N=7) - Date: 2025-11-13 - Comment: Representation Learning: leverages Sparse Autoencoders to decompose LLM representations into interpretable preference features. Compression/Efficiency: builds a lightweight reward model with <1% trainable parameters.
Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-Tuning - Score: 15 (R=8, N=7) - Date: 2025-11-12 - Comment: Curvature-aware zeroth-order fine-tuning with low-rank block-diagonal preconditioning and variance reduction—fits 'High Performance Computing / Efficiency' (memory and compute-efficient training of LLMs).
Magnitude-Modulated Equivariant Adapter for Parameter-Efficient Fine-Tuning of Equivariant Graph Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Model Architecture + Compression/Efficiency: proposes an equivariant PEFT adapter with per-order scalar gating that preserves symmetry for equivariant GNNs.
An Efficient Gradient-Aware Error-Bounded Lossy Compressor for Federated Learning - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Matches Compression/Efficiency for distributed training: gradient-aware error-bounded lossy compression tailored to FL communication.
Another BRIXEL in the Wall: Towards Cheaper Dense Features - Score: 15 (R=8, N=7) - Date: 2025-11-10 - Comment: Compression/Efficiency: knowledge distillation to approximate high-resolution dense feature maps at lower compute for vision foundation models.
Less Is More: Generating Time Series with LLaMA-Style Autoregression in Simple Factorized Latent Spaces - Score: 15 (R=8, N=7) - Date: 2025-11-10 - Comment: Model Architecture (disentangled/quantized latent space with AR Transformer) and Compression/Efficiency (discrete tokens for fast, arbitrary-length generation).
MDM: Manhattan Distance Mapping of DNN Weights for Parasitic-Resistance-Resilient Memristive Crossbars - Score: 15 (R=8, N=7) - Date: 2025-11-10 - Comment: Hardware-aware efficiency: weight mapping leveraging structured sparsity and spatial reordering to mitigate parasitic resistance in memristive CIM crossbars.
Optimizing Reasoning Efficiency through Prompt Difficulty Prediction - Score: 15 (R=8, N=7) - Date: 2025-11-07 - Comment: Conditional/Dynamic Networks and Efficiency: difficulty-aware routing to assign problems to the smallest adequate reasoning model to cut compute cost.
RAGBoost: Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse - Score: 15 (R=8, N=7) - Date: 2025-11-06 - Comment: Model Efficiency/HPC: accuracy-preserving context reuse and caching for RAG to improve LLM prefill efficiency without degrading accuracy.
In Good GRACEs: Principled Teacher Selection for Knowledge Distillation - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Compression/Efficiency: principled teacher selection metric (GRACE) for knowledge distillation using student gradient properties without verifier/logits.
ConMeZO: Adaptive Descent-Direction Sampling for Gradient-Free Finetuning of Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Optimization/Efficiency: zeroth-order finetuning with cone-restricted adaptive direction sampling (low-memory training with faster convergence).
In Situ Training of Implicit Neural Compressors for Scientific Simulations via Sketch-Based Regularization - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Compression/Efficiency and HPC: in situ training of implicit neural compressors with sketch-based regularization (JL-motivated) to prevent forgetting under memory limits.
Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Inference Efficiency: adaptive reward prediction to control thinking length, early stop unpromising chains, and optimize model/compute selection.
When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Inference Efficiency: retrieval-enhanced speculative decoding with adaptive entropy trigger, feedback-driven candidate selection, and relaxed verification for speedup.
More Than A Shortcut: A Hyperbolic Approach To Early-Exit Networks - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Model Compression and Efficiency: early-exit networks enhanced with hyperbolic geometry and hierarchical entailment loss to enforce coherent multi-exit representations and adaptive computation.
DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Model Compression and Efficiency: proposes a training-free decoding framework that selectively branches at high-entropy tokens with early stopping to shorten chain-of-thought while improving accuracy.
Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Matches Test-time Efficiency/HPC: formulates test-time scaling as an optimizable multi-LLM collaboration graph and searches architectures under compute budgets.
LL-ViT: Edge Deployable Vision Transformers with Look Up Table Neurons - Score: 15 (R=8, N=7) - Date: 2025-11-04 - Comment: Matches Model Compression/Efficiency: LUT-based neurons within ViT to cut multiplications/memory with an FPGA accelerator.
Diluting Restricted Boltzmann Machines - Score: 15 (R=8, N=7) - Date: 2025-11-04 - Comment: Matches Compression/Efficiency: pruning and sparsity analysis in RBMs, including limits of extreme pruning and lottery-ticket-like behavior.
H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models - Score: 15 (R=8, N=7) - Date: 2025-11-04 - Comment: Matches Compression/Efficiency and Systems: hierarchical dual-stage cache with lightweight similarity for accelerating diffusion model inference.
Category-Aware Semantic Caching for Heterogeneous LLM Workloads - Score: 15 (R=8, N=7) - Date: 2025-11-04 - Comment: Efficiency/HPC (serving): category-aware semantic cache with adaptive thresholds/TTLs and hybrid in-memory HNSW to lower miss cost.

High Performance Computing (44)

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics - Score: 20.0 (R=0, N=0) - Date: 2025-11-13 - Comment: Author match
Parallel Sampling via Autospeculation - Score: 19 (R=10, N=9) - Date: 2025-11-12 - Comment: High Performance Computing: speculative rejection sampling (autospeculation) achieving parallel sampling in O~(n^{1/2}) time for autoregressive and diffusion models.
Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch - Score: 18 (R=10, N=8) - Date: 2025-11-25 - Comment: HPC/Systems: TP-invariant matmul/reduction kernels (tree-based) enabling bitwise deterministic inference across tensor parallel sizes.
Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference - Score: 18 (R=10, N=8) - Date: 2025-11-13 - Comment: Model architecture and efficiency: Mixture-of-Channels sparsifies FFNs by activating top-K channels per token to cut activation memory and improve throughput.
Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits - Score: 18 (R=10, N=8) - Date: 2025-11-05 - Comment: High Performance Computing/Memory Optimization: CXL-enabled processing-near-memory KV-cache management and hybrid execution for 1M-token LLM inference.
Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizer - Score: 18 (R=9, N=9) - Date: 2025-11-11 - Comment: Optimization theory: rigorous convergence rates and the Adam symmetry theorem for SGD-Adam on strongly convex problems.
ROOT: Robust Orthogonalized Optimizer for Neural Network Training - Score: 17 (R=9, N=8) - Date: 2025-11-26 - Comment: Matches Optimizers/Training Stability: robust orthogonalized optimizer with dimension-robust orthogonalization and proximal noise suppression for large-scale training.
Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone - Score: 17 (R=9, N=8) - Date: 2025-11-21 - Comment: Efficiency/Architecture: diffusion LM with bidirectional Mamba backbone for linear-time, high-throughput generation.
A Tensor Compiler for Processing-In-Memory Architectures - Score: 17 (R=9, N=8) - Date: 2025-11-20 - Comment: High Performance Computing: data-centric ML compiler co-optimizing data rearrangements and compute for PIM backends to accelerate LLM kernels.
ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels - Score: 17 (R=9, N=8) - Date: 2025-11-19 - Comment: High Performance Computing: unified primitives for overlapped multi-GPU kernels enabling compute-communication overlap across parallelism modes.
Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels - Score: 17 (R=9, N=8) - Date: 2025-11-19 - Comment: High Performance Computing: compiler-composed nanokernels generating production-quality matmul microkernels from MLIR, reducing reliance on hand-tuned libraries.
T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization - Score: 17 (R=9, N=8) - Date: 2025-11-18 - Comment: Model Compression and Efficiency + Systems co-design: CPU-only ternary LLM inference via in-register SIMD LUT generation eliminating memory bottlenecks.
ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer - Score: 17 (R=9, N=8) - Date: 2025-11-18 - Comment: High Performance Computing: dynamic, layer-wise parallel strategy switching for LLM training with sequence-aware memory/time cost models and hot-switching runtime.
Scalable Synthesis of distributed LLM workloads through Symbolic Tensor Graphs - Score: 17 (R=9, N=8) - Date: 2025-11-14 - Comment: High Performance Computing: introduces a symbolic tensor-graph generator to synthesize high-fidelity distributed LLM execution traces and explore parallelization strategies at massive scale.
TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training - Score: 17 (R=9, N=8) - Date: 2025-11-14 - Comment: High Performance Computing — topology-aware weight pipeline parallelism that reduces cross-node traffic and overlaps compute/communication to scale long-context LLM training.
LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication - Score: 17 (R=9, N=8) - Date: 2025-11-13 - Comment: High Performance Computing: introduces a hierarchical NVSHMEM-based all-reduce (NVRAR) to accelerate multi-node LLM inference and reduce batch latency.
HipKittens: Fast and Furious AMD Kernels - Score: 17 (R=9, N=8) - Date: 2025-11-12 - Comment: HPC: tile-based programming framework for AMD GPUs enabling high-performance AI kernels across vendors (systems-level innovation).
Streaming Tensor Program: A streaming abstraction for dynamic parallelism - Score: 17 (R=9, N=8) - Date: 2025-11-12 - Comment: High Performance Computing: new streaming abstraction (STeP) with dynamic tiling/parallelization and explicit memory hierarchy for dynamic tensor workloads on accelerators.
Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: High Performance Computing/Distributed Inference: Federated Attention integrates FL principles into self-attention with KV aggregation and analyzes quality–communication trade-offs.
From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: Systems/HPC: operator-level autoscaling for LLM inference with fine-grained scaling, batching, and placement; significant resource and energy savings.
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: HPC/Systems: in-kernel communication and fine-grained dataflow synchronization to eliminate BSP taxes and improve distributed LLM efficiency.
Language Modeling With Factorization Memory - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: Model Architecture: Factorization Memory RNN with sparse memory activation; parallel-trainable and O(1) compute/memory at inference for long contexts.
Loquetier: A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: Compression/Efficiency + Systems: virtualized multi-LoRA unifying fine-tuning and serving with shared base model and merged forward kernels for high-throughput co-serving.
When is a System Discoverable from Data? Discovery Requires Chaos - Score: 17 (R=8, N=9) - Date: 2025-11-13 - Comment: Matches Representation Learning/Theory: identifiability/discoverability conditions for dynamical systems from data; links chaos to unique discovery.
Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design - Score: 16 (R=9, N=7) - Date: 2025-11-24 - Comment: High Performance Computing and Model Architecture: large-scale MoE pretraining on AMD Pollara with detailed systems/networking microbenchmarks and MI300X-aware transformer/MoE sizing rules for throughput/latency.
Harli: Harvest Underutilized Resources in LLM Serving with Finetuning Tasks - Score: 16 (R=9, N=7) - Date: 2025-11-18 - Comment: High Performance Computing - systems-level co-location of PEFT finetuning with decode instances via unified memory allocator, latency predictor, and scheduler.
Motif 2 12.7B technical report - Score: 16 (R=9, N=7) - Date: 2025-11-12 - Comment: Model Architecture: introduces Grouped Differential Attention; HPC/Systems: custom kernels and optimized distributed training pipeline for a 12.7B foundation model.
Descend or Rewind? Stochastic Gradient Descent Unlearning - Score: 16 (R=8, N=8) - Date: 2025-11-21 - Comment: Training dynamics/optimization: certified SGD unlearning guarantees for stochastic D2D/R2D across convex and nonconvex losses.
Principled Coarse-Grained Acceptance for Speculative Decoding in Speech - Score: 16 (R=8, N=8) - Date: 2025-11-19 - Comment: Efficiency/HPC: proposes coarse-grained acceptance for speculative decoding to increase throughput with exactness guarantees at the group level.
Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression - Score: 16 (R=8, N=8) - Date: 2025-11-18 - Comment: Training dynamics/scaling laws: theoretical analysis of multi-epoch data reuse (effective reuse rate) informing data-scaling under limited data.
DIGing--SGLD: Decentralized and Scalable Langevin Sampling over Time--Varying Networks - Score: 16 (R=8, N=8) - Date: 2025-11-18 - Comment: High Performance Computing: decentralized SGLD with gradient tracking over time-varying networks and finite-time convergence guarantees.
TNT: Improving Chunkwise Training for Test-Time Memorization - Score: 16 (R=8, N=8) - Date: 2025-11-11 - Comment: High Performance Computing: training paradigm enabling massive context parallelization for RNNs via hierarchical memory and chunk decoupling.
Accelerating Sparse Convolutions in Voxel-Based Point Cloud Networks - Score: 15 (R=8, N=7) - Date: 2025-11-27 - Comment: High Performance Computing: GPU systems-level sparse convolution engine exploiting voxel coordinate properties to accelerate kernel map construction and inference.
Row-stochastic matrices can provably outperform doubly stochastic matrices in decentralized learning - Score: 15 (R=8, N=7) - Date: 2025-11-26 - Comment: Matches High Performance Computing/Distributed Training: theoretical framework showing when row-stochastic mixing outperforms doubly stochastic in decentralized learning.
Generative Caching for Structurally Similar Prompts and Responses - Score: 15 (R=8, N=7) - Date: 2025-11-26 - Comment: Efficiency/Systems: generative caching that synthesizes variation-aware responses for structurally similar prompts to reduce latency.
Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter - Score: 15 (R=8, N=7) - Date: 2025-11-21 - Comment: Matches High Performance Computing criterion: systems-level acceleration of RL training via adaptive speculative decoding and a memory-efficient CUDA Graphs rollout engine.
What happens when nanochat meets DiLoCo? - Score: 15 (R=8, N=7) - Date: 2025-11-19 - Comment: Matches High Performance Computing/Distributed Training: analyzes communication-constrained local-update training (DiLoCo) vs DDP and reveals representation drift.
Synera: Synergistic LLM Serving across Device and Cloud at Scale - Score: 15 (R=8, N=7) - Date: 2025-11-13 - Comment: High Performance Computing: device–cloud synergistic LLM serving with communication-efficient selective offloading, stall-free parallel inference, and scalable batching.
On the Convergence and Stability of Distributed Sub-model Training - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: HPC/Distributed training: shuffled sub-model training with convergence and stability (generalization) analysis.
DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing - Score: 15 (R=8, N=7) - Date: 2025-11-10 - Comment: High-performance inference systems: adaptive SM-level GPU multiplexing with an attention-aware roofline model and optimizer for LLM serving latency/throughput.
PerfDojo: Automated ML Library Generation for Heterogeneous Architectures - Score: 15 (R=8, N=7) - Date: 2025-11-07 - Comment: High Performance Computing: systems-level optimization for heterogeneous ML kernels via an LLM+RL framework (PerfDojo) with a human-readable IR enabling performance portability.
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: HPC/Systems: multi-agent, hardware-feedback-driven CUDA kernel generation/optimization improving kernel performance and generalization across GPUs.
AReaL-Hex: Accommodating Asynchronous RL Training over Heterogeneous GPUs - Score: 15 (R=8, N=7) - Date: 2025-11-04 - Comment: HPC/Distributed training: heterogeneity-aware scheduling for fully asynchronous RL training of LLMs using MILP and graph partitioning.
Information-Theoretic Greedy Layer-wise Training for Traffic Sign Recognition - Score: 15 (R=8, N=7) - Date: 2025-11-04 - Comment: Training Dynamics/Efficiency: greedy layer-wise training using deterministic information bottleneck to avoid backprop and reduce memory.

Representation Learning (111)

Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit - Score: 19 (R=10, N=9) - Date: 2025-11-21 - Comment: Representation Learning: provable sample/time-optimal learning of multi-index models by two-layer nets via gradient descent.
Transformer Injectivity & Geometric Robustness - Analytic Margins and Bi-Lipschitz Uniformity of Sequence-Level Hidden States - Score: 19 (R=10, N=9) - Date: 2025-11-20 - Comment: Representation Learning/Theory: proves generic injectivity and bi-Lipschitz properties of Transformer sequence-level states; quantization effects analyzed.
Autoencoding Dynamics: Topological Limitations and Capabilities - Score: 18 (R=10, N=8) - Date: 2025-11-10 - Comment: Representation Learning (theoretical/topological limits and capabilities of autoencoders, including dynamics on manifolds).
On the Emergence of Induction Heads for In-Context Learning - Score: 18 (R=10, N=8) - Date: 2025-11-05 - Comment: Representation Learning/Training Dynamics: theoretical and empirical analysis of induction-head emergence with provable low-dimensional subspace constraints and scaling of emergence time.
In-Context Compositional Learning via Sparse Coding Transformer - Score: 17 (R=9, N=8) - Date: 2025-11-26 - Comment: Model Architecture/Representation Learning: reformulates attention as sparse coding with encoding/decoding dictionaries and sparse coefficients for compositional generalization.
A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias - Score: 17 (R=9, N=8) - Date: 2025-11-24 - Comment: Training Dynamics Theory: unified stability analysis of SGD vs SAM using a data-coherence curvature measure, explaining flatness preference and simplicity bias.
Sparse Autoencoders are Topic Models - Score: 17 (R=9, N=8) - Date: 2025-11-21 - Comment: Representation Learning/Sparsity: theoretical reinterpretation of sparse autoencoders as topic models and a new SAE-TM framework for thematic analysis.
Structured Contrastive Learning for Interpretable Latent Representations - Score: 17 (R=9, N=8) - Date: 2025-11-20 - Comment: Matches representation-learning criterion by structuring latent spaces into invariant/variant/free components with contrastive training.
Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings - Score: 17 (R=9, N=8) - Date: 2025-11-20 - Comment: Model Architecture and Representation Learning: modifies Transformer information flow with hierarchical prepended summary tokens and mean pooling to improve embeddings, especially for long context.
A Disentangled Low-Rank RNN Framework for Uncovering Neural Connectivity and Dynamics - Score: 17 (R=9, N=8) - Date: 2025-11-19 - Comment: Representation learning and low-rank architectures: introduces a disentangled low-rank RNN (VAE-based) with partial correlation penalty for interpretable latent dynamics.
On the Fundamental Limits of LLMs at Scale - Score: 17 (R=9, N=8) - Date: 2025-11-18 - Comment: Representation Learning/Training Limits: unified theoretical framework on fundamental scaling ceilings (computability, information-theoretic, geometry) with mitigation paths (sparse/hierarchical attention).
Training Instabilities Induce Flatness Bias in Gradient Descent - Score: 17 (R=9, N=8) - Date: 2025-11-18 - Comment: Training Dynamics/Representation Learning: shows that large-stepsize-induced instabilities bias GD toward flatter minima (RPE mechanism), extending to SGD/Adam and improving generalization.
PCA++: How Uniformity Induces Robustness to Background Noise in Contrastive Learning - Score: 17 (R=9, N=8) - Date: 2025-11-18 - Comment: Representation Learning: proposes uniformity-constrained contrastive PCA (closed-form generalized eigenproblem) with high-dimensional analysis clarifying uniformity’s role.
Understanding InfoNCE: Transition Probability Matrix Induced Feature Clustering - Score: 17 (R=9, N=8) - Date: 2025-11-18 - Comment: Representation Learning: theoretical analysis of InfoNCE via transition matrices and a new SC-InfoNCE objective.
On the Convergence of Overparameterized Problems: Inherent Properties of the Compositional Structure of Neural Networks - Score: 17 (R=9, N=8) - Date: 2025-11-14 - Comment: Training Dynamics/Representation Learning: theoretical analysis of gradient flow and convergence in overparameterized neural compositions, characterizing saddles and initialization effects.
Koopman Invariants as Drivers of Emergent Time-Series Clustering in Joint-Embedding Predictive Architectures - Score: 17 (R=9, N=8) - Date: 2025-11-14 - Comment: Representation Learning: theoretical link showing JEPAs learn Koopman invariant subspaces under near-identity predictors, explaining emergent regime clustering.
Depth-induced NTK: Bridging Over-parameterized Neural Networks and Deep Neural Kernels - Score: 17 (R=9, N=8) - Date: 2025-11-11 - Comment: Representation Learning Theory: proposes a depth-induced NTK capturing depth effects beyond infinite-width NTK, with analysis of spectrum and training invariance.
Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability - Score: 17 (R=9, N=8) - Date: 2025-11-11 - Comment: Representation Learning: advances Sparse Autoencoders with a temporal contrastive loss to disentangle semantic vs. syntactic features for interpretability.
Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs - Score: 17 (R=9, N=8) - Date: 2025-11-10 - Comment: Matches Representation Learning/Training Dynamics: theoretical mechanism and empirical analysis of semantic calibration emerging from next-token training; explains when calibration holds and when it breaks.
Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models - Score: 17 (R=9, N=8) - Date: 2025-11-07 - Comment: Optimization Theory: non-asymptotic convergence and uniform-stability generalization bounds for stochastic Gauss-Newton in overparameterized DNNs.
High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes - Score: 17 (R=9, N=8) - Date: 2025-11-07 - Comment: Training Dynamics Theory: high-dimensional scaling limits comparing SGD, momentum, and adaptive step-sizes with rigorous implications for optimization and generalization.
An Augmentation Overlap Theory of Contrastive Learning - Score: 17 (R=9, N=8) - Date: 2025-11-07 - Comment: Representation Learning: provides tight bounds and an augmentation-overlap theory for contrastive learning, plus an unsupervised evaluation metric aligned with downstream performance.
Bulk-boundary decomposition of neural networks - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: Training Dynamics/Representation Learning: bulk–boundary decomposition and field-theoretic formulation of SGD dynamics, separating architecture-driven and data-driven effects.
The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: Representation Learning/Theory: shows normalization yields exponential Lipschitz reduction, explaining smoother optimization and better generalization (capacity control).
Priors in Time: Missing Inductive Biases for Language Model Interpretability - Score: 17 (R=9, N=8) - Date: 2025-11-04 - Comment: Matches Representation Learning: critiques SAE priors and introduces Temporal Feature Analysis with temporal inductive bias for activation decomposition.
A Proof of Learning Rate Transfer under $\mu$P - Score: 17 (R=9, N=8) - Date: 2025-11-04 - Comment: Training dynamics/parameterization theory: first proof of learning-rate transfer under μP, contrasting with SP/NTP.
Panprediction: Optimal Predictions for Any Downstream Task and Loss - Score: 17 (R=9, N=8) - Date: 2025-11-03 - Comment: Representation Learning — theoretical panprediction framework with sample complexity bounds via calibration; foundational generalization across tasks/losses.
Why Less is More (Sometimes): A Theory of Data Curation - Score: 17 (R=8, N=9) - Date: 2025-11-06 - Comment: Representation Learning / Training dynamics: theoretical scaling laws explaining when curated subsets outperform full datasets.
Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits - Score: 16 (R=9, N=7) - Date: 2025-11-26 - Comment: Representation Learning/Mechanistic Interpretability: decomposes heads/MLPs into singular directions revealing low-rank subspace computations.
Understanding Counting Mechanisms in Large Language and Vision-Language Models - Score: 16 (R=9, N=7) - Date: 2025-11-26 - Comment: Matches Representation Learning: mechanistic interpretability of numerical representations via causal mediation and activation patching.
Understanding the Staged Dynamics of Transformers in Learning Latent Structure - Score: 16 (R=9, N=7) - Date: 2025-11-25 - Comment: Representation Learning: analyzes staged training dynamics of transformers learning latent structure.
MuM: Multi-View Masked Image Modeling for 3D Vision - Score: 16 (R=9, N=7) - Date: 2025-11-24 - Comment: Matches Representation Learning: multi-view masked autoencoding architecture with inter-frame attention tailored for 3D geometric features.
Clifford Algebraic Rotor Embeddings : Maybe embeddings should start to CARE - Score: 16 (R=9, N=7) - Date: 2025-11-18 - Comment: Model Architecture: generalization of Rotary Positional Embeddings via quaternions and Clifford algebra (CARE).
Adaptive Symmetrization of the KL Divergence - Score: 16 (R=9, N=7) - Date: 2025-11-17 - Comment: Training Objective/Representation Learning: adaptive optimization of Jeffreys (symmetric) KL via proxy-assisted constrained training, bridging NFs and EBMs.
Semi-Unified Sparse Dictionary Learning with Learnable Top-K LISTA and FISTA Encoders - Score: 16 (R=9, N=7) - Date: 2025-11-14 - Comment: Representation Learning and Sparsity: semi-unified sparse dictionary learning with learnable Top-K LISTA/LISTAConv encoders and PALM-style convergence analysis.
Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders - Score: 16 (R=9, N=7) - Date: 2025-11-13 - Comment: Matches Representation Learning with Sparse Autoencoders and group equivariance; architectural innovation for symmetry-aware SAEs.
Understanding the role of depth in the neural tangent kernel for overparameterized neural networks - Score: 16 (R=9, N=7) - Date: 2025-11-11 - Comment: Matches Representation Learning/training dynamics: analysis of NTK behavior with increasing depth in overparameterized networks.
SiamMM: A Mixture Model Perspective on Deep Unsupervised Learning - Score: 16 (R=9, N=7) - Date: 2025-11-10 - Comment: Representation learning: connects clustering-based self-supervised learning to classical mixture models and proposes the SiamMM architecture.
LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS - Score: 16 (R=9, N=7) - Date: 2025-11-05 - Comment: Mechanistic interpretability: reformulates CCS probing as a contrastive eigenproblem with closed-form solutions and multi-variable extension.
Calibration Across Layers: Understanding Calibration Evolution in LLMs - Score: 16 (R=9, N=7) - Date: 2025-11-05 - Comment: Representation Learning/Training Dynamics: analyzes calibration evolution across depth and identifies a low-dimensional calibration direction in the residual stream improving ECE/MCE.
Atlas-Alignment: Making Interpretability Transferable Across Language Models - Score: 16 (R=9, N=7) - Date: 2025-11-04 - Comment: Representation Learning/Interpretability: aligns unknown model latents to a labeled Concept Atlas via lightweight representational alignment for semantic retrieval and steering.
Operationalizing Quantized Disentanglement - Score: 16 (R=8, N=8) - Date: 2025-11-27 - Comment: Representation Learning: operationalizes quantized disentanglement via axis-aligned density discontinuities (“cliffs”) with independence constraints.
Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning - Score: 16 (R=8, N=8) - Date: 2025-11-26 - Comment: Matches Training Dynamics/Representation Learning: principled reward modification (differential smoothing) to counter RL-induced diversity collapse with theory.
MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings - Score: 16 (R=8, N=8) - Date: 2025-11-25 - Comment: Model Architecture: Transformer with input-dependent positional embeddings to disentangle structure vs content and learn cognitive maps (OOD generalization).
Towards a Unified Analysis of Neural Networks in Nonparametric Instrumental Variable Regression: Optimization and Generalization - Score: 16 (R=8, N=8) - Date: 2025-11-19 - Comment: Optimization/Training Dynamics: mean-field analysis and global convergence for neural 2SLS in NPIV (bilevel MFLD) with generalization trade-offs.
On the Dimension-Free Approximation of Deep Neural Networks for Symmetric Korobov Functions - Score: 16 (R=8, N=8) - Date: 2025-11-18 - Comment: Foundational approximation/generalization theory: dimension-free approximation rates for symmetric functions with symmetric DNNs.
On the Entropy Calibration of Language Models - Score: 16 (R=8, N=8) - Date: 2025-11-18 - Comment: Representation Learning/Training Dynamics - entropy calibration theory and scaling for LMs with implications for sampling/truncation strategies.
Training Language Models to Explain Their Own Computations - Score: 16 (R=8, N=8) - Date: 2025-11-13 - Comment: Representation Learning: trains LMs to produce faithful natural-language explanations of their own features/causal activations, leveraging privileged internal access.
Coherence Mechanisms for Provable Self-Improvement - Score: 16 (R=8, N=8) - Date: 2025-11-12 - Comment: Representation Learning/Training Dynamics: projection-based coherence mechanisms with monotonic improvement guarantees via Bregman divergence.
Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training - Score: 16 (R=8, N=8) - Date: 2025-11-11 - Comment: Representation Learning/Training Dynamics: provides provable benefits of curriculum post-training and test-time scaling for Transformer tree reasoning, reducing sample complexity.
A Provably-Correct and Robust Convex Model for Smooth Separable NMF - Score: 16 (R=8, N=8) - Date: 2025-11-11 - Comment: Representation Learning: convex, provably-correct model for smooth separable NMF with robustness guarantees.
Diversified Flow Matching with Translation Identifiability - Score: 16 (R=8, N=8) - Date: 2025-11-11 - Comment: Generative modeling/representation: ODE-based diversified flow matching with translation identifiability guarantees.
Robustness of Minimum-Volume Nonnegative Matrix Factorization under an Expanded Sufficiently Scattered Condition - Score: 16 (R=8, N=8) - Date: 2025-11-07 - Comment: Matches Representation Learning: theoretical robustness guarantees for NMF (dictionary/factorization learning) under expanded sufficiently scattered condition.
Flat Minima and Generalization: Insights from Stochastic Convex Optimization - Score: 16 (R=8, N=8) - Date: 2025-11-06 - Comment: Representation Learning/Training Dynamics: theoretical analysis showing flat minima can generalize poorly and examining SA-GD/SAM generalization via stability.
Precise asymptotic analysis of Sobolev training for random feature models - Score: 16 (R=8, N=8) - Date: 2025-11-06 - Comment: Representation Learning / Training dynamics: precise asymptotic theory of Sobolev training for overparameterized random feature models.
Redundancy Maximization as a Principle of Associative Memory Learning - Score: 16 (R=8, N=8) - Date: 2025-11-05 - Comment: Representation Learning/Architecture: introduces redundancy maximization as an information-theoretic local learning principle for associative memory, greatly increasing Hopfield capacity.
Efficient Vector Symbolic Architectures from Histogram Recovery - Score: 16 (R=8, N=8) - Date: 2025-11-05 - Comment: Representation Learning: coding-theoretic vector symbolic architecture with efficient binding and provable recovery via histogram recovery and list-decoding, enabling robust compositional representations.
Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering - Score: 16 (R=8, N=8) - Date: 2025-11-05 - Comment: Representation Learning/Training Dynamics: a unified Bayesian account linking in-context learning and activation steering, predictive of internal belief updates.
Bridging Lifelong and Multi-Task Representation Learning via Algorithm and Complexity Measure - Score: 16 (R=8, N=8) - Date: 2025-11-04 - Comment: Representation Learning: proposes a simple lifelong representation learning algorithm with sample complexity bounds via a new task-eluder dimension.
Regularization Implies balancedness in the deep linear network - Score: 16 (R=8, N=8) - Date: 2025-11-04 - Comment: Representation Learning: theoretical training dynamics in deep linear networks showing L2 regularization induces balancedness via GIT.
Visualizing LLM Latent Space Geometry Through Dimensionality Reduction - Score: 15 (R=8, N=7) - Date: 2025-11-27 - Comment: Representation learning/interpretability: analyzes Transformer latent geometry across layers (attention vs MLP) via dimensionality reduction.
FANoise: Singular Value-Adaptive Noise Modulation for Robust Multimodal Representation Learning - Score: 15 (R=8, N=7) - Date: 2025-11-27 - Comment: Representation Learning: singular value-adaptive (feature-adaptive) noise injection for contrastive multimodal learning to improve robustness/generalization.
Probabilistic Hash Embeddings for Online Learning of Categorical Features - Score: 15 (R=8, N=7) - Date: 2025-11-27 - Comment: Compression/Efficiency + representation: probabilistic hash embeddings with Bayesian online learning for streaming categorical features; memory-bounded and order-invariant.
Representation Interventions Enable Lifelong Unstructured Knowledge Control - Score: 15 (R=8, N=7) - Date: 2025-11-27 - Comment: Representation Learning/Editing: intervention-based knowledge control with paraphrase-robust, edit-localized modules and a query-adaptive router; preserves base weights.
Shortcut Invariance: Targeted Jacobian Regularization in Disentangled Latent Space - Score: 15 (R=8, N=7) - Date: 2025-11-26 - Comment: Matches Representation Learning: targeted Jacobian regularization in disentangled latent space to enforce shortcut invariance and OOD robustness.
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens - Score: 15 (R=8, N=7) - Date: 2025-11-26 - Comment: Matches Model Architecture and Representation Learning: continuous visual tokens enabling dense perceptual reasoning within VLMs.
Equivariant Deep Equilibrium Models for Imaging Inverse Problems - Score: 15 (R=8, N=7) - Date: 2025-11-25 - Comment: Model Architecture: deep equilibrium models with implicit differentiation; Representation Learning: equivariant training approximating proximal maps of invariant priors.
From Tables to Signals: Revealing Spectral Adaptivity in TabPFN - Score: 15 (R=8, N=7) - Date: 2025-11-25 - Comment: Matches Representation Learning: frequency-domain analysis revealing spectral adaptivity in TabPFN and its inductive biases.
Controllability Analysis of State Space-based Language Model - Score: 15 (R=8, N=7) - Date: 2025-11-25 - Comment: Representation Learning/Training Dynamics: controllability-based Influence Score to quantify token impact in SSM (Mamba) LMs.
InTAct: Interval-based Task Activation Consolidation for Continual Learning - Score: 15 (R=8, N=7) - Date: 2025-11-24 - Comment: Representation Learning/Continual Learning: constrains shared-layer activation intervals to stabilize representations and reduce drift without freezing parameters or replay.
Self-Supervised Learning by Curvature Alignment - Score: 15 (R=8, N=7) - Date: 2025-11-24 - Comment: Representation Learning: introduces curvature-regularized SSL (and RKHS variant) aligning local manifold geometry alongside redundancy reduction.
Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation - Score: 15 (R=8, N=7) - Date: 2025-11-24 - Comment: Representation Learning: introduces a context-compression pretext objective that trains LLMs to produce compact memory tokens for holistic embeddings, further improved with contrastive learning.
Anatomy of an Idiom: Tracing Non-Compositionality in Language Models - Score: 15 (R=8, N=7) - Date: 2025-11-21 - Comment: Representation Learning/Interpretability: circuit discovery for idiom processing in transformers, identifying reusable “Idiom Heads” and attention mechanisms.
iLTM: Integrated Large Tabular Model - Score: 15 (R=8, N=7) - Date: 2025-11-21 - Comment: Model Architecture: integrated tabular foundation model combining tree-derived embeddings, dimensionality-agnostic reps, a meta-trained hypernetwork, MLPs, and retrieval.
Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story - Score: 15 (R=8, N=7) - Date: 2025-11-21 - Comment: Matches Representation Learning criterion: analyzes intrinsic dimension of text using SAEs and links causal linguistic features to representational complexity.
Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation - Score: 15 (R=8, N=7) - Date: 2025-11-21 - Comment: Model Compression and Efficiency: feature-only knowledge distillation framework challenging logit-based losses.
SymLoc: Symbolic Localization of Hallucination across HaluEval and TruthfulQA - Score: 15 (R=8, N=7) - Date: 2025-11-20 - Comment: Representation Learning: symbolic, layer-wise localization of hallucination with attention variance analysis provides insight into internal processing.
DeepDefense: Layer-Wise Gradient-Feature Alignment for Building Robust Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-11-20 - Comment: Representation Learning/Training Dynamics: proposes Gradient-Feature Alignment regularization with theory on tangential vs radial perturbations to improve robustness.
Task Addition and Weight Disentanglement in Closed-Vocabulary Models - Score: 15 (R=8, N=7) - Date: 2025-11-19 - Comment: Representation Learning/Model Editing: extends task arithmetic to closed-vocabulary vision transformers; studies weight disentanglement emerging from pretraining.
Exploring Transferability of Self-Supervised Learning by Task Conflict Calibration - Score: 15 (R=8, N=7) - Date: 2025-11-19 - Comment: Representation Learning: explicitly models SSL transferability via multi-task construction and Task Conflict Calibration with causal factor extraction and bi-level optimization.
Contrastive Entropy Bounds for Density and Conditional Density Decomposition - Score: 15 (R=8, N=7) - Date: 2025-11-18 - Comment: Representation Learning Theory: Hilbert-space operator formulation connecting autoencoders and MDNs with trace/nuclear-norm objectives and contrastive entropy bounds.
Genomic Next-Token Predictors are In-Context Learners - Score: 15 (R=8, N=7) - Date: 2025-11-18 - Comment: Representation Learning/Training dynamics: demonstrates emergent in-context learning from next-token prediction in genomic sequences under controlled tasks.
To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance - Score: 15 (R=8, N=7) - Date: 2025-11-18 - Comment: Representation Learning: controlled contrastive module to tune cross-modal alignment strength; identifies optimal alignment vs redundancy trade-offs.
From Parameter to Representation: A Closed-Form Approach for Controllable Model Merging - Score: 15 (R=8, N=7) - Date: 2025-11-17 - Comment: Model Architecture/Representation: closed-form, preference-controllable model merging via optimal linear transformation in representation space.
Continuum Dropout for Neural Differential Equations - Score: 15 (R=8, N=7) - Date: 2025-11-14 - Comment: Training Dynamics/Regularization for Neural Differential Equations: introduces a principled continuous-time dropout mechanism improving generalization and uncertainty quantification.
Generalizing PDE Emulation with Equation-Aware Neural Operators - Score: 15 (R=8, N=7) - Date: 2025-11-14 - Comment: Model Architecture and Representation Learning: equation-aware neural operator conditioned on PDE term encodings to generalize across PDE families.
Generalization Can Emerge in Tabular Foundation Models From a Single Table - Score: 15 (R=8, N=7) - Date: 2025-11-14 - Comment: Representation Learning: analyses/pretrains tabular foundation models showing generalization can emerge from self-supervision on a single table; highlights task-construction effects.
Abstract Gradient Training: A Unified Certification Framework for Data Poisoning, Unlearning, and Differential Privacy - Score: 15 (R=8, N=7) - Date: 2025-11-13 - Comment: Representation learning/training dynamics: unified certification via parameter-space bounds for first-order optimizers covering poisoning, unlearning, and DP.
Multi-step Predictive Coding Leads To Simplicity Bias - Score: 15 (R=8, N=7) - Date: 2025-11-13 - Comment: Matches Representation Learning/training dynamics: theory showing when multi-step predictive coding yields low-dimensional latent structure.
Unsupervised Feature Selection Through Group Discovery - Score: 15 (R=8, N=7) - Date: 2025-11-13 - Comment: Matches Representation Learning with unsupervised feature selection using group discovery and group sparsity regularization.
GeoGNN: Quantifying and Mitigating Semantic Drift in Text-Attributed Graphs - Score: 15 (R=8, N=7) - Date: 2025-11-13 - Comment: Matches Representation Learning: manifold-aware geodesic aggregation to mitigate semantic drift in TAGs; architecture-level change to message passing.
Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning - Score: 15 (R=8, N=7) - Date: 2025-11-12 - Comment: Matches Model Architecture/Representation Learning via sensitivity-aware task vector insertion (where and what) using activation clustering and RL.
Sampling and Loss Weights in Multi-Domain Training - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Training Dynamics: theoretical analysis of sampling vs. loss weights in multi-domain training to reduce gradient variance and generalization gap.
Rep2Text: Decoding Full Text from a Single LLM Token Representation - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Representation Learning: decodes inputs from a single last-token representation and analyzes information bottleneck in LLM internals.
How Wide and How Deep? Mitigating Over-Squashing of GNNs via Channel Capacity Constrained Estimation - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Representation/Architecture analysis: information-theoretic estimation of GNN width/depth to mitigate over-squashing via channel capacity.
Non-Negative Stiefel Approximating Flow: Orthogonalish Matrix Optimization for Interpretable Embeddings - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Matches Representation Learning with sparsity: interpretable embeddings via non-negative, near-Stiefel constrained factorization.
First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation - Score: 15 (R=8, N=7) - Date: 2025-11-11 - Comment: Representation Learning/Training Dynamics: analyzes which layers best estimate data influence and proposes improved cross-layer aggregation.
Sharp Minima Can Generalize: A Loss Landscape Perspective On Data - Score: 15 (R=8, N=7) - Date: 2025-11-10 - Comment: Representation Learning: analyzes loss landscape and training dynamics, showing how dataset size reshapes flat/sharp minima and generalization.
When Data Falls Short: Grokking Below the Critical Threshold - Score: 15 (R=8, N=7) - Date: 2025-11-10 - Comment: Representation Learning/Training Dynamics (grokking under data scarcity and knowledge transfer via distillation).
Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness - Score: 15 (R=8, N=7) - Date: 2025-11-07 - Comment: Representation Learning: theoretically grounded embedding-space regularization to suppress spurious features and improve worst-group robustness.
Sketch-Augmented Features Improve Learning Long-Range Dependencies in Graph Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-11-07 - Comment: Model Architecture + Representation Learning: inject sketched random global feature embeddings into GNNs to mitigate oversquashing/oversmoothing and capture long-range dependencies.
The stability of shallow neural networks on spheres: A sharp spectral analysis - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Training Dynamics/Representation Learning: sharp spectral analysis of mass/stiffness matrices for shallow ReLU^k networks on spheres, linking approximation power to numerical stability.
Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Representation/Parameter-Space Analysis: shows shared low-dimensional subspaces and linear mode connectivity underlying emergent misalignment across tasks, informing weight-space interpretability.
Multi-Step Knowledge Interaction Analysis via Rank-2 Subspace Disentanglement - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Representation Learning/Analysis: rank-2 subspace disentanglement to quantify interactions between parametric and context knowledge across multi-step NLE generation.
Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Representation Learning/Theory: proposes semantic information theory for LLMs (directed rate-distortion/reward, semantic flow) with structure-agnostic measures.
How Focused Are LLMs? A Quantitative Study via Repetitive Deterministic Prediction Tasks - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Matches Representation Learning/Training Dynamics: quantitative analysis linking attention-induced interference to deterministic failure patterns in LLMs using a statistical physics model.
Mutual Information guided Visual Contrastive Learning - Score: 15 (R=8, N=7) - Date: 2025-11-05 - Comment: Representation Learning: mutual-information-guided positive selection/augmentation for contrastive learning, reducing hand-crafted heuristics.
Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization - Score: 15 (R=8, N=7) - Date: 2025-11-04 - Comment: Representation Learning/Architecture: parallel decoupled embeddings via learnable prefixes with mutual information minimization to diversify representations.
Analyzing the Power of Chain of Thought through Memorization Capabilities - Score: 15 (R=8, N=7) - Date: 2025-11-04 - Comment: Matches Representation Learning/Training Dynamics: theoretical analysis of transformers’ memorization capacity with and without Chain-of-Thought.
ParaScopes: What do Language Models Activations Encode About Future Text? - Score: 15 (R=8, N=7) - Date: 2025-11-04 - Comment: Representation Learning/Interpretability: probes residual stream to decode multi-token future information and planning signals.
Feature-Function Curvature Analysis: A Geometric Framework for Explaining Differentiable Models - Score: 15 (R=8, N=7) - Date: 2025-11-04 - Comment: Representation Learning: analyzes training dynamics and learned function geometry, providing insights into how networks encode and evolve features.

Other Foundational Research (7)

Global Convergence of Four-Layer Matrix Factorization under Random Initialization - Score: 18 (R=9, N=9) - Date: 2025-11-14 - Comment: Training Dynamics/Theory: first polynomial-time global convergence guarantee for gradient descent on four-layer matrix factorization under random initialization.
A Fully Polynomial-Time Algorithm for Robustly Learning Halfspaces over the Hypercube - Score: 18 (R=9, N=9) - Date: 2025-11-11 - Comment: Learning Theory: fully polynomial-time robust algorithm for learning halfspaces over the hypercube under contamination.
ODE approximation for the Adam algorithm: General and overparametrized setting - Score: 17 (R=9, N=8) - Date: 2025-11-07 - Comment: Training Dynamics: ODE-based analysis of Adam shows convergence to zeros of an Adam vector field; Lyapunov characterization in overparameterized settings.
Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? - Score: 17 (R=9, N=8) - Date: 2025-11-05 - Comment: Training Dynamics/Optimization Theory: isotropic curvature model to analyze single-iteration updates and explain when gradient orthogonalization is optimal, informing optimizer design.
Sample Complexity of Agnostic Multiclass Classification: Natarajan Dimension Strikes Back - Score: 17 (R=8, N=9) - Date: 2025-11-18 - Comment: Foundational Learning Theory: near-tight agnostic multiclass sample complexity in terms of DS and Natarajan dimensions.
Almost Sure Convergence Analysis of Differentially Private Stochastic Gradient Methods - Score: 16 (R=8, N=8) - Date: 2025-11-21 - Comment: Matches Training Dynamics/Optimization theory criterion: establishes almost sure convergence of DP-SGD and momentum variants under standard assumptions.
How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining - Score: 15 (R=8, N=7) - Date: 2025-11-26 - Comment: Matches Training Dynamics for foundation model pretraining: analyzes LR decay compatibility with data curricula; proposes moderated decay/model averaging.