Personalized Monthly Topic Summary 2025/12
| Metric | Value |
|---|---|
| Total Papers | 512 |
| Model Architecture | 125 |
| Model Compression and Efficiency | 183 |
| High Performance Computing | 52 |
| Representation Learning | 144 |
| Other Foundational Research | 8 |
Model Architecture (125)
-
Sliding Window Recurrences for Sequence Models - Score: 20.0 (R=0, N=0) - Date: 2025-12-17 - Comment: Author match
-
Closing the Train-Test Gap in World Models for Gradient-Based Planning - Score: 20.0 (R=0, N=0) - Date: 2025-12-11 - Comment: Author match
-
Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds - Score: 19 (R=10, N=9) - Date: 2025-12-31 - Comment: Representation Learning/Training Dynamics: first-order gradient analysis of attention with advantage-based routing and EM-like specialization mechanism explaining how cross-entropy shapes internal geometry.
-
The Bayesian Geometry of Transformer Attention - Score: 19 (R=10, N=9) - Date: 2025-12-31 - Comment: Representation Learning/Transformer internals: introduces “Bayesian wind tunnels” and a geometric mechanism showing how attention implements Bayesian inference.
-
End-to-End Test-Time Training for Long Context - Score: 19 (R=10, N=9) - Date: 2025-12-30 - Comment: Treats long-context handling as test-time training with meta-learned initialization on a standard sliding-window Transformer; achieves constant-latency inference — strong efficiency/training-dynamics innovation.
-
How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models - Score: 19 (R=10, N=9) - Date: 2025-12-19 - Comment: Matches Model Architecture and Representation Learning—unified attention/SSM framework with head-count and gradient propagation theory.
-
SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations - Score: 19 (R=10, N=9) - Date: 2025-12-17 - Comment: MoE Efficiency/HPC—memory-efficient MoE backward/forward, IO-overlap GPU kernels, and tile-aware token rounding for reduced padding.
-
Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective - Score: 19 (R=10, N=9) - Date: 2025-12-15 - Comment: Model Architecture: theoretical analysis of softmax attention showing large-prompt linearization with non-asymptotic concentration bounds, enabling training-dynamics analysis.
-
LUNA: Linear Universal Neural Attention with Generalization Guarantees - Score: 19 (R=10, N=9) - Date: 2025-12-10 - Comment: Matches Model Architecture and Efficiency: linear attention with learned positive-definite kernel feature maps and streaming computation; retains linear time/memory.
-
Group Representational Position Encoding - Score: 19 (R=10, N=9) - Date: 2025-12-09 - Comment: Model Architecture: unified positional encoding framework (group actions) subsuming RoPE/ALiBi with new multiplicative/additive families and efficient implementations for long context.
-
A Mathematical Theory of Top-$k$ Sparse Attention via Total Variation Distance - Score: 19 (R=10, N=9) - Date: 2025-12-09 - Comment: Strong match to Model Architecture/Efficiency: rigorous theory for Top-k sparse attention with certified TV bounds and output error factorization.
-
Network of Theseus (like the ship) - Score: 19 (R=10, N=9) - Date: 2025-12-05 - Comment: Matches Model Architecture: progressive architecture conversion using representational similarity alignment to decouple optimization from deployment, enabling new accuracy–efficiency tradeoffs.
-
The Mean-Field Dynamics of Transformers - Score: 19 (R=10, N=9) - Date: 2025-12-02 - Comment: Representation Learning: mean-field theory for Transformer attention (Wasserstein gradient flows, clustering/phase transition) elucidates training dynamics and representation collapse in deep attention.
-
Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models - Score: 19 (R=10, N=9) - Date: 2025-12-01 - Comment: Model Architecture and Efficiency: Hierarchical Sparse Attention enabling ultra-long (up to 16M) context with sparsity and length generalization; MoE-based ultra-long LLM.
-
Learning When Not to Attend Globally - Score: 18 (R=10, N=8) - Date: 2025-12-31 - Comment: Model Architecture/Efficiency: conditional attention via per-head binary router switching between global and sliding-window attention to cut full attention usage.
-
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss - Score: 18 (R=10, N=8) - Date: 2025-12-30 - Comment: Direct MoE advancement: auxiliary expert–router coupling loss aligning router embeddings with expert capabilities; computationally efficient and scales with experts, matching the MoE architecture criterion.
-
Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism - Score: 18 (R=10, N=8) - Date: 2025-12-30 - Comment: HPC + MoE: fine-grained scheduling for disaggregated expert parallelism to optimize MoE inference throughput with algorithmic solver.
-
RevFFN: Memory-Efficient Full-Parameter Fine-Tuning of Mixture-of-Experts LLMs with Reversible Blocks - Score: 18 (R=10, N=8) - Date: 2025-12-26 - Comment: Model Architecture + Efficiency: reversible Transformer blocks for MoE enable activation reconstruction during backprop, greatly reducing memory for full‑parameter fine‑tuning.
-
NVIDIA Nemotron 3: Efficient and Open Intelligence - Score: 18 (R=10, N=8) - Date: 2025-12-26 - Comment: Model architecture (MoE hybrid Mamba-Transformer) and systems innovations (NVFP4, LatentMoE, MTP layers) for efficiency and 1M-token context.
-
UCCL-EP: Portable Expert-Parallel Communication - Score: 18 (R=10, N=8) - Date: 2025-12-25 - Comment: Matches High Performance Computing and MoE: portable expert-parallel communication for Mixture-of-Experts using a GPU–CPU control channel and GPUDirect RDMA.
-
How Many Experts Are Enough? Towards Optimal Semantic Specialization for Mixture-of-Experts - Score: 18 (R=10, N=8) - Date: 2025-12-25 - Comment: Model Architecture (MoE): adaptive expert expansion via semantic drift detection and dynamic routing by confidence mass to optimize expert specialization.
-
Sprecher Networks: A Parameter-Efficient Kolmogorov-Arnold Architecture - Score: 18 (R=10, N=8) - Date: 2025-12-23 - Comment: Model Architecture and Efficiency: KAS-inspired Sprecher Networks with shared learnable splines and O(LN+LG) scaling; reduced memory via sequential eval.
-
Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing - Score: 18 (R=10, N=8) - Date: 2025-12-23 - Comment: High Performance Computing/Systems for MoE: heterogeneous CPU/GPU expert placement, serverless offloading, and optimization for cost/latency.
-
Efficient Mixture-of-Agents Serving via Tree-Structured Routing, Adaptive Pruning, and Dependency-Aware Prefill-Decode Overlap - Score: 18 (R=10, N=8) - Date: 2025-12-23 - Comment: Model Architecture (MoE/MoA) and Efficiency: tree-structured routing, adaptive pruning, and prefill–decode overlap for low-latency serving.
-
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed - Score: 18 (R=10, N=8) - Date: 2025-12-18 - Comment: Model Architecture + Efficiency: AR-to-diffusion LM conversion with block-wise attention (preserves AR weights, enables KV caching) and position-dependent masking for faster parallel generation.
-
Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training - Score: 18 (R=10, N=8) - Date: 2025-12-18 - Comment: MoE Architecture/Efficiency: dynamic Top-p routing with PI control for target sparsity and layer-wise routing normalization for controllable expert activation.
-
Improving Recursive Transformers with Mixture of LoRAs - Score: 18 (R=10, N=8) - Date: 2025-12-16 - Comment: Model Architecture and Compression/Efficiency: conditional computation via Mixture of LoRAs inside shared FFN for recursive transformers; expert merging for deployment.
-
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics - Score: 18 (R=10, N=8) - Date: 2025-12-16 - Comment: Model Architecture/Efficiency: exact linear-time attention via continuous-time dynamics (error-free linear attention) with theoretical foundations.
-
StructuredDNA: A Bio-Physical Framework for Energy-Aware Transformer Routing - Score: 18 (R=10, N=8) - Date: 2025-12-11 - Comment: Model Architecture and Efficiency: sparse single-expert Transformer routing replacing dense MoE via an energy-minimization routing layer.
-
GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory - Score: 18 (R=10, N=8) - Date: 2025-12-09 - Comment: Model Architecture/Efficiency: proposes linear-time sliding-window attention with learnable gating to stabilize associative memory; FlashAttention-compatible fused kernel for I/O-efficient implementation.
-
Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems - Score: 18 (R=10, N=8) - Date: 2025-12-05 - Comment: Model Architecture (MoE) + HPC/Efficiency: context-aware expert placement using prefill activations, CXL-attached near-data processing, and per-expert mixed-precision quantization to cut cross-device transfers.
-
Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium - Score: 18 (R=10, N=8) - Date: 2025-12-01 - Comment: Model Architecture: introduces Equilibrium Transformers with iterative latent refinement via learned energy minimization, a closed-loop alternative to standard autoregression.
-
In-Context Multi-Operator Learning with DeepOSets - Score: 18 (R=9, N=9) - Date: 2025-12-19 - Comment: Matches Model Architecture with a non-attention, non-autoregressive design (DeepOSets) that exhibits in-context learning and provides a universal operator-approximation theory.
-
Rates and architectures for learning geometrically non-trivial operators - Score: 18 (R=9, N=9) - Date: 2025-12-11 - Comment: Model Architecture + Representation Learning: theory and architectures for learning geometric integral operators; proposes cross-attention–reminiscent architecture with superalgebraic sample efficiency.
-
Provable Long-Range Benefits of Next-Token Prediction - Score: 18 (R=9, N=9) - Date: 2025-12-09 - Comment: Theory/Training Dynamics: complexity-theoretic guarantees that next-token training yields long-range k-token indistinguishability with polynomial-size RNNs.
-
Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning - Score: 17 (R=10, N=7) - Date: 2025-12-26 - Comment: Model architecture (MoE): hybrid Mamba-Transformer with expert sparsity to reduce activated parameters; efficiency gains in throughput and long-context support.
-
SHRP: Specialized Head Routing and Pruning for Efficient Encoder Compression - Score: 17 (R=10, N=7) - Date: 2025-12-25 - Comment: Matches Compression/Efficiency: structured pruning of attention heads via usage-driven routing; treats heads as experts to enable deterministic pruning with high retention.
-
MoE Pathfinder: Trajectory-driven Expert Pruning - Score: 17 (R=10, N=7) - Date: 2025-12-23 - Comment: Model Architecture + Compression: MoE expert pruning via global trajectory/path planning using multi-signal importance, yielding non-uniform layerwise retention.
-
Theoretical Foundations of Scaling Law in Familial Models - Score: 17 (R=9, N=8) - Date: 2025-12-31 - Comment: HPC/Scaling Laws & Dynamic Architectures: extends neural scaling laws to early-exit/relay familial models via a unified L(N,D,G) with granularity as a scaling variable and IsoFLOP-controlled experiments.
-
The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-12-30 - Comment: Foundational scaling-law theory for ensembling LLMs (multi-model collaboration), directly addressing model architecture/ensembling limits and diversity effects.
-
Visual Language Hypothesis - Score: 17 (R=9, N=8) - Date: 2025-12-30 - Comment: Matches: Representation Learning (topological theory of semantics) and Model Architecture (architectural requirements for topology change).
-
On the Existence and Behaviour of Secondary Attention Sinks - Score: 17 (R=9, N=8) - Date: 2025-12-30 - Comment: Mechanistic analysis of attention sinks (secondary sinks) in Transformers — deep representation/architecture insight into attention behavior and MLP roles.
-
Towards Unsupervised Causal Representation Learning via Latent Additive Noise Model Causal Autoencoders - Score: 17 (R=9, N=8) - Date: 2025-12-30 - Comment: Representation Learning: causal autoencoder with ANM inductive bias and identifiability analysis; architectural innovation via differentiable ANM layer in WAE.
-
Forward Only Learning for Orthogonal Neural Networks of any Depth - Score: 17 (R=9, N=8) - Date: 2025-12-26 - Comment: Model efficiency/training algorithm: forward-only learning eliminating backprop for orthogonal networks; scales to deep/convolutional architectures, reducing memory/compute.
-
Parallel Token Prediction for Language Models - Score: 17 (R=9, N=8) - Date: 2025-12-25 - Comment: Matches Model Architecture/Efficiency: introduces Parallel Token Prediction to jointly predict dependent tokens in a single transformer call with a universality proof, reducing autoregressive decoding latency without independence assumptions.
-
Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures - Score: 17 (R=9, N=8) - Date: 2025-12-24 - Comment: Representation Learning/Training Dynamics: unified theory of simplicity bias via saddle-to-saddle dynamics across FCN, CNN, and attention architectures.
-
Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs - Score: 17 (R=9, N=8) - Date: 2025-12-23 - Comment: High Performance Computing: joint optimization of software pipelining and warp specialization via constraint solving, delivering provably optimal GPU schedules (e.g., for Flash Attention).
-
NRGPT: An Energy-based Alternative for GPT - Score: 17 (R=9, N=8) - Date: 2025-12-19 - Comment: Model Architecture/Generative modeling: reframes GPT as an energy-based model with inference as energy landscape exploration, with theory and experiments.
-
Time-Frequency Analysis for Neural Networks - Score: 17 (R=9, N=8) - Date: 2025-12-19 - Comment: Matches Model Architecture and Theory by introducing time-frequency-windowed units and proving dimension-independent Sobolev approximation rates for shallow nets.
-
Bidirectional Normalizing Flow: From Data to Noise and Back - Score: 17 (R=9, N=8) - Date: 2025-12-12 - Comment: Model Architecture/Efficiency: relaxes exact invertibility in normalizing flows to enable faster sampling with improved flexibility.
-
Stronger Normalization-Free Transformers - Score: 17 (R=9, N=8) - Date: 2025-12-12 - Comment: Model Architecture: introduces Derf, a point-wise function enabling stronger normalization-free Transformers, outperforming LayerNorm/RMSNorm/DyT across domains.
-
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models - Score: 17 (R=9, N=8) - Date: 2025-12-10 - Comment: Architecture + efficiency: combines sliding-window attention with linear attention (Gated DeltaNet) for linear complexity, constant KV cache, and faster inference.
-
Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis - Score: 17 (R=9, N=8) - Date: 2025-12-10 - Comment: Training dynamics/Architecture growth: analyzes and improves gradual depth stacking to counter the Transformer curse of depth with mechanistic insights.
-
GSPN-2: Efficient Parallel Sequence Modeling - Score: 17 (R=9, N=8) - Date: 2025-12-10 - Comment: Matches HPC and Model Architecture: algorithm–system co-design for efficient global context modeling (GSPN-2) with fused kernels and compact channel propagation as an alternative to self-attention.
-
Each Prompt Matters: Scaling Reinforcement Learning Without Wasting Rollouts on Hundred-Billion-Scale MoE - Score: 17 (R=9, N=8) - Date: 2025-12-09 - Comment: Strong match to HPC/Model Architecture (MoE): RL training pipeline for hundred-billion-scale MoE with router replay and high-throughput system.
-
GradientSpace: Unsupervised Data Clustering for Improved Instruction Tuning - Score: 17 (R=9, N=8) - Date: 2025-12-09 - Comment: Mixture-of-Experts/efficiency criterion: clusters samples in full gradient space to train specialized LoRA experts with a lightweight router for single-expert routing.
-
Vector Quantization using Gaussian Variational Autoencoder - Score: 17 (R=9, N=8) - Date: 2025-12-09 - Comment: Autoencoders/Quantization: converts Gaussian VAE to VQ-VAE without training via Gaussian codebooks; strong theory and practical gains across UNet/ViT.
-
BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination - Score: 17 (R=9, N=8) - Date: 2025-12-09 - Comment: Efficiency/HW–algorithm co-design criterion: attention accelerator with bit-serial stage fusion, adaptive token selection, and early termination to reduce memory and compute.
-
Learnability Window in Gated Recurrent Neural Networks - Score: 17 (R=9, N=8) - Date: 2025-12-08 - Comment: Matches Model Architecture/Training Dynamics: theoretical analysis linking gating spectra in RNNs to gradient transport and the learnability window under heavy-tailed noise.
-
Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding - Score: 17 (R=9, N=8) - Date: 2025-12-04 - Comment: Model Architecture and Training Dynamics: theoretical analysis of self-attention token dynamics and positional encodings; proposes refinements to mitigate collapse.
-
Solving Neural Min-Max Games: The Role of Architecture, Initialization & Dynamics - Score: 17 (R=9, N=8) - Date: 2025-12-02 - Comment: Training Dynamics/Theory: proves global convergence to Nash equilibria in nonconvex min-max games with two-layer nets via hidden convexity and overparameterization.
-
Constructing Efficient Fact-Storing MLPs for Transformers - Score: 17 (R=9, N=8) - Date: 2025-12-02 - Comment: Representation Learning/Model Architecture: explicit constructions of fact-storing MLPs with asymptotically optimal facts-per-parameter and analysis of encoder–decoder mechanisms within Transformers.
-
LFM2 Technical Report - Score: 17 (R=9, N=8) - Date: 2025-12-01 - Comment: Matches Model Architecture (MoE, hybrid attention+convolution) and Compression/Efficiency (edge-latency/memory-constrained design, on-device deployment).
-
Towards Understanding Transformers in Learning Random Walks - Score: 17 (R=9, N=8) - Date: 2025-12-01 - Comment: Representation Learning/Theory: interpretable analysis of transformer attention and training dynamics on random walks with optimality guarantees.
-
Transformer Reconstructed with Dynamic Value Attention - Score: 16 (R=9, N=7) - Date: 2025-12-30 - Comment: Matches: Model Architecture (Dynamic Value Attention replacing multi-head+FFN) and Efficiency (reduced heads and training time).
-
Distilling to Hybrid Attention Models via KL-Guided Layer Selection - Score: 16 (R=9, N=7) - Date: 2025-12-25 - Comment: Matches Compression/Efficiency and Architecture: KL-guided layer selection for distilling to hybrid softmax/linear attention, improving efficient LLM inference.
-
SAP: Syntactic Attention Pruning for Transformer-based Language Models - Score: 16 (R=9, N=7) - Date: 2025-12-23 - Comment: Matches Model Compression/Efficiency: prunes Transformer attention heads using syntax-informed criteria.
-
Lag Operator SSMs: A Geometric Framework for Structured State Space Modeling - Score: 16 (R=9, N=7) - Date: 2025-12-23 - Comment: Model Architecture: first-principles discrete-time SSM construction via a lag operator; connects to HiPPO and offers modular design space for sequence models.
-
Self-Motivated Growing Neural Network for Adaptive Architecture via Local Structural Plasticity - Score: 16 (R=9, N=7) - Date: 2025-12-16 - Comment: Matches Model Architecture: dynamic/growing neural network with local structural plasticity (neuron insertion/pruning).
-
Exploring the Design Space of Transition Matching - Score: 16 (R=9, N=7) - Date: 2025-12-16 - Comment: Model Architecture: systematic design/training/sampling study of Transition Matching with head–backbone architecture for generative models.
-
Optimized Architectures for Kolmogorov-Arnold Networks - Score: 16 (R=9, N=7) - Date: 2025-12-16 - Comment: Matches Model Compression and Efficiency: differentiable sparsification to learn compact architectures (also Model Architecture for KAN design).
-
A Simple Generalisation of the Implicit Dynamics of In-Context Learning - Score: 16 (R=9, N=7) - Date: 2025-12-15 - Comment: Model Architecture: theoretical extension of implicit weight-update dynamics in transformer blocks (ICL) across layers/positions with layer normalization.
-
Prismatic World Model: Learning Compositional Dynamics for Planning in Hybrid Systems - Score: 16 (R=9, N=7) - Date: 2025-12-10 - Comment: Model architecture (MoE): context-aware gating with specialized experts for hybrid dynamics; adds latent orthogonalization to enforce expert diversity.
-
PlantBiMoE: A Bidirectional Foundation Model with SparseMoE for Plant Genomes - Score: 16 (R=9, N=7) - Date: 2025-12-09 - Comment: Direct match to Model Architecture and Efficiency: integrates Sparse Mixture-of-Experts (SparseMoE) with bidirectional Mamba for a lightweight foundation model.
-
HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies - Score: 16 (R=9, N=7) - Date: 2025-12-08 - Comment: Strong match to Model Architecture: Hierarchical Mixture-of-Experts (MoE) action module for heterogeneous VLA policies.
-
Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting - Score: 16 (R=9, N=7) - Date: 2025-12-04 - Comment: Strong match to Model Architecture: Mixture-of-Experts with adaptive expert count driven by frequency-domain cues.
-
Upcycled and Merged MoE Reward Model for Mitigating Reward Hacking - Score: 16 (R=9, N=7) - Date: 2025-12-02 - Comment: Model Architecture: Mixture-of-Experts reward model upcycled from dense and merged back for efficient inference; addresses robustness to reward hacking.
-
Experts are all you need: A Composable Framework for Large Language Model Inference - Score: 16 (R=9, N=7) - Date: 2025-12-01 - Comment: Model Architecture/Inference: composable expert framework with routing and parallel sub-query execution (MoE-like dispatch without joint pretraining).
-
Trust Region Masking for Long-Horizon LLM Reinforcement Learning - Score: 16 (R=8, N=8) - Date: 2025-12-31 - Comment: Training dynamics/theory for LLM RL: new trust-region error bounds scaling with sequence length and Trust Region Masking to ensure non-vacuous guarantees.
-
An Inverse Scattering Inspired Fourier Neural Operator for Time-Dependent PDE Learning - Score: 16 (R=8, N=8) - Date: 2025-12-23 - Comment: Matches Model Architecture: inverse-scattering-inspired Fourier Neural Operator with invertible lifting and exponential Fourier evolution improves long-horizon stability.
-
KOSS: Kalman-Optimal Selective State Spaces for Long-Term Sequence Modeling - Score: 16 (R=8, N=8) - Date: 2025-12-19 - Comment: Matches Model Architecture—selective SSM with Kalman-optimal selection and scalable computation mechanisms.
-
Geometric Laplace Neural Operator - Score: 16 (R=8, N=8) - Date: 2025-12-19 - Comment: Model Architecture: introduces a Laplace spectral neural operator on Riemannian manifolds via pole–residue decomposition and a grid-invariant implementation (GLNONet).
-
A Single Architecture for Representing Invariance Under Any Space Group - Score: 16 (R=8, N=8) - Date: 2025-12-17 - Comment: Model Architecture—single architecture/layer enforcing invariance under any space group via symmetry-adapted Fourier weight sharing.
-
Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs - Score: 16 (R=8, N=8) - Date: 2025-12-17 - Comment: Efficiency/Training dynamics: test-time training with targeted gradient updates to overcome static attention limitations in long-context LLMs.
-
The Native Spiking Microarchitecture: From Iontronic Primitives to Bit-Exact FP8 Arithmetic - Score: 16 (R=8, N=8) - Date: 2025-12-09 - Comment: HPC/Architecture: native spiking microarchitecture achieving bit-exact FP8 arithmetic and O(log N) linear layer latency.
-
LDLT $\mathcal{L}$-Lipschitz Network: Generalized Deep End-To-End Lipschitz Network Construction - Score: 16 (R=8, N=8) - Date: 2025-12-08 - Comment: Strong match to Model Architecture: general LMI/LDL^T-based parameterization for constructing L-Lipschitz deep networks with theoretical guarantees across architectures.
-
CFO: Learning Continuous-Time PDE Dynamics via Flow-Matched Neural Operators - Score: 16 (R=8, N=8) - Date: 2025-12-08 - Comment: Matches Model Architecture/Neural Operators: flow-matched continuous-time operator learning enabling time-resolution invariance and stable long rollouts.
-
Polynomial Neural Sheaf Diffusion: A Spectral Filtering Approach on Cellular Sheaves - Score: 16 (R=8, N=8) - Date: 2025-12-02 - Comment: Model Architecture: Polynomial Neural Sheaf Diffusion introduces stable spectral filtering on sheaf Laplacians with diagonal restriction maps, improving scalability and stability.
-
Exact Learning of Arithmetic with Differentiable Agents - Score: 16 (R=8, N=8) - Date: 2025-12-01 - Comment: Model Architecture: differentiable finite-state transducers enabling exact algorithmic learning with strong length generalization.
-
LLMBoost: Make Large Language Models Stronger with Boosting - Score: 15 (R=8, N=7) - Date: 2025-12-31 - Comment: Model Architecture: introduces cross-model attention to fuse intermediate hidden states across LLMs in a boosting chain; Efficiency/HPC: near-parallel layer-wise pipelined inference; Theory: monotonic improvement guarantee under bounded corrections.
-
GLUE: Gradient-free Learning to Unify Experts - Score: 15 (R=8, N=7) - Date: 2025-12-30 - Comment: Matches Model Architecture: gradient-free convex weight-space mixing (SPSA-learned mixture coefficients) to unify multiple pretrained experts for initialization.
-
InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation - Score: 15 (R=8, N=7) - Date: 2025-12-29 - Comment: Model architecture: Mixture of Low-rank Experts with instruction-guided global routing for coherent expert selection.
-
Scalable Deep Subspace Clustering Network - Score: 15 (R=8, N=7) - Date: 2025-12-29 - Comment: Architecture/Efficiency: scalable deep subspace clustering using autoencoders with landmark-based O(n) complexity.
-
Stabilizing Multimodal Autoencoders: A Theoretical and Empirical Analysis of Fusion Strategies - Score: 15 (R=8, N=7) - Date: 2025-12-26 - Comment: Autoencoder architecture analysis: derives Lipschitz bounds for multimodal fusion and proposes a regularized attention-based fusion improving stability and convergence.
-
From Fake Focus to Real Precision: Confusion-Driven Adversarial Attention Learning in Transformers - Score: 15 (R=8, N=7) - Date: 2025-12-26 - Comment: Representation Learning/Training Dynamics: adversarial attention learning with dynamic masking and policy‑gradient to redistribute Transformer attention without extra supervision.
-
Q-RUN: Quantum-Inspired Data Re-uploading Networks - Score: 15 (R=8, N=7) - Date: 2025-12-25 - Comment: Model Architecture: quantum-inspired data re-uploading network layer with strong Fourier expressivity as a drop-in alternative to fully connected layers, reducing parameters.
-
Field-Space Attention for Structure-Preserving Earth System Transformers - Score: 15 (R=8, N=7) - Date: 2025-12-24 - Comment: Model Architecture and Efficiency: field-space attention operating on continuous fields with fixed multiscale decomposition for compact, stable Earth system transformers.
-
The Best of Both Worlds: Hybridizing Neural Operators and Solvers for Stable Long-Horizon Inference - Score: 15 (R=8, N=7) - Date: 2025-12-23 - Comment: Matches Model Architecture/HPC: hybrid neural-operator–solver with residual-based, data-free error control for stable long-horizon inference at reduced compute.
-
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion - Score: 15 (R=8, N=7) - Date: 2025-12-23 - Comment: Model Architecture/Efficiency: cross-attention via self-attention to enable local text–text interaction, improving VLM fusion efficiency at scale.
-
A Logical View of GNN-Style Computation and the Role of Activation Functions - Score: 15 (R=8, N=7) - Date: 2025-12-23 - Comment: Matches Architecture/Expressivity Analysis: logical characterization of GNN-style computation and role of activation functions.
-
Cartesian-nj: Extending e3nn to Irreducible Cartesian Tensor Product and Contracion - Score: 15 (R=8, N=7) - Date: 2025-12-19 - Comment: Model Architecture: extends equivariant networks to irreducible Cartesian tensor product/contraction with a library enabling Cartesian counterparts to spherical models.
-
STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning - Score: 15 (R=8, N=7) - Date: 2025-12-18 - Comment: Stacked autoregressive modules with high-capacity VQ for unified multimodal learning (Model Architecture).
-
Universal Reasoning Model - Score: 15 (R=8, N=7) - Date: 2025-12-17 - Comment: Model Architecture: recurrent inductive bias with UT enhanced by short convolution; Training Dynamics: truncated backpropagation for reasoning
-
ParaFormer: A Generalized PageRank Graph Transformer for Graph Representation Learning - Score: 15 (R=8, N=7) - Date: 2025-12-17 - Comment: Model Architecture: PageRank-enhanced attention to mitigate over-smoothing in Graph Transformers with theoretical backing
-
Massive Editing for Large Language Models Based on Dynamic Weight Generation - Score: 15 (R=8, N=7) - Date: 2025-12-17 - Comment: Model Architecture: conditional/dynamic neuron whose weights are generated per query for large-scale knowledge editing; Efficiency: low-cost editing versus retraining
-
Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing - Score: 15 (R=8, N=7) - Date: 2025-12-16 - Comment: Representation Learning — analyzes U-shaped attention bias and introduces initial-saliency scaling to improve long-context processing.
-
Qonvolution: Towards Learning High-Frequency Signals with Queried Convolution - Score: 15 (R=8, N=7) - Date: 2025-12-16 - Comment: Matches Model Architecture: introduces Queried-Convolutions to better learn high-frequency signals.
-
CogniSNN: Enabling Neuron-Expandability, Pathway-Reusability, and Dynamic-Configurability with Random Graph Architectures in Spiking Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-12-15 - Comment: Matches Model Architecture (dynamic/conditional networks): introduces random graph SNN architectures with dynamic growth and residual mechanisms.
-
Neuronal Attention Circuit (NAC) for Representation Learning - Score: 15 (R=8, N=7) - Date: 2025-12-12 - Comment: Model Architecture: introduces a continuous-time attention mechanism with sparse gates and Top-K interaction selection, with theoretical guarantees on stability and approximation.
-
Supervised learning pays attention - Score: 15 (R=8, N=7) - Date: 2025-12-11 - Comment: Model Architecture/Representation Learning: introduces attention-weighted supervised learning to fit personalized local models with theoretical MSE benefits under mixture-of-models.
-
Banach neural operator for Navier-Stokes equations - Score: 15 (R=8, N=7) - Date: 2025-12-11 - Comment: Matches Model Architecture: Banach neural operator integrating Koopman theory with deep networks for spatiotemporal dynamics.
-
RRAEDy: Adaptive Latent Linearization of Nonlinear Dynamical Systems - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Model architecture and low-rank/representation criterion: rank-reduction autoencoder that discovers latent dimensionality and learns linear DMD dynamics with pruning.
-
FRWKV:Frequency-Domain Linear Attention for Long-Term Time Series Forecasting - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Matches Model Architecture/Efficiency: frequency-domain linear attention with O(T) complexity for long sequences.
-
A new initialisation to Control Gradients in Sinusoidal Neural network - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Model architecture/training dynamics criterion: closed-form initialization for SIREN controlling gradient scaling and pre-activations, with NTK analysis to stabilize deep sinusoidal networks.
-
Continuous-Time Homeostatic Dynamics for Reentrant Inference Models - Score: 15 (R=8, N=7) - Date: 2025-12-08 - Comment: Matches Model Architecture: formulates a reentrant inference network as a continuous-time neural ODE with stability via population-level gain modulation.
-
TV2TV: A Unified Framework for Interleaved Language and Video Generation - Score: 15 (R=8, N=7) - Date: 2025-12-05 - Comment: Model Architecture: introduces a Mixture-of-Transformers with interleaved text/video generation (conditional alternation) for unified modeling.
-
Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models - Score: 15 (R=8, N=7) - Date: 2025-12-05 - Comment: Matches Model Architecture (MoE-style): modality-decoupled experts to mitigate gradient conflict; also provides analysis relevant to representation learning/training dynamics.
-
Graph VQ-Transformer (GVT): Fast and Accurate Molecular Generation via High-Fidelity Discrete Latents - Score: 15 (R=8, N=7) - Date: 2025-12-03 - Comment: Model Architecture: a Graph VQ-VAE producing high-fidelity discrete latents and an autoregressive Transformer over them for efficient generation.
-
MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Matches Model Architecture and Efficiency: multi‑scale attention (local point + global patch) with ball‑tree partitioning to scale neural solvers efficiently.
-
Beyond Loss Guidance: Using PDE Residuals as Spectral Attention in Diffusion Neural Operators - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Model Architecture and Efficiency: integrates PDE residuals as spectral attention inside diffusion neural operators, eliminating test-time optimization and accelerating inference.
-
Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Model Architecture (MoE): systematic study of architectural factors for Diffusion MoE (expert design, widths, expert counts, positional encodings) yielding efficient recipes with fewer activated parameters.
-
Fiber Bundle Networks: A Geometric Machine Learning Paradigm - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Matches Model Architecture: proposes Fiber Bundle Networks with learnable Riemannian metrics and prototype optimization for interpretable decision regions.
-
Memory-Integrated Reconfigurable Adapters: A Unified Framework for Settings with Multiple Tasks - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Model Architecture: integrates Hopfield-style associative memory with adapters for dynamic per-sample task/domain routing and retention (conditional/dynamic network).
-
Preventing Model Collapse via Contraction-Conditioned Neural Filters - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Model Architecture/Training Stability: contraction-conditioned neural filters and losses guarantee convergence without increasing sample sizes, preventing model collapse.
-
SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Matches Efficiency/test-time compute: selective resource allocation for reasoning sub-problems (dynamic routing between fast/slow processing) to improve cost–accuracy trade-offs.
Model Compression and Efficiency (183)
-
FALCON: Few-step Accurate Likelihoods for Continuous Flows - Score: 20.0 (R=0, N=0) - Date: 2025-12-11 - Comment: Author match
-
PHOTON: Hierarchical Autoregressive Modeling for Lightspeed and Memory-Efficient Language Generation - Score: 19 (R=10, N=9) - Date: 2025-12-26 - Comment: Model Architecture + Efficiency: hierarchical autoregressive design with multi‑resolution latent streams reduces KV‑cache traffic, boosting long‑context throughput.
-
Data-Free Pruning of Self-Attention Layers in LLMs - Score: 19 (R=10, N=9) - Date: 2025-12-26 - Comment: Strong match to Model Compression/Efficiency: data-free pruning of self-attention layers via a weight-only Gate-Norm criterion enabling faster inference with minimal accuracy loss.
-
CoDeQ: End-to-End Joint Model Compression with Dead-Zone Quantizer for High-Sparsity and Low-Precision Networks - Score: 19 (R=10, N=9) - Date: 2025-12-16 - Comment: Model Compression and Efficiency — end-to-end joint pruning and quantization by learning quantizer dead-zone widths (fully differentiable).
-
HybridToken-VLM: Hybrid Token Compression for Vision-Language Models - Score: 19 (R=10, N=9) - Date: 2025-12-10 - Comment: Compression/Efficiency: hybrid discrete–continuous token compression for VLMs yielding 580-to-1 compression and a single fused token.
-
BEP: A Binary Error Propagation Algorithm for Binary Neural Networks Training - Score: 19 (R=10, N=9) - Date: 2025-12-05 - Comment: Model Compression and Efficiency: introduces Binary Error Propagation, a discrete analog of backprop enabling fully binary forward and backward passes (including RNNs) with only bitwise ops.
-
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs - Score: 19 (R=10, N=9) - Date: 2025-12-04 - Comment: Model Compression and Efficiency: learned token retention gates for KV-cache eviction under memory budgets, improving long-context inference.
-
Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in ${\pm 1, \pm i}$ - Score: 19 (R=10, N=9) - Date: 2025-12-04 - Comment: Matches Model Compression and Efficiency: universal conversion to widely-linear complex form with phase-aware 2-bit quantization and multiplication-free accumulation for efficient LLM inference.
-
Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models - Score: 19 (R=10, N=9) - Date: 2025-12-03 - Comment: Matches Model Compression and Efficiency: joint distillation for flow-based models yielding few-step sampling and tractable likelihood with orders-of-magnitude fewer NFEs.
-
Efficient Turing Machine Simulation with Transformers - Score: 19 (R=10, N=9) - Date: 2025-12-02 - Comment: Matches Model Architecture and Efficiency: theoretical construction for efficient TM simulation with constant‑bit Transformers and sparse attention with fixed offsets.
-
TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies - Score: 19 (R=10, N=9) - Date: 2025-12-01 - Comment: Model Compression/Efficiency & HPC: simple loss (TWEO) to eliminate extreme outliers enabling full-model FP8 training and hardware-friendly W8A8 quantization.
-
MoR: Mixture Of Representations For Mixed-Precision Training - Score: 18 (R=10, N=8) - Date: 2025-12-30 - Comment: Matches Compression/Efficiency: dynamic per-tensor and sub-tensor mixed-precision selection (FP8 vs BF16) based on tensor properties to robustly scale low-precision training.
-
Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks - Score: 18 (R=10, N=8) - Date: 2025-12-30 - Comment: Core compression: game-theoretic formulation where sparsity emerges at equilibrium; yields an interpretable pruning algorithm.
-
Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-12-29 - Comment: Compression/Efficiency: post-training 1-bit quantization for LLMs with data-aware output alignment accounting for activation error accumulation.
-
On the Convergence Rate of LoRA Gradient Descent - Score: 18 (R=10, N=8) - Date: 2025-12-23 - Comment: Matches Compression/Efficiency Theory: non-asymptotic convergence analysis for LoRA (low-rank adaptation) gradient descent.
-
CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs - Score: 18 (R=10, N=8) - Date: 2025-12-23 - Comment: Matches Compression/Efficiency and HPC: efficient GEMM kernel for codebook quantized LLMs eliminating dequantization via precomputed partial sums.
-
KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction - Score: 18 (R=10, N=8) - Date: 2025-12-23 - Comment: Model Compression and Efficiency: reversible KV-cache compression via sketch-based token reconstruction enabling large context with small memory budget.
-
Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA - Score: 18 (R=10, N=8) - Date: 2025-12-23 - Comment: Matches HPC/Efficiency: cross-model KV-cache reuse and Activated LoRA enable efficient multi-adapter LLM serving.
-
Learning What to Write: Write-Gated KV for Efficient Long-Context Inference - Score: 18 (R=10, N=8) - Date: 2025-12-22 - Comment: Transformer efficiency: learned KV admission (write-gated KV) and compact global+local cache to reduce KV size and attention cost—cache/memory optimization for long-context inference.
-
Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation - Score: 18 (R=10, N=8) - Date: 2025-12-22 - Comment: Model Architecture (Mixture-of-Experts) + Model Compression/Efficiency — router-guided low-rank compensation with quantization/offloading to cut bandwidth while preserving accuracy.
-
Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation - Score: 18 (R=10, N=8) - Date: 2025-12-18 - Comment: Matches Compression/Efficiency: probabilistic cross-tokenizer likelihood scoring enabling distillation with smaller vocabularies; exact/approx algorithms leveraging BPE structure.
-
OPTIMA: Optimal One-shot Pruning for LLMs via Quadratic Programming Reconstruction - Score: 18 (R=10, N=8) - Date: 2025-12-18 - Comment: Model Compression: one-shot post-training pruning via batched quadratic programming layer reconstruction; accelerator-friendly.
-
Low-Rank Compression of Language Models via Differentiable Rank Selection - Score: 18 (R=10, N=8) - Date: 2025-12-18 - Comment: Compression: low-rank LLM compression with differentiable per-layer rank selection via learned singular value masks, fine-tuning-free.
-
EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training - Score: 18 (R=10, N=8) - Date: 2025-12-18 - Comment: Matches HPC/Communication Efficiency: entropy-driven dynamic gradient compression with theoretical link between entropy and compression rate for distributed LLM training.
-
SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping - Score: 18 (R=10, N=8) - Date: 2025-12-16 - Comment: Low-rank compression for LLMs via shared projection and block skipping; directly fits Compression/Efficiency (low-rank methods).
-
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-12-16 - Comment: High Performance Computing — bottleneck-aware tensor parallelism and system optimizations for low-rank LLMs; also aligns with low-rank efficiency.
-
SparseSwaps: Tractable LLM Pruning Mask Refinement at Scale - Score: 18 (R=10, N=8) - Date: 2025-12-12 - Comment: Model Compression and Efficiency: LLM pruning mask refinement via tractable optimal 1-swaps computed from Gram matrices; GPU-scalable.
-
Winning the Lottery by Preserving Network Training Dynamics with Concrete Ticket Search - Score: 18 (R=10, N=8) - Date: 2025-12-09 - Comment: Strong match to Sparsity/Pruning: pruning-at-initialization via Concrete relaxation preserving training dynamics; lottery ticket advances.
-
FOAM: Blocked State Folding for Memory-Efficient LLM Training - Score: 18 (R=10, N=8) - Date: 2025-12-09 - Comment: High Performance Computing and Efficiency: optimizer-state compression (block-wise moments with residual correction) for memory-efficient LLM training with convergence guarantees, cutting optimizer memory up to 90%.
-
Leveraging KV Similarity for Online Structured Pruning in LLMs - Score: 18 (R=10, N=8) - Date: 2025-12-09 - Comment: Model Compression and Efficiency: online structured pruning for LLM attention via key-value similarity with variance-aware fusion, reducing inference cost without calibration data.
-
Block Sparse Flash Attention - Score: 18 (R=10, N=8) - Date: 2025-12-09 - Comment: Model Efficiency/HPC: block-sparse FlashAttention with calibrated per-block pruning and CUDA kernel, preserving accuracy while skipping ~50% compute/memory transfers.
-
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices - Score: 18 (R=10, N=8) - Date: 2025-12-09 - Comment: Strong match to Compression/Efficiency/HPC: vector LUT paradigm for ultra-low-bit LLM inference improving memory bandwidth and parallelism.
-
Theoretical Compression Bounds for Wide Multilayer Perceptrons - Score: 18 (R=10, N=8) - Date: 2025-12-09 - Comment: Strong match to Compression/Efficiency: theoretical compression bounds for pruning/quantization in wide networks, including structured pruning.
-
KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity - Score: 18 (R=10, N=8) - Date: 2025-12-08 - Comment: Matches Compression/Efficiency: provable KV-cache compression via optimal low-rank approximation of the attention matrix (attention fidelity guarantees).
-
Sparse Attention Post-Training for Mechanistic Interpretability - Score: 18 (R=10, N=8) - Date: 2025-12-08 - Comment: Matches Compression/Efficiency and Representation Learning: post-training sparsity regularization makes transformer attention extremely sparse without loss, exposing interpretable connectivity.
-
UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs - Score: 18 (R=10, N=8) - Date: 2025-12-05 - Comment: Matches Model Compression and Efficiency: unified quantization + low-rank compression with structured weight-sorting, quantization-aware SVD, and fused RoPE kernel supporting configurable pruning.
-
A note on the impossibility of conditional PAC-efficient reasoning in large language models - Score: 18 (R=10, N=8) - Date: 2025-12-05 - Comment: Theory of Conditional/Dynamic Networks: impossibility of conditional PAC-efficient deferral between fast and expert models in a distribution-free setting.
-
Understanding and Harnessing Sparsity in Unified Multimodal Models - Score: 18 (R=10, N=8) - Date: 2025-12-04 - Comment: Model Compression/Efficiency and MoE Architecture: training-free pruning probe of unified multimodal models and MoE adaptation enabling sparse activation in generation.
-
Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models - Score: 18 (R=10, N=8) - Date: 2025-12-03 - Comment: Structured pruning tailored for reasoning LLMs using self-generated traces and decode-only gradients—directly matches Compression/Efficiency (pruning).
-
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling - Score: 18 (R=10, N=8) - Date: 2025-12-02 - Comment: Model Compression and Efficiency: proposes an FP4 (NVFP4) quantization algorithm (4/6) with adaptive block-level scaling to reduce near-maximum value error, enabling stable FP4 training/inference on Blackwell GPUs.
-
Low-Rank Prehab: Preparing Neural Networks for SVD Compression - Score: 18 (R=10, N=8) - Date: 2025-12-02 - Comment: Matches Model Compression and Efficiency: pre‑conditioning networks (Prehab) for superior SVD low‑rank compression with improved post‑finetuning accuracy.
-
LPCD: Unified Framework from Layer-Wise to Submodule Quantization - Score: 18 (R=10, N=8) - Date: 2025-12-02 - Comment: Model Compression and Efficiency: unified PTQ framework extending from layer-wise to arbitrary submodule quantization via layer-projected coordinate descent.
-
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding - Score: 18 (R=10, N=8) - Date: 2025-12-02 - Comment: Model Compression and Efficiency / HPC: sparse self-speculative decoding with PillarAttn, unified scheduler, delayed verification, and dynamic KV-cache management for faster long-CoT inference.
-
WUSH: Near-Optimal Adaptive Transforms for LLM Quantization - Score: 18 (R=10, N=8) - Date: 2025-12-02 - Comment: Model Compression and Efficiency: derives near-optimal adaptive linear transforms for joint weight–activation block quantization (RTN AbsMax), improving over Hadamard.
-
HBLLM: Wavelet-Enhanced High-Fidelity 1-Bit Quantization for LLMs - Score: 18 (R=10, N=8) - Date: 2025-12-02 - Comment: Matches Model Compression and Efficiency: proposes wavelet-enhanced 1-bit post-training quantization with structure-aware grouping and saliency-driven selection achieving SOTA fidelity at ~1.08 bits.
-
R2Q: Towards Robust 2-Bit Large Language Models via Residual Refinement Quantization - Score: 18 (R=10, N=8) - Date: 2025-12-01 - Comment: Model Compression/Efficiency: extreme low-bit (2-bit) LLM quantization via residual refinement (two sequential 1-bit sub-quantizations).
-
Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality - Score: 18 (R=9, N=9) - Date: 2025-12-12 - Comment: Representation Learning: identifiability and interpretable control via causal minimality with sparsity/compression constraints in generative models.
-
Provably Learning from Modern Language Models via Low Logit Rank - Score: 18 (R=9, N=9) - Date: 2025-12-11 - Comment: Representation/Low-Rank Theory: exploits empirically low logit rank to give provable, efficient learning algorithms under logit queries.
-
RADAR: Accelerating Large Language Model Inference With RL-Based Dynamic Draft Trees - Score: 17 (R=10, N=7) - Date: 2025-12-18 - Comment: Matches Inference Efficiency/HPC: RL-based dynamic speculative decoding trees to adapt draft calls and accelerate LLM inference.
-
KV-CAR: KV Cache Compression using Autoencoders and KV Reuse in Large Language Models - Score: 17 (R=10, N=7) - Date: 2025-12-09 - Comment: Strong match to Compression/Efficiency: KV cache compression via autoencoders and cross-layer KV reuse for LLM inference.
-
A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention - Score: 17 (R=10, N=7) - Date: 2025-12-04 - Comment: Model Compression/Efficiency: native Top-k sparse attention for both training and decoding, with analysis (entropy view) and approximation fidelity study.
-
Globally optimized SVD compression of LLMs via Fermi-function-based rank selection and gauge fixing - Score: 17 (R=10, N=7) - Date: 2025-12-04 - Comment: Compression/efficiency: SVD-based LLM compression with globally optimized rank selection (FermiGrad) and lossless gauge fixing (PivGa).
-
G-KV: Decoding-Time KV Cache Eviction with Global Attention - Score: 17 (R=10, N=7) - Date: 2025-12-02 - Comment: Model Compression and Efficiency: decoding-time KV-cache eviction using a global attention-based scoring mechanism with post-training RL/distillation for compressed-cache settings.
-
Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation - Score: 17 (R=9, N=8) - Date: 2025-12-30 - Comment: Matches: Model Compression and Efficiency (low-rank subspace adapters) and Representation Learning (SAE-based interpretable feature subspace).
-
AFA-LoRA: Enabling Non-Linear Adaptations in LoRA with Activation Function Annealing - Score: 17 (R=9, N=8) - Date: 2025-12-30 - Comment: Compression/Efficiency: PEFT (LoRA) with annealed activation to gain non-linear expressivity while remaining mergeable for efficient adaptation.
-
GQ-VAE: A gated quantized VAE for learning variable length tokens - Score: 17 (R=9, N=8) - Date: 2025-12-29 - Comment: Model Architecture and Compression/Efficiency: a learned tokenizer (gated quantized VAE) producing variable-length discrete tokens; improves compression while remaining a drop-in replacement for BPE.
-
Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs - Score: 17 (R=9, N=8) - Date: 2025-12-26 - Comment: HPC/Efficiency: custom Triton kernels and memory‑layout optimizations for block low‑rank (Monarch/BLAST) operations mitigate memory‑bound multi‑token inference on GPUs.
-
BRIDGE: Budget-aware Reasoning via Intermediate Distillation with Guided Examples - Score: 17 (R=9, N=8) - Date: 2025-12-24 - Comment: Model Compression/Efficiency: budget-aware distillation via teacher assistant with theoretical generalization bounds.
-
Approximation and learning with compositional tensor trains - Score: 17 (R=9, N=8) - Date: 2025-12-23 - Comment: Matches Model Architecture and Compression/Efficiency: compositional tensor-train networks enable low-rank structured layers with tensor-algebra-based optimization and controllable layer-wise compression.
-
Bridging Training and Merging Through Momentum-Aware Optimization - Score: 17 (R=9, N=8) - Date: 2025-12-22 - Comment: Momentum/curvature-aware optimization that preserves factorized statistics for curvature-aware model merging—low-rank optimization and efficient model composition.
-
InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression - Score: 17 (R=9, N=8) - Date: 2025-12-22 - Comment: Model Compression and Efficiency: introduces an ELBO-based adaptive video tokenization framework that reduces token budget; Model Architecture: transformer-based adaptive compressor for variable-rate discrete tokens.
-
MEPIC: Memory Efficient Position Independent Caching for LLM Serving - Score: 17 (R=9, N=8) - Date: 2025-12-19 - Comment: Matches High-Performance Serving Efficiency—position-independent KV caching with paged sharing and RoPE fusion.
-
Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference - Score: 17 (R=9, N=8) - Date: 2025-12-19 - Comment: Matches Model Compression and Efficiency via training-free sparse attention (Top-k selection/reuse across layers) with substantial prefill/decode speedups.
-
Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-12-19 - Comment: Compression/Efficiency: adaptive low-rank factorization of MHSA via RL and online matrix perturbation to balance fidelity and latency at inference.
-
Random matrix theory of sparse neuronal networks with heterogeneous timescales - Score: 17 (R=9, N=8) - Date: 2025-12-19 - Comment: Representation Learning/Theory: random matrix analysis of sparse E/I networks’ Jacobians links sparsity, timescales, and gains to spectral edge and dynamics.
-
EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving - Score: 17 (R=9, N=8) - Date: 2025-12-18 - Comment: Serving Efficiency/HPC: joint KV-cache compression and multi-tier eviction via a unified utility to minimize latency at fixed quality.
-
SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-12-18 - Comment: Model Compression/Efficiency: quantization-aware training that optimizes static activation quantization factors for LLMs.
-
Spherical Leech Quantization for Visual Tokenization and Generation - Score: 17 (R=9, N=8) - Date: 2025-12-17 - Comment: Compression/Quantization—non-parametric lattice-based codebook (Leech lattice) for autoencoder-style visual tokenization and compression.
-
Ladder Up, Memory Down: Low-Cost Fine-Tuning With Side Nets - Score: 17 (R=9, N=8) - Date: 2025-12-17 - Comment: Compression/Efficiency: PEFT via Ladder Side Tuning halves peak memory vs QLoRA; Model Architecture: xLadder increases effective depth via side-net cross-connections at fixed params
-
MIDUS: Memory-Infused Depth Up-Scaling - Score: 17 (R=9, N=8) - Date: 2025-12-17 - Comment: Model Architecture: head-wise memory layer replacing FFNs in duplicated blocks; Compression/Efficiency: sparse memory access and per-head value factorization for parameter-efficient depth up-scaling
-
SeVeDo: A Heterogeneous Transformer Accelerator for Low-Bit Inference via Hierarchical Group Quantization and SVD-Guided Mixed Precision - Score: 17 (R=9, N=8) - Date: 2025-12-16 - Comment: Heterogeneous accelerator with hierarchical group quantization and SVD-guided mixed precision; strong fit to Compression/Efficiency and HPC inference.
-
Resting Neurons, Active Insights: Improving Input Sparsification for Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-12-16 - Comment: Proposes input sparsification as dynamic structural pruning with trainable compensatory neurons; fits Model Compression/Efficiency (sparsity/pruning, conditional activation).
-
V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval - Score: 17 (R=9, N=8) - Date: 2025-12-16 - Comment: Dynamic KV cache retrieval with software–hardware co-design for streaming video LLMs; directly matches Compression/Efficiency (cache optimization) and HPC inference acceleration.
-
TPV: Parameter Perturbations Through the Lens of Test Prediction Variance - Score: 17 (R=9, N=8) - Date: 2025-12-15 - Comment: Representation Learning and Compression/Efficiency: introduces label-free test prediction variance linking generalization to parameter perturbations and yields a pruning importance measure.
-
Unlocking the Address Book: Dissecting the Sparse Semantic Structure of LLM Key-Value Caches via Sparse Autoencoders - Score: 17 (R=9, N=8) - Date: 2025-12-12 - Comment: Model Efficiency + Representation Learning: decomposes KV caches via Top-K sparse autoencoders (sparsity without shrinkage bias) with a dual-budget scheme, informing cache compression/interpretability.
-
Supervised Learning of Random Neural Architectures Structured by Latent Random Fields on Compact Boundaryless Multiply-Connected Manifolds - Score: 17 (R=9, N=8) - Date: 2025-12-12 - Comment: Strong match to Model Architecture (and sparsity): a geometry-aware generative process where architectures/weights emerge from a latent random field on manifolds, yielding sparse connectivity and theoretical properties.
-
Tensor-Compressed and Fully-Quantized Training of Neural PDE Solvers - Score: 17 (R=9, N=8) - Date: 2025-12-11 - Comment: Compression/Efficiency + HPC: fully-quantized training, tensor-train compression, and a precision-scalable accelerator for efficient PINN/PDE solvers.
-
Towards Lossless Ultimate Vision Token Compression for VLMs - Score: 17 (R=9, N=8) - Date: 2025-12-11 - Comment: Model Compression and Efficiency: training-free iterative visual token merging and spectrum pruning compatible with FlashAttention for end-to-end VLM token compression.
-
SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models - Score: 17 (R=9, N=8) - Date: 2025-12-10 - Comment: Matches Model Compression and Efficiency: training-free KV-cache compression via selective sentence-level eviction and generation control for CoT.
-
LAPA: Log-Domain Prediction-Driven Dynamic Sparsity Accelerator for Transformer Model - Score: 17 (R=9, N=8) - Date: 2025-12-10 - Comment: Matches Compression/Efficiency: dynamic sparsity prediction and log-domain computation with hardware co-design for Transformer acceleration.
-
ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models - Score: 17 (R=9, N=8) - Date: 2025-12-10 - Comment: Efficiency/HPC: adaptive parallel reasoning at inference with trie-based training–inference co-design avoiding KV cache/PE changes, plus RL for parallelization.
-
Neural expressiveness for beyond importance model compression - Score: 17 (R=9, N=8) - Date: 2025-12-09 - Comment: Model Compression: introduces an expressiveness-based, data-agnostic pruning criterion complementary to importance-based pruning with large compression gains.
-
InvarDiff: Cross-Scale Invariance Caching for Accelerated Diffusion Models - Score: 17 (R=9, N=8) - Date: 2025-12-08 - Comment: Strong match to Efficiency: training-free cross-timestep and cross-layer caching exploiting invariances to accelerate diffusion models 2–3x.
-
The Universal Weight Subspace Hypothesis - Score: 17 (R=9, N=8) - Date: 2025-12-05 - Comment: Representation Learning: identifies universal low-dimensional/sparse spectral subspaces in weight matrices across architectures; also connects to low-rank compression potential.
-
Arbitrage: Efficient Reasoning via Advantage-Aware Speculation - Score: 17 (R=9, N=8) - Date: 2025-12-05 - Comment: Model Efficiency: step-level speculative decoding with advantage-aware routing to reduce inference cost while maintaining reasoning quality.
-
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation - Score: 17 (R=9, N=8) - Date: 2025-12-05 - Comment: Model Compression and Efficiency: introduces a sparse attention variant with multi-level pooled KV representations and a hardware-friendly kernel to reduce quadratic attention cost while preserving information.
-
Convergence for Discrete Parameter Updates - Score: 17 (R=9, N=8) - Date: 2025-12-04 - Comment: Training efficiency: discrete update rules with convergence guarantees for low-precision training—core to compression/efficiency.
-
Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles - Score: 17 (R=9, N=8) - Date: 2025-12-04 - Comment: Efficiency/Training Dynamics: theoretical analysis of data curation via operator spectra; shows limits of static pruning and acceleration via time-dependent reweighting.
-
Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation - Score: 17 (R=9, N=8) - Date: 2025-12-03 - Comment: Matches Model Compression and Efficiency: low-rank, basis-oriented parameter-efficient transfer with orthogonal task-informed subspaces; strong PEFT contribution.
-
Efficiently Learning Branching Networks for Multitask Algorithmic Reasoning - Score: 17 (R=9, N=8) - Date: 2025-12-02 - Comment: Matches Model Architecture and Efficiency: branching neural networks for multitask reasoning with efficient convex-relaxed structure search (dynamic/conditional computation).
-
SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs - Score: 17 (R=9, N=8) - Date: 2025-12-02 - Comment: Matches High Performance Computing and Efficiency: algorithm–system co-design for long-context LLMs via speculative context sparsity, pruned retrieval heads, asynchronous prefetch, and adaptive memory management.
-
CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference - Score: 17 (R=9, N=8) - Date: 2025-12-01 - Comment: Model Efficiency/HPC: certifiable sub-vocabulary decoding with geometric bounds, sparse kernels, and multi-GPU sharding for output layer acceleration.
-
FRoD: Full-Rank Efficient Fine-Tuning with Rotational Degrees for Fast Convergence - Score: 16 (R=9, N=7) - Date: 2025-12-31 - Comment: Model Compression/Efficiency: new PEFT method using hierarchical joint decomposition and sparse rotational perturbations to enable full-rank updates with ~1.7% trainable params.
-
Merge before Forget: A Single LoRA Continual Learning via Continual Merging - Score: 16 (R=9, N=7) - Date: 2025-12-30 - Comment: Compression/Efficiency: continual LoRA merging with orthogonal basis and time-aware scaling, maintaining constant memory and mitigating interference.
-
The Quest for Winning Tickets in Low-Rank Adapters - Score: 16 (R=9, N=7) - Date: 2025-12-30 - Comment: Matches Compression/Efficiency and Sparsity: extends Lottery Ticket Hypothesis to LoRA and introduces Partial-LoRA for sparse low-rank adapters with large parameter savings.
-
nncase: An End-to-End Compiler for Efficient LLM Deployment on Heterogeneous Storage Architectures - Score: 16 (R=9, N=7) - Date: 2025-12-29 - Comment: High Performance Computing/Efficiency: end-to-end compiler with e-graph rewriting, auto-parallel distribution, and cache-aware scheduling for LLM deployment.
-
Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation - Score: 16 (R=9, N=7) - Date: 2025-12-26 - Comment: Compression/Efficiency: reasoning distillation via sequence truncation and selective supervision (prioritizing early CoT tokens) halves training compute with minimal performance loss.
-
Towards Minimal Fine-Tuning of VLMs - Score: 16 (R=9, N=7) - Date: 2025-12-23 - Comment: Matches Compression/Efficiency: Image-LoRA restricts low-rank adaptation to visual-token spans and selects influential heads to minimize trainable parameters/FLOPs.
-
MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning - Score: 16 (R=9, N=7) - Date: 2025-12-23 - Comment: Model Compression and Efficiency: query-aware mixed-precision KV cache quantization for long-context reasoning.
-
Mitigating Forgetting in Low Rank Adaptation - Score: 16 (R=9, N=7) - Date: 2025-12-22 - Comment: Parameter-efficient fine-tuning: Low-Rank Adaptation with Laplace-based weight-space regularization to mitigate forgetting—low-rank/PEFT and training dynamics.
-
Dion2: A Simple Method to Shrink Matrix in Muon - Score: 16 (R=9, N=7) - Date: 2025-12-22 - Comment: Optimizer efficiency: Shrinks Muon’s orthonormalization step via sampled rows/columns, inducing sparse updates to reduce compute/communication costs.
-
Batch Normalization-Free Fully Integer Quantized Neural Networks via Progressive Tandem Learning - Score: 16 (R=9, N=7) - Date: 2025-12-19 - Comment: Compression/Efficiency: trains BN-free fully integer quantized networks via progressive layer-wise distillation enabling integer-only inference.
-
CKA-Guided Modular Quantization: Beyond Bit-Width to Algorithmic Diversity - Score: 16 (R=9, N=7) - Date: 2025-12-19 - Comment: Matches Model Compression/Efficiency via per-layer heterogeneous PTQ selection guided by CKA (quantization).
-
Arithmetic-Intensity-Aware Quantization - Score: 16 (R=9, N=7) - Date: 2025-12-18 - Comment: Compression/Efficiency: mixed-precision PTQ optimizing per-layer bit-widths for arithmetic intensity vs accuracy to boost throughput on memory-bound nets.
-
Efficient Vision-Language Reasoning via Adaptive Token Pruning - Score: 16 (R=9, N=7) - Date: 2025-12-16 - Comment: Model Compression and Efficiency — adaptive token pruning at the vision-language interface using attention/similarity-based importance.
-
CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving - Score: 16 (R=9, N=7) - Date: 2025-12-16 - Comment: High Performance Computing/Efficiency: disaggregated KV-cache over CXL with FPGA acceleration, speculative prefetch, and compression for LLM serving.
-
Multi-Granular Node Pruning for Circuit Discovery - Score: 16 (R=9, N=7) - Date: 2025-12-12 - Comment: Sparsity/Pruning for circuit discovery with multi-granular learnable masks; improves interpretability while maintaining performance.
-
LGAN: An Efficient High-Order Graph Neural Network via the Line Graph Aggregation - Score: 16 (R=9, N=7) - Date: 2025-12-12 - Comment: Strong match to Model Architecture/Efficiency: proposes a high-order GNN via line graph aggregation with provable greater expressivity than 2-WL and lower time complexity.
-
Uncertainty-Preserving QBNNs: Multi-Level Quantization of SVI-Based Bayesian Neural Networks for Image Classification - Score: 16 (R=9, N=7) - Date: 2025-12-12 - Comment: Direct match to Model Compression and Efficiency: proposes multi-level quantization (VPQ/SPQ/JQ) for SVI-based Bayesian NNs with logarithmic variance quantization and uncertainty-preserving activations.
-
HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression - Score: 16 (R=9, N=7) - Date: 2025-12-11 - Comment: Matches Model Compression and Efficiency: hierarchical progressive multi-teacher knowledge distillation with adaptive hyperparameters and parallelization.
-
SparsePixels: Efficient Convolution for Sparse Data on FPGAs - Score: 16 (R=9, N=7) - Date: 2025-12-09 - Comment: HPC/Systems Efficiency: sparse CNN formulation and FPGA HLS implementation exploiting spatial sparsity for microsecond-latency inference with quantization-aware training.
-
Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective - Score: 16 (R=9, N=7) - Date: 2025-12-05 - Comment: Compression/Efficiency: advances logit-based knowledge distillation with generalized decoupling and analysis of teacher predictive distributions.
-
GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers - Score: 16 (R=9, N=7) - Date: 2025-12-05 - Comment: Compression/Efficiency: PEFT via grouped activation shared parameterization; stochastic variant improves robustness under hardware noise.
-
VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm - Score: 16 (R=9, N=7) - Date: 2025-12-03 - Comment: Model Compression/Efficiency: training-free token pruning with spatial sparsity buffering and redundancy-aware selection for VLMs.
-
ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity - Score: 16 (R=9, N=7) - Date: 2025-12-03 - Comment: Designs a sparse Transformer accelerator using local similarity and HLog quantization—strong match to Compression/Efficiency and systems-level HPC for Transformers.
-
SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification - Score: 16 (R=9, N=7) - Date: 2025-12-03 - Comment: Self-speculative decoding with partial KV verification to accelerate long-context generation—matches HPC/Efficiency (decoding acceleration, KV optimization).
-
Morphling: Fast, Fused, and Flexible GNN Training at Scale - Score: 16 (R=9, N=7) - Date: 2025-12-02 - Comment: High-Performance Computing: domain-specific code synthesis and sparsity-aware runtime for scalable, fused GNN training across CPU/GPU/MPI backends.
-
Less is More: Resource-Efficient Low-Rank Adaptation - Score: 16 (R=9, N=7) - Date: 2025-12-02 - Comment: Model Compression/Efficiency: EffiLoRA shares A across layers and selectively updates B at runtime to reduce PEFT cost while retaining performance.
-
db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism - Score: 16 (R=9, N=7) - Date: 2025-12-01 - Comment: High Performance Computing: sparsity-aware dual-balanced sequence parallelism for block-sparse attention with dynamic runtime partitioning.
-
PerfMamba: Performance Analysis and Pruning of Selective State Space Models - Score: 16 (R=9, N=7) - Date: 2025-12-01 - Comment: Compression/Efficiency—prunes low-activity states in selective SSMs (Mamba) for speed/memory gains; HPC—systematic runtime/memory/I/O profiling and scaling analysis.
-
Enhancing Trustworthiness with Mixed Precision: Benchmarks, Opportunities, and Challenges - Score: 16 (R=9, N=7) - Date: 2025-12-01 - Comment: Matches Compression/Efficiency: mixed-precision/quantization analysis for LLMs with a precision-ensemble voting method targeting trustworthy deployment.
-
SingleQuant: Efficient Quantization of Large Language Models in a Single Pass - Score: 16 (R=9, N=7) - Date: 2025-12-01 - Comment: Model Compression and Efficiency: proposes a single-pass LLM quantization framework with structured Givens-rotation transforms to remove STE-induced non-smoothness and accelerate quantization.
-
Decomposed Trust: Exploring Privacy, Adversarial Robustness, Fairness, and Ethics of Low-Rank LLMs - Score: 16 (R=9, N=7) - Date: 2025-12-01 - Comment: Matches Compression/Efficiency: low-rank factorization of LLMs with comprehensive trustworthiness analysis; introduces methods to mitigate with precision-aware strategies.
-
SHARe-KAN: Holographic Vector Quantization for Memory-Bound Inference - Score: 16 (R=8, N=8) - Date: 2025-12-19 - Comment: Compression/Efficiency: gain–shape–bias vector quantization and hardware-aware compiler reduce KAN runtime memory by 88× while preserving accuracy.
-
Bias-Variance Trade-off for Clipped Stochastic First-Order Methods: From Bounded Variance to Infinite Mean - Score: 16 (R=8, N=8) - Date: 2025-12-18 - Comment: Unified complexity analysis for clipped stochastic first-order methods under heavy-tailed noise (Compression/Efficiency: optimization theory).
-
Spiking Manifesto - Score: 16 (R=8, N=8) - Date: 2025-12-16 - Comment: Matches Model Architecture/Efficiency: proposes spiking network reinterpretation of ANNs for potential thousandfold efficiency gains.
-
Efficient Low-Tubal-Rank Tensor Estimation via Alternating Preconditioned Gradient Descent - Score: 16 (R=8, N=8) - Date: 2025-12-09 - Comment: Direct match to Compression/Efficiency: proposes an APGD algorithm for low-tubal-rank tensor estimation with linear convergence under over-parameterization, improving optimization efficiency.
-
Pathway to $O(\sqrt{d})$ Complexity bound under Wasserstein metric of flow-based models - Score: 16 (R=8, N=8) - Date: 2025-12-09 - Comment: Theoretical Efficiency: establishes O(sqrt(d)) sampling iteration complexity under Wasserstein metric for flow-based generative models with explicit assumptions.
-
Utility Boundary of Dataset Distillation: Scaling and Configuration-Coverage Laws - Score: 16 (R=8, N=8) - Date: 2025-12-08 - Comment: Matches Compression/Efficiency: unified theory with scaling and configuration-coverage laws for dataset distillation (generalization-error framework).
-
One-Step Diffusion Samplers via Self-Distillation and Deterministic Flow - Score: 16 (R=8, N=8) - Date: 2025-12-08 - Comment: Strong match to Efficiency: one-step diffusion sampler via self-distillation and deterministic-flow ELBO with orders-of-magnitude fewer network evaluations.
-
On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference - Score: 16 (R=8, N=8) - Date: 2025-12-05 - Comment: Model Compression/Efficiency: theoretical and algorithmic advances in test-time compute via reward-filtered sequential inference with stronger guarantees than BoN.
-
SVRG and Beyond via Posterior Correction - Score: 16 (R=8, N=8) - Date: 2025-12-02 - Comment: Matches Training Efficiency/HPC: connects SVRG to Bayesian posterior correction and derives Hessian‑ and Adam‑like SVRG variants improving deep model training.
-
ZIP-RC: Zero-overhead Inference-time Prediction of Reward and Cost for Adaptive and Interpretable Generation - Score: 16 (R=8, N=8) - Date: 2025-12-02 - Comment: Matches Efficiency/Test‑time Scaling: zero‑overhead reward and cost prediction from unused logits enables adaptive inference and compute allocation.
-
Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning - Score: 15 (R=8, N=7) - Date: 2025-12-31 - Comment: Model Efficiency/Training stability: dynamic vocabulary pruning to bound tail-induced training–inference mismatch in LLM RL with theoretical guarantees.
-
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models - Score: 15 (R=8, N=7) - Date: 2025-12-31 - Comment: Model Compression/Distillation: mask-progressive teacher and offline RL-based rewards to stabilize and improve student VLM distillation.
-
A general framework for deep learning - Score: 15 (R=8, N=7) - Date: 2025-12-30 - Comment: Matches: Model Compression and Efficiency (sparse-penalized DNN) and Representation Learning theory (minimax-optimal excess risk under mixing).
-
Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models - Score: 15 (R=8, N=7) - Date: 2025-12-30 - Comment: Matches Model Architecture and Efficiency: mask fine-tuning with learnable gating scores to reconfigure internal subnetworks in VLMs without weight updates.
-
Towards Long-window Anchoring in Vision-Language Model Distillation - Score: 15 (R=8, N=7) - Date: 2025-12-29 - Comment: Model Compression/Efficiency via knowledge distillation of long-range attention (distance-weighted attention matching and RoPE gain modulation) for long-context VLMs.
-
Perplexity-Aware Data Scaling Law: Perplexity Landscapes Predict Performance for Continual Pre-training - Score: 15 (R=8, N=7) - Date: 2025-12-29 - Comment: Training dynamics/data scaling: perplexity-aware scaling law for continual pre-training, guiding data selection.
-
HyDRA: Hierarchical and Dynamic Rank Adaptation for Mobile Vision Language Model - Score: 15 (R=8, N=7) - Date: 2025-12-26 - Comment: Matches Model Compression/Efficiency: dynamic and hierarchical low-rank (LoRA) rank scheduling for parameter-efficient fine-tuning of VLMs.
-
Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs - Score: 15 (R=8, N=7) - Date: 2025-12-24 - Comment: Inference Efficiency/HPC: speculative decoding with diffusion LLM drafters and adaptive speculation length for lossless AR acceleration.
-
ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge - Score: 15 (R=8, N=7) - Date: 2025-12-24 - Comment: High Performance Computing/Efficiency: cross-request pipelining and unified KV ring buffer to accelerate autoregressive VLA inference on edge.
-
PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models - Score: 15 (R=8, N=7) - Date: 2025-12-24 - Comment: Model Efficiency: accelerates discrete flow models to few-step sampling via closed-form inversion–based source–target coupling without a teacher.
-
Binary Kernel Logistic Regression: a sparsity-inducing formulation and a convergent decomposition training algorithm - Score: 15 (R=8, N=7) - Date: 2025-12-23 - Comment: Model Compression/Efficiency: sparsity-inducing KLR formulation with a convergent SMO-type decomposition algorithm.
-
When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-12-23 - Comment: Model Compression and Efficiency: systematic study of INT8/INT4 quantization effects on LLM continual learning dynamics, highlighting regularization benefits.
-
IPCV: Information-Preserving Compression for MLLM Visual Encoders - Score: 15 (R=8, N=7) - Date: 2025-12-23 - Comment: Model Compression and Efficiency: training-free token pruning inside ViT with information-preserving reconstruction and attention stabilization for MLLMs.
-
DeepShare: Sharing ReLU Across Channels and Layers for Efficient Private Inference - Score: 15 (R=8, N=7) - Date: 2025-12-22 - Comment: Model Compression and Efficiency: architectural sharing of DReLU across channels and layers to cut expensive non-linear operations in private inference, with expressivity analysis.
-
Tiny Recursive Control: Iterative Reasoning for Efficient Optimal Control - Score: 15 (R=8, N=7) - Date: 2025-12-19 - Comment: Model Architecture/Efficiency: iterative weight sharing and hierarchical latent refinement trade iteration depth for parameter count, enabling compact controllers.
-
AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs - Score: 15 (R=8, N=7) - Date: 2025-12-19 - Comment: Efficiency: adaptive gradient-guided layer/block selection for fine-tuning SLMs reduces memory/compute while matching full fine-tuning accuracy.
-
How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness - Score: 15 (R=8, N=7) - Date: 2025-12-19 - Comment: Matches Model Compression and Efficiency via systematic study of Low-Rank Adaptation (LoRA) rank trade-offs and effects on representations/generalization.
-
Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models - Score: 15 (R=8, N=7) - Date: 2025-12-19 - Comment: Matches Model Compression and Efficiency via conditional/dynamic compute (early-exit) while preserving cross-modal embedding compatibility through dual-path training.
-
Metanetworks as Regulatory Operators: Learning to Edit for Requirement Compliance - Score: 15 (R=8, N=7) - Date: 2025-12-18 - Comment: Graph metanetwork edits NNs in one pass to enforce requirements and includes weight pruning (Model Architecture + Compression/Efficiency: pruning).
-
FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows - Score: 15 (R=8, N=7) - Date: 2025-12-18 - Comment: Matches Model Architecture/Efficiency: bidirectional invertible flows tied via a shared latent with single flow-matching objective for any-to-any generation.
-
Distillation-Guided Structural Transfer for Continual Learning Beyond Sparse Distributed Memory - Score: 15 (R=8, N=7) - Date: 2025-12-18 - Comment: Selective subnetwork distillation atop sparse Top-K subnetworks for continual learning (Compression/Efficiency: sparsity; structural training).
-
Plug-and-Play Parameter-Efficient Tuning of Embeddings for Federated Recommendation - Score: 15 (R=8, N=7) - Date: 2025-12-18 - Comment: Matches Compression/Efficiency and Distributed Training: PEFT-based embedding modules (LoRA, hashing, RQ-VAE) to cut communication in federated recommendation.
-
Implicit Bias and Invariance: How Hopfield Networks Efficiently Learn Graph Orbits - Score: 15 (R=8, N=7) - Date: 2025-12-17 - Comment: Representation Learning/Implicit Bias—emergent invariance in Hopfield networks and norm-efficient learning of graph orbits.
-
Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-12-16 - Comment: Efficiency/HPC: adaptive rejection sampling for speculative decoding using target-model uncertainty to increase throughput in autoregressive inference.
-
Alada: Alternating Adaptation of Momentum Method for Memory-Efficient Matrix Optimization - Score: 15 (R=8, N=7) - Date: 2025-12-16 - Comment: Matches Model Compression and Efficiency/HPC: memory-efficient optimizer via rank-one factorized second-moment estimation.
-
Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach - Score: 15 (R=8, N=7) - Date: 2025-12-11 - Comment: Model Compression and Efficiency: LoRA-based continual learning with gate-free mixture of LoRA modules and gradient-regularized low-rank updates.
-
Branching Strategies Based on Subgraph GNNs: A Study on Theoretical Promise versus Practical Reality - Score: 15 (R=8, N=7) - Date: 2025-12-11 - Comment: Model Architecture Theory/Efficiency: proves subgraph GNNs (below 3-WL) can approximate strong branching and studies expressivity–efficiency trade-offs.
-
Luxical: High-Speed Lexical-Dense Text Embeddings - Score: 15 (R=8, N=7) - Date: 2025-12-11 - Comment: Model Compression and Efficiency: high-speed lexical-dense embeddings distilling transformer embeddings into sparse TF–IDF + small ReLU networks.
-
Resolving Conflicts in Lifelong Learning via Aligning Updates in Subspaces - Score: 15 (R=8, N=7) - Date: 2025-12-11 - Comment: Model Compression and Efficiency: low-rank adaptation (LoRA) with subspace-aligned updates and adapter merging for continual learning.
-
PVeRA: Probabilistic Vector-Based Random Matrix Adaptation - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Compression/efficiency criterion: parameter-efficient finetuning via probabilistic low-rank adapters (VeRA-style) using shared frozen random matrices.
-
LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Model Architecture/Efficiency: augments token embeddings with linguistic metadata to improve pretraining efficiency and generation with minimal parameter overhead.
-
Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Model Compression: LLM-driven, training-free proxy discovery for mixed-precision quantization, reformulating MPQ design via prompt-optimized LLMs.
-
Recover-to-Forget: Gradient Reconstruction from LoRA for Efficient LLM Unlearning - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Model Compression/Efficiency: uses low-rank LoRA updates to reconstruct full-model gradients for scalable unlearning; leverages low-rank structure for efficient parameter updates.
-
Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Matches Compression/Efficiency: training-free dynamic token compression guided by internal attention for VLLMs.
-
Approximate Multiplier Induced Error Propagation in Deep Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Compression/efficiency criterion: analytic propagation of approximate multiplier error through GEMM to estimate and predict DNN accuracy impact.
-
LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning - Score: 15 (R=8, N=7) - Date: 2025-12-08 - Comment: Matches Model Compression and Efficiency: online early-exit mechanism using hidden-state probes and conformal guarantees to reduce inference compute.
-
Uncertainty Quantification for Scientific Machine Learning using Sparse Variational Gaussian Process Kolmogorov-Arnold Networks (SVGP KAN) - Score: 15 (R=8, N=7) - Date: 2025-12-08 - Comment: Matches Model Architecture and Efficiency: integrates sparse variational Gaussian processes with Kolmogorov–Arnold Networks for scalable Bayesian inference with quasi-linear complexity.
-
Optical Context Compression Is Just (Bad) Autoencoding - Score: 15 (R=8, N=7) - Date: 2025-12-04 - Comment: Matches Compression/Efficiency: critical analysis of context compression representations for LMs with alternative encoders.
-
Tuning-Free Structured Sparse Recovery of Multiple Measurement Vectors using Implicit Regularization - Score: 15 (R=8, N=7) - Date: 2025-12-04 - Comment: Sparsity/Implicit Regularization: tuning-free structured sparse recovery in MMV via overparameterized factorization and provable row-sparsity emergence.
-
VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling - Score: 15 (R=8, N=7) - Date: 2025-12-04 - Comment: Matches Efficiency/Architecture: minimal Feature Token Modulation and low-rank updates (FLA) for robust VLA adaptation.
-
In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs - Score: 15 (R=8, N=7) - Date: 2025-12-03 - Comment: Efficiency: training-free in-context distillation with self-consistency cascades to reduce LLM agent inference cost while preserving accuracy.
-
Stay Unique, Stay Efficient: Preserving Model Personality in Multi-Task Merging - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Model Compression and Efficiency: SVD-based low-rank decomposition with novel thresholding/scaling for model merging, preserving task-specific information with ~1% storage.
-
Mode-Conditioning Unlocks Superior Test-Time Scaling - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Matches Efficiency/Test‑time Scaling: mode‑conditioning via specialists or mode‑specific prefixes allocates sampling budget across reasoning modes.
-
Scalable and Interpretable Scientific Discovery via Sparse Variational Gaussian Process Kolmogorov-Arnold Networks (SVGP KAN) - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Matches Model Architecture and Efficiency: probabilistic KAN with sparse variational GP inference reducing complexity from O(N^3) to O(NM^2).
-
Accelerated Execution of Bayesian Neural Networks using a Single Probabilistic Forward Pass and Code Generation - Score: 15 (R=8, N=7) - Date: 2025-12-01 - Comment: Model Efficiency/HPC: single probabilistic forward pass for BNNs with TVM code generation and Gaussian-propagating ops for embedded deployment.
-
Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning - Score: 15 (R=8, N=7) - Date: 2025-12-01 - Comment: Model Compression and Efficiency: quantizes/intermediately compresses multimodal embeddings to low-bit integers for communication-efficient split learning.
-
AutoTailor: Automatic and Efficient Adaptive Model Deployment for Diverse Edge Devices - Score: 15 (R=8, N=7) - Date: 2025-12-01 - Comment: Systems/Efficiency: automated SuperNet construction with learning-free latency/accuracy predictors for adaptive edge deployment.
-
Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models - Score: 15 (R=8, N=7) - Date: 2025-12-01 - Comment: Model Compression/Efficiency: adaptive prefill length prediction and dLLM-specific speculative decoding to reduce inference cost.
-
RoSA: Enhancing Parameter-Efficient Fine-Tuning via RoPE-aware Selective Adaptation in Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-12-01 - Comment: Model Compression/Efficiency: PEFT via RoPE-aware attention enhancement and dynamic layer selection for targeted adaptation.
-
Cacheback: Speculative Decoding With Nothing But Cache - Score: 15 (R=8, N=7) - Date: 2025-12-01 - Comment: Model Compression/Efficiency: training-free, model-agnostic speculative decoding via cache-only draft generation to accelerate LLM inference.
High Performance Computing (52)
-
A Comedy of Estimators: On KL Regularization in RL Training of LLMs - Score: 20.0 (R=0, N=0) - Date: 2025-12-30 - Comment: Author match
-
Reversing Large Language Models for Efficient Training and Fine-Tuning - Score: 19 (R=10, N=9) - Date: 2025-12-03 - Comment: HPC/Memory Optimization + Model Architecture: reversible LLM layers enable backprop without storing activations; includes conversion of pretrained models into reversible form.
-
Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution - Score: 18 (R=10, N=8) - Date: 2025-12-30 - Comment: Matches High Performance Computing: O(1) memory exact differentiation for SSMs via tiled operator-space evolution (PGF), enabling long-sequence training/sensitivity analysis on limited hardware.
-
Mesh-Attention: A New Communication-Efficient Distributed Attention with Improved Data Locality - Score: 18 (R=10, N=8) - Date: 2025-12-26 - Comment: HPC/distributed algorithm: new communication-efficient distributed attention (tile-based mesh) with provably lower communication and improved scalability.
-
Flash Multi-Head Feed-Forward Network - Score: 18 (R=10, N=8) - Date: 2025-12-09 - Comment: Model Architecture + Systems Efficiency: introduces Multi-Head FFN with an I/O-aware fused kernel (Flash-style) and dynamic sub-networks for better perplexity/memory.
-
KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing - Score: 18 (R=10, N=8) - Date: 2025-12-05 - Comment: High Performance Computing: systems-level innovation for LLM inference by storing both weights and KV cache in compute-enabled 3D NAND, with head-group parallelism and page-level KV mapping to cut data movement.
-
Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs - Score: 17 (R=9, N=8) - Date: 2025-12-30 - Comment: Matches: High Performance Computing (compiler/runtime mega-kernelization, SM-level task graphs for end-to-end fusion).
-
On Harnessing Idle Compute at the Edge for Foundation Model Training - Score: 17 (R=9, N=8) - Date: 2025-12-30 - Comment: Distributed training innovation for edge devices via selective hybrid tensor parallelism and PS-centric design (HPC/distributed training).
-
Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-12-29 - Comment: High Performance Computing: algorithmic optimization of distributed LLM inference (block placement and request routing) with models and guarantees.
-
GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping - Score: 17 (R=9, N=8) - Date: 2025-12-22 - Comment: Systems/HPC contribution: SSD-offloaded LLM training with vertical micro-batch scheduling and optimizer-step overlap for memory/throughput optimization.
-
Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs - Score: 17 (R=9, N=8) - Date: 2025-12-22 - Comment: High Performance Computing — optimizer-level innovation (Generalized Primal Averaging) that accelerates LLM training with reduced memory and convergence guarantees.
-
AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines - Score: 17 (R=9, N=8) - Date: 2025-12-19 - Comment: High Performance Computing/Systems: end-to-end compiler maps NN graphs onto AMD AIE-ML 2D arrays with on-chip memory placement and quantization support.
-
DEER: Draft with Diffusion, Verify with Autoregressive Models - Score: 17 (R=9, N=8) - Date: 2025-12-19 - Comment: Efficiency/HPC: speculative decoding with diffusion drafter (parallel) and AR verifier significantly increases acceptance length and end-to-end speedup.
-
CurvaDion: Curvature-Adaptive Distributed Orthonormalization - Score: 17 (R=9, N=8) - Date: 2025-12-18 - Comment: Distributed Training/HPC: curvature-adaptive synchronization using Relative Maximum Momentum Change to cut communication while preserving convergence.
-
LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Average for Training Neural Networks - Score: 17 (R=9, N=8) - Date: 2025-12-10 - Comment: High Performance Computing: principled multistage pipelining with variable gradient delays and pipeline-aware EMA to reconstruct past weights and cut memory.
-
DCO: Dynamic Cache Orchestration for LLM Accelerators through Predictive Management - Score: 17 (R=9, N=8) - Date: 2025-12-09 - Comment: HPC/Systems: predictive cache management (bypassing, dead-block prediction, thrash mitigation) for multi-core AI accelerators running LLMs; systems-level innovation for faster inference.
-
From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs - Score: 17 (R=9, N=8) - Date: 2025-12-09 - Comment: Model architecture/efficiency criterion: principled adaptation from autoregressive to block-wise diffusion with context-causal masks and gradual block growth to enable parallel generation.
-
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning - Score: 17 (R=9, N=8) - Date: 2025-12-04 - Comment: High Performance Computing: RL-driven kernel synthesis/optimization for HGEMM outperforming cuBLAS/cuBLASLt, enabling faster core operations for large-scale training/inference.
-
A Fully First-Order Layer for Differentiable Optimization - Score: 17 (R=9, N=8) - Date: 2025-12-03 - Comment: High Performance Computing/Efficiency: a fully first-order differentiable optimization layer avoiding Hessian solves, reducing compute and memory with non-asymptotic guarantees.
-
Model Recovery at the Edge under Resource Constraints for Physical AI - Score: 17 (R=9, N=8) - Date: 2025-12-03 - Comment: Matches Model Compression and Efficiency + HPC: replaces iterative NODE solvers with a parallelizable neural architecture on FPGA, reducing memory/energy and runtime.
-
Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems - Score: 17 (R=9, N=8) - Date: 2025-12-01 - Comment: High Performance Computing: systems-level framework for serving heterogeneous LoRA adapters with dynamic placement, routing, and GPU Direct RDMA to improve throughput and tail latency.
-
AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis - Score: 16 (R=9, N=7) - Date: 2025-12-31 - Comment: High Performance Computing: automated kernel generation/tuning across multiple DSLs and hardware backends for AI workloads, addressing portability and performance.
-
KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta - Score: 16 (R=9, N=7) - Date: 2025-12-31 - Comment: High Performance Computing: agentic kernel coding framework automating kernel optimization across heterogeneous accelerators with graph-based search and RAG prompts.
-
GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs - Score: 16 (R=9, N=7) - Date: 2025-12-31 - Comment: High Performance Computing: LLM-driven GPU kernel autotuning using minimal executable programs to avoid full builds, with automated repair and pattern inheritance for performance.
-
Communication Compression for Distributed Learning with Aggregate and Server-Guided Feedback - Score: 16 (R=9, N=7) - Date: 2025-12-30 - Comment: Matches Distributed Training/Communication Compression: CAFe and CAFe-S enable biased compression with aggregate/server-guided feedback and convergence guarantees without client-side state.
-
Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling - Score: 16 (R=9, N=7) - Date: 2025-12-29 - Comment: High-performance computing: systems-level analysis of SRAM size/frequency and memory-bandwidth bottlenecks for LLM inference.
-
A Mechanistic Analysis of Transformers for Dynamical Systems - Score: 16 (R=9, N=7) - Date: 2025-12-25 - Comment: Representation Learning/Mechanistic Interpretability: analyzes single-layer Transformers as history-dependent recurrences; identifies regimes and limitations for dynamical systems modeling.
-
Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap - Score: 16 (R=9, N=7) - Date: 2025-12-12 - Comment: High Performance Computing: finer-grain compute–communication overlap (FiCCO) with GPU DMA offload and scheduling heuristics for distributed ML training/inference.
-
RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs - Score: 16 (R=9, N=7) - Date: 2025-12-09 - Comment: High-performance training criterion: distributed RL framework on TPUs with parameter-server design and preemption-resilient large-scale rollout generation for LLM training.
-
Research Program: Theory of Learning in Dynamical Systems - Score: 16 (R=8, N=8) - Date: 2025-12-23 - Comment: Matches Representation Learning/Training Dynamics: research program and finite-sample learnability framework for dynamical systems via spectral filtering.
-
Stable spectral neural operator for learning stiff PDE systems from limited data - Score: 16 (R=8, N=8) - Date: 2025-12-15 - Comment: Model architecture criterion: Stable Spectral Neural Operator with integrating-factor time-stepping embeds spectral inductive biases for stiff PDEs under limited data.
-
Optimizing Optimizers for Fast Gradient-Based Learning - Score: 16 (R=8, N=8) - Date: 2025-12-09 - Comment: Optimization Theory: convex formulation for designing optimizers maximizing instantaneous loss decrease; yields closed-form optimizers and dynamic hyperparameters.
-
Splitwise: Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL - Score: 15 (R=8, N=7) - Date: 2025-12-31 - Comment: HPC/Systems Efficiency: fine-grained edge-cloud partitioning of Transformer sub-blocks using Lyapunov-assisted DRL to minimize latency/energy under variable bandwidth.
-
Role-Based Fault Tolerance System for LLM RL Post-Training - Score: 15 (R=8, N=7) - Date: 2025-12-31 - Comment: High Performance Computing: role-based fault isolation and dynamic UCX P2P communication for resilient distributed LLM RL post-training, avoiding full restarts.
-
Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving - Score: 15 (R=8, N=7) - Date: 2025-12-31 - Comment: High-Performance Serving/Efficiency: adaptive speculative decoding that dynamically selects speculative length based on load to optimize throughput/latency.
-
Beyond Centralization: Provable Communication Efficient Decentralized Multi-Task Learning - Score: 15 (R=8, N=7) - Date: 2025-12-30 - Comment: HPC/Distributed training and Representation Learning: communication-efficient decentralized multi-task learning with shared low-rank structure and provable time/communication/sample complexity.
-
Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing - Score: 15 (R=8, N=7) - Date: 2025-12-22 - Comment: Systems-level inference optimization: GPU-internal scheduling and resource sharing across multimodal preprocessing, vision encoding, and LLM inference to reduce latency and improve utilization.
-
Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference - Score: 15 (R=8, N=7) - Date: 2025-12-19 - Comment: Matches High Performance Computing with a scheduling mechanism (staggered batching and load-aware allocation) to co-optimize TTFT and throughput in DP+EP inference.
-
Dynamic Rebatching for Efficient Early-Exit Inference with DREX - Score: 15 (R=8, N=7) - Date: 2025-12-18 - Comment: HPC/Systems for LLM inference: dynamic rebatching for early-exit models with KV-state handling and SLA-aware scheduling.
-
Scalable Formal Verification via Autoencoder Latent Space Abstraction - Score: 15 (R=8, N=7) - Date: 2025-12-16 - Comment: Autoencoder-based latent space abstraction with formal guarantees to scale verification; matches Representation Learning (autoencoder latent modeling) and systems scalability.
-
DP-CSGP: Differentially Private Stochastic Gradient Push with Compressed Communication - Score: 15 (R=8, N=7) - Date: 2025-12-16 - Comment: HPC/Distributed Training and Communication Efficiency: DP stochastic gradient push with compressed communication over directed graphs with utility bounds.
-
Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model Updates - Score: 15 (R=8, N=7) - Date: 2025-12-16 - Comment: Matches High Performance Computing/Efficiency: inference-side low-rank (LoRA) updates and systems optimizations for freshness with minimal overhead.
-
Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters - Score: 15 (R=8, N=7) - Date: 2025-12-12 - Comment: Matches High Performance Computing: proposes a hybrid RL+MILP scheduler (RLTune) for heterogeneous GPU clusters, improving utilization/JCT without per-job profiling.
-
TinyD\'ej`aVu: Smaller Memory Footprint & Faster Inference on Sensor Data Streams with Always-On Microcontrollers - Score: 15 (R=8, N=7) - Date: 2025-12-11 - Comment: Model Compression and Efficiency: systems-level memory optimization and redundant compute elimination for sliding-window inference on MCUs.
-
WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving - Score: 15 (R=8, N=7) - Date: 2025-12-11 - Comment: High Performance Computing (serving systems): predictive one-for-many GPU prewarming, evict-aware placement, and zero-overhead memory switching for multi-LLM serving.
-
gHAWK: Local and Global Structure Encoding for Scalable Training of Graph Neural Networks on Knowledge Graphs - Score: 15 (R=8, N=7) - Date: 2025-12-10 - Comment: Matches High Performance Computing/Scalable Training: precomputed structural priors (Bloom filters, TransE) to reduce message passing and memory for large KGs.
-
MobileFineTuner: A Unified End-to-End Framework for Fine-Tuning LLMs on Mobile Phones - Score: 15 (R=8, N=7) - Date: 2025-12-10 - Comment: High Performance/Systems: enables on-device LLM fine-tuning with parameter sharding, gradient accumulation, and energy-aware scheduling.
-
MixLM: High-Throughput and Effective LLM Ranking via Text-Embedding Mix-Interaction - Score: 15 (R=8, N=7) - Date: 2025-12-10 - Comment: Matches Model Compression and Efficiency: reduces LLM input context via mix-interaction of text and cached embedding tokens; systems-level serving optimizations.
-
Always Keep Your Promises: DynamicLRP, A Model-Agnostic Solution To Layer-Wise Relevance Propagation - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Matches Representation/Analysis: model-agnostic LRP at tensor-op graph level with a new Promise System preserving conservation properties.
-
From monoliths to modules: Decomposing transducers for efficient world modelling - Score: 15 (R=8, N=7) - Date: 2025-12-04 - Comment: Systems-level innovation: decomposing transducers into sub-transducers for parallelizable and interpretable world modeling (distributed inference).
-
VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Matches HPC/systems for inference: future-state-aware asynchronous inference to eliminate prediction–execution misalignment, achieving low-latency control without architectural changes.
-
Distributed Dynamic Associative Memory via Online Convex Optimization - Score: 15 (R=8, N=7) - Date: 2025-12-01 - Comment: Matches Representation Learning (associative memory formalism) and High Performance Computing/Distributed Training (tree-based distributed online optimization with regret bounds).
Representation Learning (144)
-
SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation - Score: 20.0 (R=0, N=0) - Date: 2025-12-26 - Comment: Author match
-
World Models Can Leverage Human Videos for Dexterous Manipulation - Score: 20.0 (R=0, N=0) - Date: 2025-12-16 - Comment: Author match
-
JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention - Score: 20.0 (R=0, N=0) - Date: 2025-12-09 - Comment: Author match
-
Provably Extracting the Features from a General Superposition - Score: 19 (R=10, N=9) - Date: 2025-12-19 - Comment: Matches Representation Learning—provable recovery of features from general superposition with efficient query algorithm.
-
Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability - Score: 19 (R=10, N=9) - Date: 2025-12-16 - Comment: Representation Learning: proposes an information-theoretic metric for superposition via sparse autoencoders; connects feature capacity to robustness.
-
Block-Recurrent Dynamics in Vision Transformers - Score: 18 (R=10, N=8) - Date: 2025-12-24 - Comment: Model Architecture and Representation Learning: discovers block-recurrent depth structure in ViTs and trains recurrent surrogates (Raptor), with dynamical/low-rank analyses.
-
A Unified Representation of Neural Networks Architectures - Score: 18 (R=10, N=8) - Date: 2025-12-22 - Comment: Foundational architecture theory: unified continuum representation (DiPaNet) linking infinite width/depth, residual nets, and neural ODEs with approximation error analysis.
-
Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction - Score: 18 (R=10, N=8) - Date: 2025-12-18 - Comment: Establishes a bijection between ARMs and EBMs with theoretical equivalence and distillation bounds (Model Architecture/Representation Learning theory).
-
Understanding NTK Variance in Implicit Neural Representations - Score: 18 (R=10, N=8) - Date: 2025-12-18 - Comment: Matches Representation Learning/Theory: closed-form analysis linking INR architectural components to NTK eigenvalue variance and spectral bias.
-
Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings - Score: 18 (R=10, N=8) - Date: 2025-12-16 - Comment: Model Architecture/Efficiency — zero-shot context extension by dropping positional embeddings post-training without long-context finetuning.
-
Enforcing Orderedness to Improve Feature Consistency - Score: 18 (R=10, N=8) - Date: 2025-12-04 - Comment: Representation Learning: sparse autoencoders with strict latent ordering resolve permutation non-identifiability in sparse dictionary learning, improving feature consistency.
-
Likelihood-Preserving Embeddings for Statistical Inference - Score: 18 (R=9, N=9) - Date: 2025-12-30 - Comment: Likelihood-preserving embeddings with explicit bounds (approximate sufficient statistics); strong theory for representation learning/compression for inference.
-
Understanding Scaling Laws in Deep Neural Networks via Feature Learning Dynamics - Score: 18 (R=9, N=9) - Date: 2025-12-25 - Comment: Representation Learning/Training Dynamics: derives Neural Feature Dynamics in the infinite-width/depth limit to explain scaling laws and proposes a depth-aware learning-rate correction.
-
Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling - Score: 18 (R=9, N=9) - Date: 2025-12-24 - Comment: Representation Learning/Training Dynamics: analytically tractable inference-time scaling with best-of-k theory for LLM-as-a-judge.
-
From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers - Score: 18 (R=9, N=9) - Date: 2025-12-23 - Comment: Representation Learning/Training Dynamics in Transformers: theoretical analysis of shortcut vs induction head selection driven by data diversity.
-
The Operator Origins of Neural Scaling Laws: A Generalized Spectral Transport Dynamics of Deep Learning - Score: 18 (R=9, N=9) - Date: 2025-12-12 - Comment: Foundational training dynamics: derives a spectral transport–dissipation PDE linking operator geometry to neural scaling laws and double descent, unifying NTK vs feature learning.
-
Generation is Required for Data-Efficient Perception - Score: 18 (R=9, N=9) - Date: 2025-12-10 - Comment: Matches Representation Learning Theory: formalizes inductive biases for compositional generalization and shows why generative inversion enables data-efficient perception.
-
Asymptotic analysis of shallow and deep forgetting in replay with Neural Collapse - Score: 18 (R=9, N=9) - Date: 2025-12-09 - Comment: Training dynamics/representation learning criterion: asymptotic analysis of shallow vs deep forgetting in replay via Neural Collapse, explaining separability vs classifier failure.
-
When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling - Score: 18 (R=9, N=9) - Date: 2025-12-04 - Comment: Representation Learning/Theory: shows failure of Gaussian equivalence for random features at quadratic scaling; introduces Conditional Gaussian Equivalent model with sharp asymptotics.
-
Understanding the Mechanisms of Fast Hyperparameter Transfer - Score: 17 (R=9, N=8) - Date: 2025-12-31 - Comment: Training Dynamics/Representation Learning: theoretical framework for scale-aware hyperparameter transfer with compute-optimal analysis and mechanisms of fast transfer.
-
Approximation Capabilities of Feedforward Neural Networks with GELU Activations - Score: 17 (R=9, N=8) - Date: 2025-12-29 - Comment: Representation Learning/Theory: approximation bounds for GELU networks covering functions and higher-order derivatives (expressivity/approximation theory).
-
An Equivariance Toolbox for Learning Dynamics - Score: 17 (R=9, N=8) - Date: 2025-12-29 - Comment: Representation Learning: theoretical framework linking symmetry/equivariance to first- and second-order learning dynamics (gradient–Hessian geometry).
-
On Conditional Stochastic Interpolation for Generative Nonlinear Sufficient Dimension Reduction - Score: 17 (R=9, N=8) - Date: 2025-12-23 - Comment: Matches Representation Learning: generative sufficient dimension reduction with population/sample-level exhaustiveness guarantees for recovering central sigma-field.
-
The Interaction Bottleneck of Deep Neural Networks: Discovery, Proof, and Modulation - Score: 17 (R=9, N=8) - Date: 2025-12-23 - Comment: Matches Representation Learning/Training Dynamics: discovers and explains an interaction-order bottleneck and provides modulation losses.
-
When Does Learning Renormalize? Sufficient Conditions for Power Law Spectral Dynamics - Score: 17 (R=9, N=8) - Date: 2025-12-23 - Comment: Matches Representation Learning/Training Dynamics: provides theoretical conditions for power-law spectral dynamics via GRSD renormalization.
-
Disentangled representations via score-based variational autoencoders - Score: 17 (R=9, N=8) - Date: 2025-12-22 - Comment: Representation Learning: unifies diffusion and VAE ELBOs in a score-based autoencoder to learn interpretable, disentangled latent representations.
-
In-Context Algebra - Score: 17 (R=9, N=8) - Date: 2025-12-19 - Comment: Representation Learning/Mechanistic interpretability: isolates learned mechanisms (copying, identity recognition, cancellation) in transformers trained on variable algebra.
-
Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants - Score: 17 (R=9, N=8) - Date: 2025-12-18 - Comment: Matches Model Architecture and Representation Learning: introduces a sparse concept-bottleneck encoder/decoder trained end-to-end to predict behavior from activations.
-
RePo: Language Models with Context Re-Positioning - Score: 17 (R=9, N=8) - Date: 2025-12-18 - Comment: Differentiable context re-positioning replacing fixed positional indices (Model Architecture; Representation Learning of contextual structure).
-
Universality of high-dimensional scaling limits of stochastic gradient descent - Score: 17 (R=9, N=8) - Date: 2025-12-16 - Comment: Theoretical universality of high-dimensional SGD scaling limits (ODE/SDE); strong fit to Representation Learning (training dynamics theory).
-
Phase transitions reveal hierarchical structure in deep neural networks - Score: 17 (R=9, N=8) - Date: 2025-12-16 - Comment: Representation Learning — theoretical link between saddle points, phase transitions, and mode connectivity; introduces a probe of loss geometry.
-
D-STEER - Preference Alignment Techniques Learn to Behave, not to Believe -- Beneath the Surface, DPO as Steering Vector Perturbation in Activation Space - Score: 17 (R=9, N=8) - Date: 2025-12-16 - Comment: Representation Learning/Training Dynamics: shows DPO acts as a low-rank steering perturbation (rank-1 dominance) in activation space.
-
Gradient Descent as a Perceptron Algorithm: Understanding Dynamics and Implicit Acceleration - Score: 17 (R=9, N=8) - Date: 2025-12-15 - Comment: Training dynamics criterion: theoretical analysis linking GD steps to generalized perceptron algorithms, explaining implicit acceleration in nonlinear networks.
-
Emergence of Nonequilibrium Latent Cycles in Unsupervised Generative Modeling - Score: 17 (R=9, N=8) - Date: 2025-12-15 - Comment: Matches Representation Learning/Generative modeling: proposes a nonequilibrium latent-variable architecture breaking detailed balance with theoretical insights into training dynamics.
-
Features Emerge as Discrete States: The First Application of SAEs to 3D Representations - Score: 17 (R=9, N=8) - Date: 2025-12-15 - Comment: Matches Representation Learning: applies Sparse Autoencoders (dictionary learning) to 3D activations, revealing discrete state features and training dynamics.
-
Learning by Analogy: A Causal Framework for Composition Generalization - Score: 17 (R=9, N=8) - Date: 2025-12-12 - Comment: Representation Learning: causal modularity framework with identifiable hierarchical latent structure enabling compositional generalization.
-
Self-Supervised Learning with Gaussian Processes - Score: 17 (R=9, N=8) - Date: 2025-12-11 - Comment: Representation Learning: GP priors on representations with connections to kernel PCA and VICReg, enabling uncertainty-aware SSL.
-
Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression - Score: 17 (R=9, N=8) - Date: 2025-12-11 - Comment: Representation Learning/Transformer analysis: clean and adversarial generalization bounds (Rademacher) quantifying the impact of positional encoding in in-context regression.
-
A Geometric Unification of Concept Learning with Concept Cones - Score: 17 (R=9, N=8) - Date: 2025-12-09 - Comment: Strong match to Representation Learning: unifies CBMs and SAEs via concept cones with quantitative metrics linking sparsity/expansion to concept emergence.
-
GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering - Score: 17 (R=9, N=8) - Date: 2025-12-09 - Comment: Representation Learning/SAE: graph-regularized sparse autoencoders with Laplacian smoothness to recover distributed safety features and enable selective steering.
-
On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability - Score: 17 (R=9, N=8) - Date: 2025-12-08 - Comment: Matches Representation Learning: unified theoretical framework and optimization landscape for sparse dictionary learning (sparse autoencoders/transcoders), explaining phenomena like feature absorption and dead neurons.
-
When do spectral gradient updates help in deep learning? - Score: 17 (R=9, N=8) - Date: 2025-12-05 - Comment: Representation Learning/Training Dynamics: theoretical conditions for spectral gradient methods (e.g., Muon) to outperform Euclidean updates in deep nets/transformers.
-
The Initialization Determines Whether In-Context Learning Is Gradient Descent - Score: 17 (R=9, N=8) - Date: 2025-12-05 - Comment: Representation Learning/Training Dynamics: theoretical link between multi-head linear self-attention and gradient descent via initialization (yq-LSA).
-
Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics - Score: 17 (R=9, N=8) - Date: 2025-12-04 - Comment: Matches Representation Learning: provides a theoretical characterization of cross-entropy training dynamics and neural collapse; Hadamard initialization diagonalizes softmax to make dynamics tractable.
-
AlignSAE: Concept-Aligned Sparse Autoencoders - Score: 17 (R=9, N=8) - Date: 2025-12-02 - Comment: Representation Learning: introduces concept-aligned Sparse Autoencoders with supervised post-training to bind ontology concepts to sparse latent slots enabling causal interventions.
-
Implicitly Normalized Online PCA: A Regularized Algorithm with Exact High-Dimensional Dynamics - Score: 17 (R=9, N=8) - Date: 2025-12-02 - Comment: Matches Representation Learning/training dynamics: introduces Implicitly Normalized Online PCA with exact high-dimensional dynamics (PDE/ODE) and performance phase transitions.
-
Tuning Universality in Deep Neural Networks - Score: 17 (R=9, N=8) - Date: 2025-12-02 - Comment: Training Dynamics/Representation Theory: stochastic deep information propagation linking activation design to universality classes and avalanche dynamics.
-
On the Effect of Regularization on Nonparametric Mean-Variance Regression - Score: 17 (R=9, N=8) - Date: 2025-12-01 - Comment: Representation Learning/Training Dynamics: analyzes phase transitions in mean-variance regression via statistical field theory, reducing regularization search dimensionality.
-
Uncovering Competency Gaps in Large Language Models and Their Benchmarks - Score: 16 (R=9, N=7) - Date: 2025-12-26 - Comment: Matches Representation Learning: uses sparse autoencoders to probe internal concepts and decompose benchmark scores, yielding representation-grounded evaluation insights.
-
LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer - Score: 16 (R=9, N=7) - Date: 2025-12-23 - Comment: Representation Learning: uses Sparse Autoencoders to learn disentangled stylistic concepts enabling interpretable, controllable steering.
-
From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts? - Score: 16 (R=9, N=7) - Date: 2025-12-19 - Comment: Matches Representation Learning—evaluates SAEs/sparse probes for disentanglement and independent steering of concepts.
-
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers - Score: 16 (R=9, N=7) - Date: 2025-12-18 - Comment: Matches Representation Learning/Interpretability: trains LLMs to answer natural-language queries about activations across tasks (LatentQA-style AOs).
-
ReflCtrl: Controlling LLM Reflection via Representation Engineering - Score: 16 (R=9, N=7) - Date: 2025-12-18 - Comment: Matches Representation Engineering/Efficiency: discovers a latent 'reflection' direction to control CoT self-reflection and cut inference tokens.
-
What matters for Representation Alignment: Global Information or Spatial Structure? - Score: 16 (R=9, N=7) - Date: 2025-12-12 - Comment: Strongly matches Representation Learning: analyzes which aspects (spatial structure vs. global semantics) drive representational alignment in generative training and proposes simple architectural/normalization tweaks (iREPA).
-
Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit - Score: 16 (R=9, N=7) - Date: 2025-12-12 - Comment: Representation Learning: uses sparse autoencoders to build interpretable, controllable embeddings (sparse concept features) for large-scale text data analysis.
-
Drawback of Enforcing Equivariance and its Compensation via the Lens of Expressive Power - Score: 16 (R=9, N=7) - Date: 2025-12-11 - Comment: Representation/Architecture Theory: analyzes expressivity-generalization tradeoffs in equivariant networks and compensation via model size.
-
Mathematical Foundations of Neural Tangents and Infinite-Width Networks - Score: 16 (R=9, N=7) - Date: 2025-12-10 - Comment: Matches Representation Learning and Model Architecture: NTK-ECRN enables rigorous analysis with bounds on NTK dynamics/eigenvalues linking to generalization/stability.
-
Complexity of One-Dimensional ReLU DNNs - Score: 16 (R=9, N=7) - Date: 2025-12-10 - Comment: Matches Representation Learning Theory: expressivity of 1D ReLU DNNs via linear region counts; introduces function-adaptive sparsity notion.
-
Softly Symbolifying Kolmogorov-Arnold Networks - Score: 16 (R=9, N=7) - Date: 2025-12-10 - Comment: Matches Model Architecture and Sparsity: integrates symbolic primitives with differentiable sparsifying gates (MDL-guided) in KANs for interpretable representations.
-
SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals - Score: 16 (R=9, N=7) - Date: 2025-12-05 - Comment: Representation Learning/Interpretability: identifies a robust tail-activation signal (SuperActivators) for concept detection, improving concept attribution across modalities and layers.
-
Training Dynamics of Learning 3D-Rotational Equivariance - Score: 16 (R=9, N=7) - Date: 2025-12-03 - Comment: Analyzes training dynamics of learning 3D-rotational equivariance with a principled equivariance error measure—matches Representation Learning and training dynamics.
-
Dynamical Implicit Neural Representations - Score: 16 (R=9, N=7) - Date: 2025-12-01 - Comment: Model Architecture/Representation Learning: Dynamical INRs (continuous-time feature evolution) mitigate spectral bias with supporting theory (NTK, Rademacher).
-
Random Controlled Differential Equations - Score: 16 (R=8, N=8) - Date: 2025-12-30 - Comment: Model Architecture and Efficiency: random-feature controlled differential equation reservoirs with only a linear readout; kernel-limit theory for efficient time-series representation learning.
-
Decomposing Task Vectors for Refined Model Editing - Score: 16 (R=8, N=8) - Date: 2025-12-30 - Comment: Matches: Representation Learning (task vector decomposition into shared/unique subspaces) enabling controlled model editing.
-
Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model - Score: 16 (R=8, N=8) - Date: 2025-12-30 - Comment: Matches: Representation Learning (preference alignment via semiparametric single-index modeling) with theoretical policy error bounds.
-
Dictionary-Transform Generative Adversarial Networks - Score: 16 (R=8, N=8) - Date: 2025-12-29 - Comment: Advances Representation Learning with sparse synthesis/analysis operators and a model-based adversarial framework; strong theoretical guarantees on identifiability and stability for sparse dictionary learning.
-
Model Merging via Multi-Teacher Knowledge Distillation - Score: 16 (R=8, N=8) - Date: 2025-12-26 - Comment: Representation Learning/Model Merging: derives a flatness‑aware PAC‑Bayes bound and frames merging as multi‑teacher knowledge distillation with SAM (SAMerging).
-
Relu and softplus neural nets as zero-sum turn-based games - Score: 16 (R=8, N=8) - Date: 2025-12-24 - Comment: Representation Learning: game-theoretic and path-integral formulation of ReLU/Softplus networks enabling robustness certificates and training insights.
-
Symplectic Reservoir Representation of Legendre Dynamics - Score: 16 (R=8, N=8) - Date: 2025-12-23 - Comment: Matches Representation Learning and Model Architecture: symplectic reservoir computing preserves Legendre duality via Hamiltonian dynamics, imposing geometric structure on representations.
-
A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models - Score: 16 (R=8, N=8) - Date: 2025-12-23 - Comment: Representation Learning/Training Dynamics: provides a unified EBM-based theoretical analysis of RL-tuned LMs (instruction/RLVR).
-
On the Universal Representation Property of Spiking Neural Networks - Score: 16 (R=8, N=8) - Date: 2025-12-19 - Comment: Matches Representation Learning criterion via a rigorous, quantitative universal representation property for SNNs (sequence-to-sequence spike functions) and analysis of architectural depth/composition.
-
Derivative-Informed Fourier Neural Operator: Universal Approximation and Applications to PDE-Constrained Optimization - Score: 16 (R=8, N=8) - Date: 2025-12-17 - Comment: Model Architecture/Representation Learning: derivative-informed training of Fourier Neural Operators with universal approximation (incl. Fréchet derivatives) and efficient training schemes.
-
Modular connectivity in neural networks emerges from Poisson noise-motivated regularisation, and promotes robustness and compositional generalisation - Score: 16 (R=8, N=8) - Date: 2025-12-17 - Comment: Representation Learning/Regularization—Poisson-noise-motivated regularizer induces modular connectivity and compositional generalization.
-
State over Tokens: Characterizing the Role of Reasoning Tokens - Score: 16 (R=8, N=8) - Date: 2025-12-16 - Comment: Conceptual framework recasting reasoning tokens as externalized computational state; aligns with Representation Learning/training dynamics.
-
Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry - Score: 16 (R=8, N=8) - Date: 2025-12-16 - Comment: Representation Learning/Theory: quantifies complexity gap between exact vs approximate symmetry, guiding symmetry-inductive biases.
-
Mull-Tokens: Modality-Agnostic Latent Thinking - Score: 16 (R=8, N=8) - Date: 2025-12-12 - Comment: Matches Model Architecture and Representation Learning: introduces modality-agnostic latent tokens for internal multimodal reasoning, enabling dynamic latent thinking without tool calls.
-
Local-Curvature-Aware Knowledge Graph Embedding: An Extended Ricci Flow Approach - Score: 16 (R=8, N=8) - Date: 2025-12-09 - Comment: Representation Geometry: couples KGE optimization with local curvature via extended Ricci flow; proves curvature decay and convergence of distances.
-
Zero Generalization Error Theorem for Random Interpolators via Algebraic Geometry - Score: 16 (R=8, N=8) - Date: 2025-12-09 - Comment: Theory/Training Dynamics: proves zero generalization error for random interpolators beyond a sample threshold via algebraic geometry.
-
Interaction Tensor Shap - Score: 16 (R=8, N=8) - Date: 2025-12-08 - Comment: Matches Representation Learning/interpretability: exact high-order Shapley interactions via tensor-network contractions enabling polynomial-time computation under TT structure.
-
Probabilistic Foundations of Fuzzy Simplicial Sets for Nonlinear Dimensionality Reduction - Score: 16 (R=8, N=8) - Date: 2025-12-04 - Comment: Representation Learning theory: probabilistic foundations for fuzzy simplicial sets (UMAP), linking to generative models and enabling new embedding methods.
-
Does Flatness imply Generalization for Logistic Loss in Univariate Two-Layer ReLU Network? - Score: 16 (R=8, N=8) - Date: 2025-12-02 - Comment: Matches Representation Learning/Training Dynamics theory: analyzes when flatness implies generalization for logistic loss in 2‑layer ReLU nets.
-
An RKHS Perspective on Tree Ensembles - Score: 16 (R=8, N=8) - Date: 2025-12-02 - Comment: Representation Learning/Theory: RKHS framework for tree ensembles with variational interpretation and gradient flow on a data-dependent Hilbert manifold.
-
From Coefficients to Directions: Rethinking Model Merging with Directional Alignment - Score: 16 (R=8, N=8) - Date: 2025-12-02 - Comment: Matches Model Architecture and Representation Learning: introduces directional alignment across parameter and feature spaces (leveraging Neural Collapse) for principled model merging.
-
A Variational Manifold Embedding Framework for Nonlinear Dimensionality Reduction - Score: 16 (R=8, N=8) - Date: 2025-12-01 - Comment: Matches Representation Learning: variational framework for nonlinear manifold embedding with PDE characterization; foundational dimensionality reduction.
-
Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian - Score: 16 (R=8, N=8) - Date: 2025-12-01 - Comment: Training dynamics/representation learning theory: convergence behavior of over-parameterized score matching (foundational for diffusion/score models).
-
How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure - Score: 15 (R=8, N=7) - Date: 2025-12-31 - Comment: Representation Learning/Theory: finite-sample uniform convergence and calibration bounds for VLM-induced classifiers under low-dimensional/spectrum-dependent structure, tying sample complexity to intrinsic dimension.
-
Towards Efficient Post-Training via Fourier-Driven Adapter Architectures - Score: 15 (R=8, N=7) - Date: 2025-12-31 - Comment: Model Architecture/Efficiency: adapter-based PEFT with random Fourier features for frequency-aware modulation of representations.
-
Multiple Token Divergence: Measuring and Steering In-Context Computation Density - Score: 15 (R=8, N=7) - Date: 2025-12-30 - Comment: Introduces a metric (Multiple Token Divergence) for in-context computation density and a decoding method (Divergence Steering), offering insights and control over computational dynamics in LMs — representation/efficiency criterion.
-
The Affine Divergence: Aligning Activation Updates Beyond Normalisation - Score: 15 (R=8, N=7) - Date: 2025-12-30 - Comment: Matches: Model Architecture (new normalization mechanisms including PatchNorm) and Representation Learning (training dynamics aligning activation updates).
-
Frequency Regularization: Unveiling the Spectral Inductive Bias of Deep Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-12-30 - Comment: Provides representation learning and training-dynamics insights by quantifying spectral inductive bias of regularizers (SSR metric) and analyzing frequency behavior in CNNs.
-
Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought - Score: 15 (R=8, N=7) - Date: 2025-12-30 - Comment: Causal/adversarial analysis of latent tokens in Chain-of-Continuous-Thought, revealing shortcut behavior — insightful representation/behavioral analysis of LM reasoning.
-
Explainable Multimodal Regression via Information Decomposition - Score: 15 (R=8, N=7) - Date: 2025-12-29 - Comment: Representation Learning: principled multimodal fusion via Partial Information Decomposition with analytic estimators and independence regularization.
-
TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior - Score: 15 (R=8, N=7) - Date: 2025-12-25 - Comment: Matches Representation Learning/Architecture analysis: isolates tokenizer effects with matched models and a targeted benchmark to study tokenization’s impact.
-
Neuron-Guided Interpretation of Code LLMs: Where, Why, and How? - Score: 15 (R=8, N=7) - Date: 2025-12-25 - Comment: Representation Learning: neuron-level analysis of code LLMs revealing language-specific neurons and concept layers, with neuron-guided fine-tuning and embeddings.
-
How Much 3D Do Video Foundation Models Encode? - Score: 15 (R=8, N=7) - Date: 2025-12-25 - Comment: Representation Learning: model-agnostic probing of 3D properties encoded by video foundation models using shallow readouts, yielding insights into learned representations.
-
Learning to Reason in LLMs by Expectation Maximization - Score: 15 (R=8, N=7) - Date: 2025-12-24 - Comment: Training Objective/Representation Learning: formulates reasoning as a latent variable and applies an EM objective; studies sampling schemes for rationale learning.
-
High-Performance Self-Supervised Learning by Joint Training of Flow Matching - Score: 15 (R=8, N=7) - Date: 2025-12-24 - Comment: Representation Learning and Efficiency: joint training of an encoder with conditional flow matching generator for faster, stable SSL.
-
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies - Score: 15 (R=8, N=7) - Date: 2025-12-23 - Comment: Representation Learning / Training Dynamics: decomposes LLM policy into internal layer/modular policies and proposes bottom-up policy optimization.
-
KerJEPA: Kernel Discrepancies for Euclidean Self-Supervised Learning - Score: 15 (R=8, N=7) - Date: 2025-12-23 - Comment: Representation Learning: kernel-based JEPA regularizers via closed-form high-dimensional sliced MMD, generalizing prior Gaussian-prior regularization.
-
Phase-space entropy at acquisition reflects downstream learnability - Score: 15 (R=8, N=7) - Date: 2025-12-23 - Comment: Representation Learning: modality-agnostic phase-space entropy metric at acquisition predicting downstream learnability and sampling policy quality.
-
On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning - Score: 15 (R=8, N=7) - Date: 2025-12-23 - Comment: Matches theoretical generalization bounds for deep multitask networks; core Representation Learning theory contribution.
-
Large Language Models as Discounted Bayesian Filters - Score: 15 (R=8, N=7) - Date: 2025-12-23 - Comment: Representation Learning/Training Dynamics: characterizes LLM online inference as discounted Bayesian filtering with prompt-based prior calibration.
-
Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability - Score: 15 (R=8, N=7) - Date: 2025-12-23 - Comment: Representation Learning/Interpretability: theoretical guarantees for neuron identification faithfulness and stability with bootstrap-based coverage.
-
In-Context Semi-Supervised Learning - Score: 15 (R=8, N=7) - Date: 2025-12-19 - Comment: Representation Learning/ICL theory: shows Transformers exploit unlabeled context to learn context-dependent representations in semi-supervised in-context learning.
-
Soft Geometric Inductive Bias for Object Centric Dynamics - Score: 15 (R=8, N=7) - Date: 2025-12-19 - Comment: Model Architecture/Representation: geometric algebra neural networks provide soft equivariance as an inductive bias for object-centric dynamics modeling.
-
High-Dimensional Partial Least Squares: Spectral Analysis and Fundamental Limitations - Score: 15 (R=8, N=7) - Date: 2025-12-18 - Comment: Representation Learning theory: high-dimensional spectral analysis of PLS-SVD for shared low-rank latent structure.
-
A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point - Score: 15 (R=8, N=7) - Date: 2025-12-18 - Comment: Analytical characterization of Hessian eigenspectrum and effective parameter count near optimum (Representation Learning/training dynamics).
-
Topological Metric for Unsupervised Embedding Quality Evaluation - Score: 15 (R=8, N=7) - Date: 2025-12-18 - Comment: Representation Learning: unsupervised embedding quality metric using persistent homology (topology-aware).
-
ATLAS: Adaptive Topology-based Learning at Scale for Homophilic and Heterophilic Graphs - Score: 15 (R=8, N=7) - Date: 2025-12-18 - Comment: Model Architecture/Efficiency: scalable graph learning by augmenting MLPs with multi-resolution community features (avoids message passing).
-
Understanding the Gain from Data Filtering in Multimodal Contrastive Learning - Score: 15 (R=8, N=7) - Date: 2025-12-17 - Comment: Representation Learning: theoretical characterization of teacher-based data filtering benefits in multimodal contrastive learning.
-
CORE: Contrastive Masked Feature Reconstruction on Graphs - Score: 15 (R=8, N=7) - Date: 2025-12-16 - Comment: Matches Representation Learning: theoretical link between masked feature reconstruction and contrastive objectives with a unified framework.
-
PAC-Bayes Bounds for Multivariate Linear Regression and Linear Autoencoders - Score: 15 (R=8, N=7) - Date: 2025-12-16 - Comment: Theory for Representation Learning: PAC-Bayes bounds for multivariate linear regression and linear autoencoders, enabling principled generalization analysis.
-
Wait, Wait, Wait... Why Do Reasoning Models Loop? - Score: 15 (R=8, N=7) - Date: 2025-12-16 - Comment: Analyzes training dynamics and inductive biases in Transformers that cause looping; matches Representation Learning criterion (training dynamics of neural networks).
-
High-Dimensional Tensor Discriminant Analysis: Low-Rank Discriminant Structure, Representation Synergy, and Theoretical Guarantees - Score: 15 (R=8, N=7) - Date: 2025-12-16 - Comment: Representation Learning: CP low-rank discriminant tensor model with global convergence and minimax guarantees; low-rank structure focus.
-
Fully Inductive Node Representation Learning via Graph View Transformation - Score: 15 (R=8, N=7) - Date: 2025-12-15 - Comment: Model Architecture: permutation-equivariant Graph View Transformation and recurrent design enabling fully inductive cross-dataset node representation learning.
-
Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning - Score: 15 (R=8, N=7) - Date: 2025-12-15 - Comment: Matches Representation Learning/interpretability: analyzes functional specialization of Transformer attention heads and their role in reasoning.
-
Is the Information Bottleneck Robust Enough? Towards Label-Noise Resistant Information Bottleneck Learning - Score: 15 (R=8, N=7) - Date: 2025-12-12 - Comment: Representation Learning: robust Information Bottleneck with mutual information bounds and disentanglement under label noise.
-
Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules - Score: 15 (R=8, N=7) - Date: 2025-12-12 - Comment: Representation Learning/Mechanistic Interpretability: identifies sparse functional attention heads mediating multimodal reasoning in VLMs via probing and interventions.
-
Circuits, Features, and Heuristics in Molecular Transformers - Score: 15 (R=8, N=7) - Date: 2025-12-11 - Comment: Matches Representation Learning: mechanistic interpretability of Transformers with sparse autoencoders revealing learned feature dictionaries.
-
Spectral Embedding via Chebyshev Bases for Robust DeepONet Approximation - Score: 15 (R=8, N=7) - Date: 2025-12-11 - Comment: Matches Model Architecture: DeepONet trunk replaced by Chebyshev spectral embedding for robust operator learning on bounded domains.
-
Understanding temperature tuning in energy-based models - Score: 15 (R=8, N=7) - Date: 2025-12-11 - Comment: Representation/Training Dynamics: interpretable framework for temperature tuning in energy-based models and its impact on generative performance.
-
GeoDM: Geometry-aware Distribution Matching for Dataset Distillation - Score: 15 (R=8, N=7) - Date: 2025-12-10 - Comment: Matches Representation Learning and Compression: geometry-aware dataset distillation with learnable curvature in product manifolds and a generalization bound.
-
PR-CapsNet: Pseudo-Riemannian Capsule Network with Adaptive Curvature Routing for Graph Learning - Score: 15 (R=8, N=7) - Date: 2025-12-10 - Comment: Model Architecture + Representation Learning: extends CapsNet with pseudo-Riemannian tangent-space routing and adaptive curvature fusion for graph embeddings.
-
Short-Context Dominance: How Much Local Context Natural Language Actually Needs? - Score: 15 (R=8, N=7) - Date: 2025-12-10 - Comment: Matches Representation Learning/Training Dynamics: quantifies minimum context length and proposes DaMCL-based decoding to address short-context dominance.
-
Semi-Supervised Contrastive Learning with Orthonormal Prototypes - Score: 15 (R=8, N=7) - Date: 2025-12-10 - Comment: Representation Learning: analyzes collapse threshold in contrastive learning and proposes an orthonormal prototype loss to prevent dimensional collapse.
-
Nonnegative Matrix Factorization through Cone Collapse - Score: 15 (R=8, N=7) - Date: 2025-12-10 - Comment: Representation Learning: geometric Cone Collapse algorithm to recover data cone and cone-aware ONMF for parts-based dictionaries.
-
Self-Supervised Learning on Molecular Graphs: A Systematic Investigation of Masking Design - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Representation Learning: principled probabilistic framework studying masking design in SSL for molecular graphs; insights on targets vs encoders (Graph Transformers).
-
RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Parameter-Efficient Fine-Tuning: dual-branch multimodal adapter with per-layer reconstruction to balance task adaptation and generalization in VLMs.
-
Entropic Confinement and Mode Connectivity in Overparameterized Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Training Dynamics/Representation: identifies curvature-induced entropic barriers explaining connectivity vs. confinement in loss landscapes.
-
Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity - Score: 15 (R=8, N=7) - Date: 2025-12-05 - Comment: Representation Learning/Training Dynamics: scaling analysis predicting when feature learning emerges in deep networks, including attention heads.
-
Domain Feature Collapse: Implications for Out-of-Distribution Detection and Solutions - Score: 15 (R=8, N=7) - Date: 2025-12-04 - Comment: Representation Learning Theory: information-theoretic account of domain feature collapse (I(x_d; z)=0) and simple preservation strategy for OOD.
-
AaPE: Aliasing-aware Patch Embedding for Self-Supervised Audio Representation Learning - Score: 15 (R=8, N=7) - Date: 2025-12-04 - Comment: Model Architecture and Representation Learning: aliasing-aware patch embedding for audio SSL that preserves high-frequency cues while mitigating downsampling aliasing.
-
Better World Models Can Lead to Better Post-Training Performance - Score: 15 (R=8, N=7) - Date: 2025-12-04 - Comment: Representation Learning: explicit world-modeling (state prediction) objectives sharpen latent state representations in Transformers and improve post-training dynamics.
-
Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval - Score: 15 (R=8, N=7) - Date: 2025-12-04 - Comment: Representation Learning: mechanistic analysis shows delayed entity resolution hinders reuse of LLM factual recall circuits in VLMs; proposes fixes.
-
Learning Network Sheaves for AI-native Semantic Communication - Score: 15 (R=8, N=7) - Date: 2025-12-04 - Comment: Representation learning with sparsity: learns network sheaves and sparse, structured dictionaries for semantic communication alignment.
-
Exploring Depth Generalization in Large Language Models for Solving Recursive Logic Tasks - Score: 15 (R=8, N=7) - Date: 2025-12-04 - Comment: Matches Representation Learning/Training Dynamics: analysis of depth generalization limits in Transformers with a decomposition pipeline.
-
Embedding networks with the random walk first return time distribution - Score: 15 (R=8, N=7) - Date: 2025-12-03 - Comment: Matches Representation Learning: proposes first-return-time-distribution-based graph embeddings with theoretical grounding and empirical benefits.
-
H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Representation Learning/Mechanistic Interpretability: identifies a sparse subset of hallucination-associated neurons (H-Neurons) with causal impact and pretraining origin analysis.
-
Fantastic Features and Where to Find Them: A Probing Method to combine Features from Multiple Foundation Models - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Representation Learning: scalable probing-based adapter (ComBo) that combines features across multiple foundation models/layers without backprop through backbones.
-
Upper Approximation Bounds for Neural Oscillators - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Representation Learning/Theory: derives approximation bounds for neural oscillator architectures and related state-space models, analyzing capacity and error scaling.
-
One Swallow Does Not Make a Summer: Understanding Semantic Structures in Embedding Spaces - Score: 15 (R=8, N=7) - Date: 2025-12-02 - Comment: Representation Learning: introduces Semantic Field Subspace and SAFARI to uncover hierarchical semantic structure in embedding spaces with scalable approximations.
-
Towards Understanding Generalization in DP-GD: A Case Study in Training Two-Layer CNNs - Score: 15 (R=8, N=7) - Date: 2025-12-01 - Comment: Training dynamics/Representation Learning: theory showing DP-GD can generalize better than GD in two-layer CNNs under certain regimes.
-
From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures - Score: 15 (R=8, N=7) - Date: 2025-12-01 - Comment: Representation Learning: analyzes topology/geometry of embedding spaces and proposes Unified Topological Signatures to link embedding structure to model behavior.
-
Towards a Foundation Model for Partial Differential Equations Across Physics Domains - Score: 15 (R=8, N=7) - Date: 2025-12-01 - Comment: Model Architecture/Foundation Model: PDE-FM combines spatial–spectral tokenization, physics-aware conditioning, Mamba backbone, and operator-theoretic decoder for cross-physics generalization.
Other Foundational Research (8)
-
Improved Mean Flows: On the Challenges of Fastforward Generative Models - Score: 20.0 (R=0, N=0) - Date: 2025-12-02 - Comment: Author match
-
Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration - Score: 17 (R=9, N=8) - Date: 2025-12-31 - Comment: Training dynamics/scaling: extends parameterization for hyperparameter transfer across width, depth, batch size, and training duration, including per-module transfer.
-
Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis - Score: 16 (R=8, N=8) - Date: 2025-12-30 - Comment: Optimization theory for SGD with gradient clipping under heavy-tailed noise; refined rates and lower bounds (training dynamics/optimization).
-
Muon is Provably Faster with Momentum Variance Reduction - Score: 16 (R=8, N=8) - Date: 2025-12-19 - Comment: HPC/Optimization: adds momentum variance reduction to LMO-based optimizers (Muon/Scion) within Gluon framework, with provable faster rates.
-
Stopping Rules for Stochastic Gradient Descent via Anytime-Valid Confidence Sequences - Score: 16 (R=8, N=8) - Date: 2025-12-16 - Comment: Matches Optimization/Training (HPC relevance): anytime-valid stopping rules for SGD via confidence sequences for principled training control.
-
Generative Modeling with Continuous Flows: Sample Complexity of Flow Matching - Score: 16 (R=8, N=8) - Date: 2025-12-02 - Comment: Theory for Generative Modeling: first sample complexity bounds for flow matching by decomposing approximation/statistical/optimization errors to guarantee W2 convergence.
-
Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training - Score: 15 (R=8, N=7) - Date: 2025-12-10 - Comment: Training dynamics/scaling laws: models downstream accuracy scaling with budget across token-to-parameter ratios and inference sampling.
-
Comparing BFGS and OGR for Second-Order Optimization - Score: 15 (R=8, N=7) - Date: 2025-12-09 - Comment: Optimization/Training: proposes Online Gradient Regression for online Hessian estimation vs. BFGS, enabling non-PD Hessians in non-convex settings.