← Previous Summary | Monthly Overview | Next Summary →
2025-09 | 2025-10 | 2025-11

Personalized Monthly Topic Summary 2025/10

Metric	Value
Total Papers	819
Model Architecture	212
Model Compression and Efficiency	281
High Performance Computing	65
Representation Learning	252
Other Foundational Research	9

Model Architecture (212)

Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin - Score: 20.0 (R=0, N=0) - Date: 2025-10-09 - Comment: Author match
Kimi Linear: An Expressive, Efficient Attention Architecture - Score: 19 (R=10, N=9) - Date: 2025-10-31 - Comment: Model Architecture/Efficiency: introduces Kimi Delta Attention (linear attention) and hybrid with MLA, cutting KV cache and boosting throughput while surpassing full attention.
Non-Singularity of the Gradient Descent map for Neural Networks with Piecewise Analytic Activations - Score: 19 (R=10, N=9) - Date: 2025-10-29 - Comment: Proves non-singularity of the GD map for realistic neural architectures (including attention/conv) with piecewise analytic activations—core training dynamics theory.
Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency without Model Sweeps - Score: 19 (R=10, N=9) - Date: 2025-10-16 - Comment: Directly targets Model Architecture: Mixture-of-Experts (softmax-gated) with identifiability theory, finite-sample MLE rates, and consistent expert-number selection.
Chimera: State Space Models Beyond Sequences - Score: 19 (R=10, N=9) - Date: 2025-10-16 - Comment: Model Architecture: extends state space models to arbitrary data topology; Efficiency: linear-time recurrence on DAGs and quadratic-time relaxation for general graphs.
Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models - Score: 19 (R=10, N=9) - Date: 2025-10-16 - Comment: Model Architecture theory: dimension-free minimax rates for learning pairwise interactions in attention-style models.
Transmuting prompts into weights - Score: 19 (R=10, N=9) - Date: 2025-10-13 - Comment: Model Architecture and Representation: theoretical mapping from prompts to implicit weight updates in deep Transformers; introduces token-independent thought vectors/matrices enabling principled weight-level steering.
Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts - Score: 19 (R=10, N=9) - Date: 2025-10-09 - Comment: Matches Model Architecture (MoE) and Representation Learning: provable joint training dynamics for soft-routed MoE; also includes post-training pruning with convergence guarantees (Model Compression/Efficiency).
Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix - Score: 19 (R=10, N=9) - Date: 2025-10-09 - Comment: Provides a rigorous random-matrix-theoretic analysis of self-attention spectra, advancing theoretical understanding of Transformer architecture and representation dynamics.
The Effect of Attention Head Count on Transformer Approximation - Score: 19 (R=10, N=9) - Date: 2025-10-09 - Comment: Model Architecture theory: establishes upper and lower bounds on transformer approximation as a function of attention head count, including a first rigorous lower bound in a nonlinear practical setting.
SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation - Score: 19 (R=10, N=9) - Date: 2025-10-09 - Comment: Model Architecture and Efficiency: introduces a hybrid AR–diffusion decoding paradigm enabling blockwise parallel generation and reports scaling across dense and MoE models.
Critical attention scaling in long-context transformers - Score: 19 (R=10, N=9) - Date: 2025-10-08 - Comment: Strong match to Model Architecture and Representation Learning: rigorous theory of attention scaling in long-context Transformers, identifying critical β_n ≍ log n to prevent rank-collapse.
Replacing Softmax Similarity with a Sharpened Angular Similarity: Theory and Practice of Scaling To Billion-Context Attention - Score: 19 (R=10, N=9) - Date: 2025-10-07 - Comment: Matches Model Compression/Efficiency and HPC: replaces Softmax with linear-time RACE attention via sharpened angular similarity, randomized projections, and soft LSH; enables million-token contexts with reduced memory/runtime.
Implicit Models: Expressive Power Scales with Test-Time Compute - Score: 19 (R=10, N=9) - Date: 2025-10-07 - Comment: Model Architecture and Efficiency: theory for implicit (infinite-depth, weight-tied) models showing expressive power scales with test-time iterations and constant-memory training.
Pretraining with hierarchical memories: separating long-tail and common knowledge - Score: 19 (R=10, N=9) - Date: 2025-10-06 - Comment: Strongly matches Model Architecture and Efficiency: memory-augmented transformers with hierarchical parametric memory banks and context-dependent fetch, aligned with hardware for scalable pretraining/inference.
Support Basis: Fast Attention Beyond Bounded Entries - Score: 19 (R=10, N=9) - Date: 2025-10-03 - Comment: Efficient attention approximation with sub-quadratic runtime beyond bounded entries; rigorous guarantees and justification of polynomial attention.
Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space - Score: 19 (R=10, N=9) - Date: 2025-10-02 - Comment: Model Architecture: introduces adaptive parallel computation in transformers by forking/deleting residual streams learned during pretraining (dynamic networks).
ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference - Score: 18 (R=10, N=8) - Date: 2025-10-31 - Comment: MoE Efficiency/HPC: adaptive expert prefetching and cache-aware routing for memory-constrained MoE inference with runtime-driven scheduling.
Normalization in Attention Dynamics - Score: 18 (R=10, N=8) - Date: 2025-10-28 - Comment: Model Architecture/Training dynamics: unified analysis of normalization schemes in transformers via interacting-particle dynamics; identifies effective Peri-LN.
Adaptive Graph Mixture of Residual Experts: Unsupervised Learning on Diverse Graphs with Heterogeneous Specialization - Score: 18 (R=10, N=8) - Date: 2025-10-28 - Comment: Model Architecture: graph Mixture-of-Experts with structurally-aware gating and unsupervised specialization objective.
Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning - Score: 18 (R=10, N=8) - Date: 2025-10-24 - Comment: Model Architecture: Mixture-of-Experts with a dynamic router to split thinking vs non-thinking branches for multimodal reasoning—directly matches MoE criterion.
HybridEP: Scaling Expert Parallelism to Cross-Datacenter Scenario via Hybrid Expert/Data Transmission - Score: 18 (R=10, N=8) - Date: 2025-10-23 - Comment: Matches HPC and MoE scaling: HybridEP introduces modeling-guided hybrid expert/data transmission and topology/domain partitioning to scale Expert Parallelism across datacenters under bandwidth constraints.
Transformers are Inherently Succinct - Score: 18 (R=10, N=8) - Date: 2025-10-23 - Comment: Model Architecture Theory: proves transformers’ high succinctness vs automata/LTL and EXPSPACE-complete verification.
L-MoE: End-to-End Training of a Lightweight Mixture of Low-Rank Adaptation Experts - Score: 18 (R=10, N=8) - Date: 2025-10-22 - Comment: Model Architecture: unifies MoE with low-rank LoRA adapters (L-MoE) and differentiable gating for end-to-end training and dynamic composition.
Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts - Score: 18 (R=10, N=8) - Date: 2025-10-21 - Comment: Model Architecture (MoE): probabilistic input-domain-aware routing decoupled from task optimization for expert specialization and balanced utilization.
Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior - Score: 18 (R=10, N=8) - Date: 2025-10-21 - Comment: Model Architecture + Sparsity: proposes a sparse transformer grounded in regularized Wasserstein proximal operator with L1 prior; theoretical and architectural innovation.
Expert Merging in Sparse Mixture of Experts with Nash Bargaining - Score: 18 (R=10, N=8) - Date: 2025-10-21 - Comment: Model Architecture (MoE): principled expert merging for sparse MoE via Nash bargaining with convergence guarantees; improves merging over ad-hoc averaging.
First Attentions Last: Better Exploiting First Attentions for Efficient Transformer Training - Score: 18 (R=10, N=8) - Date: 2025-10-17 - Comment: Model Architecture + HPC – redesigns Transformer wiring to remove per-block MHA–MLP communication, eliminating TP all-reduce and enabling parallel MHA/MLP execution.
MergeMoE: Efficient Compression of MoE Models via Expert Output Merging - Score: 18 (R=10, N=8) - Date: 2025-10-17 - Comment: Model Compression and Efficiency (MoE): theoretical framing and optimized expert output merging for compressing MoE models.
Dr.LLM: Dynamic Layer Routing in LLMs - Score: 18 (R=10, N=8) - Date: 2025-10-16 - Comment: Model Architecture and Efficiency: adaptive-depth dynamic layer routing (skip/execute/repeat) with supervised routers for budget-aware inference.
Deconstructing Attention: Investigating Design Principles for Effective Language Modeling - Score: 18 (R=10, N=8) - Date: 2025-10-15 - Comment: Systematic analysis and relaxation of attention design principles in Transformers — Model Architecture (attention mechanism).
Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling - Score: 18 (R=10, N=8) - Date: 2025-10-15 - Comment: Matches Model Architecture: proposes Translution unifying self-attention and convolution with a lightweight alpha-Translution variant for adaptive relative modeling.
Stability of Transformers under Layer Normalization - Score: 18 (R=10, N=8) - Date: 2025-10-15 - Comment: Direct match to architectural analysis/training stability: principled theory on Transformer stability under different LayerNorm placements and residual scaling.
Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers - Score: 18 (R=10, N=8) - Date: 2025-10-14 - Comment: Model Architecture: proposes Value-State Gated Attention for Transformers to mitigate attention sinks/value-state drains with theoretical grounding, improving stability and quantization fidelity.
Localist LLMs -- A Mathematical Framework for Dynamic Locality Control - Score: 18 (R=10, N=8) - Date: 2025-10-13 - Comment: Matches Model Architecture and Sparsity: introduces a tunable locality dial via group sparsity on attention with theoretical guarantees, enabling dynamic control between localist and distributed representations.
From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill - Score: 18 (R=10, N=8) - Date: 2025-10-10 - Comment: High Performance Computing and Efficiency: introduces layered prefill scheduling that reduces MoE expert weight reloads, lowering memory bandwidth and latency for stall-free serving.
Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training - Score: 18 (R=10, N=8) - Date: 2025-10-10 - Comment: Matches Model Architecture criterion (Mixture-of-Experts): orthogonal growth (depth/width) and checkpoint recycling for efficient pretraining.
MeSH: Memory-as-State-Highways for Recursive Transformers - Score: 18 (R=10, N=8) - Date: 2025-10-10 - Comment: Model Architecture: Memory-as-State-Highways adds explicit memory and lightweight routers to diversify computation across recursive iterations, strengthening recursive transformers.
From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics - Score: 18 (R=10, N=8) - Date: 2025-10-09 - Comment: Representation Learning/Training Dynamics: theoretical two-stage analysis of Transformer attention training (condensation then rank collapse) under gradient flow.
Exact Causal Attention with 10% Fewer Operations - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: Compression/Efficiency/HPC: exact causal attention with ~10% fewer operations via new masked matmul identities and GPU-optimized kernels.
On Structured State-Space Duality - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Matches Model Architecture: formalizes and generalizes the SSM–masked-attention duality, providing necessary/sufficient conditions and training complexity bounds; expands efficient Transformer/SSM design space.
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Matches Low-Precision Training and Efficiency: mechanistic analysis of flash attention failures under low precision and a minimal modification to mitigate biased rounding errors.
A Mathematical Explanation of Transformers for Large Language Models and GPTs - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Architecture—provides a continuous operator-theoretic formulation of Transformers (self-attention as integral operator, layer norm as projection), deepening theoretical foundations.
Allocation of Parameters in Transformers - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Strongly matches Model Architecture/Efficiency: theoretical allocation of attention heads and dimensions across Transformer layers with saturation analysis.
MemMamba: Rethinking Memory Patterns in State Space Model - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Strongly matches Model Architecture: theoretical analysis of Mamba’s memory decay and a new MemMamba architecture adding state summarization and cross-layer/token attention with linear complexity.
Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Architecture and Efficiency: structured cross-layer weight sharing via matrix dictionary learning for attention projections, yielding 66.7% parameter reduction with strong performance.
Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression - Score: 18 (R=10, N=8) - Date: 2025-10-06 - Comment: Model Architecture: MoE with dynamic expert clustering and hierarchical routing; Compression/Efficiency: shared base + ultra low-rank residual adapters, mixed precision, reduced communication.
CAT: Curvature-Adaptive Transformers for Geometry-Aware Learning - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: Model Architecture: conditional routing across geometry-specific attention branches (mixture-of-geometry/MoE-like) enabling curvature-adaptive Transformers.
Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: Model Architecture: proposes a new attention mechanism (Local Linear Attention) as an alternative to Softmax/linear attention in Transformers; High-Performance Computing/Efficiency: introduces memory-efficient primitives and a hardware-efficient blockwise algorithm (FlashLLA) with custom kernels to reduce O(n^2 d) and O(n d^2) costs.
Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs - Score: 18 (R=10, N=8) - Date: 2025-10-02 - Comment: MoE: router regularization via Dirichlet-prior shaping to improve expert balance and specialization in upcycled sparse MoEs.
Cutting the Skip: Training Residual-Free Transformers - Score: 18 (R=10, N=8) - Date: 2025-10-02 - Comment: Model Architecture/Training Dynamics: enables stable training of residual-free transformers via principled initialization based on Jacobian conditioning analysis.
LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: Model Architecture (MoE + PEFT): proposes learnable dynamic routing for Mixture of LoRA Experts with differentiable selection and analytical sparsity control.
On the Structure of Stationary Solutions to McKean-Vlasov Equations with Applications to Noisy Transformers - Score: 18 (R=9, N=9) - Date: 2025-10-24 - Comment: Representation Learning/Training Dynamics — mean-field analysis of Noisy Transformers via stationary McKean–Vlasov solutions, bifurcations, and phase transitions.
Who Said Neural Networks Aren't Linear? - Score: 18 (R=9, N=9) - Date: 2025-10-10 - Comment: Matches Model Architecture: introduces a Linearizer architecture (invertible NNs around a linear map) enabling linear-algebraic analysis and composition properties for nonlinear networks.
On residual network depth - Score: 18 (R=9, N=9) - Date: 2025-10-07 - Comment: Model Architecture: Residual Expansion Theorem giving first-principles analysis of depth in residual networks and principled scaling to control combinatorial path growth.
Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time - Score: 18 (R=9, N=9) - Date: 2025-10-02 - Comment: Model Architecture/Representation Learning: theoretical scaling laws for deep linear self-attention (depth vs width vs context) and training dynamics.
Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training - Score: 17 (R=10, N=7) - Date: 2025-10-31 - Comment: Model Architecture: Mixture-of-Experts with router-gating and shared experts; Efficiency: sparse activation controls inference cost
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation - Score: 17 (R=10, N=7) - Date: 2025-10-30 - Comment: Model Architecture: sparse Mixture-of-Experts (MoE) unified multimodal model with only 6.1B active parameters per token.
Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning - Score: 17 (R=10, N=7) - Date: 2025-10-17 - Comment: Model Architecture (MoE): action-specialized MoE for VLA with decoupled expert selection/weighting enabling collaborative expert usage.
Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting - Score: 17 (R=10, N=7) - Date: 2025-10-08 - Comment: High Performance Computing: data-movement-centric profiling and forecasting for large-scale MoE serving; informs system design (e.g., wafer-scale GPUs).
Multilingual Routing in Mixture-of-Experts - Score: 17 (R=10, N=7) - Date: 2025-10-07 - Comment: Mixture-of-Experts: analysis of multilingual routing dynamics with inference-time router steering to enhance cross-lingual expert utilization.
From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing - Score: 17 (R=10, N=7) - Date: 2025-10-07 - Comment: Model Architecture + HPC/Efficiency: MoE inference-time routing that adapts to gate score distributions to balance expert load and reduce latency without retraining.
Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning - Score: 17 (R=10, N=7) - Date: 2025-10-02 - Comment: Model Architecture and Efficiency: MoE with adaptive shared experts and LoRA-based fine-grained low-rank experts for multi-task learning.
How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs - Score: 17 (R=9, N=8) - Date: 2025-10-30 - Comment: Strongly matches architecture/representation learning criteria with theoretical analysis of ICL in Transformers including nonlinear MLP heads and multi-source data mixing.
Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information - Score: 17 (R=9, N=8) - Date: 2025-10-30 - Comment: Strongly matches architecture/theory criteria by proving multi-head Transformers learn DAG structure via a kernel-guided mutual information objective.
The Neural Differential Manifold: An Architecture with Explicit Geometric Structure - Score: 17 (R=9, N=8) - Date: 2025-10-30 - Comment: Model Architecture: proposes a neural architecture as a differentiable manifold with learned Riemannian metric and geometry-regularized optimization (natural-gradient aligned).
The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Model Architecture theory: tighter upper/lower bounds on parameter complexity for robust memorization in ReLU nets across the robustness ratio range.
Nested AutoRegressive Models - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Model Architecture + Efficiency: nested autoregressive multi-scale design reduces generation from O(n) to O(log n).
Triangle Multiplication Is All You Need For Biomolecular Structure Representations - Score: 17 (R=9, N=8) - Date: 2025-10-27 - Comment: Matches Model Architecture and Efficiency: replaces triangle attention with a streamlined module (Pairmixer) preserving higher-order reasoning while reducing compute/memory.
Transformers are almost optimal metalearners for linear classification - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: Representation Learning/Architecture Theory: theoretical proof that (simplified) transformers are near-optimal metalearners for linear classification.
When Do Transformers Learn Heuristics for Graph Connectivity? - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: Matches Model Architecture and Representation Learning: theoretical and empirical analysis of when Transformers learn correct algorithms vs heuristics on graph connectivity, tied to depth/diameter capacity and training dynamics.
Fast Inference via Hierarchical Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: High-Performance Inference: hierarchical speculative decoding with latency-optimal hierarchy selection.
MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: Model Architecture and Systems Efficiency: MoE expert partitioning into fine-grained sub-experts plus QoS-aware scheduling for elastic inference.
Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Model Architecture/Optimization Theory: closed-form optimum and NP-hardness for one-layer LSA on Markovian functions; multilayer LSA interpreted as preconditioned GD.
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: HPC/Architecture: conditional scaling laws incorporating hidden size, MLP/attention parameter split, and GQA to optimize inference efficiency.
Localist LLMs with Recruitment Learning - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Architecture/Sparsity: introduces a tunable locality dial and information-theoretic recruitment with group sparsity on attention for adaptive interpretable-to-distributed encodings.
Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Architecture/Sparsity: analyzes and improves hierarchical sparse attention for extreme length generalization with key design principles and theory for chunk encoding/residual bypass.
Fighter: Unveiling the Graph Convolutional Nature of Transformers in Time Series Modeling - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Architecture and Analysis: proves equivalence between Transformer attention and GCNs in time series, and introduces a streamlined graph-convolutional Transformer (Fighter).
Infinite Neural Operators: Gaussian processes on functions - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Matches Model Architecture Theory: establishes GP limits for neural operators (incl. FNO), enabling kernel-based operator learning with computed covariances/posteriors.
On Universality of Deep Equivariant Networks - Score: 17 (R=9, N=8) - Date: 2025-10-20 - Comment: Model Architecture: universality results for invariant/equivariant networks highlighting depth/readout as mechanisms.
ParaFormer: Shallow Parallel Transformers with Progressive Approximation - Score: 17 (R=9, N=8) - Date: 2025-10-20 - Comment: Strongly matches Model Architecture and Efficiency/HPC: shallow parallel Transformer with progressive approximation enabling compression and multi-GPU speedups.
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning - Score: 17 (R=9, N=8) - Date: 2025-10-20 - Comment: Training dynamics/scaling laws: proposes a weight-decay scaling rule for AdamW extending μP beyond the near-init regime for width-robust hyperparameter transfer in Transformers.
Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: HPC/Inference Efficiency + Architecture: diffusion-forcing parallel sampler for recurrent-depth transformers enabling faster generation.
Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Training Objective/Architecture: auxiliary future summary prediction head to capture long-horizon dependencies beyond MTP.
Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: High Performance Computing: compiler/IR (asynchronous references) automating warp specialization for GPU kernels incl. LLM attention.
Context-Selective State Space Models: Feedback is All You Need - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Model Architecture: novel time-varying SSM with state-feedback selectivity (COFFEE) offering efficient long-range dependency modeling.
Axial Neural Networks for Dimension-Free Foundation Models - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Model Architecture: introduces a dimension-agnostic Axial Neural Network enabling foundation models to generalize across tensor dimensionalities efficiently.
HiLoRA: Adaptive Hierarchical LoRA Routing for Training-Free Domain Generalization - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Model Architecture and Efficiency: adaptive hierarchical routing over LoRA pools at rank-one component granularity with token-level activation; training-free selection with theoretical guarantees.
Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Model Architecture: replaces softmax attention with a Credal Attention Mechanism yielding credal sets for uncertainty-aware Transformers; integrates uncertainty directly into the attention mechanism.
Softmax $\geq$ Linear: Transformers may learn to classify in-context by kernel gradient descent - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Representation Learning: theoretical analysis of in-context learning dynamics in transformers with softmax attention (kernel gradient descent, context-adaptive rates).
What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably) - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Strongly matches Model Architecture by analyzing looped-attention Transformers vs single-pass Transformers via loss-landscape theory and proposing a staged training framework (SHIFT), touching training dynamics as well.
Decomposer Networks: Deep Component Analysis and Synthesis - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Matches Model Architecture and Representation Learning: semantic autoencoder with Gauss–Seidel-style unrolled competition among components for interpretable factorization.
Hierarchical LoRA MoE for Efficient CTR Model Scaling - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Model Architecture and Efficiency: hierarchical MoE with LoRA rank-1 experts and hierarchical routing enabling parallel layer execution; improved FLOPs/parameter efficiency.
Design Principles for Sequence Models via Coefficient Dynamics - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Matches Model Architecture: unified framework via coefficient dynamics that connects Transformers, SSMs, and RNNs, yielding design principles and stability/efficiency trade-offs.
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton - Score: 17 (R=9, N=8) - Date: 2025-10-13 - Comment: High Performance Computing/Efficiency criterion: evaluates full and layerwise Gauss-Newton preconditioning for transformer training, showing large iteration reductions and insights on Hessian structure.
Integral Signatures of Activation Functions: A 9-Dimensional Taxonomy and Stability Theory for Deep Learning - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: Architecture/Training Dynamics: rigorous activation-function taxonomy with Lyapunov stability and kernel Hessian bounds guiding network design.
Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: Model Architecture and Efficiency: introduces recursive iteration over selected reasoning-relevant layers and adaptive depth for test-time compute scaling without increasing parameters.
Grouped Differential Attention - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Model Architecture: introduces grouped differential attention with ratio-aware head allocation and selective expansion for more compute-efficient Transformers.
Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Proposes a new Transformer variant with Relational Attention over rows/columns/PK–FK links, a clear architecture innovation for relational data and representation learning.
On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Foundational generation paradigm analysis: formal study beyond autoregression/diffusion with rewrite/edit capabilities and associated learnability/hardness results.
Paying Attention to Hybrid Attention: Untangling the Issues with Conversion Methods - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Strong match to Model Architecture and Efficiency: analyzes hybrid linear-attention conversions and proposes methods (e.g., SSD, HedgeCATs) to ensure genuine linear attention usage post-conversion.
Fundamental Limits of Crystalline Equivariant Graph Neural Networks: A Circuit Complexity Perspective - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Model Architecture Theory: circuit-complexity characterization (TC^0) of crystalline equivariant GNNs, clarifying expressive/computational limits under symmetry constraints.
Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Matches Model Architecture and Efficiency: reveals implicit Mixture-of-Experts–like specialization in diffusion LLMs and proposes a training-free test-time ensembling method (HEX) across generation schedules.
Expand Neurons, Not Parameters - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Model Architecture and Efficiency—Fixed Parameter Expansion widens networks at constant non-zero parameters to reduce polysemanticity and improve accuracy.
Arithmetic-Mean $\mu$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Matches Training Dynamics/Initialization and scaling laws: introduces AM-μP with provable learning-rate depth scaling (L^{-3/2}) for CNNs/ResNets enabling zero-shot LR transfer.
Platonic Transformers: A Solid Choice For Equivariance - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Model Architecture: introduces an equivariant Transformer via Platonic group–based attention and weight sharing; formally equivalent to dynamic group convolution and includes a linear-time convolutional variant (Efficiency).
Paris: A Decentralized Trained Open-Weight Diffusion Model - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Matches Model Architecture (MoE) and High Performance/Systems: decentralized training of independent experts with a router, eliminating synchronization.
PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Model Architecture: PDE-based continuous dynamical system analysis of Transformer components (attention, FFN, residuals, layer norm) as stabilizers.
Mathematical Modeling and Convergence Analysis of Deep Neural Networks with Dense Layer Connectivities in Deep Learning - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Model Architecture: theoretical modeling of densely connected networks (DenseNet-style) via nonlinear integral equations with convergence (Γ-convergence) results for training.
Rethinking the shape convention of an MLP - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Model Architecture: rethinks MLP shape/skip placement with hourglass blocks and fixed random expansion; provides scaling insights applicable to residual networks/Transformers.
Flock: A Knowledge Graph Foundation Model via Learning on Random Walks - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Model Architecture: introduces probabilistic node–relation equivariance and random-walk sequence modeling with universality guarantees for KG link functions.
Memory Determines Learning Direction: A Theory of Gradient-Based Optimization in State Space Models - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Model Architecture: theoretical analysis of SSM learning dynamics and an initialization/weight-freezing optimization strategy.
Composer: A Search Framework for Hybrid Neural Architecture Design - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Model Architecture and Efficiency: framework for searching hybrid Attention/MLP architectures with scalable extrapolation strategies for LLMs.
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Matches Model Architecture: introduces a new attention-based state-space LLM with locally interacting neurons, sparse positive activations, and built-in interpretability.
Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Representation Learning / Training Dynamics: circuit-level analysis showing emergent, specialized attention heads from post-training in reasoning models.
Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Matches Model Architecture/Efficiency: training-light expert model merging with unsupervised hidden/logit alignment and importance-guided layer chunking to replace multi-model serving.
Fact Grounded Attention: Eliminating Hallucination in Large Language Models Through Attention Level Knowledge Integration - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Model Architecture: injects verifiable knowledge directly into pre-softmax attention scores (Transformer attention modification) to control generation and prevent hallucination.
A Formal Comparison Between Chain-of-Thought and Latent Thought - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Model Architecture/Training Dynamics: formal analysis contrasting looped latent-thought Transformers vs CoT, clarifying computational capabilities.
AMLA: MUL by ADD in FlashAttention Rescaling - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: High Performance Computing: novel FlashAttention-based kernel replacing MUL with integer ADD for rescaling plus preload pipeline/tiling to maximize FLOPS on NPUs.
Enhancing Linear Attention with Residual Learning - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Model Architecture/Efficiency: introduces Residual Linear Attention and Residual Delta Net to boost expressivity while retaining linear-time attention.
Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers - Score: 16 (R=9, N=7) - Date: 2025-10-30 - Comment: Strongly matches representation learning criterion via mechanistic interpretability of attention-only transformers and emergence of minimal circuits for IOI.
Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Extends Transformer long-context capacity by logarithmically compressing input tokens without altering architecture (Compression/Efficiency for context).
Head Pursuit: Probing Attention Specialization in Multimodal Transformers - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Representation Learning/Interpretability: probes attention head specialization and enables controllable editing of concepts in uni/multimodal transformers.
Multi-Task Vehicle Routing Solver via Mixture of Specialized Experts under State-Decomposable MDP - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Model Architecture: Mixture-of-Specialized-Experts (MoE) with LoRA experts and adaptive gating under a state-decomposable MDP.
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction - Score: 16 (R=9, N=7) - Date: 2025-10-24 - Comment: Model architecture/efficiency: hybrid sparse attention with learnable token eviction retains critical KV pairs, preserving linear-time/space.
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Matches Model Architecture and Efficiency: proposes a hybrid linear+softmax attention architecture for long-context with FP8 operator support, reducing compute/I-O while maintaining reasoning performance.
Data Efficient Any Transformer-to-Mamba Distillation via Attention Bridge - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: cross-architecture distillation from Transformers to SSMs via an attention bridge with token-level supervision and layer-wise alignment.
Accelerating Vision Transformers with Adaptive Patch Sizes - Score: 16 (R=9, N=7) - Date: 2025-10-22 - Comment: Model Architecture/Efficiency: adaptive patch sizes to reduce ViT token count and accelerate inference/training.
ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing - Score: 16 (R=9, N=7) - Date: 2025-10-17 - Comment: Model Architecture and Efficiency: hybrid Decoder-MLP architecture with paired weight sharing; reduces KV cache and latency.
ICL-Router: In-Context Learned Model Representations for LLM Routing - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Matches Model Architecture (Dynamic Routing/MoE-style): learns in-context model representations to route queries across LLMs, enabling scalable routing without retraining.
Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers - Score: 16 (R=9, N=7) - Date: 2025-10-14 - Comment: Model Architecture (MoE) and training stability: aligns training and inference routers via rollout routing replay to stabilize MoE RL, addressing core MoE routing behavior.
DND: Boosting Large Language Models with Dynamic Nested Depth - Score: 16 (R=9, N=7) - Date: 2025-10-14 - Comment: Matches Model Architecture: introduces conditional/dynamic computation in transformers via token-level nested depth with a learned router, improving efficiency-control without full re-architecture.
MODE: Learning compositional representations of complex systems with Mixtures Of Dynamical Experts - Score: 16 (R=9, N=7) - Date: 2025-10-14 - Comment: Matches Model Architecture: Mixture-of-Experts with neural gating for decomposing dynamics into sparse experts; conditional/dynamic modeling across regimes.
Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization - Score: 16 (R=9, N=7) - Date: 2025-10-10 - Comment: Strong match to Model Architecture: extends DPO with mixture models and MoE architectures using variational inference and ELBO optimization for expert specialization.
MoGU: Mixture-of-Gaussians with Uncertainty-based Gating for Time Series Forecasting - Score: 16 (R=9, N=7) - Date: 2025-10-10 - Comment: Model Architecture (MoE): probabilistic experts with uncertainty-based gating replacing input-based routers for regression/forecasting.
Native Hybrid Attention for Efficient Sequence Modeling - Score: 16 (R=9, N=7) - Date: 2025-10-09 - Comment: Matches Model Architecture and Efficiency: proposes a hybrid linear+softmax attention layer with sliding-window control for long-context sequence modeling, reducing quadratic attention cost.
A Comparative Analysis of Contextual Representation Flow in State-Space and Transformer Architectures - Score: 16 (R=9, N=7) - Date: 2025-10-09 - Comment: Matches Representation Learning: token- and layer-level analysis of representation propagation and oversmoothing in SSMs vs Transformers, revealing inductive biases and training dynamics.
A General Constructive Upper Bound on Shallow Neural Nets Complexity - Score: 16 (R=9, N=7) - Date: 2025-10-09 - Comment: Model Architecture theory: provides a constructive upper bound on neurons needed in shallow networks to approximate continuous functions on compact sets.
VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing - Score: 16 (R=9, N=7) - Date: 2025-10-08 - Comment: Model architecture: dynamic expert routing (MoE-style) with patchwise routing and curriculum top-K annealing; parameter-efficient fine-tuning of expert library.
Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: Strongly matches Model Architecture: theoretical expressivity bounds and analysis of pooling mechanisms in Transformers, offering principled architectural guidance.
HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference - Score: 16 (R=9, N=7) - Date: 2025-10-06 - Comment: Strongly matches High-Performance Computing: heterogeneous CiD + on-chip analog CiM with phase-aware mapping and 2.5D integration targeted at low-batch, long-context LLM inference.
Transformers Discover Molecular Structure Without Graph Priors - Score: 16 (R=9, N=7) - Date: 2025-10-03 - Comment: Model Architecture / Representation Learning: shows pure Transformers (no graph priors) learn distance-aware structure for molecular modeling, with scaling and attention analysis.
Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Model Architecture/Theory: first theoretical analysis of one-layer Mamba’s ICL generalization with outliers, contrasting linear attention vs. nonlinear gating.
Indirect Attention: Turning Context Misalignment into a Feature - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Architecture: introduces a modified attention mechanism (Indirect Attention) with analysis under key–value misalignment/noise, directly innovating the Transformer attention core.
Scaling Equilibrium Propagation to Deeper Neural Network Architectures - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Architecture and Training Algorithm: introduces residual connections in Hopfield networks to scale equilibrium propagation to deeper networks.
Guiding Mixture-of-Experts with Temporal Multimodal Interactions - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Architecture (MoE): introduces interaction-aware routing leveraging temporal multimodal dynamics to guide expert specialization.
Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism - Score: 16 (R=8, N=8) - Date: 2025-10-31 - Comment: Model Architecture: specialized memory mechanism with task-aware trigger/updater for linear-time SGM inference and dynamic adaptation.
A Deep Learning Framework for Multi-Operator Learning: Architectures and Approximation Theory - Score: 16 (R=8, N=8) - Date: 2025-10-30 - Comment: Matches model architecture and efficiency theory criteria with new multi-operator neural operator architectures (MNO/MONet) and explicit approximation/scaling laws.
An efficient probabilistic hardware architecture for diffusion-like models - Score: 16 (R=8, N=8) - Date: 2025-10-29 - Comment: High Performance Computing/Efficiency: proposes an all-transistor probabilistic architecture implementing denoising models with orders-of-magnitude energy reduction.
A data free neural operator enabling fast inference of 2D and 3D Navier Stokes equations - Score: 16 (R=8, N=8) - Date: 2025-10-29 - Comment: Model Architecture/Efficiency: physics-grounded, data-free neural operator for Navier–Stokes enabling fast, robust inference (including 3D) without paired solution data.
Fisher meets Feynman: score-based variational inference with a product of experts - Score: 16 (R=8, N=8) - Date: 2025-10-28 - Comment: Representation Learning/Inference: tractable product-of-experts variational family with Fisher-divergence optimization and Feynman/Dirichlet auxiliary variables.
Generalised Flow Maps for Few-Step Generative Modelling on Riemannian Manifolds - Score: 16 (R=8, N=8) - Date: 2025-10-27 - Comment: Matches Model Architecture and Efficiency: few-step generative modeling generalized to Riemannian manifolds (self-distillation-based GFMs), reducing inference steps.
Matricial Free Energy as a Gaussianizing Regularizer: Enhancing Autoencoders for Gaussian Code Generation - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Matches Model Architecture/Regularization: introduces matricial free energy loss from free probability to Gaussianize autoencoder codes.
Asymptotically Stable Quaternion-valued Hopfield-structured Neural Network with Periodic Projection-based Supervised Learning Rules - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Model Architecture: quaternion-valued Hopfield-type network with projection-based learning and stability guarantees.
WARP-LUTs - Walsh-Assisted Relaxation for Probabilistic Look Up Tables - Score: 16 (R=8, N=8) - Date: 2025-10-20 - Comment: Model Architecture and Efficiency: multiplication-free probabilistic LUT networks with Walsh-assisted relaxation for fewer parameters and faster convergence.
Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning - Score: 16 (R=8, N=8) - Date: 2025-10-20 - Comment: Matches Model Architecture via a hybrid discrete diffusion planner with an autoregressive executor, including latent-space interfacing to reduce tokens and improve reasoning.
Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training - Score: 16 (R=8, N=8) - Date: 2025-10-17 - Comment: High-Performance/Optimization: noise-adaptive layerwise learning rates atop geometry-aware optimizers to accelerate training, with convergence analysis and transformer experiments.
Y-shaped Generative Flows - Score: 16 (R=8, N=8) - Date: 2025-10-16 - Comment: Model Architecture: introduces Y-shaped generative flows with a new velocity-powered objective in neural ODEs to encourage shared transport pathways—an architectural/optimization innovation in continuous-time generative models.
Designing ReLU Generative Networks to Enumerate Trees with a Given Tree Edit Distance - Score: 16 (R=8, N=8) - Date: 2025-10-15 - Comment: Model Architecture/Theory: constructs constant-depth ReLU generative networks (O(n^3)) to exactly enumerate tree-structured outputs by edit distance.
Heptapod: Language Modeling on Visual Signals - Score: 16 (R=8, N=8) - Date: 2025-10-09 - Comment: Model Architecture: introduces a causal Transformer with a novel “next 2D distribution prediction” objective and a reconstruction-focused visual tokenizer, unifying autoregressive modeling with masked autoencoding.
ATOM: A Pretrained Neural Operator for Multitask Molecular Dynamics - Score: 16 (R=8, N=8) - Date: 2025-10-08 - Comment: Model Architecture: introduces a transformer neural operator with quasi-equivariance and temporal attention enabling parallel multi-step decoding and cross-molecule operator pretraining.
Rule Encoding and Compliance in Large Language Models: An Information-Theoretic Analysis - Score: 16 (R=8, N=8) - Date: 2025-10-08 - Comment: Model Architecture Analysis: information-theoretic bounds on attention mechanisms (causal/bidirectional/sparse/kernelized/cross-attention) for rule encoding/compliance.
Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning - Score: 16 (R=8, N=8) - Date: 2025-10-06 - Comment: Matches Representation Learning/Model Architecture: amortized activation steering by training a minimal transformer submodule; effective for both dense and MoE models with strong compute efficiency.
Learning Inter-Atomic Potentials without Explicit Equivariance - Score: 16 (R=8, N=8) - Date: 2025-10-02 - Comment: Model Architecture/Representation Learning: learns SO(3) equivariance in a non-equivariant Transformer for inter-atomic potentials, avoiding hard-wired symmetry constraints.
PDE Solvers Should Be Local: Fast, Stable Rollouts with Learned Local Stencils - Score: 16 (R=8, N=8) - Date: 2025-10-01 - Comment: Matches Model Architecture: a finite-difference-inspired local operator network (learned stencils, explicit time stepping) with theoretical error/approximation guarantees and improved efficiency via strict locality.
Defeating the Training-Inference Mismatch via FP16 - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: HPC/Training optimization: shows FP16 precision mitigates training–inference mismatch in RL fine-tuning, improving stability and convergence
The End of Manual Decoding: Towards Truly End-to-End Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Model Architecture: augments transformers with lightweight heads that learn token-level temperature and top‑p, enabling end-to-end, dynamic decoding control.
Lipschitz-aware Linearity Grafting for Certified Robustness - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Model Architecture/Robustness: theoretical analysis and method for grafting linearity to tighten local Lipschitz bounds and improve certified robustness.
A Physics-informed Multi-resolution Neural Operator - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Model Architecture/Efficiency: extends RINO to a physics-informed, data-free operator with multi-resolution inputs and PDE-enforced training.
Spatially Aware Linear Transformer (SAL-T) for Particle Jet Tagging - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Spatially aware linear transformer variant that maintains linear attention and reduces complexity—Architecture/Efficiency contribution.
LUNA: Efficient and Topology-Agnostic Foundation Model for EEG Signal Analysis - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Architecture/Efficiency: topology-agnostic EEG foundation model using latent cross-attention to decouple compute from channel count (linear scaling).
Relieving the Over-Aggregating Effect in Graph Transformers - Score: 15 (R=8, N=7) - Date: 2025-10-27 - Comment: Model Architecture: Wideformer modifies graph attention to mitigate over-aggregating via parallel partitioned aggregation and guided weighting.
PINN Balls: Scaling Second-Order Methods for PINNs with Domain Decomposition and Adaptive Sampling - Score: 15 (R=8, N=7) - Date: 2025-10-27 - Comment: Model Architecture (MoE) + Efficiency: local Mixture‑of‑Experts with learnable domain decomposition to scale second‑order training for PINNs.
Diffusion Autoencoders with Perceivers for Long, Irregular and Multimodal Astronomical Sequences - Score: 15 (R=8, N=7) - Date: 2025-10-24 - Comment: Model Architecture and Representation Learning: diffusion autoencoder with Perceiver encoder/decoder for long, irregular, multimodal sequences.
Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Model Architecture/Efficiency: transformer with learned summarization tokens for memory creation/retrieval enabling long-horizon efficiency.
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: High Performance Computing: hybrid rollout–training architecture leveraging preemptible GPUs with adaptive offload and token-level migration for RL on LLMs.
Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Matches Model Compression/Efficiency and Architecture: ensembles via pruned attention heads merged into a compact grouped-MHA, yielding near single-model inference cost with UQ gains.
LIME: Link-based user-item Interaction Modeling with decoupled xor attention for Efficient test time scaling - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Matches Model Architecture and Efficiency: decoupled link embeddings enabling precomputed attention weights and a linear attention mechanism (LIME-XOR) for O(N) inference-time scaling.
ZACH-ViT: A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Model Architecture and Efficiency: a compact ViT variant removing positional embeddings and [CLS] token for permutation invariance and parameter efficiency.
NeurIPT: Foundation Model for Neural Interfaces - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Model Architecture: introduces a Progressive Mixture-of-Experts (PMoE) Transformer and amplitude-aware masked pretraining for EEG foundation modeling.
Protein Folding with Neural Ordinary Differential Equations - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Matches Model Architecture and Efficiency: continuous-depth Evoformer via Neural ODEs with adjoint memory savings and adaptive solver trade-offs.
Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Matches Model Architecture: embeds group-equivariant (rotation/scale) convolutions to improve adversarial robustness with theoretical gradient regularization and certified bounds.
Early-stopping for Transformer model training - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Training Dynamics/Representation: RMT-based spectral criteria for transformer early stopping; heavy-tailed dynamics monitoring without validation.
AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Matches Model Architecture/Training with Semi-Discrete Optimal Transport to align noise and data in flow-based models, improving trajectory straightness and efficiency.
Purifying Task Vectors in Knowledge-Aware Subspace for Model Merging - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Model Architecture/Merging: knowledge-aware subspace (context SVD) to purify/prune task vectors and mitigate redundancy in model merging.
DARTS-GT: Differentiable Architecture Search for Graph Transformers with Quantifiable Instance-Specific Interpretability Analysis - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Model Architecture: DARTS-driven heterogeneous Graph Transformer design with quantifiable interpretability via causal ablation.
A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Model Architecture/Training: safety-sensitive subspace freezing and harmful-resistant null-space projection to preserve alignment during LoRA fine-tuning.
Deep Attention-guided Adaptive Subsampling - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Conditional/Dynamic Networks and Efficiency: input-adaptive attention-guided subsampling module learned end-to-end to reduce compute while maintaining performance—fits dynamic computation and efficiency criteria.
Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Model Architecture/structure-aware optimization by applying targeted, layer-group–specific DPO (with LoRA) leveraging functional specialization of Transformer layers.
GraphTARIF: Linear Graph Transformer with Augmented Rank and Improved Focus - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Model Architecture/Efficiency: enhances linear graph attention by increasing rank via a gated local branch and sharpening focus with a learnable entropy-reducing log-power function while preserving linear complexity.
Multi-View Graph Learning with Graph-Tuple - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Introduces a multi-view graph-tuple message-passing architecture with provable expressivity gains (model architecture).
Why Do Transformers Fail to Forecast Time Series In-Context? - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Model Architecture/Representation Learning: rigorous ICL analysis of transformer (Linear Self-Attention) limits and CoT collapse for forecasting; foundational insights into training dynamics.
DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Representation Learning/Architecture Analysis: token-to-head contribution tracing reveals bias heads; inference-time selective masking of attention heads.
Inverse-Free Wilson Loops for Transformers: A Practical Diagnostic for Invariance and Order Sensitivity - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Representation Learning/Diagnostics: inverse-free curvature mapping and activation commutators provide practical probes of invariance and order sensitivity in Transformers.
gLSTM: Mitigating Over-Squashing by Increasing Storage Capacity - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Model Architecture: novel GNN (gLSTM) inspired by associative memories/xLSTM to mitigate over-squashing by increasing storage capacity; addresses core architectural limitations.
Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors - Score: 15 (R=8, N=7) - Date: 2025-10-09 - Comment: High Performance Computing/Efficiency: vectorized FlashAttention on RISC‑V with low-cost exponential approximation and tiling to improve memory locality and throughput.
BlockGPT: Spatio-Temporal Modelling of Rainfall via Frame-Level Autoregression - Score: 15 (R=8, N=7) - Date: 2025-10-09 - Comment: Introduces a frame-level autoregressive Transformer with space–time factorization and batched tokenization, improving architectural efficiency (notably faster inference).
Latent Speech-Text Transformer - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Model Architecture and Efficiency: dynamic aggregation of speech tokens into latent patches to reduce sequence length and improve modality alignment
Directional Sheaf Hypergraph Networks: Unifying Learning on Directed and Undirected Hypergraphs - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Model Architecture—Directional Sheaf Hypergraph Networks with a directed sheaf Laplacian for learning on directed/undirected hypergraphs.
GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Model Architecture: an LLM-free, tuning-free graph foundational model enabling in-context learning via a novel token-based framework across node/edge/graph tasks.
Activation Steering with a Feedback Controller - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Model Architecture/Activation Steering foundations: frames activation steering as PID control with theoretical stability and a principled closed-loop mechanism.
Rethinking Inter-LoRA Orthogonality in Adapter Merging: Insights from Orthogonal Monte Carlo Dropout - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Model Architecture: orthogonality-preserving adapter (LoRA) merging via Orthogonal Monte Carlo Dropout with analysis on compositionality/semantic interference.
Why Do We Need Warm-up? A Theoretical Perspective - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Training dynamics: theoretical justification for learning-rate warm-up under generalized smoothness with convergence complexity bounds.
xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Model Architecture and Efficiency — analysis of xLSTM scaling laws with linear-time complexity vs Transformers; insights on training/inference scaling with context length.
Equivariant Geometric Scattering Networks via Vector Diffusion Wavelets - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture and Efficiency: SE(3)-equivariant geometric scattering transform integrated into GNNs, achieving comparable performance with fewer parameters.
GLAI: GreenLightningAI for Accelerated Training through Knowledge Decoupling - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture and Efficiency—new MLP-replacement block that decouples structural vs quantitative knowledge to speed training while retaining expressivity.
Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture: analyzes Differential Attention’s robustness and training dynamics, revealing structural trade-offs in attention design.
Continual Learning with Query-Only Attention - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture/Training Dynamics: query-only attention variant with analysis of plasticity and catastrophic forgetting.
Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Representation Learning: reverse-engineers transformer mechanisms for long-range dependencies (attention DAG caching, Minkowski-sum digit geometry) and training dynamics with an auxiliary inductive-bias loss.
Large Language Models Inference Engines based on Spiking Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture/Efficiency: spike-based self-attention and SNN conversion/fine-tuning for transformer inference, targeting energy-efficient deployment.
BigBang-Proton Technical Report: Next-Word-Prediction is Scientific Multitask Learner - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture: introduces Monte Carlo Attention and Binary Patch Encoding as architectural/tokenization innovations in a unified autoregressive scientific model.
MAESTRO : Adaptive Sparse Attention and Robust Learning for Multimodal Dynamic Time Series - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Model Architecture and Efficiency: sparse cross-modal attention with sparse Mixture-of-Experts routing and adaptive attention budgeting for long multimodal sequences.

Model Compression and Efficiency (281)

Catalyst GFlowNet for electrocatalyst design: A hydrogen evolution reaction case study - Score: 20.0 (R=0, N=0) - Date: 2025-10-03 - Comment: Author match
A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization - Score: 19 (R=10, N=9) - Date: 2025-10-28 - Comment: Matches Compression/Efficiency: first convergence theory for Adam/Muon under floating-point quantization of gradients/weights/states; explains low-precision training.
Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples - Score: 19 (R=10, N=9) - Date: 2025-10-24 - Comment: Compression/Efficiency: layer-selective rank reduction and pruning of high-order components with low-rank factorization; rapid adaptation using a single gradient step on 100 samples.
CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training - Score: 19 (R=10, N=9) - Date: 2025-10-22 - Comment: Matches Model Compression and Efficiency: curvature-aware gradient correction for quantization-aware training with theoretical convergence and strong W4A4 results.
Learning under Quantization for High-Dimensional Linear Regression - Score: 19 (R=10, N=9) - Date: 2025-10-22 - Comment: Matches Model Compression and Efficiency: first systematic theory of learning performance under low-bit quantization across parameters/activations/gradients/data/labels.
Unbiased Gradient Low-Rank Projection - Score: 19 (R=10, N=9) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: unbiased low-rank gradient projection (GUM) with convergence guarantees, preserving memory savings while matching/improving full-parameter training.
The Graphon Limit Hypothesis: Understanding Neural Network Pruning via Infinite Width Analysis - Score: 19 (R=10, N=9) - Date: 2025-10-21 - Comment: Matches Model Compression and Sparsity Theory: introduces a graphon-based infinite-width framework and Graphon NTK to analyze pruning and sparse network trainability.
Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads - Score: 19 (R=10, N=9) - Date: 2025-10-21 - Comment: Model Architecture + Efficiency: SkipV1Former reuses first-layer Value heads to cut V projections/KV cache (~25–50%) while improving perplexity; KV-cache reduction.
REAP the Experts: Why Pruning Prevails for One-Shot MoE compression - Score: 19 (R=10, N=9) - Date: 2025-10-17 - Comment: MoE + Compression: theoretical case against expert merging and a router-weighted expert pruning criterion for one-shot SMoE compression.
On efficiently computable functions, deep networks and sparse compositionality - Score: 19 (R=10, N=9) - Date: 2025-10-16 - Comment: Model Architecture and Representation Learning: theory linking efficient Turing computability to compositionally sparse DAGs and corresponding deep neural approximants.
The Markovian Thinker - Score: 19 (R=10, N=9) - Date: 2025-10-09 - Comment: High-Performance/Algorithmic Efficiency: redesigns the reasoning environment to a Markovian, constant-state setup enabling linear compute and constant memory for very long thinking.
vAttention: Verified Sparse Attention - Score: 19 (R=10, N=9) - Date: 2025-10-08 - Comment: Sparse Attention with guarantees: unified top-k and sampling providing user-specified (epsilon, delta) accuracy with strong efficiency gains
Boomerang Distillation Enables Zero-Shot Model Size Interpolation - Score: 19 (R=10, N=9) - Date: 2025-10-07 - Comment: Strongly matches Model Compression/Efficiency and Model Architecture: zero-shot model size interpolation by re-incorporating teacher blocks after distillation (no extra training).
PolyKAN: A Polyhedral Analysis Framework for Provable and Minimal KAN Compression - Score: 19 (R=10, N=9) - Date: 2025-10-07 - Comment: Model Compression and Efficiency: provable, minimal KAN compression via polyhedral analysis and ε-equivalent compression with an optimal DP algorithm.
To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration - Score: 19 (R=10, N=9) - Date: 2025-10-06 - Comment: Model Compression and Efficiency + HPC: establishes exponent concentration with theoretical entropy bounds; proposes lossless ECF8 FP format with entropy-aware encoding and GPU-optimized decoding.
The Unseen Frontier: Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM - Score: 19 (R=10, N=9) - Date: 2025-10-03 - Comment: Model Compression and Efficiency — extreme sparsity/pruning for LLMs via surrogate-free ADMM; includes quantized variant and convergence guarantees.
A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws - Score: 19 (R=10, N=9) - Date: 2025-10-02 - Comment: Compression/Efficiency Theory: proves polylogarithmic compression of models and datasets, establishing a dynamical lottery ticket hypothesis and boosted scaling laws.
An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning - Score: 18 (R=10, N=8) - Date: 2025-10-31 - Comment: HPC + Compression/Efficiency: All-Reduce–compatible Top-K gradient compressor with contraction guarantees; communication-efficient distributed training.
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats - Score: 18 (R=10, N=8) - Date: 2025-10-30 - Comment: Compression/Efficiency: comprehensive study of low-bit quantization formats (INT vs FP) at fine-grained levels with new training method for MXINT8.
SALS: Sparse Attention in Latent Space for KV cache Compression - Score: 18 (R=10, N=8) - Date: 2025-10-29 - Comment: Model Compression and Efficiency: KV cache compression via latent-space sparse attention that bypasses RoPE-induced rank issues and avoids full reconstruction.
Efficient Low Rank Attention for Long-Context Inference in Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-10-29 - Comment: Model Compression and Efficiency: low-rank query/key decomposition with mixed GPU-CPU KV cache to reduce memory and transfers while preserving exact attention.
Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation - Score: 18 (R=10, N=8) - Date: 2025-10-28 - Comment: Model Architecture/HPC: high-sparsity MoE scaling to 1T with FP8 training and efficient heterogeneous pipelines guided by scaling laws.
Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression - Score: 18 (R=10, N=8) - Date: 2025-10-28 - Comment: Compression/Efficiency: low-bit LLM post-training quantization via learnable grouped lattice vector quantizers with Babai rounding.
Sparser Block-Sparse Attention via Token Permutation - Score: 18 (R=10, N=8) - Date: 2025-10-27 - Comment: Matches Compression/Efficiency: block-sparse attention enhanced via token permutation and custom kernels, improving long-context LLM prefilling speed/accuracy.
Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling - Score: 18 (R=10, N=8) - Date: 2025-10-24 - Comment: Compression/Efficiency — multi-bit quantization training via weight bias correction and bit-wise coreset sampling to reduce training cost across precisions.
ARC-Encoder: learning compressed text representations for large language models - Score: 18 (R=10, N=8) - Date: 2025-10-24 - Comment: Compression/Efficiency — external encoder that compresses context into continuous representations to replace tokens, reducing LLM inference cost without modifying decoders.
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training - Score: 18 (R=10, N=8) - Date: 2025-10-22 - Comment: High Performance Computing and Efficiency: distributed dynamic sparse attention training (balanced/hierarchical sparse ring attention) enabling efficient ultra-long contexts.
Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression - Score: 18 (R=10, N=8) - Date: 2025-10-22 - Comment: Matches Model Compression and Efficiency: Binary Quadratic Quantization for matrix approximation/PTQ, extending beyond first-order schemes with strong 2-bit results.
From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-10-22 - Comment: Matches Model Compression and Efficiency: global structured pruning of LLM attention heads and MLP channels using loss-based importance with iterative schedule.
AMS-QUANT: Adaptive Mantissa Sharing for Floating-point Quantization - Score: 18 (R=10, N=8) - Date: 2025-10-21 - Comment: Matches Model Compression and Efficiency: introduces adaptive mantissa-bit sharing for sub-integer floating-point quantization with CUDA kernels, reducing memory access and latency.
TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs - Score: 18 (R=10, N=8) - Date: 2025-10-21 - Comment: High Performance Computing / Compression: ternary (1.58-bit) LLM accelerator with table-lookup matmul, fused attention, and prefill/decoding optimizations on edge FPGAs.
Efficient Dynamic Structured Sparse Training with Learned Shuffles - Score: 18 (R=10, N=8) - Date: 2025-10-17 - Comment: Compression/Efficiency: dynamic structured sparsity augmented with learned permutations to match unstructured DST accuracy while accelerating training/inference.
A Free Lunch in LLM Compression: Revisiting Retraining after Pruning - Score: 18 (R=10, N=8) - Date: 2025-10-17 - Comment: Compression: shows reconstruction-based post-pruning retraining can beat full retraining; key design insights and efficient recovery after pruning.
What Layers When: Learning to Skip Compute in LLMs with Residual Gates - Score: 18 (R=10, N=8) - Date: 2025-10-17 - Comment: Model Compression and Efficiency: token-wise layer skipping via residual-stream gates enabling dynamic computation with stable fine-tuning.
Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference - Score: 18 (R=10, N=8) - Date: 2025-10-17 - Comment: Model Compression/Efficiency: informed token-level routing using a lightweight feature forecaster for execute-or-approximate computation.
NOSA: Native and Offloadable Sparse Attention - Score: 18 (R=10, N=8) - Date: 2025-10-16 - Comment: Model compression and efficiency: trainable sparse attention with explicit locality enabling KV cache offloading and reduced transfers, improving decoding throughput and memory use.
MC#: Mixture Compressor for Mixture-of-Experts Large Models - Score: 18 (R=10, N=8) - Date: 2025-10-15 - Comment: MoE compression via mixed-precision quantization and dynamic expert pruning/routing (quantization + sparsity/pruning for MoE efficiency).
AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs - Score: 18 (R=10, N=8) - Date: 2025-10-15 - Comment: Direct hit on Model Compression and Efficiency: multi-precision quantization with bit-plane compute and hardware–algorithm co-design for LLMs.
PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-10-15 - Comment: Matches Model Compression and Efficiency: N:M sparsity with learnable channel permutation via differentiable Sinkhorn normalization and block-wise optimization.
SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions - Score: 18 (R=10, N=8) - Date: 2025-10-14 - Comment: Compression: unified Bayesian pruning+quantization via spike-and-slab priors and GMM-based low-bit weights, with consistency guarantees.
LOTION: Smoothing the Optimization Landscape for Quantized Training - Score: 18 (R=10, N=8) - Date: 2025-10-14 - Comment: Model Compression and Efficiency: proposes a principled smoothing framework for quantized training (randomized rounding/Nesterov-style smoothing) with convergence guarantees and preservation of global minima.
FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference - Score: 18 (R=10, N=8) - Date: 2025-10-13 - Comment: Compression/Efficiency: fine-grained low-rank rank allocation per layer and progressive low-rank decoding for efficient LLM inference.
FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts - Score: 18 (R=10, N=8) - Date: 2025-10-10 - Comment: Model Architecture/Efficiency: implicit rank-wise MoE within LoRA using sparse random projection as router for parameter-efficient fine-tuning and task decoupling.
Artificial Hippocampus Networks for Efficient Long-Context Modeling - Score: 18 (R=10, N=8) - Date: 2025-10-09 - Comment: Model Architecture and Efficiency: hybrid memory design combining Transformer KV cache with learnable RNN-like compressive long-term memory (AHN) to cut FLOPs and cache.
Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: Strong match to Compression/Efficiency: activation-informed theoretical bounds and Pareto-guided low-rank rank selection (PGSVD) for zero-shot LLM/VLM compression.
ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: Model Compression and Efficiency: semi-structured 2:4 pruning via adaptive matrix factorization with block-diagonal wrappers
KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: Strong match to Compression/Efficiency: KV-cache quantization to very low precision with Hadamard rotation and linear correction plus a fast attention kernel for efficient long-context inference.
PatternKV: Flattening KV Representation Expands Quantization Headroom - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: Model Compression and Efficiency: proposes a pattern-aligned residual quantization scheme for KV-cache to flatten distributions and enable low-bit inference with less memory/bandwidth.
COSPADI: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: Model Compression and Efficiency: training-free sparse dictionary factorization guided by calibration to compress LLMs; structured sparsity compatible with quantization and efficient sparse-dense ops.
Post-training quantization of vision encoders needs prefixing registers - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Compression and Efficiency—training-free post-training quantization for vision encoders via prefix registers (RegCache) to suppress activation outliers.
Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Compression and Efficiency: introduces compressed convolutional attention (CCA/CCGQA) reducing KV-cache and FLOPs with significant speedups; applicable to dense and MoE models.
Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Compression and Efficiency: establishes low-rank structure in time-series embeddings, proves compressibility of Q/K/V and attention, introduces flow-of-ranks; guides width/depth/head allocation and achieves large inference/memory reductions on a foundation TS model.
UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Compression/Efficiency: unified post-training pruning with mirror descent combining local saliency and global coordination; supports unstructured and N:M sparsity with one-shot mask generation.
SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Compression and Efficiency: introduces Sigma-Delta 1-bit/1.58-bit quantization for LLMs with adjustable and fine-grained OSR allocation plus Hadamard-based weight smoothing.
PT$^2$-LLM: Post-Training Ternarization for Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Directly matches Model Compression and Efficiency: post-training ternarization (quantization) for LLMs with asymmetric ternary quantizer, iterative fitting, and activation-aware refinement.
StructPrune: Structured Global Pruning asymptotics with $\mathcal{O}(\sqrt{N})$ GPU Memory - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Compression/Efficiency—structured global pruning with O(sqrt(N)) memory via ADMM and derived layer-wise sparsity allocation.
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Matches Model Compression and Efficiency: training-free depth pruning by replacing Transformer blocks with a linear operator using small calibration data; no retraining needed.
The Curious Case of In-Training Compression of State Space Models - Score: 18 (R=10, N=8) - Date: 2025-10-06 - Comment: Compression/Efficiency — in-training balanced truncation of State Space Models via Hankel singular values to reduce state dimension while preserving expressivity.
Randomized Gradient Subspaces for Efficient Large Language Model Training - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: High Performance Computing/Efficiency: randomized gradient subspace methods (GrassWalk/GrassJump) reduce optimizer memory for LLM pretraining by leveraging near-flat curvature.
ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: Compression/Efficiency/HPC: thought-adaptive KV-cache compression with hybrid quantization–eviction and a PagedAttention-extended kernel for memory reuse.
Budgeted Broadcast: An Activity-Dependent Pruning Rule for Neural Network Efficiency - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: Matches Model Compression and Efficiency: introduces an activity-dependent pruning rule with constrained-entropy analysis to balance fan-in/fan-out (sparsity/pruning) for efficiency.
RSAVQ: Riemannian Sensitivity-Aware Vector Quantization for Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: Matches Model Compression and Efficiency: low-bit vector quantization for LLMs using Fisher-information (Riemannian) sensitivity guidance and channel-wise bit allocation.
PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning - Score: 18 (R=10, N=8) - Date: 2025-10-02 - Comment: Model Compression and Efficiency—structured pruning within low-rank adapters (LoRA) with theoretical robustness analysis and dynamic rank allocation.
CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: Model Compression and Efficiency: continuous and differentiable semi-structured (N:M) sparsity-aware training with a new sparsity-aware optimizer (AdamS), weight scaling, and self-distillation to preserve accuracy.
Layer-wise dynamic rank for compressing large language models - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: Compression/Efficiency: layer-wise dynamic low-rank SVD with effective-rank metric and Lagrangian allocation for LLM compression.
Effective Model Pruning - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: Matches Compression/Efficiency: introduces a universal, parameter-free adaptive pruning threshold (effective number via Inverse Simpson index) applicable to diverse pruning criteria.
AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: HPC + Compression/Efficiency: KV-cache storage hierarchy with adaptive lossy compression to optimize DRAM/SSD placement for LLM serving.
On the expressivity of sparse maxout networks - Score: 18 (R=9, N=9) - Date: 2025-10-17 - Comment: Representation/Architecture Theory: expressivity analysis and depth hierarchies for sparse maxout networks under fixed indegree (sparsity).
Drop-Muon: Update Less, Converge Faster - Score: 18 (R=9, N=9) - Date: 2025-10-03 - Comment: Training efficiency criterion: randomized progressive layer updates with non-Euclidean optimization and convergence theory, reducing update cost.
ARA: Adaptive Rank Allocation for Efficient Large Language Model SVD Compression - Score: 17 (R=10, N=7) - Date: 2025-10-23 - Comment: Matches Compression/Efficiency: Adaptive Rank Allocation for SVD-based LLM compression with a new mask design and loss to optimize per-layer ranks under global constraints.
BitNet Distillation - Score: 17 (R=10, N=7) - Date: 2025-10-17 - Comment: Model Compression and Efficiency: distillation to 1.58-bit (ternary) LLMs with SubLN and attention distillation; large memory/speed gains.
Training Dynamics Impact Post-Training Quantization Robustness - Score: 17 (R=10, N=7) - Date: 2025-10-08 - Comment: Compression/Efficiency: analysis of post-training quantization robustness tied to training dynamics and hyperparameters in LLMs.
Quantization Range Estimation for Convolutional Neural Networks - Score: 17 (R=10, N=7) - Date: 2025-10-07 - Comment: Strongly matches Model Compression/Efficiency: post-training quantization with provable local convexity and efficient range search.
STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: Compression/Efficiency: low-precision activation quantization using sequence-dimension linear transforms and mixed-precision token retention; complements existing quantization.
Polybasic Speculative Decoding Through a Theoretical Perspective - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: HPC/Efficiency: theoretical framework for multi-model (polybasic) speculative decoding with optimal inference time characterization and practical speedups.
CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices - Score: 17 (R=9, N=8) - Date: 2025-10-30 - Comment: Model Architecture and Efficiency: introduces invertible linear layers via circulant–diagonal decomposition with FFT, reducing parameters and log-det/inversion cost for normalizing flows.
Sequences of Logits Reveal the Low Rank Structure of Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-30 - Comment: Representation Learning + Compression/Efficiency: demonstrates and exploits low-rank structure in LM logits with a model-agnostic abstraction and theory.
LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Data-aware LoRA initialization derived via asymptotic/Fisher analysis—matches Low-Rank Adaptation and Compression/Efficiency criteria.
FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Model Compression and Efficiency: FP8 end-to-end LoRA fine-tuning by merging adapters into a quantized backbone and reducing quantization overhead.
ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Model Compression/Efficiency for fine-tuning: optimally scaled LoRA accumulates high-rank updates from low-rank increments with analytic scaling guarantees.
Beyond Hidden-Layer Manipulation: Semantically-Aware Logit Interventions for Debiasing LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Model Compression and Efficiency: introduces differentiable contiguous layer pruning with endpoint tuning for LLMs; compatible with quantization.
Batch Speculative Decoding Done Right - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Matches HPC/Efficiency: batch speculative decoding with correctness guarantees and synchronization strategy addressing ragged tensors.
Encoder-Decoder Diffusion Language Models for Efficient Training and Inference - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Matches Model Architecture and Efficiency with an encoder-decoder diffusion LM enabling faster training/inference.
Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Matches Model Compression and Efficiency by enabling one-step sampling for AR image models via conditional score distillation.
$\alpha$-LoRA: Effective Fine-Tuning via Base Model Rescaling - Score: 17 (R=9, N=8) - Date: 2025-10-27 - Comment: Low‑rank/Compression: new reparameterization (α‑LoRA) via base model rescaling with theory (RMT) to improve fine‑tuning generalization.
ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: Model Compression/Efficiency: LUT-aware hierarchical linear quantization (HLQ) and optimized CPU kernels for LLM edge deployment.
NeuroAda: Activating Each Neuron's Potential for Parameter-Efficient Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: Model Compression/Efficiency: PEFT via bypass connections on selected parameters enabling ≤0.02% trainable weights.
StreamingTOM: Streaming Token Compression for Efficient Video Understanding - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Compression/Efficiency: training-free streaming token compression with causal temporal reduction and 4-bit online KV-cache memory.
Glyph: Scaling Context Windows via Visual-Text Compression - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: compresses long textual context via visual rendering to reduce tokens and compute, yielding faster prefilling/decoding and SFT.
Neuronal Group Communication for Efficient Neural representation - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Matches Model Architecture and Compression/Efficiency: proposes low-rank, group-based neuronal communication with a stability metric, improving compactness and modularity.
One-Bit Quantization for Random Features Models - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: theory for one-bit quantization in Random Features models showing no generalization loss when quantizing all but last layer.
Compressing Many-Shots in In-Context Learning - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Efficiency: compresses many-shot in-context prompts via layer-wise soft-token summaries to cut memory/compute during inference.
AMiD: Knowledge Distillation for LLMs with $\alpha$-mixture Assistant Distribution - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: generalized assistant distribution and divergences for KD of LLMs improving stability/performance.
Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Compression and Efficiency/HPC: exploits fine-tuning-time sparsity with dynamic sparse operators and predictors to accelerate PEFT.
CTR-LoRA: Curvature-Aware and Trust-Region Guided Low-Rank Adaptation for Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Compression/Efficiency: PEFT via curvature-aware trust-region LoRA with adaptive rank scheduling using second-order proxies; stability and throughput gains.
Continual Learning via Sparse Memory Finetuning - Score: 17 (R=9, N=8) - Date: 2025-10-20 - Comment: Matches Model Compression and Efficiency (Sparsity) for continual learning via sparsely updated memory layers to reduce interference/forgetting.
Attention Is All You Need for KV Cache in Diffusion LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Compression/Efficiency: adaptive, layer-aware KV cache refresh (Elastic-Cache) for diffusion LLMs reduces redundant recomputation with negligible quality loss.
Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: HPC/Training Efficiency: principled batch-size scheduling equivalence to LR decay (with theory) to accelerate pretraining.
MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Compression/Efficiency – low-bit microscaling (BFP) quantization extension addressing outliers for efficient LLM serving with minimal overhead.
A Deep State-Space Model Compression Method using Upper Bound on Output Error - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Compression/Efficiency: provable output-error bounds and gradient-based model order reduction for Deep SSMs.
Towards Reversible Model Merging For Low-rank Weights - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Compression/Efficiency – low-rank (LoRA/SVD) weight merging with a reversible basis and closed-form solution for reconstruction-capable model space.
K-Merge: Online Continual Merging of Adapters for On-device Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Matches Model Compression and Efficiency: online continual merging of low-rank adapters (LoRAs) for on-device LLMs under storage/compute constraints.
KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Model Compression and Efficiency: training-free KV-cache reuse/alignment across agents for multi-agent LLM inference, delivering large speedups without quality loss.
DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Model Architecture and Efficiency: tightly couples an AR LM with masked diffusion over discrete RVQ codes enabling blockwise parallelism; offers controllable compute via RVQ layer pruning.
Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Compression/Efficiency theory: extends MDL to singular models; LLC-based complexity predicts quantization/low-rank compressibility.
MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Model Compression/Efficiency: training-free structural pruning for diffusion models that aligns pruning policy with pretraining dynamics.
Direct Multi-Token Decoding - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Model Efficiency/HPC — Direct Multi-Token Decoding uses late layers to emit multiple tokens per step without auxiliary models, reducing repeated forward passes.
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Model Compression/Efficiency and HPC: NVFP4 quantization + LoRA to accelerate RL training of LLMs, with adaptive quantization noise for exploration.
Differentiable Fast Top-K Selection for Large-Scale Recommendation - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Designs a differentiable Top-K operator with O(n) complexity for end-to-end training (algorithmic efficiency breakthrough).
LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Matches Compression/Efficiency and HPC: semantic-aware KV retrieval and fine-grained decoupled management with custom kernels to accelerate long-sequence LLM decoding while preserving accuracy.
Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Matches HPC and Compression/Efficiency: automated mapping and scheduling for block-diagonal sparse LLMs on compute-in-memory accelerators to boost array utilization and reduce memory/compute.
ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Pure low-precision (BF16/Float8) training with Kahan summation, stochastic rounding, and memory optimizations (gradient fusion/chunking) — Model Compression/Efficiency for large output spaces.
Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Strongly matches Model Compression and Efficiency by leveraging structured sparsification, conformal prediction, and lattice quantization to compress token distributions for speculative decoding; systems-level bandwidth optimization aligns with efficiency goals.
CacheClip: Accelerating RAG with Effective KV Cache Reuse - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Compression/Efficiency: KV cache reuse with auxiliary-model-guided selective recomputation, shared-prefix sink removal, and grouping for faster RAG prefill without quality loss.
AdaPM: a Partial Momentum Algorithm for LLM Training - Score: 17 (R=9, N=8) - Date: 2025-10-13 - Comment: HPC/Efficiency: memory-efficient optimizer for LLM training via partial momentum with bias correction, reducing momentum state memory by >90%.
Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Matches Representation Learning and Sparsity: identifies and manipulates sparse, layer-consistent dimensions governing multilingual control without training.
Efficient numeracy in language models through single-token number embeddings - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Efficiency/Architecture: proposes single-token number embeddings (BitTokens) via IEEE 754 to reduce tokenization overhead and enable efficient arithmetic in LLMs.
From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Training Dynamics/Efficiency: establishes a scaling law for multi-stage (bootstrapped) pretraining, guiding efficient reuse of overtrained base models.
Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Inference Efficiency: training-aware speculative decoding (self-speculation) with online updates for lossless speedups
HoRA: Cross-Head Low-Rank Adaptation with Joint Hypernetworks - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: PEFT/Low-Rank: cross-head shared low-rank adapters generated by joint hypernetworks; theoretical sample-efficiency gains via a hierarchical MoE perspective.
Composite Optimization with Error Feedback: the Dual Averaging Approach - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Matches Model Compression and Efficiency and High-Performance Computing: communication-efficient distributed training with compression via a new EF–Dual Averaging method and convergence analysis for composite objectives.
On The Expressive Power of GNN Derivatives - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Model Architecture — HOD-GNN augments MPNNs with high-order feature derivatives to boost expressivity up to WL hierarchy; efficient derivative message passing.
In-memory Training on Analog Devices with Limited Conductance States via Multi-tile Residual Learning - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Matches Model Compression and Efficiency/HPC: enables in-memory training on low-precision analog devices via multi-tile residual learning with convergence guarantees.
SAGE: Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Efficiency: streaming subset selection via Frequent Directions gradient sketches enabling constant-memory, GPU-friendly training.
DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Efficiency/HPC: speculative decoding using a diffusion LM drafter with causal-consistency path search and adaptive draft length for speedups.
KaVa: Latent Reasoning via Compressed KV-Cache Distillation - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Model Compression and Efficiency: compressed KV-cache distillation to supervise latent reasoning, leveraging cache-aware signals for efficient inference and memory savings.
StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Model Compression and Efficiency: advances LoRA via U S V^T factorization with Stiefel manifold constraints and Riemannian optimization for low-rank adapters.
HiSpec: Hierarchical Speculative Decoding for LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Compression/Efficiency/HPC: hierarchical speculative decoding using early-exit intermediate verification with KV-cache/hidden-state reuse for high-throughput inference.
Low Rank Gradients and Where to Find Them - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Compression/Efficiency: identifies approximate low-rank structure in gradients; Representation Learning/Training Dynamics: links data/activation/regularizers to gradient rank components.
On Predictability of Reinforcement Learning Dynamics for Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Representation Learning/Training Dynamics: identifies low-rank (rank-1) structure in RL-induced parameter updates and exploits it for efficient training speedups.
Randomized Matrix Sketching for Neural Network Training and Gradient Monitoring - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Model Compression and Efficiency: adapts matrix sketching to layer activations for memory-efficient backprop and gradient monitoring, enabling reduced activation storage.
HilbertA: Hilbert Attention for Image Generation with Diffusion Models - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Sparse attention/HPC: 2D-aware GPU-efficient attention via Hilbert-curve token ordering and sliding schedule, implemented in Triton.
DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Compression/Efficiency: differentiable vector quantization via reparameterization (and space-filling variant) enabling end-to-end training and improved codebook usage.
Distillation of Large Language Models via Concrete Score Matching - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Model Compression and Efficiency: new discrete score-matching KD objective aligning relative logits for LLM distillation, addressing softmax smoothing and shift invariance.
Flow Matching with Semidiscrete Couplings - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Matches Compression/Efficiency and Training Algorithms: semidiscrete OT-based flow matching eliminates quadratic batch-OT costs, enabling scalable generative training.
Spectral Perturbation Bounds for Low-Rank Approximation with Applications to Privacy - Score: 17 (R=8, N=9) - Date: 2025-10-30 - Comment: Compression/Efficiency: new spectral-norm perturbation bounds for low-rank approximation, improving theoretical guarantees (e.g., DP-PCA utility).
LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits - Score: 16 (R=9, N=7) - Date: 2025-10-31 - Comment: Model Compression and Efficiency: mixed-precision post-training quantization of LoRA via SVD reparameterization to ultra-low bits.
Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-31 - Comment: Efficiency: inference-cost-aware speculative decoding with dynamic tree construction accounting for GPU/batch effects
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache - Score: 16 (R=9, N=7) - Date: 2025-10-31 - Comment: Efficiency/HPC: attention-map caching and similarity retrieval to accelerate prefill self-attention in LLMs with minimal accuracy loss.
zFLoRA: Zero-Latency Fused Low-Rank Adapters - Score: 16 (R=9, N=7) - Date: 2025-10-31 - Comment: Compression/Efficiency: fused low-rank adapters that incur zero or negligible inference latency overhead
Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning - Score: 16 (R=9, N=7) - Date: 2025-10-30 - Comment: Compression/Efficiency and Architecture: structured local learning on low-rank manifolds (SVD) with aligned feedback, reducing parameters and avoiding BP while maintaining accuracy.
Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models - Score: 16 (R=9, N=7) - Date: 2025-10-29 - Comment: Model Compression/Efficiency: sparse PEFT with kernelized low-rank updates and adaptive bi-level sparsity allocation, reducing memory while improving adaptation.
SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs - Score: 16 (R=9, N=7) - Date: 2025-10-29 - Comment: Speculative Knowledge Distillation applies token-level gating for distillation loss—directly matches Compression/Efficiency via improved KD for LLMs.
Improving the Straight-Through Estimator with Zeroth-Order Information - Score: 16 (R=9, N=7) - Date: 2025-10-29 - Comment: Model Compression/Efficiency: quantization-aware training via FOGZO combining STE with zeroth-order information to reduce bias and compute.
Transformers from Compressed Representations - Score: 16 (R=9, N=7) - Date: 2025-10-29 - Comment: Transformer efficiency via learning directly from compressed representations, reducing tokens/compute—matches the Compression/Efficiency criterion with an architectural tokenization strategy.
The Structural Scalpel: Automated Contiguous Layer Pruning for Large Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-29 - Comment: Model Compression and Efficiency: introduces a novel pruning framework with differentiable concave gates to select contiguous layer segments and a localized fine-tuning strategy; method-centric compression (pruning) with synergy to quantization.
Mixed Precision Training of Neural ODEs - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Matches Model Compression and Efficiency with a mixed-precision training framework for Neural ODEs addressing memory/runtime.
Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Optimizer-level memory reduction using low-rank Jacobian approximation with error-feedback to train with approximate gradients under tight memory (Compression/Efficiency).
When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Matches Model Compression and Efficiency by analyzing how layer pruning impacts test-time scaling for reasoning in LLMs.
TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Model Compression and Efficiency: ternary quantization of both vision and text encoders with distillation for large VLMs.
PLAN: Proactive Low-Rank Allocation for Continual Learning - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Model Compression/Efficiency: Low-Rank Adaptation (LoRA) with proactive orthogonal allocation for continual learning.
AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: High-Performance Inference: selective knowledge distillation tailored to maximize token acceptance in speculative decoding.
GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: sparse fine-tuning by selecting parameters with large gradients and low pre-trained magnitudes to preserve knowledge.
Latent Space Factorization in LoRA - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Matches Compression/Efficiency and Model Architecture: a LoRA variant (FVAE-LoRA) that factorizes task-salient vs residual latent spaces via a new ELBO for parameter-efficient finetuning with improved robustness.
Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: proposes dequantization-free mixed-precision quantization for LLMs via SNN-style spike encoding, reducing MAC energy.
CPSVD: Enhancing Large Language Model Compression via Column-Preserving Singular Value Decomposition - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: column-preserving SVD with adaptive per-module compression for LLMs (low-rank plus selective column retention).
Feature Space Adaptation for Robust Model Fine-Tuning - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Model Compression/Efficiency: PEFT in feature space (LoRFA/VeFA) with low-rank/vector transformations to preserve pretrained representations under shift.
ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters - Score: 16 (R=9, N=7) - Date: 2025-10-22 - Comment: Model Architecture/Efficiency: depth scaling of ViTs via layer-wise weight sharing plus lightweight parallel adapter parameters.
S2AP: Score-space Sharpness Minimization for Adversarial Pruning - Score: 16 (R=9, N=7) - Date: 2025-10-22 - Comment: Compression/Efficiency: adversarial pruning with score-space sharpness minimization to stabilize mask selection and preserve robustness.
Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations - Score: 16 (R=9, N=7) - Date: 2025-10-22 - Comment: Matches High Performance Computing/Efficiency: zeroth-order LLM fine-tuning with projected gradient-aligned perturbations to cut estimator variance and iterations.
ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: zero-shot, prompt-aware visual token pruning for VLMs to reduce inference cost while preserving task-relevant content.
SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference - Score: 16 (R=9, N=7) - Date: 2025-10-21 - Comment: High Performance Computing / Efficiency: hardware-software co-design of Softmax and LayerNorm (E2Softmax, AILayerNorm) with low-precision arithmetic and no retraining.
Bitwidth-Specific Logarithmic Arithmetic for Future Hardware-Accelerated Training - Score: 16 (R=9, N=7) - Date: 2025-10-21 - Comment: Matches Compression and Efficiency: bitwidth-specific logarithmic arithmetic with hardware-friendly piecewise-linear addition enabling low-precision training.
QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: unified low-rank SVD across Q/K/VP with rank allocation and joint quantization to reduce KV cache and compute in VLMs.
Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation - Score: 16 (R=9, N=7) - Date: 2025-10-20 - Comment: Matches Model Compression and Efficiency via structured pruning with concatenation-based layer merging and hierarchical distillation to retain capacity.
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow - Score: 16 (R=9, N=7) - Date: 2025-10-17 - Comment: Compression/Efficiency + Hardware co-design: hardware-aware dynamic token and FFN pruning with optimized dataflow for low-power ViT acceleration.
Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning - Score: 16 (R=9, N=7) - Date: 2025-10-17 - Comment: Compression/Efficiency – Transformer pruning with unified Head Importance–Entropy Score combining gradients and attention entropy.
CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression - Score: 16 (R=9, N=7) - Date: 2025-10-16 - Comment: Model compression and efficiency: embedding-layer compression via group residual vector quantization with a corrective adaptor, reducing memory footprint and compatible with 4-bit hardware.
Rescaling-Aware Training for Efficient Deployment of Deep Learning Models on Full-Integer Hardware - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Matches Model Compression and Efficiency: quantization- and rescale-aware training for integer-only inference; reduces rescaler bitwidth post-training with minimal retraining.
Neural Weight Compression for Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Model Compression and Efficiency — learned autoencoder codec for LM weight compression with importance-aware loss and inference-time error compensation.
Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Compression/Efficiency: scale-dependent guidelines for allocating memory between weights, KV cache, and generation length; compares KV eviction vs quantization for reasoning models.
CauchyNet: Compact and Data-Efficient Learning using Holomorphic Activation Functions - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Model Architecture: complex-valued holomorphic activation functions (Cauchy-inspired) enabling compact, data-efficient networks with theoretical guarantees.
ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Matches Model Architecture/Efficiency: selective layer expansion and unit-wise decoupled tuning for parameter-efficient continual pretraining of LLMs.
Vanishing Contributions: A Unified Approach to Smoothly Transition Neural Models into Compressed Form - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Matches Model Compression and Efficiency: proposes a general training scheme (VCON) to smoothly transition models to compressed forms (pruning/quantization/low-rank) to mitigate accuracy loss.
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: High Performance Computing — systems-level KV-cache offloading and cross-engine sharing with pipelined data movement and a control API for enterprise-scale LLM inference.
BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices - Score: 16 (R=9, N=7) - Date: 2025-10-14 - Comment: Model Compression and Efficiency: aggressive quantization (≈1.58-bit encoders), sliding-window attention, and episodic memory for edge-efficient multimodal transformers.
Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization - Score: 16 (R=9, N=7) - Date: 2025-10-14 - Comment: Model Compression and Efficiency/HPC: subspace-restricted training of ViTs (WASI) to cut memory and FLOPs for on-device learning.
StreamingVLM: Real-Time Understanding for Infinite Video Streams - Score: 16 (R=9, N=7) - Date: 2025-10-13 - Comment: Compression/Efficiency criterion: streaming KV-cache management (attention sinks, short/long windows) with training–inference alignment for real-time long-context VLMs.
dInfer: An Efficient Inference Framework for Diffusion Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-13 - Comment: High Performance Computing and Model Efficiency: proposes an inference framework for diffusion LLMs with algorithmic and system-level optimizations (diffusion iteration manager, decoding, KV-cache manager) enabling large speedups.
Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression - Score: 16 (R=9, N=7) - Date: 2025-10-13 - Comment: Matches Model Compression and Efficiency: CoT compression via an upfront thought-embedding compressor–executor framework to reduce token usage/latency while maintaining reasoning quality.
Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation - Score: 16 (R=9, N=7) - Date: 2025-10-13 - Comment: Matches Model Compression and Efficiency: uses low-rank adaptation (LoRA) with synthetic data/logit distillation to recover accuracy after quantization/pruning/serialization-induced degradation.
AILoRA: Function-Aware Asymmetric Initialization for Low-Rank Adaptation of Large Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-10 - Comment: Strong match to Compression/Efficiency: enhances LoRA via function-aware asymmetric low-rank initialization with analysis of distinct W^Q and W^V roles in self-attention.
Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation - Score: 16 (R=9, N=7) - Date: 2025-10-09 - Comment: Compression/Efficiency: selects structurally sparse subnetwork initializations via evolutionary search and uses distillation to accelerate pretraining, achieving 9.2x fewer tokens for comparable perplexity.
Sharpness-Aware Data Generation for Zero-shot Quantization - Score: 16 (R=9, N=7) - Date: 2025-10-09 - Comment: Matches Model Compression/Efficiency: zero-shot quantization with sharpness-aware synthetic data generation and supporting theory for better generalization.
DoRAN: Stabilizing Weight-Decomposed Low-Rank Adaptation via Noise Injection and Auxiliary Networks - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: Model Compression and Efficiency: stabilizes and enhances low-rank adaptation (DoRA) via noise injection and auxiliary networks that generate low-rank factors, improving PEFT.
Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: Matches Model Compression and Efficiency: proposes ultra-low-bit (2-bit) post-training quantization tailored to diffusion LLMs with masked calibration simulation and adaptive blockwise mixed precision.
CHORD: Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration - Score: 16 (R=9, N=7) - Date: 2025-10-06 - Comment: Model Compression and Efficiency: channel-wise mixed-precision quantization personalized via a hypernetwork; 2-bit per-channel strategy encoding enables resource-adaptive deployment without backprop.
HyperAdaLoRA: Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance - Score: 16 (R=9, N=7) - Date: 2025-10-06 - Comment: Compression/Efficiency: dynamic low-rank adaptation (LoRA) accelerated via hypernetwork-generated SVD parameters with rank pruning for efficient PEFT.
Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: High Performance/Efficiency—learning-based zeroth-order optimizer for LLM fine-tuning reducing memory with L2L-style perturbation strategies.
The Pitfalls of KV Cache Compression - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Model Compression and Efficiency: critical analysis of KV cache compression with improved eviction policies for multi-instruction prompting in LLMs.
Enhancing Certifiable Semantic Robustness via Robust Pruning of Deep Neural Networks - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Model Compression and Efficiency: robust pruning guided by an Unbiased and Smooth Neuron metric (USN) plus a Wasserstein loss to enhance certifiable robustness.
ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Model Compression and Efficiency: training-free adaptive suppression of reasoning steps for LRLMs to reduce tokens/latency while preserving accuracy.
Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Model Compression and Efficiency: analyzes RoPE interpolation under post-training quantization and proposes an interpolation-aware, per-band weight rescaling (Q-ROAR) guided by new diagnostics.
Equivariance by Local Canonicalization: A Matter of Representation - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Architecture and Efficiency: transfers tensor field networks to local canonicalization to preserve equivariance with lower runtime (PyG integration).
Collaborative Compression for Large-Scale MoE Deployment on Edge - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Compression and Efficiency: MoE-aware collaborative compression combining expert pruning, mixed-precision quantization, and activation optimization for ultra-large MoE deployment under strict memory limits.
Growing Winning Subnetworks, Not Pruning Them: A Paradigm for Density Discovery in Sparse Neural Networks - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Compression and Efficiency: proposes growth-based sparse training (PWMPR) to discover winning subnetworks and operating density, complementing pruning/dynamic sparsity.
Rethinking Parameter Sharing for LLM Fine-Tuning with Multiple LoRAs - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Compression/Efficiency: rethinks multi-LoRA parameter sharing (ALoRA, Fed-ALoRA) with asymmetric design and matrix decomposition for heterogeneous ranks.
FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Matches Model Compression and Efficiency: unified sparse attention kernel with flexible sparse symbols and optimized sparse GEMMs for DiT inference acceleration.
On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Compression/Efficiency: quantization-aware fine-tuning via configuration-aware low-rank (LoRA) adjustments that adapt to arbitrary per-layer bit-widths without re-finetuning.
Bridging Function Approximation and Device Physics via Negative Differential Resistance Networks - Score: 16 (R=8, N=8) - Date: 2025-10-29 - Comment: Model Architecture + Efficiency/Hardware: analog implementation of Kolmogorov–Arnold Networks using negative differential resistance devices for learnable nonlinearities.
HollowFlow: Efficient Sample Likelihood Evaluation using Hollow Message Passing - Score: 16 (R=8, N=8) - Date: 2025-10-28 - Comment: Enforces block-diagonal Jacobians via non-backtracking GNNs to make likelihood evaluation scale with constant backward passes (Algorithmic Efficiency for generative flows).
Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLMs - Score: 16 (R=8, N=8) - Date: 2025-10-24 - Comment: HPC/Efficiency: provably no-regret drafter selection for speculative decoding that evaluates all drafters without extra target queries, reducing inference cost.
Just-In-Time Piecewise-Linear Semantics for ReLU-type Networks - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Model Analysis/Verification: JIT piecewise-linear semantics for ReLU networks enabling exact/approx certificates, Lipschitz, robustness—foundational network semantics.
Computational Budget Should Be Considered in Data Selection - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Matches Efficiency/Data Selection: compute-budget-aware bilevel data selection with Hessian-free gradient estimator and efficient inner-loop relaxation.
SHaRe-SSM: An Oscillatory Spiking Neural Network for Target Variable Modeling in Long Sequences - Score: 16 (R=8, N=8) - Date: 2025-10-17 - Comment: Model Architecture/Efficiency: oscillatory spiking state-space model (multiplication-free, sparse events) with parallel scans for very long sequences.
Z0-Inf: Zeroth Order Approximation for Data Influence - Score: 16 (R=8, N=8) - Date: 2025-10-16 - Comment: Algorithmic efficiency and training dynamics: introduces a zeroth-order, gradient-free influence estimation scalable to LLMs, enabling practical data influence analysis without Hessians/gradients.
Tracing the Traces: Latent Temporal Signals for Efficient and Accurate Reasoning - Score: 16 (R=8, N=8) - Date: 2025-10-14 - Comment: Model Compression and Efficiency + Representation Learning: introduces latent-trajectory signals from internal representations to guide inference-time compute allocation and answer selection, reducing token usage.
Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models - Score: 16 (R=8, N=8) - Date: 2025-10-14 - Comment: Matches Model Compression and Efficiency: parameter-efficient transfer via gradient-sign masking to transport task vectors across pre-trained models with first-order descent guarantee.
Accelerating Inference for Multilayer Neural Networks with Quantum Computers - Score: 16 (R=8, N=8) - Date: 2025-10-09 - Comment: High Performance Computing/Efficiency: fully coherent quantum implementation of multilayer neural inference with provable speedups under quantum data access assumptions.
Best-of-Majority: Minimax-Optimal Strategy for Pass@$k$ Inference Scaling - Score: 16 (R=8, N=8) - Date: 2025-10-06 - Comment: Matches Efficiency/Test-Time Scaling: introduces Best-of-Majority, a minimax-optimal Pass@k inference strategy with theoretical guarantees over majority voting/BoN.
Constrained Adaptive Rejection Sampling - Score: 16 (R=8, N=8) - Date: 2025-10-03 - Comment: Compression/Efficiency: algorithmic innovation for constrained decoding via adaptive rejection sampling that preserves the exact distribution while improving sample efficiency.
CIMNAS: A Joint Framework for Compute-In-Memory-Aware Neural Architecture Search - Score: 16 (R=8, N=8) - Date: 2025-10-01 - Comment: Model Compression and Efficiency/HPC: joint HW-aware NAS with quantization and CIM device/circuit/architecture co-optimization for EDAP-focused design.
The Information-Theoretic Imperative: Compression and the Epistemic Foundations of Intelligence - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning and Compression theory: argues compression efficiency drives causal representation discovery; testable predictions about rate–distortion and OOD generalization.
Are Language Models Efficient Reasoners? A Perspective from Logic Programming - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Representation Learning/Training Dynamics: framework measuring reasoning efficiency and aligning natural-language proofs with minimal logic-program proofs.
BSFA: Leveraging the Subspace Dichotomy to Accelerate Neural Network Training - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Training Dynamics and Efficiency: exploits Hessian subspace dichotomy (Dom vs Bulk) with PCA-based projection and differential scaling to accelerate optimization.
Continual Low-Rank Adapters for LLM-based Generative Recommender Systems - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Compression/Efficiency: low-rank adapters (LoRA) with proximal regularization for continual adaptation.
What Really Matters in Matrix-Whitening Optimizers? - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Training Dynamics/Optimization: analysis of matrix-whitening vs spectral descent; identifies variance adaptation as key ingredient with low-rank estimators.
Resource-Efficient and Robust Inference of Deep and Bayesian Neural Networks on Embedded and Analog Computing Platforms - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Compression/Efficiency + Hardware co-design: automatic compression, approximate Bayesian inference, and analog accelerators for embedded inference.
All in one timestep: Enhancing Sparsity and Energy efficiency in Multi-level Spiking Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Model Architecture and Efficiency: proposes multi-level spiking neurons and a Sparse-ResNet to enhance sparsity and reduce energy/latency in SNNs.
PAHQ: Accelerating Automated Circuit Discovery through Mixed-Precision Inference Optimization - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Model Compression and Efficiency via mixed-precision quantization to speed up interpretability patching with reduced memory.
FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Efficiency: self-speculative decoding with draft/verify for VLMs, accelerating autoregressive inference.
Single-Teacher View Augmentation: Boosting Knowledge Distillation via Angular Diversity - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Model Compression and Efficiency via improved knowledge distillation using angularly diverse single-teacher augmentations.
Few-Shot Knowledge Distillation of LLMs With Counterfactual Explanations - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Model Compression/Efficiency: proposes few-shot task-aware knowledge distillation using counterfactual explanations with theoretical guarantees.
Memory Constrained Dynamic Subnetwork Update for Transfer Learning - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Compression/Efficiency: memory-constrained dynamic subnetwork adaptation with principled layer ranking and dynamic channel sampling.
Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Model Architecture and Efficiency: factorized hypernetwork generates context-aware LoRA adapters for conditioned fine-tuning (parameter-efficient adapters).
Study of Training Dynamics for Memory-Constrained Fine-Tuning - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: dynamic stochastic channel selection yields high activation/gradient sparsity for memory-constrained fine-tuning.
Knowledge Distillation of Uncertainty using Deep Latent Factor Model - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: proposes distribution distillation (Gaussian distillation) using a deep latent factor model to compress deep ensembles while preserving uncertainty, reducing compute/memory.
MetaCluster: Enabling Deep Compression of Kolmogorov-Arnold Network - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: codebook-based weight sharing for KANs via meta-learner-induced clustering enables up to 80x parameter compression.
LightMem: Lightweight and Efficient Memory-Augmented Generation - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Model Architecture/Efficiency: lightweight memory-augmented generation with multi-stage memory and offline consolidation (cache-like), reducing token and runtime costs.
Graphical model for tensor factorization by sparse sampling - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Representation Learning and Sparsity: message-passing and replica-theory analysis for tensor factorization under sparse sampling on random graphs.
Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: training-free, attention-guided recurrent token selection for streaming Video-LLMs, discarding up to ~95% tokens with minimal loss.
All You Need is One: Capsule Prompt Tuning with a Single Vector - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Model Architecture/Efficiency (PEFT): Capsule Prompt-Tuning with a single vector acting as an instance-aware "attention anchor" for parameter-efficient adaptation.
Zeroth-Order Sharpness-Aware Learning with Exponential Tilting - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Training Dynamics/Efficiency: bridges zeroth-order optimization with sharpness-aware minimization via exponential tilting; gradient-free, memory-efficient SAM alternative.
Vector Quantization in the Brain: Grid-like Codes in World Models - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Matches Representation Learning/Model Architecture: brain-inspired action-conditioned vector quantization via attractor dynamics for spatiotemporal world models.
SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Optimization for large-scale training: a Lookahead variant applying Nesterov momentum to pseudo-gradients (SNOO) for compute-efficient training with minimal overhead.
GRATING: Low-Latency and Memory-Efficient Semantic Selection on Device - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Systems-level inference efficiency: training-free monolithic forwarding with sequence-level sparsity for top-K reranking, reducing latency and peak memory on-device.
TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Matches Efficiency/HPC: universal speculative decoding via DTW-based alignment enabling draft–target mismatch and faster inference.
Revisiting Knowledge Distillation: The Hidden Role of Dataset Size - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Training dynamics/representation: identifies data-efficiency of knowledge distillation in low-data regimes and evaluates competing theories (label smoothing vs dark knowledge).
LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Compression/Efficiency: conditional computation via stage-wise layer skipping and confidence-based early exit tailored for multi-stage reasoning.
DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Efficiency: context-aware dynamic vocabulary shortlisting for speculative decoding to reduce drafter compute while keeping exact verification.
SG-XDEAT: Sparsity-Guided Cross-Dimensional and Cross-Encoding Attention with Target-Aware Conditioning in Tabular Learning - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Model architecture and efficiency: introduces Adaptive Sparse Self-Attention (sparsity) plus cross-dimensional/cross-encoding attention with target-aware conditioning for tabular learning.
Rethinking Knowledge Distillation: A Data Dependent Regulariser With a Negative Asymmetric Payoff - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Model Compression/Efficiency: rigorous analysis reframing knowledge distillation as a data-dependent regularizer with quantified transfer dynamics.
SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Matches Compression/Efficiency: proposes an embedding compression framework (dimension pruning with adaptive selection and cross-batch memory) for retrieval.
Your VAR Model is Secretly an Efficient and Explainable Generative Classifier - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Model Architecture and Efficiency: proposes a VAR-based generative classifier with tractable likelihood enabling token-wise MI explanations and faster inference than diffusion-based counterparts.
Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Model Architecture and Efficiency: latent interleaved vision-text reasoning design and progressive training reduce annotation and inference latency.
MoRA: On-the-fly Molecule-aware Low-Rank Adaptation Framework for LLM-based Multi-Modal Molecular Assistant - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Proposes instance-specific dynamic Low-Rank Adaptation (LoRA) weights injected on-the-fly (low-rank parameter-efficient adaptation/architecture).
EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Inference-time Efficiency — entropy-aware branching reallocates test-time compute adaptively to hard prompts, improving Pass@k at lower token budgets.
LightSAE: Parameter-Efficient and Heterogeneity-Aware Embedding for IoT Multivariate Time Series Forecasting - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Model Compression/Efficiency: parameter-efficient embedding via low-rank factorization and shared gated component pool for heterogeneous time-series channels.
LLM-Oriented Token-Adaptive Knowledge Distillation - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Matches Model Compression and Efficiency via Knowledge Distillation for LLMs with token-level adaptive focusing and temperature scaling.
Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Model Architecture/Efficiency: adaptive conditional computation (fast vs slow reasoning) with entropy-guided hybrid policy optimization to reduce reasoning cost.
Logits Replay + MoClip: Stabilized, Low-Cost Post-Training with Minimal Forgetting - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Compression/Efficiency: Top-K logits replay with exact renormalized losses plus MoClip optimizer stabilizes updates for low-cost LLM post-training with minimal forgetting.
PAC Reasoning: Controlling the Performance Loss for Efficient Reasoning - Score: 15 (R=8, N=7) - Date: 2025-10-13 - Comment: Conditional/Dynamic Networks: PAC-based switching between thinking/nonthinking modes with distribution-free performance-loss guarantees for efficient inference.
Auto-scaling Continuous Memory for GUI Agent - Score: 15 (R=8, N=7) - Date: 2025-10-13 - Comment: Compression/Efficiency criterion: fixed-length continuous memory embeddings replacing long textual histories to reduce context cost while preserving visual detail.
DeepPrune: Parallel Scaling without Inter-trace Redundancy - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Model Efficiency: dynamic pruning of parallel Chain-of-Thought traces via learned equivalence prediction and online clustering, reducing inference tokens.
Don't Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Matches Model Compression/Efficiency criterion: studies pruning in VLA and introduces a training-free weight interpolation correction to recover sparsified models.
Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Matches Model Architecture and Efficiency: proposes a single-layer, O(N) Co^4 architecture reportedly outperforming GPT-2/GPT-BERT on BabyLM.
First Try Matters: Revisiting the Role of Reflection in Reasoning Models - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Inference Efficiency/Training Dynamics: empirical analysis of reflection plus question-aware early stopping to cut reasoning tokens.
Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Efficiency: sequence-level entropy from token log-probs as a confidence signal for early stopping in reasoning models.
OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Model Compression and Efficiency: algorithmic speedups for long-context speculative decoding (LSTM drafter, [SPEC] verifier, hybrid tree/non-tree) to improve inference throughput.
GUIDE: Guided Initialization and Distillation of Embeddings - Score: 15 (R=8, N=7) - Date: 2025-10-09 - Comment: Matches Model Compression and Efficiency: parameter-space guided initialization/distillation (GUIDE) improves teacher–student transfer with no training/inference overhead.
MixReasoning: Switching Modes to Think - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Conditional/Dynamic Networks and Efficiency: adaptively switches reasoning depth within a single response to reduce computation without accuracy loss.
AMAQ: Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Model Compression and Efficiency: adaptive mixed-bit activation quantization with bit-regularized channel-wise/layer-wise allocation for split learning; also reduces communication in distributed training.
Scalable In-context Ranking with Generative Models - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Model Architecture/Efficiency: enforced block-sparse attention across documents with auxiliary contrastive objective, reducing attention from quadratic to linear
Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: High-Performance/Systems Efficiency: hardware–software co-design with module-level offloading, low-bit kernels, and token-aware buffering for on-device LMM inference.
Compressed Concatenation of Small Embedding Models - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Compression/Efficiency: concatenation of small embedding models with a Matryoshka-trained decoder and quantization to achieve high compression while preserving retrieval performance.
Efficient Training of Spiking Neural Networks by Spike-aware Data Pruning - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Model Compression and Efficiency: spike-aware data pruning that approximates gradient-norm sampling via an efficient upper bound, cutting SNN training time while maintaining accuracy.
Adaptively Sampling-Reusing-Mixing Decomposed Gradients to Speed Up Sharpness Aware Minimization - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Efficiency/optimization: accelerates SAM by decomposing and selectively reusing gradient components while preserving flat-minima generalization.
REG: A Regularization Optimizer for Robust Training Dynamics - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Training Dynamics/Optimization for large models: introduces a structure-aware optimizer (RACS) replacing Muon’s matrix sign to stabilize and regularize updates.
Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: High Performance/Scaling: quality-aware scaling law extending Chinchilla to jointly model data quality, dataset size, and model size for compute-efficient pretraining.
Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Model Compression/Efficiency—pruning-based circuit extraction; Representation Learning—mechanistic interpretability via sparse circuit discovery with a hybrid attribution+pruning framework.
QuadEnhancer: Leveraging Quadratic Transformations to Enhance Deep Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Model Architecture and Efficiency: introduces quadratic transformations with low-rankness, weight sharing, and sparsification as a lightweight enhancer.
Light Differentiable Logic Gate Networks - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Model Architecture/Efficiency—reparametrization of differentiable logic gate neurons reduces parameter size and improves training stability.
AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Matches Model Compression/Efficiency: proposes gradient-free layer selection using Betti-number activation topology with forward passes only, reducing retraining compute/memory on-device.
ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Matches Model Compression/Efficiency: pluggable QK/Chunk adapters with attention distillation for chunk-wise attention and KV cache reduction to accelerate LLM inference.
Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Model Compression/Efficiency: Shapley-value-based, shift-invariant pruning for Kolmogorov–Arnold Networks enabling reliable compression.
ACON: Optimizing Context Compression for Long-horizon LLM Agents - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Compression and Efficiency: proposes an LLM-agent context compression framework with guideline optimization and distillation to smaller compressors, reducing memory/token usage.
Entropy After $\langle \texttt{/Think} \rangle$ for reasoning model early exiting - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Efficiency: adaptive early exiting for reasoning LLMs using entropy trajectory after stop-thinking token to save tokens.
AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Matches Efficiency/Decoding: adaptive block-size semi-autoregressive scheduler using confidence dynamics for diffusion LLM inference.
RAE: A Neural Network Dimensionality Reduction Method for Nearest Neighbors Preservation in Vector Search - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Matches Representation Learning and Compression/Efficiency: proposes a regularized autoencoder with provable bounds to preserve k-NN under dimensionality reduction for vector search.
Adaptive Graph Coarsening for Efficient GNN Training - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Efficiency for GNNs: joint training with adaptive graph coarsening (K-means over learned embeddings) to reduce training data and computation.
Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Model Efficiency: theoretical conditions for layer skipping in VLMs using information-theoretic redundancy analysis.

High Performance Computing (65)

Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection - Score: 20.0 (R=0, N=0) - Date: 2025-10-28 - Comment: Author match
A Definition of AGI - Score: 20.0 (R=0, N=0) - Date: 2025-10-22 - Comment: Author match
Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models - Score: 20.0 (R=0, N=0) - Date: 2025-10-01 - Comment: Author match
ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models - Score: 19 (R=10, N=9) - Date: 2025-10-28 - Comment: Systems-level innovation enabling sequence-parallel training of nonlinear RNNs via Newton iterations and parallel reductions (High Performance Computing + Model Architecture).
Collective Communication for 100k+ GPUs - Score: 18 (R=10, N=8) - Date: 2025-10-24 - Comment: High Performance Computing: introduces a collective communication framework (NCCLX) enabling reliable high-throughput, low-latency scaling to 100k+ GPUs for LLM training/inference.
AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training - Score: 18 (R=10, N=8) - Date: 2025-10-24 - Comment: High-performance training: asynchronous hierarchical ZeRO with adaptive resharding and multi-stream overlap for scalable LLM training.
Efficient Long-context Language Model Training by Core Attention Disaggregation - Score: 18 (R=10, N=8) - Date: 2025-10-22 - Comment: High Performance Computing: decouples core attention into dedicated servers (CAD/DistCA) to balance compute/memory and eliminate stragglers in distributed long-context training.
Accelerating Frontier MoE Training with 3D Integrated Optics - Score: 18 (R=10, N=8) - Date: 2025-10-21 - Comment: High Performance Computing: photonic 3D co-packaged optics to scale MoE training across racks; systems-level innovation enabling larger parallelism and faster training.
Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs - Score: 18 (R=10, N=8) - Date: 2025-10-14 - Comment: High Performance Computing: novel tensor-compiler fusion for dependency-heavy reductions (e.g., attention) using algebraic corrections to boost locality and parallelism on GPUs.
MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: HPC/Distributed Training: multi-timescale adaptive optimizers with local updates reduce communication, with convergence guarantees.
Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: High Performance Computing: system–hardware co-design (Mono3D DRAM + NMP) for MoE serving with tiered memory and expert-usage prediction.
TASP: Topology-aware Sequence Parallelism - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: High Performance Computing: topology-aware sequence parallelism that decomposes AlltoAll topology into orthogonal rings for communication-efficient attention.
Convergence Rates for Gradient Descent on the Edge of Stability in Overparametrised Least Squares - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Matches Training Dynamics Theory: convergence rates and regimes for GD at edge of stability via manifold-based decomposition in overparameterized least squares.
MuonBP: Faster Muon via Block-Periodic Orthogonalization - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: High Performance Computing: distributed-friendly optimizer (block-periodic orthogonalization) reducing communication with theory and throughput gains.
FlexLink: Boosting your NVLink Bandwidth by 27% without accuracy concern - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: High Performance Computing: novel collective communication fabric aggregating NVLink, PCIe, and RDMA with adaptive load balancing; drop-in replacement for NCCL.
From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: High Performance Computing/Systems: MLIR-AIR compiler dialect orchestrates asynchronous, spatial scheduling for NPUs; efficient mapping of attention and matmul.
FairBatching: Fairness-Aware Batch Formation for LLM Inference - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: High Performance Computing/Systems: fairness-aware batching scheduler improves TTFT/TPOT and GPU utilization for LLM inference.
DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Matches High Performance Computing: dynamic context parallelism with fine-grained blockwise partitioning for long-context training, reducing communication and improving balance.
EA4LLM: A Gradient-Free Approach to Large Language Model Optimization via Evolutionary Algorithms - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: High Performance Computing/Training criterion: introduces a gradient-free evolutionary optimization method for training large LLMs, enabling non-differentiable components and reducing hardware constraints—an algorithmic innovation for large-scale training.
Task-Level Insights from Eigenvalues across Sequence Models - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Representation learning and training dynamics: dynamical-systems eigenvalue analysis across attention and SSMs to link spectra with memory/long-range dependency and architectural effects.
Efficient Autoregressive Inference for Transformer Probabilistic Models - Score: 17 (R=9, N=8) - Date: 2025-10-13 - Comment: Systems/architecture innovation: causal autoregressive buffer with cached context enables efficient joint sampling—cache/memory optimization for Transformer probabilistic models.
Slicing Is All You Need: Towards A Universal One-Sided Algorithm for Distributed Matrix Multiplication - Score: 17 (R=9, N=8) - Date: 2025-10-13 - Comment: Strong match to HPC: a universal algorithm for distributed matrix multiplication across arbitrary partitionings/replication, improving systems support for large-scale training/inference.
SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: High Performance Computing: systems-level prefill/decode disaggregation with specialized hardware to optimize compute/memory utilization for LLM inference.
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: HPC + Efficiency: introduces a parallelism-compatible FlashAttention-2 JVP kernel enabling 10B+ model sCM training and proposes score-regularized continuous-time consistency distillation for few-step generation.
Lossless Vocabulary Reduction for Auto-Regressive Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: Model efficiency/tokenization: lossless vocabulary reduction enabling smaller vocabularies and cross-tokenizer cooperation for AR LMs; strong alignment with efficiency and systems-level interoperability.
GCPO: When Contrast Fails, Go Gold - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: High-Performance/Distributed Training: stability-based generalization and excess error bounds for multi-gossip decentralized training; algorithmic insights into communication/training efficiency.
Geodesics in the Deep Linear Network - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: Training dynamics/geometry: derives geodesics and ODEs in deep linear network geometry, offering theoretical insight into network optimization landscapes.
Riddled basin geometry sets fundamental limits to predictability and reproducibility in deep learning - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Training Dynamics Theory: links chaotic dynamics and symmetry-induced invariant subspaces to riddled basins, revealing limits to predictability.
OptPipe: Memory- and Scheduling-Optimized Pipeline Parallelism for LLM Training - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: High Performance Computing: optimized pipeline-parallel scheduling jointly accounting for memory capacity, activation reuse, and bubble minimization
Learning without Global Backpropagation via Synergistic Information Distillation - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: High Performance Computing: training without global backprop via local synergistic distillation to remove update locking and reduce activation memory, enabling parallel module updates.
Cache-to-Cache: Direct Semantic Communication Between Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Systems/Efficiency — KV-cache projection/fusion and gating enable direct inter-LLM communication, improving accuracy and reducing latency.
LoRAFusion: Efficient LoRA Fine-Tuning for LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: High Performance Computing and Efficiency: fused kernels for LoRA and adaptive multi-job scheduling for concurrent fine-tuning; systems-level innovation for PEFT.
SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Matches High Performance Computing: fine-grained slice-level packing and asymmetric forward/backward partitioning for balanced distributed LLM training.
Memory-Efficient Backpropagation for Fine-Tuning LLMs on Resource-Constrained Mobile Devices - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: High Performance Computing: memory-efficient backpropagation enabling on-device fine-tuning of LLMs (<1GB), a systems-level memory optimization for training.
Distributed Low-Communication Training with Decoupled Momentum Optimization - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: High Performance Computing: reduces communication via decoupled momentum optimization and DCT-based momentum compression with infrequent syncs for distributed training.
KVComm: Enabling Efficient LLM Communication through Selective KV Sharing - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: Compression/Efficiency and Systems: selective KV sharing based on attention-importance with Gaussian prior reduces inter-LLM communication while retaining performance.
Training Across Reservoirs: Using Numerical Differentiation To Couple Trainable Networks With Black-Box Reservoirs - Score: 16 (R=8, N=8) - Date: 2025-10-30 - Comment: Matches architecture/systems criteria by enabling training with black-box modules via Bounded Numerical Differentiation, supporting hybrid analogue–digital compositions.
SHAP Meets Tensor Networks: Provably Tractable Explanations with Parallelism - Score: 16 (R=8, N=8) - Date: 2025-10-28 - Comment: Matches Efficiency/HPC theory: exact SHAP for tensor networks with polylog-time parallel algorithm (TT); insights for BNNs linking width to SHAP hardness.
Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime - Score: 16 (R=8, N=8) - Date: 2025-10-28 - Comment: Training Dynamics Theory: non-asymptotic convergence of SGLD in the lazy training (NTK) regime with finite-time/width bounds.
AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM - Score: 16 (R=8, N=8) - Date: 2025-10-22 - Comment: Matches High Performance/Systems Efficiency: parametric integration of billion-scale KGs into LLMs with sub-linear time/memory via KG2KV and HiKVP.
A Split-Client Approach to Second-Order Optimization - Score: 16 (R=8, N=8) - Date: 2025-10-20 - Comment: Matches High Performance Computing criterion: proposes an algorithmic/system-level asynchronous split-client scheme for second-order training with provable wall-clock speedups, enabling practical large-scale optimization.
Optimal Scaling Needs Optimal Norm - Score: 16 (R=8, N=8) - Date: 2025-10-07 - Comment: Matches High Performance Computing/training dynamics: discovers an operator-norm invariance governing optimal LR/batch scaling for LLM training and reports distributed Scion implementation and large-scale scaling rules.
TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling - Score: 16 (R=8, N=8) - Date: 2025-10-06 - Comment: Matches High Performance Computing: systems-level preemptive scheduling and proactive KV-cache memory management for LLM serving to improve responsiveness and throughput.
Rethinking Thinking Tokens: LLMs as Improvement Operators - Score: 16 (R=8, N=8) - Date: 2025-10-02 - Comment: Inference efficiency and dynamic refinement: Parallel-Distill-Refine orchestrates bounded workspace and parallelism to improve accuracy-latency trade-offs (HPC/algorithmic efficiency; conditional computation).
Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models - Score: 16 (R=8, N=8) - Date: 2025-10-02 - Comment: Model efficiency/HPC: lossless parallel decoding for diffusion LLMs via draft-and-verify without extra forward passes; substantial inference speedup.
ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Efficiency/HPC: adapts speculative decoding to RL training with dynamic tuning and drafter distillation for faster rollouts
TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Model Architecture/Efficiency: linear RNN (GatedDeltaProduct) pre-trained synthetically with fully parallelizable training/inference.
xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches High Performance Computing/Systems via CPU-based dynamic analysis to estimate peak GPU memory for DL training.
Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Matches HPC: demand-aware optical-network framework that overlaps reconfiguration with collective communication to accelerate distributed ML collectives.
MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: High Performance Computing: systems-level training pipeline (Megatron-Core) with near-linear multi-node scaling and efficiency optimizations for large video generation models.
Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: co-design of KV cache policies (eviction/recompute/refresh) with eDRAM for LLM serving; systems-level memory optimization for inference.
xLLM Technical Report - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: HPC/Systems – large-scale LLM inference framework with disaggregated prefill/decode, global KV cache management, and execution/memory pipeline optimizations.
Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Asynchronous RL post-training system (fine-grained parallelism, rollout-train decoupling) — HPC/distributed training for LLMs.
BioOSS: A Bio-Inspired Oscillatory State System with Spatio-Temporal Dynamics - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Model Architecture with a new bio-inspired oscillatory state system (BioOSS) capturing spatio-temporal propagation dynamics with trainable damping and speed parameters.
video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: HPC/Memory Optimization: fixed-budget streaming via test-time-training memory module (Hessian-free CG) and prompt-dependent memory retrieval for long-context audio-visual LLMs.
RepDL: Bit-level Reproducible Deep Learning Training and Inference - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: High Performance Computing/Systems: ensures deterministic, bitwise-reproducible training and inference via correct rounding and order-invariant floating-point computation across platforms.
Robust and Efficient Collaborative Learning - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Matches High Performance Computing criterion: decentralized, pull-based distributed training algorithm with O(n log n) communication.
Cocoon: A System Architecture for Differentially Private Training with Correlated Noises - Score: 15 (R=8, N=7) - Date: 2025-10-09 - Comment: High Performance Computing: hardware–software co-design (precomputed correlated DP noise, near-memory processing) to reduce training overheads for large models/embeddings.
CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credits - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Inference Efficiency: training-free acceleration of parallel decoding in diffusion LLMs via Trace Credit accumulation and logit fusion
From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: High Performance Computing: systems-level design for LLM serving on multi-core NPUs (tensor parallelism, core placement, memory management) to optimize inference throughput.
MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: High Performance Computing/Systems: colocated inference and fine-tuning with iteration-level scheduling and memory management to meet SLOs on edge GPUs.
Linear RNNs for autoregressive generation of long music samples - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Model Architecture: advances in linear RNN/state-space design plus context-parallelism enabling 1M-token training (systems-level efficiency).
DeMuon: A Decentralized Muon for Matrix Optimization over Graphs - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Decentralized optimization with orthogonalization (Newton–Schulz) and gradient tracking; systems-level advance for distributed training.
Generalized Parallel Scaling with Interdependent Generations - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: High Performance Computing/Systems-level inference—parallel scaling with interdependent generations via shared hidden-state tensors and small parameter overhead.
Exploring System 1 and 2 communication for latent reasoning in LLMs - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture: studies dual-model latent communication vs unified forward-pass, analyzing representation and compute tradeoffs.

Representation Learning (252)

Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density - Score: 20.0 (R=0, N=0) - Date: 2025-10-08 - Comment: Author match
Language Models are Injective and Hence Invertible - Score: 19 (R=10, N=9) - Date: 2025-10-20 - Comment: Representation Learning: proves injectivity/invertibility of transformer LMs and provides an exact input reconstruction algorithm.
Pretrain-Test Task Alignment Governs Generalization in In-Context Learning - Score: 19 (R=10, N=9) - Date: 2025-10-01 - Comment: Representation Learning/Theory: exact analysis of ICL generalization via pretrain-test task alignment; predictive measure validated on Transformers.
Superposition disentanglement of neural representations reveals hidden alignment - Score: 18 (R=10, N=8) - Date: 2025-10-06 - Comment: Representation Learning: examines superposition and alignment; uses sparse autoencoders to disentangle features and improve representational alignment metrics.
Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders - Score: 18 (R=10, N=8) - Date: 2025-10-06 - Comment: Representation Learning: uses sparse autoencoders to identify and steer code-correctness directions in LLM representations (mechanistic interpretability).
Self-Supervised Representation Learning as Mutual Information Maximization - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: Theoretical unification of self-supervised representation learning via MI; explains stop-gradient and predictor networks from first principles.
AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features - Score: 18 (R=10, N=8) - Date: 2025-10-02 - Comment: Representation Learning and Sparsity: derives SAEs from proximal gradient unrolling and introduces AbsTopK (|·|-TopK) to recover bidirectional features under an ℓ0-inspired sparsity constraint.
A Generalized Information Bottleneck Theory of Deep Learning - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: Representation Learning Theory: introduces a Generalized Information Bottleneck using computable synergy/interaction information, explaining compression dynamics across CNNs/Transformers.
Deep sequence models tend to memorize geometrically; it is unclear why - Score: 18 (R=9, N=9) - Date: 2025-10-31 - Comment: Representation Learning: uncovers geometric memorization in deep sequence models with analysis linking to spectral bias; insights into training dynamics and embeddings.
Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation - Score: 18 (R=9, N=9) - Date: 2025-10-29 - Comment: Representation Learning / Training Dynamics: statistical physics analysis of multi-layer perceptron feature learning and phase transitions near interpolation.
A simple mean field model of feature learning - Score: 18 (R=9, N=9) - Date: 2025-10-20 - Comment: Representation Learning: mean-field theory of feature learning and phase transitions in finite-width networks.
LLMs Process Lists With General Filter Heads - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: Representation Learning: identifies causal, general-purpose ‘filter heads’ implementing a functional filtering operation across tasks
Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: Representation Learning: token-level causal analysis of CLIP, identifying composition nonidentifiability and links to modality gaps.
Learning Geometry: A Framework for Building Adaptive Manifold Models through Metric Optimization - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: Representation Learning/Model Architecture: learns an adaptive manifold via metric tensor optimization (discrete differential geometry), a foundational framework beyond parameter tuning.
Contrastive Predictive Coding Done Right for Mutual Information Estimation - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: Representation Learning: proposes InfoNCE-anchor for principled MI estimation and unifies contrastive objectives via proper scoring rules, clarifying what contrastive losses learn.
Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Representation Learning: shows emergent object binding in ViT embeddings, identifies a low-dimensional subspace guiding attention, and validates via causal ablations.
Eigenfunction Extraction for Ordered Representation Learning - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Framework for extracting ordered, identifiable eigenfunctions tied to contrastive/non-contrastive objectives—strong Representation Learning theory contribution leveraging low-rank and Rayleigh quotient ideas.
Perception Learning: A Formal Separation of Sensory Representation Learning from Decision Learning - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Representation Learning: formally separates perception from decision, defines representation-invariant perceptual metrics, and proves orthogonality to Bayes task-risk gradients.
From Memorization to Reasoning in the Spectrum of Loss Curvature - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Representation Learning/Training Dynamics: disentangles memorization via loss-curvature-based weight decomposition and weight editing.
Learning Without Augmenting: Unsupervised Time Series Representation Learning via Frame Projections - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Representation Learning: augmentation-free SSL via orthonormal/overcomplete frame projections leveraging geometric biases.
Equivariance by Contrast: Identifiable Equivariant Embeddings from Unlabeled Finite Group Actions - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Matches Representation Learning: learns identifiable equivariant embeddings from unlabeled group actions without inductive biases.
Neural Collapse under Gradient Flow on Shallow ReLU Networks for Orthogonally Separable Data - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Theoretical analysis of Neural Collapse arising under gradient flow in two-layer ReLU networks (Representation learning/training dynamics).
Disentangled Representation Learning via Modular Compositional Bias - Score: 17 (R=9, N=8) - Date: 2025-10-27 - Comment: Matches Representation Learning: modular compositional bias enabling disentanglement of attributes/objects without architecture/objective redesign.
From Information to Generative Exponent: Learning Rate Induces Phase Transitions in SGD - Score: 17 (R=9, N=8) - Date: 2025-10-27 - Comment: Representation Learning: theoretical analysis of SGD dynamics showing learning-rate-induced phase transitions; introduces a two-timescale layer-wise training algorithm.
Connecting Jensen-Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learning - Score: 17 (R=9, N=8) - Date: 2025-10-24 - Comment: Representation Learning/Theory — derives a tight lower bound connecting JSD to KLD/MI, justifying discriminative MI objectives used in practice.
Why Prototypes Collapse: Diagnosing and Preventing Partial Collapse in Prototypical Self-Supervised Learning - Score: 17 (R=9, N=8) - Date: 2025-10-24 - Comment: Representation Learning: diagnoses prototype collapse and proposes decoupled EM-updated prototypes to stabilize prototypical SSL training.
A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: Representation Learning/Training Dynamics: derandomization lemma explaining structure discovery (low-rank) in neural networks under broad conditions.
Towards Identifiability of Hierarchical Temporal Causal Representation Learning - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Matches Representation Learning: identifiability of hierarchical temporal causal latents from conditionally independent observations with a variational generative model.
ActivationReasoning: Logical Reasoning in Latent Activation Spaces - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Matches Representation Learning: operationalizes logical reasoning and control in latent activation space using sparse autoencoder-derived concepts and rule application.
Extracting Rule-based Descriptions of Attention Features in Transformers - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Representation learning and transformer analysis: extracts rule-based descriptions of SAE attention features (skip-gram, absence, counting), providing mechanistic interpretability of transformer internals.
Generalization Below the Edge of Stability: The Role of Data Geometry - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Representation Learning/Training Dynamics: theoretical generalization below the edge of stability tied to data geometry for overparameterized ReLU nets.
Measure-Theoretic Anti-Causal Representation Learning - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Matches Representation Learning: measure-theoretic anti-causal representation framework (ACIA) with interventional kernels and OOD generalization guarantees.
Diverse Influence Component Analysis: A Geometric Approach to Nonlinear Mixture Identifiability - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Matches Representation Learning/Identifiability: introduces Jacobian Volume Maximization to identify nonlinear latent components without auxiliary signals or sparsity assumptions.
Algorithmic Primitives and Compositional Geometry of Reasoning in Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Matches Representation Learning/Mechanistic Interpretability: identifies and steers compositional activation primitives underlying LLM reasoning via function vectors.
On the Neural Feature Ansatz for Deep Neural Networks - Score: 17 (R=9, N=8) - Date: 2025-10-20 - Comment: Matches Representation Learning: theoretical analysis of Neural Feature Ansatz and training dynamics across depth.
The Coverage Principle: How Pre-training Enables Post-Training - Score: 17 (R=9, N=8) - Date: 2025-10-20 - Comment: Matches Representation Learning/Training Dynamics: theory of coverage from next-token pretraining predicting downstream/post-training success with provable interventions.
Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Model Architecture and Representation Learning: input-adaptive recurrence, discrete bottleneck, and error-correction for OOD algorithmic generalization in Transformers.
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Representation Learning/Interpretability: analyzes activation differences post narrow finetuning; strong evidence of training traces and steering via diffs.
Mamba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Representation Learning/Training Dynamics: theoretical analysis of Mamba’s in-context learning via nonlinear gating and test-time feature learning with sample complexity results.
Statistical Guarantees for High-Dimensional Stochastic Gradient Descent - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Matches Representation Learning: theoretical analysis of training dynamics for high-dimensional SGD/ASGD with moment and concentration guarantees.
Sculpting Latent Spaces With MMD: Disentanglement With Programmable Priors - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Representation Learning/Autoencoders: replaces KL with MMD to enforce programmable priors for disentanglement and proposes an unsupervised Latent Predictability Score—directly advancing controllable latent structure.
Adversarial Attacks Leverage Interference Between Features in Superposition - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Provides mechanistic representation-learning insight via superposition explaining adversarial vulnerability (representation learning/training dynamics).
Iterative Amortized Inference: Unifying In-Context Learning and Learned Optimizers - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Unified framework for amortized learning (ICL, learned optimizers) with iterative amortized inference — Representation Learning/training dynamics and adaptation.
In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Representation Learning/Training Dynamics: finite-sample generalization theory for ICL in Transformers with risk decomposition and non-asymptotic bounds.
On the Optimal Representation Efficiency of Barlow Twins: An Information-Geometric Interpretation - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Information-geometric theory of Barlow Twins showing optimal representation efficiency via isotropic FIM — Representation Learning theory.
Understanding Self-supervised Contrastive Learning through Supervised Objectives - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Strongly matches Representation Learning by providing a theoretical formulation linking self-supervised contrastive objectives to supervised ones, yielding insights into InfoNCE and balanced contrastive losses.
Rademacher Meets Colors: More Expressivity, but at What Cost ? - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Matches Representation Learning/Theory: links GNN expressivity (WL colorings) to Rademacher complexity, explaining generalization–expressivity trade-offs.
Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Training Dynamics/Generalization: theoretical characterization of stochastic Adam’s generalization vs batch size and weight decay in overparameterized CNNs, aligning with representation learning theory.
Redundancy as a Structural Information Principle for Learning and Generalization - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Matches Representation Learning: introduces a theoretical redundancy framework unifying classical information measures and predicts generalization-optimal redundancy, validated with autoencoders.
The Achilles' Heel of LLMs: How Altering a Handful of Neurons Can Cripple Language Abilities - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Representation Learning/Architecture analysis: perturbation-based causal identification reveals ultra-sparse critical neurons and their layerwise localization governing language ability.
Geodesic Calculus on Latent Spaces - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Geometric representation learning: Riemannian calculus on autoencoder latent manifolds (implicit submanifolds), with learned projection and geodesic/exponential map computations.
PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Representation Learning: product of hyperbolic factors with an l1-product metric to jointly capture hierarchy and compositionality in embeddings.
On the Alignment Between Supervised and Self-Supervised Contrastive Learning - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Representation Learning: proves representation-level alignment between self-supervised contrastive learning and negatives-only supervised contrastive learning with high-probability bounds (CKA/RSA).
On Uniformly Scaling Flows: A Density-Aligned Approach to Deep One-Class Classification - Score: 17 (R=9, N=8) - Date: 2025-10-13 - Comment: Representation Learning criterion: introduces uniformly scaling flows linking Deep SVDD and normalizing flows, preventing collapse and tightening likelihood–latent norm alignment.
Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry - Score: 17 (R=9, N=8) - Date: 2025-10-13 - Comment: Representation Learning: dictionary/SAE-based interpretability of DINOv2 and a new Minkowski Representation Hypothesis about concept geometry in ViTs.
Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: Representation Learning/Training Dynamics: theoretical bounds for attention-only transformers and mechanisms (dropout, EMA) that improve length generalization.
Base Models Know How to Reason, Thinking Models Learn When - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: Representation Learning/Training Dynamics: causal elicitation of latent reasoning mechanisms in base models and analysis of when vs how reasoning is deployed; foundational interpretability insight.
Wide Neural Networks as a Baseline for the Computational No-Coincidence Conjecture - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Matches Representation Learning/training dynamics: theoretical condition for near-independent outputs in wide nets via zero-mean activations, informing architectural design.
The Effect of Label Noise on the Information Content of Neural Representations - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Representation Learning: analyzes information content of hidden representations and training dynamics under label noise using an information-theoretic proxy.
Context Length Alone Hurts LLM Performance Despite Perfect Retrieval - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Training dynamics/representation insight: shows long-context length alone degrades LLM performance independent of retrieval; proposes a simple mitigation to reduce effective context.
Computing frustration and near-monotonicity in deep neural networks - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Representation Learning: analyzes trained DNNs via signed-graph frustration to reveal near-monotonic structure and implicit regularization.
Provable Affine Identifiability of Nonlinear CCA under Latent Distributional Priors - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Representation Learning: proves affine identifiability for nonlinear CCA under latent priors, with whitening necessity and finite-sample convergence guarantees.
On the Limitations and Capabilities of Position Embeddings for Length Generalization - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Matches Model Architecture (Transformers): theoretical analysis of position embeddings for length generalization (LRC/SRC) plus a learning-based PE framework and scale hints.
What Scales in Cross-Entropy Scaling Law? - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Representation Learning/Training Dynamics: theoretical decomposition of cross-entropy into error-entropy/self-alignment/confidence, identifying error-entropy as the true scaling component.
Understanding the Role of Training Data in Test-Time Scaling - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Representation Learning/Training Dynamics: theoretical analysis of test-time scaling for transformers, linking training data properties to benefits of long chain-of-thought.
Decrypt Modality Gap in Multimodal Contrastive Learning: From Convergent Representation to Pair Alignment - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Matches Representation Learning: first theoretical framework explaining modality gap in multimodal contrastive learning via dimension collapse and alignment theory.
Topological Invariance and Breakdown in Learning - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Representation Learning/Training Dynamics — architecture-agnostic theory showing topology-preserving vs. simplifying phases in learning governed by the learning rate.
Unraveling Syntax: How Language Models Learn Context-Free Grammars - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Representation Learning/Training Dynamics: theoretical and empirical study of how transformers learn PCFGs, with recursive loss/KL formulae and subgrammar pretraining effects.
Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Representation learning/training dynamics: controlled study of arbitration between parametric and in-context knowledge in Transformers.
Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Model Architecture/Representation Learning: implicit energy-based model learning an equilibrium gradient with optimization-driven sampling and adaptive compute—foundational alternative to diffusion/flow.
Precise Dynamics of Diagonal Linear Networks: A Unifying Analysis by Dynamical Mean-Field Theory - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Matches Representation Learning: theoretical analysis of gradient-flow dynamics in diagonal linear networks via Dynamical Mean-Field Theory.
Posterior Collapse as a Phase Transition in Variational Autoencoders - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Representation Learning: theoretical analysis of VAEs’ training dynamics, framing posterior collapse as a phase transition with a critical boundary.
Meaningless Tokens, Meaningful Gains: How Activation Shifts Enhance LLM Reasoning - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Model Architecture and Representation Learning: mechanistic analysis of MLP activation distributions and an inference-time activation redistribution module (ARM) that improves reasoning.
Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space? - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Representation/Architecture Analysis: introduces spectral utilization diagnostics (hard/soft rank, concentration, SUI) revealing FFN latent-space scaling laws.
Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Representation Learning and Training Dynamics: provides a mathematical framework for loss of plasticity, identifying frozen units and cloned-unit manifolds and linking to low-rank/simplicity biases.
Estimating Dimensionality of Neural Representations from Finite Samples - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Representation Learning: bias-corrected estimator of neural manifold dimensionality robust to finite samples and noise, applicable to networks and brain data.
Muon Outperforms Adam in Tail-End Associative Memory Learning - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Training dynamics/Representation Learning: theoretical and empirical analysis of optimizer behavior in LLMs via an associative memory lens, explaining isotropy and tail-class learning advantages.
A Unified Probabilistic Framework for Dictionary Learning with Parsimonious Activation - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Representation Learning: dictionary learning with a parsimonious (row-sparse) activation prior, grounded in a Bayesian framework for sparsity.
How Does Preconditioning Guide Feature Learning in Deep Neural Networks? - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Representation Learning: theory linking preconditioner-induced Gram metric to spectral bias and generalization.
Compositional Symmetry as Compression: Lie Pseudogroup Structure in Algorithmic Agents - Score: 17 (R=8, N=9) - Date: 2025-10-15 - Comment: Representation Learning: theoretical framework linking compositional symmetry/equivariance to manifold reductions and predictive coding, offering principles for compressive, hierarchical representations.
IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning - Score: 16 (R=9, N=7) - Date: 2025-10-30 - Comment: Model Architecture (Normalization) and Representation Learning: IB-inspired normalization controlling task-relevant information with theory on IB value and generalization.
A Framework for Quantifying How Pre-Training and Context Benefit In-Context Learning - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Representation Learning/Training dynamics: theoretical framework quantifying ICL benefits of pre-training and context length (transformer setting).
H-SPLID: HSIC-based Saliency Preserving Latent Information Decomposition - Score: 16 (R=9, N=7) - Date: 2025-10-24 - Comment: Representation learning: HSIC-based latent decomposition into salient/non-salient subspaces with theory linking robustness and compression.
How Do LLMs Use Their Depth? - Score: 16 (R=9, N=7) - Date: 2025-10-22 - Comment: Representation Learning: layer-wise analysis revealing a 'guess-then-refine' computation pattern across depth in LLMs, informing efficient use of layers.
CAST: Compositional Analysis via Spectral Tracking for Understanding Transformer Layer Functions - Score: 16 (R=9, N=7) - Date: 2025-10-17 - Comment: Representation Learning/Interpretability: probe-free spectral analysis (transformation matrix estimation, CKA) to characterize transformer layer functions.
A Function Centric Perspective On Flat and Sharp Minima - Score: 16 (R=9, N=7) - Date: 2025-10-16 - Comment: Training dynamics/Representation: function-centric analysis of sharpness vs generalization, showing sharper minima under regularization can generalize better—insight into loss landscape geometry.
Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training - Score: 16 (R=9, N=7) - Date: 2025-10-14 - Comment: Matches Representation Learning with Sparse Autoencoders: proposes Adaptive Temporal Masking to reduce feature absorption and stabilize SAE training.
Memory Retrieval and Consolidation in Large Language Models through Function Tokens - Score: 16 (R=9, N=7) - Date: 2025-10-10 - Comment: Representation Learning: proposes the function token hypothesis with evidence on how function tokens retrieve features and drive memory consolidation in LLMs.
Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: Representation Learning: introduces cumulant-expansion probes of softmax entropy to quantify higher-order feature-learning dynamics across layers and training.
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: Matches Representation Learning and Sparse methods: analyzes Sparse Autoencoders’ interpretability vs. steering utility and proposes Delta Token Confidence for feature selection.
Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing - Score: 16 (R=9, N=7) - Date: 2025-10-06 - Comment: Representation learning: activation-space attribution with representation gradient tracing to link outputs to training data.
How Do Language Models Compose Functions? - Score: 16 (R=9, N=7) - Date: 2025-10-03 - Comment: Representation Learning: mechanistic analysis of compositionality in LLMs via logit-lens, identifying processing pathways and linking them to embedding space geometry.
Shape Happens: Automatic Feature Manifold Discovery in LLMs via Supervised Multi-Dimensional Scaling - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Representation Learning: SMDS method to discover and analyze feature manifolds in LLM latent space.
Feature Identification via the Empirical NTK - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Representation Learning/Training dynamics: empirical NTK eigenanalysis surfaces learned features and tracks grokking phase changes.
Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability - Score: 16 (R=8, N=8) - Date: 2025-10-31 - Comment: Representation Learning: analyzes how Transformers learn PRNG structure; scaling laws, curriculum necessity, and interpretable embeddings
From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning - Score: 16 (R=8, N=8) - Date: 2025-10-30 - Comment: Matches representation learning criterion with a theoretical analysis of feature learning and training dynamics (weak-to-strong generalization) in CNNs.
How do simple rotations affect the implicit bias of Adam? - Score: 16 (R=8, N=8) - Date: 2025-10-29 - Comment: Representation Learning / Training Dynamics: analyzes Adam’s implicit bias under rotations and uses an equivariant reparameterization to restore rotation invariance.
From Black-box to Causal-box: Towards Building More Interpretable Models - Score: 16 (R=8, N=8) - Date: 2025-10-28 - Comment: Model Architecture/Representation Learning: framework for causally interpretable architectures enabling counterfactual queries with formal criteria.
How Does Label Noise Gradient Descent Improve Generalization in the Low SNR Regime? - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Matches Representation Learning/Training Dynamics Theory: proves label-noise gradient descent suppresses noise memorization and improves generalization in low SNR.
Deeper with Riemannian Geometry: Overcoming Oversmoothing and Oversquashing for Graph Foundation Models - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Model Architecture/Representation Learning: local Riemannian approach addressing oversmoothing/oversquashing with theoretical guarantees for deep MPNNs.
On the Impossibility of Retrain Equivalence in Machine Unlearning - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Representation Learning/Training Dynamics: theoretical impossibility result for retrain equivalence in multi-stage training, highlighting path dependence of local unlearning.
Symmetry and Generalisation in Neural Approximations of Renormalisation Transformations - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Representation Learning/Training Dynamics: analyzes symmetry constraints and expressivity in MLPs/GNNs for learning RG transformations, with theoretical and empirical insights.
Sequence Modeling with Spectral Mean Flows - Score: 16 (R=8, N=8) - Date: 2025-10-20 - Comment: Matches Model Architecture and Representation Learning: operator-theoretic sequence model with spectral tensor-network decomposition and flow matching.
Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential - Score: 16 (R=8, N=8) - Date: 2025-10-20 - Comment: Representation learning/training dynamics: uses cross-layer sparse autoencoders to extract latent rules and introduces SAL to quantify soundness-aware internal distributions predicting reasoning potential.
To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models - Score: 16 (R=8, N=8) - Date: 2025-10-17 - Comment: Model Architecture/Analysis – theoretical limits of SSMs and tool-augmented design enabling length generalization for reasoning tasks.
Programmatic Representation Learning with Language Models - Score: 16 (R=8, N=8) - Date: 2025-10-17 - Comment: Representation Learning: programmatic feature synthesis with decision trees (LeaPR), offering interpretable, non-neural predictors learned via LLM-synthesized code.
When Flatness Does (Not) Guarantee Adversarial Robustness - Score: 16 (R=8, N=8) - Date: 2025-10-17 - Comment: Representation Learning/Training dynamics – formal analysis linking flat minima to local adversarial robustness and geometry of loss landscapes.
Cautious Weight Decay - Score: 16 (R=8, N=8) - Date: 2025-10-16 - Comment: Matches Representation Learning: optimization/training dynamics innovation (Cautious Weight Decay) as a drop-in modification to standard optimizers.
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing - Score: 16 (R=8, N=8) - Date: 2025-10-16 - Comment: Representation Learning: targeted editing of hidden representations with a learned value function for precise attribute intensity control.
Scaling Language-Centric Omnimodal Representation Learning - Score: 16 (R=8, N=8) - Date: 2025-10-14 - Comment: Matches Representation Learning: analyzes emergent cross-modal alignment in MLLMs and proposes a language-centric embedding framework with a scaling law linking generative and representation quality.
Do LLMs "Feel"? Emotion Circuits Discovery and Control - Score: 16 (R=8, N=8) - Date: 2025-10-14 - Comment: Representation Learning: identifies context-agnostic emotion directions and causal neuron/attention-head circuits that implement and control emotional expression in LLMs.
Verifying Chain-of-Thought Reasoning via Its Computational Graph - Score: 16 (R=8, N=8) - Date: 2025-10-14 - Comment: Representation Learning/Training Dynamics: white-box verification via computational (attribution) graphs to diagnose and fix CoT reasoning, offering causal insights into latent circuits.
On the Implicit Adversariality of Catastrophic Forgetting in Deep Continual Learning - Score: 16 (R=8, N=8) - Date: 2025-10-14 - Comment: Representation Learning: theoretical analysis of catastrophic forgetting via low-rank bias and gradient alignment; introduces backGP to mitigate alignment from backward propagation.
Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models - Score: 16 (R=8, N=8) - Date: 2025-10-10 - Comment: Representation Learning: unpaired multimodal training with shared parameters; theory under linear assumptions showing unimodal gains from auxiliary modalities.
Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity - Score: 16 (R=8, N=8) - Date: 2025-10-10 - Comment: Matches Representation Learning/training dynamics: shows width expansion enables linear mode connectivity without permutations; introduces LEWC explanation.
R\'enyi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization - Score: 16 (R=8, N=8) - Date: 2025-10-10 - Comment: Representation Learning/Training Dynamics: introduces Rényi-sharpness tied to Hessian spectra with generalization bounds and a new SAM-style regularizer (RSAM).
Beyond independent component analysis: identifiability and algorithms - Score: 16 (R=8, N=8) - Date: 2025-10-10 - Comment: Representation Learning: identifiability theory beyond ICA (pairwise mean independence) with an algebraic recovery algorithm.
Nearly Instance-Optimal Parameter Recovery from Many Trajectories via Hellinger Localization - Score: 16 (R=8, N=8) - Date: 2025-10-09 - Comment: Representation Learning/Training Theory: Hellinger localization framework yields near instance-optimal MLE rates for multi-trajectory sequential models, including linear-attention sequence models.
Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models - Score: 16 (R=8, N=8) - Date: 2025-10-08 - Comment: Strong match to Representation Learning: proposes a framework to trace internal representations, identifies a commitment layer and dual-pathway mechanism underlying hallucinations in Transformers.
From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models - Score: 16 (R=8, N=8) - Date: 2025-10-07 - Comment: Matches Representation Learning/training dynamics: a variance-optimized preference optimization method with theory for aligning large reasoning models.
How does the optimizer implicitly bias the model merging loss landscape? - Score: 16 (R=8, N=8) - Date: 2025-10-07 - Comment: Representation Learning/Training Dynamics: shows how optimizer-induced effective noise shapes the global loss landscape and predicts model merging success.
Sharp Lower Bounds for Linearized ReLU^k Approximation on the Sphere - Score: 16 (R=8, N=8) - Date: 2025-10-07 - Comment: Model Architecture / Representation Learning: theoretical saturation bounds for linearized shallow ReLU^k networks, analyzing approximation capacity of the architecture.
Decision Potential Surface: A Theoretical and Practical Approximation of LLM's Decision Boundary - Score: 16 (R=8, N=8) - Date: 2025-10-07 - Comment: Representation Learning: introduces Decision Potential Surface to approximate LLM decision boundaries with provable error bounds via K-sampling.
Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification - Score: 16 (R=8, N=8) - Date: 2025-10-06 - Comment: Training Dynamics — optimal generalization rates for GD on deep ReLU networks via control of activation patterns and sharper Rademacher bounds.
Learning Multi-Index Models with Hyper-Kernel Ridge Regression - Score: 16 (R=8, N=8) - Date: 2025-10-06 - Comment: Representation Learning/Theory: HKRR provides sample complexity guarantees for compositional multi-index models, bridging kernels and neural approaches to overcome curse of dimensionality.
Uniform-in-time convergence bounds for Persistent Contrastive Divergence Algorithms - Score: 16 (R=8, N=8) - Date: 2025-10-03 - Comment: Representation Learning/Training Dynamics: theoretical uniform-in-time convergence bounds for PCD in EBMs with an efficient continuous-time SDE formulation and stable S-ROCK integrators.
Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed - Score: 16 (R=8, N=8) - Date: 2025-10-03 - Comment: Representation learning insight: analyzes how latent geometry vs shared data-space affects adversarial transfer with theory and experiments.
Risk Phase Transitions in Spiked Regression: Alignment Driven Benign and Catastrophic Overfitting - Score: 16 (R=8, N=8) - Date: 2025-10-03 - Comment: Generalization theory in overparameterized spiked regression, classifying benign vs catastrophic overfitting—training dynamics/representation theory.
To Augment or Not to Augment? Diagnosing Distributional Symmetry Breaking - Score: 16 (R=8, N=8) - Date: 2025-10-03 - Comment: Representation Learning: proposes a metric for distributional symmetry-breaking and theory showing when equivariant methods can underperform.
On the Benefits of Weight Normalization for Overparameterized Matrix Sensing - Score: 16 (R=8, N=8) - Date: 2025-10-02 - Comment: Training dynamics/optimization analysis of weight normalization showing faster convergence in overparameterized matrix sensing (Representation Learning / Model Architecture analysis).
A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models - Score: 16 (R=8, N=8) - Date: 2025-10-02 - Comment: Model Architecture and Representation Learning: introduces a deterministic Manifold-Probabilistic Projection Model unifying geometric manifold structure with kernel-based probabilistic modeling, reinterpreting diffusion as projection.
Nonparametric Identification of Latent Concepts - Score: 16 (R=8, N=8) - Date: 2025-10-02 - Comment: Representation Learning: provides a nonparametric identifiability theory for latent concepts from multi-class observations, offering foundational guarantees on recovering representations.
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training - Score: 16 (R=8, N=8) - Date: 2025-10-01 - Comment: Representation Learning: analysis of emergent visual priors from language pretraining with scaling trends and data-centric pretraining recipe.
SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From - Score: 16 (R=8, N=8) - Date: 2025-10-01 - Comment: Representation Learning/Training Dynamics: reveals persistent initialization-dependent fingerprints in LLMs across training.
Test time training enhances in-context learning of nonlinear functions - Score: 16 (R=8, N=8) - Date: 2025-10-01 - Comment: Training dynamics/Representation Learning: theory for test-time training combined with ICL in transformers, showing adaptation to task-specific link functions and features.
Gradient Descent with Large Step Sizes: Chaos and Fractal Convergence Region - Score: 16 (R=8, N=8) - Date: 2025-10-01 - Comment: Representation Learning/Training Dynamics: theoretical analysis of gradient descent in matrix factorization, identifying critical step sizes and chaotic/fractal convergence behavior.
Clone Deterministic 3D Worlds with Geometrically-Regularized World Models - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning: geometric regularization to shape latent manifold topology for robust world-model rollouts
Unravelling the Mechanisms of Manipulating Numbers in Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning: probes numerical information processing in LLMs, yielding universal probes and layer-wise mechanism insights.
Likely Interpolants of Generative Models - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning: principled interpolation scheme for generative models via likely transition paths with Riemannian-geodesic interpretation, no retraining required.
Angular Steering: Behavior Control via Rotation in Activation Space - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning: activation-space steering via geometric rotation (and adaptive variant) to control LLM behaviors.
What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning: uses sparse autoencoders to learn interpretable latent features of human preference data for analysis and curation.
ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning: uses Sparse Autoencoders on foundation-model features to discover disentangled concepts and dataset bias
Mechanistic Interpretability of RNNs emulating Hidden Markov Models - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Representation Learning/Mechanistic Interpretability: reverse-engineers RNNs emulating HMMs, uncovering structured dynamics and connectivity enabling probabilistic computation.
Nonlinear Dynamics In Optimization Landscape of Shallow Neural Networks with Tunable Leaky ReLU - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Representation Learning/Training Dynamics: theoretical bifurcation analysis of shallow networks with tunable leaky ReLU revealing symmetry-breaking and landscape structure.
Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Representation Learning: causal analysis of which CoT steps actually influence predictions; identifies and steers a latent 'TrueThinking' direction in LLM representation space.
Confidence is Not Competence - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Representation Learning: geometric analysis of LLM internal states revealing separable assessment/execution manifolds.
Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Representation Learning: leverages internal correlation-matrix rank as a self-indicator to verify reasoning paths without external verifiers.
Debiasing Reward Models by Representation Learning with Guarantees - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Representation Learning: identifies non-spurious latent variables and trains reward models on them with identifiability guarantees to mitigate spurious correlations.
VIKING: Deep variational inference with stochastic projections - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Variational family reflecting network reparametrization for fully-correlated posteriors—foundational approximate Bayesian inference for deep nets (Representation Learning/Training Dynamics).
Monotone and Separable Set Functions: Characterizations and Neural Models - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Model Architecture/Representation Learning: characterizes monotone-and-separating set functions and proposes neural models preserving set-containment order with universality.
Manifold Approximation leads to Robust Kernel Alignment - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Representation Learning: manifold-aware kernel alignment (MKA) provides a more robust representation similarity metric than CKA.
Scaling Non-Parametric Sampling with Representation - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Representation Learning with a simple non-parametric generative model and mechanistic analysis of image structure.
Probing Neural Combinatorial Optimization Models - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Representation Learning/Interpretability: probing (CS-Probing) to analyze internal representations and inductive biases in NCO models.
Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Representation Learning by characterizing a low-dimensional emotional manifold in LLM hidden states and controllable interventions.
Mechanistic Interpretability for Neural TSP Solvers - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Representation Learning/Interpretability: activation-level analysis with sparse autoencoders reveals interpretable features in Transformer TSP solvers.
VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Representation Learning via sparse autoencoders to interpret and enhance vision-language alignment at a concept level.
On Uncertainty Calibration for Equivariant Functions - Score: 15 (R=8, N=7) - Date: 2025-10-27 - Comment: Representation/Architecture Analysis: theoretical bounds linking equivariance properties to uncertainty calibration (ECE/ENCE) in models.
Correlation Dimension of Auto-Regressive Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-27 - Comment: Representation Learning: introduces correlation-dimension metric to quantify long-range structural complexity and generative dynamics in autoregressive LLMs.
Model Merging with Functional Dual Anchors - Score: 15 (R=8, N=7) - Date: 2025-10-27 - Comment: Matches Representation Learning/Training Dynamics: proposes a new model-merging framework in input-representation space (Functional Dual Anchors) for foundation models, improving post-hoc integration efficiency.
Neural Mutual Information Estimation with Vector Copulas - Score: 15 (R=8, N=7) - Date: 2025-10-27 - Comment: Representation Learning: proposes a neural mutual information estimator using vector copulas to balance capacity and data efficiency.
Context-level Language Modeling by Learning Predictive Context Embeddings - Score: 15 (R=8, N=7) - Date: 2025-10-24 - Comment: Representation Learning: introduces a next-context prediction objective to learn predictive context embeddings and improve long-range modeling with minimal overhead.
IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks - Score: 15 (R=8, N=7) - Date: 2025-10-24 - Comment: Representation Learning — integrates Information Bottleneck into GANs with an intermediate stochastic bottleneck to induce disentangled factors.
Understanding the Implicit Biases of Design Choices for Time Series Foundation Models - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Representation Learning: analyzes implicit inductive biases/training dynamics of TSFMs (patching, embeddings, objectives) with theory and controlled evaluations.
Weight Decay may matter more than muP for Learning Rate Transfer in Practice - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Representation Learning/Training Dynamics: analyzes learning-rate transfer across widths, highlighting weight decay vs muP scaling.
Category learning in deep neural networks: Information content and geometry of internal representations - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Representation Learning: information-theoretic and Fisher-geometry analysis of category learning shaping internal representations.
SO(3)-invariant PCA with application to molecular data - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Matches Representation Learning: SO(3)-invariant PCA that accounts for all rotations efficiently via algebraic structure, reducing covariance complexity.
Approximation Rates of Shallow Neural Networks: Barron Spaces, Activation Functions and Optimality Analysis - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Matches Representation Learning Theory: approximation rates in Barron spaces and limits of ReLU^k shallow networks.
NTKMTL: Mitigating Task Imbalance in Multi-Task Learning from Neural Tangent Kernel Perspective - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Matches Representation Learning/Training Dynamics: NTK-based spectral balancing to mitigate task imbalance in multi-task learning.
Rethinking PCA Through Duality - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Representation Learning/Theory: new DC formulations and kernelizable dual PCA linked to self-attention; optimization perspective on PCA algorithms.
Gradient Variance Reveals Failure Modes in Flow-Based Generative Models - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Matches Representation Learning/Training Dynamics: theoretical and empirical analysis of rectified flows showing gradient-variance-driven memorization and failure modes.
Mapping Post-Training Forgetting in Language Models at Scale - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Matches Training Dynamics/Representation Retention: sample-wise metrics mapping forgetting and backward transfer across post-training stages and scales.
Atlas-based Manifold Representations for Interpretable Riemannian Machine Learning - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Representation Learning: learns a differentiable atlas for latent manifolds enabling Riemannian optimization and interpretable representations.
Local properties of neural networks through the lens of layer-wise Hessians - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Matches Representation Learning/Training Dynamics: layer-wise Hessian spectral analysis links geometry to generalization and expressivity.
Model Metamers Reveal Invariances in Graph Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Representation Learning: introduces model metamers for GNNs to probe and quantify learned invariances, with theoretical analysis of metamer manifolds.
DFNN: A Deep Fr\'echet Neural Network Framework for Learning Metric-Space-Valued Responses - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Matches Model Architecture/Representation Learning: proposes deep Fréchet neural networks with a universal approximation theorem for metric-space-valued outputs.
Memorizing Long-tail Data Can Help Generalization Through Composition - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Representation Learning/Theory: shows how memorizing long-tail data can aid generalization via composition, with linear theory and neural experiments.
Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Representation Learning/Training Dynamics: controlled synthetic testbed analyzing how pretraining diversity and contextual structure affect OOD factual generalization; identifies optimization bottlenecks.
Breaking Memorization Barriers in LLM Code Fine-Tuning via Information Bottleneck for Improved Generalization - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Representation Learning/Training Dynamics: information bottleneck-regularized fine-tuning to reduce memorization and improve generalization in code LLMs.
Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Matches Representation Learning criterion: introduces a theoretically grounded similarity (PMI in RKHS) for contrastive multi-modal models like CLIP, analyzing and improving the underlying representation/metric.
Particle Dynamics for Latent-Variable Energy-Based Models - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Matches Representation Learning: latent-variable energy-based models with Wasserstein gradient flow training and convergence guarantees.
Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Representation Learning: analyzes feature geometry/normalization for OOD and introduces radially scaled l2 normalization.
From Universal Approximation Theorem to Tropical Geometry of Multi-Layer Perceptrons - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Matches Representation Learning/Architecture: geometry-aware initialization for sigmoidal MLPs via tropical perspective.
TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation Learning: analyzes tokenizer–grammar misalignment and layer-wise embedding effects in code LLMs.
Circuit Insights: Towards Interpretability Beyond Activations - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation Learning – mechanistic interpretability beyond activations (WeightLens/CircuitLens) to analyze features and circuits from weights and interactions.
Predicting Task Performance with Context-aware Scaling Laws - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation/Training Dynamics: proposes context-aware scaling laws linking downstream performance to compute and context length.
Provable Unlearning with Gradient Ascent on Two-Layer ReLU Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation Learning/Training Dynamics: theoretical analysis of gradient-ascent unlearning in linear and two-layer ReLU nets with new success criterion.
Rethinking Hebbian Principle: Low-Dimensional Structural Projection for Unsupervised Learning - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation Learning – unsupervised Hebbian-style learning with structural projection and orthogonality constraints for feature learning.
Semantic representations emerge in biologically inspired ensembles of cross-supervising neural networks - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation Learning: biologically inspired cross-supervising ensembles yield decodable semantic representations with sparse inter-network connectivity.
Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation Learning: analyzes training dynamics and shows statistical simplicity (n-gram diversity) predicts SLM learnability/coherence.
Learning Latent Energy-Based Models via Interacting Particle Langevin Dynamics - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Representation Learning: introduces an interacting particle Langevin dynamics algorithm with convergence guarantees for learning latent energy-based models (training dynamics).
Influence Dynamics and Stagewise Data Attribution - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Representation Learning: analyzes training dynamics via stagewise data attribution grounded in singular learning theory, linking influence shifts to semantic hierarchy development.
LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Matches Representation Learning: analyzes robustness of internal truthfulness representations under semantically-preserving perturbations.
Discursive Circuits: How Do Language Models Understand Discourse Relations? - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Representation Learning: circuit discovery via activation patching identifies sparse transformer subgraphs responsible for discourse relations.
Test-Time Adaptation by Causal Trimming - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Representation Learning by identifying and trimming non-causal representation components via augmentation-induced variance and PCA at test time; efficient adaptation without label supervision.
Topological Alignment of Shared Vision-Language Embedding Space - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Representation Learning: topology-aware cross-modal alignment using persistent homology with theoretical error bounds via graph sparsification.
Multitask Learning with Learned Task Relationships - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Representation Learning/Architecture: learns task relationships via a Gaussian Markov Random Field precision matrix jointly with local models; includes theoretical analysis.
Lost in the Middle: An Emergent Property from Information Retrieval Demands in LLMs - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Representation Learning/Training Dynamics: explains positional bias (“lost in the middle”) via retrieval demands and attention dynamics in LLMs.
An Exploration of Non-Euclidean Gradient Descent: Muon and its Many Variants - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Representation Learning/Training Dynamics via a principled non-Euclidean gradient descent view of optimizers, introducing robust variants (MuonMax) and momentum integration (Momo).
The Geometry of Reasoning: Flowing Logics in Representation Space - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Geometric/representation-space analysis of LLM reasoning flows — Representation Learning (training dynamics and embedding geometry).
Scaling Laws and Symmetry, Evidence from Neural Force Fields - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Model Architecture and Representation Learning: empirical scaling-law analysis showing equivariant architectures and higher-order representations yield better scaling exponents.
PIXEL: Adaptive Steering Via Position-wise Injection with eXact Estimated Levels under Subspace Calibration - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Representation Learning/Model Architecture: activation steering with learned property-aligned subspaces and position-wise injection with closed-form strength selection.
ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Model Compression and Efficiency + Representation Learning: bridges KG embeddings and LLMs via residual vector quantization to create learnable code tokens, enabling structured–contextual fusion.
QuIRK: Quantum-Inspired Re-uploading KAN - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Model Architecture: introduces a new KAN variant replacing B-splines with quantum-inspired single-qubit re-uploading units, reducing parameters while retaining interpretability.
On the Representations of Entities in Auto-regressive Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-13 - Comment: Representation Learning criterion: introduces Entity Lens to reconstruct multi-token entity mentions from internal hidden states (task vectors), probing how LLMs encode entities.
Sparse components distinguish visual pathways & their alignment to neural networks - Score: 15 (R=8, N=7) - Date: 2025-10-13 - Comment: Representation Learning: introduces sparse component decomposition and Sparse Component Alignment to probe and compare latent axes of brain and DNN representations.
Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings - Score: 15 (R=8, N=7) - Date: 2025-10-13 - Comment: Matches Representation Learning: proposes and analyzes in-process structure-aware encoding for LLM embeddings (including parallel caching vs sequential concatenation) with insights into how structural relations are encoded.
Deep Multimodal Subspace Clustering Networks - Score: 15 (R=8, N=7) - Date: 2025-10-13 - Comment: Model Architecture/Representation Learning: autoencoder with a self-expressive layer for unsupervised multimodal subspace clustering, comparing early/late/intermediate fusion and shared affinity.
To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Representation Learning: analyzes and exploits ViT attention-sink tokens to improve information flow from vision encoder to LLM.
On the Relationship Between the Choice of Representation and In-Context Learning - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Representation Learning: isolates effects of representation choice vs. in-context learning capacity; optimization to enumerate label representations with systematic analysis.
Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Matches the High Performance Computing criterion: theoretical analysis of decentralized distributed training (multi-gossip steps) via stability-based generalization bounds, detailing effects of topology, heterogeneity, and learning rate on training.
HySim-LLM: Embedding-Weighted Fine-Tuning Bounds and Manifold Denoising for Domain-Adapted LLMs - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Matches Representation Learning/Training Theory: proposes similarity-weighted fine-tuning bounds and manifold denoising guarantees for domain-adapted LLMs.
Vocabulary embeddings organize linguistic structure early in language model training - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Matches Representation Learning: empirical analysis of how input/output embeddings organize semantic/syntactic structure early in LLM training (training dynamics insights).
Angular Constraint Embedding via SpherePair Loss for Constrained Clustering - Score: 15 (R=8, N=7) - Date: 2025-10-09 - Comment: Representation Learning: proposes a geometric angular embedding (SpherePair loss) with theoretical guarantees, decoupling representation learning from clustering.
Chem-NMF: Multi-layer $\alpha$-divergence Non-Negative Matrix Factorization for Cardiorespiratory Disease Clustering, with Improved Convergence Inspired by Chemical Catalysts and Rigorous Asymptotic Analysis - Score: 15 (R=8, N=7) - Date: 2025-10-09 - Comment: Matches Representation Learning and Low-rank methods: multi-layer α-divergence NMF with a convergence-stabilizing scheme and rigorous asymptotic analysis.
Analyzing the Effect of Embedding Norms and Singular Values to Oversmoothing in Graph Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Representation Learning/Training Dynamics: introduces MASED metric with bounds and a regularization scheme (G-Reg) to mitigate oversmoothing in deep GNNs.
Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Learning Theory/Optimization: data-dependent generalization bounds for Gibbs and Langevin algorithms in the overparameterized interpolation regime.
Probing the Difficulty Perception Mechanism of Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Representation Learning/Training Dynamics: probes internal representations to linearly decode difficulty and identifies specific attention heads responsible for difficulty perception.
Revisiting Long-context Modeling from Context Denoising Perspective - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Matches: Representation Learning/Training Dynamics — context denoising training using IG-based noise detection to improve attention in long-context LLMs.
Quantifying the Accuracy-Interpretability Trade-Off in Concept-Based Sidechannel Models - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Representation Learning/Architecture: unified probabilistic sidechannel model with a new Sidechannel Independence Score and SIS regularization to control the accuracy–interpretability trade-off.
On the Theory of Continual Learning with Gradient Descent for Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Representation Learning/Training Dynamics: theoretical bounds on forgetting for continual learning in neural networks trained by gradient descent.
Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Matches: Model Architecture and Representation Learning — a self-supervised latent dynamics architecture jointly learning recognition and motion representations.
Approximate Gaussianity Beyond Initialisation in Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Representation Learning: analyzes weight distributions during training via permutation-invariant Gaussian matrix models and tracks dynamics with Wasserstein distance.
Learning to Interpret Weight Differences in Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Representation Learning/interpretability: trains models to describe finetuning-induced weight diffs via adapters, enabling natural-language explanations of parameter changes.
Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Representation Learning/Training Dynamics: proposes egalitarian gradient descent to equalize learning across principal directions, offering insights into grokking dynamics.
Learning Linear Regression with Low-Rank Tasks in-Context - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Representation Learning: theoretical analysis of in-context learning with a linear attention model on low-rank task distributions, characterizing prediction distributions, implicit regularization, and phase transitions in generalization.
GRACE: Generative Representation Learning via Contrastive Policy Optimization - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Representation Learning—treats contrastive signals as rewards over generated rationales to train embedding-capable LLMs.
Internal states before wait modulate reasoning patterns - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Representation Learning/mechanistic interpretability: identifies latent features that modulate ‘wait’ tokens and causally links them to reasoning patterns in transformers.
Why Cannot Neural Networks Master Extrapolation? Insights from Physical Laws - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Representation Learning theory: formal analysis of extrapolation limits in neural networks with implications for designing models with better out-of-domain behavior.
From Moments to Models: Graphon Mixture-Aware Mixup and Contrastive Learning - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Representation Learning: model-aware contrastive learning and mixup via graphon mixture modeling with a theoretical bound linking cut distance to motif densities.
Decomposing Attention To Find Context-Sensitive Neurons - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Representation Learning/Interpretability: decomposes attention to uncover context-sensitive neurons from weights using a calibration text.
Hyperparameter Loss Surfaces Are Simple Near their Optima - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Training dynamics: theory and tools for hyperparameter loss surfaces near optima, deriving asymptotic laws for random search and effective dimensionality.
On the Role of Temperature Sampling in Test-Time Scaling - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Matches Test-Time Scaling: multi-temperature sampling/voting to expand reasoning coverage without additional training, offering analysis of sampling dynamics.
Mitigating Modal Imbalance in Multimodal Reasoning - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Representation Learning: analyzes and mitigates cross-modal attention imbalance with a training strategy that explicitly combines modalities to improve joint reasoning.
Multimodal Function Vectors for Spatial Relations - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Representation Learning/Model Architecture — identifies and manipulates attention-head ‘function vectors’ in an LMM to control relational reasoning.
Robust Tangent Space Estimation via Laplacian Eigenvector Gradient Orthogonalization - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Manifold/representation learning via Laplacian eigenvector gradient orthogonalization with theoretical robustness to noise.
Flatness-Aware Stochastic Gradient Langevin Dynamics - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Matches Representation Learning/training dynamics: proposes fSGLD to bias toward flat minima with theoretical guarantees (invariant measure, convergence, excess-risk).
PENEX: AdaBoost-Inspired Neural Network Regularization - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation Learning/Training Dynamics — new penalized exponential loss (PENEX) with margin maximization behavior for neural network regularization.
Learning Model Representations Using Publicly Available Model Hubs - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation Learning: learns weight-space representations from heterogeneous public model hubs with a new backbone for unstructured model populations.
Representational Alignment Across Model Layers and Brain Regions with Hierarchical Optimal Transport - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation Learning — Hierarchical Optimal Transport for global, soft alignment across layers/neurons, yielding interpretable representational correspondences.
Learning Time-Series Representations by Hierarchical Uniformity-Tolerance Latent Balancing - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation learning criterion: hierarchical losses and temperature scheduling to balance uniformity–tolerance in contrastive time-series embeddings.
Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation/Training Dynamics: shows SFT metrics can mispredict RL outcomes and proposes stronger proxies (generalization loss, Pass@large k) for post-training.
Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation Learning/Training Dynamics and Data Efficiency: proves similarity of cross-modal attention trajectories implies gradient similarity, enabling principled data selection for LVLM fine-tuning.
Quantum-inspired Benchmark for Estimating Intrinsic Dimension - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation Learning — intrinsic dimension estimation benchmark with complex manifolds; foundational evaluation of IDE methods.
Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation Learning/Interpretability: gradient-based ability impact with targeted ablation to mechanistically diagnose benchmarks and decompose model competence.
Geometric Properties of Neural Multivariate Regression - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Representation Learning: analyzes intrinsic dimensionality and collapse in neural regression representations, yielding insights into training dynamics and generalization.
Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Representation Learning/Training dynamics: evaluates probability-based objectives beyond NLL for SFT, with theory tied to model capability.
Learning Energy-based Variational Latent Prior for VAEs - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Representation Learning/Model Architecture—energy-based variational latent prior for VAEs addressing prior holes with efficient sampling via variational treatment.
Bayesian Influence Functions for Hessian-Free Data Attribution - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Representation Learning: introduces Bayesian influence functions to quantify training data impact via SG-MCMC-based loss landscape statistics, scaling to billion-parameter models (training dynamics/attribution).
Reconcile Certified Robustness and Accuracy for DNN-based Smoothed Majority Vote Classifier - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Representation/Robustness Theory: PAC-Bayesian generalization bound with certified radius for smoothed majority vote and a spectral-norm-inspired regularizer.
Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Representation Learning/Training dynamics: introduces Training Re-evaluation Curves (TREC) and predicts them from AdamW EMA for proactive LLM data curriculum design.
Language Model Planning from an Information Theoretic Perspective - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Representation Learning: probes planning by compressing hidden states (via VQ-VAE) to measure mutual information and analyze transformer computation structure.
Knowledge distillation through geometry-aware representational alignment - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Compression/Efficiency and Representation Learning: geometry-aware feature distillation using Procrustes distance and Gram matrix alignment.

Other Foundational Research (9)

Surrogate-based quantification of policy uncertainty in generative flow networks - Score: 20.0 (R=0, N=0) - Date: 2025-10-28 - Comment: Author match
Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise - Score: 20.0 (R=0, N=0) - Date: 2025-10-15 - Comment: Author match
Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: Training dynamics/implicit bias: theoretical analysis of per-sample Adam vs full-batch, characterizing optimizer-induced max-margin geometry.
Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Conditional/Dynamic Networks: adaptive per-token compute with pause tokens and new CYB losses for dynamic inference.
Finite-Time Analysis of Stochastic Nonconvex Nonsmooth Optimization on the Riemannian Manifolds - Score: 16 (R=8, N=8) - Date: 2025-10-27 - Comment: Optimization/Training Theory: finite-time guarantees for nonsmooth nonconvex stochastic optimization on Riemannian manifolds, including a zeroth-order variant.
On Biologically Plausible Learning in Continuous Time - Score: 16 (R=8, N=8) - Date: 2025-10-22 - Comment: Training dynamics: continuous-time learning that unifies SGD/FA/DFA/KP and analyzes temporal credit assignment via eligibility traces and input–error overlap.
Learning to Answer from Correct Demonstrations - Score: 16 (R=8, N=8) - Date: 2025-10-20 - Comment: Matches foundational Training Objective design for learning from correct demonstrations beyond MLE, with sample complexity guarantees under a low-cardinality reward class.
Second-order Optimization under Heavy-Tailed Noise: Hessian Clipping and Sample Complexity Limits - Score: 16 (R=8, N=8) - Date: 2025-10-15 - Comment: Optimization/Training Dynamics: robust second-order method with gradient/Hessian clipping under heavy-tailed noise and tight sample complexity bounds.
Improved High-probability Convergence Guarantees of Decentralized SGD - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: High Performance Computing: new high-probability convergence guarantees for decentralized SGD with linear speedup under light-tailed noise.