Personalized Monthly Topic Summary 2025/10
| Metric | Value |
|---|---|
| Total Papers | 819 |
| Model Architecture | 212 |
| Model Compression and Efficiency | 281 |
| High Performance Computing | 65 |
| Representation Learning | 252 |
| Other Foundational Research | 9 |
Model Architecture (212)
-
Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin - Score: 20.0 (R=0, N=0) - Date: 2025-10-09 - Comment: Author match
-
Kimi Linear: An Expressive, Efficient Attention Architecture - Score: 19 (R=10, N=9) - Date: 2025-10-31 - Comment: Model Architecture/Efficiency: introduces Kimi Delta Attention (linear attention) and hybrid with MLA, cutting KV cache and boosting throughput while surpassing full attention.
-
Non-Singularity of the Gradient Descent map for Neural Networks with Piecewise Analytic Activations - Score: 19 (R=10, N=9) - Date: 2025-10-29 - Comment: Proves non-singularity of the GD map for realistic neural architectures (including attention/conv) with piecewise analytic activations—core training dynamics theory.
-
Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency without Model Sweeps - Score: 19 (R=10, N=9) - Date: 2025-10-16 - Comment: Directly targets Model Architecture: Mixture-of-Experts (softmax-gated) with identifiability theory, finite-sample MLE rates, and consistent expert-number selection.
-
Chimera: State Space Models Beyond Sequences - Score: 19 (R=10, N=9) - Date: 2025-10-16 - Comment: Model Architecture: extends state space models to arbitrary data topology; Efficiency: linear-time recurrence on DAGs and quadratic-time relaxation for general graphs.
-
Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models - Score: 19 (R=10, N=9) - Date: 2025-10-16 - Comment: Model Architecture theory: dimension-free minimax rates for learning pairwise interactions in attention-style models.
-
Transmuting prompts into weights - Score: 19 (R=10, N=9) - Date: 2025-10-13 - Comment: Model Architecture and Representation: theoretical mapping from prompts to implicit weight updates in deep Transformers; introduces token-independent thought vectors/matrices enabling principled weight-level steering.
-
Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts - Score: 19 (R=10, N=9) - Date: 2025-10-09 - Comment: Matches Model Architecture (MoE) and Representation Learning: provable joint training dynamics for soft-routed MoE; also includes post-training pruning with convergence guarantees (Model Compression/Efficiency).
-
Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix - Score: 19 (R=10, N=9) - Date: 2025-10-09 - Comment: Provides a rigorous random-matrix-theoretic analysis of self-attention spectra, advancing theoretical understanding of Transformer architecture and representation dynamics.
-
The Effect of Attention Head Count on Transformer Approximation - Score: 19 (R=10, N=9) - Date: 2025-10-09 - Comment: Model Architecture theory: establishes upper and lower bounds on transformer approximation as a function of attention head count, including a first rigorous lower bound in a nonlinear practical setting.
-
SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation - Score: 19 (R=10, N=9) - Date: 2025-10-09 - Comment: Model Architecture and Efficiency: introduces a hybrid AR–diffusion decoding paradigm enabling blockwise parallel generation and reports scaling across dense and MoE models.
-
Critical attention scaling in long-context transformers - Score: 19 (R=10, N=9) - Date: 2025-10-08 - Comment: Strong match to Model Architecture and Representation Learning: rigorous theory of attention scaling in long-context Transformers, identifying critical β_n ≍ log n to prevent rank-collapse.
-
Replacing Softmax Similarity with a Sharpened Angular Similarity: Theory and Practice of Scaling To Billion-Context Attention - Score: 19 (R=10, N=9) - Date: 2025-10-07 - Comment: Matches Model Compression/Efficiency and HPC: replaces Softmax with linear-time RACE attention via sharpened angular similarity, randomized projections, and soft LSH; enables million-token contexts with reduced memory/runtime.
-
Implicit Models: Expressive Power Scales with Test-Time Compute - Score: 19 (R=10, N=9) - Date: 2025-10-07 - Comment: Model Architecture and Efficiency: theory for implicit (infinite-depth, weight-tied) models showing expressive power scales with test-time iterations and constant-memory training.
-
Pretraining with hierarchical memories: separating long-tail and common knowledge - Score: 19 (R=10, N=9) - Date: 2025-10-06 - Comment: Strongly matches Model Architecture and Efficiency: memory-augmented transformers with hierarchical parametric memory banks and context-dependent fetch, aligned with hardware for scalable pretraining/inference.
-
Support Basis: Fast Attention Beyond Bounded Entries - Score: 19 (R=10, N=9) - Date: 2025-10-03 - Comment: Efficient attention approximation with sub-quadratic runtime beyond bounded entries; rigorous guarantees and justification of polynomial attention.
-
Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space - Score: 19 (R=10, N=9) - Date: 2025-10-02 - Comment: Model Architecture: introduces adaptive parallel computation in transformers by forking/deleting residual streams learned during pretraining (dynamic networks).
-
ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference - Score: 18 (R=10, N=8) - Date: 2025-10-31 - Comment: MoE Efficiency/HPC: adaptive expert prefetching and cache-aware routing for memory-constrained MoE inference with runtime-driven scheduling.
-
Normalization in Attention Dynamics - Score: 18 (R=10, N=8) - Date: 2025-10-28 - Comment: Model Architecture/Training dynamics: unified analysis of normalization schemes in transformers via interacting-particle dynamics; identifies effective Peri-LN.
-
Adaptive Graph Mixture of Residual Experts: Unsupervised Learning on Diverse Graphs with Heterogeneous Specialization - Score: 18 (R=10, N=8) - Date: 2025-10-28 - Comment: Model Architecture: graph Mixture-of-Experts with structurally-aware gating and unsupervised specialization objective.
-
Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning - Score: 18 (R=10, N=8) - Date: 2025-10-24 - Comment: Model Architecture: Mixture-of-Experts with a dynamic router to split thinking vs non-thinking branches for multimodal reasoning—directly matches MoE criterion.
-
HybridEP: Scaling Expert Parallelism to Cross-Datacenter Scenario via Hybrid Expert/Data Transmission - Score: 18 (R=10, N=8) - Date: 2025-10-23 - Comment: Matches HPC and MoE scaling: HybridEP introduces modeling-guided hybrid expert/data transmission and topology/domain partitioning to scale Expert Parallelism across datacenters under bandwidth constraints.
-
Transformers are Inherently Succinct - Score: 18 (R=10, N=8) - Date: 2025-10-23 - Comment: Model Architecture Theory: proves transformers’ high succinctness vs automata/LTL and EXPSPACE-complete verification.
-
L-MoE: End-to-End Training of a Lightweight Mixture of Low-Rank Adaptation Experts - Score: 18 (R=10, N=8) - Date: 2025-10-22 - Comment: Model Architecture: unifies MoE with low-rank LoRA adapters (L-MoE) and differentiable gating for end-to-end training and dynamic composition.
-
Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts - Score: 18 (R=10, N=8) - Date: 2025-10-21 - Comment: Model Architecture (MoE): probabilistic input-domain-aware routing decoupled from task optimization for expert specialization and balanced utilization.
-
Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior - Score: 18 (R=10, N=8) - Date: 2025-10-21 - Comment: Model Architecture + Sparsity: proposes a sparse transformer grounded in regularized Wasserstein proximal operator with L1 prior; theoretical and architectural innovation.
-
Expert Merging in Sparse Mixture of Experts with Nash Bargaining - Score: 18 (R=10, N=8) - Date: 2025-10-21 - Comment: Model Architecture (MoE): principled expert merging for sparse MoE via Nash bargaining with convergence guarantees; improves merging over ad-hoc averaging.
-
First Attentions Last: Better Exploiting First Attentions for Efficient Transformer Training - Score: 18 (R=10, N=8) - Date: 2025-10-17 - Comment: Model Architecture + HPC – redesigns Transformer wiring to remove per-block MHA–MLP communication, eliminating TP all-reduce and enabling parallel MHA/MLP execution.
-
MergeMoE: Efficient Compression of MoE Models via Expert Output Merging - Score: 18 (R=10, N=8) - Date: 2025-10-17 - Comment: Model Compression and Efficiency (MoE): theoretical framing and optimized expert output merging for compressing MoE models.
-
Dr.LLM: Dynamic Layer Routing in LLMs - Score: 18 (R=10, N=8) - Date: 2025-10-16 - Comment: Model Architecture and Efficiency: adaptive-depth dynamic layer routing (skip/execute/repeat) with supervised routers for budget-aware inference.
-
Deconstructing Attention: Investigating Design Principles for Effective Language Modeling - Score: 18 (R=10, N=8) - Date: 2025-10-15 - Comment: Systematic analysis and relaxation of attention design principles in Transformers — Model Architecture (attention mechanism).
-
Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling - Score: 18 (R=10, N=8) - Date: 2025-10-15 - Comment: Matches Model Architecture: proposes Translution unifying self-attention and convolution with a lightweight alpha-Translution variant for adaptive relative modeling.
-
Stability of Transformers under Layer Normalization - Score: 18 (R=10, N=8) - Date: 2025-10-15 - Comment: Direct match to architectural analysis/training stability: principled theory on Transformer stability under different LayerNorm placements and residual scaling.
-
Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers - Score: 18 (R=10, N=8) - Date: 2025-10-14 - Comment: Model Architecture: proposes Value-State Gated Attention for Transformers to mitigate attention sinks/value-state drains with theoretical grounding, improving stability and quantization fidelity.
-
Localist LLMs -- A Mathematical Framework for Dynamic Locality Control - Score: 18 (R=10, N=8) - Date: 2025-10-13 - Comment: Matches Model Architecture and Sparsity: introduces a tunable locality dial via group sparsity on attention with theoretical guarantees, enabling dynamic control between localist and distributed representations.
-
From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill - Score: 18 (R=10, N=8) - Date: 2025-10-10 - Comment: High Performance Computing and Efficiency: introduces layered prefill scheduling that reduces MoE expert weight reloads, lowering memory bandwidth and latency for stall-free serving.
-
Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training - Score: 18 (R=10, N=8) - Date: 2025-10-10 - Comment: Matches Model Architecture criterion (Mixture-of-Experts): orthogonal growth (depth/width) and checkpoint recycling for efficient pretraining.
-
MeSH: Memory-as-State-Highways for Recursive Transformers - Score: 18 (R=10, N=8) - Date: 2025-10-10 - Comment: Model Architecture: Memory-as-State-Highways adds explicit memory and lightweight routers to diversify computation across recursive iterations, strengthening recursive transformers.
-
From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics - Score: 18 (R=10, N=8) - Date: 2025-10-09 - Comment: Representation Learning/Training Dynamics: theoretical two-stage analysis of Transformer attention training (condensation then rank collapse) under gradient flow.
-
Exact Causal Attention with 10% Fewer Operations - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: Compression/Efficiency/HPC: exact causal attention with ~10% fewer operations via new masked matmul identities and GPU-optimized kernels.
-
On Structured State-Space Duality - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Matches Model Architecture: formalizes and generalizes the SSM–masked-attention duality, providing necessary/sufficient conditions and training complexity bounds; expands efficient Transformer/SSM design space.
-
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Matches Low-Precision Training and Efficiency: mechanistic analysis of flash attention failures under low precision and a minimal modification to mitigate biased rounding errors.
-
A Mathematical Explanation of Transformers for Large Language Models and GPTs - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Architecture—provides a continuous operator-theoretic formulation of Transformers (self-attention as integral operator, layer norm as projection), deepening theoretical foundations.
-
Allocation of Parameters in Transformers - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Strongly matches Model Architecture/Efficiency: theoretical allocation of attention heads and dimensions across Transformer layers with saturation analysis.
-
MemMamba: Rethinking Memory Patterns in State Space Model - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Strongly matches Model Architecture: theoretical analysis of Mamba’s memory decay and a new MemMamba architecture adding state summarization and cross-layer/token attention with linear complexity.
-
Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Architecture and Efficiency: structured cross-layer weight sharing via matrix dictionary learning for attention projections, yielding 66.7% parameter reduction with strong performance.
-
Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression - Score: 18 (R=10, N=8) - Date: 2025-10-06 - Comment: Model Architecture: MoE with dynamic expert clustering and hierarchical routing; Compression/Efficiency: shared base + ultra low-rank residual adapters, mixed precision, reduced communication.
-
CAT: Curvature-Adaptive Transformers for Geometry-Aware Learning - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: Model Architecture: conditional routing across geometry-specific attention branches (mixture-of-geometry/MoE-like) enabling curvature-adaptive Transformers.
-
Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: Model Architecture: proposes a new attention mechanism (Local Linear Attention) as an alternative to Softmax/linear attention in Transformers; High-Performance Computing/Efficiency: introduces memory-efficient primitives and a hardware-efficient blockwise algorithm (FlashLLA) with custom kernels to reduce O(n^2 d) and O(n d^2) costs.
-
Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs - Score: 18 (R=10, N=8) - Date: 2025-10-02 - Comment: MoE: router regularization via Dirichlet-prior shaping to improve expert balance and specialization in upcycled sparse MoEs.
-
Cutting the Skip: Training Residual-Free Transformers - Score: 18 (R=10, N=8) - Date: 2025-10-02 - Comment: Model Architecture/Training Dynamics: enables stable training of residual-free transformers via principled initialization based on Jacobian conditioning analysis.
-
LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: Model Architecture (MoE + PEFT): proposes learnable dynamic routing for Mixture of LoRA Experts with differentiable selection and analytical sparsity control.
-
On the Structure of Stationary Solutions to McKean-Vlasov Equations with Applications to Noisy Transformers - Score: 18 (R=9, N=9) - Date: 2025-10-24 - Comment: Representation Learning/Training Dynamics — mean-field analysis of Noisy Transformers via stationary McKean–Vlasov solutions, bifurcations, and phase transitions.
-
Who Said Neural Networks Aren't Linear? - Score: 18 (R=9, N=9) - Date: 2025-10-10 - Comment: Matches Model Architecture: introduces a Linearizer architecture (invertible NNs around a linear map) enabling linear-algebraic analysis and composition properties for nonlinear networks.
-
On residual network depth - Score: 18 (R=9, N=9) - Date: 2025-10-07 - Comment: Model Architecture: Residual Expansion Theorem giving first-principles analysis of depth in residual networks and principled scaling to control combinatorial path growth.
-
Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time - Score: 18 (R=9, N=9) - Date: 2025-10-02 - Comment: Model Architecture/Representation Learning: theoretical scaling laws for deep linear self-attention (depth vs width vs context) and training dynamics.
-
Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training - Score: 17 (R=10, N=7) - Date: 2025-10-31 - Comment: Model Architecture: Mixture-of-Experts with router-gating and shared experts; Efficiency: sparse activation controls inference cost
-
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation - Score: 17 (R=10, N=7) - Date: 2025-10-30 - Comment: Model Architecture: sparse Mixture-of-Experts (MoE) unified multimodal model with only 6.1B active parameters per token.
-
Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning - Score: 17 (R=10, N=7) - Date: 2025-10-17 - Comment: Model Architecture (MoE): action-specialized MoE for VLA with decoupled expert selection/weighting enabling collaborative expert usage.
-
Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting - Score: 17 (R=10, N=7) - Date: 2025-10-08 - Comment: High Performance Computing: data-movement-centric profiling and forecasting for large-scale MoE serving; informs system design (e.g., wafer-scale GPUs).
-
Multilingual Routing in Mixture-of-Experts - Score: 17 (R=10, N=7) - Date: 2025-10-07 - Comment: Mixture-of-Experts: analysis of multilingual routing dynamics with inference-time router steering to enhance cross-lingual expert utilization.
-
From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing - Score: 17 (R=10, N=7) - Date: 2025-10-07 - Comment: Model Architecture + HPC/Efficiency: MoE inference-time routing that adapts to gate score distributions to balance expert load and reduce latency without retraining.
-
Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning - Score: 17 (R=10, N=7) - Date: 2025-10-02 - Comment: Model Architecture and Efficiency: MoE with adaptive shared experts and LoRA-based fine-grained low-rank experts for multi-task learning.
-
How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs - Score: 17 (R=9, N=8) - Date: 2025-10-30 - Comment: Strongly matches architecture/representation learning criteria with theoretical analysis of ICL in Transformers including nonlinear MLP heads and multi-source data mixing.
-
Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information - Score: 17 (R=9, N=8) - Date: 2025-10-30 - Comment: Strongly matches architecture/theory criteria by proving multi-head Transformers learn DAG structure via a kernel-guided mutual information objective.
-
The Neural Differential Manifold: An Architecture with Explicit Geometric Structure - Score: 17 (R=9, N=8) - Date: 2025-10-30 - Comment: Model Architecture: proposes a neural architecture as a differentiable manifold with learned Riemannian metric and geometry-regularized optimization (natural-gradient aligned).
-
The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Model Architecture theory: tighter upper/lower bounds on parameter complexity for robust memorization in ReLU nets across the robustness ratio range.
-
Nested AutoRegressive Models - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Model Architecture + Efficiency: nested autoregressive multi-scale design reduces generation from O(n) to O(log n).
-
Triangle Multiplication Is All You Need For Biomolecular Structure Representations - Score: 17 (R=9, N=8) - Date: 2025-10-27 - Comment: Matches Model Architecture and Efficiency: replaces triangle attention with a streamlined module (Pairmixer) preserving higher-order reasoning while reducing compute/memory.
-
Transformers are almost optimal metalearners for linear classification - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: Representation Learning/Architecture Theory: theoretical proof that (simplified) transformers are near-optimal metalearners for linear classification.
-
When Do Transformers Learn Heuristics for Graph Connectivity? - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: Matches Model Architecture and Representation Learning: theoretical and empirical analysis of when Transformers learn correct algorithms vs heuristics on graph connectivity, tied to depth/diameter capacity and training dynamics.
-
Fast Inference via Hierarchical Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: High-Performance Inference: hierarchical speculative decoding with latency-optimal hierarchy selection.
-
MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: Model Architecture and Systems Efficiency: MoE expert partitioning into fine-grained sub-experts plus QoS-aware scheduling for elastic inference.
-
Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Model Architecture/Optimization Theory: closed-form optimum and NP-hardness for one-layer LSA on Markovian functions; multilayer LSA interpreted as preconditioned GD.
-
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: HPC/Architecture: conditional scaling laws incorporating hidden size, MLP/attention parameter split, and GQA to optimize inference efficiency.
-
Localist LLMs with Recruitment Learning - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Architecture/Sparsity: introduces a tunable locality dial and information-theoretic recruitment with group sparsity on attention for adaptive interpretable-to-distributed encodings.
-
Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Architecture/Sparsity: analyzes and improves hierarchical sparse attention for extreme length generalization with key design principles and theory for chunk encoding/residual bypass.
-
Fighter: Unveiling the Graph Convolutional Nature of Transformers in Time Series Modeling - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Architecture and Analysis: proves equivalence between Transformer attention and GCNs in time series, and introduces a streamlined graph-convolutional Transformer (Fighter).
-
Infinite Neural Operators: Gaussian processes on functions - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Matches Model Architecture Theory: establishes GP limits for neural operators (incl. FNO), enabling kernel-based operator learning with computed covariances/posteriors.
-
On Universality of Deep Equivariant Networks - Score: 17 (R=9, N=8) - Date: 2025-10-20 - Comment: Model Architecture: universality results for invariant/equivariant networks highlighting depth/readout as mechanisms.
-
ParaFormer: Shallow Parallel Transformers with Progressive Approximation - Score: 17 (R=9, N=8) - Date: 2025-10-20 - Comment: Strongly matches Model Architecture and Efficiency/HPC: shallow parallel Transformer with progressive approximation enabling compression and multi-GPU speedups.
-
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning - Score: 17 (R=9, N=8) - Date: 2025-10-20 - Comment: Training dynamics/scaling laws: proposes a weight-decay scaling rule for AdamW extending μP beyond the near-init regime for width-robust hyperparameter transfer in Transformers.
-
Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: HPC/Inference Efficiency + Architecture: diffusion-forcing parallel sampler for recurrent-depth transformers enabling faster generation.
-
Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Training Objective/Architecture: auxiliary future summary prediction head to capture long-horizon dependencies beyond MTP.
-
Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: High Performance Computing: compiler/IR (asynchronous references) automating warp specialization for GPU kernels incl. LLM attention.
-
Context-Selective State Space Models: Feedback is All You Need - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Model Architecture: novel time-varying SSM with state-feedback selectivity (COFFEE) offering efficient long-range dependency modeling.
-
Axial Neural Networks for Dimension-Free Foundation Models - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Model Architecture: introduces a dimension-agnostic Axial Neural Network enabling foundation models to generalize across tensor dimensionalities efficiently.
-
HiLoRA: Adaptive Hierarchical LoRA Routing for Training-Free Domain Generalization - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Model Architecture and Efficiency: adaptive hierarchical routing over LoRA pools at rank-one component granularity with token-level activation; training-free selection with theoretical guarantees.
-
Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Model Architecture: replaces softmax attention with a Credal Attention Mechanism yielding credal sets for uncertainty-aware Transformers; integrates uncertainty directly into the attention mechanism.
-
Softmax $\geq$ Linear: Transformers may learn to classify in-context by kernel gradient descent - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Representation Learning: theoretical analysis of in-context learning dynamics in transformers with softmax attention (kernel gradient descent, context-adaptive rates).
-
What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably) - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Strongly matches Model Architecture by analyzing looped-attention Transformers vs single-pass Transformers via loss-landscape theory and proposing a staged training framework (SHIFT), touching training dynamics as well.
-
Decomposer Networks: Deep Component Analysis and Synthesis - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Matches Model Architecture and Representation Learning: semantic autoencoder with Gauss–Seidel-style unrolled competition among components for interpretable factorization.
-
Hierarchical LoRA MoE for Efficient CTR Model Scaling - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Model Architecture and Efficiency: hierarchical MoE with LoRA rank-1 experts and hierarchical routing enabling parallel layer execution; improved FLOPs/parameter efficiency.
-
Design Principles for Sequence Models via Coefficient Dynamics - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Matches Model Architecture: unified framework via coefficient dynamics that connects Transformers, SSMs, and RNNs, yielding design principles and stability/efficiency trade-offs.
-
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton - Score: 17 (R=9, N=8) - Date: 2025-10-13 - Comment: High Performance Computing/Efficiency criterion: evaluates full and layerwise Gauss-Newton preconditioning for transformer training, showing large iteration reductions and insights on Hessian structure.
-
Integral Signatures of Activation Functions: A 9-Dimensional Taxonomy and Stability Theory for Deep Learning - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: Architecture/Training Dynamics: rigorous activation-function taxonomy with Lyapunov stability and kernel Hessian bounds guiding network design.
-
Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: Model Architecture and Efficiency: introduces recursive iteration over selected reasoning-relevant layers and adaptive depth for test-time compute scaling without increasing parameters.
-
Grouped Differential Attention - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Model Architecture: introduces grouped differential attention with ratio-aware head allocation and selective expansion for more compute-efficient Transformers.
-
Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Proposes a new Transformer variant with Relational Attention over rows/columns/PK–FK links, a clear architecture innovation for relational data and representation learning.
-
On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Foundational generation paradigm analysis: formal study beyond autoregression/diffusion with rewrite/edit capabilities and associated learnability/hardness results.
-
Paying Attention to Hybrid Attention: Untangling the Issues with Conversion Methods - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Strong match to Model Architecture and Efficiency: analyzes hybrid linear-attention conversions and proposes methods (e.g., SSD, HedgeCATs) to ensure genuine linear attention usage post-conversion.
-
Fundamental Limits of Crystalline Equivariant Graph Neural Networks: A Circuit Complexity Perspective - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Model Architecture Theory: circuit-complexity characterization (TC^0) of crystalline equivariant GNNs, clarifying expressive/computational limits under symmetry constraints.
-
Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Matches Model Architecture and Efficiency: reveals implicit Mixture-of-Experts–like specialization in diffusion LLMs and proposes a training-free test-time ensembling method (HEX) across generation schedules.
-
Expand Neurons, Not Parameters - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Model Architecture and Efficiency—Fixed Parameter Expansion widens networks at constant non-zero parameters to reduce polysemanticity and improve accuracy.
-
Arithmetic-Mean $\mu$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Matches Training Dynamics/Initialization and scaling laws: introduces AM-μP with provable learning-rate depth scaling (L^{-3/2}) for CNNs/ResNets enabling zero-shot LR transfer.
-
Platonic Transformers: A Solid Choice For Equivariance - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Model Architecture: introduces an equivariant Transformer via Platonic group–based attention and weight sharing; formally equivalent to dynamic group convolution and includes a linear-time convolutional variant (Efficiency).
-
Paris: A Decentralized Trained Open-Weight Diffusion Model - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Matches Model Architecture (MoE) and High Performance/Systems: decentralized training of independent experts with a router, eliminating synchronization.
-
PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Model Architecture: PDE-based continuous dynamical system analysis of Transformer components (attention, FFN, residuals, layer norm) as stabilizers.
-
Mathematical Modeling and Convergence Analysis of Deep Neural Networks with Dense Layer Connectivities in Deep Learning - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Model Architecture: theoretical modeling of densely connected networks (DenseNet-style) via nonlinear integral equations with convergence (Γ-convergence) results for training.
-
Rethinking the shape convention of an MLP - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Model Architecture: rethinks MLP shape/skip placement with hourglass blocks and fixed random expansion; provides scaling insights applicable to residual networks/Transformers.
-
Flock: A Knowledge Graph Foundation Model via Learning on Random Walks - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Model Architecture: introduces probabilistic node–relation equivariance and random-walk sequence modeling with universality guarantees for KG link functions.
-
Memory Determines Learning Direction: A Theory of Gradient-Based Optimization in State Space Models - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Model Architecture: theoretical analysis of SSM learning dynamics and an initialization/weight-freezing optimization strategy.
-
Composer: A Search Framework for Hybrid Neural Architecture Design - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Model Architecture and Efficiency: framework for searching hybrid Attention/MLP architectures with scalable extrapolation strategies for LLMs.
-
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Matches Model Architecture: introduces a new attention-based state-space LLM with locally interacting neurons, sparse positive activations, and built-in interpretability.
-
Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Representation Learning / Training Dynamics: circuit-level analysis showing emergent, specialized attention heads from post-training in reasoning models.
-
Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Matches Model Architecture/Efficiency: training-light expert model merging with unsupervised hidden/logit alignment and importance-guided layer chunking to replace multi-model serving.
-
Fact Grounded Attention: Eliminating Hallucination in Large Language Models Through Attention Level Knowledge Integration - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Model Architecture: injects verifiable knowledge directly into pre-softmax attention scores (Transformer attention modification) to control generation and prevent hallucination.
-
A Formal Comparison Between Chain-of-Thought and Latent Thought - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Model Architecture/Training Dynamics: formal analysis contrasting looped latent-thought Transformers vs CoT, clarifying computational capabilities.
-
AMLA: MUL by ADD in FlashAttention Rescaling - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: High Performance Computing: novel FlashAttention-based kernel replacing MUL with integer ADD for rescaling plus preload pipeline/tiling to maximize FLOPS on NPUs.
-
Enhancing Linear Attention with Residual Learning - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Model Architecture/Efficiency: introduces Residual Linear Attention and Residual Delta Net to boost expressivity while retaining linear-time attention.
-
Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers - Score: 16 (R=9, N=7) - Date: 2025-10-30 - Comment: Strongly matches representation learning criterion via mechanistic interpretability of attention-only transformers and emergence of minimal circuits for IOI.
-
Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Extends Transformer long-context capacity by logarithmically compressing input tokens without altering architecture (Compression/Efficiency for context).
-
Head Pursuit: Probing Attention Specialization in Multimodal Transformers - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Representation Learning/Interpretability: probes attention head specialization and enables controllable editing of concepts in uni/multimodal transformers.
-
Multi-Task Vehicle Routing Solver via Mixture of Specialized Experts under State-Decomposable MDP - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Model Architecture: Mixture-of-Specialized-Experts (MoE) with LoRA experts and adaptive gating under a state-decomposable MDP.
-
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction - Score: 16 (R=9, N=7) - Date: 2025-10-24 - Comment: Model architecture/efficiency: hybrid sparse attention with learnable token eviction retains critical KV pairs, preserving linear-time/space.
-
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Matches Model Architecture and Efficiency: proposes a hybrid linear+softmax attention architecture for long-context with FP8 operator support, reducing compute/I-O while maintaining reasoning performance.
-
Data Efficient Any Transformer-to-Mamba Distillation via Attention Bridge - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: cross-architecture distillation from Transformers to SSMs via an attention bridge with token-level supervision and layer-wise alignment.
-
Accelerating Vision Transformers with Adaptive Patch Sizes - Score: 16 (R=9, N=7) - Date: 2025-10-22 - Comment: Model Architecture/Efficiency: adaptive patch sizes to reduce ViT token count and accelerate inference/training.
-
ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing - Score: 16 (R=9, N=7) - Date: 2025-10-17 - Comment: Model Architecture and Efficiency: hybrid Decoder-MLP architecture with paired weight sharing; reduces KV cache and latency.
-
ICL-Router: In-Context Learned Model Representations for LLM Routing - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Matches Model Architecture (Dynamic Routing/MoE-style): learns in-context model representations to route queries across LLMs, enabling scalable routing without retraining.
-
Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers - Score: 16 (R=9, N=7) - Date: 2025-10-14 - Comment: Model Architecture (MoE) and training stability: aligns training and inference routers via rollout routing replay to stabilize MoE RL, addressing core MoE routing behavior.
-
DND: Boosting Large Language Models with Dynamic Nested Depth - Score: 16 (R=9, N=7) - Date: 2025-10-14 - Comment: Matches Model Architecture: introduces conditional/dynamic computation in transformers via token-level nested depth with a learned router, improving efficiency-control without full re-architecture.
-
MODE: Learning compositional representations of complex systems with Mixtures Of Dynamical Experts - Score: 16 (R=9, N=7) - Date: 2025-10-14 - Comment: Matches Model Architecture: Mixture-of-Experts with neural gating for decomposing dynamics into sparse experts; conditional/dynamic modeling across regimes.
-
Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization - Score: 16 (R=9, N=7) - Date: 2025-10-10 - Comment: Strong match to Model Architecture: extends DPO with mixture models and MoE architectures using variational inference and ELBO optimization for expert specialization.
-
MoGU: Mixture-of-Gaussians with Uncertainty-based Gating for Time Series Forecasting - Score: 16 (R=9, N=7) - Date: 2025-10-10 - Comment: Model Architecture (MoE): probabilistic experts with uncertainty-based gating replacing input-based routers for regression/forecasting.
-
Native Hybrid Attention for Efficient Sequence Modeling - Score: 16 (R=9, N=7) - Date: 2025-10-09 - Comment: Matches Model Architecture and Efficiency: proposes a hybrid linear+softmax attention layer with sliding-window control for long-context sequence modeling, reducing quadratic attention cost.
-
A Comparative Analysis of Contextual Representation Flow in State-Space and Transformer Architectures - Score: 16 (R=9, N=7) - Date: 2025-10-09 - Comment: Matches Representation Learning: token- and layer-level analysis of representation propagation and oversmoothing in SSMs vs Transformers, revealing inductive biases and training dynamics.
-
A General Constructive Upper Bound on Shallow Neural Nets Complexity - Score: 16 (R=9, N=7) - Date: 2025-10-09 - Comment: Model Architecture theory: provides a constructive upper bound on neurons needed in shallow networks to approximate continuous functions on compact sets.
-
VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing - Score: 16 (R=9, N=7) - Date: 2025-10-08 - Comment: Model architecture: dynamic expert routing (MoE-style) with patchwise routing and curriculum top-K annealing; parameter-efficient fine-tuning of expert library.
-
Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: Strongly matches Model Architecture: theoretical expressivity bounds and analysis of pooling mechanisms in Transformers, offering principled architectural guidance.
-
HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference - Score: 16 (R=9, N=7) - Date: 2025-10-06 - Comment: Strongly matches High-Performance Computing: heterogeneous CiD + on-chip analog CiM with phase-aware mapping and 2.5D integration targeted at low-batch, long-context LLM inference.
-
Transformers Discover Molecular Structure Without Graph Priors - Score: 16 (R=9, N=7) - Date: 2025-10-03 - Comment: Model Architecture / Representation Learning: shows pure Transformers (no graph priors) learn distance-aware structure for molecular modeling, with scaling and attention analysis.
-
Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Model Architecture/Theory: first theoretical analysis of one-layer Mamba’s ICL generalization with outliers, contrasting linear attention vs. nonlinear gating.
-
Indirect Attention: Turning Context Misalignment into a Feature - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Architecture: introduces a modified attention mechanism (Indirect Attention) with analysis under key–value misalignment/noise, directly innovating the Transformer attention core.
-
Scaling Equilibrium Propagation to Deeper Neural Network Architectures - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Architecture and Training Algorithm: introduces residual connections in Hopfield networks to scale equilibrium propagation to deeper networks.
-
Guiding Mixture-of-Experts with Temporal Multimodal Interactions - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Architecture (MoE): introduces interaction-aware routing leveraging temporal multimodal dynamics to guide expert specialization.
-
Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism - Score: 16 (R=8, N=8) - Date: 2025-10-31 - Comment: Model Architecture: specialized memory mechanism with task-aware trigger/updater for linear-time SGM inference and dynamic adaptation.
-
A Deep Learning Framework for Multi-Operator Learning: Architectures and Approximation Theory - Score: 16 (R=8, N=8) - Date: 2025-10-30 - Comment: Matches model architecture and efficiency theory criteria with new multi-operator neural operator architectures (MNO/MONet) and explicit approximation/scaling laws.
-
An efficient probabilistic hardware architecture for diffusion-like models - Score: 16 (R=8, N=8) - Date: 2025-10-29 - Comment: High Performance Computing/Efficiency: proposes an all-transistor probabilistic architecture implementing denoising models with orders-of-magnitude energy reduction.
-
A data free neural operator enabling fast inference of 2D and 3D Navier Stokes equations - Score: 16 (R=8, N=8) - Date: 2025-10-29 - Comment: Model Architecture/Efficiency: physics-grounded, data-free neural operator for Navier–Stokes enabling fast, robust inference (including 3D) without paired solution data.
-
Fisher meets Feynman: score-based variational inference with a product of experts - Score: 16 (R=8, N=8) - Date: 2025-10-28 - Comment: Representation Learning/Inference: tractable product-of-experts variational family with Fisher-divergence optimization and Feynman/Dirichlet auxiliary variables.
-
Generalised Flow Maps for Few-Step Generative Modelling on Riemannian Manifolds - Score: 16 (R=8, N=8) - Date: 2025-10-27 - Comment: Matches Model Architecture and Efficiency: few-step generative modeling generalized to Riemannian manifolds (self-distillation-based GFMs), reducing inference steps.
-
Matricial Free Energy as a Gaussianizing Regularizer: Enhancing Autoencoders for Gaussian Code Generation - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Matches Model Architecture/Regularization: introduces matricial free energy loss from free probability to Gaussianize autoencoder codes.
-
Asymptotically Stable Quaternion-valued Hopfield-structured Neural Network with Periodic Projection-based Supervised Learning Rules - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Model Architecture: quaternion-valued Hopfield-type network with projection-based learning and stability guarantees.
-
WARP-LUTs - Walsh-Assisted Relaxation for Probabilistic Look Up Tables - Score: 16 (R=8, N=8) - Date: 2025-10-20 - Comment: Model Architecture and Efficiency: multiplication-free probabilistic LUT networks with Walsh-assisted relaxation for fewer parameters and faster convergence.
-
Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning - Score: 16 (R=8, N=8) - Date: 2025-10-20 - Comment: Matches Model Architecture via a hybrid discrete diffusion planner with an autoregressive executor, including latent-space interfacing to reduce tokens and improve reasoning.
-
Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training - Score: 16 (R=8, N=8) - Date: 2025-10-17 - Comment: High-Performance/Optimization: noise-adaptive layerwise learning rates atop geometry-aware optimizers to accelerate training, with convergence analysis and transformer experiments.
-
Y-shaped Generative Flows - Score: 16 (R=8, N=8) - Date: 2025-10-16 - Comment: Model Architecture: introduces Y-shaped generative flows with a new velocity-powered objective in neural ODEs to encourage shared transport pathways—an architectural/optimization innovation in continuous-time generative models.
-
Designing ReLU Generative Networks to Enumerate Trees with a Given Tree Edit Distance - Score: 16 (R=8, N=8) - Date: 2025-10-15 - Comment: Model Architecture/Theory: constructs constant-depth ReLU generative networks (O(n^3)) to exactly enumerate tree-structured outputs by edit distance.
-
Heptapod: Language Modeling on Visual Signals - Score: 16 (R=8, N=8) - Date: 2025-10-09 - Comment: Model Architecture: introduces a causal Transformer with a novel “next 2D distribution prediction” objective and a reconstruction-focused visual tokenizer, unifying autoregressive modeling with masked autoencoding.
-
ATOM: A Pretrained Neural Operator for Multitask Molecular Dynamics - Score: 16 (R=8, N=8) - Date: 2025-10-08 - Comment: Model Architecture: introduces a transformer neural operator with quasi-equivariance and temporal attention enabling parallel multi-step decoding and cross-molecule operator pretraining.
-
Rule Encoding and Compliance in Large Language Models: An Information-Theoretic Analysis - Score: 16 (R=8, N=8) - Date: 2025-10-08 - Comment: Model Architecture Analysis: information-theoretic bounds on attention mechanisms (causal/bidirectional/sparse/kernelized/cross-attention) for rule encoding/compliance.
-
Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning - Score: 16 (R=8, N=8) - Date: 2025-10-06 - Comment: Matches Representation Learning/Model Architecture: amortized activation steering by training a minimal transformer submodule; effective for both dense and MoE models with strong compute efficiency.
-
Learning Inter-Atomic Potentials without Explicit Equivariance - Score: 16 (R=8, N=8) - Date: 2025-10-02 - Comment: Model Architecture/Representation Learning: learns SO(3) equivariance in a non-equivariant Transformer for inter-atomic potentials, avoiding hard-wired symmetry constraints.
-
PDE Solvers Should Be Local: Fast, Stable Rollouts with Learned Local Stencils - Score: 16 (R=8, N=8) - Date: 2025-10-01 - Comment: Matches Model Architecture: a finite-difference-inspired local operator network (learned stencils, explicit time stepping) with theoretical error/approximation guarantees and improved efficiency via strict locality.
-
Defeating the Training-Inference Mismatch via FP16 - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: HPC/Training optimization: shows FP16 precision mitigates training–inference mismatch in RL fine-tuning, improving stability and convergence
-
The End of Manual Decoding: Towards Truly End-to-End Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Model Architecture: augments transformers with lightweight heads that learn token-level temperature and top‑p, enabling end-to-end, dynamic decoding control.
-
Lipschitz-aware Linearity Grafting for Certified Robustness - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Model Architecture/Robustness: theoretical analysis and method for grafting linearity to tighten local Lipschitz bounds and improve certified robustness.
-
A Physics-informed Multi-resolution Neural Operator - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Model Architecture/Efficiency: extends RINO to a physics-informed, data-free operator with multi-resolution inputs and PDE-enforced training.
-
Spatially Aware Linear Transformer (SAL-T) for Particle Jet Tagging - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Spatially aware linear transformer variant that maintains linear attention and reduces complexity—Architecture/Efficiency contribution.
-
LUNA: Efficient and Topology-Agnostic Foundation Model for EEG Signal Analysis - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Architecture/Efficiency: topology-agnostic EEG foundation model using latent cross-attention to decouple compute from channel count (linear scaling).
-
Relieving the Over-Aggregating Effect in Graph Transformers - Score: 15 (R=8, N=7) - Date: 2025-10-27 - Comment: Model Architecture: Wideformer modifies graph attention to mitigate over-aggregating via parallel partitioned aggregation and guided weighting.
-
PINN Balls: Scaling Second-Order Methods for PINNs with Domain Decomposition and Adaptive Sampling - Score: 15 (R=8, N=7) - Date: 2025-10-27 - Comment: Model Architecture (MoE) + Efficiency: local Mixture‑of‑Experts with learnable domain decomposition to scale second‑order training for PINNs.
-
Diffusion Autoencoders with Perceivers for Long, Irregular and Multimodal Astronomical Sequences - Score: 15 (R=8, N=7) - Date: 2025-10-24 - Comment: Model Architecture and Representation Learning: diffusion autoencoder with Perceiver encoder/decoder for long, irregular, multimodal sequences.
-
Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Model Architecture/Efficiency: transformer with learned summarization tokens for memory creation/retrieval enabling long-horizon efficiency.
-
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: High Performance Computing: hybrid rollout–training architecture leveraging preemptible GPUs with adaptive offload and token-level migration for RL on LLMs.
-
Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Matches Model Compression/Efficiency and Architecture: ensembles via pruned attention heads merged into a compact grouped-MHA, yielding near single-model inference cost with UQ gains.
-
LIME: Link-based user-item Interaction Modeling with decoupled xor attention for Efficient test time scaling - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Matches Model Architecture and Efficiency: decoupled link embeddings enabling precomputed attention weights and a linear attention mechanism (LIME-XOR) for O(N) inference-time scaling.
-
ZACH-ViT: A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Model Architecture and Efficiency: a compact ViT variant removing positional embeddings and [CLS] token for permutation invariance and parameter efficiency.
-
NeurIPT: Foundation Model for Neural Interfaces - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Model Architecture: introduces a Progressive Mixture-of-Experts (PMoE) Transformer and amplitude-aware masked pretraining for EEG foundation modeling.
-
Protein Folding with Neural Ordinary Differential Equations - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Matches Model Architecture and Efficiency: continuous-depth Evoformer via Neural ODEs with adjoint memory savings and adaptive solver trade-offs.
-
Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Matches Model Architecture: embeds group-equivariant (rotation/scale) convolutions to improve adversarial robustness with theoretical gradient regularization and certified bounds.
-
Early-stopping for Transformer model training - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Training Dynamics/Representation: RMT-based spectral criteria for transformer early stopping; heavy-tailed dynamics monitoring without validation.
-
AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Matches Model Architecture/Training with Semi-Discrete Optimal Transport to align noise and data in flow-based models, improving trajectory straightness and efficiency.
-
Purifying Task Vectors in Knowledge-Aware Subspace for Model Merging - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Model Architecture/Merging: knowledge-aware subspace (context SVD) to purify/prune task vectors and mitigate redundancy in model merging.
-
DARTS-GT: Differentiable Architecture Search for Graph Transformers with Quantifiable Instance-Specific Interpretability Analysis - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Model Architecture: DARTS-driven heterogeneous Graph Transformer design with quantifiable interpretability via causal ablation.
-
A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Model Architecture/Training: safety-sensitive subspace freezing and harmful-resistant null-space projection to preserve alignment during LoRA fine-tuning.
-
Deep Attention-guided Adaptive Subsampling - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Conditional/Dynamic Networks and Efficiency: input-adaptive attention-guided subsampling module learned end-to-end to reduce compute while maintaining performance—fits dynamic computation and efficiency criteria.
-
Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Model Architecture/structure-aware optimization by applying targeted, layer-group–specific DPO (with LoRA) leveraging functional specialization of Transformer layers.
-
GraphTARIF: Linear Graph Transformer with Augmented Rank and Improved Focus - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Model Architecture/Efficiency: enhances linear graph attention by increasing rank via a gated local branch and sharpening focus with a learnable entropy-reducing log-power function while preserving linear complexity.
-
Multi-View Graph Learning with Graph-Tuple - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Introduces a multi-view graph-tuple message-passing architecture with provable expressivity gains (model architecture).
-
Why Do Transformers Fail to Forecast Time Series In-Context? - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Model Architecture/Representation Learning: rigorous ICL analysis of transformer (Linear Self-Attention) limits and CoT collapse for forecasting; foundational insights into training dynamics.
-
DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Representation Learning/Architecture Analysis: token-to-head contribution tracing reveals bias heads; inference-time selective masking of attention heads.
-
Inverse-Free Wilson Loops for Transformers: A Practical Diagnostic for Invariance and Order Sensitivity - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Representation Learning/Diagnostics: inverse-free curvature mapping and activation commutators provide practical probes of invariance and order sensitivity in Transformers.
-
gLSTM: Mitigating Over-Squashing by Increasing Storage Capacity - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Model Architecture: novel GNN (gLSTM) inspired by associative memories/xLSTM to mitigate over-squashing by increasing storage capacity; addresses core architectural limitations.
-
Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors - Score: 15 (R=8, N=7) - Date: 2025-10-09 - Comment: High Performance Computing/Efficiency: vectorized FlashAttention on RISC‑V with low-cost exponential approximation and tiling to improve memory locality and throughput.
-
BlockGPT: Spatio-Temporal Modelling of Rainfall via Frame-Level Autoregression - Score: 15 (R=8, N=7) - Date: 2025-10-09 - Comment: Introduces a frame-level autoregressive Transformer with space–time factorization and batched tokenization, improving architectural efficiency (notably faster inference).
-
Latent Speech-Text Transformer - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Model Architecture and Efficiency: dynamic aggregation of speech tokens into latent patches to reduce sequence length and improve modality alignment
-
Directional Sheaf Hypergraph Networks: Unifying Learning on Directed and Undirected Hypergraphs - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Model Architecture—Directional Sheaf Hypergraph Networks with a directed sheaf Laplacian for learning on directed/undirected hypergraphs.
-
GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Model Architecture: an LLM-free, tuning-free graph foundational model enabling in-context learning via a novel token-based framework across node/edge/graph tasks.
-
Activation Steering with a Feedback Controller - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Model Architecture/Activation Steering foundations: frames activation steering as PID control with theoretical stability and a principled closed-loop mechanism.
-
Rethinking Inter-LoRA Orthogonality in Adapter Merging: Insights from Orthogonal Monte Carlo Dropout - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Model Architecture: orthogonality-preserving adapter (LoRA) merging via Orthogonal Monte Carlo Dropout with analysis on compositionality/semantic interference.
-
Why Do We Need Warm-up? A Theoretical Perspective - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Training dynamics: theoretical justification for learning-rate warm-up under generalized smoothness with convergence complexity bounds.
-
xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Model Architecture and Efficiency — analysis of xLSTM scaling laws with linear-time complexity vs Transformers; insights on training/inference scaling with context length.
-
Equivariant Geometric Scattering Networks via Vector Diffusion Wavelets - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture and Efficiency: SE(3)-equivariant geometric scattering transform integrated into GNNs, achieving comparable performance with fewer parameters.
-
GLAI: GreenLightningAI for Accelerated Training through Knowledge Decoupling - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture and Efficiency—new MLP-replacement block that decouples structural vs quantitative knowledge to speed training while retaining expressivity.
-
Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture: analyzes Differential Attention’s robustness and training dynamics, revealing structural trade-offs in attention design.
-
Continual Learning with Query-Only Attention - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture/Training Dynamics: query-only attention variant with analysis of plasticity and catastrophic forgetting.
-
Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Representation Learning: reverse-engineers transformer mechanisms for long-range dependencies (attention DAG caching, Minkowski-sum digit geometry) and training dynamics with an auxiliary inductive-bias loss.
-
Large Language Models Inference Engines based on Spiking Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture/Efficiency: spike-based self-attention and SNN conversion/fine-tuning for transformer inference, targeting energy-efficient deployment.
-
BigBang-Proton Technical Report: Next-Word-Prediction is Scientific Multitask Learner - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture: introduces Monte Carlo Attention and Binary Patch Encoding as architectural/tokenization innovations in a unified autoregressive scientific model.
-
MAESTRO : Adaptive Sparse Attention and Robust Learning for Multimodal Dynamic Time Series - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Model Architecture and Efficiency: sparse cross-modal attention with sparse Mixture-of-Experts routing and adaptive attention budgeting for long multimodal sequences.
Model Compression and Efficiency (281)
-
Catalyst GFlowNet for electrocatalyst design: A hydrogen evolution reaction case study - Score: 20.0 (R=0, N=0) - Date: 2025-10-03 - Comment: Author match
-
A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization - Score: 19 (R=10, N=9) - Date: 2025-10-28 - Comment: Matches Compression/Efficiency: first convergence theory for Adam/Muon under floating-point quantization of gradients/weights/states; explains low-precision training.
-
Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples - Score: 19 (R=10, N=9) - Date: 2025-10-24 - Comment: Compression/Efficiency: layer-selective rank reduction and pruning of high-order components with low-rank factorization; rapid adaptation using a single gradient step on 100 samples.
-
CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training - Score: 19 (R=10, N=9) - Date: 2025-10-22 - Comment: Matches Model Compression and Efficiency: curvature-aware gradient correction for quantization-aware training with theoretical convergence and strong W4A4 results.
-
Learning under Quantization for High-Dimensional Linear Regression - Score: 19 (R=10, N=9) - Date: 2025-10-22 - Comment: Matches Model Compression and Efficiency: first systematic theory of learning performance under low-bit quantization across parameters/activations/gradients/data/labels.
-
Unbiased Gradient Low-Rank Projection - Score: 19 (R=10, N=9) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: unbiased low-rank gradient projection (GUM) with convergence guarantees, preserving memory savings while matching/improving full-parameter training.
-
The Graphon Limit Hypothesis: Understanding Neural Network Pruning via Infinite Width Analysis - Score: 19 (R=10, N=9) - Date: 2025-10-21 - Comment: Matches Model Compression and Sparsity Theory: introduces a graphon-based infinite-width framework and Graphon NTK to analyze pruning and sparse network trainability.
-
Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads - Score: 19 (R=10, N=9) - Date: 2025-10-21 - Comment: Model Architecture + Efficiency: SkipV1Former reuses first-layer Value heads to cut V projections/KV cache (~25–50%) while improving perplexity; KV-cache reduction.
-
REAP the Experts: Why Pruning Prevails for One-Shot MoE compression - Score: 19 (R=10, N=9) - Date: 2025-10-17 - Comment: MoE + Compression: theoretical case against expert merging and a router-weighted expert pruning criterion for one-shot SMoE compression.
-
On efficiently computable functions, deep networks and sparse compositionality - Score: 19 (R=10, N=9) - Date: 2025-10-16 - Comment: Model Architecture and Representation Learning: theory linking efficient Turing computability to compositionally sparse DAGs and corresponding deep neural approximants.
-
The Markovian Thinker - Score: 19 (R=10, N=9) - Date: 2025-10-09 - Comment: High-Performance/Algorithmic Efficiency: redesigns the reasoning environment to a Markovian, constant-state setup enabling linear compute and constant memory for very long thinking.
-
vAttention: Verified Sparse Attention - Score: 19 (R=10, N=9) - Date: 2025-10-08 - Comment: Sparse Attention with guarantees: unified top-k and sampling providing user-specified (epsilon, delta) accuracy with strong efficiency gains
-
Boomerang Distillation Enables Zero-Shot Model Size Interpolation - Score: 19 (R=10, N=9) - Date: 2025-10-07 - Comment: Strongly matches Model Compression/Efficiency and Model Architecture: zero-shot model size interpolation by re-incorporating teacher blocks after distillation (no extra training).
-
PolyKAN: A Polyhedral Analysis Framework for Provable and Minimal KAN Compression - Score: 19 (R=10, N=9) - Date: 2025-10-07 - Comment: Model Compression and Efficiency: provable, minimal KAN compression via polyhedral analysis and ε-equivalent compression with an optimal DP algorithm.
-
To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration - Score: 19 (R=10, N=9) - Date: 2025-10-06 - Comment: Model Compression and Efficiency + HPC: establishes exponent concentration with theoretical entropy bounds; proposes lossless ECF8 FP format with entropy-aware encoding and GPU-optimized decoding.
-
The Unseen Frontier: Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM - Score: 19 (R=10, N=9) - Date: 2025-10-03 - Comment: Model Compression and Efficiency — extreme sparsity/pruning for LLMs via surrogate-free ADMM; includes quantized variant and convergence guarantees.
-
A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws - Score: 19 (R=10, N=9) - Date: 2025-10-02 - Comment: Compression/Efficiency Theory: proves polylogarithmic compression of models and datasets, establishing a dynamical lottery ticket hypothesis and boosted scaling laws.
-
An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning - Score: 18 (R=10, N=8) - Date: 2025-10-31 - Comment: HPC + Compression/Efficiency: All-Reduce–compatible Top-K gradient compressor with contraction guarantees; communication-efficient distributed training.
-
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats - Score: 18 (R=10, N=8) - Date: 2025-10-30 - Comment: Compression/Efficiency: comprehensive study of low-bit quantization formats (INT vs FP) at fine-grained levels with new training method for MXINT8.
-
SALS: Sparse Attention in Latent Space for KV cache Compression - Score: 18 (R=10, N=8) - Date: 2025-10-29 - Comment: Model Compression and Efficiency: KV cache compression via latent-space sparse attention that bypasses RoPE-induced rank issues and avoids full reconstruction.
-
Efficient Low Rank Attention for Long-Context Inference in Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-10-29 - Comment: Model Compression and Efficiency: low-rank query/key decomposition with mixed GPU-CPU KV cache to reduce memory and transfers while preserving exact attention.
-
Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation - Score: 18 (R=10, N=8) - Date: 2025-10-28 - Comment: Model Architecture/HPC: high-sparsity MoE scaling to 1T with FP8 training and efficient heterogeneous pipelines guided by scaling laws.
-
Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression - Score: 18 (R=10, N=8) - Date: 2025-10-28 - Comment: Compression/Efficiency: low-bit LLM post-training quantization via learnable grouped lattice vector quantizers with Babai rounding.
-
Sparser Block-Sparse Attention via Token Permutation - Score: 18 (R=10, N=8) - Date: 2025-10-27 - Comment: Matches Compression/Efficiency: block-sparse attention enhanced via token permutation and custom kernels, improving long-context LLM prefilling speed/accuracy.
-
Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling - Score: 18 (R=10, N=8) - Date: 2025-10-24 - Comment: Compression/Efficiency — multi-bit quantization training via weight bias correction and bit-wise coreset sampling to reduce training cost across precisions.
-
ARC-Encoder: learning compressed text representations for large language models - Score: 18 (R=10, N=8) - Date: 2025-10-24 - Comment: Compression/Efficiency — external encoder that compresses context into continuous representations to replace tokens, reducing LLM inference cost without modifying decoders.
-
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training - Score: 18 (R=10, N=8) - Date: 2025-10-22 - Comment: High Performance Computing and Efficiency: distributed dynamic sparse attention training (balanced/hierarchical sparse ring attention) enabling efficient ultra-long contexts.
-
Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression - Score: 18 (R=10, N=8) - Date: 2025-10-22 - Comment: Matches Model Compression and Efficiency: Binary Quadratic Quantization for matrix approximation/PTQ, extending beyond first-order schemes with strong 2-bit results.
-
From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-10-22 - Comment: Matches Model Compression and Efficiency: global structured pruning of LLM attention heads and MLP channels using loss-based importance with iterative schedule.
-
AMS-QUANT: Adaptive Mantissa Sharing for Floating-point Quantization - Score: 18 (R=10, N=8) - Date: 2025-10-21 - Comment: Matches Model Compression and Efficiency: introduces adaptive mantissa-bit sharing for sub-integer floating-point quantization with CUDA kernels, reducing memory access and latency.
-
TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs - Score: 18 (R=10, N=8) - Date: 2025-10-21 - Comment: High Performance Computing / Compression: ternary (1.58-bit) LLM accelerator with table-lookup matmul, fused attention, and prefill/decoding optimizations on edge FPGAs.
-
Efficient Dynamic Structured Sparse Training with Learned Shuffles - Score: 18 (R=10, N=8) - Date: 2025-10-17 - Comment: Compression/Efficiency: dynamic structured sparsity augmented with learned permutations to match unstructured DST accuracy while accelerating training/inference.
-
A Free Lunch in LLM Compression: Revisiting Retraining after Pruning - Score: 18 (R=10, N=8) - Date: 2025-10-17 - Comment: Compression: shows reconstruction-based post-pruning retraining can beat full retraining; key design insights and efficient recovery after pruning.
-
What Layers When: Learning to Skip Compute in LLMs with Residual Gates - Score: 18 (R=10, N=8) - Date: 2025-10-17 - Comment: Model Compression and Efficiency: token-wise layer skipping via residual-stream gates enabling dynamic computation with stable fine-tuning.
-
Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference - Score: 18 (R=10, N=8) - Date: 2025-10-17 - Comment: Model Compression/Efficiency: informed token-level routing using a lightweight feature forecaster for execute-or-approximate computation.
-
NOSA: Native and Offloadable Sparse Attention - Score: 18 (R=10, N=8) - Date: 2025-10-16 - Comment: Model compression and efficiency: trainable sparse attention with explicit locality enabling KV cache offloading and reduced transfers, improving decoding throughput and memory use.
-
MC#: Mixture Compressor for Mixture-of-Experts Large Models - Score: 18 (R=10, N=8) - Date: 2025-10-15 - Comment: MoE compression via mixed-precision quantization and dynamic expert pruning/routing (quantization + sparsity/pruning for MoE efficiency).
-
AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs - Score: 18 (R=10, N=8) - Date: 2025-10-15 - Comment: Direct hit on Model Compression and Efficiency: multi-precision quantization with bit-plane compute and hardware–algorithm co-design for LLMs.
-
PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-10-15 - Comment: Matches Model Compression and Efficiency: N:M sparsity with learnable channel permutation via differentiable Sinkhorn normalization and block-wise optimization.
-
SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions - Score: 18 (R=10, N=8) - Date: 2025-10-14 - Comment: Compression: unified Bayesian pruning+quantization via spike-and-slab priors and GMM-based low-bit weights, with consistency guarantees.
-
LOTION: Smoothing the Optimization Landscape for Quantized Training - Score: 18 (R=10, N=8) - Date: 2025-10-14 - Comment: Model Compression and Efficiency: proposes a principled smoothing framework for quantized training (randomized rounding/Nesterov-style smoothing) with convergence guarantees and preservation of global minima.
-
FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference - Score: 18 (R=10, N=8) - Date: 2025-10-13 - Comment: Compression/Efficiency: fine-grained low-rank rank allocation per layer and progressive low-rank decoding for efficient LLM inference.
-
FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts - Score: 18 (R=10, N=8) - Date: 2025-10-10 - Comment: Model Architecture/Efficiency: implicit rank-wise MoE within LoRA using sparse random projection as router for parameter-efficient fine-tuning and task decoupling.
-
Artificial Hippocampus Networks for Efficient Long-Context Modeling - Score: 18 (R=10, N=8) - Date: 2025-10-09 - Comment: Model Architecture and Efficiency: hybrid memory design combining Transformer KV cache with learnable RNN-like compressive long-term memory (AHN) to cut FLOPs and cache.
-
Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: Strong match to Compression/Efficiency: activation-informed theoretical bounds and Pareto-guided low-rank rank selection (PGSVD) for zero-shot LLM/VLM compression.
-
ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: Model Compression and Efficiency: semi-structured 2:4 pruning via adaptive matrix factorization with block-diagonal wrappers
-
KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: Strong match to Compression/Efficiency: KV-cache quantization to very low precision with Hadamard rotation and linear correction plus a fast attention kernel for efficient long-context inference.
-
PatternKV: Flattening KV Representation Expands Quantization Headroom - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: Model Compression and Efficiency: proposes a pattern-aligned residual quantization scheme for KV-cache to flatten distributions and enable low-bit inference with less memory/bandwidth.
-
COSPADI: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: Model Compression and Efficiency: training-free sparse dictionary factorization guided by calibration to compress LLMs; structured sparsity compatible with quantization and efficient sparse-dense ops.
-
Post-training quantization of vision encoders needs prefixing registers - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Compression and Efficiency—training-free post-training quantization for vision encoders via prefix registers (RegCache) to suppress activation outliers.
-
Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Compression and Efficiency: introduces compressed convolutional attention (CCA/CCGQA) reducing KV-cache and FLOPs with significant speedups; applicable to dense and MoE models.
-
Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Compression and Efficiency: establishes low-rank structure in time-series embeddings, proves compressibility of Q/K/V and attention, introduces flow-of-ranks; guides width/depth/head allocation and achieves large inference/memory reductions on a foundation TS model.
-
UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Compression/Efficiency: unified post-training pruning with mirror descent combining local saliency and global coordination; supports unstructured and N:M sparsity with one-shot mask generation.
-
SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Compression and Efficiency: introduces Sigma-Delta 1-bit/1.58-bit quantization for LLMs with adjustable and fine-grained OSR allocation plus Hadamard-based weight smoothing.
-
PT$^2$-LLM: Post-Training Ternarization for Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Directly matches Model Compression and Efficiency: post-training ternarization (quantization) for LLMs with asymmetric ternary quantizer, iterative fitting, and activation-aware refinement.
-
StructPrune: Structured Global Pruning asymptotics with $\mathcal{O}(\sqrt{N})$ GPU Memory - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Model Compression/Efficiency—structured global pruning with O(sqrt(N)) memory via ADMM and derived layer-wise sparsity allocation.
-
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization - Score: 18 (R=10, N=8) - Date: 2025-10-07 - Comment: Matches Model Compression and Efficiency: training-free depth pruning by replacing Transformer blocks with a linear operator using small calibration data; no retraining needed.
-
The Curious Case of In-Training Compression of State Space Models - Score: 18 (R=10, N=8) - Date: 2025-10-06 - Comment: Compression/Efficiency — in-training balanced truncation of State Space Models via Hankel singular values to reduce state dimension while preserving expressivity.
-
Randomized Gradient Subspaces for Efficient Large Language Model Training - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: High Performance Computing/Efficiency: randomized gradient subspace methods (GrassWalk/GrassJump) reduce optimizer memory for LLM pretraining by leveraging near-flat curvature.
-
ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: Compression/Efficiency/HPC: thought-adaptive KV-cache compression with hybrid quantization–eviction and a PagedAttention-extended kernel for memory reuse.
-
Budgeted Broadcast: An Activity-Dependent Pruning Rule for Neural Network Efficiency - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: Matches Model Compression and Efficiency: introduces an activity-dependent pruning rule with constrained-entropy analysis to balance fan-in/fan-out (sparsity/pruning) for efficiency.
-
RSAVQ: Riemannian Sensitivity-Aware Vector Quantization for Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: Matches Model Compression and Efficiency: low-bit vector quantization for LLMs using Fisher-information (Riemannian) sensitivity guidance and channel-wise bit allocation.
-
PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning - Score: 18 (R=10, N=8) - Date: 2025-10-02 - Comment: Model Compression and Efficiency—structured pruning within low-rank adapters (LoRA) with theoretical robustness analysis and dynamic rank allocation.
-
CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: Model Compression and Efficiency: continuous and differentiable semi-structured (N:M) sparsity-aware training with a new sparsity-aware optimizer (AdamS), weight scaling, and self-distillation to preserve accuracy.
-
Layer-wise dynamic rank for compressing large language models - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: Compression/Efficiency: layer-wise dynamic low-rank SVD with effective-rank metric and Lagrangian allocation for LLM compression.
-
Effective Model Pruning - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: Matches Compression/Efficiency: introduces a universal, parameter-free adaptive pruning threshold (effective number via Inverse Simpson index) applicable to diverse pruning criteria.
-
AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: HPC + Compression/Efficiency: KV-cache storage hierarchy with adaptive lossy compression to optimize DRAM/SSD placement for LLM serving.
-
On the expressivity of sparse maxout networks - Score: 18 (R=9, N=9) - Date: 2025-10-17 - Comment: Representation/Architecture Theory: expressivity analysis and depth hierarchies for sparse maxout networks under fixed indegree (sparsity).
-
Drop-Muon: Update Less, Converge Faster - Score: 18 (R=9, N=9) - Date: 2025-10-03 - Comment: Training efficiency criterion: randomized progressive layer updates with non-Euclidean optimization and convergence theory, reducing update cost.
-
ARA: Adaptive Rank Allocation for Efficient Large Language Model SVD Compression - Score: 17 (R=10, N=7) - Date: 2025-10-23 - Comment: Matches Compression/Efficiency: Adaptive Rank Allocation for SVD-based LLM compression with a new mask design and loss to optimize per-layer ranks under global constraints.
-
BitNet Distillation - Score: 17 (R=10, N=7) - Date: 2025-10-17 - Comment: Model Compression and Efficiency: distillation to 1.58-bit (ternary) LLMs with SubLN and attention distillation; large memory/speed gains.
-
Training Dynamics Impact Post-Training Quantization Robustness - Score: 17 (R=10, N=7) - Date: 2025-10-08 - Comment: Compression/Efficiency: analysis of post-training quantization robustness tied to training dynamics and hyperparameters in LLMs.
-
Quantization Range Estimation for Convolutional Neural Networks - Score: 17 (R=10, N=7) - Date: 2025-10-07 - Comment: Strongly matches Model Compression/Efficiency: post-training quantization with provable local convexity and efficient range search.
-
STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: Compression/Efficiency: low-precision activation quantization using sequence-dimension linear transforms and mixed-precision token retention; complements existing quantization.
-
Polybasic Speculative Decoding Through a Theoretical Perspective - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: HPC/Efficiency: theoretical framework for multi-model (polybasic) speculative decoding with optimal inference time characterization and practical speedups.
-
CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices - Score: 17 (R=9, N=8) - Date: 2025-10-30 - Comment: Model Architecture and Efficiency: introduces invertible linear layers via circulant–diagonal decomposition with FFT, reducing parameters and log-det/inversion cost for normalizing flows.
-
Sequences of Logits Reveal the Low Rank Structure of Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-30 - Comment: Representation Learning + Compression/Efficiency: demonstrates and exploits low-rank structure in LM logits with a model-agnostic abstraction and theory.
-
LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Data-aware LoRA initialization derived via asymptotic/Fisher analysis—matches Low-Rank Adaptation and Compression/Efficiency criteria.
-
FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Model Compression and Efficiency: FP8 end-to-end LoRA fine-tuning by merging adapters into a quantized backbone and reducing quantization overhead.
-
ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Model Compression/Efficiency for fine-tuning: optimally scaled LoRA accumulates high-rank updates from low-rank increments with analytic scaling guarantees.
-
Beyond Hidden-Layer Manipulation: Semantically-Aware Logit Interventions for Debiasing LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Model Compression and Efficiency: introduces differentiable contiguous layer pruning with endpoint tuning for LLMs; compatible with quantization.
-
Batch Speculative Decoding Done Right - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Matches HPC/Efficiency: batch speculative decoding with correctness guarantees and synchronization strategy addressing ragged tensors.
-
Encoder-Decoder Diffusion Language Models for Efficient Training and Inference - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Matches Model Architecture and Efficiency with an encoder-decoder diffusion LM enabling faster training/inference.
-
Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Matches Model Compression and Efficiency by enabling one-step sampling for AR image models via conditional score distillation.
-
$\alpha$-LoRA: Effective Fine-Tuning via Base Model Rescaling - Score: 17 (R=9, N=8) - Date: 2025-10-27 - Comment: Low‑rank/Compression: new reparameterization (α‑LoRA) via base model rescaling with theory (RMT) to improve fine‑tuning generalization.
-
ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: Model Compression/Efficiency: LUT-aware hierarchical linear quantization (HLQ) and optimized CPU kernels for LLM edge deployment.
-
NeuroAda: Activating Each Neuron's Potential for Parameter-Efficient Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: Model Compression/Efficiency: PEFT via bypass connections on selected parameters enabling ≤0.02% trainable weights.
-
StreamingTOM: Streaming Token Compression for Efficient Video Understanding - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Compression/Efficiency: training-free streaming token compression with causal temporal reduction and 4-bit online KV-cache memory.
-
Glyph: Scaling Context Windows via Visual-Text Compression - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: compresses long textual context via visual rendering to reduce tokens and compute, yielding faster prefilling/decoding and SFT.
-
Neuronal Group Communication for Efficient Neural representation - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Matches Model Architecture and Compression/Efficiency: proposes low-rank, group-based neuronal communication with a stability metric, improving compactness and modularity.
-
One-Bit Quantization for Random Features Models - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: theory for one-bit quantization in Random Features models showing no generalization loss when quantizing all but last layer.
-
Compressing Many-Shots in In-Context Learning - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Efficiency: compresses many-shot in-context prompts via layer-wise soft-token summaries to cut memory/compute during inference.
-
AMiD: Knowledge Distillation for LLMs with $\alpha$-mixture Assistant Distribution - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: generalized assistant distribution and divergences for KD of LLMs improving stability/performance.
-
Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Compression and Efficiency/HPC: exploits fine-tuning-time sparsity with dynamic sparse operators and predictors to accelerate PEFT.
-
CTR-LoRA: Curvature-Aware and Trust-Region Guided Low-Rank Adaptation for Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Model Compression/Efficiency: PEFT via curvature-aware trust-region LoRA with adaptive rank scheduling using second-order proxies; stability and throughput gains.
-
Continual Learning via Sparse Memory Finetuning - Score: 17 (R=9, N=8) - Date: 2025-10-20 - Comment: Matches Model Compression and Efficiency (Sparsity) for continual learning via sparsely updated memory layers to reduce interference/forgetting.
-
Attention Is All You Need for KV Cache in Diffusion LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Compression/Efficiency: adaptive, layer-aware KV cache refresh (Elastic-Cache) for diffusion LLMs reduces redundant recomputation with negligible quality loss.
-
Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: HPC/Training Efficiency: principled batch-size scheduling equivalence to LR decay (with theory) to accelerate pretraining.
-
MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Compression/Efficiency – low-bit microscaling (BFP) quantization extension addressing outliers for efficient LLM serving with minimal overhead.
-
A Deep State-Space Model Compression Method using Upper Bound on Output Error - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Compression/Efficiency: provable output-error bounds and gradient-based model order reduction for Deep SSMs.
-
Towards Reversible Model Merging For Low-rank Weights - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Compression/Efficiency – low-rank (LoRA/SVD) weight merging with a reversible basis and closed-form solution for reconstruction-capable model space.
-
K-Merge: Online Continual Merging of Adapters for On-device Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Matches Model Compression and Efficiency: online continual merging of low-rank adapters (LoRAs) for on-device LLMs under storage/compute constraints.
-
KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Model Compression and Efficiency: training-free KV-cache reuse/alignment across agents for multi-agent LLM inference, delivering large speedups without quality loss.
-
DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Model Architecture and Efficiency: tightly couples an AR LM with masked diffusion over discrete RVQ codes enabling blockwise parallelism; offers controllable compute via RVQ layer pruning.
-
Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Compression/Efficiency theory: extends MDL to singular models; LLC-based complexity predicts quantization/low-rank compressibility.
-
MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Model Compression/Efficiency: training-free structural pruning for diffusion models that aligns pruning policy with pretraining dynamics.
-
Direct Multi-Token Decoding - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Model Efficiency/HPC — Direct Multi-Token Decoding uses late layers to emit multiple tokens per step without auxiliary models, reducing repeated forward passes.
-
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Model Compression/Efficiency and HPC: NVFP4 quantization + LoRA to accelerate RL training of LLMs, with adaptive quantization noise for exploration.
-
Differentiable Fast Top-K Selection for Large-Scale Recommendation - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Designs a differentiable Top-K operator with O(n) complexity for end-to-end training (algorithmic efficiency breakthrough).
-
LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Matches Compression/Efficiency and HPC: semantic-aware KV retrieval and fine-grained decoupled management with custom kernels to accelerate long-sequence LLM decoding while preserving accuracy.
-
Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Matches HPC and Compression/Efficiency: automated mapping and scheduling for block-diagonal sparse LLMs on compute-in-memory accelerators to boost array utilization and reduce memory/compute.
-
ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Pure low-precision (BF16/Float8) training with Kahan summation, stochastic rounding, and memory optimizations (gradient fusion/chunking) — Model Compression/Efficiency for large output spaces.
-
Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Strongly matches Model Compression and Efficiency by leveraging structured sparsification, conformal prediction, and lattice quantization to compress token distributions for speculative decoding; systems-level bandwidth optimization aligns with efficiency goals.
-
CacheClip: Accelerating RAG with Effective KV Cache Reuse - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Compression/Efficiency: KV cache reuse with auxiliary-model-guided selective recomputation, shared-prefix sink removal, and grouping for faster RAG prefill without quality loss.
-
AdaPM: a Partial Momentum Algorithm for LLM Training - Score: 17 (R=9, N=8) - Date: 2025-10-13 - Comment: HPC/Efficiency: memory-efficient optimizer for LLM training via partial momentum with bias correction, reducing momentum state memory by >90%.
-
Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Matches Representation Learning and Sparsity: identifies and manipulates sparse, layer-consistent dimensions governing multilingual control without training.
-
Efficient numeracy in language models through single-token number embeddings - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Efficiency/Architecture: proposes single-token number embeddings (BitTokens) via IEEE 754 to reduce tokenization overhead and enable efficient arithmetic in LLMs.
-
From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Training Dynamics/Efficiency: establishes a scaling law for multi-stage (bootstrapped) pretraining, guiding efficient reuse of overtrained base models.
-
Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Inference Efficiency: training-aware speculative decoding (self-speculation) with online updates for lossless speedups
-
HoRA: Cross-Head Low-Rank Adaptation with Joint Hypernetworks - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: PEFT/Low-Rank: cross-head shared low-rank adapters generated by joint hypernetworks; theoretical sample-efficiency gains via a hierarchical MoE perspective.
-
Composite Optimization with Error Feedback: the Dual Averaging Approach - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Matches Model Compression and Efficiency and High-Performance Computing: communication-efficient distributed training with compression via a new EF–Dual Averaging method and convergence analysis for composite objectives.
-
On The Expressive Power of GNN Derivatives - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Model Architecture — HOD-GNN augments MPNNs with high-order feature derivatives to boost expressivity up to WL hierarchy; efficient derivative message passing.
-
In-memory Training on Analog Devices with Limited Conductance States via Multi-tile Residual Learning - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Matches Model Compression and Efficiency/HPC: enables in-memory training on low-precision analog devices via multi-tile residual learning with convergence guarantees.
-
SAGE: Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Efficiency: streaming subset selection via Frequent Directions gradient sketches enabling constant-memory, GPU-friendly training.
-
DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Efficiency/HPC: speculative decoding using a diffusion LM drafter with causal-consistency path search and adaptive draft length for speedups.
-
KaVa: Latent Reasoning via Compressed KV-Cache Distillation - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Model Compression and Efficiency: compressed KV-cache distillation to supervise latent reasoning, leveraging cache-aware signals for efficient inference and memory savings.
-
StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Model Compression and Efficiency: advances LoRA via U S V^T factorization with Stiefel manifold constraints and Riemannian optimization for low-rank adapters.
-
HiSpec: Hierarchical Speculative Decoding for LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Compression/Efficiency/HPC: hierarchical speculative decoding using early-exit intermediate verification with KV-cache/hidden-state reuse for high-throughput inference.
-
Low Rank Gradients and Where to Find Them - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Compression/Efficiency: identifies approximate low-rank structure in gradients; Representation Learning/Training Dynamics: links data/activation/regularizers to gradient rank components.
-
On Predictability of Reinforcement Learning Dynamics for Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Representation Learning/Training Dynamics: identifies low-rank (rank-1) structure in RL-induced parameter updates and exploits it for efficient training speedups.
-
Randomized Matrix Sketching for Neural Network Training and Gradient Monitoring - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Model Compression and Efficiency: adapts matrix sketching to layer activations for memory-efficient backprop and gradient monitoring, enabling reduced activation storage.
-
HilbertA: Hilbert Attention for Image Generation with Diffusion Models - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Sparse attention/HPC: 2D-aware GPU-efficient attention via Hilbert-curve token ordering and sliding schedule, implemented in Triton.
-
DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Compression/Efficiency: differentiable vector quantization via reparameterization (and space-filling variant) enabling end-to-end training and improved codebook usage.
-
Distillation of Large Language Models via Concrete Score Matching - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Model Compression and Efficiency: new discrete score-matching KD objective aligning relative logits for LLM distillation, addressing softmax smoothing and shift invariance.
-
Flow Matching with Semidiscrete Couplings - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Matches Compression/Efficiency and Training Algorithms: semidiscrete OT-based flow matching eliminates quadratic batch-OT costs, enabling scalable generative training.
-
Spectral Perturbation Bounds for Low-Rank Approximation with Applications to Privacy - Score: 17 (R=8, N=9) - Date: 2025-10-30 - Comment: Compression/Efficiency: new spectral-norm perturbation bounds for low-rank approximation, improving theoretical guarantees (e.g., DP-PCA utility).
-
LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits - Score: 16 (R=9, N=7) - Date: 2025-10-31 - Comment: Model Compression and Efficiency: mixed-precision post-training quantization of LoRA via SVD reparameterization to ultra-low bits.
-
Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-31 - Comment: Efficiency: inference-cost-aware speculative decoding with dynamic tree construction accounting for GPU/batch effects
-
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache - Score: 16 (R=9, N=7) - Date: 2025-10-31 - Comment: Efficiency/HPC: attention-map caching and similarity retrieval to accelerate prefill self-attention in LLMs with minimal accuracy loss.
-
zFLoRA: Zero-Latency Fused Low-Rank Adapters - Score: 16 (R=9, N=7) - Date: 2025-10-31 - Comment: Compression/Efficiency: fused low-rank adapters that incur zero or negligible inference latency overhead
-
Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning - Score: 16 (R=9, N=7) - Date: 2025-10-30 - Comment: Compression/Efficiency and Architecture: structured local learning on low-rank manifolds (SVD) with aligned feedback, reducing parameters and avoiding BP while maintaining accuracy.
-
Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models - Score: 16 (R=9, N=7) - Date: 2025-10-29 - Comment: Model Compression/Efficiency: sparse PEFT with kernelized low-rank updates and adaptive bi-level sparsity allocation, reducing memory while improving adaptation.
-
SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs - Score: 16 (R=9, N=7) - Date: 2025-10-29 - Comment: Speculative Knowledge Distillation applies token-level gating for distillation loss—directly matches Compression/Efficiency via improved KD for LLMs.
-
Improving the Straight-Through Estimator with Zeroth-Order Information - Score: 16 (R=9, N=7) - Date: 2025-10-29 - Comment: Model Compression/Efficiency: quantization-aware training via FOGZO combining STE with zeroth-order information to reduce bias and compute.
-
Transformers from Compressed Representations - Score: 16 (R=9, N=7) - Date: 2025-10-29 - Comment: Transformer efficiency via learning directly from compressed representations, reducing tokens/compute—matches the Compression/Efficiency criterion with an architectural tokenization strategy.
-
The Structural Scalpel: Automated Contiguous Layer Pruning for Large Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-29 - Comment: Model Compression and Efficiency: introduces a novel pruning framework with differentiable concave gates to select contiguous layer segments and a localized fine-tuning strategy; method-centric compression (pruning) with synergy to quantization.
-
Mixed Precision Training of Neural ODEs - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Matches Model Compression and Efficiency with a mixed-precision training framework for Neural ODEs addressing memory/runtime.
-
Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Optimizer-level memory reduction using low-rank Jacobian approximation with error-feedback to train with approximate gradients under tight memory (Compression/Efficiency).
-
When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Matches Model Compression and Efficiency by analyzing how layer pruning impacts test-time scaling for reasoning in LLMs.
-
TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Model Compression and Efficiency: ternary quantization of both vision and text encoders with distillation for large VLMs.
-
PLAN: Proactive Low-Rank Allocation for Continual Learning - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Model Compression/Efficiency: Low-Rank Adaptation (LoRA) with proactive orthogonal allocation for continual learning.
-
AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: High-Performance Inference: selective knowledge distillation tailored to maximize token acceptance in speculative decoding.
-
GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: sparse fine-tuning by selecting parameters with large gradients and low pre-trained magnitudes to preserve knowledge.
-
Latent Space Factorization in LoRA - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Matches Compression/Efficiency and Model Architecture: a LoRA variant (FVAE-LoRA) that factorizes task-salient vs residual latent spaces via a new ELBO for parameter-efficient finetuning with improved robustness.
-
Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: proposes dequantization-free mixed-precision quantization for LLMs via SNN-style spike encoding, reducing MAC energy.
-
CPSVD: Enhancing Large Language Model Compression via Column-Preserving Singular Value Decomposition - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: column-preserving SVD with adaptive per-module compression for LLMs (low-rank plus selective column retention).
-
Feature Space Adaptation for Robust Model Fine-Tuning - Score: 16 (R=9, N=7) - Date: 2025-10-23 - Comment: Model Compression/Efficiency: PEFT in feature space (LoRFA/VeFA) with low-rank/vector transformations to preserve pretrained representations under shift.
-
ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters - Score: 16 (R=9, N=7) - Date: 2025-10-22 - Comment: Model Architecture/Efficiency: depth scaling of ViTs via layer-wise weight sharing plus lightweight parallel adapter parameters.
-
S2AP: Score-space Sharpness Minimization for Adversarial Pruning - Score: 16 (R=9, N=7) - Date: 2025-10-22 - Comment: Compression/Efficiency: adversarial pruning with score-space sharpness minimization to stabilize mask selection and preserve robustness.
-
Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations - Score: 16 (R=9, N=7) - Date: 2025-10-22 - Comment: Matches High Performance Computing/Efficiency: zeroth-order LLM fine-tuning with projected gradient-aligned perturbations to cut estimator variance and iterations.
-
ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: zero-shot, prompt-aware visual token pruning for VLMs to reduce inference cost while preserving task-relevant content.
-
SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference - Score: 16 (R=9, N=7) - Date: 2025-10-21 - Comment: High Performance Computing / Efficiency: hardware-software co-design of Softmax and LayerNorm (E2Softmax, AILayerNorm) with low-precision arithmetic and no retraining.
-
Bitwidth-Specific Logarithmic Arithmetic for Future Hardware-Accelerated Training - Score: 16 (R=9, N=7) - Date: 2025-10-21 - Comment: Matches Compression and Efficiency: bitwidth-specific logarithmic arithmetic with hardware-friendly piecewise-linear addition enabling low-precision training.
-
QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: unified low-rank SVD across Q/K/VP with rank allocation and joint quantization to reduce KV cache and compute in VLMs.
-
Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation - Score: 16 (R=9, N=7) - Date: 2025-10-20 - Comment: Matches Model Compression and Efficiency via structured pruning with concatenation-based layer merging and hierarchical distillation to retain capacity.
-
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow - Score: 16 (R=9, N=7) - Date: 2025-10-17 - Comment: Compression/Efficiency + Hardware co-design: hardware-aware dynamic token and FFN pruning with optimized dataflow for low-power ViT acceleration.
-
Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning - Score: 16 (R=9, N=7) - Date: 2025-10-17 - Comment: Compression/Efficiency – Transformer pruning with unified Head Importance–Entropy Score combining gradients and attention entropy.
-
CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression - Score: 16 (R=9, N=7) - Date: 2025-10-16 - Comment: Model compression and efficiency: embedding-layer compression via group residual vector quantization with a corrective adaptor, reducing memory footprint and compatible with 4-bit hardware.
-
Rescaling-Aware Training for Efficient Deployment of Deep Learning Models on Full-Integer Hardware - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Matches Model Compression and Efficiency: quantization- and rescale-aware training for integer-only inference; reduces rescaler bitwidth post-training with minimal retraining.
-
Neural Weight Compression for Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Model Compression and Efficiency — learned autoencoder codec for LM weight compression with importance-aware loss and inference-time error compensation.
-
Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Compression/Efficiency: scale-dependent guidelines for allocating memory between weights, KV cache, and generation length; compares KV eviction vs quantization for reasoning models.
-
CauchyNet: Compact and Data-Efficient Learning using Holomorphic Activation Functions - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Model Architecture: complex-valued holomorphic activation functions (Cauchy-inspired) enabling compact, data-efficient networks with theoretical guarantees.
-
ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Matches Model Architecture/Efficiency: selective layer expansion and unit-wise decoupled tuning for parameter-efficient continual pretraining of LLMs.
-
Vanishing Contributions: A Unified Approach to Smoothly Transition Neural Models into Compressed Form - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: Matches Model Compression and Efficiency: proposes a general training scheme (VCON) to smoothly transition models to compressed forms (pruning/quantization/low-rank) to mitigate accuracy loss.
-
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference - Score: 16 (R=9, N=7) - Date: 2025-10-15 - Comment: High Performance Computing — systems-level KV-cache offloading and cross-engine sharing with pipelined data movement and a control API for enterprise-scale LLM inference.
-
BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices - Score: 16 (R=9, N=7) - Date: 2025-10-14 - Comment: Model Compression and Efficiency: aggressive quantization (≈1.58-bit encoders), sliding-window attention, and episodic memory for edge-efficient multimodal transformers.
-
Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization - Score: 16 (R=9, N=7) - Date: 2025-10-14 - Comment: Model Compression and Efficiency/HPC: subspace-restricted training of ViTs (WASI) to cut memory and FLOPs for on-device learning.
-
StreamingVLM: Real-Time Understanding for Infinite Video Streams - Score: 16 (R=9, N=7) - Date: 2025-10-13 - Comment: Compression/Efficiency criterion: streaming KV-cache management (attention sinks, short/long windows) with training–inference alignment for real-time long-context VLMs.
-
dInfer: An Efficient Inference Framework for Diffusion Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-13 - Comment: High Performance Computing and Model Efficiency: proposes an inference framework for diffusion LLMs with algorithmic and system-level optimizations (diffusion iteration manager, decoding, KV-cache manager) enabling large speedups.
-
Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression - Score: 16 (R=9, N=7) - Date: 2025-10-13 - Comment: Matches Model Compression and Efficiency: CoT compression via an upfront thought-embedding compressor–executor framework to reduce token usage/latency while maintaining reasoning quality.
-
Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation - Score: 16 (R=9, N=7) - Date: 2025-10-13 - Comment: Matches Model Compression and Efficiency: uses low-rank adaptation (LoRA) with synthetic data/logit distillation to recover accuracy after quantization/pruning/serialization-induced degradation.
-
AILoRA: Function-Aware Asymmetric Initialization for Low-Rank Adaptation of Large Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-10 - Comment: Strong match to Compression/Efficiency: enhances LoRA via function-aware asymmetric low-rank initialization with analysis of distinct W^Q and W^V roles in self-attention.
-
Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation - Score: 16 (R=9, N=7) - Date: 2025-10-09 - Comment: Compression/Efficiency: selects structurally sparse subnetwork initializations via evolutionary search and uses distillation to accelerate pretraining, achieving 9.2x fewer tokens for comparable perplexity.
-
Sharpness-Aware Data Generation for Zero-shot Quantization - Score: 16 (R=9, N=7) - Date: 2025-10-09 - Comment: Matches Model Compression/Efficiency: zero-shot quantization with sharpness-aware synthetic data generation and supporting theory for better generalization.
-
DoRAN: Stabilizing Weight-Decomposed Low-Rank Adaptation via Noise Injection and Auxiliary Networks - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: Model Compression and Efficiency: stabilizes and enhances low-rank adaptation (DoRA) via noise injection and auxiliary networks that generate low-rank factors, improving PEFT.
-
Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: Matches Model Compression and Efficiency: proposes ultra-low-bit (2-bit) post-training quantization tailored to diffusion LLMs with masked calibration simulation and adaptive blockwise mixed precision.
-
CHORD: Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration - Score: 16 (R=9, N=7) - Date: 2025-10-06 - Comment: Model Compression and Efficiency: channel-wise mixed-precision quantization personalized via a hypernetwork; 2-bit per-channel strategy encoding enables resource-adaptive deployment without backprop.
-
HyperAdaLoRA: Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance - Score: 16 (R=9, N=7) - Date: 2025-10-06 - Comment: Compression/Efficiency: dynamic low-rank adaptation (LoRA) accelerated via hypernetwork-generated SVD parameters with rank pruning for efficient PEFT.
-
Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: High Performance/Efficiency—learning-based zeroth-order optimizer for LLM fine-tuning reducing memory with L2L-style perturbation strategies.
-
The Pitfalls of KV Cache Compression - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Model Compression and Efficiency: critical analysis of KV cache compression with improved eviction policies for multi-instruction prompting in LLMs.
-
Enhancing Certifiable Semantic Robustness via Robust Pruning of Deep Neural Networks - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Model Compression and Efficiency: robust pruning guided by an Unbiased and Smooth Neuron metric (USN) plus a Wasserstein loss to enhance certifiable robustness.
-
ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Model Compression and Efficiency: training-free adaptive suppression of reasoning steps for LRLMs to reduce tokens/latency while preserving accuracy.
-
Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Model Compression and Efficiency: analyzes RoPE interpolation under post-training quantization and proposes an interpolation-aware, per-band weight rescaling (Q-ROAR) guided by new diagnostics.
-
Equivariance by Local Canonicalization: A Matter of Representation - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Architecture and Efficiency: transfers tensor field networks to local canonicalization to preserve equivariance with lower runtime (PyG integration).
-
Collaborative Compression for Large-Scale MoE Deployment on Edge - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Compression and Efficiency: MoE-aware collaborative compression combining expert pruning, mixed-precision quantization, and activation optimization for ultra-large MoE deployment under strict memory limits.
-
Growing Winning Subnetworks, Not Pruning Them: A Paradigm for Density Discovery in Sparse Neural Networks - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Compression and Efficiency: proposes growth-based sparse training (PWMPR) to discover winning subnetworks and operating density, complementing pruning/dynamic sparsity.
-
Rethinking Parameter Sharing for LLM Fine-Tuning with Multiple LoRAs - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Model Compression/Efficiency: rethinks multi-LoRA parameter sharing (ALoRA, Fed-ALoRA) with asymmetric design and matrix decomposition for heterogeneous ranks.
-
FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Matches Model Compression and Efficiency: unified sparse attention kernel with flexible sparse symbols and optimized sparse GEMMs for DiT inference acceleration.
-
On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs - Score: 16 (R=9, N=7) - Date: 2025-10-01 - Comment: Compression/Efficiency: quantization-aware fine-tuning via configuration-aware low-rank (LoRA) adjustments that adapt to arbitrary per-layer bit-widths without re-finetuning.
-
Bridging Function Approximation and Device Physics via Negative Differential Resistance Networks - Score: 16 (R=8, N=8) - Date: 2025-10-29 - Comment: Model Architecture + Efficiency/Hardware: analog implementation of Kolmogorov–Arnold Networks using negative differential resistance devices for learnable nonlinearities.
-
HollowFlow: Efficient Sample Likelihood Evaluation using Hollow Message Passing - Score: 16 (R=8, N=8) - Date: 2025-10-28 - Comment: Enforces block-diagonal Jacobians via non-backtracking GNNs to make likelihood evaluation scale with constant backward passes (Algorithmic Efficiency for generative flows).
-
Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLMs - Score: 16 (R=8, N=8) - Date: 2025-10-24 - Comment: HPC/Efficiency: provably no-regret drafter selection for speculative decoding that evaluates all drafters without extra target queries, reducing inference cost.
-
Just-In-Time Piecewise-Linear Semantics for ReLU-type Networks - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Model Analysis/Verification: JIT piecewise-linear semantics for ReLU networks enabling exact/approx certificates, Lipschitz, robustness—foundational network semantics.
-
Computational Budget Should Be Considered in Data Selection - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Matches Efficiency/Data Selection: compute-budget-aware bilevel data selection with Hessian-free gradient estimator and efficient inner-loop relaxation.
-
SHaRe-SSM: An Oscillatory Spiking Neural Network for Target Variable Modeling in Long Sequences - Score: 16 (R=8, N=8) - Date: 2025-10-17 - Comment: Model Architecture/Efficiency: oscillatory spiking state-space model (multiplication-free, sparse events) with parallel scans for very long sequences.
-
Z0-Inf: Zeroth Order Approximation for Data Influence - Score: 16 (R=8, N=8) - Date: 2025-10-16 - Comment: Algorithmic efficiency and training dynamics: introduces a zeroth-order, gradient-free influence estimation scalable to LLMs, enabling practical data influence analysis without Hessians/gradients.
-
Tracing the Traces: Latent Temporal Signals for Efficient and Accurate Reasoning - Score: 16 (R=8, N=8) - Date: 2025-10-14 - Comment: Model Compression and Efficiency + Representation Learning: introduces latent-trajectory signals from internal representations to guide inference-time compute allocation and answer selection, reducing token usage.
-
Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models - Score: 16 (R=8, N=8) - Date: 2025-10-14 - Comment: Matches Model Compression and Efficiency: parameter-efficient transfer via gradient-sign masking to transport task vectors across pre-trained models with first-order descent guarantee.
-
Accelerating Inference for Multilayer Neural Networks with Quantum Computers - Score: 16 (R=8, N=8) - Date: 2025-10-09 - Comment: High Performance Computing/Efficiency: fully coherent quantum implementation of multilayer neural inference with provable speedups under quantum data access assumptions.
-
Best-of-Majority: Minimax-Optimal Strategy for Pass@$k$ Inference Scaling - Score: 16 (R=8, N=8) - Date: 2025-10-06 - Comment: Matches Efficiency/Test-Time Scaling: introduces Best-of-Majority, a minimax-optimal Pass@k inference strategy with theoretical guarantees over majority voting/BoN.
-
Constrained Adaptive Rejection Sampling - Score: 16 (R=8, N=8) - Date: 2025-10-03 - Comment: Compression/Efficiency: algorithmic innovation for constrained decoding via adaptive rejection sampling that preserves the exact distribution while improving sample efficiency.
-
CIMNAS: A Joint Framework for Compute-In-Memory-Aware Neural Architecture Search - Score: 16 (R=8, N=8) - Date: 2025-10-01 - Comment: Model Compression and Efficiency/HPC: joint HW-aware NAS with quantization and CIM device/circuit/architecture co-optimization for EDAP-focused design.
-
The Information-Theoretic Imperative: Compression and the Epistemic Foundations of Intelligence - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning and Compression theory: argues compression efficiency drives causal representation discovery; testable predictions about rate–distortion and OOD generalization.
-
Are Language Models Efficient Reasoners? A Perspective from Logic Programming - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Representation Learning/Training Dynamics: framework measuring reasoning efficiency and aligning natural-language proofs with minimal logic-program proofs.
-
BSFA: Leveraging the Subspace Dichotomy to Accelerate Neural Network Training - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Training Dynamics and Efficiency: exploits Hessian subspace dichotomy (Dom vs Bulk) with PCA-based projection and differential scaling to accelerate optimization.
-
Continual Low-Rank Adapters for LLM-based Generative Recommender Systems - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Compression/Efficiency: low-rank adapters (LoRA) with proximal regularization for continual adaptation.
-
What Really Matters in Matrix-Whitening Optimizers? - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Training Dynamics/Optimization: analysis of matrix-whitening vs spectral descent; identifies variance adaptation as key ingredient with low-rank estimators.
-
Resource-Efficient and Robust Inference of Deep and Bayesian Neural Networks on Embedded and Analog Computing Platforms - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Compression/Efficiency + Hardware co-design: automatic compression, approximate Bayesian inference, and analog accelerators for embedded inference.
-
All in one timestep: Enhancing Sparsity and Energy efficiency in Multi-level Spiking Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Model Architecture and Efficiency: proposes multi-level spiking neurons and a Sparse-ResNet to enhance sparsity and reduce energy/latency in SNNs.
-
PAHQ: Accelerating Automated Circuit Discovery through Mixed-Precision Inference Optimization - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Model Compression and Efficiency via mixed-precision quantization to speed up interpretability patching with reduced memory.
-
FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Efficiency: self-speculative decoding with draft/verify for VLMs, accelerating autoregressive inference.
-
Single-Teacher View Augmentation: Boosting Knowledge Distillation via Angular Diversity - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Model Compression and Efficiency via improved knowledge distillation using angularly diverse single-teacher augmentations.
-
Few-Shot Knowledge Distillation of LLMs With Counterfactual Explanations - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Model Compression/Efficiency: proposes few-shot task-aware knowledge distillation using counterfactual explanations with theoretical guarantees.
-
Memory Constrained Dynamic Subnetwork Update for Transfer Learning - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Compression/Efficiency: memory-constrained dynamic subnetwork adaptation with principled layer ranking and dynamic channel sampling.
-
Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Model Architecture and Efficiency: factorized hypernetwork generates context-aware LoRA adapters for conditioned fine-tuning (parameter-efficient adapters).
-
Study of Training Dynamics for Memory-Constrained Fine-Tuning - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: dynamic stochastic channel selection yields high activation/gradient sparsity for memory-constrained fine-tuning.
-
Knowledge Distillation of Uncertainty using Deep Latent Factor Model - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: proposes distribution distillation (Gaussian distillation) using a deep latent factor model to compress deep ensembles while preserving uncertainty, reducing compute/memory.
-
MetaCluster: Enabling Deep Compression of Kolmogorov-Arnold Network - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Model Compression and Efficiency: codebook-based weight sharing for KANs via meta-learner-induced clustering enables up to 80x parameter compression.
-
LightMem: Lightweight and Efficient Memory-Augmented Generation - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Model Architecture/Efficiency: lightweight memory-augmented generation with multi-stage memory and offline consolidation (cache-like), reducing token and runtime costs.
-
Graphical model for tensor factorization by sparse sampling - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Representation Learning and Sparsity: message-passing and replica-theory analysis for tensor factorization under sparse sampling on random graphs.
-
Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: training-free, attention-guided recurrent token selection for streaming Video-LLMs, discarding up to ~95% tokens with minimal loss.
-
All You Need is One: Capsule Prompt Tuning with a Single Vector - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Model Architecture/Efficiency (PEFT): Capsule Prompt-Tuning with a single vector acting as an instance-aware "attention anchor" for parameter-efficient adaptation.
-
Zeroth-Order Sharpness-Aware Learning with Exponential Tilting - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Training Dynamics/Efficiency: bridges zeroth-order optimization with sharpness-aware minimization via exponential tilting; gradient-free, memory-efficient SAM alternative.
-
Vector Quantization in the Brain: Grid-like Codes in World Models - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Matches Representation Learning/Model Architecture: brain-inspired action-conditioned vector quantization via attractor dynamics for spatiotemporal world models.
-
SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Optimization for large-scale training: a Lookahead variant applying Nesterov momentum to pseudo-gradients (SNOO) for compute-efficient training with minimal overhead.
-
GRATING: Low-Latency and Memory-Efficient Semantic Selection on Device - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Systems-level inference efficiency: training-free monolithic forwarding with sequence-level sparsity for top-K reranking, reducing latency and peak memory on-device.
-
TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Matches Efficiency/HPC: universal speculative decoding via DTW-based alignment enabling draft–target mismatch and faster inference.
-
Revisiting Knowledge Distillation: The Hidden Role of Dataset Size - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Training dynamics/representation: identifies data-efficiency of knowledge distillation in low-data regimes and evaluates competing theories (label smoothing vs dark knowledge).
-
LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Compression/Efficiency: conditional computation via stage-wise layer skipping and confidence-based early exit tailored for multi-stage reasoning.
-
DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Efficiency: context-aware dynamic vocabulary shortlisting for speculative decoding to reduce drafter compute while keeping exact verification.
-
SG-XDEAT: Sparsity-Guided Cross-Dimensional and Cross-Encoding Attention with Target-Aware Conditioning in Tabular Learning - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Model architecture and efficiency: introduces Adaptive Sparse Self-Attention (sparsity) plus cross-dimensional/cross-encoding attention with target-aware conditioning for tabular learning.
-
Rethinking Knowledge Distillation: A Data Dependent Regulariser With a Negative Asymmetric Payoff - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Model Compression/Efficiency: rigorous analysis reframing knowledge distillation as a data-dependent regularizer with quantified transfer dynamics.
-
SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Matches Compression/Efficiency: proposes an embedding compression framework (dimension pruning with adaptive selection and cross-batch memory) for retrieval.
-
Your VAR Model is Secretly an Efficient and Explainable Generative Classifier - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Model Architecture and Efficiency: proposes a VAR-based generative classifier with tractable likelihood enabling token-wise MI explanations and faster inference than diffusion-based counterparts.
-
Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Model Architecture and Efficiency: latent interleaved vision-text reasoning design and progressive training reduce annotation and inference latency.
-
MoRA: On-the-fly Molecule-aware Low-Rank Adaptation Framework for LLM-based Multi-Modal Molecular Assistant - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Proposes instance-specific dynamic Low-Rank Adaptation (LoRA) weights injected on-the-fly (low-rank parameter-efficient adaptation/architecture).
-
EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Inference-time Efficiency — entropy-aware branching reallocates test-time compute adaptively to hard prompts, improving Pass@k at lower token budgets.
-
LightSAE: Parameter-Efficient and Heterogeneity-Aware Embedding for IoT Multivariate Time Series Forecasting - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Model Compression/Efficiency: parameter-efficient embedding via low-rank factorization and shared gated component pool for heterogeneous time-series channels.
-
LLM-Oriented Token-Adaptive Knowledge Distillation - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Matches Model Compression and Efficiency via Knowledge Distillation for LLMs with token-level adaptive focusing and temperature scaling.
-
Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Model Architecture/Efficiency: adaptive conditional computation (fast vs slow reasoning) with entropy-guided hybrid policy optimization to reduce reasoning cost.
-
Logits Replay + MoClip: Stabilized, Low-Cost Post-Training with Minimal Forgetting - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Compression/Efficiency: Top-K logits replay with exact renormalized losses plus MoClip optimizer stabilizes updates for low-cost LLM post-training with minimal forgetting.
-
PAC Reasoning: Controlling the Performance Loss for Efficient Reasoning - Score: 15 (R=8, N=7) - Date: 2025-10-13 - Comment: Conditional/Dynamic Networks: PAC-based switching between thinking/nonthinking modes with distribution-free performance-loss guarantees for efficient inference.
-
Auto-scaling Continuous Memory for GUI Agent - Score: 15 (R=8, N=7) - Date: 2025-10-13 - Comment: Compression/Efficiency criterion: fixed-length continuous memory embeddings replacing long textual histories to reduce context cost while preserving visual detail.
-
DeepPrune: Parallel Scaling without Inter-trace Redundancy - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Model Efficiency: dynamic pruning of parallel Chain-of-Thought traces via learned equivalence prediction and online clustering, reducing inference tokens.
-
Don't Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Matches Model Compression/Efficiency criterion: studies pruning in VLA and introduces a training-free weight interpolation correction to recover sparsified models.
-
Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Matches Model Architecture and Efficiency: proposes a single-layer, O(N) Co^4 architecture reportedly outperforming GPT-2/GPT-BERT on BabyLM.
-
First Try Matters: Revisiting the Role of Reflection in Reasoning Models - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Inference Efficiency/Training Dynamics: empirical analysis of reflection plus question-aware early stopping to cut reasoning tokens.
-
Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Efficiency: sequence-level entropy from token log-probs as a confidence signal for early stopping in reasoning models.
-
OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Model Compression and Efficiency: algorithmic speedups for long-context speculative decoding (LSTM drafter, [SPEC] verifier, hybrid tree/non-tree) to improve inference throughput.
-
GUIDE: Guided Initialization and Distillation of Embeddings - Score: 15 (R=8, N=7) - Date: 2025-10-09 - Comment: Matches Model Compression and Efficiency: parameter-space guided initialization/distillation (GUIDE) improves teacher–student transfer with no training/inference overhead.
-
MixReasoning: Switching Modes to Think - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Conditional/Dynamic Networks and Efficiency: adaptively switches reasoning depth within a single response to reduce computation without accuracy loss.
-
AMAQ: Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Model Compression and Efficiency: adaptive mixed-bit activation quantization with bit-regularized channel-wise/layer-wise allocation for split learning; also reduces communication in distributed training.
-
Scalable In-context Ranking with Generative Models - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Model Architecture/Efficiency: enforced block-sparse attention across documents with auxiliary contrastive objective, reducing attention from quadratic to linear
-
Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: High-Performance/Systems Efficiency: hardware–software co-design with module-level offloading, low-bit kernels, and token-aware buffering for on-device LMM inference.
-
Compressed Concatenation of Small Embedding Models - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Compression/Efficiency: concatenation of small embedding models with a Matryoshka-trained decoder and quantization to achieve high compression while preserving retrieval performance.
-
Efficient Training of Spiking Neural Networks by Spike-aware Data Pruning - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Model Compression and Efficiency: spike-aware data pruning that approximates gradient-norm sampling via an efficient upper bound, cutting SNN training time while maintaining accuracy.
-
Adaptively Sampling-Reusing-Mixing Decomposed Gradients to Speed Up Sharpness Aware Minimization - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Efficiency/optimization: accelerates SAM by decomposing and selectively reusing gradient components while preserving flat-minima generalization.
-
REG: A Regularization Optimizer for Robust Training Dynamics - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Training Dynamics/Optimization for large models: introduces a structure-aware optimizer (RACS) replacing Muon’s matrix sign to stabilize and regularize updates.
-
Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: High Performance/Scaling: quality-aware scaling law extending Chinchilla to jointly model data quality, dataset size, and model size for compute-efficient pretraining.
-
Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Model Compression/Efficiency—pruning-based circuit extraction; Representation Learning—mechanistic interpretability via sparse circuit discovery with a hybrid attribution+pruning framework.
-
QuadEnhancer: Leveraging Quadratic Transformations to Enhance Deep Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Model Architecture and Efficiency: introduces quadratic transformations with low-rankness, weight sharing, and sparsification as a lightweight enhancer.
-
Light Differentiable Logic Gate Networks - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Model Architecture/Efficiency—reparametrization of differentiable logic gate neurons reduces parameter size and improves training stability.
-
AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Matches Model Compression/Efficiency: proposes gradient-free layer selection using Betti-number activation topology with forward passes only, reducing retraining compute/memory on-device.
-
ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Matches Model Compression/Efficiency: pluggable QK/Chunk adapters with attention distillation for chunk-wise attention and KV cache reduction to accelerate LLM inference.
-
Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Model Compression/Efficiency: Shapley-value-based, shift-invariant pruning for Kolmogorov–Arnold Networks enabling reliable compression.
-
ACON: Optimizing Context Compression for Long-horizon LLM Agents - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Compression and Efficiency: proposes an LLM-agent context compression framework with guideline optimization and distillation to smaller compressors, reducing memory/token usage.
-
Entropy After $\langle \texttt{/Think} \rangle$ for reasoning model early exiting - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Efficiency: adaptive early exiting for reasoning LLMs using entropy trajectory after stop-thinking token to save tokens.
-
AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Matches Efficiency/Decoding: adaptive block-size semi-autoregressive scheduler using confidence dynamics for diffusion LLM inference.
-
RAE: A Neural Network Dimensionality Reduction Method for Nearest Neighbors Preservation in Vector Search - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Matches Representation Learning and Compression/Efficiency: proposes a regularized autoencoder with provable bounds to preserve k-NN under dimensionality reduction for vector search.
-
Adaptive Graph Coarsening for Efficient GNN Training - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Efficiency for GNNs: joint training with adaptive graph coarsening (K-means over learned embeddings) to reduce training data and computation.
-
Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Model Efficiency: theoretical conditions for layer skipping in VLMs using information-theoretic redundancy analysis.
High Performance Computing (65)
-
Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection - Score: 20.0 (R=0, N=0) - Date: 2025-10-28 - Comment: Author match
-
A Definition of AGI - Score: 20.0 (R=0, N=0) - Date: 2025-10-22 - Comment: Author match
-
Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models - Score: 20.0 (R=0, N=0) - Date: 2025-10-01 - Comment: Author match
-
ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models - Score: 19 (R=10, N=9) - Date: 2025-10-28 - Comment: Systems-level innovation enabling sequence-parallel training of nonlinear RNNs via Newton iterations and parallel reductions (High Performance Computing + Model Architecture).
-
Collective Communication for 100k+ GPUs - Score: 18 (R=10, N=8) - Date: 2025-10-24 - Comment: High Performance Computing: introduces a collective communication framework (NCCLX) enabling reliable high-throughput, low-latency scaling to 100k+ GPUs for LLM training/inference.
-
AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training - Score: 18 (R=10, N=8) - Date: 2025-10-24 - Comment: High-performance training: asynchronous hierarchical ZeRO with adaptive resharding and multi-stream overlap for scalable LLM training.
-
Efficient Long-context Language Model Training by Core Attention Disaggregation - Score: 18 (R=10, N=8) - Date: 2025-10-22 - Comment: High Performance Computing: decouples core attention into dedicated servers (CAD/DistCA) to balance compute/memory and eliminate stragglers in distributed long-context training.
-
Accelerating Frontier MoE Training with 3D Integrated Optics - Score: 18 (R=10, N=8) - Date: 2025-10-21 - Comment: High Performance Computing: photonic 3D co-packaged optics to scale MoE training across racks; systems-level innovation enabling larger parallelism and faster training.
-
Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs - Score: 18 (R=10, N=8) - Date: 2025-10-14 - Comment: High Performance Computing: novel tensor-compiler fusion for dependency-heavy reductions (e.g., attention) using algebraic corrections to boost locality and parallelism on GPUs.
-
MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: HPC/Distributed Training: multi-timescale adaptive optimizers with local updates reduce communication, with convergence guarantees.
-
Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving - Score: 18 (R=10, N=8) - Date: 2025-10-08 - Comment: High Performance Computing: system–hardware co-design (Mono3D DRAM + NMP) for MoE serving with tiered memory and expert-usage prediction.
-
TASP: Topology-aware Sequence Parallelism - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: High Performance Computing: topology-aware sequence parallelism that decomposes AlltoAll topology into orthogonal rings for communication-efficient attention.
-
Convergence Rates for Gradient Descent on the Edge of Stability in Overparametrised Least Squares - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Matches Training Dynamics Theory: convergence rates and regimes for GD at edge of stability via manifold-based decomposition in overparameterized least squares.
-
MuonBP: Faster Muon via Block-Periodic Orthogonalization - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: High Performance Computing: distributed-friendly optimizer (block-periodic orthogonalization) reducing communication with theory and throughput gains.
-
FlexLink: Boosting your NVLink Bandwidth by 27% without accuracy concern - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: High Performance Computing: novel collective communication fabric aggregating NVLink, PCIe, and RDMA with adaptive load balancing; drop-in replacement for NCCL.
-
From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: High Performance Computing/Systems: MLIR-AIR compiler dialect orchestrates asynchronous, spatial scheduling for NPUs; efficient mapping of attention and matmul.
-
FairBatching: Fairness-Aware Batch Formation for LLM Inference - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: High Performance Computing/Systems: fairness-aware batching scheduler improves TTFT/TPOT and GPU utilization for LLM inference.
-
DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Matches High Performance Computing: dynamic context parallelism with fine-grained blockwise partitioning for long-context training, reducing communication and improving balance.
-
EA4LLM: A Gradient-Free Approach to Large Language Model Optimization via Evolutionary Algorithms - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: High Performance Computing/Training criterion: introduces a gradient-free evolutionary optimization method for training large LLMs, enabling non-differentiable components and reducing hardware constraints—an algorithmic innovation for large-scale training.
-
Task-Level Insights from Eigenvalues across Sequence Models - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Representation learning and training dynamics: dynamical-systems eigenvalue analysis across attention and SSMs to link spectra with memory/long-range dependency and architectural effects.
-
Efficient Autoregressive Inference for Transformer Probabilistic Models - Score: 17 (R=9, N=8) - Date: 2025-10-13 - Comment: Systems/architecture innovation: causal autoregressive buffer with cached context enables efficient joint sampling—cache/memory optimization for Transformer probabilistic models.
-
Slicing Is All You Need: Towards A Universal One-Sided Algorithm for Distributed Matrix Multiplication - Score: 17 (R=9, N=8) - Date: 2025-10-13 - Comment: Strong match to HPC: a universal algorithm for distributed matrix multiplication across arbitrary partitionings/replication, improving systems support for large-scale training/inference.
-
SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: High Performance Computing: systems-level prefill/decode disaggregation with specialized hardware to optimize compute/memory utilization for LLM inference.
-
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: HPC + Efficiency: introduces a parallelism-compatible FlashAttention-2 JVP kernel enabling 10B+ model sCM training and proposes score-regularized continuous-time consistency distillation for few-step generation.
-
Lossless Vocabulary Reduction for Auto-Regressive Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: Model efficiency/tokenization: lossless vocabulary reduction enabling smaller vocabularies and cross-tokenizer cooperation for AR LMs; strong alignment with efficiency and systems-level interoperability.
-
GCPO: When Contrast Fails, Go Gold - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: High-Performance/Distributed Training: stability-based generalization and excess error bounds for multi-gossip decentralized training; algorithmic insights into communication/training efficiency.
-
Geodesics in the Deep Linear Network - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: Training dynamics/geometry: derives geodesics and ODEs in deep linear network geometry, offering theoretical insight into network optimization landscapes.
-
Riddled basin geometry sets fundamental limits to predictability and reproducibility in deep learning - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Training Dynamics Theory: links chaotic dynamics and symmetry-induced invariant subspaces to riddled basins, revealing limits to predictability.
-
OptPipe: Memory- and Scheduling-Optimized Pipeline Parallelism for LLM Training - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: High Performance Computing: optimized pipeline-parallel scheduling jointly accounting for memory capacity, activation reuse, and bubble minimization
-
Learning without Global Backpropagation via Synergistic Information Distillation - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: High Performance Computing: training without global backprop via local synergistic distillation to remove update locking and reduce activation memory, enabling parallel module updates.
-
Cache-to-Cache: Direct Semantic Communication Between Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Systems/Efficiency — KV-cache projection/fusion and gating enable direct inter-LLM communication, improving accuracy and reducing latency.
-
LoRAFusion: Efficient LoRA Fine-Tuning for LLMs - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: High Performance Computing and Efficiency: fused kernels for LoRA and adaptive multi-job scheduling for concurrent fine-tuning; systems-level innovation for PEFT.
-
SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Matches High Performance Computing: fine-grained slice-level packing and asymmetric forward/backward partitioning for balanced distributed LLM training.
-
Memory-Efficient Backpropagation for Fine-Tuning LLMs on Resource-Constrained Mobile Devices - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: High Performance Computing: memory-efficient backpropagation enabling on-device fine-tuning of LLMs (<1GB), a systems-level memory optimization for training.
-
Distributed Low-Communication Training with Decoupled Momentum Optimization - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: High Performance Computing: reduces communication via decoupled momentum optimization and DCT-based momentum compression with infrequent syncs for distributed training.
-
KVComm: Enabling Efficient LLM Communication through Selective KV Sharing - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: Compression/Efficiency and Systems: selective KV sharing based on attention-importance with Gaussian prior reduces inter-LLM communication while retaining performance.
-
Training Across Reservoirs: Using Numerical Differentiation To Couple Trainable Networks With Black-Box Reservoirs - Score: 16 (R=8, N=8) - Date: 2025-10-30 - Comment: Matches architecture/systems criteria by enabling training with black-box modules via Bounded Numerical Differentiation, supporting hybrid analogue–digital compositions.
-
SHAP Meets Tensor Networks: Provably Tractable Explanations with Parallelism - Score: 16 (R=8, N=8) - Date: 2025-10-28 - Comment: Matches Efficiency/HPC theory: exact SHAP for tensor networks with polylog-time parallel algorithm (TT); insights for BNNs linking width to SHAP hardness.
-
Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime - Score: 16 (R=8, N=8) - Date: 2025-10-28 - Comment: Training Dynamics Theory: non-asymptotic convergence of SGLD in the lazy training (NTK) regime with finite-time/width bounds.
-
AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM - Score: 16 (R=8, N=8) - Date: 2025-10-22 - Comment: Matches High Performance/Systems Efficiency: parametric integration of billion-scale KGs into LLMs with sub-linear time/memory via KG2KV and HiKVP.
-
A Split-Client Approach to Second-Order Optimization - Score: 16 (R=8, N=8) - Date: 2025-10-20 - Comment: Matches High Performance Computing criterion: proposes an algorithmic/system-level asynchronous split-client scheme for second-order training with provable wall-clock speedups, enabling practical large-scale optimization.
-
Optimal Scaling Needs Optimal Norm - Score: 16 (R=8, N=8) - Date: 2025-10-07 - Comment: Matches High Performance Computing/training dynamics: discovers an operator-norm invariance governing optimal LR/batch scaling for LLM training and reports distributed Scion implementation and large-scale scaling rules.
-
TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling - Score: 16 (R=8, N=8) - Date: 2025-10-06 - Comment: Matches High Performance Computing: systems-level preemptive scheduling and proactive KV-cache memory management for LLM serving to improve responsiveness and throughput.
-
Rethinking Thinking Tokens: LLMs as Improvement Operators - Score: 16 (R=8, N=8) - Date: 2025-10-02 - Comment: Inference efficiency and dynamic refinement: Parallel-Distill-Refine orchestrates bounded workspace and parallelism to improve accuracy-latency trade-offs (HPC/algorithmic efficiency; conditional computation).
-
Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models - Score: 16 (R=8, N=8) - Date: 2025-10-02 - Comment: Model efficiency/HPC: lossless parallel decoding for diffusion LLMs via draft-and-verify without extra forward passes; substantial inference speedup.
-
ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Efficiency/HPC: adapts speculative decoding to RL training with dynamic tuning and drafter distillation for faster rollouts
-
TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Model Architecture/Efficiency: linear RNN (GatedDeltaProduct) pre-trained synthetically with fully parallelizable training/inference.
-
xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches High Performance Computing/Systems via CPU-based dynamic analysis to estimate peak GPU memory for DL training.
-
Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Matches HPC: demand-aware optical-network framework that overlaps reconfiguration with collective communication to accelerate distributed ML collectives.
-
MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: High Performance Computing: systems-level training pipeline (Megatron-Core) with near-linear multi-node scaling and efficiency optimizations for large video generation models.
-
Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Model Compression and Efficiency: co-design of KV cache policies (eviction/recompute/refresh) with eDRAM for LLM serving; systems-level memory optimization for inference.
-
xLLM Technical Report - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: HPC/Systems – large-scale LLM inference framework with disaggregated prefill/decode, global KV cache management, and execution/memory pipeline optimizations.
-
Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Asynchronous RL post-training system (fine-grained parallelism, rollout-train decoupling) — HPC/distributed training for LLMs.
-
BioOSS: A Bio-Inspired Oscillatory State System with Spatio-Temporal Dynamics - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Model Architecture with a new bio-inspired oscillatory state system (BioOSS) capturing spatio-temporal propagation dynamics with trainable damping and speed parameters.
-
video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: HPC/Memory Optimization: fixed-budget streaming via test-time-training memory module (Hessian-free CG) and prompt-dependent memory retrieval for long-context audio-visual LLMs.
-
RepDL: Bit-level Reproducible Deep Learning Training and Inference - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: High Performance Computing/Systems: ensures deterministic, bitwise-reproducible training and inference via correct rounding and order-invariant floating-point computation across platforms.
-
Robust and Efficient Collaborative Learning - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Matches High Performance Computing criterion: decentralized, pull-based distributed training algorithm with O(n log n) communication.
-
Cocoon: A System Architecture for Differentially Private Training with Correlated Noises - Score: 15 (R=8, N=7) - Date: 2025-10-09 - Comment: High Performance Computing: hardware–software co-design (precomputed correlated DP noise, near-memory processing) to reduce training overheads for large models/embeddings.
-
CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credits - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Inference Efficiency: training-free acceleration of parallel decoding in diffusion LLMs via Trace Credit accumulation and logit fusion
-
From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: High Performance Computing: systems-level design for LLM serving on multi-core NPUs (tensor parallelism, core placement, memory management) to optimize inference throughput.
-
MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: High Performance Computing/Systems: colocated inference and fine-tuning with iteration-level scheduling and memory management to meet SLOs on edge GPUs.
-
Linear RNNs for autoregressive generation of long music samples - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Model Architecture: advances in linear RNN/state-space design plus context-parallelism enabling 1M-token training (systems-level efficiency).
-
DeMuon: A Decentralized Muon for Matrix Optimization over Graphs - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Decentralized optimization with orthogonalization (Newton–Schulz) and gradient tracking; systems-level advance for distributed training.
-
Generalized Parallel Scaling with Interdependent Generations - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: High Performance Computing/Systems-level inference—parallel scaling with interdependent generations via shared hidden-state tensors and small parameter overhead.
-
Exploring System 1 and 2 communication for latent reasoning in LLMs - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Model Architecture: studies dual-model latent communication vs unified forward-pass, analyzing representation and compute tradeoffs.
Representation Learning (252)
-
Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density - Score: 20.0 (R=0, N=0) - Date: 2025-10-08 - Comment: Author match
-
Language Models are Injective and Hence Invertible - Score: 19 (R=10, N=9) - Date: 2025-10-20 - Comment: Representation Learning: proves injectivity/invertibility of transformer LMs and provides an exact input reconstruction algorithm.
-
Pretrain-Test Task Alignment Governs Generalization in In-Context Learning - Score: 19 (R=10, N=9) - Date: 2025-10-01 - Comment: Representation Learning/Theory: exact analysis of ICL generalization via pretrain-test task alignment; predictive measure validated on Transformers.
-
Superposition disentanglement of neural representations reveals hidden alignment - Score: 18 (R=10, N=8) - Date: 2025-10-06 - Comment: Representation Learning: examines superposition and alignment; uses sparse autoencoders to disentangle features and improve representational alignment metrics.
-
Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders - Score: 18 (R=10, N=8) - Date: 2025-10-06 - Comment: Representation Learning: uses sparse autoencoders to identify and steer code-correctness directions in LLM representations (mechanistic interpretability).
-
Self-Supervised Representation Learning as Mutual Information Maximization - Score: 18 (R=10, N=8) - Date: 2025-10-03 - Comment: Theoretical unification of self-supervised representation learning via MI; explains stop-gradient and predictor networks from first principles.
-
AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features - Score: 18 (R=10, N=8) - Date: 2025-10-02 - Comment: Representation Learning and Sparsity: derives SAEs from proximal gradient unrolling and introduces AbsTopK (|·|-TopK) to recover bidirectional features under an ℓ0-inspired sparsity constraint.
-
A Generalized Information Bottleneck Theory of Deep Learning - Score: 18 (R=10, N=8) - Date: 2025-10-01 - Comment: Representation Learning Theory: introduces a Generalized Information Bottleneck using computable synergy/interaction information, explaining compression dynamics across CNNs/Transformers.
-
Deep sequence models tend to memorize geometrically; it is unclear why - Score: 18 (R=9, N=9) - Date: 2025-10-31 - Comment: Representation Learning: uncovers geometric memorization in deep sequence models with analysis linking to spectral bias; insights into training dynamics and embeddings.
-
Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation - Score: 18 (R=9, N=9) - Date: 2025-10-29 - Comment: Representation Learning / Training Dynamics: statistical physics analysis of multi-layer perceptron feature learning and phase transitions near interpolation.
-
A simple mean field model of feature learning - Score: 18 (R=9, N=9) - Date: 2025-10-20 - Comment: Representation Learning: mean-field theory of feature learning and phase transitions in finite-width networks.
-
LLMs Process Lists With General Filter Heads - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: Representation Learning: identifies causal, general-purpose ‘filter heads’ implementing a functional filtering operation across tasks
-
Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: Representation Learning: token-level causal analysis of CLIP, identifying composition nonidentifiability and links to modality gaps.
-
Learning Geometry: A Framework for Building Adaptive Manifold Models through Metric Optimization - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: Representation Learning/Model Architecture: learns an adaptive manifold via metric tensor optimization (discrete differential geometry), a foundational framework beyond parameter tuning.
-
Contrastive Predictive Coding Done Right for Mutual Information Estimation - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: Representation Learning: proposes InfoNCE-anchor for principled MI estimation and unifies contrastive objectives via proper scoring rules, clarifying what contrastive losses learn.
-
Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Representation Learning: shows emergent object binding in ViT embeddings, identifies a low-dimensional subspace guiding attention, and validates via causal ablations.
-
Eigenfunction Extraction for Ordered Representation Learning - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Framework for extracting ordered, identifiable eigenfunctions tied to contrastive/non-contrastive objectives—strong Representation Learning theory contribution leveraging low-rank and Rayleigh quotient ideas.
-
Perception Learning: A Formal Separation of Sensory Representation Learning from Decision Learning - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Representation Learning: formally separates perception from decision, defines representation-invariant perceptual metrics, and proves orthogonality to Bayes task-risk gradients.
-
From Memorization to Reasoning in the Spectrum of Loss Curvature - Score: 17 (R=9, N=8) - Date: 2025-10-29 - Comment: Representation Learning/Training Dynamics: disentangles memorization via loss-curvature-based weight decomposition and weight editing.
-
Learning Without Augmenting: Unsupervised Time Series Representation Learning via Frame Projections - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Representation Learning: augmentation-free SSL via orthonormal/overcomplete frame projections leveraging geometric biases.
-
Equivariance by Contrast: Identifiable Equivariant Embeddings from Unlabeled Finite Group Actions - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Matches Representation Learning: learns identifiable equivariant embeddings from unlabeled group actions without inductive biases.
-
Neural Collapse under Gradient Flow on Shallow ReLU Networks for Orthogonally Separable Data - Score: 17 (R=9, N=8) - Date: 2025-10-28 - Comment: Theoretical analysis of Neural Collapse arising under gradient flow in two-layer ReLU networks (Representation learning/training dynamics).
-
Disentangled Representation Learning via Modular Compositional Bias - Score: 17 (R=9, N=8) - Date: 2025-10-27 - Comment: Matches Representation Learning: modular compositional bias enabling disentanglement of attributes/objects without architecture/objective redesign.
-
From Information to Generative Exponent: Learning Rate Induces Phase Transitions in SGD - Score: 17 (R=9, N=8) - Date: 2025-10-27 - Comment: Representation Learning: theoretical analysis of SGD dynamics showing learning-rate-induced phase transitions; introduces a two-timescale layer-wise training algorithm.
-
Connecting Jensen-Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learning - Score: 17 (R=9, N=8) - Date: 2025-10-24 - Comment: Representation Learning/Theory — derives a tight lower bound connecting JSD to KLD/MI, justifying discriminative MI objectives used in practice.
-
Why Prototypes Collapse: Diagnosing and Preventing Partial Collapse in Prototypical Self-Supervised Learning - Score: 17 (R=9, N=8) - Date: 2025-10-24 - Comment: Representation Learning: diagnoses prototype collapse and proposes decoupled EM-updated prototypes to stabilize prototypical SSL training.
-
A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond - Score: 17 (R=9, N=8) - Date: 2025-10-23 - Comment: Representation Learning/Training Dynamics: derandomization lemma explaining structure discovery (low-rank) in neural networks under broad conditions.
-
Towards Identifiability of Hierarchical Temporal Causal Representation Learning - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Matches Representation Learning: identifiability of hierarchical temporal causal latents from conditionally independent observations with a variational generative model.
-
ActivationReasoning: Logical Reasoning in Latent Activation Spaces - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Matches Representation Learning: operationalizes logical reasoning and control in latent activation space using sparse autoencoder-derived concepts and rule application.
-
Extracting Rule-based Descriptions of Attention Features in Transformers - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Representation learning and transformer analysis: extracts rule-based descriptions of SAE attention features (skip-gram, absence, counting), providing mechanistic interpretability of transformer internals.
-
Generalization Below the Edge of Stability: The Role of Data Geometry - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Representation Learning/Training Dynamics: theoretical generalization below the edge of stability tied to data geometry for overparameterized ReLU nets.
-
Measure-Theoretic Anti-Causal Representation Learning - Score: 17 (R=9, N=8) - Date: 2025-10-22 - Comment: Matches Representation Learning: measure-theoretic anti-causal representation framework (ACIA) with interventional kernels and OOD generalization guarantees.
-
Diverse Influence Component Analysis: A Geometric Approach to Nonlinear Mixture Identifiability - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Matches Representation Learning/Identifiability: introduces Jacobian Volume Maximization to identify nonlinear latent components without auxiliary signals or sparsity assumptions.
-
Algorithmic Primitives and Compositional Geometry of Reasoning in Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-21 - Comment: Matches Representation Learning/Mechanistic Interpretability: identifies and steers compositional activation primitives underlying LLM reasoning via function vectors.
-
On the Neural Feature Ansatz for Deep Neural Networks - Score: 17 (R=9, N=8) - Date: 2025-10-20 - Comment: Matches Representation Learning: theoretical analysis of Neural Feature Ansatz and training dynamics across depth.
-
The Coverage Principle: How Pre-training Enables Post-Training - Score: 17 (R=9, N=8) - Date: 2025-10-20 - Comment: Matches Representation Learning/Training Dynamics: theory of coverage from next-token pretraining predicting downstream/post-training success with provable interventions.
-
Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Model Architecture and Representation Learning: input-adaptive recurrence, discrete bottleneck, and error-correction for OOD algorithmic generalization in Transformers.
-
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Representation Learning/Interpretability: analyzes activation differences post narrow finetuning; strong evidence of training traces and steering via diffs.
-
Mamba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Representation Learning/Training Dynamics: theoretical analysis of Mamba’s in-context learning via nonlinear gating and test-time feature learning with sample complexity results.
-
Statistical Guarantees for High-Dimensional Stochastic Gradient Descent - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Matches Representation Learning: theoretical analysis of training dynamics for high-dimensional SGD/ASGD with moment and concentration guarantees.
-
Sculpting Latent Spaces With MMD: Disentanglement With Programmable Priors - Score: 17 (R=9, N=8) - Date: 2025-10-16 - Comment: Representation Learning/Autoencoders: replaces KL with MMD to enforce programmable priors for disentanglement and proposes an unsupervised Latent Predictability Score—directly advancing controllable latent structure.
-
Adversarial Attacks Leverage Interference Between Features in Superposition - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Provides mechanistic representation-learning insight via superposition explaining adversarial vulnerability (representation learning/training dynamics).
-
Iterative Amortized Inference: Unifying In-Context Learning and Learned Optimizers - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Unified framework for amortized learning (ICL, learned optimizers) with iterative amortized inference — Representation Learning/training dynamics and adaptation.
-
In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Representation Learning/Training Dynamics: finite-sample generalization theory for ICL in Transformers with risk decomposition and non-asymptotic bounds.
-
On the Optimal Representation Efficiency of Barlow Twins: An Information-Geometric Interpretation - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Information-geometric theory of Barlow Twins showing optimal representation efficiency via isotropic FIM — Representation Learning theory.
-
Understanding Self-supervised Contrastive Learning through Supervised Objectives - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Strongly matches Representation Learning by providing a theoretical formulation linking self-supervised contrastive objectives to supervised ones, yielding insights into InfoNCE and balanced contrastive losses.
-
Rademacher Meets Colors: More Expressivity, but at What Cost ? - Score: 17 (R=9, N=8) - Date: 2025-10-15 - Comment: Matches Representation Learning/Theory: links GNN expressivity (WL colorings) to Rademacher complexity, explaining generalization–expressivity trade-offs.
-
Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Training Dynamics/Generalization: theoretical characterization of stochastic Adam’s generalization vs batch size and weight decay in overparameterized CNNs, aligning with representation learning theory.
-
Redundancy as a Structural Information Principle for Learning and Generalization - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Matches Representation Learning: introduces a theoretical redundancy framework unifying classical information measures and predicts generalization-optimal redundancy, validated with autoencoders.
-
The Achilles' Heel of LLMs: How Altering a Handful of Neurons Can Cripple Language Abilities - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Representation Learning/Architecture analysis: perturbation-based causal identification reveals ultra-sparse critical neurons and their layerwise localization governing language ability.
-
Geodesic Calculus on Latent Spaces - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Geometric representation learning: Riemannian calculus on autoencoder latent manifolds (implicit submanifolds), with learned projection and geodesic/exponential map computations.
-
PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Representation Learning: product of hyperbolic factors with an l1-product metric to jointly capture hierarchy and compositionality in embeddings.
-
On the Alignment Between Supervised and Self-Supervised Contrastive Learning - Score: 17 (R=9, N=8) - Date: 2025-10-14 - Comment: Representation Learning: proves representation-level alignment between self-supervised contrastive learning and negatives-only supervised contrastive learning with high-probability bounds (CKA/RSA).
-
On Uniformly Scaling Flows: A Density-Aligned Approach to Deep One-Class Classification - Score: 17 (R=9, N=8) - Date: 2025-10-13 - Comment: Representation Learning criterion: introduces uniformly scaling flows linking Deep SVDD and normalizing flows, preventing collapse and tightening likelihood–latent norm alignment.
-
Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry - Score: 17 (R=9, N=8) - Date: 2025-10-13 - Comment: Representation Learning: dictionary/SAE-based interpretability of DINOv2 and a new Minkowski Representation Hypothesis about concept geometry in ViTs.
-
Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: Representation Learning/Training Dynamics: theoretical bounds for attention-only transformers and mechanisms (dropout, EMA) that improve length generalization.
-
Base Models Know How to Reason, Thinking Models Learn When - Score: 17 (R=9, N=8) - Date: 2025-10-10 - Comment: Representation Learning/Training Dynamics: causal elicitation of latent reasoning mechanisms in base models and analysis of when vs how reasoning is deployed; foundational interpretability insight.
-
Wide Neural Networks as a Baseline for the Computational No-Coincidence Conjecture - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Matches Representation Learning/training dynamics: theoretical condition for near-independent outputs in wide nets via zero-mean activations, informing architectural design.
-
The Effect of Label Noise on the Information Content of Neural Representations - Score: 17 (R=9, N=8) - Date: 2025-10-09 - Comment: Representation Learning: analyzes information content of hidden representations and training dynamics under label noise using an information-theoretic proxy.
-
Context Length Alone Hurts LLM Performance Despite Perfect Retrieval - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Training dynamics/representation insight: shows long-context length alone degrades LLM performance independent of retrieval; proposes a simple mitigation to reduce effective context.
-
Computing frustration and near-monotonicity in deep neural networks - Score: 17 (R=9, N=8) - Date: 2025-10-08 - Comment: Representation Learning: analyzes trained DNNs via signed-graph frustration to reveal near-monotonic structure and implicit regularization.
-
Provable Affine Identifiability of Nonlinear CCA under Latent Distributional Priors - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Representation Learning: proves affine identifiability for nonlinear CCA under latent priors, with whitening necessity and finite-sample convergence guarantees.
-
On the Limitations and Capabilities of Position Embeddings for Length Generalization - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Matches Model Architecture (Transformers): theoretical analysis of position embeddings for length generalization (LRC/SRC) plus a learning-based PE framework and scale hints.
-
What Scales in Cross-Entropy Scaling Law? - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Representation Learning/Training Dynamics: theoretical decomposition of cross-entropy into error-entropy/self-alignment/confidence, identifying error-entropy as the true scaling component.
-
Understanding the Role of Training Data in Test-Time Scaling - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Representation Learning/Training Dynamics: theoretical analysis of test-time scaling for transformers, linking training data properties to benefits of long chain-of-thought.
-
Decrypt Modality Gap in Multimodal Contrastive Learning: From Convergent Representation to Pair Alignment - Score: 17 (R=9, N=8) - Date: 2025-10-07 - Comment: Matches Representation Learning: first theoretical framework explaining modality gap in multimodal contrastive learning via dimension collapse and alignment theory.
-
Topological Invariance and Breakdown in Learning - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Representation Learning/Training Dynamics — architecture-agnostic theory showing topology-preserving vs. simplifying phases in learning governed by the learning rate.
-
Unraveling Syntax: How Language Models Learn Context-Free Grammars - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Representation Learning/Training Dynamics: theoretical and empirical study of how transformers learn PCFGs, with recursive loss/KL formulae and subgrammar pretraining effects.
-
Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models - Score: 17 (R=9, N=8) - Date: 2025-10-06 - Comment: Representation learning/training dynamics: controlled study of arbitration between parametric and in-context knowledge in Transformers.
-
Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Model Architecture/Representation Learning: implicit energy-based model learning an equilibrium gradient with optimization-driven sampling and adaptive compute—foundational alternative to diffusion/flow.
-
Precise Dynamics of Diagonal Linear Networks: A Unifying Analysis by Dynamical Mean-Field Theory - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Matches Representation Learning: theoretical analysis of gradient-flow dynamics in diagonal linear networks via Dynamical Mean-Field Theory.
-
Posterior Collapse as a Phase Transition in Variational Autoencoders - Score: 17 (R=9, N=8) - Date: 2025-10-03 - Comment: Representation Learning: theoretical analysis of VAEs’ training dynamics, framing posterior collapse as a phase transition with a critical boundary.
-
Meaningless Tokens, Meaningful Gains: How Activation Shifts Enhance LLM Reasoning - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Model Architecture and Representation Learning: mechanistic analysis of MLP activation distributions and an inference-time activation redistribution module (ARM) that improves reasoning.
-
Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space? - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Representation/Architecture Analysis: introduces spectral utilization diagnostics (hard/soft rank, concentration, SUI) revealing FFN latent-space scaling laws.
-
Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity - Score: 17 (R=9, N=8) - Date: 2025-10-02 - Comment: Representation Learning and Training Dynamics: provides a mathematical framework for loss of plasticity, identifying frozen units and cloned-unit manifolds and linking to low-rank/simplicity biases.
-
Estimating Dimensionality of Neural Representations from Finite Samples - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Representation Learning: bias-corrected estimator of neural manifold dimensionality robust to finite samples and noise, applicable to networks and brain data.
-
Muon Outperforms Adam in Tail-End Associative Memory Learning - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Training dynamics/Representation Learning: theoretical and empirical analysis of optimizer behavior in LLMs via an associative memory lens, explaining isotropy and tail-class learning advantages.
-
A Unified Probabilistic Framework for Dictionary Learning with Parsimonious Activation - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Representation Learning: dictionary learning with a parsimonious (row-sparse) activation prior, grounded in a Bayesian framework for sparsity.
-
How Does Preconditioning Guide Feature Learning in Deep Neural Networks? - Score: 17 (R=9, N=8) - Date: 2025-10-01 - Comment: Representation Learning: theory linking preconditioner-induced Gram metric to spectral bias and generalization.
-
Compositional Symmetry as Compression: Lie Pseudogroup Structure in Algorithmic Agents - Score: 17 (R=8, N=9) - Date: 2025-10-15 - Comment: Representation Learning: theoretical framework linking compositional symmetry/equivariance to manifold reductions and predictive coding, offering principles for compressive, hierarchical representations.
-
IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning - Score: 16 (R=9, N=7) - Date: 2025-10-30 - Comment: Model Architecture (Normalization) and Representation Learning: IB-inspired normalization controlling task-relevant information with theory on IB value and generalization.
-
A Framework for Quantifying How Pre-Training and Context Benefit In-Context Learning - Score: 16 (R=9, N=7) - Date: 2025-10-28 - Comment: Representation Learning/Training dynamics: theoretical framework quantifying ICL benefits of pre-training and context length (transformer setting).
-
H-SPLID: HSIC-based Saliency Preserving Latent Information Decomposition - Score: 16 (R=9, N=7) - Date: 2025-10-24 - Comment: Representation learning: HSIC-based latent decomposition into salient/non-salient subspaces with theory linking robustness and compression.
-
How Do LLMs Use Their Depth? - Score: 16 (R=9, N=7) - Date: 2025-10-22 - Comment: Representation Learning: layer-wise analysis revealing a 'guess-then-refine' computation pattern across depth in LLMs, informing efficient use of layers.
-
CAST: Compositional Analysis via Spectral Tracking for Understanding Transformer Layer Functions - Score: 16 (R=9, N=7) - Date: 2025-10-17 - Comment: Representation Learning/Interpretability: probe-free spectral analysis (transformation matrix estimation, CKA) to characterize transformer layer functions.
-
A Function Centric Perspective On Flat and Sharp Minima - Score: 16 (R=9, N=7) - Date: 2025-10-16 - Comment: Training dynamics/Representation: function-centric analysis of sharpness vs generalization, showing sharper minima under regularization can generalize better—insight into loss landscape geometry.
-
Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training - Score: 16 (R=9, N=7) - Date: 2025-10-14 - Comment: Matches Representation Learning with Sparse Autoencoders: proposes Adaptive Temporal Masking to reduce feature absorption and stabilize SAE training.
-
Memory Retrieval and Consolidation in Large Language Models through Function Tokens - Score: 16 (R=9, N=7) - Date: 2025-10-10 - Comment: Representation Learning: proposes the function token hypothesis with evidence on how function tokens retrieve features and drive memory consolidation in LLMs.
-
Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: Representation Learning: introduces cumulant-expansion probes of softmax entropy to quantify higher-order feature-learning dynamics across layers and training.
-
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders - Score: 16 (R=9, N=7) - Date: 2025-10-07 - Comment: Matches Representation Learning and Sparse methods: analyzes Sparse Autoencoders’ interpretability vs. steering utility and proposes Delta Token Confidence for feature selection.
-
Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing - Score: 16 (R=9, N=7) - Date: 2025-10-06 - Comment: Representation learning: activation-space attribution with representation gradient tracing to link outputs to training data.
-
How Do Language Models Compose Functions? - Score: 16 (R=9, N=7) - Date: 2025-10-03 - Comment: Representation Learning: mechanistic analysis of compositionality in LLMs via logit-lens, identifying processing pathways and linking them to embedding space geometry.
-
Shape Happens: Automatic Feature Manifold Discovery in LLMs via Supervised Multi-Dimensional Scaling - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Representation Learning: SMDS method to discover and analyze feature manifolds in LLM latent space.
-
Feature Identification via the Empirical NTK - Score: 16 (R=9, N=7) - Date: 2025-10-02 - Comment: Representation Learning/Training dynamics: empirical NTK eigenanalysis surfaces learned features and tracks grokking phase changes.
-
Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability - Score: 16 (R=8, N=8) - Date: 2025-10-31 - Comment: Representation Learning: analyzes how Transformers learn PRNG structure; scaling laws, curriculum necessity, and interpretable embeddings
-
From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning - Score: 16 (R=8, N=8) - Date: 2025-10-30 - Comment: Matches representation learning criterion with a theoretical analysis of feature learning and training dynamics (weak-to-strong generalization) in CNNs.
-
How do simple rotations affect the implicit bias of Adam? - Score: 16 (R=8, N=8) - Date: 2025-10-29 - Comment: Representation Learning / Training Dynamics: analyzes Adam’s implicit bias under rotations and uses an equivariant reparameterization to restore rotation invariance.
-
From Black-box to Causal-box: Towards Building More Interpretable Models - Score: 16 (R=8, N=8) - Date: 2025-10-28 - Comment: Model Architecture/Representation Learning: framework for causally interpretable architectures enabling counterfactual queries with formal criteria.
-
How Does Label Noise Gradient Descent Improve Generalization in the Low SNR Regime? - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Matches Representation Learning/Training Dynamics Theory: proves label-noise gradient descent suppresses noise memorization and improves generalization in low SNR.
-
Deeper with Riemannian Geometry: Overcoming Oversmoothing and Oversquashing for Graph Foundation Models - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Model Architecture/Representation Learning: local Riemannian approach addressing oversmoothing/oversquashing with theoretical guarantees for deep MPNNs.
-
On the Impossibility of Retrain Equivalence in Machine Unlearning - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Representation Learning/Training Dynamics: theoretical impossibility result for retrain equivalence in multi-stage training, highlighting path dependence of local unlearning.
-
Symmetry and Generalisation in Neural Approximations of Renormalisation Transformations - Score: 16 (R=8, N=8) - Date: 2025-10-21 - Comment: Representation Learning/Training Dynamics: analyzes symmetry constraints and expressivity in MLPs/GNNs for learning RG transformations, with theoretical and empirical insights.
-
Sequence Modeling with Spectral Mean Flows - Score: 16 (R=8, N=8) - Date: 2025-10-20 - Comment: Matches Model Architecture and Representation Learning: operator-theoretic sequence model with spectral tensor-network decomposition and flow matching.
-
Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential - Score: 16 (R=8, N=8) - Date: 2025-10-20 - Comment: Representation learning/training dynamics: uses cross-layer sparse autoencoders to extract latent rules and introduces SAL to quantify soundness-aware internal distributions predicting reasoning potential.
-
To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models - Score: 16 (R=8, N=8) - Date: 2025-10-17 - Comment: Model Architecture/Analysis – theoretical limits of SSMs and tool-augmented design enabling length generalization for reasoning tasks.
-
Programmatic Representation Learning with Language Models - Score: 16 (R=8, N=8) - Date: 2025-10-17 - Comment: Representation Learning: programmatic feature synthesis with decision trees (LeaPR), offering interpretable, non-neural predictors learned via LLM-synthesized code.
-
When Flatness Does (Not) Guarantee Adversarial Robustness - Score: 16 (R=8, N=8) - Date: 2025-10-17 - Comment: Representation Learning/Training dynamics – formal analysis linking flat minima to local adversarial robustness and geometry of loss landscapes.
-
Cautious Weight Decay - Score: 16 (R=8, N=8) - Date: 2025-10-16 - Comment: Matches Representation Learning: optimization/training dynamics innovation (Cautious Weight Decay) as a drop-in modification to standard optimizers.
-
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing - Score: 16 (R=8, N=8) - Date: 2025-10-16 - Comment: Representation Learning: targeted editing of hidden representations with a learned value function for precise attribute intensity control.
-
Scaling Language-Centric Omnimodal Representation Learning - Score: 16 (R=8, N=8) - Date: 2025-10-14 - Comment: Matches Representation Learning: analyzes emergent cross-modal alignment in MLLMs and proposes a language-centric embedding framework with a scaling law linking generative and representation quality.
-
Do LLMs "Feel"? Emotion Circuits Discovery and Control - Score: 16 (R=8, N=8) - Date: 2025-10-14 - Comment: Representation Learning: identifies context-agnostic emotion directions and causal neuron/attention-head circuits that implement and control emotional expression in LLMs.
-
Verifying Chain-of-Thought Reasoning via Its Computational Graph - Score: 16 (R=8, N=8) - Date: 2025-10-14 - Comment: Representation Learning/Training Dynamics: white-box verification via computational (attribution) graphs to diagnose and fix CoT reasoning, offering causal insights into latent circuits.
-
On the Implicit Adversariality of Catastrophic Forgetting in Deep Continual Learning - Score: 16 (R=8, N=8) - Date: 2025-10-14 - Comment: Representation Learning: theoretical analysis of catastrophic forgetting via low-rank bias and gradient alignment; introduces backGP to mitigate alignment from backward propagation.
-
Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models - Score: 16 (R=8, N=8) - Date: 2025-10-10 - Comment: Representation Learning: unpaired multimodal training with shared parameters; theory under linear assumptions showing unimodal gains from auxiliary modalities.
-
Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity - Score: 16 (R=8, N=8) - Date: 2025-10-10 - Comment: Matches Representation Learning/training dynamics: shows width expansion enables linear mode connectivity without permutations; introduces LEWC explanation.
-
R\'enyi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization - Score: 16 (R=8, N=8) - Date: 2025-10-10 - Comment: Representation Learning/Training Dynamics: introduces Rényi-sharpness tied to Hessian spectra with generalization bounds and a new SAM-style regularizer (RSAM).
-
Beyond independent component analysis: identifiability and algorithms - Score: 16 (R=8, N=8) - Date: 2025-10-10 - Comment: Representation Learning: identifiability theory beyond ICA (pairwise mean independence) with an algebraic recovery algorithm.
-
Nearly Instance-Optimal Parameter Recovery from Many Trajectories via Hellinger Localization - Score: 16 (R=8, N=8) - Date: 2025-10-09 - Comment: Representation Learning/Training Theory: Hellinger localization framework yields near instance-optimal MLE rates for multi-trajectory sequential models, including linear-attention sequence models.
-
Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models - Score: 16 (R=8, N=8) - Date: 2025-10-08 - Comment: Strong match to Representation Learning: proposes a framework to trace internal representations, identifies a commitment layer and dual-pathway mechanism underlying hallucinations in Transformers.
-
From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models - Score: 16 (R=8, N=8) - Date: 2025-10-07 - Comment: Matches Representation Learning/training dynamics: a variance-optimized preference optimization method with theory for aligning large reasoning models.
-
How does the optimizer implicitly bias the model merging loss landscape? - Score: 16 (R=8, N=8) - Date: 2025-10-07 - Comment: Representation Learning/Training Dynamics: shows how optimizer-induced effective noise shapes the global loss landscape and predicts model merging success.
-
Sharp Lower Bounds for Linearized ReLU^k Approximation on the Sphere - Score: 16 (R=8, N=8) - Date: 2025-10-07 - Comment: Model Architecture / Representation Learning: theoretical saturation bounds for linearized shallow ReLU^k networks, analyzing approximation capacity of the architecture.
-
Decision Potential Surface: A Theoretical and Practical Approximation of LLM's Decision Boundary - Score: 16 (R=8, N=8) - Date: 2025-10-07 - Comment: Representation Learning: introduces Decision Potential Surface to approximate LLM decision boundaries with provable error bounds via K-sampling.
-
Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification - Score: 16 (R=8, N=8) - Date: 2025-10-06 - Comment: Training Dynamics — optimal generalization rates for GD on deep ReLU networks via control of activation patterns and sharper Rademacher bounds.
-
Learning Multi-Index Models with Hyper-Kernel Ridge Regression - Score: 16 (R=8, N=8) - Date: 2025-10-06 - Comment: Representation Learning/Theory: HKRR provides sample complexity guarantees for compositional multi-index models, bridging kernels and neural approaches to overcome curse of dimensionality.
-
Uniform-in-time convergence bounds for Persistent Contrastive Divergence Algorithms - Score: 16 (R=8, N=8) - Date: 2025-10-03 - Comment: Representation Learning/Training Dynamics: theoretical uniform-in-time convergence bounds for PCD in EBMs with an efficient continuous-time SDE formulation and stable S-ROCK integrators.
-
Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed - Score: 16 (R=8, N=8) - Date: 2025-10-03 - Comment: Representation learning insight: analyzes how latent geometry vs shared data-space affects adversarial transfer with theory and experiments.
-
Risk Phase Transitions in Spiked Regression: Alignment Driven Benign and Catastrophic Overfitting - Score: 16 (R=8, N=8) - Date: 2025-10-03 - Comment: Generalization theory in overparameterized spiked regression, classifying benign vs catastrophic overfitting—training dynamics/representation theory.
-
To Augment or Not to Augment? Diagnosing Distributional Symmetry Breaking - Score: 16 (R=8, N=8) - Date: 2025-10-03 - Comment: Representation Learning: proposes a metric for distributional symmetry-breaking and theory showing when equivariant methods can underperform.
-
On the Benefits of Weight Normalization for Overparameterized Matrix Sensing - Score: 16 (R=8, N=8) - Date: 2025-10-02 - Comment: Training dynamics/optimization analysis of weight normalization showing faster convergence in overparameterized matrix sensing (Representation Learning / Model Architecture analysis).
-
A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models - Score: 16 (R=8, N=8) - Date: 2025-10-02 - Comment: Model Architecture and Representation Learning: introduces a deterministic Manifold-Probabilistic Projection Model unifying geometric manifold structure with kernel-based probabilistic modeling, reinterpreting diffusion as projection.
-
Nonparametric Identification of Latent Concepts - Score: 16 (R=8, N=8) - Date: 2025-10-02 - Comment: Representation Learning: provides a nonparametric identifiability theory for latent concepts from multi-class observations, offering foundational guarantees on recovering representations.
-
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training - Score: 16 (R=8, N=8) - Date: 2025-10-01 - Comment: Representation Learning: analysis of emergent visual priors from language pretraining with scaling trends and data-centric pretraining recipe.
-
SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From - Score: 16 (R=8, N=8) - Date: 2025-10-01 - Comment: Representation Learning/Training Dynamics: reveals persistent initialization-dependent fingerprints in LLMs across training.
-
Test time training enhances in-context learning of nonlinear functions - Score: 16 (R=8, N=8) - Date: 2025-10-01 - Comment: Training dynamics/Representation Learning: theory for test-time training combined with ICL in transformers, showing adaptation to task-specific link functions and features.
-
Gradient Descent with Large Step Sizes: Chaos and Fractal Convergence Region - Score: 16 (R=8, N=8) - Date: 2025-10-01 - Comment: Representation Learning/Training Dynamics: theoretical analysis of gradient descent in matrix factorization, identifying critical step sizes and chaotic/fractal convergence behavior.
-
Clone Deterministic 3D Worlds with Geometrically-Regularized World Models - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning: geometric regularization to shape latent manifold topology for robust world-model rollouts
-
Unravelling the Mechanisms of Manipulating Numbers in Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning: probes numerical information processing in LLMs, yielding universal probes and layer-wise mechanism insights.
-
Likely Interpolants of Generative Models - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning: principled interpolation scheme for generative models via likely transition paths with Riemannian-geodesic interpretation, no retraining required.
-
Angular Steering: Behavior Control via Rotation in Activation Space - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning: activation-space steering via geometric rotation (and adaptive variant) to control LLM behaviors.
-
What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning: uses sparse autoencoders to learn interpretable latent features of human preference data for analysis and curation.
-
ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts - Score: 15 (R=8, N=7) - Date: 2025-10-31 - Comment: Representation Learning: uses Sparse Autoencoders on foundation-model features to discover disentangled concepts and dataset bias
-
Mechanistic Interpretability of RNNs emulating Hidden Markov Models - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Representation Learning/Mechanistic Interpretability: reverse-engineers RNNs emulating HMMs, uncovering structured dynamics and connectivity enabling probabilistic computation.
-
Nonlinear Dynamics In Optimization Landscape of Shallow Neural Networks with Tunable Leaky ReLU - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Representation Learning/Training Dynamics: theoretical bifurcation analysis of shallow networks with tunable leaky ReLU revealing symmetry-breaking and landscape structure.
-
Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Representation Learning: causal analysis of which CoT steps actually influence predictions; identifies and steers a latent 'TrueThinking' direction in LLM representation space.
-
Confidence is Not Competence - Score: 15 (R=8, N=7) - Date: 2025-10-30 - Comment: Representation Learning: geometric analysis of LLM internal states revealing separable assessment/execution manifolds.
-
Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Representation Learning: leverages internal correlation-matrix rank as a self-indicator to verify reasoning paths without external verifiers.
-
Debiasing Reward Models by Representation Learning with Guarantees - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Representation Learning: identifies non-spurious latent variables and trains reward models on them with identifiability guarantees to mitigate spurious correlations.
-
VIKING: Deep variational inference with stochastic projections - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Variational family reflecting network reparametrization for fully-correlated posteriors—foundational approximate Bayesian inference for deep nets (Representation Learning/Training Dynamics).
-
Monotone and Separable Set Functions: Characterizations and Neural Models - Score: 15 (R=8, N=7) - Date: 2025-10-29 - Comment: Model Architecture/Representation Learning: characterizes monotone-and-separating set functions and proposes neural models preserving set-containment order with universality.
-
Manifold Approximation leads to Robust Kernel Alignment - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Representation Learning: manifold-aware kernel alignment (MKA) provides a more robust representation similarity metric than CKA.
-
Scaling Non-Parametric Sampling with Representation - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Representation Learning with a simple non-parametric generative model and mechanistic analysis of image structure.
-
Probing Neural Combinatorial Optimization Models - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Representation Learning/Interpretability: probing (CS-Probing) to analyze internal representations and inductive biases in NCO models.
-
Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Representation Learning by characterizing a low-dimensional emotional manifold in LLM hidden states and controllable interventions.
-
Mechanistic Interpretability for Neural TSP Solvers - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Representation Learning/Interpretability: activation-level analysis with sparse autoencoders reveals interpretable features in Transformer TSP solvers.
-
VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set - Score: 15 (R=8, N=7) - Date: 2025-10-28 - Comment: Matches Representation Learning via sparse autoencoders to interpret and enhance vision-language alignment at a concept level.
-
On Uncertainty Calibration for Equivariant Functions - Score: 15 (R=8, N=7) - Date: 2025-10-27 - Comment: Representation/Architecture Analysis: theoretical bounds linking equivariance properties to uncertainty calibration (ECE/ENCE) in models.
-
Correlation Dimension of Auto-Regressive Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-27 - Comment: Representation Learning: introduces correlation-dimension metric to quantify long-range structural complexity and generative dynamics in autoregressive LLMs.
-
Model Merging with Functional Dual Anchors - Score: 15 (R=8, N=7) - Date: 2025-10-27 - Comment: Matches Representation Learning/Training Dynamics: proposes a new model-merging framework in input-representation space (Functional Dual Anchors) for foundation models, improving post-hoc integration efficiency.
-
Neural Mutual Information Estimation with Vector Copulas - Score: 15 (R=8, N=7) - Date: 2025-10-27 - Comment: Representation Learning: proposes a neural mutual information estimator using vector copulas to balance capacity and data efficiency.
-
Context-level Language Modeling by Learning Predictive Context Embeddings - Score: 15 (R=8, N=7) - Date: 2025-10-24 - Comment: Representation Learning: introduces a next-context prediction objective to learn predictive context embeddings and improve long-range modeling with minimal overhead.
-
IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks - Score: 15 (R=8, N=7) - Date: 2025-10-24 - Comment: Representation Learning — integrates Information Bottleneck into GANs with an intermediate stochastic bottleneck to induce disentangled factors.
-
Understanding the Implicit Biases of Design Choices for Time Series Foundation Models - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Representation Learning: analyzes implicit inductive biases/training dynamics of TSFMs (patching, embeddings, objectives) with theory and controlled evaluations.
-
Weight Decay may matter more than muP for Learning Rate Transfer in Practice - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Representation Learning/Training Dynamics: analyzes learning-rate transfer across widths, highlighting weight decay vs muP scaling.
-
Category learning in deep neural networks: Information content and geometry of internal representations - Score: 15 (R=8, N=7) - Date: 2025-10-23 - Comment: Representation Learning: information-theoretic and Fisher-geometry analysis of category learning shaping internal representations.
-
SO(3)-invariant PCA with application to molecular data - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Matches Representation Learning: SO(3)-invariant PCA that accounts for all rotations efficiently via algebraic structure, reducing covariance complexity.
-
Approximation Rates of Shallow Neural Networks: Barron Spaces, Activation Functions and Optimality Analysis - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Matches Representation Learning Theory: approximation rates in Barron spaces and limits of ReLU^k shallow networks.
-
NTKMTL: Mitigating Task Imbalance in Multi-Task Learning from Neural Tangent Kernel Perspective - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Matches Representation Learning/Training Dynamics: NTK-based spectral balancing to mitigate task imbalance in multi-task learning.
-
Rethinking PCA Through Duality - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Representation Learning/Theory: new DC formulations and kernelizable dual PCA linked to self-attention; optimization perspective on PCA algorithms.
-
Gradient Variance Reveals Failure Modes in Flow-Based Generative Models - Score: 15 (R=8, N=7) - Date: 2025-10-22 - Comment: Matches Representation Learning/Training Dynamics: theoretical and empirical analysis of rectified flows showing gradient-variance-driven memorization and failure modes.
-
Mapping Post-Training Forgetting in Language Models at Scale - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Matches Training Dynamics/Representation Retention: sample-wise metrics mapping forgetting and backward transfer across post-training stages and scales.
-
Atlas-based Manifold Representations for Interpretable Riemannian Machine Learning - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Representation Learning: learns a differentiable atlas for latent manifolds enabling Riemannian optimization and interpretable representations.
-
Local properties of neural networks through the lens of layer-wise Hessians - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Matches Representation Learning/Training Dynamics: layer-wise Hessian spectral analysis links geometry to generalization and expressivity.
-
Model Metamers Reveal Invariances in Graph Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Representation Learning: introduces model metamers for GNNs to probe and quantify learned invariances, with theoretical analysis of metamer manifolds.
-
DFNN: A Deep Fr\'echet Neural Network Framework for Learning Metric-Space-Valued Responses - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Matches Model Architecture/Representation Learning: proposes deep Fréchet neural networks with a universal approximation theorem for metric-space-valued outputs.
-
Memorizing Long-tail Data Can Help Generalization Through Composition - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Representation Learning/Theory: shows how memorizing long-tail data can aid generalization via composition, with linear theory and neural experiments.
-
Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Representation Learning/Training Dynamics: controlled synthetic testbed analyzing how pretraining diversity and contextual structure affect OOD factual generalization; identifies optimization bottlenecks.
-
Breaking Memorization Barriers in LLM Code Fine-Tuning via Information Bottleneck for Improved Generalization - Score: 15 (R=8, N=7) - Date: 2025-10-21 - Comment: Representation Learning/Training Dynamics: information bottleneck-regularized fine-tuning to reduce memorization and improve generalization in code LLMs.
-
Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Matches Representation Learning criterion: introduces a theoretically grounded similarity (PMI in RKHS) for contrastive multi-modal models like CLIP, analyzing and improving the underlying representation/metric.
-
Particle Dynamics for Latent-Variable Energy-Based Models - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Matches Representation Learning: latent-variable energy-based models with Wasserstein gradient flow training and convergence guarantees.
-
Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Representation Learning: analyzes feature geometry/normalization for OOD and introduces radially scaled l2 normalization.
-
From Universal Approximation Theorem to Tropical Geometry of Multi-Layer Perceptrons - Score: 15 (R=8, N=7) - Date: 2025-10-20 - Comment: Matches Representation Learning/Architecture: geometry-aware initialization for sigmoidal MLPs via tropical perspective.
-
TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation Learning: analyzes tokenizer–grammar misalignment and layer-wise embedding effects in code LLMs.
-
Circuit Insights: Towards Interpretability Beyond Activations - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation Learning – mechanistic interpretability beyond activations (WeightLens/CircuitLens) to analyze features and circuits from weights and interactions.
-
Predicting Task Performance with Context-aware Scaling Laws - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation/Training Dynamics: proposes context-aware scaling laws linking downstream performance to compute and context length.
-
Provable Unlearning with Gradient Ascent on Two-Layer ReLU Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation Learning/Training Dynamics: theoretical analysis of gradient-ascent unlearning in linear and two-layer ReLU nets with new success criterion.
-
Rethinking Hebbian Principle: Low-Dimensional Structural Projection for Unsupervised Learning - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation Learning – unsupervised Hebbian-style learning with structural projection and orthogonality constraints for feature learning.
-
Semantic representations emerge in biologically inspired ensembles of cross-supervising neural networks - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation Learning: biologically inspired cross-supervising ensembles yield decodable semantic representations with sparse inter-network connectivity.
-
Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-17 - Comment: Representation Learning: analyzes training dynamics and shows statistical simplicity (n-gram diversity) predicts SLM learnability/coherence.
-
Learning Latent Energy-Based Models via Interacting Particle Langevin Dynamics - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Representation Learning: introduces an interacting particle Langevin dynamics algorithm with convergence guarantees for learning latent energy-based models (training dynamics).
-
Influence Dynamics and Stagewise Data Attribution - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Representation Learning: analyzes training dynamics via stagewise data attribution grounded in singular learning theory, linking influence shifts to semantic hierarchy development.
-
LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance - Score: 15 (R=8, N=7) - Date: 2025-10-16 - Comment: Matches Representation Learning: analyzes robustness of internal truthfulness representations under semantically-preserving perturbations.
-
Discursive Circuits: How Do Language Models Understand Discourse Relations? - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Representation Learning: circuit discovery via activation patching identifies sparse transformer subgraphs responsible for discourse relations.
-
Test-Time Adaptation by Causal Trimming - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Representation Learning by identifying and trimming non-causal representation components via augmentation-induced variance and PCA at test time; efficient adaptation without label supervision.
-
Topological Alignment of Shared Vision-Language Embedding Space - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Representation Learning: topology-aware cross-modal alignment using persistent homology with theoretical error bounds via graph sparsification.
-
Multitask Learning with Learned Task Relationships - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Representation Learning/Architecture: learns task relationships via a Gaussian Markov Random Field precision matrix jointly with local models; includes theoretical analysis.
-
Lost in the Middle: An Emergent Property from Information Retrieval Demands in LLMs - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Representation Learning/Training Dynamics: explains positional bias (“lost in the middle”) via retrieval demands and attention dynamics in LLMs.
-
An Exploration of Non-Euclidean Gradient Descent: Muon and its Many Variants - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Representation Learning/Training Dynamics via a principled non-Euclidean gradient descent view of optimizers, introducing robust variants (MuonMax) and momentum integration (Momo).
-
The Geometry of Reasoning: Flowing Logics in Representation Space - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Geometric/representation-space analysis of LLM reasoning flows — Representation Learning (training dynamics and embedding geometry).
-
Scaling Laws and Symmetry, Evidence from Neural Force Fields - Score: 15 (R=8, N=7) - Date: 2025-10-15 - Comment: Matches Model Architecture and Representation Learning: empirical scaling-law analysis showing equivariant architectures and higher-order representations yield better scaling exponents.
-
PIXEL: Adaptive Steering Via Position-wise Injection with eXact Estimated Levels under Subspace Calibration - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Representation Learning/Model Architecture: activation steering with learned property-aligned subspaces and position-wise injection with closed-form strength selection.
-
ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Model Compression and Efficiency + Representation Learning: bridges KG embeddings and LLMs via residual vector quantization to create learnable code tokens, enabling structured–contextual fusion.
-
QuIRK: Quantum-Inspired Re-uploading KAN - Score: 15 (R=8, N=7) - Date: 2025-10-14 - Comment: Model Architecture: introduces a new KAN variant replacing B-splines with quantum-inspired single-qubit re-uploading units, reducing parameters while retaining interpretability.
-
On the Representations of Entities in Auto-regressive Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-13 - Comment: Representation Learning criterion: introduces Entity Lens to reconstruct multi-token entity mentions from internal hidden states (task vectors), probing how LLMs encode entities.
-
Sparse components distinguish visual pathways & their alignment to neural networks - Score: 15 (R=8, N=7) - Date: 2025-10-13 - Comment: Representation Learning: introduces sparse component decomposition and Sparse Component Alignment to probe and compare latent axes of brain and DNN representations.
-
Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings - Score: 15 (R=8, N=7) - Date: 2025-10-13 - Comment: Matches Representation Learning: proposes and analyzes in-process structure-aware encoding for LLM embeddings (including parallel caching vs sequential concatenation) with insights into how structural relations are encoded.
-
Deep Multimodal Subspace Clustering Networks - Score: 15 (R=8, N=7) - Date: 2025-10-13 - Comment: Model Architecture/Representation Learning: autoencoder with a self-expressive layer for unsupervised multimodal subspace clustering, comparing early/late/intermediate fusion and shared affinity.
-
To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Representation Learning: analyzes and exploits ViT attention-sink tokens to improve information flow from vision encoder to LLM.
-
On the Relationship Between the Choice of Representation and In-Context Learning - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Representation Learning: isolates effects of representation choice vs. in-context learning capacity; optimization to enumerate label representations with systematic analysis.
-
Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Matches the High Performance Computing criterion: theoretical analysis of decentralized distributed training (multi-gossip steps) via stability-based generalization bounds, detailing effects of topology, heterogeneity, and learning rate on training.
-
HySim-LLM: Embedding-Weighted Fine-Tuning Bounds and Manifold Denoising for Domain-Adapted LLMs - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Matches Representation Learning/Training Theory: proposes similarity-weighted fine-tuning bounds and manifold denoising guarantees for domain-adapted LLMs.
-
Vocabulary embeddings organize linguistic structure early in language model training - Score: 15 (R=8, N=7) - Date: 2025-10-10 - Comment: Matches Representation Learning: empirical analysis of how input/output embeddings organize semantic/syntactic structure early in LLM training (training dynamics insights).
-
Angular Constraint Embedding via SpherePair Loss for Constrained Clustering - Score: 15 (R=8, N=7) - Date: 2025-10-09 - Comment: Representation Learning: proposes a geometric angular embedding (SpherePair loss) with theoretical guarantees, decoupling representation learning from clustering.
-
Chem-NMF: Multi-layer $\alpha$-divergence Non-Negative Matrix Factorization for Cardiorespiratory Disease Clustering, with Improved Convergence Inspired by Chemical Catalysts and Rigorous Asymptotic Analysis - Score: 15 (R=8, N=7) - Date: 2025-10-09 - Comment: Matches Representation Learning and Low-rank methods: multi-layer α-divergence NMF with a convergence-stabilizing scheme and rigorous asymptotic analysis.
-
Analyzing the Effect of Embedding Norms and Singular Values to Oversmoothing in Graph Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Representation Learning/Training Dynamics: introduces MASED metric with bounds and a regularization scheme (G-Reg) to mitigate oversmoothing in deep GNNs.
-
Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Learning Theory/Optimization: data-dependent generalization bounds for Gibbs and Langevin algorithms in the overparameterized interpolation regime.
-
Probing the Difficulty Perception Mechanism of Large Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Representation Learning/Training Dynamics: probes internal representations to linearly decode difficulty and identifies specific attention heads responsible for difficulty perception.
-
Revisiting Long-context Modeling from Context Denoising Perspective - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Matches: Representation Learning/Training Dynamics — context denoising training using IG-based noise detection to improve attention in long-context LLMs.
-
Quantifying the Accuracy-Interpretability Trade-Off in Concept-Based Sidechannel Models - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Representation Learning/Architecture: unified probabilistic sidechannel model with a new Sidechannel Independence Score and SIS regularization to control the accuracy–interpretability trade-off.
-
On the Theory of Continual Learning with Gradient Descent for Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Representation Learning/Training Dynamics: theoretical bounds on forgetting for continual learning in neural networks trained by gradient descent.
-
Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Matches: Model Architecture and Representation Learning — a self-supervised latent dynamics architecture jointly learning recognition and motion representations.
-
Approximate Gaussianity Beyond Initialisation in Neural Networks - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: Representation Learning: analyzes weight distributions during training via permutation-invariant Gaussian matrix models and tracks dynamics with Wasserstein distance.
-
Learning to Interpret Weight Differences in Language Models - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Representation Learning/interpretability: trains models to describe finetuning-induced weight diffs via adapters, enabling natural-language explanations of parameter changes.
-
Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Representation Learning/Training Dynamics: proposes egalitarian gradient descent to equalize learning across principal directions, offering insights into grokking dynamics.
-
Learning Linear Regression with Low-Rank Tasks in-Context - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Representation Learning: theoretical analysis of in-context learning with a linear attention model on low-rank task distributions, characterizing prediction distributions, implicit regularization, and phase transitions in generalization.
-
GRACE: Generative Representation Learning via Contrastive Policy Optimization - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Representation Learning—treats contrastive signals as rewards over generated rationales to train embedding-capable LLMs.
-
Internal states before wait modulate reasoning patterns - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Representation Learning/mechanistic interpretability: identifies latent features that modulate ‘wait’ tokens and causally links them to reasoning patterns in transformers.
-
Why Cannot Neural Networks Master Extrapolation? Insights from Physical Laws - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Representation Learning theory: formal analysis of extrapolation limits in neural networks with implications for designing models with better out-of-domain behavior.
-
From Moments to Models: Graphon Mixture-Aware Mixup and Contrastive Learning - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Representation Learning: model-aware contrastive learning and mixup via graphon mixture modeling with a theoretical bound linking cut distance to motif densities.
-
Decomposing Attention To Find Context-Sensitive Neurons - Score: 15 (R=8, N=7) - Date: 2025-10-07 - Comment: Matches Representation Learning/Interpretability: decomposes attention to uncover context-sensitive neurons from weights using a calibration text.
-
Hyperparameter Loss Surfaces Are Simple Near their Optima - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Training dynamics: theory and tools for hyperparameter loss surfaces near optima, deriving asymptotic laws for random search and effective dimensionality.
-
On the Role of Temperature Sampling in Test-Time Scaling - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Matches Test-Time Scaling: multi-temperature sampling/voting to expand reasoning coverage without additional training, offering analysis of sampling dynamics.
-
Mitigating Modal Imbalance in Multimodal Reasoning - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Representation Learning: analyzes and mitigates cross-modal attention imbalance with a training strategy that explicitly combines modalities to improve joint reasoning.
-
Multimodal Function Vectors for Spatial Relations - Score: 15 (R=8, N=7) - Date: 2025-10-06 - Comment: Representation Learning/Model Architecture — identifies and manipulates attention-head ‘function vectors’ in an LMM to control relational reasoning.
-
Robust Tangent Space Estimation via Laplacian Eigenvector Gradient Orthogonalization - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Manifold/representation learning via Laplacian eigenvector gradient orthogonalization with theoretical robustness to noise.
-
Flatness-Aware Stochastic Gradient Langevin Dynamics - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Matches Representation Learning/training dynamics: proposes fSGLD to bias toward flat minima with theoretical guarantees (invariant measure, convergence, excess-risk).
-
PENEX: AdaBoost-Inspired Neural Network Regularization - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation Learning/Training Dynamics — new penalized exponential loss (PENEX) with margin maximization behavior for neural network regularization.
-
Learning Model Representations Using Publicly Available Model Hubs - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation Learning: learns weight-space representations from heterogeneous public model hubs with a new backbone for unstructured model populations.
-
Representational Alignment Across Model Layers and Brain Regions with Hierarchical Optimal Transport - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation Learning — Hierarchical Optimal Transport for global, soft alignment across layers/neurons, yielding interpretable representational correspondences.
-
Learning Time-Series Representations by Hierarchical Uniformity-Tolerance Latent Balancing - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation learning criterion: hierarchical losses and temperature scheduling to balance uniformity–tolerance in contrastive time-series embeddings.
-
Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation/Training Dynamics: shows SFT metrics can mispredict RL outcomes and proposes stronger proxies (generalization loss, Pass@large k) for post-training.
-
Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation Learning/Training Dynamics and Data Efficiency: proves similarity of cross-modal attention trajectories implies gradient similarity, enabling principled data selection for LVLM fine-tuning.
-
Quantum-inspired Benchmark for Estimating Intrinsic Dimension - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation Learning — intrinsic dimension estimation benchmark with complex manifolds; foundational evaluation of IDE methods.
-
Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks - Score: 15 (R=8, N=7) - Date: 2025-10-03 - Comment: Representation Learning/Interpretability: gradient-based ability impact with targeted ablation to mechanistically diagnose benchmarks and decompose model competence.
-
Geometric Properties of Neural Multivariate Regression - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Representation Learning: analyzes intrinsic dimensionality and collapse in neural regression representations, yielding insights into training dynamics and generalization.
-
Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Representation Learning/Training dynamics: evaluates probability-based objectives beyond NLL for SFT, with theory tied to model capability.
-
Learning Energy-based Variational Latent Prior for VAEs - Score: 15 (R=8, N=7) - Date: 2025-10-02 - Comment: Representation Learning/Model Architecture—energy-based variational latent prior for VAEs addressing prior holes with efficient sampling via variational treatment.
-
Bayesian Influence Functions for Hessian-Free Data Attribution - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Representation Learning: introduces Bayesian influence functions to quantify training data impact via SG-MCMC-based loss landscape statistics, scaling to billion-parameter models (training dynamics/attribution).
-
Reconcile Certified Robustness and Accuracy for DNN-based Smoothed Majority Vote Classifier - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Representation/Robustness Theory: PAC-Bayesian generalization bound with certified radius for smoothed majority vote and a spectral-norm-inspired regularizer.
-
Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Representation Learning/Training dynamics: introduces Training Re-evaluation Curves (TREC) and predicts them from AdamW EMA for proactive LLM data curriculum design.
-
Language Model Planning from an Information Theoretic Perspective - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Representation Learning: probes planning by compressing hidden states (via VQ-VAE) to measure mutual information and analyze transformer computation structure.
-
Knowledge distillation through geometry-aware representational alignment - Score: 15 (R=8, N=7) - Date: 2025-10-01 - Comment: Compression/Efficiency and Representation Learning: geometry-aware feature distillation using Procrustes distance and Gram matrix alignment.
Other Foundational Research (9)
-
Surrogate-based quantification of policy uncertainty in generative flow networks - Score: 20.0 (R=0, N=0) - Date: 2025-10-28 - Comment: Author match
-
Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise - Score: 20.0 (R=0, N=0) - Date: 2025-10-15 - Comment: Author match
-
Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime - Score: 17 (R=9, N=8) - Date: 2025-10-31 - Comment: Training dynamics/implicit bias: theoretical analysis of per-sample Adam vs full-batch, characterizing optimizer-induced max-margin geometry.
-
Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production - Score: 17 (R=9, N=8) - Date: 2025-10-17 - Comment: Conditional/Dynamic Networks: adaptive per-token compute with pause tokens and new CYB losses for dynamic inference.
-
Finite-Time Analysis of Stochastic Nonconvex Nonsmooth Optimization on the Riemannian Manifolds - Score: 16 (R=8, N=8) - Date: 2025-10-27 - Comment: Optimization/Training Theory: finite-time guarantees for nonsmooth nonconvex stochastic optimization on Riemannian manifolds, including a zeroth-order variant.
-
On Biologically Plausible Learning in Continuous Time - Score: 16 (R=8, N=8) - Date: 2025-10-22 - Comment: Training dynamics: continuous-time learning that unifies SGD/FA/DFA/KP and analyzes temporal credit assignment via eligibility traces and input–error overlap.
-
Learning to Answer from Correct Demonstrations - Score: 16 (R=8, N=8) - Date: 2025-10-20 - Comment: Matches foundational Training Objective design for learning from correct demonstrations beyond MLE, with sample complexity guarantees under a low-cardinality reward class.
-
Second-order Optimization under Heavy-Tailed Noise: Hessian Clipping and Sample Complexity Limits - Score: 16 (R=8, N=8) - Date: 2025-10-15 - Comment: Optimization/Training Dynamics: robust second-order method with gradient/Hessian clipping under heavy-tailed noise and tight sample complexity bounds.
-
Improved High-probability Convergence Guarantees of Decentralized SGD - Score: 15 (R=8, N=7) - Date: 2025-10-08 - Comment: High Performance Computing: new high-probability convergence guarantees for decentralized SGD with linear speedup under light-tailed noise.