Personalized Monthly Topic Summary 2026/01
| Metric | Value |
|---|---|
| Total Papers | 411 |
| Model Architecture | 122 |
| Model Compression and Efficiency | 129 |
| High Performance Computing | 42 |
| Representation Learning | 114 |
| Other Foundational Research | 4 |
Model Architecture (122)
-
L$^3$: Large Lookup Layers - Score: 19 (R=10, N=9) - Date: 2026-01-30 - Comment: Model Architecture & Sparsity: proposes Large Lookup Layers as a systems-friendly sparse alternative to MoE with static token-based routing and embedding allocation; enables CPU-offloaded inference.
-
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep - Score: 19 (R=10, N=9) - Date: 2026-01-28 - Comment: Strong match to Model Architecture and training stability: Post-LN Transformer with Highway-style connections enabling stable ultra-deep training and improved depth scaling.
-
Superlinear Multi-Step Attention - Score: 19 (R=10, N=9) - Date: 2026-01-27 - Comment: Model Architecture and Efficiency: multi-step attention achieving subquadratic complexity while preserving random context access; scalable design for long contexts.
-
LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts - Score: 19 (R=10, N=9) - Date: 2026-01-27 - Comment: MoE Architecture + Efficiency: hardware–software co-designed LatentMoE optimizing accuracy per FLOP/parameter, with empirical/theoretical backing.
-
LongCat-Flash-Thinking-2601 Technical Report - Score: 19 (R=10, N=9) - Date: 2026-01-27 - Comment: Matches Model Architecture (MoE) and HPC/Distributed Training: 560B MoE with domain-parallel expert training, large-scale asynchronous RL infrastructure, and test-time scaling.
-
On the Expressive Power of Floating-Point Transformers - Score: 19 (R=10, N=9) - Date: 2026-01-26 - Comment: Model Architecture/Representation Theory: expressive power of floating-point Transformers, permutation equivariance under finite precision, and positional encoding effects.
-
Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics - Score: 19 (R=10, N=9) - Date: 2026-01-09 - Comment: Model Architecture: continuous-token maturation with delayed discretization for autoregressive generation, enabling stable deterministic decoding.
-
Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation, Convergence and Mixture of Experts - Score: 19 (R=10, N=9) - Date: 2026-01-06 - Comment: Model Architecture + Representation Learning: diffusion models with MoLR-MoG latent leading to MoE-structured score; provides estimation and convergence guarantees.
-
A Depth Hierarchy for Computing the Maximum in ReLU Networks via Extremal Graph Theory - Score: 19 (R=10, N=9) - Date: 2026-01-06 - Comment: Theoretical Architecture: depth hierarchy lower bounds for computing max with ReLUs via extremal graph theory.
-
Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Model Architecture: proposes depth-recurrent attention mixtures combining depth attention and sparse expert attention (MoE) to scale latent reasoning efficiently.
-
ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Model Architecture/Compression: MoE with adaptive token-to-concept compression for implicit compute allocation; reduces attention/KV cache and improves efficiency.
-
L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Matches Model Architecture: MoE routing improved via low-rank latent routing space and Lipschitz-controlled scoring geometry.
-
Scaling Embeddings Outperforms Scaling Experts in Language Models - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Model architecture and efficiency: proposes scaling embeddings as an alternative to MoE sparsity scaling; includes system optimizations/speculative decoding; directly targets MoE/LLM scaling.
-
Hyperparameter Transfer with Mixture-of-Expert Layers - Score: 18 (R=10, N=8) - Date: 2026-01-29 - Comment: Model Architecture (MoE): DMFT-justified parameterization enabling hyperparameter transfer across width/depth/experts/expert-size in sparse MoE Transformers.
-
$\infty$-MoE: Generalizing Mixture of Experts to Infinite Experts - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: MoE Architecture: continuous expert parameterization (infinite experts) enabling flexible compute–accuracy trade-offs at inference.
-
FlashMoE: Reducing SSD I/O Bottlenecks via ML-Based Cache Replacement for Mixture-of-Experts Inference on Edge Devices - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: High Performance Computing/Efficiency for MoE: ML-based cache replacement for SSD-offloaded experts enabling on-device MoE inference and reducing I/O bottlenecks.
-
GRIP: Algorithm-Agnostic Machine Unlearning for Mixture-of-Experts via Geometric Router Constraints - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: Matches Model Architecture (MoE): geometric router constraints (null-space projection) for algorithm-agnostic unlearning that preserves routing while erasing expert knowledge.
-
A Collision-Free Hot-Tier Extension for Engram-Style Conditional Memory: A Controlled Study of Training Dynamics - Score: 18 (R=10, N=8) - Date: 2026-01-26 - Comment: Model Architecture and Training Dynamics: conditional memory with a collision-free hot tier via MPHF; analysis reveals gating credit assignment limits and collision-induced regularization.
-
Demystifying the Slash Pattern in Attention: The Role of RoPE - Score: 18 (R=10, N=8) - Date: 2026-01-15 - Comment: Representation/Architecture analysis: theoretical and empirical explanation of slash attention patterns via RoPE and training dynamics.
-
WaveFormer: Frequency-Time Decoupled Vision Modeling with Wave Equation - Score: 18 (R=10, N=8) - Date: 2026-01-14 - Comment: Model Architecture and Efficiency: replaces attention with a wave propagation operator (O(N log N)) via frequency-time decoupled formulation for global interactions.
-
MoE-DisCo:Low Economy Cost Training Mixture-of-Experts Models - Score: 18 (R=10, N=8) - Date: 2026-01-13 - Comment: MoE/HPC: staged training of Mixture-of-Experts via disentangled submodels and unsupervised clustering to reduce cost on low-end hardware.
-
Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths - Score: 18 (R=10, N=8) - Date: 2026-01-13 - Comment: Model Architecture and Efficiency: Transformer alternative with EMA/gated attention plus sliding chunk attention, timestep decay normalization, and adaptive working memory for million-token contexts without explicit context extension.
-
Monkey Jump : MoE-Style PEFT for Efficient Multi-Task Learning - Score: 18 (R=10, N=8) - Date: 2026-01-13 - Comment: Strong match to Model Architecture (MoE-style specialization) and Compression/Efficiency (parameter-efficient routing without extra trainable experts/routers).
-
CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters - Score: 18 (R=10, N=8) - Date: 2026-01-09 - Comment: Model Architecture: demographic-aware Mixture of Adapters with routing to separate cultural modes and mitigate gradient interference.
-
The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models - Score: 18 (R=10, N=8) - Date: 2026-01-09 - Comment: Strongly matches Model Architecture (Mixture-of-Experts analysis uncovering a domain-invariant ‘Standing Committee’; direct MoE focus).
-
Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts - Score: 18 (R=10, N=8) - Date: 2026-01-07 - Comment: Model Architecture (MoE): kNN-augmented expert routing with retrieval-based mixing for robust token-to-expert assignment under shift.
-
Geometric and Dynamic Scaling in Deep Transformers - Score: 18 (R=10, N=8) - Date: 2026-01-07 - Comment: Model Architecture/Training Dynamics: proposes Manifold-Geometric Transformer with manifold-constrained hyper-connections and deep delta learning to prevent rank collapse in deep Transformers.
-
LinMU: Multimodal Understanding Made Linear - Score: 18 (R=10, N=8) - Date: 2026-01-06 - Comment: Efficiency/Architecture: replaces quadratic attention with dual-branch linear-complexity module (bidirectional SSM + local window attention) and a 3-stage distillation pipeline for VLMs.
-
Making MoE based LLM inference resilient with Tarragon - Score: 18 (R=10, N=8) - Date: 2026-01-06 - Comment: HPC/MoE Systems: resilient MoE inference via reconfigurable datapath, KV-cache checkpointing, and shadow experts for fault tolerance.
-
RepetitionCurse: Measuring and Understanding Router Imbalance in Mixture-of-Experts LLMs under DoS Stress - Score: 18 (R=10, N=8) - Date: 2026-01-01 - Comment: Directly targets MoE router behavior and expert-parallel load imbalance under adversarial prompts; strong match to Model Architecture (MoE) and systems-level inference effects.
-
Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication - Score: 18 (R=9, N=9) - Date: 2026-01-30 - Comment: Model Architecture: logic-derived Arrow Language Model interpreting next-token prediction as nested intuitionistic implication with low-rank realization.
-
FloydNet: A Learning Paradigm for Global Relational Reasoning - Score: 18 (R=9, N=9) - Date: 2026-01-28 - Comment: Model Architecture: replaces local message passing with a learned DP-style global refinement operator; proven expressivity (3-WL/2-FWL).
-
The Geometry of Thought: Disclosing the Transformer as a Tropical Polynomial Circuit - Score: 18 (R=9, N=9) - Date: 2026-01-16 - Comment: Architecture theory: shows self-attention’s tropical (max-plus) limit, linking transformers to dynamic programming/shortest-path.
-
Robust Reasoning as a Symmetry-Protected Topological Phase - Score: 18 (R=9, N=9) - Date: 2026-01-09 - Comment: Model Architecture: proposes a Holonomic Network with non-Abelian gauge symmetry, framing robust reasoning as a symmetry-protected topological phase.
-
Horseshoe Mixtures-of-Experts (HS-MoE) - Score: 17 (R=10, N=7) - Date: 2026-01-15 - Comment: Model Architecture: Mixture-of-Experts with Bayesian horseshoe priors for sparse expert selection and a particle learning algorithm for sequential inference.
-
Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints - Score: 17 (R=10, N=7) - Date: 2026-01-14 - Comment: Model Architecture (MoE): principled design under memory/inference constraints; highlights total parameters and expert sparsity as key factors.
-
Towards Specialized Generalists: A Multi-Task MoE-LoRA Framework for Domain-Specific LLM Adaptation - Score: 17 (R=10, N=7) - Date: 2026-01-14 - Comment: Model Architecture: combines Mixture-of-Experts with Low-Rank Adaptation (LoRA) for multi-task domain adaptation and interference mitigation.
-
Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Model Architecture/Efficiency: distills Transformers into RNN-attention hybrids (HALO/HypeNet) with improved long-context efficiency and length generalization.
-
A Separable Architecture for Continuous Token Representation in Language Models - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Model Architecture/Efficiency: replaces embedding tables with a continuous token generator (separable architecture) improving parametric efficiency.
-
Clustering in Deep Stochastic Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Matches Representation Learning/Theory: stochastic analysis of deep Transformer token dynamics; interacting-particle limit prevents collapse.
-
Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Model Architecture/Training Theory: unified generalization framework via L2-stability for parameter-space model merging across heterogeneous experts, with actionable merging guidance.
-
Perceptrons and localization of attention's mean-field landscape - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Model Architecture theory: mean-field analysis of Transformer attention/perceptron blocks showing atomic localization of critical points.
-
The Depth Delusion: Why Transformers Should Be Wider, Not Deeper - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Model Architecture/Scaling Laws: architecture-conditioned scaling revealing critical depth and advocating width-over-depth tradeoffs.
-
SONIC: Spectral Oriented Neural Invariant Convolutions - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Strong match to Model Architecture: continuous, orientation-aware spectral parameterization of convolutional operators with global receptive fields and resolution adaptivity.
-
LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Matches Model Architecture and Certified Robustness: constraint-free, convolution-free 1-Lipschitz architecture with manifold optimization and scalable training.
-
Power-based Partial Attention: Bridging Linear-Complexity and Full Attention - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Matches Model Architecture/Efficiency: sub-quadratic attention (O(L^{1+p})) bridging linear and full attention to quantify necessary attention.
-
Finite-Time Analysis of Gradient Descent for Shallow Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Theoretical Training Dynamics: finite-time analysis of gradient descent for shallow Transformers with width scaling and sequence-length–independent optimization error.
-
Multigrade Neural Network Approximation - Score: 17 (R=9, N=8) - Date: 2026-01-26 - Comment: Model Architecture/Training Paradigm: multigrade deep learning (grade-wise residual training) with operator-theoretic guarantees of vanishing approximation error.
-
Provably Learning Attention with Queries - Score: 17 (R=9, N=8) - Date: 2026-01-26 - Comment: Matches Model Architecture (attention/Transformer) with theoretical learning/identifiability via query access.
-
Unit-Consistent (UC) Adjoint for GSD and Backprop in Deep Learning Applications - Score: 17 (R=9, N=8) - Date: 2026-01-19 - Comment: Model Architecture/Optimization: introduces a unit-consistent adjoint for gauge-equivariant backprop/steepest descent in positively homogeneous networks.
-
MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts - Score: 17 (R=9, N=8) - Date: 2026-01-16 - Comment: Model architecture: Modality-Aware Mixture-of-Experts with modality-specific routing and shared experts (MoE).
-
TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks - Score: 17 (R=9, N=8) - Date: 2026-01-16 - Comment: Matches Conditional/Dynamic Networks and Efficiency: step-level routing (TRIM) that sends only critical reasoning steps to larger models using uncertainty and process rewards, improving cost-accuracy tradeoffs.
-
Unlabeled Data Can Provably Enhance In-Context Learning of Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-16 - Comment: Representation Learning/Training Dynamics in Transformers: theoretical analysis showing CoT-augmented prompts let transformers emulate EM using unlabeled data for improved ICL.
-
Layer-Parallel Training for Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-15 - Comment: High Performance Computing: parallel-in-time, layer-parallel training of Transformers via neural ODE formulation with accuracy control.
-
Controlled LLM Training on Spectral Sphere - Score: 17 (R=9, N=8) - Date: 2026-01-15 - Comment: High-Performance Training/Optimization: Spectral Sphere Optimizer enforces module-wise spectral constraints, fully muP-aligned, improving stability (incl. MoE router balance) over AdamW/Muon.
-
Parallel Context-of-Experts Decoding for Retrieval Augmented Generation - Score: 17 (R=9, N=8) - Date: 2026-01-14 - Comment: Model Architecture/Efficiency: Parallel Context-of-Experts decoding treats retrieved docs as experts with contrastive aggregation, avoiding shared attention and prefill bottlenecks.
-
LDLT L-Lipschitz Network Weight Parameterization Initialization - Score: 17 (R=9, N=8) - Date: 2026-01-14 - Comment: Model Architecture/Training Dynamics: analytic initialization for LDLT L-Lipschitz layers with exact variance derivations; practical prescriptions for stable deep Lipschitz networks.
-
CliffordNet: All You Need is Geometric Algebra - Score: 17 (R=9, N=8) - Date: 2026-01-13 - Comment: Proposes a new vision backbone grounded in geometric algebra with linear complexity, directly matching Model Architecture (Transformer/CNN alternatives) and Efficiency.
-
Bi-Orthogonal Factor Decomposition for Vision Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-13 - Comment: Strong match to Representation Learning/mechanistic analysis: bi-orthogonal factor decomposition to disentangle position vs content interactions in ViT attention.
-
Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer - Score: 17 (R=9, N=8) - Date: 2026-01-12 - Comment: Model Architecture: introduces a Discrete Transformer with enforced functional disentanglement (routing vs arithmetic) and annealed sampling to enable program extraction, boosting interpretability.
-
Token-Level LLM Collaboration via FusionRoute - Score: 17 (R=9, N=8) - Date: 2026-01-09 - Comment: Matches Model Architecture/Efficiency: token-level routing with a trainable complementary generator; theoretical limits of expert-only routing (MoE-like).
-
Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers - Score: 17 (R=9, N=8) - Date: 2026-01-09 - Comment: Matches Training Dynamics/Architecture: learnable per-matrix/row/column multipliers to free WD-noise equilibrium scale, improving optimization.
-
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation - Score: 17 (R=9, N=8) - Date: 2026-01-07 - Comment: Model Architecture: unified autoregressive transformer with next-scale visual prediction enabling fast 1024×1024 generation; unified multimodal tokenization and training.
-
Attention Needs to Focus: A Unified Perspective on Attention Allocation - Score: 17 (R=9, N=8) - Date: 2026-01-07 - Comment: Model Architecture and Efficiency: introduces Lazy Attention with positional discrimination and Elastic-Softmax to mitigate collapse/sink and induce attention sparsity.
-
Context-Free Recognition with Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-06 - Comment: Model Architecture Theory: shows looped transformers with O(log n) iterations and padding can recognize CFLs, advancing formal capacity understanding.
-
Deep Delta Learning - Score: 17 (R=9, N=8) - Date: 2026-01-05 - Comment: Model Architecture: generalizes residual connections via a learnable rank‑1 Delta operator with spectral control and gated dynamics.
-
Constructing a Neuro-Symbolic Mathematician from First Principles - Score: 17 (R=9, N=8) - Date: 2026-01-05 - Comment: Model Architecture: neuro-symbolic design using a Hypergraph Transformer and a differentiable symbolic reasoning kernel with energy-based training signals.
-
Modeling Language as a Sequence of Thoughts - Score: 17 (R=9, N=8) - Date: 2026-01-01 - Comment: Model Architecture: a recurrent Transformer with sentence-level “thought” memory and shared-parameter token/thought generation for sequence-of-thought modeling.
-
GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization - Score: 16 (R=9, N=7) - Date: 2026-01-30 - Comment: Model Architecture: Transformer normalization innovation (GeoNorm) unifying pre-/post-norm via geodesic updates with negligible overhead.
-
Seg-MoE: Multi-Resolution Segment-wise Mixture-of-Experts for Time Series Forecasting Transformers - Score: 16 (R=9, N=7) - Date: 2026-01-30 - Comment: Model Architecture: MoE innovation with segment-wise routing for time-series Transformers, aligning conditional sparsity with temporal locality.
-
On the Expressiveness of State Space Models via Temporal Logics - Score: 16 (R=9, N=7) - Date: 2026-01-28 - Comment: Strong match to Model Architecture theory: expressiveness analysis of State Space Models via temporal logic, including quantized vs unbounded precision and comparison to transformers.
-
TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors - Score: 16 (R=9, N=7) - Date: 2026-01-27 - Comment: Model Architecture/Representation Learning: provides a unified high-order attention-interaction tensor that linearly represents full Transformer computations (attention, FFN, norms, residuals).
-
Split-on-Share: Mixture of Sparse Experts for Task-Agnostic Continual Learning - Score: 16 (R=9, N=7) - Date: 2026-01-27 - Comment: Model Architecture (MoE): Mixture of Sparse Experts with shared/unique experts and unified gating for task-agnostic continual learning.
-
Sycophancy Hides Linearly in the Attention Heads - Score: 16 (R=9, N=7) - Date: 2026-01-26 - Comment: Representation Learning: linear separability of sycophancy in attention heads and targeted linear steering within Transformer attention activations.
-
Attention-MoA: Enhancing Mixture-of-Agents via Inter-Agent Semantic Attention and Deep Residual Synthesis - Score: 16 (R=9, N=7) - Date: 2026-01-26 - Comment: Model Architecture: Mixture-of-Agents with inter-agent semantic attention and deep residual synthesis plus adaptive early stopping for collaborative LLM inference.
-
Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models - Score: 16 (R=9, N=7) - Date: 2026-01-15 - Comment: Model Architecture (MoE): attribution-based analysis of knowledge acquisition dynamics in MoE vs. dense models.
-
M$^2$FMoE: Multi-Resolution Multi-View Frequency Mixture-of-Experts for Extreme-Adaptive Time Series Forecasting - Score: 16 (R=9, N=7) - Date: 2026-01-14 - Comment: Model Architecture (MoE): multi-resolution, multi-view frequency Mixture-of-Experts with temporal gating for extreme-adaptive forecasting.
-
Scalable Heterogeneous Graph Learning via Heterogeneous-aware Orthogonal Prototype Experts - Score: 16 (R=9, N=7) - Date: 2026-01-13 - Comment: Strong match to Model Architecture (Mixture-of-Experts-style prediction head) with expert routing and orthogonalization.
-
Neuro-Channel Networks: A Multiplication-Free Architecture by Biological Signal Transmission - Score: 16 (R=9, N=7) - Date: 2026-01-06 - Comment: Model Architecture + Efficiency: proposes a multiplication-free network replacing weights with channel-widths and sign-gated transmission to eliminate multiplications.
-
Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale - Score: 16 (R=9, N=7) - Date: 2026-01-06 - Comment: Conditional/Dynamic Networks: large-scale LLM routing and adaptive aggregation framework (mixture-of-models) with task-aware switching.
-
MambaFormer: Token-Level Guided Routing Mixture-of-Experts for Accurate and Efficient Clinical Assistance - Score: 16 (R=9, N=7) - Date: 2026-01-06 - Comment: Model Architecture: hybrid MoE with token-level dynamic routing between Transformer and SSM (Mamba) experts plus utility-guided routing loss for efficiency/accuracy trade-offs.
-
mHC: Manifold-Constrained Hyper-Connections - Score: 16 (R=9, N=7) - Date: 2026-01-01 - Comment: Model Architecture: proposes manifold-constrained Hyper-Connections to restore identity mapping and improve stability/scalability of widened residual streams with efficiency-aware optimizations.
-
Generalising E-prop to Deep Networks - Score: 16 (R=9, N=7) - Date: 2026-01-01 - Comment: Extends E-prop to deep recurrent networks, enabling online credit assignment across time and depth; core training/architecture contribution.
-
Identifiable Equivariant Networks are Layerwise Equivariant - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Matches Model Architecture/Theory: identifiability-based proof linking end-to-end equivariance to layerwise equivariance.
-
TRACE: Trajectory Recovery for Continuous Mechanism Evolution in Causal Representation Learning - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Representation Learning with MoE: identifiable continuous mechanism trajectories via MoE experts for causal representation learning.
-
The Effect of Architecture During Continual Learning - Score: 16 (R=8, N=8) - Date: 2026-01-28 - Comment: Model Architecture/Representation Learning: joint optimization of architecture and weights to mitigate forgetting; bilevel formulation with low-rank knowledge transfer.
-
Analytic Bijections for Smooth and Interpretable Normalizing Flows - Score: 16 (R=8, N=8) - Date: 2026-01-19 - Comment: Model Architecture: new analytic bijections and a radial flow architecture delivering smooth, interpretable and closed-form invertible transformations.
-
On the origin of neural scaling laws: from random graphs to natural language - Score: 16 (R=8, N=8) - Date: 2026-01-16 - Comment: Scaling laws theory: investigates origins of neural scaling exponents via simplified transformers and random-graph sequences.
-
Density Matrix RNN (DM-RNN): A Quantum Information Theoretic Framework for Modeling Musical Context and Polyphony - Score: 16 (R=8, N=8) - Date: 2026-01-09 - Comment: Model Architecture: DM-RNN with density-matrix state and CPTP dynamics; rigorous parameterization and information-theoretic analysis of representations.
-
Discontinuous Galerkin finite element operator network for solving non-smooth PDEs - Score: 16 (R=8, N=8) - Date: 2026-01-09 - Comment: DG–FEONet: hybrid DG-based neural operator trained via residual minimization—operator-learning architecture with data-free training and robustness to discontinuities.
-
Physical Transformer - Score: 16 (R=8, N=8) - Date: 2026-01-07 - Comment: Model Architecture: proposes a ‘physical transformer’ coupling attention/FFN with Hamiltonian dynamics and symplectic layers; Representation Learning: reasoning on a learned manifold with geometric invariants.
-
Effective LoRA Adapter Routing using Task Representations - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model Architecture/Efficiency: task-representation-based routing and composition of LoRA adapters (adapter MoE-style selection) scaling with tasks, not adapters.
-
Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Optimizer/training dynamics: explains Adam’s behavior via gradient scale invariance when β1=β2; guides optimizer design.
-
KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model architecture/efficiency: Kronecker-product parameterization of manifold-constrained hyper-connections to guarantee double stochasticity with reduced parameters.
-
Multi-Modal Time Series Prediction via Mixture of Modulated Experts - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model Architecture: Mixture-of-Experts with expert modulation (conditioning routing and computation) for multi-modal time series.
-
MAR: Efficient Large Language Models via Module-aware Architecture Refinement - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model Architecture and Efficiency: integrates SSMs and activation sparsification with spiking-aware components to reduce LLM inference energy.
-
Is Parameter Isolation Better for Prompt-Based Continual Learning? - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model Architecture: prompt-sharing with gated routing and history-aware modulation (sparse activation) for continual learning—conditional/dynamic prompts.
-
CCMamba: Selective State-Space Models for Higher-Order Graph Learning on Combinatorial Complexes - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Matches Model Architecture/Efficiency: replaces attention with selective state-space models for linear-time, long-range message passing on combinatorial complexes.
-
Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Representation Learning: mechanistic analysis of multimodal in-context learning circuits (induction-style) and RoPE effects in transformers.
-
TINNs: Time-Induced Neural Networks for Solving Time-Dependent PDEs - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Model Architecture: introduces a conditional/dynamic network by parameterizing weights as a learned function of time, addressing limitations of shared weights in PINNs.
-
Revisiting Incremental Stochastic Majorization-Minimization Algorithms with Applications to Mixture of Experts - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Mixture-of-Experts: incremental stochastic MM algorithm with convergence guarantees for softmax-gated MoE training on streaming data.
-
Component-Level Lesioning of Language Models Reveals Clinically Aligned Aphasia Phenotypes - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Representation Learning and Model Architecture: component-level lesioning of MoE and dense Transformers to probe functional organization and interpretability of internal modules.
-
Residual Tokens Enhance Masked Autoencoders for Speech Modeling - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Model Architecture & Representation Learning: masked autoencoder augmented with residual trainable tokens to capture unlabeled factors in speech.
-
SEAFormer: A Spatial Proximity and Edge-Aware Transformer for Real-World Vehicle Routing Problems - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Model Architecture and Efficiency: transformer with Clustered Proximity Attention reducing attention complexity from O(n^2) to O(n) and edge-aware module for decision making.
-
A Constrained Optimization Perspective of Unrolled Transformers - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Model Architecture/Training Dynamics: constrained optimization with layerwise descent constraints via primal–dual training for Transformers.
-
NewPINNs: Physics-Informing Neural Networks Using Conventional Solvers for Partial Differential Equations - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Matches Model Architecture/Training Dynamics: solver-in-the-loop physics-informing (NewPINNs) replacing residual-based losses for stable training.
-
Hierarchical Orthogonal Residual Spread for Precise Massive Editing in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-19 - Comment: Matches 'Model Architecture: conditional/dynamic networks' by introducing Hierarchical Orthogonal Residual Spread to stabilize and localize large-scale LLM edits.
-
LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning - Score: 15 (R=8, N=7) - Date: 2026-01-16 - Comment: Model Architecture/Training: aligns latent visual attention trajectories (visual thoughts) with curriculum sensory gating to enhance multimodal reasoning and grounding.
-
From Hawkes Processes to Attention: Time-Modulated Mechanisms for Event Sequences - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Model Architecture: introduces Hawkes Attention—a time-modulated attention operator replacing Q/K/V projections with per-type kernels.
-
ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning - Score: 15 (R=8, N=7) - Date: 2026-01-14 - Comment: Matches Model Architecture/Conditional Computation: controllable multi-budget reasoning via on-policy RL and distillation enabling distinct compute modes.
-
Hyperbolic Heterogeneous Graph Transformer - Score: 15 (R=8, N=7) - Date: 2026-01-14 - Comment: Model Architecture/Efficiency: hyperbolic heterogeneous graph Transformer with relation-specific hyperbolic attention operating fully in manifold and linear-time attention.
-
Free-RBF-KAN: Kolmogorov-Arnold Networks with Adaptive Radial Basis Functions for Efficient Function Learning - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Model Architecture: KAN with adaptive RBFs and learned smoothness, with universality proof and faster training/inference.
-
CompNO: A Novel Foundation Model approach for solving Partial Differential Equations - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Model Architecture: compositional neural operators with reusable Foundation Blocks (parametric FNOs) and boundary-condition operator assembled via lightweight adapters for PDEs.
-
Hellinger Multimodal Variational Autoencoders - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Model Architecture/Representation: introduces Hellinger pooling for multimodal VAEs, improving joint inference without sub-sampling.
-
Circular Reasoning: Understanding Self-Reinforcing Loops in Large Reasoning Models - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Training Dynamics/Representation: analyzes circular reasoning failure via attention dynamics and introduces a detection method (CUSUM).
-
AGDC: Autoregressive Generation of Variable-Length Sequences with Joint Discrete and Continuous Spaces - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Unified autoregressive framework for joint discrete–continuous sequences using diffusion for continuous values matches Model Architecture innovation and efficiency for precision handling.
-
Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Matches Model Architecture and Efficiency: head-level diagnosis with conflict-aware sparse fine-tuning that selectively updates Transformer heads.
-
Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Architecture/scaling analysis showing MoE reasoning performance aligns with active parameters—core insight into MoE inference compute scaling.
-
Decentralized Autoregressive Generation - Score: 15 (R=8, N=7) - Date: 2026-01-07 - Comment: Model Architecture: introduces a decentralized autoregressive training objective via linear combination of expert flows (conditional/dynamic networks).
-
Neural Networks on Symmetric Spaces of Noncompact Type - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Model Architecture: designs FC layers and attention mechanisms on symmetric spaces (noncompact Riemannian manifolds).
-
Three factor delay learning rules for spiking neural networks - Score: 15 (R=8, N=7) - Date: 2026-01-05 - Comment: Model Architecture/Training rules for SNNs: online three-factor learning of synaptic/axonal delays for temporal tasks, improving efficiency on neuromorphic hardware.
-
Flow Matching Neural Processes - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Model Architecture: introduces flow-matching neural processes enabling amortized conditional generation via ODE solvers.
Model Compression and Efficiency (129)
-
Discrete Feynman-Kac Correctors - Score: 20.0 (R=0, N=0) - Date: 2026-01-16 - Comment: Author match
-
Explicit Multi-head Attention for Inter-head Interaction in Large Language Models - Score: 19 (R=10, N=9) - Date: 2026-01-28 - Comment: Model Architecture & Efficiency: explicit multi-head attention with head-level linear composition and normalization; enables KV-cache compression via low-rank virtual heads.
-
Low-Rank Key Value Attention - Score: 19 (R=10, N=9) - Date: 2026-01-19 - Comment: Architecture/efficiency: low-rank KV attention reduces KV cache while preserving head diversity; improves pretraining compute efficiency.
-
STEM: Scaling Transformers with Embedding Modules - Score: 19 (R=10, N=9) - Date: 2026-01-16 - Comment: Model architecture and efficiency: static token-indexed sparsity replacing FFN up-projection; decouples capacity from per-token compute and enables CPU offload.
-
T3C: Test-Time Tensor Compression with Consistency Guarantees - Score: 19 (R=10, N=9) - Date: 2026-01-07 - Comment: Model Compression and Efficiency: train-once, test-time budget-conditioned low-rank plus mixed-precision with a controller and per-layer consistency certificates.
-
Fast-weight Product Key Memory - Score: 19 (R=10, N=9) - Date: 2026-01-05 - Comment: Introduces a dynamic fast-weight Product Key Memory—sparse episodic memory updated at train/inference time—for sequence models (Model Architecture; Efficiency via sparse memory).
-
Task-Driven Kernel Flows: Label Rank Compression and Laplacian Spectral Filtering - Score: 19 (R=10, N=9) - Date: 2026-01-05 - Comment: Representation Learning and Efficiency: theory showing supervised learning induces low-rank kernels (rank bounded by number of classes) via a kernel ODE and low-rank SGD noise.
-
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space - Score: 19 (R=10, N=9) - Date: 2026-01-01 - Comment: Strongly matches Model Architecture and Efficiency: introduces a dynamic hierarchical language model shifting compute to a compressed concept space, discovers variable-length units end-to-end, proposes a compression-aware scaling law and a decoupled μP parametrization.
-
ECO: Quantized Training without Full-Precision Master Weights - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Compression/Efficiency: quantized training without full-precision master weights via error-compensating optimizer; theory and SMoE applicability.
-
Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: KV-cache low-rank projection learned on the Stiefel manifold by minimizing decoder-layer output error with rank allocation profiles.
-
HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: low-bit PTQ via Hessian conditioning with learnable rotations to reduce curvature sensitivity.
-
ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: HPC/Systems + MoE: lossless compression and cache-affinity scheduling for on-device MoE serving with provable performance, shifting I/O to compute-centric.
-
HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: introduces a Hessian-guided, differentiable QAT with temperature annealing for ultra-low-bit LLMs, improving optimization over STE-based methods.
-
LoPRo: Enhancing Low-Rank Quantization via Permuted Block-Wise Rotation - Score: 18 (R=10, N=8) - Date: 2026-01-28 - Comment: Compression/Efficiency: fine-tuning-free post-training quantization with low-rank decomposition and permuted block-wise rotations (2–3 bit regime).
-
StableQAT: Stable Quantization-Aware Training at Ultra-Low Bitwidths - Score: 18 (R=10, N=8) - Date: 2026-01-28 - Comment: Strong match to Model Compression/Efficiency: a theoretically grounded surrogate for ultra-low-bit Quantization-Aware Training that generalizes STE and stabilizes training.
-
Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective - Score: 18 (R=10, N=8) - Date: 2026-01-28 - Comment: High Performance Computing & Efficiency: unified model for KV-cache eviction and query routing with randomized eviction and learning-based routing; theoretical guarantees and large speedups.
-
Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: Matches Compression/Efficiency: unifies sparsity and low-rank fine-tuning with provable MSE bounds, fused GEMM, and bitmap encoding for true speedups.
-
Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: Matches Cache/Efficiency: KV cache compression for CoT with answer-first principle, attention-based LRFU eviction, and adaptive budget allocation.
-
E2Former-V2: On-the-Fly Equivariant Attention with Linear Activation Memory - Score: 18 (R=10, N=8) - Date: 2026-01-26 - Comment: High-Performance Computing and Efficiency: algebraic sparsity (EAAS) and a fused on-the-fly equivariant attention kernel achieving large TFLOPS gains with linear activation memory.
-
Global Context Compression with Interleaved Vision-Text Transformation - Score: 18 (R=10, N=8) - Date: 2026-01-16 - Comment: Compression/Efficiency and Model Architecture: global context compression in Transformers via interleaved vision–text tokens, reducing memory/FLOPs and token count.
-
Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models - Score: 18 (R=10, N=8) - Date: 2026-01-16 - Comment: Model Architecture/Efficiency: Bounded Hyperbolic Tanh as a normalization-free alternative to Pre-LN with theoretical stability and faster training/inference for LLMs.
-
Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification - Score: 18 (R=10, N=8) - Date: 2026-01-15 - Comment: Model Compression and Efficiency: hardware-aligned 1.25-bit ternary quantization via 3:4 fine-grained sparsity and an annealing residual synapse mechanism (Arenas) to avoid representational collapse.
-
KVzap: Fast, Adaptive, and Faithful KV Cache Pruning - Score: 18 (R=10, N=8) - Date: 2026-01-15 - Comment: Model Compression and Efficiency: fast, adaptive KV cache pruning for both prefilling and decoding; cache/pruning focus.
-
Hierarchical Sparse Plus Low Rank Compression of LLM - Score: 18 (R=10, N=8) - Date: 2026-01-15 - Comment: Model Compression and Efficiency: hierarchical sparse-plus-low-rank (HSS) factorization with sparsity for LLM layers.
-
ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs - Score: 18 (R=10, N=8) - Date: 2026-01-13 - Comment: Model Compression and Efficiency: unified NVFP4 4-bit PTQ via Augmented Residual Channels that preserves block isolation and hardware-uniform GEMM, with theoretical error bounds.
-
MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs - Score: 18 (R=10, N=8) - Date: 2026-01-13 - Comment: Directly targets MoE training memory/throughput with co-designed kernels and activation checkpointing, squarely matching HPC and Compression/Efficiency criteria for MoE.
-
FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching - Score: 18 (R=10, N=8) - Date: 2026-01-12 - Comment: Matches Model Compression and Efficiency: flexible low-rank quantization with sketching and clipping-optimized approximation for LLMs.
-
ADEPT: Adaptive Dynamic Early-Exit Process for Transformers - Score: 18 (R=10, N=8) - Date: 2026-01-09 - Comment: Model Efficiency: adaptive token-level early exit in both prefill and generation with KV-cache decoupling for transformers.
-
GRIT -- Geometry-Aware PEFT with K-FACPreconditioning, Fisher-Guided Reprojection, andDynamic Rank Adaptation - Score: 18 (R=10, N=8) - Date: 2026-01-05 - Comment: Model Compression and Efficiency: low-rank PEFT with K-FAC preconditioning, Fisher-guided reprojection, and dynamic rank adaptation.
-
More Than Bits: Multi-Envelope Double Binary Factorization for Extreme Quantization - Score: 18 (R=10, N=8) - Date: 2026-01-01 - Comment: Strongly matches Compression/Efficiency: proposes Multi-envelope Double Binary Factorization for extreme low-bit quantization with shared sign bases, rank-l envelope, closed-form init, and alternating refinement; preserves deployment-friendly binary inference primitives.
-
PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression - Score: 18 (R=10, N=8) - Date: 2026-01-01 - Comment: LLM-aware lossy compression of the KV cache with co-designed algorithms/systems; strong fit to Compression/Efficiency (cache) for Transformer inference.
-
Efficient Context Scaling with LongCat ZigZag Attention - Score: 18 (R=10, N=8) - Date: 2026-01-01 - Comment: Model Architecture and Efficiency: introduces sparse ZigZag attention (LoZA) for efficient long-context scaling (up to 1M tokens) with speedups in prefill/decode.
-
Trellis: Learning to Compress Key-Value Memory in Attention Models - Score: 18 (R=10, N=8) - Date: 2026-01-01 - Comment: Model Compression and Efficiency: learns to compress the Transformer KV cache into a fixed-size dynamic memory via a recurrent two-pass update with online gradient descent.
-
Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining - Score: 18 (R=9, N=9) - Date: 2026-01-28 - Comment: Deep Learning Theory: provable hierarchical learning in deep conv nets on Random Hierarchy Models via layerwise training (shallow-to-deep chaining).
-
Diffusion Language Models are Provably Optimal Parallel Samplers - Score: 18 (R=9, N=9) - Date: 2026-01-01 - Comment: Model Architecture/Efficiency: proves diffusion language models with CoT and revision/remasking are optimal parallel samplers in sequential steps and space, giving a theoretical foundation for efficient inference.
-
Sliced-Wasserstein Distribution Alignment Loss Improves the Ultra-Low-Bit Quantization of Large Language Models - Score: 17 (R=10, N=7) - Date: 2026-01-15 - Comment: Model Compression and Efficiency: proposes a sliced-Wasserstein distribution alignment loss for ultra-low-bit post-training quantization of LLMs, improving calibration of activation/output distributions.
-
Quantized SO(3)-Equivariant Graph Neural Networks for Efficient Molecular Property Prediction - Score: 17 (R=10, N=7) - Date: 2026-01-06 - Comment: Compression/Efficiency: low-bit quantization for SO(3)-equivariant GNNs with magnitude-direction decoupling and branch-separated QAT.
-
LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: adaptive look-ahead mixed-precision inference selecting small subsets for high precision to control rounding error in Transformers.
-
Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Efficiency/Cache: repurposes KV cache as lightweight representation for chain-of-embedding and fast/slow reasoning switching, reducing tokens at inference.
-
Towards Compact and Robust DNNs via Compression-aware Sharpness Minimization - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Compression/Efficiency & Robustness: sharpness-aware training over pruning masks (structure perturbations) to co-optimize compactness and robustness.
-
TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Model Compression and Efficiency/HPC: instance-aware token seeking/ditching to cut activation memory during fine-tuning with large savings.
-
Self-Supervised Weight Templates for Scalable Vision Model Initialization - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Model Compression/Efficiency & Architecture: Tucker-factorized shared weight template with size-specific scalers enables scalable initialization across depths/widths; includes width-wise stochastic scaling.
-
EPAS: Efficient Training with Progressive Activation Sharing - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Efficiency/HPC: progressive activation (QK/KV) sharing across Transformer layers to boost training and inference throughput with controllable sharing at inference.
-
FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Matches Quantization and KV-cache efficiency: FP8 W8A8 rollout, FP8 KV-cache with per-step recalibration, and mismatch correction for LLM RL.
-
S$^3$-Attention:Attention-Aligned Endogenous Retrieval for Memory-Bounded Long-Context Inference - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Model Compression/Efficiency: replaces full KV cache with attention-aligned endogenous retrieval via sparse autoencoders and a CPU inverted index to bound GPU memory during long-context inference.
-
Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Model Compression and Efficiency: proposes gating-based KV cache eviction with forward-only gate training for memory/compute-efficient LLM inference.
-
AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Matches Efficiency: activation-guided low-rank subspace ZO optimization enabling memory-efficient LLM fine-tuning with theoretical guarantees.
-
A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Matches High-Performance Training Dynamics: scalable critical sharpness measure (few forward passes) capturing curvature phenomena in LLM training up to 7B.
-
Theory of Minimal Weight Perturbations in Deep Networks and its Applications for Low-Rank Activated Backdoor Attacks - Score: 17 (R=9, N=8) - Date: 2026-01-26 - Comment: Model Compression and Efficiency: theoretical bounds on minimal weight perturbations and provable low-rank compression thresholds; insights into layer-wise sensitivity.
-
Mugi: Value Level Parallelism For Efficient LLMs - Score: 17 (R=9, N=8) - Date: 2026-01-19 - Comment: Compression/Efficiency: value-level parallelism generalized to nonlinear ops, weight/KV-cache quantization, and a new VLP architecture (Mugi) for full LLM workloads.
-
$D^2Prune$: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness - Score: 17 (R=9, N=8) - Date: 2026-01-15 - Comment: Model Compression and Efficiency: proposes dual Taylor expansion pruning with attention distribution awareness for precise LLM sparsification.
-
Beyond Variance: Knowledge-Aware LLM Compression via Fisher-Aligned Subspace Diagnostics - Score: 17 (R=9, N=8) - Date: 2026-01-13 - Comment: Model Compression and Representation Learning: Fisher-aligned subspace selection for activation compression using the Fisher Information Matrix and a new dependence metric for knowledge-critical directions.
-
Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2026-01-13 - Comment: Efficiency/HPC: introduces provably lossless hierarchical speculative decoding that increases accepted tokens without fidelity loss.
-
mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations - Score: 17 (R=9, N=8) - Date: 2026-01-12 - Comment: Model Architecture/Efficiency: reparameterizes hyper-connections to exactly enforce doubly stochastic mixing (via Birkhoff–von Neumann), eliminating Sinkhorn iterations and improving stability/speed.
-
RelayLLM: Efficient Reasoning via Collaborative Decoding - Score: 17 (R=9, N=8) - Date: 2026-01-09 - Comment: Model Compression/Efficiency: token-level collaborative decoding with dynamic routing to an LLM to cut compute cost.
-
Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2026-01-09 - Comment: Strong match to Model Compression and Efficiency (memory-efficient LLM fine-tuning via prior-informed ZO gradient estimation with theory).
-
TAP-ViTs: Task-Adaptive Pruning for On-Device Deployment of Vision Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-07 - Comment: Model Compression/Efficiency: task-adaptive pruning for ViTs using per-device GMM-derived proxy datasets and dual-granularity importance evaluation; privacy-preserving on-device deployment.
-
FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness - Score: 17 (R=9, N=8) - Date: 2026-01-06 - Comment: Efficiency: TTC-aware training and early stopping to trade training FLOPs for test-time compute with a theoretical break-even bound.
-
RMAAT: Astrocyte-Inspired Memory Compression and Replay for Efficient Long-Context Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-05 - Comment: Model Architecture and Efficiency: recurrent memory tokens with adaptive compression and memory-efficient backprop (AMRB) for long-context Transformers.
-
Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data - Score: 16 (R=9, N=7) - Date: 2026-01-30 - Comment: Matches Model Compression/Sparsity: adaptive pruning discovers routed, specialized subnetworks ('adaptive tickets') for heterogeneous data.
-
Soft Quantization: Model Compression Via Weight Coupling - Score: 16 (R=9, N=7) - Date: 2026-01-30 - Comment: Compression/quantization: training-time weight coupling induces mixed-precision discretization; a novel route to quantization beyond standard PTQ.
-
Switchcodec: Adaptive residual-expert sparse quantization for high-fidelity neural audio coding - Score: 16 (R=9, N=7) - Date: 2026-01-29 - Comment: Model Compression and Efficiency: residual-experts vector quantization (dynamic expert routing, variable bitrate) for neural audio coding—sparse quantization with MoE-like routing.
-
GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs - Score: 16 (R=9, N=7) - Date: 2026-01-28 - Comment: Model Compression and Efficiency: gradient-guided layer pruning and merging for LLMs enabling efficient fine-tuning and inference.
-
Is Finer Better? The Limits of Microscaling Formats in Large Language Models - Score: 16 (R=9, N=7) - Date: 2026-01-28 - Comment: Strong match to Model Compression/Efficiency: analyzes limits of microscaling quantization and proposes a hardware-friendly FP8 UE5M3 scale format for FP4 data types.
-
How Is Uncertainty Propagated in Knowledge Distillation? - Score: 16 (R=9, N=7) - Date: 2026-01-28 - Comment: Model Compression and Efficiency: variance-aware knowledge distillation (multi-response averaging and inverse-variance weighting) with formal analysis of uncertainty propagation.
-
From LLMs to LRMs: Rethinking Pruning for Reasoning-Centric Models - Score: 16 (R=9, N=7) - Date: 2026-01-27 - Comment: Matches Model Compression and Efficiency: controlled study of depth/width/static/dynamic pruning strategies for reasoning-centric LLMs.
-
Low-Rank Tensor Approximation of Weights in Large Language Models via Cosine Lanczos Bidiagonalization - Score: 16 (R=9, N=7) - Date: 2026-01-27 - Comment: Compression/Efficiency: low-rank tensor approximation of LLM weight tensors via cosine Lanczos bidiagonalization in a transform domain (cproduct).
-
MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models - Score: 16 (R=9, N=7) - Date: 2026-01-19 - Comment: KV-cache efficiency via adapting MLA to VLMs with modality-decoupled low-rank KV compression and RoPE modification; parameter-efficient adaptation.
-
FAQ: Mitigating Quantization Error via Regenerating Calibration Data with Family-Aware Quantization - Score: 16 (R=9, N=7) - Date: 2026-01-19 - Comment: Matches 'Model Compression and Efficiency: Quantization' by regenerating family-aware calibration data to improve PTQ accuracy in LLMs.
-
Single-Stage Huffman Encoder for ML Compression - Score: 16 (R=9, N=7) - Date: 2026-01-16 - Comment: Matches Compression/Efficiency and HPC communication: proposes a single-stage Huffman encoder with fixed codebooks for on-the-fly tensor compression during distributed LLM training, removing codebook-gen/transmission overhead.
-
Enhancing LUT-based Deep Neural Networks Inference through Architecture and Connectivity Optimization - Score: 16 (R=9, N=7) - Date: 2026-01-16 - Comment: Compression/Efficiency: LUT-based DNN architectural aggregation plus non-greedy sparse connectivity pruning/regrowth for FPGA-efficient inference.
-
GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR - Score: 16 (R=9, N=7) - Date: 2026-01-15 - Comment: Model Compression/Efficiency: geometry-aware low-rank adapters (LoRA) initialized by SVD to stabilize RLVR updates while using dense operators.
-
Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference - Score: 16 (R=9, N=7) - Date: 2026-01-13 - Comment: Compression/Efficiency: training-free adaptive layer selection for layer-wise token pruning to reduce KV cache while preserving accuracy.
-
SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis - Score: 16 (R=9, N=7) - Date: 2026-01-13 - Comment: Matches Compression/Efficiency (low-rank parameter editing) and Representation Learning (capability as low-rank subspaces) for selective ablation.
-
Controllable LLM Reasoning via Sparse Autoencoder-Based Steering - Score: 16 (R=9, N=7) - Date: 2026-01-09 - Comment: Strongly matches Representation Learning and Sparsity (Sparse Autoencoders to disentangle and steer reasoning strategies).
-
Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage - Score: 16 (R=9, N=7) - Date: 2026-01-07 - Comment: Compression/Efficiency: analyzes sparse-attention decoding overheads (Less is Less) and proposes early-stopping to reduce token consumption in long-decode.
-
RPIQ: Residual-Projected Multi-Collaboration Closed-Loop and Single Instance Quantization for Visually Impaired Assistance - Score: 16 (R=9, N=7) - Date: 2026-01-07 - Comment: Model Compression and Efficiency: proposes a new quantization framework with multi-collaborative closed-loop compensation and Gauss–Seidel iterative quantization addressing inter-block error accumulation (4-bit).
-
CRoPE: Efficient Parametrization of Rotary Positional Embedding - Score: 16 (R=9, N=7) - Date: 2026-01-07 - Comment: Transformer architecture and compression/efficiency: efficient parametrization of Rotary Positional Embedding reducing attention block parameters with negligible performance loss.
-
Bayesian Subspace Gradient Estimation for Zeroth-Order Optimization of Large Language Models - Score: 16 (R=9, N=7) - Date: 2026-01-07 - Comment: Compression/Efficiency & HPC: Bayesian zeroth-order optimizer that reduces memory and improves convergence for LLM fine-tuning.
-
Heterogeneous Low-Bandwidth Pre-Training of LLMs - Score: 16 (R=9, N=7) - Date: 2026-01-06 - Comment: HPC + Efficiency: heterogeneous distributed pre-training combining SparseLoCo with activation/activation-gradient compression and subspace pipeline parallelism.
-
SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines - Score: 16 (R=9, N=7) - Date: 2026-01-06 - Comment: Compression/Theory: SGD-based KD analysis with Bayesian teachers; shows variance reduction and guidance on BCP noise.
-
QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models - Score: 16 (R=9, N=7) - Date: 2026-01-05 - Comment: Model Compression and Efficiency: automated quantization with tiered (global/block/module) search optimizing a performance–memory trade-off for spike-driven LMs.
-
OptRot: Mitigating Weight Outliers via Data-Free Rotations for Post-Training Quantization - Score: 16 (R=9, N=7) - Date: 2026-01-01 - Comment: Model Compression and Efficiency: introduces data-free, fusible rotations (OptRot) to mitigate weight/activation outliers for post-training quantization, improving W4A8 and weight-only PTQ.
-
MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling - Score: 16 (R=9, N=7) - Date: 2026-01-01 - Comment: Model Architecture/Efficiency: introduces a multi-scale state-space model with input-dependent scale-mixing to capture long-range, hierarchical dependencies efficiently.
-
Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: token-budgeted LLM–SLM collaboration via hint prefixes and learned hint-length routing for cost-efficient inference.
-
Procedural Pretraining: Warming Up Language Models with Abstract Data - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Training Dynamics/Efficiency: procedural pretraining on abstract data to induce algorithmic structure and accelerate LLM pretraining with less data.
-
LoRA and Privacy: When Random Projections Help (and When They Don't) - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Low-Rank/Compression + Privacy theory: DP analysis of Wishart/projection mechanisms; shows LoRA randomness is not inherently private and when low-rank helps with DP.
-
Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: training-free distribution sharpening via scaled low-temperature token sampling to match RL post-training gains without MCMC.
-
Auto-Regressive Masked Diffusion Models - Score: 16 (R=8, N=8) - Date: 2026-01-26 - Comment: Matches Model Architecture (strictly causal, permutation-equivariant masked diffusion) and Efficiency (parallel autoregressive-style decoding/strided generation).
-
Training-Trajectory-Aware Token Selection - Score: 16 (R=8, N=8) - Date: 2026-01-16 - Comment: Matches Compression/Efficiency and training dynamics: token-level objective (T3S) for distillation that mitigates trajectory bottlenecks in strong students, improving reasoning efficiency.
-
Greedy Is Enough: Sparse Action Discovery in Agentic LLMs - Score: 16 (R=8, N=8) - Date: 2026-01-15 - Comment: Compression/Efficiency/Theory: frames sparse action discovery as block-sparse recovery and proves a greedy OMP-style algorithm recovers the relevant action set with sample guarantees.
-
Sparsity Is Necessary: Polynomial-Time Stability for Agentic LLMs in Large Action Spaces - Score: 16 (R=8, N=8) - Date: 2026-01-15 - Comment: Model Compression/Efficiency: theory for block-sparse policies with ℓ1,2 regularization yielding sample complexity and support recovery in large action spaces.
-
Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking - Score: 16 (R=8, N=8) - Date: 2026-01-13 - Comment: Compression/Efficiency via sparsity/pruning: concept-aware neuron masking for multi-concept unlearning in diffusion models (training-free).
-
Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models - Score: 16 (R=8, N=8) - Date: 2026-01-09 - Comment: Model Efficiency: training-time Dynamic Outlier Truncation to suppress redundant reasoning tokens and improve cost–accuracy trade-off.
-
SpikySpace: A Spiking State Space Model for Energy-Efficient Time Series Forecasting - Score: 16 (R=8, N=8) - Date: 2026-01-07 - Comment: Model Architecture and Efficiency: introduces a spiking state space model with event-driven selective scanning and neuromorphic-friendly activations for energy-efficient sequence modeling.
-
Making Foundation Models Probabilistic via Singular Value Ensembles - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Matches Compression/Efficiency: parameter-efficient implicit ensembles by freezing singular vectors and learning per-member singular values.
-
Grounding and Enhancing Informativeness and Utility in Dataset Distillation - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: principled dataset distillation balancing informativeness and utility with theoretical underpinnings.
-
Flow Perturbation++: Multi-Step Unbiased Jacobian Estimation for High-Dimensional Boltzmann Sampling - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Efficiency for CNFs: unbiased multi-step Jacobian estimation (Flow Perturbation++) reduces variance for high-dimensional Boltzmann sampling.
-
MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: training-free caching for flow-matching inference via average-velocity JVP reuse and stability-aware scheduling to reduce compute without retraining.
-
Leveraging Second-Order Curvature for Efficient Learned Image Compression: Theory and Empirical Evidence - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Model Compression and Efficiency: second-order (quasi-Newton) optimizer for learned image compression improves optimization efficiency and reduces activation/latent outliers, aiding post-training quantization.
-
Convergence Analysis of Randomized Subspace Normalized SGD under Heavy-Tailed Noise - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Efficiency/Training Dynamics: randomized subspace normalized SGD with high-probability guarantees under heavy-tailed noise; reduced per-iteration cost and better oracle complexity than full-dim NSGD.
-
TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Model Compression and Efficiency: test-time adaptive ensemble drafting for speculative decoding to speed LVLM inference.
-
Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Model Compression and Efficiency: introduces windowed token pruning and KV caching to accelerate diffusion LM inference.
-
PiC-BNN: A 128-kbit 65 nm Processing-in-CAM-Based End-to-End Binary Neural Network Accelerator - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Compression/Efficiency: end-to-end binary neural network accelerator using processing-in-CAM, eliminating full-precision ops.
-
A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Representation Learning/Efficient adaptation: isolates behavior-specific neurons via sparse autoencoders and updates only a small neuron subset (sparse, neuron-level fine-tuning).
-
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Efficiency: memory-bounded test-time search with chunk-wise KV cache resets and geometric regularization to improve long-context reasoning coverage.
-
Gradient Regularized Natural Gradients - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Optimization/Efficiency: Gradient-Regularized Natural Gradients with structured FIM approximations and a Kalman-based variant; convergence guarantees.
-
Sparse RBF Networks for PDEs and nonlocal equations: function space theory, operator calculus, and training algorithms - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Model Architecture and Sparsity: Sparse RBF networks with function-space theory (Besov characterization) and efficient operator calculus/training for PDEs.
-
Analyzing Neural Network Information Flow Using Differential Geometry - Score: 15 (R=8, N=7) - Date: 2026-01-26 - Comment: Model Compression/Efficiency and Representation Learning: curvature-based (Ollivier–Ricci) analysis of information flow to rank/prune edges in neural networks.
-
Differentially Private Subspace Fine-Tuning for Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-19 - Comment: Model Compression and Efficiency: subspace (low-rank) DP fine-tuning injects noise only along principal gradient directions, preserving DP while reducing perturbation.
-
Thinking Long, but Short: Stable Sequential Test-Time Scaling for Large Reasoning Models - Score: 15 (R=8, N=7) - Date: 2026-01-16 - Comment: Matches Efficiency/HPC inference: introduces stable sequential test-time scaling (Min-Seek) with a custom KV-cache scheme enabling beyond-context reasoning at near-linear complexity.
-
Disentangling Task Conflicts in Multi-Task LoRA via Orthogonal Gradient Projection - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Parameter-efficient training: orthogonal gradient projection tailored to LoRA subspace to mitigate task interference (Model Architecture/Training).
-
Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Compression/Efficiency: theoretically grounded relaxed speculative decoding with annealed resampling for faster AR generation.
-
Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Model Efficiency: hidden-state-based step scoring and KV-cache-aware pruning for test-time scaling.
-
Mechanisms are Transferable: Data-Efficient Low-Resource Adaptation via Circuit-Targeted Supervised Fine-Tuning - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Model Compression/Efficiency: updates only a sparse subset of attention heads (head-level gradient masking) based on mechanistic relevance, reducing parameters and forgetting.
-
Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Model Compression and Efficiency: post-training quantization repurposed for safety realignment, decoupled from fine-tuning.
-
Artificial Entanglement in the Fine-Tuning of Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Quantum-information-inspired analysis of low-rank PEFT (LoRA) via “artificial entanglement” directly matches Compression/Efficiency (low-rank) and Representation Learning/training dynamics.
-
Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Model Compression/Efficiency: subtask-focused knowledge distillation that transfers only relevant subspaces/layer components from teacher to student.
-
Continual Learning of Achieving Forgetting-free and Positive Knowledge Transfer - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Matches Model Architecture and Sparsity: task-specific binary masks (sparse sub-networks) with gradient alignment/projection for continual learning.
-
DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Matches Model Architecture and Efficiency: dual-path, delay-aware Mamba backbone with linear-time modules for sequence modeling.
-
Efficient Differentiable Causal Discovery via Reliable Super-Structure Learning - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Matches Efficiency and Low-rank/Sparsity: sparse+low-rank precision decomposition with ADMM to constrain and accelerate differentiable causal discovery.
-
Not All Steps are Informative: On the Linearity of LLMs' RLVR Training - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Matches High-Performance/Training Efficiency (algorithmic extrapolation of weights/logits to reduce RLVR computation) and training dynamics analysis.
-
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Matches Compression/Efficiency: instruction-conditioned visual token selection with positional continuity (PosPad) for efficient VLM grounding.
-
Compressed code: the hidden effects of quantization and distillation on programming tokens - Score: 15 (R=8, N=7) - Date: 2026-01-07 - Comment: Representation Learning and Compression: analyzes how quantization and distillation alter token-level representations for code and impact generation quality.
-
Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance - Score: 15 (R=8, N=7) - Date: 2026-01-07 - Comment: Compression/Efficiency/Training Dynamics: shows safety gradients are low-rank and introduces one-shot alignment correction leveraging this structure.
-
Sparse Bayesian Message Passing under Structural Uncertainty - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Sparsity/Bayesian Architecture: posterior over signed adjacency and sparse signed message passing for robust GNNs under heterophily.
-
Gradient-Free Approaches is a Key to an Efficient Interaction with Markovian Stochasticity - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Optimization/training algorithms: derivative-free method for Markovian noise with mixing-time–independent rates (algorithmic efficiency).
-
MODE: Efficient Time Series Prediction with Mamba Enhanced by Low-Rank Neural ODEs - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Model Architecture/Efficiency: integrates Mamba SSM with low-rank Neural ODEs and segmented selective scanning for long-range time series with reduced complexity.
-
Enabling Physical AI at the Edge: Hardware-Accelerated Recovery of System Dynamics - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Model Compression and Efficiency / HPC: FPGA-accelerated framework with sparsity-driven dropout and streaming parallelism for efficient model recovery at the edge.
High Performance Computing (42)
-
Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: MoE + Systems: proposes Least-Loaded Expert Parallelism to dynamically rebalance imbalanced MoE routing across devices for latency/memory efficiency.
-
Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: Matches High Performance Computing: communication-avoiding GEMM via generalized space-filling curves with platform/shape-oblivious partitioning minimizing data movement.
-
A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems - Score: 18 (R=10, N=8) - Date: 2026-01-09 - Comment: Systems-level framework for efficient MoE inference on GPU–NDP with tensor parallelism, load balancing, and dataset-free prefetching—HPC/efficiency for MoE.
-
The Hessian of tall-skinny networks is easy to invert - Score: 18 (R=9, N=9) - Date: 2026-01-13 - Comment: HPC/Optimization: exact Hessian-inverse-vector products for deep nets with linear-in-layers time/memory, enabling scalable second-order methods.
-
Nested Learning: The Illusion of Deep Learning Architectures - Score: 18 (R=9, N=9) - Date: 2026-01-01 - Comment: Proposes a new learning paradigm (Nested Learning), expressive optimizers, self-modifying sequence model, and a continuum memory system; foundational architecture/training perspective.
-
PRISM: Distribution-free Adaptive Computation of Matrix Functions for Accelerating Neural Network Training - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Systems/efficiency: algorithmic framework (adaptive polynomial fitting + randomized sketching) to accelerate matrix functions used in optimizers (Shampoo/Muon), enabling faster large-model training.
-
DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: HPC/systems: deterministic attention scheduling (backward pass DAG scheduling) to regain throughput for reproducible LLM training.
-
High-dimensional learning dynamics of multi-pass Stochastic Gradient Descent in multi-index models - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Training dynamics: asymptotically exact mean-field characterization of multi-pass mini-batch SGD vs SME vs gradient flow in high dimensions.
-
SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: High Performance Computing: SLO-aware rotary scheduling (RotaSched) and DuplexKV memory co-design on Superchips for responsive LLM serving.
-
Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling - Score: 17 (R=9, N=8) - Date: 2026-01-29 - Comment: High Performance Computing/Systems: NPU architectural primitives and memory hierarchy tailored to diffusion LLM sampling (non-GEMM operations), delivering significant inference speedups.
-
Revisiting Parameter Server in LLM Post-Training - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Systems-level innovation for distributed LLM training: replaces collective ops with point-to-point in FSDP (On-Demand Communication) to handle workload imbalance—fits the HPC/distributed training criterion.
-
Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: High Performance Computing/Systems: unified layout abstraction and compiler DSL for distribution, tiling, and sharding across device meshes and memory hierarchies.
-
ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration - Score: 17 (R=9, N=8) - Date: 2026-01-19 - Comment: HPC/Systems for LLM serving: fine-grained, adaptive KV cache placement with ILP and runtime feedback to meet SLOs.
-
Parallelizable memory recurrent units - Score: 17 (R=9, N=8) - Date: 2026-01-15 - Comment: Model Architecture and Efficiency: new recurrent units (MRU/BMRU) with parallel scan compatibility and persistent memory.
-
HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2026-01-15 - Comment: High Performance Computing/Efficiency: holistic-aware parallel speculative decoding with semantic token preservation for video-LLMs.
-
Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64 - Score: 17 (R=9, N=8) - Date: 2026-01-09 - Comment: High-Performance Computing/Efficiency: systems-level memory layout and SIMD kernel design (virtual tensor core) to overcome memory wall for LLM inference on ARM64.
-
Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving - Score: 17 (R=9, N=8) - Date: 2026-01-05 - Comment: Systems-level method for LLM serving emulation via CUDA virtualization and distributed time-warp coordination (High Performance Computing).
-
Accelerating Decentralized Optimization via Overlapping Local Steps - Score: 16 (R=9, N=7) - Date: 2026-01-06 - Comment: HPC/Distributed Training: overlaps computation and communication in decentralized SGD (OLDSGD) with convergence guarantees to reduce wall-clock time.
-
Reliable and Resilient Collective Communication Library for LLM Training and Serving - Score: 16 (R=9, N=7) - Date: 2026-01-01 - Comment: High Performance Computing: resilient collective communication for distributed LLM training/serving with connection migration and bandwidth-aware load redistribution.
-
LLM-42: Enabling Determinism in LLM Inference with Verified Speculation - Score: 16 (R=8, N=8) - Date: 2026-01-27 - Comment: High Performance Computing: scheduling-based deterministic inference via verify-rollback that preserves dynamic batching without changing kernels.
-
PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning - Score: 16 (R=8, N=8) - Date: 2026-01-12 - Comment: Model Architecture + HPC/Test-time compute: introduces a conditional/message-passing architecture to massively parallelize reasoning and scale test-time compute beyond context limits.
-
Distributed Online Convex Optimization with Efficient Communication: Improved Algorithm and Lower bounds - Score: 16 (R=8, N=8) - Date: 2026-01-09 - Comment: Matches High-Performance/Distributed Training: improved algorithms and lower bounds for compressed communication in distributed online convex optimization.
-
RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference - Score: 16 (R=8, N=8) - Date: 2026-01-07 - Comment: High Performance Computing/System efficiency: KV cache residency across pipeline stages, affinity-aware routing, and memory-aware caching to extend sequence length under strict latency SLOs.
-
Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments - Score: 16 (R=8, N=8) - Date: 2026-01-07 - Comment: Model Architecture: introduces group-equivariant world models via one-parameter Lie group flows (equivariance for memory and dynamics).
-
Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding - Score: 16 (R=8, N=8) - Date: 2026-01-01 - Comment: Co-designed speculative decoding with compiler-friendly execution and latency-aware drafting; systems-level inference optimization (HPC/efficiency) for LLMs.
-
FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Optimization for large-scale training: momentum-orthogonalized updates structured by Fisher geometry (trust-region with K-FAC metric), balancing isotropy and adaptivity.
-
Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: HPC/systems for LLM serving: analytical sizing of Attention/FFN ratios in disaggregated architecture to maximize throughput and minimize idle time.
-
Collaborative Compressors in Distributed Mean Estimation with Limited Communication Budget - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: HPC/Distributed training: collaborative compressors for communication-efficient distributed mean estimation with error analyses beyond l2.
-
Kareus: Joint Reduction of Dynamic and Static Energy in Large Model Training - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: High Performance Computing: joint optimization of kernel scheduling and frequency scaling to reduce training energy/time—systems-level training efficiency.
-
HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training - Score: 15 (R=8, N=7) - Date: 2026-01-19 - Comment: High-Performance/Distributed Training: hybrid-order split learning that reduces client memory (no backprop activations) with convergence analysis.
-
Distributed Perceptron under Bounded Staleness, Partial Participation, and Noisy Communication - Score: 15 (R=8, N=7) - Date: 2026-01-16 - Comment: High Performance Computing/Distributed Training: semi-asynchronous perceptron with staleness-bucket aggregation under delays, partial participation, and noisy communication, with mistake bounds.
-
Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: High Performance Computing/Training: distribution-aligned sequence distillation to better match teacher output distributions and mitigate exposure bias.
-
NOVAK: Unified adaptive optimizer for deep neural networks - Score: 15 (R=8, N=7) - Date: 2026-01-14 - Comment: HPC/Systems: unified adaptive optimizer with custom CUDA kernels and rectified adaptive rates; systems-level speedups for large-scale training.
-
Tight Analysis of Decentralized SGD: A Markov Chain Perspective - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: High Performance Computing/Distributed Training: Markov chain analysis of decentralized SGD with non-asymptotic bounds and linear speedup characterization under network topology.
-
AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: High Performance Computing: unified, framework-agnostic performance modeling and configuration search for LLM serving (covers tensor/pipeline/expert parallelism, KV-cache, and scheduling) enabling algorithmic systems-level efficiency gains.
-
Latent Space Communication via K-V Cache Alignment - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Architecture/Systems: aligns K-V caches via shared latent space with adapters for high-bandwidth inter-model communication and skill transfer.
-
DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: High Performance Computing/System Efficiency: GPU-first tokenizer with LUT-based streaming and overlapped H2D/compute removes tokenization bottlenecks for foundation models.
-
MQ-GNN: A Multi-Queue Pipelined Architecture for Scalable and Efficient GNN Training - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Multi-queue pipelined GNN training with asynchronous updates, caching, and adaptive queue sizing—systems/HPC innovation for scalable training.
-
Accelerating Storage-Based Training for Graph Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-01-07 - Comment: High Performance Computing: systems-level storage I/O optimization (block-wise I/O and hyperbatching) to accelerate large-scale GNN training on NVMe.
-
Energy-Aware Routing to Large Reasoning Models - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Efficiency/Systems: variance-aware, energy-aware routing among large reasoning models using compute scaling laws.
-
Toward Large-Scale Photonics-Empowered AI Systems: From Physical Design Automation to System-Algorithm Co-Exploration - Score: 15 (R=8, N=7) - Date: 2026-01-05 - Comment: Cross-layer systems/toolchain for photonic AI with dynamic tensor ops for Transformers and implementation-aware co-design (High Performance Computing).
-
Tensor Computing Interface: An Application-Oriented, Lightweight Interface for Portable High-Performance Tensor Network Applications - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: High Performance Computing: portable, lightweight tensor-network API enabling high-performance across heterogeneous backends.
Representation Learning (114)
-
Value-guided action planning with JEPA world models - Score: 20.0 (R=0, N=0) - Date: 2026-01-07 - Comment: Author match
-
What Drives Success in Physical Planning with Joint-Embedding Predictive World Models? - Score: 20.0 (R=0, N=0) - Date: 2026-01-01 - Comment: Author match
-
Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models - Score: 18 (R=10, N=8) - Date: 2026-01-14 - Comment: Strongly matches Representation Learning: identifies and steers sparse latent features (via SAEs) causally tied to reasoning, enabling activation-level control.
-
Attribution-Guided Distillation of Matryoshka Sparse Autoencoders - Score: 18 (R=10, N=8) - Date: 2026-01-01 - Comment: Representation Learning and Sparsity: distillation of a compact core of features in sparse autoencoders, improving transfer across sparsity levels.
-
Minimax Rates for Hyperbolic Hierarchical Learning - Score: 18 (R=9, N=9) - Date: 2026-01-29 - Comment: Representation Learning Theory: proves minimax-optimal sample complexity for hyperbolic representations on hierarchies and exponential separation vs Euclidean embeddings.
-
Implicit bias as a Gauge correction: Theory and Inverse Design - Score: 18 (R=9, N=9) - Date: 2026-01-13 - Comment: Representation Learning/Training Dynamics: geometric gauge-correction mechanism explaining implicit bias from symmetry–stochasticity interaction, with inverse-design of desired biases (e.g., sparsity).
-
When Models Manipulate Manifolds: The Geometry of a Counting Task - Score: 18 (R=9, N=9) - Date: 2026-01-09 - Comment: Representation Learning/Training Dynamics: mechanistic interpretability revealing low-dimensional counting manifolds and attention geometry in transformers.
-
From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence - Score: 18 (R=9, N=9) - Date: 2026-01-07 - Comment: Representation learning theory: introduces a new information measure (epiplexity) for computationally bounded observers, guiding data selection and learning.
-
Deep Networks Learn Deep Hierarchical Models - Score: 18 (R=9, N=9) - Date: 2026-01-05 - Comment: Representation Learning/Theory: proves layerwise SGD on ResNets efficiently learns deep hierarchical label models (polynomial depth), advancing learnability theory.
-
Value-Based Pre-Training with Downstream Feedback - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Representation/Training Dynamics: value-based continued pretraining steers SSL using downstream-gradient alignment to maximize gradient value per step.
-
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Representation Learning/Training Dynamics: influence-function-based mechanistic data attribution linking training samples to interpretable circuits and ICL heads.
-
Can Local Learning Match Self-Supervised Backpropagation? - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Representation learning/training dynamics: theoretical equivalence conditions between local SSL and global BP-SSL and practical local-SSL variants matching global SSL.
-
Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs - Score: 17 (R=9, N=8) - Date: 2026-01-29 - Comment: Representation Learning: principled concept extraction via unsupervised linear unmixing of LLM activations (Concept Component Analysis) with sparsity priors, offering a theory-backed alternative to SAEs.
-
Spectral Ghost in Representation Learning: from Component Analysis to Self-Supervised Learning - Score: 17 (R=9, N=8) - Date: 2026-01-29 - Comment: Representation Learning: unified spectral framework explaining self-supervised objectives via spectral sufficiency, offering principled foundations and algorithmic guidance.
-
Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds - Score: 17 (R=9, N=8) - Date: 2026-01-29 - Comment: Representation Learning: geometric/spectral analysis of Transformer hidden manifolds revealing phase transitions, effective dimensionality collapse, and renormalization-like flows.
-
The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-Modal Divergence - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Representation Learning Theory: measure-theoretic analysis of contrastive learning geometry beyond alignment–uniformity, including multimodal divergence effects.
-
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Representation Learning/Mechanistic Interpretability: closed-form early-training weight characterizations in Transformers via gradient leading terms.
-
Neural Network Approximation: A View from Polytope Decomposition - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Matches Representation Learning Theory: universal approximation via polytope decomposition with explicit ReLU constructions and improved rates.
-
Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability - Score: 17 (R=9, N=8) - Date: 2026-01-26 - Comment: Representation Learning/Training Dynamics: introduces a process-tensor view of SGD with a measurable non-Markovian memory witness via back-flow of distinguishability.
-
Spectral Characterization and Mitigation of Sequential Knowledge Editing Collapse - Score: 17 (R=9, N=8) - Date: 2026-01-19 - Comment: Representation Learning/Training Dynamics: spectral analysis ties collapse to dominant singular directions; REVIVE preserves singular subspace during editing.
-
Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs - Score: 17 (R=9, N=8) - Date: 2026-01-19 - Comment: Representation Learning: uses Sparse Autoencoders to identify causal, task-specific features ("translation initiation") inside LLMs and validates via interventions.
-
Transient learning dynamics drive escape from sharp valleys in Stochastic Gradient Descent - Score: 17 (R=9, N=8) - Date: 2026-01-19 - Comment: Matches 'Representation Learning: training dynamics in neural networks' by theoretically linking SGD noise, effective potentials, and transient freezing to preference for flat minima.
-
An analytic theory of convolutional neural network inverse problems solvers - Score: 17 (R=9, N=8) - Date: 2026-01-16 - Comment: Matches Representation Learning/Theory: provides an analytic LE-MMSE framework capturing CNN inductive biases (equivariance, locality) for inverse problems with strong empirical alignment.
-
In-Context Operator Learning on the Space of Probability Measures - Score: 17 (R=9, N=8) - Date: 2026-01-16 - Comment: Matches Representation Learning/Theory: proposes in-context operator learning on probability measures with scaling-law theory and explicit architectures for OT maps.
-
Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge - Score: 17 (R=9, N=8) - Date: 2026-01-15 - Comment: Model Architecture: token-wise branch-and-merge (Multiplex Thinking) aggregates K sampled token embeddings into a single multiplex token for soft reasoning.
-
Towards A Unified PAC-Bayesian Framework for Norm-based Generalization Bounds - Score: 17 (R=9, N=8) - Date: 2026-01-14 - Comment: Representation Learning/Theory: unified PAC-Bayesian norm-based generalization bounds using anisotropic posteriors and an architecture-aware sensitivity matrix.
-
Transformer Is Inherently a Causal Learner - Score: 17 (R=9, N=8) - Date: 2026-01-13 - Comment: Representation Learning: proves autoregressive transformers’ gradient sensitivities recover time-delayed causal graphs, offering theoretical insight into learned representations.
-
Do Sparse Autoencoders Identify Reasoning Features in Language Models? - Score: 17 (R=9, N=8) - Date: 2026-01-12 - Comment: Representation Learning: falsification-oriented analysis of Sparse Autoencoders, combining causal token injection and LLM-guided tests to assess whether SAE features encode genuine reasoning.
-
Excess Description Length of Learning Generalizable Predictors - Score: 17 (R=9, N=8) - Date: 2026-01-09 - Comment: Matches Representation Learning/Training Dynamics: information-theoretic framework (Excess Description Length) quantifying capability acquisition and generalization.
-
On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime - Score: 17 (R=9, N=8) - Date: 2026-01-07 - Comment: Representation Learning/Optimization: studies preconditioned gradient descent to mitigate spectral bias and reduce grokking delays; theoretical and empirical insights into learning regimes.
-
Context Collapse: In-Context Learning and Model Collapse - Score: 17 (R=9, N=8) - Date: 2026-01-07 - Comment: Representation Learning: theoretical analysis of in-context learning in a (linear) transformer via reduction to preconditioned gradient descent; links training dynamics to phase transitions and introduces context collapse.
-
Leveraging Flatness to Improve Information-Theoretic Generalization Bounds for SGD - Score: 17 (R=9, N=8) - Date: 2026-01-06 - Comment: Representation Learning Theory: information-theoretic generalization bounds leveraging flatness to tighten SGD generalization and improve rates.
-
Sobolev Approximation of Deep ReLU Network in Log-weighted Barron Space - Score: 17 (R=9, N=8) - Date: 2026-01-06 - Comment: Theoretical Representation Learning: new log-weighted Barron spaces and depth-sensitive ReLU approximation bounds (Sobolev metrics).
-
Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process - Score: 17 (R=9, N=8) - Date: 2026-01-01 - Comment: Representation Learning: uses sparse autoencoders to discover disentangled reasoning vectors enabling interpretable control of LLM reasoning behaviors via latent interventions.
-
Linear representations in language models can change dramatically over a conversation - Score: 16 (R=9, N=7) - Date: 2026-01-29 - Comment: Representation Learning: studies dynamics of linear concept directions in LMs across conversations, impacting interpretability/steering.
-
Decomposing multimodal embedding spaces with group-sparse autoencoders - Score: 16 (R=9, N=7) - Date: 2026-01-29 - Comment: Representation Learning + sparsity: group-sparse autoencoders with cross-modal masking to decompose multimodal embeddings.
-
Learning Ordered Representations in Latent Space for Intrinsic Dimension Estimation via Principal Component Autoencoder - Score: 16 (R=9, N=7) - Date: 2026-01-28 - Comment: Model Architecture & Representation Learning: proposes an autoencoder with non-uniform variance regularization and isometric constraint to recover ordered latent components (PCA generalization).
-
Jacobian Scopes: token-level causal attributions in LLMs - Score: 16 (R=9, N=7) - Date: 2026-01-27 - Comment: Matches Representation Learning/Analysis: gradient-based token-level causal attributions (Jacobian Scopes) for interpreting LLM predictions.
-
YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation - Score: 16 (R=9, N=7) - Date: 2026-01-15 - Comment: Representation Learning: learns sparse, disentangled activation steering vectors in SAE latent space for controllability/alignment without a reference model (reference-free).
-
Dynamics Reveals Structure: Challenging the Linear Propagation Assumption - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Matches Representation Learning: theoretical analysis of first-order update propagation and constraints (bilinearity vs negation) on feature maps.
-
CORDS: Continuous Representations of Discrete Structures - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Representation Learning/Set Modeling: invertible continuous fields (density/feature) for variable-sized sets enabling exact decoding.
-
Bridging Functional and Representational Similarity via Usable Information - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Representation Learning Theory: unifies functional and representational similarity via usable information linking stitching, CKA/RSA, and reconstruction.
-
Representation Unlearning: Forgetting through Information Compression - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Representation Unlearning: imposes an information bottleneck in representation space to forget while retaining utility, with variational objectives.
-
Fast and Geometrically Grounded Lorentz Neural Networks - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Model architecture: new Lorentz linear layer with geometric guarantees plus efficient activations/caching for hyperbolic NNs, improving representation learning in non-Euclidean space.
-
$\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Representation/Compression Theory: tight bounds on minimal embeddable dimension for top-k retrieval under common similarities, informing embedding design.
-
Order-Optimal Sample Complexity of Rectified Flows - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Representation learning/theory: proves order-optimal sample complexity for rectified flows in generative modeling.
-
To Grok Grokking: Provable Grokking in Ridge Regression - Score: 16 (R=8, N=8) - Date: 2026-01-28 - Comment: Representation Learning: theoretical training-dynamics analysis of grokking with provable bounds on generalization delay in ridge regression.
-
Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs - Score: 16 (R=8, N=8) - Date: 2026-01-19 - Comment: Representation Learning/Mechanistic Interpretability: identifies anchor–adapter circuits causing shortcut memorization under RLVR and demonstrates causal steering.
-
Digital Metabolism: Decoupling Logic from Facts via Regenerative Unlearning -- Towards a Pure Neural Logic Core - Score: 16 (R=8, N=8) - Date: 2026-01-19 - Comment: Representation learning/training dynamics: protocol to decouple logic from facts via gradient reversal—toward modular neural logic core.
-
Universal Latent Homeomorphic Manifolds: Cross-Domain Representation Learning via Homeomorphism Verification - Score: 16 (R=8, N=8) - Date: 2026-01-15 - Comment: Representation Learning: proposes a topology-based (homeomorphism) framework and verification algorithms to unify latent manifolds across modalities, offering theoretical insights into learned representations.
-
Dynamic Graph Structure Learning via Resistance Curvature Flow - Score: 16 (R=8, N=8) - Date: 2026-01-15 - Comment: Representation Learning/Efficiency: Resistance Curvature Flow replaces OT-based curvature optimization with effective-resistance matrix ops for dynamic graph structure learning (>100x speedup).
-
Manifold limit for the training of shallow graph convolutional neural networks - Score: 16 (R=8, N=8) - Date: 2026-01-12 - Comment: Representation Learning/Training Theory: proves Γ-convergence for training shallow GCNNs under manifold assumptions, formalizing mesh/sample independence.
-
On the Limits of Self-Improving in LLMs and Why AGI, ASI and the Singularity Are Not Near Without Symbolic Model Synthesis - Score: 16 (R=8, N=8) - Date: 2026-01-12 - Comment: Representation Learning/Training dynamics theory: formalizes recursive self-training in LLMs and proves degenerative behaviors (entropy decay, variance amplification), arguing for neurosymbolic synthesis.
-
Bridging Distance and Spectral Positional Encodings via Anchor-Based Diffusion Geometry Approximation - Score: 16 (R=8, N=8) - Date: 2026-01-09 - Comment: Representation Learning: connects spectral/diffusion positional encodings to anchor-based distance via low-rank/Nyström approximation with theoretical guarantees.
-
An Algebraic Representation Theorem for Linear GENEOs in Geometric Machine Learning - Score: 16 (R=8, N=8) - Date: 2026-01-09 - Comment: Strongly matches Model Architecture theory (representation theorem for equivariant operators/GENEOs enabling efficient, interpretable architectures).
-
Credit Assignment via Neural Manifold Noise Correlation - Score: 16 (R=8, N=8) - Date: 2026-01-07 - Comment: Representation Learning/Learning algorithms: proposes manifold-restricted noise correlation for credit assignment, improving sample efficiency and scalability with biological plausibility.
-
The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving - Score: 16 (R=8, N=8) - Date: 2026-01-05 - Comment: Proposes a unified training objective (DCR) to prevent diversity collapse in reasoning, addressing training dynamics and representation over solution traces (Representation Learning).
-
Implicit score matching meets denoising score matching: improved rates of convergence and log-density Hessian estimation - Score: 16 (R=8, N=8) - Date: 2026-01-01 - Comment: Representation Learning/Theory: establishes convergence rates and Hessian estimation for implicit and denoising score matching, with implications for diffusion model samplers.
-
From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation Learning: contrastive latent regularizer to reduce forget–retain entanglement for LLM unlearning (explicit representation shaping).
-
Putting a Face to Forgetting: Continual Learning meets Mechanistic Interpretability - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation learning/Mechanistic interpretability: geometric, feature-centric framework explaining catastrophic forgetting; analysis on ViTs.
-
How Expressive Are Graph Neural Networks in the Presence of Node Identifiers? - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation Learning: formal analysis of GNN expressive power with unique node identifiers (key-invariant expressivity) links to logic classes.
-
Amortized Spectral Kernel Discovery via Prior-Data Fitted Network - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation Learning/Architecture analysis: decoders mapping PFN latents to spectral densities and stationary kernels (Bochner) enabling amortized kernel discovery.
-
XFACTORS: Disentangled Information Bottleneck via Contrastive Supervision - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation learning: weakly-supervised disentanglement via contrastive supervision within a VAE/Information Bottleneck framework, enabling controllable factors.
-
Learning the Mechanism of Catastrophic Forgetting: A Perspective from Gradient Similarity - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation Learning/Training dynamics: gradient-similarity theory identifies conflicting vs collaborative neurons; proposes selective freezing to prevent forgetting.
-
FlexCausal: Flexible Causal Disentanglement via Structural Flow Priors and Manifold-Aware Interventions - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation Learning: causal disentanglement with block-diagonal VAE and flow-based priors plus manifold-aware interventions.
-
Missing-Data-Induced Phase Transitions in Spectral PLS for Multimodal Learning - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation learning theory: phase transition analysis for spectral PLS under missing data using spiked random matrix theory; insights into multimodal representation recovery.
-
LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them? - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation Learning/Efficient Fine-tuning: layer-wise analysis localizes language control and selectively tunes final layers (few parameters) to fix multilingual consistency.
-
Implicit Hypothesis Testing and Divergence Preservation in Neural Network Representations - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Representation Learning/Training Dynamics: frames supervised training as implicit hypothesis testing with KL divergence alignment toward Neyman–Pearson optimality, suggesting regularization strategies.
-
Loss Landscape Geometry and the Learning of Symmetries: Or, What Influence Functions Reveal About Robust Generalization - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Representation Learning: influence-function diagnostic measuring gradient coupling along symmetry orbits to assess robust generalization via loss landscape geometry.
-
Domain Expansion: A Latent Space Construction Framework for Multi-Task Learning - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Model Architecture/Representation Learning: orthogonal pooling constructs mutually orthogonal latent subspaces per task to resolve gradient conflicts in multi-task learning.
-
Stability and Generalization of Nonconvex Optimization with Heavy-Tailed Noise - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Matches Representation Learning/Training Dynamics: stability and generalization bounds for nonconvex optimization under heavy-tailed gradient noise across SGD variants.
-
Fixed Aggregation Features Can Rival GNNs - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Matches Representation Learning/Architecture: fixed (non-trainable) neighborhood aggregation features rival GNNs; theoretical links to Kolmogorov–Arnold representations challenge prevailing assumptions.
-
Smooth embeddings in contracting recurrent networks driven by regular dynamics: A synthesis for neural representation - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Representation Learning: theoretical synthesis showing when contracting RNNs learn smooth, topology-preserving embeddings of regular dynamics; implications for state dimension and training dynamics.
-
ASEHybrid: When Geometry Matters Beyond Homophily in Graph Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Matches Model Architecture and Representation Learning: geometry-aware GNN with theoretical characterization (label informativeness) and curvature-guided rewiring.
-
Representational Homomorphism Predicts and Improves Compositional Generalization In Transformer Language Model - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Representation Learning: introduces a structural metric (Homomorphism Error) on Transformer hidden states and uses it as a training regularizer to improve compositional generalization.
-
Stability as a Liability:Systematic Breakdown of Linguistic Structure in LLMs - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Matches Representation Learning: analyzes training dynamics under MLE, showing stability leads to forward-KL minimization and low-entropy generations.
-
Nonlinear multi-study factor analysis - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Matches Representation Learning: sparse multi-study variational autoencoder for shared/specific nonlinear factors with identifiability guarantees.
-
Spelling Bee Embeddings for Language Modeling - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Model Architecture: modifies the embedding layer to inject spelling features, improving representation quality with compute/data savings.
-
No Validation, No Problem: Predicting Model Performance from a Single Gradient - Score: 15 (R=8, N=7) - Date: 2026-01-26 - Comment: Representation Learning/Training Dynamics: proposes a validation-free checkpointing signal from a single gradient; efficiency-oriented early stopping/selection without labels.
-
Relational Linearity is a Predictor of Hallucinations - Score: 15 (R=8, N=7) - Date: 2026-01-19 - Comment: Representation learning/training dynamics: links relational linearity in embeddings to hallucination behavior, offering insight into how LLMs store facts.
-
Operator learning on domain boundary through combining fundamental solution-based artificial data and boundary integral techniques - Score: 15 (R=8, N=7) - Date: 2026-01-19 - Comment: Representation Learning: boundary-only neural operator (MAD-BNO) learns Dirichlet–Neumann maps from mathematical artificial data; recovers interiors via boundary integrals.
-
Understanding and Preserving Safety in Fine-Tuned LLMs - Score: 15 (R=8, N=7) - Date: 2026-01-16 - Comment: Representation Learning/Training Dynamics: identifies a low-rank safety-gradient subspace and uses projection-based fine-tuning (SPF) to preserve safety while maintaining utility.
-
Toward Understanding Unlearning Difficulty: A Mechanistic Perspective and Circuit-Guided Difficulty Metric - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Representation Learning: introduces a circuit-level, mechanistic pre-unlearning difficulty metric (CUD) grounded in model circuits and interaction pathways.
-
Ability Transfer and Recovery via Modularized Parameters Localization - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Parameter modularization: activation-guided channel-wise ability transfer; insights into ability localization in LLM parameters (Representation Learning/Model Editing).
-
Supervised Spike Agreement Dependent Plasticity for Fast Local Learning in Spiking Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-01-14 - Comment: Representation Learning/Training dynamics: supervised spike agreement-dependent plasticity enabling local, backprop-free learning with linear-time complexity in SNNs.
-
Deep Exploration of Epoch-wise Double Descent in Noisy Data: Signal Separation, Large Activation, and Benign Overfitting - Score: 15 (R=8, N=7) - Date: 2026-01-14 - Comment: Representation Learning: empirical analysis of epoch-wise double descent, benign overfitting, and large activations in deep nets.
-
Representations of Text and Images Align From Layer One - Score: 15 (R=8, N=7) - Date: 2026-01-14 - Comment: Representation Learning: constructive, layer-wise evidence of image–text alignment from early layers using synthesis-based probes.
-
Local EGOP for Continuous Index Learning - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Matches Representation Learning: Local EGOP metric for adaptive kernels/subspace estimation achieving intrinsic-dimension rates.
-
Variational decomposition autoencoding improves disentanglement of latent representations - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Matches Model Architecture/Representation Learning: decomposition-aware variational autoencoder for disentangled latent subspaces.
-
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Mechanistic interpretability of Diffusion Transformers’ circuits for spatial relations fits the Representation Learning/training dynamics criterion.
-
SPINAL -- Scaling-law and Preference Integration in Neural Alignment Layers - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Representation Learning and training dynamics: SPINAL quantifies layerwise geometric changes from DPO via contraction/transport scores.
-
VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Leverages Variational Information Bottleneck to probe and intervene on attention heads, matching the Representation Learning criterion (internal mechanism analysis and causally-informed mitigation).
-
Tracing Moral Foundations in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Matches Representation Learning/mechanistic interpretability: layer-wise concepts, sparse autoencoders features, and causal steering in LLMs.
-
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Representation Learning/Training Dynamics: analyzes structure of long CoT reasoning and proposes Mole-Syn to synthesize effective reasoning trajectories for stable learning.
-
Quantifying and Inducing Shape Bias in CNNs via Max-Pool Dilation - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Matches Representation Learning and Architecture: quantifies dataset shape-texture balance and induces shape bias via max-pool dilation.
-
Poisson Hyperplane Processes with Rectified Linear Units - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Model Architecture/Theory: establishes a probabilistic PHP representation equivalent to two-layer ReLU networks, with scalable decomposition and Bayesian inference.
-
Aligned explanations in neural networks - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Matches Model Architecture (pseudo-linear PiNets enabling aligned, instance-wise linear predictions) and Representation Learning (linearly readable features).
-
Layer-wise Positional Bias in Short-Context Language Modeling - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Representation learning/training dynamics: layer-wise positional bias profiling via attribution, revealing recency/primacy patterns across depth.
-
Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Representation learning/training dynamics: layer-wise analysis of multi-hop reasoning with a probabilistic recall-and-extract framework explaining internal composition.
-
Hierarchical temporal receptive windows and zero-shot timescale generalization in biologically constrained scale-invariant deep networks - Score: 15 (R=8, N=7) - Date: 2026-01-07 - Comment: Model Architecture: introduces a scale-invariant recurrent architecture (SITH-RNN) with hierarchical temporal receptive windows and zero-shot timescale generalization; Representation Learning: insights into temporal priors and training dynamics.
-
Output Embedding Centering for Stable LLM Pretraining - Score: 15 (R=8, N=7) - Date: 2026-01-07 - Comment: Training dynamics/representation geometry: proposes output embedding centering (μ-centering/μ-loss) to stabilize LLM pretraining with theoretical guarantees.
-
ELLA: Efficient Lifelong Learning for Adapters in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Representation Learning/Efficiency: selective subspace de-correlation via anisotropic shrinkage regularization for continual adapters with constant compute/memory.
-
Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Training dynamics/representation learning: entropy-gated fine-tuning to mitigate forgetting by suppressing confident-conflict gradients.
-
Towards a Principled Muon under $\mu\mathsf{P}$: Ensuring Spectral Conditions throughout Training - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Training Dynamics/Optimization: ensures μP spectral conditions throughout training for Muon (Muon++), aligning optimizer updates with μP scaling for large models.
-
Intention Collapse: Intention-Level Metrics for Reasoning in Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Representation Learning: proposes intention-level metrics (entropy, effective dimensionality, recoverability) to study inference-time computation and internal representations in LMs.
-
Deep Clustering with Associative Memories - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Representation learning: deep clustering objective using energy-based associative memories coupling representation and clustering.
-
Deep Deterministic Nonlinear ICA via Total Correlation Minimization with Matrix-Based Entropy Functional - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Representation Learning: deep deterministic nonlinear ICA minimizing total correlation via matrix-based entropy functional; avoids variational/adversarial schemes.
-
On the geometry and topology of representations: the manifolds of modular addition - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Analyzes learned representations for modular addition as manifolds, showing equivalence across attention architectures; core Representation Learning insight.
-
Generative Classifiers Avoid Shortcut Solutions - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Representation Learning/Architecture: shows generative classifiers reduce shortcut reliance and perform better under distribution shift, with theoretical and empirical analysis.
-
Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Representation Learning/Efficiency: identifies cognitive attention heads and applies test-time representation rotations (training-free) to steer reasoning, reducing tokens and improving accuracy.
-
Towards mechanistic understanding in a data-driven weather model: internal activations reveal interpretable physical features - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Representation Learning/interpretability: applies sparse autoencoders to internal activations of a weather model to discover and intervene on physical features.
-
Information-Theoretic Quality Metric of Low-Dimensional Embeddings - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Introduces an information-theoretic metric (ERPM) for embedding quality via entropy/stable rank; fits Representation Learning evaluation/analysis.
-
Deep learning methods for inverse problems using connections between proximal operators and Hamilton-Jacobi equations - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Model Architecture and Representation Learning: leverages connections between proximal operators and Hamilton–Jacobi PDEs to design architectures for learning priors in inverse problems.
-
Geometric Scaling of Bayesian Inference in LLMs - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Matches Representation Learning: analyzes internal geometry in Transformers/LLMs (entropy-aligned axis, low-dimensional value manifolds) and training dynamics via targeted interventions revealing how uncertainty is encoded.
Other Foundational Research (4)
-
In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior - Score: 20.0 (R=0, N=0) - Date: 2026-01-07 - Comment: Author match
-
Paradoxical noise preference in RNNs - Score: 16 (R=9, N=7) - Date: 2026-01-09 - Comment: Matches Training Dynamics: reveals noise-level-dependent fixed-point shifts in CTRNNs and noise as integral to computation.
-
A New Convergence Analysis of Plug-and-Play Proximal Gradient Descent Under Prior Mismatch - Score: 16 (R=8, N=8) - Date: 2026-01-16 - Comment: Matches theoretical training analysis: first convergence proof for PnP-PGD under prior mismatch, relaxing restrictive assumptions.
-
Hebbian Learning with Global Direction - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Training dynamics: biologically plausible Hebbian framework augmented with global directional signals as an alternative to backprop.