← Previous Summary | Monthly Overview | Next Summary →
2025-12 | 2026-01 | 2026-02

Personalized Monthly Topic Summary 2026/01

MetricValue
Total Papers411
Model Architecture122
Model Compression and Efficiency129
High Performance Computing42
Representation Learning114
Other Foundational Research4

Model Architecture (122)

  1. L$^3$: Large Lookup Layers - Score: 19 (R=10, N=9) - Date: 2026-01-30 - Comment: Model Architecture & Sparsity: proposes Large Lookup Layers as a systems-friendly sparse alternative to MoE with static token-based routing and embedding allocation; enables CPU-offloaded inference.

  2. Post-LayerNorm Is Back: Stable, ExpressivE, and Deep - Score: 19 (R=10, N=9) - Date: 2026-01-28 - Comment: Strong match to Model Architecture and training stability: Post-LN Transformer with Highway-style connections enabling stable ultra-deep training and improved depth scaling.

  3. Superlinear Multi-Step Attention - Score: 19 (R=10, N=9) - Date: 2026-01-27 - Comment: Model Architecture and Efficiency: multi-step attention achieving subquadratic complexity while preserving random context access; scalable design for long contexts.

  4. LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts - Score: 19 (R=10, N=9) - Date: 2026-01-27 - Comment: MoE Architecture + Efficiency: hardware–software co-designed LatentMoE optimizing accuracy per FLOP/parameter, with empirical/theoretical backing.

  5. LongCat-Flash-Thinking-2601 Technical Report - Score: 19 (R=10, N=9) - Date: 2026-01-27 - Comment: Matches Model Architecture (MoE) and HPC/Distributed Training: 560B MoE with domain-parallel expert training, large-scale asynchronous RL infrastructure, and test-time scaling.

  6. On the Expressive Power of Floating-Point Transformers - Score: 19 (R=10, N=9) - Date: 2026-01-26 - Comment: Model Architecture/Representation Theory: expressive power of floating-point Transformers, permutation equivariance under finite precision, and positional encoding effects.

  7. Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics - Score: 19 (R=10, N=9) - Date: 2026-01-09 - Comment: Model Architecture: continuous-token maturation with delayed discretization for autoregressive generation, enabling stable deterministic decoding.

  8. Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation, Convergence and Mixture of Experts - Score: 19 (R=10, N=9) - Date: 2026-01-06 - Comment: Model Architecture + Representation Learning: diffusion models with MoLR-MoG latent leading to MoE-structured score; provides estimation and convergence guarantees.

  9. A Depth Hierarchy for Computing the Maximum in ReLU Networks via Extremal Graph Theory - Score: 19 (R=10, N=9) - Date: 2026-01-06 - Comment: Theoretical Architecture: depth hierarchy lower bounds for computing max with ReLUs via extremal graph theory.

  10. Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Model Architecture: proposes depth-recurrent attention mixtures combining depth attention and sparse expert attention (MoE) to scale latent reasoning efficiently.

  11. ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Model Architecture/Compression: MoE with adaptive token-to-concept compression for implicit compute allocation; reduces attention/KV cache and improves efficiency.

  12. L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Matches Model Architecture: MoE routing improved via low-rank latent routing space and Lipschitz-controlled scoring geometry.

  13. Scaling Embeddings Outperforms Scaling Experts in Language Models - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Model architecture and efficiency: proposes scaling embeddings as an alternative to MoE sparsity scaling; includes system optimizations/speculative decoding; directly targets MoE/LLM scaling.

  14. Hyperparameter Transfer with Mixture-of-Expert Layers - Score: 18 (R=10, N=8) - Date: 2026-01-29 - Comment: Model Architecture (MoE): DMFT-justified parameterization enabling hyperparameter transfer across width/depth/experts/expert-size in sparse MoE Transformers.

  15. $\infty$-MoE: Generalizing Mixture of Experts to Infinite Experts - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: MoE Architecture: continuous expert parameterization (infinite experts) enabling flexible compute–accuracy trade-offs at inference.

  16. FlashMoE: Reducing SSD I/O Bottlenecks via ML-Based Cache Replacement for Mixture-of-Experts Inference on Edge Devices - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: High Performance Computing/Efficiency for MoE: ML-based cache replacement for SSD-offloaded experts enabling on-device MoE inference and reducing I/O bottlenecks.

  17. GRIP: Algorithm-Agnostic Machine Unlearning for Mixture-of-Experts via Geometric Router Constraints - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: Matches Model Architecture (MoE): geometric router constraints (null-space projection) for algorithm-agnostic unlearning that preserves routing while erasing expert knowledge.

  18. A Collision-Free Hot-Tier Extension for Engram-Style Conditional Memory: A Controlled Study of Training Dynamics - Score: 18 (R=10, N=8) - Date: 2026-01-26 - Comment: Model Architecture and Training Dynamics: conditional memory with a collision-free hot tier via MPHF; analysis reveals gating credit assignment limits and collision-induced regularization.

  19. Demystifying the Slash Pattern in Attention: The Role of RoPE - Score: 18 (R=10, N=8) - Date: 2026-01-15 - Comment: Representation/Architecture analysis: theoretical and empirical explanation of slash attention patterns via RoPE and training dynamics.

  20. WaveFormer: Frequency-Time Decoupled Vision Modeling with Wave Equation - Score: 18 (R=10, N=8) - Date: 2026-01-14 - Comment: Model Architecture and Efficiency: replaces attention with a wave propagation operator (O(N log N)) via frequency-time decoupled formulation for global interactions.

  21. MoE-DisCo:Low Economy Cost Training Mixture-of-Experts Models - Score: 18 (R=10, N=8) - Date: 2026-01-13 - Comment: MoE/HPC: staged training of Mixture-of-Experts via disentangled submodels and unsupervised clustering to reduce cost on low-end hardware.

  22. Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths - Score: 18 (R=10, N=8) - Date: 2026-01-13 - Comment: Model Architecture and Efficiency: Transformer alternative with EMA/gated attention plus sliding chunk attention, timestep decay normalization, and adaptive working memory for million-token contexts without explicit context extension.

  23. Monkey Jump : MoE-Style PEFT for Efficient Multi-Task Learning - Score: 18 (R=10, N=8) - Date: 2026-01-13 - Comment: Strong match to Model Architecture (MoE-style specialization) and Compression/Efficiency (parameter-efficient routing without extra trainable experts/routers).

  24. CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters - Score: 18 (R=10, N=8) - Date: 2026-01-09 - Comment: Model Architecture: demographic-aware Mixture of Adapters with routing to separate cultural modes and mitigate gradient interference.

  25. The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models - Score: 18 (R=10, N=8) - Date: 2026-01-09 - Comment: Strongly matches Model Architecture (Mixture-of-Experts analysis uncovering a domain-invariant ‘Standing Committee’; direct MoE focus).

  26. Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts - Score: 18 (R=10, N=8) - Date: 2026-01-07 - Comment: Model Architecture (MoE): kNN-augmented expert routing with retrieval-based mixing for robust token-to-expert assignment under shift.

  27. Geometric and Dynamic Scaling in Deep Transformers - Score: 18 (R=10, N=8) - Date: 2026-01-07 - Comment: Model Architecture/Training Dynamics: proposes Manifold-Geometric Transformer with manifold-constrained hyper-connections and deep delta learning to prevent rank collapse in deep Transformers.

  28. LinMU: Multimodal Understanding Made Linear - Score: 18 (R=10, N=8) - Date: 2026-01-06 - Comment: Efficiency/Architecture: replaces quadratic attention with dual-branch linear-complexity module (bidirectional SSM + local window attention) and a 3-stage distillation pipeline for VLMs.

  29. Making MoE based LLM inference resilient with Tarragon - Score: 18 (R=10, N=8) - Date: 2026-01-06 - Comment: HPC/MoE Systems: resilient MoE inference via reconfigurable datapath, KV-cache checkpointing, and shadow experts for fault tolerance.

  30. RepetitionCurse: Measuring and Understanding Router Imbalance in Mixture-of-Experts LLMs under DoS Stress - Score: 18 (R=10, N=8) - Date: 2026-01-01 - Comment: Directly targets MoE router behavior and expert-parallel load imbalance under adversarial prompts; strong match to Model Architecture (MoE) and systems-level inference effects.

  31. Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication - Score: 18 (R=9, N=9) - Date: 2026-01-30 - Comment: Model Architecture: logic-derived Arrow Language Model interpreting next-token prediction as nested intuitionistic implication with low-rank realization.

  32. FloydNet: A Learning Paradigm for Global Relational Reasoning - Score: 18 (R=9, N=9) - Date: 2026-01-28 - Comment: Model Architecture: replaces local message passing with a learned DP-style global refinement operator; proven expressivity (3-WL/2-FWL).

  33. The Geometry of Thought: Disclosing the Transformer as a Tropical Polynomial Circuit - Score: 18 (R=9, N=9) - Date: 2026-01-16 - Comment: Architecture theory: shows self-attention’s tropical (max-plus) limit, linking transformers to dynamic programming/shortest-path.

  34. Robust Reasoning as a Symmetry-Protected Topological Phase - Score: 18 (R=9, N=9) - Date: 2026-01-09 - Comment: Model Architecture: proposes a Holonomic Network with non-Abelian gauge symmetry, framing robust reasoning as a symmetry-protected topological phase.

  35. Horseshoe Mixtures-of-Experts (HS-MoE) - Score: 17 (R=10, N=7) - Date: 2026-01-15 - Comment: Model Architecture: Mixture-of-Experts with Bayesian horseshoe priors for sparse expert selection and a particle learning algorithm for sequential inference.

  36. Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints - Score: 17 (R=10, N=7) - Date: 2026-01-14 - Comment: Model Architecture (MoE): principled design under memory/inference constraints; highlights total parameters and expert sparsity as key factors.

  37. Towards Specialized Generalists: A Multi-Task MoE-LoRA Framework for Domain-Specific LLM Adaptation - Score: 17 (R=10, N=7) - Date: 2026-01-14 - Comment: Model Architecture: combines Mixture-of-Experts with Low-Rank Adaptation (LoRA) for multi-task domain adaptation and interference mitigation.

  38. Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Model Architecture/Efficiency: distills Transformers into RNN-attention hybrids (HALO/HypeNet) with improved long-context efficiency and length generalization.

  39. A Separable Architecture for Continuous Token Representation in Language Models - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Model Architecture/Efficiency: replaces embedding tables with a continuous token generator (separable architecture) improving parametric efficiency.

  40. Clustering in Deep Stochastic Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Matches Representation Learning/Theory: stochastic analysis of deep Transformer token dynamics; interacting-particle limit prevents collapse.

  41. Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Model Architecture/Training Theory: unified generalization framework via L2-stability for parameter-space model merging across heterogeneous experts, with actionable merging guidance.

  42. Perceptrons and localization of attention's mean-field landscape - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Model Architecture theory: mean-field analysis of Transformer attention/perceptron blocks showing atomic localization of critical points.

  43. The Depth Delusion: Why Transformers Should Be Wider, Not Deeper - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Model Architecture/Scaling Laws: architecture-conditioned scaling revealing critical depth and advocating width-over-depth tradeoffs.

  44. SONIC: Spectral Oriented Neural Invariant Convolutions - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Strong match to Model Architecture: continuous, orientation-aware spectral parameterization of convolutional operators with global receptive fields and resolution adaptivity.

  45. LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Matches Model Architecture and Certified Robustness: constraint-free, convolution-free 1-Lipschitz architecture with manifold optimization and scalable training.

  46. Power-based Partial Attention: Bridging Linear-Complexity and Full Attention - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Matches Model Architecture/Efficiency: sub-quadratic attention (O(L^{1+p})) bridging linear and full attention to quantify necessary attention.

  47. Finite-Time Analysis of Gradient Descent for Shallow Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Theoretical Training Dynamics: finite-time analysis of gradient descent for shallow Transformers with width scaling and sequence-length–independent optimization error.

  48. Multigrade Neural Network Approximation - Score: 17 (R=9, N=8) - Date: 2026-01-26 - Comment: Model Architecture/Training Paradigm: multigrade deep learning (grade-wise residual training) with operator-theoretic guarantees of vanishing approximation error.

  49. Provably Learning Attention with Queries - Score: 17 (R=9, N=8) - Date: 2026-01-26 - Comment: Matches Model Architecture (attention/Transformer) with theoretical learning/identifiability via query access.

  50. Unit-Consistent (UC) Adjoint for GSD and Backprop in Deep Learning Applications - Score: 17 (R=9, N=8) - Date: 2026-01-19 - Comment: Model Architecture/Optimization: introduces a unit-consistent adjoint for gauge-equivariant backprop/steepest descent in positively homogeneous networks.

  51. MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts - Score: 17 (R=9, N=8) - Date: 2026-01-16 - Comment: Model architecture: Modality-Aware Mixture-of-Experts with modality-specific routing and shared experts (MoE).

  52. TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks - Score: 17 (R=9, N=8) - Date: 2026-01-16 - Comment: Matches Conditional/Dynamic Networks and Efficiency: step-level routing (TRIM) that sends only critical reasoning steps to larger models using uncertainty and process rewards, improving cost-accuracy tradeoffs.

  53. Unlabeled Data Can Provably Enhance In-Context Learning of Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-16 - Comment: Representation Learning/Training Dynamics in Transformers: theoretical analysis showing CoT-augmented prompts let transformers emulate EM using unlabeled data for improved ICL.

  54. Layer-Parallel Training for Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-15 - Comment: High Performance Computing: parallel-in-time, layer-parallel training of Transformers via neural ODE formulation with accuracy control.

  55. Controlled LLM Training on Spectral Sphere - Score: 17 (R=9, N=8) - Date: 2026-01-15 - Comment: High-Performance Training/Optimization: Spectral Sphere Optimizer enforces module-wise spectral constraints, fully muP-aligned, improving stability (incl. MoE router balance) over AdamW/Muon.

  56. Parallel Context-of-Experts Decoding for Retrieval Augmented Generation - Score: 17 (R=9, N=8) - Date: 2026-01-14 - Comment: Model Architecture/Efficiency: Parallel Context-of-Experts decoding treats retrieved docs as experts with contrastive aggregation, avoiding shared attention and prefill bottlenecks.

  57. LDLT L-Lipschitz Network Weight Parameterization Initialization - Score: 17 (R=9, N=8) - Date: 2026-01-14 - Comment: Model Architecture/Training Dynamics: analytic initialization for LDLT L-Lipschitz layers with exact variance derivations; practical prescriptions for stable deep Lipschitz networks.

  58. CliffordNet: All You Need is Geometric Algebra - Score: 17 (R=9, N=8) - Date: 2026-01-13 - Comment: Proposes a new vision backbone grounded in geometric algebra with linear complexity, directly matching Model Architecture (Transformer/CNN alternatives) and Efficiency.

  59. Bi-Orthogonal Factor Decomposition for Vision Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-13 - Comment: Strong match to Representation Learning/mechanistic analysis: bi-orthogonal factor decomposition to disentangle position vs content interactions in ViT attention.

  60. Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer - Score: 17 (R=9, N=8) - Date: 2026-01-12 - Comment: Model Architecture: introduces a Discrete Transformer with enforced functional disentanglement (routing vs arithmetic) and annealed sampling to enable program extraction, boosting interpretability.

  61. Token-Level LLM Collaboration via FusionRoute - Score: 17 (R=9, N=8) - Date: 2026-01-09 - Comment: Matches Model Architecture/Efficiency: token-level routing with a trainable complementary generator; theoretical limits of expert-only routing (MoE-like).

  62. Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers - Score: 17 (R=9, N=8) - Date: 2026-01-09 - Comment: Matches Training Dynamics/Architecture: learnable per-matrix/row/column multipliers to free WD-noise equilibrium scale, improving optimization.

  63. NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation - Score: 17 (R=9, N=8) - Date: 2026-01-07 - Comment: Model Architecture: unified autoregressive transformer with next-scale visual prediction enabling fast 1024×1024 generation; unified multimodal tokenization and training.

  64. Attention Needs to Focus: A Unified Perspective on Attention Allocation - Score: 17 (R=9, N=8) - Date: 2026-01-07 - Comment: Model Architecture and Efficiency: introduces Lazy Attention with positional discrimination and Elastic-Softmax to mitigate collapse/sink and induce attention sparsity.

  65. Context-Free Recognition with Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-06 - Comment: Model Architecture Theory: shows looped transformers with O(log n) iterations and padding can recognize CFLs, advancing formal capacity understanding.

  66. Deep Delta Learning - Score: 17 (R=9, N=8) - Date: 2026-01-05 - Comment: Model Architecture: generalizes residual connections via a learnable rank‑1 Delta operator with spectral control and gated dynamics.

  67. Constructing a Neuro-Symbolic Mathematician from First Principles - Score: 17 (R=9, N=8) - Date: 2026-01-05 - Comment: Model Architecture: neuro-symbolic design using a Hypergraph Transformer and a differentiable symbolic reasoning kernel with energy-based training signals.

  68. Modeling Language as a Sequence of Thoughts - Score: 17 (R=9, N=8) - Date: 2026-01-01 - Comment: Model Architecture: a recurrent Transformer with sentence-level “thought” memory and shared-parameter token/thought generation for sequence-of-thought modeling.

  69. GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization - Score: 16 (R=9, N=7) - Date: 2026-01-30 - Comment: Model Architecture: Transformer normalization innovation (GeoNorm) unifying pre-/post-norm via geodesic updates with negligible overhead.

  70. Seg-MoE: Multi-Resolution Segment-wise Mixture-of-Experts for Time Series Forecasting Transformers - Score: 16 (R=9, N=7) - Date: 2026-01-30 - Comment: Model Architecture: MoE innovation with segment-wise routing for time-series Transformers, aligning conditional sparsity with temporal locality.

  71. On the Expressiveness of State Space Models via Temporal Logics - Score: 16 (R=9, N=7) - Date: 2026-01-28 - Comment: Strong match to Model Architecture theory: expressiveness analysis of State Space Models via temporal logic, including quantized vs unbounded precision and comparison to transformers.

  72. TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors - Score: 16 (R=9, N=7) - Date: 2026-01-27 - Comment: Model Architecture/Representation Learning: provides a unified high-order attention-interaction tensor that linearly represents full Transformer computations (attention, FFN, norms, residuals).

  73. Split-on-Share: Mixture of Sparse Experts for Task-Agnostic Continual Learning - Score: 16 (R=9, N=7) - Date: 2026-01-27 - Comment: Model Architecture (MoE): Mixture of Sparse Experts with shared/unique experts and unified gating for task-agnostic continual learning.

  74. Sycophancy Hides Linearly in the Attention Heads - Score: 16 (R=9, N=7) - Date: 2026-01-26 - Comment: Representation Learning: linear separability of sycophancy in attention heads and targeted linear steering within Transformer attention activations.

  75. Attention-MoA: Enhancing Mixture-of-Agents via Inter-Agent Semantic Attention and Deep Residual Synthesis - Score: 16 (R=9, N=7) - Date: 2026-01-26 - Comment: Model Architecture: Mixture-of-Agents with inter-agent semantic attention and deep residual synthesis plus adaptive early stopping for collaborative LLM inference.

  76. Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models - Score: 16 (R=9, N=7) - Date: 2026-01-15 - Comment: Model Architecture (MoE): attribution-based analysis of knowledge acquisition dynamics in MoE vs. dense models.

  77. M$^2$FMoE: Multi-Resolution Multi-View Frequency Mixture-of-Experts for Extreme-Adaptive Time Series Forecasting - Score: 16 (R=9, N=7) - Date: 2026-01-14 - Comment: Model Architecture (MoE): multi-resolution, multi-view frequency Mixture-of-Experts with temporal gating for extreme-adaptive forecasting.

  78. Scalable Heterogeneous Graph Learning via Heterogeneous-aware Orthogonal Prototype Experts - Score: 16 (R=9, N=7) - Date: 2026-01-13 - Comment: Strong match to Model Architecture (Mixture-of-Experts-style prediction head) with expert routing and orthogonalization.

  79. Neuro-Channel Networks: A Multiplication-Free Architecture by Biological Signal Transmission - Score: 16 (R=9, N=7) - Date: 2026-01-06 - Comment: Model Architecture + Efficiency: proposes a multiplication-free network replacing weights with channel-widths and sign-gated transmission to eliminate multiplications.

  80. Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale - Score: 16 (R=9, N=7) - Date: 2026-01-06 - Comment: Conditional/Dynamic Networks: large-scale LLM routing and adaptive aggregation framework (mixture-of-models) with task-aware switching.

  81. MambaFormer: Token-Level Guided Routing Mixture-of-Experts for Accurate and Efficient Clinical Assistance - Score: 16 (R=9, N=7) - Date: 2026-01-06 - Comment: Model Architecture: hybrid MoE with token-level dynamic routing between Transformer and SSM (Mamba) experts plus utility-guided routing loss for efficiency/accuracy trade-offs.

  82. mHC: Manifold-Constrained Hyper-Connections - Score: 16 (R=9, N=7) - Date: 2026-01-01 - Comment: Model Architecture: proposes manifold-constrained Hyper-Connections to restore identity mapping and improve stability/scalability of widened residual streams with efficiency-aware optimizations.

  83. Generalising E-prop to Deep Networks - Score: 16 (R=9, N=7) - Date: 2026-01-01 - Comment: Extends E-prop to deep recurrent networks, enabling online credit assignment across time and depth; core training/architecture contribution.

  84. Identifiable Equivariant Networks are Layerwise Equivariant - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Matches Model Architecture/Theory: identifiability-based proof linking end-to-end equivariance to layerwise equivariance.

  85. TRACE: Trajectory Recovery for Continuous Mechanism Evolution in Causal Representation Learning - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Representation Learning with MoE: identifiable continuous mechanism trajectories via MoE experts for causal representation learning.

  86. The Effect of Architecture During Continual Learning - Score: 16 (R=8, N=8) - Date: 2026-01-28 - Comment: Model Architecture/Representation Learning: joint optimization of architecture and weights to mitigate forgetting; bilevel formulation with low-rank knowledge transfer.

  87. Analytic Bijections for Smooth and Interpretable Normalizing Flows - Score: 16 (R=8, N=8) - Date: 2026-01-19 - Comment: Model Architecture: new analytic bijections and a radial flow architecture delivering smooth, interpretable and closed-form invertible transformations.

  88. On the origin of neural scaling laws: from random graphs to natural language - Score: 16 (R=8, N=8) - Date: 2026-01-16 - Comment: Scaling laws theory: investigates origins of neural scaling exponents via simplified transformers and random-graph sequences.

  89. Density Matrix RNN (DM-RNN): A Quantum Information Theoretic Framework for Modeling Musical Context and Polyphony - Score: 16 (R=8, N=8) - Date: 2026-01-09 - Comment: Model Architecture: DM-RNN with density-matrix state and CPTP dynamics; rigorous parameterization and information-theoretic analysis of representations.

  90. Discontinuous Galerkin finite element operator network for solving non-smooth PDEs - Score: 16 (R=8, N=8) - Date: 2026-01-09 - Comment: DG–FEONet: hybrid DG-based neural operator trained via residual minimization—operator-learning architecture with data-free training and robustness to discontinuities.

  91. Physical Transformer - Score: 16 (R=8, N=8) - Date: 2026-01-07 - Comment: Model Architecture: proposes a ‘physical transformer’ coupling attention/FFN with Hamiltonian dynamics and symplectic layers; Representation Learning: reasoning on a learned manifold with geometric invariants.

  92. Effective LoRA Adapter Routing using Task Representations - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model Architecture/Efficiency: task-representation-based routing and composition of LoRA adapters (adapter MoE-style selection) scaling with tasks, not adapters.

  93. Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Optimizer/training dynamics: explains Adam’s behavior via gradient scale invariance when β1=β2; guides optimizer design.

  94. KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model architecture/efficiency: Kronecker-product parameterization of manifold-constrained hyper-connections to guarantee double stochasticity with reduced parameters.

  95. Multi-Modal Time Series Prediction via Mixture of Modulated Experts - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model Architecture: Mixture-of-Experts with expert modulation (conditioning routing and computation) for multi-modal time series.

  96. MAR: Efficient Large Language Models via Module-aware Architecture Refinement - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model Architecture and Efficiency: integrates SSMs and activation sparsification with spiking-aware components to reduce LLM inference energy.

  97. Is Parameter Isolation Better for Prompt-Based Continual Learning? - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model Architecture: prompt-sharing with gated routing and history-aware modulation (sparse activation) for continual learning—conditional/dynamic prompts.

  98. CCMamba: Selective State-Space Models for Higher-Order Graph Learning on Combinatorial Complexes - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Matches Model Architecture/Efficiency: replaces attention with selective state-space models for linear-time, long-range message passing on combinatorial complexes.

  99. Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Representation Learning: mechanistic analysis of multimodal in-context learning circuits (induction-style) and RoPE effects in transformers.

  100. TINNs: Time-Induced Neural Networks for Solving Time-Dependent PDEs - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Model Architecture: introduces a conditional/dynamic network by parameterizing weights as a learned function of time, addressing limitations of shared weights in PINNs.

  101. Revisiting Incremental Stochastic Majorization-Minimization Algorithms with Applications to Mixture of Experts - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Mixture-of-Experts: incremental stochastic MM algorithm with convergence guarantees for softmax-gated MoE training on streaming data.

  102. Component-Level Lesioning of Language Models Reveals Clinically Aligned Aphasia Phenotypes - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Representation Learning and Model Architecture: component-level lesioning of MoE and dense Transformers to probe functional organization and interpretability of internal modules.

  103. Residual Tokens Enhance Masked Autoencoders for Speech Modeling - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Model Architecture & Representation Learning: masked autoencoder augmented with residual trainable tokens to capture unlabeled factors in speech.

  104. SEAFormer: A Spatial Proximity and Edge-Aware Transformer for Real-World Vehicle Routing Problems - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Model Architecture and Efficiency: transformer with Clustered Proximity Attention reducing attention complexity from O(n^2) to O(n) and edge-aware module for decision making.

  105. A Constrained Optimization Perspective of Unrolled Transformers - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Model Architecture/Training Dynamics: constrained optimization with layerwise descent constraints via primal–dual training for Transformers.

  106. NewPINNs: Physics-Informing Neural Networks Using Conventional Solvers for Partial Differential Equations - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Matches Model Architecture/Training Dynamics: solver-in-the-loop physics-informing (NewPINNs) replacing residual-based losses for stable training.

  107. Hierarchical Orthogonal Residual Spread for Precise Massive Editing in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-19 - Comment: Matches 'Model Architecture: conditional/dynamic networks' by introducing Hierarchical Orthogonal Residual Spread to stabilize and localize large-scale LLM edits.

  108. LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning - Score: 15 (R=8, N=7) - Date: 2026-01-16 - Comment: Model Architecture/Training: aligns latent visual attention trajectories (visual thoughts) with curriculum sensory gating to enhance multimodal reasoning and grounding.

  109. From Hawkes Processes to Attention: Time-Modulated Mechanisms for Event Sequences - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Model Architecture: introduces Hawkes Attention—a time-modulated attention operator replacing Q/K/V projections with per-type kernels.

  110. ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning - Score: 15 (R=8, N=7) - Date: 2026-01-14 - Comment: Matches Model Architecture/Conditional Computation: controllable multi-budget reasoning via on-policy RL and distillation enabling distinct compute modes.

  111. Hyperbolic Heterogeneous Graph Transformer - Score: 15 (R=8, N=7) - Date: 2026-01-14 - Comment: Model Architecture/Efficiency: hyperbolic heterogeneous graph Transformer with relation-specific hyperbolic attention operating fully in manifold and linear-time attention.

  112. Free-RBF-KAN: Kolmogorov-Arnold Networks with Adaptive Radial Basis Functions for Efficient Function Learning - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Model Architecture: KAN with adaptive RBFs and learned smoothness, with universality proof and faster training/inference.

  113. CompNO: A Novel Foundation Model approach for solving Partial Differential Equations - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Model Architecture: compositional neural operators with reusable Foundation Blocks (parametric FNOs) and boundary-condition operator assembled via lightweight adapters for PDEs.

  114. Hellinger Multimodal Variational Autoencoders - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Model Architecture/Representation: introduces Hellinger pooling for multimodal VAEs, improving joint inference without sub-sampling.

  115. Circular Reasoning: Understanding Self-Reinforcing Loops in Large Reasoning Models - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Training Dynamics/Representation: analyzes circular reasoning failure via attention dynamics and introduces a detection method (CUSUM).

  116. AGDC: Autoregressive Generation of Variable-Length Sequences with Joint Discrete and Continuous Spaces - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Unified autoregressive framework for joint discrete–continuous sequences using diffusion for continuous values matches Model Architecture innovation and efficiency for precision handling.

  117. Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Matches Model Architecture and Efficiency: head-level diagnosis with conflict-aware sparse fine-tuning that selectively updates Transformer heads.

  118. Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Architecture/scaling analysis showing MoE reasoning performance aligns with active parameters—core insight into MoE inference compute scaling.

  119. Decentralized Autoregressive Generation - Score: 15 (R=8, N=7) - Date: 2026-01-07 - Comment: Model Architecture: introduces a decentralized autoregressive training objective via linear combination of expert flows (conditional/dynamic networks).

  120. Neural Networks on Symmetric Spaces of Noncompact Type - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Model Architecture: designs FC layers and attention mechanisms on symmetric spaces (noncompact Riemannian manifolds).

  121. Three factor delay learning rules for spiking neural networks - Score: 15 (R=8, N=7) - Date: 2026-01-05 - Comment: Model Architecture/Training rules for SNNs: online three-factor learning of synaptic/axonal delays for temporal tasks, improving efficiency on neuromorphic hardware.

  122. Flow Matching Neural Processes - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Model Architecture: introduces flow-matching neural processes enabling amortized conditional generation via ODE solvers.

Model Compression and Efficiency (129)

  1. Discrete Feynman-Kac Correctors - Score: 20.0 (R=0, N=0) - Date: 2026-01-16 - Comment: Author match

  2. Explicit Multi-head Attention for Inter-head Interaction in Large Language Models - Score: 19 (R=10, N=9) - Date: 2026-01-28 - Comment: Model Architecture & Efficiency: explicit multi-head attention with head-level linear composition and normalization; enables KV-cache compression via low-rank virtual heads.

  3. Low-Rank Key Value Attention - Score: 19 (R=10, N=9) - Date: 2026-01-19 - Comment: Architecture/efficiency: low-rank KV attention reduces KV cache while preserving head diversity; improves pretraining compute efficiency.

  4. STEM: Scaling Transformers with Embedding Modules - Score: 19 (R=10, N=9) - Date: 2026-01-16 - Comment: Model architecture and efficiency: static token-indexed sparsity replacing FFN up-projection; decouples capacity from per-token compute and enables CPU offload.

  5. T3C: Test-Time Tensor Compression with Consistency Guarantees - Score: 19 (R=10, N=9) - Date: 2026-01-07 - Comment: Model Compression and Efficiency: train-once, test-time budget-conditioned low-rank plus mixed-precision with a controller and per-layer consistency certificates.

  6. Fast-weight Product Key Memory - Score: 19 (R=10, N=9) - Date: 2026-01-05 - Comment: Introduces a dynamic fast-weight Product Key Memory—sparse episodic memory updated at train/inference time—for sequence models (Model Architecture; Efficiency via sparse memory).

  7. Task-Driven Kernel Flows: Label Rank Compression and Laplacian Spectral Filtering - Score: 19 (R=10, N=9) - Date: 2026-01-05 - Comment: Representation Learning and Efficiency: theory showing supervised learning induces low-rank kernels (rank bounded by number of classes) via a kernel ODE and low-rank SGD noise.

  8. Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space - Score: 19 (R=10, N=9) - Date: 2026-01-01 - Comment: Strongly matches Model Architecture and Efficiency: introduces a dynamic hierarchical language model shifting compute to a compressed concept space, discovers variable-length units end-to-end, proposes a compression-aware scaling law and a decoupled μP parametrization.

  9. ECO: Quantized Training without Full-Precision Master Weights - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Compression/Efficiency: quantized training without full-precision master weights via error-compensating optimizer; theory and SMoE applicability.

  10. Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: KV-cache low-rank projection learned on the Stiefel manifold by minimizing decoder-layer output error with rank allocation profiles.

  11. HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: low-bit PTQ via Hessian conditioning with learnable rotations to reduce curvature sensitivity.

  12. ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: HPC/Systems + MoE: lossless compression and cache-affinity scheduling for on-device MoE serving with provable performance, shifting I/O to compute-centric.

  13. HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs - Score: 18 (R=10, N=8) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: introduces a Hessian-guided, differentiable QAT with temperature annealing for ultra-low-bit LLMs, improving optimization over STE-based methods.

  14. LoPRo: Enhancing Low-Rank Quantization via Permuted Block-Wise Rotation - Score: 18 (R=10, N=8) - Date: 2026-01-28 - Comment: Compression/Efficiency: fine-tuning-free post-training quantization with low-rank decomposition and permuted block-wise rotations (2–3 bit regime).

  15. StableQAT: Stable Quantization-Aware Training at Ultra-Low Bitwidths - Score: 18 (R=10, N=8) - Date: 2026-01-28 - Comment: Strong match to Model Compression/Efficiency: a theoretically grounded surrogate for ultra-low-bit Quantization-Aware Training that generalizes STE and stabilizes training.

  16. Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective - Score: 18 (R=10, N=8) - Date: 2026-01-28 - Comment: High Performance Computing & Efficiency: unified model for KV-cache eviction and query routing with randomized eviction and learning-based routing; theoretical guarantees and large speedups.

  17. Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: Matches Compression/Efficiency: unifies sparsity and low-rank fine-tuning with provable MSE bounds, fused GEMM, and bitmap encoding for true speedups.

  18. Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: Matches Cache/Efficiency: KV cache compression for CoT with answer-first principle, attention-based LRFU eviction, and adaptive budget allocation.

  19. E2Former-V2: On-the-Fly Equivariant Attention with Linear Activation Memory - Score: 18 (R=10, N=8) - Date: 2026-01-26 - Comment: High-Performance Computing and Efficiency: algebraic sparsity (EAAS) and a fused on-the-fly equivariant attention kernel achieving large TFLOPS gains with linear activation memory.

  20. Global Context Compression with Interleaved Vision-Text Transformation - Score: 18 (R=10, N=8) - Date: 2026-01-16 - Comment: Compression/Efficiency and Model Architecture: global context compression in Transformers via interleaved vision–text tokens, reducing memory/FLOPs and token count.

  21. Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models - Score: 18 (R=10, N=8) - Date: 2026-01-16 - Comment: Model Architecture/Efficiency: Bounded Hyperbolic Tanh as a normalization-free alternative to Pre-LN with theoretical stability and faster training/inference for LLMs.

  22. Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification - Score: 18 (R=10, N=8) - Date: 2026-01-15 - Comment: Model Compression and Efficiency: hardware-aligned 1.25-bit ternary quantization via 3:4 fine-grained sparsity and an annealing residual synapse mechanism (Arenas) to avoid representational collapse.

  23. KVzap: Fast, Adaptive, and Faithful KV Cache Pruning - Score: 18 (R=10, N=8) - Date: 2026-01-15 - Comment: Model Compression and Efficiency: fast, adaptive KV cache pruning for both prefilling and decoding; cache/pruning focus.

  24. Hierarchical Sparse Plus Low Rank Compression of LLM - Score: 18 (R=10, N=8) - Date: 2026-01-15 - Comment: Model Compression and Efficiency: hierarchical sparse-plus-low-rank (HSS) factorization with sparsity for LLM layers.

  25. ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs - Score: 18 (R=10, N=8) - Date: 2026-01-13 - Comment: Model Compression and Efficiency: unified NVFP4 4-bit PTQ via Augmented Residual Channels that preserves block isolation and hardware-uniform GEMM, with theoretical error bounds.

  26. MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs - Score: 18 (R=10, N=8) - Date: 2026-01-13 - Comment: Directly targets MoE training memory/throughput with co-designed kernels and activation checkpointing, squarely matching HPC and Compression/Efficiency criteria for MoE.

  27. FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching - Score: 18 (R=10, N=8) - Date: 2026-01-12 - Comment: Matches Model Compression and Efficiency: flexible low-rank quantization with sketching and clipping-optimized approximation for LLMs.

  28. ADEPT: Adaptive Dynamic Early-Exit Process for Transformers - Score: 18 (R=10, N=8) - Date: 2026-01-09 - Comment: Model Efficiency: adaptive token-level early exit in both prefill and generation with KV-cache decoupling for transformers.

  29. GRIT -- Geometry-Aware PEFT with K-FACPreconditioning, Fisher-Guided Reprojection, andDynamic Rank Adaptation - Score: 18 (R=10, N=8) - Date: 2026-01-05 - Comment: Model Compression and Efficiency: low-rank PEFT with K-FAC preconditioning, Fisher-guided reprojection, and dynamic rank adaptation.

  30. More Than Bits: Multi-Envelope Double Binary Factorization for Extreme Quantization - Score: 18 (R=10, N=8) - Date: 2026-01-01 - Comment: Strongly matches Compression/Efficiency: proposes Multi-envelope Double Binary Factorization for extreme low-bit quantization with shared sign bases, rank-l envelope, closed-form init, and alternating refinement; preserves deployment-friendly binary inference primitives.

  31. PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression - Score: 18 (R=10, N=8) - Date: 2026-01-01 - Comment: LLM-aware lossy compression of the KV cache with co-designed algorithms/systems; strong fit to Compression/Efficiency (cache) for Transformer inference.

  32. Efficient Context Scaling with LongCat ZigZag Attention - Score: 18 (R=10, N=8) - Date: 2026-01-01 - Comment: Model Architecture and Efficiency: introduces sparse ZigZag attention (LoZA) for efficient long-context scaling (up to 1M tokens) with speedups in prefill/decode.

  33. Trellis: Learning to Compress Key-Value Memory in Attention Models - Score: 18 (R=10, N=8) - Date: 2026-01-01 - Comment: Model Compression and Efficiency: learns to compress the Transformer KV cache into a fixed-size dynamic memory via a recurrent two-pass update with online gradient descent.

  34. Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining - Score: 18 (R=9, N=9) - Date: 2026-01-28 - Comment: Deep Learning Theory: provable hierarchical learning in deep conv nets on Random Hierarchy Models via layerwise training (shallow-to-deep chaining).

  35. Diffusion Language Models are Provably Optimal Parallel Samplers - Score: 18 (R=9, N=9) - Date: 2026-01-01 - Comment: Model Architecture/Efficiency: proves diffusion language models with CoT and revision/remasking are optimal parallel samplers in sequential steps and space, giving a theoretical foundation for efficient inference.

  36. Sliced-Wasserstein Distribution Alignment Loss Improves the Ultra-Low-Bit Quantization of Large Language Models - Score: 17 (R=10, N=7) - Date: 2026-01-15 - Comment: Model Compression and Efficiency: proposes a sliced-Wasserstein distribution alignment loss for ultra-low-bit post-training quantization of LLMs, improving calibration of activation/output distributions.

  37. Quantized SO(3)-Equivariant Graph Neural Networks for Efficient Molecular Property Prediction - Score: 17 (R=10, N=7) - Date: 2026-01-06 - Comment: Compression/Efficiency: low-bit quantization for SO(3)-equivariant GNNs with magnitude-direction decoupling and branch-separated QAT.

  38. LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: adaptive look-ahead mixed-precision inference selecting small subsets for high precision to control rounding error in Transformers.

  39. Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Efficiency/Cache: repurposes KV cache as lightweight representation for chain-of-embedding and fast/slow reasoning switching, reducing tokens at inference.

  40. Towards Compact and Robust DNNs via Compression-aware Sharpness Minimization - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Compression/Efficiency & Robustness: sharpness-aware training over pruning masks (structure perturbations) to co-optimize compactness and robustness.

  41. TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Model Compression and Efficiency/HPC: instance-aware token seeking/ditching to cut activation memory during fine-tuning with large savings.

  42. Self-Supervised Weight Templates for Scalable Vision Model Initialization - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Model Compression/Efficiency & Architecture: Tucker-factorized shared weight template with size-specific scalers enables scalable initialization across depths/widths; includes width-wise stochastic scaling.

  43. EPAS: Efficient Training with Progressive Activation Sharing - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Efficiency/HPC: progressive activation (QK/KV) sharing across Transformer layers to boost training and inference throughput with controllable sharing at inference.

  44. FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Matches Quantization and KV-cache efficiency: FP8 W8A8 rollout, FP8 KV-cache with per-step recalibration, and mismatch correction for LLM RL.

  45. S$^3$-Attention:Attention-Aligned Endogenous Retrieval for Memory-Bounded Long-Context Inference - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Model Compression/Efficiency: replaces full KV cache with attention-aligned endogenous retrieval via sparse autoencoders and a CPU inverted index to bound GPU memory during long-context inference.

  46. Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Model Compression and Efficiency: proposes gating-based KV cache eviction with forward-only gate training for memory/compute-efficient LLM inference.

  47. AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Matches Efficiency: activation-guided low-rank subspace ZO optimization enabling memory-efficient LLM fine-tuning with theoretical guarantees.

  48. A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Matches High-Performance Training Dynamics: scalable critical sharpness measure (few forward passes) capturing curvature phenomena in LLM training up to 7B.

  49. Theory of Minimal Weight Perturbations in Deep Networks and its Applications for Low-Rank Activated Backdoor Attacks - Score: 17 (R=9, N=8) - Date: 2026-01-26 - Comment: Model Compression and Efficiency: theoretical bounds on minimal weight perturbations and provable low-rank compression thresholds; insights into layer-wise sensitivity.

  50. Mugi: Value Level Parallelism For Efficient LLMs - Score: 17 (R=9, N=8) - Date: 2026-01-19 - Comment: Compression/Efficiency: value-level parallelism generalized to nonlinear ops, weight/KV-cache quantization, and a new VLP architecture (Mugi) for full LLM workloads.

  51. $D^2Prune$: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness - Score: 17 (R=9, N=8) - Date: 2026-01-15 - Comment: Model Compression and Efficiency: proposes dual Taylor expansion pruning with attention distribution awareness for precise LLM sparsification.

  52. Beyond Variance: Knowledge-Aware LLM Compression via Fisher-Aligned Subspace Diagnostics - Score: 17 (R=9, N=8) - Date: 2026-01-13 - Comment: Model Compression and Representation Learning: Fisher-aligned subspace selection for activation compression using the Fisher Information Matrix and a new dependence metric for knowledge-critical directions.

  53. Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2026-01-13 - Comment: Efficiency/HPC: introduces provably lossless hierarchical speculative decoding that increases accepted tokens without fidelity loss.

  54. mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations - Score: 17 (R=9, N=8) - Date: 2026-01-12 - Comment: Model Architecture/Efficiency: reparameterizes hyper-connections to exactly enforce doubly stochastic mixing (via Birkhoff–von Neumann), eliminating Sinkhorn iterations and improving stability/speed.

  55. RelayLLM: Efficient Reasoning via Collaborative Decoding - Score: 17 (R=9, N=8) - Date: 2026-01-09 - Comment: Model Compression/Efficiency: token-level collaborative decoding with dynamic routing to an LLM to cut compute cost.

  56. Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2026-01-09 - Comment: Strong match to Model Compression and Efficiency (memory-efficient LLM fine-tuning via prior-informed ZO gradient estimation with theory).

  57. TAP-ViTs: Task-Adaptive Pruning for On-Device Deployment of Vision Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-07 - Comment: Model Compression/Efficiency: task-adaptive pruning for ViTs using per-device GMM-derived proxy datasets and dual-granularity importance evaluation; privacy-preserving on-device deployment.

  58. FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness - Score: 17 (R=9, N=8) - Date: 2026-01-06 - Comment: Efficiency: TTC-aware training and early stopping to trade training FLOPs for test-time compute with a theoretical break-even bound.

  59. RMAAT: Astrocyte-Inspired Memory Compression and Replay for Efficient Long-Context Transformers - Score: 17 (R=9, N=8) - Date: 2026-01-05 - Comment: Model Architecture and Efficiency: recurrent memory tokens with adaptive compression and memory-efficient backprop (AMRB) for long-context Transformers.

  60. Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data - Score: 16 (R=9, N=7) - Date: 2026-01-30 - Comment: Matches Model Compression/Sparsity: adaptive pruning discovers routed, specialized subnetworks ('adaptive tickets') for heterogeneous data.

  61. Soft Quantization: Model Compression Via Weight Coupling - Score: 16 (R=9, N=7) - Date: 2026-01-30 - Comment: Compression/quantization: training-time weight coupling induces mixed-precision discretization; a novel route to quantization beyond standard PTQ.

  62. Switchcodec: Adaptive residual-expert sparse quantization for high-fidelity neural audio coding - Score: 16 (R=9, N=7) - Date: 2026-01-29 - Comment: Model Compression and Efficiency: residual-experts vector quantization (dynamic expert routing, variable bitrate) for neural audio coding—sparse quantization with MoE-like routing.

  63. GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs - Score: 16 (R=9, N=7) - Date: 2026-01-28 - Comment: Model Compression and Efficiency: gradient-guided layer pruning and merging for LLMs enabling efficient fine-tuning and inference.

  64. Is Finer Better? The Limits of Microscaling Formats in Large Language Models - Score: 16 (R=9, N=7) - Date: 2026-01-28 - Comment: Strong match to Model Compression/Efficiency: analyzes limits of microscaling quantization and proposes a hardware-friendly FP8 UE5M3 scale format for FP4 data types.

  65. How Is Uncertainty Propagated in Knowledge Distillation? - Score: 16 (R=9, N=7) - Date: 2026-01-28 - Comment: Model Compression and Efficiency: variance-aware knowledge distillation (multi-response averaging and inverse-variance weighting) with formal analysis of uncertainty propagation.

  66. From LLMs to LRMs: Rethinking Pruning for Reasoning-Centric Models - Score: 16 (R=9, N=7) - Date: 2026-01-27 - Comment: Matches Model Compression and Efficiency: controlled study of depth/width/static/dynamic pruning strategies for reasoning-centric LLMs.

  67. Low-Rank Tensor Approximation of Weights in Large Language Models via Cosine Lanczos Bidiagonalization - Score: 16 (R=9, N=7) - Date: 2026-01-27 - Comment: Compression/Efficiency: low-rank tensor approximation of LLM weight tensors via cosine Lanczos bidiagonalization in a transform domain (cproduct).

  68. MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models - Score: 16 (R=9, N=7) - Date: 2026-01-19 - Comment: KV-cache efficiency via adapting MLA to VLMs with modality-decoupled low-rank KV compression and RoPE modification; parameter-efficient adaptation.

  69. FAQ: Mitigating Quantization Error via Regenerating Calibration Data with Family-Aware Quantization - Score: 16 (R=9, N=7) - Date: 2026-01-19 - Comment: Matches 'Model Compression and Efficiency: Quantization' by regenerating family-aware calibration data to improve PTQ accuracy in LLMs.

  70. Single-Stage Huffman Encoder for ML Compression - Score: 16 (R=9, N=7) - Date: 2026-01-16 - Comment: Matches Compression/Efficiency and HPC communication: proposes a single-stage Huffman encoder with fixed codebooks for on-the-fly tensor compression during distributed LLM training, removing codebook-gen/transmission overhead.

  71. Enhancing LUT-based Deep Neural Networks Inference through Architecture and Connectivity Optimization - Score: 16 (R=9, N=7) - Date: 2026-01-16 - Comment: Compression/Efficiency: LUT-based DNN architectural aggregation plus non-greedy sparse connectivity pruning/regrowth for FPGA-efficient inference.

  72. GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR - Score: 16 (R=9, N=7) - Date: 2026-01-15 - Comment: Model Compression/Efficiency: geometry-aware low-rank adapters (LoRA) initialized by SVD to stabilize RLVR updates while using dense operators.

  73. Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference - Score: 16 (R=9, N=7) - Date: 2026-01-13 - Comment: Compression/Efficiency: training-free adaptive layer selection for layer-wise token pruning to reduce KV cache while preserving accuracy.

  74. SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis - Score: 16 (R=9, N=7) - Date: 2026-01-13 - Comment: Matches Compression/Efficiency (low-rank parameter editing) and Representation Learning (capability as low-rank subspaces) for selective ablation.

  75. Controllable LLM Reasoning via Sparse Autoencoder-Based Steering - Score: 16 (R=9, N=7) - Date: 2026-01-09 - Comment: Strongly matches Representation Learning and Sparsity (Sparse Autoencoders to disentangle and steer reasoning strategies).

  76. Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage - Score: 16 (R=9, N=7) - Date: 2026-01-07 - Comment: Compression/Efficiency: analyzes sparse-attention decoding overheads (Less is Less) and proposes early-stopping to reduce token consumption in long-decode.

  77. RPIQ: Residual-Projected Multi-Collaboration Closed-Loop and Single Instance Quantization for Visually Impaired Assistance - Score: 16 (R=9, N=7) - Date: 2026-01-07 - Comment: Model Compression and Efficiency: proposes a new quantization framework with multi-collaborative closed-loop compensation and Gauss–Seidel iterative quantization addressing inter-block error accumulation (4-bit).

  78. CRoPE: Efficient Parametrization of Rotary Positional Embedding - Score: 16 (R=9, N=7) - Date: 2026-01-07 - Comment: Transformer architecture and compression/efficiency: efficient parametrization of Rotary Positional Embedding reducing attention block parameters with negligible performance loss.

  79. Bayesian Subspace Gradient Estimation for Zeroth-Order Optimization of Large Language Models - Score: 16 (R=9, N=7) - Date: 2026-01-07 - Comment: Compression/Efficiency & HPC: Bayesian zeroth-order optimizer that reduces memory and improves convergence for LLM fine-tuning.

  80. Heterogeneous Low-Bandwidth Pre-Training of LLMs - Score: 16 (R=9, N=7) - Date: 2026-01-06 - Comment: HPC + Efficiency: heterogeneous distributed pre-training combining SparseLoCo with activation/activation-gradient compression and subspace pipeline parallelism.

  81. SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines - Score: 16 (R=9, N=7) - Date: 2026-01-06 - Comment: Compression/Theory: SGD-based KD analysis with Bayesian teachers; shows variance reduction and guidance on BCP noise.

  82. QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models - Score: 16 (R=9, N=7) - Date: 2026-01-05 - Comment: Model Compression and Efficiency: automated quantization with tiered (global/block/module) search optimizing a performance–memory trade-off for spike-driven LMs.

  83. OptRot: Mitigating Weight Outliers via Data-Free Rotations for Post-Training Quantization - Score: 16 (R=9, N=7) - Date: 2026-01-01 - Comment: Model Compression and Efficiency: introduces data-free, fusible rotations (OptRot) to mitigate weight/activation outliers for post-training quantization, improving W4A8 and weight-only PTQ.

  84. MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling - Score: 16 (R=9, N=7) - Date: 2026-01-01 - Comment: Model Architecture/Efficiency: introduces a multi-scale state-space model with input-dependent scale-mixing to capture long-range, hierarchical dependencies efficiently.

  85. Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: token-budgeted LLM–SLM collaboration via hint prefixes and learned hint-length routing for cost-efficient inference.

  86. Procedural Pretraining: Warming Up Language Models with Abstract Data - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Training Dynamics/Efficiency: procedural pretraining on abstract data to induce algorithmic structure and accelerate LLM pretraining with less data.

  87. LoRA and Privacy: When Random Projections Help (and When They Don't) - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Low-Rank/Compression + Privacy theory: DP analysis of Wishart/projection mechanisms; shows LoRA randomness is not inherently private and when low-rank helps with DP.

  88. Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: training-free distribution sharpening via scaled low-temperature token sampling to match RL post-training gains without MCMC.

  89. Auto-Regressive Masked Diffusion Models - Score: 16 (R=8, N=8) - Date: 2026-01-26 - Comment: Matches Model Architecture (strictly causal, permutation-equivariant masked diffusion) and Efficiency (parallel autoregressive-style decoding/strided generation).

  90. Training-Trajectory-Aware Token Selection - Score: 16 (R=8, N=8) - Date: 2026-01-16 - Comment: Matches Compression/Efficiency and training dynamics: token-level objective (T3S) for distillation that mitigates trajectory bottlenecks in strong students, improving reasoning efficiency.

  91. Greedy Is Enough: Sparse Action Discovery in Agentic LLMs - Score: 16 (R=8, N=8) - Date: 2026-01-15 - Comment: Compression/Efficiency/Theory: frames sparse action discovery as block-sparse recovery and proves a greedy OMP-style algorithm recovers the relevant action set with sample guarantees.

  92. Sparsity Is Necessary: Polynomial-Time Stability for Agentic LLMs in Large Action Spaces - Score: 16 (R=8, N=8) - Date: 2026-01-15 - Comment: Model Compression/Efficiency: theory for block-sparse policies with ℓ1,2 regularization yielding sample complexity and support recovery in large action spaces.

  93. Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking - Score: 16 (R=8, N=8) - Date: 2026-01-13 - Comment: Compression/Efficiency via sparsity/pruning: concept-aware neuron masking for multi-concept unlearning in diffusion models (training-free).

  94. Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models - Score: 16 (R=8, N=8) - Date: 2026-01-09 - Comment: Model Efficiency: training-time Dynamic Outlier Truncation to suppress redundant reasoning tokens and improve cost–accuracy trade-off.

  95. SpikySpace: A Spiking State Space Model for Energy-Efficient Time Series Forecasting - Score: 16 (R=8, N=8) - Date: 2026-01-07 - Comment: Model Architecture and Efficiency: introduces a spiking state space model with event-driven selective scanning and neuromorphic-friendly activations for energy-efficient sequence modeling.

  96. Making Foundation Models Probabilistic via Singular Value Ensembles - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Matches Compression/Efficiency: parameter-efficient implicit ensembles by freezing singular vectors and learning per-member singular values.

  97. Grounding and Enhancing Informativeness and Utility in Dataset Distillation - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: principled dataset distillation balancing informativeness and utility with theoretical underpinnings.

  98. Flow Perturbation++: Multi-Step Unbiased Jacobian Estimation for High-Dimensional Boltzmann Sampling - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Efficiency for CNFs: unbiased multi-step Jacobian estimation (Flow Perturbation++) reduces variance for high-dimensional Boltzmann sampling.

  99. MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Model Compression and Efficiency: training-free caching for flow-matching inference via average-velocity JVP reuse and stability-aware scheduling to reduce compute without retraining.

  100. Leveraging Second-Order Curvature for Efficient Learned Image Compression: Theory and Empirical Evidence - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Model Compression and Efficiency: second-order (quasi-Newton) optimizer for learned image compression improves optimization efficiency and reduces activation/latent outliers, aiding post-training quantization.

  101. Convergence Analysis of Randomized Subspace Normalized SGD under Heavy-Tailed Noise - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Efficiency/Training Dynamics: randomized subspace normalized SGD with high-probability guarantees under heavy-tailed noise; reduced per-iteration cost and better oracle complexity than full-dim NSGD.

  102. TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Model Compression and Efficiency: test-time adaptive ensemble drafting for speculative decoding to speed LVLM inference.

  103. Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Model Compression and Efficiency: introduces windowed token pruning and KV caching to accelerate diffusion LM inference.

  104. PiC-BNN: A 128-kbit 65 nm Processing-in-CAM-Based End-to-End Binary Neural Network Accelerator - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Compression/Efficiency: end-to-end binary neural network accelerator using processing-in-CAM, eliminating full-precision ops.

  105. A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Representation Learning/Efficient adaptation: isolates behavior-specific neurons via sparse autoencoders and updates only a small neuron subset (sparse, neuron-level fine-tuning).

  106. The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Efficiency: memory-bounded test-time search with chunk-wise KV cache resets and geometric regularization to improve long-context reasoning coverage.

  107. Gradient Regularized Natural Gradients - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Optimization/Efficiency: Gradient-Regularized Natural Gradients with structured FIM approximations and a Kalman-based variant; convergence guarantees.

  108. Sparse RBF Networks for PDEs and nonlocal equations: function space theory, operator calculus, and training algorithms - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Model Architecture and Sparsity: Sparse RBF networks with function-space theory (Besov characterization) and efficient operator calculus/training for PDEs.

  109. Analyzing Neural Network Information Flow Using Differential Geometry - Score: 15 (R=8, N=7) - Date: 2026-01-26 - Comment: Model Compression/Efficiency and Representation Learning: curvature-based (Ollivier–Ricci) analysis of information flow to rank/prune edges in neural networks.

  110. Differentially Private Subspace Fine-Tuning for Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-19 - Comment: Model Compression and Efficiency: subspace (low-rank) DP fine-tuning injects noise only along principal gradient directions, preserving DP while reducing perturbation.

  111. Thinking Long, but Short: Stable Sequential Test-Time Scaling for Large Reasoning Models - Score: 15 (R=8, N=7) - Date: 2026-01-16 - Comment: Matches Efficiency/HPC inference: introduces stable sequential test-time scaling (Min-Seek) with a custom KV-cache scheme enabling beyond-context reasoning at near-linear complexity.

  112. Disentangling Task Conflicts in Multi-Task LoRA via Orthogonal Gradient Projection - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Parameter-efficient training: orthogonal gradient projection tailored to LoRA subspace to mitigate task interference (Model Architecture/Training).

  113. Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Compression/Efficiency: theoretically grounded relaxed speculative decoding with annealed resampling for faster AR generation.

  114. Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Model Efficiency: hidden-state-based step scoring and KV-cache-aware pruning for test-time scaling.

  115. Mechanisms are Transferable: Data-Efficient Low-Resource Adaptation via Circuit-Targeted Supervised Fine-Tuning - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Model Compression/Efficiency: updates only a sparse subset of attention heads (head-level gradient masking) based on mechanistic relevance, reducing parameters and forgetting.

  116. Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Model Compression and Efficiency: post-training quantization repurposed for safety realignment, decoupled from fine-tuning.

  117. Artificial Entanglement in the Fine-Tuning of Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Quantum-information-inspired analysis of low-rank PEFT (LoRA) via “artificial entanglement” directly matches Compression/Efficiency (low-rank) and Representation Learning/training dynamics.

  118. Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Model Compression/Efficiency: subtask-focused knowledge distillation that transfers only relevant subspaces/layer components from teacher to student.

  119. Continual Learning of Achieving Forgetting-free and Positive Knowledge Transfer - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Matches Model Architecture and Sparsity: task-specific binary masks (sparse sub-networks) with gradient alignment/projection for continual learning.

  120. DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Matches Model Architecture and Efficiency: dual-path, delay-aware Mamba backbone with linear-time modules for sequence modeling.

  121. Efficient Differentiable Causal Discovery via Reliable Super-Structure Learning - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Matches Efficiency and Low-rank/Sparsity: sparse+low-rank precision decomposition with ADMM to constrain and accelerate differentiable causal discovery.

  122. Not All Steps are Informative: On the Linearity of LLMs' RLVR Training - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Matches High-Performance/Training Efficiency (algorithmic extrapolation of weights/logits to reduce RLVR computation) and training dynamics analysis.

  123. FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Matches Compression/Efficiency: instruction-conditioned visual token selection with positional continuity (PosPad) for efficient VLM grounding.

  124. Compressed code: the hidden effects of quantization and distillation on programming tokens - Score: 15 (R=8, N=7) - Date: 2026-01-07 - Comment: Representation Learning and Compression: analyzes how quantization and distillation alter token-level representations for code and impact generation quality.

  125. Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance - Score: 15 (R=8, N=7) - Date: 2026-01-07 - Comment: Compression/Efficiency/Training Dynamics: shows safety gradients are low-rank and introduces one-shot alignment correction leveraging this structure.

  126. Sparse Bayesian Message Passing under Structural Uncertainty - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Sparsity/Bayesian Architecture: posterior over signed adjacency and sparse signed message passing for robust GNNs under heterophily.

  127. Gradient-Free Approaches is a Key to an Efficient Interaction with Markovian Stochasticity - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Optimization/training algorithms: derivative-free method for Markovian noise with mixing-time–independent rates (algorithmic efficiency).

  128. MODE: Efficient Time Series Prediction with Mamba Enhanced by Low-Rank Neural ODEs - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Model Architecture/Efficiency: integrates Mamba SSM with low-rank Neural ODEs and segmented selective scanning for long-range time series with reduced complexity.

  129. Enabling Physical AI at the Edge: Hardware-Accelerated Recovery of System Dynamics - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Model Compression and Efficiency / HPC: FPGA-accelerated framework with sparsity-driven dropout and streaming parallelism for efficient model recovery at the edge.

High Performance Computing (42)

  1. Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: MoE + Systems: proposes Least-Loaded Expert Parallelism to dynamically rebalance imbalanced MoE routing across devices for latency/memory efficiency.

  2. Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple - Score: 18 (R=10, N=8) - Date: 2026-01-27 - Comment: Matches High Performance Computing: communication-avoiding GEMM via generalized space-filling curves with platform/shape-oblivious partitioning minimizing data movement.

  3. A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems - Score: 18 (R=10, N=8) - Date: 2026-01-09 - Comment: Systems-level framework for efficient MoE inference on GPU–NDP with tensor parallelism, load balancing, and dataset-free prefetching—HPC/efficiency for MoE.

  4. The Hessian of tall-skinny networks is easy to invert - Score: 18 (R=9, N=9) - Date: 2026-01-13 - Comment: HPC/Optimization: exact Hessian-inverse-vector products for deep nets with linear-in-layers time/memory, enabling scalable second-order methods.

  5. Nested Learning: The Illusion of Deep Learning Architectures - Score: 18 (R=9, N=9) - Date: 2026-01-01 - Comment: Proposes a new learning paradigm (Nested Learning), expressive optimizers, self-modifying sequence model, and a continuum memory system; foundational architecture/training perspective.

  6. PRISM: Distribution-free Adaptive Computation of Matrix Functions for Accelerating Neural Network Training - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Systems/efficiency: algorithmic framework (adaptive polynomial fitting + randomized sketching) to accelerate matrix functions used in optimizers (Shampoo/Muon), enabling faster large-model training.

  7. DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: HPC/systems: deterministic attention scheduling (backward pass DAG scheduling) to regain throughput for reproducible LLM training.

  8. High-dimensional learning dynamics of multi-pass Stochastic Gradient Descent in multi-index models - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Training dynamics: asymptotically exact mean-field characterization of multi-pass mini-batch SGD vs SME vs gradient flow in high dimensions.

  9. SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: High Performance Computing: SLO-aware rotary scheduling (RotaSched) and DuplexKV memory co-design on Superchips for responsive LLM serving.

  10. Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling - Score: 17 (R=9, N=8) - Date: 2026-01-29 - Comment: High Performance Computing/Systems: NPU architectural primitives and memory hierarchy tailored to diffusion LLM sampling (non-GEMM operations), delivering significant inference speedups.

  11. Revisiting Parameter Server in LLM Post-Training - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Systems-level innovation for distributed LLM training: replaces collective ops with point-to-point in FSDP (On-Demand Communication) to handle workload imbalance—fits the HPC/distributed training criterion.

  12. Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: High Performance Computing/Systems: unified layout abstraction and compiler DSL for distribution, tiling, and sharding across device meshes and memory hierarchies.

  13. ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration - Score: 17 (R=9, N=8) - Date: 2026-01-19 - Comment: HPC/Systems for LLM serving: fine-grained, adaptive KV cache placement with ILP and runtime feedback to meet SLOs.

  14. Parallelizable memory recurrent units - Score: 17 (R=9, N=8) - Date: 2026-01-15 - Comment: Model Architecture and Efficiency: new recurrent units (MRU/BMRU) with parallel scan compatibility and persistent memory.

  15. HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding - Score: 17 (R=9, N=8) - Date: 2026-01-15 - Comment: High Performance Computing/Efficiency: holistic-aware parallel speculative decoding with semantic token preservation for video-LLMs.

  16. Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64 - Score: 17 (R=9, N=8) - Date: 2026-01-09 - Comment: High-Performance Computing/Efficiency: systems-level memory layout and SIMD kernel design (virtual tensor core) to overcome memory wall for LLM inference on ARM64.

  17. Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving - Score: 17 (R=9, N=8) - Date: 2026-01-05 - Comment: Systems-level method for LLM serving emulation via CUDA virtualization and distributed time-warp coordination (High Performance Computing).

  18. Accelerating Decentralized Optimization via Overlapping Local Steps - Score: 16 (R=9, N=7) - Date: 2026-01-06 - Comment: HPC/Distributed Training: overlaps computation and communication in decentralized SGD (OLDSGD) with convergence guarantees to reduce wall-clock time.

  19. Reliable and Resilient Collective Communication Library for LLM Training and Serving - Score: 16 (R=9, N=7) - Date: 2026-01-01 - Comment: High Performance Computing: resilient collective communication for distributed LLM training/serving with connection migration and bandwidth-aware load redistribution.

  20. LLM-42: Enabling Determinism in LLM Inference with Verified Speculation - Score: 16 (R=8, N=8) - Date: 2026-01-27 - Comment: High Performance Computing: scheduling-based deterministic inference via verify-rollback that preserves dynamic batching without changing kernels.

  21. PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning - Score: 16 (R=8, N=8) - Date: 2026-01-12 - Comment: Model Architecture + HPC/Test-time compute: introduces a conditional/message-passing architecture to massively parallelize reasoning and scale test-time compute beyond context limits.

  22. Distributed Online Convex Optimization with Efficient Communication: Improved Algorithm and Lower bounds - Score: 16 (R=8, N=8) - Date: 2026-01-09 - Comment: Matches High-Performance/Distributed Training: improved algorithms and lower bounds for compressed communication in distributed online convex optimization.

  23. RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference - Score: 16 (R=8, N=8) - Date: 2026-01-07 - Comment: High Performance Computing/System efficiency: KV cache residency across pipeline stages, affinity-aware routing, and memory-aware caching to extend sequence length under strict latency SLOs.

  24. Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments - Score: 16 (R=8, N=8) - Date: 2026-01-07 - Comment: Model Architecture: introduces group-equivariant world models via one-parameter Lie group flows (equivariance for memory and dynamics).

  25. Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding - Score: 16 (R=8, N=8) - Date: 2026-01-01 - Comment: Co-designed speculative decoding with compiler-friendly execution and latency-aware drafting; systems-level inference optimization (HPC/efficiency) for LLMs.

  26. FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Optimization for large-scale training: momentum-orthogonalized updates structured by Fisher geometry (trust-region with K-FAC metric), balancing isotropy and adaptivity.

  27. Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: HPC/systems for LLM serving: analytical sizing of Attention/FFN ratios in disaggregated architecture to maximize throughput and minimize idle time.

  28. Collaborative Compressors in Distributed Mean Estimation with Limited Communication Budget - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: HPC/Distributed training: collaborative compressors for communication-efficient distributed mean estimation with error analyses beyond l2.

  29. Kareus: Joint Reduction of Dynamic and Static Energy in Large Model Training - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: High Performance Computing: joint optimization of kernel scheduling and frequency scaling to reduce training energy/time—systems-level training efficiency.

  30. HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training - Score: 15 (R=8, N=7) - Date: 2026-01-19 - Comment: High-Performance/Distributed Training: hybrid-order split learning that reduces client memory (no backprop activations) with convergence analysis.

  31. Distributed Perceptron under Bounded Staleness, Partial Participation, and Noisy Communication - Score: 15 (R=8, N=7) - Date: 2026-01-16 - Comment: High Performance Computing/Distributed Training: semi-asynchronous perceptron with staleness-bucket aggregation under delays, partial participation, and noisy communication, with mistake bounds.

  32. Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: High Performance Computing/Training: distribution-aligned sequence distillation to better match teacher output distributions and mitigate exposure bias.

  33. NOVAK: Unified adaptive optimizer for deep neural networks - Score: 15 (R=8, N=7) - Date: 2026-01-14 - Comment: HPC/Systems: unified adaptive optimizer with custom CUDA kernels and rectified adaptive rates; systems-level speedups for large-scale training.

  34. Tight Analysis of Decentralized SGD: A Markov Chain Perspective - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: High Performance Computing/Distributed Training: Markov chain analysis of decentralized SGD with non-asymptotic bounds and linear speedup characterization under network topology.

  35. AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: High Performance Computing: unified, framework-agnostic performance modeling and configuration search for LLM serving (covers tensor/pipeline/expert parallelism, KV-cache, and scheduling) enabling algorithmic systems-level efficiency gains.

  36. Latent Space Communication via K-V Cache Alignment - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Architecture/Systems: aligns K-V caches via shared latent space with adapters for high-bandwidth inter-model communication and skill transfer.

  37. DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: High Performance Computing/System Efficiency: GPU-first tokenizer with LUT-based streaming and overlapped H2D/compute removes tokenization bottlenecks for foundation models.

  38. MQ-GNN: A Multi-Queue Pipelined Architecture for Scalable and Efficient GNN Training - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Multi-queue pipelined GNN training with asynchronous updates, caching, and adaptive queue sizing—systems/HPC innovation for scalable training.

  39. Accelerating Storage-Based Training for Graph Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-01-07 - Comment: High Performance Computing: systems-level storage I/O optimization (block-wise I/O and hyperbatching) to accelerate large-scale GNN training on NVMe.

  40. Energy-Aware Routing to Large Reasoning Models - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Efficiency/Systems: variance-aware, energy-aware routing among large reasoning models using compute scaling laws.

  41. Toward Large-Scale Photonics-Empowered AI Systems: From Physical Design Automation to System-Algorithm Co-Exploration - Score: 15 (R=8, N=7) - Date: 2026-01-05 - Comment: Cross-layer systems/toolchain for photonic AI with dynamic tensor ops for Transformers and implementation-aware co-design (High Performance Computing).

  42. Tensor Computing Interface: An Application-Oriented, Lightweight Interface for Portable High-Performance Tensor Network Applications - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: High Performance Computing: portable, lightweight tensor-network API enabling high-performance across heterogeneous backends.

Representation Learning (114)

  1. Value-guided action planning with JEPA world models - Score: 20.0 (R=0, N=0) - Date: 2026-01-07 - Comment: Author match

  2. What Drives Success in Physical Planning with Joint-Embedding Predictive World Models? - Score: 20.0 (R=0, N=0) - Date: 2026-01-01 - Comment: Author match

  3. Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models - Score: 18 (R=10, N=8) - Date: 2026-01-14 - Comment: Strongly matches Representation Learning: identifies and steers sparse latent features (via SAEs) causally tied to reasoning, enabling activation-level control.

  4. Attribution-Guided Distillation of Matryoshka Sparse Autoencoders - Score: 18 (R=10, N=8) - Date: 2026-01-01 - Comment: Representation Learning and Sparsity: distillation of a compact core of features in sparse autoencoders, improving transfer across sparsity levels.

  5. Minimax Rates for Hyperbolic Hierarchical Learning - Score: 18 (R=9, N=9) - Date: 2026-01-29 - Comment: Representation Learning Theory: proves minimax-optimal sample complexity for hyperbolic representations on hierarchies and exponential separation vs Euclidean embeddings.

  6. Implicit bias as a Gauge correction: Theory and Inverse Design - Score: 18 (R=9, N=9) - Date: 2026-01-13 - Comment: Representation Learning/Training Dynamics: geometric gauge-correction mechanism explaining implicit bias from symmetry–stochasticity interaction, with inverse-design of desired biases (e.g., sparsity).

  7. When Models Manipulate Manifolds: The Geometry of a Counting Task - Score: 18 (R=9, N=9) - Date: 2026-01-09 - Comment: Representation Learning/Training Dynamics: mechanistic interpretability revealing low-dimensional counting manifolds and attention geometry in transformers.

  8. From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence - Score: 18 (R=9, N=9) - Date: 2026-01-07 - Comment: Representation learning theory: introduces a new information measure (epiplexity) for computationally bounded observers, guiding data selection and learning.

  9. Deep Networks Learn Deep Hierarchical Models - Score: 18 (R=9, N=9) - Date: 2026-01-05 - Comment: Representation Learning/Theory: proves layerwise SGD on ResNets efficiently learns deep hierarchical label models (polynomial depth), advancing learnability theory.

  10. Value-Based Pre-Training with Downstream Feedback - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Representation/Training Dynamics: value-based continued pretraining steers SSL using downstream-gradient alignment to maximize gradient value per step.

  11. Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Representation Learning/Training Dynamics: influence-function-based mechanistic data attribution linking training samples to interpretable circuits and ICL heads.

  12. Can Local Learning Match Self-Supervised Backpropagation? - Score: 17 (R=9, N=8) - Date: 2026-01-30 - Comment: Representation learning/training dynamics: theoretical equivalence conditions between local SSL and global BP-SSL and practical local-SSL variants matching global SSL.

  13. Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs - Score: 17 (R=9, N=8) - Date: 2026-01-29 - Comment: Representation Learning: principled concept extraction via unsupervised linear unmixing of LLM activations (Concept Component Analysis) with sparsity priors, offering a theory-backed alternative to SAEs.

  14. Spectral Ghost in Representation Learning: from Component Analysis to Self-Supervised Learning - Score: 17 (R=9, N=8) - Date: 2026-01-29 - Comment: Representation Learning: unified spectral framework explaining self-supervised objectives via spectral sufficiency, offering principled foundations and algorithmic guidance.

  15. Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds - Score: 17 (R=9, N=8) - Date: 2026-01-29 - Comment: Representation Learning: geometric/spectral analysis of Transformer hidden manifolds revealing phase transitions, effective dimensionality collapse, and renormalization-like flows.

  16. The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-Modal Divergence - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Representation Learning Theory: measure-theoretic analysis of contrastive learning geometry beyond alignment–uniformity, including multimodal divergence effects.

  17. How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability - Score: 17 (R=9, N=8) - Date: 2026-01-28 - Comment: Representation Learning/Mechanistic Interpretability: closed-form early-training weight characterizations in Transformers via gradient leading terms.

  18. Neural Network Approximation: A View from Polytope Decomposition - Score: 17 (R=9, N=8) - Date: 2026-01-27 - Comment: Matches Representation Learning Theory: universal approximation via polytope decomposition with explicit ReLU constructions and improved rates.

  19. Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability - Score: 17 (R=9, N=8) - Date: 2026-01-26 - Comment: Representation Learning/Training Dynamics: introduces a process-tensor view of SGD with a measurable non-Markovian memory witness via back-flow of distinguishability.

  20. Spectral Characterization and Mitigation of Sequential Knowledge Editing Collapse - Score: 17 (R=9, N=8) - Date: 2026-01-19 - Comment: Representation Learning/Training Dynamics: spectral analysis ties collapse to dominant singular directions; REVIVE preserves singular subspace during editing.

  21. Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs - Score: 17 (R=9, N=8) - Date: 2026-01-19 - Comment: Representation Learning: uses Sparse Autoencoders to identify causal, task-specific features ("translation initiation") inside LLMs and validates via interventions.

  22. Transient learning dynamics drive escape from sharp valleys in Stochastic Gradient Descent - Score: 17 (R=9, N=8) - Date: 2026-01-19 - Comment: Matches 'Representation Learning: training dynamics in neural networks' by theoretically linking SGD noise, effective potentials, and transient freezing to preference for flat minima.

  23. An analytic theory of convolutional neural network inverse problems solvers - Score: 17 (R=9, N=8) - Date: 2026-01-16 - Comment: Matches Representation Learning/Theory: provides an analytic LE-MMSE framework capturing CNN inductive biases (equivariance, locality) for inverse problems with strong empirical alignment.

  24. In-Context Operator Learning on the Space of Probability Measures - Score: 17 (R=9, N=8) - Date: 2026-01-16 - Comment: Matches Representation Learning/Theory: proposes in-context operator learning on probability measures with scaling-law theory and explicit architectures for OT maps.

  25. Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge - Score: 17 (R=9, N=8) - Date: 2026-01-15 - Comment: Model Architecture: token-wise branch-and-merge (Multiplex Thinking) aggregates K sampled token embeddings into a single multiplex token for soft reasoning.

  26. Towards A Unified PAC-Bayesian Framework for Norm-based Generalization Bounds - Score: 17 (R=9, N=8) - Date: 2026-01-14 - Comment: Representation Learning/Theory: unified PAC-Bayesian norm-based generalization bounds using anisotropic posteriors and an architecture-aware sensitivity matrix.

  27. Transformer Is Inherently a Causal Learner - Score: 17 (R=9, N=8) - Date: 2026-01-13 - Comment: Representation Learning: proves autoregressive transformers’ gradient sensitivities recover time-delayed causal graphs, offering theoretical insight into learned representations.

  28. Do Sparse Autoencoders Identify Reasoning Features in Language Models? - Score: 17 (R=9, N=8) - Date: 2026-01-12 - Comment: Representation Learning: falsification-oriented analysis of Sparse Autoencoders, combining causal token injection and LLM-guided tests to assess whether SAE features encode genuine reasoning.

  29. Excess Description Length of Learning Generalizable Predictors - Score: 17 (R=9, N=8) - Date: 2026-01-09 - Comment: Matches Representation Learning/Training Dynamics: information-theoretic framework (Excess Description Length) quantifying capability acquisition and generalization.

  30. On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime - Score: 17 (R=9, N=8) - Date: 2026-01-07 - Comment: Representation Learning/Optimization: studies preconditioned gradient descent to mitigate spectral bias and reduce grokking delays; theoretical and empirical insights into learning regimes.

  31. Context Collapse: In-Context Learning and Model Collapse - Score: 17 (R=9, N=8) - Date: 2026-01-07 - Comment: Representation Learning: theoretical analysis of in-context learning in a (linear) transformer via reduction to preconditioned gradient descent; links training dynamics to phase transitions and introduces context collapse.

  32. Leveraging Flatness to Improve Information-Theoretic Generalization Bounds for SGD - Score: 17 (R=9, N=8) - Date: 2026-01-06 - Comment: Representation Learning Theory: information-theoretic generalization bounds leveraging flatness to tighten SGD generalization and improve rates.

  33. Sobolev Approximation of Deep ReLU Network in Log-weighted Barron Space - Score: 17 (R=9, N=8) - Date: 2026-01-06 - Comment: Theoretical Representation Learning: new log-weighted Barron spaces and depth-sensitive ReLU approximation bounds (Sobolev metrics).

  34. Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process - Score: 17 (R=9, N=8) - Date: 2026-01-01 - Comment: Representation Learning: uses sparse autoencoders to discover disentangled reasoning vectors enabling interpretable control of LLM reasoning behaviors via latent interventions.

  35. Linear representations in language models can change dramatically over a conversation - Score: 16 (R=9, N=7) - Date: 2026-01-29 - Comment: Representation Learning: studies dynamics of linear concept directions in LMs across conversations, impacting interpretability/steering.

  36. Decomposing multimodal embedding spaces with group-sparse autoencoders - Score: 16 (R=9, N=7) - Date: 2026-01-29 - Comment: Representation Learning + sparsity: group-sparse autoencoders with cross-modal masking to decompose multimodal embeddings.

  37. Learning Ordered Representations in Latent Space for Intrinsic Dimension Estimation via Principal Component Autoencoder - Score: 16 (R=9, N=7) - Date: 2026-01-28 - Comment: Model Architecture & Representation Learning: proposes an autoencoder with non-uniform variance regularization and isometric constraint to recover ordered latent components (PCA generalization).

  38. Jacobian Scopes: token-level causal attributions in LLMs - Score: 16 (R=9, N=7) - Date: 2026-01-27 - Comment: Matches Representation Learning/Analysis: gradient-based token-level causal attributions (Jacobian Scopes) for interpreting LLM predictions.

  39. YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation - Score: 16 (R=9, N=7) - Date: 2026-01-15 - Comment: Representation Learning: learns sparse, disentangled activation steering vectors in SAE latent space for controllability/alignment without a reference model (reference-free).

  40. Dynamics Reveals Structure: Challenging the Linear Propagation Assumption - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Matches Representation Learning: theoretical analysis of first-order update propagation and constraints (bilinearity vs negation) on feature maps.

  41. CORDS: Continuous Representations of Discrete Structures - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Representation Learning/Set Modeling: invertible continuous fields (density/feature) for variable-sized sets enabling exact decoding.

  42. Bridging Functional and Representational Similarity via Usable Information - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Representation Learning Theory: unifies functional and representational similarity via usable information linking stitching, CKA/RSA, and reconstruction.

  43. Representation Unlearning: Forgetting through Information Compression - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Representation Unlearning: imposes an information bottleneck in representation space to forget while retaining utility, with variational objectives.

  44. Fast and Geometrically Grounded Lorentz Neural Networks - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Model architecture: new Lorentz linear layer with geometric guarantees plus efficient activations/caching for hyperbolic NNs, improving representation learning in non-Euclidean space.

  45. $\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Representation/Compression Theory: tight bounds on minimal embeddable dimension for top-k retrieval under common similarities, informing embedding design.

  46. Order-Optimal Sample Complexity of Rectified Flows - Score: 16 (R=8, N=8) - Date: 2026-01-30 - Comment: Representation learning/theory: proves order-optimal sample complexity for rectified flows in generative modeling.

  47. To Grok Grokking: Provable Grokking in Ridge Regression - Score: 16 (R=8, N=8) - Date: 2026-01-28 - Comment: Representation Learning: theoretical training-dynamics analysis of grokking with provable bounds on generalization delay in ridge regression.

  48. Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs - Score: 16 (R=8, N=8) - Date: 2026-01-19 - Comment: Representation Learning/Mechanistic Interpretability: identifies anchor–adapter circuits causing shortcut memorization under RLVR and demonstrates causal steering.

  49. Digital Metabolism: Decoupling Logic from Facts via Regenerative Unlearning -- Towards a Pure Neural Logic Core - Score: 16 (R=8, N=8) - Date: 2026-01-19 - Comment: Representation learning/training dynamics: protocol to decouple logic from facts via gradient reversal—toward modular neural logic core.

  50. Universal Latent Homeomorphic Manifolds: Cross-Domain Representation Learning via Homeomorphism Verification - Score: 16 (R=8, N=8) - Date: 2026-01-15 - Comment: Representation Learning: proposes a topology-based (homeomorphism) framework and verification algorithms to unify latent manifolds across modalities, offering theoretical insights into learned representations.

  51. Dynamic Graph Structure Learning via Resistance Curvature Flow - Score: 16 (R=8, N=8) - Date: 2026-01-15 - Comment: Representation Learning/Efficiency: Resistance Curvature Flow replaces OT-based curvature optimization with effective-resistance matrix ops for dynamic graph structure learning (>100x speedup).

  52. Manifold limit for the training of shallow graph convolutional neural networks - Score: 16 (R=8, N=8) - Date: 2026-01-12 - Comment: Representation Learning/Training Theory: proves Γ-convergence for training shallow GCNNs under manifold assumptions, formalizing mesh/sample independence.

  53. On the Limits of Self-Improving in LLMs and Why AGI, ASI and the Singularity Are Not Near Without Symbolic Model Synthesis - Score: 16 (R=8, N=8) - Date: 2026-01-12 - Comment: Representation Learning/Training dynamics theory: formalizes recursive self-training in LLMs and proves degenerative behaviors (entropy decay, variance amplification), arguing for neurosymbolic synthesis.

  54. Bridging Distance and Spectral Positional Encodings via Anchor-Based Diffusion Geometry Approximation - Score: 16 (R=8, N=8) - Date: 2026-01-09 - Comment: Representation Learning: connects spectral/diffusion positional encodings to anchor-based distance via low-rank/Nyström approximation with theoretical guarantees.

  55. An Algebraic Representation Theorem for Linear GENEOs in Geometric Machine Learning - Score: 16 (R=8, N=8) - Date: 2026-01-09 - Comment: Strongly matches Model Architecture theory (representation theorem for equivariant operators/GENEOs enabling efficient, interpretable architectures).

  56. Credit Assignment via Neural Manifold Noise Correlation - Score: 16 (R=8, N=8) - Date: 2026-01-07 - Comment: Representation Learning/Learning algorithms: proposes manifold-restricted noise correlation for credit assignment, improving sample efficiency and scalability with biological plausibility.

  57. The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving - Score: 16 (R=8, N=8) - Date: 2026-01-05 - Comment: Proposes a unified training objective (DCR) to prevent diversity collapse in reasoning, addressing training dynamics and representation over solution traces (Representation Learning).

  58. Implicit score matching meets denoising score matching: improved rates of convergence and log-density Hessian estimation - Score: 16 (R=8, N=8) - Date: 2026-01-01 - Comment: Representation Learning/Theory: establishes convergence rates and Hessian estimation for implicit and denoising score matching, with implications for diffusion model samplers.

  59. From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation Learning: contrastive latent regularizer to reduce forget–retain entanglement for LLM unlearning (explicit representation shaping).

  60. Putting a Face to Forgetting: Continual Learning meets Mechanistic Interpretability - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation learning/Mechanistic interpretability: geometric, feature-centric framework explaining catastrophic forgetting; analysis on ViTs.

  61. How Expressive Are Graph Neural Networks in the Presence of Node Identifiers? - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation Learning: formal analysis of GNN expressive power with unique node identifiers (key-invariant expressivity) links to logic classes.

  62. Amortized Spectral Kernel Discovery via Prior-Data Fitted Network - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation Learning/Architecture analysis: decoders mapping PFN latents to spectral densities and stationary kernels (Bochner) enabling amortized kernel discovery.

  63. XFACTORS: Disentangled Information Bottleneck via Contrastive Supervision - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation learning: weakly-supervised disentanglement via contrastive supervision within a VAE/Information Bottleneck framework, enabling controllable factors.

  64. Learning the Mechanism of Catastrophic Forgetting: A Perspective from Gradient Similarity - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation Learning/Training dynamics: gradient-similarity theory identifies conflicting vs collaborative neurons; proposes selective freezing to prevent forgetting.

  65. FlexCausal: Flexible Causal Disentanglement via Structural Flow Priors and Manifold-Aware Interventions - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation Learning: causal disentanglement with block-diagonal VAE and flow-based priors plus manifold-aware interventions.

  66. Missing-Data-Induced Phase Transitions in Spectral PLS for Multimodal Learning - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation learning theory: phase transition analysis for spectral PLS under missing data using spiked random matrix theory; insights into multimodal representation recovery.

  67. LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them? - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Representation Learning/Efficient Fine-tuning: layer-wise analysis localizes language control and selectively tunes final layers (few parameters) to fix multilingual consistency.

  68. Implicit Hypothesis Testing and Divergence Preservation in Neural Network Representations - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Representation Learning/Training Dynamics: frames supervised training as implicit hypothesis testing with KL divergence alignment toward Neyman–Pearson optimality, suggesting regularization strategies.

  69. Loss Landscape Geometry and the Learning of Symmetries: Or, What Influence Functions Reveal About Robust Generalization - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Representation Learning: influence-function diagnostic measuring gradient coupling along symmetry orbits to assess robust generalization via loss landscape geometry.

  70. Domain Expansion: A Latent Space Construction Framework for Multi-Task Learning - Score: 15 (R=8, N=7) - Date: 2026-01-29 - Comment: Model Architecture/Representation Learning: orthogonal pooling constructs mutually orthogonal latent subspaces per task to resolve gradient conflicts in multi-task learning.

  71. Stability and Generalization of Nonconvex Optimization with Heavy-Tailed Noise - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Matches Representation Learning/Training Dynamics: stability and generalization bounds for nonconvex optimization under heavy-tailed gradient noise across SGD variants.

  72. Fixed Aggregation Features Can Rival GNNs - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Matches Representation Learning/Architecture: fixed (non-trainable) neighborhood aggregation features rival GNNs; theoretical links to Kolmogorov–Arnold representations challenge prevailing assumptions.

  73. Smooth embeddings in contracting recurrent networks driven by regular dynamics: A synthesis for neural representation - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Representation Learning: theoretical synthesis showing when contracting RNNs learn smooth, topology-preserving embeddings of regular dynamics; implications for state dimension and training dynamics.

  74. ASEHybrid: When Geometry Matters Beyond Homophily in Graph Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Matches Model Architecture and Representation Learning: geometry-aware GNN with theoretical characterization (label informativeness) and curvature-guided rewiring.

  75. Representational Homomorphism Predicts and Improves Compositional Generalization In Transformer Language Model - Score: 15 (R=8, N=7) - Date: 2026-01-28 - Comment: Representation Learning: introduces a structural metric (Homomorphism Error) on Transformer hidden states and uses it as a training regularizer to improve compositional generalization.

  76. Stability as a Liability:Systematic Breakdown of Linguistic Structure in LLMs - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Matches Representation Learning: analyzes training dynamics under MLE, showing stability leads to forward-KL minimization and low-entropy generations.

  77. Nonlinear multi-study factor analysis - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Matches Representation Learning: sparse multi-study variational autoencoder for shared/specific nonlinear factors with identifiability guarantees.

  78. Spelling Bee Embeddings for Language Modeling - Score: 15 (R=8, N=7) - Date: 2026-01-27 - Comment: Model Architecture: modifies the embedding layer to inject spelling features, improving representation quality with compute/data savings.

  79. No Validation, No Problem: Predicting Model Performance from a Single Gradient - Score: 15 (R=8, N=7) - Date: 2026-01-26 - Comment: Representation Learning/Training Dynamics: proposes a validation-free checkpointing signal from a single gradient; efficiency-oriented early stopping/selection without labels.

  80. Relational Linearity is a Predictor of Hallucinations - Score: 15 (R=8, N=7) - Date: 2026-01-19 - Comment: Representation learning/training dynamics: links relational linearity in embeddings to hallucination behavior, offering insight into how LLMs store facts.

  81. Operator learning on domain boundary through combining fundamental solution-based artificial data and boundary integral techniques - Score: 15 (R=8, N=7) - Date: 2026-01-19 - Comment: Representation Learning: boundary-only neural operator (MAD-BNO) learns Dirichlet–Neumann maps from mathematical artificial data; recovers interiors via boundary integrals.

  82. Understanding and Preserving Safety in Fine-Tuned LLMs - Score: 15 (R=8, N=7) - Date: 2026-01-16 - Comment: Representation Learning/Training Dynamics: identifies a low-rank safety-gradient subspace and uses projection-based fine-tuning (SPF) to preserve safety while maintaining utility.

  83. Toward Understanding Unlearning Difficulty: A Mechanistic Perspective and Circuit-Guided Difficulty Metric - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Representation Learning: introduces a circuit-level, mechanistic pre-unlearning difficulty metric (CUD) grounded in model circuits and interaction pathways.

  84. Ability Transfer and Recovery via Modularized Parameters Localization - Score: 15 (R=8, N=7) - Date: 2026-01-15 - Comment: Parameter modularization: activation-guided channel-wise ability transfer; insights into ability localization in LLM parameters (Representation Learning/Model Editing).

  85. Supervised Spike Agreement Dependent Plasticity for Fast Local Learning in Spiking Neural Networks - Score: 15 (R=8, N=7) - Date: 2026-01-14 - Comment: Representation Learning/Training dynamics: supervised spike agreement-dependent plasticity enabling local, backprop-free learning with linear-time complexity in SNNs.

  86. Deep Exploration of Epoch-wise Double Descent in Noisy Data: Signal Separation, Large Activation, and Benign Overfitting - Score: 15 (R=8, N=7) - Date: 2026-01-14 - Comment: Representation Learning: empirical analysis of epoch-wise double descent, benign overfitting, and large activations in deep nets.

  87. Representations of Text and Images Align From Layer One - Score: 15 (R=8, N=7) - Date: 2026-01-14 - Comment: Representation Learning: constructive, layer-wise evidence of image–text alignment from early layers using synthesis-based probes.

  88. Local EGOP for Continuous Index Learning - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Matches Representation Learning: Local EGOP metric for adaptive kernels/subspace estimation achieving intrinsic-dimension rates.

  89. Variational decomposition autoencoding improves disentanglement of latent representations - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Matches Model Architecture/Representation Learning: decomposition-aware variational autoencoder for disentangled latent subspaces.

  90. Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Mechanistic interpretability of Diffusion Transformers’ circuits for spatial relations fits the Representation Learning/training dynamics criterion.

  91. SPINAL -- Scaling-law and Preference Integration in Neural Alignment Layers - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Representation Learning and training dynamics: SPINAL quantifies layerwise geometric changes from DPO via contraction/transport scores.

  92. VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Leverages Variational Information Bottleneck to probe and intervene on attention heads, matching the Representation Learning criterion (internal mechanism analysis and causally-informed mitigation).

  93. Tracing Moral Foundations in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-13 - Comment: Matches Representation Learning/mechanistic interpretability: layer-wise concepts, sparse autoencoders features, and causal steering in LLMs.

  94. The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Representation Learning/Training Dynamics: analyzes structure of long CoT reasoning and proposes Mole-Syn to synthesize effective reasoning trajectories for stable learning.

  95. Quantifying and Inducing Shape Bias in CNNs via Max-Pool Dilation - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Matches Representation Learning and Architecture: quantifies dataset shape-texture balance and induces shape bias via max-pool dilation.

  96. Poisson Hyperplane Processes with Rectified Linear Units - Score: 15 (R=8, N=7) - Date: 2026-01-12 - Comment: Model Architecture/Theory: establishes a probabilistic PHP representation equivalent to two-layer ReLU networks, with scalable decomposition and Bayesian inference.

  97. Aligned explanations in neural networks - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Matches Model Architecture (pseudo-linear PiNets enabling aligned, instance-wise linear predictions) and Representation Learning (linearly readable features).

  98. Layer-wise Positional Bias in Short-Context Language Modeling - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Representation learning/training dynamics: layer-wise positional bias profiling via attribution, revealing recency/primacy patterns across depth.

  99. Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-09 - Comment: Representation learning/training dynamics: layer-wise analysis of multi-hop reasoning with a probabilistic recall-and-extract framework explaining internal composition.

  100. Hierarchical temporal receptive windows and zero-shot timescale generalization in biologically constrained scale-invariant deep networks - Score: 15 (R=8, N=7) - Date: 2026-01-07 - Comment: Model Architecture: introduces a scale-invariant recurrent architecture (SITH-RNN) with hierarchical temporal receptive windows and zero-shot timescale generalization; Representation Learning: insights into temporal priors and training dynamics.

  101. Output Embedding Centering for Stable LLM Pretraining - Score: 15 (R=8, N=7) - Date: 2026-01-07 - Comment: Training dynamics/representation geometry: proposes output embedding centering (μ-centering/μ-loss) to stabilize LLM pretraining with theoretical guarantees.

  102. ELLA: Efficient Lifelong Learning for Adapters in Large Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Representation Learning/Efficiency: selective subspace de-correlation via anisotropic shrinkage regularization for continual adapters with constant compute/memory.

  103. Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Training dynamics/representation learning: entropy-gated fine-tuning to mitigate forgetting by suppressing confident-conflict gradients.

  104. Towards a Principled Muon under $\mu\mathsf{P}$: Ensuring Spectral Conditions throughout Training - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Training Dynamics/Optimization: ensures μP spectral conditions throughout training for Muon (Muon++), aligning optimizer updates with μP scaling for large models.

  105. Intention Collapse: Intention-Level Metrics for Reasoning in Language Models - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Representation Learning: proposes intention-level metrics (entropy, effective dimensionality, recoverability) to study inference-time computation and internal representations in LMs.

  106. Deep Clustering with Associative Memories - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Representation learning: deep clustering objective using energy-based associative memories coupling representation and clustering.

  107. Deep Deterministic Nonlinear ICA via Total Correlation Minimization with Matrix-Based Entropy Functional - Score: 15 (R=8, N=7) - Date: 2026-01-06 - Comment: Representation Learning: deep deterministic nonlinear ICA minimizing total correlation via matrix-based entropy functional; avoids variational/adversarial schemes.

  108. On the geometry and topology of representations: the manifolds of modular addition - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Analyzes learned representations for modular addition as manifolds, showing equivalence across attention architectures; core Representation Learning insight.

  109. Generative Classifiers Avoid Shortcut Solutions - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Representation Learning/Architecture: shows generative classifiers reduce shortcut reliance and perform better under distribution shift, with theoretical and empirical analysis.

  110. Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Representation Learning/Efficiency: identifies cognitive attention heads and applies test-time representation rotations (training-free) to steer reasoning, reducing tokens and improving accuracy.

  111. Towards mechanistic understanding in a data-driven weather model: internal activations reveal interpretable physical features - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Representation Learning/interpretability: applies sparse autoencoders to internal activations of a weather model to discover and intervene on physical features.

  112. Information-Theoretic Quality Metric of Low-Dimensional Embeddings - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Introduces an information-theoretic metric (ERPM) for embedding quality via entropy/stable rank; fits Representation Learning evaluation/analysis.

  113. Deep learning methods for inverse problems using connections between proximal operators and Hamilton-Jacobi equations - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Model Architecture and Representation Learning: leverages connections between proximal operators and Hamilton–Jacobi PDEs to design architectures for learning priors in inverse problems.

  114. Geometric Scaling of Bayesian Inference in LLMs - Score: 15 (R=8, N=7) - Date: 2026-01-01 - Comment: Matches Representation Learning: analyzes internal geometry in Transformers/LLMs (entropy-aligned axis, low-dimensional value manifolds) and training dynamics via targeted interventions revealing how uncertainty is encoded.

Other Foundational Research (4)

  1. In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior - Score: 20.0 (R=0, N=0) - Date: 2026-01-07 - Comment: Author match

  2. Paradoxical noise preference in RNNs - Score: 16 (R=9, N=7) - Date: 2026-01-09 - Comment: Matches Training Dynamics: reveals noise-level-dependent fixed-point shifts in CTRNNs and noise as integral to computation.

  3. A New Convergence Analysis of Plug-and-Play Proximal Gradient Descent Under Prior Mismatch - Score: 16 (R=8, N=8) - Date: 2026-01-16 - Comment: Matches theoretical training analysis: first convergence proof for PnP-PGD under prior mismatch, relaxing restrictive assumptions.

  4. Hebbian Learning with Global Direction - Score: 15 (R=8, N=7) - Date: 2026-01-30 - Comment: Training dynamics: biologically plausible Hebbian framework augmented with global directional signals as an alternative to backprop.