← Previous Summary | Monthly Overview | Next Summary →
2025-02 | 2025-03 | 2026-03

Personalized Monthly Topic Summary 2025/03

MetricValue
Total Papers124
Architecture and Training Dynamics41
Efficiency, Compression, and Large-Scale Training39
Representation Learning Theory and Structure43
Memory Structures and Agent Memory Systems1
World Models, Exploration, and Open-Ended Reinforcement Learning0

Architecture and Training Dynamics (41)

  1. A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers - Score: 19 (R=10, N=9) - Date: 2025-03-07 - Comment: The paper provides theoretical insights into the expressive power of log-depth transformers, directly addressing foundational questions about model architecture and depth scaling.

  2. Convergence Rates for Softmax Gating Mixture of Experts - Score: 19 (R=10, N=9) - Date: 2025-03-06 - Comment: The paper provides a theoretical analysis of softmax gating in Mixture of Experts (MoE), directly addressing architectural insights and efficiency. The convergence analysis and sample efficiency insights are highly relevant.

  3. Mixture of Experts Made Intrinsically Interpretable - Score: 18 (R=10, N=8) - Date: 2025-03-12 - Comment: The paper introduces MoE-X, a Mixture-of-Experts model designed for intrinsic interpretability, which aligns closely with the MoE and interpretability criteria.

  4. MoFE: Mixture of Frozen Experts Architecture - Score: 18 (R=10, N=8) - Date: 2025-03-11 - Comment: The paper introduces the Mixture of Frozen Experts (MoFE) architecture, which is directly relevant to foundational research on Mixture-of-Experts and efficiency in model architectures.

  5. Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning - Score: 18 (R=10, N=8) - Date: 2025-03-10 - Comment: The paper introduces a symbolic Mixture-of-Experts framework, which directly aligns with the MoE topic under model architecture. The instance-level expert selection and efficiency improvements are notable contributions.

  6. Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts - Score: 18 (R=10, N=8) - Date: 2025-03-10 - Comment: The paper introduces Linear-MoE, combining linear sequence modeling with Mixture-of-Experts, which is highly relevant to architectural innovations and foundational research in MoE.

  7. Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts - Score: 18 (R=10, N=8) - Date: 2025-03-10 - Comment: The paper addresses the Straggler Effect in Mixture-of-Experts, which is directly relevant to model architecture and efficiency improvements. The proposed techniques are innovative.

  8. Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer - Score: 18 (R=10, N=8) - Date: 2025-03-05 - Comment: The paper proposes Union-of-Experts (UoE), which advances the Mixture-of-Experts paradigm with architectural innovations, aligning closely with model architecture research.

  9. Efficiently Editing Mixture-of-Experts Models with Compressed Experts - Score: 18 (R=10, N=8) - Date: 2025-03-04 - Comment: The paper introduces compressed experts for Mixture-of-Experts (MoE) models, reducing inference costs while maintaining performance. This directly aligns with the 'Model Architecture' and 'Model Compression' criteria.

  10. A Theory of Learning with Autoregressive Chain of Thought - Score: 18 (R=9, N=9) - Date: 2025-03-12 - Comment: The paper formalizes learning with autoregressive chain-of-thought, which aligns with foundational research in LLMs and introduces theoretical insights.

  11. L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling - Score: 18 (R=9, N=9) - Date: 2025-03-07 - Comment: The paper establishes a mutual information scaling law for long-context language modeling, which provides theoretical insights into LLM behavior and aligns with the LLM criterion.

  12. Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining - Score: 18 (R=9, N=9) - Date: 2025-03-07 - Comment: The paper establishes scaling laws for hyperparameters in LLM pretraining, providing theoretical insights into model optimization and aligning with foundational research in LLM behavior.

  13. Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought - Score: 18 (R=9, N=9) - Date: 2025-03-03 - Comment: The paper provides theoretical insights into how transformers implement multi-step gradient descent with Chain of Thought prompting, aligning with 'Large Language Models' and 'Representation Learning'.

  14. Accelerating MoE Model Inference with Expert Sharding - Score: 17 (R=9, N=8) - Date: 2025-03-12 - Comment: The paper addresses efficiency in Mixture-of-Experts (MoE) inference through expert sharding, which directly aligns with the model architecture and compression criteria. The tensor sharding approach is a novel contribution to MoE inference.

  15. ProTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-03-12 - Comment: The paper introduces a framework for protein structure reasoning and editing using LLMs, which aligns with foundational AI for science and multimodal generative paradigms.

  16. ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper introduces a compression method for Mixture-of-Experts models, which aligns with model compression and efficiency improvements.

  17. eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper proposes a memory-efficient MoE inference system, directly aligning with the model architecture and efficiency criteria.

  18. InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: This paper introduces a novel paradigm for long-context reasoning in LLMs, addressing computational scaling and reasoning depth. It aligns with foundational research in LLMs by proposing a new iterative reasoning framework, which could have broader implications for model efficiency and architecture.

  19. MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper proposes a sparse Mixture-of-Experts framework for multi-source, multi-modal question answering, which aligns with foundational research on MoE and scalability.

  20. Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs - Score: 17 (R=9, N=8) - Date: 2025-03-10 - Comment: The paper discusses scaling Mixture-of-Experts (MoE) models efficiently, which directly aligns with foundational research in model architecture and efficiency.

  21. Continual Pre-training of MoEs: How robust is your router? - Score: 17 (R=9, N=8) - Date: 2025-03-10 - Comment: The paper investigates continual pre-training of MoE models, providing insights into routing algorithms and robustness, which is highly relevant to foundational research in MoE architectures.

  22. HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization - Score: 17 (R=9, N=8) - Date: 2025-03-07 - Comment: The paper proposes HybridNorm, a novel normalization strategy for transformers, which directly aligns with the model architecture criterion. It provides insights into training stability and performance improvements.

  23. SOLAR: Scalable Optimization of Large-scale Architecture for Reasoning - Score: 17 (R=9, N=8) - Date: 2025-03-07 - Comment: The paper introduces SOLAR, a framework for reasoning in LLMs with novel topological approaches, aligning with foundational research in model architecture and reasoning.

  24. Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling - Score: 17 (R=9, N=8) - Date: 2025-03-07 - Comment: The paper focuses on improving MoE inference efficiency with speculative parallelization, which directly aligns with foundational research in MoE architectures and efficiency.

  25. Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions - Score: 17 (R=9, N=8) - Date: 2025-03-07 - Comment: The paper meta-analyzes design decisions in language models, providing insights into architectural choices and their downstream impact, which aligns with foundational research in model architecture.

  26. Conformal Transformations for Symmetric Power Transformers - Score: 17 (R=9, N=8) - Date: 2025-03-06 - Comment: The paper introduces a novel architectural improvement to linear transformers by addressing capacity limitations in symmetric power transformers using conformal transformations. This aligns with the 'Model Architecture' criterion, focusing on architectural innovations.

  27. Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs - Score: 17 (R=9, N=8) - Date: 2025-03-06 - Comment: The paper investigates cognitive behaviors in language models that enable self-improvement, providing theoretical insights into reasoning behaviors and their impact on model performance. This aligns with the 'Large Language Models' criterion, focusing on theoretical insights into LLM behavior.

  28. Forgetting Transformer: Softmax Attention with a Forget Gate - Score: 17 (R=9, N=8) - Date: 2025-03-05 - Comment: The paper introduces a Forgetting Transformer with a novel attention mechanism, which aligns with foundational research in model architecture and transformer innovations.

  29. Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper provides theoretical insights into the depth-width tradeoffs in transformers for graph tasks, which is highly relevant to understanding transformer architectures and their efficiency.

  30. Compositional Reasoning with Transformers, RNNs, and Chain of Thought - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper compares the expressive power of transformers, RNNs, and chain-of-thought methods for compositional reasoning, providing theoretical insights into model capabilities. This aligns with the interest in analyzing architectures.

  31. Liger: Linearizing Large Language Models to Gated Recurrent Structures - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper introduces Liger, a method for linearizing LLMs into gated recurrent structures, which aligns with foundational research in model architecture and efficiency. The use of LoRA for lightweight fine-tuning and the introduction of Liger Attention are novel contributions.

  32. DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper proposes a method for enhancing parameter efficiency in Mixture-of-Experts models, which aligns with foundational research in model architecture and efficiency.

  33. Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper introduces Neural ODE Transformers, offering insights into internal dynamics and adaptive fine-tuning. This aligns with foundational research in model architecture and interpretability.

  34. Transformer Meets Twicing: Harnessing Unattended Residual Information - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper proposes Twicing Attention, a novel attention mechanism addressing representational capacity decay in transformers. This aligns with foundational research in model architecture and offers theoretical guarantees.

  35. CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper introduces a dual momentum Mixture-of-Experts framework for continual learning in multimodal tasks, which is highly relevant to MoE and architectural innovations.

  36. CoSMoEs: Compact Sparse Mixture of Experts - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: This paper introduces Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference, addressing quality, memory, and latency. It is highly relevant to the Mixture-of-Experts (MoE) criterion and provides insights into architectural innovations.

  37. FANformer: Improving Large Language Models Through Effective Periodicity Modeling - Score: 17 (R=9, N=8) - Date: 2025-03-03 - Comment: FANformer integrates Fourier Analysis Network into the attention mechanism, providing a novel architectural improvement for LLMs with potential foundational impact on periodicity modeling in transformers.

  38. Oscillation-Reduced MXFP4 Training for Vision Transformers - Score: 17 (R=9, N=8) - Date: 2025-03-03 - Comment: The paper addresses FP4 training for Vision Transformers with novel methods to reduce weight oscillation, aligning with 'Model Compression' and efficiency breakthroughs.

  39. Triple Phase Transitions: Understanding the Learning Dynamics of Large Language Models from a Neuroscience Perspective - Score: 17 (R=9, N=8) - Date: 2025-03-03 - Comment: The paper explores phase transitions in LLMs from a neuroscience perspective, providing theoretical insights into emergent behaviors in LLM training.

  40. Disentangling Feature Structure: A Mathematically Provable Two-Stage Training Dynamics in Transformers - Score: 17 (R=9, N=8) - Date: 2025-03-03 - Comment: The paper provides a theoretical analysis of two-stage training dynamics in transformers, contributing to understanding of feature disentanglement and optimization processes.

  41. Revisiting Kernel Attention with Correlated Gaussian Process Representation - Score: 17 (R=9, N=8) - Date: 2025-03-03 - Comment: The paper introduces a novel transformer architecture using Correlated Gaussian Processes (CGPs) to enhance representation capacity, aligning with the 'Model Architecture' criterion. It also includes a sparse approximation, which touches on 'Model Compression'.

Efficiency, Compression, and Large-Scale Training (39)

  1. Quantum-PEFT: Ultra parameter-efficient fine-tuning - Score: 18 (R=9, N=9) - Date: 2025-03-10 - Comment: The paper proposes Quantum-PEFT, a novel parameter-efficient fine-tuning method leveraging quantum computations, which aligns with model compression and efficiency breakthroughs.

  2. Neural Manifolds and Cognitive Consistency: A New Approach to Memory Consolidation in Artificial Systems - Score: 18 (R=9, N=9) - Date: 2025-03-05 - Comment: The paper introduces a novel framework for memory consolidation inspired by neuroscience, which aligns with foundational research in representation learning and emerging trends.

  3. ELECTRA: A Symmetry-breaking Cartesian Network for Charge Density Prediction with Floating Orbitals - Score: 17 (R=9, N=8) - Date: 2025-03-12 - Comment: The paper introduces a symmetry-breaking equivariant model for predicting electronic charge densities, which is foundational in AI for science and introduces a novel generative paradigm.

  4. Accurate INT8 Training Through Dynamic Block-Level Fallback - Score: 17 (R=9, N=8) - Date: 2025-03-12 - Comment: The paper proposes a dynamic fallback quantization method for INT8 training, which aligns with the model compression criterion by addressing efficiency and robustness in low-bit training.

  5. EFPC: Towards Efficient and Flexible Prompt Compression - Score: 17 (R=9, N=8) - Date: 2025-03-12 - Comment: The paper proposes a novel prompt compression method for LLMs, which aligns with foundational research in model compression and efficiency.

  6. SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs - Score: 17 (R=9, N=8) - Date: 2025-03-12 - Comment: The paper introduces SplitQuantV2, a novel low-bit quantization method for LLMs, which aligns with the model compression criterion and demonstrates practical efficiency improvements.

  7. MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration - Score: 17 (R=9, N=8) - Date: 2025-03-12 - Comment: The paper focuses on a novel static quantization framework for LLMs, which aligns with the model compression criterion, particularly in sparsity and quantization.

  8. Understanding the Learning Dynamics of LoRA: A Gradient Flow Perspective on Low-Rank Adaptation in Matrix Factorization - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper provides theoretical insights into the learning dynamics of LoRA, which aligns with representation learning and low-rank adaptation in model compression.

  9. Task Vector Quantization for Memory-Efficient Model Merging - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper introduces a memory-efficient model merging method using task vector quantization, which aligns with model compression and efficiency breakthroughs.

  10. Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper introduces a novel data-free delta compression method inspired by JPEG compression, which aligns with model compression and efficiency breakthroughs.

  11. Towards Superior Quantization Accuracy: A Layer-sensitive Approach - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: This paper proposes a layer-sensitive approach to quantization, which directly aligns with the model compression criterion. The methods SensiBoost and KurtBoost provide novel insights into layer-specific quantization strategies, improving accuracy with minimal memory overhead.

  12. Seesaw: High-throughput LLM Inference via Model Re-sharding - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper introduces a dynamic re-sharding technique for LLM inference, which aligns with model compression and efficiency breakthroughs.

  13. Sample-aware Adaptive Structured Pruning for Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper proposes a structured pruning framework for LLMs, which aligns with the model compression criterion. The use of adaptive methods adds novelty to the pruning process.

  14. IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper proposes an integrated enlarge-and-prune pipeline for generative language model pretraining, which aligns with foundational research in model compression.

  15. Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models - Score: 17 (R=9, N=8) - Date: 2025-03-10 - Comment: The paper introduces Balcony, a framework for dynamic inference in LLMs, which aligns with the model compression and efficiency criterion through its innovative depth-based dynamic inference approach.

  16. Wanda++: Pruning Large Language Models via Regional Gradients - Score: 17 (R=9, N=8) - Date: 2025-03-10 - Comment: The paper introduces Wanda++, a pruning framework for LLMs, which aligns with model compression and sparsity. The use of regional gradients is a novel approach.

  17. Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning - Score: 17 (R=9, N=8) - Date: 2025-03-10 - Comment: The paper proposes task-aware KV cache compression, which aligns with model compression and efficiency improvements in LLMs. The task-aware approach is a novel contribution.

  18. TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation - Score: 17 (R=9, N=8) - Date: 2025-03-10 - Comment: The paper introduces a novel Branch-Merge distillation approach for model compression, which aligns with the model compression criterion, particularly in the context of LLMs.

  19. Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size - Score: 17 (R=9, N=8) - Date: 2025-03-07 - Comment: The paper proposes a novel entropy-weighted quantization method for LLMs, which aligns with the model compression criterion. The findings on entropy and precision requirements are insightful and relevant.

  20. How can representation dimension dominate structurally pruned LLMs? - Score: 17 (R=9, N=8) - Date: 2025-03-07 - Comment: The paper investigates the role of representation dimension in pruned LLMs, providing foundational insights into structured pruning and its impact on model performance.

  21. PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention - Score: 17 (R=9, N=8) - Date: 2025-03-06 - Comment: The paper introduces PowerAttention, a sparse attention mechanism for LLMs that improves efficiency and scalability. This aligns with the 'Model Compression' criterion, focusing on efficiency breakthroughs in attention mechanisms.

  22. Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression - Score: 17 (R=9, N=8) - Date: 2025-03-05 - Comment: The paper introduces Q-Filters, a novel KV Cache compression method leveraging QK geometry, which aligns with the model compression criterion. It provides theoretical insights and demonstrates compatibility with FlashAttention, making it highly relevant.

  23. An Accelerated Alternating Partial Bregman Algorithm for ReLU-based Matrix Decomposition - Score: 17 (R=9, N=8) - Date: 2025-03-05 - Comment: The paper introduces a novel matrix decomposition framework with theoretical contributions to sparsity and low-rank methods, which aligns with model compression and representation learning.

  24. Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization - Score: 17 (R=9, N=8) - Date: 2025-03-05 - Comment: The paper uses Random Matrix Theory for pruning DNNs, aligning with the model compression criterion and providing both theoretical and empirical contributions.

  25. Identifying Sensitive Weights via Post-quantization Integral - Score: 17 (R=9, N=8) - Date: 2025-03-05 - Comment: The paper proposes a novel sensitivity metric (PQI) for post-training quantization, which is highly relevant to model compression and efficiency.

  26. CABS: Conflict-Aware and Balanced Sparsification for Enhancing Model Merging - Score: 17 (R=9, N=8) - Date: 2025-03-05 - Comment: The paper introduces a sparsification framework (CABS) for model merging, which aligns with model compression and sparsity-related research.

  27. When Can You Get Away with Low Memory Adam? - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper introduces SlimAdam, a memory-efficient variant of Adam optimizer, which aligns with the model compression criterion by addressing memory efficiency through a novel SNR-based approach.

  28. RSQ: Learning from Important Tokens Leads to Better Quantized LLMs - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper proposes a novel quantization method (RSQ) for LLMs, focusing on token importance and efficiency, which aligns with model compression and efficiency breakthroughs.

  29. Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper provides theoretical insights into the limitations of SGD optimization in deep learning, which aligns with foundational research on training dynamics.

  30. EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: EliteKV proposes a novel KV cache compression method for RoPE-based models, which aligns with foundational research in model compression and efficiency.

  31. Revisiting Large Language Model Pruning using Neuron Semantic Attribution - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper revisits pruning in LLMs using neuron semantic attribution, which aligns with model compression and provides insights into pruning behavior.

  32. KurTail : Kurtosis-based LLM Quantization - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: This paper introduces a novel quantization method for LLMs, addressing outliers and optimizing memory efficiency. It aligns with the model compression criterion, particularly in quantization and efficiency breakthroughs.

  33. Parameter-Efficient Fine-Tuning of Large Language Models via Deconvolution in Subspace - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper proposes a parameter-efficient fine-tuning method (DCFT) for LLMs, which aligns with foundational research in model efficiency.

  34. Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper introduces MorphKV, a novel inference-time technique for maintaining constant-sized KV caches in LLMs, addressing memory efficiency and accuracy trade-offs. This aligns with the 'Model Compression' criterion, particularly in the context of KV cache optimization.

  35. LoR2C : Low-Rank Residual Connection Adaptation for Parameter-Efficient Fine-Tuning - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper introduces a novel low-rank residual connection adaptation for parameter-efficient fine-tuning, which aligns with model compression and efficiency breakthroughs.

  36. Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper introduces Progressive Sparse Attention (PSA) for efficient attention in LLMs, focusing on reducing KV cache usage and improving inference efficiency. This aligns with model compression and efficiency breakthroughs.

  37. KVCrush: Key value cache size-reduction using similarity in head-behaviour - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper proposes a KV cache compression method for LLMs, addressing memory efficiency with minimal accuracy loss. This aligns with the model compression criterion, particularly in KV cache optimization.

  38. Training LLMs with MXFP4 - Score: 17 (R=9, N=8) - Date: 2025-03-03 - Comment: The paper focuses on low-precision training with MXFP4, which aligns with the model compression criterion, specifically addressing efficiency breakthroughs through stochastic rounding and variance reduction techniques.

  39. Stochastic Rounding for LLM Training: Theory and Practice - Score: 17 (R=9, N=8) - Date: 2025-03-03 - Comment: The paper explores stochastic rounding for LLM training, providing theoretical insights into implicit regularization and convergence. This aligns with the 'Large Language Models' criterion, focusing on foundational efficiency improvements.

Representation Learning Theory and Structure (43)

  1. Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry - Score: 19 (R=10, N=9) - Date: 2025-03-04 - Comment: The paper provides a theoretical framework for sparse autoencoders, directly addressing representation learning and the biases in concept detection.

  2. Disentangling Task Interference within Neurons: Model Merging in Alignment with Neuronal Mechanisms - Score: 18 (R=9, N=9) - Date: 2025-03-10 - Comment: The paper introduces NeuroMerging, a novel framework for model merging that addresses task interference at the neuronal level, aligning with the representation learning and model architecture criteria.

  3. Deep Learning is Not So Mysterious or Different - Score: 18 (R=9, N=9) - Date: 2025-03-05 - Comment: The paper provides a theoretical perspective on generalization phenomena in deep learning, which aligns with foundational research in representation learning and training dynamics.

  4. Synergy Between Sufficient Changes and Sparse Mixing Procedure for Disentangled Representation Learning - Score: 18 (R=9, N=9) - Date: 2025-03-04 - Comment: The paper proposes a novel framework combining sparse mixing and distributional changes for disentangled representation learning, which directly aligns with foundational research in representation learning.

  5. Dataset Distillation with Neural Characteristic Function: A Minmax Perspective - Score: 18 (R=9, N=9) - Date: 2025-03-03 - Comment: The paper introduces Neural Characteristic Function Matching for dataset distillation, which is a novel approach to representation learning with significant theoretical contributions.

  6. How good is PAC-Bayes at explaining generalisation? - Score: 17 (R=9, N=8) - Date: 2025-03-12 - Comment: The paper provides a theoretical analysis of PAC-Bayes bounds and their ability to explain generalization, which is highly relevant to foundational research in representation learning and generalization theory.

  7. A Theoretical Framework for Preventing Class Collapse in Supervised Contrastive Learning - Score: 17 (R=9, N=8) - Date: 2025-03-12 - Comment: The paper provides a theoretical framework to prevent class collapse in supervised contrastive learning, which is highly relevant to foundational research in representation learning.

  8. Route Sparse Autoencoder to Interpret Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-03-12 - Comment: The paper proposes a sparse autoencoder framework for LLM interpretability, which aligns with representation learning and interpretability of LLMs.

  9. CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement - Score: 17 (R=9, N=8) - Date: 2025-03-12 - Comment: The paper introduces CAD-VAE, a novel disentangled VAE framework addressing fairness in representation learning, which aligns with foundational research in representation learning.

  10. Learning Energy-Based Models by Self-normalising the Likelihood - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper proposes a novel self-normalized log-likelihood objective for energy-based models, which aligns with foundational research in representation learning.

  11. How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper uses sparse autoencoders to trace internal representations in LLMs, directly addressing representation learning and interpretability in LLMs.

  12. Breaking Free from MMI: A New Frontier in Rationalization by Probing Input Utilization - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper critiques the MMI criterion and proposes a novel alternative for rationale extraction, which aligns with representation learning and interpretability.

  13. Make Haste Slowly: A Theory of Emergent Structured Mixed Selectivity in Feature Learning ReLU Networks - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper provides theoretical insights into feature learning in ReLU networks, which aligns with foundational research in representation learning.

  14. Analyzing the Role of Permutation Invariance in Linear Mode Connectivity - Score: 17 (R=9, N=8) - Date: 2025-03-11 - Comment: The paper provides a theoretical analysis of linear mode connectivity and sparsity in neural networks, which aligns with representation learning and training dynamics.

  15. Strategy Coopetition Explains the Emergence and Transience of In-Context Learning - Score: 17 (R=9, N=8) - Date: 2025-03-10 - Comment: The paper provides a mechanistic understanding of in-context learning dynamics, which aligns with foundational research in representation learning and training dynamics.

  16. Distilling Dataset into Neural Field - Score: 17 (R=9, N=8) - Date: 2025-03-10 - Comment: The paper introduces a novel parameterization framework for dataset distillation using neural fields, which is highly relevant to foundational research in representation learning and efficiency.

  17. Enough Coin Flips Can Make LLMs Act Bayesian - Score: 17 (R=9, N=8) - Date: 2025-03-07 - Comment: The paper investigates whether LLMs perform Bayesian reasoning during in-context learning, providing theoretical insights into LLM behavior and interpretability. This aligns closely with the foundational research on LLMs and their emergent capabilities.

  18. Transferable Foundation Models for Geometric Tasks on Point Cloud Representations: Geometric Neural Operators - Score: 17 (R=9, N=8) - Date: 2025-03-07 - Comment: The paper introduces Geometric Neural Operators (GNPs) for point cloud representations, which aligns with foundational research in representation learning and architecture-level innovations.

  19. Activation Space Interventions Can Be Transferred Between Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-03-07 - Comment: The paper explores activation space interventions and their transferability between LLMs, which aligns with representation learning and foundational insights into LLM behavior.

  20. Causally Reliable Concept Bottleneck Models - Score: 17 (R=9, N=8) - Date: 2025-03-07 - Comment: The paper introduces a concept bottleneck model with causal reasoning capabilities, aligning with representation learning and emerging trends in explainable AI. It also provides a pipeline for learning causal structures.

  21. Learning Causal Response Representations through Direct Effect Analysis - Score: 17 (R=9, N=8) - Date: 2025-03-07 - Comment: The paper focuses on causal representation learning, which aligns with the representation learning criterion. It introduces a novel optimization framework and provides theoretical guarantees, making it relevant to foundational research.

  22. Generalizability of Neural Networks Minimizing Empirical Risk Based on Expressive Ability - Score: 17 (R=9, N=8) - Date: 2025-03-07 - Comment: The paper provides theoretical insights into generalizability based on expressiveness, directly addressing foundational questions in representation learning and over-parameterization.

  23. Process-based Self-Rewarding Language Models - Score: 17 (R=9, N=8) - Date: 2025-03-06 - Comment: The paper explores a self-rewarding paradigm for LLMs with a focus on mathematical reasoning, which aligns with foundational research in LLM behavior and interpretability. The proposed process-based self-rewarding pipeline introduces novel theoretical insights.

  24. Towards Understanding Distilled Reasoning Models: A Representational Approach - Score: 17 (R=9, N=8) - Date: 2025-03-06 - Comment: The paper explores how model distillation impacts reasoning features in LLMs, aligning with representation learning and theoretical insights into LLM behavior. The focus on feature geometry and structured representations is highly relevant.

  25. Effective LLM Knowledge Learning via Model Generalization - Score: 17 (R=9, N=8) - Date: 2025-03-06 - Comment: The paper explores knowledge learning in LLMs and proposes methods to improve generalization during pretraining. This aligns with the 'Large Language Models' criterion, particularly in understanding and enhancing foundational knowledge acquisition.

  26. Towards Understanding Multi-Round Large Language Model Reasoning: Approximability, Learnability and Generalizability - Score: 17 (R=9, N=8) - Date: 2025-03-06 - Comment: The paper provides theoretical insights into multi-round reasoning in LLMs, focusing on approximation, learnability, and generalization. This aligns with the 'Large Language Models' criterion, particularly in understanding foundational behavior and theoretical properties.

  27. Weak-to-Strong Generalization Even in Random Feature Networks, Provably - Score: 17 (R=9, N=8) - Date: 2025-03-05 - Comment: The paper explores weak-to-strong generalization in random feature networks, providing theoretical insights into training dynamics and generalization, which aligns well with foundational research in representation learning.

  28. Unsupervised Attributed Dynamic Network Embedding with Stability Guarantees - Score: 17 (R=9, N=8) - Date: 2025-03-05 - Comment: The paper focuses on unsupervised representation learning for dynamic networks, with a novel stability guarantee and theoretical contributions. This aligns with the representation learning criterion.

  29. (How) Do Language Models Track State? - Score: 17 (R=9, N=8) - Date: 2025-03-05 - Comment: The paper investigates how language models track state and identifies two distinct mechanisms, providing theoretical insights into LLM behavior and interpretability.

  30. A Theory of Initialisation's Impact on Specialisation - Score: 17 (R=9, N=8) - Date: 2025-03-05 - Comment: The paper provides theoretical insights into the impact of initialization on neuron specialization, which is relevant to representation learning and training dynamics in neural networks.

  31. A Near Complete Nonasymptotic Generalization Theory For Multilayer Neural Networks: Beyond the Bias-Variance Tradeoff - Score: 17 (R=9, N=8) - Date: 2025-03-05 - Comment: The paper introduces a nonasymptotic generalization theory for multilayer neural networks, addressing foundational aspects of generalization and double descent, which is highly relevant to understanding training dynamics.

  32. From superposition to sparse codes: interpretable representations in neural networks - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper provides a theoretical framework for understanding neural representations using sparse coding, which aligns with foundational research in representation learning.

  33. On the Power of Context-Enhanced Learning in LLMs - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper formalizes context-enhanced learning for LLMs, providing theoretical insights into gradient-based learning with enhanced context. This aligns with foundational research in LLM behavior and interpretability.

  34. Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper proposes a sparse coding method for adaptive representation learning, which aligns with foundational research in representation learning and efficiency.

  35. Asymptotic Theory of Eigenvectors for Latent Embeddings with Generalized Laplacian Matrices - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper develops an asymptotic theory for eigenvectors in generalized Laplacian matrices, contributing to foundational research in representation learning and theoretical insights into latent embeddings.

  36. Projection Head is Secretly an Information Bottleneck - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper provides a theoretical understanding of the projection head in contrastive learning, aligning with foundational research in representation learning and offering novel insights into its role as an information bottleneck.

  37. Towards Understanding the Benefit of Multitask Representation Learning in Decision Process - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper provides theoretical insights into multitask representation learning, directly addressing foundational aspects of representation learning.

  38. Steering Large Language Model Activations in Sparse Spaces - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: The paper introduces Sparse Activation Steering (SAS) for guiding LLM behavior using sparse autoencoders. This aligns with foundational research in representation learning and interpretability, offering a novel approach to behavior modulation.

  39. BAnG: Bidirectional Anchored Generation for Conditional RNA Design - Score: 17 (R=9, N=8) - Date: 2025-03-03 - Comment: The paper explores identifiability in mechanistic interpretability, which aligns with emerging trends and foundational research in understanding neural networks.

  40. Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks - Score: 17 (R=9, N=8) - Date: 2025-03-03 - Comment: The paper provides a theoretical analysis of training dynamics in large two-layer networks, uncovering phenomena like time-scale separation and feature unlearning. This aligns with the 'Representation Learning' criterion, focusing on training dynamics and generalization.

  41. Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking) - Score: 17 (R=9, N=8) - Date: 2025-03-03 - Comment: This position paper advocates for using layerwise linear models to understand neural dynamical phenomena like neural collapse and grokking, which directly aligns with foundational research in representation learning and training dynamics.

  42. Learning Dynamics of Deep Linear Networks Beyond the Edge of Stability - Score: 17 (R=9, N=8) - Date: 2025-03-03 - Comment: The paper provides a theoretical analysis of learning dynamics in deep linear networks, contributing to foundational understanding of training dynamics in neural networks.

  43. Brain-Inspired Exploration of Functional Networks and Key Neurons in Large Language Models - Score: 17 (R=9, N=8) - Date: 2025-03-03 - Comment: The paper explores functional networks in LLMs inspired by cognitive neuroscience, providing insights into LLM behavior and interpretability, which aligns with the LLM criterion.

Memory Structures and Agent Memory Systems (1)

  1. CE-U: Cross Entropy Unlearning - Score: 17 (R=9, N=8) - Date: 2025-03-04 - Comment: CE-U proposes a novel loss function for unlearning in LLMs, which aligns with foundational research in LLM behavior and theoretical insights. The focus on gradient stability and theoretical analysis is a strong match.