Previous Day 2025-10-01
Monthly Overview 2025-10
Next Day 2025-10-03

Personalized Daily ArXiv Papers 2025-10-02

[gpt-5] Prompt Completion Total
Token 73396 66962 140358
Cost $0.09 $0.67 $0.76

Total arXiv papers: 759

Total scanned papers: 487

Total relevant papers: 43

Table of contents with paper titles:

  1. A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws Authors: Hong-Yi Wang, Di Luo, Tomaso Poggio, Isaac L. Chuang, Liu Ziyin

  2. Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space Authors: Houjun Liu, Shikhar Murty, Christopher D. Manning, R\'obert Csord\'as

  3. Cutting the Skip: Training Residual-Free Transformers Authors: Yiping Ji, James Martens, Jianqiao Zheng, Ziqin Zhou, Peyman Moghadam, Xinyu Zhang, Hemanth Saratchandran, Simon Lucey

  4. PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning Authors: Xin Yu, Cong Xie, Ziyu Zhao, Tiantian Fan, Lingzhou Xue, Zhi Zhang

  5. AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features Authors: Xudong Zhu, Mohammad Mahdi Khalili, Zhihui Zhu

  6. Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs Authors: Leyla Mirvakhabova, Babak Ehteshami Bejnordi, Gaurav Kumar, Hanxue Liang, Wanru Zhao, Paul Whatmough

  7. Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time Authors: Blake Bordelon, Mary I. Letey, Cengiz Pehlevan

  8. Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning Authors: Minghao Yang, Ren Togo, Guang Li, Takahiro Ogawa, Miki Haseyama

  9. Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space? Authors: Nandan Kumar Jha, Brandon Reagen

  10. Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity Authors: Amir Joudaki, Giulia Lanzillotta, Mohammad Samragh Razlighi, Iman Mirzadeh, Keivan Alizadeh, Thomas Hofmann, Mehrdad Farajtabar, Fartash Faghri

  11. LoRAFusion: Efficient LoRA Fine-Tuning for LLMs Authors: Zhanda Zhu, Qidong Su, Yaoyao Ding, Kevin Song, Shang Wang, Gennady Pekhimenko

  12. Memory Determines Learning Direction: A Theory of Gradient-Based Optimization in State Space Models Authors: JingChuan Guan, Tomoyuki Kubota, Yasuo Kuniyoshi, Kohei Nakajima

  13. On Predictability of Reinforcement Learning Dynamics for Large Language Models Authors: Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guiquan Liu, Junfeng Fang

  14. Composer: A Search Framework for Hybrid Neural Architecture Design Authors: Bilge Acun, Prasoon Sinha, Newsha Ardalani, Sangmin Bae, Alicia Golden, Chien-Yu Lin, Meghana Madhyastha, Fei Sun, Neeraja J. Yadwadkar, Carole-Jean Wu

  15. Meaningless Tokens, Meaningful Gains: How Activation Shifts Enhance LLM Reasoning Authors: Zeru Shi, Yingjia Wan, Zhenting Wang, Qifan Wang, Fan Yang, Elisa Kreiss, Ruixiang Tang

  16. Randomized Matrix Sketching for Neural Network Training and Gradient Monitoring Authors: Harbir Antil, Deepanshu Verma

  17. Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis Authors: Hongkang Li, Songtao Lu, Xiaodong Cui, Pin-Yu Chen, Meng Wang

  18. Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs Authors: Kairun Zhang, Haoyu Li, Yanjun Zhao, Yifan Sun, Huan Zhang

  19. Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling Authors: Ye Qiao, Haocheng Xu, Xiaofan Zhang, Sitao Huang

  20. Feature Identification via the Empirical NTK Authors: Jennifer Lin

  21. Shape Happens: Automatic Feature Manifold Discovery in LLMs via Supervised Multi-Dimensional Scaling Authors: Federico Tiblias, Irina Bigoulaeva, Jingcheng Niu, Simone Balloccu, Iryna Gurevych

  22. The Pitfalls of KV Cache Compression Authors: Alex Chen, Renato Geh, Aditya Grover, Guy Van den Broeck, Daniel Israel

  23. Enhancing Certifiable Semantic Robustness via Robust Pruning of Deep Neural Networks Authors: Hanjiang Hu, Bowei Li, Ziwei Wang, Tianhao Wei, Casidhe Hutchison, Eric Sample, Changliu Liu

  24. ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models Authors: Dongqi Zheng

  25. A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models Authors: Leah Bar, Liron Mor Yosef, Shai Zucker, Neta Shoham, Inbar Seroussi, Nir Sochen

  26. On the Benefits of Weight Normalization for Overparameterized Matrix Sensing Authors: Yudong Wei, Liang Zhang, Bingcong Li, Niao He

  27. Rethinking Thinking Tokens: LLMs as Improvement Operators Authors: Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, Anirudh Goyal

  28. Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models Authors: Shutong Wu, Jiawei Zhang

  29. Learning Inter-Atomic Potentials without Explicit Equivariance Authors: Ahmed A. Elhag, Arun Raja, Alex Morehead, Samuel M. Blau, Garrett M. Morris, Michael M. Bronstein

  30. Nonparametric Identification of Latent Concepts Authors: Yujia Zheng, Shaoan Xie, Kun Zhang

  31. Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls Authors: Xiaoyan Bai, Itamar Pres, Yuntian Deng, Chenhao Tan, Stuart Shieber, Fernanda Vi\'egas, Martin Wattenberg, Andrew Lee

  32. GLAI: GreenLightningAI for Accelerated Training through Knowledge Decoupling Authors: Jose I. Mestre, Alberto Fern\'andez-Hern\'andez, Cristian P\'erez-Corral, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ort\'i

  33. Generalized Parallel Scaling with Interdependent Generations Authors: Harry Dong, David Brandfonbrener, Eryk Helenowski, Yun He, Mrinal Kumar, Han Fang, Yuejie Chi, Karthik Abinav Sankararaman

  34. Learning Energy-based Variational Latent Prior for VAEs Authors: Debottam Dutta, Chaitanya Amballa, Zhongweiyang Xu, Yu-Lin Wei, Romit Roy Choudhury

  35. Equivariant Geometric Scattering Networks via Vector Diffusion Wavelets Authors: David R. Johnson, Rishabh Anand, Smita Krishnaswamy, Michael Perlmutter

  36. Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum Authors: Gaotang Li, Ruizhong Qiu, Xiusi Chen, Heng Ji, Hanghang Tong

  37. Large Language Models Inference Engines based on Spiking Neural Networks Authors: Adarsha Balaji, Sandeep Madireddy

  38. Continual Learning with Query-Only Attention Authors: Gautham Bekal, Ashish Pujari, Scott David Kelly

  39. Exploring System 1 and 2 communication for latent reasoning in LLMs Authors: Julian Coda-Forno, Zhuokai Zhao, Qiang Zhang, Dipesh Tamboli, Weiwei Li, Xiangjun Fan, Lizhu Zhang, Eric Schulz, Hsiao-Ping Tseng

  40. Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness Authors: Tsubasa Takahashi, Shojiro Yamabe, Futa Waseda, Kento Sasaki

  41. Geometric Properties of Neural Multivariate Regression Authors: George Andriopoulos, Zixuan Dong, Bimarsha Adhikari, Keith Ross

  42. BigBang-Proton Technical Report: Next-Word-Prediction is Scientific Multitask Learner Authors: Hengkui Wu, Liujiang Liu, Jihua He, Qihao Wang, Keke Zhao, Shuyang Hu, Renle Fu, Dahao Liang, Lingyu Zeng, Bruce Liu, Yuan Liu, Jin Zhan, Jiaqiang Niu, Xinglong Jia, Yaqin Hu, Wenjun Ji, Panpan Chi, Ken Chen, Hengyuan Wu, Yingsi Xin, Yongfeng Zhu, Yuexin Wang, Manqi Ruan, Ningtao Bian, Xiaohua Wu, Weipeng Xu

  43. ACON: Optimizing Context Compression for Long-horizon LLM Agents Authors: Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan


1. A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws

ArXiv ID: 2510.00504

Authors: Hong-Yi Wang, Di Luo, Tomaso Poggio, Isaac L. Chuang, Liu Ziyin

Abstract: When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of $d$ objects can be asymptotically compressed into a function of $\operatorname{polylog} d$ objects with vanishing error. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. (Ia) directly establishes a proof of the \textit{dynamical} lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form $L\sim d^{-\alpha}$ can be boosted to an arbitrarily fast power law decay, and ultimately to $\exp(-\alpha' \sqrt[m]{d})$.

Comment: Compression/Efficiency Theory: proves polylogarithmic compression of models and datasets, establishing a dynamical lottery ticket hypothesis and boosted scaling laws.

Relevance: 10 Novelty: 9


2. Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space

ArXiv ID: 2510.00219

Authors: Houjun Liu, Shikhar Murty, Christopher D. Manning, R\'obert Csord\'as

Abstract: Current approaches for scaling inference-time compute in transformers rely on training them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they are limited because they cannot be applied during pretraining and are limited to only serially-generated, natural-language verbalization to scale inference-time compute. In this work, we propose Thoughtbubbles, a transformer variant that natively performs parallel adaptive computation in latent space by learning to fork or delete residual streams. Thus, tokens that require a large amount of computation can form a "bubble" of cloned residuals in the middle of the network for additional thinking. Crucially, this behavior is learned during pretraining with only language modeling loss. Thoughtbubbles outperforms both standard decoder LMs as well as non-adaptive parallel computation approaches on OpenWebText and peS2o perplexity and in zero-shot evaluations such as HellaSwag and LAMBADA after pretraining across 150M to 772M parameter scales. The implicit nature of our method enables adaptive computation to be learned starting at pretraining time, paving the way to unify train and test-time behavior for reasoning models.

Comment: Model Architecture: introduces adaptive parallel computation in transformers by forking/deleting residual streams learned during pretraining (dynamic networks).

Relevance: 10 Novelty: 9


3. Cutting the Skip: Training Residual-Free Transformers

ArXiv ID: 2510.00345

Authors: Yiping Ji, James Martens, Jianqiao Zheng, Ziqin Zhou, Peyman Moghadam, Xinyu Zhang, Hemanth Saratchandran, Simon Lucey

Abstract: Transformers have achieved remarkable success across a wide range of applications, a feat often attributed to their scalability. Yet training them without skip (residual) connections remains notoriously difficult. While skips stabilize optimization, they also disrupt the hierarchical structure of representations, raising the long-standing question of whether transformers can be trained efficiently without them. In this work, we address this problem by analyzing the Jacobian of a skipless transformer block, showing why skips improve conditioning and revealing that their stabilization benefits can be recovered through a principled initialization strategy. Building on this insight, we introduce the first method that enables stable and efficient training of skipless transformers without altering the standard architecture. We validate our approach on Vision Transformers (ViTs) in both supervised and self-supervised settings, demonstrating that skipless ViTs trained with our initialization overcome the usual optimization barriers, learn richer hierarchical representations, and outperform strong baselines, that incorporate skip connections, on dense prediction benchmarks. These results show that skip connections are not a fundamental requirement for training ViTs and open new avenues for hierarchical representation learning in vision models.

Comment: Model Architecture/Training Dynamics: enables stable training of residual-free transformers via principled initialization based on Jacobian conditioning analysis.

Relevance: 10 Novelty: 8


4. PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning

ArXiv ID: 2510.00192

Authors: Xin Yu, Cong Xie, Ziyu Zhao, Tiantian Fan, Lingzhou Xue, Zhi Zhang

Abstract: Low-rank adaptation (LoRA) has become a widely used paradigm for parameter-efficient fine-tuning of large language models, yet its representational capacity often lags behind full fine-tuning. Within the context of LoRA, a key open question is how to obtain expressive low-rank adapters from over-parameterized spaces. We propose \textit{PrunedLoRA}, a new framework that leverages structured pruning to obtain highly representative low-rank adapters from an over-parameterized initialization. Unlike prior approaches that impose a fixed low-rank budget, PrunedLoRA dynamically prunes less important components during fine-tuning and prevents their reactivation, enabling flexible and adaptive rank allocation. For structured pruning, by minimizing the pruning error for overall loss, we provide fine-grained pruning and recovery updates in a gradient-based pruning strategy with grounded interpretation. We provide the first theoretical analysis of the robustness of structured pruning and provably show that under the impact of weight perturbation, gradient-based pruning is more robust than activation-based pruning with respect to overall loss. Empirically, PrunedLoRA consistently outperforms LoRA and its variants across supervised fine-tuning tasks in mathematical reasoning, code generation, and natural language understanding, and it also demonstrates advantages over existing structured pruning methods across diverse sparsity levels.

Comment: Model Compression and Efficiency—structured pruning within low-rank adapters (LoRA) with theoretical robustness analysis and dynamic rank allocation.

Relevance: 10 Novelty: 8


5. AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

ArXiv ID: 2510.00404

Authors: Xudong Zhu, Mohammad Mahdi Khalili, Zhihui Zhu

Abstract: Sparse autoencoders (SAEs) have emerged as powerful techniques for interpretability of large language models (LLMs), aiming to decompose hidden states into meaningful semantic features. While several SAE variants have been proposed, there remains no principled framework to derive SAEs from the original dictionary learning formulation. In this work, we introduce such a framework by unrolling the proximal gradient method for sparse coding. We show that a single-step update naturally recovers common SAE variants, including ReLU, JumpReLU, and TopK. Through this lens, we reveal a fundamental limitation of existing SAEs: their sparsity-inducing regularizers enforce non-negativity, preventing a single feature from representing bidirectional concepts (e.g., male vs. female). This structural constraint fragments semantic axes into separate, redundant features, limiting representational completeness. To address this issue, we propose AbsTopK SAE, a new variant derived from the $\ell_0$ sparsity constraint that applies hard thresholding over the largest-magnitude activations. By preserving both positive and negative activations, AbsTopK uncovers richer, bidirectional conceptual representations. Comprehensive experiments across four LLMs and seven probing and steering tasks show that AbsTopK improves reconstruction fidelity, enhances interpretability, and enables single features to encode contrasting concepts. Remarkably, AbsTopK matches or even surpasses the Difference-in-Mean method, a supervised approach that requires labeled data for each concept and has been shown in prior work to outperform SAEs.

Comment: Representation Learning and Sparsity: derives SAEs from proximal gradient unrolling and introduces AbsTopK (|·|-TopK) to recover bidirectional features under an ℓ0-inspired sparsity constraint.

Relevance: 10 Novelty: 8


6. Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs

ArXiv ID: 2510.01185

Authors: Leyla Mirvakhabova, Babak Ehteshami Bejnordi, Gaurav Kumar, Hanxue Liang, Wanru Zhao, Paul Whatmough

Abstract: Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. Our analysis reveals that upcycled MoEs, even with conventional regularization, exhibit low-confidence, weakly differentiated routing, hindering performance. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior. DPSL offers fine-grained control over expert balance and specialization, and enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention; notably, DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models (with Qwen2, Phi3, Llama3.2 LLM backbones) show DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering more adaptive, higher-performing models.

Comment: MoE: router regularization via Dirichlet-prior shaping to improve expert balance and specialization in upcycled sparse MoEs.

Relevance: 10 Novelty: 8


7. Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time

ArXiv ID: 2510.01098

Authors: Blake Bordelon, Mary I. Letey, Cengiz Pehlevan

Abstract: We study in-context learning (ICL) of linear regression in a deep linear self-attention model, characterizing how performance depends on various computational and statistical resources (width, depth, number of training steps, batch size and data per context). In a joint limit where data dimension, context length, and residual stream width scale proportionally, we analyze the limiting asymptotics for three ICL settings: (1) isotropic covariates and tasks (ISO), (2) fixed and structured covariance (FS), and (3) where covariances are randomly rotated and structured (RRS). For ISO and FS settings, we find that depth only aids ICL performance if context length is limited. Alternatively, in the RRS setting where covariances change across contexts, increasing the depth leads to significant improvements in ICL, even at infinite context length. This provides a new solvable toy model of neural scaling laws which depends on both width and depth of a transformer and predicts an optimal transformer shape as a function of compute. This toy model enables computation of exact asymptotics for the risk as well as derivation of powerlaws under source/capacity conditions for the ICL tasks.

Comment: Model Architecture/Representation Learning: theoretical scaling laws for deep linear self-attention (depth vs width vs context) and training dynamics.

Relevance: 9 Novelty: 9


8. Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning

ArXiv ID: 2510.00570

Authors: Minghao Yang, Ren Togo, Guang Li, Takahiro Ogawa, Miki Haseyama

Abstract: Mixture-of-Experts (MoE) has emerged as a powerful framework for multi-task learning (MTL). However, existing MoE-MTL methods often rely on single-task pretrained backbones and suffer from redundant adaptation and inefficient knowledge sharing during the transition from single-task to multi-task learning (STL to MTL). To address these limitations, we propose adaptive shared experts (ASE) within a low-rank adaptation (LoRA) based MoE, where shared experts are assigned router-computed gating weights jointly normalized with sparse experts. This design facilitates STL to MTL transition, enhances expert specialization, and cooperation. Furthermore, we incorporate fine-grained experts by increasing the number of LoRA experts while proportionally reducing their rank, enabling more effective knowledge sharing under a comparable parameter budget. Extensive experiments on the PASCAL-Context benchmark, under unified training settings, demonstrate that ASE consistently improves performance across diverse configurations and validates the effectiveness of fine-grained designs for MTL.

Comment: Model Architecture and Efficiency: MoE with adaptive shared experts and LoRA-based fine-grained low-rank experts for multi-task learning.

Relevance: 10 Novelty: 7


9. Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

ArXiv ID: 2510.00537

Authors: Nandan Kumar Jha, Brandon Reagen

Abstract: As large language models (LLMs) scale, the question is not only how large they become, but how much of their capacity is effectively utilized. Existing scaling laws relate model size to loss, yet overlook how components exploit their latent space. We study feed-forward networks (FFNs) and recast width selection as a spectral utilization problem. Using a lightweight diagnostic suite -- Hard Rank (participation ratio), Soft Rank (Shannon rank), Spectral Concentration, and the composite Spectral Utilization Index (SUI) -- we quantify how many latent directions are meaningfully activated across LLaMA, GPT-2, and nGPT families. Our key finding is an asymmetric spectral scaling law: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early. Moreover, at larger widths, variance further collapses into a narrow subspace, leaving much of the latent space under-utilized. These results recast FFN width selection as a principled trade-off between tail capacity and dominant-mode capacity, offering concrete guidance for inference-efficient LLM design.

Comment: Representation/Architecture Analysis: introduces spectral utilization diagnostics (hard/soft rank, concentration, SUI) revealing FFN latent-space scaling laws.

Relevance: 9 Novelty: 8


10. Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

ArXiv ID: 2510.00304

Authors: Amir Joudaki, Giulia Lanzillotta, Mohammad Samragh Razlighi, Iman Mirzadeh, Keivan Alizadeh, Thomas Hofmann, Mehrdad Farajtabar, Fartash Faghri

Abstract: Deep learning models excel in stationary data but struggle in non-stationary environments due to a phenomenon known as loss of plasticity (LoP), the degradation of their ability to learn in the future. This work presents a first-principles investigation of LoP in gradient-based learning. Grounded in dynamical systems theory, we formally define LoP by identifying stable manifolds in the parameter space that trap gradient trajectories. Our analysis reveals two primary mechanisms that create these traps: frozen units from activation saturation and cloned-unit manifolds from representational redundancy. Our framework uncovers a fundamental tension: properties that promote generalization in static settings, such as low-rank representations and simplicity biases, directly contribute to LoP in continual learning scenarios. We validate our theoretical analysis with numerical simulations and explore architectural choices or targeted perturbations as potential mitigation strategies.

Comment: Representation Learning and Training Dynamics: provides a mathematical framework for loss of plasticity, identifying frozen units and cloned-unit manifolds and linking to low-rank/simplicity biases.

Relevance: 9 Novelty: 8


11. LoRAFusion: Efficient LoRA Fine-Tuning for LLMs

ArXiv ID: 2510.00206

Authors: Zhanda Zhu, Qidong Su, Yaoyao Ding, Kevin Song, Shang Wang, Gennady Pekhimenko

Abstract: Low-Rank Adaptation (LoRA) has become the leading Parameter-Efficient Fine-Tuning (PEFT) method for Large Language Models (LLMs), as it significantly reduces GPU memory usage while maintaining competitive fine-tuned model quality on downstream tasks. Despite these benefits, we identify two key inefficiencies in existing LoRA fine-tuning systems. First, they incur substantial runtime overhead due to redundant memory accesses on large activation tensors. Second, they miss the opportunity to concurrently fine-tune multiple independent LoRA adapters that share the same base model on the same set of GPUs. This leads to missed performance gains such as reduced pipeline bubbles, better communication overlap, and improved GPU load balance. To address these issues, we introduce LoRAFusion, an efficient LoRA fine-tuning system for LLMs. At the kernel level, we propose a graph-splitting method that fuses memory-bound operations. This design eliminates unnecessary memory accesses and preserves the performance of compute-bound GEMMs without incurring the cost of recomputation or synchronization. At the scheduling level, LoRAFusion introduces an adaptive batching algorithm for multi-job fine-tuning. It first splits LoRA adapters into groups to intentionally stagger batch execution across jobs, and then solves a bin-packing problem within each group to generate balanced, dependency-aware microbatches. LoRAFusion achieves up to $1.96\times$ ($1.47\times$ on average) end-to-end speedup compared to Megatron-LM, and up to $1.46\times$ ($1.29\times$ on average) improvement over mLoRA, the state-of-the-art multi-LoRA fine-tuning system. Our fused kernel achieves up to $1.39\times$ ($1.27\times$ on average) kernel performance improvement and can directly serve as a plug-and-play replacement in existing LoRA systems. We open-source LoRAFusion at https://github.com/CentML/lorafusion.

Comment: High Performance Computing and Efficiency: fused kernels for LoRA and adaptive multi-job scheduling for concurrent fine-tuning; systems-level innovation for PEFT.

Relevance: 9 Novelty: 8


12. Memory Determines Learning Direction: A Theory of Gradient-Based Optimization in State Space Models

ArXiv ID: 2510.00563

Authors: JingChuan Guan, Tomoyuki Kubota, Yasuo Kuniyoshi, Kohei Nakajima

Abstract: State space models (SSMs) have gained attention by showing potential to outperform Transformers. However, previous studies have not sufficiently addressed the mechanisms underlying their high performance owing to a lack of theoretical explanation of SSMs' learning dynamics. In this study, we provide such an explanation and propose an improved training strategy. The memory capacity of SSMs can be evaluated by examining how input time series are stored in their current state. Such an examination reveals a tradeoff between memory accuracy and length, as well as the theoretical equivalence between the structured state space sequence model (S4) and a simplified S4 with diagonal recurrent weights. This theoretical foundation allows us to elucidate the learning dynamics, proving the importance of initial parameters. Our analytical results suggest that successful learning requires the initial memory structure to be the longest possible even if memory accuracy may deteriorate or the gradient lose the teacher information. Experiments on tasks requiring long memory confirmed that extending memory is difficult, emphasizing the importance of initialization. Furthermore, we found that fixing recurrent weights can be more advantageous than adapting them because it achieves comparable or even higher performance with faster convergence. Our results provide a new theoretical foundation for SSMs and potentially offer a novel optimization strategy.

Comment: Model Architecture: theoretical analysis of SSM learning dynamics and an initialization/weight-freezing optimization strategy.

Relevance: 9 Novelty: 8


13. On Predictability of Reinforcement Learning Dynamics for Large Language Models

ArXiv ID: 2510.00553

Authors: Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guiquan Liu, Junfeng Fang

Abstract: Recent advances in reasoning capabilities of large language models (LLMs) are largely driven by reinforcement learning (RL), yet the underlying parameter dynamics during RL training remain poorly understood. This work identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 Dominance, where the top singular subspace of the parameter update matrix nearly fully determines reasoning improvements, recovering over 99\% of performance gains; and (2) Rank-1 Linear Dynamics, where this dominant subspace evolves linearly throughout training, enabling accurate prediction from early checkpoints. Extensive experiments across 8 LLMs and 7 algorithms validate the generalizability of these properties. More importantly, based on these findings, we propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window, achieving up to 2.5 speedup while retaining \textgreater 96\% of reasoning performance without extra modules or hyperparameter tuning. This positions our finding as a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.

Comment: Representation Learning/Training Dynamics: identifies low-rank (rank-1) structure in RL-induced parameter updates and exploits it for efficient training speedups.

Relevance: 9 Novelty: 8


14. Composer: A Search Framework for Hybrid Neural Architecture Design

ArXiv ID: 2510.00379

Authors: Bilge Acun, Prasoon Sinha, Newsha Ardalani, Sangmin Bae, Alicia Golden, Chien-Yu Lin, Meghana Madhyastha, Fei Sun, Neeraja J. Yadwadkar, Carole-Jean Wu

Abstract: Hybrid model architectures that combine computational primitives (e.g., Attention, MLP) in different ratios have shown promising performance beyond Transformers. Some studies have shown that different interleavings of primitives can affect model quality as well. However, prior works explore the hybrid model architecture design space manually. Due to the large design space and training costs, discovering hybrid models that combine key computational primitives for pre-training is challenging. In this work, we take a principled approach in designing a modular hybrid model architecture search framework -- Composer. Composer explores model architectures at a small scale and extrapolates the top-performing model architectures to a larger scale using our proposed scaling strategies. Using Composer, we discover new hybrid LLM architectures that outperform Llama 3.2. Compared to Llama 3.2 and previous state-of-the-art baselines, the new model architectures consistently reduce validation loss at parameter scales of 350M-3B and improve evaluation accuracy on the downstream tasks by up to 2.8-8.3% (1.1-3.1% on average) while improving both training and inference efficiency.

Comment: Model Architecture and Efficiency: framework for searching hybrid Attention/MLP architectures with scalable extrapolation strategies for LLMs.

Relevance: 9 Novelty: 8


15. Meaningless Tokens, Meaningful Gains: How Activation Shifts Enhance LLM Reasoning

ArXiv ID: 2510.01032

Authors: Zeru Shi, Yingjia Wan, Zhenting Wang, Qifan Wang, Fan Yang, Elisa Kreiss, Ruixiang Tang

Abstract: Motivated by the puzzling observation that inserting long sequences of meaningless tokens before the query prompt can consistently enhance LLM reasoning performance, this work analyzes the underlying mechanism driving this phenomenon and based on these insights proposes a more principled method that allows for similar performance gains. First, we find that the improvements arise from a redistribution of activations in the LLM's MLP layers, where near zero activations become less frequent while large magnitude activations increase. This redistribution enhances the model's representational capacity by suppressing weak signals and promoting stronger, more informative ones. Building on this insight, we propose the Activation Redistribution Module (ARM), a lightweight inference-time technique that modifies activations directly without altering the input sequence. ARM adaptively identifies near-zero activations after the non-linear function and shifts them outward, implicitly reproducing the beneficial effects of meaningless tokens in a controlled manner. Extensive experiments across diverse benchmarks and model architectures clearly show that ARM consistently improves LLM performance on reasoning tasks while requiring only a few lines of simple code to implement. Our findings deliver both a clear mechanistic explanation for the unexpected benefits of meaningless tokens and a simple yet effective technique that harnesses activation redistribution to further improve LLM performance.

Comment: Model Architecture and Representation Learning: mechanistic analysis of MLP activation distributions and an inference-time activation redistribution module (ARM) that improves reasoning.

Relevance: 9 Novelty: 8


16. Randomized Matrix Sketching for Neural Network Training and Gradient Monitoring

ArXiv ID: 2510.00442

Authors: Harbir Antil, Deepanshu Verma

Abstract: Neural network training relies on gradient computation through backpropagation, yet memory requirements for storing layer activations present significant scalability challenges. We present the first adaptation of control-theoretic matrix sketching to neural network layer activations, enabling memory-efficient gradient reconstruction in backpropagation. This work builds on recent matrix sketching frameworks for dynamic optimization problems, where similar state trajectory storage challenges motivate sketching techniques. Our approach sketches layer activations using three complementary sketch matrices maintained through exponential moving averages (EMA) with adaptive rank adjustment, automatically balancing memory efficiency against approximation quality. Empirical evaluation on MNIST, CIFAR-10, and physics-informed neural networks demonstrates a controllable accuracy-memory tradeoff. We demonstrate a gradient monitoring application on MNIST showing how sketched activations enable real-time gradient norm tracking with minimal memory overhead. These results establish that sketched activation storage provides a viable path toward memory-efficient neural network training and analysis.

Comment: Model Compression and Efficiency: adapts matrix sketching to layer activations for memory-efficient backprop and gradient monitoring, enabling reduced activation storage.

Relevance: 9 Novelty: 8


17. Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis

ArXiv ID: 2510.00399

Authors: Hongkang Li, Songtao Lu, Xiaodong Cui, Pin-Yu Chen, Meng Wang

Abstract: The Mamba model has gained significant attention for its computational advantages over Transformer-based models, while achieving comparable performance across a wide range of language tasks. Like Transformers, Mamba exhibits in-context learning (ICL) capabilities, i.e., making predictions for new tasks based on a prompt containing input-label pairs and a query, without requiring fine-tuning. Despite its empirical success, the theoretical understanding of Mamba remains limited, largely due to the nonlinearity introduced by its gating mechanism. To the best of our knowledge, this paper presents the first theoretical analysis of the training dynamics of a one-layer Mamba model, which consists of a linear attention component followed by a nonlinear gating layer, and its ICL generalization on unseen binary classification tasks, even when the prompt includes additive outliers. Our analysis shows that Mamba leverages the linear attention layer to select informative context examples and uses the nonlinear gating layer to suppress the influence of outliers. By establishing and comparing to the analysis of linear Transformers under the same setting, we show that although Mamba may require more training iterations to converge, it maintains accurate predictions even when the proportion of outliers exceeds the threshold that a linear Transformer can tolerate. These theoretical findings are supported by empirical experiments.

Comment: Model Architecture/Theory: first theoretical analysis of one-layer Mamba’s ICL generalization with outliers, contrasting linear attention vs. nonlinear gating.

Relevance: 9 Novelty: 7


18. Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs

ArXiv ID: 2510.00419

Authors: Kairun Zhang, Haoyu Li, Yanjun Zhao, Yifan Sun, Huan Zhang

Abstract: Zeroth-order optimizers have recently emerged as a practical approach for fine-tuning large language models (LLMs), significantly reducing GPU memory consumption compared to traditional first-order methods. Yet, existing zeroth-order methods rely on hand-crafted, static sampling strategies that are not adaptable to model-specific structures. To address this, we propose ZO Fine-tuner, a learning-based zeroth-order optimizer for LLMs that automatically learns efficient perturbation strategies through a compact and memory-efficient design. Crucially, our approach is motivated by the observation that only a small number of foundation models and their derivatives are widely adopted in practice. Therefore, learning the optimizer once for a given LLM and reusing it across diverse downstream tasks is both feasible and highly desirable. Accordingly, ZO Fine-tuner is designed to scale learning to learn (L2L) to the foundation-model era by supporting one-time training per LLM with minimal overhead. Experiments on 4 LLMs and 7 datasets show that ZO Fine-tuner outperforms prior zeroth-order baselines in 82.1\% of task-model combinations, thereby demonstrating strong performance and scalability for efficient LLM fine-tuning. Our code is available at https://github.com/ASTRAL-Group/ZO_Fine_tuner.git.

Comment: High Performance/Efficiency—learning-based zeroth-order optimizer for LLM fine-tuning reducing memory with L2L-style perturbation strategies.

Relevance: 9 Novelty: 7


19. Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling

ArXiv ID: 2510.00028

Authors: Ye Qiao, Haocheng Xu, Xiaofan Zhang, Sitao Huang

Abstract: Extending the context window support of large language models (LLMs) is crucial for tasks with long-distance dependencies. RoPE-based interpolation and extrapolation methods, such as linear scaling and frequency-aware schemes, enable longer input length support without retraining, while post-training quantization (PTQ) makes deployment practical. However, we show that combining RoPE position interpolation (PI) with PTQ degrades accuracy due to coupled effects including long-context aliasing, dynamic-range dilation, anisotropy from axis-aligned quantizers vs. rotated RoPE pairs, and outlier shifting that produces position-dependent logit noise. We provide, to the best of our knowledge, the first systematic analysis of the PI+PTQ approach and introduce two practical diagnostics: interpolation pressure (per-band sensitivity to phase scaling) and tail-inflation ratios (outlier shift from short to long contexts). Following the analysis results, we propose Q-ROAR (Quantization, RoPE-interpolation, and Outlier Aware Rescaling), a weight-only, interpolation-aware stabilization of PI for quantized LLMs. Q-ROAR groups RoPE dimensions into a small number of frequency bands and performs a lightweight search over per-band scales for Key and Query weights (with an optional symmetric variant to preserve logit scale). The search is guided by our diagnostics and uses a tiny long-context development dataset, requiring no fine-tuning to the model, no architecture or kernel changes, and no additional deployment overhead. Empirically, Q-ROAR reduces the model's perplexity on long-context workloads by more than 14%, while preserving short-context performance, inference throughput, and compatibility with existing LLM system stacks.

Comment: Model Compression and Efficiency: analyzes RoPE interpolation under post-training quantization and proposes an interpolation-aware, per-band weight rescaling (Q-ROAR) guided by new diagnostics.

Relevance: 9 Novelty: 7


20. Feature Identification via the Empirical NTK

ArXiv ID: 2510.00468

Authors: Jennifer Lin

Abstract: We provide evidence that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface the features used by trained neural networks. Across two standard toy models for mechanistic interpretability, Toy Models of Superposition (TMS) and a 1-layer MLP trained on modular addition, we find that the eNTK exhibits sharp spectral cliffs whose top eigenspaces align with ground-truth features. In TMS, the eNTK recovers the ground-truth features in both the sparse (high superposition) and dense regimes. In modular arithmetic, the eNTK can be used to recover Fourier feature families. Moreover, we provide evidence that a layerwise eNTK localizes features to specific layers and that the evolution of the eNTK eigenspectrum can be used to diagnose the grokking phase transition. These results suggest that eNTK analysis may provide a practical handle for feature discovery and for detecting phase changes in small models.

Comment: Representation Learning/Training dynamics: empirical NTK eigenanalysis surfaces learned features and tracks grokking phase changes.

Relevance: 9 Novelty: 7


21. Shape Happens: Automatic Feature Manifold Discovery in LLMs via Supervised Multi-Dimensional Scaling

ArXiv ID: 2510.01025

Authors: Federico Tiblias, Irina Bigoulaeva, Jingcheng Niu, Simone Balloccu, Iryna Gurevych

Abstract: The linear representation hypothesis states that language models (LMs) encode concepts as directions in their latent space, forming organized, multidimensional manifolds. Prior efforts focus on discovering specific geometries for specific features, and thus lack generalization. We introduce Supervised Multi-Dimensional Scaling (SMDS), a model-agnostic method to automatically discover feature manifolds. We apply SMDS to temporal reasoning as a case study, finding that different features form various geometric structures such as circles, lines, and clusters. SMDS reveals many insights on these structures: they consistently reflect the properties of the concepts they represent; are stable across model families and sizes; actively support reasoning in models; and dynamically reshape in response to context changes. Together, our findings shed light on the functional role of feature manifolds, supporting a model of entity-based reasoning in which LMs encode and transform structured representations.

Comment: Representation Learning: SMDS method to discover and analyze feature manifolds in LLM latent space.

Relevance: 9 Novelty: 7


22. The Pitfalls of KV Cache Compression

ArXiv ID: 2510.00231

Authors: Alex Chen, Renato Geh, Aditya Grover, Guy Van den Broeck, Daniel Israel

Abstract: KV cache compression promises increased throughput and efficiency with negligible loss in performance. While the gains in throughput are indisputable and recent literature has indeed shown minimal degradation on particular benchmarks, in general the consequences of compression in realistic scenarios such as multi-instruction prompting have been insufficiently studied. In this paper, we identify several pitfalls practitioners should be aware of when deploying KV cache compressed LLMs. Importantly, we show that certain instructions degrade much more rapidly with compression, effectively causing them to be completely ignored by the LLM. As a practical example of that, we highlight system prompt leakage as a case study, empirically showing the impact of compression on leakage and general instruction following. We show several factors that play a role in prompt leakage: compression method, instruction order, and KV eviction bias. We then propose simple changes to KV cache eviction policies that can reduce the impact of these factors and improve the overall performance in multi-instruction tasks.

Comment: Model Compression and Efficiency: critical analysis of KV cache compression with improved eviction policies for multi-instruction prompting in LLMs.

Relevance: 9 Novelty: 7


23. Enhancing Certifiable Semantic Robustness via Robust Pruning of Deep Neural Networks

ArXiv ID: 2510.00083

Authors: Hanjiang Hu, Bowei Li, Ziwei Wang, Tianhao Wei, Casidhe Hutchison, Eric Sample, Changliu Liu

Abstract: Deep neural networks have been widely adopted in many vision and robotics applications with visual inputs. It is essential to verify its robustness against semantic transformation perturbations, such as brightness and contrast. However, current certified training and robustness certification methods face the challenge of over-parameterization, which hinders the tightness and scalability due to the over-complicated neural networks. To this end, we first analyze stability and variance of layers and neurons against input perturbation, showing that certifiable robustness can be indicated by a fundamental Unbiased and Smooth Neuron metric (USN). Based on USN, we introduce a novel neural network pruning method that removes neurons with low USN and retains those with high USN, thereby preserving model expressiveness without over-parameterization. To further enhance this pruning process, we propose a new Wasserstein distance loss to ensure that pruned neurons are more concentrated across layers. We validate our approach through extensive experiments on the challenging robust keypoint detection task, which involves realistic brightness and contrast perturbations, demonstrating that our method achieves superior robustness certification performance and efficiency compared to baselines.

Comment: Model Compression and Efficiency: robust pruning guided by an Unbiased and Smooth Neuron metric (USN) plus a Wasserstein loss to enhance certifiable robustness.

Relevance: 9 Novelty: 7


24. ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models

ArXiv ID: 2510.00071

Authors: Dongqi Zheng

Abstract: Large Reasoning Language Models (LRLMs or LRMs) demonstrate remarkable capabilities in complex reasoning tasks, but suffer from significant computational inefficiencies due to overthinking phenomena. Existing efficient reasoning methods face the challenge of balancing reasoning quality with inference cost reduction. We propose \textbf{Adaptive Reasoning Suppression (ARS)}, a novel training-free approach that dynamically suppresses redundant reasoning steps while preserving accuracy through adaptive certainty monitoring. ARS introduces a multi-checkpoint certainty estimation mechanism with progressive suppression thresholds, achieving superior efficiency compared to static suppression methods. Our extensive evaluation across mathematical reasoning benchmarks using multiple model architectures demonstrates that ARS achieves up to 53%, 46.1%, and 57.9% in token, latency and energy reduction, while maintaining or improving accuracy.

Comment: Model Compression and Efficiency: training-free adaptive suppression of reasoning steps for LRLMs to reduce tokens/latency while preserving accuracy.

Relevance: 9 Novelty: 7


25. A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models

ArXiv ID: 2510.00666

Authors: Leah Bar, Liron Mor Yosef, Shai Zucker, Neta Shoham, Inbar Seroussi, Nir Sochen

Abstract: The foundational premise of generative AI for images is the assumption that images are inherently low-dimensional objects embedded within a high-dimensional space. Additionally, it is often implicitly assumed that thematic image datasets form smooth or piecewise smooth manifolds. Common approaches overlook the geometric structure and focus solely on probabilistic methods, approximating the probability distribution through universal approximation techniques such as the kernel method. In some generative models, the low dimensional nature of the data manifest itself by the introduction of a lower dimensional latent space. Yet, the probability distribution in the latent or the manifold coordinate space is considered uninteresting and is predefined or considered uniform. This study unifies the geometric and probabilistic perspectives by providing a geometric framework and a kernel-based probabilistic method simultaneously. The resulting framework demystifies diffusion models by interpreting them as a projection mechanism onto the manifold of ``good images''. This interpretation leads to the construction of a new deterministic model, the Manifold-Probabilistic Projection Model (MPPM), which operates in both the representation (pixel) space and the latent space. We demonstrate that the Latent MPPM (LMPPM) outperforms the Latent Diffusion Model (LDM) across various datasets, achieving superior results in terms of image restoration and generation.

Comment: Model Architecture and Representation Learning: introduces a deterministic Manifold-Probabilistic Projection Model unifying geometric manifold structure with kernel-based probabilistic modeling, reinterpreting diffusion as projection.

Relevance: 8 Novelty: 8


26. On the Benefits of Weight Normalization for Overparameterized Matrix Sensing

ArXiv ID: 2510.01175

Authors: Yudong Wei, Liang Zhang, Bingcong Li, Niao He

Abstract: While normalization techniques are widely used in deep learning, their theoretical understanding remains relatively limited. In this work, we establish the benefits of (generalized) weight normalization (WN) applied to the overparameterized matrix sensing problem. We prove that WN with Riemannian optimization achieves linear convergence, yielding an exponential speedup over standard methods that do not use WN. Our analysis further demonstrates that both iteration and sample complexity improve polynomially as the level of overparameterization increases. To the best of our knowledge, this work provides the first characterization of how WN leverages overparameterization for faster convergence in matrix sensing.

Comment: Training dynamics/optimization analysis of weight normalization showing faster convergence in overparameterized matrix sensing (Representation Learning / Model Architecture analysis).

Relevance: 8 Novelty: 8


27. Rethinking Thinking Tokens: LLMs as Improvement Operators

ArXiv ID: 2510.01123

Authors: Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, Anirudh Goyal

Abstract: Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own "thoughts" with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).

Comment: Inference efficiency and dynamic refinement: Parallel-Distill-Refine orchestrates bounded workspace and parallelism to improve accuracy-latency trade-offs (HPC/algorithmic efficiency; conditional computation).

Relevance: 8 Novelty: 8


28. Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models

ArXiv ID: 2510.00294

Authors: Shutong Wu, Jiawei Zhang

Abstract: Diffusion Large Language Models (DLLMs) have emerged as a new paradigm of language modeling beyond autoregressive next-token prediction. Thanks to their bidirectional attention mechanism, DLLMs are more capable of capturing the connection of context, and thus show unique advantages in challenges like the famous "reversal curse" or learning under data-constrained scenarios. However, this bidirectional nature also brings an obstacle that DLLMs are not inherently compatible with KV Cache, and consequently, the inference efficiency is not competitive compared with autoregressive models. Taking advantage of their inherent capability of multi-token prediction, existing parallel decoding algorithms can speed up the DLLM inference, but at the cost of non-negligible performance degradation. To overcome this challenge, we introduce Free Draft-and-Verification (Freedave), a novel fast sampling algorithm tailored for DLLMs that achieves lossless parallel decoding. Specifically, we propose a pipeline of parallel-decoded candidate generation and verification, which is guaranteed to reproduce the same sequence generated by static sampling, without introducing extra model forward calls. By applying Freedave, the throughput of DLLMs can be boosted up to $2.8\times$ without performance degradation on math reasoning tasks.

Comment: Model efficiency/HPC: lossless parallel decoding for diffusion LLMs via draft-and-verify without extra forward passes; substantial inference speedup.

Relevance: 8 Novelty: 8


29. Learning Inter-Atomic Potentials without Explicit Equivariance

ArXiv ID: 2510.00027

Authors: Ahmed A. Elhag, Arun Raja, Alex Morehead, Samuel M. Blau, Garrett M. Morris, Michael M. Bronstein

Abstract: Accurate and scalable machine-learned inter-atomic potentials (MLIPs) are essential for molecular simulations ranging from drug discovery to new material design. Current state-of-the-art models enforce roto-translational symmetries through equivariant neural network architectures, a hard-wired inductive bias that can often lead to reduced flexibility, computational efficiency, and scalability. In this work, we introduce TransIP: Transformer-based Inter-Atomic Potentials, a novel training paradigm for interatomic potentials achieving symmetry compliance without explicit architectural constraints. Our approach guides a generic non-equivariant Transformer-based model to learn SO(3)-equivariance by optimizing its representations in the embedding space. Trained on the recent Open Molecules (OMol25) collection, a large and diverse molecular dataset built specifically for MLIPs and covering different types of molecules (including small organics, biomolecular fragments, and electrolyte-like species), TransIP attains comparable performance in machine-learning force fields versus state-of-the-art equivariant baselines. Further, compared to a data augmentation baseline, TransIP achieves 40% to 60% improvement in performance across varying OMol25 dataset sizes. More broadly, our work shows that learned equivariance can be a powerful and efficient alternative to equivariant or augmentation-based MLIP models.

Comment: Model Architecture/Representation Learning: learns SO(3) equivariance in a non-equivariant Transformer for inter-atomic potentials, avoiding hard-wired symmetry constraints.

Relevance: 8 Novelty: 8


30. Nonparametric Identification of Latent Concepts

ArXiv ID: 2510.00136

Authors: Yujia Zheng, Shaoan Xie, Kun Zhang

Abstract: We are born with the ability to learn concepts by comparing diverse observations. This helps us to understand the new world in a compositional manner and facilitates extrapolation, as objects naturally consist of multiple concepts. In this work, we argue that the cognitive mechanism of comparison, fundamental to human learning, is also vital for machines to recover true concepts underlying the data. This offers correctness guarantees for the field of concept learning, which, despite its impressive empirical successes, still lacks general theoretical support. Specifically, we aim to develop a theoretical framework for the identifiability of concepts with multiple classes of observations. We show that with sufficient diversity across classes, hidden concepts can be identified without assuming specific concept types, functional relations, or parametric generative models. Interestingly, even when conditions are not globally satisfied, we can still provide alternative guarantees for as many concepts as possible based on local comparisons, thereby extending the applicability of our theory to more flexible scenarios. Moreover, the hidden structure between classes and concepts can also be identified nonparametrically. We validate our theoretical results in both synthetic and real-world settings.

Comment: Representation Learning: provides a nonparametric identifiability theory for latent concepts from multi-class observations, offering foundational guarantees on recovering representations.

Relevance: 8 Novelty: 8


31. Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls

ArXiv ID: 2510.00184

Authors: Xiaoyan Bai, Itamar Pres, Yuntian Deng, Chenhao Tan, Stuart Shieber, Fernanda Vi\'egas, Martin Wattenberg, Andrew Lee

Abstract: Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via \emph{implicit chain-of-thought}, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to cache'' andretrieve'' pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the ``running sum'' via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.

Comment: Representation Learning: reverse-engineers transformer mechanisms for long-range dependencies (attention DAG caching, Minkowski-sum digit geometry) and training dynamics with an auxiliary inductive-bias loss.

Relevance: 8 Novelty: 7


32. GLAI: GreenLightningAI for Accelerated Training through Knowledge Decoupling

ArXiv ID: 2510.00883

Authors: Jose I. Mestre, Alberto Fern\'andez-Hern\'andez, Cristian P\'erez-Corral, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ort\'i

Abstract: In this work we introduce GreenLightningAI (GLAI), a new architectural block designed as an alternative to conventional MLPs. The central idea is to separate two types of knowledge that are usually entangled during training: (i) structural knowledge, encoded by the stable activation patterns induced by ReLU activations; and (ii) quantitative knowledge, carried by the numerical weights and biases. By fixing the structure once stabilized, GLAI reformulates the MLP as a combination of paths, where only the quantitative component is optimized. This reformulation retains the universal approximation capabilities of MLPs, yet achieves a more efficient training process, reducing training time by ~40% on average across the cases examined in this study. Crucially, GLAI is not just another classifier, but a generic block that can replace MLPs wherever they are used, from supervised heads with frozen backbones to projection layers in self-supervised learning or few-shot classifiers. Across diverse experimental setups, GLAI consistently matches or exceeds the accuracy of MLPs with an equivalent number of parameters, while converging faster. Overall, GLAI establishes a new design principle that opens a direction for future integration into large-scale architectures such as Transformers, where MLP blocks dominate the computational footprint.

Comment: Model Architecture and Efficiency—new MLP-replacement block that decouples structural vs quantitative knowledge to speed training while retaining expressivity.

Relevance: 8 Novelty: 7


33. Generalized Parallel Scaling with Interdependent Generations

ArXiv ID: 2510.01143

Authors: Harry Dong, David Brandfonbrener, Eryk Helenowski, Yun He, Mrinal Kumar, Han Fang, Yuejie Chi, Karthik Abinav Sankararaman

Abstract: Parallel LLM inference scaling involves sampling a set of $N>1$ responses for a single input prompt. However, these $N$ parallel responses tend to be generated independently from each other, partitioning compute resources and leaving potentially useful information in one generation untapped by others. This is in contrast to response length scaling where past computation is used in all future steps. For higher quality responses and response sets, we propose Bridge to generate interdependent responses in parallel by rethinking batched LLM hidden states as holistic tensors rather than independent slices. With only a small amount (2.8%-5.1%) of new parameters, Bridge improves the relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 50% and boosts consistency of correct responses. Trained once, Bridge scales to any generation width, all with greater performance than independent generations, unlocking a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique.

Comment: High Performance Computing/Systems-level inference—parallel scaling with interdependent generations via shared hidden-state tensors and small parameter overhead.

Relevance: 8 Novelty: 7


34. Learning Energy-based Variational Latent Prior for VAEs

ArXiv ID: 2510.00260

Authors: Debottam Dutta, Chaitanya Amballa, Zhongweiyang Xu, Yu-Lin Wei, Romit Roy Choudhury

Abstract: Variational Auto-Encoders (VAEs) are known to generate blurry and inconsistent samples. One reason for this is the "prior hole" problem. A prior hole refers to regions that have high probability under the VAE's prior but low probability under the VAE's posterior. This means that during data generation, high probability samples from the prior could have low probability under the posterior, resulting in poor quality data. Ideally, a prior needs to be flexible enough to match the posterior while retaining the ability to generate samples fast. Generative models continue to address this tradeoff. This paper proposes to model the prior as an energy-based model (EBM). While EBMs are known to offer the flexibility to match posteriors (and also improving the ELBO), they are traditionally slow in sample generation due to their dependency on MCMC methods. Our key idea is to bring a variational approach to tackle the normalization constant in EBMs, thus bypassing the expensive MCMC approaches. The variational form can be approximated with a sampler network, and we show that such an approach to training priors can be formulated as an alternating optimization problem. Moreover, the same sampler reduces to an implicit variational prior during generation, providing efficient and fast sampling. We compare our Energy-based Variational Latent Prior (EVaLP) method to multiple SOTA baselines and show improvements in image generation quality, reduced prior holes, and better sampling efficiency.

Comment: Representation Learning/Model Architecture—energy-based variational latent prior for VAEs addressing prior holes with efficient sampling via variational treatment.

Relevance: 8 Novelty: 7


35. Equivariant Geometric Scattering Networks via Vector Diffusion Wavelets

ArXiv ID: 2510.01022

Authors: David R. Johnson, Rishabh Anand, Smita Krishnaswamy, Michael Perlmutter

Abstract: We introduce a novel version of the geometric scattering transform for geometric graphs containing scalar and vector node features. This new scattering transform has desirable symmetries with respect to rigid-body roto-translations (i.e., $SE(3)$-equivariance) and may be incorporated into a geometric GNN framework. We empirically show that our equivariant scattering-based GNN achieves comparable performance to other equivariant message-passing-based GNNs at a fraction of the parameter count.

Comment: Model Architecture and Efficiency: SE(3)-equivariant geometric scattering transform integrated into GNNs, achieving comparable performance with fewer parameters.

Relevance: 8 Novelty: 7


36. Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

ArXiv ID: 2510.00526

Authors: Gaotang Li, Ruizhong Qiu, Xiusi Chen, Heng Ji, Hanghang Tong

Abstract: Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. To this end, we study a general family of probability-based objectives and characterize their effectiveness under different conditions. Through comprehensive experiments and extensive ablation studies across 7 model backbones, 14 benchmarks, and 3 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. Our code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.

Comment: Representation Learning/Training dynamics: evaluates probability-based objectives beyond NLL for SFT, with theory tied to model capability.

Relevance: 8 Novelty: 7


37. Large Language Models Inference Engines based on Spiking Neural Networks

ArXiv ID: 2510.00133

Authors: Adarsha Balaji, Sandeep Madireddy

Abstract: Foundational models based on the transformer architecture are currently the state-of-the-art in general language modeling, as well as in scientific areas such as material science and climate. However, training and deploying these models is computationally challenging as the time and space complexity has a quadratic relation to the input sequence length. Several efforts exploring efficient computational paradigms and model architectures to address these limitations have been made. In this work, we explore spiking neural networks (SNNs) to design transformer models. A challenge in training large-scale SNNs, using existing surrogate learning methods is inefficient and time-consuming. On the other hand, techniques to convert existing transformer-based models to their SNN equivalent are not scalable, as achieving optimal performance comes at the cost of a large number of spike time-steps, i.e. increased latency. To address this, we propose NeurTransformer, a methodology for designing transformer-based SNN for inference using a supervised fine-tuning approach with existing conversion methods. The proposed methodology works by: (1) replacing the self-attention mechanism with a spike-based self-attention (SSA), (2) converting the feed-forward block of the trained transformer model to its equivalent SNN, and (3) fine-tuning the SSA block using SNN-based surrogate learning algorithms. We benchmark the proposed methodology and demonstrate its accuracy and scalability using three variants of the GPT-2 model of increasing model size. We observe that the converted GPT-2 small models demonstrate a 5-12% loss in cosine similarity and a 9.7% reduction in perplexity. Finally, we demonstrate the energy efficiency of the SSA block compared to the ASA block and show between 64.71% and 85.28% reductions in estimated energy consumption when implementing the self-attention mechanism on a digital hardware.

Comment: Model Architecture/Efficiency: spike-based self-attention and SNN conversion/fine-tuning for transformer inference, targeting energy-efficient deployment.

Relevance: 8 Novelty: 7


38. Continual Learning with Query-Only Attention

ArXiv ID: 2510.00365

Authors: Gautham Bekal, Ashish Pujari, Scott David Kelly

Abstract: Continual learning involves learning from a stream of data without repetition of data points, a scenario that is inherently complex due to distributional shift across tasks. We propose a query-only attention mechanism that discards keys and values, yet preserves the core inductive bias of transformer architectures. In continual learning scenarios, this simplified mechanism significantly mitigates both loss of plasticity and catastrophic forgetting, outperforming baselines such as selective re-initialization. We establish a conceptual link between query-only attention, full transformer attention, and model agnostic meta-learning, framing them as instances of meta-learning. We further provide intuition for why query-based models and attention networks help preserve plasticity in continual settings. Finally, through preliminary Hessian spectrum analysis, we observe that models maintaining higher curvature rank across tasks tend to retain plasticity. Our findings suggest that full attention may not be essential for capturing the benefits of meta-learning in continual learning.

Comment: Model Architecture/Training Dynamics: query-only attention variant with analysis of plasticity and catastrophic forgetting.

Relevance: 8 Novelty: 7


39. Exploring System 1 and 2 communication for latent reasoning in LLMs

ArXiv ID: 2510.00494

Authors: Julian Coda-Forno, Zhuokai Zhao, Qiang Zhang, Dipesh Tamboli, Weiwei Li, Xiangjun Fan, Lizhu Zhang, Eric Schulz, Hsiao-Ping Tseng

Abstract: Should LLM reasoning live in a separate module, or within a single model's forward pass and representational space? We study dual-architecture latent reasoning, where a fluent Base exchanges latent messages with a Coprocessor, and test two hypotheses aimed at improving latent communication over Liu et al. (2024): (H1) increase channel capacity; (H2) learn communication via joint finetuning. Under matched latent-token budgets on GPT-2 and Qwen-3, H2 is consistently strongest while H1 yields modest gains. A unified soft-embedding baseline, a single model with the same forward pass and shared representations, using the same latent-token budget, nearly matches H2 and surpasses H1, suggesting current dual designs mostly add compute rather than qualitatively improving reasoning. Across GSM8K, ProsQA, and a Countdown stress test with increasing branching factor, scaling the latent-token budget beyond small values fails to improve robustness. Latent analyses show overlapping subspaces with limited specialization, consistent with weak reasoning gains. We conclude dual-model latent reasoning remains promising in principle, but likely requires objectives and communication mechanisms that explicitly shape latent spaces for algorithmic planning.

Comment: Model Architecture: studies dual-model latent communication vs unified forward-pass, analyzing representation and compute tradeoffs.

Relevance: 8 Novelty: 7


40. Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness

ArXiv ID: 2510.00517

Authors: Tsubasa Takahashi, Shojiro Yamabe, Futa Waseda, Kento Sasaki

Abstract: Differential Attention (DA) has been proposed as a refinement to standard attention, suppressing redundant or noisy context through a subtractive structure and thereby reducing contextual hallucination. While this design sharpens task-relevant focus, we show that it also introduces a structural fragility under adversarial perturbations. Our theoretical analysis identifies negative gradient alignment-a configuration encouraged by DA's subtraction-as the key driver of sensitivity amplification, leading to increased gradient norms and elevated local Lipschitz constants. We empirically validate this Fragile Principle through systematic experiments on ViT/DiffViT and evaluations of pretrained CLIP/DiffCLIP, spanning five datasets in total. These results demonstrate higher attack success rates, frequent gradient opposition, and stronger local sensitivity compared to standard attention. Furthermore, depth-dependent experiments reveal a robustness crossover: stacking DA layers attenuates small perturbations via depth-dependent noise cancellation, though this protection fades under larger attack budgets. Overall, our findings uncover a fundamental trade-off: DA improves discriminative focus on clean inputs but increases adversarial vulnerability, underscoring the need to jointly design for selectivity and robustness in future attention mechanisms.

Comment: Model Architecture: analyzes Differential Attention’s robustness and training dynamics, revealing structural trade-offs in attention design.

Relevance: 8 Novelty: 7


41. Geometric Properties of Neural Multivariate Regression

ArXiv ID: 2510.01105

Authors: George Andriopoulos, Zixuan Dong, Bimarsha Adhikari, Keith Ross

Abstract: Neural multivariate regression underpins a wide range of domains such as control, robotics, and finance, yet the geometry of its learned representations remains poorly characterized. While neural collapse has been shown to benefit generalization in classification, we find that analogous collapse in regression consistently degrades performance. To explain this contrast, we analyze models through the lens of intrinsic dimension. Across control tasks and synthetic datasets, we estimate the intrinsic dimension of last-layer features (ID_H) and compare it with that of the regression targets (ID_Y). Collapsed models exhibit ID_H ID_Y. For the non-collapsed models, performance with respect to ID_H depends on the data quantity and noise levels. From these observations, we identify two regimes (over-compressed and under-compressed) that determine when expanding or reducing feature dimensionality improves performance. Our results provide new geometric insights into neural regression and suggest practical strategies for enhancing generalization.

Comment: Representation Learning: analyzes intrinsic dimensionality and collapse in neural regression representations, yielding insights into training dynamics and generalization.

Relevance: 8 Novelty: 7


42. BigBang-Proton Technical Report: Next-Word-Prediction is Scientific Multitask Learner

ArXiv ID: 2510.00129

Authors: Hengkui Wu, Liujiang Liu, Jihua He, Qihao Wang, Keke Zhao, Shuyang Hu, Renle Fu, Dahao Liang, Lingyu Zeng, Bruce Liu, Yuan Liu, Jin Zhan, Jiaqiang Niu, Xinglong Jia, Yaqin Hu, Wenjun Ji, Panpan Chi, Ken Chen, Hengyuan Wu, Yingsi Xin, Yongfeng Zhu, Yuexin Wang, Manqi Ruan, Ningtao Bian, Xiaohua Wu, Weipeng Xu

Abstract: We introduce BigBang-Proton, a unified sequence-based architecture for auto-regressive language modeling pretrained on cross-scale, cross-structure, cross-discipline real-world scientific tasks to construct a scientific multi-task learner. BigBang-Proton incorporates three fundamental innovations compared to mainstream general-purpose LLMs: Theory-Experiment Learning paradigm aligns large-scale numerical experimental data with theoretical text corpora; Binary Patch Encoding replaces byte pair encoding(BPE) tokenization; Monte Carlo Attention substitutes traditional transformer architectures. Through next-word-prediction pretraining on cross-discipline scientific datasets of real-world problems mixed with general textual corpus, followed by fine-tuning and inference on downstream tasks, BigBang-Proton demonstrates 100\% accuracy in up to 50-digit arithmetic addition operations, performance on par with leading specialized models in particle physics jet tagging, matching MAE of specialized models in inter-atomic potential simulation, performance comparable to traditional spatiotemporal models in water quality prediction, and benchmark-exceeding performance in genome modeling. These results prove that language-guided scientific computing can match or exceed the performance of task-specific scientific models while maintaining multitask learning capabilities. We further hypothesize to scale the pretraining to the universe scale as a fundamental step toward developing material world foundational model.

Comment: Model Architecture: introduces Monte Carlo Attention and Binary Patch Encoding as architectural/tokenization innovations in a unified autoregressive scientific model.

Relevance: 8 Novelty: 7


43. ACON: Optimizing Context Compression for Long-horizon LLM Agents

ArXiv ID: 2510.00615

Authors: Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan

Abstract: Large language models (LLMs) are increasingly deployed as agents in dynamic, real-world environments, where success requires both reasoning and effective tool use. A central challenge for agentic tasks is the growing context length, as agents must accumulate long histories of actions and observations. This expansion raises costs and reduces efficiency in long-horizon tasks, yet prior work on context compression has mostly focused on single-step tasks or narrow applications. We introduce Agent Context Optimization (ACON), a unified framework that optimally compresses both environment observations and interaction histories into concise yet informative condensations. ACON leverages compression guideline optimization in natural language space: given paired trajectories where full context succeeds but compressed context fails, capable LLMs analyze the causes of failure, and the compression guideline is updated accordingly. Furthermore, we propose distilling the optimized LLM compressor into smaller models to reduce the overhead of the additional module. Experiments on AppWorld, OfficeBench, and Multi-objective QA show that ACON reduces memory usage by 26-54% (peak tokens) while largely preserving task performance, preserves over 95% of accuracy when distilled into smaller compressors, and enhances smaller LMs as long-horizon agents with up to 46% performance improvement.

Comment: Model Compression and Efficiency: proposes an LLM-agent context compression framework with guideline optimization and distillation to smaller compressors, reducing memory/token usage.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  2. Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  3. High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

  4. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.