Personalized Daily ArXiv Papers 2025-09-29
| [gpt-5] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 89447 | 81726 | 171173 |
| Cost | $0.11 | $0.82 | $0.93 |
Total arXiv papers: 939
Total scanned papers: 569
Total relevant papers: 60
Table of contents with paper titles:
-
Active Attacks: Red-teaming LLMs via Adaptive Environments Authors: Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim
-
Beyond Johnson-Lindenstrauss: Uniform Bounds for Sketched Bilinear Forms Authors: Rohan Deb, Qiaobo Li, Mayank Shrivastava, Arindam Banerjee
-
$\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization Authors: Yuandong Tian
-
Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers Authors: Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova
-
HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space Authors: Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang
-
Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models Authors: Aleksandar Terzi\'c, Nicolas Menet, Michael Hersche, Thomas Hofmann, Abbas Rahimi
-
OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule Authors: Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen
-
Statistical Advantage of Softmax Attention: Insights from Single-Location Regression Authors: O. Duranthon, P. Marion, C. Boyer, B. Loureiro, L. Zdeborov\'a
-
Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts Authors: Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang
-
Partial Parameter Updates for Efficient Distributed Training Authors: Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert
-
A Law of Data Reconstruction for Random Features (and Beyond) Authors: Leonardo Iurada, Simone Bombari, Tatiana Tommasi, Marco Mondelli
-
Enhancing Low-Rank Adaptation with Structured Nonlinear Transformations Authors: Guanzhi Deng, Mingyang Liu, Dapeng Wu, Yinqiao Li, Linqi Song
-
Wavelet-Induced Rotary Encodings: RoPE Meets Graphs Authors: Isaac Reid, Arijit Sehanobish, Cedrik H\"ofs, Bruno Mlodozeniec, Leonhard Vulpius, Federico Barbero, Adrian Weller, Krzysztof Choromanski, Richard E. Turner, Petar Veli\v{c}kovi\'c
-
Concept-SAE: Active Causal Probing of Visual Model Behavior Authors: Jianrong Ding, Muxi Chen, Chenchen Zhao, Qiang Xu
-
Mechanistic Independence: A Principle for Identifiable Disentangled Representations Authors: Stefan Matthes, Zhiwei Han, Hao Shen
-
Linear Causal Representation Learning by Topological Ordering, Pruning, and Disentanglement Authors: Hao Chen, Lin Liu, Yu Guang Wang
-
REMA: A Unified Reasoning Manifold Framework for Interpreting Large Language Model Authors: Bo Li, Guanzhi Deng, Ronghao Chen, Junrong Yue, Shuo Zhang, Qinghua Zhao, Linqi Song, Lijie Wen
-
Neural Feature Geometry Evolves as Discrete Ricci Flow Authors: Moritz Hehl, Max von Renesse, Melanie Weber
-
Scale-Wise VAR is Secretly Discrete Diffusion Authors: Amandeep Kumar, Nithin Gopalakrishnan Nair, Vishal M. Patel
-
Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding Authors: Shijing Hu, Jingyang Li, Zhihui Lu, Pan Zhou
-
Bilinear relational structure fixes reversal curse and enables consistent model editing Authors: Dong-Kyum Kim, Minsung Kim, Jea Kwon, Nakyeong Yang, Meeyoung Cha
-
StateX: Enhancing RNN Recall via Post-training State Expansion Authors: Xingyu Shen, Yingfa Chen, Zhen Leng Thai, Xu Han, Zhiyuan Liu, Maosong Sun
-
IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning Authors: Aayush Mishra, Daniel Khashabi, Anqi Liu
-
SlimDiff: Training-Free, Activation-Guided Hands-free Slimming of Diffusion Models Authors: Arani Roy, Shristi Das Biswas, Kaushik Roy
-
LANCE: Low Rank Activation Compression for Efficient On-Device Continual Learning Authors: Marco Paul E. Apolinario, Kaushik Roy
-
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features Authors: Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Elena Tutubalina, Ivan Oseledets
-
PreLoRA: Hybrid Pre-training of Vision Transformers with Full Training and Low-Rank Adapters Authors: Krishu K Thapa, Reet Barik, Krishna Teja Chitty-Venkata, Murali Emani, Venkatram Vishwanath
-
Blockwise Hadamard high-Rank Adaptation for Parameter-Efficient LLM Fine-Tuning Authors: Feng Yu, Jia Hu, Geyong Min
-
InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models Authors: Wenjun Wang, Shuo Cai, Congkai Xie, Mingfa Feng, Yiming Zhang, Zhen Li, Kejing Yang, Ming Li, Jiannong Cao, Yuan Xie, Hongxia Yang
-
Lightweight error mitigation strategies for post-training N:M activation sparsity in LLMs Authors: Shirin Alanova, Kristina Kazistova, Ekaterina Galaeva, Alina Kostromina, Vladimir Smirnov, Redko Dmitry, Alexey Dontsov, Maxim Zhelnin, Evgeny Burnaev, Egor Shvetsov
-
CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones Authors: Wenyi Gong, Mieszko Lis
-
Stochastic activations Authors: Maria Lomeli, Matthijs Douze, Gergely Szilvasy, Loic Cabannes, Jade Copet, Sainbayar Sukhbaatar, Jason Weston, Gabriel Synnaeve, Pierre-Emmanuel Mazar\'e, Herv\'e J\'egou
-
IIET: Efficient Numerical Transformer via Implicit Iterative Euler Method Authors: Xinyu Liu, Bei Li, Jiahao Liu, Junhao Ruan, Kechen Jiao, Hongyin Tang, Jingang Wang, Xiao Tong, Jingbo Zhu
-
Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time Authors: Yixuan Han, Fan Ma, Ruijie Quan, Yi Yang
-
General Pruning Criteria for Fast SBL Authors: Jakob M\"oderl, Erik Leitinger, Bernard Henri Fleury
-
Why High-rank Neural Networks Generalize?: An Algebraic Framework with RKHSs Authors: Yuka Hashimoto, Sho Sonoda, Isao Ishikawa, Masahiro Ikeda
-
Global Convergence in Neural ODEs: Impact of Activation Functions Authors: Tianxiang Gao, Siyuan Sun, Hailiang Liu, Hongyang Gao
-
Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning Authors: Naicheng He, Kaicheng Guo, Arjun Prakash, Saket Tiwari, Ruo Yu Tao, Tyrone Serapio, Amy Greenwald, George Konidaris
-
Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models Authors: Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen
-
Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting Authors: Yasmine Omri, Connor Ding, Tsachy Weissman, Thierry Tambe
-
Toward a Physics of Deep Learning and Brains Authors: Arsham Ghavasieh, Meritxell Vila-Minana, Akanksha Khurd, John Beggs, Gerardo Ortiz, Santo Fortunato
-
TRACE: Learning to Compute on Graphs Authors: Ziyang Zheng, Jiaying Zhu, Jingyi Zhou, Qiang Xu
-
Differentiable Structure Learning for General Binary Data Authors: Chang Deng, Bryon Aragam
-
Overclocking Electrostatic Generative Models Authors: Daniil Shlenskii, Alexander Korotin
-
A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems Authors: Xavier Gonzalez, E. Kelly Buchanan, Hyun Dong Lee, Jerry Weihong Liu, Ke Alexander Wang, David M. Zoltowski, Christopher R\'e, Scott W. Linderman
-
Unlocking the Power of Mixture-of-Experts for Task-Aware Time Series Analytics Authors: Xingjian Wu, Zhengyu Li, Hanyin Cheng, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang
-
R-Capsule: Compressing High-Level Plans for Efficient Large Language Model Reasoning Authors: Hongyu Shan, Mingyang Song, Chang Dai, Di Liang, Han Chen
-
From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement Authors: Jianzhi Yan, Le Liu, Youcheng Pan, Shiwei Chen, Zike Yuan, Yang Xiang, Buzhou Tang
-
ChaosNexus: A Foundation Model for Universal Chaotic System Forecasting with Multi-scale Representations Authors: Chang Liu, Bohao Zhao, Jingtao Ding, Yong Li
-
Effective continuous equations for adaptive SGD: a stochastic analysis view Authors: Luca Callisti, Marco Romito, Francesco Triggiano
-
Prophecy: Inferring Formal Properties from Neuron Activations Authors: Divya Gopinath, Corina S. Pasareanu, Muhammad Usman
-
A Data-driven Typology of Vision Models from Integrated Representational Metrics Authors: Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla
-
SAEmnesia: Erasing Concepts in Diffusion Models with Sparse Autoencoders Authors: Enrico Cassano, Riccardo Renzulli, Marco Nurisso, Mirko Zaffaroni, Alan Perotti, Marco Grangetto
-
Kernel Regression of Multi-Way Data via Tensor Trains with Hadamard Overparametrization: The Dynamic Graph Flow Case Authors: Duc Thien Nguyen, Konstantinos Slavakis, Eleftherios Kofidis, Dimitris Pados
-
Sharpness-Aware Minimization Can Hallucinate Minimizers Authors: Chanwoong Park, Uijeong Jang, Ernest K. Ryu, Insoon Yang
-
A circuit for predicting hierarchical structure in-context in Large Language Models Authors: Tankred Saanum, Can Demircan, Samuel J. Gershman, Eric Schulz
-
IndiSeek learns information-guided disentangled representations Authors: Yu Gui, Cong Ma, Zongming Ma
-
Null-Space Filtering for Data-Free Continual Model Merging: Preserving Transparency, Promoting Fidelity Authors: Zihuan Qiu, Lei Wang, Yang Cao, Runtong Zhang, Bing Su, Yi Xu, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li
-
Transformers Can Learn Connectivity in Some Graphs but Not Others Authors: Amit Roy, Abulhair Saparov
-
Understanding and Enhancing Mask-Based Pretraining towards Universal Representations Authors: Mingze Dong, Leda Wang, Yuval Kluger
1. Active Attacks: Red-teaming LLMs via Adaptive Environments
ArXiv ID: 2509.21947
Authors: Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim
Abstract: We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce \textit{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an easy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods -- including GFlowNets, PPO, and REINFORCE -- by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain greater than $400\ \times$) with only a 6% increase in computation. Our code is publicly available \href{https://github.com/dbsxodud-11/active_attacks}{here}.
Comment: Author match
2. Beyond Johnson-Lindenstrauss: Uniform Bounds for Sketched Bilinear Forms
ArXiv ID: 2509.21847
Authors: Rohan Deb, Qiaobo Li, Mayank Shrivastava, Arindam Banerjee
Abstract: Uniform bounds on sketched inner products of vectors or matrices underpin several important computational and statistical results in machine learning and randomized algorithms, including the Johnson-Lindenstrauss (J-L) lemma, the Restricted Isometry Property (RIP), randomized sketching, and approximate linear algebra. However, many modern analyses involve sketched bilinear forms, for which existing uniform bounds either do not apply or are not sharp on general sets. In this work, we develop a general framework to analyze such sketched bilinear forms and derive uniform bounds in terms of geometric complexities of the associated sets. Our approach relies on generic chaining and introduces new techniques for handling suprema over pairs of sets. We further extend these results to the setting where the bilinear form involves a sum of $T$ independent sketching matrices and show that the deviation scales as $\sqrt{T}$. This unified analysis recovers known results such as the J-L lemma as special cases, while extending RIP-type guarantees. Additionally, we obtain improved convergence bounds for sketched Federated Learning algorithms where such cross terms arise naturally due to sketched gradient compression, and design sketched variants of bandit algorithms with sharper regret bounds that depend on the geometric complexity of the action and parameter sets, rather than the ambient dimension.
Comment: Compression/Efficiency/HPC Theory: Uniform bounds for sketched bilinear forms via generic chaining; extends JL/RIP and improves guarantees for gradient compression and bandits.
Relevance: 10 Novelty: 9
3. $\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization
ArXiv ID: 2509.21519
Authors: Yuandong Tian
Abstract: While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open question whether there is a mathematical framework to characterize what kind of features emerge, how and in which conditions it happens from training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li_2}$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning, characterized by the structure of backpropagated gradient $G_F$ across layers. In (I), $G_F$ is random, and top layer overfits to random hidden representation. In (II), the gradient of each node (column of $G_F$) only depends on its own activation, and thus each hidden node learns their representation independently from $G_F$, which now carries information about target labels, thanks to weight decay. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. Finally, in (III), we provably show how hidden nodes interact, and how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.
Comment: Representation Learning: provides a mathematical framework for grokking and feature emergence with provable dynamics and scaling laws in neural networks.
Relevance: 10 Novelty: 9
4. Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers
ArXiv ID: 2509.22445
Authors: Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova
Abstract: The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.
Comment: Model Compression and Generalization Theory: proposes asymptotically optimal MDL objectives for Transformers grounded in Kolmogorov complexity; constructs a tractable variational objective.
Relevance: 10 Novelty: 9
5. HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space
ArXiv ID: 2509.22299
Authors: Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang
Abstract: Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to Optimal Brain Surgeon (OBS) theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where d is the model's dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of compression ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at compression ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at \href{https://github.com/LLIKKE/HEAPr}{https://github.com/LLIKKE/HEAPr}.
Comment: Model Architecture (MoE) + Compression/Efficiency: second-order, Hessian-based atomic expert pruning with reduced complexity enables fine-grained MoE compression.
Relevance: 10 Novelty: 9
6. Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models
ArXiv ID: 2509.22284
Authors: Aleksandar Terzi\'c, Nicolas Menet, Michael Hersche, Thomas Hofmann, Abbas Rahimi
Abstract: Modern state-space models (SSMs) often utilize transition matrices which enable efficient computation but pose restrictions on the model's expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs. Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$). Consequently, the computational cost of parallel scans scales linearly with the state size. Theoretically, the model is BIBO-stable and can emulate any $N$-state FSA with one layer of dimension $N$ and a linear readout of size $N \times N$, significantly improving on all current structured SSM guarantees. Experimentally, the model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks. On multiclass time-series classification, the performance is comparable to that of neural controlled differential equations, a paradigm explicitly built for time-series analysis. Finally, we integrate PD-SSM into a hybrid Transformer-SSM architecture and demonstrate that the model can effectively track the states of a complex FSA in which transitions are encoded as a set of variable-length English sentences. The code is available at https://github.com/IBM/expressive-sparse-state-space-model
Comment: Model Architecture with structured sparsity: PD-SSM factorizes transition matrices (one-hot P times diagonal D) enabling FSA state tracking at diagonal-SSM cost with strong expressivity guarantees.
Relevance: 10 Novelty: 9
7. OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule
ArXiv ID: 2509.21623
Authors: Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen
Abstract: The expanding long-context capabilities of large language models are constrained by a significant memory bottleneck: the key-value (KV) cache required for autoregressive generation. This bottleneck is substantial; for instance, a Llama-3.1-8B model processing a 32K-token prompt at a batch size of 4 requires approximately 16GB for its KV cache, a size exceeding the model's weights. While KV-cache compression via low-rank projection is a promising direction, existing methods rely on a static, offline-learned subspace that performs poorly under data distribution shifts. To overcome these limitations, we introduce OjaKV, a novel framework that integrates a strategic hybrid storage policy with online subspace adaptation. First, OjaKV recognizes that not all tokens are equally important for compression; it preserves the crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention. Second, for the vast majority of intermediate tokens, it applies low-rank compression by incrementally adapting the projection basis using Oja's algorithm for online principal component analysis. This adaptation involves a comprehensive update during prompt prefilling and lightweight periodic updates during decoding, ensuring the subspace remains aligned with the evolving context. Crucially, our framework is fully compatible with modern attention modules like FlashAttention. Experiments demonstrate that OjaKV maintains or even improves zero-shot accuracy at high compression ratios. In particular, OjaKV achieves its strongest gains on very long-context benchmarks that require complex reasoning, highlighting the importance of online subspace adaptation in dynamically tracking context shifts. These results establish our hybrid framework as a practical, plug-and-play solution for memory-efficient long-context inference without requiring model fine-tuning.
Comment: Model Compression and Efficiency: context-aware online low-rank KV cache compression with Oja’s rule and a hybrid storage policy; practical long-context memory optimization compatible with FlashAttention.
Relevance: 10 Novelty: 8
8. Statistical Advantage of Softmax Attention: Insights from Single-Location Regression
ArXiv ID: 2509.21936
Authors: O. Duranthon, P. Marion, C. Boyer, B. Loureiro, L. Zdeborov\'a
Abstract: Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the easier-to-analyze linearized attention. In this work, we address this gap through a principled study of the single-location regression task, where the output depends on a linear transformation of a single input token at a random location. Building on ideas from statistical physics, we develop an analysis of attention-based predictors in the high-dimensional limit, where generalization performance is captured by a small set of order parameters. At the population level, we show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short. We then examine other activation functions to identify which properties are necessary for optimal performance. Finally, we analyze the finite-sample regime: we provide an asymptotic characterization of the test error and show that, while softmax is no longer Bayes-optimal, it consistently outperforms linear attention. We discuss the connection with optimization by gradient-based algorithms.
Comment: Representation Learning/Architecture Analysis: high-dimensional theory shows softmax attention attains Bayes risk and outperforms linear attention.
Relevance: 10 Novelty: 8
9. Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
ArXiv ID: 2509.21892
Authors: Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang
Abstract: Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. Intuitively, activating more experts at inference $k'$ (where $k'> k$) means engaging a larger set of model parameters for the computation and thus is expected to improve performance. However, contrary to this intuition, we find the scaling range to be so narrow that performance begins to degrade rapidly after only a slight increase in the number of experts. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce Elastic Mixture-of-Experts (EMoE), a novel training framework that enables MoE models to scale the number of activated experts at inference without incurring additional training overhead. By simultaneously training experts to collaborate in diverse combinations and encouraging the router for high-quality selections, EMoE ensures robust performance across computational budgets at inference. We conduct extensive experiments on various MoE settings. Our results show that EMoE significantly expands the effective performance-scaling range, extending it to as much as 2-3$\times$ the training-time $k$, while also pushing the model's peak performance to a higher level.
Comment: Model Architecture (MoE): training framework that enables scaling the number of activated experts at inference by fostering expert collaboration and robust routing.
Relevance: 10 Novelty: 8
10. Partial Parameter Updates for Efficient Distributed Training
ArXiv ID: 2509.22418
Authors: Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert
Abstract: We introduce a memory- and compute-efficient method for low-communication distributed training. Existing methods reduce communication by performing multiple local updates between infrequent global synchronizations. We demonstrate that their efficiency can be significantly improved by restricting backpropagation: instead of updating all the parameters, each node updates only a fixed subset while keeping the remainder frozen during local steps. This constraint substantially reduces peak memory usage and training FLOPs, while a full forward pass over all parameters eliminates the need for cross-node activation exchange. Experiments on a $1.3$B-parameter language model trained across $32$ nodes show that our method matches the perplexity of prior low-communication approaches under identical token and bandwidth budgets while reducing training FLOPs and peak memory.
Comment: Matches High Performance Computing and Efficiency: introduces partial parameter updates for low-communication distributed training, reducing memory/FLOPs and avoiding activation exchange while maintaining perplexity.
Relevance: 10 Novelty: 8
11. A Law of Data Reconstruction for Random Features (and Beyond)
ArXiv ID: 2509.22214
Authors: Leonardo Iurada, Simone Bombari, Tatiana Tommasi, Marco Mondelli
Abstract: Large-scale deep learning models are known to memorize parts of the training set. In machine learning theory, memorization is often framed as interpolation or label fitting, and classical results show that this can be achieved when the number of parameters $p$ in the model is larger than the number of training samples $n$. In this work, we consider memorization from the perspective of data reconstruction, demonstrating that this can be achieved when $p$ is larger than $dn$, where $d$ is the dimensionality of the data. More specifically, we show that, in the random features model, when $p \gg dn$, the subspace spanned by the training samples in feature space gives sufficient information to identify the individual samples in input space. Our analysis suggests an optimization method to reconstruct the dataset from the model parameters, and we demonstrate that this method performs well on various architectures (random features, two-layer fully-connected and deep residual networks). Our results reveal a law of data reconstruction, according to which the entire training dataset can be recovered as $p$ exceeds the threshold $dn$.
Comment: Representation Learning/Theory: shows a law for full data reconstruction (p ≳ d·n) in random features and beyond, with an accompanying reconstruction method.
Relevance: 9 Novelty: 9
12. Enhancing Low-Rank Adaptation with Structured Nonlinear Transformations
ArXiv ID: 2509.21870
Authors: Guanzhi Deng, Mingyang Liu, Dapeng Wu, Yinqiao Li, Linqi Song
Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning method for large language models. However, its linear nature limits expressiveness. We propose LoRAN, a non-linear extension of LoRA that applies lightweight transformations to the low-rank updates. We further introduce Sinter, a sine-based activation that adds structured perturbations without increasing parameter count. Experiments across summarization and classification tasks show that LoRAN consistently improves over QLoRA. Ablation studies reveal that Sinter outperforms standard activations such as Sigmoid, ReLU, and Tanh, highlighting the importance of activation design in lowrank tuning.
Comment: Model Compression and Efficiency: non-linear low-rank adaptation (LoRAN) with sine-based activation for parameter-efficient fine-tuning.
Relevance: 10 Novelty: 7
13. Wavelet-Induced Rotary Encodings: RoPE Meets Graphs
ArXiv ID: 2509.22259
Authors: Isaac Reid, Arijit Sehanobish, Cedrik H\"ofs, Bruno Mlodozeniec, Leonhard Vulpius, Federico Barbero, Adrian Weller, Krzysztof Choromanski, Richard E. Turner, Petar Veli\v{c}kovi\'c
Abstract: We introduce WIRE: Wavelet-Induced Rotary Encodings. WIRE extends Rotary Position Encodings (RoPE), a popular algorithm in LLMs and ViTs, to graph-structured data. We demonstrate that WIRE is more general than RoPE, recovering the latter in the special case of grid graphs. WIRE also enjoys a host of desirable theoretical properties, including equivariance under node ordering permutation, compatibility with linear attention, and (under select assumptions) asymptotic dependence on graph resistive distance. We test WIRE on a range of synthetic and real-world tasks, including identifying monochromatic subgraphs, semantic segmentation of point clouds, and more standard graph benchmarks. We find it to be effective in settings where the underlying graph structure is important.
Comment: Model Architecture/Representation Learning: Generalizes RoPE to graphs (WIRE) with theoretical guarantees (permutation equivariance, linear-attention compatibility).
Relevance: 9 Novelty: 8
14. Concept-SAE: Active Causal Probing of Visual Model Behavior
ArXiv ID: 2509.22015
Authors: Jianrong Ding, Muxi Chen, Chenchen Zhao, Qiang Xu
Abstract: Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model's learned features, offering a powerful observational lens. However, the ambiguous and ungrounded nature of these features makes them unreliable instruments for the active, causal probing of model behavior. To solve this, we introduce Concept-SAE, a framework that forges semantically grounded concept tokens through a novel hybrid disentanglement strategy. We first quantitatively demonstrate that our dual-supervision approach produces tokens that are remarkably faithful and spatially localized, outperforming alternative methods in disentanglement. This validated fidelity enables two critical applications: (1) we probe the causal link between internal concepts and predictions via direct intervention, and (2) we probe the model's failure modes by systematically localizing adversarial vulnerabilities to specific layers. Concept-SAE provides a validated blueprint for moving beyond correlational interpretation to the mechanistic, causal probing of model behavior.
Comment: Representation Learning: Concept-grounded sparse autoencoders with dual supervision enabling causal probing via interventions.
Relevance: 9 Novelty: 8
15. Mechanistic Independence: A Principle for Identifiable Disentangled Representations
ArXiv ID: 2509.22196
Authors: Stefan Matthes, Zhiwei Han, Hao Shen
Abstract: Disentangled representations seek to recover latent factors of variation underlying observed data, yet their identifiability is still not fully understood. We introduce a unified framework in which disentanglement is achieved through mechanistic independence, which characterizes latent factors by how they act on observed variables rather than by their latent distribution. This perspective is invariant to changes of the latent density, even when such changes induce statistical dependencies among factors. Within this framework, we propose several related independence criteria -- ranging from support-based and sparsity-based to higher-order conditions -- and show that each yields identifiability of latent subspaces, even under nonlinear, non-invertible mixing. We further establish a hierarchy among these criteria and provide a graph-theoretic characterization of latent subspaces as connected components. Together, these results clarify the conditions under which disentangled representations can be identified without relying on statistical assumptions.
Comment: Representation Learning: principled identifiability of disentangled latent subspaces under nonlinear, non-invertible mixing via mechanistic independence.
Relevance: 9 Novelty: 8
16. Linear Causal Representation Learning by Topological Ordering, Pruning, and Disentanglement
ArXiv ID: 2509.22553
Authors: Hao Chen, Lin Liu, Yu Guang Wang
Abstract: Causal representation learning (CRL) has garnered increasing interests from the causal inference and artificial intelligence community, due to its capability of disentangling potentially complex data-generating mechanism into causally interpretable latent features, by leveraging the heterogeneity of modern datasets. In this paper, we further contribute to the CRL literature, by focusing on the stylized linear structural causal model over the latent features and assuming a linear mixing function that maps latent features to the observed data or measurements. Existing linear CRL methods often rely on stringent assumptions, such as accessibility to single-node interventional data or restrictive distributional constraints on latent features and exogenous measurement noise. However, these prerequisites can be challenging to satisfy in certain scenarios. In this work, we propose a novel linear CRL algorithm that, unlike most existing linear CRL methods, operates under weaker assumptions about environment heterogeneity and data-generating distributions while still recovering latent causal features up to an equivalence class. We further validate our new algorithm via synthetic experiments and an interpretability analysis of large language models (LLMs), demonstrating both its superiority over competing methods in finite samples and its potential in integrating causality into AI.
Comment: Representation Learning: linear causal representation learning with topological ordering, pruning, and disentanglement to recover latent causal features under weaker assumptions.
Relevance: 9 Novelty: 8
17. REMA: A Unified Reasoning Manifold Framework for Interpreting Large Language Model
ArXiv ID: 2509.22518
Authors: Bo Li, Guanzhi Deng, Ronghao Chen, Junrong Yue, Shuo Zhang, Qinghua Zhao, Linqi Song, Lijie Wen
Abstract: Understanding how Large Language Models (LLMs) perform complex reasoning and their failure mechanisms is a challenge in interpretability research. To provide a measurable geometric analysis perspective, we define the concept of the Reasoning Manifold, a latent low-dimensional geometric structure formed by the internal representations corresponding to all correctly reasoned generations. This structure can be conceptualized as the embodiment of the effective thinking paths that the model has learned to successfully solve a given task. Based on this concept, we build REMA, a framework that explains the origins of failures by quantitatively comparing the spatial relationships of internal model representations corresponding to both erroneous and correct reasoning samples. Specifically, REMA first quantifies the geometric deviation of each erroneous representation by calculating its k-nearest neighbors distance to the approximated manifold formed by correct representations, thereby providing a unified failure signal. It then localizes the divergence points where these deviations first become significant by tracking this deviation metric across the model's layers and comparing it against a baseline of internal fluctuations from correct representations, thus identifying where the reasoning chain begins to go off-track. Our extensive experiments on diverse language and multimodal models and tasks demonstrate the low-dimensional nature of the reasoning manifold and the high separability between erroneous and correct reasoning representations. The results also validate the effectiveness of the REMA framework in analyzing the origins of reasoning failures. This research connects abstract reasoning failures to measurable geometric deviations in representations, providing new avenues for in-depth understanding and diagnosis of the internal computational processes of black-box models.
Comment: Representation Learning: proposes a low-dimensional reasoning manifold and geometric deviation metric to localize failure points in LLM reasoning.
Relevance: 9 Novelty: 8
18. Neural Feature Geometry Evolves as Discrete Ricci Flow
ArXiv ID: 2509.22362
Authors: Moritz Hehl, Max von Renesse, Melanie Weber
Abstract: Deep neural networks learn feature representations via complex geometric transformations of the input data manifold. Despite the models' empirical success across domains, our understanding of neural feature representations is still incomplete. In this work we investigate neural feature geometry through the lens of discrete geometry. Since the input data manifold is typically unobserved, we approximate it using geometric graphs that encode local similarity structure. We provide theoretical results on the evolution of these graphs during training, showing that nonlinear activations play a crucial role in shaping feature geometry in feedforward neural networks. Moreover, we discover that the geometric transformations resemble a discrete Ricci flow on these graphs, suggesting that neural feature geometry evolves analogous to Ricci flow. This connection is supported by experiments on over 20,000 feedforward neural networks trained on binary classification tasks across both synthetic and real-world datasets. We observe that the emergence of class separability corresponds to the emergence of community structure in the associated graph representations, which is known to relate to discrete Ricci flow dynamics. Building on these insights, we introduce a novel framework for locally evaluating geometric transformations through comparison with discrete Ricci flow dynamics. Our results suggest practical design principles, including a geometry-informed early-stopping heuristic and a criterion for selecting network depth.
Comment: Representation learning: theoretical link between neural feature geometry and discrete Ricci flow, explaining training dynamics and informing design (depth, early stopping).
Relevance: 9 Novelty: 8
19. Scale-Wise VAR is Secretly Discrete Diffusion
ArXiv ID: 2509.22636
Authors: Amandeep Kumar, Nithin Gopalakrishnan Nair, Vishal M. Patel
Abstract: Autoregressive (AR) transformers have emerged as a powerful paradigm for visual generation, largely due to their scalability, computational efficiency and unified architecture with language and vision. Among them, next scale prediction Visual Autoregressive Generation (VAR) has recently demonstrated remarkable performance, even surpassing diffusion-based models. In this work, we revisit VAR and uncover a theoretical insight: when equipped with a Markovian attention mask, VAR is mathematically equivalent to a discrete diffusion. We term this reinterpretation as Scalable Visual Refinement with Discrete Diffusion (SRDD), establishing a principled bridge between AR transformers and diffusion models. Leveraging this new perspective, we show how one can directly import the advantages of diffusion such as iterative refinement and reduce architectural inefficiencies into VAR, yielding faster convergence, lower inference cost, and improved zero-shot reconstruction. Across multiple datasets, we show that the diffusion based perspective of VAR leads to consistent gains in efficiency and generation.
Comment: Model architecture: proves VAR with Markovian attention is equivalent to discrete diffusion, enabling diffusion-style iterative refinement for AR transformers and improving efficiency.
Relevance: 9 Novelty: 8
20. Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding
ArXiv ID: 2509.22134
Authors: Shijing Hu, Jingyang Li, Zhihui Lu, Pan Zhou
Abstract: Speculative decoding accelerates large language model (LLM) inference by letting a lightweight draft model propose multiple tokens that the target model verifies in parallel. Yet existing training objectives optimize only a single greedy draft path, while decoding follows a tree policy that re-ranks and verifies multiple branches. This draft policy misalignment limits achievable speedups. We introduce Group Tree Optimization (GTO), which aligns training with the decoding-time tree policy through two components: (i) Draft Tree Reward, a sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance; (ii) Group-based Draft Policy Training, a stable optimization scheme that contrasts trees from the current and a frozen reference draft model, forming debiased group-standardized advantages and applying a PPO-style surrogate along the longest accepted sequence for robust updates. We further prove that increasing our Draft Tree Reward provably improves acceptance length and speedup. Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), and multiple LLMs (e.g., LLaMA-3.1-8B, LLaMA-3.3-70B, Vicuna-1.3-13B, DeepSeek-R1-Distill-LLaMA-8B), GTO increases acceptance length by 7.4% and yields an additional 7.7% speedup over prior state-of-the-art EAGLE-3. By bridging draft policy misalignment, GTO offers a practical, general solution for efficient LLM inference.
Comment: Efficiency for LLM inference: Group Tree Optimization aligns training with speculative decoding’s tree policy, with a provable reward tied to acceptance length and speedup.
Relevance: 9 Novelty: 8
21. Bilinear relational structure fixes reversal curse and enables consistent model editing
ArXiv ID: 2509.21993
Authors: Dong-Kyum Kim, Minsung Kim, Jea Kwon, Nakyeong Yang, Meeyoung Cha
Abstract: The reversal curse -- a language model's (LM) inability to infer an unseen fact B is A'' from a learned factA is B'' -- is widely considered a fundamental limitation. We show that this is not an inherent failure but an artifact of how models encode knowledge. By training LMs from scratch on a synthetic dataset of relational knowledge graphs, we demonstrate that bilinear relational structure emerges in their hidden representations. This structure substantially alleviates the reversal curse, enabling LMs to infer unseen reverse facts. Crucially, we also find that this bilinear structure plays a key role in consistent model editing. When a fact is updated in a LM with this structure, the edit correctly propagates to its reverse and other logically dependent facts. In contrast, models lacking this representation not only suffer from the reversal curse but also fail to generalize edits, further introducing logical inconsistencies. Our results establish that training on a relational knowledge dataset induces the emergence of bilinear internal representations, which in turn enable LMs to behave in a logically consistent manner after editing. This implies that the success of model editing depends critically not just on editing algorithms but on the underlying representational geometry of the knowledge being modified.
Comment: Representation Learning: identifies bilinear relational structure in LM representations that fixes reversal curse and enables consistent model editing—linking internal geometry to logical generalization.
Relevance: 9 Novelty: 8
22. StateX: Enhancing RNN Recall via Post-training State Expansion
ArXiv ID: 2509.22630
Authors: Xingyu Shen, Yingfa Chen, Zhen Leng Thai, Xu Han, Zhiyuan Liu, Maosong Sun
Abstract: While Transformer-based models have demonstrated remarkable language modeling performance, their high complexities result in high costs when processing long contexts. In contrast, recurrent neural networks (RNNs) such as linear attention and state space models have gained popularity due to their constant per-token complexities. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a constant-size recurrent state. Previous works have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with larger recurrent states results in high training costs. In this paper, we introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training. For two popular classes of RNNs, linear attention and state space models, we design post-training architectural modifications to scale up the state size with no or negligible increase in model parameters. Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning ability of RNNs without incurring high post-training costs or compromising other capabilities.
Comment: Model Architecture and Efficiency: post-training recurrent state expansion to boost recall for linear-attention/SSM RNNs with minimal parameter growth.
Relevance: 9 Novelty: 8
23. IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning
ArXiv ID: 2509.22621
Authors: Aayush Mishra, Daniel Khashabi, Anqi Liu
Abstract: Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt. ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: Can ICL's internal computations be used to improve the qualities of SFT? We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL's rich functionality, we introduce ICL Activation Alignment (IA2), a self-distillation technique which aims to replicate ICL's activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and 2 model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.
Comment: Matches Representation Learning/Training Dynamics: self-distills SFT to align internal activations with ICL, transferring in-context computation mechanisms to improve accuracy and calibration.
Relevance: 9 Novelty: 8
24. SlimDiff: Training-Free, Activation-Guided Hands-free Slimming of Diffusion Models
ArXiv ID: 2509.21498
Authors: Arani Roy, Shristi Das Biswas, Kaushik Roy
Abstract: Diffusion models (DMs), lauded for their generative performance, are computationally prohibitive due to their billion-scale parameters and iterative denoising dynamics. Existing efficiency techniques, such as quantization, timestep reduction, or pruning, offer savings in compute, memory, or runtime but are strictly bottlenecked by reliance on fine-tuning or retraining to recover performance. In this work, we introduce SlimDiff, an automated activation-informed structural compression framework that reduces both attention and feedforward dimensionalities in DMs, while being entirely gradient-free. SlimDiff reframes DM compression as a spectral approximation task, where activation covariances across denoising timesteps define low-rank subspaces that guide dynamic pruning under a fixed compression budget. This activation-aware formulation mitigates error accumulation across timesteps by applying module-wise decompositions over functional weight groups: query--key interactions, value--output couplings, and feedforward projections, rather than isolated matrix factorizations, while adaptively allocating sparsity across modules to respect the non-uniform geometry of diffusion trajectories. SlimDiff achieves up to 35\% acceleration and $\sim$100M parameter reduction over baselines, with generation quality on par with uncompressed models without any backpropagation. Crucially, our approach requires only about 500 calibration samples, over 70$\times$ fewer than prior methods. To our knowledge, this is the first closed-form, activation-guided structural compression of DMs that is entirely training-free, providing both theoretical clarity and practical efficiency.
Comment: Compression/Efficiency: training-free, activation-guided structural slimming (low-rank/sparsity) of diffusion models for speed and parameter reduction.
Relevance: 9 Novelty: 8
25. LANCE: Low Rank Activation Compression for Efficient On-Device Continual Learning
ArXiv ID: 2509.21617
Authors: Marco Paul E. Apolinario, Kaushik Roy
Abstract: On-device learning is essential for personalization, privacy, and long-term adaptation in resource-constrained environments. Achieving this requires efficient learning, both fine-tuning existing models and continually acquiring new tasks without catastrophic forgetting. Yet both settings are constrained by high memory cost of storing activations during backpropagation. Existing activation compression methods reduce this cost but relying on repeated low-rank decompositions, introducing computational overhead. Also, such methods have not been explored for continual learning. We propose LANCE (Low-rank Activation Compression), a framework that performs one-shot higher-order Singular Value Decompsoition (SVD) to obtain a reusable low-rank subspace for activation projection. This eliminates repeated decompositions, reducing both memory and computation. Moreover, fixed low-rank subspaces further enable on-device continual learning by allocating tasks to orthogonal subspaces without storing large task-specific matrices. Experiments show that LANCE reduces activation storage up to 250$\times$ while maintaining accuracy comparable to full backpropagation on CIFAR-10/100, Oxford-IIIT Pets, Flowers102, and CUB-200 datasets. On continual learning benchmarks (Split CIFAR-100, Split MiniImageNet, 5-Datasets), it achieves performance competitive with orthogonal gradient projection methods at a fraction of the memory cost. These results position LANCE as a practical and scalable solution for efficient fine-tuning and continual learning on edge devices.
Comment: Compression/Efficiency: Low-rank activation compression via one-shot HOSVD for memory-optimized backprop; reusable subspaces enable efficient on-device continual learning.
Relevance: 9 Novelty: 7
26. OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features
ArXiv ID: 2509.22033
Authors: Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Elena Tutubalina, Ivan Oseledets
Abstract: Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.
Comment: Representation Learning: Sparse autoencoders with orthogonality regularization to mitigate feature absorption/composition while scaling linearly.
Relevance: 9 Novelty: 7
27. PreLoRA: Hybrid Pre-training of Vision Transformers with Full Training and Low-Rank Adapters
ArXiv ID: 2509.21619
Authors: Krishu K Thapa, Reet Barik, Krishna Teja Chitty-Venkata, Murali Emani, Venkatram Vishwanath
Abstract: Training large models ranging from millions to billions of parameters is highly resource-intensive, requiring significant time, compute, and memory. It is observed that most of the learning (higher change in weights) takes place in the earlier stage of the training loop. These changes stabilize as training continues, enabling them to be captured by matrices of a low intrinsic rank. Therefore, we propose an approach to identify such states of partial convergence and dynamically switch from full parameter training to Low-Rank Adaptation (LoRA) on the ViT-Large model. We introduce a flexible approach that leverages user-defined hyperparameters to determine the switching point and assign a rank specific to each module layer based on its level of convergence. Experimental results show that this approach preserves model accuracy while reducing the number of trainable parameters to 10% of its original size, resulting in a 3x improvement in throughput, and a 1.5x reduction in average training time per epoch while also reducing GPU memory consumption by 20%
Comment: Model Compression/Efficiency: hybrid pretraining that switches to Low-Rank Adapters with per-layer rank selection to cut trainable parameters and memory.
Relevance: 9 Novelty: 7
28. Blockwise Hadamard high-Rank Adaptation for Parameter-Efficient LLM Fine-Tuning
ArXiv ID: 2509.21637
Authors: Feng Yu, Jia Hu, Geyong Min
Abstract: Parameter-efficient fine-tuning (PEFT) methods must be resource-efficient yet handle heterogeneous reasoning transformations, and classical low-rank adaptation (LoRA) is constrained by the nominal rank $r$. Hadamard-style extensions like HiRA raise the nominal rank but couple every update to the global energy pattern of the frozen weight matrix, while ABBA trades this inductive bias for fully learned dense intermediates. To address the limitation of global modulation, we propose Block Hadamard high-Rank Adaptation (BHRA), which partitions each weight matrix and applies HiRA-style multiplicative modulation independently within every block, preserving the PEFT parameter footprint while unlocking localized rank amplification. Our empirical analyses reveal that this blockwise design maintains rich spectra across rank budgets, mitigating the collapse induced by global modulation. Across eight commonsense reasoning tasks and two arithmetic benchmarks with Llama-3.2 1B/3B, Mistral-7B, and Gemma-2 9B, BHRA consistently surpasses strong PEFT baselines under matched parameter budgets.
Comment: Model Compression/Efficiency (PEFT): blockwise Hadamard high-rank adaptation increases effective rank under fixed parameter budget, improving fine-tuning efficiency.
Relevance: 9 Novelty: 7
29. InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models
ArXiv ID: 2509.22536
Authors: Wenjun Wang, Shuo Cai, Congkai Xie, Mingfa Feng, Yiming Zhang, Zhen Li, Kejing Yang, Ming Li, Jiannong Cao, Yuan Xie, Hongxia Yang
Abstract: The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.
Comment: Model Compression and Efficiency: comprehensive FP8 quantized training recipe; High Performance Computing: hybrid-granularity quantization improves throughput, memory, and training time for LLMs.
Relevance: 9 Novelty: 7
30. Lightweight error mitigation strategies for post-training N:M activation sparsity in LLMs
ArXiv ID: 2509.22166
Authors: Shirin Alanova, Kristina Kazistova, Ekaterina Galaeva, Alina Kostromina, Vladimir Smirnov, Redko Dmitry, Alexey Dontsov, Maxim Zhelnin, Evgeny Burnaev, Egor Shvetsov
Abstract: The demand for efficient large language model (LLM) inference has intensified the focus on sparsification techniques. While semi-structured (N:M) pruning is well-established for weights, its application to activation pruning remains underexplored despite its potential for dynamic, input-adaptive compression and reductions in I/O overhead. This work presents a comprehensive analysis of methods for post-training N:M activation pruning in LLMs. Across multiple LLMs, we demonstrate that pruning activations enables superior preservation of generative capabilities compared to weight pruning at equivalent sparsity levels. We evaluate lightweight, plug-and-play error mitigation techniques and pruning criteria, establishing strong hardware-friendly baselines that require minimal calibration. Furthermore, we explore sparsity patterns beyond NVIDIA's standard 2:4, showing that the 16:32 pattern achieves performance nearly on par with unstructured sparsity. However, considering the trade-off between flexibility and hardware implementation complexity, we focus on the 8:16 pattern as a superior candidate. Our findings provide both effective practical methods for activation pruning and a motivation for future hardware to support more flexible sparsity patterns. Our code is available https://anonymous.4open.science/r/Structured-Sparse-Activations-Inference-EC3C/README.md .
Comment: Strong match to Model Compression and Efficiency: post-training N:M activation sparsity for LLMs with lightweight error mitigation and analysis of hardware-friendly patterns (e.g., 8:16, 16:32).
Relevance: 9 Novelty: 7
31. CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones
ArXiv ID: 2509.21764
Authors: Wenyi Gong, Mieszko Lis
Abstract: Many modern ViT backbones adopt spatial architectural designs, such as window attention, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this paper, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (i)exploiting the uneven information distribution across the spatial layout while (ii)preserving the spatial structure post-merging. Our approach employs (i)a 2D reduction strategy to enforce structured token layouts, (ii)a spatial-aware merging algorithm that maintains relative token positions, and (iii)a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25x speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.
Comment: Compression/Efficiency: spatial-preserving token merging tailored to ViT backbones with window/relative position designs, improving speed with minimal accuracy loss.
Relevance: 9 Novelty: 7
32. Stochastic activations
ArXiv ID: 2509.22358
Authors: Maria Lomeli, Matthijs Douze, Gergely Szilvasy, Loic Cabannes, Jade Copet, Sainbayar Sukhbaatar, Jason Weston, Gabriel Synnaeve, Pierre-Emmanuel Mazar\'e, Herv\'e J\'egou
Abstract: We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup in the CPU. Interestingly, this leads to much better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for generation. This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling. This offers an alternative to existing strategies by providing a controlled way to increase the diversity of the generated text.
Comment: Model Architecture and Efficiency: introduces stochastic activations to enable ReLU at inference for sparse latent vectors and reduced FLOPs; addresses optimization/training dynamics of activation functions.
Relevance: 9 Novelty: 7
33. IIET: Efficient Numerical Transformer via Implicit Iterative Euler Method
ArXiv ID: 2509.22463
Authors: Xinyu Liu, Bei Li, Jiahao Liu, Junhao Ruan, Kechen Jiao, Hongyin Tang, Jingang Wang, Xiao Tong, Jingbo Zhu
Abstract: High-order numerical methods enhance Transformer performance in tasks like NLP and CV, but introduce a performance-efficiency trade-off due to increased computational overhead. Our analysis reveals that conventional efficiency techniques, such as distillation, can be detrimental to the performance of these models, exemplified by PCformer. To explore more optimizable ODE-based Transformer architectures, we propose the \textbf{I}terative \textbf{I}mplicit \textbf{E}uler \textbf{T}ransformer \textbf{(IIET)}, which simplifies high-order methods using an iterative implicit Euler approach. This simplification not only leads to superior performance but also facilitates model compression compared to PCformer. To enhance inference efficiency, we introduce \textbf{I}teration \textbf{I}nfluence-\textbf{A}ware \textbf{D}istillation \textbf{(IIAD)}. Through a flexible threshold, IIAD allows users to effectively balance the performance-efficiency trade-off. On lm-evaluation-harness, IIET boosts average accuracy by 2.65\% over vanilla Transformers and 0.8\% over PCformer. Its efficient variant, E-IIET, significantly cuts inference overhead by 55\% while retaining 99.4\% of the original task accuracy. Moreover, the most efficient IIET variant achieves an average performance gain exceeding 1.6\% over vanilla Transformer with comparable speed.
Comment: Model Architecture + Efficiency: ODE-based Transformer using iterative implicit Euler with an influence-aware distillation scheme for improved performance-efficiency and compressibility.
Relevance: 9 Novelty: 7
34. Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time
ArXiv ID: 2509.22572
Authors: Yixuan Han, Fan Ma, Ruijie Quan, Yi Yang
Abstract: Test-Time Scaling (TTS) enhances the reasoning ability of large language models (LLMs) by allocating additional computation during inference. However, existing approaches primarily rely on output-level sampling while overlooking the role of model architecture. In mainstream Mixture-of-Experts (MoE) LLMs, we observe that varying the number of activated experts yields complementary solution sets with stable accuracy, revealing a new and underexplored source of diversity. Motivated by this observation, we propose Dynamic Experts Search (DES), a TTS strategy that elevates expert activation into a controllable dimension of the search space. DES integrates two key components: (1) Dynamic MoE, which enables direct control of expert counts during inference to generate diverse reasoning trajectories without additional cost; and (2) Expert Configuration Inheritance, which preserves consistent expert counts within a reasoning path while varying them across runs, thereby balancing stability and diversity throughout the search. Extensive experiments across MoE architectures, verifiers and reasoning benchmarks (i.e., math, code and knowledge) demonstrate that DES reliably outperforms TTS baselines, enhancing accuracy and stability without additional cost. These results highlight DES as a practical and scalable form of architecture-aware TTS, illustrating how structural flexibility in modern LLMs can advance reasoning.
Comment: Model Architecture (MoE) + Test-time scaling: controls number of activated experts during inference to induce diverse reasoning paths without extra cost.
Relevance: 9 Novelty: 7
35. General Pruning Criteria for Fast SBL
ArXiv ID: 2509.21572
Authors: Jakob M\"oderl, Erik Leitinger, Bernard Henri Fleury
Abstract: Sparse Bayesian learning (SBL) associates to each weight in the underlying linear model a hyperparameter by assuming that each weight is Gaussian distributed with zero mean and precision (inverse variance) equal to its associated hyperparameter. The method estimates the hyperparameters by marginalizing out the weights and performing (marginalized) maximum likelihood (ML) estimation. SBL returns many hyperparameter estimates to diverge to infinity, effectively setting the estimates of the corresponding weights to zero (i.e., pruning the corresponding weights from the model) and thereby yielding a sparse estimate of the weight vector. In this letter, we analyze the marginal likelihood as function of a single hyperparameter while keeping the others fixed, when the Gaussian assumptions on the noise samples and the weight distribution that underlies the derivation of SBL are weakened. We derive sufficient conditions that lead, on the one hand, to finite hyperparameter estimates and, on the other, to infinite ones. Finally, we show that in the Gaussian case, the two conditions are complementary and coincide with the pruning condition of fast SBL (F-SBL), thereby providing additional insights into this algorithm.
Comment: Compression/Efficiency: theoretical pruning criteria in Sparse Bayesian Learning clarifying when weights are pruned (sparsity).
Relevance: 9 Novelty: 7
36. Why High-rank Neural Networks Generalize?: An Algebraic Framework with RKHSs
ArXiv ID: 2509.21895
Authors: Yuka Hashimoto, Sho Sonoda, Isao Ishikawa, Masahiro Ikeda
Abstract: We derive a new Rademacher complexity bound for deep neural networks using Koopman operators, group representations, and reproducing kernel Hilbert spaces (RKHSs). The proposed bound describes why the models with high-rank weight matrices generalize well. Although there are existing bounds that attempt to describe this phenomenon, these existing bounds can be applied to limited types of models. We introduce an algebraic representation of neural networks and a kernel function to construct an RKHS to derive a bound for a wider range of realistic models. This work paves the way for the Koopman-based theory for Rademacher complexity bounds to be valid for more practical situations.
Comment: Representation Learning/Training Dynamics: new Rademacher complexity bounds via RKHS/Koopman explain why high-rank weight matrices generalize.
Relevance: 8 Novelty: 8
37. Global Convergence in Neural ODEs: Impact of Activation Functions
ArXiv ID: 2509.22436
Authors: Tianxiang Gao, Siyuan Sun, Hailiang Liu, Hongyang Gao
Abstract: Neural Ordinary Differential Equations (ODEs) have been successful in various applications due to their continuous nature and parameter-sharing efficiency. However, these unique characteristics also introduce challenges in training, particularly with respect to gradient computation accuracy and convergence analysis. In this paper, we address these challenges by investigating the impact of activation functions. We demonstrate that the properties of activation functions, specifically smoothness and nonlinearity, are critical to the training dynamics. Smooth activation functions guarantee globally unique solutions for both forward and backward ODEs, while sufficient nonlinearity is essential for maintaining the spectral properties of the Neural Tangent Kernel (NTK) during training. Together, these properties enable us to establish the global convergence of Neural ODEs under gradient descent in overparameterized regimes. Our theoretical findings are validated by numerical experiments, which not only support our analysis but also provide practical guidelines for scaling Neural ODEs, potentially leading to faster training and improved performance in real-world applications.
Comment: Training Dynamics/Architecture: establishes global convergence of Neural ODEs under smooth, sufficiently nonlinear activations via NTK properties.
Relevance: 8 Novelty: 8
38. Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning
ArXiv ID: 2509.22335
Authors: Naicheng He, Kaicheng Guo, Arjun Prakash, Saket Tiwari, Ruo Yu Tao, Tyrone Serapio, Amy Greenwald, George Konidaris
Abstract: We investigate why deep neural networks suffer from \emph{loss of plasticity} in deep continual learning, failing to learn new tasks without reinitializing parameters. We show that this failure is preceded by Hessian spectral collapse at new-task initialization, where meaningful curvature directions vanish and gradient descent becomes ineffective. To characterize the necessary condition for successful training, we introduce the notion of $\tau$-trainability and show that current plasticity preserving algorithms can be unified under this framework. Targeting spectral collapse directly, we then discuss the Kronecker factored approximation of the Hessian, which motivates two regularization enhancements: maintaining high effective feature rank and applying $L2$ penalties. Experiments on continual supervised and reinforcement learning tasks confirm that combining these two regularizers effectively preserves plasticity.
Comment: Representation Learning/Training Dynamics: links Hessian spectral collapse to loss of plasticity and proposes regularizers (feature-rank, L2) to preserve trainability.
Relevance: 8 Novelty: 8
39. Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models
ArXiv ID: 2509.21761
Authors: Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen
Abstract: Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning, yet the internal mechanisms governing these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand and fully eliminate the backdoor threat. In this paper, aiming to bridge this gap, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework. We first introduce the Backdoor Probe that proves the existence of learnable backdoor features encoded within the representations. Building on this insight, we further develop Backdoor Attention Head Attribution (BAHA), efficiently pinpointing the specific attention heads responsible for processing these features. Our primary experiments reveals these heads are relatively sparse; ablating a minimal \textbf{$\sim$ 3%} of total heads is sufficient to reduce the Attack Success Rate (ASR) by \textbf{over 90%}. More importantly, we further employ these findings to construct the Backdoor Vector derived from these attributed heads as a master controller for the backdoor. Through only \textbf{1-point} intervention on \textbf{single} representation, the vector can either boost ASR up to \textbf{$\sim$ 100% ($\uparrow$)} on clean inputs, or completely neutralize backdoor, suppressing ASR down to \textbf{$\sim$ 0% ($\downarrow$)} on triggered inputs. In conclusion, our work pioneers the exploration of mechanistic interpretability in LLM backdoors, demonstrating a powerful method for backdoor control and revealing actionable insights for the community.
Comment: Representation Learning: mechanistic attribution of backdoor features and attention heads; Sparsity: identifies a sparse set of heads enabling backdoor control via a single vector intervention.
Relevance: 8 Novelty: 8
40. Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
ArXiv ID: 2509.22615
Authors: Yasmine Omri, Connor Ding, Tsachy Weissman, Thierry Tambe
Abstract: Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora. While these pipelines have enabled impressive zero shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy intensive and costly, and (ii) patch based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance aware pruning, and batched CUDA kernels, achieving over 90x faster fitting and about 97% GPU utilization compared to prior implementations. We further adapt contrastive language image pretraining (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat aware input stem and a perceiver resampler, training only about 7% of the total parameters. On large DataComp subsets, GS encoders yield meaningful zero shot ImageNet-1K performance while compressing inputs 3 to 20x relative to pixels. While accuracy currently trails RGB encoders, our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission efficient for edge cloud learning.
Comment: Matches Model Compression and Efficiency: 2D Gaussian Splatting compresses visual inputs and reduces tokenization cost; High Performance Computing: batched CUDA kernels with ~90x faster fitting and high GPU utilization; efficient adapter over a frozen Transformer.
Relevance: 8 Novelty: 8
41. Toward a Physics of Deep Learning and Brains
ArXiv ID: 2509.22649
Authors: Arsham Ghavasieh, Meritxell Vila-Minana, Akanksha Khurd, John Beggs, Gerardo Ortiz, Santo Fortunato
Abstract: Deep neural networks and brains both learn and share superficial similarities: processing nodes are likened to neurons and adjustable weights are likened to modifiable synapses. But can a unified theoretical framework be found to underlie them both? Here we show that the equations used to describe neuronal avalanches in living brains can also be applied to cascades of activity in deep neural networks. These equations are derived from non-equilibrium statistical physics and show that deep neural networks learn best when poised between absorbing and active phases. Because these networks are strongly driven by inputs, however, they do not operate at a true critical point but within a quasi-critical regime -- one that still approximately satisfies crackling noise scaling relations. By training networks with different initializations, we show that maximal susceptibility is a more reliable predictor of learning than proximity to the critical point itself. This provides a blueprint for engineering improved network performance. Finally, using finite-size scaling we identify distinct universality classes, including Barkhausen noise and directed percolation. This theoretical framework demonstrates that universal features are shared by both biological and artificial neural networks.
Comment: Representation learning/training dynamics: non-equilibrium statistical physics framework (quasi-criticality, susceptibility, universality classes) explaining when networks learn best.
Relevance: 8 Novelty: 8
42. TRACE: Learning to Compute on Graphs
ArXiv ID: 2509.21886
Authors: Ziyang Zheng, Jiaying Zhu, Jingyi Zhou, Qiang Xu
Abstract: Learning to compute, the ability to model the functional behavior of a computational graph, is a fundamental challenge for graph representation learning. Yet, the dominant paradigm is architecturally mismatched for this task. This flawed assumption, central to mainstream message passing neural networks (MPNNs) and their conventional Transformer-based counterparts, prevents models from capturing the position-aware, hierarchical nature of computation. To resolve this, we introduce \textbf{TRACE}, a new paradigm built on an architecturally sound backbone and a principled learning objective. First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation, providing a faithful architectural backbone that replaces the flawed permutation-invariant aggregation. Second, we introduce \textbf{function shift learning}, a novel objective that decouples the learning problem. Instead of predicting the complex global function directly, our model is trained to predict only the \textit{function shift}, the discrepancy between the true global function and a simple local approximation that assumes input independence. We validate this paradigm on electronic circuits, one of the most complex and economically critical classes of computational graphs. Across a comprehensive suite of benchmarks, TRACE substantially outperforms all prior architectures. These results demonstrate that our architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning to compute on graphs.
Comment: Model Architecture: Hierarchical Transformer backbone aligned with stepwise computation and a function-shift learning objective for graph computation.
Relevance: 8 Novelty: 8
43. Differentiable Structure Learning for General Binary Data
ArXiv ID: 2509.21658
Authors: Chang Deng, Bryon Aragam
Abstract: Existing methods for differentiable structure learning in discrete data typically assume that the data are generated from specific structural equation models. However, these assumptions may not align with the true data-generating process, which limits the general applicability of such methods. Furthermore, current approaches often ignore the complex dependence structure inherent in discrete data and consider only linear effects. We propose a differentiable structure learning framework that is capable of capturing arbitrary dependencies among discrete variables. We show that although general discrete models are unidentifiable from purely observational data, it is possible to characterize the complete set of compatible parameters and structures. Additionally, we establish identifiability up to Markov equivalence under mild assumptions. We formulate the learning problem as a single differentiable optimization task in the most general form, thereby avoiding the unrealistic simplifications adopted by previous methods. Empirical results demonstrate that our approach effectively captures complex relationships in discrete data.
Comment: Matches Representation Learning: general differentiable structure learning for binary variables with identifiability analysis and a single differentiable optimization, capturing arbitrary dependencies.
Relevance: 8 Novelty: 8
44. Overclocking Electrostatic Generative Models
ArXiv ID: 2509.22454
Authors: Daniil Shlenskii, Alexander Korotin
Abstract: Electrostatic generative models such as PFGM++ have recently emerged as a powerful framework, achieving state-of-the-art performance in image synthesis. PFGM++ operates in an extended data space with auxiliary dimensionality $D$, recovering the diffusion model framework as $D\to\infty$, while yielding superior empirical results for finite $D$. Like diffusion models, PFGM++ relies on expensive ODE simulations to generate samples, making it computationally costly. To address this, we propose Inverse Poisson Flow Matching (IPFM), a novel distillation framework that accelerates electrostatic generative models across all values of $D$. Our IPFM reformulates distillation as an inverse problem: learning a generator whose induced electrostatic field matches that of the teacher. We derive a tractable training objective for this problem and show that, as $D \to \infty$, our IPFM closely recovers Score Identity Distillation (SiD), a recent method for distilling diffusion models. Empirically, our IPFM produces distilled generators that achieve near-teacher or even superior sample quality using only a few function evaluations. Moreover, we observe that distillation converges faster for finite $D$ than in the $D \to \infty$ (diffusion) limit, which is consistent with prior findings that finite-$D$ PFGM++ models exhibit more favorable optimization and sampling properties.
Comment: Matches Model Compression and Efficiency: introduces IPFM, a distillation framework to accelerate electrostatic generative models (PFGM++) across auxiliary dimensions, reducing function evaluations.
Relevance: 8 Novelty: 8
45. A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems
ArXiv ID: 2509.21716
Authors: Xavier Gonzalez, E. Kelly Buchanan, Hyun Dong Lee, Jerry Weihong Liu, Ke Alexander Wang, David M. Zoltowski, Christopher R\'e, Scott W. Linderman
Abstract: Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. This unifying view highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, our framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.
Comment: High Performance Computing: unifies fixed-point parallelization of sequential models via an LDS framework for efficient, scalable evaluation.
Relevance: 8 Novelty: 8
46. Unlocking the Power of Mixture-of-Experts for Task-Aware Time Series Analytics
ArXiv ID: 2509.22279
Authors: Xingjian Wu, Zhengyu Li, Hanyin Cheng, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang
Abstract: Time Series Analysis is widely used in various real-world applications such as weather forecasting, financial fraud detection, imputation for missing data in IoT systems, and classification for action recognization. Mixture-of-Experts (MoE), as a powerful architecture, though demonstrating effectiveness in NLP, still falls short in adapting to versatile tasks in time series analytics due to its task-agnostic router and the lack of capability in modeling channel correlations. In this study, we propose a novel, general MoE-based time series framework called PatchMoE to support the intricate ``knowledge'' utilization for distinct tasks, thus task-aware. Based on the observation that hierarchical representations often vary across tasks, e.g., forecasting vs. classification, we propose a Recurrent Noisy Gating to utilize the hierarchical information in routing, thus obtaining task-sepcific capability. And the routing strategy is operated on time series tokens in both temporal and channel dimensions, and encouraged by a meticulously designed Temporal \& Channel Load Balancing Loss to model the intricate temporal and channel correlations. Comprehensive experiments on five downstream tasks demonstrate the state-of-the-art performance of PatchMoE.
Comment: Model Architecture: Mixture-of-Experts with task-aware recurrent noisy gating and temporal/channel token routing with load-balancing loss.
Relevance: 8 Novelty: 7
47. R-Capsule: Compressing High-Level Plans for Efficient Large Language Model Reasoning
ArXiv ID: 2509.22131
Authors: Hongyu Shan, Mingyang Song, Chang Dai, Di Liang, Han Chen
Abstract: Chain-of-Thought (CoT) prompting helps Large Language Models (LLMs) tackle complex reasoning by eliciting explicit step-by-step rationales. However, CoT's verbosity increases latency and memory usage and may propagate early errors across long chains. We propose the Reasoning Capsule (R-Capsule), a framework that aims to combine the efficiency of latent reasoning with the transparency of explicit CoT. The core idea is to compress the high-level plan into a small set of learned latent tokens (a Reasoning Capsule) while keeping execution steps lightweight or explicit. This hybrid approach is inspired by the Information Bottleneck (IB) principle, where we encourage the capsule to be approximately minimal yet sufficient for the task. Minimality is encouraged via a low-capacity bottleneck, which helps improve efficiency. Sufficiency is encouraged via a dual objective: a primary task loss for answer accuracy and an auxiliary plan-reconstruction loss that encourages the capsule to faithfully represent the original textual plan. The reconstruction objective helps ground the latent space, thereby improving interpretability and reducing the use of uninformative shortcuts. Our framework strikes a balance between efficiency, accuracy, and interpretability, thereby reducing the visible token footprint of reasoning while maintaining or improving accuracy on complex benchmarks. Our codes are available at: https://anonymous.4open.science/r/Reasoning-Capsule-7BE0
Comment: Model Architecture/Efficiency: Compresses reasoning via learned latent tokens (information bottleneck) to reduce CoT token footprint while retaining accuracy.
Relevance: 8 Novelty: 7
48. From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement
ArXiv ID: 2509.22144
Authors: Jianzhi Yan, Le Liu, Youcheng Pan, Shiwei Chen, Zike Yuan, Yang Xiang, Buzhou Tang
Abstract: Chain-of-Thought (CoT) reasoning improves performance on complex tasks but introduces significant inference latency due to verbosity. We propose Multiround Adaptive Chain-of-Thought Compression (MACC), a framework that leverages the token elasticity phenomenon--where overly small token budgets can paradoxically increase output length--to progressively compress CoTs via multiround refinement. This adaptive strategy allows MACC to determine the optimal compression depth for each input. Our method achieves an average accuracy improvement of 5.6 percent over state-of-the-art baselines, while also reducing CoT length by an average of 47 tokens and significantly lowering latency. Furthermore, we show that test-time performance--accuracy and token length--can be reliably predicted using interpretable features like perplexity and compression rate on the training set. Evaluated across different models, our method enables efficient model selection and forecasting without repeated fine-tuning, demonstrating that CoT compression is both effective and predictable. Our code will be released in https://github.com/Leon221220/MACC.
Comment: Compression/Efficiency: adaptive multi-round Chain-of-Thought compression reducing token budget and latency while preserving accuracy.
Relevance: 8 Novelty: 7
49. ChaosNexus: A Foundation Model for Universal Chaotic System Forecasting with Multi-scale Representations
ArXiv ID: 2509.21802
Authors: Chang Liu, Bohao Zhao, Jingtao Ding, Yong Li
Abstract: Accurately forecasting chaotic systems, prevalent in domains such as weather prediction and fluid dynamics, remains a significant scientific challenge. The inherent sensitivity of these systems to initial conditions, coupled with a scarcity of observational data, severely constrains traditional modeling approaches. Since these models are typically trained for a specific system, they lack the generalization capacity necessary for real-world applications, which demand robust zero-shot or few-shot forecasting on novel or data-limited scenarios. To overcome this generalization barrier, we propose ChaosNexus, a foundation model pre-trained on a diverse corpus of chaotic dynamics. ChaosNexus employs a novel multi-scale architecture named ScaleFormer augmented with Mixture-of-Experts layers, to capture both universal patterns and system-specific behaviors. The model demonstrates state-of-the-art zero-shot generalization across both synthetic and real-world benchmarks. On a large-scale testbed comprising over 9,000 synthetic chaotic systems, it improves the fidelity of long-term attractor statistics by more than 40% compared to the leading baseline. This robust performance extends to real-world applications with exceptional data efficiency. For instance, in 5-day global weather forecasting, ChaosNexus achieves a competitive zero-shot mean error below 1 degree, a result that further improves with few-shot fine-tuning. Moreover, experiments on the scaling behavior of ChaosNexus provide a guiding principle for scientific foundation models: cross-system generalization stems from the diversity of training systems, rather than sheer data volume.
Comment: Model Architecture: foundation model with a multi-scale transformer (ScaleFormer) augmented by Mixture-of-Experts layers—explicitly matching MoE criterion.
Relevance: 8 Novelty: 7
50. Effective continuous equations for adaptive SGD: a stochastic analysis view
ArXiv ID: 2509.21614
Authors: Luca Callisti, Marco Romito, Francesco Triggiano
Abstract: We present a theoretical analysis of some popular adaptive Stochastic Gradient Descent (SGD) methods in the small learning rate regime. Using the stochastic modified equations framework introduced by Li et al., we derive effective continuous stochastic dynamics for these methods. Our key contribution is that sampling-induced noise in SGD manifests in the limit as independent Brownian motions driving the parameter and gradient second momentum evolutions. Furthermore, extending the approach of Malladi et al., we investigate scaling rules between the learning rate and key hyperparameters in adaptive methods, characterising all non-trivial limiting dynamics.
Comment: Training Dynamics: derives effective continuous-time stochastic dynamics for adaptive SGD with scaling rules, clarifying noise structure in optimization.
Relevance: 8 Novelty: 7
51. Prophecy: Inferring Formal Properties from Neuron Activations
ArXiv ID: 2509.21677
Authors: Divya Gopinath, Corina S. Pasareanu, Muhammad Usman
Abstract: We present Prophecy, a tool for automatically inferring formal properties of feed-forward neural networks. Prophecy is based on the observation that a significant part of the logic of feed-forward networks is captured in the activation status of the neurons at inner layers. Prophecy works by extracting rules based on neuron activations (values or on/off statuses) as preconditions that imply certain desirable output property, e.g., the prediction being a certain class. These rules represent network properties captured in the hidden layers that imply the desired output behavior. We present the architecture of the tool, highlight its features and demonstrate its usage on different types of models and output properties. We present an overview of its applications, such as inferring and proving formal explanations of neural networks, compositional verification, run-time monitoring, repair, and others. We also show novel results highlighting its potential in the era of large vision-language models.
Comment: Representation Learning/Interpretability: infers formal rules from neuron activation patterns for verification, monitoring, and explanation of neural behavior.
Relevance: 8 Novelty: 7
52. A Data-driven Typology of Vision Models from Integrated Representational Metrics
ArXiv ID: 2509.21628
Authors: Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla
Abstract: Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet-geometry, unit tuning, or linear decodability-and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies-shaped jointly by architecture and training objective-define representational structure beyond surface design categories.
Comment: Matches Representation Learning: integrated representational similarity metrics and Similarity Network Fusion to typologize model representations (geometry, tuning, decodability).
Relevance: 8 Novelty: 7
53. SAEmnesia: Erasing Concepts in Diffusion Models with Sparse Autoencoders
ArXiv ID: 2509.21379
Authors: Enrico Cassano, Riccardo Renzulli, Marco Nurisso, Mirko Zaffaroni, Alan Perotti, Marco Grangetto
Abstract: Effective concept unlearning in text-to-image diffusion models requires precise localization of concept representations within the model's latent space. While sparse autoencoders successfully reduce neuron polysemanticity (i.e., multiple concepts per neuron) compared to the original network, individual concept representations can still be distributed across multiple latent features, requiring extensive search procedures for concept unlearning. We introduce SAEmnesia, a supervised sparse autoencoder training method that promotes one-to-one concept-neuron mappings through systematic concept labeling, mitigating feature splitting and promoting feature centralization. Our approach learns specialized neurons with significantly stronger concept associations compared to unsupervised baselines. The only computational overhead introduced by SAEmnesia is limited to cross-entropy computation during training. At inference time, this interpretable representation reduces hyperparameter search by 96.67% with respect to current approaches. On the UnlearnCanvas benchmark, SAEmnesia achieves a 9.22% improvement over the state-of-the-art. In sequential unlearning tasks, we demonstrate superior scalability with a 28.4% improvement in unlearning accuracy for 9-object removal.
Comment: Matches Representation Learning and Sparsity: supervised sparse autoencoders to achieve one-to-one concept–neuron mappings enabling targeted unlearning in diffusion models.
Relevance: 8 Novelty: 7
54. Kernel Regression of Multi-Way Data via Tensor Trains with Hadamard Overparametrization: The Dynamic Graph Flow Case
ArXiv ID: 2509.22197
Authors: Duc Thien Nguyen, Konstantinos Slavakis, Eleftherios Kofidis, Dimitris Pados
Abstract: A regression-based framework for interpretable multi-way data imputation, termed Kernel Regression via Tensor Trains with Hadamard overparametrization (KReTTaH), is introduced. KReTTaH adopts a nonparametric formulation by casting imputation as regression via reproducing kernel Hilbert spaces. Parameter efficiency is achieved through tensors of fixed tensor-train (TT) rank, which reside on low-dimensional Riemannian manifolds, and is further enhanced via Hadamard overparametrization, which promotes sparsity within the TT parameter space. Learning is accomplished by solving a smooth inverse problem posed on the Riemannian manifold of fixed TT-rank tensors. As a representative application, the estimation of dynamic graph flows is considered. In this setting, KReTTaH exhibits flexibility by seamlessly incorporating graph-based (topological) priors via its inverse problem formulation. Numerical tests on real-world graph datasets demonstrate that KReTTaH consistently outperforms state-of-the-art alternatives-including a nonparametric tensor- and a neural-network-based methods-for imputing missing, time-varying edge flows.
Comment: Compression/Efficiency: low-rank tensor-train parameterization with sparsity via Hadamard overparametrization and Riemannian optimization for multi-way data regression.
Relevance: 8 Novelty: 7
55. Sharpness-Aware Minimization Can Hallucinate Minimizers
ArXiv ID: 2509.21818
Authors: Chanwoong Park, Uijeong Jang, Ernest K. Ryu, Insoon Yang
Abstract: Sharpness-Aware Minimization (SAM) is a widely used method that steers training toward flatter minimizers, which typically generalize better. In this work, however, we show that SAM can converge to hallucinated minimizers -- points that are not minimizers of the original objective. We theoretically prove the existence of such hallucinated minimizers and establish conditions for local convergence to them. We further provide empirical evidence demonstrating that SAM can indeed converge to these points in practice. Finally, we propose a simple yet effective remedy for avoiding hallucinated minimizers.
Comment: Representation Learning/Training Dynamics: theoretical analysis of SAM shows convergence to non-minimizers and proposes a remedy.
Relevance: 8 Novelty: 7
56. A circuit for predicting hierarchical structure in-context in Large Language Models
ArXiv ID: 2509.21534
Authors: Tankred Saanum, Can Demircan, Samuel J. Gershman, Eric Schulz
Abstract: Large Language Models (LLMs) excel at in-context learning, the ability to use information provided as context to improve prediction of future tokens. Induction heads have been argued to play a crucial role for in-context learning in Transformer Language Models. These attention heads make a token attend to successors of past occurrences of the same token in the input. This basic mechanism supports LLMs' ability to copy and predict repeating patterns. However, it is unclear if this same mechanism can support in-context learning of more complex repetitive patterns with hierarchical structure. Natural language is teeming with such cases: The article "the" in English usually prefaces multiple nouns in a text. When predicting which token succeeds a particular instance of "the", we need to integrate further contextual cues from the text to predict the correct noun. If induction heads naively attend to all past instances of successor tokens of "the" in a context-independent manner, they cannot support this level of contextual information integration. In this study, we design a synthetic in-context learning task, where tokens are repeated with hierarchical dependencies. Here, attending uniformly to all successor tokens is not sufficient to accurately predict future tokens. Evaluating a range of LLMs on these token sequences and natural language analogues, we find adaptive induction heads that support prediction by learning what to attend to in-context. Next, we investigate how induction heads themselves learn in-context. We find evidence that learning is supported by attention heads that uncover a set of latent contexts, determining the different token transition relationships. Overall, we not only show that LLMs have induction heads that learn, but offer a complete mechanistic account of how LLMs learn to predict higher-order repetitive patterns in-context.
Comment: Representation Learning: mechanistic analysis of Transformer induction heads revealing adaptive circuits for hierarchical in-context learning and latent-context routing.
Relevance: 8 Novelty: 7
57. IndiSeek learns information-guided disentangled representations
ArXiv ID: 2509.21584
Authors: Yu Gui, Cong Ma, Zongming Ma
Abstract: Learning disentangled representations is a fundamental task in multi-modal learning. In modern applications such as single-cell multi-omics, both shared and modality-specific features are critical for characterizing cell states and supporting downstream analyses. Ideally, modality-specific features should be independent of shared ones while also capturing all complementary information within each modality. This tradeoff is naturally expressed through information-theoretic criteria, but mutual-information-based objectives are difficult to estimate reliably, and their variational surrogates often underperform in practice. In this paper, we introduce IndiSeek, a novel disentangled representation learning approach that addresses this challenge by combining an independence-enforcing objective with a computationally efficient reconstruction loss that bounds conditional mutual information. This formulation explicitly balances independence and completeness, enabling principled extraction of modality-specific features. We demonstrate the effectiveness of IndiSeek on synthetic simulations, a CITE-seq dataset and multiple real-world multi-modal benchmarks.
Comment: Representation Learning: disentanglement via independence-enforcing objective plus reconstruction loss that bounds conditional mutual information.
Relevance: 8 Novelty: 7
58. Null-Space Filtering for Data-Free Continual Model Merging: Preserving Transparency, Promoting Fidelity
ArXiv ID: 2509.21413
Authors: Zihuan Qiu, Lei Wang, Yang Cao, Runtong Zhang, Bing Su, Yi Xu, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li
Abstract: Data-free continual model merging (DFCMM) aims to fuse independently fine-tuned models into a single backbone that evolves with incoming tasks without accessing task data. This paper formulate two fundamental desiderata for DFCMM: transparency, avoiding interference with earlier tasks, and fidelity, adapting faithfully to each new task. This poses a challenge that existing approaches fail to address: how to bridge data-level desiderata with parameter-space optimization to ensure transparency and fidelity in the absence of task data. To this end, we propose NUFILT (NUll-space FILTering), a data-free framework that directly links these desiderata to optimization. Our key observation is that task vectors approximately align with representation subspaces, providing structural surrogates for enforcing transparency and fidelity. Accordingly, we design a null-space projector that preserves prior responses by filtering out overlapping components of new task vectors, thereby ensuring transparency, and a lightweight LoRA adapter that injects complementary task-specific signals, enabling fidelity in adapting to new tasks. The adapter is trained with a projection-based surrogate loss to retain consistency with previous knowledge while introducing novel directions. This joint filtering-adaptation process allows the backbone to absorb new knowledge while retaining existing behaviors, and the updates are finally fused back in a layer-wise linear fashion without extra parameters or inference cost. Theoretically, we establish approximate subspace alignment guarantees that justify null-space filtering. Empirically, NUFILT achieves state-of-the-art performance with minimal forgetting on both vision and NLP benchmarks, improving average accuracy by 4-7% over OPCM and WUDI-Merging, while narrowing the gap to fine-tuning and reducing computation overhead.
Comment: Model Compression/Efficiency: data-free continual model merging via null-space filtering and lightweight LoRA adapters aligned with representation subspaces.
Relevance: 8 Novelty: 7
59. Transformers Can Learn Connectivity in Some Graphs but Not Others
ArXiv ID: 2509.22343
Authors: Amit Roy, Abulhair Saparov
Abstract: Reasoning capability is essential to ensure the factual correctness of the responses of transformer-based Large Language Models (LLMs), and robust reasoning about transitive relations is instrumental in many settings, such as causal inference. Hence, it is essential to investigate the capability of transformers in the task of inferring transitive relations (e.g., knowing A causes B and B causes C, then A causes C). The task of inferring transitive relations is equivalent to the task of connectivity in directed graphs (e.g., knowing there is a path from A to B, and there is a path from B to C, then there is a path from A to C). Past research focused on whether transformers can learn to infer transitivity from in-context examples provided in the input prompt. However, transformers' capability to infer transitive relations from training examples and how scaling affects the ability is unexplored. In this study, we seek to answer this question by generating directed graphs to train transformer models of varying sizes and evaluate their ability to infer transitive relations for various graph sizes. Our findings suggest that transformers are capable of learning connectivity on "grid-like'' directed graphs where each node can be embedded in a low-dimensional subspace, and connectivity is easily inferable from the embeddings of the nodes. We find that the dimensionality of the underlying grid graph is a strong predictor of transformers' ability to learn the connectivity task, where higher-dimensional grid graphs pose a greater challenge than low-dimensional grid graphs. In addition, we observe that increasing the model scale leads to increasingly better generalization to infer connectivity over grid graphs. However, if the graph is not a grid graph and contains many disconnected components, transformers struggle to learn the connectivity task, especially when the number of components is large.
Comment: Matches Representation Learning: analyzes transformers’ ability to learn transitive relations/connectivity across graph families and scales, yielding insights into what structure transformers can encode.
Relevance: 8 Novelty: 7
60. Understanding and Enhancing Mask-Based Pretraining towards Universal Representations
ArXiv ID: 2509.21650
Authors: Mingze Dong, Leda Wang, Yuval Kluger
Abstract: Mask-based pretraining has become a cornerstone of modern large-scale models across language, vision, and recently biology. Despite its empirical success, its role and limits in learning data representations have been unclear. In this work, we show that the behavior of mask-based pretraining can be directly characterized by test risk in high-dimensional minimum-norm ("ridge-less") linear regression, without relying on further model specifications. Further analysis of linear models uncovers several novel aspects of mask-based pretraining. The theoretical framework and its implications have been validated across diverse neural architectures (including MLPs, CNNs, and Transformers) applied to both vision and language tasks. Guided by our theory, we propose an embarrassingly simple yet overlooked pretraining scheme named Randomly Random Mask AutoEncoding (R$^2$MAE), which enforces capturing multi-scale features from data and is able to outperform optimal fixed mask ratio settings in our linear model framework. We implement R$^2$MAE in vision, language, DNA sequence, and single-cell models, where it consistently outperforms standard and more complicated masking schemes, leading to improvements for state-of-the-art models. Our code is available at: https://github.com/MingzeDong/r2mae
Comment: Representation Learning: theoretical characterization of mask-based pretraining and a simple masking scheme improving learned representations.
Relevance: 8 Novelty: 7
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.