Personalized Daily ArXiv Papers 2026-02-13

[gpt-5]	Prompt	Completion	Total
Token	59225	54067	113292
Cost	$0.07	$0.54	$0.61

Total arXiv papers: 684

Total scanned papers: 441

Total relevant papers: 36

Table of contents with paper titles:

Causal-JEPA: Learning World Models through Object-Level Latent Interventions Authors: Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, Randall Balestriero
RAM-Net: Expressive Linear Attention with Selectively Addressable Memory Authors: Kaicheng Xiao, Haotian Li, Liran Dong, Guoliang Xing
Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration Authors: Akhiad Bercovich, Nir Ailon, Vladimir Anisimov, Tomer Asida, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Roi Koren, Itay Levy, Zach Moshe, Pavlo Molchanov, Najeeb Nabwani, Mostofa Patwari, Omri Puny, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv
Krause Synchronization Transformers Authors: Jingkun Liu, Yisong Yue, Max Welling, Yue Song
LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training Authors: Xinyi Liu, Yujie Wang, Fangcheng Fu, Xuefeng Xiao, Huixia Li, Jiashi Li, Bin Cui
KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models Authors: Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang
Retrieval-Aware Distillation for Transformer-SSM Hybrids Authors: Aviv Bick, Eric P. Xing, Albert Gu
MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models Authors: Arian Raje, Anupam Nayak, Gauri Joshi
Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders Authors: Yifan Luo, Yang Zhan, Jiedong Jiang, Tianyang Liu, Mingrui Wu, Zhennan Zhou, Bin Dong
Spectra: Rethinking Optimizers for LLMs Under Spectral Anisotropy Authors: Zhendong Huang, Hengjie Cao, Fang Dong, Ruijun Huang, Mengyi Chen, Yifeng Yang, Xin Zhang, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, Li Shang
GHOST: Unmasking Phantom States in Mamba2 via Grouped Hidden-state Output-aware Selection & Truncation Authors: Michael Menezes, Anastasios Kyrillidis
Improved state mixing in higher-order and block diagonal linear recurrent networks Authors: Igor Dubinin, Antonio Orvieto, Felix Effenberger
Sparse Semantic Dimension as a Generalization Certificate for LLMs Authors: Dibyanayan Bandyopadhyay, Asif Ekbal
HiFloat4 Format for Language Model Inference Authors: Yuanyong Luo, Jing Huang, Yu Cheng, Ziwei Yu, Kaihua Zhang, Kehong Hong, Xinda Ma, Xin Wang, Anping Tong, Guipeng Hu, Yun Xu, Mehran Taghian, Peng Wu, Guanglin Li, Yunke Peng, Tianchi Hu, Minqi Chen, Michael Bi Mi, Hu Liu, Xiping Zhou, Junsong Wang, Qiang Lin, Heng Liao
SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion Authors: Chengting Yu, Xiaobo Shu, Yadao Wang, Yizhen Zhang, Haoyi Wu, You Wu, Rujiao Long, Ziheng Chen, Yuchi Xu, Wenbo Su, Bo Zheng
MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling Authors: MiniCPM Team, Wenhao An, Yingfa Chen, Yewei Fang, Jiayi Li, Xin Li, Yaohui Li, Yishan Li, Yuxuan Li, Biyuan Lin, Chuan Liu, Hezi Liu, Siyuan Liu, Hongya Lyu, Yinxu Pan, Shixin Ren, Xingyu Shen, Zhou Su, Haojun Sun, Yangang Sun, Zhen Leng Thai, Xin Tian, Rui Wang, Xiaorong Wang, Yudong Wang, Bo Wu, Xiaoyue Xu, Dong Xu, Shuaikang Xue, Jiawei Yang, Bowen Zhang, Jinqian Zhang, Letian Zhang, Shengnan Zhang, Xinyu Zhang, Xinyuan Zhang, Zhu Zhang, Hengyu Zhao, Jiacheng Zhao, Jie Zhou, Zihan Zhou, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu, Maosong Sun
PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving Authors: Sunghyeon Woo, Hoseung Kim, Sunghwan Shim, Minjung Jo, Hyunjoon Jeong, Jeongtae Lee, Joonghoon Kim, Sungjae Lee, Baeseong Park, Se Jung Kwon, Dongsoo Lee
Towards Compressive and Scalable Recurrent Memory Authors: Yunchong Song, Jushi Kai, Liming Lu, Kaixi Qiu, Zhouhan Lin
Prototype Transformer: Towards Language Model Architectures Interpretable by Design Authors: Yordan Yordanov, Matteo Forasassi, Bayar Menzat, Ruizhi Wang, Chang Qi, Markus Kaltenberger, Amine M'Charrak, Tommaso Salvatori, Thomas Lukasiewicz
HLA: Hadamard Linear Attention Authors: Hanno Ackermann, Hong Cai, Mohsen Ghafoorian, Amirhossein Habibian
Predicting LLM Output Length via Entropy-Guided Representations Authors: Huanyi Xie, Yubin Chen, Liangyu Wang, Lijie Hu, Di Wang
The Implicit Bias of Logit Regularization Authors: Alon Beck, Yohai Bar Sinai, Noam Levi
Protein Circuit Tracing via Cross-layer Transcoders Authors: Darin Tsui, Kunal Talreja, Daniel Saeedi, Amirali Aghazadeh
MonarchRT: Efficient Attention for Real-Time Video Generation Authors: Krish Agarwal, Zhuoming Chen, Cheng Luo, Yongqi Chen, Haizhong Zheng, Xun Huang, Atri Rudra, Beidi Chen
Efficient Analysis of the Distilled Neural Tangent Kernel Authors: Jamie Mahowald, Brian Bell, Alex Ho, Michael Geyer
ThinkRouter: Efficient Reasoning via Routing Thinking between Latent and Discrete Spaces Authors: Xin Xu, Tong Yu, Xiang Chen, Haoliang Wang, Julian McAuley, Saayan Mitra
Manifold-Aware Temporal Domain Generalization for Large Language Models Authors: Yiheng Yao, Zekun Cai, Xinyuan Song, Hiroki Hill Kobayashi, Xuan Song, Ryosuke Shibasaki, Liang Zhao
Beyond Parameter Arithmetic: Sparse Complementary Fusion for Distribution-Aware Model Merging Authors: Weihong Lin, Lin Sun, Qilong Shi, Aomufei Yuan, Yuxuan Tian, Zhengyang Wang, Guangxiang Zhao, Xiangzheng Zhang, Tong Yang
RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis Authors: Zhen Bi, Xueshu Chen, Luoyang Sun, Yuhang Yao, Qing Shen, Jungang Lou, Cheng Deng
Enforcing Reciprocity in Operator Learning for Seismic Wave Propagation Authors: Caifeng Zou, Yaozhong Shi, Zachary E. Ross, Robert W. Clayton, Kamyar Azizzadenesheli
Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt Authors: Yujie Gu, Richeng Jin, Xiaoyu Ji, Yier Jin, Wenyuan Xu
Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification Authors: Nghia Nguyen, Tianjiao Ding, Ren\'e Vidal
In-Context Function Learning in Large Language Models Authors: Elif Akata, Konstantinos Voudouris, Vincent Fortuin, Eric Schulz
The Observer Effect in World Models: Invasive Adaptation Corrupts Latent Physics Authors: Christian Intern`o, Jumpei Yamaguchi, Loren Amdahl-Culleton, Markus Olhofer, David Klindt, Barbara Hammer
TAVAE: A VAE with Adaptable Priors Explains Contextual Modulation in the Visual Cortex Authors: Bal\'azs Mesz\'ena, Keith T. Murray, Julien Corbo, O. Batuhan Erkat, M\'arton A. Hajnal, Pierre-Olivier Polack, Gerg\H{o} Orb\'an

1. Causal-JEPA: Learning World Models through Object-Level Latent Interventions

ArXiv ID: 2602.11389

Authors: Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, Randall Balestriero

Abstract: World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20\% in counterfactual reasoning compared to the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1\% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at https://github.com/galilai-group/cjepa.

Comment: Author match

2. RAM-Net: Expressive Linear Attention with Selectively Addressable Memory

ArXiv ID: 2602.11958

Authors: Kaicheng Xiao, Haotian Li, Liran Dong, Guoliang Xing

Abstract: While linear attention architectures offer efficient inference, compressing unbounded history into a fixed-size memory inherently limits expressivity and causes information loss. To address this limitation, we introduce Random Access Memory Network (RAM-Net), a novel architecture designed to bridge the gap between the representational capacity of full attention and the memory efficiency of linear models. The core of RAM-Net maps inputs to high-dimensional sparse vectors serving as explicit addresses, allowing the model to selectively access a massive memory state. This design enables exponential state size scaling without additional parameters, which significantly mitigates signal interference and enhances retrieval fidelity. Moreover, the inherent sparsity ensures exceptional computational efficiency, as state updates are confined to minimal entries. Extensive experiments demonstrate that RAM-Net consistently surpasses state-of-the-art baselines in fine-grained long-range retrieval tasks and achieves competitive performance in standard language modeling and zero-shot commonsense reasoning benchmarks, validating its superior capability to capture complex dependencies with significantly reduced computational overhead.

Comment: Matches Model Architecture and Efficiency: RAM-Net introduces selectively addressable sparse memory enabling expressive linear attention with random access.

Relevance: 10 Novelty: 9

3. Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration

ArXiv ID: 2602.11937

Authors: Akhiad Bercovich, Nir Ailon, Vladimir Anisimov, Tomer Asida, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Roi Koren, Itay Levy, Zach Moshe, Pavlo Molchanov, Najeeb Nabwani, Mostofa Patwari, Omri Puny, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv

Abstract: Reasoning-focused LLMs improve answer quality by generating longer reasoning traces, but the additional tokens dramatically increase serving cost, motivating inference optimization. We extend and apply Puzzle, a post-training neural architecture search (NAS) framework, to gpt-oss-120B to produce gpt-oss-puzzle-88B, a deployment-optimized derivative. Our approach combines heterogeneous MoE expert pruning, selective replacement of full-context attention with window attention, FP8 KV-cache quantization with calibrated scales, and post-training reinforcement learning to recover accuracy, while maintaining low generation length. In terms of per-token speeds, on an 8XH100 node we achieve 1.63X and 1.22X throughput speedups in long-context and short-context settings, respectively. gpt-oss-puzzle-88B also delivers throughput speedups of 2.82X on a single NVIDIA H100 GPU. However, because token counts can change with reasoning effort and model variants, per-token throughput (tok/s) and latency (ms/token) do not necessarily lead to end-to-end speedups: a 2X throughput gain is erased if traces grow 2X. Conversely, throughput gains can be spent on more reasoning tokens to improve accuracy; we therefore advocate request-level efficiency metrics that normalize throughput by tokens generated and trace an accuracy--speed frontier across reasoning efforts. We show that gpt-oss-puzzle-88B improves over gpt-oss-120B along the entire frontier, delivering up to 1.29X higher request-level efficiency. Across various benchmarks, gpt-oss-puzzle-88B matches or slightly exceeds the parent on suite-average accuracy across reasoning efforts, with retention ranging from 100.8% (high) to 108.2% (low), showing that post-training architecture search can substantially reduce inference costs without sacrificing quality.

Comment: Model Compression and Efficiency + MoE: heterogeneous MoE expert pruning, windowed attention replacement, and FP8 KV-cache quantization via post-training NAS for inference acceleration.

Relevance: 10 Novelty: 8

4. Krause Synchronization Transformers

ArXiv ID: 2602.11534

Authors: Jingkun Liu, Yisong Yue, Max Welling, Yue Song

Abstract: Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Experiments across vision (ViT on CIFAR/ImageNet), autoregressive generation (MNIST/CIFAR-10), and large language models (Llama/Qwen) demonstrate consistent gains with substantially reduced computation, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.

Comment: C1+C2: Model architecture and efficiency—localized, selectively sparse attention (Krause Attention) with linear time complexity.

Relevance: 10 Novelty: 8

5. LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training

ArXiv ID: 2602.11686

Authors: Xinyi Liu, Yujie Wang, Fangcheng Fu, Xuefeng Xiao, Huixia Li, Jiashi Li, Bin Cui

Abstract: Expert parallelism is vital for effectively training Mixture-of-Experts (MoE) models, enabling different devices to host distinct experts, with each device processing different input data. However, during expert parallel training, dynamic routing results in significant load imbalance among experts: a handful of overloaded experts hinder overall iteration, emerging as a training bottleneck. In this paper, we introduce LAER-MoE, an efficient MoE training framework. The core of LAER-MoE is a novel parallel paradigm, Fully Sharded Expert Parallel (FSEP), which fully partitions each expert parameter by the number of devices and restores partial experts at expert granularity through All-to-All communication during training. This allows for flexible re-layout of expert parameters during training to enhance load balancing. In particular, we perform fine-grained scheduling of communication operations to minimize communication overhead. Additionally, we develop a load balancing planner to formulate re-layout strategies of experts and routing schemes for tokens during training. We perform experiments on an A100 cluster, and the results indicate that our system achieves up to 1.69x acceleration compared to the current state-of-the-art training systems. Source code available at https://github.com/PKU-DAIR/Hetu-Galvatron/tree/laer-moe.

Comment: Matches High Performance Computing and MoE Architecture: introduces Fully Sharded Expert Parallelism and adaptive expert re-layout for load-balanced MoE training.

Relevance: 10 Novelty: 8

6. KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

ArXiv ID: 2602.11184

Authors: Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang

Abstract: Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation. However, their enormous parameter sizes and memory demands pose major challenges for deployment in resource-constrained environments. Vector Quantization (VQ) offers a promising approach for ultra-low-bit compression in Large Language Models (LLMs) by leveraging a codebook, where weight vectors are mapped to the most similar discrete codewords. Yet, directly applying VQ to MoEs often leads to substantial performance degradation due to two critical obstacles: (1) redundant representations among experts cause VQ to repeatedly quantize similar representations for each expert, resulting in inefficient use of limited codebook capacity; and (2) cumulative output bias is amplified by expert aggregation in MoE layers, leading to distributional shifts in the quantized outputs. To address these issues, we propose KBVQ-MoE, a novel VQ framework to enhance extremely low-bit quantization for MoE-based LLMs. KBVQ-MoE integrates two techniques: (1) input-driven redundancy elimination, where a Karhunen-Loeve Transform (KLT) guided singular value decomposition (SVD) extracts dominant weight components and shares them across experts; and (2) bias-corrected output stabilization, where vector quantization is applied only to expert-specific (non-redundant) representations and the quantized outputs are corrected via channel-wise affine compensation. Experiments on various MoE LLMs demonstrate that KBVQ-MoE preserves accuracy substantially better than existing quantization methods. For example, 3-bit quantization of Qwen1.5-MoE-A2.7B achieves an average accuracy of 67.99, nearly identical to the FP16 baseline of 68.07, underscoring KBVQ-MoE's potential for efficient deployment on edge devices and other resource-constrained platforms.

Comment: Matches Compression/Efficiency and MoE: KLT-guided SVD plus bias-corrected vector quantization for ultra-low-bit MoE LLMs.

Relevance: 10 Novelty: 8

7. Retrieval-Aware Distillation for Transformer-SSM Hybrids

ArXiv ID: 2602.11374

Authors: Aviv Bick, Eric P. Xing, Albert Gu

Abstract: State-space models (SSMs) offer efficient sequence modeling but lag behind Transformers on benchmarks that require in-context retrieval. Prior work links this gap to a small set of attention heads, termed Gather-and-Aggregate (G&A), which SSMs struggle to reproduce. We propose retrieval-aware distillation, which converts a pretrained Transformer into a hybrid student by preserving only these retrieval-critical heads and distilling the rest into recurrent heads. We identify the essential heads via ablation on a synthetic retrieval task, producing a hybrid with sparse, non-uniform attention placement. We show that preserving just 2% of attention heads recovers over 95% of teacher performance on retrieval-heavy tasks (10 heads in a 1B model), requiring far fewer heads than hybrids that retain at least 25%. We further find that large recurrent states often compensate for missing retrieval: once retrieval is handled by these heads, the SSM backbone can be simplified with limited loss, even with an $8\times$ reduction in state dimension. By reducing both the attention cache and the SSM state, the resulting hybrid is $5$--$6\times$ more memory-efficient than comparable hybrids, closing the Transformer--SSM gap at a fraction of the memory cost.

Comment: Model Architecture/Efficiency: retrieval-aware distillation to build Transformer–SSM hybrids by preserving only retrieval-critical heads; 5–6x memory savings.

Relevance: 10 Novelty: 8

8. MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models

ArXiv ID: 2602.11192

Authors: Arian Raje, Anupam Nayak, Gauri Joshi

Abstract: Mixture-of-Experts (MoE) model architectures can significantly reduce the number of activated parameters per token, enabling computationally efficient training and inference. However, their large overall parameter counts and model sizes have precluded their widespread usage in resource-constrained settings as all of the parameters must still be loaded into GPU memory. Prior works aim to address this memory bottleneck by offloading certain experts into CPU memory and porting them to GPU memory only when they are activated. In practice, these methods suffer from the significant I/O latency incurred by expert transfer. We present MELINOE, a method that fine-tunes an MoE model to more strongly prefer activating a smaller number of experts per sequence. Caching these preferred experts in GPU memory reduces expert churn and CPU-GPU transfer overhead. MELINOE increases throughput by $1.2-3\times$ over efficient baselines and up to $14.7\times$ over transfer-heavy baselines while retaining or even improving the performance of the model on a downstream task, making it a reliable method for improving MoE inference efficiency.

Comment: Mixture-of-Experts Efficiency: fine-tuning to reduce experts-per-sequence and cache preferred experts, cutting CPU–GPU transfers and boosting throughput up to 14.7x.

Relevance: 10 Novelty: 8

9. Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction

ArXiv ID: 2602.12204

Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Abstract: Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: \emph{attention demand should decrease over time as recurring patterns become familiar}. We present a surprising finding from analyzing GPT-2 models: \textbf{88\%} of attention operations retrieve information already predictable from the model's hidden state, and this redundancy does \emph{not} decrease during training. Motivated by this observation, we introduce \textbf{\ours{}} (\textbf{C}onsolidation-based \textbf{R}outing for \textbf{A}daptive \textbf{M}emory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, \ours{} exhibits \emph{decreasing attention utilization} over training, achieving a \textbf{37.8$\times$} reduction through a sharp phase transition at approximately 3K steps. We prove that this capability is \emph{impossible} without consolidation: any static routing scheme requires $\Omega(f \cdot n)$ attention for tasks with recurring patterns of frequency $f$. On our proposed SRCD benchmark, \ours{} achieves \textbf{100\% retrieval accuracy} at 1.6\% attention compute (vs.\ 68\% for baselines), and consolidated patterns transfer to unseen tasks with \textbf{48--52\%} attention reduction without retraining. Remarkably, the learned consolidation dynamics quantitatively match human episodic-to-semantic memory transition curves from cognitive psychology ($\gamma = 0.43$ vs.\ $\gamma_{\text{human}} \approx 0.4$--$0.5$). Code and benchmarks are available at [anonymized].

Comment: Efficiency/Conditional Networks: consolidation-based routing that provably reduces attention compute over training with adaptive memory consolidation.

Relevance: 9 Novelty: 9

10. From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders

ArXiv ID: 2602.11881

Authors: Yifan Luo, Yang Zhan, Jiedong Jiang, Tianyang Liu, Mingrui Wu, Zhennan Zhou, Bin Dong

Abstract: Sparse autoencoders (SAEs) have proven effective for extracting monosemantic features from large language models (LLMs), yet these features are typically identified in isolation. However, broad evidence suggests that LLMs capture the intrinsic structure of natural language, where the phenomenon of "feature splitting" in particular indicates that such structure is hierarchical. To capture this, we propose the Hierarchical Sparse Autoencoder (HSAE), which jointly learns a series of SAEs and the parent-child relationships between their features. HSAE strengthens the alignment between parent and child features through two novel mechanisms: a structural constraint loss and a random feature perturbation mechanism. Extensive experiments across various LLMs and layers demonstrate that HSAE consistently recovers semantically meaningful hierarchies, supported by both qualitative case studies and rigorous quantitative metrics. At the same time, HSAE preserves the reconstruction fidelity and interpretability of standard SAEs across different dictionary sizes. Our work provides a powerful, scalable tool for discovering and analyzing the multi-scale conceptual structures embedded in LLM representations.

Comment: Representation Learning: introduces hierarchical sparse autoencoders to discover multi-scale, monosemantic feature hierarchies in LLMs.