Personalized Daily ArXiv Papers 2026-01-29

[gpt-5]	Prompt	Completion	Total
Token	37149	39349	76498
Cost	$0.05	$0.39	$0.44

Total arXiv papers: 570

Total scanned papers: 320

Total relevant papers: 26

Table of contents with paper titles:

Hyperparameter Transfer with Mixture-of-Expert Layers Authors: Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, Boris Hanin
HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs Authors: Guoan Wang, Feiyu Wang, Zongwei Lv, Yikun Zong, Tong Yang
Minimax Rates for Hyperbolic Hierarchical Learning Authors: Divit Rawal, Sriram Vishwanath
Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs Authors: Yuhang Liu, Erdun Gao, Dong Gong, Anton van den Hengel, Javen Qinfeng Shi
Spectral Ghost in Representation Learning: from Component Analysis to Self-Supervised Learning Authors: Bo Dai, Na Li, Dale Schuurmans
Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling Authors: Binglei Lou, Haoran Wu, Yao Lai, Jiayi Nie, Can Xiao, Xuan Guo, Rika Antonova, Robert Mullins, Aaron Zhao
Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication Authors: Paul Tarau
Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds Authors: Faruk Alpay, Bugra Kilictas
Towards Compact and Robust DNNs via Compression-aware Sharpness Minimization Authors: Jialuo He, Huangxun Chen
Switchcodec: Adaptive residual-expert sparse quantization for high-fidelity neural audio coding Authors: Xiangbo Wang, Wenbin Jiang, Jin Wang, Yubo You, Sheng Fang, Fei Wen
Decomposing multimodal embedding spaces with group-sparse autoencoders Authors: Chiraag Kaushik, Davis Barch, Andrea Fanelli
Linear representations in language models can change dramatically over a conversation Authors: Andrew Kyle Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, Murray Shanahan
$\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval Authors: Zihao Wang, Hang Yin, Lihui Liu, Hanghang Tong, Yangqiu Song, Ginny Wong, Simon See
Domain Expansion: A Latent Space Construction Framework for Multi-Task Learning Authors: Chi-Yao Huang, Khoa Vo, Aayush Atul Verma, Duo Lu, Yezhou Yang
Convergence Analysis of Randomized Subspace Normalized SGD under Heavy-Tailed Noise Authors: Gaku Omiya, Pierre-Louis Poirion, Akiko Takeda
Implicit Hypothesis Testing and Divergence Preservation in Neural Network Representations Authors: Kadircan Aksoy, Peter Jung, Protim Bhattacharjee
Leveraging Second-Order Curvature for Efficient Learned Image Compression: Theory and Empirical Evidence Authors: Yichi Zhang, Fengqing Zhu
Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching Authors: Fengrui Zuo, Zhiwei Ke, Yiming Liu, Wenqi Lou, Chao Wang, Xvehai Zhou
TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs Authors: Minjae Lee, Wonjun Kang, Byeongkeun Ahn, Christian Classen, Kevin Galim, Seunghyuk Oh, Minghao Yan, Hyung Il Koo, Kangwook Lee
PiC-BNN: A 128-kbit 65 nm Processing-in-CAM-Based End-to-End Binary Neural Network Accelerator Authors: Yuval Harary, Almog Sharoni, Esteban Garz\'on, Marco Lanuzza, Adam Teman, Leonid Yavits
Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning Authors: Zeyu Xing, Xing Li, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan
LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them? Authors: J. Ben Tamo, Daniel Carlander-Reuterfelt, Jonathan Rubin, Dezhi Hong, Mingxian Wang, Oleg Poliannikov
Loss Landscape Geometry and the Learning of Symmetries: Or, What Influence Functions Reveal About Robust Generalization Authors: James Amarel, Robyn Miller, Nicolas Hengartner, Benjamin Migliori, Emily Casleton, Alexei Skurikhin, Earl Lawrence, Gerd J. Kunde
Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers Authors: Yiran Huang, Karsten Roth, Quentin Bouniot, Wenjia Xu, Zeynep Akata
TINNs: Time-Induced Neural Networks for Solving Time-Dependent PDEs Authors: Chen-Yang Dai, Che-Chia Chang, Te-Sheng Lin, Ming-Chih Lai, Chieh-Hsin Lai
SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips Authors: Jiahuan Yu, Mingtao Hu, Zichao Lin, Minjia Zhang

1. Hyperparameter Transfer with Mixture-of-Expert Layers

ArXiv ID: 2601.20205

Authors: Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, Boris Hanin

Abstract: Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our parameterization enables reliable HP transfer across models from 51M to over 2B total parameters. We further take HPs identified from sweeping small models on a short token horizon to train larger models on longer horizons and report performant model behaviors.

Comment: Model Architecture (MoE): DMFT-justified parameterization enabling hyperparameter transfer across width/depth/experts/expert-size in sparse MoE Transformers.

Relevance: 10 Novelty: 8

2. HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs

ArXiv ID: 2601.20745

Authors: Guoan Wang, Feiyu Wang, Zongwei Lv, Yikun Zong, Tong Yang

Abstract: As large language models (LLMs) continue to scale, deployment is increasingly bottlenecked by the memory wall, motivating a shift toward extremely low-bit quantization. However, most quantization-aware training (QAT) methods apply hard rounding and the straight-through estimator (STE) from the beginning of the training, which prematurely discretizes the optimization landscape and induces persistent gradient mismatch between latent weights and quantized weights, hindering effective optimization of quantized models. To address this, we propose Hestia, a Hessian-guided differentiable QAT framework for extremely low-bit LLMs, which replaces the rigid step function with a temperature-controlled softmax relaxation to maintain gradient flow early in training while progressively hardening quantization. Furthermore, Hestia leverages a tensor-wise Hessian trace metric as a lightweight curvature signal to drive fine-grained temperature annealing, enabling sensitivity-aware discretization across the model. Evaluations on Llama-3.2 show that Hestia consistently outperforms existing ternary QAT baselines, yielding average zero-shot improvements of 5.39% and 4.34% for the 1B and 3B models. These results indicate that Hessian-guided relaxation effectively recovers representational capacity, establishing a more robust training path for 1.58-bit LLMs. The code is available at https://github.com/hestia2026/Hestia.

Comment: Model Compression and Efficiency: Hessian-guided differentiable quantization-aware training with temperature-annealed soft rounding for 1.58-bit LLMs.

Relevance: 10 Novelty: 8

3. Minimax Rates for Hyperbolic Hierarchical Learning

ArXiv ID: 2601.20047

Authors: Divit Rawal, Sriram Vishwanath

Abstract: We prove an exponential separation in sample complexity between Euclidean and hyperbolic representations for learning on hierarchical data under standard Lipschitz regularization. For depth-$R$ hierarchies with branching factor $m$, we first establish a geometric obstruction for Euclidean space: any bounded-radius embedding forces volumetric collapse, mapping exponentially many tree-distant points to nearby locations. This necessitates Lipschitz constants scaling as $\exp(\Omega(R))$ to realize even simple hierarchical targets, yielding exponential sample complexity under capacity control. We then show this obstruction vanishes in hyperbolic space: constant-distortion hyperbolic embeddings admit $O(1)$-Lipschitz realizability, enabling learning with $n = O(mR \log m)$ samples. A matching $\Omega(mR \log m)$ lower bound via Fano's inequality establishes that hyperbolic representations achieve the information-theoretic optimum. We also show a geometry-independent bottleneck: any rank-$k$ prediction space captures only $O(k)$ canonical hierarchical contrasts.

Comment: Representation Learning Theory: proves minimax-optimal sample complexity for hyperbolic representations on hierarchies and exponential separation vs Euclidean embeddings.

Relevance: 9 Novelty: 9

4. Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs

ArXiv ID: 2601.20420

Authors: Yuhang Liu, Erdun Gao, Dong Gong, Anton van den Hengel, Javen Qinfeng Shi

Abstract: Developing human understandable interpretation of large language models (LLMs) becomes increasingly critical for their deployment in essential domains. Mechanistic interpretability seeks to mitigate the issues through extracts human-interpretable process and concepts from LLMs' activations. Sparse autoencoders (SAEs) have emerged as a popular approach for extracting interpretable and monosemantic concepts by decomposing the LLM internal representations into a dictionary. Despite their empirical progress, SAEs suffer from a fundamental theoretical ambiguity: the well-defined correspondence between LLM representations and human-interpretable concepts remains unclear. This lack of theoretical grounding gives rise to several methodological challenges, including difficulties in principled method design and evaluation criteria. In this work, we show that, under mild assumptions, LLM representations can be approximated as a {linear mixture} of the log-posteriors over concepts given the input context, through the lens of a latent variable model where concepts are treated as latent variables. This motivates a principled framework for concept extraction, namely Concept Component Analysis (ConCA), which aims to recover the log-posterior of each concept from LLM representations through a {unsupervised} linear unmixing process. We explore a specific variant, termed sparse ConCA, which leverages a sparsity prior to address the inherent ill-posedness of the unmixing problem. We implement 12 sparse ConCA variants and demonstrate their ability to extract meaningful concepts across multiple LLMs, offering theory-backed advantages over SAEs.

Comment: Representation Learning: principled concept extraction via unsupervised linear unmixing of LLM activations (Concept Component Analysis) with sparsity priors, offering a theory-backed alternative to SAEs.

Relevance: 9 Novelty: 8

5. Spectral Ghost in Representation Learning: from Component Analysis to Self-Supervised Learning

ArXiv ID: 2601.20154

Authors: Bo Dai, Na Li, Dale Schuurmans

Abstract: Self-supervised learning (SSL) have improved empirical performance by unleashing the power of unlabeled data for practical applications. Specifically, SSL extracts the representation from massive unlabeled data, which will be transferred to a plenty of down streaming tasks with limited data. The significant improvement on diverse applications of representation learning has attracted increasing attention, resulting in a variety of dramatically different self-supervised learning objectives for representation extraction, with an assortment of learning procedures, but the lack of a clear and unified understanding. Such an absence hampers the ongoing development of representation learning, leaving a theoretical understanding missing, principles for efficient algorithm design unclear, and the use of representation learning methods in practice unjustified. The urgency for a unified framework is further motivated by the rapid growth in representation learning methods. In this paper, we are therefore compelled to develop a principled foundation of representation learning. We first theoretically investigate the sufficiency of the representation from a spectral representation view, which reveals the spectral essence of the existing successful SSL algorithms and paves the path to a unified framework for understanding and analysis. Such a framework work also inspires the development of more efficient and easy-to-use representation learning algorithms with principled way in real-world applications.

Comment: Representation Learning: unified spectral framework explaining self-supervised objectives via spectral sufficiency, offering principled foundations and algorithmic guidance.

Relevance: 9 Novelty: 8

6. Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling

ArXiv ID: 2601.20706

Authors: Binglei Lou, Haoran Wu, Yao Lai, Jiayi Nie, Can Xiao, Xuan Guo, Rika Antonova, Robert Mullins, Aaron Zhao

Abstract: Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation, but their sampling phase displays fundamentally different characteristics compared to GEMM-centric transformer layers. Profiling on modern GPUs reveals that sampling can account for up to 70% of total model inference latency-primarily due to substantial memory loads and writes from vocabulary-wide logits, reduction-based token selection, and iterative masked updates. These processes demand large on-chip SRAM and involve irregular memory accesses that conventional NPUs struggle to handle efficiently. To address this, we identify a set of critical instructions that an NPU architecture must specifically optimize for dLLM sampling. Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy. Together, these optimizations deliver up to a 2.53x speedup over the NVIDIA RTX A6000 GPU under an equivalent nm technology node. We also open-source our cycle-accurate simulation and post-synthesis RTL verification code, confirming functional equivalence with current dLLM PyTorch implementations.

Comment: High Performance Computing/Systems: NPU architectural primitives and memory hierarchy tailored to diffusion LLM sampling (non-GEMM operations), delivering significant inference speedups.

Relevance: 9 Novelty: 8

7. Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication

ArXiv ID: 2601.19915

Authors: Paul Tarau

Abstract: We introduce the \emph{Arrow Language Model}, a neural architecture derived from an intuitionistic-logic interpretation of next-token prediction. Instead of representing tokens as additive embeddings mixed by attention, we encode a prefix as a \emph{left-nested implication chain} whose structure preserves order through non-commutative composition. Next-token prediction corresponds to \emph{modus ponens}, and sequence processing becomes constructive proof extension under the Curry--Howard correspondence. Our Prolog-based specialized theorem provers validate fundamental properties of the neural models, among which relations between commutative vs. non-commutative sequencing and single-token vs. multi-token prediction choices. We show that a neural architecture equivalent to multiplicative RNNs arises naturally from a proof-theoretic interpretation of next-token prediction as nested intuitionistic implication, we present a practical low-rank neural realization and position the model relative to Transformers and state-space models. Keywords: logic-based derivation of neural architectures, intuitionistic implicational logic, token-as-operator neural models, state-space models, alternatives to transformer-based foundational models.

Comment: Model Architecture: logic-derived Arrow LM interpreting next-token prediction as left-nested intuitionistic implication; presents a low-rank neural realization and positions vs Transformers/SSMs.

Relevance: 9 Novelty: 8

8. Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds

ArXiv ID: 2601.19942

Authors: Faruk Alpay, Bugra Kilictas

Abstract: We study the emergence of multi-step reasoning in deep Transformer language models through a geometric and statistical-physics lens. Treating the hidden-state trajectory as a flow on an implicit Riemannian manifold, we analyze the layerwise covariance spectrum of activations, where $C^{(\ell)}=\mathbb{E}[h^{(\ell)}h^{(\ell)\top}]$, and track deviations from a random-matrix bulk. Across model scales (1.5B--30B), we observe a sharp reduction in effective dimensionality consistent with a phase transition: an order parameter based on sparsity/localization, $\Omega(h)=1-|h|_1/(\sqrt{d}|h|_2)$, exhibits a discontinuity near a critical normalized depth $\gamma_c\approx 0.42$ in sufficiently large models. We formalize the forward pass as a discrete coarse-graining map and relate the appearance of stable "concept basins" to fixed points of this renormalization-like dynamics. The resulting low-entropy regime is characterized by a spectral tail collapse and by the formation of transient, reusable object-like structures in representation space, which we call Transient Class Objects (TCOs). We provide theoretical conditions connecting logical separability to spectral decay and validate the predicted signatures with layerwise probes on multiple open-weight model families.

Comment: Representation Learning: geometric/spectral analysis of Transformer hidden manifolds revealing phase transitions, effective dimensionality collapse, and renormalization-like flows.

Relevance: 9 Novelty: 8

9. Towards Compact and Robust DNNs via Compression-aware Sharpness Minimization

ArXiv ID: 2601.20301

Authors: Jialuo He, Huangxun Chen

Abstract: Sharpness-Aware Minimization (SAM) has recently emerged as an effective technique for improving DNN robustness to input variations. However, its interplay with the compactness requirements of on-device DNN deployments remains less explored. Simply pruning a SAM-trained model can undermine robustness, since flatness in the continuous parameter space does not necessarily translate to robustness under the discrete structural changes induced by pruning. Conversely, applying SAM after pruning may be fundamentally constrained by architectural limitations imposed by an early, robustness-agnostic pruning pattern. To address this gap, we propose Compression-aware ShArpness Minimization (C-SAM), a framework that shifts sharpness-aware learning from parameter perturbations to mask perturbations. By explicitly perturbing pruning masks during training, C-SAM promotes a flatter loss landscape with respect to model structure, enabling the discovery of pruning patterns that simultaneously optimize model compactness and robustness to input variations. Extensive experiments on CelebA-HQ, Flowers-102, and CIFAR-10-C across ResNet-18, GoogLeNet, and MobileNet-V2 show that C-SAM consistently achieves higher certified robustness than strong baselines, with improvements of up to 42%, while maintaining task accuracy comparable to the corresponding unpruned models.

Comment: Model Compression and Efficiency: pruning-aware sharpness minimization via mask perturbations to co-optimize compactness and robustness under structural sparsity.

Relevance: 9 Novelty: 8

10. Switchcodec: Adaptive residual-expert sparse quantization for high-fidelity neural audio coding

ArXiv ID: 2601.20362

Authors: Xiangbo Wang, Wenbin Jiang, Jin Wang, Yubo You, Sheng Fang, Fei Wen

Abstract: Recent neural audio compression models often rely on residual vector quantization for high-fidelity coding, but using a fixed number of per-frame codebooks is suboptimal for the wide variability of audio content-especially for signals that are either very simple or highly complex. To address this limitation, we propose SwitchCodec, a neural audio codec based on Residual Experts Vector Quantization (REVQ). REVQ combines a shared quantizer with dynamically routed expert quantizers that are activated according to the input audio, decoupling bitrate from codebook capacity and improving compression efficiency. This design ensures full training and utilization of each quantizer. In addition, a variable-bitrate mechanism adjusts the number of active expert quantizers at inference, enabling multi-bitrate operation without retraining. Experiments demonstrate that SwitchCodec surpasses existing baselines on both objective metrics and subjective listening tests.

Comment: Model Compression and Efficiency: residual-experts vector quantization (dynamic expert routing, variable bitrate) for neural audio coding—sparse quantization with MoE-like routing.

Relevance: 9 Novelty: 7

11. Decomposing multimodal embedding spaces with group-sparse autoencoders

ArXiv ID: 2601.20028

Authors: Chiraag Kaushik, Davis Barch, Andrea Fanelli

Abstract: The Linear Representation Hypothesis asserts that the embeddings learned by neural networks can be understood as linear combinations of features corresponding to high-level concepts. Based on this ansatz, sparse autoencoders (SAEs) have recently become a popular method for decomposing embeddings into a sparse combination of linear directions, which have been shown empirically to often correspond to human-interpretable semantics. However, recent attempts to apply SAEs to multimodal embedding spaces (such as the popular CLIP embeddings for image/text data) have found that SAEs often learn "split dictionaries", where most of the learned sparse features are essentially unimodal, active only for data of a single modality. In this work, we study how to effectively adapt SAEs for the setting of multimodal embeddings while ensuring multimodal alignment. We first argue that the existence of a split dictionary decomposition on an aligned embedding space implies the existence of a non-split dictionary with improved modality alignment. Then, we propose a new SAE-based approach to multimodal embedding decomposition using cross-modal random masking and group-sparse regularization. We apply our method to popular embeddings for image/text (CLIP) and audio/text (CLAP) data and show that, compared to standard SAEs, our approach learns a more multimodal dictionary while reducing the number of dead neurons and improving feature semanticity. We finally demonstrate how this improvement in alignment of concepts between modalities can enable improvements in the interpretability and control of cross-modal tasks.

Comment: Representation Learning + sparsity: group-sparse autoencoders with cross-modal masking to decompose multimodal embeddings.

Relevance: 9 Novelty: 7

12. Linear representations in language models can change dramatically over a conversation

ArXiv ID: 2601.20834

Authors: Andrew Kyle Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, Murray Shanahan

Abstract: Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering -- in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.

Comment: Representation Learning: studies dynamics of linear concept directions in LMs across conversations, impacting interpretability/steering.

Relevance: 9 Novelty: 7

13. $\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval

ArXiv ID: 2601.20844

Authors: Zihao Wang, Hang Yin, Lihui Liu, Hanghang Tong, Yangqiu Song, Ginny Wong, Simon See

Abstract: This paper studies the minimal dimension required to embed subset memberships ($m$ elements and ${m\choose k}$ subsets of at most $k$ elements) into vector spaces, denoted as Minimal Embeddable Dimension (MED). The tight bounds of MED are derived theoretically and supported empirically for various notions of "distances" or "similarities," including the $\ell_2$ metric, inner product, and cosine similarity. In addition, we conduct numerical simulation in a more achievable setting, where the ${m\choose k}$ subset embeddings are chosen as the centroid of the embeddings of the contained elements. Our simulation easily realizes a logarithmic dependency between the MED and the number of elements to embed. These findings imply that embedding-based retrieval limitations stem primarily from learnability challenges, not geometric constraints, guiding future algorithm design.

Comment: Representation Learning: establishes tight bounds on minimal embeddable dimension for top-k retrieval across L2/inner product/cosine, isolating geometric limits of embedding-based retrieval.

Relevance: 8 Novelty: 8

14. Domain Expansion: A Latent Space Construction Framework for Multi-Task Learning

ArXiv ID: 2601.20069

Authors: Chi-Yao Huang, Khoa Vo, Aayush Atul Verma, Duo Lu, Yezhou Yang

Abstract: Training a single network with multiple objectives often leads to conflicting gradients that degrade shared representations, forcing them into a compromised state that is suboptimal for any single task--a problem we term latent representation collapse. We introduce Domain Expansion, a framework that prevents these conflicts by restructuring the latent space itself. Our framework uses a novel orthogonal pooling mechanism to construct a latent space where each objective is assigned to a mutually orthogonal subspace. We validate our approach across diverse benchmarks--including ShapeNet, MPIIGaze, and Rotated MNIST--on challenging multi-objective problems combining classification with pose and gaze estimation. Our experiments demonstrate that this structure not only prevents collapse but also yields an explicit, interpretable, and compositional latent space where concepts can be directly manipulated.

Comment: Model Architecture/Representation Learning: orthogonal pooling constructs mutually orthogonal latent subspaces per task to resolve gradient conflicts in multi-task learning.