Personalized Daily ArXiv Papers 2026-03-31

Model	Metric	Usage			Papers
Model	Metric	Prompt	Completion	Total	Total arXiv	Scanned	Relevant
`gpt-5.4`	Tokens	251654	8825	260479	985	617	68
`gpt-5.4`	Cost	$0.63	$0.13	$0.76	985	617	68

Table of contents with paper titles:

The Geometric Cost of Normalization: Affine Bounds on the Bayesian Complexity of Neural Networks Authors: Sungbae Chun
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration Authors: Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang, Yao Lu, Yongxiang Liu, Ganzhao Yuan
HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention Authors: Yufei Xu, Fanxu Meng, Fan Jiang, Yuxuan Wang, Ruijie Zhou, Jiexi Wu, Zhixin Pan, Zhaohui Wang, Xiaojuan Tang, Wenjie Pei, Tongxuan Liu, Di yin, Xing Sun, Muhan Zhang
High dimensional theory of two-phase optimizers Authors: Atish Agarwala
Rethinking Language Model Scaling under Transferable Hypersphere Optimization Authors: Liliang Ren, Yang Liu, Yelong Shen, Weizhu Chen
KVSculpt: KV Cache Compression as Distillation Authors: Bo Jiang, Sian Jin
Temporal Credit Is Free Authors: Aur Shalev Merin
On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry Authors: Mohammad Tinati, Stephen Tu
Expert Streaming: Accelerating Low-Batch MoE Inference via Multi-chiplet Architecture and Dynamic Expert Trajectory Scheduling Authors: Songchen Ma, Hongyi Li, Weihao Zhang, Yonghao Tan, Pingcheng Dong, Yu Liu, Lan Liu, Yuzhong Jiao, Xuejiao Liu, Luhong Liang, Kwang-Ting Cheng
TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization Authors: Dipkumar Patel
IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression Authors: Zhongping Ji
GeoBlock: Inferring Block Granularity from Dependency Geometry in Diffusion Language Models Authors: Lipeng Wan, Junjie Ma, Jianhui Gu, Zeyang Liu, Xuyang Lu, Xuguang Lan
GSR-GNN: Training Acceleration and Memory-Saving Framework of Deep GNNs on Circuit Graph Authors: Yuebo Luo, Shiyang Li, Yifei Feng, Vishal Kancharla, Shaoyi Huang, Caiwen Ding
Gaussian Joint Embeddings For Self-Supervised Representation Learning Authors: Yongchao Huang
Arithmetic OOD Failure Unfolds in Stages in Minimal GPTs Authors: Seine A. Shintani
On the Loss Landscape Geometry of Regularized Deep Matrix Factorization: Uniqueness and Sharpness Authors: Anil Kamber, Rahul Parhi
Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation Authors: Vit\'oria Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, David Klindt
Lipschitz verification of neural networks through training Authors: Simon Kuang, Yuezhu Xu, S. Sivaranjani, Xinfan Lin
ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference Authors: Qiuyang Zhang, Kai Zhou, Ding Tang, Kai Lu, Cheng Li, Zhenyu Yang, Peng Xu, Jiguang Wan
Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling Authors: Minh-Tuan Tran, Xuan-May Le, Quan Hung Tran, Mehrtash Harandi, Dinh Phung, Trung Le
MolmoPoint: Better Pointing for VLMs with Grounding Tokens Authors: Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna
A Step Toward Federated Pretraining of Multimodal Large Language Models Authors: Baochen Xiong, Yifan Xu, Xiaoshan Yang, Yaguang Song, Yaowei Wang, Changsheng Xu
Universal Approximation Constraints of Narrow ResNets: The Tunnel Effect Authors: Christian Kuehn, Sara-Viola Kuntz, Tobias W\"ohrer
Preconditioned Attention: Enhancing Efficiency in Transformers Authors: Hemanth Saratchandran
On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models Authors: Chongyang Zhao, Mingsong Li, Haodong Lu, Dong Gong
Sparse-by-Design Cross-Modality Prediction: L0-Gated Representations for Reliable and Efficient Learning Authors: Filippo Cenacchi
Spectral Higher-Order Neural Networks Authors: Gianluca Peri, Timoteo Carletti, Duccio Fanelli, Diego Febbe
OptINC: Optical In-Network-Computing for Scalable Distributed Learning Authors: Sijie Fei, Grace Li Zhang, Bing Li, Ulf Schlichtmann
The Price of Meaning: Why Every Semantic Memory System Forgets Authors: Sambartha Ray Barman, Andrey Starenky, Sofia Bodnar, Nikhil Narasimhan, Ashwin Gopinath
Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds Authors: N Alex Cayco Gajic, Arthur Pellegrino
Next-Token Prediction and Regret Minimization Authors: Mehryar Mohri, Clayton Sanford, Jon Schneider, Kiran Vodrahalli, Yifan Wu
A Tight Expressivity Hierarchy for GNN-Based Entity Resolution in Master Data Management Authors: Ashwin Ganesan
Expectation Error Bounds for Transfer Learning in Linear Regression and Linear Neural Networks Authors: Meitong Liu, Christopher Jung, Rui Li, Xue Feng, Han Zhao
Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals Authors: Nathaniel Oh, Paul Attie
The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams Authors: Isaac Llorente-Saguer
Kernel Dynamics under Path Entropy Maximization Authors: Jnaneshwar Das
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology Authors: Yifan Wang, Bolian Li, David Cho, Ruqi Zhang, Fanping Sui, Ananth Grama
Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment Authors: Tongfei Liu, Yufan Liu, Bing Li, Weiming Hu
Gradient Manipulation in Distributed Stochastic Gradient Descent with Strategic Agents: Truthful Incentives with Convergence Guarantees Authors: Ziqin Chen, Yongqiang Wang
Interpretable Physics Extraction from Data for Linear Dynamical Systems using Lie Generator Networks Authors: Shafayeth Jamil, Rehan Kapadia
Variational Neurons in Transformers for Language Modeling Authors: Yves Ruffenach
Heddle: A Distributed Orchestration System for Agentic RL Rollout Authors: Zili Zhang, Yinmin Zhong, Chengxu Yang, Chao Jin, Bingyang Wu, Xinming Wei, Yuliang Liu, Xin Jin
Categorical Perception in Large Language Model Hidden States: Structural Warping at Digit-Count Boundaries Authors: Jon-Paul Cacioli
Spectral Signatures of Data Quality: Eigenvalue Tail Index as a Diagnostic for Label Noise in Neural Networks Authors: Matthew Loftus
Taming the Instability: A Robust Second-Order Optimizer for Federated Learning over Non-IID Data Authors: Yuanqiao Zhang, Tiantian He, Yuan Gao, Yixin Wang, Yew-Soon Ong, Maoguo Gong, A. K. Qin, Hui Li
RSR-core: A High-Performance Engine for Low-Bit Matrix-Vector Multiplication Authors: Mohsen Dehghankar, Abolfazl Asudeh
ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing Authors: Edward J. Yoon
Attention Frequency Modulation: Training-Free Spectral Modulation of Diffusion Cross-Attention Authors: Seunghun Oh, Unsang Park
Physics-Guided Transformer (PGT): Physics-Aware Attention Mechanism for PINNs Authors: Ehsan Zeraatkar, Rodion Podorozhny, Jelena Te\v{s}i\'c
LACE: Loss-Adaptive Capacity Expansion for Continual Learning Authors: Shivnath Tathe
daVinci-LLM:Towards the Science of Pretraining Authors: Yiwei Qin, Yixiu Liu, Tiantian Mi, Muhang Xie, Zhen Huang, Weiye Si, Pengrui Lu, Siyuan Feng, Xia Wu, Liming Liu, Ye Luo, Jinlong Hou, Qipeng Guo, Yu Qiao, Pengfei Liu
Stepwise Credit Assignment for GRPO on Flow-Matching Models Authors: Yash Savani, Branislav Kveton, Yuchen Liu, Yilin Wang, Jing Shi, Subhojyoti Mukherjee, Nikos Vlassis, Krishna Kumar Singh
LogicDiff: Logic-Guided Denoising Improves Reasoning in Masked Diffusion Language Models Authors: Shaik Aman
Can We Change the Stroke Size for Easier Diffusion? Authors: Yunwei Bai, Ying Kiat Tan, Yao Shu, Tsuhan Chen
ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models Authors: Song Yu, Li Li
Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching Authors: Andrea Fraschini, Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli
Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints Authors: Jelena Markovic-Voronov, Kayhan Behdin, Yuanda Xu, Zhengze Zhou, Zhipeng Wang, Rahul Mazumder
ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control Authors: Christopher Cruz
AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding Authors: Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis, Marc Pollefeys
Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings Authors: Gesina Schwalbe, Mert Keser, Moritz Bayerkuhnlein, Edgar Heinert, Annika M\"utze, Marvin Keller, Sparsh Tiwari, Georgii Mikriukov, Diedrich Wolter, Jae Hee Lee, Matthias Rottmann
Semantic Interaction Information mediates compositional generalization in latent space Authors: John Schwarcz
FEMBA on the Edge: Physiologically-Aware Pre-Training, Quantization, and Deployment of a Bidirectional Mamba EEG Foundation Model on an Ultra-low Power Microcontroller Authors: Anna Tegon, Nicholas Lehmann, Yawei Li, Andrea Cossettini, Luca Benini, Thorir Mar Ingolfsson
Steering Sparse Autoencoder Latents to Control Dynamic Head Pruning in Vision Transformers (Student Abstract) Authors: Yousung Lee, Dongsoo Har
RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation Authors: Chanseul Cho, Seokju Yun, Jeaseong Jeon, Seungjae Moon, Youngmin Ro
EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation Authors: Sravanth Kodavanti, Manjunath Arveti, Sowmya Vajrala, Srinivas Miriyala, Vikram N R
DSO: Dual-Scale Neural Operators for Stable Long-term Fluid Dynamics Forecasting Authors: Huanshuo Dong, Hao Wu, Hong Wang, Qin-Yi Zhang, Zhezheng Hao
Scalable Maximum Entropy Population Synthesis via Persistent Contrastive Divergence Authors: Mirko Degli Esposti
Diffusion Maps is not Dimensionality Reduction Authors: Julio Candanedo, Alejandro Pati\~no

1. The Geometric Cost of Normalization: Affine Bounds on the Bayesian Complexity of Neural Networks

ArXiv ID: 2603.27432

Authors: Sungbae Chun

Abstract: LayerNorm and RMSNorm impose fundamentally different geometric constraints on their outputs - and this difference has a precise, quantifiable consequence for model complexity. We prove that LayerNorm's mean-centering step, by confining data to a linear hyperplane (through the origin), reduces the Local Learning Coefficient (LLC) of the subsequent weight matrix by exactly $m/2$ (where $m$ is its output dimension); RMSNorm's projection onto a sphere preserves the LLC entirely. This reduction is structurally guaranteed before any training begins, determined by data manifold geometry alone. The underlying condition is a geometric threshold: for the codimension-one manifolds we study, the LLC drop is binary -- any non-zero curvature, regardless of sign or magnitude, is sufficient to preserve the LLC, while only affinely flat manifolds cause the drop. At finite sample sizes this threshold acquires a smooth crossover whose width depends on how much of the data distribution actually experiences the curvature, not merely on whether curvature exists somewhere. We verify both predictions experimentally with controlled single-layer scaling experiments using the wrLLC framework. We further show that Softmax simplex data introduces a "smuggled bias" that activates the same $m/2$ LLC drop when paired with an explicit downstream bias, proved via the affine symmetry extension of the main theorem and confirmed empirically.

Comment: Theoretical analysis of LayerNorm vs RMSNorm ties normalization design to Bayesian complexity via exact geometric bounds.

Relevance: 10 Novelty: 9

2. MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

ArXiv ID: 2603.28254

Authors: Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang, Yao Lu, Yongxiang Liu, Ganzhao Yuan

Abstract: Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions mostly act either after orthogonalization by rescaling updates or before it with heavier whitening-based preconditioners. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon in three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). These variants rebalance the momentum matrix before finite-step Newton--Schulz using row/column squared-norm statistics and only $\mathcal{O}(m+n)$ auxiliary state. We show that finite-step orthogonalization is governed by input spectral properties, especially stable rank and condition number, and that row/column normalization is a zeroth-order whitening surrogate that removes marginal scale mismatch. For the hidden matrix weights targeted by {\method}, the row-normalized variant R is the natural default and preserves the $\widetilde{\mathcal{O}}(T^{-1/4})$ stationarity guarantee of Muon-type methods. In LLaMA2 pretraining on C4, the default R variant consistently outperforms Muon on 130M and 350M models, yielding faster convergence and lower validation perplexity.

Comment: Architecture mechanisms and training dynamics: lightweight pre-orthogonalization equilibration for Muon with spectral analysis of finite-step orthogonalization and LLaMA pretraining gains.

Relevance: 10 Novelty: 8

3. HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

ArXiv ID: 2603.28458

Authors: Yufei Xu, Fanxu Meng, Fan Jiang, Yuxuan Wang, Ruijie Zhou, Jiexi Wu, Zhixin Pan, Zhaohui Wang, Xiaojuan Tang, Wenjie Pei, Tongxuan Liu, Di yin, Xing Sun, Muhan Zhang

Abstract: Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical token for each query using a lightweight indexer, and then computing attention only over the selected subset. While the downstream sparse attention scales efficiently, the indexer still scans the entire prefix for every query, introducing an O($L^2$) per-layer bottleneck that becomes prohibitive as context length grows. We propose HISA (Hierarchical Indexed Sparse Attention), a drop-in replacement for the indexer that transforms the search process from a flat token scan into a two-stage hierarchical procedure. First, a block-level coarse filter scores pooled block representatives to prune irrelevant regions. Then, a token-level refinement applies the original indexer only within the remaining candidate blocks. HISA preserves the exact token-level top-k sparsity pattern required by the downstream Sparse MLA operator and requires no additional training. On kernel-level benchmarks, HISA achieves a 2$\times$ speedup at 32K context length and 4$\times$ at 128K. On Needle-in-a-Haystack and LongBench, we directly replace the indexer in DeepSeek-V3.2 with HISA, without any fine-tuning. HISA closely matches the original DSA in quality while significantly outperforming block-sparse baselines. Moreover, the token selection sets produced by HISA and the original DSA exhibit a mean IoU greater than 99%, indicating that the efficiency gains come with virtually no impact on selection fidelity.

Comment: Hierarchical indexing for token-level sparse attention directly targets efficient inference and long-context attention mechanisms.

Relevance: 10 Novelty: 8

4. High dimensional theory of two-phase optimizers

ArXiv ID: 2603.26954

Authors: Atish Agarwala

Abstract: The trend towards larger training setups has brought a renewed interest in partially asynchronous two-phase optimizers which optimize locally and then synchronize across workers. Additionally, recent work suggests that the one-worker version of one of these algorithms, DiLoCo, shows promising results as a (synchronous) optimizer. Motivated by these studies we present an analysis of LA-DiLoCo, a simple member of the DiLoCo family, on a high-dimensional linear regression problem. We show that the one-worker variant, LA, provides a different tradeoff between signal and noise than SGD, which is beneficial in many scenarios. We also show that the multi-worker version generates more noise than the single worker version, but that this additional noise generation can be ameliorated by appropriate choice of hyperparameters. We conclude with an analysis of SLA -- LA with momentum -- and show that stacking two momentum operators gives an opportunity for acceleration via a non-linear transformation of the "effective'' Hessian spectrum, which is maximized for Nesterov momentum. Altogether our results show that two-phase optimizers represent a fruitful new paradigm for understanding and improving training algorithms.

Comment: High-dimensional theory of two-phase optimizers gives mechanistic insight into training dynamics and distributed optimization behavior.

Relevance: 10 Novelty: 8

5. Rethinking Language Model Scaling under Transferable Hypersphere Optimization

ArXiv ID: 2603.28743

Authors: Liliang Ren, Yang Liu, Yelong Shen, Weizhu Chen

Abstract: Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer. We prove that weight decay is a first-order no-op on the Frobenius sphere, show that Depth-$\mu$P remains necessary, and find that the optimal learning rate follows the same data-scaling power law with the "magic exponent" 0.32 previously observed for AdamW. A single base learning rate tuned at the smallest scale transfers across all compute budgets under HyperP, yielding $1.58\times$ compute efficiency over a strong Muon baseline at $6\times10^{21}$ FLOPs. Moreover, HyperP delivers transferable stability: all monitored instability indicators, including $Z$-values, output RMS, and activation outliers, remain bounded and non-increasing under training FLOPs scaling. We also propose SqrtGate, an MoE gating mechanism derived from the hypersphere constraint that preserves output RMS across MoE granularities for improved granularity scaling, and show that hypersphere optimization enables substantially larger auxiliary load-balancing weights, yielding both strong performance and good expert balance. We release our training codebase at https://github.com/microsoft/ArchScale.

Comment: Architecture mechanisms and training dynamics: hypersphere parameterization gives transferable scaling laws and stability across width, depth, tokens, and MoE granularity, including the new SqrtGate routing design.

Relevance: 10 Novelty: 8

6. KVSculpt: KV Cache Compression as Distillation

ArXiv ID: 2603.27819

Authors: Bo Jiang, Sian Jin

Abstract: KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint -- quantization and low-rank decomposition -- are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction -- selecting which KV pairs to keep -- to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries. We propose KVSculpt, which moves to the other end of this spectrum: instead of selecting or combining original pairs, we optimize a smaller set of unconstrained KV pairs in continuous embedding space to preserve each layer's attention behavior. Keys are optimized via L-BFGS and values are solved in closed form via least squares, alternating every few steps. On top of this, we introduce adaptive budget allocation, which uses a cheap pilot compression run to redistribute the compression budget across layers and KV heads based on per-component difficulty. On Qwen2.5-1.5B-Instruct with 2048-token contexts, KVSculpt reduces KL divergence by 3.5-4.1x compared to Select+Fit -- attention-score eviction with least-squares value fitting -- across compression ratios r in {0.3, 0.5, 0.7}. Adaptive allocation provides an additional 1.3x KL reduction at no extra inference cost. Analysis reveals that compression difficulty is highly non-uniform: per-layer pilot MSE varies by up to 100x across layers, and the two KV heads within a single layer can differ by up to 467x -- demonstrating that fine-grained budget allocation is essential.

Comment: Compression and efficient inference: replaces KV eviction/merging with direct optimization of a smaller unconstrained KV set, plus adaptive per-layer/head budget allocation for long-context inference.

Relevance: 10 Novelty: 8

7. Temporal Credit Is Free

ArXiv ID: 2603.28750

Authors: Aur Shalev Merin

Abstract: Recurrent networks do not need Jacobian propagation to adapt online. The hidden state already carries temporal credit through the forward pass; immediate derivatives suffice if you stop corrupting them with stale trace memory and normalize gradient scales across parameter groups. An architectural rule predicts when normalization is needed: \b{eta}2 is required when gradients must pass through a nonlinear state update with no output bypass, and unnecessary otherwise. Across ten architectures, real primate neural data, and streaming ML benchmarks, immediate derivatives with RMSprop match or exceed full RTRL, scaling to n = 1024 at 1000x less memory.

Comment: Architecture/training-dynamics result claiming immediate derivatives can replace temporal Jacobian propagation in recurrent online learning, with a normalization rule predicting when this works.

Relevance: 9 Novelty: 8

8. On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry

ArXiv ID: 2603.27631

Authors: Mohammad Tinati, Stephen Tu

Abstract: Self-supervised pre-training, where large corpora of unlabeled data are used to learn representations for downstream fine-tuning, has become a cornerstone of modern machine learning. While a growing body of theoretical work has begun to analyze this paradigm, existing bounds leave open the question of how sharp the current rates are, and whether they accurately capture the complex interaction between pre-training and fine-tuning. In this paper, we address this gap by developing an asymptotic theory of pre-training via two-stage M-estimation. A key challenge is that the pre-training estimator is often identifiable only up to a group symmetry, a feature common in representation learning that requires careful treatment. We address this issue using tools from Riemannian geometry to study the intrinsic parameters of the pre-training representation, which we link with the downstream predictor through a notion of orbit-invariance, precisely characterizing the limiting distribution of the downstream test risk. We apply our main result to several case studies, including spectral pre-training, factor models, and Gaussian mixture models, and obtain substantial improvements in problem-specific factors over prior art when applicable.

Comment: Representation-learning theory: asymptotic two-stage M-estimation for self-supervised pretraining with group-symmetry identifiability and downstream risk characterization.

Relevance: 9 Novelty: 8

9. Expert Streaming: Accelerating Low-Batch MoE Inference via Multi-chiplet Architecture and Dynamic Expert Trajectory Scheduling

ArXiv ID: 2603.27624

Authors: Songchen Ma, Hongyi Li, Weihao Zhang, Yonghao Tan, Pingcheng Dong, Yu Liu, Lan Liu, Yuzhong Jiao, Xuejiao Liu, Luhong Liang, Kwang-Ting Cheng

Abstract: Mixture-of-Experts is a promising approach for edge AI with low-batch inference. Yet, on-device deployments often face limited on-chip memory and severe workload imbalance; the prevalent use of offloading further incurs off-chip memory access bottlenecks. Moreover, MoE sparsity and dynamic gating shift distributed strategies toward much finer granularity and introduce runtime scheduling considerations. Recently, high die-to-die bandwidth chiplet interconnects have created new opportunities for multi-chiplet systems to address workload imbalance and offloading bottlenecks with fine-grained scheduling. In this paper, we propose Fully Sharded Expert Data Parallelism, a parallelization paradigm specifically architected for low-batch MoE inference on multi-chiplet accelerators. FSE-DP attains adaptive computation-communication overlap and balanced load by orchestrating fine-grained, complementary expert streams along dynamic trajectories across high-bandwidth D2D links. The attendant dataflow complexity is tamed by a minimal, hardware-amenable set of virtualization rules and a lightweight scheduling algorithm. Our approach achieves 1.22 to 2.00 times speedup over state-of-the-art baselines and saves up to 78.8 percent on-chip memory.

Comment: MoE inference systems: introduces a multi-chiplet sharding/scheduling paradigm for low-batch expert routing with memory and load-balance gains.

Relevance: 9 Novelty: 8

10. TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization

ArXiv ID: 2603.27467

Authors: Dipkumar Patel

Abstract: We compress KV cache entries by quantizing angles in the Fast Walsh-Hadamard domain, where a random diagonal rotation makes consecutive element pairs approximately uniformly distributed on the unit circle. We extend this angular quantizer with per-layer early-boost, which independently configures K and V codebook sizes at each layer, allocating higher precision to a model-specific subset of critical layers. Across seven models (1B to 7B parameters), per-layer early-boost achieves lossless compression on four models and near-lossless quality on six of seven, at 3.28 to 3.67 angle bits per element. Asymmetric norm quantization (8-bit for keys, 4-bit log-space for values) yields 6.56 total bits per element on Mistral-7B with perplexity degradation of +0.0014 and no calibration data. A layer-group sensitivity analysis reveals model-specific bottleneck patterns, including K-dominated versus V-dominated layers and negative-transfer layers where increased precision degrades quality.

Comment: Compression/efficient inference: proposes a new KV-cache quantization scheme in the Walsh-Hadamard angle domain with per-layer precision allocation.

Relevance: 9 Novelty: 8

11. IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

ArXiv ID: 2603.28430

Authors: Zhongping Ji

Abstract: Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive $O(d^2)$ storage and compute. RotorQuant reduces this cost with blockwise $3$D Clifford rotors, yet the resulting $3$D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of $SO(4)$. It represents each $4$D block as a quaternion and applies a closed-form transform $T(v)=q_L v \overline{q_R}$. This yields two main variants: \emph{IsoQuant-Full}, which realizes the full $SO(4)$ rotation, and \emph{IsoQuant-Fast}, which keeps only one isoclinic factor for lower cost; the framework also admits a lightweight $2$D special case. At $d=128$, IsoQuant-Full reduces forward rotation cost from about $2{,}408$ FMAs in RotorQuant to $1{,}024$, while IsoQuant-Fast further reduces it to $512$. Across $18$ fused CUDA settings with $d \in {128,256,512}$, bit widths ${2,3,4}$, and FP16/FP32 execution, IsoQuant achieves mean kernel-level speedups of about $4.5\times$--$4.7\times$ over RotorQuant while maintaining comparable reconstruction MSE, with peak speedups above $6\times$. Current validation is limited to the stage-1 quantize--dequantize path on synthetic normalized vectors; end-to-end KV-cache evaluation remains future work.

Comment: Compression and efficient inference: hardware-aligned SO(4) isoclinic rotations for KV-cache quantization with clear kernel-level efficiency gains.

Relevance: 9 Novelty: 8

12. GeoBlock: Inferring Block Granularity from Dependency Geometry in Diffusion Language Models

ArXiv ID: 2603.26675

Authors: Lipeng Wan, Junjie Ma, Jianhui Gu, Zeyang Liu, Xuyang Lu, Xuguang Lan

Abstract: Block diffusion enables efficient parallel refinement in diffusion language models, but its decoding behavior depends critically on block size. Existing block-sizing strategies rely on fixed rules or heuristic signals and do not account for the dependency geometry that determines which tokens can be safely refined together. This motivates a geometry view of diffusion decoding: \emph{regions with strong causal ordering require sequential updates, whereas semantically cohesive regions admit parallel refinement.} We introduce GeoBlock, a geometry-aware block inference framework that determines block granularity directly from attention-derived dependency geometry. Instead of relying on predefined schedules or local confidence heuristics, GeoBlock analyzes cross-token dependency patterns to identify geometrically stable refinement regions and dynamically determines appropriate block boundaries during decoding. By adapting block granularity to the dependency geometry, GeoBlock preserves the parallel efficiency of block diffusion while enforcing dependency-consistent refinement that exhibits autoregressive reliability. GeoBlock requires no additional training and integrates seamlessly into existing block diffusion architectures. Extensive experiments across multiple benchmarks show that GeoBlock reliably identifies geometry-consistent block boundaries and improves the accuracy of block diffusion with only a small additional computational budget.

Comment: Geometry-aware block sizing for diffusion language models studies decoding-time dependency structure for parallel refinement.

Relevance: 9 Novelty: 8

13. GSR-GNN: Training Acceleration and Memory-Saving Framework of Deep GNNs on Circuit Graph

ArXiv ID: 2603.27156

Authors: Yuebo Luo, Shiyang Li, Yifei Feng, Vishal Kancharla, Shaoyi Huang, Caiwen Ding

Abstract: Graph Neural Networks (GNNs) show strong promise for circuit analysis, but scaling to modern large-scale circuit graphs is limited by GPU memory and training cost, especially for deep models. We revisit deep GNNs for circuit graphs and show that, when trainable, they significantly outperform shallow architectures, motivating an efficient, domain-specific training framework. We propose Grouped-Sparse-Reversible GNN (GSR-GNN), which enables training GNNs with up to hundreds of layers while reducing both compute and memory overhead. GSR-GNN integrates reversible residual modules with a group-wise sparse nonlinear operator that compresses node embeddings without sacrificing task-relevant information, and employs an optimized execution pipeline to eliminate fragmented activation storage and reduce data movement. On sampled circuit graphs, GSR-GNN achieves up to 87.2\% peak memory reduction and over 30$\times$ training speedup with negligible degradation in correlation-based quality metrics, making deep GNNs practical for large-scale EDA workloads.

Comment: Reversible residual modules plus grouped sparse operators reduce deep GNN training memory and compute substantially.

Relevance: 9 Novelty: 8

14. Gaussian Joint Embeddings For Self-Supervised Representation Learning

ArXiv ID: 2603.26799

Authors: Yongchao Huang

Abstract: Self-supervised representation learning often relies on deterministic predictive architectures to align context and target views in latent space. While effective in many settings, such methods are limited in genuinely multi-modal inverse problems, where squared-loss prediction collapses towards conditional averages, and they frequently depend on architectural asymmetries to prevent representation collapse. In this work, we propose a probabilistic alternative based on generative joint modeling. We introduce Gaussian Joint Embeddings (GJE) and its multi-modal extension, Gaussian Mixture Joint Embeddings (GMJE), which model the joint density of context and target representations and replace black-box prediction with closed-form conditional inference under an explicit probabilistic model. This yields principled uncertainty estimates and a covariance-aware objective for controlling latent geometry. We further identify a failure mode of naive empirical batch optimization, which we term the Mahalanobis Trace Trap, and develop several remedies spanning parametric, adaptive, and non-parametric settings, including prototype-based GMJE, conditional Mixture Density Networks (GMJE-MDN), topology-adaptive Growing Neural Gas (GMJE-GNG), and a Sequential Monte Carlo (SMC) memory bank. In addition, we show that standard contrastive learning can be interpreted as a degenerate non-parametric limiting case of the GMJE framework. Experiments on synthetic multi-modal alignment tasks and vision benchmarks show that GMJE recovers complex conditional structure, learns competitive discriminative representations, and defines latent densities that are better suited to unconditional sampling than deterministic or unimodal baselines.

Comment: Probabilistic joint-embedding framework analyzes self-supervised representation structure and links contrastive learning as a limiting case.

Relevance: 9 Novelty: 8

15. Arithmetic OOD Failure Unfolds in Stages in Minimal GPTs

ArXiv ID: 2603.26828

Authors: Seine A. Shintani

Abstract: Arithmetic benchmarks are often reduced to a single held-out score, but that score can conflate qualitatively different failures. We study a controlled minimal GPT trained on exhaustive 2-digit addition, where all local digit transitions are already present in training, and ask why 3-digit generalization still fails. The failure is staged. First, there is a layout barrier: a learned absolute-position model collapses under a pure 3-digit layout shift, and mixed-layout exposure is the only intervention that materially weakens this barrier. Second, after layout repair, the hundreds position behaves like a carry flag rather than a semantic hundreds digit; targeted carry probes reverse the relevant logit margin, whereas a matched extra-data control does not. Third, after carry repair, the main remaining bottleneck is conditional recomposition: high-conditioned tail data outperforms a matched control, high-only data, and tail-only data on all true-3-digit suites, and the same ordering reappears in a larger 2-layer bridge experiment. The residual errors after recomposition are then overwhelmingly tens-only, and a separate 10-seed late-stage study shows that a sign-aware tens repair raises exact match on the hardest thousands-carry suite from 0.664 to 0.822. We therefore provide an experimentally testable decomposition of arithmetic OOD failure into layout, carry-semantics, recomposition, and late tens-residual stages.

Comment: Controlled minimal-GPT study decomposes arithmetic OOD failure into staged representational and training-dynamics bottlenecks.

Relevance: 9 Novelty: 8

16. On the Loss Landscape Geometry of Regularized Deep Matrix Factorization: Uniqueness and Sharpness

ArXiv ID: 2603.27072

Authors: Anil Kamber, Rahul Parhi

Abstract: Weight decay is ubiquitous in training deep neural network architectures. Its empirical success is often attributed to capacity control; nonetheless, our theoretical understanding of its effect on the loss landscape and the set of minimizers remains limited. In this paper, we show that $\ell^2$-regularized deep matrix factorization/deep linear network training problems with squared-error loss admit a unique end-to-end minimizer for all target matrices subject to factorization, except for a set of Lebesgue measure zero formed by the depth and the regularization parameter. This observation reveals fundamental properties of the loss landscape of regularized deep matrix factorization problems: the Hessian spectrum is constant across all minimizers of the regularized deep scalar factorization problem with squared-error loss. Moreover, we show that, in regularized deep matrix factorization problems with squared-error loss, if the target matrix does not belong to the Lebesgue measure-zero set, then the Frobenius norm of each layer is constant across all minimizers. This, in turn, yields a global lower bound on the trace of the Hessian evaluated at any minimizer of the regularized deep matrix factorization problem. Furthermore, we establish a critical threshold for the regularization parameter above which the unique end-to-end minimizer collapses to zero.

Comment: Representation learning theory and training dynamics: analyzes how \ell^2 regularization changes deep matrix factorization geometry, proving near-generic uniqueness of end-to-end minimizers and Hessian sharpness structure.

Relevance: 9 Novelty: 8

17. Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

ArXiv ID: 2603.28744

Authors: Vit\'oria Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, David Klindt

Abstract: The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under superposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent factors. Sparse autoencoders (SAEs), on the other hand, amortise sparse inference into a fixed encoder, introducing a systematic gap. We show this amortisation gap persists across training set sizes, latent dimensions, and sparsity levels, causing SAEs to fail under out-of-distribution (OOD) compositional shifts. Through controlled experiments that decompose the failure, we identify dictionary learning -- not the inference procedure -- as the binding constraint: SAE-learned dictionaries point in substantially wrong directions, and replacing the encoder with per-sample FISTA on the same dictionary does not close the gap. An oracle baseline proves the problem is solvable with a good dictionary at all scales tested. Our results reframe the SAE failure as a dictionary learning challenge, not an amortisation problem, and point to scalable dictionary learning as the key open problem for sparse inference under superposition.

Comment: Representation learning theory and structure: shows SAEs fail under compositional shifts because dictionary learning, not sparse inference amortization, is the core bottleneck under superposition.

Relevance: 9 Novelty: 8

18. Lipschitz verification of neural networks through training

ArXiv ID: 2603.28113

Authors: Simon Kuang, Yuezhu Xu, S. Sivaranjani, Xinfan Lin

Abstract: The global Lipschitz constant of a neural network governs both adversarial robustness and generalization. Conventional approaches to certified training" typically follow a train-then-verify paradigm: they train a network and then attempt to bound its Lipschitz constant. Because the efficienttrivial bound" (the product of the layerwise Lipschitz constants) is exponentially loose for arbitrary networks, these approaches must rely on computationally expensive techniques such as semidefinite programming, mixed-integer programming, or branch-and-bound. We propose a different paradigm: rather than designing complex verifiers for arbitrary networks, we design networks to be verifiable by the fast trivial bound. We show that directly penalizing the trivial bound during training forces it to become tight, thereby effectively regularizing the true Lipschitz constant. To achieve this, we identify three structural obstructions to a tight trivial bound (dead neurons, bias terms, and ill-conditioned weights) and introduce architectural mitigations, including a novel notion of norm-saturating polyactivations and bias-free sinusoidal layers. Our approach avoids the runtime complexity of advanced verification while achieving strong results: we train robust networks on MNIST with Lipschitz bounds that are small (orders of magnitude lower than comparable works) and tight (within 10% of the ground truth). The experimental results validate the theoretical guarantees, support the proposed mechanisms, and extend empirically to diverse activations and non-Euclidean norms.

Comment: Architecture mechanisms and training stability: trains networks so the trivial Lipschitz bound becomes tight, introducing structural fixes like norm-saturating polyactivations and bias-free sinusoidal layers.

Relevance: 9 Novelty: 8

19. ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference

ArXiv ID: 2603.27138

Authors: Qiuyang Zhang, Kai Zhou, Ding Tang, Kai Lu, Cheng Li, Zhenyu Yang, Peng Xu, Jiguang Wan

Abstract: Large language models encounter critical GPU memory capacity constraints during long-context inference, where KV cache memory consumption severely limits decode batch sizes. While existing research has explored offloading KV cache to DRAM, these approaches either demand frequent GPU-CPU data transfers or impose extensive CPU computation requirements, resulting in poor GPU utilization as the system waits for I/O operations or CPU processing to complete. We propose ScoutAttention, a novel KV cache offloading framework that accelerates LLM inference through collaborative GPU-CPU attention computation. To prevent CPU computation from bottlenecking the system, ScoutAttention introduces GPU-CPU collaborative block-wise sparse attention that significantly reduces CPU load. Unlike conventional parallel computing approaches, our framework features a novel layer-ahead CPU pre-computation algorithm, enabling the CPU to initiate attention computation one layer in advance, complemented by asynchronous periodic recall mechanisms to maintain minimal CPU compute load. Experimental results demonstrate that ScoutAttention maintains accuracy within 2.4% of baseline while achieving 2.1x speedup compared to existing offloading methods.

Comment: Compression and efficient inference: proposes KV-cache offloading with layer-ahead CPU pre-computation and collaborative sparse attention to change long-context inference memory/throughput tradeoffs.

Relevance: 9 Novelty: 8

20. Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling

ArXiv ID: 2603.27665

Authors: Minh-Tuan Tran, Xuan-May Le, Quan Hung Tran, Mehrtash Harandi, Dinh Phung, Trung Le

Abstract: Existing generative models, such as diffusion and auto-regressive networks, are inherently static, relying on a fixed set of pretrained parameters to handle all inputs. In contrast, humans flexibly adapt their internal generative representations to each perceptual or imaginative context. Inspired by this capability, we introduce Composer, a new paradigm for adaptive generative modeling based on test-time instance-specific parameter composition. Composer generates input-conditioned parameter adaptations at inference time, which are injected into the pretrained model's weights, enabling per-input specialization without fine-tuning or retraining. Adaptation occurs once prior to multi-step generation, yielding higher-quality, context-aware outputs with minimal computational and memory overhead. Experiments show that Composer substantially improves performance across diverse generative models and use cases, including lightweight/quantized models and test-time scaling. By leveraging input-aware parameter composition, Composer establishes a new paradigm for designing generative models that dynamically adapt to each input, moving beyond static parameterization.

Comment: Introduces test-time instance-specific weight composition, a dynamic computation mechanism that adapts model parameters per input without retraining.

Relevance: 9 Novelty: 8

21. MolmoPoint: Better Pointing for VLMs with Grounding Tokens

ArXiv ID: 2603.28069

Authors: Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna

Abstract: Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.

Comment: Introduces a new grounding mechanism for VLMs that points by selecting visual tokens and subpatches instead of emitting coordinate text, a core architectural design change.

Relevance: 9 Novelty: 8

22. A Step Toward Federated Pretraining of Multimodal Large Language Models

ArXiv ID: 2603.26786

Authors: Baochen Xiong, Yifan Xu, Xiaoshan Yang, Yaguang Song, Yaowei Wang, Changsheng Xu

Abstract: The rapid evolution of Multimodal Large Language Models (MLLMs) is bottlenecked by the saturation of high-quality public data, while vast amounts of diverse multimodal data remain inaccessible in privacy-sensitive silos. Federated Learning (FL) offers a promising solution to unlock these distributed resources, but existing research focuses predominantly on fine-tuning, leaving the foundational pre-training phase largely unexplored. In this paper, we formally introduce the Federated MLLM Alignment (Fed-MA) task, a lightweight pre-training paradigm that freezes the vision encoder and LLM while collaboratively training the cross-modal projector. We identify two critical challenges in this setting: (i) parameter interference in aggregating local projectors; and (ii) gradient oscillations in one-pass collaborative SGD. To address these challenges, we propose Fed-CMP, a pioneering framework for federated MLLM pre-training. Fed-CMP employs Canonical Reliability-Aware Aggregation, which constructs a canonical space to decompose client projectors into a shared alignment basis and client-specific coefficients, then performs reliability-weighted fusion to suppress parameter interference. Furthermore, Fed-CMP introduces Orthogonality-Preserved Momentum, which applies momentum to the shared alignment basis via orthogonal projection, accumulating historical optimization directions while preserving geometric structure. We construct four federated pre-training scenarios based on public datasets, and extensive experiments validate that Fed-CMP significantly outperforms existing baselines.

Comment: Federated MLLM pretraining introduces canonical-space projector aggregation and orthogonality-preserved momentum to reduce parameter interference and gradient oscillation in distributed training.

Relevance: 9 Novelty: 8

23. Universal Approximation Constraints of Narrow ResNets: The Tunnel Effect

ArXiv ID: 2603.28591

Authors: Christian Kuehn, Sara-Viola Kuntz, Tobias W\"ohrer

Abstract: We analyze the universal approximation constraints of narrow Residual Neural Networks (ResNets) both theoretically and numerically. For deep neural networks without input space augmentation, a central constraint is the inability to represent critical points of the input-output map. We prove that this has global consequences for target function approximations and show that the manifestation of this defect is typically a shift of the critical point to infinity, which we call the ``tunnel effect'' in the context of classification tasks. While ResNets offer greater expressivity than standard multilayer perceptrons (MLPs), their capability strongly depends on the signal ratio between the skip and residual channels. We establish quantitative approximation bounds for both the residual-dominant (close to MLP) and skip-dominant (close to neural ODE) regimes. These estimates depend explicitly on the channel ratio and uniform network weight bounds. Low-dimensional examples further provide a detailed analysis of the different ResNet regimes and how architecture-target incompatibility influences the approximation error.

Comment: Architecture mechanism analysis of narrow ResNets, giving explicit approximation constraints and skip-vs-residual regime bounds via the tunnel effect.

Relevance: 9 Novelty: 7

24. Preconditioned Attention: Enhancing Efficiency in Transformers

ArXiv ID: 2603.27153

Authors: Hemanth Saratchandran

Abstract: Central to the success of Transformers is the attention block, which effectively models global dependencies among input tokens associated to a dataset. However, we theoretically demonstrate that standard attention mechanisms in transformers often produce ill-conditioned matrices with large condition numbers. This ill-conditioning is a well-known obstacle for gradient-based optimizers, leading to inefficient training. To address this issue, we introduce preconditioned attention, a novel approach that incorporates a conditioning matrix into each attention head. Our theoretical analysis shows that this method significantly reduces the condition number of attention matrices, resulting in better-conditioned matrices that improve optimization. Conditioned attention serves as a simple drop-in replacement for a wide variety of attention mechanisms in the literature. We validate the effectiveness of preconditioned attention across a diverse set of transformer applications, including image classification, object detection, instance segmentation, long sequence modeling and language modeling.

Comment: Architecture mechanisms and training dynamics: introduces preconditioned attention as a drop-in attention variant motivated by conditioning analysis of attention matrices.

Relevance: 9 Novelty: 7

25. On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

ArXiv ID: 2603.27481

Authors: Chongyang Zhao, Mingsong Li, Haodong Lu, Dong Gong

Abstract: Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.

Comment: Architecture mechanisms and training dynamics: token-level analysis of MoE routing drift in continual LVLMs with a new drift-aware assignment mechanism.

Relevance: 9 Novelty: 7

26. Sparse-by-Design Cross-Modality Prediction: L0-Gated Representations for Reliable and Efficient Learning

ArXiv ID: 2603.26801

Authors: Filippo Cenacchi

Abstract: Predictive systems increasingly span heterogeneous modalities such as graphs, language, and tabular records, but sparsity and efficiency remain modality-specific (graph edge or neighborhood sparsification, Transformer head or layer pruning, and separate tabular feature-selection pipelines). This fragmentation makes results hard to compare, complicates deployment, and weakens reliability analysis across end-to-end KDD pipelines. A unified sparsification primitive would make accuracy-efficiency trade-offs comparable across modalities and enable controlled reliability analysis under representation compression. We ask whether a single representation-level mechanism can yield comparable accuracy-efficiency trade-offs across modalities while preserving or improving probability calibration. We propose L0-Gated Cross-Modality Learning (L0GM), a modality-agnostic, feature-wise hard-concrete gating framework that enforces L0-style sparsity directly on learned representations. L0GM attaches hard-concrete stochastic gates to each modality's classifier-facing interface: node embeddings (GNNs), pooled sequence embeddings such as CLS (Transformers), and learned tabular embedding vectors (tabular models). This yields end-to-end trainable sparsification with an explicit control knob for the active feature fraction. To stabilize optimization and make trade-offs interpretable, we introduce an L0-annealing schedule that induces clear accuracy-sparsity Pareto frontiers. Across three public benchmarks (ogbn-products, Adult, IMDB), L0GM achieves competitive predictive performance while activating fewer representation dimensions, and it reduces Expected Calibration Error (ECE) in our evaluation. Overall, L0GM establishes a modality-agnostic, reproducible sparsification primitive that supports comparable accuracy, efficiency, and calibration trade-off analysis across heterogeneous modalities.

Comment: Compression and sparsity: modality-agnostic L0 hard-concrete gating on learned representations as a unified sparsification primitive across GNNs, transformers, and tabular models.

Relevance: 9 Novelty: 7

27. Spectral Higher-Order Neural Networks

ArXiv ID: 2603.28420

Authors: Gianluca Peri, Timoteo Carletti, Duccio Fanelli, Diego Febbe

Abstract: Neural networks are fundamental tools of modern machine learning. The standard paradigm assumes binary interactions (across feedforward linear passes) between inter-tangled units, organized in sequential layers. Generalized architectures have been also designed that move beyond pairwise interactions, so as to account for higher-order couplings among computing neurons. Higher-order networks are however usually deployed as augmented graph neural networks (GNNs), and, as such, prove solely advantageous in contexts where the input exhibits an explicit hypergraph structure. Here, we present Spectral Higher-Order Neural Networks (SHONNs), a new algorithmic strategy to incorporate higher-order interactions in general-purpose, feedforward, network structures. SHONNs leverages a reformulation of the model in terms of spectral attributes. This allows to mitigate the common stability and parameter scaling problems that come along weighted, higher-order, forward propagations.

Comment: Architecture mechanism: general-purpose higher-order feedforward network design using a spectral formulation to control stability and parameter scaling.

Relevance: 8 Novelty: 8

28. OptINC: Optical In-Network-Computing for Scalable Distributed Learning

ArXiv ID: 2603.28290

Authors: Sijie Fei, Grace Li Zhang, Bing Li, Ulf Schlichtmann

Abstract: Distributed learning is widely used for training large models on large datasets by distributing parts of the model or dataset across multiple devices and aggregating the computed results for subsequent computations or parameter updates. Existing communication algorithms for distributed learning such as ring all-reduce result in heavy communication overhead between servers. Since communication in large-scale systems uses optical fibers, we propose an Optical In-Network-Computing (OptINC) architecture to offload the computation in servers onto the optical interconnects. To execute gradient averaging and quantization in the optical domain, we incorporate optical devices such as Mach-Zehnder-Interferometers (MZIs) into the interconnects. Such a de facto optical neural network (ONN) can effectively reduce the communication overhead in existing distributed training solutions. To reduce dataset complexity for training this neural network, a preprocessing algorithm implemented in the optical domain is also proposed. Hardware cost is lowered by approximating the weight matrices of the optical neural network with unitary and diagonal matrices, while the accuracy is maintained by a proposed hardware-aware training algorithm. The proposed solution was evaluated on real distributed learning tasks, including ResNet50 on CIFAR-100, and a LLaMA-based network on Wikipedia-1B. In both cases, the proposed framework can achieve comparable training accuracy to the ring all-reduce baseline, while eliminating communication overhead.

Comment: Large-scale training systems: offloads gradient averaging and quantization into optical interconnects via in-network computing rather than standard server-side reduction.

Relevance: 8 Novelty: 8

29. The Price of Meaning: Why Every Semantic Memory System Forgets

ArXiv ID: 2603.27116

Authors: Sambartha Ray Barman, Andrey Starenky, Sofia Bodnar, Nikhil Narasimhan, Ashwin Gopinath

Abstract: Every major AI memory system in production today organises information by meaning. That organisation enables generalisation, analogy, and conceptual retrieval -- but it comes at a price. We prove that the same geometric structure enabling semantic generalisation makes interference, forgetting, and false recall inescapable. We formalise this tradeoff for \textit{semantically continuous kernel-threshold memories}: systems whose retrieval score is a monotone function of an inner product in a semantic feature space with finite local intrinsic dimension. Within this class we derive four results: (1) semantically useful representations have finite effective rank; (2) finite local dimension implies positive competitor mass in retrieval neighbourhoods; (3) under growing memory, retention decays to zero, yielding power-law forgetting curves under power-law arrival statistics; (4) for associative lures satisfying a $\delta$-convexity condition, false recall cannot be eliminated by threshold tuning. We test these predictions across five architectures: vector retrieval, graph memory, attention-based context, BM25 filesystem retrieval, and parametric memory. Pure semantic systems express the vulnerability directly as forgetting and false recall. Reasoning-augmented systems partially override these symptoms but convert graceful degradation into catastrophic failure. Systems that escape interference entirely do so by sacrificing semantic generalisation. The price of meaning is interference, and no architecture we tested avoids paying it.

Comment: Representation learning theory and structure: proves an interference-forgetting tradeoff for semantic memory systems via finite local intrinsic dimension and kernel-threshold retrieval.

Relevance: 8 Novelty: 8

30. Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds

ArXiv ID: 2603.28764

Authors: N Alex Cayco Gajic, Arthur Pellegrino

Abstract: Similarity measures are widely used to interpret the representational geometries used by neural networks to solve tasks. Yet, because existing methods compare the extrinsic geometry of representations in state space, rather than their intrinsic geometry, they may fail to capture subtle yet crucial distinctions between fundamentally different neural network solutions. Here, we introduce metric similarity analysis (MSA), a novel method which leverages tools from Riemannian geometry to compare the intrinsic geometry of neural representations under the manifold hypothesis. We show that MSA can be used to i) disentangle features of neural computations in deep networks with different learning regimes, ii) compare nonlinear dynamics, and iii) investigate diffusion models. Hence, we introduce a mathematically grounded and broadly applicable framework to understand the mechanisms behind neural computations by comparing their intrinsic geometries.

Comment: Representation learning structure: introduces intrinsic-geometry-based similarity metrics for neural representations using Riemannian tools rather than extrinsic state-space comparisons.

Relevance: 8 Novelty: 8

31. Next-Token Prediction and Regret Minimization

ArXiv ID: 2603.28499

Authors: Mehryar Mohri, Clayton Sanford, Jon Schneider, Kiran Vodrahalli, Yifan Wu

Abstract: We consider the question of how to employ next-token prediction algorithms in adversarial online decision-making environments. Specifically, if we train a next-token prediction model on a distribution $\mathcal{D}$ over sequences of opponent actions, when is it the case that the induced online decision-making algorithm (by approximately best responding to the model's predictions) has low adversarial regret (i.e., when is $\mathcal{D}$ a \emph{low-regret distribution})? For unbounded context windows (where the prediction made by the model can depend on all the actions taken by the adversary thus far), we show that although not every distribution $\mathcal{D}$ is a low-regret distribution, every distribution $\mathcal{D}$ is exponentially close (in TV distance) to one low-regret distribution, and hence sublinear regret can always be achieved at negligible cost to the accuracy of the original next-token prediction model. In contrast to this, for bounded context windows (where the prediction made by the model can depend only on the past $w$ actions taken by the adversary, as may be the case in modern transformer architectures), we show that there are some distributions $\mathcal{D}$ of opponent play that are $\Theta(1)$-far from any low-regret distribution $\mathcal{D'}$ (even when $w = \Omega(T)$ and such distributions exist). Finally, we complement these results by showing that the unbounded context robustification procedure can be implemented by layers of a standard transformer architecture, and provide empirical evidence that transformer models can be efficiently trained to represent these new low-regret distributions.

Comment: Architecture/training theory: analyzes when next-token prediction induces low-regret online decision-making and studies the bounded-context limitation relevant to transformer architectures.

Relevance: 8 Novelty: 8

32. A Tight Expressivity Hierarchy for GNN-Based Entity Resolution in Master Data Management

ArXiv ID: 2603.27154

Authors: Ashwin Ganesan

Abstract: Entity resolution -- identifying database records that refer to the same real-world entity -- is naturally modelled on bipartite graphs connecting entity nodes to their attribute values. Applying a message-passing neural network (MPNN) with all available extensions (reverse message passing, port numbering, ego IDs) incurs unnecessary overhead, since different entity resolution tasks have fundamentally different complexity. For a given matching criterion, what is the cheapest MPNN architecture that provably works? We answer this with a four-theorem separation theory on typed entity-attribute graphs. We introduce co-reference predicates $\mathrm{Dup}r$ (two same-type entities share at least $r$ attribute values) and the $\ell$-cycle predicate $\mathrm{Cyc}\ell$ for settings with entity-entity edges. For each predicate we prove tight bounds -- constructing graph pairs provably indistinguishable by every MPNN lacking the required adaptation, and exhibiting explicit minimal-depth MPNNs that compute the predicate on all inputs. The central finding is a sharp complexity gap between detecting any shared attribute and detecting multiple shared attributes. The former is purely local, requiring only reverse message passing in two layers. The latter demands cross-attribute identity correlation -- verifying that the same entity appears at several attributes of the target -- a fundamentally non-local requirement needing ego IDs and four layers, even on acyclic bipartite graphs. A similar necessity holds for cycle detection. Together, these results yield a minimal-architecture principle: practitioners can select the cheapest sufficient adaptation set, with a guarantee that no simpler architecture works. Computational validation confirms every prediction.

Comment: Tight expressivity hierarchy identifies minimal message-passing mechanisms needed for specific graph reasoning predicates.

Relevance: 8 Novelty: 8

33. Expectation Error Bounds for Transfer Learning in Linear Regression and Linear Neural Networks

ArXiv ID: 2603.28739

Authors: Meitong Liu, Christopher Jung, Rui Li, Xue Feng, Han Zhao

Abstract: In transfer learning, the learner leverages auxiliary data to improve generalization on a main task. However, the precise theoretical understanding of when and how auxiliary data help remains incomplete. We provide new insights on this issue in two canonical linear settings: ordinary least squares regression and under-parameterized linear neural networks. For linear regression, we derive exact closed-form expressions for the expected generalization error with bias-variance decomposition, yielding necessary and sufficient conditions for auxiliary tasks to improve generalization on the main task. We also derive globally optimal task weights as outputs of solvable optimization programs, with consistency guarantees for empirical estimates. For linear neural networks with shared representations of width $q \leq K$, where $K$ is the number of auxiliary tasks, we derive a non-asymptotic expectation bound on the generalization error, yielding the first non-vacuous sufficient condition for beneficial auxiliary learning in this setting, as well as principled directions for task weight curation. We achieve this by proving a new column-wise low-rank perturbation bound for random matrices, which improves upon existing bounds by preserving fine-grained column structures. Our results are verified on synthetic data simulated with controlled parameters.

Comment: Provides exact and non-vacuous generalization conditions for transfer learning in linear models and linear networks, advancing representation-learning theory.

Relevance: 8 Novelty: 8

34. Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals

ArXiv ID: 2603.26829

Authors: Nathaniel Oh, Paul Attie

Abstract: Language models detect false premises when asked directly but absorb them under conversational pressure, producing authoritative professional output built on errors they already identified. This failure - order-gap hallucination - is invisible to output inspection because the error migrates into the activation space of the safety circuit, suppressed but not erased. We introduce Squish and Release (S&R), an activation-patching architecture with two components: a fixed detector body (layers 24-31, the localized safety evaluation circuit) and a swappable detector core (an activation vector controlling perception direction). A safety core shifts the model from compliance toward detection; an absorb core reverses it. We evaluate on OLMo-2 7B using the Order-Gap Benchmark - 500 chains across 500 domains, all manually graded. Key findings: cascade collapse is near-total (99.8% compliance at O5); the detector body is binary and localized (layers 24-31 shift 93.6%, layers 0-23 contribute zero, p<10^-189); a synthetically engineered core releases 76.6% of collapsed chains; detection is the more stable attractor (83% restore vs 58% suppress); and epistemic specificity is confirmed (false-premise core releases 45.4%, true-premise core releases 0.0%). The contribution is the framework - body/core architecture, benchmark, and core engineering methodology - which is model-agnostic by design.

Comment: Localizes a safety-related activation circuit and decomposes it into detector body/core components, offering mechanistic insight into hidden hallucination behavior.

Relevance: 8 Novelty: 8

35. The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

ArXiv ID: 2603.27412

Authors: Isaac Llorente-Saguer

Abstract: We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle $\theta$ from this reference direction. The anomaly score is the negative log-likelihood of $\theta$ under a Gaussian fit to the normative distribution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. We evaluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5-0.5B families: base, instruction-tuned, and \emph{abliterated} (refusal direction surgically removed via orthogonalisation). Across all six variants, LatentBiopsy achieves AUROC $\geq$0.937 for harmful-vs-normative detection and AUROC = 1.000 for discriminating harmful from benign-aggressive prompts (XSTest), with sub-millisecond per-query overhead. Three empirical findings emerge. First, geometry survives refusal ablation: both abliterated variants achieve AUROC at most 0.015 below their instruction-tuned counterparts, establishing a geometric dissociation between harmful-intent representation and the downstream generative refusal mechanism. Second, harmful prompts exhibit a near-degenerate angular distribution ($\sigma_\theta \approx 0.03$ rad), an order of magnitude tighter than the normative distribution ($\sigma_\theta \approx 0.27$ rad), preserved across all alignment stages including abliteration. Third, the two families exhibit opposite ring orientations at the same depth: harmful prompts occupy the outer ring in Qwen3.5-0.8B but the inner ring in Qwen2.5-0.5B, directly motivating the direction-agnostic scoring rule.

Comment: Residual-stream angular anomaly detection gives a mechanistic representation-space analysis of harmful intent that survives refusal ablation.

Relevance: 8 Novelty: 8

36. Kernel Dynamics under Path Entropy Maximization

ArXiv ID: 2603.27880

Authors: Jnaneshwar Das

Abstract: We propose a variational framework in which the kernel function k : X x X -> R, interpreted as the foundational object encoding what distinctions an agent can represent, is treated as a dynamical variable subject to path entropy maximization (Maximum Caliber, MaxCal). Each kernel defines a representational structure over which an information geometry on probability space may be analyzed; a trajectory through kernel space therefore corresponds to a trajectory through a family of effective geometries, making the optimization landscape endogenous to its own traversal. We formulate fixed-point conditions for self-consistent kernels, propose renormalization group (RG) flow as a structured special case, and suggest neural tangent kernel (NTK) evolution during deep network training as a candidate empirical instantiation. Under explicit information-thermodynamic assumptions, the work required for kernel change is bounded below by delta W >= k_B T delta I_k, where delta I_k is the mutual information newly unlocked by the updated kernel. In this view, stable fixed points of MaxCal over kernels correspond to self-reinforcing distinction structures, with biological niches, scientific paradigms, and craft mastery offered as conjectural interpretations. We situate the framework relative to assembly theory and the MaxCal literature, separate formal results from structured correspondences and conjectural bridges, and pose six open questions that make the program empirically and mathematically testable.

Comment: A variational framework for kernel evolution under path entropy maximization is foundational representation-learning theory with explicit links to NTK dynamics.

Relevance: 8 Novelty: 8

37. SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

ArXiv ID: 2603.27977

Authors: Yifan Wang, Bolian Li, David Cho, Ruqi Zhang, Fanping Sui, Ananth Grama

Abstract: Reinforcement learning has become central to improving large reasoning models, but its success still relies heavily on verifiable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimization towards final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning) and extend traditional RLVR to open ended settings. We introduce structure aware reinforcement learning (SARL), a label free framework that constructs a per response Reasoning Map from intermediate thinking steps and rewards its small world topology, inspired by complex networks and the functional organization of the human brain. SARL encourages reasoning trajectories that are both locally coherent and globally efficient, shifting supervision from destination to path. Our experiments on Qwen3-4B show SARL surpasses ground truth based RL and prior label free RL baselines, achieving the best average gain of 9.1% under PPO and 11.6% under GRPO on math tasks and 34.6% under PPO and 30.4% under GRPO on open ended tasks. Beyond good performance, SARL also exhibits lower KL divergence, higher policy entropy, indicating a more stable and exploratory training and generalized reasoning ability.

Comment: Proposes label-free RL that rewards the topology of reasoning trajectories rather than outcome correctness, directly targeting training dynamics for reasoning models.

Relevance: 8 Novelty: 8

38. Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment

ArXiv ID: 2603.27987

Authors: Tongfei Liu, Yufan Liu, Bing Li, Weiming Hu

Abstract: The high cost and accessibility problem associated with large datasets hinder the development of large-scale visual recognition systems. Dataset Distillation addresses these problems by synthesizing compact surrogate datasets for efficient training, storage, transfer, and privacy preservation. The existing state-of-the-art diffusion-based dataset distillation methods face three issues: lack of theoretical justification, poor efficiency in scaling to high data volumes, and failure in data-free scenarios. To address these issues, we establish a theoretical framework that justifies the use of diffusion models by proving the equivalence between dataset distillation and distribution matching, and reveals an inherent efficiency limit in the dataset distillation paradigm. We then propose a Dataset Concentration (DsCo) framework that uses a diffusion-based Noise-Optimization (NOpt) method to synthesize a small yet representative set of samples, and optionally augments the synthetic data via "Doping", which mixes selected samples from the original dataset with the synthetic samples to overcome the efficiency limit of dataset distillation. DsCo is applicable in both data-accessible and data-free scenarios, achieving SOTA performances for low data volumes, and it extends well to high data volumes, where it nearly reduces the dataset size by half with no performance degradation.

Comment: Gives a theoretical account of diffusion-based dataset distillation as distribution matching and proposes a new concentration framework to overcome distillation efficiency limits.

Relevance: 8 Novelty: 8

39. Gradient Manipulation in Distributed Stochastic Gradient Descent with Strategic Agents: Truthful Incentives with Convergence Guarantees

ArXiv ID: 2603.27962

Authors: Ziqin Chen, Yongqiang Wang

Abstract: Distributed learning has gained significant attention due to its advantages in scalability, privacy, and fault tolerance.In this paradigm, multiple agents collaboratively train a global model by exchanging parameters only with their neighbors. However, a key vulnerability of existing distributed learning approaches is their implicit assumption that all agents behave honestly during gradient updates. In real-world scenarios, this assumption often breaks down, as selfish or strategic agents may be incentivized to manipulate gradients for personal gain, ultimately compromising the final learning outcome. In this work, we propose a fully distributed payment mechanism that, for the first time, guarantees both truthful behaviors and accurate convergence in distributed stochastic gradient descent. This represents a significant advancement, as it overcomes two major limitations of existing truthfulness mechanisms for collaborative learning:(1) reliance on a centralized server for payment collection, and (2) sacrificing convergence accuracy to guarantee truthfulness. In addition to characterizing the convergence rate under general convex and strongly convex conditions, we also prove that our approach guarantees the cumulative gain that an agent can obtain through strategic behavior remains finite, even as the number of iterations approaches infinity--a property unattainable by most existing truthfulness mechanisms. Our experimental results on standard machine learning tasks, evaluated on benchmark datasets, confirm the effectiveness of the proposed approach.

Comment: Distributed SGD with strategic agents proposes a truthful incentive/payment mechanism that still preserves convergence guarantees, directly targeting large-scale training dynamics.

Relevance: 8 Novelty: 8

40. Interpretable Physics Extraction from Data for Linear Dynamical Systems using Lie Generator Networks

ArXiv ID: 2603.27442

Authors: Shafayeth Jamil, Rehan Kapadia

Abstract: When the system is linear, why should learning be nonlinear? Linear dynamical systems, the analytical backbone of control theory, signal processing and circuit analysis, have exact closed-form solutions via the state transition matrix. Yet when system parameters must be inferred from data, recent neural approaches offer flexibility at the cost of physical guarantees: Neural ODEs provide flexible trajectory approximation but may violate physical invariants, while energy preserving architectures do not natively represent dissipation essential to real-world systems. We introduce Lie Generator Networks (LGN), which learn a structured generator A and compute trajectories directly via matrix exponentiation. This shift from integration to exponentiation preserves structure by construction. By parameterizing A = S - D (skew-symmetric minus positive diagonal), stability and dissipation emerge from the underlying architecture and are not introduced during training via the loss function. LGN provides a unified framework for linear conservative, dissipative, and time-varying systems. On a 100-dimensional stable RLC ladder, standard derivative-based least-squares system identification can yield unstable eigenvalues. The unconstrained LGN yields stable but physically incorrect spectra, whereas LGN-SD recovers all 100 eigenvalues with over two orders of magnitude lower mean eigenvalue error than unconstrained alternatives. Critically, these eigenvalues reveal poles, natural frequencies, and damping ratios which are interpretable physics that black-box networks do not provide.

Comment: Structured generator parameterization (A = S - D) gives a mechanistic architectural constraint that enforces stability and dissipation by construction while exposing interpretable spectra.

Relevance: 8 Novelty: 8

41. Variational Neurons in Transformers for Language Modeling

ArXiv ID: 2603.28219

Authors: Yves Ruffenach

Abstract: Transformers for language modeling usually rely on deterministic internal computation, with uncertainty expressed mainly at the output layer. We introduce variational neurons into Transformer feed-forward computation so that uncertainty becomes part of the internal computation itself. Concretely, we replace deterministic feed-forward units with local variational units based on EVE while preserving the overall Transformer backbone. We evaluate this design in compact next-token language-modeling settings. We compare deterministic and variational variants with both predictive and probabilistic criteria. Alongside negative log-likelihood, perplexity and accuracy, we analyze calibration, conditional variance, mutual information and latent-usage statistics. The resulting picture is clear. Variational neurons integrate stably into Transformers, preserve strong predictive performance and produce informative uncertainty signals. The experiments also show that task quality, useful depth and internal stability are distinct properties. These results establish variational Transformers as a practical form of uncertainty-aware language modeling. They show that Transformers can predict with an explicit internal structure of uncertainty, which supports stronger probabilistic evaluation and a more informative analysis of model behavior.

Comment: Architectural mechanism: injects local variational neurons into Transformer feed-forward blocks to make uncertainty part of internal computation.