Personalized Daily ArXiv Papers 2026-02-16

[gpt-5]	Prompt	Completion	Total
Token	39388	42265	81653
Cost	$0.05	$0.42	$0.47

Total arXiv papers: 483

Total scanned papers: 279

Total relevant papers: 22

Table of contents with paper titles:

Stabilizing Native Low-Rank LLM Pretraining Authors: Paul Janson, Edouard Oyallon, Eugene Belilovsky
LCSB: Layer-Cyclic Selective Backpropagation for Memory-Efficient On-Device LLM Fine-Tuning Authors: Juneyoung Park, Eunbeen Yoon, Seongwan Kim. Jaeho Lee
HyperMLP: An Integrated Perspective for Sequence Modeling Authors: Jiecheng Lu, Shihao Yang
SLA2: Sparse-Linear Attention with Learnable Routing and QAT Authors: Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica, Jianfei Chen, Jun Zhu, Joseph E. Gonzalez
Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers Authors: Anrui Chen, Ruijun Huang, Xin Zhang, Fang Dong, Hengjie Cao, Zhendong Huang, Yifeng Yang, Mengyi Chen, Jixian Zhou, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Tun Lu, Fan Yang, Li Shang
Why Deep Jacobian Spectra Separate: Depth-Induced Scaling and Singular-Vector Alignment Authors: Nathana\"el Haas, Franc\c{c}ois Gatine, Augustin M Cosse, Zied Bouraoui
SD-MoE: Spectral Decomposition for Effective Expert Specialization Authors: Ruijun Huang, Fang Dong, Xin Zhang, Hengjie Cao, Zhendong Huang, Anrui Chen, Jixian Zhou, Mengyi Chen, Yifeng Yang, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, Chun Zhang, Li Shang
QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching Authors: Ke Xu, Yixin Wang, Zhongcheng Li, Hao Cui, Jinshui Hu, Xingyi Zhang
Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models Authors: Omer Faruk Deniz, Ruiyu Mao, Ruochen Li, Yapeng Tian, Latifur Khan
Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning Authors: Juneyoung Park, Yuri Hong, Seongwan Kim, Jaeho Lee
Rational Neural Networks have Expressivity Advantages Authors: Maosen Tang, Alex Townsend
TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design Authors: Jonghun Lee, Junghoon Lee, Hyeonjin Kim, Seoho Jeon, Jisup Yoon, Hyunbin Park, Meejeong Park, Heonjae Ha
A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models Authors: Mugunthan Shandirasegaran, Hongkang Li, Songyang Zhang, Meng Wang, Shuai Zhang
Transporting Task Vectors across Different Architectures without Training Authors: Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Angelo Porrello, Simone Calderara
Which Algorithms Can Graph Neural Networks Learn? Authors: Solveig Wittig, Antonis Vasileiou, Robert R. Nerem, Timo Stoll, Floris Geerts, Yusu Wang, Christopher Morris
FlashSchNet: Fast and Accurate Coarse-Grained Neural Network Molecular Dynamics Authors: Pingzhi Li, Hongxuan Li, Zirui Liu, Xingcheng Lin, Tianlong Chen
Quantization-Robust LLM Unlearning via Low-Rank Adaptation Authors: Jo\~ao Vitor Boer Abitante, Joana Meneguzzo Pasquali, Luan Fonseca Garcia, Ewerton de Oliveira, Thomas da Silva Paula, Rodrigo C. Barros, Lucas S. Kupssinsk\"u
Synaptic Activation and Dual Liquid Dynamics for Interpretable Bio-Inspired Models Authors: M\'onika Farsang, Radu Grosu
CoPE-VideoLM: Codec Primitives For Efficient Video Language Models Authors: Sayan Deb Sarkar, R\'emi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu
Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation Authors: Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, Mubarak Shah
The Appeal and Reality of Recycling LoRAs with Adaptive Merging Authors: Haokun Liu, Gyung Hyun Je, Marco Ciccone, Zhenlin Xu, Prasanth YSS, Colin Raffel
SWING: Unlocking Implicit Graph Representations for Graph Random Features Authors: Alessandro Manenti, Avinava Dubey, Arijit Sehanobish, Cesare Alippi, Krzysztof Choromanski

1. Stabilizing Native Low-Rank LLM Pretraining

ArXiv ID: 2602.12429

Authors: Paul Janson, Edouard Oyallon, Eugene Belilovsky

Abstract: Foundation models have achieved remarkable success, yet their growing parameter counts pose significant computational and memory challenges. Low-rank factorization offers a promising route to reduce training and inference costs, but the community lacks a stable recipe for training models from scratch using exclusively low-rank weights while matching the performance of the dense model. We demonstrate that Large Language Models (LLMs) can be trained from scratch using exclusively low-rank factorized weights for all non-embedding matrices without auxiliary "full-rank" guidance required by prior methods. While native low-rank training often suffers from instability and loss spikes, we identify uncontrolled growth in the spectral norm (largest singular value) of the weight matrix update as the dominant factor. To address this, we introduce Spectron: Spectral renormalization with orthogonalization, which dynamically bounds the resultant weight updates based on the current spectral norms of the factors. Our method enables stable, end-to-end factorized training with negligible overhead. Finally, we establish compute-optimal scaling laws for natively low-rank transformers, demonstrating predictable power-law behavior and improved inference efficiency relative to dense models.

Comment: Low‑rank Architecture/Training: native low‑rank transformer pretraining stabilized by spectral renormalization with orthogonalization (Spectron) and compute‑optimal scaling laws.

Relevance: 10 Novelty: 9

2. LCSB: Layer-Cyclic Selective Backpropagation for Memory-Efficient On-Device LLM Fine-Tuning

ArXiv ID: 2602.13073

Authors: Juneyoung Park, Eunbeen Yoon, Seongwan Kim. Jaeho Lee

Abstract: Memory-efficient backpropagation (MeBP) has enabled first-order fine-tuning of large language models (LLMs) on mobile devices with less than 1GB memory. However, MeBP requires backward computation through all transformer layers at every step, where weight decompression alone accounts for 32--42% of backward time. We propose Layer-Cyclic Selective Backpropagation (LCSB), which computes gradients for only a subset of layers per step. Our key insight is that residual connections guarantee gradient flow through identity paths, while AdamW momentum provides implicit updates for non-selected layers. We interpret LCSB as Block Coordinate Descent on the LoRA parameter space, providing theoretical justification for convergence. LCSB achieves up to 1.40$\times$ speedup with less than 2\% quality degradation across five models and three tasks. Surprisingly, in 4-bit quantized settings, LCSB exhibits superior stability: a 3B model that completely diverges under full backpropagation converges smoothly with LCSB, suggesting an implicit regularization effect from selective gradient computation.

Comment: Memory Optimization/Efficiency: selective per-step layer backpropagation for LoRA fine-tuning on-device, with BCD interpretation and improved stability.

Relevance: 10 Novelty: 8

3. HyperMLP: An Integrated Perspective for Sequence Modeling

ArXiv ID: 2602.12601

Authors: Jiecheng Lu, Shihao Yang

Abstract: Self-attention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets.

Comment: Matches Model Architecture: reinterprets attention as a dynamic MLP and proposes HyperMLP/HyperGLU with theory and empirical gains over softmax attention.

Relevance: 10 Novelty: 8

4. SLA2: Sparse-Linear Attention with Learnable Routing and QAT

ArXiv ID: 2602.12675

Authors: Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica, Jianfei Chen, Jun Zhu, Joseph E. Gonzalez

Abstract: Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

Comment: Matches Model Architecture and Efficiency: improved sparse-linear attention with learnable routing and quantization-aware training for major speedups while preserving quality.

Relevance: 10 Novelty: 8

5. Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers

ArXiv ID: 2602.12587

Authors: Anrui Chen, Ruijun Huang, Xin Zhang, Fang Dong, Hengjie Cao, Zhendong Huang, Yifeng Yang, Mengyi Chen, Jixian Zhou, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Tun Lu, Fan Yang, Li Shang

Abstract: Mixture-of-Experts (MoE) architectures are often considered a natural fit for continual learning because sparse routing should localize updates and reduce interference, yet MoE Transformers still forget substantially even with sparse, well-balanced expert utilization. We attribute this gap to a pre-routing bottleneck: multi-head attention concatenates head-specific signals into a single post-attention router input, forcing routing to act on co-occurring feature compositions rather than separable head channels. We show that this router input simultaneously encodes multiple separately decodable semantic and structural factors with uneven head support, and that different feature compositions induce weakly aligned parameter-gradient directions; as a result, routing maps many distinct compositions to the same route. We quantify this collision effect via a route-wise effective composition number $N_{eff}$ and find that higher $N_{eff}$ is associated with larger old-task loss increases after continual training. Motivated by these findings, we propose MH-MoE, which performs head-wise routing over sub-representations to increase routing granularity and reduce composition collisions. On TRACE with Qwen3-0.6B/8B, MH-MoE effectively mitigates forgetting, reducing BWT on Qwen3-0.6B from 11.2% (LoRAMoE) to 4.5%.

Comment: Model Architecture (MoE): identifies a pre-routing bottleneck from multi-head attention causing route collisions and proposes head-wise routing (MH-MoE) to mitigate catastrophic forgetting.

Relevance: 10 Novelty: 8

6. Why Deep Jacobian Spectra Separate: Depth-Induced Scaling and Singular-Vector Alignment

ArXiv ID: 2602.12384

Authors: Nathana\"el Haas, Franc\c{c}ois Gatine, Augustin M Cosse, Zied Bouraoui

Abstract: Understanding why gradient-based training in deep networks exhibits strong implicit bias remains challenging, in part because tractable singular-value dynamics are typically available only for balanced deep linear models. We propose an alternative route based on two theoretically grounded and empirically testable signatures of deep Jacobians: depth-induced exponential scaling of ordered singular values and strong spectral separation. Adopting a fixed-gates view of piecewise-linear networks, where Jacobians reduce to products of masked linear maps within a single activation region, we prove the existence of Lyapunov exponents governing the top singular values at initialization, give closed-form expressions in a tractable masked model, and quantify finite-depth corrections. We further show that sufficiently strong separation forces singular-vector alignment in matrix products, yielding an approximately shared singular basis for intermediate Jacobians. Together, these results motivate an approximation regime in which singular-value dynamics become effectively decoupled, mirroring classical balanced deep-linear analyses without requiring balancing. Experiments in fixed-gates settings validate the predicted scaling, alignment, and resulting dynamics, supporting a mechanistic account of emergent low-rank Jacobian structure as a driver of implicit bias.

Comment: Matches Representation Learning: theoretical analysis of deep Jacobian spectra (scaling, separation, and singular-vector alignment) explaining implicit bias and low-rank behavior.

Relevance: 9 Novelty: 9

7. SD-MoE: Spectral Decomposition for Effective Expert Specialization

ArXiv ID: 2602.12556

Authors: Ruijun Huang, Fang Dong, Xin Zhang, Hengjie Cao, Zhendong Huang, Anrui Chen, Jixian Zhou, Mengyi Chen, Yifeng Yang, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, Chun Zhang, Li Shang

Abstract: Mixture-of-Experts (MoE) architectures scale Large Language Models via expert specialization induced by conditional computation. In practice, however, expert specialization often fails: some experts become functionally similar, while others functioning as de facto shared experts, limiting the effective capacity and model performance. In this work, we analysis from a spectral perspective on parameter and gradient spaces, uncover that (1) experts share highly overlapping dominant spectral components in their parameters, (2) dominant gradient subspaces are strongly aligned across experts, driven by ubiquitous low-rank structure in human corpus, and (3) gating mechanisms preferentially route inputs along these dominant directions, further limiting specialization. To address this, we propose Spectral-Decoupled MoE (SD-MoE), which decomposes both parameter and gradient in the spectral space. SD-MoE improves performance across downstream tasks, enables effective expert specialization, incurring minimal additional computation, and can be seamlessly integrated into a wide range of existing MoE architectures, including Qwen and DeepSeek.

Comment: Model Architecture (MoE): spectral decomposition of parameters/gradients to decouple dominant subspaces and improve expert specialization with minimal overhead.

Relevance: 10 Novelty: 7

8. QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching

ArXiv ID: 2602.12609

Authors: Ke Xu, Yixin Wang, Zhongcheng Li, Hao Cui, Jinshui Hu, Xingyi Zhang

Abstract: Elastic precision quantization enables multi-bit deployment via a single optimization pass, fitting diverse quantization scenarios.Yet, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language models.This paper proposes QuEPT, an efficient post-training scheme that reconstructs block-wise multi-bit errors with one-shot calibration on a small data slice. It can dynamically adapt to various predefined bit-widths by cascading different low-rank adapters, and supports real-time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi-Bit Token Merging (MB-ToMe) to dynamically fuse token features across different bit-widths, improving robustness during bit-width switching. Additionally, we propose Multi-Bit Cascaded Low-Rank adapters (MB-CLoRA) to strengthen correlations between bit-width groups, further improve the overall performance of QuEPT. Extensive experiments demonstrate that QuEPT achieves comparable or better performance to existing state-of-the-art post-training quantization methods.Our code is available at https://github.com/xuke225/QuEPT

Comment: Compression/Efficiency + Low‑rank + Quantization: one‑shot post‑training elastic multi‑bit switching with cascaded low‑rank adapters (MB‑CLoRA) and multi‑bit token merging; supports mixed precision and KV‑cache efficiency.

Relevance: 9 Novelty: 8

9. Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

ArXiv ID: 2602.12618

Authors: Omer Faruk Deniz, Ruiyu Mao, Ruochen Li, Yapeng Tian, Latifur Khan

Abstract: Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.

Comment: Compression/Efficiency: attention‑driven self‑compression that progressively reduces vision tokens within the LLM, FlashAttention‑compatible, cutting FLOPs and KV‑cache.

Relevance: 9 Novelty: 8

10. Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning

ArXiv ID: 2602.13069

Authors: Juneyoung Park, Yuri Hong, Seongwan Kim, Jaeho Lee

Abstract: On-device fine-tuning enables privacy-preserving personalization of large language models, but mobile devices impose severe memory constraints, typically 6--12GB shared across all workloads. Existing approaches force a trade-off between exact gradients with high memory (MeBP) and low memory with noisy estimates (MeZO). We propose Memory-efficient Structured Backpropagation (MeSP), which bridges this gap by manually deriving backward passes that exploit LoRA's low-rank structure. Our key insight is that the intermediate projection $h = xA$ can be recomputed during backward at minimal cost since rank $r \ll d_{in}$, eliminating the need to store it. MeSP achieves 49\% average memory reduction compared to MeBP on Qwen2.5 models (0.5B--3B) while computing mathematically identical gradients. Our analysis also reveals that MeZO's gradient estimates show near-zero correlation with true gradients (cosine similarity $\approx$0.001), explaining its slow convergence. MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible on memory-constrained devices.

Comment: Matches Model Compression and Efficiency/HPC: structured backprop exploiting LoRA low-rank to cut memory with exact gradients for on-device fine-tuning.

Relevance: 9 Novelty: 8

11. Rational Neural Networks have Expressivity Advantages

ArXiv ID: 2602.12390

Authors: Maosen Tang, Alex Townsend

Abstract: We study neural networks with trainable low-degree rational activation functions and show that they are more expressive and parameter-efficient than modern piecewise-linear and smooth activations such as ELU, LeakyReLU, LogSigmoid, PReLU, ReLU, SELU, CELU, Sigmoid, SiLU, Mish, Softplus, Tanh, Softmin, Softmax, and LogSoftmax. For an error target of $\varepsilon>0$, we establish approximation-theoretic separations: Any network built from standard fixed activations can be uniformly approximated on compact domains by a rational-activation network with only $\mathrm{poly}(\log\log(1/\varepsilon))$ overhead in size, while the converse provably requires $\Omega(\log(1/\varepsilon))$ parameters in the worst case. This exponential gap persists at the level of full networks and extends to gated activations and transformer-style nonlinearities. In practice, rational activations integrate seamlessly into standard architectures and training pipelines, allowing rationals to match or outperform fixed activations under identical architectures and optimizers.

Comment: Model Architecture: introduces trainable low-degree rational activation functions with provable expressivity/parameter-efficiency advantages, extending to transformer-style nonlinearities.

Relevance: 9 Novelty: 8

12. TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design

ArXiv ID: 2602.12962

Authors: Jonghun Lee, Junghoon Lee, Hyeonjin Kim, Seoho Jeon, Jisup Yoon, Hyunbin Park, Meejeong Park, Heonjae Ha

Abstract: Recent studies have extensively explored NPU architectures for accelerating AI inference in on-device environments, which are inherently resource-constrained. Meanwhile, transformer-based large language models (LLMs) have become dominant, with rapidly increasing model sizes but low degree of parameter reuse compared to conventional CNNs, making end-to-end execution on resource-limited devices extremely challenging. To address these challenges, we propose TriGen, a novel NPU architecture tailored for resource-constrained environments through software-hardware co-design. Firstly, TriGen adopts low-precision computation using microscaling (MX) to enable additional optimization opportunities while preserving accuracy, and resolves the issues that arise by employing such precision. Secondly, to jointly optimize both nonlinear and linear operations, TriGen eliminates the need for specialized hardware for essential nonlinear operations by using fast and accurate LUT, thereby maximizing performance gains and reducing hardware-cost in on-device environments, and finally, by taking practical hardware constraints into account, further employs scheduling techniques to maximize computational utilization even under limited on-chip memory capacity. We evaluate the performance of TriGen on various LLMs and show that TriGen achieves an average 2.73x performance speedup and 52% less memory transfer over the baseline NPU design with negligible accuracy loss.

Comment: Compression/Efficiency & HPC: microscaling low-precision compute, LUT-based nonlinear ops, and memory-aware scheduling via SW–HW co-design for LLM inference.

Relevance: 9 Novelty: 7

13. A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models

ArXiv ID: 2602.12499

Authors: Mugunthan Shandirasegaran, Hongkang Li, Songyang Zhang, Meng Wang, Shuai Zhang

Abstract: The recent empirical success of Mamba and other selective state space models (SSMs) has renewed interest in non-attention architectures for sequence modeling, yet their theoretical foundations remain underexplored. We present a first-step analysis of generalization and learning dynamics for a simplified but representative Mamba block: a single-layer, single-head selective SSM with input-dependent gating, followed by a two-layer MLP trained via gradient descent (GD). Our study adopts a structured data model with tokens that include both class-relevant and class-irrelevant patterns under token-level noise and examines two canonical regimes: majority-voting and locality-structured data sequences. We prove that the model achieves guaranteed generalization by establishing non-asymptotic sample complexity and convergence rate bounds, which improve as the effective signal increases and the noise decreases. Furthermore, we show that the gating vector aligns with class-relevant features while ignoring irrelevant ones, thereby formalizing a feature-selection role similar to attention but realized through selective recurrence. Numerical experiments on synthetic data justify our theoretical results. Overall, our results provide principled insight into when and why Mamba-style selective SSMs learn efficiently, offering a theoretical counterpoint to Transformer-centric explanations.

Comment: Model Architecture: theoretical analysis of selective SSMs (Mamba), showing input-dependent gating performs feature selection and establishing generalization/convergence bounds.

Relevance: 9 Novelty: 7

14. Transporting Task Vectors across Different Architectures without Training

ArXiv ID: 2602.12952

Authors: Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Angelo Porrello, Simone Calderara

Abstract: Adapting large pre-trained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains largely unexplored. In this work, we introduce Theseus, a training-free method for transporting task-specific updates across heterogeneous models. Rather than matching parameters directly, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over strong baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically.

Comment: Matches Representation Learning: training-free transport of task vectors across heterogeneous architectures via functional alignment of intermediate representations.

Relevance: 8 Novelty: 8

15. Which Algorithms Can Graph Neural Networks Learn?

ArXiv ID: 2602.13106

Authors: Solveig Wittig, Antonis Vasileiou, Robert R. Nerem, Timo Stoll, Floris Geerts, Yusu Wang, Christopher Morris

Abstract: In recent years, there has been growing interest in understanding neural architectures' ability to learn to execute discrete algorithms, a line of work often referred to as neural algorithmic reasoning. The goal is to integrate algorithmic reasoning capabilities into larger neural pipelines. Many such architectures are based on (message-passing) graph neural networks (MPNNs), owing to their permutation equivariance and ability to deal with sparsity and variable-sized inputs. However, existing work is either largely empirical and lacks formal guarantees or it focuses solely on expressivity, leaving open the question of when and how such architectures generalize beyond a finite training set. In this work, we propose a general theoretical framework that characterizes the sufficient conditions under which MPNNs can learn an algorithm from a training set of small instances and provably approximate its behavior on inputs of arbitrary size. Our framework applies to a broad class of algorithms, including single-source shortest paths, minimum spanning trees, and general dynamic programming problems, such as the $0$-$1$ knapsack problem. In addition, we establish impossibility results for a wide range of algorithmic tasks, showing that standard MPNNs cannot learn them, and we derive more expressive MPNN-like architectures that overcome these limitations. Finally, we refine our analysis for the Bellman-Ford algorithm, yielding a substantially smaller required training set and significantly extending the recent work of Nerem et al. [2025] by allowing for a differentiable regularization loss. Empirical results largely support our theoretical findings.

Comment: Representation/Architecture theory: provides conditions for MPNNs to learn algorithms and generalize to arbitrary sizes; includes impossibility results and more expressive MPNN-like variants.

Relevance: 8 Novelty: 8

16. FlashSchNet: Fast and Accurate Coarse-Grained Neural Network Molecular Dynamics

ArXiv ID: 2602.13140

Authors: Pingzhi Li, Hongxuan Li, Zirui Liu, Xingcheng Lin, Tianlong Chen

Abstract: Graph neural network (GNN) potentials such as SchNet improve the accuracy and transferability of molecular dynamics (MD) simulation by learning many-body interactions, but remain slower than classical force fields due to fragmented kernels and memory-bound pipelines that underutilize GPUs. We show that a missing principle is making GNN-MD IO-aware, carefully accounting for reads and writes between GPU high-bandwidth memory (HBM) and on-chip SRAM. We present FlashSchNet, an efficient and accurate IO-aware SchNet-style GNN-MD framework built on four techniques: (1) flash radial basis, which fuses pairwise distance computation, Gaussian basis expansion, and cosine envelope into a single tiled pass, computing each distance once and reusing it across all basis functions; (2) flash message passing, which fuses cutoff, neighbor gather, filter multiplication, and reduction to avoid materializing edge tensors in HBM; (3) flash aggregation, which reformulates scatter-add via CSR segment reduce, reducing atomic writes by a factor of feature dimension and enabling contention-free accumulation in both forward and backward passes; (4) channel-wise 16-bit quantization that exploits the low per-channel dynamic range in SchNet MLP weights to further improve throughput with negligible accuracy loss. On a single NVIDIA RTX PRO 6000, FlashSchNet achieves 1000 ns/day aggregate simulation throughput over 64 parallel replicas on coarse-grained (CG) protein containing 269 beads (6.5x faster than CGSchNet baseline with 80% reduction of peak memory), surpassing classical force fields (e.g. MARTINI) while retaining SchNet-level accuracy and transferability.

Comment: Compression/Efficiency + HPC: IO-aware fused kernels (flash radial basis/message passing/aggregation) and channel‑wise 16‑bit quantization to cut HBM traffic and atomics for GNNs.