Previous Day 2026-02-20
Monthly Overview 2026-02
Next Day 2026-02-24

Personalized Daily ArXiv Papers 2026-02-23

[gpt-5] Prompt Completion Total
Token 38546 37428 75974
Cost $0.05 $0.37 $0.42

Total arXiv papers: 436

Total scanned papers: 260

Total relevant papers: 25

Table of contents with paper titles:

  1. Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

  2. SeedFlood: A Step Toward Scalable Decentralized Training of LLMs Authors: Jihun Kim, Namhoon Lee

  3. ScaleBITS: Scalable Bitwidth Search for Hardware-Aligned Mixed-Precision LLMs Authors: Xinlin Li, Timothy Chou, Josh Fromm, Zichang Liu, Yunjie Pan, Christina Fragouli

  4. Cut Less, Fold More: Model Compression through the Lens of Projection Geometry Authors: Olga Saukh, Dong Wang, Haris \v{S}iki\'c, Yun Cheng, Lothar Thiele

  5. Turbo Connection: Reasoning as Information Flow from Higher to Lower Layers Authors: Mohan Tang, Sidi Lu

  6. Unifying approach to uniform expressivity of graph neural networks Authors: Huan Luo, Jonni Virtema

  7. RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference Authors: Xiuying Wei, Caglar Gulcehre

  8. Bayesian Optimality of In-Context Learning with Selective State Spaces Authors: Di Zhang, Jiaqi Xing

  9. JPmHC Dynamical Isometry via Orthogonal Hyper-Connections Authors: Biswa Sengupta, Jinhua Wang, Leo Brunswic

  10. Calibrated Adaptation: Bayesian Stiefel Manifold Priors for Reliable Parameter-Efficient Fine-Tuning Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

  11. Neural-HSS: Hierarchical Semi-Separable Neural PDE Solver Authors: Pietro Sittoni, Emanuele Zangrando, Angelo A. Casulli, Nicola Guglielmi, Francesco Tudisco

  12. GeneZip: Region-Aware Compression for Long Context DNA Modeling Authors: Jianan Zhao, Xixian Liu, Zhihao Zhan, Xinyu Yuan, Hongyu Guo, Jian Tang

  13. Topological Exploration of High-Dimensional Empirical Risk Landscapes: general approach, and applications to phase retrieval Authors: Antoine Maillard, Tony Bonnaire, Giulio Biroli

  14. UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems Authors: Lin Huang, Arthur Jiang, XiaoLi Liu, Zion Wang, Jason Zhao, Chu Wang, HaoCheng Lu, ChengXiang Huang, JiaJun Cheng, YiYue Du, Jia Zhang

  15. Asynchronous Heavy-Tailed Optimization Authors: Junfei Sun, Dixi Yao, Xuchen Gong, Tahseen Rabbani, Manzil Zaheer, Tian Li

  16. Dual Length Codes for Lossless Compression of BFloat16 Authors: Aditya Agrawal, Albert Magyar, Hiteshwar Eswaraiah, Patrick Sheridan, Pradeep Janedula, Ravi Krishnan Venkatesan, Krishna Nair, Ravi Iyer

  17. Breaking the Correlation Plateau: On the Optimization and Capacity Limits of Attention-Based Regressors Authors: Jingquan Yan, Yuwei Miao, Peiran Yu, Junzhou Huang

  18. A Geometric Probe of the Accuracy-Robustness Trade-off: Sharp Boundaries in Symmetry-Breaking Dimensional Expansion Authors: Yu Bai, Zhe Wang, Jiarui Zhang, Dong-Xiao Zhang, Yinjun Gao, Jun-Jie Zhang

  19. Influence-Preserving Proxies for Gradient-Based Data Selection in LLM Fine-tuning Authors: Sirui Chen, Yunzhe Qi, Mengting Ai, Yifan Sun, Ruizhong Qiu, Jiaru Zou, Jingrui He

  20. Provable Adversarial Robustness in In-Context Learning Authors: Di Zhang

  21. Advection-Diffusion on Graphs: A Bakry-Emery Laplacian for Spectral Graph Neural Networks Authors: Pierre-Gabriel Berlureau, Ali Hariri, Victor Kawasaki-Borruat, Mia Zosso, Pierre Vandergheynst

  22. Learning Long-Range Dependencies with Temporal Predictive Coding Authors: Tom Potter, Oliver Rhodes

  23. Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs Authors: Zachary Coalson, Bo Fang, Sanghyun Hong

  24. PHAST: Port-Hamiltonian Architecture for Structured Temporal Dynamics Forecasting Authors: Shubham Bhardwaj, Chandrajit Bajaj

  25. ADAPT: Hybrid Prompt Optimization for LLM Feature Visualization Authors: Jo\~ao N. Cardoso, Arlindo L. Oliveira, Bruno Martins


1. Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds

ArXiv ID: 2602.17798

Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Abstract: Mixture-of-Experts models rely on learned routers to assign tokens to experts, yet standard softmax gating provides no principled mechanism to control the tradeoff between sparsity and utilization. We propose Grassmannian MoE (GrMoE), a routing framework that operates on the Grassmannian manifold of subspaces, where gating weights arise from the concentration parameters of Matrix Bingham distributions. This construction yields a single, interpretable knob -- the concentration matrix $\Lambda$ -- that continuously controls routing entropy, replacing discrete top-$k$ selection with a smooth, geometrically principled sparsity mechanism. We further develop an amortized variational inference procedure for posterior routing distributions, enabling uncertainty-aware expert assignment that naturally resists expert collapse. We formally prove tight bounds relating the Bingham concentration spectrum to routing entropy, expected top-$k$ mass, and an exponential bound on expert collapse, establishing the first formal theory of concentration-controlled sparsity. On synthetic routing tasks, a 350M-parameter MoE language model with 8 experts, a 1.3B-parameter model with 16 experts, and a 2.7B-parameter model with 32 experts, GrMoE achieves 0\% routing collapse across all seeds, comparable or better perplexity with 15--30\% improved load balance, and a smooth monotonic relationship between concentration and effective sparsity that enables post-hoc sparsity tuning without retraining. Token-level analysis reveals that experts learn heterogeneous concentration values that correlate with linguistic specialization, providing interpretable routing behavior.

Comment: Model Architecture (MoE): Grassmannian routing with Matrix Bingham distributions enabling concentration-controlled sparsity and provable resistance to expert collapse.

Relevance: 10 Novelty: 9


2. SeedFlood: A Step Toward Scalable Decentralized Training of LLMs

ArXiv ID: 2602.18181

Authors: Jihun Kim, Namhoon Lee

Abstract: This work presents a new approach to decentralized training-SeedFlood-designed to scale for large models across complex network topologies and achieve global consensus with minimal communication overhead. Traditional gossip-based methods suffer from message communication costs that grow with model size, while information decay over network hops renders global consensus inefficient. SeedFlood departs from these practices by exploiting the seed-reconstructible structure of zeroth-order updates and effectively making the messages near-zero in size, allowing them to be flooded to every client in the network. This mechanism makes communication overhead negligible and independent of model size, removing the primary scalability bottleneck in decentralized training. Consequently, SeedFlood enables training in regimes previously considered impractical, such as billion-parameter models distributed across hundreds of clients. Our experiments on decentralized LLM fine-tuning demonstrate thatSeedFlood consistently outperforms gossip-based baselines in both generalization performance and communication efficiency, and even achieves results comparable to first-order methods in large scale settings.

Comment: HPC/Distributed Training: seed-reconstructible zeroth-order updates enable near-zero-size messages and model-size-independent communication.

Relevance: 10 Novelty: 9


3. ScaleBITS: Scalable Bitwidth Search for Hardware-Aligned Mixed-Precision LLMs

ArXiv ID: 2602.17698

Authors: Xinlin Li, Timothy Chou, Josh Fromm, Zichang Liu, Yunjie Pan, Christina Fragouli

Abstract: Post-training weight quantization is crucial for reducing the memory and inference cost of large language models (LLMs), yet pushing the average precision below 4 bits remains challenging due to highly non-uniform weight sensitivity and the lack of principled precision allocation. Existing solutions use irregular fine-grained mixed-precision with high runtime overhead or rely on heuristics or highly constrained precision allocation strategies. In this work, we propose ScaleBITS, a mixed-precision quantization framework that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency. Guided by a new sensitivity analysis, we introduce a hardware-aligned, block-wise weight partitioning scheme, powered by bi-directional channel reordering. We formulate global bitwidth allocation as a constrained optimization problem and develop a scalable approximation to the greedy algorithm, enabling end-to-end principled allocation. Experiments show that ScaleBITS significantly improves over uniform-precision quantization (up to +36%) and outperforms state-of-the-art sensitivity-aware baselines (up to +13%) in ultra-low-bit regime, without adding runtime overhead.

Comment: Compression/Efficiency: hardware-aligned mixed-precision quantization with block-wise partitioning and global bitwidth allocation under memory budget.

Relevance: 10 Novelty: 8


4. Cut Less, Fold More: Model Compression through the Lens of Projection Geometry

ArXiv ID: 2602.18116

Authors: Olga Saukh, Dong Wang, Haris \v{S}iki\'c, Yun Cheng, Lothar Thiele

Abstract: Compressing neural networks without retraining is vital for deployment at scale. We study calibration-free compression through the lens of projection geometry: structured pruning is an axis-aligned projection, whereas model folding performs a low-rank projection via weight clustering. We formalize both as orthogonal operators and show that, within a rank distance of one, folding provably yields smaller parameter reconstruction error, and under mild smoothness assumptions, smaller functional perturbations than pruning. At scale, we evaluate >1000 checkpoints spanning ResNet18, PreActResNet18, ViT-B/32, and CLIP ViT-B/32 on CIFAR-10 and ImageNet-1K, covering diverse training hyperparameters (optimizers, learning rates, augmentations, regularization, sharpness-aware training), as well as multiple LLaMA-family 60M and 130M parameter models trained on C4. We show that folding typically achieves higher post-compression accuracy, with the largest gains at moderate-high compression. The gap narrows and occasionally reverses at specific training setups. Our results position folding as a geometry-aware, calibration-free alternative to pruning that is often superior in practice and principled in theory.

Comment: Model Compression and Efficiency: geometry-aware, calibration-free compression; formalizes pruning vs low-rank folding as orthogonal projections with theoretical and large-scale empirical support.

Relevance: 10 Novelty: 8


5. Turbo Connection: Reasoning as Information Flow from Higher to Lower Layers

ArXiv ID: 2602.17993

Authors: Mohan Tang, Sidi Lu

Abstract: Complex problems, whether in math, logic, or planning, are solved by humans through a sequence of steps where the result of one step informs the next. In this work, we adopt the perspective that the reasoning power of Transformers is fundamentally limited by a fixed maximum number of steps along any latent path of computation. To address this, we introduce Turbo Connection (TurboConn), a novel architecture that overcomes the fixed-depth constraint by routing multiple residual connections from the higher-layer hidden states of each token $t$ to the lower layers of token $t+1$. Fine-tuning pre-trained LLMs with our method not only yields accuracy gains of 0.9% to over 10% on benchmarks like GSM8K, Parity, and multi-step arithmetic, but also demonstrates that the density of these backward connections is critical; our dense interaction significantly outperforms "sparse" alternatives that only pass a single hidden state or vector. Notably, TurboConn can be integrated into pre-trained LLMs to overcome task-specific plateaus: while a fine-tuned Qwen-3-1.7B achieves only 53.78% on Parity, adding our architectural modification enables the model to reach 100% accuracy, all without the necessity to retrain the full model from scratch or sophisticated curriculum learning. Our results provide strong empirical evidence that the depth of the computational path is a key factor in reasoning ability, also offering a new mechanism to enhance LLMs without significantly affecting generation latency.

Comment: Model Architecture: TurboConn adds dense backward cross-token residuals to increase effective computational depth in Transformers.

Relevance: 10 Novelty: 8


6. Unifying approach to uniform expressivity of graph neural networks

ArXiv ID: 2602.18409

Authors: Huan Luo, Jonni Virtema

Abstract: The expressive power of Graph Neural Networks (GNNs) is often analysed via correspondence to the Weisfeiler-Leman (WL) algorithm and fragments of first-order logic. Standard GNNs are limited to performing aggregation over immediate neighbourhoods or over global read-outs. To increase their expressivity, recent attempts have been made to incorporate substructural information (e.g. cycle counts and subgraph properties). In this paper, we formalize this architectural trend by introducing Template GNNs (T-GNNs), a generalized framework where node features are updated by aggregating over valid template embeddings from a specified set of graph templates. We propose a corresponding logic, Graded template modal logic (GML(T)), and generalized notions of template-based bisimulation and WL algorithm. We establish an equivalence between the expressive power of T-GNNs and GML(T), and provide a unifying approach for analysing GNN expressivity: we show how standard AC-GNNs and its recent variants can be interpreted as instantiations of T-GNNs.

Comment: Model Architecture/Expressivity: introduces Template GNNs with matching logic and equivalence to analyze and unify GNN expressivity.

Relevance: 10 Novelty: 8


7. RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

ArXiv ID: 2602.18196

Authors: Xiuying Wei, Caglar Gulcehre

Abstract: Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of the attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. However, we find a persistent failure mode of them -- sparsifying a pretrained attention model to a dilated pattern leads to severe accuracy degradation. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once, then flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at 16 and drops by about 2-3 points at 64 on commonsense reasoning and LongBench tasks, respectively. Moreover, RAT+ outperforms attention when sparsifying to the top-k block attention. We further scale to 2.6B parameters and 200B tokens and observe the same trend.

Comment: Efficiency: train-dense, infer-sparse attention via recurrence-augmented attention; reduces FLOPs and KV cache with minimal accuracy loss.

Relevance: 9 Novelty: 8


8. Bayesian Optimality of In-Context Learning with Selective State Spaces

ArXiv ID: 2602.17744

Authors: Di Zhang, Jiaqi Xing

Abstract: We propose Bayesian optimal sequential prediction as a new principle for understanding in-context learning (ICL). Unlike interpretations framing Transformers as performing implicit gradient descent, we formalize ICL as meta-learning over latent sequence tasks. For tasks governed by Linear Gaussian State Space Models (LG-SSMs), we prove a meta-trained selective SSM asymptotically implements the Bayes-optimal predictor, converging to the posterior predictive mean. We further establish a statistical separation from gradient descent, constructing tasks with temporally correlated noise where the optimal Bayesian predictor strictly outperforms any empirical risk minimization (ERM) estimator. Since Transformers can be seen as performing implicit ERM, this demonstrates selective SSMs achieve lower asymptotic risk due to superior statistical efficiency. Experiments on synthetic LG-SSM tasks and a character-level Markov benchmark confirm selective SSMs converge faster to Bayes-optimal risk, show superior sample efficiency with longer contexts in structured-noise settings, and track latent states more robustly than linear Transformers. This reframes ICL from "implicit optimization" to "optimal inference," explaining the efficiency of selective SSMs and offering a principled basis for architecture design.

Comment: Model Architecture and Representation Learning: theoretical framing of ICL as Bayes-optimal inference with selective SSMs, separating from ERM/implicit GD and demonstrating statistical efficiency.

Relevance: 9 Novelty: 8


9. JPmHC Dynamical Isometry via Orthogonal Hyper-Connections

ArXiv ID: 2602.18308

Authors: Biswa Sengupta, Jinhua Wang, Leo Brunswic

Abstract: Recent advances in deep learning, exemplified by Hyper-Connections (HC), have expanded the residual connection paradigm by introducing wider residual streams and diverse connectivity patterns. While these innovations yield significant performance gains, they compromise the identity mapping property of residual connections, leading to training instability, limited scalability, and increased memory overhead. To address these challenges, we propose JPmHC (Jacobian-spectrum Preserving manifold-constrained Hyper-Connections), a framework that replaces identity skips with a trainable linear mixer acting on n parallel streams while explicitly controlling gradient conditioning. By constraining the mixer M on operator-norm-bounded manifolds (e.g., bistochastic, Stiefel, Grassmann), JPmHC prevents gradient pathologies and enhances stability. JPmHC introduces three key contributions: (i) a free-probability analysis that predicts Jacobian spectra for structured skips, providing actionable design rules for mixer selection; (ii) memory-efficient implicit differentiation for fixed-point projections, reducing activation memory and synchronization overhead; and (iii) a Stiefel-constrained mixer via Cayley transforms, ensuring orthogonality without post-hoc normalization. Empirical evaluations on ARC-AGI demonstrate that JPmHC achieves faster convergence, higher accuracy, and lower computational cost compared to bistochastic baselines. As a flexible and scalable extension of HC, JPmHC advances spectrum-aware, stable, and efficient deep learning, offering insights into topological architecture design and foundational model evolution.

Comment: Model Architecture/Stability: orthogonality-constrained hyper-connections preserving Jacobian spectrum; manifold-constrained mixers with memory-efficient implicit differentiation.

Relevance: 9 Novelty: 8


10. Calibrated Adaptation: Bayesian Stiefel Manifold Priors for Reliable Parameter-Efficient Fine-Tuning

ArXiv ID: 2602.17809

Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Abstract: Parameter-efficient fine-tuning methods such as LoRA enable practical adaptation of large language models but provide no principled uncertainty estimates, leading to poorly calibrated predictions and unreliable behavior under domain shift. We introduce Stiefel-Bayes Adapters (SBA), a Bayesian PEFT framework that places a Matrix Langevin prior over orthonormal adapter factors on the Stiefel manifold $\St$ and performs approximate posterior inference via tangent space Laplace approximation with geodesic retraction. Unlike Gaussian priors in flat space projected onto orthogonality constraints, our prior on the manifold naturally encodes the inductive bias that adapter subspaces should be well conditioned and orthogonal, while the posterior provides calibrated predictive uncertainty without recalibration. We prove formally that the tangent space approximation strictly avoids the structural variance inflation inherent in projecting from ambient space, establishing a rigorous theoretical advantage for intrinsic manifold inference. Across GLUE and SuperGLUE benchmarks on RoBERTa-large, LLaMA-2-7B, LLaMA-2-13B, Mistral-7B, and Qwen2.5-7B, domain shift evaluations, selective prediction protocols, and an abstractive summarization task, SBA achieves task performance comparable to LoRA and DoRA while reducing Expected Calibration Error by 18 to 34\% over deterministic baselines, improving selective prediction AUROC by 12 to 25\% under domain shift, and outperforming deep ensembles of five LoRA models on OOD detection at a fraction of the parameter cost. Our results demonstrate that where you place uncertainty, on the right geometric structure, matters more than simply adding any Bayesian treatment to adapters.

Comment: Parameter-Efficient Fine-Tuning: Bayesian adapters with Matrix Langevin priors on the Stiefel manifold for calibrated low-rank adaptation and uncertainty, with intrinsic manifold inference.

Relevance: 9 Novelty: 8


11. Neural-HSS: Hierarchical Semi-Separable Neural PDE Solver

ArXiv ID: 2602.18248

Authors: Pietro Sittoni, Emanuele Zangrando, Angelo A. Casulli, Nicola Guglielmi, Francesco Tudisco

Abstract: Deep learning-based methods have shown remarkable effectiveness in solving PDEs, largely due to their ability to enable fast simulations once trained. However, despite the availability of high-performance computing infrastructure, many critical applications remain constrained by the substantial computational costs associated with generating large-scale, high-quality datasets and training models. In this work, inspired by studies on the structure of Green's functions for elliptic PDEs, we introduce Neural-HSS, a parameter-efficient architecture built upon the Hierarchical Semi-Separable (HSS) matrix structure that is provably data-efficient for a broad class of PDEs. We theoretically analyze the proposed architecture, proving that it satisfies exactness properties even in very low-data regimes. We also investigate its connections with other architectural primitives, such as the Fourier neural operator layer and convolutional layers. We experimentally validate the data efficiency of Neural-HSS on the three-dimensional Poisson equation over a grid of two million points, demonstrating its superior ability to learn from data generated by elliptic PDEs in the low-data regime while outperforming baseline methods. Finally, we demonstrate its capability to learn from data arising from a broad class of PDEs in diverse domains, including electromagnetism, fluid dynamics, and biology.

Comment: Compression/Efficiency + Model Architecture: leverages HSS low-rank structure for parameter/data efficiency, with theoretical guarantees and links to FNO/convolutions.

Relevance: 9 Novelty: 8


12. GeneZip: Region-Aware Compression for Long Context DNA Modeling

ArXiv ID: 2602.17739

Authors: Jianan Zhao, Xixian Liu, Zhihao Zhan, Xinyu Yuan, Hongyu Guo, Jian Tang

Abstract: Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches largely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy multi-GPU parallelism. Here we introduce GeneZip, a DNA compression model that leverages a key biological prior: genomic information is highly imbalanced. Coding regions comprise only a small fraction (about 2 percent) yet are information-dense, whereas most non-coding sequence is comparatively information-sparse. GeneZip couples HNet-style dynamic routing with a region-aware compression-ratio objective, enabling adaptive allocation of representation budget across genomic regions. As a result, GeneZip learns region-aware compression and achieves 137.6x compression with only 0.31 perplexity increase. On downstream long-context benchmarks, GeneZip achieves comparable or better performance on contact map prediction, expression quantitative trait loci prediction, and enhancer-target gene prediction. By reducing effective sequence length, GeneZip unlocks simultaneous scaling of context and capacity: compared to the prior state-of-the-art model JanusDNA, it enables training models 82.6x larger at 1M-bp context, supporting a 636M-parameter GeneZip model at 1M-bp context. All experiments in this paper can be trained on a single A100 80GB GPU.

Comment: Compression/Efficiency: region-aware DNA compression with dynamic routing enabling long-context training with major compute savings.

Relevance: 9 Novelty: 8


13. Topological Exploration of High-Dimensional Empirical Risk Landscapes: general approach, and applications to phase retrieval

ArXiv ID: 2602.17779

Authors: Antoine Maillard, Tony Bonnaire, Giulio Biroli

Abstract: We consider the landscape of empirical risk minimization for high-dimensional Gaussian single-index models (generalized linear models). The objective is to recover an unknown signal $\boldsymbol{\theta}^\star \in \mathbb{R}^d$ (where $d \gg 1$) from a loss function $\hat{R}(\boldsymbol{\theta})$ that depends on pairs of labels $(\mathbf{x}i \cdot \boldsymbol{\theta}, \mathbf{x}_i \cdot \boldsymbol{\theta}^\star)(0, I_d)$, in the proportional asymptotic regime $n \asymp d$. Using the Kac-Rice formula, we analyze different complexities of the landscape -- defined as the expected number of critical points -- corresponding to various types of critical points, including local minima. We first show that some variational formulas previously established in the literature for these complexities can be drastically simplified, reducing to explicit variational problems over a finite number of scalar parameters that we can efficiently solve numerically. Our framework also provides detailed predictions for properties of the critical points, including the spectral properties of the Hessian and the joint distribution of labels. We apply our analysis to the real phase retrieval problem for which we derive complete topological phase diagrams of the loss landscape, characterizing notably BBP-type transitions where the Hessian at local minima (as predicted by the Kac-Rice formula) becomes unstable in the direction of the signal. We test the predictive power of our analysis to characterize gradient flow dynamics, finding excellent agreement with finite-size simulations of local optimization algorithms, and capturing fine-grained details such as the empirical distribution of labels. Overall, our results open new avenues for the asymptotic study of loss landscapes and topological trivialization phenomena in high-dimensional statistical models.}^n$, with $\mathbf{x}_i \sim \mathcal{N

Comment: Training dynamics/representation theory: Kac–Rice analysis of high-dimensional loss landscapes and Hessian spectra (foundational landscape insights).

Relevance: 9 Novelty: 8


14. UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems

ArXiv ID: 2602.17709

Authors: Lin Huang, Arthur Jiang, XiaoLi Liu, Zion Wang, Jason Zhao, Chu Wang, HaoCheng Lu, ChengXiang Huang, JiaJun Cheng, YiYue Du, Jia Zhang

Abstract: All-atom molecular simulation serves as a quintessential computational microscope'' for understanding the machinery of life, yet it remains fundamentally limited by the trade-off between quantum-mechanical (QM) accuracy and biological scale. We present UBio-MolFM, a universal foundation model framework specifically engineered to bridge this gap. UBio-MolFM introduces three synergistic innovations: (1) UBio-Mol26, a large bio-specific dataset constructed via a multi-fidelityTwo-Pronged Strategy'' that combines systematic bottom-up enumeration with top-down sampling of native protein environments (up to 1,200 atoms); (2) E2Former-V2, a linear-scaling equivariant transformer that integrates Equivariant Axis-Aligned Sparsification (EAAS) and Long-Short Range (LSR) modeling to capture non-local physics with up to ~4x higher inference throughput in our large-system benchmarks; and (3) a Three-Stage Curriculum Learning protocol that transitions from energy initialization to energy-force consistency, with force-focused supervision to mitigate energy offsets. Rigorous benchmarking across microscopic forces and macroscopic observables -- including liquid water structure, ionic solvation, and peptide folding -- demonstrates that UBio-MolFM achieves ab initio-level fidelity on large, out-of-distribution biomolecular systems (up to ~1,500 atoms) and realistic MD observables. By reconciling scalability with quantum precision, UBio-MolFM provides a robust, ready-to-use tool for the next generation of computational biology.

Comment: Architecture/Efficiency/HPC: linear-scaling equivariant transformer (E2Former-V2) with sparsification and long–short range modeling for higher throughput.

Relevance: 9 Novelty: 8


15. Asynchronous Heavy-Tailed Optimization

ArXiv ID: 2602.18002

Authors: Junfei Sun, Dixi Yao, Xuchen Gong, Tahseen Rabbani, Manzil Zaheer, Tian Li

Abstract: Heavy-tailed stochastic gradient noise, commonly observed in transformer models, can destabilize the optimization process. Recent works mainly focus on developing and understanding approaches to address heavy-tailed noise in the centralized or distributed, synchronous setting, leaving the interactions between such noise and asynchronous optimization underexplored. In this work, we investigate two communication schemes that handle stragglers with asynchronous updates in the presence of heavy-tailed gradient noise. We propose and theoretically analyze algorithmic modifications based on delay-aware learning rate scheduling and delay compensation to enhance the performance of asynchronous algorithms. Our convergence guarantees under heavy-tailed noise match the rate of the synchronous counterparts and improve delay tolerance compared with existing asynchronous approaches. Empirically, our approaches outperform prior synchronous and asynchronous methods in terms of accuracy/runtime trade-offs and are more robust to hyperparameters in both image and language tasks.

Comment: High-Performance/Distributed Training: asynchronous optimization under heavy-tailed gradient noise with delay-aware scheduling and compensation, with convergence guarantees.

Relevance: 9 Novelty: 7


16. Dual Length Codes for Lossless Compression of BFloat16

ArXiv ID: 2602.17849

Authors: Aditya Agrawal, Albert Magyar, Hiteshwar Eswaraiah, Patrick Sheridan, Pradeep Janedula, Ravi Krishnan Venkatesan, Krishna Nair, Ravi Iyer

Abstract: Training and serving Large Language Models (LLMs) relies heavily on parallelization and collective operations, which are frequently bottlenecked by network bandwidth. Lossless compression using e.g., Huffman codes can alleviate the issue, however, Huffman codes suffer from slow, bit-sequential decoding and high hardware complexity due to deep tree traversals. Universal codes e.g., Exponential-Golomb codes are faster to decode but do not exploit the symbol frequency distributions. To address these limitations, this paper introduces Dual Length Codes, a hybrid approach designed to balance compression efficiency with decoding speed. Analyzing BFloat16 tensors from the Gemma model, we observed that the top 8 most frequent symbols account for approximately 50% of the cumulative probability. These 8 symbols are assigned a short 4 bit code. The remaining 248 symbols are assigned a longer 9 bit code. The coding scheme uses a single prefix bit to distinguish between the two code lengths. The scheme uses a small Look Up Table with only 8 entries for encoding and decoding. The scheme achieves a compressibility of 18.6% in comparison to 21.3% achieved by Huffman codes, but it significantly speeds up the decoding and simplifies the hardware complexity.

Comment: High Performance Computing/Compression: lossless coding scheme for BFloat16 tensors to reduce communication bandwidth with fast decoding and simple hardware.

Relevance: 9 Novelty: 7


17. Breaking the Correlation Plateau: On the Optimization and Capacity Limits of Attention-Based Regressors

ArXiv ID: 2602.17898

Authors: Jingquan Yan, Yuwei Miao, Peiran Yu, Junzhou Huang

Abstract: Attention-based regression models are often trained by jointly optimizing Mean Squared Error (MSE) loss and Pearson correlation coefficient (PCC) loss, emphasizing the magnitude of errors and the order or shape of targets, respectively. A common but poorly understood phenomenon during training is the PCC plateau: PCC stops improving early in training, even as MSE continues to decrease. We provide the first rigorous theoretical analysis of this behavior, revealing fundamental limitations in both optimization dynamics and model capacity. First, in regard to the flattened PCC curve, we uncover a critical conflict where lowering MSE (magnitude matching) can paradoxically suppress the PCC gradient (shape matching). This issue is exacerbated by the softmax attention mechanism, particularly when the data to be aggregated is highly homogeneous. Second, we identify a limitation in the model capacity: we derived a PCC improvement limit for any convex aggregator (including the softmax attention), showing that the convex hull of the inputs strictly bounds the achievable PCC gain. We demonstrate that data homogeneity intensifies both limitations. Motivated by these insights, we propose the Extrapolative Correlation Attention (ECA), which incorporates novel, theoretically-motivated mechanisms to improve the PCC optimization and extrapolate beyond the convex hull. Across diverse benchmarks, including challenging homogeneous data setting, ECA consistently breaks the PCC plateau, achieving significant improvements in correlation without compromising MSE performance.

Comment: Representation Learning/Training Dynamics: theoretical analysis of attention-based regressors (PCC plateau) and architecture fix (Extrapolative Correlation Attention) addressing softmax/convex-hull limits.

Relevance: 8 Novelty: 8


18. A Geometric Probe of the Accuracy-Robustness Trade-off: Sharp Boundaries in Symmetry-Breaking Dimensional Expansion

ArXiv ID: 2602.17948

Authors: Yu Bai, Zhe Wang, Jiarui Zhang, Dong-Xiao Zhang, Yinjun Gao, Jun-Jie Zhang

Abstract: The trade-off between clean accuracy and adversarial robustness is a pervasive phenomenon in deep learning, yet its geometric origin remains elusive. In this work, we utilize Symmetry-Breaking Dimensional Expansion (SBDE) as a controlled probe to investigate the mechanism underlying this trade-off. SBDE expands input images by inserting constant-valued pixels, which breaks translational symmetry and consistently improves clean accuracy (e.g., from $90.47\%$ to $95.63\%$ on CIFAR-10 with ResNet-18) by reducing parameter degeneracy. However, this accuracy gain comes at the cost of reduced robustness against iterative white-box attacks. By employing a test-time \emph{mask projection} that resets the inserted auxiliary pixels to their training values, we demonstrate that the vulnerability stems almost entirely from the inserted dimensions. The projection effectively neutralizes the attacks and restores robustness, revealing that the model achieves high accuracy by creating \emph{sharp boundaries} (steep loss gradients) specifically along the auxiliary axes. Our findings provide a concrete geometric explanation for the accuracy-robustness paradox: the optimization landscape deepens the basin of attraction to improve accuracy but inevitably erects steep walls along the auxiliary degrees of freedom, creating a fragile sensitivity to off-manifold perturbations.

Comment: Representation Learning/Training Dynamics: geometric explanation of accuracy–robustness trade-off via symmetry-breaking dimensional expansion and mask projection.

Relevance: 8 Novelty: 7


19. Influence-Preserving Proxies for Gradient-Based Data Selection in LLM Fine-tuning

ArXiv ID: 2602.17835

Authors: Sirui Chen, Yunzhe Qi, Mengting Ai, Yifan Sun, Ruizhong Qiu, Jiaru Zou, Jingrui He

Abstract: Supervised fine-tuning (SFT) relies critically on selecting training data that most benefits a model's downstream performance. Gradient-based data selection methods such as TracIn and Influence Functions leverage influence to identify useful samples, but their computational cost scales poorly, making them impractical for multi-billion-parameter large language models (LLMs). A common alternative is to use off-the-shelf smaller models as proxies, but they remain suboptimal since their learning dynamics are unclear, their sizes cannot be flexibly adjusted, and they cannot be further aligned with the target model in terms of gradient-based influence estimation. To address these challenges, we introduce Iprox, a two-stage framework that derives influence-preserving proxies directly from the target model. It first applies a low-rank compression stage to preserve influence information of the target model, and then an aligning stage to align both model gradients and logits, thereby constructing proxies that flexibly control computational cost while retaining the target model's influence. Experimental results across diverse LLM families and evaluation tasks show that Iprox consistently outperforms off-the-shelf proxies and baseline methods. On Qwen3-4B, a 1.5B proxy constructed with Iprox achieves stronger performance than the larger 1.7B off-the-shelf proxy. Notably, on Llama3.2, Iprox achieves better performance than baselines while reducing computational cost by more than half relative to the full 3B model. These results show that Iprox provides effective influence-preserving proxies, making gradient-based data selection more scalable for LLMs.

Comment: Compression/Efficiency: low-rank compression plus gradient/logit alignment to build influence-preserving proxies for scalable LLM data selection.

Relevance: 8 Novelty: 7


20. Provable Adversarial Robustness in In-Context Learning

ArXiv ID: 2602.17743

Authors: Di Zhang

Abstract: Large language models adapt to new tasks through in-context learning (ICL) without parameter updates. Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining. This assumption overlooks adversarial distribution shifts that threaten real-world reliability. To address this gap, we introduce a distributionally robust meta-learning framework that provides worst-case performance guarantees for ICL under Wasserstein-based distribution shifts. Focusing on linear self-attention Transformers, we derive a non-asymptotic bound linking adversarial perturbation strength ($\rho$), model capacity ($m$), and the number of in-context examples ($N$). The analysis reveals that model robustness scales with the square root of its capacity ($\rho_{\text{max}} \propto \sqrt{m}$), while adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude ($N_\rho - N_0 \propto \rho^2$). Experiments on synthetic tasks confirm these scaling laws. These findings advance the theoretical understanding of ICL's limits under adversarial conditions and suggest that model capacity serves as a fundamental resource for distributional robustness.

Comment: Theory of In-Context Learning: provable adversarial robustness bounds for linear self-attention Transformers under Wasserstein shifts (capacity/sample complexity).

Relevance: 8 Novelty: 7


21. Advection-Diffusion on Graphs: A Bakry-Emery Laplacian for Spectral Graph Neural Networks

ArXiv ID: 2602.18141

Authors: Pierre-Gabriel Berlureau, Ali Hariri, Victor Kawasaki-Borruat, Mia Zosso, Pierre Vandergheynst

Abstract: Graph Neural Networks (GNNs) often struggle to propagate information across long distances due to oversmoothing and oversquashing. Existing remedies such as graph transformers or rewiring typically incur high computational cost or require altering the graph structure. We introduce a Bakry-Emery graph Laplacian that integrates diffusion and advection through a learnable node-wise potential, inducing task-dependent propagation dynamics without modifying topology. This operator has a well-behaved spectral decomposition and acts as a drop-in replacement for standard Laplacians in spectral GNNs. Building on this insight, we develop mu-ChebNet, a spectral architecture that jointly learns the potential and Chebyshev filters, effectively bridging message-passing adaptivity and spectral efficiency. Our theoretical analysis shows how the potential modulates the spectrum, enabling control of key graph properties. Empirically, mu-ChebNet delivers consistent gains on synthetic long-range reasoning tasks, as well as real-world benchmarks, while offering an interpretable routing field that reveals how information flows through the graph. This establishes the Bakry-Emery Laplacian as a principled and efficient foundation for adaptive spectral graph learning.

Comment: Model Architecture (GNNs): Bakry-Emery Laplacian with learnable potential yielding adaptive advection–diffusion in spectral GNNs (mu-ChebNet).

Relevance: 8 Novelty: 7


22. Learning Long-Range Dependencies with Temporal Predictive Coding

ArXiv ID: 2602.18131

Authors: Tom Potter, Oliver Rhodes

Abstract: Predictive Coding (PC) is a biologically-inspired learning framework characterised by local, parallelisable operations, properties that enable energy-efficient implementation on neuromorphic hardware. Despite this, extending PC effectively to recurrent neural networks (RNNs) has been challenging, particularly for tasks involving long-range temporal dependencies. Backpropagation Through Time (BPTT) remains the dominant method for training RNNs, but its non-local computation, lack of spatial parallelism, and requirement to store extensive activation histories results in significant energy consumption. This work introduces a novel method combining Temporal Predictive Coding (tPC) with approximate Real-Time Recurrent Learning (RTRL), enabling effective spatio-temporal credit assignment. Results indicate that the proposed method can closely match the performance of BPTT on both synthetic benchmarks and real-world tasks. On a challenging machine translation task, with a 15-million parameter model, the proposed method achieves a test perplexity of 7.62 (vs. 7.49 for BPTT), marking one of the first applications of tPC to tasks of this scale. These findings demonstrate the potential of this method to learn complex temporal dependencies whilst retaining the local, parallelisable, and flexible properties of the original PC framework, paving the way for more energy-efficient learning systems.

Comment: Training/Efficiency: combines Temporal Predictive Coding with approximate RTRL for local, parallelizable spatio-temporal credit assignment as an alternative to BPTT.

Relevance: 8 Novelty: 7


23. Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

ArXiv ID: 2602.17778

Authors: Zachary Coalson, Bo Fang, Sanghyun Hong

Abstract: Multi-turn interaction length is a dominant factor in the operational costs of conversational LLMs. In this work, we present a new failure mode in conversational LLMs: turn amplification, in which a model consistently prolongs multi-turn interactions without completing the underlying task. We show that an adversary can systematically exploit clarification-seeking behavior$-$commonly encouraged in multi-turn conversation settings$-$to scalably prolong interactions. Moving beyond prompt-level behaviors, we take a mechanistic perspective and identify a query-independent, universal activation subspace associated with clarification-seeking responses. Unlike prior cost-amplification attacks that rely on per-turn prompt optimization, our attack arises from conversational dynamics and persists across prompts and tasks. We show that this mechanism provides a scalable pathway to induce turn amplification: both supply-chain attacks via fine-tuning and runtime attacks through low-level parameter corruptions consistently shift models toward abstract, clarification-seeking behavior across prompts. Across multiple instruction-tuned LLMs and benchmarks, our attack substantially increases turn count while remaining compliant. We also show that existing defenses offer limited protection against this emerging class of failures.

Comment: Representation Learning/Mechanistic Interpretability: identifies a universal activation subspace driving clarification-seeking and turn amplification across prompts and models.

Relevance: 8 Novelty: 7


24. PHAST: Port-Hamiltonian Architecture for Structured Temporal Dynamics Forecasting

ArXiv ID: 2602.17998

Authors: Shubham Bhardwaj, Chandrajit Bajaj

Abstract: Real physical systems are dissipative -- a pendulum slows, a circuit loses charge to heat -- and forecasting their dynamics from partial observations is a central challenge in scientific machine learning. We address the \emph{position-only} (q-only) problem: given only generalized positions~$q_t$ at discrete times (momenta~$p_t$ latent), learn a structured model that (a)~produces stable long-horizon forecasts and (b)~recovers physically meaningful parameters when sufficient structure is provided. The port-Hamiltonian framework makes the conservative-dissipative split explicit via $\dot{x}=(J-R)\nabla H(x)$, guaranteeing $dH/dt\le 0$ when $R\succeq 0$. We introduce \textbf{PHAST} (Port-Hamiltonian Architecture for Structured Temporal dynamics), which decomposes the Hamiltonian into potential~$V(q)$, mass~$M(q)$, and damping~$D(q)$ across three knowledge regimes (KNOWN, PARTIAL, UNKNOWN), uses efficient low-rank PSD/SPD parameterizations, and advances dynamics with Strang splitting. Across thirteen q-only benchmarks spanning mechanical, electrical, molecular, thermal, gravitational, and ecological systems, PHAST achieves the best long-horizon forecasting among competitive baselines and enables physically meaningful parameter recovery when the regime provides sufficient anchors. We show that identification is fundamentally ill-posed without such anchors (gauge freedom), motivating a two-axis evaluation that separates forecasting stability from identifiability.

Comment: Model Architecture: introduces a port-Hamiltonian neural architecture with low-rank PSD/SPD parameterizations and stable integrators for long-horizon dynamics.

Relevance: 8 Novelty: 7


25. ADAPT: Hybrid Prompt Optimization for LLM Feature Visualization

ArXiv ID: 2602.17867

Authors: Jo\~ao N. Cardoso, Arlindo L. Oliveira, Bruno Martins

Abstract: Understanding what features are encoded by learned directions in LLM activation space requires identifying inputs that strongly activate them. Feature visualization, which optimizes inputs to maximally activate a target direction, offers an alternative to costly dataset search approaches, but remains underexplored for LLMs due to the discrete nature of text. Furthermore, existing prompt optimization techniques are poorly suited to this domain, which is highly prone to local minima. To overcome these limitations, we introduce ADAPT, a hybrid method combining beam search initialization with adaptive gradient-guided mutation, designed around these failure modes. We evaluate on Sparse Autoencoder latents from Gemma 2 2B, proposing metrics grounded in dataset activation statistics to enable rigorous comparison, and show that ADAPT consistently outperforms prior methods across layers and latent types. Our results establish that feature visualization for LLMs is tractable, but requires design assumptions tailored to the domain.

Comment: Representation Learning: feature visualization for LLM directions via hybrid prompt optimization tailored to discrete text.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  2. Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  3. High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

  4. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.