Personalized Daily ArXiv Papers 2026-02-05

[gpt-5]	Prompt	Completion	Total
Token	53923	46726	100649
Cost	$0.07	$0.47	$0.53

Total arXiv papers: 705

Total scanned papers: 391

Total relevant papers: 37

Table of contents with paper titles:

Synthesizable Molecular Generation via Soft-constrained GFlowNets with Rich Chemical Priors Authors: Hyeonah Kim, Minsu Kim, Celine Roget, Dionessa Biton, Louis Vaillancourt, Yves V. Brun, Yoshua Bengio, Alex Hernandez-Garcia
GeoIB: Geometry-Aware Information Bottleneck via Statistical-Manifold Compression Authors: Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Shui Yu
BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models Authors: Junyu Chen, Jungang Li, Jing Xiong, Wenjie Wang, Qingyao Yang, He Xiao, Zhen Li, Taiqiang Wu, Mengzhao Chen, Zhen Peng, Chaofan Tao, Long Shi, Hongxia Yang, Ngai Wong
LoRDO: Distributed Low-Rank Optimization with Infrequent Communication Authors: Andrej Jovanovi\'c, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane
Disentangling Causal Importance from Emergent Structure in Multi-Expert Orchestration Authors: Sudipto Ghosh, Sujoy Nath, Sunny Manchanda, Tanmoy Chakraborty
Online Vector Quantized Attention Authors: Nick Alonso, Tomas Figliolia, Beren Millidge
From Dead Neurons to Deep Approximators: Deep Bernstein Networks as a Provable Alternative to Residual Layers Authors: Ibrahim Albool, Malak Gamal El-Din, Salma Elmalaki, Yasser Shoukry
Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism Authors: Chenwei Cui, Rockwell Jackson, Benjamin Joseph Herrera, Ana Mar\'ia T\'arano, Hannah Kerner
Semantic Rate Distortion and Posterior Design: Compute Constraints, Multimodality, and Strategic Inference Authors: Emrah Akyol
SpecMD: A Comprehensive Study On Speculative Expert Prefetching Authors: Duc Hoang, Ajay Jaiswal, Mohammad Samragh, Minsik Cho
Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models Authors: Yichen Xu, Yuyang Liang, Shan Dai, Tianyang Hu, Tsz Nam Chan, Chenhao Ma
Rethinking Weight Tying: Pseudo-Inverse Tying for Stable LM Training and Updates Authors: Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang
Proxy Compression for Language Modeling Authors: Lin Zheng, Xinyu Li, Qian Liu, Xiachong Feng, Lingpeng Kong
SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration Authors: Zekun Li, Ning Wang, Tongxin Bai, Changwang Mei, Peisong Wang, Shuang Qiu, Jian Cheng
Provable Target Sample Complexity Improvements as Pre-Trained Models Scale Authors: Kazuto Fukuchi, Ryuichiro Hataya, Kota Matsui
The Key to State Reduction in Linear Attention: A Rank-based Perspective Authors: Philipp Nazari, T. Konstantin Rusch
LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding Authors: Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, Min Zhang
Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning Authors: Nicholas Barnfield, Subhabrata Sen, Pragya Sur
Decomposing Query-Key Feature Interactions Using Contrastive Covariances Authors: Andrew Lee, Yonatan Belinkov, Fernanda Vi\'egas, Martin Wattenberg
Billion-Scale Graph Foundation Models Authors: Maya Bechler-Speicher, Yoel Gottlieb, Andrey Isakov, David Abensur, Ami Tavory, Daniel Haimovich, Ido Guy, Udi Weinsberg
Gradient Flow Through Diagram Expansions: Learning Regimes and Explicit Solutions Authors: Dmitry Yarotsky, Eugene Golikov, Yaroslav Gusev
RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models Authors: Jiacheng Liang, Yuhui Wang, Tanqiu Jiang, Ting Wang
Rational ANOVA Networks Authors: Jusheng Zhang, Ningyuan Liu, Qinhan Lyu, Jing Yang, Keze Wang
Greedy-Gnorm: A Gradient Matrix Norm-Based Alternative to Attention Entropy for Head Pruning Authors: Yuxi Guo, Paul Sheridan
LORE: Jointly Learning the Intrinsic Dimensionality and Relative Similarity Structure From Ordinal Data Authors: Vivek Anand, Alec Helbling, Mark Davenport, Gordon Berman, Sankar Alagapan, Christopher Rozell
Topology-Aware Revival for Efficient Sparse Training Authors: Meiling Jin, Fei Wang, Xiaoyun Yuan, Chen Qian, Yuan Cheng
Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model Authors: Blake Bordelon, Francesco Mori
Learning to Reason in 13 Parameters Authors: John X. Morris, Niloofar Mireshghallah, Mark Ibrahim, Saeed Mahloujifar
Supervised Learning as Lossy Compression: Characterizing Generalization and Sample Complexity via Finite Blocklength Analysis Authors: Kosuke Sugiyama, Masato Uchida
Subliminal Effects in Your Data: A General Mechanism via Log-Linearity Authors: Ishaq Aden-Ali, Noah Golowich, Allen Liu, Abhishek Shetty, Ankur Moitra, Nika Haghtalab
Continual Learning through Control Minimization Authors: Sander de Haan, Yassine Taoudi-Benchekroun, Pau Vilimelis Aceituno, Benjamin F. Grewe
Towards Understanding and Avoiding Limitations of Convolutions on Graphs Authors: Andreas Roth
Fluid Representations in Reasoning Models Authors: Dmitrii Kharlapenko, Alessandro Stolfo, Arthur Conmy, Mrinmaya Sachan, Zhijing Jin
SEIS: Subspace-based Equivariance and Invariance Scores for Neural Representations Authors: Huahua Lin, Katayoun Farrahi, Xiaohao Cai
Finding Structure in Continual Learning Authors: Pourya Shamsolmoali, Masoumeh Zareapoor
MirrorLA: Reflecting Feature Map for Vision Linear Attention Authors: Weikang Meng, Liangyu Huo, Yadan Luo, Yaowei Wang, Yingjian Li, Zheng Zhang
A Hitchhiker's Guide to Poisson Gradient Estimation Authors: Michael Ibrahim, Hanqi Zhao, Eli Sennesh, Zhi Li, Anqi Wu, Jacob L. Yates, Chengrui Li, Hadi Vafaii

1. Synthesizable Molecular Generation via Soft-constrained GFlowNets with Rich Chemical Priors

ArXiv ID: 2602.04119

Authors: Hyeonah Kim, Minsu Kim, Celine Roget, Dionessa Biton, Louis Vaillancourt, Yves V. Brun, Yoshua Bengio, Alex Hernandez-Garcia

Abstract: The application of generative models for experimental drug discovery campaigns is severely limited by the difficulty of designing molecules de novo that can be synthesized in practice. Previous works have leveraged Generative Flow Networks (GFlowNets) to impose hard synthesizability constraints through the design of state and action spaces based on predefined reaction templates and building blocks. Despite the promising prospects of this approach, it currently lacks flexibility and scalability. As an alternative, we propose S3-GFN, which generates synthesizable SMILES molecules via simple soft regularization of a sequence-based GFlowNet. Our approach leverages rich molecular priors learned from large-scale SMILES corpora to steer molecular generation towards high-reward, synthesizable chemical spaces. The model induces constraints through off-policy replay training with a contrastive learning signal based on separate buffers of synthesizable and unsynthesizable samples. Our experiments show that S3-GFN learns to generate synthesizable molecules ($\geq 95\%$) with higher rewards in diverse tasks.

Comment: Author match

2. GeoIB: Geometry-Aware Information Bottleneck via Statistical-Manifold Compression

ArXiv ID: 2602.03906

Authors: Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Shui Yu

Abstract: Information Bottleneck (IB) is widely used, but in deep learning, it is usually implemented through tractable surrogates, such as variational bounds or neural mutual information (MI) estimators, rather than directly controlling the MI I(X;Z) itself. The looseness and estimator-dependent bias can make IB "compression" only indirectly controlled and optimization fragile. We revisit the IB problem through the lens of information geometry and propose a \textbf{Geo}metric \textbf{I}nformation \textbf{B}ottleneck (\textbf{GeoIB}) that dispenses with mutual information (MI) estimation. We show that I(X;Z) and I(Z;Y) admit exact projection forms as minimal Kullback-Leibler (KL) distances from the joint distributions to their respective independence manifolds. Guided by this view, GeoIB controls information compression with two complementary terms: (i) a distribution-level Fisher-Rao (FR) discrepancy, which matches KL to second order and is reparameterization-invariant; and (ii) a geometry-level Jacobian-Frobenius (JF) term that provides a local capacity-type upper bound on I(Z;X) by penalizing pullback volume expansion of the encoder. We further derive a natural-gradient optimizer consistent with the FR metric and prove that the standard additive natural-gradient step is first-order equivalent to the geodesic update. We conducted extensive experiments and observed that the GeoIB achieves a better trade-off between prediction accuracy and compression ratio in the information plane than the mainstream IB baselines on popular datasets. GeoIB improves invariance and optimization stability by unifying distributional and geometric regularization under a single bottleneck multiplier. The source code of GeoIB is released at "https://anonymous.4open.science/r/G-IB-0569".

Comment: Representation Learning/Compression — Geometry-aware Information Bottleneck replacing MI estimation with Fisher–Rao and Jacobian-based controls plus natural-gradient updates.

Relevance: 10 Novelty: 9

3. BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

ArXiv ID: 2602.04163

Authors: Junyu Chen, Jungang Li, Jing Xiong, Wenjie Wang, Qingyao Yang, He Xiao, Zhen Li, Taiqiang Wu, Mengzhao Chen, Zhen Peng, Chaofan Tao, Long Shi, Hongxia Yang, Ngai Wong

Abstract: Large language model (LLM) inference is often bounded by memory footprint and memory bandwidth in resource-constrained deployments, making quantization a fundamental technique for efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. Fundamentally, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using approximate second-order information while progressively compensating quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit). Moreover, we provide theoretical analysis showing that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. Code: github.com/KingdalfGoodman/BPDQ.

Comment: Model Compression and Efficiency: 2-bit LLM quantization via variable bit-plane grids with second-order refinement and theory.

Relevance: 10 Novelty: 9

4. LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

ArXiv ID: 2602.04396

Authors: Andrej Jovanovi\'c, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane

Abstract: Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.

Comment: High Performance Computing — unifies low-rank optimization with infrequent synchronization for distributed training, reducing optimizer-state communication and restoring subspace exploration.

Relevance: 10 Novelty: 8

5. Disentangling Causal Importance from Emergent Structure in Multi-Expert Orchestration

ArXiv ID: 2602.04291

Authors: Sudipto Ghosh, Sujoy Nath, Sunny Manchanda, Tanmoy Chakraborty

Abstract: Multi-expert systems, where multiple Large Language Models (LLMs) collaborate to solve complex tasks, are increasingly adopted for high-performance reasoning and generation. However, the orchestration policies governing expert interaction and sequencing remain largely opaque. We introduce INFORM, an interpretability analysis that treats orchestration as an explicit, analyzable computation, enabling the decoupling of expert interaction structure, execution order, and causal attribution. We use INFORM to evaluate an orchestrator on GSM8K, HumanEval, and MMLU using a homogeneous consortium of ten instruction-tuned experts drawn from LLaMA-3.1 8B, Qwen-3 8B, and DeepSeek-R1 8B, with controlled decoding-temperature variation, and a secondary heterogeneous consortium spanning 1B-7B parameter models. Across tasks, routing dominance is a poor proxy for functional necessity. We reveal a divergence between relational importance, captured by routing mass and interaction topology, and intrinsic importance, measured via gradient-based causal attribution: frequently selected experts often act as interaction hubs with limited causal influence, while sparsely routed experts can be structurally critical. Orchestration behaviors emerge asynchronously, with expert centralization preceding stable routing confidence and expert ordering remaining non-deterministic. Targeted ablations show that masking intrinsically important experts induces disproportionate collapse in interaction structure compared to masking frequent peers, confirming that INFORM exposes causal and structural dependencies beyond accuracy metrics alone.

Comment: Model Architecture: analysis of multi-expert (MoE) orchestration and routing with causal attribution disentanglement.

Relevance: 10 Novelty: 8

6. Online Vector Quantized Attention

ArXiv ID: 2602.03922

Authors: Nick Alonso, Tomas Figliolia, Beren Millidge

Abstract: Standard sequence mixing layers used in language models struggle to balance efficiency and performance. Self-attention performs well on long context tasks but has expensive quadratic compute and linear memory costs, while linear attention and SSMs use only linear compute and constant memory but struggle with long context processing. In this paper, we develop a sequence mixing layer that aims to find a better compromise between memory-compute costs and long-context processing, which we call online vector-quantized (OVQ) attention. OVQ-attention requires linear compute costs and constant memory, but, unlike linear attention and SSMs, it uses a sparse memory update that allows it to greatly increase the size of its memory state and, consequently, memory capacity. We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression, and we test it on a variety of synthetic long context tasks and on long context language modeling. OVQ-attention shows significant improvements over linear attention baselines and the original VQ-attention, on which OVQ-attention was inspired. It demonstrates competitive, and sometimes identical, performance to strong self-attention baselines up 64k sequence length, despite using a small fraction of the memory of full self-attention.

Comment: Model Architecture + Efficiency: online vector-quantized attention with linear compute/constant memory and sparse memory updates for long-context tasks.

Relevance: 10 Novelty: 8

7. From Dead Neurons to Deep Approximators: Deep Bernstein Networks as a Provable Alternative to Residual Layers

ArXiv ID: 2602.04264

Authors: Ibrahim Albool, Malak Gamal El-Din, Salma Elmalaki, Yasser Shoukry

Abstract: Residual connections are the de facto standard for mitigating vanishing gradients, yet they impose structural constraints and fail to address the inherent inefficiencies of piecewise linear activations. We show that Deep Bernstein Networks (which utilizes Bernstein polynomials as activation functions) can act as residual-free architecture while simultaneously optimize trainability and representation power. We provide a two-fold theoretical foundation for our approach. First, we derive a theoretical lower bound on the local derivative, proving it remains strictly bounded away from zero. This directly addresses the root cause of gradient stagnation; empirically, our architecture reduces ``dead'' neurons from 90\% in standard deep networks to less than 5\%, outperforming ReLU, Leaky ReLU, SeLU, and GeLU. Second, we establish that the approximation error for Bernstein-based networks decays exponentially with depth, a significant improvement over the polynomial rates of ReLU-based architectures. By unifying these results, we demonstrate that Bernstein activations provide a superior mechanism for function approximation and signal flow. Our experiments on HIGGS and MNIST confirm that Deep Bernstein Networks achieve high-performance training without skip-connections, offering a principled path toward deep, residual-free architectures with enhanced expressive capacity.

Comment: Model architecture: Bernstein activation-based deep networks as residual-free alternatives with provable trainability and exponential approximation rates.

Relevance: 10 Novelty: 8

8. Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism

ArXiv ID: 2602.04870

Authors: Chenwei Cui, Rockwell Jackson, Benjamin Joseph Herrera, Ana Mar\'ia T\'arano, Hannah Kerner

Abstract: Large language models have transformed many applications but remain expensive to train. Sparse Mixture of Experts (MoE) addresses this through conditional computation, with Expert Parallel (EP) as the standard distributed training method. However, EP has three limitations: communication cost grows linearly with the number of activated experts $k$, load imbalance affects latency and memory usage, and data-dependent communication requires metadata exchange. We propose Multi-Head LatentMoE and Head Parallel (HP), a new architecture and parallelism achieving $O(1)$ communication cost regardless of $k$, completely balanced traffic, and deterministic communication, all while remaining compatible with EP. To accelerate Multi-Head LatentMoE, we propose IO-aware routing and expert computation. Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to $1.61\times$ faster while having identical performance. With doubled granularity, it achieves higher overall performance while still being $1.11\times$ faster. Our method makes multi-billion-parameter foundation model research more accessible.

Comment: Direct hit on Model Architecture (Mixture-of-Experts) and High Performance Computing: proposes a new MoE variant with deterministic O(1) communication Head Parallelism for distributed training.

Relevance: 10 Novelty: 8

9. Semantic Rate Distortion and Posterior Design: Compute Constraints, Multimodality, and Strategic Inference

ArXiv ID: 2602.03949

Authors: Emrah Akyol

Abstract: We study strategic Gaussian semantic compression under rate and compute constraints, where an encoder and decoder optimize distinct quadratic objectives. A latent Gaussian state generates a task dependent semantic variable, and the decoder best responds via MMSE estimation, reducing the encoder's problem to posterior covariance design under an information rate constraint. We characterize the strategic rate distortion function in direct, remote, and full information regimes, derive semantic waterfilling and rate constrained Gaussian persuasion solutions, and establish Gaussian optimality under misaligned objectives. We further show that architectural compute limits act as implicit rate constraints, yielding exponential improvements in semantic accuracy with model depth and inference time compute, while multimodal observation eliminates the geometric mean penalty inherent to remote encoding. These results provide information theoretic foundations for data and energy efficient AI and offer a principled interpretation of modern multimodal language models as posterior design mechanisms under resource constraints.

Comment: Compression/Efficiency Theory: strategic semantic compression and posterior design under rate and compute constraints with formal characterizations.

Relevance: 9 Novelty: 9

10. SpecMD: A Comprehensive Study On Speculative Expert Prefetching

ArXiv ID: 2602.03921

Authors: Duc Hoang, Ajay Jaiswal, Mohammad Samragh, Minsik Cho

Abstract: Mixture-of-Experts (MoE) models enable sparse expert activation, meaning that only a subset of the model's parameters is used during each inference. However, to translate this sparsity into practical performance, an expert caching mechanism is required. Previous works have proposed hardware-centric caching policies, but how these various caching policies interact with each other and different hardware specification remains poorly understood. To address this gap, we develop \textbf{SpecMD}, a standardized framework for benchmarking ad-hoc cache policies on various hardware configurations. Using SpecMD, we perform an exhaustive benchmarking of several MoE caching strategies, reproducing and extending prior approaches in controlled settings with realistic constraints. Our experiments reveal that MoE expert access is not consistent with temporal locality assumptions (e.g LRU, LFU). Motivated by this observation, we propose \textbf{Least-Stale}, a novel eviction policy that exploits MoE's predictable expert access patterns to reduce collision misses by up to $85\times$ over LRU. With such gains, we achieve over $88\%$ hit rates with up to $34.7\%$ Time-to-first-token (TTFT) reduction on OLMoE at only $5\%$ or $0.6GB$ of VRAM cache capacity.

Comment: MoE Efficiency/HPC: standardized expert-caching benchmark and a novel Least-Stale eviction policy tailored to MoE access patterns.

Relevance: 10 Novelty: 7

11. Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models

ArXiv ID: 2602.04019

Authors: Yichen Xu, Yuyang Liang, Shan Dai, Tianyang Hu, Tsz Nam Chan, Chenhao Ma

Abstract: As large language models (LLMs) continue to grow, the cost of full-parameter fine-tuning has made parameter-efficient fine-tuning (PEFT) the default strategy for downstream adaptation. Constraints from inference latency in scalable serving and fine-tuning cost in edge or rapid-deployment settings make the choice of which layers to fine-tune unavoidable. Yet current practice typically applies PEFT uniformly across all layers, with limited understanding or leverage of layer selection. This paper develops a unified projected residual view of PEFT on top of a frozen base model. Under a local quadratic approximation, layerwise adaptation is governed by three quantities: (i) the projected residual norm (resnorm), which measures how much correctable bias a layer can capture; (ii) the activation energy, which determines feature conditioning; and (iii) layer coupling, which quantifies how strongly residuals interact across layers. We show that, for squared loss and linear adapters, the resnorm equals a normalized gradient norm, activation energy controls ill-conditioning and noise amplification, and weak coupling yields approximately additive layerwise contributions. Building on these insights, we introduce the Layer Card, a reusable diagnostic that summarizes residual signal strength, compute cost, and performance for each layer of a given model. With an identical model and LoRA configuration, Layer Card-guided placement refines the choice of adapted layers to flexibly prioritize different objectives, such as maximizing performance or reducing fine-tuning cost. Moreover, on Qwen3-8B, we show that selectively adapting a subset of layers can achieve performance close to full-layer LoRA while substantially reducing fine-tuning cost and the number of adapter-augmented layers during inference, offering a more cost-performance-aware alternative to full-layer insertion.

Comment: Model Compression/Efficiency — principled layer selection for PEFT via projected residual view; introduces Layer Card to optimize LoRA placement under compute/performance trade-offs.