Personalized Daily ArXiv Papers 2025-06-25

[gpt-4o]	Prompt	Completion	Total
Token	28271	3381	31652
Cost	$0.07	$0.03	$0.1

Total arXiv papers: 433

Total scanned papers: 278

Total relevant papers: 18

Table of contents with paper titles:

Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models Authors: Zihan Wang, Rui Pan, Jiarui Yao, Robert Csordas, Linjie Li, Lu Yin, Jiajun Wu, Tong Zhang, Manling Li, Shiwei Liu
RCStat: A Statistical Framework for using Relative Contextualization in Transformers Authors: Debabrata Mahapatra, Shubham Agarwal, Apoorv Saxena, Subrata Mitra
In-Context Occam's Razor: How Transformers Prefer Simpler Hypotheses on the Fly Authors: Puneesh Deora, Bhavya Vasudeva, Tina Behnia, Christos Thrampoulidis
Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs Authors: Shuang Ao, Yi Dong, Jinwei Hu, Sarvapali Ramchurn
Riemannian generative decoder Authors: Andreas Bjerregaard, S{\o}ren Hauberg, Anders Krogh
First-Order Sparse Convex Optimization: Better Rates with Sparse Updates Authors: Dan Garber
Uncovering Conceptual Blindspots in Generative Image Models Using Sparse Autoencoders Authors: Matyas Bohacek, Thomas Fel, Maneesh Agrawala, Ekdeep Singh Lubana
Who Does What in Deep Learning? Multidimensional Game-Theoretic Attribution of Function of Neural Units Authors: Shrey Dixit, Kayson Fakhar, Fatemeh Hadaeghi, Patrick Mineault, Konrad P. Kording, Claus C. Hilgetag
From memories to maps: Mechanisms of in context reinforcement learning in transformers Authors: Ching Fang, Kanaka Rajan
Finding Clustering Algorithms in the Transformer Architecture Authors: Kenneth L. Clarkson, Lior Horesh, Takuya Ito, Charlotte Park, Parikshit Ram
ProxelGen: Generating Proteins as 3D Densities Authors: Felix Faltings, Hannes Stark, Regina Barzilay, Tommi Jaakkola
The Effect of Depth on the Expressivity of Deep Linear State-Space Models Authors: Zeyu Bao, Penghao Yu, Haotian Jiang, Qianxiao Li
Thought Anchors: Which LLM Reasoning Steps Matter? Authors: Paul C. Bogdan, Uzay Macar, Neel Nanda, Arthur Conmy
Discrepancy-Aware Graph Mask Auto-Encoder Authors: Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Weigang Lu
On the algorithmic construction of deep ReLU networks Authors: Daan Huybrechs
Inference-Time Reward Hacking in Large Language Models Authors: Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio du Pin Calmon
Tensor-Parallelism with Partially Synchronized Activations Authors: Itay Lamprecht, Asaf Karnieli, Yair Hanani, Niv Giladi, Daniel Soudry
Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models Authors: Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang

1. Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models

ArXiv ID: 2506.18945

Authors: Zihan Wang, Rui Pan, Jiarui Yao, Robert Csordas, Linjie Li, Lu Yin, Jiajun Wu, Tong Zhang, Manling Li, Shiwei Liu

Abstract: We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model's representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE's benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.

Comment: The paper introduces a new Mixture-of-Experts architecture, Chain-of-Experts, which is highly relevant to model architecture innovations.

Relevance: 10 Novelty: 9

2. RCStat: A Statistical Framework for using Relative Contextualization in Transformers

ArXiv ID: 2506.19549

Authors: Debabrata Mahapatra, Shubham Agarwal, Apoorv Saxena, Subrata Mitra

Abstract: Prior work on input-token importance in auto-regressive transformers has relied on Softmax-normalized attention weights, which obscure the richer structure of pre-Softmax query-key logits. We introduce RCStat, a statistical framework that harnesses raw attention logits via Relative Contextualization (RC), a random variable measuring contextual alignment between token segments, and derive an efficient upper bound for RC. We demonstrate two applications: (i) Key-Value compression, where RC-based thresholds drive adaptive key-value eviction for substantial cache reduction with minimal quality loss; and (ii) Attribution, where RC yields higher-fidelity token-, sentence-, and chunk-level explanations than post-Softmax methods. Across question answering, summarization, and attribution benchmarks, RCStat achieves significant empirical gains, delivering state-of-the-art compression and attribution performance without any model retraining.

Comment: The paper introduces RCStat, a framework for using relative contextualization in transformers, which relates to model architecture and compression through key-value compression.