Personalized Daily Arxiv Papers 02/13/2025

	Prompt	Completion	Total
Token	119253	9699	128952
Cost	$0.3	$0.1	$0.4

Total scanned papers: 376

Total relevant papers: 39

Table of contents with paper titles:

Monte Carlo Tree Diffusion for System 2 Planning Authors: Jaesik Yoon, Hyeonseo Cho, Doojin Baek, Yoshua Bengio, Sungjin Ahn
Mixture of Decoupled Message Passing Experts with Entropy Constraint for General Node Classification Authors: Xuanze Chen, Jiajun Zhou, Jinsong Chen, Shanqing Yu, Qi Xuan
Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach Authors: Xu Zhang, Kaidi Xu, Ziqing Hu, Ren Wang
Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline Authors: Zhiyuan Fang, Yuegui Huang, Zicong Hong, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng
Online Scheduling for LLM Inference with KV Cache Constraints Authors: Patrick Jaillet, Jiashuo Jiang, Chara Podimata, Zijie Zhou
Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving Authors: Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, Chi Jin
Unsupervised categorization of similarity measures Authors: Yoshiyuki Ohmura, Wataru Shimaya, Yasuo Kuniyoshi
LUNAR: LLM Unlearning via Neural Activation Redirection Authors: William F. Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, Nicholas D. Lane
LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits Authors: Zikai Zhou, Qizheng Zhang, Hermann Kumbong, Kunle Olukotun
LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid Authors: Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng
RomanLens: Latent Romanization and its role in Multilinguality in LLMs Authors: Alan Saji (Nilekani Centre at AI4Bharat), Jaavid Aktar Husain (Singapore University of Technology and Design), Thanmay Jayakumar (Nilekani Centre at AI4Bharat, Indian Institute of Technology Madras, India), Raj Dabre (Nilekani Centre at AI4Bharat, Indian Institute of Technology Bombay, India), Anoop Kunchukuttan (Nilekani Centre at AI4Bharat, Microsoft, India), Mitesh M. Khapra (Nilekani Centre at AI4Bharat, Indian Institute of Technology Madras, India), Ratish Puduppully (IT University of Copenhagen)
Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning Authors: Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, Shaul Druckmann
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! Authors: Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
Training-Free Restoration of Pruned Neural Networks Authors: Keonho Lee, Minsoo Kim, Dong-Wan Choi
Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning Authors: Qifan Yu, Zhenyu He, Sijie Li, Xun Zhou, Jun Zhang, Jingjing Xu, Di He
Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty Authors: Yeseul Cho, Baekrok Shin, Changmin Kang, Chulhee Yun
Scalable Thermodynamic Second-order Optimization Authors: Kaelan Donatella, Samuel Duffield, Denis Melanson, Maxwell Aifer, Phoebe Klett, Rajath Salegame, Zach Belateche, Gavin Crooks, Antonio J. Martinez, Patrick J. Coles
Exploring Model Invariance with Discrete Search for Ultra-Low-Bit Quantization Authors: Yuqiao Wen, Yanshuai Cao, Lili Mou
LoCA: Location-Aware Cosine Adaptation for Parameter-Efficient Fine-Tuning Authors: Zhekai Du, Yinjie Min, Jingjing Li, Ke Lu, Changliang Zou, Liuhua Peng, Tingjin Chu, Mingming Gong
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs Authors: Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution Authors: Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, Peizhong Ju, A. B. Siddique
Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension Authors: Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds
On Mechanistic Circuits for Extractive Question-Answering Authors: Samyadeep Basu, Vlad Morariu, Zichao Wang, Ryan Rossi, Cherry Zhao, Soheil Feizi, Varun Manjunatha
Can Large Language Models Understand Intermediate Representations? Authors: Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon Authors: Nurit Cohen-Inger, Yehonatan Elisha, Bracha Shapira, Lior Rokach, Seffi Cohen
Numerical Schemes for Signature Kernels Authors: Thomas Cass, Francesco Piatti, Jeffrey Pei
Enabling Autoregressive Models to Fill In Masked Tokens Authors: Daniel Israel, Aditya Grover, Guy Van den Broeck
What is a Sketch-and-Precondition Derivation for Low-Rank Approximation? Inverse Power Error or Inverse Power Estimation? Authors: Ruihan Xu, Yiping Lu
Harnessing Language's Fractal Geometry with Recursive Inference Scaling Authors: Ibrahim Alabdulmohsin, Xiaohua Zhai
The Observational Partial Order of Causal Structures with Latent Variables Authors: Marina Maciel Ansanelli, Elie Wolfe, Robert W. Spekkens
ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval Authors: Shubham Gupta, Zichao Li, Tianyi Chen, Cem Subakan, Siva Reddy, Perouz Taslakian, Valentina Zantedeschi
Loss Landscape Analysis for Reliable Quantized ML Models for Scientific Sensing Authors: Tommaso Baldi, Javier Campos, Olivia Weng, Caleb Geniesse, Nhan Tran, Ryan Kastner, Alessandro Biondi
Gradient Based Method for the Fusion of Lattice Quantizers Authors: Liyuan Zhang, Hanzhong Cao, Jiaheng Li, Minyang Yu
Column-wise Quantization of Weights and Partial Sums for Accurate and Efficient Compute-In-Memory Accelerators Authors: Jiyoon Kim, Kang Eun Jeon, Yulhwa Kim, Jong Hwan Ko
No Data, No Optimization: A Lightweight Method To Disrupt Neural Networks With Sign-Flips Authors: Ido Galil, Moshe Kimhi, Ran El-Yaniv
Automated Consistency Analysis of LLMs Authors: Aditya Patwardhan, Vivek Vaidya, Ashish Kundu
When More is Less: Understanding Chain-of-Thought Length in LLMs Authors: Yuyang Wu, Yifei Wang, Tianqi Du, Stefanie Jegelka, Yisen Wang
EdgeEar: Efficient and Accurate Ear Recognition for Edge Devices Authors: Camile Lendering, Bernardo Perrone Ribeiro, \v{Z}iga Emer\v{s}i\v{c}, Peter Peer
Understanding LLMs' Fluid Intelligence Deficiency: An Analysis of the ARC Task Authors: Junjie Wu, Mo Yu, Lemao Liu, Dit-Yan Yeung, Jie Zhou

1. Monte Carlo Tree Diffusion for System 2 Planning

ArXiv ID: 2502.07202

Authors: Jaesik Yoon, Hyeonseo Cho, Doojin Baek, Yoshua Bengio, Sungjin Ahn

Abstract: Diffusion models have recently emerged as a powerful tool for planning. However, unlike Monte Carlo Tree Search (MCTS)-whose performance naturally improves with additional test-time computation (TTC), standard diffusion-based planners offer only limited avenues for TTC scalability. In this paper, we introduce Monte Carlo Tree Diffusion (MCTD), a novel framework that integrates the generative strength of diffusion models with the adaptive search capabilities of MCTS. Our method reconceptualizes denoising as a tree-structured process, allowing partially denoised plans to be iteratively evaluated, pruned, and refined. By selectively expanding promising trajectories while retaining the flexibility to revisit and improve suboptimal branches, MCTD achieves the benefits of MCTS such as controlling exploration-exploitation trade-offs within the diffusion framework. Empirical results on challenging long-horizon tasks show that MCTD outperforms diffusion baselines, yielding higher-quality solutions as TTC increases.

Comment: Author match

2. Mixture of Decoupled Message Passing Experts with Entropy Constraint for General Node Classification

ArXiv ID: 2502.08083

Authors: Xuanze Chen, Jiajun Zhou, Jinsong Chen, Shanqing Yu, Qi Xuan

Abstract: The varying degrees of homophily and heterophily in real-world graphs persistently constrain the universality of graph neural networks (GNNs) for node classification. Adopting a data-centric perspective, this work reveals an inherent preference of different graphs towards distinct message encoding schemes: homophilous graphs favor local propagation, while heterophilous graphs exhibit preference for flexible combinations of propagation and transformation. To address this, we propose GNNMoE, a universal node classification framework based on the Mixture-of-Experts (MoE) mechanism. The framework first constructs diverse message-passing experts through recombination of fine-grained encoding operators, then designs soft and hard gating layers to allocate the most suitable expert networks for each node's representation learning, thereby enhancing both model expressiveness and adaptability to diverse graphs. Furthermore, considering that soft gating might introduce encoding noise in homophilous scenarios, we introduce an entropy constraint to guide sharpening of soft gates, achieving organic integration of weighted combination and Top-K selection. Extensive experiments demonstrate that GNNMoE significantly outperforms mainstream GNNs, heterophilous GNNs, and graph transformers in both node classification performance and universality across diverse graph datasets.

Comment: The paper proposes a Mixture-of-Experts (MoE) framework for node classification, which is highly relevant to model architecture and MoE research. The entropy constraint adds a novel perspective to MoE design.

Relevance: 10 Novelty: 8

3. Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach

ArXiv ID: 2502.06832

Authors: Xu Zhang, Kaidi Xu, Ziqing Hu, Ren Wang

Abstract: Mixture of Experts (MoE) have shown remarkable success in leveraging specialized expert networks for complex machine learning tasks. However, their susceptibility to adversarial attacks presents a critical challenge for deployment in robust applications. This paper addresses the critical question of how to incorporate robustness into MoEs while maintaining high natural accuracy. We begin by analyzing the vulnerability of MoE components, finding that expert networks are notably more susceptible to adversarial attacks than the router. Based on this insight, we propose a targeted robust training technique that integrates a novel loss function to enhance the adversarial robustness of MoE, requiring only the robustification of one additional expert without compromising training or inference efficiency. Building on this, we introduce a dual-model strategy that linearly combines a standard MoE model with our robustified MoE model using a smoothing parameter. This approach allows for flexible control over the robustness-accuracy trade-off. We further provide theoretical foundations by deriving certified robustness bounds for both the single MoE and the dual-model. To push the boundaries of robustness and accuracy, we propose a novel joint training strategy JTDMoE for the dual-model. This joint training enhances both robustness and accuracy beyond what is achievable with separate models. Experimental results on CIFAR-10 and TinyImageNet datasets using ResNet18 and Vision Transformer (ViT) architectures demonstrate the effectiveness of our proposed methods.

Comment: The paper addresses robustness in Mixture of Experts (MoE) models, which directly aligns with the model architecture criterion. The dual-model approach and robustness bounds are novel contributions.

Relevance: 10 Novelty: 8

4. Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline

ArXiv ID: 2502.06888

Authors: Zhiyuan Fang, Yuegui Huang, Zicong Hong, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng

Abstract: Mixture of Experts (MoE), with its distinctive sparse structure, enables the scaling of language models up to trillions of parameters without significantly increasing computational costs. However, the substantial parameter size presents a challenge for inference, as the expansion in GPU memory cannot keep pace with the growth in parameters. Although offloading techniques utilise memory from the CPU and disk and parallelise the I/O and computation for efficiency, the computation for each expert in MoE models is often less than the I/O, resulting in numerous bubbles in the pipeline. Therefore, we propose Klotski, an efficient MoE inference engine that significantly reduces pipeline bubbles through a novel expert-aware multi-batch pipeline paradigm. The proposed paradigm uses batch processing to extend the computation time of the current layer to overlap with the loading time of the next layer. Although this idea has been effectively applied to dense models, more batches may activate more experts in the MoE, leading to longer loading times and more bubbles. Thus, unlike traditional approaches, we balance computation and I/O time and minimise bubbles by orchestrating their inference orders based on their heterogeneous computation and I/O requirements and activation patterns under different batch numbers. Moreover, to adapt to different hardware environments and models, we design a constraint-sensitive I/O-compute planner and a correlation-aware expert prefetcher for a schedule that minimises pipeline bubbles. Experimental results demonstrate that Klotski achieves a superior throughput-latency trade-off compared to state-of-the-art techniques, with throughput improvements of up to 85.12x.

Comment: The paper proposes Klotski, an efficient MoE inference engine, which directly aligns with the core topic of Mixture-of-Experts and efficiency improvements.

Relevance: 10 Novelty: 8

5. Online Scheduling for LLM Inference with KV Cache Constraints

ArXiv ID: 2502.07115

Authors: Patrick Jaillet, Jiashuo Jiang, Chara Podimata, Zijie Zhou

Abstract: Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A key challenge in LLM inference is the management of the Key-Value (KV) cache, which reduces redundant computations but introduces memory constraints. In this work, we model LLM inference with KV cache constraints theoretically and propose novel batching and scheduling algorithms that minimize inference latency while effectively managing the KV cache's memory. We analyze both semi-online and fully online scheduling models, and our results are threefold. First, we provide a polynomial-time algorithm that achieves exact optimality in terms of average latency in the semi-online prompt arrival model. Second, in the fully online case with a stochastic prompt arrival, we introduce an efficient online scheduling algorithm with constant regret. Third, we prove that no algorithm (deterministic or randomized) can achieve a constant competitive ratio in fully online adversarial settings. Our empirical evaluations on a public LLM inference dataset, using the Llama-70B model on A100 GPUs, show that our approach significantly outperforms benchmark algorithms used currently in practice, achieving lower latency while reducing energy consumption. Overall, our results offer a path toward more sustainable and cost-effective LLM deployment.

Comment: The paper addresses KV cache constraints in LLM inference, which is directly relevant to model compression and efficiency. The theoretical scheduling algorithms and empirical results add novelty.