Personalized Daily Arxiv Papers 02/26/2025

[gpt-4o]	Prompt	Completion	Total
Token	58989	8331	67320
Cost	$0.15	$0.08	$0.23

Total ArXiv papers: 625

Total scanned papers: 384

Total relevant papers: 37

Table of contents with paper titles:

Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions Authors: Yizhe Zhang, Richard Bai, Zijin Gu, Ruixiang Zhang, Jiatao Gu, Emmanuel Abbe, Samy Bengio, Navdeep Jaitly
Mechanistic PDE Networks for Discovery of Governing Equations Authors: Adeel Pervez, Efstratios Gavves, Francesco Locatello
How Do Large Language Monkeys Get Their Power (Laws)? Authors: Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, Sanmi Koyejo
AMPO: Active Multi-Preference Optimization Authors: Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan
Global law of conjugate kernel random matrices with heavy-tailed weights Authors: Alice Guionnet, Vanessa Piccolo
Graded Neural Networks Authors: Tony Shaska
PICASO: Permutation-Invariant Context Composition with State Space Models Authors: Tian Yu Liu, Alessandro Achille, Matthew Trager, Aditya Golatkar, Luca Zancato, Stefano Soatto
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference Authors: Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen
Unveiling and Causalizing CoT: A Causal Pespective Authors: Jiarun Fu, Lizhong Ding, Hao Li, Pengqi Li, Qiuning Wei, Xu Chen
The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve? Authors: Zhenheng Tang, Xiang Liu, Qian Wang, Peijie Dong, Bingsheng He, Xiaowen Chu, Bo Li
Optimal Brain Apoptosis Authors: Mingyuan Sun, Zheng Fang, Jiaxu Wang, Junjie Jiang, Delei Kong, Chenming Hu, Yuetong Fang, Renjing Xu
CoKV: Optimizing KV Cache Allocation via Cooperative Game Authors: Qiheng Sun, Hongwei Zhang, Haocheng Xia, Jiayao Zhang, Jinfei Liu, Kui Ren
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Authors: Lucy Farnik, Tim Lawson, Conor Houghton, Laurence Aitchison
C-LoRA: Continual Low-Rank Adaptation for Pre-trained Models Authors: Xin Zhang, Liang Bai, Xian Yang, Jiye Liang
Near-Optimal Approximations for Bayesian Inference in Function Space Authors: Veit Wild, James Wu, Dino Sejdinovic, Jeremias Knoblauch
The Gradient of Algebraic Model Counting Authors: Jaron Maene, Luc De Raedt
Effective Field Neural Network Authors: Xi Liu, Yujun Zhao, Chun Yu Wan, Yang Zhang, Junwei Liu
Scalable Equilibrium Sampling with Sequential Boltzmann Generators Authors: Charlie B. Tan, Avishek Joey Bose, Chen Lin, Leon Klein, Michael M. Bronstein, Alexander Tong
Golden Ratio Mixing of Real and Synthetic Data for Stabilizing Generative Model Training Authors: Hengzhi He, Shirong Xu, Guang Cheng
Aligning Compound AI Systems via System-level DPO Authors: Xiangwen Wang, Yibo Jacky Zhang, Zhoujie Ding, Katherine Tsai, Sanmi Koyejo
C-3DPO: Constrained Controlled Classification for Direct Preference Optimization Authors: Kavosh Asadi, Julien Han, Xingzi Xu, Dominique Perrault-Joncas, Shoham Sabach, Karim Bouyarmane, Mohammad Ghavamzadeh
A Fokker-Planck-Based Loss Function that Bridges Dynamics with Density Estimation Authors: Zhixin Lu, {\L}ukasz Ku\'smierz, Stefan Mihalas
Knowledge Distillation with Training Wheels Authors: Guanlin Liu, Anand Ramachandran, Tanmay Gangwani, Yan Fu, Abhinav Sethy
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning Authors: Wenkai Yang, Shuming Ma, Yankai Lin, Furu Wei
LeanKAN: A Parameter-Lean Kolmogorov-Arnold Network Layer with Improved Memory Efficiency and Convergence Behavior Authors: Benjamin C. Koenig, Suyong Kim, Sili Deng
An Overview of Large Language Models for Statisticians Authors: Wenlong Ji, Weizhe Yuan, Emily Getzen, Kyunghyun Cho, Michael I. Jordan, Song Mei, Jason E Weston, Weijie J. Su, Jing Xu, Linjun Zhang
A General Framework to Enhance Fine-tuning-based LLM Unlearning Authors: Jie Ren, Zhenwei Dai, Xianfeng Tang, Hui Liu, Jingying Zeng, Zhen Li, Rahul Goutam, Suhang Wang, Yue Xing, Qi He, Hui Liu
Synthetic Text Generation for Training Large Language Models via Gradient Matching Authors: Dang Nguyen, Zeman Li, Mohammadhossein Bateni, Vahab Mirrokni, Meisam Razaviyayn, Baharan Mirzasoleiman
A Priori Generalizability Estimate for a CNN Authors: Cito Balsells, Beatrice Riviere, David Fuentes
Hallucination Detection in LLMs Using Spectral Features of Attention Maps Authors: Jakub Binkowski, Denis Janiak, Albert Sawczyn, Bogdan Gabrys, Tomasz Kajdanowicz
Representation Engineering for Large-Language Models: Survey and Research Challenges Authors: Lukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jennifer Yen, Zejia Yang, David Williams-King, Linh Le, Kosi Asuzu, Carsten Maple
Model-Free Adversarial Purification via Coarse-To-Fine Tensor Network Representation Authors: Guang Lin, Duc Thien Nguyen, Zerui Tao, Konstantinos Slavakis, Toshihisa Tanaka, Qibin Zhao
Learning Backbones: Sparsifying Graphs through Zero Forcing for Effective Graph-Based Learning Authors: Obaid Ullah Ahmad, Anwar Said, Mudassir Shabbir, Xenofon Koutsoukos, Waseem Abbas
An Improved Privacy and Utility Analysis of Differentially Private SGD with Bounded Domain and Smooth Losses Authors: Hao Liang, Wanrong Zhang, Xinlei He, Kaishun He, Hong Xing
Scaling LLM Pre-training with Vocabulary Curriculum Authors: Fangyuan Yu
Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm Authors: Andrzej Cichocki
A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models Authors: Zihao Lin, Samyadeep Basu, Mohammad Beigi, Varun Manjunatha, Ryan A. Rossi, Zichao Wang, Yufan Zhou, Sriram Balasubramanian, Arman Zarei, Keivan Rezaei, Ying Shen, Barry Menglong Yao, Zhiyang Xu, Qin Liu, Yuxiang Zhang, Yan Sun, Shilong Liu, Li Shen, Hongxuan Li, Soheil Feizi, Lifu Huang

1. Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions

ArXiv ID: 2502.18435

Authors: Yizhe Zhang, Richard Bai, Zijin Gu, Ruixiang Zhang, Jiatao Gu, Emmanuel Abbe, Samy Bengio, Navdeep Jaitly

Abstract: Language models usually use left-to-right (L2R) autoregressive factorization. However, L2R factorization may not always be the best inductive bias. Therefore, we investigate whether alternative factorizations of the text distribution could be beneficial in some tasks. We investigate right-to-left (R2L) training as a compelling alternative, focusing on multiple-choice questions (MCQs) as a test bed for knowledge extraction and reasoning. Through extensive experiments across various model sizes (2B-8B parameters) and training datasets, we find that R2L models can significantly outperform L2R models on several MCQ benchmarks, including logical reasoning, commonsense understanding, and truthfulness assessment tasks. Our analysis reveals that this performance difference may be fundamentally linked to multiple factors including calibration, computability and directional conditional entropy. We ablate the impact of these factors through controlled simulation studies using arithmetic tasks, where the impacting factors can be better disentangled. Our work demonstrates that exploring alternative factorizations of the text distribution can lead to improvements in LLM capabilities and provides theoretical insights into optimal factorization towards approximating human language distribution, and when each reasoning order might be more advantageous.

Comment: Author match

2. Mechanistic PDE Networks for Discovery of Governing Equations

ArXiv ID: 2502.18377

Authors: Adeel Pervez, Efstratios Gavves, Francesco Locatello

Abstract: We present Mechanistic PDE Networks -- a model for discovery of governing partial differential equations from data. Mechanistic PDE Networks represent spatiotemporal data as space-time dependent linear partial differential equations in neural network hidden representations. The represented PDEs are then solved and decoded for specific tasks. The learned PDE representations naturally express the spatiotemporal dynamics in data in neural network hidden space, enabling increased power for dynamical modeling. Solving the PDE representations in a compute and memory-efficient way, however, is a significant challenge. We develop a native, GPU-capable, parallel, sparse, and differentiable multigrid solver specialized for linear partial differential equations that acts as a module in Mechanistic PDE Networks. Leveraging the PDE solver, we propose a discovery architecture that can discover nonlinear PDEs in complex settings while also being robust to noise. We validate PDE discovery on a number of PDEs, including reaction-diffusion and Navier-Stokes equations.

Comment: The paper proposes Mechanistic PDE Networks for discovering governing equations, which aligns with foundational research in AI for Science and introduces a novel architecture.

Relevance: 9 Novelty: 9

3. How Do Large Language Monkeys Get Their Power (Laws)?

ArXiv ID: 2502.17578

Authors: Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, Sanmi Koyejo

Abstract: Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts. In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts. We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, ${\sim}2-4$ orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.

Comment: The paper provides theoretical insights into power law scaling in large language models, which aligns with foundational research in LLM behavior and interpretability.