Personalized Daily ArXiv Papers 2025-05-28

[gpt-4o]	Prompt	Completion	Total
Token	46091	6165	52256
Cost	$0.12	$0.06	$0.18

Total arXiv papers: 838

Total scanned papers: 478

Total relevant papers: 42

Table of contents with paper titles:

Bridging Arbitrary and Tree Metrics via Differentiable Gromov Hyperbolicity Authors: Pierre Houedry, Nicolas Courty, Florestan Martin-Baillon, Laetitia Chapel, Titouan Vayer
Copresheaf Topological Neural Networks: A Generalized Deep Learning Framework Authors: Mustafa Hajij, Lennart Bastian, Sarah Osentoski, Hardik Kabaria, John L. Davenport, Sheik Dawood, Balaji Cherukuri, Joseph G. Kocheemoolayil, Nastaran Shahmansouri, Adrian Lew, Theodore Papamarkou, Tolga Birdal
When Shift Happens - Confounding Is to Blame Authors: Abbavaram Gowtham Reddy, Celia Rubio-Madrigal, Rebekka Burkholz, Krikamol Muandet
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders Authors: James Oldfield, Shawn Im, Yixuan Li, Mihalis A. Nicolaou, Ioannis Patras, Grigorios G Chrysos
Algorithms and SQ Lower Bounds for Robustly Learning Real-valued Multi-index Models Authors: Ilias Diakonikolas, Giannis Iakovidis, Daniel M. Kane, Lisheng Ren
Kernel Quantile Embeddings and Associated Probability Metrics Authors: Masha Naslidnyk, Siu Lun Chau, Fran\c{c}ois-Xavier Briol, Krikamol Muandet
Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers Authors: Yukun Zhang, Xueqing Zhou
How Do Transformers Learn Variable Binding in Symbolic Programs? Authors: Yiwei Wu, Atticus Geiger, Rapha\"el Milli`ere
Efficient Large Language Model Inference with Neural Block Linearization Authors: Mete Erdogan, Francesco Tonin, Volkan Cevher
Multi-objective Large Language Model Alignment with Hierarchical Experts Authors: Zhuo Li, Guodong Du, Weiyang Guo, Yigeng Zhou, Xiucheng Li, Wenya Wang, Fangming Liu, Yequan Wang, Deheng Ye, Min Zhang, Jing Li
Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling Authors: Hovhannes Tamoyan, Subhabrata Dutta, Iryna Gurevych
Leaner Transformers: More Heads, Less Depth Authors: Hemanth Saratchandran, Damien Teney, Simon Lucey
Why Do More Experts Fail? A Theoretical Analysis of Model Merging Authors: Zijing Wang, Xingle Xu, Yongkang Liu, Yiqun Zhang, Peiqin Lin, Shi Feng, Xiaocui Yang, Daling Wang, Hinrich Sch\"utze
Sparsified State-Space Models are Efficient Highway Networks Authors: Woomin Song, Jihoon Tack, Sangwoo Mo, Seunghyuk Oh, Jinwoo Shin
Pretraining Language Models to Ponder in Continuous Space Authors: Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhiyu Li, Zhouhan Lin
Test-Time Learning for Large Language Models Authors: Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, Mingkui Tan
LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning Authors: Nurbek Tastan, Stefanos Laskaridis, Martin Takac, Karthik Nandakumar, Samuel Horvath
Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers Authors: Charles London, Varun Kanade
Who Reasons in the Large Language Models? Authors: Jie Shao, Jianxin Wu
Towards Fully FP8 GEMM LLM Training at Scale Authors: Alejandro Hern\'andez-Cano, Dhia Garbaya, Imanol Schlag, Martin Jaggi
Can Large Reasoning Models Self-Train? Authors: Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, Andrea Zanette
Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction Authors: Mahdi Pourmirzaei, Farzaneh Esmaili, Salhuldin Alqarghuli, Mohammadreza Pourmirzaei, Ye Han, Kai Chen, Mohsen Rezaei, Duolin Wang, Dong Xu
Pretrained LLMs Learn Multiple Types of Uncertainty Authors: Roi Cohen, Omri Fahn, Gerard de Melo
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query Authors: Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che
Holes in Latent Space: Topological Signatures Under Adversarial Influence Authors: Aideen Fay, In\'es Garc\'ia-Redondo, Qiquan Wang, Haim Dubossarsky, Anthea Monod
Evaluating Training in Binarized Neural Networks Through the Lens of Algorithmic Information Theory Authors: Eduardo Y. Sakabe, Felipe S. Abrah\~ao, Alexandre Sim\~oes, Esther Colombini, Paula Costa, Ricardo Gudwin, Hector Zenil
Stochastic Preconditioning for Neural Field Optimization Authors: Selena Ling, Merlin Nimier-David, Alec Jacobson, Nicholas Sharp
Beyond Demonstrations: Dynamic Vector Construction from Latent Representations Authors: Wang Cai, Hsiu-Yuan Huang, Zhixiang Wang, Yunfang Wu
Rotary Masked Autoencoders are Versatile Learners Authors: Uros Zivanovic, Serafina Di Gioia, Andre Scaffidi, Mart\'in de los Rios, Gabriella Contardo, Roberto Trotta
Efficient Identity and Position Graph Embedding via Spectral-Based Random Feature Aggregation Authors: Meng Qin, Jiahong Liu, Irwin King
EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models Authors: Chengyu Wang, Junbing Yan, Wenrui Cai, Yuanhao Yue, Jun Huang
Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms Authors: Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, Ningyu Zhang
Simple yet Effective Graph Distillation via Clustering Authors: Yurui Lai, Taiyan Zhang, Renchi Yang
Convergence of Clipped-SGD for Convex $(L_0,L_1)$-Smooth Optimization with Heavy-Tailed Noise Authors: Savelii Chezhegov, Aleksandr Beznosikov, Samuel Horv\'ath, Eduard Gorbunov
HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling Authors: Hexiong Yang, Mingrui Chen, Huaibo Huang, Junxian Duan, Jie Cao, Zhen Zhou, Ran He
Input Convex Kolmogorov Arnold Networks Authors: Thomas Deschatre, Xavier Warin
DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models Authors: Nastaran Saadati, Zhanhong Jiang, Joshua R. Waite, Shreyan Ganguly, Aditya Balu, Chinmay Hegde, Soumik Sarkar
Hardware-Efficient Attention for Fast Decoding Authors: Ted Zadouri, Hubert Strauss, Tri Dao
One-Time Soft Alignment Enables Resilient Learning without Weight Transport Authors: Jeonghwan Cheon, Jaehyuk Bae, Se-Bum Paik
SageAttention2++: A More Efficient Implementation of SageAttention2 Authors: Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, Jianfei Chen
A Lightweight Multi-Expert Generative Language Model System for Engineering Information and Knowledge Extraction Authors: Bogdan Bogachov, Yaoyao Fiona Zhao
Position: Adopt Constraints Over Penalties in Deep Learning Authors: Juan Ramirez, Meraj Hashemizadeh, Simon Lacoste-Julien

1. Bridging Arbitrary and Tree Metrics via Differentiable Gromov Hyperbolicity

ArXiv ID: 2505.21073

Authors: Pierre Houedry, Nicolas Courty, Florestan Martin-Baillon, Laetitia Chapel, Titouan Vayer

Abstract: Trees and the associated shortest-path tree metrics provide a powerful framework for representing hierarchical and combinatorial structures in data. Given an arbitrary metric space, its deviation from a tree metric can be quantified by Gromov's $\delta$-hyperbolicity. Nonetheless, designing algorithms that bridge an arbitrary metric to its closest tree metric is still a vivid subject of interest, as most common approaches are either heuristical and lack guarantees, or perform moderately well. In this work, we introduce a novel differentiable optimization framework, coined DeltaZero, that solves this problem. Our method leverages a smooth surrogate for Gromov's $\delta$-hyperbolicity which enables a gradient-based optimization, with a tractable complexity. The corresponding optimization procedure is derived from a problem with better worst case guarantees than existing bounds, and is justified statistically. Experiments on synthetic and real-world datasets demonstrate that our method consistently achieves state-of-the-art distortion.

Comment: The paper presents a novel differentiable optimization framework for bridging arbitrary and tree metrics, which is a cutting-edge theoretical work challenging established assumptions.

Relevance: 9 Novelty: 9

2. Copresheaf Topological Neural Networks: A Generalized Deep Learning Framework

ArXiv ID: 2505.21251

Authors: Mustafa Hajij, Lennart Bastian, Sarah Osentoski, Hardik Kabaria, John L. Davenport, Sheik Dawood, Balaji Cherukuri, Joseph G. Kocheemoolayil, Nastaran Shahmansouri, Adrian Lew, Theodore Papamarkou, Tolga Birdal

Abstract: We introduce copresheaf topological neural networks (CTNNs), a powerful and unifying framework that encapsulates a wide spectrum of deep learning architectures, designed to operate on structured data: including images, point clouds, graphs, meshes, and topological manifolds. While deep learning has profoundly impacted domains ranging from digital assistants to autonomous systems, the principled design of neural architectures tailored to specific tasks and data types remains one of the field's most persistent open challenges. CTNNs address this gap by grounding model design in the language of copresheaves, a concept from algebraic topology that generalizes and subsumes most practical deep learning models in use today. This abstract yet constructive formulation yields a rich design space from which theoretically sound and practically effective solutions can be derived to tackle core challenges in representation learning: long-range dependencies, oversmoothing, heterophily, and non-Euclidean domains. Our empirical results on structured data benchmarks demonstrate that CTNNs consistently outperform conventional baselines, particularly in tasks requiring hierarchical or localized sensitivity. These results underscore CTNNs as a principled, multi-scale foundation for the next generation of deep learning architectures.

Comment: The paper introduces copresheaf topological neural networks, a generalized deep learning framework, which aligns with emerging trends and architectural innovation.

Relevance: 9 Novelty: 9

3. When Shift Happens - Confounding Is to Blame

ArXiv ID: 2505.21422

Authors: Abbavaram Gowtham Reddy, Celia Rubio-Madrigal, Rebekka Burkholz, Krikamol Muandet

Abstract: Distribution shifts introduce uncertainty that undermines the robustness and generalization capabilities of machine learning models. While conventional wisdom suggests that learning causal-invariant representations enhances robustness to such shifts, recent empirical studies present a counterintuitive finding: (i) empirical risk minimization (ERM) can rival or even outperform state-of-the-art out-of-distribution (OOD) generalization methods, and (ii) its OOD generalization performance improves when all available covariates, not just causal ones, are utilized. Drawing on both empirical and theoretical evidence, we attribute this phenomenon to hidden confounding. Shifts in hidden confounding induce changes in data distributions that violate assumptions commonly made by existing OOD generalization approaches. Under such conditions, we prove that effective generalization requires learning environment-specific relationships, rather than relying solely on invariant ones. Furthermore, we show that models augmented with proxies for hidden confounders can mitigate the challenges posed by hidden confounding shifts. These findings offer new theoretical insights and practical guidance for designing robust OOD generalization algorithms and principled covariate selection strategies.

Comment: The paper provides theoretical insights into distribution shifts and hidden confounding, which aligns with emerging trends and foundational research.

Relevance: 9 Novelty: 9

4. Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

ArXiv ID: 2505.21364

Authors: James Oldfield, Shawn Im, Yixuan Li, Mihalis A. Nicolaou, Ioannis Patras, Grigorios G Chrysos

Abstract: Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language--opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.

Comment: The paper introduces Mixture of Decoders (MxDs) for interpretable dense layer decomposition, relevant to model architecture and sparsity.