Personalized Daily Arxiv Papers 03/05/2025

[gpt-4o]	Prompt	Completion	Total
Token	49494	7131	56625
Cost	$0.12	$0.07	$0.19

Total ArXiv papers: 643

Total scanned papers: 375

Total relevant papers: 35

Table of contents with paper titles:

Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer Authors: Yujiao Yang, Jing Lian, Linhui Li
Deep Learning is Not So Mysterious or Different Authors: Andrew Gordon Wilson
Neural Manifolds and Cognitive Consistency: A New Approach to Memory Consolidation in Artificial Systems Authors: Phuong-Nam Nguyen
A Near Complete Nonasymptotic Generalization Theory For Multilayer Neural Networks: Beyond the Bias-Variance Tradeoff Authors: Hao Yu, Xiangyang Ji
Identifying Sensitive Weights via Post-quantization Integral Authors: Yuezhou Hu, Weiyu Huang, Zichen Liang, Chang Chen, Jintao Zhang, Jun Zhu, Jianfei Chen
Unsupervised Attributed Dynamic Network Embedding with Stability Guarantees Authors: Emma Ceccherini, Ian Gallagher, Andrew Jones, Daniel Lawson
Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization Authors: Leonid Berlyand, Theo Bourdais, Houman Owhadi, Yitzchak Shmalo
CABS: Conflict-Aware and Balanced Sparsification for Enhancing Model Merging Authors: Zongzhen Yang, Binhang Qi, Hailong Sun, Wenrui Long, Ruobing Zhao, Xiang Gao
Weak-to-Strong Generalization Even in Random Feature Networks, Provably Authors: Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora, Zhiyuan Li, Nathan Srebro
A Theory of Initialisation's Impact on Specialisation Authors: Devon Jarvis, Sebastian Lee, Cl\'ementine Carla Juliette Domin\'e, Andrew M Saxe, Stefano Sarao Mannelli
An Accelerated Alternating Partial Bregman Algorithm for ReLU-based Matrix Decomposition Authors: Qingsong Wang, Yunfei Qu, Chunfeng Cui, Deren Han
Forgetting Transformer: Softmax Attention with a Forget Gate Authors: Zhixuan Lin, Evgenii Nikishin, Xu Owen He, Aaron Courville
Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression Authors: Nathan Godey, Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini, \'Eric de la Clergerie, Beno\^it Sagot
(How) Do Language Models Track State? Authors: Belinda Z. Li, Zifan Carl Guo, Jacob Andreas
CrystalFramer: Rethinking the Role of Frames for SE(3)-Invariant Crystal Structure Modeling Authors: Yusei Ito, Tatsunori Taniai, Ryo Igarashi, Yoshitaka Ushiku, Kanta Ono
Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs Authors: Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi
PaCA: Partial Connection Adaptation for Efficient Fine-Tuning Authors: Sunghyeon Woo, Sol Namkung, Sunwoo Lee, Inho Jeong, Beomseok Kim, Dongsuk Jeon
The Distributionally Robust Optimization Model of Sparse Principal Component Analysis Authors: Lei Wang, Xin Liu, Xiaojun Chen
Weight transport through spike timing for robust local gradients Authors: Timo Gierlich, Andreas Baumbach, Akos F. Kungl, Kevin Max, Mihai A. Petrovici
A Minimalist Example of Edge-of-Stability and Progressive Sharpening Authors: Liming Liu, Zixuan Zhang, Simon Du, Tuo Zhao
Spike-and-Slab Posterior Sampling in High Dimensions Authors: Syamantak Kumar, Purnamrita Sarkar, Kevin Tian, Yusong Zhu
Elliptic Loss Regularization Authors: Ali Hasan, Haoming Yang, Yuting Ng, Vahid Tarokh
Mathematical Foundation of Interpretable Equivariant Surrogate Models Authors: Jacopo Joy Colombini, Filippo Bonchi, Francesco Giannini, Fosca Giannotti, Roberto Pellungrini, Patrizio Frosini
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models Authors: Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang
Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms and Numerical Analysis Authors: Long Cheng, Qichen Liao, Fan Wu, Junlin Mu, Tengfei Han, Zhe Qiu, Lianqiang Li, Tianyi Liu, Fangzheng Miao, Keming Gao, Liang Wang, Zhen Zhang, Qiande Yin
Enhancing Transformer with GNN Structural Knowledge via Distillation: A Novel Approach Authors: Zhihua Duan, Jialin Wang
Quantifying Overfitting along the Regularization Path for Two-Part-Code MDL in Supervised Classification Authors: Xiaohan Zhu, Nathan Srebro
On the Relationship Between Double Descent of CNNs and Shape/Texture Bias Under Learning Process Authors: Shun Iwase, Shuya Takahashi, Nakamasa Inoue, Rio Yokota, Ryo Nakamura, Hirokatsu Kataoka
Linear Representations of Political Perspective Emerge in Large Language Models Authors: Junsol Kim, James Evans, Aaron Schein
MindBridge: Scalable and Cross-Model Knowledge Editing via Memory-Augmented Modality Authors: Shuaike Li, Kai Zhang, Qi Liu, Enhong Chen
Sharpness-Aware Minimization: General Analysis and Improved Rates Authors: Dimitris Oikonomou, Nicolas Loizou
VAEs and GANs: Implicitly Approximating Complex Distributions with Simple Base Distributions and Deep Neural Networks -- Principles, Necessity, and Limitations Authors: Yuan-Hao Wei
Frankenstein Optimizer: Harnessing the Potential by Revisiting Optimization Tricks Authors: Chia-Wei Hsu, Nien-Ti Tsou, Yu-Cheng Chen, Yang Jeong Park, Ju Li
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale Authors: Jerome Ku, Eric Nguyen, David W. Romero, Garyk Brixi, Brandon Yang, Anton Vorontsov, Ali Taghibakhshi, Amy X. Lu, Dave P. Burke, Greg Brockman, Stefano Massaroli, Christopher R\'e, Patrick D. Hsu, Brian L. Hie, Stefano Ermon, Michael Poli
Unnatural Languages Are Not Bugs but Features for LLMs Authors: Keyu Duan, Yiran Zhao, Zhili Feng, Jinjie Ni, Tianyu Pang, Qian Liu, Tianle Cai, Longxu Dou, Kenji Kawaguchi, Anirudh Goyal, J. Zico Kolter, Michael Qizhe Shieh

1. Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

ArXiv ID: 2503.02495

Authors: Yujiao Yang, Jing Lian, Linhui Li

Abstract: Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. However, expert in exist MoE paradigm works as an individual, thereby lacking high-quality expert interactions. Moreover, they have not been effectively extended to attention block, which constrains further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement dynamic routing on input data and experts. Our approach advances MoE design with three key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE's routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the model employed with UoE surpass Full Attention, state-of-art MoEs and efficient transformers in several tasks across image and natural language domains. The source codes are available at https://github.com/YujiaoYang-work/UoE.

Comment: The paper proposes Union-of-Experts (UoE), which advances the Mixture-of-Experts paradigm with architectural innovations, aligning closely with model architecture research.

Relevance: 10 Novelty: 8

2. Deep Learning is Not So Mysterious or Different

ArXiv ID: 2503.02113

Authors: Andrew Gordon Wilson

Abstract: Deep neural networks are often seen as different from other model classes by defying conventional notions of generalization. Popular examples of anomalous generalization behaviour include benign overfitting, double descent, and the success of overparametrization. We argue that these phenomena are not distinct to neural networks, or particularly mysterious. Moreover, this generalization behaviour can be intuitively understood, and rigorously characterized using long-standing generalization frameworks such as PAC-Bayes and countable hypothesis bounds. We present soft inductive biases as a key unifying principle in explaining these phenomena: rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem. However, we also highlight how deep learning is relatively distinct in other ways, such as its ability for representation learning, phenomena such as mode connectivity, and its relative universality.

Comment: The paper provides a theoretical perspective on generalization phenomena in deep learning, which aligns with foundational research in representation learning and training dynamics.

Relevance: 9 Novelty: 9

3. Neural Manifolds and Cognitive Consistency: A New Approach to Memory Consolidation in Artificial Systems

ArXiv ID: 2503.01867

Authors: Phuong-Nam Nguyen

Abstract: We introduce a novel mathematical framework that unifies neural population dynamics, hippocampal sharp wave-ripple (SpWR) generation, and cognitive consistency constraints inspired by Heider's theory. Our model leverages low-dimensional manifold representations to capture structured neural drift and incorporates a balance energy function to enforce coherent synaptic interactions, effectively simulating the memory consolidation processes observed in biological systems. Simulation results demonstrate that our approach not only reproduces key features of SpWR events but also enhances network interpretability. This work paves the way for scalable neuromorphic architectures that bridge neuroscience and artificial intelligence, offering more robust and adaptive learning mechanisms for future intelligent systems.

Comment: The paper introduces a novel framework for memory consolidation inspired by neuroscience, which aligns with foundational research in representation learning and emerging trends.

Relevance: 9 Novelty: 9

4. A Near Complete Nonasymptotic Generalization Theory For Multilayer Neural Networks: Beyond the Bias-Variance Tradeoff

ArXiv ID: 2503.02129

Authors: Hao Yu, Xiangyang Ji

Abstract: We propose a first near complete (that will make explicit sense in the main text) nonasymptotic generalization theory for multilayer neural networks with arbitrary Lipschitz activations and general Lipschitz loss functions (with some very mild conditions). In particular, it doens't require the boundness of loss function, as commonly assumed in the literature. Our theory goes beyond the bias-variance tradeoff, aligned with phenomenon typically encountered in deep learning. It is therefore sharp different with other existing nonasymptotic generalization error bounds for neural networks. More explicitly, we propose an explicit generalization error upper bound for multilayer neural networks with arbitrary Lipschitz activations $\sigma$ with $\sigma(0)=0$ and broad enough Lipschitz loss functions, without requiring either the width, depth or other hyperparameters of the neural network approaching infinity, a specific neural network architect (e.g. sparsity, boundness of some norms), a particular activation function, a particular optimization algorithm or boundness of the loss function, and with taking the approximation error into consideration. General Lipschitz activation can also be accommodated into our framework. A feature of our theory is that it also considers approximation errors. Furthermore, we show the near minimax optimality of our theory for multilayer ReLU networks for regression problems. Notably, our upper bound exhibits the famous double descent phenomenon for such networks, which is the most distinguished characteristic compared with other existing results. This work emphasizes a view that many classical results should be improved to embrace the unintuitive characteristics of deep learning to get a better understanding of it.

Comment: The paper introduces a nonasymptotic generalization theory for multilayer neural networks, addressing foundational aspects of generalization and double descent, which is highly relevant to understanding training dynamics.