Personalized Daily ArXiv Papers 2025-05-06

[gpt-4o]	Prompt	Completion	Total
Token	43013	6051	49064
Cost	$0.11	$0.06	$0.17

Total arXiv papers: 637

Total scanned papers: 402

Total relevant papers: 23

Table of contents with paper titles:

Contextures: Representations from Contexts Authors: Runtian Zhai, Kai Yang, Che-Ping Tsai, Burak Varici, Zico Kolter, Pradeep Ravikumar
MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling Authors: Abdoul Majid O. Thiombiano, Brahim Hnich, Ali Ben Mrad, Mohamed Wiem Mkaouer
Always Skip Attention Authors: Yiping Ji, Hemanth Saratchandran, Peyman Moghaddam, Simon Lucey
Sharpness-Aware Minimization with Z-Score Gradient Filtering for Neural Networks Authors: Juyoung Yun
Don't be lazy: CompleteP enables compute-efficient deep transformers Authors: Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, Joel Hestness
Secrets of GFlowNets' Learning Behavior: A Theoretical Study Authors: Tianshu Yu
What do Language Model Probabilities Represent? From Distribution Estimation to Response Prediction Authors: Eitan Wagner, Omri Abend
Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients Authors: Yezhen Wang, Zhouhao Yang, Brian K Chen, Fanyi Pu, Bo Li, Tianyu Gao, Kenji Kawaguchi
Intra-Layer Recurrence in Transformers for Language Modeling Authors: Anthony Nguyen, Wenjun Lin
Towards Quantifying the Hessian Structure of Neural Networks Authors: Zhaorui Dong, Yushun Zhang, Zhi-Quan Luo, Jianfeng Yao, Ruoyu Sun
Practical Efficiency of Muon for Pretraining Authors: Essential AI, :, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, Ashish Vaswani
Morello: Compiling Fast Neural Networks with Dynamic Programming and Spatial Compression Authors: Samuel J. Kaufman, Ren\'e Just, Rastislav Bodik
Low-Loss Space in Neural Networks is Continuous and Fully Connected Authors: Yongding Tian, Zaid Al-Ars, Maksim Kitsak, Peter Hofstee
A dynamic view of the double descent Authors: Vivek Shripad Borkar
Advancing Constrained Monotonic Neural Networks: Achieving Universal Approximation Beyond Bounded Activations Authors: Davide Sartor, Alberto Sinigaglia, Gian Antonio Susto
Quantitative Analysis of Performance Drop in DeepSeek Model Quantization Authors: Enbo Zhao, Yi Shen, Shuming Shi, Jieyun Huang, Zhihao Chen, Ning Wang, Siqi Xiao, Jian Zhang, Kai Wang, Shiguo Lian
Surrogate to Poincar\'e inequalities on manifolds for dimension reduction in nonlinear feature spaces Authors: Anthony Nouy, Alexandre Pasco
Adaptively Point-weighting Curriculum Learning Authors: Wensheng Li, Hao Wang, Ruifeng Zhou, Hanting Guan, Chao Zhang, Dacheng Tao
Learning Local Causal World Models with State Space Models and Attention Authors: Francesco Petri, Luigi Asprino, Aldo Gangemi
BiGSCoder: State Space Model for Code Understanding Authors: Shweta Verma, Abhinav Anand, Mira Mezini
A probabilistic view on Riemannian machine learning models for SPD matrices Authors: Thibault de Surrel, Florian Yger, Fabien Lotte, Sylvain Chevallier
Large Language Model Partitioning for Low-Latency Inference at the Edge Authors: Dimitrios Kafetzis, Ramin Khalili, Iordanis Koutsopoulos
Attention Mechanisms Perspective: Exploring LLM Processing of Graph-Structured Data Authors: Zhong Guan, Likang Wu, Hongke Zhao, Ming He, Jianpin Fan

1. Contextures: Representations from Contexts

ArXiv ID: 2505.01557

Authors: Runtian Zhai, Kai Yang, Che-Ping Tsai, Burak Varici, Zico Kolter, Pradeep Ravikumar

Abstract: Despite the empirical success of foundation models, we do not have a systematic characterization of the representations that these models learn. In this paper, we establish the contexture theory. It shows that a large class of representation learning methods can be characterized as learning from the association between the input and a context variable. Specifically, we show that many popular methods aim to approximate the top-d singular functions of the expectation operator induced by the context, in which case we say that the representation learns the contexture. We demonstrate the generality of the contexture theory by proving that representation learning within various learning paradigms -- supervised, self-supervised, and manifold learning -- can all be studied from such a perspective. We also prove that the representations that learn the contexture are optimal on those tasks that are compatible with the context. One important implication of the contexture theory is that once the model is large enough to approximate the top singular functions, further scaling up the model size yields diminishing returns. Therefore, scaling is not all we need, and further improvement requires better contexts. To this end, we study how to evaluate the usefulness of a context without knowing the downstream tasks. We propose a metric and show by experiments that it correlates well with the actual performance of the encoder on many real datasets.

Comment: The paper introduces the contexture theory for representation learning, providing a theoretical framework that aligns with the representation learning criterion.

Relevance: 10 Novelty: 9

2. MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling

ArXiv ID: 2505.01459

Authors: Abdoul Majid O. Thiombiano, Brahim Hnich, Ali Ben Mrad, Mohamed Wiem Mkaouer

Abstract: This paper introduces MoxE, a novel architecture that synergistically combines the Extended Long Short-Term Memory (xLSTM) with the Mixture of Experts (MoE) framework to address critical scalability and efficiency challenges in large language models (LLMs). The proposed method effectively leverages xLSTM's innovative memory structures while strategically introducing sparsity through MoE to substantially reduce computational overhead. At the heart of our approach is a novel entropy-based routing mechanism, designed to dynamically route tokens to specialized experts, thereby ensuring efficient and balanced resource utilization. This entropy awareness enables the architecture to effectively manage both rare and common tokens, with mLSTM blocks being favored to handle rare tokens. To further enhance generalization, we introduce a suite of auxiliary losses, including entropy-based and group-wise balancing losses, ensuring robust performance and efficient training. Theoretical analysis and empirical evaluations rigorously demonstrate that MoxE achieves significant efficiency gains and enhanced effectiveness compared to existing approaches, marking a notable advancement in scalable LLM architectures.

Comment: The paper introduces MoxE, a novel MoE-based architecture with entropy-aware routing, which aligns with foundational research in model architecture and efficiency.

Relevance: 10 Novelty: 8

3. Always Skip Attention

ArXiv ID: 2505.01996

Authors: Yiping Ji, Hemanth Saratchandran, Peyman Moghaddam, Simon Lucey

Abstract: We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (\eg, CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that the self-attention mechanism is fundamentally ill-conditioned and is, therefore, uniquely dependent on skip connections for regularization. Additionally, we propose Token Graying -- a simple yet effective complement (to skip connections) that further improves the condition of input tokens. We validate our approach in both supervised and self-supervised training methods.

Comment: The paper provides theoretical insights into the critical role of skip connections in Vision Transformers, which is highly relevant to model architecture analysis.