Personalized Daily ArXiv Papers 2025-05-15

[gpt-4o]	Prompt	Completion	Total
Token	31545	4177	35722
Cost	$0.08	$0.04	$0.12

Total arXiv papers: 359

Total scanned papers: 231

Total relevant papers: 17

Table of contents with paper titles:

The Geometry of Meaning: Perfect Spacetime Representations of Hierarchical Structures Authors: Andres Anabalon, Hugo Garces, Julio Oliva, Jose Cifuentes
SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures Authors: Julian Kranz, Davide Gallon, Steffen Dereich, Arnulf Jentzen
Equilibrium Propagation for Learning in Lagrangian Dynamical Systems Authors: Serge Massar
Accelerating Machine Learning Systems via Category Theory: Applications to Spherical Attention for Gene Regulatory Networks Authors: Vincent Abbott, Kotaro Kamiya, Gerard Glowacki, Yu Atsumi, Gioele Zardini, Yoshihiro Maruyama
Variational Rank Reduction Autoencoder Authors: Jad Mounayer, Alicia Tierz, Jerome Tomezyk, Chady Ghnatios, Francisco Chinesta
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Authors: Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu
An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models Authors: Jialin Mao, Itay Griniasty, Yan Sun, Mark K. Transtrum, James P. Sethna, Pratik Chaudhari
An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits Authors: Cody Steinmetz, Gavin Childress, Aaron Herbst, Gavin Jones, Jasdeep Singh, Eli Vang, Keagan Weinstock
Establishing Linear Surrogate Regret Bounds for Convex Smooth Losses via Convolutional Fenche-Young Losses Authors: Yuzhou Cao, Han Bao, Lei Feng, Bo An
Independent Component Analysis by Robust Distance Correlation Authors: Sarah Leyder, Jakob Raymaekers, Peter J. Rousseeuw, Tom Van Deuren, Tim Verdonck
SaFARi: State-Space Models for Frame-Agnostic Representation Authors: Hossein Babaei, Mel White, Sina Alemohammad, Richard G. Baraniuk
Layered Unlearning for Adversarial Relearning Authors: Timothy Qian, Vinith Suriyakumar, Ashia Wilson, Dylan Hadfield-Menell
Block-Biased Mamba for Long-Range Sequence Processing Authors: Annan Yu, N. Benjamin Erichson
Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists? Authors: Anthony GX-Chen, Dongyan Lin, Mandana Samiei, Doina Precup, Blake A. Richards, Rob Fergus, Kenneth Marino
A Multi-Task Foundation Model for Wireless Channel Representation Using Contrastive and Masked Autoencoder Learning Authors: Berkay Guler, Giovanni Geraci, Hamid Jafarkhani
Neural Multivariate Regression: Qualitative Insights from the Unconstrained Feature Model Authors: George Andriopoulos, Soyuj Jung Basnet, Juan Guevara, Li Guo, Keith Ross
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures Authors: Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y. X. Wei

1. The Geometry of Meaning: Perfect Spacetime Representations of Hierarchical Structures

ArXiv ID: 2505.08795

Authors: Andres Anabalon, Hugo Garces, Julio Oliva, Jose Cifuentes

Abstract: We show that there is a fast algorithm that embeds hierarchical structures in three-dimensional Minkowski spacetime. The correlation of data ends up purely encoded in the causal structure. Our model relies solely on oriented token pairs -- local hierarchical signals -- with no access to global symbolic structure. We apply our method to the corpus of \textit{WordNet}. We provide a perfect embedding of the mammal sub-tree including ambiguities (more than one hierarchy per node) in such a way that the hierarchical structures get completely codified in the geometry and exactly reproduce the ground-truth. We extend this to a perfect embedding of the maximal unambiguous subset of the \textit{WordNet} with 82{,}115 noun tokens and a single hierarchy per token. We introduce a novel retrieval mechanism in which causality, not distance, governs hierarchical access. Our results seem to indicate that all discrete data has a perfect geometrical representation that is three-dimensional. The resulting embeddings are nearly conformally invariant, indicating deep connections with general relativity and field theory. These results suggest that concepts, categories, and their interrelations, namely hierarchical meaning itself, is geometric.

Comment: The paper proposes a novel geometric representation of hierarchical structures in 3D Minkowski spacetime, with connections to general relativity and field theory. This introduces a potentially transformative paradigm for representation learning.

Relevance: 9 Novelty: 9

2. SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures

ArXiv ID: 2505.09572

Authors: Julian Kranz, Davide Gallon, Steffen Dereich, Arnulf Jentzen

Abstract: We study gradient flows for loss landscapes of fully connected feed forward neural networks with commonly used continuously differentiable activation functions such as the logistic, hyperbolic tangent, softplus or GELU function. We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an asymptotic critical value. Moreover, we prove the existence of a threshold $\varepsilon>0$ such that the loss value of any gradient flow initialized at most $\varepsilon$ above the optimal level converges to it. For polynomial target functions and sufficiently big architecture and data set, we prove that the optimal loss value is zero and can only be realized asymptotically. From this setting, we deduce our main result that any gradient flow with sufficiently good initialization diverges to infinity. Our proof heavily relies on the geometry of o-minimal structures. We confirm these theoretical findings with numerical experiments and extend our investigation to real-world scenarios, where we observe an analogous behavior.

Comment: The paper provides theoretical insights into gradient flows in neural networks, leveraging o-minimal structures, which is highly relevant to training dynamics and foundational representation learning.

Relevance: 9 Novelty: 9

3. Equilibrium Propagation for Learning in Lagrangian Dynamical Systems

ArXiv ID: 2505.07363

Authors: Serge Massar

Abstract: We propose a method for training dynamical systems governed by Lagrangian mechanics using Equilibrium Propagation. Our approach extends Equilibrium Propagation -- initially developed for energy-based models -- to dynamical trajectories by leveraging the principle of action extremization. Training is achieved by gently nudging trajectories toward desired targets and measuring how the variables conjugate to the parameters to be trained respond. This method is particularly suited to systems with periodic boundary conditions or fixed initial and final states, enabling efficient parameter updates without requiring explicit backpropagation through time. In the case of periodic boundary conditions, this approach yields the semiclassical limit of Quantum Equilibrium Propagation. Applications to systems with dissipation are also discussed.

Comment: The paper extends Equilibrium Propagation to Lagrangian dynamical systems, introducing a novel training paradigm that challenges traditional backpropagation, aligning with emerging trends in foundational research.

Relevance: 9 Novelty: 9

4. Accelerating Machine Learning Systems via Category Theory: Applications to Spherical Attention for Gene Regulatory Networks

ArXiv ID: 2505.09326

Authors: Vincent Abbott, Kotaro Kamiya, Gerard Glowacki, Yu Atsumi, Gioele Zardini, Yoshihiro Maruyama

Abstract: How do we enable artificial intelligence models to improve themselves? This is central to exponentially improving generalized artificial intelligence models, which can improve their own architecture to handle new problem domains in an efficient manner that leverages the latest hardware. However, current automated compilation methods are poor, and efficient algorithms require years of human development. In this paper, we use neural circuit diagrams, based in category theory, to prove a general theorem related to deep learning algorithms, guide the development of a novel attention algorithm catered to the domain of gene regulatory networks, and produce a corresponding efficient kernel. The algorithm we propose, spherical attention, shows that neural circuit diagrams enable a principled and systematic method for reasoning about deep learning architectures and providing high-performance code. By replacing SoftMax with an $L^2$ norm as suggested by diagrams, it overcomes the special function unit bottleneck of standard attention while retaining the streaming property essential to high-performance. Our diagrammatically derived \textit{FlashSign} kernel achieves comparable performance to the state-of-the-art, fine-tuned FlashAttention algorithm on an A100, and $3.6\times$ the performance of PyTorch. Overall, this investigation shows neural circuit diagrams' suitability as a high-level framework for the automated development of efficient, novel artificial intelligence architectures.

Comment: The paper introduces a novel attention mechanism (spherical attention) derived using category theory and neural circuit diagrams, which aligns with 'Emerging Trends' and 'Model Architecture' due to its theoretical innovation and architectural insights.

Relevance: 9 Novelty: 9

5. Variational Rank Reduction Autoencoder

ArXiv ID: 2505.09458

Authors: Jad Mounayer, Alicia Tierz, Jerome Tomezyk, Chady Ghnatios, Francisco Chinesta

Abstract: Deterministic Rank Reduction Autoencoders (RRAEs) enforce by construction a regularization on the latent space by applying a truncated SVD. While this regularization makes Autoencoders more powerful, using them for generative purposes is counter-intuitive due to their deterministic nature. On the other hand, Variational Autoencoders (VAEs) are well known for their generative abilities by learning a probabilistic latent space. In this paper, we present Variational Rank Reduction Autoencoders (VRRAEs), a model that leverages the advantages of both RRAEs and VAEs. Our claims and results show that when carefully sampling the latent space of RRAEs and further regularizing with the Kullback-Leibler (KL) divergence (similarly to VAEs), VRRAEs outperform RRAEs and VAEs. Additionally, we show that the regularization induced by the SVD not only makes VRRAEs better generators than VAEs, but also reduces the possibility of posterior collapse. Our results include a synthetic dataset of a small size that showcases the robustness of VRRAEs against collapse, and three real-world datasets; the MNIST, CelebA, and CIFAR-10, over which VRRAEs are shown to outperform both VAEs and RRAEs on many random generation and interpolation tasks based on the FID score.

Comment: The paper introduces Variational Rank Reduction Autoencoders (VRRAEs), combining SVD-based regularization with VAEs, which is relevant to representation learning and autoencoder innovations.

Relevance: 9 Novelty: 8

6. BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

ArXiv ID: 2505.09568

Authors: Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu

Abstract: Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.

Comment: The paper introduces a unified multimodal model architecture and training strategy, which aligns with the 'Model Architecture' criterion. The focus on foundational design choices and training recipes for multimodal models adds significant relevance.

Relevance: 9 Novelty: 8

7. An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models

ArXiv ID: 2505.08915

Authors: Jialin Mao, Itay Griniasty, Yan Sun, Mark K. Transtrum, James P. Sethna, Pratik Chaudhari

Abstract: Recent experiments have shown that training trajectories of multiple deep neural networks with different architectures, optimization algorithms, hyper-parameter settings, and regularization methods evolve on a remarkably low-dimensional "hyper-ribbon-like" manifold in the space of probability distributions. Inspired by the similarities in the training trajectories of deep networks and linear networks, we analytically characterize this phenomenon for the latter. We show, using tools in dynamical systems theory, that the geometry of this low-dimensional manifold is controlled by (i) the decay rate of the eigenvalues of the input correlation matrix of the training data, (ii) the relative scale of the ground-truth output to the weights at the beginning of training, and (iii) the number of steps of gradient descent. By analytically computing and bounding the contributions of these quantities, we characterize phase boundaries of the region where hyper-ribbons are to be expected. We also extend our analysis to kernel machines and linear models that are trained with stochastic gradient descent.

Comment: This paper provides an analytical characterization of training dynamics in neural networks, focusing on low-dimensional manifolds and their geometry. It aligns closely with foundational research in representation learning and training dynamics.

Relevance: 9 Novelty: 8

8. An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

ArXiv ID: 2505.08823

Authors: Cody Steinmetz, Gavin Childress, Aaron Herbst, Gavin Jones, Jasdeep Singh, Eli Vang, Keagan Weinstock

Abstract: Large language models (LLMs) have transformed natural-language processing, yet their scale makes real-world deployment costly. Post-training quantization reduces memory and computation but often degrades accuracy, while quantization-aware training can recover performance at the cost of extra training. Pushing quantization to the ternary (2-bit) regime yields even larger savings but is notoriously unstable. Building on recent work showing that a bias-free, RMS-normalized Transformer with straight-through estimation can reach 1.58-bit precision, we demonstrate that simply inserting RMS normalization before every linear projection and applying a gradual, layer-wise quantization schedule stably fine-tunes full-precision checkpoints into ternary LLMs. Our approach matches or surpasses more elaborate knowledge-distillation pipelines on standard language-modeling benchmarks without adding model complexity. These results indicate that careful normalization alone can close much of the accuracy gap between ternary and full-precision LLMs, making ultra-low-bit inference practical.

Comment: The paper focuses on quantization for LLMs, specifically achieving ternary (2-bit) precision using RMS normalization and a gradual quantization schedule. This aligns with the 'Model Compression' criterion, particularly in sparsity and low-bit efficiency breakthroughs.

Relevance: 9 Novelty: 8

9. Establishing Linear Surrogate Regret Bounds for Convex Smooth Losses via Convolutional Fenche-Young Losses

ArXiv ID: 2505.09432

Authors: Yuzhou Cao, Han Bao, Lei Feng, Bo An

Abstract: Surrogate regret bounds bridge the gap between the convergence rates of surrogate and target losses, with linear bounds favorable for their lossless regret transfer. While convex smooth surrogate losses are appealing in particular due to the efficient estimation and optimization, the existence of a trade-off between the smoothness and linear regret bound has been believed in the community. That being said, the better optimization and estimation properties of convex smooth surrogate losses may inevitably deteriorate after undergoing the regret transfer onto a target loss. We overcome this dilemma for arbitrary discrete target losses by constructing a convex smooth surrogate loss, which entails a linear surrogate regret bound composed with a tailored prediction link. The construction is based on Fenchel-Young losses generated by the convolutional negentropy, which are equivalent to the infimal convolution of a generalized negentropy and the target Bayes risk. Consequently, the infimal convolution enables us to derive a smooth loss while maintaining the surrogate regret bound linear. We additionally benefit from the infimal convolution to have a consistent estimator of the underlying class probability. Our results are overall a novel demonstration of how convex analysis penetrates into optimization and statistical efficiency in risk minimization.

Comment: This paper introduces a novel convex smooth surrogate loss with linear regret bounds, which aligns with foundational research in optimization and theoretical efficiency improvements.