Previous Day 2025-05-14
Monthly Overview 2025-05
Next Day 2025-05-16

Personalized Daily ArXiv Papers 2025-05-15

[gpt-4o] Prompt Completion Total
Token 31545 4177 35722
Cost $0.08 $0.04 $0.12

Total arXiv papers: 359

Total scanned papers: 231

Total relevant papers: 17

Table of contents with paper titles:

  1. The Geometry of Meaning: Perfect Spacetime Representations of Hierarchical Structures Authors: Andres Anabalon, Hugo Garces, Julio Oliva, Jose Cifuentes

  2. SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures Authors: Julian Kranz, Davide Gallon, Steffen Dereich, Arnulf Jentzen

  3. Equilibrium Propagation for Learning in Lagrangian Dynamical Systems Authors: Serge Massar

  4. Accelerating Machine Learning Systems via Category Theory: Applications to Spherical Attention for Gene Regulatory Networks Authors: Vincent Abbott, Kotaro Kamiya, Gerard Glowacki, Yu Atsumi, Gioele Zardini, Yoshihiro Maruyama

  5. Variational Rank Reduction Autoencoder Authors: Jad Mounayer, Alicia Tierz, Jerome Tomezyk, Chady Ghnatios, Francisco Chinesta

  6. BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Authors: Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu

  7. An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models Authors: Jialin Mao, Itay Griniasty, Yan Sun, Mark K. Transtrum, James P. Sethna, Pratik Chaudhari

  8. An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits Authors: Cody Steinmetz, Gavin Childress, Aaron Herbst, Gavin Jones, Jasdeep Singh, Eli Vang, Keagan Weinstock

  9. Establishing Linear Surrogate Regret Bounds for Convex Smooth Losses via Convolutional Fenche-Young Losses Authors: Yuzhou Cao, Han Bao, Lei Feng, Bo An

  10. Independent Component Analysis by Robust Distance Correlation Authors: Sarah Leyder, Jakob Raymaekers, Peter J. Rousseeuw, Tom Van Deuren, Tim Verdonck

  11. SaFARi: State-Space Models for Frame-Agnostic Representation Authors: Hossein Babaei, Mel White, Sina Alemohammad, Richard G. Baraniuk

  12. Layered Unlearning for Adversarial Relearning Authors: Timothy Qian, Vinith Suriyakumar, Ashia Wilson, Dylan Hadfield-Menell

  13. Block-Biased Mamba for Long-Range Sequence Processing Authors: Annan Yu, N. Benjamin Erichson

  14. Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists? Authors: Anthony GX-Chen, Dongyan Lin, Mandana Samiei, Doina Precup, Blake A. Richards, Rob Fergus, Kenneth Marino

  15. A Multi-Task Foundation Model for Wireless Channel Representation Using Contrastive and Masked Autoencoder Learning Authors: Berkay Guler, Giovanni Geraci, Hamid Jafarkhani

  16. Neural Multivariate Regression: Qualitative Insights from the Unconstrained Feature Model Authors: George Andriopoulos, Soyuj Jung Basnet, Juan Guevara, Li Guo, Keith Ross

  17. Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures Authors: Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y. X. Wei


1. The Geometry of Meaning: Perfect Spacetime Representations of Hierarchical Structures

ArXiv ID: 2505.08795

Authors: Andres Anabalon, Hugo Garces, Julio Oliva, Jose Cifuentes

Abstract: We show that there is a fast algorithm that embeds hierarchical structures in three-dimensional Minkowski spacetime. The correlation of data ends up purely encoded in the causal structure. Our model relies solely on oriented token pairs -- local hierarchical signals -- with no access to global symbolic structure. We apply our method to the corpus of \textit{WordNet}. We provide a perfect embedding of the mammal sub-tree including ambiguities (more than one hierarchy per node) in such a way that the hierarchical structures get completely codified in the geometry and exactly reproduce the ground-truth. We extend this to a perfect embedding of the maximal unambiguous subset of the \textit{WordNet} with 82{,}115 noun tokens and a single hierarchy per token. We introduce a novel retrieval mechanism in which causality, not distance, governs hierarchical access. Our results seem to indicate that all discrete data has a perfect geometrical representation that is three-dimensional. The resulting embeddings are nearly conformally invariant, indicating deep connections with general relativity and field theory. These results suggest that concepts, categories, and their interrelations, namely hierarchical meaning itself, is geometric.

Comment: The paper proposes a novel geometric representation of hierarchical structures in 3D Minkowski spacetime, with connections to general relativity and field theory. This introduces a potentially transformative paradigm for representation learning.

Relevance: 9 Novelty: 9


2. SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures

ArXiv ID: 2505.09572

Authors: Julian Kranz, Davide Gallon, Steffen Dereich, Arnulf Jentzen

Abstract: We study gradient flows for loss landscapes of fully connected feed forward neural networks with commonly used continuously differentiable activation functions such as the logistic, hyperbolic tangent, softplus or GELU function. We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an asymptotic critical value. Moreover, we prove the existence of a threshold $\varepsilon>0$ such that the loss value of any gradient flow initialized at most $\varepsilon$ above the optimal level converges to it. For polynomial target functions and sufficiently big architecture and data set, we prove that the optimal loss value is zero and can only be realized asymptotically. From this setting, we deduce our main result that any gradient flow with sufficiently good initialization diverges to infinity. Our proof heavily relies on the geometry of o-minimal structures. We confirm these theoretical findings with numerical experiments and extend our investigation to real-world scenarios, where we observe an analogous behavior.

Comment: The paper provides theoretical insights into gradient flows in neural networks, leveraging o-minimal structures, which is highly relevant to training dynamics and foundational representation learning.

Relevance: 9 Novelty: 9


3. Equilibrium Propagation for Learning in Lagrangian Dynamical Systems

ArXiv ID: 2505.07363

Authors: Serge Massar

Abstract: We propose a method for training dynamical systems governed by Lagrangian mechanics using Equilibrium Propagation. Our approach extends Equilibrium Propagation -- initially developed for energy-based models -- to dynamical trajectories by leveraging the principle of action extremization. Training is achieved by gently nudging trajectories toward desired targets and measuring how the variables conjugate to the parameters to be trained respond. This method is particularly suited to systems with periodic boundary conditions or fixed initial and final states, enabling efficient parameter updates without requiring explicit backpropagation through time. In the case of periodic boundary conditions, this approach yields the semiclassical limit of Quantum Equilibrium Propagation. Applications to systems with dissipation are also discussed.

Comment: The paper extends Equilibrium Propagation to Lagrangian dynamical systems, introducing a novel training paradigm that challenges traditional backpropagation, aligning with emerging trends in foundational research.

Relevance: 9 Novelty: 9


4. Accelerating Machine Learning Systems via Category Theory: Applications to Spherical Attention for Gene Regulatory Networks

ArXiv ID: 2505.09326

Authors: Vincent Abbott, Kotaro Kamiya, Gerard Glowacki, Yu Atsumi, Gioele Zardini, Yoshihiro Maruyama

Abstract: How do we enable artificial intelligence models to improve themselves? This is central to exponentially improving generalized artificial intelligence models, which can improve their own architecture to handle new problem domains in an efficient manner that leverages the latest hardware. However, current automated compilation methods are poor, and efficient algorithms require years of human development. In this paper, we use neural circuit diagrams, based in category theory, to prove a general theorem related to deep learning algorithms, guide the development of a novel attention algorithm catered to the domain of gene regulatory networks, and produce a corresponding efficient kernel. The algorithm we propose, spherical attention, shows that neural circuit diagrams enable a principled and systematic method for reasoning about deep learning architectures and providing high-performance code. By replacing SoftMax with an $L^2$ norm as suggested by diagrams, it overcomes the special function unit bottleneck of standard attention while retaining the streaming property essential to high-performance. Our diagrammatically derived \textit{FlashSign} kernel achieves comparable performance to the state-of-the-art, fine-tuned FlashAttention algorithm on an A100, and $3.6\times$ the performance of PyTorch. Overall, this investigation shows neural circuit diagrams' suitability as a high-level framework for the automated development of efficient, novel artificial intelligence architectures.

Comment: The paper introduces a novel attention mechanism (spherical attention) derived using category theory and neural circuit diagrams, which aligns with 'Emerging Trends' and 'Model Architecture' due to its theoretical innovation and architectural insights.

Relevance: 9 Novelty: 9


5. Variational Rank Reduction Autoencoder

ArXiv ID: 2505.09458

Authors: Jad Mounayer, Alicia Tierz, Jerome Tomezyk, Chady Ghnatios, Francisco Chinesta

Abstract: Deterministic Rank Reduction Autoencoders (RRAEs) enforce by construction a regularization on the latent space by applying a truncated SVD. While this regularization makes Autoencoders more powerful, using them for generative purposes is counter-intuitive due to their deterministic nature. On the other hand, Variational Autoencoders (VAEs) are well known for their generative abilities by learning a probabilistic latent space. In this paper, we present Variational Rank Reduction Autoencoders (VRRAEs), a model that leverages the advantages of both RRAEs and VAEs. Our claims and results show that when carefully sampling the latent space of RRAEs and further regularizing with the Kullback-Leibler (KL) divergence (similarly to VAEs), VRRAEs outperform RRAEs and VAEs. Additionally, we show that the regularization induced by the SVD not only makes VRRAEs better generators than VAEs, but also reduces the possibility of posterior collapse. Our results include a synthetic dataset of a small size that showcases the robustness of VRRAEs against collapse, and three real-world datasets; the MNIST, CelebA, and CIFAR-10, over which VRRAEs are shown to outperform both VAEs and RRAEs on many random generation and interpolation tasks based on the FID score.

Comment: The paper introduces Variational Rank Reduction Autoencoders (VRRAEs), combining SVD-based regularization with VAEs, which is relevant to representation learning and autoencoder innovations.

Relevance: 9 Novelty: 8


6. BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

ArXiv ID: 2505.09568

Authors: Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu

Abstract: Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.

Comment: The paper introduces a unified multimodal model architecture and training strategy, which aligns with the 'Model Architecture' criterion. The focus on foundational design choices and training recipes for multimodal models adds significant relevance.

Relevance: 9 Novelty: 8


7. An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models

ArXiv ID: 2505.08915

Authors: Jialin Mao, Itay Griniasty, Yan Sun, Mark K. Transtrum, James P. Sethna, Pratik Chaudhari

Abstract: Recent experiments have shown that training trajectories of multiple deep neural networks with different architectures, optimization algorithms, hyper-parameter settings, and regularization methods evolve on a remarkably low-dimensional "hyper-ribbon-like" manifold in the space of probability distributions. Inspired by the similarities in the training trajectories of deep networks and linear networks, we analytically characterize this phenomenon for the latter. We show, using tools in dynamical systems theory, that the geometry of this low-dimensional manifold is controlled by (i) the decay rate of the eigenvalues of the input correlation matrix of the training data, (ii) the relative scale of the ground-truth output to the weights at the beginning of training, and (iii) the number of steps of gradient descent. By analytically computing and bounding the contributions of these quantities, we characterize phase boundaries of the region where hyper-ribbons are to be expected. We also extend our analysis to kernel machines and linear models that are trained with stochastic gradient descent.

Comment: This paper provides an analytical characterization of training dynamics in neural networks, focusing on low-dimensional manifolds and their geometry. It aligns closely with foundational research in representation learning and training dynamics.

Relevance: 9 Novelty: 8


8. An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

ArXiv ID: 2505.08823

Authors: Cody Steinmetz, Gavin Childress, Aaron Herbst, Gavin Jones, Jasdeep Singh, Eli Vang, Keagan Weinstock

Abstract: Large language models (LLMs) have transformed natural-language processing, yet their scale makes real-world deployment costly. Post-training quantization reduces memory and computation but often degrades accuracy, while quantization-aware training can recover performance at the cost of extra training. Pushing quantization to the ternary (2-bit) regime yields even larger savings but is notoriously unstable. Building on recent work showing that a bias-free, RMS-normalized Transformer with straight-through estimation can reach 1.58-bit precision, we demonstrate that simply inserting RMS normalization before every linear projection and applying a gradual, layer-wise quantization schedule stably fine-tunes full-precision checkpoints into ternary LLMs. Our approach matches or surpasses more elaborate knowledge-distillation pipelines on standard language-modeling benchmarks without adding model complexity. These results indicate that careful normalization alone can close much of the accuracy gap between ternary and full-precision LLMs, making ultra-low-bit inference practical.

Comment: The paper focuses on quantization for LLMs, specifically achieving ternary (2-bit) precision using RMS normalization and a gradual quantization schedule. This aligns with the 'Model Compression' criterion, particularly in sparsity and low-bit efficiency breakthroughs.

Relevance: 9 Novelty: 8


9. Establishing Linear Surrogate Regret Bounds for Convex Smooth Losses via Convolutional Fenche-Young Losses

ArXiv ID: 2505.09432

Authors: Yuzhou Cao, Han Bao, Lei Feng, Bo An

Abstract: Surrogate regret bounds bridge the gap between the convergence rates of surrogate and target losses, with linear bounds favorable for their lossless regret transfer. While convex smooth surrogate losses are appealing in particular due to the efficient estimation and optimization, the existence of a trade-off between the smoothness and linear regret bound has been believed in the community. That being said, the better optimization and estimation properties of convex smooth surrogate losses may inevitably deteriorate after undergoing the regret transfer onto a target loss. We overcome this dilemma for arbitrary discrete target losses by constructing a convex smooth surrogate loss, which entails a linear surrogate regret bound composed with a tailored prediction link. The construction is based on Fenchel-Young losses generated by the convolutional negentropy, which are equivalent to the infimal convolution of a generalized negentropy and the target Bayes risk. Consequently, the infimal convolution enables us to derive a smooth loss while maintaining the surrogate regret bound linear. We additionally benefit from the infimal convolution to have a consistent estimator of the underlying class probability. Our results are overall a novel demonstration of how convex analysis penetrates into optimization and statistical efficiency in risk minimization.

Comment: This paper introduces a novel convex smooth surrogate loss with linear regret bounds, which aligns with foundational research in optimization and theoretical efficiency improvements.

Relevance: 8 Novelty: 8


10. Independent Component Analysis by Robust Distance Correlation

ArXiv ID: 2505.09425

Authors: Sarah Leyder, Jakob Raymaekers, Peter J. Rousseeuw, Tom Van Deuren, Tim Verdonck

Abstract: Independent component analysis (ICA) is a powerful tool for decomposing a multivariate signal or distribution into fully independent sources, not just uncorrelated ones. Unfortunately, most approaches to ICA are not robust against outliers. Here we propose a robust ICA method called RICA, which estimates the components by minimizing a robust measure of dependence between multivariate random variables. The dependence measure used is the distance correlation (dCor). In order to make it more robust we first apply a new transformation called the bowl transform, which is bounded, one-to-one, continuous, and maps far outliers to points close to the origin. This preserves the crucial property that a zero dCor implies independence. RICA estimates the independent sources sequentially, by looking for the component that has the smallest dCor with the remainder. RICA is strongly consistent and has the usual parametric rate of convergence. Its robustness is investigated by a simulation study, in which it generally outperforms its competitors. The method is illustrated on three applications, including the well-known cocktail party problem.

Comment: The paper proposes a robust ICA method (RICA) using distance correlation, which aligns with foundational research in representation learning and robust statistical methods.

Relevance: 8 Novelty: 8


11. SaFARi: State-Space Models for Frame-Agnostic Representation

ArXiv ID: 2505.08977

Authors: Hossein Babaei, Mel White, Sina Alemohammad, Richard G. Baraniuk

Abstract: State-Space Models (SSMs) have re-emerged as a powerful tool for online function approximation, and as the backbone of machine learning models for long-range dependent data. However, to date, only a few polynomial bases have been explored for this purpose, and the state-of-the-art implementations were built upon the best of a few limited options. In this paper, we present a generalized method for building an SSM with any frame or basis, rather than being restricted to polynomials. This framework encompasses the approach known as HiPPO, but also permits an infinite diversity of other possible "species" within the SSM architecture. We dub this approach SaFARi: SSMs for Frame-Agnostic Representation.

Comment: The paper introduces a generalized framework for State-Space Models (SSMs) and extends the HiPPO approach, which aligns with representation learning and architectural innovations.

Relevance: 8 Novelty: 8


12. Layered Unlearning for Adversarial Relearning

ArXiv ID: 2505.09500

Authors: Timothy Qian, Vinith Suriyakumar, Ashia Wilson, Dylan Hadfield-Menell

Abstract: Our goal is to understand how post-training methods, such as fine-tuning, alignment, and unlearning, modify language model behavior and representations. We are particularly interested in the brittle nature of these modifications that makes them easy to bypass through prompt engineering or relearning. Recent results suggest that post-training induces shallow context-dependent ``circuits'' that suppress specific response patterns. This could be one explanation for the brittleness of post-training. To test this hypothesis, we design an unlearning algorithm, Layered Unlearning (LU), that creates distinct inhibitory mechanisms for a growing subset of the data. By unlearning the first $i$ folds while retaining the remaining $k - i$ at the $i$th of $k$ stages, LU limits the ability of relearning on a subset of data to recover the full dataset. We evaluate LU through a combination of synthetic and large language model (LLM) experiments. We find that LU improves robustness to adversarial relearning for several different unlearning methods. Our results contribute to the state-of-the-art of machine unlearning and provide insight into the effect of post-training updates.

Comment: The paper investigates post-training methods and introduces a novel unlearning algorithm, which aligns with 'Representation Learning' by exploring how model behavior and representations are modified. The focus on adversarial relearning adds theoretical depth.

Relevance: 8 Novelty: 8


13. Block-Biased Mamba for Long-Range Sequence Processing

ArXiv ID: 2505.09022

Authors: Annan Yu, N. Benjamin Erichson

Abstract: Mamba extends earlier state space models (SSMs) by introducing input-dependent dynamics, and has demonstrated strong empirical performance across a range of domains, including language modeling, computer vision, and foundation models. However, a surprising weakness remains: despite being built on architectures designed for long-range dependencies, Mamba performs poorly on long-range sequential tasks. Understanding and addressing this gap is important for improving Mamba's universality and versatility. In this work, we analyze Mamba's limitations through three perspectives: expressiveness, inductive bias, and training stability. Our theoretical results show how Mamba falls short in each of these aspects compared to earlier SSMs such as S4D. To address these issues, we propose $\text{B}_2\text{S}_6$, a simple extension of Mamba's S6 unit that combines block-wise selective dynamics with a channel-specific bias. We prove that these changes equip the model with a better-suited inductive bias and improve its expressiveness and stability. Empirically, $\text{B}_2\text{S}_6$ outperforms S4 and S4D on Long-Range Arena (LRA) tasks while maintaining Mamba's performance on language modeling benchmarks.

Comment: The paper proposes improvements to state space models (SSMs) for long-range sequence processing, which aligns with architectural innovations and addresses theoretical limitations of prior models.

Relevance: 8 Novelty: 8


14. Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?

ArXiv ID: 2505.09614

Authors: Anthony GX-Chen, Dongyan Lin, Mandana Samiei, Doina Precup, Blake A. Richards, Rob Fergus, Kenneth Marino

Abstract: Language model (LM) agents are increasingly used as autonomous decision-makers who need to actively gather information to guide their decisions. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world -- key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit systematic biases leading to erroneous conclusions. In this work, we examine LMs' ability to explore and infer causal relationships, using the well-established "Blicket Test" paradigm from developmental psychology. We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally (or sometimes even more) evidenced conjunctive ones. This "disjunctive bias" persists across model families, sizes, and prompting strategies, and performance further declines as task complexity increases. Interestingly, an analogous bias appears in human adults, suggesting that LMs may have inherited deep-seated reasoning heuristics from their training data. To this end, we quantify similarities between LMs and humans, finding that LMs exhibit adult-like inference profiles (but not children-like). Finally, we propose a test-time sampling method which explicitly samples and eliminates hypotheses about causal relationships from the LM. This scalable approach significantly reduces the disjunctive bias and moves LMs closer to the goal of scientific, causally rigorous reasoning.

Comment: The paper examines causal reasoning biases in LLMs and proposes a test-time sampling method to address them, which provides insights into LLM behavior and interpretability.

Relevance: 8 Novelty: 7


15. A Multi-Task Foundation Model for Wireless Channel Representation Using Contrastive and Masked Autoencoder Learning

ArXiv ID: 2505.09160

Authors: Berkay Guler, Giovanni Geraci, Hamid Jafarkhani

Abstract: Current applications of self-supervised learning to wireless channel representation often borrow paradigms developed for text and image processing, without fully addressing the unique characteristics and constraints of wireless communications. Aiming to fill this gap, we first propose WiMAE (Wireless Masked Autoencoder), a transformer-based encoder-decoder foundation model pretrained on a realistic open-source multi-antenna wireless channel dataset. Building upon this foundation, we develop ContraWiMAE, which enhances WiMAE by incorporating a contrastive learning objective alongside the reconstruction task in a unified multi-task framework. By warm-starting from pretrained WiMAE weights and generating positive pairs via noise injection, the contrastive component enables the model to capture both structural and discriminative features, enhancing representation quality beyond what reconstruction alone can achieve. Through extensive evaluation on unseen scenarios, we demonstrate the effectiveness of both approaches across multiple downstream tasks, with ContraWiMAE showing further improvements in linear separability and adaptability in diverse wireless environments. Comparative evaluations against a state-of-the-art wireless channel foundation model confirm the superior performance and data efficiency of our models, highlighting their potential as powerful baselines for future research in self-supervised wireless channel representation learning.

Comment: The paper proposes a transformer-based autoencoder for wireless channel representation, which aligns with 'Representation Learning' and 'Model Architecture' criteria. The use of contrastive and masked autoencoder learning adds methodological novelty.

Relevance: 8 Novelty: 7


16. Neural Multivariate Regression: Qualitative Insights from the Unconstrained Feature Model

ArXiv ID: 2505.09308

Authors: George Andriopoulos, Soyuj Jung Basnet, Juan Guevara, Li Guo, Keith Ross

Abstract: The Unconstrained Feature Model (UFM) is a mathematical framework that enables closed-form approximations for minimal training loss and related performance measures in deep neural networks (DNNs). This paper leverages the UFM to provide qualitative insights into neural multivariate regression, a critical task in imitation learning, robotics, and reinforcement learning. Specifically, we address two key questions: (1) How do multi-task models compare to multiple single-task models in terms of training performance? (2) Can whitening and normalizing regression targets improve training performance? The UFM theory predicts that multi-task models achieve strictly smaller training MSE than multiple single-task models when the same or stronger regularization is applied to the latter, and our empirical results confirm these findings. Regarding whitening and normalizing regression targets, the UFM theory predicts that they reduce training MSE when the average variance across the target dimensions is less than one, and our empirical results once again confirm these findings. These findings highlight the UFM as a powerful framework for deriving actionable insights into DNN design and data pre-processing strategies.

Comment: The paper provides theoretical insights into neural multivariate regression using the Unconstrained Feature Model (UFM), which aligns with representation learning and training dynamics in neural networks.

Relevance: 8 Novelty: 7


17. Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

ArXiv ID: 2505.09343

Authors: Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y. X. Wei

Abstract: The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3's development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.

Comment: The paper discusses hardware-aware co-design for scaling LLMs, including innovations like Mixture of Experts (MoE) and FP8 mixed-precision training. This aligns with the 'Model Architecture' criterion due to its focus on MoE and architectural efficiency.

Relevance: 8 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords: