Personalized Daily ArXiv Papers 2025-05-15
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 31545 | 4177 | 35722 |
| Cost | $0.08 | $0.04 | $0.12 |
Total arXiv papers: 359
Total scanned papers: 231
Total relevant papers: 17
Table of contents with paper titles:
-
The Geometry of Meaning: Perfect Spacetime Representations of Hierarchical Structures Authors: Andres Anabalon, Hugo Garces, Julio Oliva, Jose Cifuentes
-
SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures Authors: Julian Kranz, Davide Gallon, Steffen Dereich, Arnulf Jentzen
-
Equilibrium Propagation for Learning in Lagrangian Dynamical Systems Authors: Serge Massar
-
Accelerating Machine Learning Systems via Category Theory: Applications to Spherical Attention for Gene Regulatory Networks Authors: Vincent Abbott, Kotaro Kamiya, Gerard Glowacki, Yu Atsumi, Gioele Zardini, Yoshihiro Maruyama
-
Variational Rank Reduction Autoencoder Authors: Jad Mounayer, Alicia Tierz, Jerome Tomezyk, Chady Ghnatios, Francisco Chinesta
-
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Authors: Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu
-
An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models Authors: Jialin Mao, Itay Griniasty, Yan Sun, Mark K. Transtrum, James P. Sethna, Pratik Chaudhari
-
An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits Authors: Cody Steinmetz, Gavin Childress, Aaron Herbst, Gavin Jones, Jasdeep Singh, Eli Vang, Keagan Weinstock
-
Establishing Linear Surrogate Regret Bounds for Convex Smooth Losses via Convolutional Fenche-Young Losses Authors: Yuzhou Cao, Han Bao, Lei Feng, Bo An
-
Independent Component Analysis by Robust Distance Correlation Authors: Sarah Leyder, Jakob Raymaekers, Peter J. Rousseeuw, Tom Van Deuren, Tim Verdonck
-
SaFARi: State-Space Models for Frame-Agnostic Representation Authors: Hossein Babaei, Mel White, Sina Alemohammad, Richard G. Baraniuk
-
Layered Unlearning for Adversarial Relearning Authors: Timothy Qian, Vinith Suriyakumar, Ashia Wilson, Dylan Hadfield-Menell
-
Block-Biased Mamba for Long-Range Sequence Processing Authors: Annan Yu, N. Benjamin Erichson
-
Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists? Authors: Anthony GX-Chen, Dongyan Lin, Mandana Samiei, Doina Precup, Blake A. Richards, Rob Fergus, Kenneth Marino
-
A Multi-Task Foundation Model for Wireless Channel Representation Using Contrastive and Masked Autoencoder Learning Authors: Berkay Guler, Giovanni Geraci, Hamid Jafarkhani
-
Neural Multivariate Regression: Qualitative Insights from the Unconstrained Feature Model Authors: George Andriopoulos, Soyuj Jung Basnet, Juan Guevara, Li Guo, Keith Ross
-
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures Authors: Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y. X. Wei
1. The Geometry of Meaning: Perfect Spacetime Representations of Hierarchical Structures
ArXiv ID: 2505.08795
Authors: Andres Anabalon, Hugo Garces, Julio Oliva, Jose Cifuentes
Abstract: We show that there is a fast algorithm that embeds hierarchical structures in three-dimensional Minkowski spacetime. The correlation of data ends up purely encoded in the causal structure. Our model relies solely on oriented token pairs -- local hierarchical signals -- with no access to global symbolic structure. We apply our method to the corpus of \textit{WordNet}. We provide a perfect embedding of the mammal sub-tree including ambiguities (more than one hierarchy per node) in such a way that the hierarchical structures get completely codified in the geometry and exactly reproduce the ground-truth. We extend this to a perfect embedding of the maximal unambiguous subset of the \textit{WordNet} with 82{,}115 noun tokens and a single hierarchy per token. We introduce a novel retrieval mechanism in which causality, not distance, governs hierarchical access. Our results seem to indicate that all discrete data has a perfect geometrical representation that is three-dimensional. The resulting embeddings are nearly conformally invariant, indicating deep connections with general relativity and field theory. These results suggest that concepts, categories, and their interrelations, namely hierarchical meaning itself, is geometric.
Comment: The paper proposes a novel geometric representation of hierarchical structures in 3D Minkowski spacetime, with connections to general relativity and field theory. This introduces a potentially transformative paradigm for representation learning.
Relevance: 9 Novelty: 9
2. SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures
ArXiv ID: 2505.09572
Authors: Julian Kranz, Davide Gallon, Steffen Dereich, Arnulf Jentzen
Abstract: We study gradient flows for loss landscapes of fully connected feed forward neural networks with commonly used continuously differentiable activation functions such as the logistic, hyperbolic tangent, softplus or GELU function. We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an asymptotic critical value. Moreover, we prove the existence of a threshold $\varepsilon>0$ such that the loss value of any gradient flow initialized at most $\varepsilon$ above the optimal level converges to it. For polynomial target functions and sufficiently big architecture and data set, we prove that the optimal loss value is zero and can only be realized asymptotically. From this setting, we deduce our main result that any gradient flow with sufficiently good initialization diverges to infinity. Our proof heavily relies on the geometry of o-minimal structures. We confirm these theoretical findings with numerical experiments and extend our investigation to real-world scenarios, where we observe an analogous behavior.
Comment: The paper provides theoretical insights into gradient flows in neural networks, leveraging o-minimal structures, which is highly relevant to training dynamics and foundational representation learning.
Relevance: 9 Novelty: 9
3. Equilibrium Propagation for Learning in Lagrangian Dynamical Systems
ArXiv ID: 2505.07363
Authors: Serge Massar
Abstract: We propose a method for training dynamical systems governed by Lagrangian mechanics using Equilibrium Propagation. Our approach extends Equilibrium Propagation -- initially developed for energy-based models -- to dynamical trajectories by leveraging the principle of action extremization. Training is achieved by gently nudging trajectories toward desired targets and measuring how the variables conjugate to the parameters to be trained respond. This method is particularly suited to systems with periodic boundary conditions or fixed initial and final states, enabling efficient parameter updates without requiring explicit backpropagation through time. In the case of periodic boundary conditions, this approach yields the semiclassical limit of Quantum Equilibrium Propagation. Applications to systems with dissipation are also discussed.
Comment: The paper extends Equilibrium Propagation to Lagrangian dynamical systems, introducing a novel training paradigm that challenges traditional backpropagation, aligning with emerging trends in foundational research.
Relevance: 9 Novelty: 9
4. Accelerating Machine Learning Systems via Category Theory: Applications to Spherical Attention for Gene Regulatory Networks
ArXiv ID: 2505.09326
Authors: Vincent Abbott, Kotaro Kamiya, Gerard Glowacki, Yu Atsumi, Gioele Zardini, Yoshihiro Maruyama
Abstract: How do we enable artificial intelligence models to improve themselves? This is central to exponentially improving generalized artificial intelligence models, which can improve their own architecture to handle new problem domains in an efficient manner that leverages the latest hardware. However, current automated compilation methods are poor, and efficient algorithms require years of human development. In this paper, we use neural circuit diagrams, based in category theory, to prove a general theorem related to deep learning algorithms, guide the development of a novel attention algorithm catered to the domain of gene regulatory networks, and produce a corresponding efficient kernel. The algorithm we propose, spherical attention, shows that neural circuit diagrams enable a principled and systematic method for reasoning about deep learning architectures and providing high-performance code. By replacing SoftMax with an $L^2$ norm as suggested by diagrams, it overcomes the special function unit bottleneck of standard attention while retaining the streaming property essential to high-performance. Our diagrammatically derived \textit{FlashSign} kernel achieves comparable performance to the state-of-the-art, fine-tuned FlashAttention algorithm on an A100, and $3.6\times$ the performance of PyTorch. Overall, this investigation shows neural circuit diagrams' suitability as a high-level framework for the automated development of efficient, novel artificial intelligence architectures.
Comment: The paper introduces a novel attention mechanism (spherical attention) derived using category theory and neural circuit diagrams, which aligns with 'Emerging Trends' and 'Model Architecture' due to its theoretical innovation and architectural insights.
Relevance: 9 Novelty: 9
5. Variational Rank Reduction Autoencoder
ArXiv ID: 2505.09458
Authors: Jad Mounayer, Alicia Tierz, Jerome Tomezyk, Chady Ghnatios, Francisco Chinesta
Abstract: Deterministic Rank Reduction Autoencoders (RRAEs) enforce by construction a regularization on the latent space by applying a truncated SVD. While this regularization makes Autoencoders more powerful, using them for generative purposes is counter-intuitive due to their deterministic nature. On the other hand, Variational Autoencoders (VAEs) are well known for their generative abilities by learning a probabilistic latent space. In this paper, we present Variational Rank Reduction Autoencoders (VRRAEs), a model that leverages the advantages of both RRAEs and VAEs. Our claims and results show that when carefully sampling the latent space of RRAEs and further regularizing with the Kullback-Leibler (KL) divergence (similarly to VAEs), VRRAEs outperform RRAEs and VAEs. Additionally, we show that the regularization induced by the SVD not only makes VRRAEs better generators than VAEs, but also reduces the possibility of posterior collapse. Our results include a synthetic dataset of a small size that showcases the robustness of VRRAEs against collapse, and three real-world datasets; the MNIST, CelebA, and CIFAR-10, over which VRRAEs are shown to outperform both VAEs and RRAEs on many random generation and interpolation tasks based on the FID score.
Comment: The paper introduces Variational Rank Reduction Autoencoders (VRRAEs), combining SVD-based regularization with VAEs, which is relevant to representation learning and autoencoder innovations.
Relevance: 9 Novelty: 8
6. BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
ArXiv ID: 2505.09568
Authors: Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu
Abstract: Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.
Comment: The paper introduces a unified multimodal model architecture and training strategy, which aligns with the 'Model Architecture' criterion. The focus on foundational design choices and training recipes for multimodal models adds significant relevance.
Relevance: 9 Novelty: 8
7. An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models
ArXiv ID: 2505.08915
Authors: Jialin Mao, Itay Griniasty, Yan Sun, Mark K. Transtrum, James P. Sethna, Pratik Chaudhari
Abstract: Recent experiments have shown that training trajectories of multiple deep neural networks with different architectures, optimization algorithms, hyper-parameter settings, and regularization methods evolve on a remarkably low-dimensional "hyper-ribbon-like" manifold in the space of probability distributions. Inspired by the similarities in the training trajectories of deep networks and linear networks, we analytically characterize this phenomenon for the latter. We show, using tools in dynamical systems theory, that the geometry of this low-dimensional manifold is controlled by (i) the decay rate of the eigenvalues of the input correlation matrix of the training data, (ii) the relative scale of the ground-truth output to the weights at the beginning of training, and (iii) the number of steps of gradient descent. By analytically computing and bounding the contributions of these quantities, we characterize phase boundaries of the region where hyper-ribbons are to be expected. We also extend our analysis to kernel machines and linear models that are trained with stochastic gradient descent.
Comment: This paper provides an analytical characterization of training dynamics in neural networks, focusing on low-dimensional manifolds and their geometry. It aligns closely with foundational research in representation learning and training dynamics.
Relevance: 9 Novelty: 8
8. An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits
ArXiv ID: 2505.08823
Authors: Cody Steinmetz, Gavin Childress, Aaron Herbst, Gavin Jones, Jasdeep Singh, Eli Vang, Keagan Weinstock
Abstract: Large language models (LLMs) have transformed natural-language processing, yet their scale makes real-world deployment costly. Post-training quantization reduces memory and computation but often degrades accuracy, while quantization-aware training can recover performance at the cost of extra training. Pushing quantization to the ternary (2-bit) regime yields even larger savings but is notoriously unstable. Building on recent work showing that a bias-free, RMS-normalized Transformer with straight-through estimation can reach 1.58-bit precision, we demonstrate that simply inserting RMS normalization before every linear projection and applying a gradual, layer-wise quantization schedule stably fine-tunes full-precision checkpoints into ternary LLMs. Our approach matches or surpasses more elaborate knowledge-distillation pipelines on standard language-modeling benchmarks without adding model complexity. These results indicate that careful normalization alone can close much of the accuracy gap between ternary and full-precision LLMs, making ultra-low-bit inference practical.
Comment: The paper focuses on quantization for LLMs, specifically achieving ternary (2-bit) precision using RMS normalization and a gradual quantization schedule. This aligns with the 'Model Compression' criterion, particularly in sparsity and low-bit efficiency breakthroughs.
Relevance: 9 Novelty: 8
9. Establishing Linear Surrogate Regret Bounds for Convex Smooth Losses via Convolutional Fenche-Young Losses
ArXiv ID: 2505.09432
Authors: Yuzhou Cao, Han Bao, Lei Feng, Bo An
Abstract: Surrogate regret bounds bridge the gap between the convergence rates of surrogate and target losses, with linear bounds favorable for their lossless regret transfer. While convex smooth surrogate losses are appealing in particular due to the efficient estimation and optimization, the existence of a trade-off between the smoothness and linear regret bound has been believed in the community. That being said, the better optimization and estimation properties of convex smooth surrogate losses may inevitably deteriorate after undergoing the regret transfer onto a target loss. We overcome this dilemma for arbitrary discrete target losses by constructing a convex smooth surrogate loss, which entails a linear surrogate regret bound composed with a tailored prediction link. The construction is based on Fenchel-Young losses generated by the convolutional negentropy, which are equivalent to the infimal convolution of a generalized negentropy and the target Bayes risk. Consequently, the infimal convolution enables us to derive a smooth loss while maintaining the surrogate regret bound linear. We additionally benefit from the infimal convolution to have a consistent estimator of the underlying class probability. Our results are overall a novel demonstration of how convex analysis penetrates into optimization and statistical efficiency in risk minimization.
Comment: This paper introduces a novel convex smooth surrogate loss with linear regret bounds, which aligns with foundational research in optimization and theoretical efficiency improvements.
Relevance: 8 Novelty: 8
10. Independent Component Analysis by Robust Distance Correlation
ArXiv ID: 2505.09425
Authors: Sarah Leyder, Jakob Raymaekers, Peter J. Rousseeuw, Tom Van Deuren, Tim Verdonck
Abstract: Independent component analysis (ICA) is a powerful tool for decomposing a multivariate signal or distribution into fully independent sources, not just uncorrelated ones. Unfortunately, most approaches to ICA are not robust against outliers. Here we propose a robust ICA method called RICA, which estimates the components by minimizing a robust measure of dependence between multivariate random variables. The dependence measure used is the distance correlation (dCor). In order to make it more robust we first apply a new transformation called the bowl transform, which is bounded, one-to-one, continuous, and maps far outliers to points close to the origin. This preserves the crucial property that a zero dCor implies independence. RICA estimates the independent sources sequentially, by looking for the component that has the smallest dCor with the remainder. RICA is strongly consistent and has the usual parametric rate of convergence. Its robustness is investigated by a simulation study, in which it generally outperforms its competitors. The method is illustrated on three applications, including the well-known cocktail party problem.
Comment: The paper proposes a robust ICA method (RICA) using distance correlation, which aligns with foundational research in representation learning and robust statistical methods.
Relevance: 8 Novelty: 8
11. SaFARi: State-Space Models for Frame-Agnostic Representation
ArXiv ID: 2505.08977
Authors: Hossein Babaei, Mel White, Sina Alemohammad, Richard G. Baraniuk
Abstract: State-Space Models (SSMs) have re-emerged as a powerful tool for online function approximation, and as the backbone of machine learning models for long-range dependent data. However, to date, only a few polynomial bases have been explored for this purpose, and the state-of-the-art implementations were built upon the best of a few limited options. In this paper, we present a generalized method for building an SSM with any frame or basis, rather than being restricted to polynomials. This framework encompasses the approach known as HiPPO, but also permits an infinite diversity of other possible "species" within the SSM architecture. We dub this approach SaFARi: SSMs for Frame-Agnostic Representation.
Comment: The paper introduces a generalized framework for State-Space Models (SSMs) and extends the HiPPO approach, which aligns with representation learning and architectural innovations.
Relevance: 8 Novelty: 8
12. Layered Unlearning for Adversarial Relearning
ArXiv ID: 2505.09500
Authors: Timothy Qian, Vinith Suriyakumar, Ashia Wilson, Dylan Hadfield-Menell
Abstract: Our goal is to understand how post-training methods, such as fine-tuning, alignment, and unlearning, modify language model behavior and representations. We are particularly interested in the brittle nature of these modifications that makes them easy to bypass through prompt engineering or relearning. Recent results suggest that post-training induces shallow context-dependent ``circuits'' that suppress specific response patterns. This could be one explanation for the brittleness of post-training. To test this hypothesis, we design an unlearning algorithm, Layered Unlearning (LU), that creates distinct inhibitory mechanisms for a growing subset of the data. By unlearning the first $i$ folds while retaining the remaining $k - i$ at the $i$th of $k$ stages, LU limits the ability of relearning on a subset of data to recover the full dataset. We evaluate LU through a combination of synthetic and large language model (LLM) experiments. We find that LU improves robustness to adversarial relearning for several different unlearning methods. Our results contribute to the state-of-the-art of machine unlearning and provide insight into the effect of post-training updates.
Comment: The paper investigates post-training methods and introduces a novel unlearning algorithm, which aligns with 'Representation Learning' by exploring how model behavior and representations are modified. The focus on adversarial relearning adds theoretical depth.
Relevance: 8 Novelty: 8
13. Block-Biased Mamba for Long-Range Sequence Processing
ArXiv ID: 2505.09022
Authors: Annan Yu, N. Benjamin Erichson
Abstract: Mamba extends earlier state space models (SSMs) by introducing input-dependent dynamics, and has demonstrated strong empirical performance across a range of domains, including language modeling, computer vision, and foundation models. However, a surprising weakness remains: despite being built on architectures designed for long-range dependencies, Mamba performs poorly on long-range sequential tasks. Understanding and addressing this gap is important for improving Mamba's universality and versatility. In this work, we analyze Mamba's limitations through three perspectives: expressiveness, inductive bias, and training stability. Our theoretical results show how Mamba falls short in each of these aspects compared to earlier SSMs such as S4D. To address these issues, we propose $\text{B}_2\text{S}_6$, a simple extension of Mamba's S6 unit that combines block-wise selective dynamics with a channel-specific bias. We prove that these changes equip the model with a better-suited inductive bias and improve its expressiveness and stability. Empirically, $\text{B}_2\text{S}_6$ outperforms S4 and S4D on Long-Range Arena (LRA) tasks while maintaining Mamba's performance on language modeling benchmarks.
Comment: The paper proposes improvements to state space models (SSMs) for long-range sequence processing, which aligns with architectural innovations and addresses theoretical limitations of prior models.
Relevance: 8 Novelty: 8
14. Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?
ArXiv ID: 2505.09614
Authors: Anthony GX-Chen, Dongyan Lin, Mandana Samiei, Doina Precup, Blake A. Richards, Rob Fergus, Kenneth Marino
Abstract: Language model (LM) agents are increasingly used as autonomous decision-makers who need to actively gather information to guide their decisions. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world -- key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit systematic biases leading to erroneous conclusions. In this work, we examine LMs' ability to explore and infer causal relationships, using the well-established "Blicket Test" paradigm from developmental psychology. We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally (or sometimes even more) evidenced conjunctive ones. This "disjunctive bias" persists across model families, sizes, and prompting strategies, and performance further declines as task complexity increases. Interestingly, an analogous bias appears in human adults, suggesting that LMs may have inherited deep-seated reasoning heuristics from their training data. To this end, we quantify similarities between LMs and humans, finding that LMs exhibit adult-like inference profiles (but not children-like). Finally, we propose a test-time sampling method which explicitly samples and eliminates hypotheses about causal relationships from the LM. This scalable approach significantly reduces the disjunctive bias and moves LMs closer to the goal of scientific, causally rigorous reasoning.
Comment: The paper examines causal reasoning biases in LLMs and proposes a test-time sampling method to address them, which provides insights into LLM behavior and interpretability.
Relevance: 8 Novelty: 7
15. A Multi-Task Foundation Model for Wireless Channel Representation Using Contrastive and Masked Autoencoder Learning
ArXiv ID: 2505.09160
Authors: Berkay Guler, Giovanni Geraci, Hamid Jafarkhani
Abstract: Current applications of self-supervised learning to wireless channel representation often borrow paradigms developed for text and image processing, without fully addressing the unique characteristics and constraints of wireless communications. Aiming to fill this gap, we first propose WiMAE (Wireless Masked Autoencoder), a transformer-based encoder-decoder foundation model pretrained on a realistic open-source multi-antenna wireless channel dataset. Building upon this foundation, we develop ContraWiMAE, which enhances WiMAE by incorporating a contrastive learning objective alongside the reconstruction task in a unified multi-task framework. By warm-starting from pretrained WiMAE weights and generating positive pairs via noise injection, the contrastive component enables the model to capture both structural and discriminative features, enhancing representation quality beyond what reconstruction alone can achieve. Through extensive evaluation on unseen scenarios, we demonstrate the effectiveness of both approaches across multiple downstream tasks, with ContraWiMAE showing further improvements in linear separability and adaptability in diverse wireless environments. Comparative evaluations against a state-of-the-art wireless channel foundation model confirm the superior performance and data efficiency of our models, highlighting their potential as powerful baselines for future research in self-supervised wireless channel representation learning.
Comment: The paper proposes a transformer-based autoencoder for wireless channel representation, which aligns with 'Representation Learning' and 'Model Architecture' criteria. The use of contrastive and masked autoencoder learning adds methodological novelty.
Relevance: 8 Novelty: 7
16. Neural Multivariate Regression: Qualitative Insights from the Unconstrained Feature Model
ArXiv ID: 2505.09308
Authors: George Andriopoulos, Soyuj Jung Basnet, Juan Guevara, Li Guo, Keith Ross
Abstract: The Unconstrained Feature Model (UFM) is a mathematical framework that enables closed-form approximations for minimal training loss and related performance measures in deep neural networks (DNNs). This paper leverages the UFM to provide qualitative insights into neural multivariate regression, a critical task in imitation learning, robotics, and reinforcement learning. Specifically, we address two key questions: (1) How do multi-task models compare to multiple single-task models in terms of training performance? (2) Can whitening and normalizing regression targets improve training performance? The UFM theory predicts that multi-task models achieve strictly smaller training MSE than multiple single-task models when the same or stronger regularization is applied to the latter, and our empirical results confirm these findings. Regarding whitening and normalizing regression targets, the UFM theory predicts that they reduce training MSE when the average variance across the target dimensions is less than one, and our empirical results once again confirm these findings. These findings highlight the UFM as a powerful framework for deriving actionable insights into DNN design and data pre-processing strategies.
Comment: The paper provides theoretical insights into neural multivariate regression using the Unconstrained Feature Model (UFM), which aligns with representation learning and training dynamics in neural networks.
Relevance: 8 Novelty: 7
17. Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
ArXiv ID: 2505.09343
Authors: Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y. X. Wei
Abstract: The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3's development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.
Comment: The paper discusses hardware-aware co-design for scaling LLMs, including innovations like Mixture of Experts (MoE) and FP8 mixed-precision training. This aligns with the 'Model Architecture' criterion due to its focus on MoE and architectural efficiency.
Relevance: 8 Novelty: 7
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.