Previous Day 2025-05-09
Monthly Overview 2025-05
Next Day 2025-05-13

Personalized Daily ArXiv Papers 2025-05-12

[gpt-4o] Prompt Completion Total
Token 35399 4453 39852
Cost $0.09 $0.04 $0.13

Total arXiv papers: 328

Total scanned papers: 212

Total relevant papers: 18

Table of contents with paper titles:

  1. FloE: On-the-Fly MoE Inference Authors: Yuxin Zhou, Zheng Li, Jun Zhang, Jue Wang, Yiping Wang, Zhongle Xie, Ke Chen, Lidan Shou

  2. Continuous Thought Machines Authors: Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, Llion Jones

  3. Deep-ICE: The first globally optimal algorithm for empirical risk minimization of two-layer maxout and ReLU networks Authors: Xi He, Yi Miao, Max A. Little

  4. On the Depth of Monotone ReLU Neural Networks and ICNNs Authors: Egor Bakaev, Florestan Brunck, Christoph Hertrich, Daniel Reichman, Amir Yehudayoff

  5. Towards a Unified Representation Evaluation Framework Beyond Downstream Tasks Authors: Christos Plachouras, Julien Guinot, George Fazekas, Elio Quinton, Emmanouil Benetos, Johan Pauwels

  6. Rethinking Graph Contrastive Learning through Relative Similarity Preservation Authors: Zhiyuan Ning, Pengfei Wang, Ziyue Qiao, Pengyang Wang, Yuanchun Zhou

  7. How to Train Your Metamorphic Deep Neural Network Authors: Thomas Sommariva, Simone Calderara, Angelo Porrello

  8. MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design Authors: Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, Dahua Lin

  9. Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM Authors: Zehao Fan, Garrett Gagnon, Zhenyu Liu, Liu Liu

  10. New Statistical and Computational Results for Learning Junta Distributions Authors: Lorenzo Beretta

  11. Hypergraph Neural Sheaf Diffusion: A Symmetric Simplicial Set Framework for Higher-Order Learning Authors: Seongjin Choi, Gahee Kim, Yong-Geun Oh

  12. Neuro-Symbolic Concepts Authors: Jiayuan Mao, Joshua B. Tenenbaum, Jiajun Wu

  13. Generative Discovery of Partial Differential Equations by Learning from Math Handbooks Authors: Hao Xu, Yuntian Chen, Rui Cao, Tianning Tang, Mengge Du, Jian Li, Adrian H. Callaghan, Dongxiao Zhang

  14. Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions Authors: Dhruvesh Patel, Aishwarya Sahoo, Avinash Amballa, Tahira Naseem, Tim G. J. Rudner, Andrew McCallum

  15. Griffin: Towards a Graph-Centric Relational Database Foundation Model Authors: Yanbo Wang, Xiyuan Wang, Quan Gan, Minjie Wang, Qibin Yang, David Wipf, Muhan Zhang

  16. Autoencoder-Based Hybrid Replay for Class-Incremental Learning Authors: Milad Khademi Nori, Il-Min Kim, Guanghui Wang

  17. UniSymNet: A Unified Symbolic Network Guided by Transformer Authors: Xinxin Li, Juan Zhang, Da Li, Xingyu Liu, Jin Xu, Junping Yin

  18. Register and CLS tokens yield a decoupling of local and global features in large ViTs Authors: Alexander Lappe, Martin A. Giese


1. FloE: On-the-Fly MoE Inference

ArXiv ID: 2505.05950

Authors: Yuxin Zhou, Zheng Li, Jun Zhang, Jue Wang, Yiping Wang, Zhongle Xie, Ke Chen, Lidan Shou

Abstract: With the widespread adoption of Mixture-of-Experts (MoE) models, there is a growing demand for efficient inference on memory-constrained devices. While offloading expert parameters to CPU memory and loading activated experts on demand has emerged as a potential solution, the large size of activated experts overburdens the limited PCIe bandwidth, hindering the effectiveness in latency-sensitive scenarios. To mitigate this, we propose FloE, an on-the-fly MoE inference system on memory-constrained GPUs. FloE is built on the insight that there exists substantial untapped redundancy within sparsely activated experts. It employs various compression techniques on the expert's internal parameter matrices to reduce the data movement load, combined with low-cost sparse prediction, achieving perceptible inference acceleration in wall-clock time on resource-constrained devices. Empirically, FloE achieves a 9.3x compression of parameters per expert in Mixtral-8x7B; enables deployment on a GPU with only 11GB VRAM, reducing the memory footprint by up to 8.5x; and delivers a 48.7x inference speedup compared to DeepSpeed-MII on a single GeForce RTX 3090.

Comment: The paper proposes FloE, an on-the-fly MoE inference system, which directly aligns with the Mixture-of-Experts (MoE) criterion under model architecture and compression. The compression techniques and inference acceleration are novel and impactful.

Relevance: 10 Novelty: 8


2. Continuous Thought Machines

ArXiv ID: 2505.05522

Authors: Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, Llion Jones

Abstract: Biological brains demonstrate complex neural activity, where the timing and interplay between neurons is critical to how brains process information. Most deep learning architectures simplify neural activity by abstracting away temporal dynamics. In this paper we challenge that paradigm. By incorporating neuron-level processing and synchronization, we can effectively reintroduce neural timing as a foundational element. We present the Continuous Thought Machine (CTM), a model designed to leverage neural dynamics as its core representation. The CTM has two core innovations: (1) neuron-level temporal processing, where each neuron uses unique weight parameters to process a history of incoming signals; and (2) neural synchronization employed as a latent representation. The CTM aims to strike a balance between oversimplified neuron abstractions that improve computational efficiency, and biological realism. It operates at a level of abstraction that effectively captures essential temporal dynamics while remaining computationally tractable for deep learning. We demonstrate the CTM's strong performance and versatility across a range of challenging tasks, including ImageNet-1K classification, solving 2D mazes, sorting, parity computation, question-answering, and RL tasks. Beyond displaying rich internal representations and offering a natural avenue for interpretation owing to its internal process, the CTM is able to perform tasks that require complex sequential reasoning. The CTM can also leverage adaptive compute, where it can stop earlier for simpler tasks, or keep computing when faced with more challenging instances. The goal of this work is to share the CTM and its associated innovations, rather than pushing for new state-of-the-art results. To that end, we believe the CTM represents a significant step toward developing more biologically plausible and powerful artificial intelligence systems.

Comment: The paper introduces the Continuous Thought Machine, which challenges established paradigms by incorporating neuron-level temporal dynamics, aligning with emerging trends and architectural innovations.

Relevance: 9 Novelty: 9


3. Deep-ICE: The first globally optimal algorithm for empirical risk minimization of two-layer maxout and ReLU networks

ArXiv ID: 2505.05740

Authors: Xi He, Yi Miao, Max A. Little

Abstract: This paper introduces the first globally optimal algorithm for the empirical risk minimization problem of two-layer maxout and ReLU networks, i.e., minimizing the number of misclassifications. The algorithm has a worst-case time complexity of $O\left(N^{DK+1}\right)$, where $K$ denotes the number of hidden neurons and $D$ represents the number of features. It can be can be generalized to accommodate arbitrary computable loss functions without affecting its computational complexity. Our experiments demonstrate that the proposed algorithm provides provably exact solutions for small-scale datasets. To handle larger datasets, we introduce a novel coreset selection method that reduces the data size to a manageable scale, making it feasible for our algorithm. This extension enables efficient processing of large-scale datasets and achieves significantly improved performance, with a 20-30\% reduction in misclassifications for both training and prediction, compared to state-of-the-art approaches (neural networks trained using gradient descent and support vector machines), when applied to the same models (two-layer networks with fixed hidden nodes and linear models).

Comment: The paper introduces a globally optimal algorithm for empirical risk minimization in two-layer maxout and ReLU networks, which is highly relevant to foundational research in representation learning and training dynamics. The coreset selection method for scaling the algorithm adds significant novelty.

Relevance: 9 Novelty: 8


4. On the Depth of Monotone ReLU Neural Networks and ICNNs

ArXiv ID: 2505.06169

Authors: Egor Bakaev, Florestan Brunck, Christoph Hertrich, Daniel Reichman, Amir Yehudayoff

Abstract: We study two models of ReLU neural networks: monotone networks (ReLU$^+$) and input convex neural networks (ICNN). Our focus is on expressivity, mostly in terms of depth, and we prove the following lower bounds. For the maximum function MAX$_n$ computing the maximum of $n$ real numbers, we show that ReLU$^+$ networks cannot compute MAX$_n$, or even approximate it. We prove a sharp $n$ lower bound on the ICNN depth complexity of MAX$_n$. We also prove depth separations between ReLU networks and ICNNs; for every $k$, there is a depth-2 ReLU network of size $O(k^2)$ that cannot be simulated by a depth-$k$ ICNN. The proofs are based on deep connections between neural networks and polyhedral geometry, and also use isoperimetric properties of triangulations.

Comment: The paper provides theoretical insights into the depth complexity of monotone ReLU networks and ICNNs, which is highly relevant to understanding neural network architectures. The depth separations and expressivity analysis are significant contributions.

Relevance: 9 Novelty: 8


5. Towards a Unified Representation Evaluation Framework Beyond Downstream Tasks

ArXiv ID: 2505.06224

Authors: Christos Plachouras, Julien Guinot, George Fazekas, Elio Quinton, Emmanouil Benetos, Johan Pauwels

Abstract: Downstream probing has been the dominant method for evaluating model representations, an important process given the increasing prominence of self-supervised learning and foundation models. However, downstream probing primarily assesses the availability of task-relevant information in the model's latent space, overlooking attributes such as equivariance, invariance, and disentanglement, which contribute to the interpretability, adaptability, and utility of representations in real-world applications. While some attempts have been made to measure these qualities in representations, no unified evaluation framework with modular, generalizable, and interpretable metrics exists. In this paper, we argue for the importance of representation evaluation beyond downstream probing. We introduce a standardized protocol to quantify informativeness, equivariance, invariance, and disentanglement of factors of variation in model representations. We use it to evaluate representations from a variety of models in the image and speech domains using different architectures and pretraining approaches on identified controllable factors of variation. We find that representations from models with similar downstream performance can behave substantially differently with regard to these attributes. This hints that the respective mechanisms underlying their downstream performance are functionally different, prompting new research directions to understand and improve representations.

Comment: The paper introduces a unified framework for evaluating representations beyond downstream tasks, focusing on attributes like equivariance, invariance, and disentanglement. This is highly relevant to representation learning and offers a novel evaluation perspective.

Relevance: 9 Novelty: 8


6. Rethinking Graph Contrastive Learning through Relative Similarity Preservation

ArXiv ID: 2505.05533

Authors: Zhiyuan Ning, Pengfei Wang, Ziyue Qiao, Pengyang Wang, Yuanchun Zhou

Abstract: Graph contrastive learning (GCL) has achieved remarkable success by following the computer vision paradigm of preserving absolute similarity between augmented views. However, this approach faces fundamental challenges in graphs due to their discrete, non-Euclidean nature -- view generation often breaks semantic validity and similarity verification becomes unreliable. Through analyzing 11 real-world graphs, we discover a universal pattern transcending the homophily-heterophily dichotomy: label consistency systematically diminishes as structural distance increases, manifesting as smooth decay in homophily graphs and oscillatory decay in heterophily graphs. We establish theoretical guarantees for this pattern through random walk theory, proving label distribution convergence and characterizing the mechanisms behind different decay behaviors. This discovery reveals that graphs naturally encode relative similarity patterns, where structurally closer nodes exhibit collectively stronger semantic relationships. Leveraging this insight, we propose RELGCL, a novel GCL framework with complementary pairwise and listwise implementations that preserve these inherent patterns through collective similarity objectives. Extensive experiments demonstrate that our method consistently outperforms 20 existing approaches across both homophily and heterophily graphs, validating the effectiveness of leveraging natural relative similarity over artificial absolute similarity.

Comment: The paper proposes RELGCL, a novel graph contrastive learning framework that leverages relative similarity patterns. This aligns with representation learning and introduces a significant theoretical insight into graph learning.

Relevance: 9 Novelty: 8


7. How to Train Your Metamorphic Deep Neural Network

ArXiv ID: 2505.05510

Authors: Thomas Sommariva, Simone Calderara, Angelo Porrello

Abstract: Neural Metamorphosis (NeuMeta) is a recent paradigm for generating neural networks of varying width and depth. Based on Implicit Neural Representation (INR), NeuMeta learns a continuous weight manifold, enabling the direct generation of compressed models, including those with configurations not seen during training. While promising, the original formulation of NeuMeta proves effective only for the final layers of the undelying model, limiting its broader applicability. In this work, we propose a training algorithm that extends the capabilities of NeuMeta to enable full-network metamorphosis with minimal accuracy degradation. Our approach follows a structured recipe comprising block-wise incremental training, INR initialization, and strategies for replacing batch normalization. The resulting metamorphic networks maintain competitive accuracy across a wide range of compression ratios, offering a scalable solution for adaptable and efficient deployment of deep models. The code is available at: https://github.com/TSommariva/HTTY_NeuMeta.

Comment: The paper proposes a training algorithm for neural metamorphosis, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8


8. MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design

ArXiv ID: 2505.05799

Authors: Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, Dahua Lin

Abstract: Mixture-of-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE, a mixed-precision optimization framework for MoE models that considers both algorithmic and system perspectives. MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations. Additionally, MxMoE automatically generates optimized mixed-precision GroupGEMM kernels, enabling parallel execution of GEMMs with different precisions. Evaluations show that MxMoE outperforms existing methods, achieving 2.4 lower Wikitext-2 perplexity than GPTQ at 2.25-bit and delivering up to 3.4x speedup over full precision, as well as up to 29.4% speedup over uniform quantization at equivalent accuracy with 5-bit weight-activation quantization. Our code is available at https://github.com/cat538/MxMoE.

Comment: The paper proposes MxMoE, a mixed-precision quantization framework for MoE models, addressing both algorithmic and system-level challenges. This aligns closely with the model compression and MoE criteria, offering novel insights into quantization sensitivity and expert activation dynamics.

Relevance: 9 Novelty: 8


9. Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM

ArXiv ID: 2505.05772

Authors: Zehao Fan, Garrett Gagnon, Zhenyu Liu, Liu Liu

Abstract: Transformer-based models are the foundation of modern machine learning, but their execution, particularly during autoregressive decoding in large language models (LLMs), places significant pressure on memory systems due to frequent memory accesses and growing key-value (KV) caches. This creates a bottleneck in memory bandwidth, especially as context lengths increase. Processing-in-memory (PIM) architectures are a promising solution, offering high internal bandwidth and compute parallelism near memory. However, current PIM designs are primarily optimized for dense attention and struggle with the dynamic, irregular access patterns introduced by modern KV cache sparsity techniques. Consequently, they suffer from workload imbalance, reducing throughput and resource utilization. In this work, we propose STARC, a novel sparsity-optimized data mapping scheme tailored specifically for efficient LLM decoding on PIM architectures. STARC clusters KV pairs by semantic similarity and maps them to contiguous memory regions aligned with PIM bank structures. During decoding, queries retrieve relevant tokens at cluster granularity by matching against precomputed centroids, enabling selective attention and parallel processing without frequent reclustering or data movement overhead. Experiments on the HBM-PIM system show that, compared to common token-wise sparsity methods, STARC reduces attention-layer latency by 19%--31% and energy consumption by 19%--27%. Under a KV cache budget of 1024, it achieves up to 54%--74% latency reduction and 45%--67% energy reduction compared to full KV cache retrieval. Meanwhile, STARC maintains model accuracy comparable to state-of-the-art sparse attention methods, demonstrating its effectiveness in enabling efficient and hardware-friendly long-context LLM inference on PIM architectures.

Comment: The paper focuses on sparsity-optimized data mapping for efficient LLM decoding, which aligns with the 'Model Compression' criterion, particularly in sparsity and KV cache optimization. It also touches on 'Large Language Models' by addressing efficiency in LLM inference.

Relevance: 9 Novelty: 8


10. New Statistical and Computational Results for Learning Junta Distributions

ArXiv ID: 2505.05819

Authors: Lorenzo Beretta

Abstract: We study the problem of learning junta distributions on ${0, 1}^n$, where a distribution is a $k$-junta if its probability mass function depends on a subset of at most $k$ variables. We make two main contributions: - We show that learning $k$-junta distributions is \emph{computationally} equivalent to learning $k$-parity functions with noise (LPN), a landmark problem in computational learning theory. - We design an algorithm for learning junta distributions whose statistical complexity is optimal, up to polylogarithmic factors. Computationally, our algorithm matches the complexity of previous (non-sample-optimal) algorithms. Combined, our two contributions imply that our algorithm cannot be significantly improved, statistically or computationally, barring a breakthrough for LPN.

Comment: The paper studies learning junta distributions and connects it to the LPN problem, which aligns with 'Emerging Trends' by addressing a computationally significant theoretical problem.

Relevance: 8 Novelty: 9


11. Hypergraph Neural Sheaf Diffusion: A Symmetric Simplicial Set Framework for Higher-Order Learning

ArXiv ID: 2505.05702

Authors: Seongjin Choi, Gahee Kim, Yong-Geun Oh

Abstract: The absence of intrinsic adjacency relations and orientation systems in hypergraphs creates fundamental challenges for constructing sheaf Laplacians of arbitrary degrees. We resolve these limitations through symmetric simplicial sets derived directly from hypergraphs, which encode all possible oriented subrelations within each hyperedge as ordered tuples. This construction canonically defines adjacency via facet maps while inherently preserving hyperedge provenance. We establish that the normalized degree zero sheaf Laplacian on our induced symmetric simplicial set reduces exactly to the traditional graph normalized sheaf Laplacian when restricted to graphs, validating its mathematical consistency with prior graph-based sheaf theory. Furthermore, the induced structure preserves all structural information from the original hypergraph, ensuring that every multi-way relational detail is faithfully retained. Leveraging this framework, we introduce Hypergraph Neural Sheaf Diffusion (HNSD), the first principled extension of Neural Sheaf Diffusion (NSD) to hypergraphs. HNSD operates via normalized degree zero sheaf Laplacians over symmetric simplicial sets, resolving orientation ambiguity and adjacency sparsity inherent to hypergraph learning. Experimental evaluations demonstrate HNSD's competitive performance across established benchmarks.

Comment: The paper introduces a novel framework for hypergraph learning using Neural Sheaf Diffusion, which aligns with representation learning and architectural innovations. The use of symmetric simplicial sets for higher-order learning is a significant theoretical contribution.

Relevance: 8 Novelty: 8


12. Neuro-Symbolic Concepts

ArXiv ID: 2505.06191

Authors: Jiayuan Mao, Joshua B. Tenenbaum, Jiajun Wu

Abstract: This article presents a concept-centric paradigm for building agents that can learn continually and reason flexibly. The concept-centric agent utilizes a vocabulary of neuro-symbolic concepts. These concepts, such as object, relation, and action concepts, are grounded on sensory inputs and actuation outputs. They are also compositional, allowing for the creation of novel concepts through their structural combination. To facilitate learning and reasoning, the concepts are typed and represented using a combination of symbolic programs and neural network representations. Leveraging such neuro-symbolic concepts, the agent can efficiently learn and recombine them to solve various tasks across different domains, ranging from 2D images, videos, 3D scenes, and robotic manipulation tasks. This concept-centric framework offers several advantages, including data efficiency, compositional generalization, continual learning, and zero-shot transfer.

Comment: The paper presents a concept-centric neuro-symbolic framework for continual learning and reasoning, which is relevant to representation learning due to its focus on compositional and efficient representations. The neuro-symbolic approach is novel and impactful.

Relevance: 8 Novelty: 8


13. Generative Discovery of Partial Differential Equations by Learning from Math Handbooks

ArXiv ID: 2505.05869

Authors: Hao Xu, Yuntian Chen, Rui Cao, Tianning Tang, Mengge Du, Jian Li, Adrian H. Callaghan, Dongxiao Zhang

Abstract: Data driven discovery of partial differential equations (PDEs) is a promising approach for uncovering the underlying laws governing complex systems. However, purely data driven techniques face the dilemma of balancing search space with optimization efficiency. This study introduces a knowledge guided approach that incorporates existing PDEs documented in a mathematical handbook to facilitate the discovery process. These PDEs are encoded as sentence like structures composed of operators and basic terms, and used to train a generative model, called EqGPT, which enables the generation of free form PDEs. A loop of generation evaluation optimization is constructed to autonomously identify the most suitable PDE. Experimental results demonstrate that this framework can recover a variety of PDE forms with high accuracy and computational efficiency, particularly in cases involving complex temporal derivatives or intricate spatial terms, which are often beyond the reach of conventional methods. The approach also exhibits generalizability to irregular spatial domains and higher dimensional settings. Notably, it succeeds in discovering a previously unreported PDE governing strongly nonlinear surface gravity waves propagating toward breaking, based on real world experimental data, highlighting its applicability to practical scenarios and its potential to support scientific discovery.

Comment: The paper introduces a generative model for discovering PDEs, which aligns with AI for Science by addressing foundational research in scientific discovery. The use of generative models for PDE discovery is novel and impactful.

Relevance: 8 Novelty: 8


14. Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions

ArXiv ID: 2505.05755

Authors: Dhruvesh Patel, Aishwarya Sahoo, Avinash Amballa, Tahira Naseem, Tim G. J. Rudner, Andrew McCallum

Abstract: Autoregressive models (ARMs), which predict subsequent tokens one-by-one ``from left to right,'' have achieved significant success across a wide range of sequence generation tasks. However, they struggle to accurately represent sequences that require satisfying sophisticated constraints or whose sequential dependencies are better addressed by out-of-order generation. Masked Diffusion Models (MDMs) address some of these limitations, but the process of unmasking multiple tokens simultaneously in MDMs can introduce incoherences, and MDMs cannot handle arbitrary infilling constraints when the number of tokens to be filled in is not known in advance. In this work, we introduce Insertion Language Models (ILMs), which learn to insert tokens at arbitrary positions in a sequence -- that is, they select jointly both the position and the vocabulary element to be inserted. By inserting tokens one at a time, ILMs can represent strong dependencies between tokens, and their ability to generate sequences in arbitrary order allows them to accurately model sequences where token dependencies do not follow a left-to-right sequential structure. To train ILMs, we propose a tailored network parameterization and use a simple denoising objective. Our empirical evaluation demonstrates that ILMs outperform both ARMs and MDMs on common planning tasks. Furthermore, we show that ILMs outperform MDMs and perform on par with ARMs in an unconditional text generation task while offering greater flexibility than MDMs in arbitrary-length text infilling.

Comment: The paper introduces Insertion Language Models (ILMs), which offer a novel approach to sequence generation by inserting tokens at arbitrary positions. This aligns with foundational research in model architecture and sequence generation paradigms.

Relevance: 8 Novelty: 8


15. Griffin: Towards a Graph-Centric Relational Database Foundation Model

ArXiv ID: 2505.05568

Authors: Yanbo Wang, Xiyuan Wang, Quan Gan, Minjie Wang, Qibin Yang, David Wipf, Muhan Zhang

Abstract: We introduce Griffin, the first foundation model attemptation designed specifically for Relational Databases (RDBs). Unlike previous smaller models focused on single RDB tasks, Griffin unifies the data encoder and task decoder to handle diverse tasks. Additionally, we enhance the architecture by incorporating a cross-attention module and a novel aggregator. Griffin utilizes pretraining on both single-table and RDB datasets, employing advanced encoders for categorical, numerical, and metadata features, along with innovative components such as cross-attention modules and enhanced message-passing neural networks (MPNNs) to capture the complexities of relational data. Evaluated on large-scale, heterogeneous, and temporal graphs extracted from RDBs across various domains (spanning over 150 million nodes), Griffin demonstrates superior or comparable performance to individually trained models, excels in low-data scenarios, and shows strong transferability with similarity and diversity in pretraining across new datasets and tasks, highlighting its potential as a universally applicable foundation model for RDBs. Code available at https://github.com/yanxwb/Griffin.

Comment: The paper introduces Griffin, a foundation model for relational databases, which aligns with the 'Emerging Trends' criterion by proposing a novel architecture for a new domain of foundation models.

Relevance: 8 Novelty: 8


16. Autoencoder-Based Hybrid Replay for Class-Incremental Learning

ArXiv ID: 2505.05926

Authors: Milad Khademi Nori, Il-Min Kim, Guanghui Wang

Abstract: In class-incremental learning (CIL), effective incremental learning strategies are essential to mitigate task confusion and catastrophic forgetting, especially as the number of tasks $t$ increases. Current exemplar replay strategies impose $\mathcal{O}(t)$ memory/compute complexities. We propose an autoencoder-based hybrid replay (AHR) strategy that leverages our new hybrid autoencoder (HAE) to function as a compressor to alleviate the requirement for large memory, achieving $\mathcal{O}(0.1 t)$ at the worst case with the computing complexity of $\mathcal{O}(t)$ while accomplishing state-of-the-art performance. The decoder later recovers the exemplar data stored in the latent space, rather than in raw format. Additionally, HAE is designed for both discriminative and generative modeling, enabling classification and replay capabilities, respectively. HAE adopts the charged particle system energy minimization equations and repulsive force algorithm for the incremental embedding and distribution of new class centroids in its latent space. Our results demonstrate that AHR consistently outperforms recent baselines across multiple benchmarks while operating with the same memory/compute budgets. The source code is included in the supplementary material and will be open-sourced upon publication.

Comment: The paper introduces an autoencoder-based hybrid replay strategy for class-incremental learning, which aligns with foundational research in autoencoders and representation learning. The use of hybrid autoencoders for both discriminative and generative modeling is a novel contribution.

Relevance: 8 Novelty: 7


17. UniSymNet: A Unified Symbolic Network Guided by Transformer

ArXiv ID: 2505.06091

Authors: Xinxin Li, Juan Zhang, Da Li, Xingyu Liu, Jin Xu, Junping Yin

Abstract: Symbolic Regression (SR) is a powerful technique for automatically discovering mathematical expressions from input data. Mainstream SR algorithms search for the optimal symbolic tree in a vast function space, but the increasing complexity of the tree structure limits their performance. Inspired by neural networks, symbolic networks have emerged as a promising new paradigm. However, most existing symbolic networks still face certain challenges: binary nonlinear operators ${\times, \div}$ cannot be naturally extended to multivariate operators, and training with fixed architecture often leads to higher complexity and overfitting. In this work, we propose a Unified Symbolic Network that unifies nonlinear binary operators into nested unary operators and define the conditions under which UniSymNet can reduce complexity. Moreover, we pre-train a Transformer model with a novel label encoding method to guide structural selection, and adopt objective-specific optimization strategies to learn the parameters of the symbolic network. UniSymNet shows high fitting accuracy, excellent symbolic solution rate, and relatively low expression complexity, achieving competitive performance on low-dimensional Standard Benchmarks and high-dimensional SRBench.

Comment: The paper introduces a novel symbolic network guided by Transformers, which aligns with architectural innovations and representation learning.

Relevance: 8 Novelty: 7


18. Register and CLS tokens yield a decoupling of local and global features in large ViTs

ArXiv ID: 2505.05892

Authors: Alexander Lappe, Martin A. Giese

Abstract: Recent work has shown that the attention maps of the widely popular DINOv2 model exhibit artifacts, which hurt both model interpretability and performance on dense image tasks. These artifacts emerge due to the model repurposing patch tokens with redundant local information for the storage of global image information. To address this problem, additional register tokens have been incorporated in which the model can store such information instead. We carefully examine the influence of these register tokens on the relationship between global and local image features, showing that while register tokens yield cleaner attention maps, these maps do not accurately reflect the integration of local image information in large models. Instead, global information is dominated by information extracted from register tokens, leading to a disconnect between local and global features. Inspired by these findings, we show that the CLS token itself, which can be interpreted as a register, leads to a very similar phenomenon in models without explicit register tokens. Our work shows that care must be taken when interpreting attention maps of large ViTs. Further, by clearly attributing the faulty behaviour to register and CLS tokens, we show a path towards more interpretable vision models.

Comment: The paper analyzes the decoupling of local and global features in ViTs, providing insights into architectural behavior and interpretability.

Relevance: 8 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords: