Previous Day 2025-06-17
Monthly Overview 2025-06
Next Day 2025-06-23

Personalized Daily ArXiv Papers 2025-06-18

[gpt-4o] Prompt Completion Total
Token 45460 5677 51137
Cost $0.11 $0.06 $0.17

Total arXiv papers: 554

Total scanned papers: 333

Total relevant papers: 37

Table of contents with paper titles:

  1. Load Balancing Mixture of Experts with Similarity Preserving Routers Authors: Nabil Omi, Siddhartha Sen, Ali Farhadi

  2. Structured and Informed Probabilistic Modeling with the Thermodynamic Kolmogorov-Arnold Model Authors: Prithvi Raj

  3. Toward a Graph Foundation Model: Pre-Training Transformers With Random Walks Authors: Ziyuan Tang, Jie Chen

  4. Transformers Learn Faster with Semantic Focus Authors: Parikshit Ram, Kenneth L. Clarkson, Tim Klinger, Shashanka Ubaru, Alexander G. Gray

  5. 'Memory States' from Almost Nothing: Representing and Computing in a Non-associative Algebra Authors: Stefan Reimann

  6. Evolutionary chemical learning in dimerization networks Authors: Alexei V. Tkachenko, Bortolo Matteo Mognetti, Sergei Maslov

  7. Scientifically-Interpretable Reasoning Network (ScIReN): Uncovering the Black-Box of Nature Authors: Joshua Fan, Haodi Xu, Feng Tao, Md Nasim, Marc Grimson, Yiqi Luo, Carla P. Gomes

  8. Less is More: Undertraining Experts Improves Model Upcycling Authors: Stefan Horoi, Guy Wolf, Eugene Belilovsky, Gintare Karolina Dziugaite

  9. Mirror Descent Using the Tempesta Generalized Multi-parametric Logarithms Authors: Andrzej Cichocki

  10. MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models Authors: Hongyu Wang, Jiayu Xu, Ruiping Wang, Yan Feng, Yitao Zhai, Peng Pei, Xunliang Cai, Xilin Chen

  11. MoORE: SVD-based Model MoE-ization for Conflict- and Oblivion-Resistant Multi-Task Adaptation Authors: Shen Yuan, Yin Zheng, Taifeng Wang, Binbin Liu, Hongteng Xu

  12. Machine Mirages: Defining the Undefined Authors: Hamidou Tembine

  13. Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs Authors: Ring Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu, Dingnan Jin, Feng Zhu, Hao Dai, Hongzhi Luan, Jia Guo, Jiaming Liu, Jiewei Wu, Jun Mei, Jun Zhou, Junbo Zhao, Junwu Xiong, Kaihong Zhang, Kuan Xu, Lei Liang, Liang Jiang, Liangcheng Fu, Longfei Zheng, Qiang Gao, Qing Cui, Quan Wan, Shaomian Zheng, Shuaicheng Li, Tongkai Yang, Wang Ren, Xiaodong Yan, Xiaopei Wan, Xiaoyun Feng, Xin Zhao, Xinxing Yang, Xinyu Kong, Xuemin Yang, Yang Li, Yingting Wu, Yongkang Liu, Zhankai Xu, Zhenduo Zhang, Zhenglei Zhou, Zhenyu Huang, Zhiqiang Zhang, Zihao Wang, Zujie Wen

  14. Logical Expressiveness of Graph Neural Networks with Hierarchical Node Individualization Authors: Arie Soeteman, Balder ten Cate

  15. Single-Example Learning in a Mixture of GPDMs with Latent Geometries Authors: Jesse St. Amand, Leonardo Gizzi, Martin A. Giese

  16. Exploring Speaker Diarization with Mixture of Experts Authors: Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Hang Chen, Jun Du

  17. Sharp Generalization Bounds for Foundation Models with Asymmetric Randomized Low-Rank Adapters Authors: Anastasis Kratsios, Tin Sum Cheng, Aurelien Lucchi, Haitz S\'aez de Oc\'ariz Borde

  18. Sampling from Your Language Model One Byte at a Time Authors: Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh

  19. Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations Authors: Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das

  20. Can Large Language Models Improve Spectral Graph Neural Networks? Authors: Kangkang Lu, Yanhua Yu, Zhiyong Huang, Tat-Seng Chua

  21. MobiEdit: Resource-efficient Knowledge Editing for Personalized On-device LLMs Authors: Zhenyan Lu, Daliang Xu, Dongqi Cai, Zexi Li, Wei Liu, Fangming Liu, Shangguang Wang, Mengwei Xu

  22. Object-Centric Neuro-Argumentative Learning Authors: Abdul Rahman Jacob, Avinash Kori, Emanuele De Angelis, Ben Glocker, Maurizio Proietti, Francesca Toni

  23. A Hybrid Neural Network -- Polynomial Series Scheme for Learning Invariant Manifolds of Discrete Dynamical Systems Authors: Dimitrios G. Patsatzis, Nikolaos Kazantzis, Ioannis G. Kevrekidis, Constantinos Siettos

  24. A Variational Information Theoretic Approach to Out-of-Distribution Detection Authors: Sudeepta Mondal, Zhuolin Jiang, Ganesh Sundaramoorthi

  25. Knowledge Compression via Question Generation: Enhancing Multihop Document Retrieval without Fine-tuning Authors: Anvi Alex Eponon, Moein Shahiki-Tash, Ildar Batyrshin, Christian E. Maldonado-Sifuentes, Grigori Sidorov, Alexander Gelbukh

  26. GenerationPrograms: Fine-grained Attribution with Executable Programs Authors: David Wan, Eran Hirsch, Elias Stengel-Eskin, Ido Dagan, Mohit Bansal

  27. Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences Authors: Stas Bekman, Samyam Rajbhandari, Michael Wyatt, Jeff Rasley, Tunji Ruwase, Zhewei Yao, Aurick Qiao, Yuxiong He

  28. AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs Authors: Di He, Ajay Jaiswal, Songjun Tu, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin

  29. Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection Authors: Zhijing Wan, Zhixiang Wang, Zheng Wang, Xin Xu, Shin'ichi Satoh

  30. Equivariance Everywhere All At Once: A Recipe for Graph Foundation Models Authors: Ben Finkelshtein, .Ismail .Ilkan Ceylan, Michael Bronstein, Ron Levie

  31. From Bytes to Ideas: Language Modeling with Autoregressive U-Nets Authors: Mathurin Videau, Badr Youbi Idrissi, Alessandro Leite, Marc Schoenauer, Olivier Teytaud, David Lopez-Paz

  32. ResNets Are Deeper Than You Think Authors: Christian H. X. Ali Mehmeti-G\"opel, Michael Wand

  33. Quantifying Structure in CLIP Embeddings: A Statistical Framework for Concept Interpretation Authors: Jitian Zhao, Chenghui Li, Frederic Sala, Karl Rohe

  34. Expressive Score-Based Priors for Distribution Matching with Geometry-Preserving Regularization Authors: Ziyu Gong, Jim Lim, David I. Inouye

  35. Optimizing Length Compression in Large Reasoning Models Authors: Zhengxiang Cheng, Dongping Chen, Mingyang Fu, Tianyi Zhou

  36. Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning Authors: William F. Shen, Xinchi Qiu, Nicola Cancedda, Nicholas D. Lane

  37. S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models Authors: Tao He, Guang Huang, Yu Yang, Tianshi Xu, Sicheng Zhao, Guiguang Ding, Pengyang Wang, Feng Tian


1. Load Balancing Mixture of Experts with Similarity Preserving Routers

ArXiv ID: 2506.14038

Authors: Nabil Omi, Siddhartha Sen, Ali Farhadi

Abstract: Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks by activating only a subset of parameters ("experts") for each input. A learned router computes a distribution over these experts, and assigns input tokens to a small subset. However, without auxiliary balancing mechanisms, routers often converge to using only a few experts, severely limiting model capacity and degrading performance. Most current load balancing mechanisms encourage a distribution over experts that resembles a roughly uniform distribution of experts per token. During training, this can result in inconsistent routing behavior, resulting in the model spending its capacity to learn redundant knowledge. We address this by introducing a novel load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar inputs during training. Our experimental results show that applying our loss to the router results in 36% faster convergence and lower redundancy compared to a popular load balancing loss.

Comment: The paper addresses load balancing in Sparse Mixture of Experts (MoE) models, which is directly relevant to model architecture and efficiency.

Relevance: 10 Novelty: 8


2. Structured and Informed Probabilistic Modeling with the Thermodynamic Kolmogorov-Arnold Model

ArXiv ID: 2506.14167

Authors: Prithvi Raj

Abstract: We adapt the Kolmogorov-Arnold Representation Theorem to generative modeling by reinterpreting its inner functions as a Markov Kernel between probability spaces via inverse transform sampling. We present a generative model that is interpretable, easy to design, and efficient. Our approach couples a Kolmogorov-Arnold Network generator with independent energy-based priors, trained via Maximum Likelihood. Inverse sampling enables fast inference, while prior knowledge can be incorporated before training to better align priors with posteriors, thereby improving learning efficiency and sample quality. The learned prior is also recoverable and visualizable post-training, offering an empirical Bayes perspective. To address inflexibility and mitigate prior-posterior mismatch, we introduce scalable extensions based on mixture distributions and Langevin Monte Carlo methods, admitting a trade-off between flexibility and training efficiency. Our contributions connect classical representation theorems with modern probabilistic modeling, while balancing training stability, inference speed, and the quality and diversity of generations.

Comment: The paper introduces a novel probabilistic model inspired by classical representation theorems, relevant to emerging trends in generative modeling.

Relevance: 9 Novelty: 9


3. Toward a Graph Foundation Model: Pre-Training Transformers With Random Walks

ArXiv ID: 2506.14098

Authors: Ziyuan Tang, Jie Chen

Abstract: A foundation model like GPT elicits many emergent abilities, owing to the pre-training with broad inclusion of data and the use of the powerful Transformer architecture. While foundation models in natural languages are prevalent, can we build similar models for graphs? This paper describes an approach toward a graph foundation model that is pre-trained with diverse graph datasets by adapting the Transformer backbone. A central challenge toward this end is how a sequence model encodes graphs of varying sizes and from different domains. We propose representing a node as multiple random walks, such that the Transformer can extract node representations from sequences, which in turn form edge and graph representations. We develop a novel context prediction loss for these random walks and theoretically analyze their expressive power in distinguishing neighborhoods and graphs. We also demonstrate the pre-training of our model and its adaptation to downstream tasks, showcasing its potential as a foundation for processing and reasoning with graph-structured data.

Comment: The paper proposes a graph foundation model using Transformers, which is relevant to model architecture and representation learning.

Relevance: 9 Novelty: 8


4. Transformers Learn Faster with Semantic Focus

ArXiv ID: 2506.14095

Authors: Parikshit Ram, Kenneth L. Clarkson, Tim Klinger, Shashanka Ubaru, Alexander G. Gray

Abstract: Various forms of sparse attention have been explored to mitigate the quadratic computational and memory cost of the attention mechanism in transformers. We study sparse transformers not through a lens of efficiency but rather in terms of learnability and generalization. Empirically studying a range of attention mechanisms, we find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models, while input-agnostic sparse attention models show no such benefits -- a phenomenon that is robust across architectural and optimization hyperparameter choices. This can be interpreted as demonstrating that concentrating a model's "semantic focus" with respect to the tokens currently being considered (in the form of input-dependent sparse attention) accelerates learning. We develop a theoretical characterization of the conditions that explain this behavior. We establish a connection between the stability of the standard softmax and the loss function's Lipschitz properties, then show how sparsity affects the stability of the softmax and the subsequent convergence and generalization guarantees resulting from the attention mechanism. This allows us to theoretically establish that input-agnostic sparse attention does not provide any benefits. We also characterize conditions when semantic focus (input-dependent sparse attention) can provide improved guarantees, and we validate that these conditions are in fact met in our empirical evaluations.

Comment: The paper studies sparse attention in transformers, focusing on learnability and generalization, which is relevant to model architecture and representation learning.

Relevance: 9 Novelty: 8


5. 'Memory States' from Almost Nothing: Representing and Computing in a Non-associative Algebra

ArXiv ID: 2506.13768

Authors: Stefan Reimann

Abstract: This note presents a non-associative algebraic framework for the representation and computation of information items in high-dimensional space. This framework is consistent with the principles of spatial computing and with the empirical findings in cognitive science about memory. Computations are performed through a process of multiplication-like binding and non-associative interference-like bundling. Models that rely on associative bundling typically lose order information, which necessitates the use of auxiliary order structures, such as position markers, to represent sequential information that is important for cognitive tasks. In contrast, the non-associative bundling proposed allows the construction of sparse representations of arbitrarily long sequences that maintain their temporal structure across arbitrary lengths. In this operation, noise is a constituent element of the representation of order information, rather than a means of obscuring it. The non-associative nature of the proposed framework results in the representation of a single sequence by two distinct states. The L-state, generated through left-associative bundling, continuously updates and emphasises a recency effect, while the R-state, formed through right-associative bundling, encodes finite sequences or chunks, capturing a primacy effect. The construction of these states may be associated with activity in the prefrontal cortex in relation to short-term memory and hippocampal encoding in long-term memory, respectively. The accuracy of retrieval is contingent upon a decision-making process that is based on the mutual information between the memory states and the cue. The model is able to replicate the Serial Position Curve, which reflects the empirical recency and primacy effects observed in cognitive experiments.

Comment: The paper presents a non-associative algebraic framework for representation and computation, which aligns with representation learning.

Relevance: 9 Novelty: 8


6. Evolutionary chemical learning in dimerization networks

ArXiv ID: 2506.14006

Authors: Alexei V. Tkachenko, Bortolo Matteo Mognetti, Sergei Maslov

Abstract: We present a novel framework for chemical learning based on Competitive Dimerization Networks (CDNs) - systems in which multiple molecular species, e.g. proteins or DNA/RNA oligomers, reversibly bind to form dimers. We show that these networks can be trained in vitro through directed evolution, enabling the implementation of complex learning tasks such as multiclass classification without digital hardware or explicit parameter tuning. Each molecular species functions analogously to a neuron, with binding affinities acting as tunable synaptic weights. A training protocol involving mutation, selection, and amplification of DNA-based components allows CDNs to robustly discriminate among noisy input patterns. The resulting classifiers exhibit strong output contrast and high mutual information between input and output, especially when guided by a contrast-enhancing loss function. Comparative analysis with in silico gradient descent training reveals closely correlated performance. These results establish CDNs as a promising platform for analog physical computation, bridging synthetic biology and machine learning, and advancing the development of adaptive, energy-efficient molecular computing systems.

Comment: The paper introduces a novel framework for chemical learning using dimerization networks, which is relevant to AI for Science with a focus on foundational research.

Relevance: 9 Novelty: 8


7. Scientifically-Interpretable Reasoning Network (ScIReN): Uncovering the Black-Box of Nature

ArXiv ID: 2506.14054

Authors: Joshua Fan, Haodi Xu, Feng Tao, Md Nasim, Marc Grimson, Yiqi Luo, Carla P. Gomes

Abstract: Neural networks are a powerful tool for learning patterns from data. However, they do not respect known scientific laws, nor can they reveal novel scientific insights due to their black-box nature. In contrast, scientific reasoning distills biological or physical principles from observations and controlled experiments, and quantitatively interprets them with process-based models made of mathematical equations. Yet, process-based models rely on numerous free parameters that must be set in an ad-hoc manner, and thus often fit observations poorly in cross-scale predictions. While prior work has embedded process-based models in conventional neural networks, discovering interpretable relationships between parameters in process-based models and input features is still a grand challenge for scientific discovery. We thus propose Scientifically-Interpretable Reasoning Network (ScIReN), a fully-transparent framework that combines interpretable neural and process-based reasoning. An interpretable encoder predicts scientifically-meaningful latent parameters, which are then passed through a differentiable process-based decoder to predict labeled output variables. ScIReN also uses a novel hard-sigmoid constraint layer to restrict latent parameters to meaningful ranges defined by scientific prior knowledge, further enhancing its interpretability. While the embedded process-based model enforces established scientific knowledge, the encoder reveals new scientific mechanisms and relationships hidden in conventional black-box models. We apply ScIReN on two tasks: simulating the flow of organic carbon through soils, and modeling ecosystem respiration from plants. In both tasks, ScIReN outperforms black-box networks in predictive accuracy while providing substantial scientific interpretability -- it can infer latent scientific mechanisms and their relationships with input features.

Comment: The paper proposes a scientifically-interpretable reasoning network, which is relevant to AI for Science with a focus on foundational research.

Relevance: 9 Novelty: 8


8. Less is More: Undertraining Experts Improves Model Upcycling

ArXiv ID: 2506.14126

Authors: Stefan Horoi, Guy Wolf, Eugene Belilovsky, Gintare Karolina Dziugaite

Abstract: Modern deep learning is increasingly characterized by the use of open-weight foundation models that can be fine-tuned on specialized datasets. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. To leverage these resources, numerous model upcycling methods have emerged, enabling the reuse of fine-tuned models in multi-task systems. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then upcycled into more general-purpose systems. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model upcycling. We show that long fine-tuning of experts that optimizes for their individual performance leads to degraded merging performance, both for fully fine-tuned and LoRA-adapted models, and to worse downstream results when LoRA adapters are upcycled into MoE layers. We trace this degradation to the memorization of a small set of difficult examples that dominate late fine-tuning steps and are subsequently forgotten during merging. Finally, we demonstrate that a task-dependent aggressive early stopping strategy can significantly improve upcycling performance.

Comment: The paper challenges assumptions in model upcycling and discusses MoE layers, aligning with the model architecture criterion.

Relevance: 9 Novelty: 8


9. Mirror Descent Using the Tempesta Generalized Multi-parametric Logarithms

ArXiv ID: 2506.13984

Authors: Andrzej Cichocki

Abstract: In this paper, we develop a wide class Mirror Descent (MD) algorithms, which play a key role in machine learning. For this purpose we formulated the constrained optimization problem, in which we exploits the Bregman divergence with the Tempesta multi-parametric deformation logarithm as a link function. This link function called also mirror function defines the mapping between the primal and dual spaces and is associated with a very-wide (in fact, theoretically infinite) class of generalized trace-form entropies. In order to derive novel MD updates, we estimate generalized exponential function, which closely approximates the inverse of the multi-parametric Tempesta generalized logarithm. The shape and properties of the Tempesta logarithm and its inverse-deformed exponential functions can be tuned by several hyperparameters. By learning these hyperparameters, we can adapt to distribution or geometry of training data, and we can adjust them to achieve desired properties of MD algorithms. The concept of applying multi-parametric logarithms allow us to generate a new wide and flexible family of MD and mirror-less MD updates.

Comment: The paper develops a new class of Mirror Descent algorithms, which aligns with the emerging trends criterion.

Relevance: 9 Novelty: 8


10. MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models

ArXiv ID: 2506.14435

Authors: Hongyu Wang, Jiayu Xu, Ruiping Wang, Yan Feng, Yitao Zhai, Peng Pei, Xunliang Cai, Xilin Chen

Abstract: Large multimodal Mixture-of-Experts (MoEs) effectively scale the model size to boost performance while maintaining fixed active parameters. However, previous works primarily utilized full-precision experts during sparse up-cycling. Despite they show superior performance on end tasks, the large amount of experts introduces higher memory footprint, which poses significant challenges for the deployment on edge devices. In this work, we propose MoTE, a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint. Instead of training fewer high-precision experts, we propose to train more low-precision experts during up-cycling. Specifically, we use the pre-trained FFN as a shared expert and train ternary routed experts with parameters in {-1, 0, 1}. Extensive experiments show that our approach has promising scaling trend along model size. MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint. Furthermore, our approach is compatible with post-training quantization methods and the advantage further amplifies when memory-constraint goes lower. Given the same amount of expert memory footprint of 3.4GB and combined with post-training quantization, MoTE outperforms MoE-LLaVA by a gain of 4.3% average accuracy on end tasks, demonstrating its effectiveness and potential for memory-constrained devices.

Comment: The paper introduces MoTE, a memory-efficient approach for Mixture-of-Experts models, relevant to model compression and architecture.

Relevance: 9 Novelty: 8


11. MoORE: SVD-based Model MoE-ization for Conflict- and Oblivion-Resistant Multi-Task Adaptation

ArXiv ID: 2506.14436

Authors: Shen Yuan, Yin Zheng, Taifeng Wang, Binbin Liu, Hongteng Xu

Abstract: Adapting large-scale foundation models in multi-task scenarios often suffers from task conflict and oblivion. To mitigate such issues, we propose a novel ''model MoE-ization'' strategy that leads to a conflict- and oblivion-resistant multi-task adaptation method. Given a weight matrix of a pre-trained model, our method applies SVD to it and introduces a learnable router to adjust its singular values based on tasks and samples. Accordingly, the weight matrix becomes a Mixture of Orthogonal Rank-one Experts (MoORE), in which each expert corresponds to the outer product of a left singular vector and the corresponding right one. We can improve the model capacity by imposing a learnable orthogonal transform on the right singular vectors. Unlike low-rank adaptation (LoRA) and its MoE-driven variants, MoORE guarantees the experts' orthogonality and maintains the column space of the original weight matrix. These two properties make the adapted model resistant to the conflicts among the new tasks and the oblivion of its original tasks, respectively. Experiments on various datasets demonstrate that MoORE outperforms existing multi-task adaptation methods consistently, showing its superiority in terms of conflict- and oblivion-resistance. The code of the experiments is available at https://github.com/DaShenZi721/MoORE.

Comment: The paper proposes MoORE, a novel model MoE-ization strategy for multi-task adaptation, relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8


12. Machine Mirages: Defining the Undefined

ArXiv ID: 2506.13990

Authors: Hamidou Tembine

Abstract: As multimodal machine intelligence systems started achieving average animal-level and average human-level fluency in many measurable tasks in processing images, language, and sound, they began to exhibit a new class of cognitive aberrations: machine mirages. These include delusion, illusion, confabulation, hallucination, misattribution error, semantic drift, semantic compression, exaggeration, causal inference failure, uncanny valley of perception, bluffing-patter-bullshitting, cognitive stereotypy, pragmatic misunderstanding, hypersignification, semantic reheating-warming, simulated authority effect, fallacious abductive leap, contextual drift, referential hallucination, semiotic Frankenstein effect, calibration failure, spurious correlation, bias amplification, concept drift sensitivity, misclassification under uncertainty, adversarial vulnerability, overfitting, prosodic misclassification, accent bias, turn boundary failure, semantic boundary confusion, noise overfitting, latency-induced decision drift, ambiguity collapse and other forms of error that mimic but do not replicate human or animal fallibility. This article presents some of the errors and argues that these failures must be explicitly defined and systematically assessed. Understanding machine mirages is essential not only for improving machine intelligence reliability but also for constructing a multiscale ethical, co-evolving intelligence ecosystem that respects the diverse forms of life, cognition, and expression it will inevitably touch.

Comment: The paper discusses 'machine mirages', a new class of cognitive aberrations in multimodal machine intelligence systems, which could be considered an emerging trend challenging established assumptions.

Relevance: 9 Novelty: 8


13. Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

ArXiv ID: 2506.14731

Authors: Ring Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu, Dingnan Jin, Feng Zhu, Hao Dai, Hongzhi Luan, Jia Guo, Jiaming Liu, Jiewei Wu, Jun Mei, Jun Zhou, Junbo Zhao, Junwu Xiong, Kaihong Zhang, Kuan Xu, Lei Liang, Liang Jiang, Liangcheng Fu, Longfei Zheng, Qiang Gao, Qing Cui, Quan Wan, Shaomian Zheng, Shuaicheng Li, Tongkai Yang, Wang Ren, Xiaodong Yan, Xiaopei Wan, Xiaoyun Feng, Xin Zhao, Xinxing Yang, Xinyu Kong, Xuemin Yang, Yang Li, Yingting Wu, Yongkang Liu, Zhankai Xu, Zhenduo Zhang, Zhenglei Zhou, Zhenyu Huang, Zhiqiang Zhang, Zihao Wang, Zujie Wen

Abstract: We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.

Comment: The paper presents Ring-lite, a Mixture-of-Experts (MoE)-based LLM optimized via reinforcement learning, which aligns with model architecture by focusing on MoE and efficiency.

Relevance: 9 Novelty: 8


14. Logical Expressiveness of Graph Neural Networks with Hierarchical Node Individualization

ArXiv ID: 2506.13911

Authors: Arie Soeteman, Balder ten Cate

Abstract: We propose and study Hierarchical Ego Graph Neural Networks (HEGNNs), an expressive extension of graph neural networks (GNNs) with hierarchical node individualization, inspired by the Individualization-Refinement paradigm for graph isomorphism testing. HEGNNs generalize subgraph-GNNs and form a hierarchy of increasingly expressive models that, in the limit, can distinguish graphs up to isomorphism. We provide a logical characterization of HEGNN node classifiers, with and without subgraph restrictions, using graded hybrid logic. This characterization enables us to relate the separating power of HEGNNs to that of higher-order GNNs, GNNs enriched with local homomorphism count features, and color refinement algorithms based on Individualization-Refinement. Our experimental results confirm the practical feasibility of HEGNNs and show benefits in comparison with traditional GNN architectures, both with and without local homomorphism count features.

Comment: The paper proposes Hierarchical Ego Graph Neural Networks (HEGNNs) with a focus on logical expressiveness and graph isomorphism, which is relevant to model architecture and representation learning.

Relevance: 9 Novelty: 8


15. Single-Example Learning in a Mixture of GPDMs with Latent Geometries

ArXiv ID: 2506.14563

Authors: Jesse St. Amand, Leonardo Gizzi, Martin A. Giese

Abstract: We present the Gaussian process dynamical mixture model (GPDMM) and show its utility in single-example learning of human motion data. The Gaussian process dynamical model (GPDM) is a form of the Gaussian process latent variable model (GPLVM), but optimized with a hidden Markov model dynamical prior. The GPDMM combines multiple GPDMs in a probabilistic mixture-of-experts framework, utilizing embedded geometric features to allow for diverse sequences to be encoded in a single latent space, enabling the categorization and generation of each sequence class. GPDMs and our mixture model are particularly advantageous in addressing the challenges of modeling human movement in scenarios where data is limited and model interpretability is vital, such as in patient-specific medical applications like prosthesis control. We score the GPDMM on classification accuracy and generative ability in single-example learning, showcase model variations, and benchmark it against LSTMs, VAEs, and transformers.

Comment: The paper presents a mixture-of-experts framework using Gaussian process dynamical models, which aligns with the model architecture criterion.

Relevance: 9 Novelty: 7


16. Exploring Speaker Diarization with Mixture of Experts

ArXiv ID: 2506.14750

Authors: Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Hang Chen, Jun Du

Abstract: In this paper, we propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates a memory-aware multi-speaker embedding module with a sequence-to-sequence architecture. The system leverages a memory module to enhance speaker embeddings and employs a Seq2Seq framework to efficiently map acoustic features to speaker labels. Additionally, we explore the application of mixture of experts in speaker diarization, and introduce a Shared and Soft Mixture of Experts (SS-MoE) module to further mitigate model bias and enhance performance. Incorporating SS-MoE leads to the extended model NSD-MS2S-SSMoE. Experiments on multiple complex acoustic datasets, including CHiME-6, DiPCo, Mixer 6 and DIHARD-III evaluation sets, demonstrate meaningful improvements in robustness and generalization. The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios.

Comment: The paper explores speaker diarization with a Mixture of Experts (MoE) approach, which aligns with model architecture by introducing MoE in speaker diarization.

Relevance: 9 Novelty: 7


17. Sharp Generalization Bounds for Foundation Models with Asymmetric Randomized Low-Rank Adapters

ArXiv ID: 2506.14530

Authors: Anastasis Kratsios, Tin Sum Cheng, Aurelien Lucchi, Haitz S\'aez de Oc\'ariz Borde

Abstract: Low-Rank Adaptation (LoRA) has emerged as a widely adopted parameter-efficient fine-tuning (PEFT) technique for foundation models. Recent work has highlighted an inherent asymmetry in the initialization of LoRA's low-rank factors, which has been present since its inception and was presumably derived experimentally. This paper focuses on providing a comprehensive theoretical characterization of asymmetric LoRA with frozen random factors. First, while existing research provides upper-bound generalization guarantees based on averages over multiple experiments, the behaviour of a single fine-tuning run with specific random factors remains an open question. We address this by investigating the concentration of the typical LoRA generalization gap around its mean. Our main upper bound reveals a sample complexity of $\tilde{\mathcal{O}}\left(\frac{\sqrt{r}}{\sqrt{N}}\right)$ with high probability for rank $r$ LoRAs trained on $N$ samples. Additionally, we also determine the fundamental limits in terms of sample efficiency, establishing a matching lower bound of $\mathcal{O}\left(\frac{1}{\sqrt{N}}\right)$. By more closely reflecting the practical scenario of a single fine-tuning run, our findings offer crucial insights into the reliability and practicality of asymmetric LoRA.

Comment: The paper provides theoretical insights into LoRA with asymmetric randomized low-rank adapters, which is relevant to model compression and efficiency.

Relevance: 8 Novelty: 8


18. Sampling from Your Language Model One Byte at a Time

ArXiv ID: 2506.14123

Authors: Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh

Abstract: Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations. For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. This Prompt Boundary Problem (PBP) also arises in languages such as Chinese and in code generation, where tokens often do not line up with syntactic boundaries. Additionally mismatching tokenizers often hinder model composition and interoperability. For example, it is not possible to directly ensemble models with different tokenizers due to their mismatching vocabularies. To address these issues, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM, without changing its generative distribution at the text level. Our method efficient solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time as well as transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals.

Comment: The paper addresses tokenization issues in language models, which is relevant to large language models and their interpretability.

Relevance: 8 Novelty: 7


19. Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

ArXiv ID: 2506.13901

Authors: Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das

Abstract: Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.

Comment: The paper introduces Hierarchical Ego Graph Neural Networks, which is relevant to model architecture and representation learning.

Relevance: 8 Novelty: 7


20. Can Large Language Models Improve Spectral Graph Neural Networks?

ArXiv ID: 2506.14220

Authors: Kangkang Lu, Yanhua Yu, Zhiyong Huang, Tat-Seng Chua

Abstract: Spectral Graph Neural Networks (SGNNs) have attracted significant attention due to their ability to approximate arbitrary filters. They typically rely on supervision from downstream tasks to adaptively learn appropriate filters. However, under label-scarce conditions, SGNNs may learn suboptimal filters, leading to degraded performance. Meanwhile, the remarkable success of Large Language Models (LLMs) has inspired growing interest in exploring their potential within the GNN domain. This naturally raises an important question: \textit{Can LLMs help overcome the limitations of SGNNs and enhance their performance?} In this paper, we propose a novel approach that leverages LLMs to estimate the homophily of a given graph. The estimated homophily is then used to adaptively guide the design of polynomial spectral filters, thereby improving the expressiveness and adaptability of SGNNs across diverse graph structures. Specifically, we introduce a lightweight pipeline in which the LLM generates homophily-aware priors, which are injected into the filter coefficients to better align with the underlying graph topology. Extensive experiments on benchmark datasets demonstrate that our LLM-driven SGNN framework consistently outperforms existing baselines under both homophilic and heterophilic settings, with minimal computational and monetary overhead.

Comment: The paper explores the use of LLMs to improve Spectral Graph Neural Networks, relevant to large language models and representation learning.

Relevance: 8 Novelty: 7


21. MobiEdit: Resource-efficient Knowledge Editing for Personalized On-device LLMs

ArXiv ID: 2506.13772

Authors: Zhenyan Lu, Daliang Xu, Dongqi Cai, Zexi Li, Wei Liu, Fangming Liu, Shangguang Wang, Mengwei Xu

Abstract: Large language models (LLMs) are deployed on mobile devices to power killer applications such as intelligent assistants. LLMs pre-trained on general corpora often hallucinate when handling personalized or unseen queries, leading to incorrect or outdated responses. Knowledge editing addresses this by identifying and adjusting a small crucial portion of model weights, without compromising the general knowledge. However, prior knowledge editing methods are impractical to run on local devices due to the resource-heavy backpropagation (BP) needed for updates. We present MobiEdit, the first mobile knowledge editing framework that enables efficient LLM personalization on commercial off-the-shelf (COTS) mobile devices. MobiEdit replaces full-precision BP with quantized forward-only gradient estimation, thus compatible with the energy-efficient mobile neural processing units (NPUs). MobiEdit replaces full-precision backpropagation with quantized forward-only gradient estimation, making it compatible with energy-efficient mobile NPUs. To further improve gradient estimation efficiency, we introduce two optimizations: an early stoping mechanism that adaptively terminates editing upon success and a prefix cache that reuses computation across steps. Our approach enables real-time editing of a 3B-parameter model (Qwen2.5-3B-Instruct) on COTS mobile devices with 7.6$\times$ less memory, 14.7 $\times$ less energy and 3.6$\times$ less latency compared to previous knowledge editing methods.

Comment: The paper focuses on resource-efficient knowledge editing for LLMs on mobile devices, which relates to model compression and efficiency.

Relevance: 8 Novelty: 7


22. Object-Centric Neuro-Argumentative Learning

ArXiv ID: 2506.14577

Authors: Abdul Rahman Jacob, Avinash Kori, Emanuele De Angelis, Ben Glocker, Maurizio Proietti, Francesca Toni

Abstract: Over the last decade, as we rely more on deep learning technologies to make critical decisions, concerns regarding their safety, reliability and interpretability have emerged. We introduce a novel Neural Argumentative Learning (NAL) architecture that integrates Assumption-Based Argumentation (ABA) with deep learning for image analysis. Our architecture consists of neural and symbolic components. The former segments and encodes images into facts using object-centric learning, while the latter applies ABA learning to develop ABA frameworks enabling predictions with images. Experiments on synthetic data show that the NAL architecture can be competitive with a state-of-the-art alternative.

Comment: The paper introduces a novel architecture combining neural and symbolic components, which aligns with model architecture innovations.

Relevance: 8 Novelty: 7


23. A Hybrid Neural Network -- Polynomial Series Scheme for Learning Invariant Manifolds of Discrete Dynamical Systems

ArXiv ID: 2506.13950

Authors: Dimitrios G. Patsatzis, Nikolaos Kazantzis, Ioannis G. Kevrekidis, Constantinos Siettos

Abstract: We propose a hybrid machine learning scheme to learn -- in physics-informed and numerical analysis-informed fashion -- invariant manifolds (IM) of discrete maps for constructing reduced-order models (ROMs) for dynamical systems. The proposed scheme combines polynomial series with shallow neural networks, exploiting the complementary strengths of both approaches. Polynomials enable an efficient and accurate modeling of ROMs with guaranteed local exponential convergence rate around the fixed point, where, under certain assumptions, the IM is demonstrated to be analytic. Neural networks provide approximations to more complex structures beyond the reach of the polynomials' convergence. We evaluate the efficiency of the proposed scheme using three benchmark examples, examining convergence behavior, numerical approximation accuracy, and computational training cost. Additionally, we compare the IM approximations obtained solely with neural networks and with polynomial expansions. We demonstrate that the proposed hybrid scheme outperforms both pure polynomial approximations (power series, Legendre and Chebyshev polynomials) and standalone shallow neural network approximations in terms of numerical approximation accuracy.

Comment: The paper proposes a hybrid neural network and polynomial series scheme for learning invariant manifolds, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 7


24. A Variational Information Theoretic Approach to Out-of-Distribution Detection

ArXiv ID: 2506.14194

Authors: Sudeepta Mondal, Zhuolin Jiang, Ganesh Sundaramoorthi

Abstract: We present a theory for the construction of out-of-distribution (OOD) detection features for neural networks. We introduce random features for OOD through a novel information-theoretic loss functional consisting of two terms, the first based on the KL divergence separates resulting in-distribution (ID) and OOD feature distributions and the second term is the Information Bottleneck, which favors compressed features that retain the OOD information. We formulate a variational procedure to optimize the loss and obtain OOD features. Based on assumptions on OOD distributions, one can recover properties of existing OOD features, i.e., shaping functions. Furthermore, we show that our theory can predict a new shaping function that out-performs existing ones on OOD benchmarks. Our theory provides a general framework for constructing a variety of new features with clear explainability.

Comment: The paper presents a variational information-theoretic approach to out-of-distribution detection, which is relevant to representation learning.

Relevance: 8 Novelty: 7


25. Knowledge Compression via Question Generation: Enhancing Multihop Document Retrieval without Fine-tuning

ArXiv ID: 2506.13778

Authors: Anvi Alex Eponon, Moein Shahiki-Tash, Ildar Batyrshin, Christian E. Maldonado-Sifuentes, Grigori Sidorov, Alexander Gelbukh

Abstract: This study presents a question-based knowledge encoding approach that improves retrieval-augmented generation (RAG) systems without requiring fine-tuning or traditional chunking. We encode textual content using generated questions that span the lexical and semantic space, creating targeted retrieval cues combined with a custom syntactic reranking method. In single-hop retrieval over 109 scientific papers, our approach achieves a Recall@3 of 0.84, outperforming traditional chunking methods by 60 percent. We also introduce "paper-cards", concise paper summaries under 300 characters, which enhance BM25 retrieval, increasing MRR@3 from 0.56 to 0.85 on simplified technical queries. For multihop tasks, our reranking method reaches an F1 score of 0.52 with LLaMA2-Chat-7B on the LongBench 2WikiMultihopQA dataset, surpassing chunking and fine-tuned baselines which score 0.328 and 0.412 respectively. This method eliminates fine-tuning requirements, reduces retrieval latency, enables intuitive question-driven knowledge access, and decreases vector storage demands by 80%, positioning it as a scalable and efficient RAG alternative.

Comment: The paper focuses on a novel method for knowledge compression, which aligns with the model compression criterion.

Relevance: 8 Novelty: 7


26. GenerationPrograms: Fine-grained Attribution with Executable Programs

ArXiv ID: 2506.14580

Authors: David Wan, Eran Hirsch, Elias Stengel-Eskin, Ido Dagan, Mohit Bansal

Abstract: Recent large language models (LLMs) achieve impressive performance in source-conditioned text generation but often fail to correctly provide fine-grained attributions for their outputs, undermining verifiability and trust. Moreover, existing attribution methods do not explain how and why models leverage the provided source documents to generate their final responses, limiting interpretability. To overcome these challenges, we introduce a modular generation framework, GenerationPrograms, inspired by recent advancements in executable "code agent" architectures. Unlike conventional generation methods that simultaneously generate outputs and attributions or rely on post-hoc attribution, GenerationPrograms decomposes the process into two distinct stages: first, creating an executable program plan composed of modular text operations (such as paraphrasing, compression, and fusion) explicitly tailored to the query, and second, executing these operations following the program's specified instructions to produce the final response. Empirical evaluations demonstrate that GenerationPrograms significantly improves attribution quality at both the document level and sentence level across two long-form question-answering tasks and a multi-document summarization task. We further demonstrate that GenerationPrograms can effectively function as a post-hoc attribution method, outperforming traditional techniques in recovering accurate attributions. In addition, the interpretable programs generated by GenerationPrograms enable localized refinement through modular-level improvements that further enhance overall attribution quality.

Comment: The paper introduces a new framework for improving attribution in LLMs, which aligns with the large language models criterion.

Relevance: 8 Novelty: 7


27. Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences

ArXiv ID: 2506.13996

Authors: Stas Bekman, Samyam Rajbhandari, Michael Wyatt, Jeff Rasley, Tunji Ruwase, Zhewei Yao, Aurick Qiao, Yuxiong He

Abstract: Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, long sequence training is challenging for the AI community with limited system support in the open-source space. Out-of-box, even on a modern NVIDIA H100 80GB GPU cluster, training Llama 8B model with sequence over 32K runs out of memory on a basic Hugging Face (HF) model due to two reasons: i) LLM training workloads are not optimized to fully leverage a single GPU memory, ii) existing solutions for leveraging multiple GPU memory are not easily available to HF models, making long sequence training inaccessible. We address this with Arctic Long Sequence Training (ALST). It offers a combination of attention-agnostic single GPU and multi-GPU memory optimizations, that enables it to support out-of-box training of multi-million sequence length for a wide variety of HF models. ALST supports training Meta's Llama 8B model with 500K sequence length on a single H100 GPU, 3.7M on a single 8xH100 GPU node, and over 15M on a 4 node cluster, an increase of over 400x compared to the 32K baseline for the latter. ALST is fully compatible with HF models and open-sourced via Deepspeed https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-pallellism/ and Arctic Training https://github.com/snowflakedb/ArcticTraining/blob/main/projects/sequence-parallelism/README.md.

Comment: The paper introduces a method for training long sequences in LLMs, which aligns with the large language models criterion.

Relevance: 8 Novelty: 7


28. AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

ArXiv ID: 2506.14562

Authors: Di He, Ajay Jaiswal, Songjun Tu, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin

Abstract: Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines.

Comment: The paper introduces a novel weight decay method for LLMs, aligning with the large language models criterion.

Relevance: 8 Novelty: 7


29. Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection

ArXiv ID: 2506.14473

Authors: Zhijing Wan, Zhixiang Wang, Zheng Wang, Xin Xu, Shin'ichi Satoh

Abstract: One-shot subset selection serves as an effective tool to reduce deep learning training costs by identifying an informative data subset based on the information extracted by an information extractor (IE). Traditional IEs, typically pre-trained on the target dataset, are inherently dataset-dependent. Foundation models (FMs) offer a promising alternative, potentially mitigating this limitation. This work investigates two key questions: (1) Can FM-based subset selection outperform traditional IE-based methods across diverse datasets? (2) Do all FMs perform equally well as IEs for subset selection? Extensive experiments uncovered surprising insights: FMs consistently outperform traditional IEs on fine-grained datasets, whereas their advantage diminishes on coarse-grained datasets with noisy labels. Motivated by these finding, we propose RAM-APL (RAnking Mean-Accuracy of Pseudo-class Labels), a method tailored for fine-grained image datasets. RAM-APL leverages multiple FMs to enhance subset selection by exploiting their complementary strengths. Our approach achieves state-of-the-art performance on fine-grained datasets, including Oxford-IIIT Pet, Food-101, and Caltech-UCSD Birds-200-2011.

Comment: The paper investigates foundation models for subset selection, which aligns with the foundation model criterion.

Relevance: 8 Novelty: 7


30. Equivariance Everywhere All At Once: A Recipe for Graph Foundation Models

ArXiv ID: 2506.14291

Authors: Ben Finkelshtein, .Ismail .Ilkan Ceylan, Michael Bronstein, Ron Levie

Abstract: Graph machine learning architectures are typically tailored to specific tasks on specific datasets, which hinders their broader applicability. This has led to a new quest in graph machine learning: how to build graph foundation models capable of generalizing across arbitrary graphs and features? In this work, we present a recipe for designing graph foundation models for node-level tasks from first principles. The key ingredient underpinning our study is a systematic investigation of the symmetries that a graph foundation model must respect. In a nutshell, we argue that label permutation-equivariance alongside feature permutation-invariance are necessary in addition to the common node permutation-equivariance on each local neighborhood of the graph. To this end, we first characterize the space of linear transformations that are equivariant to permutations of nodes and labels, and invariant to permutations of features. We then prove that the resulting network is a universal approximator on multisets that respect the aforementioned symmetries. Our recipe uses such layers on the multiset of features induced by the local neighborhood of the graph to obtain a class of graph foundation models for node property prediction. We validate our approach through extensive experiments on 29 real-world node classification datasets, demonstrating both strong zero-shot empirical performance and consistent improvement as the number of training graphs increases.

Comment: The paper presents a recipe for graph foundation models, aligning with the foundation model criterion.

Relevance: 8 Novelty: 7


31. From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

ArXiv ID: 2506.14761

Authors: Mathurin Videau, Badr Youbi Idrissi, Alessandro Leite, Marc Schoenauer, Olivier Teytaud, David Lopez-Paz

Abstract: Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and leave the model stuck with that choice. We relax this rigidity by introducing an autoregressive U-Net that learns to embed its own tokens as it trains. The network reads raw bytes, pools them into words, then pairs of words, then up to 4 words, giving it a multi-scale view of the sequence. At deeper stages, the model must predict further into the future -- anticipating the next few words rather than the next byte -- so deeper stages focus on broader semantic patterns while earlier stages handle fine details. When carefully tuning and controlling pretraining compute, shallow hierarchies tie strong BPE baselines, and deeper hierarchies have a promising trend. Because tokenization now lives inside the model, the same system can handle character-level tasks and carry knowledge across low-resource languages.

Comment: The paper introduces an autoregressive U-Net for language modeling, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 7


32. ResNets Are Deeper Than You Think

ArXiv ID: 2506.14386

Authors: Christian H. X. Ali Mehmeti-G\"opel, Michael Wand

Abstract: Residual connections remain ubiquitous in modern neural network architectures nearly a decade after their introduction. Their widespread adoption is often credited to their dramatically improved trainability: residual networks train faster, more stably, and achieve higher accuracy than their feedforward counterparts. While numerous techniques, ranging from improved initialization to advanced learning rate schedules, have been proposed to close the performance gap between residual and feedforward networks, this gap has persisted. In this work, we propose an alternative explanation: residual networks do not merely reparameterize feedforward networks, but instead inhabit a different function space. We design a controlled post-training comparison to isolate generalization performance from trainability; we find that variable-depth architectures, similar to ResNets, consistently outperform fixed-depth networks, even when optimization is unlikely to make a difference. These results suggest that residual connections confer performance advantages beyond optimization, pointing instead to a deeper inductive bias aligned with the structure of natural data.

Comment: The paper provides insights into the function space of residual networks, which is relevant to understanding model architecture.

Relevance: 8 Novelty: 7


33. Quantifying Structure in CLIP Embeddings: A Statistical Framework for Concept Interpretation

ArXiv ID: 2506.13831

Authors: Jitian Zhao, Chenghui Li, Frederic Sala, Karl Rohe

Abstract: Concept-based approaches, which aim to identify human-understandable concepts within a model's internal representations, are a promising method for interpreting embeddings from deep neural network models, such as CLIP. While these approaches help explain model behavior, current methods lack statistical rigor, making it challenging to validate identified concepts and compare different techniques. To address this challenge, we introduce a hypothesis testing framework that quantifies rotation-sensitive structures within the CLIP embedding space. Once such structures are identified, we propose a post-hoc concept decomposition method. Unlike existing approaches, it offers theoretical guarantees that discovered concepts represent robust, reproducible patterns (rather than method-specific artifacts) and outperforms other techniques in terms of reconstruction error. Empirically, we demonstrate that our concept-based decomposition algorithm effectively balances reconstruction accuracy with concept interpretability and helps mitigate spurious cues in data. Applied to a popular spurious correlation dataset, our method yields a 22.6% increase in worst-group accuracy after removing spurious background concepts.

Comment: The paper introduces a statistical framework for interpreting CLIP embeddings, relevant to representation learning.

Relevance: 8 Novelty: 7


34. Expressive Score-Based Priors for Distribution Matching with Geometry-Preserving Regularization

ArXiv ID: 2506.14607

Authors: Ziyu Gong, Jim Lim, David I. Inouye

Abstract: Distribution matching (DM) is a versatile domain-invariant representation learning technique that has been applied to tasks such as fair classification, domain adaptation, and domain translation. Non-parametric DM methods struggle with scalability and adversarial DM approaches suffer from instability and mode collapse. While likelihood-based methods are a promising alternative, they often impose unnecessary biases through fixed priors or require explicit density models (e.g., flows) that can be challenging to train. We address this limitation by introducing a novel approach to training likelihood-based DM using expressive score-based prior distributions. Our key insight is that gradient-based DM training only requires the prior's score function -- not its density -- allowing us to train the prior via denoising score matching. This approach eliminates biases from fixed priors (e.g., in VAEs), enabling more effective use of geometry-preserving regularization, while avoiding the challenge of learning an explicit prior density model (e.g., a flow-based prior). Our method also demonstrates better stability and computational efficiency compared to other diffusion-based priors (e.g., LSGM). Furthermore, experiments demonstrate superior performance across multiple tasks, establishing our score-based method as a stable and effective approach to distribution matching. Source code available at https://github.com/inouye-lab/SAUB.

Comment: The paper introduces a novel approach to distribution matching using score-based priors, relevant to representation learning.

Relevance: 8 Novelty: 7


35. Optimizing Length Compression in Large Reasoning Models

ArXiv ID: 2506.14755

Authors: Zhengxiang Cheng, Dongping Chen, Mingyang Fu, Tianyi Zhou

Abstract: Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as "invalid thinking" -- models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.

Comment: The paper introduces a method for optimizing length compression in large reasoning models, which relates to model compression through efficiency improvements.

Relevance: 8 Novelty: 7


36. Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning

ArXiv ID: 2506.14387

Authors: William F. Shen, Xinchi Qiu, Nicola Cancedda, Nicholas D. Lane

Abstract: Existing work on mitigating catastrophic forgetting in large language model (LLM) fine-tuning has primarily focused on preserving specific data or tasks, while critically overlooking the degradation of essential capabilities instilled through safety alignment, particularly the model's ability to faithfully express ignorance. In this work, we show that this capability is significantly degraded during conventional fine-tuning, leading to undesired behaviors such as hallucinations. To address this novel but highly practical problem, we propose SEAT, a simple and effective fine-tuning approach that preserves both fine-tuning performance and the model's inherent ability to acknowledge its ignorance. SEAT integrates two key components: (1) sparse training that constrains activation drift, and (2) a novel entity perturbation method with KL-divergence regularization, designed to counter knowledge entanglement. Experimental results demonstrate that SEAT significantly outperforms baselines in preserving ignorance awareness while retaining fine-tuning performance, offering a more robust solution for LLM fine-tuning.

Comment: The paper proposes SEAT, a fine-tuning approach for LLMs that preserves ignorance awareness, which aligns with large language models by addressing theoretical insights into LLM behavior.

Relevance: 8 Novelty: 7


37. S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models

ArXiv ID: 2506.14158

Authors: Tao He, Guang Huang, Yu Yang, Tianshi Xu, Sicheng Zhao, Guiguang Ding, Pengyang Wang, Feng Tian

Abstract: Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks. However, their autoregressive nature leads to substantial inference latency, posing challenges for real-time applications. Speculative sampling mitigates this issue by introducing a drafting phase followed by a parallel validation phase, enabling faster token generation and verification. Existing approaches, however, overlook the inherent coherence in text generation, limiting their efficiency. To address this gap, we propose a Speculative Sampling with Syntactic and Semantic Coherence (S$^4$C) framework, which extends speculative sampling by leveraging multi-head drafting for rapid token generation and a continuous verification tree for efficient candidate validation and feature reuse. Experimental results demonstrate that S$^4$C surpasses baseline methods across mainstream tasks, offering enhanced efficiency, parallelism, and the ability to generate more valid tokens with fewer computational resources. On Spec-bench benchmarks, S$^4$C achieves an acceleration ratio of 2.26x-2.60x, outperforming state-of-the-art methods.

Comment: The paper proposes a speculative sampling framework for efficient inference in LLMs, which aligns with large language models by addressing efficiency improvements.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.