Personalized Daily ArXiv Papers 2025-06-18

[gpt-4o]	Prompt	Completion	Total
Token	45460	5677	51137
Cost	$0.11	$0.06	$0.17

Total arXiv papers: 554

Total scanned papers: 333

Total relevant papers: 37

Table of contents with paper titles:

Load Balancing Mixture of Experts with Similarity Preserving Routers Authors: Nabil Omi, Siddhartha Sen, Ali Farhadi
Structured and Informed Probabilistic Modeling with the Thermodynamic Kolmogorov-Arnold Model Authors: Prithvi Raj
Toward a Graph Foundation Model: Pre-Training Transformers With Random Walks Authors: Ziyuan Tang, Jie Chen
Transformers Learn Faster with Semantic Focus Authors: Parikshit Ram, Kenneth L. Clarkson, Tim Klinger, Shashanka Ubaru, Alexander G. Gray
'Memory States' from Almost Nothing: Representing and Computing in a Non-associative Algebra Authors: Stefan Reimann
Evolutionary chemical learning in dimerization networks Authors: Alexei V. Tkachenko, Bortolo Matteo Mognetti, Sergei Maslov
Scientifically-Interpretable Reasoning Network (ScIReN): Uncovering the Black-Box of Nature Authors: Joshua Fan, Haodi Xu, Feng Tao, Md Nasim, Marc Grimson, Yiqi Luo, Carla P. Gomes
Less is More: Undertraining Experts Improves Model Upcycling Authors: Stefan Horoi, Guy Wolf, Eugene Belilovsky, Gintare Karolina Dziugaite
Mirror Descent Using the Tempesta Generalized Multi-parametric Logarithms Authors: Andrzej Cichocki
MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models Authors: Hongyu Wang, Jiayu Xu, Ruiping Wang, Yan Feng, Yitao Zhai, Peng Pei, Xunliang Cai, Xilin Chen
MoORE: SVD-based Model MoE-ization for Conflict- and Oblivion-Resistant Multi-Task Adaptation Authors: Shen Yuan, Yin Zheng, Taifeng Wang, Binbin Liu, Hongteng Xu
Machine Mirages: Defining the Undefined Authors: Hamidou Tembine
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs Authors: Ring Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu, Dingnan Jin, Feng Zhu, Hao Dai, Hongzhi Luan, Jia Guo, Jiaming Liu, Jiewei Wu, Jun Mei, Jun Zhou, Junbo Zhao, Junwu Xiong, Kaihong Zhang, Kuan Xu, Lei Liang, Liang Jiang, Liangcheng Fu, Longfei Zheng, Qiang Gao, Qing Cui, Quan Wan, Shaomian Zheng, Shuaicheng Li, Tongkai Yang, Wang Ren, Xiaodong Yan, Xiaopei Wan, Xiaoyun Feng, Xin Zhao, Xinxing Yang, Xinyu Kong, Xuemin Yang, Yang Li, Yingting Wu, Yongkang Liu, Zhankai Xu, Zhenduo Zhang, Zhenglei Zhou, Zhenyu Huang, Zhiqiang Zhang, Zihao Wang, Zujie Wen
Logical Expressiveness of Graph Neural Networks with Hierarchical Node Individualization Authors: Arie Soeteman, Balder ten Cate
Single-Example Learning in a Mixture of GPDMs with Latent Geometries Authors: Jesse St. Amand, Leonardo Gizzi, Martin A. Giese
Exploring Speaker Diarization with Mixture of Experts Authors: Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Hang Chen, Jun Du
Sharp Generalization Bounds for Foundation Models with Asymmetric Randomized Low-Rank Adapters Authors: Anastasis Kratsios, Tin Sum Cheng, Aurelien Lucchi, Haitz S\'aez de Oc\'ariz Borde
Sampling from Your Language Model One Byte at a Time Authors: Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh
Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations Authors: Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das
Can Large Language Models Improve Spectral Graph Neural Networks? Authors: Kangkang Lu, Yanhua Yu, Zhiyong Huang, Tat-Seng Chua
MobiEdit: Resource-efficient Knowledge Editing for Personalized On-device LLMs Authors: Zhenyan Lu, Daliang Xu, Dongqi Cai, Zexi Li, Wei Liu, Fangming Liu, Shangguang Wang, Mengwei Xu
Object-Centric Neuro-Argumentative Learning Authors: Abdul Rahman Jacob, Avinash Kori, Emanuele De Angelis, Ben Glocker, Maurizio Proietti, Francesca Toni
A Hybrid Neural Network -- Polynomial Series Scheme for Learning Invariant Manifolds of Discrete Dynamical Systems Authors: Dimitrios G. Patsatzis, Nikolaos Kazantzis, Ioannis G. Kevrekidis, Constantinos Siettos
A Variational Information Theoretic Approach to Out-of-Distribution Detection Authors: Sudeepta Mondal, Zhuolin Jiang, Ganesh Sundaramoorthi
Knowledge Compression via Question Generation: Enhancing Multihop Document Retrieval without Fine-tuning Authors: Anvi Alex Eponon, Moein Shahiki-Tash, Ildar Batyrshin, Christian E. Maldonado-Sifuentes, Grigori Sidorov, Alexander Gelbukh
GenerationPrograms: Fine-grained Attribution with Executable Programs Authors: David Wan, Eran Hirsch, Elias Stengel-Eskin, Ido Dagan, Mohit Bansal
Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences Authors: Stas Bekman, Samyam Rajbhandari, Michael Wyatt, Jeff Rasley, Tunji Ruwase, Zhewei Yao, Aurick Qiao, Yuxiong He
AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs Authors: Di He, Ajay Jaiswal, Songjun Tu, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin
Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection Authors: Zhijing Wan, Zhixiang Wang, Zheng Wang, Xin Xu, Shin'ichi Satoh
Equivariance Everywhere All At Once: A Recipe for Graph Foundation Models Authors: Ben Finkelshtein, .Ismail .Ilkan Ceylan, Michael Bronstein, Ron Levie
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets Authors: Mathurin Videau, Badr Youbi Idrissi, Alessandro Leite, Marc Schoenauer, Olivier Teytaud, David Lopez-Paz
ResNets Are Deeper Than You Think Authors: Christian H. X. Ali Mehmeti-G\"opel, Michael Wand
Quantifying Structure in CLIP Embeddings: A Statistical Framework for Concept Interpretation Authors: Jitian Zhao, Chenghui Li, Frederic Sala, Karl Rohe
Expressive Score-Based Priors for Distribution Matching with Geometry-Preserving Regularization Authors: Ziyu Gong, Jim Lim, David I. Inouye
Optimizing Length Compression in Large Reasoning Models Authors: Zhengxiang Cheng, Dongping Chen, Mingyang Fu, Tianyi Zhou
Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning Authors: William F. Shen, Xinchi Qiu, Nicola Cancedda, Nicholas D. Lane
S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models Authors: Tao He, Guang Huang, Yu Yang, Tianshi Xu, Sicheng Zhao, Guiguang Ding, Pengyang Wang, Feng Tian

1. Load Balancing Mixture of Experts with Similarity Preserving Routers

ArXiv ID: 2506.14038

Authors: Nabil Omi, Siddhartha Sen, Ali Farhadi

Abstract: Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks by activating only a subset of parameters ("experts") for each input. A learned router computes a distribution over these experts, and assigns input tokens to a small subset. However, without auxiliary balancing mechanisms, routers often converge to using only a few experts, severely limiting model capacity and degrading performance. Most current load balancing mechanisms encourage a distribution over experts that resembles a roughly uniform distribution of experts per token. During training, this can result in inconsistent routing behavior, resulting in the model spending its capacity to learn redundant knowledge. We address this by introducing a novel load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar inputs during training. Our experimental results show that applying our loss to the router results in 36% faster convergence and lower redundancy compared to a popular load balancing loss.

Comment: The paper addresses load balancing in Sparse Mixture of Experts (MoE) models, which is directly relevant to model architecture and efficiency.

Relevance: 10 Novelty: 8

2. Structured and Informed Probabilistic Modeling with the Thermodynamic Kolmogorov-Arnold Model

ArXiv ID: 2506.14167

Authors: Prithvi Raj

Abstract: We adapt the Kolmogorov-Arnold Representation Theorem to generative modeling by reinterpreting its inner functions as a Markov Kernel between probability spaces via inverse transform sampling. We present a generative model that is interpretable, easy to design, and efficient. Our approach couples a Kolmogorov-Arnold Network generator with independent energy-based priors, trained via Maximum Likelihood. Inverse sampling enables fast inference, while prior knowledge can be incorporated before training to better align priors with posteriors, thereby improving learning efficiency and sample quality. The learned prior is also recoverable and visualizable post-training, offering an empirical Bayes perspective. To address inflexibility and mitigate prior-posterior mismatch, we introduce scalable extensions based on mixture distributions and Langevin Monte Carlo methods, admitting a trade-off between flexibility and training efficiency. Our contributions connect classical representation theorems with modern probabilistic modeling, while balancing training stability, inference speed, and the quality and diversity of generations.

Comment: The paper introduces a novel probabilistic model inspired by classical representation theorems, relevant to emerging trends in generative modeling.

Relevance: 9 Novelty: 9

3. Toward a Graph Foundation Model: Pre-Training Transformers With Random Walks

ArXiv ID: 2506.14098

Authors: Ziyuan Tang, Jie Chen

Abstract: A foundation model like GPT elicits many emergent abilities, owing to the pre-training with broad inclusion of data and the use of the powerful Transformer architecture. While foundation models in natural languages are prevalent, can we build similar models for graphs? This paper describes an approach toward a graph foundation model that is pre-trained with diverse graph datasets by adapting the Transformer backbone. A central challenge toward this end is how a sequence model encodes graphs of varying sizes and from different domains. We propose representing a node as multiple random walks, such that the Transformer can extract node representations from sequences, which in turn form edge and graph representations. We develop a novel context prediction loss for these random walks and theoretically analyze their expressive power in distinguishing neighborhoods and graphs. We also demonstrate the pre-training of our model and its adaptation to downstream tasks, showcasing its potential as a foundation for processing and reasoning with graph-structured data.

Comment: The paper proposes a graph foundation model using Transformers, which is relevant to model architecture and representation learning.

Relevance: 9 Novelty: 8

4. Transformers Learn Faster with Semantic Focus

ArXiv ID: 2506.14095

Authors: Parikshit Ram, Kenneth L. Clarkson, Tim Klinger, Shashanka Ubaru, Alexander G. Gray

Abstract: Various forms of sparse attention have been explored to mitigate the quadratic computational and memory cost of the attention mechanism in transformers. We study sparse transformers not through a lens of efficiency but rather in terms of learnability and generalization. Empirically studying a range of attention mechanisms, we find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models, while input-agnostic sparse attention models show no such benefits -- a phenomenon that is robust across architectural and optimization hyperparameter choices. This can be interpreted as demonstrating that concentrating a model's "semantic focus" with respect to the tokens currently being considered (in the form of input-dependent sparse attention) accelerates learning. We develop a theoretical characterization of the conditions that explain this behavior. We establish a connection between the stability of the standard softmax and the loss function's Lipschitz properties, then show how sparsity affects the stability of the softmax and the subsequent convergence and generalization guarantees resulting from the attention mechanism. This allows us to theoretically establish that input-agnostic sparse attention does not provide any benefits. We also characterize conditions when semantic focus (input-dependent sparse attention) can provide improved guarantees, and we validate that these conditions are in fact met in our empirical evaluations.

Comment: The paper studies sparse attention in transformers, focusing on learnability and generalization, which is relevant to model architecture and representation learning.

Relevance: 9 Novelty: 8

5. 'Memory States' from Almost Nothing: Representing and Computing in a Non-associative Algebra

ArXiv ID: 2506.13768

Authors: Stefan Reimann

Abstract: This note presents a non-associative algebraic framework for the representation and computation of information items in high-dimensional space. This framework is consistent with the principles of spatial computing and with the empirical findings in cognitive science about memory. Computations are performed through a process of multiplication-like binding and non-associative interference-like bundling. Models that rely on associative bundling typically lose order information, which necessitates the use of auxiliary order structures, such as position markers, to represent sequential information that is important for cognitive tasks. In contrast, the non-associative bundling proposed allows the construction of sparse representations of arbitrarily long sequences that maintain their temporal structure across arbitrary lengths. In this operation, noise is a constituent element of the representation of order information, rather than a means of obscuring it. The non-associative nature of the proposed framework results in the representation of a single sequence by two distinct states. The L-state, generated through left-associative bundling, continuously updates and emphasises a recency effect, while the R-state, formed through right-associative bundling, encodes finite sequences or chunks, capturing a primacy effect. The construction of these states may be associated with activity in the prefrontal cortex in relation to short-term memory and hippocampal encoding in long-term memory, respectively. The accuracy of retrieval is contingent upon a decision-making process that is based on the mutual information between the memory states and the cue. The model is able to replicate the Serial Position Curve, which reflects the empirical recency and primacy effects observed in cognitive experiments.

Comment: The paper presents a non-associative algebraic framework for representation and computation, which aligns with representation learning.

Relevance: 9 Novelty: 8

6. Evolutionary chemical learning in dimerization networks

ArXiv ID: 2506.14006

Authors: Alexei V. Tkachenko, Bortolo Matteo Mognetti, Sergei Maslov

Abstract: We present a novel framework for chemical learning based on Competitive Dimerization Networks (CDNs) - systems in which multiple molecular species, e.g. proteins or DNA/RNA oligomers, reversibly bind to form dimers. We show that these networks can be trained in vitro through directed evolution, enabling the implementation of complex learning tasks such as multiclass classification without digital hardware or explicit parameter tuning. Each molecular species functions analogously to a neuron, with binding affinities acting as tunable synaptic weights. A training protocol involving mutation, selection, and amplification of DNA-based components allows CDNs to robustly discriminate among noisy input patterns. The resulting classifiers exhibit strong output contrast and high mutual information between input and output, especially when guided by a contrast-enhancing loss function. Comparative analysis with in silico gradient descent training reveals closely correlated performance. These results establish CDNs as a promising platform for analog physical computation, bridging synthetic biology and machine learning, and advancing the development of adaptive, energy-efficient molecular computing systems.

Comment: The paper introduces a novel framework for chemical learning using dimerization networks, which is relevant to AI for Science with a focus on foundational research.

Relevance: 9 Novelty: 8

7. Scientifically-Interpretable Reasoning Network (ScIReN): Uncovering the Black-Box of Nature

ArXiv ID: 2506.14054

Authors: Joshua Fan, Haodi Xu, Feng Tao, Md Nasim, Marc Grimson, Yiqi Luo, Carla P. Gomes

Abstract: Neural networks are a powerful tool for learning patterns from data. However, they do not respect known scientific laws, nor can they reveal novel scientific insights due to their black-box nature. In contrast, scientific reasoning distills biological or physical principles from observations and controlled experiments, and quantitatively interprets them with process-based models made of mathematical equations. Yet, process-based models rely on numerous free parameters that must be set in an ad-hoc manner, and thus often fit observations poorly in cross-scale predictions. While prior work has embedded process-based models in conventional neural networks, discovering interpretable relationships between parameters in process-based models and input features is still a grand challenge for scientific discovery. We thus propose Scientifically-Interpretable Reasoning Network (ScIReN), a fully-transparent framework that combines interpretable neural and process-based reasoning. An interpretable encoder predicts scientifically-meaningful latent parameters, which are then passed through a differentiable process-based decoder to predict labeled output variables. ScIReN also uses a novel hard-sigmoid constraint layer to restrict latent parameters to meaningful ranges defined by scientific prior knowledge, further enhancing its interpretability. While the embedded process-based model enforces established scientific knowledge, the encoder reveals new scientific mechanisms and relationships hidden in conventional black-box models. We apply ScIReN on two tasks: simulating the flow of organic carbon through soils, and modeling ecosystem respiration from plants. In both tasks, ScIReN outperforms black-box networks in predictive accuracy while providing substantial scientific interpretability -- it can infer latent scientific mechanisms and their relationships with input features.

Comment: The paper proposes a scientifically-interpretable reasoning network, which is relevant to AI for Science with a focus on foundational research.

Relevance: 9 Novelty: 8

8. Less is More: Undertraining Experts Improves Model Upcycling

ArXiv ID: 2506.14126

Authors: Stefan Horoi, Guy Wolf, Eugene Belilovsky, Gintare Karolina Dziugaite

Abstract: Modern deep learning is increasingly characterized by the use of open-weight foundation models that can be fine-tuned on specialized datasets. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. To leverage these resources, numerous model upcycling methods have emerged, enabling the reuse of fine-tuned models in multi-task systems. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then upcycled into more general-purpose systems. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model upcycling. We show that long fine-tuning of experts that optimizes for their individual performance leads to degraded merging performance, both for fully fine-tuned and LoRA-adapted models, and to worse downstream results when LoRA adapters are upcycled into MoE layers. We trace this degradation to the memorization of a small set of difficult examples that dominate late fine-tuning steps and are subsequently forgotten during merging. Finally, we demonstrate that a task-dependent aggressive early stopping strategy can significantly improve upcycling performance.

Comment: The paper challenges assumptions in model upcycling and discusses MoE layers, aligning with the model architecture criterion.

Relevance: 9 Novelty: 8

9. Mirror Descent Using the Tempesta Generalized Multi-parametric Logarithms

ArXiv ID: 2506.13984

Authors: Andrzej Cichocki

Abstract: In this paper, we develop a wide class Mirror Descent (MD) algorithms, which play a key role in machine learning. For this purpose we formulated the constrained optimization problem, in which we exploits the Bregman divergence with the Tempesta multi-parametric deformation logarithm as a link function. This link function called also mirror function defines the mapping between the primal and dual spaces and is associated with a very-wide (in fact, theoretically infinite) class of generalized trace-form entropies. In order to derive novel MD updates, we estimate generalized exponential function, which closely approximates the inverse of the multi-parametric Tempesta generalized logarithm. The shape and properties of the Tempesta logarithm and its inverse-deformed exponential functions can be tuned by several hyperparameters. By learning these hyperparameters, we can adapt to distribution or geometry of training data, and we can adjust them to achieve desired properties of MD algorithms. The concept of applying multi-parametric logarithms allow us to generate a new wide and flexible family of MD and mirror-less MD updates.

Comment: The paper develops a new class of Mirror Descent algorithms, which aligns with the emerging trends criterion.

Relevance: 9 Novelty: 8

10. MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models

ArXiv ID: 2506.14435

Authors: Hongyu Wang, Jiayu Xu, Ruiping Wang, Yan Feng, Yitao Zhai, Peng Pei, Xunliang Cai, Xilin Chen

Abstract: Large multimodal Mixture-of-Experts (MoEs) effectively scale the model size to boost performance while maintaining fixed active parameters. However, previous works primarily utilized full-precision experts during sparse up-cycling. Despite they show superior performance on end tasks, the large amount of experts introduces higher memory footprint, which poses significant challenges for the deployment on edge devices. In this work, we propose MoTE, a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint. Instead of training fewer high-precision experts, we propose to train more low-precision experts during up-cycling. Specifically, we use the pre-trained FFN as a shared expert and train ternary routed experts with parameters in {-1, 0, 1}. Extensive experiments show that our approach has promising scaling trend along model size. MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint. Furthermore, our approach is compatible with post-training quantization methods and the advantage further amplifies when memory-constraint goes lower. Given the same amount of expert memory footprint of 3.4GB and combined with post-training quantization, MoTE outperforms MoE-LLaVA by a gain of 4.3% average accuracy on end tasks, demonstrating its effectiveness and potential for memory-constrained devices.

Comment: The paper introduces MoTE, a memory-efficient approach for Mixture-of-Experts models, relevant to model compression and architecture.

Relevance: 9 Novelty: 8

11. MoORE: SVD-based Model MoE-ization for Conflict- and Oblivion-Resistant Multi-Task Adaptation

ArXiv ID: 2506.14436

Authors: Shen Yuan, Yin Zheng, Taifeng Wang, Binbin Liu, Hongteng Xu

Abstract: Adapting large-scale foundation models in multi-task scenarios often suffers from task conflict and oblivion. To mitigate such issues, we propose a novel ''model MoE-ization'' strategy that leads to a conflict- and oblivion-resistant multi-task adaptation method. Given a weight matrix of a pre-trained model, our method applies SVD to it and introduces a learnable router to adjust its singular values based on tasks and samples. Accordingly, the weight matrix becomes a Mixture of Orthogonal Rank-one Experts (MoORE), in which each expert corresponds to the outer product of a left singular vector and the corresponding right one. We can improve the model capacity by imposing a learnable orthogonal transform on the right singular vectors. Unlike low-rank adaptation (LoRA) and its MoE-driven variants, MoORE guarantees the experts' orthogonality and maintains the column space of the original weight matrix. These two properties make the adapted model resistant to the conflicts among the new tasks and the oblivion of its original tasks, respectively. Experiments on various datasets demonstrate that MoORE outperforms existing multi-task adaptation methods consistently, showing its superiority in terms of conflict- and oblivion-resistance. The code of the experiments is available at https://github.com/DaShenZi721/MoORE.

Comment: The paper proposes MoORE, a novel model MoE-ization strategy for multi-task adaptation, relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8

12. Machine Mirages: Defining the Undefined

ArXiv ID: 2506.13990

Authors: Hamidou Tembine

Abstract: As multimodal machine intelligence systems started achieving average animal-level and average human-level fluency in many measurable tasks in processing images, language, and sound, they began to exhibit a new class of cognitive aberrations: machine mirages. These include delusion, illusion, confabulation, hallucination, misattribution error, semantic drift, semantic compression, exaggeration, causal inference failure, uncanny valley of perception, bluffing-patter-bullshitting, cognitive stereotypy, pragmatic misunderstanding, hypersignification, semantic reheating-warming, simulated authority effect, fallacious abductive leap, contextual drift, referential hallucination, semiotic Frankenstein effect, calibration failure, spurious correlation, bias amplification, concept drift sensitivity, misclassification under uncertainty, adversarial vulnerability, overfitting, prosodic misclassification, accent bias, turn boundary failure, semantic boundary confusion, noise overfitting, latency-induced decision drift, ambiguity collapse and other forms of error that mimic but do not replicate human or animal fallibility. This article presents some of the errors and argues that these failures must be explicitly defined and systematically assessed. Understanding machine mirages is essential not only for improving machine intelligence reliability but also for constructing a multiscale ethical, co-evolving intelligence ecosystem that respects the diverse forms of life, cognition, and expression it will inevitably touch.

Comment: The paper discusses 'machine mirages', a new class of cognitive aberrations in multimodal machine intelligence systems, which could be considered an emerging trend challenging established assumptions.

Relevance: 9 Novelty: 8

13. Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

ArXiv ID: 2506.14731

Authors: Ring Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu, Dingnan Jin, Feng Zhu, Hao Dai, Hongzhi Luan, Jia Guo, Jiaming Liu, Jiewei Wu, Jun Mei, Jun Zhou, Junbo Zhao, Junwu Xiong, Kaihong Zhang, Kuan Xu, Lei Liang, Liang Jiang, Liangcheng Fu, Longfei Zheng, Qiang Gao, Qing Cui, Quan Wan, Shaomian Zheng, Shuaicheng Li, Tongkai Yang, Wang Ren, Xiaodong Yan, Xiaopei Wan, Xiaoyun Feng, Xin Zhao, Xinxing Yang, Xinyu Kong, Xuemin Yang, Yang Li, Yingting Wu, Yongkang Liu, Zhankai Xu, Zhenduo Zhang, Zhenglei Zhou, Zhenyu Huang, Zhiqiang Zhang, Zihao Wang, Zujie Wen

Abstract: We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.

Comment: The paper presents Ring-lite, a Mixture-of-Experts (MoE)-based LLM optimized via reinforcement learning, which aligns with model architecture by focusing on MoE and efficiency.

Relevance: 9 Novelty: 8

14. Logical Expressiveness of Graph Neural Networks with Hierarchical Node Individualization

ArXiv ID: 2506.13911

Authors: Arie Soeteman, Balder ten Cate

Abstract: We propose and study Hierarchical Ego Graph Neural Networks (HEGNNs), an expressive extension of graph neural networks (GNNs) with hierarchical node individualization, inspired by the Individualization-Refinement paradigm for graph isomorphism testing. HEGNNs generalize subgraph-GNNs and form a hierarchy of increasingly expressive models that, in the limit, can distinguish graphs up to isomorphism. We provide a logical characterization of HEGNN node classifiers, with and without subgraph restrictions, using graded hybrid logic. This characterization enables us to relate the separating power of HEGNNs to that of higher-order GNNs, GNNs enriched with local homomorphism count features, and color refinement algorithms based on Individualization-Refinement. Our experimental results confirm the practical feasibility of HEGNNs and show benefits in comparison with traditional GNN architectures, both with and without local homomorphism count features.

Comment: The paper proposes Hierarchical Ego Graph Neural Networks (HEGNNs) with a focus on logical expressiveness and graph isomorphism, which is relevant to model architecture and representation learning.

Relevance: 9 Novelty: 8

15. Single-Example Learning in a Mixture of GPDMs with Latent Geometries

ArXiv ID: 2506.14563

Authors: Jesse St. Amand, Leonardo Gizzi, Martin A. Giese

Abstract: We present the Gaussian process dynamical mixture model (GPDMM) and show its utility in single-example learning of human motion data. The Gaussian process dynamical model (GPDM) is a form of the Gaussian process latent variable model (GPLVM), but optimized with a hidden Markov model dynamical prior. The GPDMM combines multiple GPDMs in a probabilistic mixture-of-experts framework, utilizing embedded geometric features to allow for diverse sequences to be encoded in a single latent space, enabling the categorization and generation of each sequence class. GPDMs and our mixture model are particularly advantageous in addressing the challenges of modeling human movement in scenarios where data is limited and model interpretability is vital, such as in patient-specific medical applications like prosthesis control. We score the GPDMM on classification accuracy and generative ability in single-example learning, showcase model variations, and benchmark it against LSTMs, VAEs, and transformers.

Comment: The paper presents a mixture-of-experts framework using Gaussian process dynamical models, which aligns with the model architecture criterion.

Relevance: 9 Novelty: 7

16. Exploring Speaker Diarization with Mixture of Experts

ArXiv ID: 2506.14750

Authors: Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Hang Chen, Jun Du

Abstract: In this paper, we propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates a memory-aware multi-speaker embedding module with a sequence-to-sequence architecture. The system leverages a memory module to enhance speaker embeddings and employs a Seq2Seq framework to efficiently map acoustic features to speaker labels. Additionally, we explore the application of mixture of experts in speaker diarization, and introduce a Shared and Soft Mixture of Experts (SS-MoE) module to further mitigate model bias and enhance performance. Incorporating SS-MoE leads to the extended model NSD-MS2S-SSMoE. Experiments on multiple complex acoustic datasets, including CHiME-6, DiPCo, Mixer 6 and DIHARD-III evaluation sets, demonstrate meaningful improvements in robustness and generalization. The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios.

Comment: The paper explores speaker diarization with a Mixture of Experts (MoE) approach, which aligns with model architecture by introducing MoE in speaker diarization.

Relevance: 9 Novelty: 7

17. Sharp Generalization Bounds for Foundation Models with Asymmetric Randomized Low-Rank Adapters

ArXiv ID: 2506.14530

Authors: Anastasis Kratsios, Tin Sum Cheng, Aurelien Lucchi, Haitz S\'aez de Oc\'ariz Borde

Abstract: Low-Rank Adaptation (LoRA) has emerged as a widely adopted parameter-efficient fine-tuning (PEFT) technique for foundation models. Recent work has highlighted an inherent asymmetry in the initialization of LoRA's low-rank factors, which has been present since its inception and was presumably derived experimentally. This paper focuses on providing a comprehensive theoretical characterization of asymmetric LoRA with frozen random factors. First, while existing research provides upper-bound generalization guarantees based on averages over multiple experiments, the behaviour of a single fine-tuning run with specific random factors remains an open question. We address this by investigating the concentration of the typical LoRA generalization gap around its mean. Our main upper bound reveals a sample complexity of $\tilde{\mathcal{O}}\left(\frac{\sqrt{r}}{\sqrt{N}}\right)$ with high probability for rank $r$ LoRAs trained on $N$ samples. Additionally, we also determine the fundamental limits in terms of sample efficiency, establishing a matching lower bound of $\mathcal{O}\left(\frac{1}{\sqrt{N}}\right)$. By more closely reflecting the practical scenario of a single fine-tuning run, our findings offer crucial insights into the reliability and practicality of asymmetric LoRA.

Comment: The paper provides theoretical insights into LoRA with asymmetric randomized low-rank adapters, which is relevant to model compression and efficiency.

Relevance: 8 Novelty: 8

18. Sampling from Your Language Model One Byte at a Time

ArXiv ID: 2506.14123

Authors: Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh

Abstract: Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations. For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. This Prompt Boundary Problem (PBP) also arises in languages such as Chinese and in code generation, where tokens often do not line up with syntactic boundaries. Additionally mismatching tokenizers often hinder model composition and interoperability. For example, it is not possible to directly ensemble models with different tokenizers due to their mismatching vocabularies. To address these issues, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM, without changing its generative distribution at the text level. Our method efficient solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time as well as transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals.

Comment: The paper addresses tokenization issues in language models, which is relevant to large language models and their interpretability.