Previous Day 2025-06-30
Monthly Overview 2025-07
Next Day 2025-07-02

Personalized Daily ArXiv Papers 2025-07-01

[gpt-4o] Prompt Completion Total
Token 55552 7090 62642
Cost $0.14 $0.07 $0.21

Total arXiv papers: 843

Total scanned papers: 490

Total relevant papers: 37

Table of contents with paper titles:

  1. A unified framework on the universal approximation of transformer-type architectures Authors: Jingpu Cheng, Qianxiao Li, Ting Lin, Zuowei Shen

  2. Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging Authors: Lujun Li, Zhu Qiyuan, Jiacheng Wang, Wei Li, Hao Gu, Sirui Han, Yike Guo

  3. AI's Euclid's Elements Moment: From Language Models to Computable Thought Authors: Xinmin Fang, Lingfeng Tao, Zhengxiong Li

  4. Beyond Statistical Learning: Exact Learning Is Essential for General Intelligence Authors: Andr\'as Gy\"orgy, Tor Lattimore, Nevena Lazi\'c, Csaba Szepesv\'ari

  5. On Universality of Non-Separable Approximate Message Passing Algorithms Authors: Max Lovig, Tianhao Wang, Zhou Fan

  6. The Hidden Link Between RLHF and Contrastive Learning Authors: Xufei Lv, Haoyuan Sun, Xuefeng Bai, Min Zhang, Houde Liu, Kehai Chen

  7. Towards Efficient and Accurate Spiking Neural Networks via Adaptive Bit Allocation Authors: Xingting Yao, Qinghao Hu, Fei Zhou, Tielong Liu, Gang Li, Peisong Wang, Jian Cheng

  8. Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model Authors: Mu-Chi Chen, Po-Hsuan Huang, Xiangrui Ke, Chia-Heng Tu, Chun Jason Xue, Shih-Hao Hung

  9. Residual Matrix Transformers: Scaling the Size of the Residual Stream Authors: Brian Mak, Jeffrey Flanigan

  10. On the Predictive Power of Representation Dispersion in Language Models Authors: Yanhong Li, Ming Li, Karen Livescu, Jiawei Zhou

  11. Spectra 1.1: Scaling Laws and Efficient Inference for Ternary Language Models Authors: Tejas Vaidhya, Ayush Kaushal, Vineet Jain, Francis Couture Harpin, Prashant Shishodia, Majid Behbahani, Yuriy Nevmyvaka, Irina Rish

  12. Masked Gated Linear Unit Authors: Yukito Tajima, Nakamasa Inoue, Yusuke Sekikawa, Ikuro Sato, Rio Yokota

  13. Generalized Linear Mode Connectivity for Transformers Authors: Alexander Theus, Alessandro Cabodi, Sotiris Anagnostidis, Antonio Orvieto, Sidak Pal Singh, Valentina Boeva

  14. Transition Matching: Scalable and Flexible Generative Modeling Authors: Neta Shaul, Uriel Singer, Itai Gat, Yaron Lipman

  15. Unified Multimodal Understanding via Byte-Pair Visual Encoding Authors: Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu

  16. Agent4S: The Transformation of Research Paradigms from the Perspective of Large Language Models Authors: Boyuan Zheng, Zerui Fang, Zhe Xu, Rui Wang, Yiwen Chen, Cunshi Wang, Mengwei Qu, Lei Lei, Zhen Feng, Yan Liu, Yuyang Li, Mingzhou Tan, Jiaji Wu, Jianwei Shuai, Jia Li, Fangfu Ye

  17. Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training Authors: Aadim Nepal, Safal Shrestha, Anubhav Shrestha, Minwu Kim, Keith Ross

  18. Semantic-guided Diverse Decoding for Large Language Model Authors: Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou

  19. Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts Authors: Kenny Peng, Rajiv Movva, Jon Kleinberg, Emma Pierson, Nikhil Garg

  20. Not All Explanations for Deep Learning Phenomena Are Equally Valuable Authors: Alan Jeffares, Mihaela van der Schaar

  21. A New Perspective On AI Safety Through Control Theory Methodologies Authors: Lars Ullrich, Walter Zimmer, Ross Greer, Knut Graichen, Alois C. Knoll, Mohan Trivedi

  22. Learning Stochastic Multiscale Models Authors: Andrew F. Ilersich, Prasanth B. Nair

  23. GViT: Representing Images as Gaussians for Visual Recognition Authors: Jefferson Hernandez, Ruozhen He, Guha Balakrishnan, Alexander C. Berg, Vicente Ordonez

  24. Riemannian-Geometric Fingerprints of Generative Models Authors: Hae Jin Song, Laurent Itti

  25. Neural Langevin Machine: a local asymmetric learning rule can be creative Authors: Zhendong Yu, Weizhong Huang, Haiping Huang

  26. Tensor Train Quantum State Tomography using Compressed Sensing Authors: Shakir Showkat Sofi, Charlotte Vermeylen, Lieven De Lathauwer

  27. BWLer: Barycentric Weight Layer Elucidates a Precision-Conditioning Tradeoff for PINNs Authors: Jerry Liu, Yasa Baig, Denise Hui Jean Lee, Rajat Vadiraj Dwaraknath, Atri Rudra, Chris R\'e

  28. Emergent musical properties of a transformer under contrastive self-supervised learning Authors: Yuexuan Kong, Gabriel Meseguer-Brocal, Vincent Lostanlen, Mathieu Lagrange, Romain Hennequin

  29. The Trilemma of Truth in Large Language Models Authors: Germans Savcisens, Tina Eliassi-Rad

  30. A Systematic Study of Compositional Syntactic Transformer Language Models Authors: Yida Zhao, Hao Xve, Xiang Hu, Kewei Tu

  31. Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding Authors: Nuoye Xiong, Anqi Dong, Ning Wang, Cong Hua, Guangming Zhu, Mei Lin, Peiyi Shen, Liang Zhang

  32. AICO: Feature Significance Tests for Supervised Learning Authors: Kay Giesecke, Enguerrand Horel, Chartsiri Jirachotkulthorn

  33. Token Activation Map to Visually Explain Multimodal LLMs Authors: Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, Xiaomeng Li

  34. Towards the Training of Deeper Predictive Coding Neural Networks Authors: Chang Qi, Matteo Forasassi, Thomas Lukasiewicz, Tommaso Salvatori

  35. Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model Authors: Bowen Ding, Yuhan Chen, Futing Wang, Lingfeng Ming, Tao Lin

  36. Towards Text-free Graph Foundation Models: Rethinking Multi-Domain Graph Contrastive Learning Authors: Zihao Zhao, Xinlong Zhai, Jinyu Yang, Chuan Shi

  37. Efficient Algorithms for Learning and Compressing Monophonic Halfspaces in Graphs Authors: Marco Bressan, Victor Chepoi, Emmanuel Esposito, Maximilian Thiessen


1. A unified framework on the universal approximation of transformer-type architectures

ArXiv ID: 2506.23551

Authors: Jingpu Cheng, Qianxiao Li, Ting Lin, Zuowei Shen

Abstract: We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectures. Leveraging an analyticity assumption on the attention layer, we can significantly simplify the verification of this condition, providing a non-constructive approach in establishing UAP for such architectures. We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms, including kernel-based and sparse attention mechanisms. The corollaries of our results either generalize prior works or establish UAP for architectures not previously covered. Furthermore, our framework offers a principled foundation for designing novel transformer architectures with inherent UAP guarantees, including those with specific functional symmetries. We propose examples to illustrate these insights.

Comment: The paper provides a unified theoretical framework for the universal approximation property of transformer-type architectures, which is a significant contribution to model architecture analysis.

Relevance: 10 Novelty: 9


2. Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging

ArXiv ID: 2506.23266

Authors: Lujun Li, Zhu Qiyuan, Jiacheng Wang, Wei Li, Hao Gu, Sirui Han, Yike Guo

Abstract: Mixture of Experts (MoE) LLMs face significant obstacles due to their massive parameter scale, which imposes memory, storage, and deployment challenges. Although recent expert merging methods promise greater efficiency by consolidating multiple experts, they are fundamentally hindered by parameter conflicts arising from expert specialization. In this paper, we present Sub-MoE, a novel MoE compression framework via Subspace Expert Merging. Our key insight is to perform joint Singular Value Decomposition (SVD) on concatenated expert weights, reducing conflicting parameters by extracting shared $U$-matrices while enabling effective merging of the expert-specific $V$ components. Specifically, Sub-MoE consists of two innovative phases: (1) Adaptive Expert Clustering, which groups functionally coherent experts via K-means clustering based on cosine similarity of expert outputs; and (2) Subspace Expert Merging, which first enforces Experts Union Decomposition to derive the shared $U$-matrix across experts in the same group, then pursues frequency-based merging for individual $V$-matrices, and finalizes expert reconstruction using the merged $V$-matrix. In this way, we align and fuse experts in a shared subspace, and can be extended with intra-expert compression for further inference optimization. Extensive experiments on Mixtral, DeepSeek, and Qwen-1.5|3 MoE LLMs demonstrate that our Sub-MoE significantly outperforms existing expert pruning and merging methods. Notably, our Sub-MoE maintains 96\%|86\% of original performance with 25\%|50\% expert reduction on Mixtral-8x7B in zero-shot benchmarks. Code will be released at https://github.com/lliai/MoERazor.

Comment: The paper presents a novel MoE compression framework, which is highly relevant to model compression and MoE.

Relevance: 10 Novelty: 9


3. AI's Euclid's Elements Moment: From Language Models to Computable Thought

ArXiv ID: 2506.23080

Authors: Xinmin Fang, Lingfeng Tao, Zhengxiong Li

Abstract: This paper presents a comprehensive five-stage evolutionary framework for understanding the development of artificial intelligence, arguing that its trajectory mirrors the historical progression of human cognitive technologies. We posit that AI is advancing through distinct epochs, each defined by a revolutionary shift in its capacity for representation and reasoning, analogous to the inventions of cuneiform, the alphabet, grammar and logic, mathematical calculus, and formal logical systems. This "Geometry of Cognition" framework moves beyond mere metaphor to provide a systematic, cross-disciplinary model that not only explains AI's past architectural shifts-from expert systems to Transformers-but also charts a concrete and prescriptive path forward. Crucially, we demonstrate that this evolution is not merely linear but reflexive: as AI advances through these stages, the tools and insights it develops create a feedback loop that fundamentally reshapes its own underlying architecture. We are currently transitioning into a "Metalinguistic Moment," characterized by the emergence of self-reflective capabilities like Chain-of-Thought prompting and Constitutional AI. The subsequent stages, the "Mathematical Symbolism Moment" and the "Formal Logic System Moment," will be defined by the development of a computable calculus of thought, likely through neuro-symbolic architectures and program synthesis, culminating in provably aligned and reliable AI that reconstructs its own foundational representations. This work serves as the methodological capstone to our trilogy, which previously explored the economic drivers ("why") and cognitive nature ("what") of AI. Here, we address the "how," providing a theoretical foundation for future research and offering concrete, actionable strategies for startups and developers aiming to build the next generation of intelligent systems.

Comment: The paper presents a theoretical framework for understanding AI development, which aligns with emerging trends and foundational research.

Relevance: 9 Novelty: 9


4. Beyond Statistical Learning: Exact Learning Is Essential for General Intelligence

ArXiv ID: 2506.23908

Authors: Andr\'as Gy\"orgy, Tor Lattimore, Nevena Lazi\'c, Csaba Szepesv\'ari

Abstract: Sound deductive reasoning -- the ability to derive new knowledge from existing facts and rules -- is an indisputably desirable aspect of general intelligence. Despite the major advances of AI systems in areas such as math and science, especially since the introduction of transformer architectures, it is well-documented that even the most advanced frontier systems regularly and consistently falter on easily-solvable deductive reasoning tasks. Hence, these systems are unfit to fulfill the dream of achieving artificial general intelligence capable of sound deductive reasoning. We argue that their unsound behavior is a consequence of the statistical learning approach powering their development. To overcome this, we contend that to achieve reliable deductive reasoning in learning-based AI systems, researchers must fundamentally shift from optimizing for statistical performance against distributions on reasoning problems and algorithmic tasks to embracing the more ambitious exact learning paradigm, which demands correctness on all inputs. We argue that exact learning is both essential and possible, and that this ambitious objective should guide algorithm design.

Comment: The paper argues for a shift from statistical learning to exact learning for general intelligence, which is relevant to emerging trends.

Relevance: 9 Novelty: 9


5. On Universality of Non-Separable Approximate Message Passing Algorithms

ArXiv ID: 2506.23010

Authors: Max Lovig, Tianhao Wang, Zhou Fan

Abstract: Mean-field characterizations of first-order iterative algorithms -- including Approximate Message Passing (AMP), stochastic and proximal gradient descent, and Langevin diffusions -- have enabled a precise understanding of learning dynamics in many statistical applications. For algorithms whose non-linearities have a coordinate-separable form, it is known that such characterizations enjoy a degree of universality with respect to the underlying data distribution. However, mean-field characterizations of non-separable algorithm dynamics have largely remained restricted to i.i.d. Gaussian or rotationally-invariant data. In this work, we initiate a study of universality for non-separable AMP algorithms. We identify a general condition for AMP with polynomial non-linearities, in terms of a Bounded Composition Property (BCP) for their representing tensors, to admit a state evolution that holds universally for matrices with non-Gaussian entries. We then formalize a condition of BCP-approximability for Lipschitz AMP algorithms to enjoy a similar universal guarantee. We demonstrate that many common classes of non-separable non-linearities are BCP-approximable, including local denoisers, spectral denoisers for generic signals, and compositions of separable functions with generic linear maps, implying the universality of state evolution for AMP algorithms employing these non-linearities.

Comment: The paper provides theoretical insights into Approximate Message Passing algorithms, focusing on universality for non-separable algorithms, which aligns with representation learning and training dynamics.

Relevance: 9 Novelty: 8


ArXiv ID: 2506.22578

Authors: Xufei Lv, Haoyuan Sun, Xuefeng Bai, Min Zhang, Houde Liu, Kehai Chen

Abstract: Alignment of large language models (LLMs) with human values has recently garnered significant attention, with prominent examples including the canonical yet costly Reinforcement Learning from Human Feedback (RLHF) and the simple Direct Preference Optimization (DPO). In this work, we demonstrate that both RLHF and DPO can be interpreted from the perspective of mutual information (MI) maximization, uncovering a profound connection to contrastive learning. Within this framework, both RLHF and DPO can be viewed as methods that perform contrastive learning based on the positive and negative samples derived from the base model, leveraging the Donsker-Varadhan (DV) lower bound on MI (equivalently, the MINE estimator). This paradigm further explains why RLHF may not intrinsically incentivize reasoning capacities in LLMs beyond what is already present in the base model. Building on this perspective, we replace the DV/MINE bound with the Jensen-Shannon MI estimator and propose Mutual Information Optimization (MIO). Comprehensive theoretical analysis and extensive empirical evaluations demonstrate that MIO mitigates the late-stage decline in chosen-likelihood observed in DPO, achieving competitive or superior performance across various challenging reasoning and mathematical benchmarks. We will release the model and code upon acceptance.

Comment: The paper explores the connection between RLHF and contrastive learning, providing insights into representation learning through mutual information maximization.

Relevance: 9 Novelty: 8


7. Towards Efficient and Accurate Spiking Neural Networks via Adaptive Bit Allocation

ArXiv ID: 2506.23717

Authors: Xingting Yao, Qinghao Hu, Fei Zhou, Tielong Liu, Gang Li, Peisong Wang, Jian Cheng

Abstract: Multi-bit spiking neural networks (SNNs) have recently become a heated research spot, pursuing energy-efficient and high-accurate AI. However, with more bits involved, the associated memory and computation demands escalate to the point where the performance improvements become disproportionate. Based on the insight that different layers demonstrate different importance and extra bits could be wasted and interfering, this paper presents an adaptive bit allocation strategy for direct-trained SNNs, achieving fine-grained layer-wise allocation of memory and computation resources. Thus, SNN's efficiency and accuracy can be improved. Specifically, we parametrize the temporal lengths and the bit widths of weights and spikes, and make them learnable and controllable through gradients. To address the challenges caused by changeable bit widths and temporal lengths, we propose the refined spiking neuron, which can handle different temporal lengths, enable the derivation of gradients for temporal lengths, and suit spike quantization better. In addition, we theoretically formulate the step-size mismatch problem of learnable bit widths, which may incur severe quantization errors to SNN, and accordingly propose the step-size renewal mechanism to alleviate this issue. Experiments on various datasets, including the static CIFAR and ImageNet and the dynamic CIFAR-DVS and DVS-GESTURE, demonstrate that our methods can reduce the overall memory and computation cost while achieving higher accuracy. Particularly, our SEWResNet-34 can achieve a 2.69\% accuracy gain and 4.16$\times$ lower bit budgets over the advanced baseline work on ImageNet. This work will be fully open-sourced.

Comment: The paper presents an adaptive bit allocation strategy for spiking neural networks, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8


8. Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model

ArXiv ID: 2506.23635

Authors: Mu-Chi Chen, Po-Hsuan Huang, Xiangrui Ke, Chia-Heng Tu, Chun Jason Xue, Shih-Hao Hung

Abstract: Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) with significant advancements such as OpenAI's ChatGPT, Meta's Llama, and Databricks' DBRX. This paper addresses the cost and scalability challenges encountered when constructing private LLM systems for personal or small group services, as aimed by Apple Intelligence. A Mac Studio cluster with Apple's M2 Ultra chips is established as a cost-efficient solution to host and accelerate the pretrained DBRX model with the Mixture-of-Experts (MoE) architecture. Our performance analysis reveal that parallel execution of the model's experts across two to four machine nodes significantly reduces inference time. We find that computation time for the experts is comparable to the communication time for exchanging their outputs, emphasizing the importance of network latency over bandwidth. We also observe significant management overhead due to Apple software stack's memory management logic. Based on these findings, we develop optimization schemes to eliminate the memory management overhead. As a result, the Mac Studio cluster is 1.15 times more cost-efficient than the state-of-the-art AI supercomputer with NVIDIA H100 GPUs. In addition, we construct a performance model to estimate system performance under varying configurations, and the model provides valuable insights for designing private LLM systems.

Comment: The paper explores multi-node expert parallelism for Mixture-of-Experts LLMs, which is relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8


9. Residual Matrix Transformers: Scaling the Size of the Residual Stream

ArXiv ID: 2506.22696

Authors: Brian Mak, Jeffrey Flanigan

Abstract: The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties. Code for this project can be found at https://github.com/bmac3/residual-matrix-transformer.

Comment: The introduction of Residual Matrix Transformers (RMT) offers a novel architectural modification to transformers, focusing on efficiency and scaling, which is relevant to model architecture and compression.

Relevance: 9 Novelty: 8


10. On the Predictive Power of Representation Dispersion in Language Models

ArXiv ID: 2506.24106

Authors: Yanhong Li, Ming Li, Karen Livescu, Jiawei Zhou

Abstract: We show that a language model's ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations more widely tend to achieve lower perplexity. Concretely, we find that representation dispersion - the average pairwise cosine distance among hidden vectors - strongly and negatively correlates with perplexity across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). Beyond illustrating this link, we show how dispersion can be leveraged for a range of practical tasks without requiring labeled data. First, measuring dispersion on unlabeled text allows us to predict downstream accuracy in new domains, offering a data-efficient tool for model selection. Next, we find that identifying layers with higher dispersion pinpoints the best representations for retrieval-based methods such as kNN-LM, bypassing exhaustive layer-by-layer searches. Finally, we integrate a simple push-away objective into training, which increases dispersion in both single-domain and cross-domain scenarios and directly improves perplexity in each.

Comment: The paper explores the link between representation dispersion and language model performance, which aligns with the representation learning criterion.

Relevance: 9 Novelty: 8


11. Spectra 1.1: Scaling Laws and Efficient Inference for Ternary Language Models

ArXiv ID: 2506.23025

Authors: Tejas Vaidhya, Ayush Kaushal, Vineet Jain, Francis Couture Harpin, Prashant Shishodia, Majid Behbahani, Yuriy Nevmyvaka, Irina Rish

Abstract: Large language models (LLMs) are increasingly used across research and industry applications, yet their inference efficiency remains a significant challenge. As the computational power of modern GPU architectures continuously improves, their memory bandwidth and capacity have not scaled proportionally, creating a critical bottleneck during inference. To address this, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements. We first analyze the scalability of TriLMs by conducting a scaling law analysis, revealing that TriLMs benefit more from increasing training data than from scaling model parameters. Based on this observation, we introduce Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights, which demonstrate accelerated inference across various CPU architectures. Also, building on the 2-bit packing, we develop a GPU kernel called TriRun that accelerates end-to-end model inference by up to 5 times compared to floating-point baselines. To encourage further exploration and development of TriLMs, we will release the Spectra-1.1 suite and TriRun inference kernels. Overall, our work lays the foundation for building and deploying efficient LLMs, providing a valuable resource for the research community.

Comment: The paper introduces ternary language models and novel quantization methods, aligning with the model compression criterion.

Relevance: 9 Novelty: 8


12. Masked Gated Linear Unit

ArXiv ID: 2506.23225

Authors: Yukito Tajima, Nakamasa Inoue, Yusuke Sekikawa, Ikuro Sato, Rio Yokota

Abstract: Gated Linear Units (GLUs) have become essential components in the feed-forward networks of state-of-the-art Large Language Models (LLMs). However, they require twice as many memory reads compared to feed-forward layers without gating, due to the use of separate weight matrices for the gate and value streams. To address this bottleneck, we introduce Masked Gated Linear Units (MGLUs), a novel family of GLUs with an efficient kernel implementation. The core contribution of MGLUs include: (1) the Mixture of Element-wise Gating (MoEG) architecture that learns multiple binary masks, each determining gate or value assignments at the element level on a single shared weight matrix resulting in reduced memory transfer, and (2) FlashMGLU, a hardware-friendly kernel that yields up to a 19.7 $\times$ inference-time speed-up over a naive PyTorch MGLU and is 47% more memory-efficient and 34% faster than standard GLUs despite added architectural complexity on an RTX5090 GPU. In LLM experiments, the Swish-activated variant SwiMGLU preserves its memory advantages while matching - or even surpassing - the downstream accuracy of the SwiGLU baseline.

Comment: The paper introduces Masked Gated Linear Units, a novel architectural innovation, aligning with the model architecture criterion.

Relevance: 9 Novelty: 8


13. Generalized Linear Mode Connectivity for Transformers

ArXiv ID: 2506.22712

Authors: Alexander Theus, Alessandro Cabodi, Sotiris Anagnostidis, Antonio Orvieto, Sidak Pal Singh, Valentina Boeva

Abstract: Understanding the geometry of neural network loss landscapes is a central question in deep learning, with implications for generalization and optimization. A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths, despite appearing to lie in separate loss basins. However, this is often obscured by symmetries in parameter space -- such as neuron permutations -- which make functionally equivalent models appear dissimilar. Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope and fail to capture the richer symmetries exhibited by modern architectures such as Transformers. In this work, we introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, orthogonal transformations, and general invertible maps -- broadening the set of valid reparameterizations and subsuming many previous approaches as special cases. Crucially, this generalization enables, for the first time, the discovery of low- and zero-barrier linear interpolation paths between independently trained Vision Transformers and GPT-2 models. These results reveal deeper structure in the loss landscape and underscore the importance of symmetry-aware analysis for understanding model space geometry.

Comment: The paper presents a generalized framework for understanding linear mode connectivity in Transformers, addressing symmetries in parameter space. This aligns with the core topic of model architecture, providing theoretical insights into the geometry of neural network loss landscapes.

Relevance: 9 Novelty: 8


14. Transition Matching: Scalable and Flexible Generative Modeling

ArXiv ID: 2506.23589

Authors: Neta Shaul, Uriel Singer, Itai Gat, Yaron Lipman

Abstract: Diffusion and flow matching models have significantly advanced media generation, yet their design space is well-explored, somewhat limiting further improvements. Concurrently, autoregressive (AR) models, particularly those generating continuous tokens, have emerged as a promising direction for unifying text and media generation. This paper introduces Transition Matching (TM), a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation. TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes, thereby unlocking new flexible design avenues. We explore these choices through three TM variants: (i) Difference Transition Matching (DTM), which generalizes flow matching to discrete-time by directly learning transition probabilities, yielding state-of-the-art image quality and text adherence as well as improved sampling efficiency. (ii) Autoregressive Transition Matching (ARTM) and (iii) Full History Transition Matching (FHTM) are partially and fully causal models, respectively, that generalize continuous AR methods. They achieve continuous causal AR generation quality comparable to non-causal approaches and potentially enable seamless integration with existing AR text generation techniques. Notably, FHTM is the first fully causal model to match or surpass the performance of flow-based methods on text-to-image task in continuous domains. We demonstrate these contributions through a rigorous large-scale comparison of TM variants and relevant baselines, maintaining a fixed architecture, training data, and hyperparameters.

Comment: The paper introduces Transition Matching, a novel generative paradigm that unifies diffusion/flow models and continuous AR generation, which aligns with emerging trends in foundational research.

Relevance: 9 Novelty: 8


15. Unified Multimodal Understanding via Byte-Pair Visual Encoding

ArXiv ID: 2506.23639

Authors: Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu

Abstract: Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens. Unlike conventional approaches that rely on modality-specific encoders, our method directly incorporates structural information into visual tokens, mirroring successful tokenization strategies in text-only language models. We introduce a priority-guided encoding scheme that considers both frequency and spatial consistency, coupled with a multi-stage training procedure based on curriculum-driven data composition. These enhancements enable the transformer model to better capture cross-modal relationships and reason with visual information. Comprehensive experiments demonstrate improved performance across diverse vision-language tasks. By bridging the gap between visual and textual representations, our approach contributes to the advancement of more capable and efficient multimodal foundation models.

Comment: The paper proposes a novel generative modeling paradigm, Transition Matching, which advances both diffusion/flow models and continuous AR generation, contributing to emerging trends in generative modeling.

Relevance: 8 Novelty: 9


16. Agent4S: The Transformation of Research Paradigms from the Perspective of Large Language Models

ArXiv ID: 2506.23692

Authors: Boyuan Zheng, Zerui Fang, Zhe Xu, Rui Wang, Yiwen Chen, Cunshi Wang, Mengwei Qu, Lei Lei, Zhen Feng, Yan Liu, Yuyang Li, Mingzhou Tan, Jiaji Wu, Jianwei Shuai, Jia Li, Fangfu Ye

Abstract: While AI for Science (AI4S) serves as an analytical tool in the current research paradigm, it doesn't solve its core inefficiency. We propose "Agent for Science" (Agent4S)-the use of LLM-driven agents to automate the entire research workflow-as the true Fifth Scientific Paradigm. This paper introduces a five-level classification for Agent4S, outlining a clear roadmap from simple task automation to fully autonomous, collaborative "AI Scientists." This framework defines the next revolutionary step in scientific discovery.

Comment: The paper proposes a new paradigm, Agent4S, for automating research workflows using LLMs, which is an emerging trend in AI for Science.

Relevance: 8 Novelty: 9


17. Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training

ArXiv ID: 2506.22638

Authors: Aadim Nepal, Safal Shrestha, Anubhav Shrestha, Minwu Kim, Keith Ross

Abstract: Large language models can exhibit improved mathematical reasoning capabilities following post-training with instruction tuning, reinforcement learning, or knowledge distillation. However, it remains unclear whether these improvements are driven by major changes in transformer layers or from minor adjustments that leave the relative layer importance structures of the base model largely unchanged. We investigate this question through systematic layer-wise ablation experiments, examining base, instruction-tuned, knowledge-distilled, and reinforcement learning variants on mathematical reasoning benchmarks. Our findings show that mathematical reasoning gives rise to a specific layer importance structure, and this structure persists across all post-training paradigms. Removal of such layers causes accuracy drops of up to 80%. In contrast, non-mathematical tasks like factual recall exhibit no critical layers. This distinction suggests that mathematical reasoning requires specialized layers that emerge during pre-training, while other non-reasoning tasks do not. From an information-theoretic perspective, we also observe that these critical layers are the same layers where major representational transformation occurs.

Comment: The paper investigates layer importance in LLMs for mathematical reasoning, providing insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 7


18. Semantic-guided Diverse Decoding for Large Language Model

ArXiv ID: 2506.23601

Authors: Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou

Abstract: Diverse decoding of large language models is crucial for applications requiring multiple semantically distinct responses, yet existing methods primarily achieve lexical rather than semantic diversity. This limitation significantly constrains Best-of-N strategies, group-based reinforcement learning, and data synthesis. While temperature sampling and diverse beam search modify token distributions or apply n-gram penalties, they fail to ensure meaningful semantic differentiation. We introduce Semantic-guided Diverse Decoding (SemDiD), operating directly in embedding space that balances quality with diversity through three complementary mechanisms: orthogonal directional guidance, dynamic inter-group repulsion, and position-debiased probability assessment. SemDiD harmonizes these competing objectives using adaptive gain functions and constraint optimization, ensuring both quality thresholds and maximal semantic differentiation. Experiments show SemDiD consistently outperforms existing methods, improving Best-of-N coverage by 1.4-5.2% across diverse tasks and accelerating RLHF training convergence by 15% while increasing accuracy by up to 2.1%.

Comment: The paper introduces a novel method for diverse decoding in LLMs, focusing on semantic diversity, which aligns with foundational research in LLM behavior.

Relevance: 9 Novelty: 7


19. Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts

ArXiv ID: 2506.23845

Authors: Kenny Peng, Rajiv Movva, Jon Kleinberg, Emma Pierson, Nikhil Garg

Abstract: While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that while SAEs may be less effective for acting on known concepts, SAEs are powerful tools for discovering unknown concepts. This distinction cleanly separates existing negative and positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) ML interpretability, explainability, fairness, auditing, and safety, and (ii) social and health sciences.

Comment: The paper provides insights into the use of sparse autoencoders for discovering unknown concepts, which aligns with the representation learning criterion.

Relevance: 9 Novelty: 7


20. Not All Explanations for Deep Learning Phenomena Are Equally Valuable

ArXiv ID: 2506.23286

Authors: Alan Jeffares, Mihaela van der Schaar

Abstract: Developing a better understanding of surprising or counterintuitive phenomena has constituted a significant portion of deep learning research in recent years. These include double descent, grokking, and the lottery ticket hypothesis -- among many others. Works in this area often develop ad hoc hypotheses attempting to explain these observed phenomena on an isolated, case-by-case basis. This position paper asserts that, in many prominent cases, there is little evidence to suggest that these phenomena appear in real-world applications and these efforts may be inefficient in driving progress in the broader field. Consequently, we argue against viewing them as isolated puzzles that require bespoke resolutions or explanations. However, despite this, we suggest that deep learning phenomena do still offer research value by providing unique settings in which we can refine our broad explanatory theories of more general deep learning principles. This position is reinforced by analyzing the research outcomes of several prominent examples of these phenomena from the recent literature. We revisit the current norms in the research community in approaching these problems and propose practical recommendations for future research, aiming to ensure that progress on deep learning phenomena is well aligned with the ultimate pragmatic goal of progress in the broader field of deep learning.

Comment: The paper discusses the value of understanding deep learning phenomena like double descent and the lottery ticket hypothesis, which aligns with representation learning insights.

Relevance: 9 Novelty: 7


21. A New Perspective On AI Safety Through Control Theory Methodologies

ArXiv ID: 2506.23703

Authors: Lars Ullrich, Walter Zimmer, Ross Greer, Knut Graichen, Alois C. Knoll, Mohan Trivedi

Abstract: While artificial intelligence (AI) is advancing rapidly and mastering increasingly complex problems with astonishing performance, the safety assurance of such systems is a major concern. Particularly in the context of safety-critical, real-world cyber-physical systems, AI promises to achieve a new level of autonomy but is hampered by a lack of safety assurance. While data-driven control takes up recent developments in AI to improve control systems, control theory in general could be leveraged to improve AI safety. Therefore, this article outlines a new perspective on AI safety based on an interdisciplinary interpretation of the underlying data-generation process and the respective abstraction by AI systems in a system theory-inspired and system analysis-driven manner. In this context, the new perspective, also referred to as data control, aims to stimulate AI engineering to take advantage of existing safety analysis and assurance in an interdisciplinary way to drive the paradigm of data control. Following a top-down approach, a generic foundation for safety analysis and assurance is outlined at an abstract level that can be refined for specific AI systems and applications and is prepared for future innovation.

Comment: The paper discusses AI safety through control theory, which is an emerging trend challenging established assumptions in AI safety.

Relevance: 8 Novelty: 8


22. Learning Stochastic Multiscale Models

ArXiv ID: 2506.22655

Authors: Andrew F. Ilersich, Prasanth B. Nair

Abstract: The physical sciences are replete with dynamical systems that require the resolution of a wide range of length and time scales. This presents significant computational challenges since direct numerical simulation requires discretization at the finest relevant scales, leading to a high-dimensional state space. In this work, we propose an approach to learn stochastic multiscale models in the form of stochastic differential equations directly from observational data. Our method resolves the state on a coarse mesh while introducing an auxiliary state to capture the effects of unresolved scales. We learn the parameters of the multiscale model using a modern forward-solver-free amortized variational inference method. Our approach draws inspiration from physics-based multiscale modeling approaches, such as large-eddy simulation in fluid dynamics, while learning directly from data. We present numerical studies to demonstrate that our learned multiscale models achieve superior predictive accuracy compared to direct numerical simulation and closure-type models at equivalent resolution.

Comment: The paper proposes a method for learning stochastic multiscale models, which is relevant to AI for Science and emerging trends in modeling complex systems.

Relevance: 8 Novelty: 8


23. GViT: Representing Images as Gaussians for Visual Recognition

ArXiv ID: 2506.23532

Authors: Jefferson Hernandez, Ruozhen He, Guha Balakrishnan, Alexander C. Berg, Vicente Ordonez

Abstract: We introduce GVIT, a classification framework that abandons conventional pixel or patch grid input representations in favor of a compact set of learnable 2D Gaussians. Each image is encoded as a few hundred Gaussians whose positions, scales, orientations, colors, and opacities are optimized jointly with a ViT classifier trained on top of these representations. We reuse the classifier gradients as constructive guidance, steering the Gaussians toward class-salient regions while a differentiable renderer optimizes an image reconstruction loss. We demonstrate that by 2D Gaussian input representations coupled with our GVIT guidance, using a relatively standard ViT architecture, closely matches the performance of a traditional patch-based ViT, reaching a 76.9% top-1 accuracy on Imagenet-1k using a ViT-B architecture.

Comment: The paper introduces a novel image representation method using Gaussians with a ViT classifier, which relates to representation learning and model architecture.

Relevance: 8 Novelty: 8


24. Riemannian-Geometric Fingerprints of Generative Models

ArXiv ID: 2506.22802

Authors: Hae Jin Song, Laurent Itti

Abstract: Recent breakthroughs and rapid integration of generative models (GMs) have sparked interest in the problem of model attribution and their fingerprints. For instance, service providers need reliable methods of authenticating their models to protect their IP, while users and law enforcement seek to verify the source of generated content for accountability and trust. In addition, a growing threat of model collapse is arising, as more model-generated data are being fed back into sources (e.g., YouTube) that are often harvested for training ("regurgitative training"), heightening the need to differentiate synthetic from human data. Yet, a gap still exists in understanding generative models' fingerprints, we believe, stemming from the lack of a formal framework that can define, represent, and analyze the fingerprints in a principled way. To address this gap, we take a geometric approach and propose a new definition of artifact and fingerprint of GMs using Riemannian geometry, which allows us to leverage the rich theory of differential geometry. Our new definition generalizes previous work (Song et al., 2024) to non-Euclidean manifolds by learning Riemannian metrics from data and replacing the Euclidean distances and nearest-neighbor search with geodesic distances and kNN-based Riemannian center of mass. We apply our theory to a new gradient-based algorithm for computing the fingerprints in practice. Results show that it is more effective in distinguishing a large array of GMs, spanning across 4 different datasets in 2 different resolutions (64 by 64, 256 by 256), 27 model architectures, and 2 modalities (Vision, Vision-Language). Using our proposed definition significantly improves the performance on model attribution, as well as a generalization to unseen datasets, model types, and modalities, suggesting its practical efficacy.

Comment: The paper proposes a new geometric approach to understanding generative models' fingerprints, which is relevant to representation learning and emerging trends.

Relevance: 8 Novelty: 8


25. Neural Langevin Machine: a local asymmetric learning rule can be creative

ArXiv ID: 2506.23546

Authors: Zhendong Yu, Weizhong Huang, Haiping Huang

Abstract: Fixed points of recurrent neural networks can be leveraged to store and generate information. These fixed points can be captured by the Boltzmann-Gibbs measure, which leads to neural Langevin dynamics that can be used for sampling and learning a real dataset. We call this type of generative model neural Langevin machine, which is interpretable due to its analytic form of distribution and is simple to train. Moreover, the learning process is derived as a local asymmetric plasticity rule, bearing biological relevance. Therefore, one can realize a continuous sampling of creative dynamics in a neural network, mimicking an imagination process in brain circuits. This neural Langevin machine may be another promising generative model, at least in its strength in circuit-based sampling and biologically plausible learning rule.

Comment: The paper presents a new generative model, the neural Langevin machine, which is relevant to representation learning and emerging trends.

Relevance: 8 Novelty: 8


26. Tensor Train Quantum State Tomography using Compressed Sensing

ArXiv ID: 2506.23560

Authors: Shakir Showkat Sofi, Charlotte Vermeylen, Lieven De Lathauwer

Abstract: Quantum state tomography (QST) is a fundamental technique for estimating the state of a quantum system from measured data and plays a crucial role in evaluating the performance of quantum devices. However, standard estimation methods become impractical due to the exponential growth of parameters in the state representation. In this work, we address this challenge by parameterizing the state using a low-rank block tensor train decomposition and demonstrate that our approach is both memory- and computationally efficient. This framework applies to a broad class of quantum states that can be well approximated by low-rank decompositions, including pure states, nearly pure states, and ground states of Hamiltonians.

Comment: The paper introduces a low-rank approach to quantum state tomography, which is relevant to model compression and efficiency.

Relevance: 8 Novelty: 8


27. BWLer: Barycentric Weight Layer Elucidates a Precision-Conditioning Tradeoff for PINNs

ArXiv ID: 2506.23024

Authors: Jerry Liu, Yasa Baig, Denise Hui Jean Lee, Rajat Vadiraj Dwaraknath, Atri Rudra, Chris R\'e

Abstract: Physics-informed neural networks (PINNs) offer a flexible way to solve partial differential equations (PDEs) with machine learning, yet they still fall well short of the machine-precision accuracy many scientific tasks demand. In this work, we investigate whether the precision ceiling comes from the ill-conditioning of the PDEs or from the typical multi-layer perceptron (MLP) architecture. We introduce the Barycentric Weight Layer (BWLer), which models the PDE solution through barycentric polynomial interpolation. A BWLer can be added on top of an existing MLP (a BWLer-hat) or replace it completely (explicit BWLer), cleanly separating how we represent the solution from how we take derivatives for the PDE loss. Using BWLer, we identify fundamental precision limitations within the MLP: on a simple 1-D interpolation task, even MLPs with O(1e5) parameters stall around 1e-8 RMSE -- about eight orders above float64 machine precision -- before any PDE terms are added. In PDE learning, adding a BWLer lifts this ceiling and exposes a tradeoff between achievable accuracy and the conditioning of the PDE loss. For linear PDEs we fully characterize this tradeoff with an explicit error decomposition and navigate it during training with spectral derivatives and preconditioning. Across five benchmark PDEs, adding a BWLer on top of an MLP improves RMSE by up to 30x for convection, 10x for reaction, and 1800x for wave equations while remaining compatible with first-order optimizers. Replacing the MLP entirely lets an explicit BWLer reach near-machine-precision on convection, reaction, and wave problems (up to 10 billion times better than prior results) and match the performance of standard PINNs on stiff Burgers' and irregular-geometry Poisson problems. Together, these findings point to a practical path for combining the flexibility of PINNs with the precision of classical spectral solvers.

Comment: The paper introduces a novel layer for improving precision in physics-informed neural networks, which aligns with model architecture innovations.

Relevance: 8 Novelty: 7


28. Emergent musical properties of a transformer under contrastive self-supervised learning

ArXiv ID: 2506.23873

Authors: Yuexuan Kong, Gabriel Meseguer-Brocal, Vincent Lostanlen, Mathieu Lagrange, Romain Hennequin

Abstract: In music information retrieval (MIR), contrastive self-supervised learning for general-purpose representation models is effective for global tasks such as automatic tagging. However, for local tasks such as chord estimation, it is widely assumed that contrastively trained general-purpose self-supervised models are inadequate and that more sophisticated SSL is necessary; e.g., masked modeling. Our paper challenges this assumption by revealing the potential of contrastive SSL paired with a transformer in local MIR tasks. We consider a lightweight vision transformer with one-dimensional patches in the time--frequency domain (ViT-1D) and train it with simple contrastive SSL through normalized temperature-scaled cross-entropy loss (NT-Xent). Although NT-Xent operates only over the class token, we observe that, potentially thanks to weight sharing, informative musical properties emerge in ViT-1D's sequence tokens. On global tasks, the temporal average of class and sequence tokens offers a performance increase compared to the class token alone, showing useful properties in the sequence tokens. On local tasks, sequence tokens perform unexpectedly well, despite not being specifically trained for. Furthermore, high-level musical features such as onsets emerge from layer-wise attention maps and self-similarity matrices show different layers capture different musical dimensions. Our paper does not focus on improving performance but advances the musical interpretation of transformers and sheds light on some overlooked abilities of contrastive SSL paired with transformers for sequence modeling in MIR.

Comment: The paper explores emergent properties of transformers under contrastive self-supervised learning, which aligns with representation learning and model architecture.

Relevance: 8 Novelty: 7


29. The Trilemma of Truth in Large Language Models

ArXiv ID: 2506.23921

Authors: Germans Savcisens, Tina Eliassi-Rad

Abstract: We often attribute human characteristics to large language models (LLMs) and claim that they "know" certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM's depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs "know" and how certain they are of their probabilistic internal knowledge.

Comment: The paper introduces a method for probing the veracity of LLMs, which aligns with theoretical insights into LLM behavior and interpretability.

Relevance: 8 Novelty: 7


30. A Systematic Study of Compositional Syntactic Transformer Language Models

ArXiv ID: 2506.22978

Authors: Yida Zhao, Hao Xve, Xiang Hu, Kewei Tu

Abstract: Syntactic language models (SLMs) enhance Transformers by incorporating syntactic biases through the modeling of linearized syntactic parse trees alongside surface sentences. This paper focuses on compositional SLMs that are based on constituency parse trees and contain explicit bottom-up composition of constituent representations. We identify key aspects of design choices in existing compositional SLMs and propose a unified framework encompassing both existing models and novel variants. We conduct a comprehensive empirical evaluation of all the variants in our framework across language modeling, syntactic generalization, summarization, dialogue, and inference efficiency. Based on the experimental results, we make multiple recommendations on the design of compositional SLMs. Our code is released at https://github.com/zhaoyd1/compositional_SLMs.

Comment: The paper focuses on compositional syntactic transformer language models, which aligns with the interest in model architecture, particularly transformers. It provides a unified framework and empirical evaluation, contributing to the understanding of transformer-based models.

Relevance: 8 Novelty: 7


31. Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding

ArXiv ID: 2506.22803

Authors: Nuoye Xiong, Anqi Dong, Ning Wang, Cong Hua, Guangming Zhu, Mei Lin, Peiyi Shen, Liang Zhang

Abstract: Recent advances in deep learning have led to increasingly complex models with deeper layers and more parameters, reducing interpretability and making their decisions harder to understand. While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable framework to approximate black-box reasoning and communicate conceptual understanding. Detrimental concepts are automatically identified and refined (removed/replaced) based on global gradient contributions. The modified CBM then distills corrected knowledge back into the black-box model, enhancing both interpretability and accuracy. We evaluate CBM-HNMU on various CNN and transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, and CUB-200, achieving a maximum accuracy improvement of 2.64% and a maximum increase in average accuracy across 1.03%. Source code is available at: https://github.com/XiGuaBo/CBM-HNMU.

Comment: The paper introduces a Concept Bottleneck Model to enhance interpretability in neural networks, which aligns with representation learning and model architecture analysis.

Relevance: 8 Novelty: 7


32. AICO: Feature Significance Tests for Supervised Learning

ArXiv ID: 2506.23396

Authors: Kay Giesecke, Enguerrand Horel, Chartsiri Jirachotkulthorn

Abstract: The opacity of many supervised learning algorithms remains a key challenge, hindering scientific discovery and limiting broader deployment -- particularly in high-stakes domains. This paper develops model- and distribution-agnostic significance tests to assess the influence of input features in any regression or classification algorithm. Our method evaluates a feature's incremental contribution to model performance by masking its values across samples. Under the null hypothesis, the distribution of performance differences across a test set has a non-positive median. We construct a uniformly most powerful, randomized sign test for this median, yielding exact p-values for assessing feature significance and confidence intervals with exact coverage for estimating population-level feature importance. The approach requires minimal assumptions, avoids model retraining or auxiliary models, and remains computationally efficient even for large-scale, high-dimensional settings. Experiments on synthetic tasks validate its statistical and computational advantages, and applications to real-world data illustrate its practical utility.

Comment: The paper develops model-agnostic significance tests for feature importance, which aligns with foundational research in representation learning.

Relevance: 8 Novelty: 7


33. Token Activation Map to Visually Explain Multimodal LLMs

ArXiv ID: 2506.23270

Authors: Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, Xiaomeng Li

Abstract: Multimodal large language models (MLLMs) are broadly empowering various fields. Despite their advancements, the explainability of MLLMs remains less explored, hindering deeper understanding, model credibility, and effective visualization. Unlike conventional vision models (e.g., CNNs, ViTs, CLIP) that produce a single output, MLLMs generate sequences of tokens progressively, where each generated token depends on the previous context. Therefore, earlier context tokens can introduce redundant activations that interfere with the explanation of later tokens beyond their original information. Existing studies often overlook this issue, but our observations reveal that these redundant correlations can significantly hurt the reliability of explanations. To address this, we propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM explanation, with a novel rank Gaussian filter to further reduce activation noises. We term this method Token Activation Map (TAM) to highlight the consideration of interactions between tokens. TAM also indicates that it excels at explaining multiple tokens of MLLM, which is different from the Class Activation Map (CAM) for a single prediction. Our TAM method significantly outperforms existing SoTA methods, showcasing high-quality visualization results that can be utilized for various scenarios, such as object localization, failure case analysis, video visualization, MLLMs visual comparison, and model understanding (e.g., color, shape, action, location, visual reasoning, multi-turn conversation, etc). The code is available atgithub.com/xmed-lab/TAM.

Comment: The paper proposes a method for explaining multimodal LLMs, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 8 Novelty: 7


34. Towards the Training of Deeper Predictive Coding Neural Networks

ArXiv ID: 2506.23800

Authors: Chang Qi, Matteo Forasassi, Thomas Lukasiewicz, Tommaso Salvatori

Abstract: Predictive coding networks trained with equilibrium propagation are neural models that perform inference through an iterative energy minimization process. Previous studies have demonstrated their effectiveness in shallow architectures, but show significant performance degradation when depth exceeds five to seven layers. In this work, we show that the reason behind this degradation is due to exponentially imbalanced errors between layers during weight updates, and predictions from the previous layer not being effective in guiding updates in deeper layers. We address the first issue by introducing two novel methods to optimize the latent variables that use precision-weighting to re-balance the distribution of energy among layers during the `relaxation phase', and the second issue by proposing a novel weight update mechanism that reduces error accumulation in deeper layers. Empirically, we test our methods on a large number of image classification tasks, resulting in large improvements in test accuracy across networks with more than seven layers, with performances comparable to those of backprop on similar models. These findings suggest that a better understanding of the relaxation phase is important to train models using equilibrium propagation at scale, and open new possibilities for their application in complex tasks.

Comment: The paper addresses training challenges in predictive coding networks, which aligns with the representation learning criterion.

Relevance: 8 Novelty: 7


35. Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model

ArXiv ID: 2506.23840

Authors: Bowen Ding, Yuhan Chen, Futing Wang, Lingfeng Ming, Tao Lin

Abstract: Large Reasoning Models (LRMs) excel at solving complex problems but face an overthinking dilemma. When handling simple tasks, they often produce verbose responses overloaded with thinking tokens (e.g., wait, however). These tokens trigger unnecessary high-level reasoning behaviors like reflection and backtracking, reducing efficiency. In this work, our pilot study reveals that these thinking-token-induced behaviors are not essential for effective problem-solving and may even hinder correct reasoning within constrained token budgets. We identify this phenomenon as the thinking trap. To mitigate this issue, we propose Dual Policy Preference Optimization (DuP-PO), a novel algorithm featuring: (1) A rollout sampling strategy that guarantees balanced exposure to responses with and without thinking tokens; (2) A fine-grained advantage control technique to dynamically regulate the prediction of target tokens; (3) A policy shaping method ensuring stable gradient contributions from thinking tokens. Experimental results on five popular math reasoning benchmarks show that DuP-PO performs well on the popular LRM, which significantly improves their token efficiency during reasoning, while achieving superior performance of the base model.

Comment: The paper proposes a novel algorithm to improve token efficiency in large reasoning models, which aligns with the large language models criterion.

Relevance: 8 Novelty: 7


36. Towards Text-free Graph Foundation Models: Rethinking Multi-Domain Graph Contrastive Learning

ArXiv ID: 2506.22510

Authors: Zihao Zhao, Xinlong Zhai, Jinyu Yang, Chuan Shi

Abstract: Foundation models have achieved great success in natural language processing (NLP) and computer vision (CV). Their success largely stems from the ability to integrate multi-domain knowledge in pre-training and transfer it to target domains. Considering graph data, especially graphs without textual features, is ubiquitous in real-world applications such as social networks and recommendation systems, some researchers have attempted to extend this paradigm to the graph field, aiming to construct graph foundation models. However, unlike CV and NLP, there are huge gaps among the semantics and properties of graphs in different domains, while current works still adopt traditional contrastive pre-training strategies designed in the single-domain scenario, which regard contrastive samples from different domains as equivalent. From experimental investigations, we discovered that inherent domain-specific differences prevent these strategies from effectively absorbing knowledge from different domains to generate informative representations. In this paper, we propose a novel multi-domain pre-training and cross-domain transfer framework, namely MDGCL.In the pre-training stage, we design a contrastive learning strategy to substantially recognize and capture domain differences, and introduce domain tokens to encode domain-level global information. In the downstream stage, we introduce a domain attention mechanism to enable fine-grained domain knowledge transfer. Extensive experiments on five benchmark datasets have demonstrated that our method outperforms state-of-the-art significantly, with the maximum improvement of 19.33\% on accuracy and 19.13\% on Macro-F1 score.

Comment: The paper discusses a novel multi-domain graph contrastive learning framework, which is relevant to representation learning.

Relevance: 8 Novelty: 7


37. Efficient Algorithms for Learning and Compressing Monophonic Halfspaces in Graphs

ArXiv ID: 2506.23186

Authors: Marco Bressan, Victor Chepoi, Emmanuel Esposito, Maximilian Thiessen

Abstract: Abstract notions of convexity over the vertices of a graph, and corresponding notions of halfspaces, have recently gained attention from the machine learning community. In this work we study monophonic halfspaces, a notion of graph halfspaces defined through closure under induced paths. Our main result is a $2$-satisfiability based decomposition theorem, which allows one to represent monophonic halfspaces as a disjoint union of certain vertex subsets. Using this decomposition, we achieve efficient and (nearly) optimal algorithms for various learning problems, such as teaching, active, and online learning. Most notably, we obtain a polynomial-time algorithm for empirical risk minimization. Independently of the decomposition theorem, we obtain an efficient, stable, and proper sample compression scheme. This makes monophonic halfspaces efficiently learnable with proper learners and linear error rate $1/\varepsilon$ in the realizable PAC setting. Our results answer open questions from the literature, and show a stark contrast with geodesic halfspaces, for which most of the said learning problems are NP-hard.

Comment: The paper presents efficient algorithms for learning and compressing monophonic halfspaces in graphs, contributing to model compression and efficiency breakthroughs.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.