Personalized Daily ArXiv Papers 2025-07-01

[gpt-4o]	Prompt	Completion	Total
Token	55552	7090	62642
Cost	$0.14	$0.07	$0.21

Total arXiv papers: 843

Total scanned papers: 490

Total relevant papers: 37

Table of contents with paper titles:

A unified framework on the universal approximation of transformer-type architectures Authors: Jingpu Cheng, Qianxiao Li, Ting Lin, Zuowei Shen
Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging Authors: Lujun Li, Zhu Qiyuan, Jiacheng Wang, Wei Li, Hao Gu, Sirui Han, Yike Guo
AI's Euclid's Elements Moment: From Language Models to Computable Thought Authors: Xinmin Fang, Lingfeng Tao, Zhengxiong Li
Beyond Statistical Learning: Exact Learning Is Essential for General Intelligence Authors: Andr\'as Gy\"orgy, Tor Lattimore, Nevena Lazi\'c, Csaba Szepesv\'ari
On Universality of Non-Separable Approximate Message Passing Algorithms Authors: Max Lovig, Tianhao Wang, Zhou Fan
The Hidden Link Between RLHF and Contrastive Learning Authors: Xufei Lv, Haoyuan Sun, Xuefeng Bai, Min Zhang, Houde Liu, Kehai Chen
Towards Efficient and Accurate Spiking Neural Networks via Adaptive Bit Allocation Authors: Xingting Yao, Qinghao Hu, Fei Zhou, Tielong Liu, Gang Li, Peisong Wang, Jian Cheng
Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model Authors: Mu-Chi Chen, Po-Hsuan Huang, Xiangrui Ke, Chia-Heng Tu, Chun Jason Xue, Shih-Hao Hung
Residual Matrix Transformers: Scaling the Size of the Residual Stream Authors: Brian Mak, Jeffrey Flanigan
On the Predictive Power of Representation Dispersion in Language Models Authors: Yanhong Li, Ming Li, Karen Livescu, Jiawei Zhou
Spectra 1.1: Scaling Laws and Efficient Inference for Ternary Language Models Authors: Tejas Vaidhya, Ayush Kaushal, Vineet Jain, Francis Couture Harpin, Prashant Shishodia, Majid Behbahani, Yuriy Nevmyvaka, Irina Rish
Masked Gated Linear Unit Authors: Yukito Tajima, Nakamasa Inoue, Yusuke Sekikawa, Ikuro Sato, Rio Yokota
Generalized Linear Mode Connectivity for Transformers Authors: Alexander Theus, Alessandro Cabodi, Sotiris Anagnostidis, Antonio Orvieto, Sidak Pal Singh, Valentina Boeva
Transition Matching: Scalable and Flexible Generative Modeling Authors: Neta Shaul, Uriel Singer, Itai Gat, Yaron Lipman
Unified Multimodal Understanding via Byte-Pair Visual Encoding Authors: Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu
Agent4S: The Transformation of Research Paradigms from the Perspective of Large Language Models Authors: Boyuan Zheng, Zerui Fang, Zhe Xu, Rui Wang, Yiwen Chen, Cunshi Wang, Mengwei Qu, Lei Lei, Zhen Feng, Yan Liu, Yuyang Li, Mingzhou Tan, Jiaji Wu, Jianwei Shuai, Jia Li, Fangfu Ye
Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training Authors: Aadim Nepal, Safal Shrestha, Anubhav Shrestha, Minwu Kim, Keith Ross
Semantic-guided Diverse Decoding for Large Language Model Authors: Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou
Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts Authors: Kenny Peng, Rajiv Movva, Jon Kleinberg, Emma Pierson, Nikhil Garg
Not All Explanations for Deep Learning Phenomena Are Equally Valuable Authors: Alan Jeffares, Mihaela van der Schaar
A New Perspective On AI Safety Through Control Theory Methodologies Authors: Lars Ullrich, Walter Zimmer, Ross Greer, Knut Graichen, Alois C. Knoll, Mohan Trivedi
Learning Stochastic Multiscale Models Authors: Andrew F. Ilersich, Prasanth B. Nair
GViT: Representing Images as Gaussians for Visual Recognition Authors: Jefferson Hernandez, Ruozhen He, Guha Balakrishnan, Alexander C. Berg, Vicente Ordonez
Riemannian-Geometric Fingerprints of Generative Models Authors: Hae Jin Song, Laurent Itti
Neural Langevin Machine: a local asymmetric learning rule can be creative Authors: Zhendong Yu, Weizhong Huang, Haiping Huang
Tensor Train Quantum State Tomography using Compressed Sensing Authors: Shakir Showkat Sofi, Charlotte Vermeylen, Lieven De Lathauwer
BWLer: Barycentric Weight Layer Elucidates a Precision-Conditioning Tradeoff for PINNs Authors: Jerry Liu, Yasa Baig, Denise Hui Jean Lee, Rajat Vadiraj Dwaraknath, Atri Rudra, Chris R\'e
Emergent musical properties of a transformer under contrastive self-supervised learning Authors: Yuexuan Kong, Gabriel Meseguer-Brocal, Vincent Lostanlen, Mathieu Lagrange, Romain Hennequin
The Trilemma of Truth in Large Language Models Authors: Germans Savcisens, Tina Eliassi-Rad
A Systematic Study of Compositional Syntactic Transformer Language Models Authors: Yida Zhao, Hao Xve, Xiang Hu, Kewei Tu
Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding Authors: Nuoye Xiong, Anqi Dong, Ning Wang, Cong Hua, Guangming Zhu, Mei Lin, Peiyi Shen, Liang Zhang
AICO: Feature Significance Tests for Supervised Learning Authors: Kay Giesecke, Enguerrand Horel, Chartsiri Jirachotkulthorn
Token Activation Map to Visually Explain Multimodal LLMs Authors: Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, Xiaomeng Li
Towards the Training of Deeper Predictive Coding Neural Networks Authors: Chang Qi, Matteo Forasassi, Thomas Lukasiewicz, Tommaso Salvatori
Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model Authors: Bowen Ding, Yuhan Chen, Futing Wang, Lingfeng Ming, Tao Lin
Towards Text-free Graph Foundation Models: Rethinking Multi-Domain Graph Contrastive Learning Authors: Zihao Zhao, Xinlong Zhai, Jinyu Yang, Chuan Shi
Efficient Algorithms for Learning and Compressing Monophonic Halfspaces in Graphs Authors: Marco Bressan, Victor Chepoi, Emmanuel Esposito, Maximilian Thiessen

1. A unified framework on the universal approximation of transformer-type architectures

ArXiv ID: 2506.23551

Authors: Jingpu Cheng, Qianxiao Li, Ting Lin, Zuowei Shen

Abstract: We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectures. Leveraging an analyticity assumption on the attention layer, we can significantly simplify the verification of this condition, providing a non-constructive approach in establishing UAP for such architectures. We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms, including kernel-based and sparse attention mechanisms. The corollaries of our results either generalize prior works or establish UAP for architectures not previously covered. Furthermore, our framework offers a principled foundation for designing novel transformer architectures with inherent UAP guarantees, including those with specific functional symmetries. We propose examples to illustrate these insights.

Comment: The paper provides a unified theoretical framework for the universal approximation property of transformer-type architectures, which is a significant contribution to model architecture analysis.

Relevance: 10 Novelty: 9

2. Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging

ArXiv ID: 2506.23266

Authors: Lujun Li, Zhu Qiyuan, Jiacheng Wang, Wei Li, Hao Gu, Sirui Han, Yike Guo

Abstract: Mixture of Experts (MoE) LLMs face significant obstacles due to their massive parameter scale, which imposes memory, storage, and deployment challenges. Although recent expert merging methods promise greater efficiency by consolidating multiple experts, they are fundamentally hindered by parameter conflicts arising from expert specialization. In this paper, we present Sub-MoE, a novel MoE compression framework via Subspace Expert Merging. Our key insight is to perform joint Singular Value Decomposition (SVD) on concatenated expert weights, reducing conflicting parameters by extracting shared $U$-matrices while enabling effective merging of the expert-specific $V$ components. Specifically, Sub-MoE consists of two innovative phases: (1) Adaptive Expert Clustering, which groups functionally coherent experts via K-means clustering based on cosine similarity of expert outputs; and (2) Subspace Expert Merging, which first enforces Experts Union Decomposition to derive the shared $U$-matrix across experts in the same group, then pursues frequency-based merging for individual $V$-matrices, and finalizes expert reconstruction using the merged $V$-matrix. In this way, we align and fuse experts in a shared subspace, and can be extended with intra-expert compression for further inference optimization. Extensive experiments on Mixtral, DeepSeek, and Qwen-1.5|3 MoE LLMs demonstrate that our Sub-MoE significantly outperforms existing expert pruning and merging methods. Notably, our Sub-MoE maintains 96\%|86\% of original performance with 25\%|50\% expert reduction on Mixtral-8x7B in zero-shot benchmarks. Code will be released at https://github.com/lliai/MoERazor.

Comment: The paper presents a novel MoE compression framework, which is highly relevant to model compression and MoE.

Relevance: 10 Novelty: 9

3. AI's Euclid's Elements Moment: From Language Models to Computable Thought

ArXiv ID: 2506.23080

Authors: Xinmin Fang, Lingfeng Tao, Zhengxiong Li

Abstract: This paper presents a comprehensive five-stage evolutionary framework for understanding the development of artificial intelligence, arguing that its trajectory mirrors the historical progression of human cognitive technologies. We posit that AI is advancing through distinct epochs, each defined by a revolutionary shift in its capacity for representation and reasoning, analogous to the inventions of cuneiform, the alphabet, grammar and logic, mathematical calculus, and formal logical systems. This "Geometry of Cognition" framework moves beyond mere metaphor to provide a systematic, cross-disciplinary model that not only explains AI's past architectural shifts-from expert systems to Transformers-but also charts a concrete and prescriptive path forward. Crucially, we demonstrate that this evolution is not merely linear but reflexive: as AI advances through these stages, the tools and insights it develops create a feedback loop that fundamentally reshapes its own underlying architecture. We are currently transitioning into a "Metalinguistic Moment," characterized by the emergence of self-reflective capabilities like Chain-of-Thought prompting and Constitutional AI. The subsequent stages, the "Mathematical Symbolism Moment" and the "Formal Logic System Moment," will be defined by the development of a computable calculus of thought, likely through neuro-symbolic architectures and program synthesis, culminating in provably aligned and reliable AI that reconstructs its own foundational representations. This work serves as the methodological capstone to our trilogy, which previously explored the economic drivers ("why") and cognitive nature ("what") of AI. Here, we address the "how," providing a theoretical foundation for future research and offering concrete, actionable strategies for startups and developers aiming to build the next generation of intelligent systems.

Comment: The paper presents a theoretical framework for understanding AI development, which aligns with emerging trends and foundational research.

Relevance: 9 Novelty: 9

4. Beyond Statistical Learning: Exact Learning Is Essential for General Intelligence

ArXiv ID: 2506.23908

Authors: Andr\'as Gy\"orgy, Tor Lattimore, Nevena Lazi\'c, Csaba Szepesv\'ari

Abstract: Sound deductive reasoning -- the ability to derive new knowledge from existing facts and rules -- is an indisputably desirable aspect of general intelligence. Despite the major advances of AI systems in areas such as math and science, especially since the introduction of transformer architectures, it is well-documented that even the most advanced frontier systems regularly and consistently falter on easily-solvable deductive reasoning tasks. Hence, these systems are unfit to fulfill the dream of achieving artificial general intelligence capable of sound deductive reasoning. We argue that their unsound behavior is a consequence of the statistical learning approach powering their development. To overcome this, we contend that to achieve reliable deductive reasoning in learning-based AI systems, researchers must fundamentally shift from optimizing for statistical performance against distributions on reasoning problems and algorithmic tasks to embracing the more ambitious exact learning paradigm, which demands correctness on all inputs. We argue that exact learning is both essential and possible, and that this ambitious objective should guide algorithm design.

Comment: The paper argues for a shift from statistical learning to exact learning for general intelligence, which is relevant to emerging trends.

Relevance: 9 Novelty: 9

5. On Universality of Non-Separable Approximate Message Passing Algorithms

ArXiv ID: 2506.23010

Authors: Max Lovig, Tianhao Wang, Zhou Fan

Abstract: Mean-field characterizations of first-order iterative algorithms -- including Approximate Message Passing (AMP), stochastic and proximal gradient descent, and Langevin diffusions -- have enabled a precise understanding of learning dynamics in many statistical applications. For algorithms whose non-linearities have a coordinate-separable form, it is known that such characterizations enjoy a degree of universality with respect to the underlying data distribution. However, mean-field characterizations of non-separable algorithm dynamics have largely remained restricted to i.i.d. Gaussian or rotationally-invariant data. In this work, we initiate a study of universality for non-separable AMP algorithms. We identify a general condition for AMP with polynomial non-linearities, in terms of a Bounded Composition Property (BCP) for their representing tensors, to admit a state evolution that holds universally for matrices with non-Gaussian entries. We then formalize a condition of BCP-approximability for Lipschitz AMP algorithms to enjoy a similar universal guarantee. We demonstrate that many common classes of non-separable non-linearities are BCP-approximable, including local denoisers, spectral denoisers for generic signals, and compositions of separable functions with generic linear maps, implying the universality of state evolution for AMP algorithms employing these non-linearities.

Comment: The paper provides theoretical insights into Approximate Message Passing algorithms, focusing on universality for non-separable algorithms, which aligns with representation learning and training dynamics.

Relevance: 9 Novelty: 8

6. The Hidden Link Between RLHF and Contrastive Learning

ArXiv ID: 2506.22578

Authors: Xufei Lv, Haoyuan Sun, Xuefeng Bai, Min Zhang, Houde Liu, Kehai Chen

Abstract: Alignment of large language models (LLMs) with human values has recently garnered significant attention, with prominent examples including the canonical yet costly Reinforcement Learning from Human Feedback (RLHF) and the simple Direct Preference Optimization (DPO). In this work, we demonstrate that both RLHF and DPO can be interpreted from the perspective of mutual information (MI) maximization, uncovering a profound connection to contrastive learning. Within this framework, both RLHF and DPO can be viewed as methods that perform contrastive learning based on the positive and negative samples derived from the base model, leveraging the Donsker-Varadhan (DV) lower bound on MI (equivalently, the MINE estimator). This paradigm further explains why RLHF may not intrinsically incentivize reasoning capacities in LLMs beyond what is already present in the base model. Building on this perspective, we replace the DV/MINE bound with the Jensen-Shannon MI estimator and propose Mutual Information Optimization (MIO). Comprehensive theoretical analysis and extensive empirical evaluations demonstrate that MIO mitigates the late-stage decline in chosen-likelihood observed in DPO, achieving competitive or superior performance across various challenging reasoning and mathematical benchmarks. We will release the model and code upon acceptance.

Comment: The paper explores the connection between RLHF and contrastive learning, providing insights into representation learning through mutual information maximization.

Relevance: 9 Novelty: 8

7. Towards Efficient and Accurate Spiking Neural Networks via Adaptive Bit Allocation

ArXiv ID: 2506.23717

Authors: Xingting Yao, Qinghao Hu, Fei Zhou, Tielong Liu, Gang Li, Peisong Wang, Jian Cheng

Abstract: Multi-bit spiking neural networks (SNNs) have recently become a heated research spot, pursuing energy-efficient and high-accurate AI. However, with more bits involved, the associated memory and computation demands escalate to the point where the performance improvements become disproportionate. Based on the insight that different layers demonstrate different importance and extra bits could be wasted and interfering, this paper presents an adaptive bit allocation strategy for direct-trained SNNs, achieving fine-grained layer-wise allocation of memory and computation resources. Thus, SNN's efficiency and accuracy can be improved. Specifically, we parametrize the temporal lengths and the bit widths of weights and spikes, and make them learnable and controllable through gradients. To address the challenges caused by changeable bit widths and temporal lengths, we propose the refined spiking neuron, which can handle different temporal lengths, enable the derivation of gradients for temporal lengths, and suit spike quantization better. In addition, we theoretically formulate the step-size mismatch problem of learnable bit widths, which may incur severe quantization errors to SNN, and accordingly propose the step-size renewal mechanism to alleviate this issue. Experiments on various datasets, including the static CIFAR and ImageNet and the dynamic CIFAR-DVS and DVS-GESTURE, demonstrate that our methods can reduce the overall memory and computation cost while achieving higher accuracy. Particularly, our SEWResNet-34 can achieve a 2.69\% accuracy gain and 4.16$\times$ lower bit budgets over the advanced baseline work on ImageNet. This work will be fully open-sourced.

Comment: The paper presents an adaptive bit allocation strategy for spiking neural networks, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

8. Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model

ArXiv ID: 2506.23635

Authors: Mu-Chi Chen, Po-Hsuan Huang, Xiangrui Ke, Chia-Heng Tu, Chun Jason Xue, Shih-Hao Hung

Abstract: Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) with significant advancements such as OpenAI's ChatGPT, Meta's Llama, and Databricks' DBRX. This paper addresses the cost and scalability challenges encountered when constructing private LLM systems for personal or small group services, as aimed by Apple Intelligence. A Mac Studio cluster with Apple's M2 Ultra chips is established as a cost-efficient solution to host and accelerate the pretrained DBRX model with the Mixture-of-Experts (MoE) architecture. Our performance analysis reveal that parallel execution of the model's experts across two to four machine nodes significantly reduces inference time. We find that computation time for the experts is comparable to the communication time for exchanging their outputs, emphasizing the importance of network latency over bandwidth. We also observe significant management overhead due to Apple software stack's memory management logic. Based on these findings, we develop optimization schemes to eliminate the memory management overhead. As a result, the Mac Studio cluster is 1.15 times more cost-efficient than the state-of-the-art AI supercomputer with NVIDIA H100 GPUs. In addition, we construct a performance model to estimate system performance under varying configurations, and the model provides valuable insights for designing private LLM systems.

Comment: The paper explores multi-node expert parallelism for Mixture-of-Experts LLMs, which is relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8

9. Residual Matrix Transformers: Scaling the Size of the Residual Stream

ArXiv ID: 2506.22696

Authors: Brian Mak, Jeffrey Flanigan

Abstract: The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties. Code for this project can be found at https://github.com/bmac3/residual-matrix-transformer.

Comment: The introduction of Residual Matrix Transformers (RMT) offers a novel architectural modification to transformers, focusing on efficiency and scaling, which is relevant to model architecture and compression.

Relevance: 9 Novelty: 8

10. On the Predictive Power of Representation Dispersion in Language Models

ArXiv ID: 2506.24106

Authors: Yanhong Li, Ming Li, Karen Livescu, Jiawei Zhou

Abstract: We show that a language model's ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations more widely tend to achieve lower perplexity. Concretely, we find that representation dispersion - the average pairwise cosine distance among hidden vectors - strongly and negatively correlates with perplexity across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). Beyond illustrating this link, we show how dispersion can be leveraged for a range of practical tasks without requiring labeled data. First, measuring dispersion on unlabeled text allows us to predict downstream accuracy in new domains, offering a data-efficient tool for model selection. Next, we find that identifying layers with higher dispersion pinpoints the best representations for retrieval-based methods such as kNN-LM, bypassing exhaustive layer-by-layer searches. Finally, we integrate a simple push-away objective into training, which increases dispersion in both single-domain and cross-domain scenarios and directly improves perplexity in each.

Comment: The paper explores the link between representation dispersion and language model performance, which aligns with the representation learning criterion.

Relevance: 9 Novelty: 8

11. Spectra 1.1: Scaling Laws and Efficient Inference for Ternary Language Models

ArXiv ID: 2506.23025

Authors: Tejas Vaidhya, Ayush Kaushal, Vineet Jain, Francis Couture Harpin, Prashant Shishodia, Majid Behbahani, Yuriy Nevmyvaka, Irina Rish

Abstract: Large language models (LLMs) are increasingly used across research and industry applications, yet their inference efficiency remains a significant challenge. As the computational power of modern GPU architectures continuously improves, their memory bandwidth and capacity have not scaled proportionally, creating a critical bottleneck during inference. To address this, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements. We first analyze the scalability of TriLMs by conducting a scaling law analysis, revealing that TriLMs benefit more from increasing training data than from scaling model parameters. Based on this observation, we introduce Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights, which demonstrate accelerated inference across various CPU architectures. Also, building on the 2-bit packing, we develop a GPU kernel called TriRun that accelerates end-to-end model inference by up to 5 times compared to floating-point baselines. To encourage further exploration and development of TriLMs, we will release the Spectra-1.1 suite and TriRun inference kernels. Overall, our work lays the foundation for building and deploying efficient LLMs, providing a valuable resource for the research community.

Comment: The paper introduces ternary language models and novel quantization methods, aligning with the model compression criterion.

Relevance: 9 Novelty: 8

12. Masked Gated Linear Unit

ArXiv ID: 2506.23225

Authors: Yukito Tajima, Nakamasa Inoue, Yusuke Sekikawa, Ikuro Sato, Rio Yokota

Abstract: Gated Linear Units (GLUs) have become essential components in the feed-forward networks of state-of-the-art Large Language Models (LLMs). However, they require twice as many memory reads compared to feed-forward layers without gating, due to the use of separate weight matrices for the gate and value streams. To address this bottleneck, we introduce Masked Gated Linear Units (MGLUs), a novel family of GLUs with an efficient kernel implementation. The core contribution of MGLUs include: (1) the Mixture of Element-wise Gating (MoEG) architecture that learns multiple binary masks, each determining gate or value assignments at the element level on a single shared weight matrix resulting in reduced memory transfer, and (2) FlashMGLU, a hardware-friendly kernel that yields up to a 19.7 $\times$ inference-time speed-up over a naive PyTorch MGLU and is 47% more memory-efficient and 34% faster than standard GLUs despite added architectural complexity on an RTX5090 GPU. In LLM experiments, the Swish-activated variant SwiMGLU preserves its memory advantages while matching - or even surpassing - the downstream accuracy of the SwiGLU baseline.

Comment: The paper introduces Masked Gated Linear Units, a novel architectural innovation, aligning with the model architecture criterion.

Relevance: 9 Novelty: 8

13. Generalized Linear Mode Connectivity for Transformers

ArXiv ID: 2506.22712

Authors: Alexander Theus, Alessandro Cabodi, Sotiris Anagnostidis, Antonio Orvieto, Sidak Pal Singh, Valentina Boeva

Abstract: Understanding the geometry of neural network loss landscapes is a central question in deep learning, with implications for generalization and optimization. A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths, despite appearing to lie in separate loss basins. However, this is often obscured by symmetries in parameter space -- such as neuron permutations -- which make functionally equivalent models appear dissimilar. Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope and fail to capture the richer symmetries exhibited by modern architectures such as Transformers. In this work, we introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, orthogonal transformations, and general invertible maps -- broadening the set of valid reparameterizations and subsuming many previous approaches as special cases. Crucially, this generalization enables, for the first time, the discovery of low- and zero-barrier linear interpolation paths between independently trained Vision Transformers and GPT-2 models. These results reveal deeper structure in the loss landscape and underscore the importance of symmetry-aware analysis for understanding model space geometry.

Comment: The paper presents a generalized framework for understanding linear mode connectivity in Transformers, addressing symmetries in parameter space. This aligns with the core topic of model architecture, providing theoretical insights into the geometry of neural network loss landscapes.

Relevance: 9 Novelty: 8

14. Transition Matching: Scalable and Flexible Generative Modeling

ArXiv ID: 2506.23589

Authors: Neta Shaul, Uriel Singer, Itai Gat, Yaron Lipman

Abstract: Diffusion and flow matching models have significantly advanced media generation, yet their design space is well-explored, somewhat limiting further improvements. Concurrently, autoregressive (AR) models, particularly those generating continuous tokens, have emerged as a promising direction for unifying text and media generation. This paper introduces Transition Matching (TM), a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation. TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes, thereby unlocking new flexible design avenues. We explore these choices through three TM variants: (i) Difference Transition Matching (DTM), which generalizes flow matching to discrete-time by directly learning transition probabilities, yielding state-of-the-art image quality and text adherence as well as improved sampling efficiency. (ii) Autoregressive Transition Matching (ARTM) and (iii) Full History Transition Matching (FHTM) are partially and fully causal models, respectively, that generalize continuous AR methods. They achieve continuous causal AR generation quality comparable to non-causal approaches and potentially enable seamless integration with existing AR text generation techniques. Notably, FHTM is the first fully causal model to match or surpass the performance of flow-based methods on text-to-image task in continuous domains. We demonstrate these contributions through a rigorous large-scale comparison of TM variants and relevant baselines, maintaining a fixed architecture, training data, and hyperparameters.

Comment: The paper introduces Transition Matching, a novel generative paradigm that unifies diffusion/flow models and continuous AR generation, which aligns with emerging trends in foundational research.

Relevance: 9 Novelty: 8

15. Unified Multimodal Understanding via Byte-Pair Visual Encoding

ArXiv ID: 2506.23639

Authors: Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu

Abstract: Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens. Unlike conventional approaches that rely on modality-specific encoders, our method directly incorporates structural information into visual tokens, mirroring successful tokenization strategies in text-only language models. We introduce a priority-guided encoding scheme that considers both frequency and spatial consistency, coupled with a multi-stage training procedure based on curriculum-driven data composition. These enhancements enable the transformer model to better capture cross-modal relationships and reason with visual information. Comprehensive experiments demonstrate improved performance across diverse vision-language tasks. By bridging the gap between visual and textual representations, our approach contributes to the advancement of more capable and efficient multimodal foundation models.

Comment: The paper proposes a novel generative modeling paradigm, Transition Matching, which advances both diffusion/flow models and continuous AR generation, contributing to emerging trends in generative modeling.

Relevance: 8 Novelty: 9

16. Agent4S: The Transformation of Research Paradigms from the Perspective of Large Language Models

ArXiv ID: 2506.23692

Authors: Boyuan Zheng, Zerui Fang, Zhe Xu, Rui Wang, Yiwen Chen, Cunshi Wang, Mengwei Qu, Lei Lei, Zhen Feng, Yan Liu, Yuyang Li, Mingzhou Tan, Jiaji Wu, Jianwei Shuai, Jia Li, Fangfu Ye

Abstract: While AI for Science (AI4S) serves as an analytical tool in the current research paradigm, it doesn't solve its core inefficiency. We propose "Agent for Science" (Agent4S)-the use of LLM-driven agents to automate the entire research workflow-as the true Fifth Scientific Paradigm. This paper introduces a five-level classification for Agent4S, outlining a clear roadmap from simple task automation to fully autonomous, collaborative "AI Scientists." This framework defines the next revolutionary step in scientific discovery.

Comment: The paper proposes a new paradigm, Agent4S, for automating research workflows using LLMs, which is an emerging trend in AI for Science.

Relevance: 8 Novelty: 9

17. Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training

ArXiv ID: 2506.22638

Authors: Aadim Nepal, Safal Shrestha, Anubhav Shrestha, Minwu Kim, Keith Ross

Abstract: Large language models can exhibit improved mathematical reasoning capabilities following post-training with instruction tuning, reinforcement learning, or knowledge distillation. However, it remains unclear whether these improvements are driven by major changes in transformer layers or from minor adjustments that leave the relative layer importance structures of the base model largely unchanged. We investigate this question through systematic layer-wise ablation experiments, examining base, instruction-tuned, knowledge-distilled, and reinforcement learning variants on mathematical reasoning benchmarks. Our findings show that mathematical reasoning gives rise to a specific layer importance structure, and this structure persists across all post-training paradigms. Removal of such layers causes accuracy drops of up to 80%. In contrast, non-mathematical tasks like factual recall exhibit no critical layers. This distinction suggests that mathematical reasoning requires specialized layers that emerge during pre-training, while other non-reasoning tasks do not. From an information-theoretic perspective, we also observe that these critical layers are the same layers where major representational transformation occurs.

Comment: The paper investigates layer importance in LLMs for mathematical reasoning, providing insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 7

18. Semantic-guided Diverse Decoding for Large Language Model

ArXiv ID: 2506.23601

Authors: Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou

Abstract: Diverse decoding of large language models is crucial for applications requiring multiple semantically distinct responses, yet existing methods primarily achieve lexical rather than semantic diversity. This limitation significantly constrains Best-of-N strategies, group-based reinforcement learning, and data synthesis. While temperature sampling and diverse beam search modify token distributions or apply n-gram penalties, they fail to ensure meaningful semantic differentiation. We introduce Semantic-guided Diverse Decoding (SemDiD), operating directly in embedding space that balances quality with diversity through three complementary mechanisms: orthogonal directional guidance, dynamic inter-group repulsion, and position-debiased probability assessment. SemDiD harmonizes these competing objectives using adaptive gain functions and constraint optimization, ensuring both quality thresholds and maximal semantic differentiation. Experiments show SemDiD consistently outperforms existing methods, improving Best-of-N coverage by 1.4-5.2% across diverse tasks and accelerating RLHF training convergence by 15% while increasing accuracy by up to 2.1%.

Comment: The paper introduces a novel method for diverse decoding in LLMs, focusing on semantic diversity, which aligns with foundational research in LLM behavior.

Relevance: 9 Novelty: 7

19. Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts

ArXiv ID: 2506.23845

Authors: Kenny Peng, Rajiv Movva, Jon Kleinberg, Emma Pierson, Nikhil Garg

Abstract: While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that while SAEs may be less effective for acting on known concepts, SAEs are powerful tools for discovering unknown concepts. This distinction cleanly separates existing negative and positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) ML interpretability, explainability, fairness, auditing, and safety, and (ii) social and health sciences.

Comment: The paper provides insights into the use of sparse autoencoders for discovering unknown concepts, which aligns with the representation learning criterion.

Relevance: 9 Novelty: 7

20. Not All Explanations for Deep Learning Phenomena Are Equally Valuable

ArXiv ID: 2506.23286

Authors: Alan Jeffares, Mihaela van der Schaar

Abstract: Developing a better understanding of surprising or counterintuitive phenomena has constituted a significant portion of deep learning research in recent years. These include double descent, grokking, and the lottery ticket hypothesis -- among many others. Works in this area often develop ad hoc hypotheses attempting to explain these observed phenomena on an isolated, case-by-case basis. This position paper asserts that, in many prominent cases, there is little evidence to suggest that these phenomena appear in real-world applications and these efforts may be inefficient in driving progress in the broader field. Consequently, we argue against viewing them as isolated puzzles that require bespoke resolutions or explanations. However, despite this, we suggest that deep learning phenomena do still offer research value by providing unique settings in which we can refine our broad explanatory theories of more general deep learning principles. This position is reinforced by analyzing the research outcomes of several prominent examples of these phenomena from the recent literature. We revisit the current norms in the research community in approaching these problems and propose practical recommendations for future research, aiming to ensure that progress on deep learning phenomena is well aligned with the ultimate pragmatic goal of progress in the broader field of deep learning.

Comment: The paper discusses the value of understanding deep learning phenomena like double descent and the lottery ticket hypothesis, which aligns with representation learning insights.

Relevance: 9 Novelty: 7

21. A New Perspective On AI Safety Through Control Theory Methodologies

ArXiv ID: 2506.23703

Authors: Lars Ullrich, Walter Zimmer, Ross Greer, Knut Graichen, Alois C. Knoll, Mohan Trivedi

Abstract: While artificial intelligence (AI) is advancing rapidly and mastering increasingly complex problems with astonishing performance, the safety assurance of such systems is a major concern. Particularly in the context of safety-critical, real-world cyber-physical systems, AI promises to achieve a new level of autonomy but is hampered by a lack of safety assurance. While data-driven control takes up recent developments in AI to improve control systems, control theory in general could be leveraged to improve AI safety. Therefore, this article outlines a new perspective on AI safety based on an interdisciplinary interpretation of the underlying data-generation process and the respective abstraction by AI systems in a system theory-inspired and system analysis-driven manner. In this context, the new perspective, also referred to as data control, aims to stimulate AI engineering to take advantage of existing safety analysis and assurance in an interdisciplinary way to drive the paradigm of data control. Following a top-down approach, a generic foundation for safety analysis and assurance is outlined at an abstract level that can be refined for specific AI systems and applications and is prepared for future innovation.

Comment: The paper discusses AI safety through control theory, which is an emerging trend challenging established assumptions in AI safety.

Relevance: 8 Novelty: 8

22. Learning Stochastic Multiscale Models

ArXiv ID: 2506.22655

Authors: Andrew F. Ilersich, Prasanth B. Nair

Abstract: The physical sciences are replete with dynamical systems that require the resolution of a wide range of length and time scales. This presents significant computational challenges since direct numerical simulation requires discretization at the finest relevant scales, leading to a high-dimensional state space. In this work, we propose an approach to learn stochastic multiscale models in the form of stochastic differential equations directly from observational data. Our method resolves the state on a coarse mesh while introducing an auxiliary state to capture the effects of unresolved scales. We learn the parameters of the multiscale model using a modern forward-solver-free amortized variational inference method. Our approach draws inspiration from physics-based multiscale modeling approaches, such as large-eddy simulation in fluid dynamics, while learning directly from data. We present numerical studies to demonstrate that our learned multiscale models achieve superior predictive accuracy compared to direct numerical simulation and closure-type models at equivalent resolution.

Comment: The paper proposes a method for learning stochastic multiscale models, which is relevant to AI for Science and emerging trends in modeling complex systems.

Relevance: 8 Novelty: 8

23. GViT: Representing Images as Gaussians for Visual Recognition

ArXiv ID: 2506.23532

Authors: Jefferson Hernandez, Ruozhen He, Guha Balakrishnan, Alexander C. Berg, Vicente Ordonez

Abstract: We introduce GVIT, a classification framework that abandons conventional pixel or patch grid input representations in favor of a compact set of learnable 2D Gaussians. Each image is encoded as a few hundred Gaussians whose positions, scales, orientations, colors, and opacities are optimized jointly with a ViT classifier trained on top of these representations. We reuse the classifier gradients as constructive guidance, steering the Gaussians toward class-salient regions while a differentiable renderer optimizes an image reconstruction loss. We demonstrate that by 2D Gaussian input representations coupled with our GVIT guidance, using a relatively standard ViT architecture, closely matches the performance of a traditional patch-based ViT, reaching a 76.9% top-1 accuracy on Imagenet-1k using a ViT-B architecture.

Comment: The paper introduces a novel image representation method using Gaussians with a ViT classifier, which relates to representation learning and model architecture.

Relevance: 8 Novelty: 8

24. Riemannian-Geometric Fingerprints of Generative Models

ArXiv ID: 2506.22802

Authors: Hae Jin Song, Laurent Itti

Abstract: Recent breakthroughs and rapid integration of generative models (GMs) have sparked interest in the problem of model attribution and their fingerprints. For instance, service providers need reliable methods of authenticating their models to protect their IP, while users and law enforcement seek to verify the source of generated content for accountability and trust. In addition, a growing threat of model collapse is arising, as more model-generated data are being fed back into sources (e.g., YouTube) that are often harvested for training ("regurgitative training"), heightening the need to differentiate synthetic from human data. Yet, a gap still exists in understanding generative models' fingerprints, we believe, stemming from the lack of a formal framework that can define, represent, and analyze the fingerprints in a principled way. To address this gap, we take a geometric approach and propose a new definition of artifact and fingerprint of GMs using Riemannian geometry, which allows us to leverage the rich theory of differential geometry. Our new definition generalizes previous work (Song et al., 2024) to non-Euclidean manifolds by learning Riemannian metrics from data and replacing the Euclidean distances and nearest-neighbor search with geodesic distances and kNN-based Riemannian center of mass. We apply our theory to a new gradient-based algorithm for computing the fingerprints in practice. Results show that it is more effective in distinguishing a large array of GMs, spanning across 4 different datasets in 2 different resolutions (64 by 64, 256 by 256), 27 model architectures, and 2 modalities (Vision, Vision-Language). Using our proposed definition significantly improves the performance on model attribution, as well as a generalization to unseen datasets, model types, and modalities, suggesting its practical efficacy.

Comment: The paper proposes a new geometric approach to understanding generative models' fingerprints, which is relevant to representation learning and emerging trends.

Relevance: 8 Novelty: 8

25. Neural Langevin Machine: a local asymmetric learning rule can be creative

ArXiv ID: 2506.23546

Authors: Zhendong Yu, Weizhong Huang, Haiping Huang

Abstract: Fixed points of recurrent neural networks can be leveraged to store and generate information. These fixed points can be captured by the Boltzmann-Gibbs measure, which leads to neural Langevin dynamics that can be used for sampling and learning a real dataset. We call this type of generative model neural Langevin machine, which is interpretable due to its analytic form of distribution and is simple to train. Moreover, the learning process is derived as a local asymmetric plasticity rule, bearing biological relevance. Therefore, one can realize a continuous sampling of creative dynamics in a neural network, mimicking an imagination process in brain circuits. This neural Langevin machine may be another promising generative model, at least in its strength in circuit-based sampling and biologically plausible learning rule.

Comment: The paper presents a new generative model, the neural Langevin machine, which is relevant to representation learning and emerging trends.

Relevance: 8 Novelty: 8

26. Tensor Train Quantum State Tomography using Compressed Sensing

ArXiv ID: 2506.23560

Authors: Shakir Showkat Sofi, Charlotte Vermeylen, Lieven De Lathauwer

Abstract: Quantum state tomography (QST) is a fundamental technique for estimating the state of a quantum system from measured data and plays a crucial role in evaluating the performance of quantum devices. However, standard estimation methods become impractical due to the exponential growth of parameters in the state representation. In this work, we address this challenge by parameterizing the state using a low-rank block tensor train decomposition and demonstrate that our approach is both memory- and computationally efficient. This framework applies to a broad class of quantum states that can be well approximated by low-rank decompositions, including pure states, nearly pure states, and ground states of Hamiltonians.

Comment: The paper introduces a low-rank approach to quantum state tomography, which is relevant to model compression and efficiency.

Relevance: 8 Novelty: 8

27. BWLer: Barycentric Weight Layer Elucidates a Precision-Conditioning Tradeoff for PINNs

ArXiv ID: 2506.23024

Authors: Jerry Liu, Yasa Baig, Denise Hui Jean Lee, Rajat Vadiraj Dwaraknath, Atri Rudra, Chris R\'e

Abstract: Physics-informed neural networks (PINNs) offer a flexible way to solve partial differential equations (PDEs) with machine learning, yet they still fall well short of the machine-precision accuracy many scientific tasks demand. In this work, we investigate whether the precision ceiling comes from the ill-conditioning of the PDEs or from the typical multi-layer perceptron (MLP) architecture. We introduce the Barycentric Weight Layer (BWLer), which models the PDE solution through barycentric polynomial interpolation. A BWLer can be added on top of an existing MLP (a BWLer-hat) or replace it completely (explicit BWLer), cleanly separating how we represent the solution from how we take derivatives for the PDE loss. Using BWLer, we identify fundamental precision limitations within the MLP: on a simple 1-D interpolation task, even MLPs with O(1e5) parameters stall around 1e-8 RMSE -- about eight orders above float64 machine precision -- before any PDE terms are added. In PDE learning, adding a BWLer lifts this ceiling and exposes a tradeoff between achievable accuracy and the conditioning of the PDE loss. For linear PDEs we fully characterize this tradeoff with an explicit error decomposition and navigate it during training with spectral derivatives and preconditioning. Across five benchmark PDEs, adding a BWLer on top of an MLP improves RMSE by up to 30x for convection, 10x for reaction, and 1800x for wave equations while remaining compatible with first-order optimizers. Replacing the MLP entirely lets an explicit BWLer reach near-machine-precision on convection, reaction, and wave problems (up to 10 billion times better than prior results) and match the performance of standard PINNs on stiff Burgers' and irregular-geometry Poisson problems. Together, these findings point to a practical path for combining the flexibility of PINNs with the precision of classical spectral solvers.

Comment: The paper introduces a novel layer for improving precision in physics-informed neural networks, which aligns with model architecture innovations.