Personalized Daily ArXiv Papers 2025-08-19

[gpt-4o]	Prompt	Completion	Total
Token	47140	5854	52994
Cost	$0.12	$0.06	$0.18

Total arXiv papers: 775

Total scanned papers: 466

Total relevant papers: 38

Table of contents with paper titles:

Maximum Score Routing For Mixture-of-Experts Authors: Bowen Dong, Yilong Fan, Yutao Sun, Zhenyu Li, Tengyu Pan, Xun Zhou, Jianyong Wang
Discovering Expert-Level Nash Equilibrium Algorithms with Large Language Models Authors: Hanyu Li, Dongchen Li, Xiaotie Deng
Wavy Transformer Authors: Satoshi Noguchi, Yoshinobu Kawahara
Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks Authors: Lorenzo Livi
The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution Authors: Elon Ezra, Ariel Weizman, Amos Azaria
Contrastive Representations for Temporal Reasoning Authors: Alicja Ziarko, Michal Bortkiewicz, Michal Zawalski, Benjamin Eysenbach, Piotr Milos
Uncovering Emergent Physics Representations Learned In-Context by Large Language Models Authors: Yeongwoo Song, Jaeyong Bae, Dong-Kyum Kim, Hawoong Jeong
Reduced-order modeling of Hamiltonian dynamics based on symplectic neural networks Authors: Yongsheng Chen, Wei Guo, Qi Tang, Xinghui Zhong
Data-Driven Discovery of Interpretable Kalman Filter Variants through Large Language Models and Genetic Programming Authors: Vasileios Saketos, Sebastian Kaltenbach, Sergey Litvinov, Petros Koumoutsakos
A Perfectly Truthful Calibration Measure Authors: Jason Hartline, Lunjia Hu, Yifan Wu
FLARE: Fast Low-rank Attention Routing Engine Authors: Vedant Puri, Aditya Joglekar, Kevin Ferguson, Yu-hsuan Chen, Yongjie Jessica Zhang, Levent Burak Kara
Word Meanings in Transformer Language Models Authors: Jumbly Grindrod, Peter Grindrod
RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns Authors: Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S. Chao, Derek F. Wong
Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping Authors: Xuhui Zhan, Tyler Derr
Causally-Guided Pairwise Transformer -- Towards Foundational Digital Twins in Process Industry Authors: Michael Mayr, Georgios C. Chasparis
AdaRing: Towards Ultra-Light Vision-Language Adaptation via Cross-Layer Tensor Ring Decomposition Authors: Ying Huang, Yuanbin Man, Wenqi Jia, Zhengzhong Tu, Junzhou Huang, Miao Yin
Data-driven particle dynamics: Structure-preserving coarse-graining for emergent behavior in non-equilibrium systems Authors: Quercus Hernandez, Max Win, Thomas C. O'Connor, Paulo E. Arratia, Nathaniel Trask
EXOTIC: An Exact, Optimistic, Tree-Based Algorithm for Min-Max Optimization Authors: Chinmay Maheshwari, Chinmay Pimpalkhare, Debasish Chatterjee
Universal Learning of Nonlinear Dynamics Authors: Evan Dogariu, Anand Brahmbhatt, Elad Hazan
A Unified Cortical Circuit Model with Divisive Normalization and Self-Excitation for Robust Representation and Memory Maintenance Authors: Jie Su, Weiwei Wang, Zhaotian Gu, Dahui Wang, Tianyi Qian
DE-VAE: Revealing Uncertainty in Parametric and Inverse Projections with Variational Autoencoders using Differential Entropy Authors: Frederik L. Dennig, Daniel A. Keim
SEDEG:Sequential Enhancement of Decoder and Encoder's Generality for Class Incremental Learning with Small Memory Authors: Hongyang Chen, Shaoling Pu, Lingyu Zheng, Zhongwu Sun
SparseMap: A Sparse Tensor Accelerator Framework Based on Evolution Strategy Authors: Boran Zhao, Haiming Zhai, Zihang Yuan, Hetian Liu, Tian Xia, Wenzhe Zhao, Pengju Ren
Predicting the Performance of Graph Convolutional Networks with Spectral Properties of the Graph Laplacian Authors: Shalima Binta Manir, Tim Oates
A Self-Ensemble Inspired Approach for Effective Training of Binary-Weight Spiking Neural Networks Authors: Qingyan Meng, Mingqing Xiao, Zhengyu Ma, Huihui Zhou, Yonghong Tian, Zhouchen Lin
L-SR1: Learned Symmetric-Rank-One Preconditioning Authors: Gal Lifshitz, Shahar Zuler, Ori Fouks, Dan Raviv
Distribution Matching via Generalized Consistency Models Authors: Sagar Shrestha, Rajesh Shrestha, Tri Nguyen, Subash Timilsina
ProtTeX-CC: Activating In-Context Learning in Protein LLM via Two-Stage Instruction Compression Authors: Chuanliu Fan, Zicheng Ma, Jun Gao, Nan Yu, Jun Zhang, Ziqiang Cao, Yi Qin Gao, Guohong Fu
DynamixSFT: Dynamic Mixture Optimization of Instruction Tuning Collections Authors: Haebin Shin, Lei Ji, Xiao Liu, Zhiwei Yu, Qi Chen, Yeyun Gong
Kourkoutas-Beta: A Sunspike-Driven Adam Optimizer with Desert Flair Authors: Stavros C. Kassinos
Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications Authors: Haolin Jin, Huaming Chen
SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression Authors: Yuyang Xu, Yi Cheng, Haochao Ying, Zhuoyun Du, Renjun Hu, Xing Shi, Wei Lin, Jian Wu
Rigorous Feature Importance Scores based on Shapley Value and Banzhaf Index Authors: Xuanxiang Huang, Olivier L\'etoff\'e, Joao Marques-Silva
Assessing Representation Stability for Transformer Models Authors: Bryan E. Tuck, Rakesh M. Verma
Cost-Aware Contrastive Routing for LLMs Authors: Reza Shirkavand, Shangqian Gao, Peiran Yu, Heng Huang
Constructing Invariant and Equivariant Operations by Symmetric Tensor Network Authors: Meng Zhang, Chao Wang, Hao Zhang, Shaojun Dong, Lixin He
Separating Knowledge and Perception with Procedural Data Authors: Adri\'an Rodr\'iguez-Mu\~noz, Manel Baradad, Phillip Isola, Antonio Torralba
Toward Practical Equilibrium Propagation: Brain-inspired Recurrent Neural Network with Feedback Regulation and Residual Connections Authors: Zhuo Liu, Tao Chen

1. Maximum Score Routing For Mixture-of-Experts

ArXiv ID: 2508.12801

Authors: Bowen Dong, Yilong Fan, Yutao Sun, Zhenyu Li, Tengyu Pan, Xun Zhou, Jianyong Wang

Abstract: Routing networks in sparsely activated mixture-of-experts (MoE) dynamically allocate input tokens to top-k experts through differentiable sparse transformations, enabling scalable model capacity while preserving computational efficiency. Traditional MoE networks impose an expert capacity constraint to ensure GPU-friendly computation. However, this leads to token dropping when capacity is saturated and results in low hardware efficiency due to padding in underutilized experts. Removing the capacity constraint, in turn, compromises load balancing and computational efficiency. To address these issues, we propose Maximum Score Routing ($\mathbf{MaxScore}$), a novel MoE routing paradigm that models routing as a minimum-cost maximum-flow problem and integrates a SoftTopk operator. MaxScore resolves the fundamental limitations of iterative rerouting and optimal transport formulations, achieving lower training losses and higher evaluation scores at equivalent FLOPs compared to both constrained and unconstrained baselines. Implementation details and experimental configurations can be obtained from $\href{https://github.com/dongbw18/MaxScore.git}{MaxScore}$.

Comment: The paper proposes a novel MoE routing paradigm, which is directly relevant to model architecture innovations.

Relevance: 10 Novelty: 8

2. Discovering Expert-Level Nash Equilibrium Algorithms with Large Language Models

ArXiv ID: 2508.11874

Authors: Hanyu Li, Dongchen Li, Xiaotie Deng

Abstract: Algorithm design and analysis is a cornerstone of computer science, but it confronts a major challenge. Proving an algorithm's performance guarantee across all inputs has traditionally required extensive and often error-prone human effort. While AI has shown great success in finding solutions to specific problem instances, automating the discovery of general algorithms with such provable guarantees has remained a significant barrier. This challenge stems from the difficulty of integrating the creative process of algorithm design with the rigorous process of formal analysis. To address this gap, we propose LegoNE, a framework that tightly fuses these two processes for the fundamental and notoriously difficult problem of computing approximate Nash equilibria. LegoNE automatically translates any algorithm written by a simple Python-like language into a constrained optimization problem. Solving this problem derives and proves the algorithm's approximation bound. Using LegoNE, a state-of-the-art large language model rediscovered the state-of-the-art algorithm for two-player games within hours, a feat that had taken human researchers 15 years to achieve. For three-player games, the model discovered a novel algorithm surpassing all existing human-designed ones. This work demonstrates a new human-machine collaborative paradigm for theoretical science: humans reason at a higher-abstract level, using symbols to compress the search space, and AI explores within it, achieving what neither could alone.

Comment: The paper presents a framework using LLMs for discovering Nash equilibrium algorithms, which is a significant contribution to foundational research in AI for Science.

Relevance: 9 Novelty: 9

3. Wavy Transformer

ArXiv ID: 2508.12787

Authors: Satoshi Noguchi, Yoshinobu Kawahara

Abstract: Transformers have achieved remarkable success across natural language processing (NLP) and computer vision (CV). However, deep transformer models often suffer from an over-smoothing issue, in which token representations converge to similar values as they pass through successive transformer blocks. In this paper, we establish an equivalence between the hidden-state dynamics induced by stacked attention layers and graph neural diffusion on a complete graph. From this perspective, over-smoothing can be interpreted as a consequence of the dissipative nature of the underlying diffusion dynamics. Motivated by this physical interpretation, we propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics. We also introduce a feed-forward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule, thereby extending the transformer architecture. We further validate our proposed techniques on various transformer models for NLP and CV tasks. The results consistently demonstrate that Wavy Transformer improves performance with minimal additional parameters and no extra hyperparameter tuning.

Comment: The paper introduces the Wavy Transformer, addressing over-smoothing in transformers, which is relevant to model architecture innovations.

Relevance: 9 Novelty: 8

4. Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks

ArXiv ID: 2508.12121

Authors: Lorenzo Livi

Abstract: We study how gating mechanisms in recurrent neural networks (RNNs) implicitly induce adaptive learning-rate behavior, even when training is carried out with a fixed, global learning rate. This effect arises from the coupling between state-space time scales--parametrized by the gates--and parameter-space dynamics during gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs, we obtain a first-order expansion that makes explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates. These findings reveal that gates not only control memory retention in the hidden states, but also act as data-driven preconditioners that adapt optimization trajectories in parameter space. We further draw formal analogies with learning-rate schedules, momentum, and adaptive methods such as Adam, showing that these optimization behaviors emerge naturally from gating. Numerical experiments confirm the validity of our perturbative analysis, supporting the view that gate-induced corrections remain small while exerting systematic effects on training dynamics. Overall, this work provides a unified dynamical-systems perspective on how gating couples state evolution with parameter updates, explaining why gated architectures achieve robust trainability and stability in practice.

Comment: The paper studies how gating mechanisms in RNNs induce adaptive learning-rate behavior, providing insights into training dynamics in neural networks.

Relevance: 9 Novelty: 8

5. The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution

ArXiv ID: 2508.12277

Authors: Elon Ezra, Ariel Weizman, Amos Azaria

Abstract: Large language models (LLMs) are commonly evaluated on tasks that test their knowledge or reasoning abilities. In this paper, we explore a different type of evaluation: whether an LLM can predict aspects of its own responses. Since LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model's ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this benchmark, and that increased model size or capability does not consistently lead to better performance. These results suggest a fundamental limitation in how LLMs represent and reason about their own behavior.

Comment: The paper introduces a new benchmark to evaluate LLMs' ability to predict aspects of their own responses, which provides theoretical insights into LLM behavior.

Relevance: 9 Novelty: 8

6. Contrastive Representations for Temporal Reasoning

ArXiv ID: 2508.13113

Authors: Alicja Ziarko, Michal Bortkiewicz, Michal Zawalski, Benjamin Eysenbach, Piotr Milos

Abstract: In classical AI, perception relies on learning state-based representations, while planning, which can be thought of as temporal reasoning over action sequences, is typically achieved through search. We study whether such reasoning can instead emerge from representations that capture both perceptual and temporal structure. We show that standard temporal contrastive learning, despite its popularity, often fails to capture temporal structure due to its reliance on spurious features. To address this, we introduce Combinatorial Representations for Temporal Reasoning (CRTR), a method that uses a negative sampling scheme to provably remove these spurious features and facilitate temporal reasoning. CRTR achieves strong results on domains with complex temporal structure, such as Sokoban and Rubik's Cube. In particular, for the Rubik's Cube, CRTR learns representations that generalize across all initial states and allow it to solve the puzzle using fewer search steps than BestFS, though with longer solutions. To our knowledge, this is the first method that efficiently solves arbitrary Cube states using only learned representations, without relying on an external search algorithm.

Comment: The paper introduces a method for temporal reasoning using contrastive representations, which is relevant to representation learning.

Relevance: 9 Novelty: 8

7. Uncovering Emergent Physics Representations Learned In-Context by Large Language Models

ArXiv ID: 2508.12448

Authors: Yeongwoo Song, Jaeyong Bae, Dong-Kyum Kim, Hawoong Jeong

Abstract: Large language models (LLMs) exhibit impressive in-context learning (ICL) abilities, enabling them to solve wide range of tasks via textual prompts alone. As these capabilities advance, the range of applicable domains continues to expand significantly. However, identifying the precise mechanisms or internal structures within LLMs that allow successful ICL across diverse, distinct classes of tasks remains elusive. Physics-based tasks offer a promising testbed for probing this challenge. Unlike synthetic sequences such as basic arithmetic or symbolic equations, physical systems provide experimentally controllable, real-world data based on structured dynamics grounded in fundamental principles. This makes them particularly suitable for studying the emergent reasoning behaviors of LLMs in a realistic yet tractable setting. Here, we mechanistically investigate the ICL ability of LLMs, especially focusing on their ability to reason about physics. Using a dynamics forecasting task in physical systems as a proxy, we evaluate whether LLMs can learn physics in context. We first show that the performance of dynamics forecasting in context improves with longer input contexts. To uncover how such capability emerges in LLMs, we analyze the model's residual stream activations using sparse autoencoders (SAEs). Our experiments reveal that the features captured by SAEs correlate with key physical variables, such as energy. These findings demonstrate that meaningful physical concepts are encoded within LLMs during in-context learning. In sum, our work provides a novel case study that broadens our understanding of how LLMs learn in context.

Comment: The paper investigates the in-context learning ability of LLMs using physics-based tasks, providing insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 8

8. Reduced-order modeling of Hamiltonian dynamics based on symplectic neural networks

ArXiv ID: 2508.11911

Authors: Yongsheng Chen, Wei Guo, Qi Tang, Xinghui Zhong

Abstract: We introduce a novel data-driven symplectic induced-order modeling (ROM) framework for high-dimensional Hamiltonian systems that unifies latent-space discovery and dynamics learning within a single, end-to-end neural architecture. The encoder-decoder is built from Henon neural networks (HenonNets) and may be augmented with linear SGS-reflector layers. This yields an exact symplectic map between full and latent phase spaces. Latent dynamics are advanced by a symplectic flow map implemented as a HenonNet. This unified neural architecture ensures exact preservation of the underlying symplectic structure at the reduced-order level, significantly enhancing the fidelity and long-term stability of the resulting ROM. We validate our method through comprehensive numerical experiments on canonical Hamiltonian systems. The results demonstrate the method's capability for accurate trajectory reconstruction, robust predictive performance beyond the training horizon, and accurate Hamiltonian preservation. These promising outcomes underscore the effectiveness and potential applicability of our symplectic ROM framework for complex dynamical systems across a broad range of scientific and engineering disciplines.

Comment: The paper introduces a symplectic neural network framework for Hamiltonian dynamics, relevant to model architecture and AI for Science.

Relevance: 9 Novelty: 8

9. Data-Driven Discovery of Interpretable Kalman Filter Variants through Large Language Models and Genetic Programming

ArXiv ID: 2508.11703

Authors: Vasileios Saketos, Sebastian Kaltenbach, Sergey Litvinov, Petros Koumoutsakos

Abstract: Algorithmic discovery has traditionally relied on human ingenuity and extensive experimentation. Here we investigate whether a prominent scientific computing algorithm, the Kalman Filter, can be discovered through an automated, data-driven, evolutionary process that relies on Cartesian Genetic Programming (CGP) and Large Language Models (LLM). We evaluate the contributions of both modalities (CGP and LLM) in discovering the Kalman filter under varying conditions. Our results demonstrate that our framework of CGP and LLM-assisted evolution converges to near-optimal solutions when Kalman optimality assumptions hold. When these assumptions are violated, our framework evolves interpretable alternatives that outperform the Kalman filter. These results demonstrate that combining evolutionary algorithms and generative models for interpretable, data-driven synthesis of simple computational modules is a potent approach for algorithmic discovery in scientific computing.

Comment: The paper explores a novel approach combining genetic programming and large language models for algorithmic discovery, which aligns with foundational research in AI for Science.

Relevance: 9 Novelty: 8

10. A Perfectly Truthful Calibration Measure

ArXiv ID: 2508.13100

Authors: Jason Hartline, Lunjia Hu, Yifan Wu

Abstract: Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. Calibration measures quantify how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Although predicting the true probabilities guarantees perfect calibration, in reality, when calibration is evaluated on a finite sample, predicting the truth is not guaranteed to minimize any known calibration measure. All known calibration measures incentivize predictors to lie in order to appear more calibrated on a finite sample. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a perfectly truthful calibration measure in the batch setting: averaged two-bin calibration error (ATB). In addition to being truthful, ATB is sound, complete, continuous, and quadratically related to two existing calibration measures: the smooth calibration error (smCal) and the (lower) distance to calibration (distCal). The simplicity in our definition of ATB makes it efficient and straightforward to compute. ATB allows faster estimation algorithms with significantly easier implementations than smCal and distCal, achieving improved running time and simplicity for the calibration testing problem studied by Hu et al. (2024). We also introduce a general recipe for constructing truthful measures, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.

Comment: The paper introduces a perfectly truthful calibration measure, which is a theoretical advancement in the field of prediction calibration.

Relevance: 9 Novelty: 8

11. FLARE: Fast Low-rank Attention Routing Engine

ArXiv ID: 2508.12594

Authors: Vedant Puri, Aditya Joglekar, Kevin Ferguson, Yu-hsuan Chen, Yongjie Jessica Zhang, Levent Burak Kara

Abstract: The quadratic complexity of self-attention limits its applicability and scalability on large unstructured meshes. We introduce Fast Low-rank Attention Routing Engine (FLARE), a linear complexity self-attention mechanism that routes attention through fixed-length latent sequences. Each attention head performs global communication among $N$ tokens by projecting the input sequence onto a fixed length latent sequence of $M \ll N$ tokens using learnable query tokens. By routing attention through a bottleneck sequence, FLARE learns a low-rank form of attention that can be applied at $O(NM)$ cost. FLARE not only scales to unprecedented problem sizes, but also delivers superior accuracy compared to state-of-the-art neural PDE surrogates across diverse benchmarks. We also release a new additive manufacturing dataset to spur further research. Our code is available at https://github.com/vpuri3/FLARE.py.

Comment: The paper presents a low-rank attention mechanism, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

12. Word Meanings in Transformer Language Models

ArXiv ID: 2508.12863

Authors: Jumbly Grindrod, Peter Grindrod

Abstract: We investigate how word meanings are represented in the transformer language models. Specifically, we focus on whether transformer models employ something analogous to a lexical store - where each word has an entry that contains semantic information. To do this, we extracted the token embedding space of RoBERTa-base and k-means clustered it into 200 clusters. In our first study, we then manually inspected the resultant clusters to consider whether they are sensitive to semantic information. In our second study, we tested whether the clusters are sensitive to five psycholinguistic measures: valence, concreteness, iconicity, taboo, and age of acquisition. Overall, our findings were very positive - there is a wide variety of semantic information encoded within the token embedding space. This serves to rule out certain "meaning eliminativist" hypotheses about how transformer LLMs process semantic information.

Comment: The paper investigates how word meanings are represented in transformer language models, which aligns with the interest in understanding how deep networks encode information.

Relevance: 9 Novelty: 7

13. RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns

ArXiv ID: 2508.13152

Authors: Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S. Chao, Derek F. Wong

Abstract: Detecting content generated by large language models (LLMs) is crucial for preventing misuse and building trustworthy AI systems. Although existing detection methods perform well, their robustness in out-of-distribution (OOD) scenarios is still lacking. In this paper, we hypothesize that, compared to features used by existing detection methods, the internal representations of LLMs contain more comprehensive and raw features that can more effectively capture and distinguish the statistical pattern differences between LLM-generated texts (LGT) and human-written texts (HWT). We validated this hypothesis across different LLMs and observed significant differences in neural activation patterns when processing these two types of texts. Based on this, we propose RepreGuard, an efficient statistics-based detection method. Specifically, we first employ a surrogate model to collect representation of LGT and HWT, and extract the distinct activation feature that can better identify LGT. We can classify the text by calculating the projection score of the text representations along this feature direction and comparing with a precomputed threshold. Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios, while also demonstrating robust resilience to various text sizes and mainstream attacks. Data and code are publicly available at: https://github.com/NLP2CT/RepreGuard

Comment: The paper focuses on detecting LLM-generated text by analyzing internal representations, which aligns with representation learning and insights into LLM behavior.

Relevance: 9 Novelty: 7

14. Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping

ArXiv ID: 2508.12466

Authors: Xuhui Zhan, Tyler Derr

Abstract: Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs: Inverse-LLaVA achieves notable improvements on reasoning-intensive and cognitive tasks (MM-VET: +0.2%, VizWiz: +1.8%, ScienceQA: +0.2%, cognitive reasoning: +27.2%), while showing expected decreases in perception tasks requiring memorized visual-text associations (celebrity recognition: -49.5%, OCR: -21.3%). These results provide the first empirical evidence that alignment pre-training is not necessary for effective multimodal learning, particularly for complex reasoning tasks. Our work establishes the feasibility of a new paradigm that reduces computational requirements by 45%, challenges conventional wisdom about modality fusion, and opens new research directions for efficient multimodal architectures that preserve modality-specific characteristics. Our project website with code and additional resources is available at https://inverse-llava.github.io.

Comment: The paper proposes a novel approach to multimodal learning without alignment pre-training, which challenges conventional paradigms in model architecture.

Relevance: 8 Novelty: 8

15. Causally-Guided Pairwise Transformer -- Towards Foundational Digital Twins in Process Industry

ArXiv ID: 2508.13111

Authors: Michael Mayr, Georgios C. Chasparis

Abstract: Foundational modelling of multi-dimensional time-series data in industrial systems presents a central trade-off: channel-dependent (CD) models capture specific cross-variable dynamics but lack robustness and adaptability as model layers are commonly bound to the data dimensionality of the tackled use-case, while channel-independent (CI) models offer generality at the cost of modelling the explicit interactions crucial for system-level predictive regression tasks. To resolve this, we propose the Causally-Guided Pairwise Transformer (CGPT), a novel architecture that integrates a known causal graph as an inductive bias. The core of CGPT is built around a pairwise modeling paradigm, tackling the CD/CI conflict by decomposing the multidimensional data into pairs. The model uses channel-agnostic learnable layers where all parameter dimensions are independent of the number of variables. CGPT enforces a CD information flow at the pair-level and CI-like generalization across pairs. This approach disentangles complex system dynamics and results in a highly flexible architecture that ensures scalability and any-variate adaptability. We validate CGPT on a suite of synthetic and real-world industrial datasets on long-term and one-step forecasting tasks designed to simulate common industrial complexities. Results demonstrate that CGPT significantly outperforms both CI and CD baselines in predictive accuracy and shows competitive performance with end-to-end trained CD models while remaining agnostic to the problem dimensionality.

Comment: The paper introduces a novel architecture, the Causally-Guided Pairwise Transformer, which integrates a causal graph as an inductive bias, aligning with interests in architectural innovations.

Relevance: 8 Novelty: 8

16. AdaRing: Towards Ultra-Light Vision-Language Adaptation via Cross-Layer Tensor Ring Decomposition

ArXiv ID: 2508.11870

Authors: Ying Huang, Yuanbin Man, Wenqi Jia, Zhengzhong Tu, Junzhou Huang, Miao Yin

Abstract: Adapter-based fine-tuning has gained remarkable attention in adapting large pre-trained vision language models (VLMs) for a wide range of downstream tasks efficiently. In this paradigm, only the inserted adapters are fine-tuned, without the need for training the original VLM backbone. Existing works scale adapters by integrating them into every layer of VLMs to increase the capacity of adapters. However, these methods face two primary limitations: 1) limited compression rate due to ignoring cross-layer redundancy, and 2) limited representational capacity across homogeneous adapters. In this paper, we propose a novel vision-language fine-tuning framework based on cross-layer tensor ring decomposition (TRD) with the integration and collaboration of diverse adapters, called AdaRing, achieving ultra-light parameter-efficient adaptation of VLMs on various tasks. To remove the high redundancy that exists among adapters across layers, we exploit the tensor-level low-rankness to formulate adapters as layer-shared tensor cores and layer-specific slices. Moreover, guided by generalization-aware fine-tuning, diverse rank-driven adapters cooperate to handle tasks that require different representations. Our experiments show that the proposed AdaRing achieves the state-of-the-art performance while reducing average training parameters by 90%.

Comment: The paper proposes AdaRing, a novel vision-language fine-tuning framework using cross-layer tensor ring decomposition, which involves model compression techniques.

Relevance: 8 Novelty: 8

17. Data-driven particle dynamics: Structure-preserving coarse-graining for emergent behavior in non-equilibrium systems

ArXiv ID: 2508.12569

Authors: Quercus Hernandez, Max Win, Thomas C. O'Connor, Paulo E. Arratia, Nathaniel Trask

Abstract: Multiscale systems are ubiquitous in science and technology, but are notoriously challenging to simulate as short spatiotemporal scales must be appropriately linked to emergent bulk physics. When expensive high-dimensional dynamical systems are coarse-grained into low-dimensional models, the entropic loss of information leads to emergent physics which are dissipative, history-dependent, and stochastic. To machine learn coarse-grained dynamics from time-series observations of particle trajectories, we propose a framework using the metriplectic bracket formalism that preserves these properties by construction; most notably, the framework guarantees discrete notions of the first and second laws of thermodynamics, conservation of momentum, and a discrete fluctuation-dissipation balance crucial for capturing non-equilibrium statistics. We introduce the mathematical framework abstractly before specializing to a particle discretization. As labels are generally unavailable for entropic state variables, we introduce a novel self-supervised learning strategy to identify emergent structural variables. We validate the method on benchmark systems and demonstrate its utility on two challenging examples: (1) coarse-graining star polymers at challenging levels of coarse-graining while preserving non-equilibrium statistics, and (2) learning models from high-speed video of colloidal suspensions that capture coupling between local rearrangement events and emergent stochastic dynamics. We provide open-source implementations in both PyTorch and LAMMPS, enabling large-scale inference and extensibility to diverse particle-based systems.

Comment: The paper presents a framework for machine learning coarse-grained dynamics, which is relevant to AI for Science with foundational research in modeling.

Relevance: 8 Novelty: 8

18. EXOTIC: An Exact, Optimistic, Tree-Based Algorithm for Min-Max Optimization

ArXiv ID: 2508.12479

Authors: Chinmay Maheshwari, Chinmay Pimpalkhare, Debasish Chatterjee

Abstract: Min-max optimization arises in many domains such as game theory, adversarial machine learning, etc., with gradient-based methods as a typical computational tool. Beyond convex-concave min-max optimization, the solutions found by gradient-based methods may be arbitrarily far from global optima. In this work, we present an algorithmic apparatus for computing globally optimal solutions in convex-non-concave and non-convex-concave min-max optimization. For former, we employ a reformulation that transforms it into a non-concave-convex max-min optimization problem with suitably defined feasible sets and objective function. The new form can be viewed as a generalization of Sion's minimax theorem. Next, we introduce EXOTIC-an Exact, Optimistic, Tree-based algorithm for solving the reformulated max-min problem. EXOTIC employs an iterative convex optimization solver to (approximately) solve the inner minimization and a hierarchical tree search for the outer maximization to optimistically select promising regions to search based on the approximate solution returned by convex optimization solver. We establish an upper bound on its optimality gap as a function of the number of calls to the inner solver, the solver's convergence rate, and additional problem-dependent parameters. Both our algorithmic apparatus along with its accompanying theoretical analysis can also be applied for non-convex-concave min-max optimization. In addition, we propose a class of benchmark convex-non-concave min-max problems along with their analytical global solutions, providing a testbed for evaluating algorithms for min-max optimization. Empirically, EXOTIC outperforms gradient-based methods on this benchmark as well as on existing numerical benchmark problems from the literature. Finally, we demonstrate the utility of EXOTIC by computing security strategies in multi-player games with three or more players.

Comment: The paper presents a novel algorithm for min-max optimization, which is relevant to emerging trends in optimization theory.

Relevance: 8 Novelty: 8

19. Universal Learning of Nonlinear Dynamics

ArXiv ID: 2508.11990

Authors: Evan Dogariu, Anand Brahmbhatt, Elad Hazan

Abstract: We study the fundamental problem of learning a marginally stable unknown nonlinear dynamical system. We describe an algorithm for this problem, based on the technique of spectral filtering, which learns a mapping from past observations to the next based on a spectral representation of the system. Using techniques from online convex optimization, we prove vanishing prediction error for any nonlinear dynamical system that has finitely many marginally stable modes, with rates governed by a novel quantitative control-theoretic notion of learnability. The main technical component of our method is a new spectral filtering algorithm for linear dynamical systems, which incorporates past observations and applies to general noisy and marginally stable systems. This significantly generalizes the original spectral filtering algorithm to both asymmetric dynamics as well as incorporating noise correction, and is of independent interest.

Comment: The paper presents a novel algorithm for learning nonlinear dynamical systems, which is relevant to representation learning.

Relevance: 8 Novelty: 8

20. A Unified Cortical Circuit Model with Divisive Normalization and Self-Excitation for Robust Representation and Memory Maintenance

ArXiv ID: 2508.12702

Authors: Jie Su, Weiwei Wang, Zhaotian Gu, Dahui Wang, Tianyi Qian

Abstract: Robust information representation and its persistent maintenance are fundamental for higher cognitive functions. Existing models employ distinct neural mechanisms to separately address noise-resistant processing or information maintenance, yet a unified framework integrating both operations remains elusive -- a critical gap in understanding cortical computation. Here, we introduce a recurrent neural circuit that combines divisive normalization with self-excitation to achieve both robust encoding and stable retention of normalized inputs. Mathematical analysis shows that, for suitable parameter regimes, the system forms a continuous attractor with two key properties: (1) input-proportional stabilization during stimulus presentation; and (2) self-sustained memory states persisting after stimulus offset. We demonstrate the model's versatility in two canonical tasks: (a) noise-robust encoding in a random-dot kinematogram (RDK) paradigm; and (b) approximate Bayesian belief updating in a probabilistic Wisconsin Card Sorting Test (pWCST). This work establishes a unified mathematical framework that bridges noise suppression, working memory, and approximate Bayesian inference within a single cortical microcircuit, offering fresh insights into the brain's canonical computation and guiding the design of biologically plausible artificial neural architectures.

Comment: The paper presents a unified cortical circuit model for robust representation and memory maintenance, which aligns with foundational research in representation learning.