Personalized Daily ArXiv Papers 2025-08-19
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 47140 | 5854 | 52994 |
| Cost | $0.12 | $0.06 | $0.18 |
Total arXiv papers: 775
Total scanned papers: 466
Total relevant papers: 38
Table of contents with paper titles:
-
Maximum Score Routing For Mixture-of-Experts Authors: Bowen Dong, Yilong Fan, Yutao Sun, Zhenyu Li, Tengyu Pan, Xun Zhou, Jianyong Wang
-
Discovering Expert-Level Nash Equilibrium Algorithms with Large Language Models Authors: Hanyu Li, Dongchen Li, Xiaotie Deng
-
Wavy Transformer Authors: Satoshi Noguchi, Yoshinobu Kawahara
-
Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks Authors: Lorenzo Livi
-
The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution Authors: Elon Ezra, Ariel Weizman, Amos Azaria
-
Contrastive Representations for Temporal Reasoning Authors: Alicja Ziarko, Michal Bortkiewicz, Michal Zawalski, Benjamin Eysenbach, Piotr Milos
-
Uncovering Emergent Physics Representations Learned In-Context by Large Language Models Authors: Yeongwoo Song, Jaeyong Bae, Dong-Kyum Kim, Hawoong Jeong
-
Reduced-order modeling of Hamiltonian dynamics based on symplectic neural networks Authors: Yongsheng Chen, Wei Guo, Qi Tang, Xinghui Zhong
-
Data-Driven Discovery of Interpretable Kalman Filter Variants through Large Language Models and Genetic Programming Authors: Vasileios Saketos, Sebastian Kaltenbach, Sergey Litvinov, Petros Koumoutsakos
-
A Perfectly Truthful Calibration Measure Authors: Jason Hartline, Lunjia Hu, Yifan Wu
-
FLARE: Fast Low-rank Attention Routing Engine Authors: Vedant Puri, Aditya Joglekar, Kevin Ferguson, Yu-hsuan Chen, Yongjie Jessica Zhang, Levent Burak Kara
-
Word Meanings in Transformer Language Models Authors: Jumbly Grindrod, Peter Grindrod
-
RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns Authors: Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S. Chao, Derek F. Wong
-
Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping Authors: Xuhui Zhan, Tyler Derr
-
Causally-Guided Pairwise Transformer -- Towards Foundational Digital Twins in Process Industry Authors: Michael Mayr, Georgios C. Chasparis
-
AdaRing: Towards Ultra-Light Vision-Language Adaptation via Cross-Layer Tensor Ring Decomposition Authors: Ying Huang, Yuanbin Man, Wenqi Jia, Zhengzhong Tu, Junzhou Huang, Miao Yin
-
Data-driven particle dynamics: Structure-preserving coarse-graining for emergent behavior in non-equilibrium systems Authors: Quercus Hernandez, Max Win, Thomas C. O'Connor, Paulo E. Arratia, Nathaniel Trask
-
EXOTIC: An Exact, Optimistic, Tree-Based Algorithm for Min-Max Optimization Authors: Chinmay Maheshwari, Chinmay Pimpalkhare, Debasish Chatterjee
-
Universal Learning of Nonlinear Dynamics Authors: Evan Dogariu, Anand Brahmbhatt, Elad Hazan
-
A Unified Cortical Circuit Model with Divisive Normalization and Self-Excitation for Robust Representation and Memory Maintenance Authors: Jie Su, Weiwei Wang, Zhaotian Gu, Dahui Wang, Tianyi Qian
-
DE-VAE: Revealing Uncertainty in Parametric and Inverse Projections with Variational Autoencoders using Differential Entropy Authors: Frederik L. Dennig, Daniel A. Keim
-
SEDEG:Sequential Enhancement of Decoder and Encoder's Generality for Class Incremental Learning with Small Memory Authors: Hongyang Chen, Shaoling Pu, Lingyu Zheng, Zhongwu Sun
-
SparseMap: A Sparse Tensor Accelerator Framework Based on Evolution Strategy Authors: Boran Zhao, Haiming Zhai, Zihang Yuan, Hetian Liu, Tian Xia, Wenzhe Zhao, Pengju Ren
-
Predicting the Performance of Graph Convolutional Networks with Spectral Properties of the Graph Laplacian Authors: Shalima Binta Manir, Tim Oates
-
A Self-Ensemble Inspired Approach for Effective Training of Binary-Weight Spiking Neural Networks Authors: Qingyan Meng, Mingqing Xiao, Zhengyu Ma, Huihui Zhou, Yonghong Tian, Zhouchen Lin
-
L-SR1: Learned Symmetric-Rank-One Preconditioning Authors: Gal Lifshitz, Shahar Zuler, Ori Fouks, Dan Raviv
-
Distribution Matching via Generalized Consistency Models Authors: Sagar Shrestha, Rajesh Shrestha, Tri Nguyen, Subash Timilsina
-
ProtTeX-CC: Activating In-Context Learning in Protein LLM via Two-Stage Instruction Compression Authors: Chuanliu Fan, Zicheng Ma, Jun Gao, Nan Yu, Jun Zhang, Ziqiang Cao, Yi Qin Gao, Guohong Fu
-
DynamixSFT: Dynamic Mixture Optimization of Instruction Tuning Collections Authors: Haebin Shin, Lei Ji, Xiao Liu, Zhiwei Yu, Qi Chen, Yeyun Gong
-
Kourkoutas-Beta: A Sunspike-Driven Adam Optimizer with Desert Flair Authors: Stavros C. Kassinos
-
Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications Authors: Haolin Jin, Huaming Chen
-
SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression Authors: Yuyang Xu, Yi Cheng, Haochao Ying, Zhuoyun Du, Renjun Hu, Xing Shi, Wei Lin, Jian Wu
-
Rigorous Feature Importance Scores based on Shapley Value and Banzhaf Index Authors: Xuanxiang Huang, Olivier L\'etoff\'e, Joao Marques-Silva
-
Assessing Representation Stability for Transformer Models Authors: Bryan E. Tuck, Rakesh M. Verma
-
Cost-Aware Contrastive Routing for LLMs Authors: Reza Shirkavand, Shangqian Gao, Peiran Yu, Heng Huang
-
Constructing Invariant and Equivariant Operations by Symmetric Tensor Network Authors: Meng Zhang, Chao Wang, Hao Zhang, Shaojun Dong, Lixin He
-
Separating Knowledge and Perception with Procedural Data Authors: Adri\'an Rodr\'iguez-Mu\~noz, Manel Baradad, Phillip Isola, Antonio Torralba
-
Toward Practical Equilibrium Propagation: Brain-inspired Recurrent Neural Network with Feedback Regulation and Residual Connections Authors: Zhuo Liu, Tao Chen
1. Maximum Score Routing For Mixture-of-Experts
ArXiv ID: 2508.12801
Authors: Bowen Dong, Yilong Fan, Yutao Sun, Zhenyu Li, Tengyu Pan, Xun Zhou, Jianyong Wang
Abstract: Routing networks in sparsely activated mixture-of-experts (MoE) dynamically allocate input tokens to top-k experts through differentiable sparse transformations, enabling scalable model capacity while preserving computational efficiency. Traditional MoE networks impose an expert capacity constraint to ensure GPU-friendly computation. However, this leads to token dropping when capacity is saturated and results in low hardware efficiency due to padding in underutilized experts. Removing the capacity constraint, in turn, compromises load balancing and computational efficiency. To address these issues, we propose Maximum Score Routing ($\mathbf{MaxScore}$), a novel MoE routing paradigm that models routing as a minimum-cost maximum-flow problem and integrates a SoftTopk operator. MaxScore resolves the fundamental limitations of iterative rerouting and optimal transport formulations, achieving lower training losses and higher evaluation scores at equivalent FLOPs compared to both constrained and unconstrained baselines. Implementation details and experimental configurations can be obtained from $\href{https://github.com/dongbw18/MaxScore.git}{MaxScore}$.
Comment: The paper proposes a novel MoE routing paradigm, which is directly relevant to model architecture innovations.
Relevance: 10 Novelty: 8
2. Discovering Expert-Level Nash Equilibrium Algorithms with Large Language Models
ArXiv ID: 2508.11874
Authors: Hanyu Li, Dongchen Li, Xiaotie Deng
Abstract: Algorithm design and analysis is a cornerstone of computer science, but it confronts a major challenge. Proving an algorithm's performance guarantee across all inputs has traditionally required extensive and often error-prone human effort. While AI has shown great success in finding solutions to specific problem instances, automating the discovery of general algorithms with such provable guarantees has remained a significant barrier. This challenge stems from the difficulty of integrating the creative process of algorithm design with the rigorous process of formal analysis. To address this gap, we propose LegoNE, a framework that tightly fuses these two processes for the fundamental and notoriously difficult problem of computing approximate Nash equilibria. LegoNE automatically translates any algorithm written by a simple Python-like language into a constrained optimization problem. Solving this problem derives and proves the algorithm's approximation bound. Using LegoNE, a state-of-the-art large language model rediscovered the state-of-the-art algorithm for two-player games within hours, a feat that had taken human researchers 15 years to achieve. For three-player games, the model discovered a novel algorithm surpassing all existing human-designed ones. This work demonstrates a new human-machine collaborative paradigm for theoretical science: humans reason at a higher-abstract level, using symbols to compress the search space, and AI explores within it, achieving what neither could alone.
Comment: The paper presents a framework using LLMs for discovering Nash equilibrium algorithms, which is a significant contribution to foundational research in AI for Science.
Relevance: 9 Novelty: 9
3. Wavy Transformer
ArXiv ID: 2508.12787
Authors: Satoshi Noguchi, Yoshinobu Kawahara
Abstract: Transformers have achieved remarkable success across natural language processing (NLP) and computer vision (CV). However, deep transformer models often suffer from an over-smoothing issue, in which token representations converge to similar values as they pass through successive transformer blocks. In this paper, we establish an equivalence between the hidden-state dynamics induced by stacked attention layers and graph neural diffusion on a complete graph. From this perspective, over-smoothing can be interpreted as a consequence of the dissipative nature of the underlying diffusion dynamics. Motivated by this physical interpretation, we propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics. We also introduce a feed-forward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule, thereby extending the transformer architecture. We further validate our proposed techniques on various transformer models for NLP and CV tasks. The results consistently demonstrate that Wavy Transformer improves performance with minimal additional parameters and no extra hyperparameter tuning.
Comment: The paper introduces the Wavy Transformer, addressing over-smoothing in transformers, which is relevant to model architecture innovations.
Relevance: 9 Novelty: 8
4. Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks
ArXiv ID: 2508.12121
Authors: Lorenzo Livi
Abstract: We study how gating mechanisms in recurrent neural networks (RNNs) implicitly induce adaptive learning-rate behavior, even when training is carried out with a fixed, global learning rate. This effect arises from the coupling between state-space time scales--parametrized by the gates--and parameter-space dynamics during gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs, we obtain a first-order expansion that makes explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates. These findings reveal that gates not only control memory retention in the hidden states, but also act as data-driven preconditioners that adapt optimization trajectories in parameter space. We further draw formal analogies with learning-rate schedules, momentum, and adaptive methods such as Adam, showing that these optimization behaviors emerge naturally from gating. Numerical experiments confirm the validity of our perturbative analysis, supporting the view that gate-induced corrections remain small while exerting systematic effects on training dynamics. Overall, this work provides a unified dynamical-systems perspective on how gating couples state evolution with parameter updates, explaining why gated architectures achieve robust trainability and stability in practice.
Comment: The paper studies how gating mechanisms in RNNs induce adaptive learning-rate behavior, providing insights into training dynamics in neural networks.
Relevance: 9 Novelty: 8
5. The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution
ArXiv ID: 2508.12277
Authors: Elon Ezra, Ariel Weizman, Amos Azaria
Abstract: Large language models (LLMs) are commonly evaluated on tasks that test their knowledge or reasoning abilities. In this paper, we explore a different type of evaluation: whether an LLM can predict aspects of its own responses. Since LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model's ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this benchmark, and that increased model size or capability does not consistently lead to better performance. These results suggest a fundamental limitation in how LLMs represent and reason about their own behavior.
Comment: The paper introduces a new benchmark to evaluate LLMs' ability to predict aspects of their own responses, which provides theoretical insights into LLM behavior.
Relevance: 9 Novelty: 8
6. Contrastive Representations for Temporal Reasoning
ArXiv ID: 2508.13113
Authors: Alicja Ziarko, Michal Bortkiewicz, Michal Zawalski, Benjamin Eysenbach, Piotr Milos
Abstract: In classical AI, perception relies on learning state-based representations, while planning, which can be thought of as temporal reasoning over action sequences, is typically achieved through search. We study whether such reasoning can instead emerge from representations that capture both perceptual and temporal structure. We show that standard temporal contrastive learning, despite its popularity, often fails to capture temporal structure due to its reliance on spurious features. To address this, we introduce Combinatorial Representations for Temporal Reasoning (CRTR), a method that uses a negative sampling scheme to provably remove these spurious features and facilitate temporal reasoning. CRTR achieves strong results on domains with complex temporal structure, such as Sokoban and Rubik's Cube. In particular, for the Rubik's Cube, CRTR learns representations that generalize across all initial states and allow it to solve the puzzle using fewer search steps than BestFS, though with longer solutions. To our knowledge, this is the first method that efficiently solves arbitrary Cube states using only learned representations, without relying on an external search algorithm.
Comment: The paper introduces a method for temporal reasoning using contrastive representations, which is relevant to representation learning.
Relevance: 9 Novelty: 8
7. Uncovering Emergent Physics Representations Learned In-Context by Large Language Models
ArXiv ID: 2508.12448
Authors: Yeongwoo Song, Jaeyong Bae, Dong-Kyum Kim, Hawoong Jeong
Abstract: Large language models (LLMs) exhibit impressive in-context learning (ICL) abilities, enabling them to solve wide range of tasks via textual prompts alone. As these capabilities advance, the range of applicable domains continues to expand significantly. However, identifying the precise mechanisms or internal structures within LLMs that allow successful ICL across diverse, distinct classes of tasks remains elusive. Physics-based tasks offer a promising testbed for probing this challenge. Unlike synthetic sequences such as basic arithmetic or symbolic equations, physical systems provide experimentally controllable, real-world data based on structured dynamics grounded in fundamental principles. This makes them particularly suitable for studying the emergent reasoning behaviors of LLMs in a realistic yet tractable setting. Here, we mechanistically investigate the ICL ability of LLMs, especially focusing on their ability to reason about physics. Using a dynamics forecasting task in physical systems as a proxy, we evaluate whether LLMs can learn physics in context. We first show that the performance of dynamics forecasting in context improves with longer input contexts. To uncover how such capability emerges in LLMs, we analyze the model's residual stream activations using sparse autoencoders (SAEs). Our experiments reveal that the features captured by SAEs correlate with key physical variables, such as energy. These findings demonstrate that meaningful physical concepts are encoded within LLMs during in-context learning. In sum, our work provides a novel case study that broadens our understanding of how LLMs learn in context.
Comment: The paper investigates the in-context learning ability of LLMs using physics-based tasks, providing insights into LLM behavior and interpretability.
Relevance: 9 Novelty: 8
8. Reduced-order modeling of Hamiltonian dynamics based on symplectic neural networks
ArXiv ID: 2508.11911
Authors: Yongsheng Chen, Wei Guo, Qi Tang, Xinghui Zhong
Abstract: We introduce a novel data-driven symplectic induced-order modeling (ROM) framework for high-dimensional Hamiltonian systems that unifies latent-space discovery and dynamics learning within a single, end-to-end neural architecture. The encoder-decoder is built from Henon neural networks (HenonNets) and may be augmented with linear SGS-reflector layers. This yields an exact symplectic map between full and latent phase spaces. Latent dynamics are advanced by a symplectic flow map implemented as a HenonNet. This unified neural architecture ensures exact preservation of the underlying symplectic structure at the reduced-order level, significantly enhancing the fidelity and long-term stability of the resulting ROM. We validate our method through comprehensive numerical experiments on canonical Hamiltonian systems. The results demonstrate the method's capability for accurate trajectory reconstruction, robust predictive performance beyond the training horizon, and accurate Hamiltonian preservation. These promising outcomes underscore the effectiveness and potential applicability of our symplectic ROM framework for complex dynamical systems across a broad range of scientific and engineering disciplines.
Comment: The paper introduces a symplectic neural network framework for Hamiltonian dynamics, relevant to model architecture and AI for Science.
Relevance: 9 Novelty: 8
9. Data-Driven Discovery of Interpretable Kalman Filter Variants through Large Language Models and Genetic Programming
ArXiv ID: 2508.11703
Authors: Vasileios Saketos, Sebastian Kaltenbach, Sergey Litvinov, Petros Koumoutsakos
Abstract: Algorithmic discovery has traditionally relied on human ingenuity and extensive experimentation. Here we investigate whether a prominent scientific computing algorithm, the Kalman Filter, can be discovered through an automated, data-driven, evolutionary process that relies on Cartesian Genetic Programming (CGP) and Large Language Models (LLM). We evaluate the contributions of both modalities (CGP and LLM) in discovering the Kalman filter under varying conditions. Our results demonstrate that our framework of CGP and LLM-assisted evolution converges to near-optimal solutions when Kalman optimality assumptions hold. When these assumptions are violated, our framework evolves interpretable alternatives that outperform the Kalman filter. These results demonstrate that combining evolutionary algorithms and generative models for interpretable, data-driven synthesis of simple computational modules is a potent approach for algorithmic discovery in scientific computing.
Comment: The paper explores a novel approach combining genetic programming and large language models for algorithmic discovery, which aligns with foundational research in AI for Science.
Relevance: 9 Novelty: 8
10. A Perfectly Truthful Calibration Measure
ArXiv ID: 2508.13100
Authors: Jason Hartline, Lunjia Hu, Yifan Wu
Abstract: Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. Calibration measures quantify how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Although predicting the true probabilities guarantees perfect calibration, in reality, when calibration is evaluated on a finite sample, predicting the truth is not guaranteed to minimize any known calibration measure. All known calibration measures incentivize predictors to lie in order to appear more calibrated on a finite sample. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a perfectly truthful calibration measure in the batch setting: averaged two-bin calibration error (ATB). In addition to being truthful, ATB is sound, complete, continuous, and quadratically related to two existing calibration measures: the smooth calibration error (smCal) and the (lower) distance to calibration (distCal). The simplicity in our definition of ATB makes it efficient and straightforward to compute. ATB allows faster estimation algorithms with significantly easier implementations than smCal and distCal, achieving improved running time and simplicity for the calibration testing problem studied by Hu et al. (2024). We also introduce a general recipe for constructing truthful measures, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.
Comment: The paper introduces a perfectly truthful calibration measure, which is a theoretical advancement in the field of prediction calibration.
Relevance: 9 Novelty: 8
11. FLARE: Fast Low-rank Attention Routing Engine
ArXiv ID: 2508.12594
Authors: Vedant Puri, Aditya Joglekar, Kevin Ferguson, Yu-hsuan Chen, Yongjie Jessica Zhang, Levent Burak Kara
Abstract: The quadratic complexity of self-attention limits its applicability and scalability on large unstructured meshes. We introduce Fast Low-rank Attention Routing Engine (FLARE), a linear complexity self-attention mechanism that routes attention through fixed-length latent sequences. Each attention head performs global communication among $N$ tokens by projecting the input sequence onto a fixed length latent sequence of $M \ll N$ tokens using learnable query tokens. By routing attention through a bottleneck sequence, FLARE learns a low-rank form of attention that can be applied at $O(NM)$ cost. FLARE not only scales to unprecedented problem sizes, but also delivers superior accuracy compared to state-of-the-art neural PDE surrogates across diverse benchmarks. We also release a new additive manufacturing dataset to spur further research. Our code is available at https://github.com/vpuri3/FLARE.py.
Comment: The paper presents a low-rank attention mechanism, which is relevant to model compression and efficiency.
Relevance: 9 Novelty: 8
12. Word Meanings in Transformer Language Models
ArXiv ID: 2508.12863
Authors: Jumbly Grindrod, Peter Grindrod
Abstract: We investigate how word meanings are represented in the transformer language models. Specifically, we focus on whether transformer models employ something analogous to a lexical store - where each word has an entry that contains semantic information. To do this, we extracted the token embedding space of RoBERTa-base and k-means clustered it into 200 clusters. In our first study, we then manually inspected the resultant clusters to consider whether they are sensitive to semantic information. In our second study, we tested whether the clusters are sensitive to five psycholinguistic measures: valence, concreteness, iconicity, taboo, and age of acquisition. Overall, our findings were very positive - there is a wide variety of semantic information encoded within the token embedding space. This serves to rule out certain "meaning eliminativist" hypotheses about how transformer LLMs process semantic information.
Comment: The paper investigates how word meanings are represented in transformer language models, which aligns with the interest in understanding how deep networks encode information.
Relevance: 9 Novelty: 7
13. RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns
ArXiv ID: 2508.13152
Authors: Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S. Chao, Derek F. Wong
Abstract: Detecting content generated by large language models (LLMs) is crucial for preventing misuse and building trustworthy AI systems. Although existing detection methods perform well, their robustness in out-of-distribution (OOD) scenarios is still lacking. In this paper, we hypothesize that, compared to features used by existing detection methods, the internal representations of LLMs contain more comprehensive and raw features that can more effectively capture and distinguish the statistical pattern differences between LLM-generated texts (LGT) and human-written texts (HWT). We validated this hypothesis across different LLMs and observed significant differences in neural activation patterns when processing these two types of texts. Based on this, we propose RepreGuard, an efficient statistics-based detection method. Specifically, we first employ a surrogate model to collect representation of LGT and HWT, and extract the distinct activation feature that can better identify LGT. We can classify the text by calculating the projection score of the text representations along this feature direction and comparing with a precomputed threshold. Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios, while also demonstrating robust resilience to various text sizes and mainstream attacks. Data and code are publicly available at: https://github.com/NLP2CT/RepreGuard
Comment: The paper focuses on detecting LLM-generated text by analyzing internal representations, which aligns with representation learning and insights into LLM behavior.
Relevance: 9 Novelty: 7
14. Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping
ArXiv ID: 2508.12466
Authors: Xuhui Zhan, Tyler Derr
Abstract: Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs: Inverse-LLaVA achieves notable improvements on reasoning-intensive and cognitive tasks (MM-VET: +0.2%, VizWiz: +1.8%, ScienceQA: +0.2%, cognitive reasoning: +27.2%), while showing expected decreases in perception tasks requiring memorized visual-text associations (celebrity recognition: -49.5%, OCR: -21.3%). These results provide the first empirical evidence that alignment pre-training is not necessary for effective multimodal learning, particularly for complex reasoning tasks. Our work establishes the feasibility of a new paradigm that reduces computational requirements by 45%, challenges conventional wisdom about modality fusion, and opens new research directions for efficient multimodal architectures that preserve modality-specific characteristics. Our project website with code and additional resources is available at https://inverse-llava.github.io.
Comment: The paper proposes a novel approach to multimodal learning without alignment pre-training, which challenges conventional paradigms in model architecture.
Relevance: 8 Novelty: 8
15. Causally-Guided Pairwise Transformer -- Towards Foundational Digital Twins in Process Industry
ArXiv ID: 2508.13111
Authors: Michael Mayr, Georgios C. Chasparis
Abstract: Foundational modelling of multi-dimensional time-series data in industrial systems presents a central trade-off: channel-dependent (CD) models capture specific cross-variable dynamics but lack robustness and adaptability as model layers are commonly bound to the data dimensionality of the tackled use-case, while channel-independent (CI) models offer generality at the cost of modelling the explicit interactions crucial for system-level predictive regression tasks. To resolve this, we propose the Causally-Guided Pairwise Transformer (CGPT), a novel architecture that integrates a known causal graph as an inductive bias. The core of CGPT is built around a pairwise modeling paradigm, tackling the CD/CI conflict by decomposing the multidimensional data into pairs. The model uses channel-agnostic learnable layers where all parameter dimensions are independent of the number of variables. CGPT enforces a CD information flow at the pair-level and CI-like generalization across pairs. This approach disentangles complex system dynamics and results in a highly flexible architecture that ensures scalability and any-variate adaptability. We validate CGPT on a suite of synthetic and real-world industrial datasets on long-term and one-step forecasting tasks designed to simulate common industrial complexities. Results demonstrate that CGPT significantly outperforms both CI and CD baselines in predictive accuracy and shows competitive performance with end-to-end trained CD models while remaining agnostic to the problem dimensionality.
Comment: The paper introduces a novel architecture, the Causally-Guided Pairwise Transformer, which integrates a causal graph as an inductive bias, aligning with interests in architectural innovations.
Relevance: 8 Novelty: 8
16. AdaRing: Towards Ultra-Light Vision-Language Adaptation via Cross-Layer Tensor Ring Decomposition
ArXiv ID: 2508.11870
Authors: Ying Huang, Yuanbin Man, Wenqi Jia, Zhengzhong Tu, Junzhou Huang, Miao Yin
Abstract: Adapter-based fine-tuning has gained remarkable attention in adapting large pre-trained vision language models (VLMs) for a wide range of downstream tasks efficiently. In this paradigm, only the inserted adapters are fine-tuned, without the need for training the original VLM backbone. Existing works scale adapters by integrating them into every layer of VLMs to increase the capacity of adapters. However, these methods face two primary limitations: 1) limited compression rate due to ignoring cross-layer redundancy, and 2) limited representational capacity across homogeneous adapters. In this paper, we propose a novel vision-language fine-tuning framework based on cross-layer tensor ring decomposition (TRD) with the integration and collaboration of diverse adapters, called AdaRing, achieving ultra-light parameter-efficient adaptation of VLMs on various tasks. To remove the high redundancy that exists among adapters across layers, we exploit the tensor-level low-rankness to formulate adapters as layer-shared tensor cores and layer-specific slices. Moreover, guided by generalization-aware fine-tuning, diverse rank-driven adapters cooperate to handle tasks that require different representations. Our experiments show that the proposed AdaRing achieves the state-of-the-art performance while reducing average training parameters by 90%.
Comment: The paper proposes AdaRing, a novel vision-language fine-tuning framework using cross-layer tensor ring decomposition, which involves model compression techniques.
Relevance: 8 Novelty: 8
17. Data-driven particle dynamics: Structure-preserving coarse-graining for emergent behavior in non-equilibrium systems
ArXiv ID: 2508.12569
Authors: Quercus Hernandez, Max Win, Thomas C. O'Connor, Paulo E. Arratia, Nathaniel Trask
Abstract: Multiscale systems are ubiquitous in science and technology, but are notoriously challenging to simulate as short spatiotemporal scales must be appropriately linked to emergent bulk physics. When expensive high-dimensional dynamical systems are coarse-grained into low-dimensional models, the entropic loss of information leads to emergent physics which are dissipative, history-dependent, and stochastic. To machine learn coarse-grained dynamics from time-series observations of particle trajectories, we propose a framework using the metriplectic bracket formalism that preserves these properties by construction; most notably, the framework guarantees discrete notions of the first and second laws of thermodynamics, conservation of momentum, and a discrete fluctuation-dissipation balance crucial for capturing non-equilibrium statistics. We introduce the mathematical framework abstractly before specializing to a particle discretization. As labels are generally unavailable for entropic state variables, we introduce a novel self-supervised learning strategy to identify emergent structural variables. We validate the method on benchmark systems and demonstrate its utility on two challenging examples: (1) coarse-graining star polymers at challenging levels of coarse-graining while preserving non-equilibrium statistics, and (2) learning models from high-speed video of colloidal suspensions that capture coupling between local rearrangement events and emergent stochastic dynamics. We provide open-source implementations in both PyTorch and LAMMPS, enabling large-scale inference and extensibility to diverse particle-based systems.
Comment: The paper presents a framework for machine learning coarse-grained dynamics, which is relevant to AI for Science with foundational research in modeling.
Relevance: 8 Novelty: 8
18. EXOTIC: An Exact, Optimistic, Tree-Based Algorithm for Min-Max Optimization
ArXiv ID: 2508.12479
Authors: Chinmay Maheshwari, Chinmay Pimpalkhare, Debasish Chatterjee
Abstract: Min-max optimization arises in many domains such as game theory, adversarial machine learning, etc., with gradient-based methods as a typical computational tool. Beyond convex-concave min-max optimization, the solutions found by gradient-based methods may be arbitrarily far from global optima. In this work, we present an algorithmic apparatus for computing globally optimal solutions in convex-non-concave and non-convex-concave min-max optimization. For former, we employ a reformulation that transforms it into a non-concave-convex max-min optimization problem with suitably defined feasible sets and objective function. The new form can be viewed as a generalization of Sion's minimax theorem. Next, we introduce EXOTIC-an Exact, Optimistic, Tree-based algorithm for solving the reformulated max-min problem. EXOTIC employs an iterative convex optimization solver to (approximately) solve the inner minimization and a hierarchical tree search for the outer maximization to optimistically select promising regions to search based on the approximate solution returned by convex optimization solver. We establish an upper bound on its optimality gap as a function of the number of calls to the inner solver, the solver's convergence rate, and additional problem-dependent parameters. Both our algorithmic apparatus along with its accompanying theoretical analysis can also be applied for non-convex-concave min-max optimization. In addition, we propose a class of benchmark convex-non-concave min-max problems along with their analytical global solutions, providing a testbed for evaluating algorithms for min-max optimization. Empirically, EXOTIC outperforms gradient-based methods on this benchmark as well as on existing numerical benchmark problems from the literature. Finally, we demonstrate the utility of EXOTIC by computing security strategies in multi-player games with three or more players.
Comment: The paper presents a novel algorithm for min-max optimization, which is relevant to emerging trends in optimization theory.
Relevance: 8 Novelty: 8
19. Universal Learning of Nonlinear Dynamics
ArXiv ID: 2508.11990
Authors: Evan Dogariu, Anand Brahmbhatt, Elad Hazan
Abstract: We study the fundamental problem of learning a marginally stable unknown nonlinear dynamical system. We describe an algorithm for this problem, based on the technique of spectral filtering, which learns a mapping from past observations to the next based on a spectral representation of the system. Using techniques from online convex optimization, we prove vanishing prediction error for any nonlinear dynamical system that has finitely many marginally stable modes, with rates governed by a novel quantitative control-theoretic notion of learnability. The main technical component of our method is a new spectral filtering algorithm for linear dynamical systems, which incorporates past observations and applies to general noisy and marginally stable systems. This significantly generalizes the original spectral filtering algorithm to both asymmetric dynamics as well as incorporating noise correction, and is of independent interest.
Comment: The paper presents a novel algorithm for learning nonlinear dynamical systems, which is relevant to representation learning.
Relevance: 8 Novelty: 8
20. A Unified Cortical Circuit Model with Divisive Normalization and Self-Excitation for Robust Representation and Memory Maintenance
ArXiv ID: 2508.12702
Authors: Jie Su, Weiwei Wang, Zhaotian Gu, Dahui Wang, Tianyi Qian
Abstract: Robust information representation and its persistent maintenance are fundamental for higher cognitive functions. Existing models employ distinct neural mechanisms to separately address noise-resistant processing or information maintenance, yet a unified framework integrating both operations remains elusive -- a critical gap in understanding cortical computation. Here, we introduce a recurrent neural circuit that combines divisive normalization with self-excitation to achieve both robust encoding and stable retention of normalized inputs. Mathematical analysis shows that, for suitable parameter regimes, the system forms a continuous attractor with two key properties: (1) input-proportional stabilization during stimulus presentation; and (2) self-sustained memory states persisting after stimulus offset. We demonstrate the model's versatility in two canonical tasks: (a) noise-robust encoding in a random-dot kinematogram (RDK) paradigm; and (b) approximate Bayesian belief updating in a probabilistic Wisconsin Card Sorting Test (pWCST). This work establishes a unified mathematical framework that bridges noise suppression, working memory, and approximate Bayesian inference within a single cortical microcircuit, offering fresh insights into the brain's canonical computation and guiding the design of biologically plausible artificial neural architectures.
Comment: The paper presents a unified cortical circuit model for robust representation and memory maintenance, which aligns with foundational research in representation learning.
Relevance: 8 Novelty: 7
21. DE-VAE: Revealing Uncertainty in Parametric and Inverse Projections with Variational Autoencoders using Differential Entropy
ArXiv ID: 2508.12145
Authors: Frederik L. Dennig, Daniel A. Keim
Abstract: Recently, autoencoders (AEs) have gained interest for creating parametric and invertible projections of multidimensional data. Parametric projections make it possible to embed new, unseen samples without recalculating the entire projection, while invertible projections allow the synthesis of new data instances. However, existing methods perform poorly when dealing with out-of-distribution samples in either the data or embedding space. Thus, we propose DE-VAE, an uncertainty-aware variational AE using differential entropy (DE) to improve the learned parametric and invertible projections. Given a fixed projection, we train DE-VAE to learn a mapping into 2D space and an inverse mapping back to the original space. We conduct quantitative and qualitative evaluations on four well-known datasets, using UMAP and t-SNE as baseline projection methods. Our findings show that DE-VAE can create parametric and inverse projections with comparable accuracy to other current AE-based approaches while enabling the analysis of embedding uncertainty.
Comment: The paper proposes DE-VAE, an uncertainty-aware variational autoencoder, which aligns with interests in autoencoders and representation learning.
Relevance: 8 Novelty: 7
22. SEDEG:Sequential Enhancement of Decoder and Encoder's Generality for Class Incremental Learning with Small Memory
ArXiv ID: 2508.12932
Authors: Hongyang Chen, Shaoling Pu, Lingyu Zheng, Zhongwu Sun
Abstract: In incremental learning, enhancing the generality of knowledge is crucial for adapting to dynamic data inputs. It can develop generalized representations or more balanced decision boundaries, preventing the degradation of long-term knowledge over time and thus mitigating catastrophic forgetting. Some emerging incremental learning methods adopt an encoder-decoder architecture and have achieved promising results. In the encoder-decoder achitecture, improving the generalization capabilities of both the encoder and decoder is critical, as it helps preserve previously learned knowledge while ensuring adaptability and robustness to new, diverse data inputs. However, many existing continual methods focus solely on enhancing one of the two components, which limits their effectiveness in mitigating catastrophic forgetting. And these methods perform even worse in small-memory scenarios, where only a limited number of historical samples can be stored. To mitigate this limitation, we introduces SEDEG, a two-stage training framework for vision transformers (ViT), focusing on sequentially improving the generality of both Decoder and Encoder. Initially, SEDEG trains an ensembled encoder through feature boosting to learn generalized representations, which subsequently enhance the decoder's generality and balance the classifier. The next stage involves using knowledge distillation (KD) strategies to compress the ensembled encoder and develop a new, more generalized encoder. This involves using a balanced KD approach and feature KD for effective knowledge transfer. Extensive experiments on three benchmark datasets show SEDEG's superior performance, and ablation studies confirm the efficacy of its components. The code is available at https://github.com/ShaolingPu/CIL.
Comment: The paper discusses enhancing encoder-decoder architectures for incremental learning, which aligns with model architecture analysis.
Relevance: 8 Novelty: 7
23. SparseMap: A Sparse Tensor Accelerator Framework Based on Evolution Strategy
ArXiv ID: 2508.12906
Authors: Boran Zhao, Haiming Zhai, Zihang Yuan, Hetian Liu, Tian Xia, Wenzhe Zhao, Pengju Ren
Abstract: The growing demand for sparse tensor algebra (SpTA) in machine learning and big data has driven the development of various sparse tensor accelerators. However, most existing manually designed accelerators are limited to specific scenarios, and it's time-consuming and challenging to adjust a large number of design factors when scenarios change. Therefore, automating the design of SpTA accelerators is crucial. Nevertheless, previous works focus solely on either mapping (i.e., tiling communication and computation in space and time) or sparse strategy (i.e., bypassing zero elements for efficiency), leading to suboptimal designs due to the lack of comprehensive consideration of both. A unified framework that jointly optimizes both is urgently needed. However, integrating mapping and sparse strategies leads to a combinatorial explosion in the design space(e.g., as large as $O(10^{41})$ for the workload $P_{32 \times 64} \times Q_{64 \times 48} = Z_{32 \times 48}$). This vast search space renders most conventional optimization methods (e.g., particle swarm optimization, reinforcement learning and Monte Carlo tree search) inefficient. To address this challenge, we propose an evolution strategy-based sparse tensor accelerator optimization framework, called SparseMap. SparseMap constructing a more comprehensive design space with the consideration of both mapping and sparse strategy. We introduce a series of enhancements to genetic encoding and evolutionary operators, enabling SparseMap to efficiently explore the vast and diverse design space. We quantitatively compare SparseMap with prior works and classical optimization methods, demonstrating that SparseMap consistently finds superior solutions.
Comment: The paper presents a framework for optimizing sparse tensor accelerators, which is relevant to model compression and efficiency.
Relevance: 8 Novelty: 7
24. Predicting the Performance of Graph Convolutional Networks with Spectral Properties of the Graph Laplacian
ArXiv ID: 2508.12993
Authors: Shalima Binta Manir, Tim Oates
Abstract: A common observation in the Graph Convolutional Network (GCN) literature is that stacking GCN layers may or may not result in better performance on tasks like node classification and edge prediction. We have found empirically that a graph's algebraic connectivity, which is known as the Fiedler value, is a good predictor of GCN performance. Intuitively, graphs with similar Fiedler values have analogous structural properties, suggesting that the same filters and hyperparameters may yield similar results when used with GCNs, and that transfer learning may be more effective between graphs with similar algebraic connectivity. We explore this theoretically and empirically with experiments on synthetic and real graph data, including the Cora, CiteSeer and Polblogs datasets. We explore multiple ways of aggregating the Fiedler value for connected components in the graphs to arrive at a value for the entire graph, and show that it can be used to predict GCN performance. We also present theoretical arguments as to why the Fiedler value is a good predictor.
Comment: The paper explores the use of spectral properties of the graph Laplacian to predict GCN performance, which provides theoretical insights into model architecture.
Relevance: 8 Novelty: 7
25. A Self-Ensemble Inspired Approach for Effective Training of Binary-Weight Spiking Neural Networks
ArXiv ID: 2508.12609
Authors: Qingyan Meng, Mingqing Xiao, Zhengyu Ma, Huihui Zhou, Yonghong Tian, Zhouchen Lin
Abstract: Spiking Neural Networks (SNNs) are a promising approach to low-power applications on neuromorphic hardware due to their energy efficiency. However, training SNNs is challenging because of the non-differentiable spike generation function. To address this issue, the commonly used approach is to adopt the backpropagation through time framework, while assigning the gradient of the non-differentiable function with some surrogates. Similarly, Binary Neural Networks (BNNs) also face the non-differentiability problem and rely on approximating gradients. However, the deep relationship between these two fields and how their training techniques can benefit each other has not been systematically researched. Furthermore, training binary-weight SNNs is even more difficult. In this work, we present a novel perspective on the dynamics of SNNs and their close connection to BNNs through an analysis of the backpropagation process. We demonstrate that training a feedforward SNN can be viewed as training a self-ensemble of a binary-activation neural network with noise injection. Drawing from this new understanding of SNN dynamics, we introduce the Self-Ensemble Inspired training method for (Binary-Weight) SNNs (SEI-BWSNN), which achieves high-performance results with low latency even for the case of the 1-bit weights. Specifically, we leverage a structure of multiple shortcuts and a knowledge distillation-based training technique to improve the training of (binary-weight) SNNs. Notably, by binarizing FFN layers in a Transformer architecture, our approach achieves 82.52% accuracy on ImageNet with only 2 time steps, indicating the effectiveness of our methodology and the potential of binary-weight SNNs.
Comment: The paper presents a novel approach for training binary-weight spiking neural networks, which is relevant to model compression and efficiency.
Relevance: 8 Novelty: 7
26. L-SR1: Learned Symmetric-Rank-One Preconditioning
ArXiv ID: 2508.12270
Authors: Gal Lifshitz, Shahar Zuler, Ori Fouks, Dan Raviv
Abstract: End-to-end deep learning has achieved impressive results but remains limited by its reliance on large labeled datasets, poor generalization to unseen scenarios, and growing computational demands. In contrast, classical optimization methods are data-efficient and lightweight but often suffer from slow convergence. While learned optimizers offer a promising fusion of both worlds, most focus on first-order methods, leaving learned second-order approaches largely unexplored. We propose a novel learned second-order optimizer that introduces a trainable preconditioning unit to enhance the classical Symmetric-Rank-One (SR1) algorithm. This unit generates data-driven vectors used to construct positive semi-definite rank-one matrices, aligned with the secant constraint via a learned projection. Our method is evaluated through analytic experiments and on the real-world task of Monocular Human Mesh Recovery (HMR), where it outperforms existing learned optimization-based approaches. Featuring a lightweight model and requiring no annotated data or fine-tuning, our approach offers strong generalization and is well-suited for integration into broader optimization-based frameworks.
Comment: The paper introduces a learned second-order optimizer, which is relevant to model architecture and optimization methods.
Relevance: 8 Novelty: 7
27. Distribution Matching via Generalized Consistency Models
ArXiv ID: 2508.12222
Authors: Sagar Shrestha, Rajesh Shrestha, Tri Nguyen, Subash Timilsina
Abstract: Recent advancement in generative models have demonstrated remarkable performance across various data modalities. Beyond their typical use in data synthesis, these models play a crucial role in distribution matching tasks such as latent variable modeling, domain translation, and domain adaptation. Generative Adversarial Networks (GANs) have emerged as the preferred method of distribution matching due to their efficacy in handling high-dimensional data and their flexibility in accommodating various constraints. However, GANs often encounter challenge in training due to their bi-level min-max optimization objective and susceptibility to mode collapse. In this work, we propose a novel approach for distribution matching inspired by the consistency models employed in Continuous Normalizing Flow (CNF). Our model inherits the advantages of CNF models, such as having a straight forward norm minimization objective, while remaining adaptable to different constraints similar to GANs. We provide theoretical validation of our proposed objective and demonstrate its performance through experiments on synthetic and real-world datasets.
Comment: The paper proposes a novel approach for distribution matching inspired by consistency models, which is relevant to representation learning.
Relevance: 8 Novelty: 7
28. ProtTeX-CC: Activating In-Context Learning in Protein LLM via Two-Stage Instruction Compression
ArXiv ID: 2508.12212
Authors: Chuanliu Fan, Zicheng Ma, Jun Gao, Nan Yu, Jun Zhang, Ziqiang Cao, Yi Qin Gao, Guohong Fu
Abstract: Recent advances in protein large language models, such as ProtTeX, represent both side-chain amino acids and backbone structure as discrete token sequences of residue length. While this design enables unified modeling of multimodal protein information, it suffers from two major limitations: (1) The concatenation of sequence and structure tokens approximately doubles the protein length and breaks the intrinsic residue-level alignment between modalities. (2) Constrained by the training corpus and limited context window, ProtTeX is typically trained on single-protein inputs, rendering it incompatible with in-context learning (ICL) and thus limiting its generalization capability. To address these issues, we propose ProtTeX-CC, a lightweight two-stage compression framework designed to enhance ProtTeX under few-shot settings. We first design a joint embedding compression mechanism that fuses sequence and structure representations at the residue level, effectively reducing the protein input length by half without sacrificing performance. Then we propose a self-compression module that aggregates each full demonstration into the latent space of the last few linguistic tokens, reducing the average demonstration length from 751 tokens to less than 16 tokens. Compared to the original ProtTeX, our self-compression approach achieves a compression ratio of approximately 93.68% in the total prompt length under the 16-shot setting. Without modifying the backbone model, ProtTeX-CC introduces only a small number of additional parameters through PEFT-based tuning in the joint embedding compression stage and a single trainable projection layer in the self-compression stage. Extensive experiments on protein function prediction show that ProtTeX-CC improves performance on the in-domain benchmark by 2%, and generalizes well to the out-of-domain dataset with a performance gain of 11%.
Comment: The paper introduces a two-stage compression framework for protein LLMs, relevant to model compression and efficiency.
Relevance: 8 Novelty: 7
29. DynamixSFT: Dynamic Mixture Optimization of Instruction Tuning Collections
ArXiv ID: 2508.12116
Authors: Haebin Shin, Lei Ji, Xiao Liu, Zhiwei Yu, Qi Chen, Yeyun Gong
Abstract: As numerous instruction-tuning datasets continue to emerge during the post-training stage, dynamically balancing and optimizing their mixtures has become a critical challenge. To address this, we propose DynamixSFT, a dynamic and automated method for instruction-tuning dataset mixture optimization. We formulate the problem as a multi-armed bandit setup and introduce a Prior-scaled Boltzmann Exploration that softly anchors the updated sampling distribution to the original dataset proportions, thereby preserving the inherent diversity and coverage of the collection. Sampling probabilities are updated using a lightweight 1-Step Look-ahead Reward, reflecting how much the dataset contributes to improving the model's performance at its current state. When applied to the Tulu-v2-mixture collection comprising 16 instruction-tuning datasets, DynamixSFT achieves up to a 2.2% performance improvement across 10 benchmarks. Furthermore, we provide a comprehensive analysis and visualizations to offer deeper insights into the adaptive dynamics of our method.
Comment: The paper proposes a dynamic mixture optimization method for instruction-tuning datasets, relevant to model architecture and efficiency.
Relevance: 8 Novelty: 7
30. Kourkoutas-Beta: A Sunspike-Driven Adam Optimizer with Desert Flair
ArXiv ID: 2508.12996
Authors: Stavros C. Kassinos
Abstract: Transformer neural networks are increasingly used for physics-based problems. In data-driven PDE surrogates, training samples from varying boundary and initial conditions can cause erratic losses and spiky gradients; in physics-informed neural networks (PINNs), stiff composite losses amplify this effect. We introduce Kourkoutas-Beta, an Adam-style optimizer where the fixed second-moment discount beta2 is replaced by a layer-wise dynamic value driven by a bounded sunspike'' ratio: the current pooled gradient norm divided by an exponential moving average (EMA) of past norms, squashed to the interval [0,1). Spikes lower beta2 toward beta2_min; calm phases keep it near beta2_max. Options include leaky-AMSGrad (decay), trust-region clipping (max_ratio), adaptive tiny terms, and several bias-correction modesnone'', beta2max'',exact'). With all features off and bias_correction=``none'', the method is exactly Adam. We test on four settings: (i) a Transformer PDE surrogate (Heat2D), (ii) a 3D PINN for heat conduction (Heat3D), (iii) a lightweight MLX synthetic task with jitter and rare-trigger bursts, and (iv) a character-level Transformer on 30 MB of enwik8 (small-enwik8). Kourkoutas-Beta improves stability and final loss versus fixed-beta2 Adam. On small-enwik8 it lowers bits-per-character by about 38% vs Adam-0.95 and about 58% vs Adam-0.999 over 10 seeds, with smaller variance. The method remains drop-in, with runtime overhead comparable to Adam in testbeds A-C and within single-digit percent in testbed D. It preserves Adam-style convergence guarantees while improving robustness under spiky gradients.
Comment: The paper introduces Kourkoutas-Beta, an Adam-style optimizer with dynamic adjustments, which is relevant to model architecture and training dynamics.
Relevance: 8 Novelty: 7
31. Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications
ArXiv ID: 2508.12358
Authors: Haolin Jin, Huaming Chen
Abstract: Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to assess whether system code implementation satisfy task requirements, thereby enhancing code robustness and accuracy. However, it remains unclear whether LLMs can reliably determine whether the code complies fully with the given task descriptions, which is usually natural language specifications. In this paper, we uncover a systematic failure of LLMs in evaluating whether code aligns with natural language requirements. Specifically, with widely used benchmarks, we employ unified prompts to judge code correctness. Our results reveal that LLMs frequently misclassify correct code implementations as either ``not satisfying requirements'' or containing potential defects. Surprisingly, more complex prompting, especially when leveraging prompt engineering techniques involving explanations and proposed corrections, leads to higher misjudgment rate, which highlights the critical reliability issues in using LLMs as code review assistants. We further analyze the root causes of these misjudgments, and propose two improved prompting strategies for mitigation. For the first time, our findings reveals unrecognized limitations in LLMs to match code with requirements. We also offer novel insights and practical guidance for effective use of LLMs in automated code review and task-oriented agent scenarios.
Comment: The paper uncovers systematic failures of LLMs in code verification, which provides insights into LLM behavior and interpretability.
Relevance: 8 Novelty: 7
32. SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression
ArXiv ID: 2508.12604
Authors: Yuyang Xu, Yi Cheng, Haochao Ying, Zhuoyun Du, Renjun Hu, Xing Shi, Wei Lin, Jian Wu
Abstract: Test-time scaling has proven effective in further enhancing the performance of pretrained Large Language Models (LLMs). However, mainstream post-training methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT) reasoning) often incur substantial computational overhead due to auxiliary models and overthinking. In this paper, we empirically reveal that the incorrect answers partially stem from verbose reasoning processes lacking correct self-fix, where errors accumulate across multiple reasoning steps. To this end, we propose Self-traced Step-wise Preference Optimization (SSPO), a pluggable RL process supervision framework that enables fine-grained optimization of each reasoning step. Specifically, SSPO requires neither auxiliary models nor stepwise manual annotations. Instead, it leverages step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression. Experiments demonstrate that the generated reasoning sequences from SSPO are both accurate and succinct, effectively mitigating overthinking behaviors without compromising model performance across diverse domains and languages.
Comment: The paper proposes SSPO, a framework for reasoning compression in LLMs, which aligns with foundational research in LLM behavior and interpretability.
Relevance: 8 Novelty: 7
33. Rigorous Feature Importance Scores based on Shapley Value and Banzhaf Index
ArXiv ID: 2508.11959
Authors: Xuanxiang Huang, Olivier L\'etoff\'e, Joao Marques-Silva
Abstract: Feature attribution methods based on game theory are ubiquitous in the field of eXplainable Artificial Intelligence (XAI). Recent works proposed rigorous feature attribution using logic-based explanations, specifically targeting high-stakes uses of machine learning (ML) models. Typically, such works exploit weak abductive explanation (WAXp) as the characteristic function to assign importance to features. However, one possible downside is that the contribution of non-WAXp sets is neglected. In fact, non-WAXp sets can also convey important information, because of the relationship between formal explanations (XPs) and adversarial examples (AExs). Accordingly, this paper leverages Shapley value and Banzhaf index to devise two novel feature importance scores. We take into account non-WAXp sets when computing feature contribution, and the novel scores quantify how effective each feature is at excluding AExs. Furthermore, the paper identifies properties and studies the computational complexity of the proposed scores.
Comment: The paper introduces novel feature importance scores based on Shapley value and Banzhaf index, which is relevant to representation learning and interpretability.
Relevance: 8 Novelty: 7
34. Assessing Representation Stability for Transformer Models
ArXiv ID: 2508.11667
Authors: Bryan E. Tuck, Rakesh M. Verma
Abstract: Adversarial text attacks remain a persistent threat to transformer models, yet existing defenses are typically attack-specific or require costly model retraining. We introduce Representation Stability (RS), a model-agnostic detection framework that identifies adversarial examples by measuring how embedding representations change when important words are masked. RS first ranks words using importance heuristics, then measures embedding sensitivity to masking top-k critical words, and processes the resulting patterns with a BiLSTM detector. Experiments show that adversarially perturbed words exhibit disproportionately high masking sensitivity compared to naturally important words. Across three datasets, three attack types, and two victim models, RS achieves over 88% detection accuracy and demonstrates competitive performance compared to existing state-of-the-art methods, often at lower computational cost. Using Normalized Discounted Cumulative Gain (NDCG) to measure perturbation identification quality, we reveal that gradient-based ranking outperforms attention and random selection approaches, with identification quality correlating with detection performance for word-level attacks. RS also generalizes well to unseen datasets, attacks, and models without retraining, providing a practical solution for adversarial text detection.
Comment: The paper introduces a model-agnostic framework for detecting adversarial examples in transformer models, which relates to representation learning.
Relevance: 8 Novelty: 7
35. Cost-Aware Contrastive Routing for LLMs
ArXiv ID: 2508.12491
Authors: Reza Shirkavand, Shangqian Gao, Peiran Yu, Heng Huang
Abstract: We study cost-aware routing for large language models across diverse and dynamic pools of models. Existing approaches often overlook prompt-specific context, rely on expensive model profiling, assume a fixed set of experts, or use inefficient trial-and-error strategies. We introduce Cost-Spectrum Contrastive Routing (CSCR), a lightweight framework that maps both prompts and models into a shared embedding space to enable fast, cost-sensitive selection. CSCR uses compact, fast-to-compute logit footprints for open-source models and perplexity fingerprints for black-box APIs. A contrastive encoder is trained to favor the cheapest accurate expert within adaptive cost bands. At inference time, routing reduces to a single k-NN lookup via a FAISS index, requiring no retraining when the expert pool changes and enabling microsecond latency. Across multiple benchmarks, CSCR consistently outperforms baselines, improving the accuracy-cost tradeoff by up to 25%, while generalizing robustly to unseen LLMs and out-of-distribution prompts.
Comment: The paper introduces a cost-aware routing framework for LLMs, which is relevant to model architecture and efficiency.
Relevance: 8 Novelty: 7
36. Constructing Invariant and Equivariant Operations by Symmetric Tensor Network
ArXiv ID: 2508.12596
Authors: Meng Zhang, Chao Wang, Hao Zhang, Shaojun Dong, Lixin He
Abstract: Design of neural networks that incorporate symmetry is crucial for geometric deep learning. Central to this effort is the development of invariant and equivariant operations. This works presents a systematic method for constructing valid invariant and equivariant operations. It can handle inputs and outputs in the form of Cartesian tensors with different rank, as well as spherical tensors with different types. In addition, our method features a graphical representation utilizing the symmetric tensor network, which simplifies both the proofs and constructions related to invariant and equivariant functions. We also apply this approach to design the equivariant interaction message for the geometry graph neural network, and equivariant machine learning model to learn the constitutive law of materials.
Comment: The paper introduces a method for constructing invariant and equivariant operations, which is relevant to model architecture.
Relevance: 8 Novelty: 7
37. Separating Knowledge and Perception with Procedural Data
ArXiv ID: 2508.11697
Authors: Adri\'an Rodr\'iguez-Mu\~noz, Manel Baradad, Phillip Isola, Antonio Torralba
Abstract: We train representation models with procedural data only, and apply them on visual similarity, classification, and semantic segmentation tasks without further training by using visual memory -- an explicit database of reference image embeddings. Unlike prior work on visual memory, our approach achieves full compartmentalization with respect to all real-world images while retaining strong performance. Compared to a model trained on Places, our procedural model performs within $1\%$ on NIGHTS visual similarity, outperforms by $8\%$ and $15\%$ on CUB200 and Flowers102 fine-grained classification, and is within $10\%$ on ImageNet-1K classification. It also demonstrates strong zero-shot segmentation, achieving an $R^2$ on COCO within $10\%$ of the models trained on real data. Finally, we analyze procedural versus real data models, showing that parts of the same object have dissimilar representations in procedural models, resulting in incorrect searches in memory and explaining the remaining performance gap.
Comment: The paper explores representation learning using procedural data, which is relevant to understanding how models encode information.
Relevance: 8 Novelty: 7
38. Toward Practical Equilibrium Propagation: Brain-inspired Recurrent Neural Network with Feedback Regulation and Residual Connections
ArXiv ID: 2508.11659
Authors: Zhuo Liu, Tao Chen
Abstract: Brain-like intelligent systems need brain-like learning methods. Equilibrium Propagation (EP) is a biologically plausible learning framework with strong potential for brain-inspired computing hardware. However, existing im-plementations of EP suffer from instability and prohibi-tively high computational costs. Inspired by the structure and dynamics of the brain, we propose a biologically plau-sible Feedback-regulated REsidual recurrent neural network (FRE-RNN) and study its learning performance in EP framework. Feedback regulation enables rapid convergence by reducing the spectral radius. The improvement in con-vergence property reduces the computational cost and train-ing time of EP by orders of magnitude, delivering perfor-mance on par with backpropagation (BP) in benchmark tasks. Meanwhile, residual connections with brain-inspired topologies help alleviate the vanishing gradient problem that arises when feedback pathways are weak in deep RNNs. Our approach substantially enhances the applicabil-ity and practicality of EP in large-scale networks that un-derpin artificial intelligence. The techniques developed here also offer guidance to implementing in-situ learning in physical neural networks.
Comment: The paper proposes a biologically plausible Feedback-regulated Residual recurrent neural network (FRE-RNN) to enhance Equilibrium Propagation, which aligns with foundational research in model architecture and representation learning.
Relevance: 8 Novelty: 7
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.