Personalized Daily ArXiv Papers 2025-08-26

[gpt-4o]	Prompt	Completion	Total
Token	50024	6797	56821
Cost	$0.13	$0.07	$0.19

Total arXiv papers: 904

Total scanned papers: 547

Total relevant papers: 37

Table of contents with paper titles:

Consciousness as a Functor Authors: Sridhar Mahadevan
Amortized Sampling with Transferable Normalizing Flows Authors: Charlie B. Tan, Majdi Hassan, Leon Klein, Saifuddin Syed, Dominique Beaini, Michael M. Bronstein, Alexander Tong, Kirill Neklyudov
Programmable k-local Ising Machines and all-optical Kolmogorov-Arnold Networks on Photonic Platforms Authors: Nikita Stroev, Natalia G. Berloff
Native Logical and Hierarchical Representations with Subspace Embeddings Authors: Gabriel Moreira, Zita Marinho, Manuel Marques, Jo\~ao Paulo Costeira, Chenyan Xiong
Curvature Learning for Generalization of Hyperbolic Neural Networks Authors: Xiaomeng Fan, Yuwei Wu, Zhi Gao, Mehrtash Harandi, Yunde Jia
TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training Authors: Yifan Wang, Binbin Liu, Fengze Liu, Yuanfan Guo, Jiyao Deng, Xuecheng Wu, Weidong Zhou, Xiaohuan Zhou, Taifeng Wang
CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models Authors: Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, Qianren Wang
Adversarial Examples Are Not Bugs, They Are Superposition Authors: Liv Gorton, Owen Lewis
Mutual Information Surprise: Rethinking Unexpectedness in Autonomous Systems Authors: Yinsong Wang, Xiao Liu, Quan Zeng, Yu Ding
GateTS: Versatile and Efficient Forecasting via Attention-Inspired routed Mixture-of-Experts Authors: Kyrylo Yemets, Mykola Lukashchuk, Ivan Izonin
Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective Authors: Tianyao Shi, Yi Ding
Interpreting the Effects of Quantization on LLMs Authors: Manpreet Singh, Hassan Sajjad
Limitations of Normalization in Attention Mechanism Authors: Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State
Limits of message passing for node classification: How class-bottlenecks restrict signal-to-noise ratio Authors: Jonathan Rubin, Sahil Loomba, Nick S. Jones
Spacer: Towards Engineered Scientific Inspiration Authors: Minhyeong Lee, Suyoung Hwang, Seunghyun Moon, Geonho Nah, Donghyun Koh, Youngjun Cho, Johyun Park, Hojin Yoo, Jiho Park, Haneul Choi, Sungbin Moon, Taehoon Hwang, Seungwon Kim, Jaeyeong Kim, Seongjun Kim, Juneau Jung
HypER: Hyperbolic Echo State Networks for Capturing Stretch-and-Fold Dynamics in Chaotic Flows Authors: Pradeep Singh, Sutirtha Ghosh, Ashutosh Kumar, Hrishit B P, Balasubramanian Raman
Quantum Graph Attention Network: A Novel Quantum Multi-Head Attention Mechanism for Graph Learning Authors: An Ning, Tai Yue Li, Nan Yow Chen
Riemannian Optimization for LoRA on the Stiefel Manifold Authors: Juneyoung Park, Minjae Kang, Seongbae Lee, Haegang Lee, Seongwan Kim, Jaeho Lee
Dynamic Sparse Attention on Mobile SoCs Authors: Wangsong Yin, Daliang Xu, Mengwei Xu, Gang Huang, Xuanzhe Liu
Module-Aware Parameter-Efficient Machine Unlearning on Transformers Authors: Wenjie Bao, Jian Lou, Yuke Hu, Xiaochen Li, Zhihao Liu, Jiaqi Liu, Zhan Qin, Kui Ren
WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling Authors: Jiacheng Li, Jianchao Tan, Zhidong Yang, Pingwei Sun, Feiye Huo, Jiayu Qin, Yerui Sun, Yuchen Xie, Xunliang Cai, Xiangyu Zhang, Maoxin He, Guangming Tan, Weile Jia, Tong Zhao
Optimizing Neural Networks with Learnable Non-Linear Activation Functions via Lookup-Based FPGA Acceleration Authors: Mengyuan Yin, Benjamin Chen Ming Choong, Chuping Qu, Rick Siow Mong Goh, Weng-Fai Wong, Tao Luo
Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry Authors: Haoyu Yun, Hamid Krim
Deep Learning for Markov Chains: Lyapunov Functions, Poisson's Equation, and Stationary Distributions Authors: Yanlin Qu, Jose Blanchet, Peter Glynn
Learn to Memorize: Optimizing LLM-based Agents with Adaptive Memory Framework Authors: Zeyu Zhang, Quanyu Dai, Rui Li, Xiaohe Bo, Xu Chen, Zhenhua Dong
CP4SBI: Local Conformal Calibration of Credible Sets in Simulation-Based Inference Authors: Luben M. C. Cabezas, Vagner S. Santos, Thiago R. Ramos, Pedro L. C. Rodrigues, Rafael Izbicki
CrystalDiT: A Diffusion Transformer for Crystal Generation Authors: Xiaohan Yi, Guikun Xu, Xi Xiao, Zhong Zhang, Liu Liu, Yatao Bian, Peilin Zhao
TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving Authors: Bingyang Wu, Zili Zhang, Yinmin Zhong, Guanzhe Huang, Yibo Zhu, Xuanzhe Liu, Xin Jin
Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks Authors: Sotaro Takeshita, Yurina Takeshita, Daniel Ruffinelli, Simone Paolo Ponzetto
Multimodal Representation Learning Conditioned on Semantic Relations Authors: Yang Qiao, Yuntong Hu, Liang Zhao
Confidence-Modulated Speculative Decoding for Large Language Models Authors: Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela
Steering When Necessary: Flexible Steering Large Language Models with Backtracking Authors: Jinwei Gan, Zifeng Cheng, Zhiwei Jiang, Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu
How Quantization Shapes Bias in Large Language Models Authors: Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych
Frozen in Time: Parameter-Efficient Time Series Transformers via Reservoir-Induced Feature Expansion and Fixed Random Dynamics Authors: Pradeep Singh, Mehak Sharma, Anupriya Dey, Balasubramanian Raman
Disentangling the Factors of Convergence between Brains and Computer Vision Models Authors: Jos\'ephine Raugel, Marc Szafraniec, Huy V. Vo, Camille Couprie, Patrick Labatut, Piotr Bojanowski, Valentin Wyart, Jean-R\'emi King
Bridging Foundation Models and Efficient Architectures: A Modular Brain Imaging Framework with Local Masking and Pretrained Representation Learning Authors: Yanwen Wang, Xinglin Zhao, Yijin Song, Xiaobo Liu, Yanrong Hao, Rui Cao, Xin Wen
Training Transformers for Mesh-Based Simulations Authors: Paul Garnier, Vincent Lannelongue, Jonathan Viquerat, Elie Hachem

1. Consciousness as a Functor

ArXiv ID: 2508.17561

Authors: Sridhar Mahadevan

Abstract: We propose a novel theory of consciousness as a functor (CF) that receives and transmits contents from unconscious memory into conscious memory. Our CF framework can be seen as a categorial formulation of the Global Workspace Theory proposed by Baars. CF models the ensemble of unconscious processes as a topos category of coalgebras. The internal language of thought in CF is defined as a Multi-modal Universal Mitchell-Benabou Language Embedding (MUMBLE). We model the transmission of information from conscious short-term working memory to long-term unconscious memory using our recently proposed Universal Reinforcement Learning (URL) framework. To model the transmission of information from unconscious long-term memory into resource-constrained short-term memory, we propose a network economic model.

Comment: The paper proposes a novel theory of consciousness as a functor, which is an emerging trend with potential foundational impact.

Relevance: 9 Novelty: 9

2. Amortized Sampling with Transferable Normalizing Flows

ArXiv ID: 2508.18175

Authors: Charlie B. Tan, Majdi Hassan, Leon Klein, Saifuddin Syed, Dominique Beaini, Michael M. Bronstein, Alexander Tong, Kirill Neklyudov

Abstract: Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Classical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in-full for each system of interest. The widespread success of generative models has inspired interest into overcoming this limitation through learning sampling algorithms. Despite performing on par with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We prove that deep learning enables the design of scalable and transferable samplers by introducing Prose, a 280 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Prose draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Prose as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based finetuning procedure to achieve superior performance to established methods such as sequential Monte Carlo on unseen tetrapeptides. We open-source the Prose codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and finetuning objectives.

Comment: The paper introduces a transferable normalizing flow for molecular conformations, which is relevant to AI for Science and foundational research in molecular modeling.

Relevance: 9 Novelty: 8

3. Programmable k-local Ising Machines and all-optical Kolmogorov-Arnold Networks on Photonic Platforms

ArXiv ID: 2508.17440

Authors: Nikita Stroev, Natalia G. Berloff

Abstract: We unify k-local Ising optimization and optical KAN function learning on a single photonic platform, establishing a critical convergence point in optical computing that enables interleaved discrete-continuous workflows. We introduce a single spacial light modulator (SLM)-centric primitive that realizes, in one stroke, all-optical k-local Ising interactions and fully optical Kolmogorov-Arnold network (KAN) layers. The central idea is to convert structural nonlinearity of a nominally linear photonic scatterer into a per-window computational resource by adding one relay pass through the same spatial light modulator. A folded 4f relay reimages the first Fourier plane onto the SLM so that each chosen spin clique or ridge channel occupies a disjoint window with its own second-pass phase patch. Propagation remains linear in the optical field, yet the measured intensity in each window becomes a freely programmable polynomial of the clique sum or projection amplitude. This yields native, per-clique k-local couplings without nonlinear media and, in parallel, the many independent univariate nonlinearities required by KAN layers, all with in-situ physical gradients for training using two-frame (forward and adjoint) physical gradients. We outline implementation on spatial photonic Ising machines, injection-locked VCSEL arrays, and the Microsoft analog optical computers. In all cases the hardware change is one extra lens and a fold (or an on-chip 4f loop), enabling a minimal overhead, massively parallel route to high-order optical Ising optimization and trainable, all-optical KAN processing.

Comment: The paper introduces a novel photonic platform for k-local Ising optimization and optical KAN function learning, which could be considered a foundational research in AI for Science due to its innovative approach to optical computing.

Relevance: 9 Novelty: 8

4. Native Logical and Hierarchical Representations with Subspace Embeddings

ArXiv ID: 2508.16687

Authors: Gabriel Moreira, Zita Marinho, Manuel Marques, Jo\~ao Paulo Costeira, Chenyan Xiong

Abstract: Traditional neural embeddings represent concepts as points, excelling at similarity but struggling with higher-level reasoning and asymmetric relationships. We introduce a novel paradigm: embedding concepts as linear subspaces. This framework inherently models generality via subspace dimensionality and hierarchy through subspace inclusion. It naturally supports set-theoretic operations like intersection (conjunction), linear sum (disjunction) and orthogonal complements (negations), aligning with classical formal semantics. To enable differentiable learning, we propose a smooth relaxation of orthogonal projection operators, allowing for the learning of both subspace orientation and dimension. Our method achieves state-of-the-art results in reconstruction and link prediction on WordNet. Furthermore, on natural language inference benchmarks, our subspace embeddings surpass bi-encoder baselines, offering an interpretable formulation of entailment that is both geometrically grounded and amenable to logical operations.

Comment: The paper introduces a novel paradigm of embedding concepts as linear subspaces, which is a significant contribution to representation learning and model architecture.

Relevance: 9 Novelty: 8

5. Curvature Learning for Generalization of Hyperbolic Neural Networks

ArXiv ID: 2508.17232

Authors: Xiaomeng Fan, Yuwei Wu, Zhi Gao, Mehrtash Harandi, Yunde Jia

Abstract: Hyperbolic neural networks (HNNs) have demonstrated notable efficacy in representing real-world data with hierarchical structures via exploiting the geometric properties of hyperbolic spaces characterized by negative curvatures. Curvature plays a crucial role in optimizing HNNs. Inappropriate curvatures may cause HNNs to converge to suboptimal parameters, degrading overall performance. So far, the theoretical foundation of the effect of curvatures on HNNs has not been developed. In this paper, we derive a PAC-Bayesian generalization bound of HNNs, highlighting the role of curvatures in the generalization of HNNs via their effect on the smoothness of the loss landscape. Driven by the derived bound, we propose a sharpness-aware curvature learning method to smooth the loss landscape, thereby improving the generalization of HNNs. In our method, we design a scope sharpness measure for curvatures, which is minimized through a bi-level optimization process. Then, we introduce an implicit differentiation algorithm that efficiently solves the bi-level optimization by approximating gradients of curvatures. We present the approximation error and convergence analyses of the proposed method, showing that the approximation error is upper-bounded, and the proposed method can converge by bounding gradients of HNNs. Experiments on four settings: classification, learning from long-tailed data, learning from noisy data, and few-shot learning show that our method can improve the performance of HNNs.

Comment: The paper introduces a method for curvature learning in hyperbolic neural networks, providing theoretical insights into the role of curvatures, which aligns with representation learning and emerging trends.

Relevance: 9 Novelty: 8

6. TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training

ArXiv ID: 2508.17677

Authors: Yifan Wang, Binbin Liu, Fengze Liu, Yuanfan Guo, Jiyao Deng, Xuecheng Wu, Weidong Zhou, Xiaohuan Zhou, Taifeng Wang

Abstract: The data mixture used in the pre-training of a language model is a cornerstone of its final performance. However, a static mixing strategy is suboptimal, as the model's learning preferences for various data domains shift dynamically throughout training. Crucially, observing these evolving preferences in a computationally efficient manner remains a significant challenge. To address this, we propose TiKMiX, a method that dynamically adjusts the data mixture according to the model's evolving preferences. TiKMiX introduces Group Influence, an efficient metric for evaluating the impact of data domains on the model. This metric enables the formulation of the data mixing problem as a search for an optimal, influence-maximizing distribution. We solve this via two approaches: TiKMiX-D for direct optimization, and TiKMiX-M, which uses a regression model to predict a superior mixture. We trained models with different numbers of parameters, on up to 1 trillion tokens. TiKMiX-D exceeds the performance of state-of-the-art methods like REGMIX while using just 20% of the computational resources. TiKMiX-M leads to an average performance gain of 2% across 9 downstream benchmarks. Our experiments reveal that a model's data preferences evolve with training progress and scale, and we demonstrate that dynamically adjusting the data mixture based on Group Influence, a direct measure of these preferences, significantly improves performance by mitigating the underdigestion of data seen with static ratios.

Comment: The paper introduces TiKMiX, a method for dynamic data mixture adjustment in language model pre-training, which aligns with foundational research in LLM pretraining and efficiency.

Relevance: 9 Novelty: 8

7. CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models

ArXiv ID: 2508.17243

Authors: Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, Qianren Wang

Abstract: Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos. Due to the rich visual information, a single image can generate thousands of vision tokens, leading to high computational costs during the prefilling stage and significant memory overhead during decoding. Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations. However, these methods often struggle in shallow layers due to the lack of sufficient contextual information. We argue that many visual tokens are inherently redundant even in shallow layers and can be safely and effectively pruned with appropriate contextual signals. In this work, we propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM. The PPM is lightweight, model-agnostic, and operates independently of the LVLM architecture, ensuring seamless integration with various models. Extensive experiments on multiple benchmarks demonstrate that CoViPAL outperforms training-free pruning methods under equal token budgets and surpasses training-based methods with comparable supervision. CoViPAL offers a scalable and efficient solution to improve inference efficiency in LVLMs without compromising accuracy.

Comment: The paper introduces a novel method for pruning visual tokens in large vision-language models, which aligns with model compression through pruning and efficiency improvements.

Relevance: 9 Novelty: 8

8. Adversarial Examples Are Not Bugs, They Are Superposition

ArXiv ID: 2508.17456

Authors: Liv Gorton, Owen Lewis

Abstract: Adversarial examples -- inputs with imperceptible perturbations that fool neural networks -- remain one of deep learning's most perplexing phenomena despite nearly a decade of research. While numerous defenses and explanations have been proposed, there is no consensus on the fundamental mechanism. One underexplored hypothesis is that superposition, a concept from mechanistic interpretability, may be a major contributing factor, or even the primary cause. We present four lines of evidence in support of this hypothesis, greatly extending prior arguments by Elhage et al. (2022): (1) superposition can theoretically explain a range of adversarial phenomena, (2) in toy models, intervening on superposition controls robustness, (3) in toy models, intervening on robustness (via adversarial training) controls superposition, and (4) in ResNet18, intervening on robustness (via adversarial training) controls superposition.

Comment: The paper explores the hypothesis that adversarial examples are due to superposition, providing insights into representation learning.

Relevance: 9 Novelty: 8

9. Mutual Information Surprise: Rethinking Unexpectedness in Autonomous Systems

ArXiv ID: 2508.17403

Authors: Yinsong Wang, Xiao Liu, Quan Zeng, Yu Ding

Abstract: Recent breakthroughs in autonomous experimentation have demonstrated remarkable physical capabilities, yet their cognitive control remains limited--often relying on static heuristics or classical optimization. A core limitation is the absence of a principled mechanism to detect and adapt to the unexpectedness. While traditional surprise measures--such as Shannon or Bayesian Surprise--offer momentary detection of deviation, they fail to capture whether a system is truly learning and adapting. In this work, we introduce Mutual Information Surprise (MIS), a new framework that redefines surprise not as anomaly detection, but as a signal of epistemic growth. MIS quantifies the impact of new observations on mutual information, enabling autonomous systems to reflect on their learning progression. We develop a statistical test sequence to detect meaningful shifts in estimated mutual information and propose a mutual information surprise reaction policy (MISRP) that dynamically governs system behavior through sampling adjustment and process forking. Empirical evaluations--on both synthetic domains and a dynamic pollution map estimation task--show that MISRP-governed strategies significantly outperform classical surprise-based approaches in stability, responsiveness, and predictive accuracy. By shifting surprise from reactive to reflective, MIS offers a path toward more self-aware and adaptive autonomous systems.

Comment: The introduction of Mutual Information Surprise as a new framework for autonomous systems is a novel theoretical contribution, aligning with emerging trends in AI.

Relevance: 9 Novelty: 8

10. GateTS: Versatile and Efficient Forecasting via Attention-Inspired routed Mixture-of-Experts

ArXiv ID: 2508.17515

Authors: Kyrylo Yemets, Mykola Lukashchuk, Ivan Izonin

Abstract: Accurate univariate forecasting remains a pressing need in real-world systems, such as energy markets, hydrology, retail demand, and IoT monitoring, where signals are often intermittent and horizons span both short- and long-term. While transformers and Mixture-of-Experts (MoE) architectures are increasingly favored for time-series forecasting, a key gap persists: MoE models typically require complicated training with both the main forecasting loss and auxiliary load-balancing losses, along with careful routing/temperature tuning, which hinders practical adoption. In this paper, we propose a model architecture that simplifies the training process for univariate time series forecasting and effectively addresses both long- and short-term horizons, including intermittent patterns. Our approach combines sparse MoE computation with a novel attention-inspired gating mechanism that replaces the traditional one-layer softmax router. Through extensive empirical evaluation, we demonstrate that our gating design naturally promotes balanced expert utilization and achieves superior predictive accuracy without requiring the auxiliary load-balancing losses typically used in classical MoE implementations. The model achieves better performance while utilizing only a fraction of the parameters required by state-of-the-art transformer models, such as PatchTST. Furthermore, experiments across diverse datasets confirm that our MoE architecture with the proposed gating mechanism is more computationally efficient than LSTM for both long- and short-term forecasting, enabling cost-effective inference. These results highlight the potential of our approach for practical time-series forecasting applications where both accuracy and computational efficiency are critical.

Comment: The paper introduces a Mixture-of-Experts model with a novel attention-inspired gating mechanism, aligning with model architecture and efficiency topics.

Relevance: 9 Novelty: 8

11. Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective

ArXiv ID: 2508.16712

Authors: Tianyao Shi, Yi Ding

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their heavy resource demands make quantization-reducing precision to lower-bit formats-critical for efficient serving. While many quantization methods exist, a systematic understanding of their performance, energy, and quality tradeoffs in realistic serving conditions remains a gap. In this work, we first develop a fully automated online characterization framework qMeter, and then conduct an in-depth characterization of 11 post-training LLM quantization methods across 4 model sizes (7B-70B) and two GPU architectures (A100, H100). We evaluate quantization at the application, workload, parallelism, and hardware levels under online serving conditions. Our study reveals highly task- and method-dependent tradeoffs, strong sensitivity to workload characteristics, and complex interactions with parallelism and GPU architecture. We further present three optimization case studies illustrating deployment challenges in capacity planning, energy-efficient scheduling, and multi-objective tuning. To the best of our knowledge, this is one of the first comprehensive application-, system-, and hardware-level characterization of LLM quantization from a joint performance, energy, and quality perspective.

Comment: The paper provides a comprehensive characterization of LLM quantization methods, focusing on performance, energy, and quality tradeoffs, which aligns with the model compression criterion.

Relevance: 9 Novelty: 7

12. Interpreting the Effects of Quantization on LLMs

ArXiv ID: 2508.16785

Authors: Manpreet Singh, Hassan Sajjad

Abstract: Quantization offers a practical solution to deploy LLMs in resource-constraint environments. However, its impact on internal representations remains understudied, raising questions about the reliability of quantized models. In this study, we employ a range of interpretability techniques to investigate how quantization affects model and neuron behavior. We analyze multiple LLMs under 4-bit and 8-bit quantization. Our findings reveal that the impact of quantization on model calibration is generally minor. Analysis of neuron activations indicates that the number of dead neurons, i.e., those with activation values close to 0 across the dataset, remains consistent regardless of quantization. In terms of neuron contribution to predictions, we observe that smaller full precision models exhibit fewer salient neurons, whereas larger models tend to have more, with the exception of Llama-2-7B. The effect of quantization on neuron redundancy varies across models. Overall, our findings suggest that effect of quantization may vary by model and tasks, however, we did not observe any drastic change which may discourage the use of quantization as a reliable model compression technique.

Comment: The paper investigates the effects of quantization on LLMs, providing insights into model compression and interpretability.

Relevance: 9 Novelty: 7

13. Limitations of Normalization in Attention Mechanism

ArXiv ID: 2508.17821

Authors: Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State

Abstract: This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

Comment: The paper investigates the limitations of normalization in attention mechanisms, providing insights into model architecture.

Relevance: 9 Novelty: 7

14. Limits of message passing for node classification: How class-bottlenecks restrict signal-to-noise ratio

ArXiv ID: 2508.17822

Authors: Jonathan Rubin, Sahil Loomba, Nick S. Jones

Abstract: Message passing neural networks (MPNNs) are powerful models for node classification but suffer from performance limitations under heterophily (low same-class connectivity) and structural bottlenecks in the graph. We provide a unifying statistical framework exposing the relationship between heterophily and bottlenecks through the signal-to-noise ratio (SNR) of MPNN representations. The SNR decomposes model performance into feature-dependent parameters and feature-independent sensitivities. We prove that the sensitivity to class-wise signals is bounded by higher-order homophily -- a generalisation of classical homophily to multi-hop neighbourhoods -- and show that low higher-order homophily manifests locally as the interaction between structural bottlenecks and class labels (class-bottlenecks). Through analysis of graph ensembles, we provide a further quantitative decomposition of bottlenecking into underreaching (lack of depth implying signals cannot arrive) and oversquashing (lack of breadth implying signals arriving on fewer paths) with closed-form expressions. We prove that optimal graph structures for maximising higher-order homophily are disjoint unions of single-class and two-class-bipartite clusters. This yields BRIDGE, a graph ensemble-based rewiring algorithm that achieves near-perfect classification accuracy across all homophily regimes on synthetic benchmarks and significant improvements on real-world benchmarks, by eliminating the ``mid-homophily pitfall'' where MPNNs typically struggle, surpassing current standard rewiring techniques from the literature. Our framework, whose code we make available for public use, provides both diagnostic tools for assessing MPNN performance, and simple yet effective methods for enhancing performance through principled graph modification.

Comment: The paper provides a theoretical framework for understanding limitations in message passing neural networks, relevant to representation learning.

Relevance: 8 Novelty: 8

15. Spacer: Towards Engineered Scientific Inspiration

ArXiv ID: 2508.17661

Authors: Minhyeong Lee, Suyoung Hwang, Seunghyun Moon, Geonho Nah, Donghyun Koh, Youngjun Cho, Johyun Park, Hojin Yoo, Jiho Park, Haneul Choi, Sungbin Moon, Taehoon Hwang, Seungwon Kim, Jaeyeong Kim, Seongjun Kim, Juneau Jung

Abstract: Recent advances in LLMs have made automated scientific research the next frontline in the path to artificial superintelligence. However, these systems are bound either to tasks of narrow scope or the limited creative capabilities of LLMs. We propose Spacer, a scientific discovery system that develops creative and factually grounded concepts without external intervention. Spacer attempts to achieve this via 'deliberate decontextualization,' an approach that disassembles information into atomic units - keywords - and draws creativity from unexplored connections between them. Spacer consists of (i) Nuri, an inspiration engine that builds keyword sets, and (ii) the Manifesting Pipeline that refines these sets into elaborate scientific statements. Nuri extracts novel, high-potential keyword sets from a keyword graph built with 180,000 academic publications in biological fields. The Manifesting Pipeline finds links between keywords, analyzes their logical structure, validates their plausibility, and ultimately drafts original scientific concepts. According to our experiments, the evaluation metric of Nuri accurately classifies high-impact publications with an AUROC score of 0.737. Our Manifesting Pipeline also successfully reconstructs core concepts from the latest top-journal articles solely from their keyword sets. An LLM-based scoring system estimates that this reconstruction was sound for over 85% of the cases. Finally, our embedding space analysis shows that outputs from Spacer are significantly more similar to leading publications compared with those from SOTA LLMs.

Comment: Spacer proposes a novel system for scientific discovery using LLMs, which aligns with emerging trends in AI for Science.

Relevance: 8 Novelty: 8

16. HypER: Hyperbolic Echo State Networks for Capturing Stretch-and-Fold Dynamics in Chaotic Flows

ArXiv ID: 2508.18196

Authors: Pradeep Singh, Sutirtha Ghosh, Ashutosh Kumar, Hrishit B P, Balasubramanian Raman

Abstract: Forecasting chaotic dynamics beyond a few Lyapunov times is difficult because infinitesimal errors grow exponentially. Existing Echo State Networks (ESNs) mitigate this growth but employ reservoirs whose Euclidean geometry is mismatched to the stretch-and-fold structure of chaos. We introduce the Hyperbolic Embedding Reservoir (HypER), an ESN whose neurons are sampled in the Poincare ball and whose connections decay exponentially with hyperbolic distance. This negative-curvature construction embeds an exponential metric directly into the latent space, aligning the reservoir's local expansion-contraction spectrum with the system's Lyapunov directions while preserving standard ESN features such as sparsity, leaky integration, and spectral-radius control. Training is limited to a Tikhonov-regularized readout. On the chaotic Lorenz-63 and Roessler systems, and the hyperchaotic Chen-Ueta attractor, HypER consistently lengthens the mean valid-prediction horizon beyond Euclidean and graph-structured ESN baselines, with statistically significant gains confirmed over 30 independent runs; parallel results on real-world benchmarks, including heart-rate variability from the Santa Fe and MIT-BIH datasets and international sunspot numbers, corroborate its advantage. We further establish a lower bound on the rate of state divergence for HypER, mirroring Lyapunov growth.

Comment: The paper introduces a novel ESN architecture with hyperbolic geometry, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 8

17. Quantum Graph Attention Network: A Novel Quantum Multi-Head Attention Mechanism for Graph Learning

ArXiv ID: 2508.17630

Authors: An Ning, Tai Yue Li, Nan Yow Chen

Abstract: We propose the Quantum Graph Attention Network (QGAT), a hybrid graph neural network that integrates variational quantum circuits into the attention mechanism. At its core, QGAT employs strongly entangling quantum circuits with amplitude-encoded node features to enable expressive nonlinear interactions. Distinct from classical multi-head attention that separately computes each head, QGAT leverages a single quantum circuit to simultaneously generate multiple attention coefficients. This quantum parallelism facilitates parameter sharing across heads, substantially reducing computational overhead and model complexity. Classical projection weights and quantum circuit parameters are optimized jointly in an end-to-end manner, ensuring flexible adaptation to learning tasks. Empirical results demonstrate QGAT's effectiveness in capturing complex structural dependencies and improved generalization in inductive scenarios, highlighting its potential for scalable quantum-enhanced learning across domains such as chemistry, biology, and network analysis. Furthermore, experiments confirm that quantum embedding enhances robustness against feature and structural noise, suggesting advantages in handling real-world noisy data. The modularity of QGAT also ensures straightforward integration into existing architectures, allowing it to easily augment classical attention-based models.

Comment: The Quantum Graph Attention Network introduces a novel quantum multi-head attention mechanism, aligning with model architecture innovations.

Relevance: 8 Novelty: 8

18. Riemannian Optimization for LoRA on the Stiefel Manifold

ArXiv ID: 2508.17901

Authors: Juneyoung Park, Minjae Kang, Seongbae Lee, Haegang Lee, Seongwan Kim, Jaeho Lee

Abstract: While powerful, large language models (LLMs) present significant fine-tuning challenges due to their size. Parameter-efficient fine-tuning (PEFT) methods like LoRA provide solutions, yet suffer from critical optimizer inefficiencies; notably basis redundancy in LoRA's $B$ matrix when using AdamW, which fundamentally limits performance. We address this by optimizing the $B$ matrix on the Stiefel manifold, imposing explicit orthogonality constraints that achieve near-perfect orthogonality and full effective rank. This geometric approach dramatically enhances parameter efficiency and representational capacity. Our Stiefel optimizer consistently outperforms AdamW across benchmarks with both LoRA and DoRA, demonstrating that geometric constraints are the key to unlocking LoRA's full potential for effective LLM fine-tuning.

Comment: The paper discusses a novel optimization method for LoRA, which is relevant to model compression and efficiency improvements.