Previous Day 2025-08-25
Monthly Overview 2025-08
Next Day 2025-08-27

Personalized Daily ArXiv Papers 2025-08-26

[gpt-4o] Prompt Completion Total
Token 50024 6797 56821
Cost $0.13 $0.07 $0.19

Total arXiv papers: 904

Total scanned papers: 547

Total relevant papers: 37

Table of contents with paper titles:

  1. Consciousness as a Functor Authors: Sridhar Mahadevan

  2. Amortized Sampling with Transferable Normalizing Flows Authors: Charlie B. Tan, Majdi Hassan, Leon Klein, Saifuddin Syed, Dominique Beaini, Michael M. Bronstein, Alexander Tong, Kirill Neklyudov

  3. Programmable k-local Ising Machines and all-optical Kolmogorov-Arnold Networks on Photonic Platforms Authors: Nikita Stroev, Natalia G. Berloff

  4. Native Logical and Hierarchical Representations with Subspace Embeddings Authors: Gabriel Moreira, Zita Marinho, Manuel Marques, Jo\~ao Paulo Costeira, Chenyan Xiong

  5. Curvature Learning for Generalization of Hyperbolic Neural Networks Authors: Xiaomeng Fan, Yuwei Wu, Zhi Gao, Mehrtash Harandi, Yunde Jia

  6. TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training Authors: Yifan Wang, Binbin Liu, Fengze Liu, Yuanfan Guo, Jiyao Deng, Xuecheng Wu, Weidong Zhou, Xiaohuan Zhou, Taifeng Wang

  7. CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models Authors: Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, Qianren Wang

  8. Adversarial Examples Are Not Bugs, They Are Superposition Authors: Liv Gorton, Owen Lewis

  9. Mutual Information Surprise: Rethinking Unexpectedness in Autonomous Systems Authors: Yinsong Wang, Xiao Liu, Quan Zeng, Yu Ding

  10. GateTS: Versatile and Efficient Forecasting via Attention-Inspired routed Mixture-of-Experts Authors: Kyrylo Yemets, Mykola Lukashchuk, Ivan Izonin

  11. Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective Authors: Tianyao Shi, Yi Ding

  12. Interpreting the Effects of Quantization on LLMs Authors: Manpreet Singh, Hassan Sajjad

  13. Limitations of Normalization in Attention Mechanism Authors: Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State

  14. Limits of message passing for node classification: How class-bottlenecks restrict signal-to-noise ratio Authors: Jonathan Rubin, Sahil Loomba, Nick S. Jones

  15. Spacer: Towards Engineered Scientific Inspiration Authors: Minhyeong Lee, Suyoung Hwang, Seunghyun Moon, Geonho Nah, Donghyun Koh, Youngjun Cho, Johyun Park, Hojin Yoo, Jiho Park, Haneul Choi, Sungbin Moon, Taehoon Hwang, Seungwon Kim, Jaeyeong Kim, Seongjun Kim, Juneau Jung

  16. HypER: Hyperbolic Echo State Networks for Capturing Stretch-and-Fold Dynamics in Chaotic Flows Authors: Pradeep Singh, Sutirtha Ghosh, Ashutosh Kumar, Hrishit B P, Balasubramanian Raman

  17. Quantum Graph Attention Network: A Novel Quantum Multi-Head Attention Mechanism for Graph Learning Authors: An Ning, Tai Yue Li, Nan Yow Chen

  18. Riemannian Optimization for LoRA on the Stiefel Manifold Authors: Juneyoung Park, Minjae Kang, Seongbae Lee, Haegang Lee, Seongwan Kim, Jaeho Lee

  19. Dynamic Sparse Attention on Mobile SoCs Authors: Wangsong Yin, Daliang Xu, Mengwei Xu, Gang Huang, Xuanzhe Liu

  20. Module-Aware Parameter-Efficient Machine Unlearning on Transformers Authors: Wenjie Bao, Jian Lou, Yuke Hu, Xiaochen Li, Zhihao Liu, Jiaqi Liu, Zhan Qin, Kui Ren

  21. WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling Authors: Jiacheng Li, Jianchao Tan, Zhidong Yang, Pingwei Sun, Feiye Huo, Jiayu Qin, Yerui Sun, Yuchen Xie, Xunliang Cai, Xiangyu Zhang, Maoxin He, Guangming Tan, Weile Jia, Tong Zhao

  22. Optimizing Neural Networks with Learnable Non-Linear Activation Functions via Lookup-Based FPGA Acceleration Authors: Mengyuan Yin, Benjamin Chen Ming Choong, Chuping Qu, Rick Siow Mong Goh, Weng-Fai Wong, Tao Luo

  23. Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry Authors: Haoyu Yun, Hamid Krim

  24. Deep Learning for Markov Chains: Lyapunov Functions, Poisson's Equation, and Stationary Distributions Authors: Yanlin Qu, Jose Blanchet, Peter Glynn

  25. Learn to Memorize: Optimizing LLM-based Agents with Adaptive Memory Framework Authors: Zeyu Zhang, Quanyu Dai, Rui Li, Xiaohe Bo, Xu Chen, Zhenhua Dong

  26. CP4SBI: Local Conformal Calibration of Credible Sets in Simulation-Based Inference Authors: Luben M. C. Cabezas, Vagner S. Santos, Thiago R. Ramos, Pedro L. C. Rodrigues, Rafael Izbicki

  27. CrystalDiT: A Diffusion Transformer for Crystal Generation Authors: Xiaohan Yi, Guikun Xu, Xi Xiao, Zhong Zhang, Liu Liu, Yatao Bian, Peilin Zhao

  28. TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving Authors: Bingyang Wu, Zili Zhang, Yinmin Zhong, Guanzhe Huang, Yibo Zhu, Xuanzhe Liu, Xin Jin

  29. Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks Authors: Sotaro Takeshita, Yurina Takeshita, Daniel Ruffinelli, Simone Paolo Ponzetto

  30. Multimodal Representation Learning Conditioned on Semantic Relations Authors: Yang Qiao, Yuntong Hu, Liang Zhao

  31. Confidence-Modulated Speculative Decoding for Large Language Models Authors: Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela

  32. Steering When Necessary: Flexible Steering Large Language Models with Backtracking Authors: Jinwei Gan, Zifeng Cheng, Zhiwei Jiang, Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu

  33. How Quantization Shapes Bias in Large Language Models Authors: Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych

  34. Frozen in Time: Parameter-Efficient Time Series Transformers via Reservoir-Induced Feature Expansion and Fixed Random Dynamics Authors: Pradeep Singh, Mehak Sharma, Anupriya Dey, Balasubramanian Raman

  35. Disentangling the Factors of Convergence between Brains and Computer Vision Models Authors: Jos\'ephine Raugel, Marc Szafraniec, Huy V. Vo, Camille Couprie, Patrick Labatut, Piotr Bojanowski, Valentin Wyart, Jean-R\'emi King

  36. Bridging Foundation Models and Efficient Architectures: A Modular Brain Imaging Framework with Local Masking and Pretrained Representation Learning Authors: Yanwen Wang, Xinglin Zhao, Yijin Song, Xiaobo Liu, Yanrong Hao, Rui Cao, Xin Wen

  37. Training Transformers for Mesh-Based Simulations Authors: Paul Garnier, Vincent Lannelongue, Jonathan Viquerat, Elie Hachem


1. Consciousness as a Functor

ArXiv ID: 2508.17561

Authors: Sridhar Mahadevan

Abstract: We propose a novel theory of consciousness as a functor (CF) that receives and transmits contents from unconscious memory into conscious memory. Our CF framework can be seen as a categorial formulation of the Global Workspace Theory proposed by Baars. CF models the ensemble of unconscious processes as a topos category of coalgebras. The internal language of thought in CF is defined as a Multi-modal Universal Mitchell-Benabou Language Embedding (MUMBLE). We model the transmission of information from conscious short-term working memory to long-term unconscious memory using our recently proposed Universal Reinforcement Learning (URL) framework. To model the transmission of information from unconscious long-term memory into resource-constrained short-term memory, we propose a network economic model.

Comment: The paper proposes a novel theory of consciousness as a functor, which is an emerging trend with potential foundational impact.

Relevance: 9 Novelty: 9


2. Amortized Sampling with Transferable Normalizing Flows

ArXiv ID: 2508.18175

Authors: Charlie B. Tan, Majdi Hassan, Leon Klein, Saifuddin Syed, Dominique Beaini, Michael M. Bronstein, Alexander Tong, Kirill Neklyudov

Abstract: Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Classical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in-full for each system of interest. The widespread success of generative models has inspired interest into overcoming this limitation through learning sampling algorithms. Despite performing on par with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We prove that deep learning enables the design of scalable and transferable samplers by introducing Prose, a 280 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Prose draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Prose as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based finetuning procedure to achieve superior performance to established methods such as sequential Monte Carlo on unseen tetrapeptides. We open-source the Prose codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and finetuning objectives.

Comment: The paper introduces a transferable normalizing flow for molecular conformations, which is relevant to AI for Science and foundational research in molecular modeling.

Relevance: 9 Novelty: 8


3. Programmable k-local Ising Machines and all-optical Kolmogorov-Arnold Networks on Photonic Platforms

ArXiv ID: 2508.17440

Authors: Nikita Stroev, Natalia G. Berloff

Abstract: We unify k-local Ising optimization and optical KAN function learning on a single photonic platform, establishing a critical convergence point in optical computing that enables interleaved discrete-continuous workflows. We introduce a single spacial light modulator (SLM)-centric primitive that realizes, in one stroke, all-optical k-local Ising interactions and fully optical Kolmogorov-Arnold network (KAN) layers. The central idea is to convert structural nonlinearity of a nominally linear photonic scatterer into a per-window computational resource by adding one relay pass through the same spatial light modulator. A folded 4f relay reimages the first Fourier plane onto the SLM so that each chosen spin clique or ridge channel occupies a disjoint window with its own second-pass phase patch. Propagation remains linear in the optical field, yet the measured intensity in each window becomes a freely programmable polynomial of the clique sum or projection amplitude. This yields native, per-clique k-local couplings without nonlinear media and, in parallel, the many independent univariate nonlinearities required by KAN layers, all with in-situ physical gradients for training using two-frame (forward and adjoint) physical gradients. We outline implementation on spatial photonic Ising machines, injection-locked VCSEL arrays, and the Microsoft analog optical computers. In all cases the hardware change is one extra lens and a fold (or an on-chip 4f loop), enabling a minimal overhead, massively parallel route to high-order optical Ising optimization and trainable, all-optical KAN processing.

Comment: The paper introduces a novel photonic platform for k-local Ising optimization and optical KAN function learning, which could be considered a foundational research in AI for Science due to its innovative approach to optical computing.

Relevance: 9 Novelty: 8


4. Native Logical and Hierarchical Representations with Subspace Embeddings

ArXiv ID: 2508.16687

Authors: Gabriel Moreira, Zita Marinho, Manuel Marques, Jo\~ao Paulo Costeira, Chenyan Xiong

Abstract: Traditional neural embeddings represent concepts as points, excelling at similarity but struggling with higher-level reasoning and asymmetric relationships. We introduce a novel paradigm: embedding concepts as linear subspaces. This framework inherently models generality via subspace dimensionality and hierarchy through subspace inclusion. It naturally supports set-theoretic operations like intersection (conjunction), linear sum (disjunction) and orthogonal complements (negations), aligning with classical formal semantics. To enable differentiable learning, we propose a smooth relaxation of orthogonal projection operators, allowing for the learning of both subspace orientation and dimension. Our method achieves state-of-the-art results in reconstruction and link prediction on WordNet. Furthermore, on natural language inference benchmarks, our subspace embeddings surpass bi-encoder baselines, offering an interpretable formulation of entailment that is both geometrically grounded and amenable to logical operations.

Comment: The paper introduces a novel paradigm of embedding concepts as linear subspaces, which is a significant contribution to representation learning and model architecture.

Relevance: 9 Novelty: 8


5. Curvature Learning for Generalization of Hyperbolic Neural Networks

ArXiv ID: 2508.17232

Authors: Xiaomeng Fan, Yuwei Wu, Zhi Gao, Mehrtash Harandi, Yunde Jia

Abstract: Hyperbolic neural networks (HNNs) have demonstrated notable efficacy in representing real-world data with hierarchical structures via exploiting the geometric properties of hyperbolic spaces characterized by negative curvatures. Curvature plays a crucial role in optimizing HNNs. Inappropriate curvatures may cause HNNs to converge to suboptimal parameters, degrading overall performance. So far, the theoretical foundation of the effect of curvatures on HNNs has not been developed. In this paper, we derive a PAC-Bayesian generalization bound of HNNs, highlighting the role of curvatures in the generalization of HNNs via their effect on the smoothness of the loss landscape. Driven by the derived bound, we propose a sharpness-aware curvature learning method to smooth the loss landscape, thereby improving the generalization of HNNs. In our method, we design a scope sharpness measure for curvatures, which is minimized through a bi-level optimization process. Then, we introduce an implicit differentiation algorithm that efficiently solves the bi-level optimization by approximating gradients of curvatures. We present the approximation error and convergence analyses of the proposed method, showing that the approximation error is upper-bounded, and the proposed method can converge by bounding gradients of HNNs. Experiments on four settings: classification, learning from long-tailed data, learning from noisy data, and few-shot learning show that our method can improve the performance of HNNs.

Comment: The paper introduces a method for curvature learning in hyperbolic neural networks, providing theoretical insights into the role of curvatures, which aligns with representation learning and emerging trends.

Relevance: 9 Novelty: 8


6. TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training

ArXiv ID: 2508.17677

Authors: Yifan Wang, Binbin Liu, Fengze Liu, Yuanfan Guo, Jiyao Deng, Xuecheng Wu, Weidong Zhou, Xiaohuan Zhou, Taifeng Wang

Abstract: The data mixture used in the pre-training of a language model is a cornerstone of its final performance. However, a static mixing strategy is suboptimal, as the model's learning preferences for various data domains shift dynamically throughout training. Crucially, observing these evolving preferences in a computationally efficient manner remains a significant challenge. To address this, we propose TiKMiX, a method that dynamically adjusts the data mixture according to the model's evolving preferences. TiKMiX introduces Group Influence, an efficient metric for evaluating the impact of data domains on the model. This metric enables the formulation of the data mixing problem as a search for an optimal, influence-maximizing distribution. We solve this via two approaches: TiKMiX-D for direct optimization, and TiKMiX-M, which uses a regression model to predict a superior mixture. We trained models with different numbers of parameters, on up to 1 trillion tokens. TiKMiX-D exceeds the performance of state-of-the-art methods like REGMIX while using just 20% of the computational resources. TiKMiX-M leads to an average performance gain of 2% across 9 downstream benchmarks. Our experiments reveal that a model's data preferences evolve with training progress and scale, and we demonstrate that dynamically adjusting the data mixture based on Group Influence, a direct measure of these preferences, significantly improves performance by mitigating the underdigestion of data seen with static ratios.

Comment: The paper introduces TiKMiX, a method for dynamic data mixture adjustment in language model pre-training, which aligns with foundational research in LLM pretraining and efficiency.

Relevance: 9 Novelty: 8


7. CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models

ArXiv ID: 2508.17243

Authors: Zicong Tang, Ziyang Ma, Suqing Wang, Zuchao Li, Lefei Zhang, Hai Zhao, Yun Li, Qianren Wang

Abstract: Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos. Due to the rich visual information, a single image can generate thousands of vision tokens, leading to high computational costs during the prefilling stage and significant memory overhead during decoding. Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations. However, these methods often struggle in shallow layers due to the lack of sufficient contextual information. We argue that many visual tokens are inherently redundant even in shallow layers and can be safely and effectively pruned with appropriate contextual signals. In this work, we propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM. The PPM is lightweight, model-agnostic, and operates independently of the LVLM architecture, ensuring seamless integration with various models. Extensive experiments on multiple benchmarks demonstrate that CoViPAL outperforms training-free pruning methods under equal token budgets and surpasses training-based methods with comparable supervision. CoViPAL offers a scalable and efficient solution to improve inference efficiency in LVLMs without compromising accuracy.

Comment: The paper introduces a novel method for pruning visual tokens in large vision-language models, which aligns with model compression through pruning and efficiency improvements.

Relevance: 9 Novelty: 8


8. Adversarial Examples Are Not Bugs, They Are Superposition

ArXiv ID: 2508.17456

Authors: Liv Gorton, Owen Lewis

Abstract: Adversarial examples -- inputs with imperceptible perturbations that fool neural networks -- remain one of deep learning's most perplexing phenomena despite nearly a decade of research. While numerous defenses and explanations have been proposed, there is no consensus on the fundamental mechanism. One underexplored hypothesis is that superposition, a concept from mechanistic interpretability, may be a major contributing factor, or even the primary cause. We present four lines of evidence in support of this hypothesis, greatly extending prior arguments by Elhage et al. (2022): (1) superposition can theoretically explain a range of adversarial phenomena, (2) in toy models, intervening on superposition controls robustness, (3) in toy models, intervening on robustness (via adversarial training) controls superposition, and (4) in ResNet18, intervening on robustness (via adversarial training) controls superposition.

Comment: The paper explores the hypothesis that adversarial examples are due to superposition, providing insights into representation learning.

Relevance: 9 Novelty: 8


9. Mutual Information Surprise: Rethinking Unexpectedness in Autonomous Systems

ArXiv ID: 2508.17403

Authors: Yinsong Wang, Xiao Liu, Quan Zeng, Yu Ding

Abstract: Recent breakthroughs in autonomous experimentation have demonstrated remarkable physical capabilities, yet their cognitive control remains limited--often relying on static heuristics or classical optimization. A core limitation is the absence of a principled mechanism to detect and adapt to the unexpectedness. While traditional surprise measures--such as Shannon or Bayesian Surprise--offer momentary detection of deviation, they fail to capture whether a system is truly learning and adapting. In this work, we introduce Mutual Information Surprise (MIS), a new framework that redefines surprise not as anomaly detection, but as a signal of epistemic growth. MIS quantifies the impact of new observations on mutual information, enabling autonomous systems to reflect on their learning progression. We develop a statistical test sequence to detect meaningful shifts in estimated mutual information and propose a mutual information surprise reaction policy (MISRP) that dynamically governs system behavior through sampling adjustment and process forking. Empirical evaluations--on both synthetic domains and a dynamic pollution map estimation task--show that MISRP-governed strategies significantly outperform classical surprise-based approaches in stability, responsiveness, and predictive accuracy. By shifting surprise from reactive to reflective, MIS offers a path toward more self-aware and adaptive autonomous systems.

Comment: The introduction of Mutual Information Surprise as a new framework for autonomous systems is a novel theoretical contribution, aligning with emerging trends in AI.

Relevance: 9 Novelty: 8


10. GateTS: Versatile and Efficient Forecasting via Attention-Inspired routed Mixture-of-Experts

ArXiv ID: 2508.17515

Authors: Kyrylo Yemets, Mykola Lukashchuk, Ivan Izonin

Abstract: Accurate univariate forecasting remains a pressing need in real-world systems, such as energy markets, hydrology, retail demand, and IoT monitoring, where signals are often intermittent and horizons span both short- and long-term. While transformers and Mixture-of-Experts (MoE) architectures are increasingly favored for time-series forecasting, a key gap persists: MoE models typically require complicated training with both the main forecasting loss and auxiliary load-balancing losses, along with careful routing/temperature tuning, which hinders practical adoption. In this paper, we propose a model architecture that simplifies the training process for univariate time series forecasting and effectively addresses both long- and short-term horizons, including intermittent patterns. Our approach combines sparse MoE computation with a novel attention-inspired gating mechanism that replaces the traditional one-layer softmax router. Through extensive empirical evaluation, we demonstrate that our gating design naturally promotes balanced expert utilization and achieves superior predictive accuracy without requiring the auxiliary load-balancing losses typically used in classical MoE implementations. The model achieves better performance while utilizing only a fraction of the parameters required by state-of-the-art transformer models, such as PatchTST. Furthermore, experiments across diverse datasets confirm that our MoE architecture with the proposed gating mechanism is more computationally efficient than LSTM for both long- and short-term forecasting, enabling cost-effective inference. These results highlight the potential of our approach for practical time-series forecasting applications where both accuracy and computational efficiency are critical.

Comment: The paper introduces a Mixture-of-Experts model with a novel attention-inspired gating mechanism, aligning with model architecture and efficiency topics.

Relevance: 9 Novelty: 8


11. Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective

ArXiv ID: 2508.16712

Authors: Tianyao Shi, Yi Ding

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their heavy resource demands make quantization-reducing precision to lower-bit formats-critical for efficient serving. While many quantization methods exist, a systematic understanding of their performance, energy, and quality tradeoffs in realistic serving conditions remains a gap. In this work, we first develop a fully automated online characterization framework qMeter, and then conduct an in-depth characterization of 11 post-training LLM quantization methods across 4 model sizes (7B-70B) and two GPU architectures (A100, H100). We evaluate quantization at the application, workload, parallelism, and hardware levels under online serving conditions. Our study reveals highly task- and method-dependent tradeoffs, strong sensitivity to workload characteristics, and complex interactions with parallelism and GPU architecture. We further present three optimization case studies illustrating deployment challenges in capacity planning, energy-efficient scheduling, and multi-objective tuning. To the best of our knowledge, this is one of the first comprehensive application-, system-, and hardware-level characterization of LLM quantization from a joint performance, energy, and quality perspective.

Comment: The paper provides a comprehensive characterization of LLM quantization methods, focusing on performance, energy, and quality tradeoffs, which aligns with the model compression criterion.

Relevance: 9 Novelty: 7


12. Interpreting the Effects of Quantization on LLMs

ArXiv ID: 2508.16785

Authors: Manpreet Singh, Hassan Sajjad

Abstract: Quantization offers a practical solution to deploy LLMs in resource-constraint environments. However, its impact on internal representations remains understudied, raising questions about the reliability of quantized models. In this study, we employ a range of interpretability techniques to investigate how quantization affects model and neuron behavior. We analyze multiple LLMs under 4-bit and 8-bit quantization. Our findings reveal that the impact of quantization on model calibration is generally minor. Analysis of neuron activations indicates that the number of dead neurons, i.e., those with activation values close to 0 across the dataset, remains consistent regardless of quantization. In terms of neuron contribution to predictions, we observe that smaller full precision models exhibit fewer salient neurons, whereas larger models tend to have more, with the exception of Llama-2-7B. The effect of quantization on neuron redundancy varies across models. Overall, our findings suggest that effect of quantization may vary by model and tasks, however, we did not observe any drastic change which may discourage the use of quantization as a reliable model compression technique.

Comment: The paper investigates the effects of quantization on LLMs, providing insights into model compression and interpretability.

Relevance: 9 Novelty: 7


13. Limitations of Normalization in Attention Mechanism

ArXiv ID: 2508.17821

Authors: Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State

Abstract: This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

Comment: The paper investigates the limitations of normalization in attention mechanisms, providing insights into model architecture.

Relevance: 9 Novelty: 7


14. Limits of message passing for node classification: How class-bottlenecks restrict signal-to-noise ratio

ArXiv ID: 2508.17822

Authors: Jonathan Rubin, Sahil Loomba, Nick S. Jones

Abstract: Message passing neural networks (MPNNs) are powerful models for node classification but suffer from performance limitations under heterophily (low same-class connectivity) and structural bottlenecks in the graph. We provide a unifying statistical framework exposing the relationship between heterophily and bottlenecks through the signal-to-noise ratio (SNR) of MPNN representations. The SNR decomposes model performance into feature-dependent parameters and feature-independent sensitivities. We prove that the sensitivity to class-wise signals is bounded by higher-order homophily -- a generalisation of classical homophily to multi-hop neighbourhoods -- and show that low higher-order homophily manifests locally as the interaction between structural bottlenecks and class labels (class-bottlenecks). Through analysis of graph ensembles, we provide a further quantitative decomposition of bottlenecking into underreaching (lack of depth implying signals cannot arrive) and oversquashing (lack of breadth implying signals arriving on fewer paths) with closed-form expressions. We prove that optimal graph structures for maximising higher-order homophily are disjoint unions of single-class and two-class-bipartite clusters. This yields BRIDGE, a graph ensemble-based rewiring algorithm that achieves near-perfect classification accuracy across all homophily regimes on synthetic benchmarks and significant improvements on real-world benchmarks, by eliminating the ``mid-homophily pitfall'' where MPNNs typically struggle, surpassing current standard rewiring techniques from the literature. Our framework, whose code we make available for public use, provides both diagnostic tools for assessing MPNN performance, and simple yet effective methods for enhancing performance through principled graph modification.

Comment: The paper provides a theoretical framework for understanding limitations in message passing neural networks, relevant to representation learning.

Relevance: 8 Novelty: 8


15. Spacer: Towards Engineered Scientific Inspiration

ArXiv ID: 2508.17661

Authors: Minhyeong Lee, Suyoung Hwang, Seunghyun Moon, Geonho Nah, Donghyun Koh, Youngjun Cho, Johyun Park, Hojin Yoo, Jiho Park, Haneul Choi, Sungbin Moon, Taehoon Hwang, Seungwon Kim, Jaeyeong Kim, Seongjun Kim, Juneau Jung

Abstract: Recent advances in LLMs have made automated scientific research the next frontline in the path to artificial superintelligence. However, these systems are bound either to tasks of narrow scope or the limited creative capabilities of LLMs. We propose Spacer, a scientific discovery system that develops creative and factually grounded concepts without external intervention. Spacer attempts to achieve this via 'deliberate decontextualization,' an approach that disassembles information into atomic units - keywords - and draws creativity from unexplored connections between them. Spacer consists of (i) Nuri, an inspiration engine that builds keyword sets, and (ii) the Manifesting Pipeline that refines these sets into elaborate scientific statements. Nuri extracts novel, high-potential keyword sets from a keyword graph built with 180,000 academic publications in biological fields. The Manifesting Pipeline finds links between keywords, analyzes their logical structure, validates their plausibility, and ultimately drafts original scientific concepts. According to our experiments, the evaluation metric of Nuri accurately classifies high-impact publications with an AUROC score of 0.737. Our Manifesting Pipeline also successfully reconstructs core concepts from the latest top-journal articles solely from their keyword sets. An LLM-based scoring system estimates that this reconstruction was sound for over 85% of the cases. Finally, our embedding space analysis shows that outputs from Spacer are significantly more similar to leading publications compared with those from SOTA LLMs.

Comment: Spacer proposes a novel system for scientific discovery using LLMs, which aligns with emerging trends in AI for Science.

Relevance: 8 Novelty: 8


16. HypER: Hyperbolic Echo State Networks for Capturing Stretch-and-Fold Dynamics in Chaotic Flows

ArXiv ID: 2508.18196

Authors: Pradeep Singh, Sutirtha Ghosh, Ashutosh Kumar, Hrishit B P, Balasubramanian Raman

Abstract: Forecasting chaotic dynamics beyond a few Lyapunov times is difficult because infinitesimal errors grow exponentially. Existing Echo State Networks (ESNs) mitigate this growth but employ reservoirs whose Euclidean geometry is mismatched to the stretch-and-fold structure of chaos. We introduce the Hyperbolic Embedding Reservoir (HypER), an ESN whose neurons are sampled in the Poincare ball and whose connections decay exponentially with hyperbolic distance. This negative-curvature construction embeds an exponential metric directly into the latent space, aligning the reservoir's local expansion-contraction spectrum with the system's Lyapunov directions while preserving standard ESN features such as sparsity, leaky integration, and spectral-radius control. Training is limited to a Tikhonov-regularized readout. On the chaotic Lorenz-63 and Roessler systems, and the hyperchaotic Chen-Ueta attractor, HypER consistently lengthens the mean valid-prediction horizon beyond Euclidean and graph-structured ESN baselines, with statistically significant gains confirmed over 30 independent runs; parallel results on real-world benchmarks, including heart-rate variability from the Santa Fe and MIT-BIH datasets and international sunspot numbers, corroborate its advantage. We further establish a lower bound on the rate of state divergence for HypER, mirroring Lyapunov growth.

Comment: The paper introduces a novel ESN architecture with hyperbolic geometry, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 8


17. Quantum Graph Attention Network: A Novel Quantum Multi-Head Attention Mechanism for Graph Learning

ArXiv ID: 2508.17630

Authors: An Ning, Tai Yue Li, Nan Yow Chen

Abstract: We propose the Quantum Graph Attention Network (QGAT), a hybrid graph neural network that integrates variational quantum circuits into the attention mechanism. At its core, QGAT employs strongly entangling quantum circuits with amplitude-encoded node features to enable expressive nonlinear interactions. Distinct from classical multi-head attention that separately computes each head, QGAT leverages a single quantum circuit to simultaneously generate multiple attention coefficients. This quantum parallelism facilitates parameter sharing across heads, substantially reducing computational overhead and model complexity. Classical projection weights and quantum circuit parameters are optimized jointly in an end-to-end manner, ensuring flexible adaptation to learning tasks. Empirical results demonstrate QGAT's effectiveness in capturing complex structural dependencies and improved generalization in inductive scenarios, highlighting its potential for scalable quantum-enhanced learning across domains such as chemistry, biology, and network analysis. Furthermore, experiments confirm that quantum embedding enhances robustness against feature and structural noise, suggesting advantages in handling real-world noisy data. The modularity of QGAT also ensures straightforward integration into existing architectures, allowing it to easily augment classical attention-based models.

Comment: The Quantum Graph Attention Network introduces a novel quantum multi-head attention mechanism, aligning with model architecture innovations.

Relevance: 8 Novelty: 8


18. Riemannian Optimization for LoRA on the Stiefel Manifold

ArXiv ID: 2508.17901

Authors: Juneyoung Park, Minjae Kang, Seongbae Lee, Haegang Lee, Seongwan Kim, Jaeho Lee

Abstract: While powerful, large language models (LLMs) present significant fine-tuning challenges due to their size. Parameter-efficient fine-tuning (PEFT) methods like LoRA provide solutions, yet suffer from critical optimizer inefficiencies; notably basis redundancy in LoRA's $B$ matrix when using AdamW, which fundamentally limits performance. We address this by optimizing the $B$ matrix on the Stiefel manifold, imposing explicit orthogonality constraints that achieve near-perfect orthogonality and full effective rank. This geometric approach dramatically enhances parameter efficiency and representational capacity. Our Stiefel optimizer consistently outperforms AdamW across benchmarks with both LoRA and DoRA, demonstrating that geometric constraints are the key to unlocking LoRA's full potential for effective LLM fine-tuning.

Comment: The paper discusses a novel optimization method for LoRA, which is relevant to model compression and efficiency improvements.

Relevance: 8 Novelty: 7


19. Dynamic Sparse Attention on Mobile SoCs

ArXiv ID: 2508.16703

Authors: Wangsong Yin, Daliang Xu, Mengwei Xu, Gang Huang, Xuanzhe Liu

Abstract: On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.

Comment: The paper presents a system-algorithm co-designed sparse attention module, relevant to model compression and efficiency.

Relevance: 8 Novelty: 7


20. Module-Aware Parameter-Efficient Machine Unlearning on Transformers

ArXiv ID: 2508.17233

Authors: Wenjie Bao, Jian Lou, Yuke Hu, Xiaochen Li, Zhihao Liu, Jiaqi Liu, Zhan Qin, Kui Ren

Abstract: Transformer has become fundamental to a vast series of pre-trained large models that have achieved remarkable success across diverse applications. Machine unlearning, which focuses on efficiently removing specific data influences to comply with privacy regulations, shows promise in restricting updates to influence-critical parameters. However, existing parameter-efficient unlearning methods are largely devised in a module-oblivious manner, which tends to inaccurately identify these parameters and leads to inferior unlearning performance for Transformers. In this paper, we propose {\tt MAPE-Unlearn}, a module-aware parameter-efficient machine unlearning approach that uses a learnable pair of masks to pinpoint influence-critical parameters in the heads and filters of Transformers. The learning objective of these masks is derived by desiderata of unlearning and optimized through an efficient algorithm featured by a greedy search with a warm start. Extensive experiments on various Transformer models and datasets demonstrate the effectiveness and robustness of {\tt MAPE-Unlearn} for unlearning.

Comment: The paper introduces a module-aware parameter-efficient machine unlearning approach for Transformers, which is relevant to model architecture and efficiency.

Relevance: 8 Novelty: 7


21. WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling

ArXiv ID: 2508.16676

Authors: Jiacheng Li, Jianchao Tan, Zhidong Yang, Pingwei Sun, Feiye Huo, Jiayu Qin, Yerui Sun, Yuchen Xie, Xunliang Cai, Xiangyu Zhang, Maoxin He, Guangming Tan, Weile Jia, Tong Zhao

Abstract: Transformer architecture gradually dominates the LLM field. Recent advances in training optimization for Transformer-based large language models (LLMs) primarily focus on architectural modifications or optimizer adjustments. However, these approaches lack systematic optimization of weight patterns during training. Weight pattern refers to the distribution and relative magnitudes of weight parameters in a neural network. To address this issue, we propose a Weight Scaling method called WISCA to enhance training efficiency and model quality by strategically improving neural network weight patterns without changing network structures. By rescaling weights while preserving model outputs, WISCA indirectly optimizes the model's training trajectory. Experiments demonstrate that WISCA significantly improves convergence quality (measured by generalization capability and loss reduction), particularly in LLMs with Grouped Query Attention (GQA) architectures and LoRA fine-tuning tasks. Empirical results show 5.6% average improvement on zero-shot validation tasks and 2.12% average reduction in training perplexity across multiple architectures.

Comment: The paper proposes a weight scaling method to enhance training efficiency and model quality in LLMs, relevant to model architecture and efficiency.

Relevance: 8 Novelty: 7


22. Optimizing Neural Networks with Learnable Non-Linear Activation Functions via Lookup-Based FPGA Acceleration

ArXiv ID: 2508.17069

Authors: Mengyuan Yin, Benjamin Chen Ming Choong, Chuping Qu, Rick Siow Mong Goh, Weng-Fai Wong, Tao Luo

Abstract: Learned activation functions in models like Kolmogorov-Arnold Networks (KANs) outperform fixed-activation architectures in terms of accuracy and interpretability; however, their computational complexity poses critical challenges for energy-constrained edge AI deployments. Conventional CPUs/GPUs incur prohibitive latency and power costs when evaluating higher order activations, limiting deployability under ultra-tight energy budgets. We address this via a reconfigurable lookup architecture with edge FPGAs. By coupling fine-grained quantization with adaptive lookup tables, our design minimizes energy-intensive arithmetic operations while preserving activation fidelity. FPGA reconfigurability enables dynamic hardware specialization for learned functions, a key advantage for edge systems that require post-deployment adaptability. Evaluations using KANs - where unique activation functions play a critical role - demonstrate that our FPGA-based design achieves superior computational speed and over $10^4$ times higher energy efficiency compared to edge CPUs and GPUs, while maintaining matching accuracy and minimal footprint overhead. This breakthrough positions our approach as a practical enabler for energy-critical edge AI, where computational intensity and power constraints traditionally preclude the use of adaptive activation networks.

Comment: The paper discusses optimizing neural networks with learnable activation functions via FPGA acceleration, which is relevant to model architecture and efficiency.

Relevance: 8 Novelty: 7


23. Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry

ArXiv ID: 2508.17081

Authors: Haoyu Yun, Hamid Krim

Abstract: The Vision Transformer (ViT) architecture has become widely recognized in computer vision, leveraging its self-attention mechanism to achieve remarkable success across various tasks. Despite its strengths, ViT's optimization remains confined to modeling local relationships within individual images, limiting its ability to capture the global geometric relationships between data points. To address this limitation, this paper proposes a novel framework that integrates ViT with the proximal tools, enabling a unified geometric optimization approach to enhance feature representation and classification performance. In this framework, ViT constructs the tangent bundle of the manifold through its self-attention mechanism, where each attention head corresponds to a tangent space, offering geometric representations from diverse local perspectives. Proximal iterations are then introduced to define sections within the tangent bundle and project data from tangent spaces onto the base space, achieving global feature alignment and optimization. Experimental results confirm that the proposed method outperforms traditional ViT in terms of classification accuracy and data distribution.

Comment: The paper proposes a novel framework integrating ViT with proximal tools for enhanced feature representation, relevant to model architecture.

Relevance: 8 Novelty: 7


24. Deep Learning for Markov Chains: Lyapunov Functions, Poisson's Equation, and Stationary Distributions

ArXiv ID: 2508.16737

Authors: Yanlin Qu, Jose Blanchet, Peter Glynn

Abstract: Lyapunov functions are fundamental to establishing the stability of Markovian models, yet their construction typically demands substantial creativity and analytical effort. In this paper, we show that deep learning can automate this process by training neural networks to satisfy integral equations derived from first-transition analysis. Beyond stability analysis, our approach can be adapted to solve Poisson's equation and estimate stationary distributions. While neural networks are inherently function approximators on compact domains, it turns out that our approach remains effective when applied to Markov chains on non-compact state spaces. We demonstrate the effectiveness of this methodology through several examples from queueing theory and beyond.

Comment: The paper explores the use of deep learning to automate the construction of Lyapunov functions for Markov chains, which is a foundational contribution to representation learning and theoretical insights into neural networks.

Relevance: 8 Novelty: 7


25. Learn to Memorize: Optimizing LLM-based Agents with Adaptive Memory Framework

ArXiv ID: 2508.16629

Authors: Zeyu Zhang, Quanyu Dai, Rui Li, Xiaohe Bo, Xu Chen, Zhenhua Dong

Abstract: LLM-based agents have been extensively applied across various domains, where memory stands out as one of their most essential capabilities. Previous memory mechanisms of LLM-based agents are manually predefined by human experts, leading to higher labor costs and suboptimal performance. In addition, these methods overlook the memory cycle effect in interactive scenarios, which is critical to optimizing LLM-based agents for specific environments. To address these challenges, in this paper, we propose to optimize LLM-based agents with an adaptive and data-driven memory framework by modeling memory cycles. Specifically, we design an MoE gate function to facilitate memory retrieval, propose a learnable aggregation process to improve memory utilization, and develop task-specific reflection to adapt memory storage. Our memory framework empowers LLM-based agents to learn how to memorize information effectively in specific environments, with both off-policy and on-policy optimization. In order to evaluate the effectiveness of our proposed methods, we conduct comprehensive experiments across multiple aspects. To benefit the research community in this area, we release our project at https://github.com/nuster1128/learn_to_memorize.

Comment: The paper proposes an adaptive memory framework for LLM-based agents using MoE, which is relevant to model architecture and representation learning.

Relevance: 8 Novelty: 7


26. CP4SBI: Local Conformal Calibration of Credible Sets in Simulation-Based Inference

ArXiv ID: 2508.17077

Authors: Luben M. C. Cabezas, Vagner S. Santos, Thiago R. Ramos, Pedro L. C. Rodrigues, Rafael Izbicki

Abstract: Current experimental scientists have been increasingly relying on simulation-based inference (SBI) to invert complex non-linear models with intractable likelihoods. However, posterior approximations obtained with SBI are often miscalibrated, causing credible regions to undercover true parameters. We develop $\texttt{CP4SBI}$, a model-agnostic conformal calibration framework that constructs credible sets with local Bayesian coverage. Our two proposed variants, namely local calibration via regression trees and CDF-based calibration, enable finite-sample local coverage guarantees for any scoring function, including HPD, symmetric, and quantile-based regions. Experiments on widely used SBI benchmarks demonstrate that our approach improves the quality of uncertainty quantification for neural posterior estimators using both normalizing flows and score-diffusion modeling.

Comment: The paper introduces a model-agnostic conformal calibration framework for simulation-based inference, which is a foundational contribution to AI for Science.

Relevance: 8 Novelty: 7


27. CrystalDiT: A Diffusion Transformer for Crystal Generation

ArXiv ID: 2508.16614

Authors: Xiaohan Yi, Guikun Xu, Xi Xiao, Zhong Zhang, Liu Liu, Yatao Bian, Peilin Zhao

Abstract: We present CrystalDiT, a diffusion transformer for crystal structure generation that achieves state-of-the-art performance by challenging the trend of architectural complexity. Instead of intricate, multi-stream designs, CrystalDiT employs a unified transformer that imposes a powerful inductive bias: treating lattice and atomic properties as a single, interdependent system. Combined with a periodic table-based atomic representation and a balanced training strategy, our approach achieves 9.62% SUN (Stable, Unique, Novel) rate on MP-20, substantially outperforming recent methods including FlowMM (4.38%) and MatterGen (3.42%). Notably, CrystalDiT generates 63.28% unique and novel structures while maintaining comparable stability rates, demonstrating that architectural simplicity can be more effective than complexity for materials discovery. Our results suggest that in data-limited scientific domains, carefully designed simple architectures outperform sophisticated alternatives that are prone to overfitting.

Comment: The paper presents CrystalDiT, a diffusion transformer for crystal generation, focusing on architectural simplicity over complexity, which aligns with model architecture innovations.

Relevance: 8 Novelty: 7


28. TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving

ArXiv ID: 2508.17219

Authors: Bingyang Wu, Zili Zhang, Yinmin Zhong, Guanzhe Huang, Yibo Zhu, Xuanzhe Liu, Xin Jin

Abstract: Prefix caching is crucial to accelerate multi-turn interactions and requests with shared prefixes. At the cluster level, existing prefix caching systems are tightly coupled with request scheduling to optimize cache efficiency and computation performance together, leading to load imbalance, data redundancy, and memory fragmentation of caching systems across instances. To address these issues, memory pooling is promising to shield the scheduler from the underlying cache management so that it can focus on the computation optimization. However, because existing prefix caching systems only transfer increasingly longer prefix caches between instances, they cannot achieve low-latency memory pooling. To address these problems, we propose a unified segment-level prefix cache pool, TokenLake. It uses a declarative cache interface to expose requests' query tensors, prefix caches, and cache-aware operations to TokenLake for efficient pooling. Powered by this abstraction, TokenLake can manage prefix cache at the segment level with a heavy-hitter-aware load balancing algorithm to achieve better cache load balance, deduplication, and defragmentation. TokenLake also transparently minimizes the communication volume of query tensors and new caches. Based on TokenLake, the scheduler can schedule requests elastically by using existing techniques without considering prefix cache management. Evaluations on real-world workloads show that TokenLake can improve throughput by up to 2.6$\times$ and 2.0$\times$ and boost hit rate by 2.0$\times$ and 2.1$\times$, compared to state-of-the-art cache-aware routing and cache-centric PD-disaggregation solutions, respectively.

Comment: TokenLake introduces a novel segment-level prefix cache pool for LLMs, which is relevant to model compression and efficiency improvements.

Relevance: 8 Novelty: 7


29. Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks

ArXiv ID: 2508.17744

Authors: Sotaro Takeshita, Yurina Takeshita, Daniel Ruffinelli, Simone Paolo Ponzetto

Abstract: In this paper, we study the surprising impact that truncating text embeddings has on downstream performance. We consistently observe across 6 state-of-the-art text encoders and 26 downstream tasks, that randomly removing up to 50% of embedding dimensions results in only a minor drop in performance, less than 10%, in retrieval and classification tasks. Given the benefits of using smaller-sized embeddings, as well as the potential insights about text encoding, we study this phenomenon and find that, contrary to what is suggested in prior work, this is not the result of an ineffective use of representation space. Instead, we find that a large number of uniformly distributed dimensions actually cause an increase in performance when removed. This would explain why, on average, removing a large number of embedding dimensions results in a marginal drop in performance. We make similar observations when truncating the embeddings used by large language models to make next-token predictions on generative tasks, suggesting that this phenomenon is not isolated to classification or retrieval tasks.

Comment: The paper studies the impact of truncating text embeddings, providing insights into representation learning and model efficiency.

Relevance: 8 Novelty: 7


30. Multimodal Representation Learning Conditioned on Semantic Relations

ArXiv ID: 2508.17497

Authors: Yang Qiao, Yuntong Hu, Liang Zhao

Abstract: Multimodal representation learning has advanced rapidly with contrastive models such as CLIP, which align image-text pairs in a shared embedding space. However, these models face limitations: (1) they typically focus on image-text pairs, underutilizing the semantic relations across different pairs. (2) they directly match global embeddings without contextualization, overlooking the need for semantic alignment along specific subspaces or relational dimensions; and (3) they emphasize cross-modal contrast, with limited support for intra-modal consistency. To address these issues, we propose Relation-Conditioned Multimodal Learning RCML, a framework that learns multimodal representations under natural-language relation descriptions to guide both feature extraction and alignment. Our approach constructs many-to-many training pairs linked by semantic relations and introduces a relation-guided cross-attention mechanism that modulates multimodal representations under each relation context. The training objective combines inter-modal and intra-modal contrastive losses, encouraging consistency across both modalities and semantically related samples. Experiments on different datasets show that RCML consistently outperforms strong baselines on both retrieval and classification tasks, highlighting the effectiveness of leveraging semantic relations to guide multimodal representation learning.

Comment: The paper introduces a new framework for multimodal representation learning, focusing on semantic relations and cross-attention mechanisms, which aligns with representation learning insights.

Relevance: 8 Novelty: 7


31. Confidence-Modulated Speculative Decoding for Large Language Models

ArXiv ID: 2508.15371

Authors: Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela

Abstract: Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.

Comment: The paper presents a novel speculative decoding method for LLMs, focusing on efficiency improvements, which aligns with model compression and efficiency breakthroughs.

Relevance: 8 Novelty: 7


32. Steering When Necessary: Flexible Steering Large Language Models with Backtracking

ArXiv ID: 2508.17621

Authors: Jinwei Gan, Zifeng Cheng, Zhiwei Jiang, Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu

Abstract: Large language models (LLMs) have achieved remarkable performance across many generation tasks. Nevertheless, effectively aligning them with desired behaviors remains a significant challenge. Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage, aligning their responses with the desired behaviors and avoiding the high cost of fine-tuning. Existing methods typically indiscriminately intervene to all generations or rely solely on the question to determine intervention, which limits the accurate assessment of the intervention strength. To this end, we propose the Flexible Activation Steering with Backtracking (FASB) framework, which dynamically determines both the necessity and strength of intervention by tracking the internal states of the LLMs during generation, considering both the question and the generated content. Since intervening after detecting a deviation from the desired behavior is often too late, we further propose the backtracking mechanism to correct the deviated tokens and steer the LLMs toward the desired behavior. Extensive experiments on the TruthfulQA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines. Our code will be released at https://github.com/gjw185/FASB.

Comment: The paper proposes a framework for flexible activation steering in LLMs, focusing on aligning model behavior, which is relevant to model architecture and efficiency improvements.

Relevance: 8 Novelty: 7


33. How Quantization Shapes Bias in Large Language Models

ArXiv ID: 2508.18088

Authors: Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych

Abstract: This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, toxicity, sentiment, and fairness. We employ both probabilistic and generated text-based metrics across nine benchmarks and evaluate models varying in architecture family and reasoning ability. Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are generally consistent across demographic categories and model types, although their magnitude depends on the specific setting. Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.

Comment: The paper evaluates how quantization affects bias in large language models, which is relevant to model compression and efficiency.

Relevance: 8 Novelty: 7


34. Frozen in Time: Parameter-Efficient Time Series Transformers via Reservoir-Induced Feature Expansion and Fixed Random Dynamics

ArXiv ID: 2508.18130

Authors: Pradeep Singh, Mehak Sharma, Anupriya Dey, Balasubramanian Raman

Abstract: Transformers are the de-facto choice for sequence modelling, yet their quadratic self-attention and weak temporal bias can make long-range forecasting both expensive and brittle. We introduce FreezeTST, a lightweight hybrid that interleaves frozen random-feature (reservoir) blocks with standard trainable Transformer layers. The frozen blocks endow the network with rich nonlinear memory at no optimisation cost; the trainable layers learn to query this memory through self-attention. The design cuts trainable parameters and also lowers wall-clock training time, while leaving inference complexity unchanged. On seven standard long-term forecasting benchmarks, FreezeTST consistently matches or surpasses specialised variants such as Informer, Autoformer, and PatchTST; with substantially lower compute. Our results show that embedding reservoir principles within Transformers offers a simple, principled route to efficient long-term time-series prediction.

Comment: The paper introduces a hybrid transformer architecture for time series, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 7


35. Disentangling the Factors of Convergence between Brains and Computer Vision Models

ArXiv ID: 2508.18226

Authors: Jos\'ephine Raugel, Marc Szafraniec, Huy V. Vo, Camille Couprie, Patrick Labatut, Piotr Bojanowski, Valentin Wyart, Jean-R\'emi King

Abstract: Many AI models trained on natural images develop representations that resemble those of the human brain. However, the factors that drive this brain-model similarity remain poorly understood. To disentangle how the model, training and data independently lead a neural network to develop brain-like representations, we trained a family of self-supervised vision transformers (DINOv3) that systematically varied these different factors. We compare their representations of images to those of the human brain recorded with both fMRI and MEG, providing high resolution in spatial and temporal analyses. We assess the brain-model similarity with three complementary metrics focusing on overall representational similarity, topographical organization, and temporal dynamics. We show that all three factors - model size, training amount, and image type - independently and interactively impact each of these brain similarity metrics. In particular, the largest DINOv3 models trained with the most human-centric images reach the highest brain-similarity. This emergence of brain-like representations in AI models follows a specific chronology during training: models first align with the early representations of the sensory cortices, and only align with the late and prefrontal representations of the brain with considerably more training. Finally, this developmental trajectory is indexed by both structural and functional properties of the human cortex: the representations that are acquired last by the models specifically align with the cortical areas with the largest developmental expansion, thickness, least myelination, and slowest timescales. Overall, these findings disentangle the interplay between architecture and experience in shaping how artificial neural networks come to see the world as humans do, thus offering a promising framework to understand how the human brain comes to represent its visual world.

Comment: The paper explores the convergence between brain and computer vision models, providing insights into representation learning and model architecture.

Relevance: 8 Novelty: 7


36. Bridging Foundation Models and Efficient Architectures: A Modular Brain Imaging Framework with Local Masking and Pretrained Representation Learning

ArXiv ID: 2508.16597

Authors: Yanwen Wang, Xinglin Zhao, Yijin Song, Xiaobo Liu, Yanrong Hao, Rui Cao, Xin Wen

Abstract: Functional connectivity (FC) derived from resting-state fMRI plays a critical role in personalized predictions such as age and cognitive performance. However, applying foundation models(FM) to fMRI data remains challenging due to its high dimensionality, computational complexity, and the difficulty in capturing complex spatiotemporal dynamics and indirect region-of-interest (ROI) interactions. To address these limitations, we propose a modular neuroimaging framework that integrates principles from FM with efficient, domain-specific architectures. Our approach begins with a Local Masked Autoencoder (LMAE) for pretraining, which reduces the influence of hemodynamic response function (HRF) dynamics and suppresses noise. This is followed by a Random Walk Mixture of Experts (RWMOE) module that clusters features across spatial and temporal dimensions, effectively capturing intricate brain interactions. Finally, a state-space model (SSM)-based predictor performs downstream task inference. Evaluated on the Cambridge Centre for Ageing and Neuroscience (Cam-CAN) dataset, our framework achieved mean absolute errors (MAEs) of 5.343 for age prediction and 2.940 for fluid intelligence, with Pearson correlation coefficients (PCCs) of 0.928 and 0.887, respectively-outperforming existing state-of-the-art methods. Visualization of expert distribution weights further enhances interpretability by identifying key brain regions. This work provides a robust, interpretable alternative to LLM-based approaches for fMRI analysis, offering novel insights into brain aging and cognitive function.

Comment: The paper presents a modular framework integrating foundation models with efficient architectures, including a Random Walk Mixture of Experts module, which aligns with model architecture and efficiency topics.

Relevance: 8 Novelty: 7


37. Training Transformers for Mesh-Based Simulations

ArXiv ID: 2508.18051

Authors: Paul Garnier, Vincent Lannelongue, Jonathan Viquerat, Elie Hachem

Abstract: Simulating physics using Graph Neural Networks (GNNs) is predominantly driven by message-passing architectures, which face challenges in scaling and efficiency, particularly in handling large, complex meshes. These architectures have inspired numerous enhancements, including multigrid approaches and $K$-hop aggregation (using neighbours of distance $K$), yet they often introduce significant complexity and suffer from limited in-depth investigations. In response to these challenges, we propose a novel Graph Transformer architecture that leverages the adjacency matrix as an attention mask. The proposed approach incorporates innovative augmentations, including Dilated Sliding Windows and Global Attention, to extend receptive fields without sacrificing computational efficiency. Through extensive experimentation, we evaluate model size, adjacency matrix augmentations, positional encoding and $K$-hop configurations using challenging 3D computational fluid dynamics (CFD) datasets. We also train over 60 models to find a scaling law between training FLOPs and parameters. The introduced models demonstrate remarkable scalability, performing on meshes with up to 300k nodes and 3 million edges. Notably, the smallest model achieves parity with MeshGraphNet while being $7\times$ faster and $6\times$ smaller. The largest model surpasses the previous state-of-the-art by $38.8$\% on average and outperforms MeshGraphNet by $52$\% on the all-rollout RMSE, while having a similar training speed. Code and datasets are available at https://github.com/DonsetPG/graph-physics.

Comment: The paper proposes a novel Graph Transformer architecture for mesh-based simulations, focusing on architectural innovations and efficiency, which aligns with model architecture topics.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.