Personalized Daily ArXiv Papers 2025-08-15

[gpt-4o]	Prompt	Completion	Total
Token	41896	4561	46457
Cost	$0.1	$0.05	$0.15

Total arXiv papers: 495

Total scanned papers: 291

Total relevant papers: 23

Table of contents with paper titles:

A Rose by Any Other Name Would Smell as Sweet: Categorical Homotopy Theory for Large Language Models Authors: Sridhar Mahadevan
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization Authors: Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
Memorisation and forgetting in a learning Hopfield neural network: bifurcation mechanisms, attractors and basins Authors: Adam E. Essex (Loughborough University, England), Natalia B. Janson (Loughborough University, England), Rachel A. Norris (Loughborough University, England), Alexander G. Balanov (Loughborough University, England)
Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts Authors: Hojun Jin, Eunsoo Hong, Ziwon Hyung, Sungjun Lim, Seungjin Lee, Keunseok Cho
Decoupling Understanding from Reasoning via Problem Space Mapping for Small-scale Model Reasoning Authors: Li Wang, Changhao Zhang, Zengqi Xiu, Kai Lu, Xin Yu, Kui Zhang, Wenjun Wu
Constrained Decoding of Diffusion LLMs with Context-Free Grammars Authors: Niels M\"undler, Jasper Dekoninck, Martin Vechev
DINOv3 Authors: Oriane Sim\'eoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha\"el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth\'ee Darcet, Th\'eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Herv\'e J\'egou, Patrick Labatut, Piotr Bojanowski
Why Cannot Large Language Models Ever Make True Correct Reasoning? Authors: Jingde Cheng
Natively Trainable Sparse Attention for Hierarchical Point Cloud Datasets Authors: Nicolas Lapautre, Maria Marchenko, Carlos Miguel Pati\~no, Xin Zhou
On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations Authors: Amir Mehrpanah, Matteo Gamba, Kevin Smith, Hossein Azizpour
IBEX: Information-Bottleneck-EXplored Coarse-to-Fine Molecular Generation under Limited Data Authors: Dong Xu, Zhangfan Yang, Jenna Xinyi Yao, Shuangbao Song, Zexuan Zhu, Junkai Ji
PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks Authors: Xinhao Wang, Zhiwei Lin, Zhongyu Xia, Yongtao Wang
Unpacking the Implicit Norm Dynamics of Sharpness-Aware Minimization in Tensorized Models Authors: Tianxiao Cao, Kyohei Atarashi, Hisashi Kashima
X-Node: Self-Explanation is All We Need Authors: Prajit Sengupta, Islem Rekik
SABER: Switchable and Balanced Training for Efficient LLM Reasoning Authors: Kai Zhao, Yanjun Zhao, Jiaming Song, Shien He, Lusheng Zhang, Qiang Zhang, Tianjiao Li
When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models Authors: Huyu Wu, Meng Tang, Xinhan Zheng, Haiyun Jiang
Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization Authors: Bin Hong, Jiayu Liu, Zhenya Huang, Kai Zhang, Mengdi Zhang
RTTC: Reward-Guided Collaborative Test-Time Compute Authors: J. Pablo Mu\~noz, Jinjie Yuan
Graph Learning via Logic-Based Weisfeiler-Leman Variants and Tabularization Authors: Reijo Jaakkola, Tomi Janhunen, Antti Kuusisto, Magdalena Ortiz, Matias Selin, Mantas \v{S}imkus
Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models Authors: Taibiao Zhao, Mingxuan Sun, Hao Wang, Xiaobing Chen, Xiangwei Zhou
xRFM: Accurate, scalable, and interpretable feature learning models for tabular data Authors: Daniel Beaglehole, David Holzm\"uller, Adityanarayanan Radhakrishnan, Mikhail Belkin
SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning Authors: Weijian Mai, Jiamin Wu, Yu Zhu, Zhouheng Yao, Dongzhan Zhou, Andrew F. Luo, Qihao Zheng, Wanli Ouyang, Chunfeng Song
Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction Authors: Luyao Tang, Kunze Huang, Chaoqi Chen, Yuxuan Yuan, Chenxin Li, Xiaotong Tu, Xinghao Ding, Yue Huang

1. A Rose by Any Other Name Would Smell as Sweet: Categorical Homotopy Theory for Large Language Models

ArXiv ID: 2508.10018

Authors: Sridhar Mahadevan

Abstract: Natural language is replete with superficially different statements, such as Charles Darwin wrote" andCharles Darwin is the author of", which carry the same meaning. Large language models (LLMs) should generate the same next-token probabilities in such cases, but usually do not. Empirical workarounds have been explored, such as using k-NN estimates of sentence similarity to produce smoothed estimates. In this paper, we tackle this problem more abstractly, introducing a categorical homotopy framework for LLMs. We introduce an LLM Markov category to represent probability distributions in language generated by an LLM, where the probability of a sentence, such as Charles Darwin wrote" is defined by an arrow in a Markov category. However, this approach runs into difficulties as language is full of equivalent rephrases, and each generates a non-isomorphic arrow in the LLM Markov category. To address this fundamental problem, we use categorical homotopy techniques to captureweak equivalences" in an LLM Markov category. We present a detailed overview of application of categorical homotopy to LLMs, from higher algebraic K-theory to model categories, building on powerful theoretical results developed over the past half a century.

Comment: The paper introduces a categorical homotopy framework for LLMs, offering a novel theoretical perspective on language model behavior.

Relevance: 9 Novelty: 9

2. XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

ArXiv ID: 2508.10395

Authors: Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Abstract: Although LLM inference has emerged as a critical workload for many downstream applications, efficiently inferring LLMs is challenging due to the substantial memory footprint and bandwidth requirements. In parallel, compute capabilities have steadily outpaced both memory capacity and bandwidth over the last few decades, a trend that remains evident in modern GPU hardware and exacerbates the challenge of LLM inference. As such, new algorithms are emerging that trade increased computation for reduced memory operations. To that end, we present XQuant, which takes advantage of this trend, enabling an order-of-magnitude reduction in memory consumption through low-bit quantization with substantial accuracy benefits relative to state-of-the-art KV cache quantization methods. We accomplish this by quantizing and caching the layer input activations X, instead of using standard KV caching, and then rematerializing the Keys and Values on-the-fly during inference. This results in an immediate 2$\times$ memory savings compared to KV caching. By applying XQuant, we achieve up to $\sim 7.7\times$ memory savings with $<0.1$ perplexity degradation compared to the FP16 baseline. Furthermore, our approach leverages the fact that X values are similar across layers. Building on this observation, we introduce XQuant-CL, which exploits the cross-layer similarity in the X embeddings for extreme compression. Across different models, XQuant-CL attains up to 10$\times$ memory savings relative to the FP16 baseline with only 0.01 perplexity degradation, and 12.5$\times$ memory savings with only $0.1$ perplexity degradation. XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck, while surpassing state-of-the-art KV cache quantization methods and achieving near-FP16 accuracy across a wide range of models.

Comment: The paper introduces a novel approach to reduce memory consumption in LLM inference using quantization and rematerialization, relevant to model compression.

Relevance: 9 Novelty: 8

3. Memorisation and forgetting in a learning Hopfield neural network: bifurcation mechanisms, attractors and basins

ArXiv ID: 2508.10765

Authors: Adam E. Essex (Loughborough University, England), Natalia B. Janson (Loughborough University, England), Rachel A. Norris (Loughborough University, England), Alexander G. Balanov (Loughborough University, England)

Abstract: Despite explosive expansion of artificial intelligence based on artificial neural networks (ANNs), these are employed as "black boxes'', as it is unclear how, during learning, they form memories or develop unwanted features, including spurious memories and catastrophic forgetting. Much research is available on isolated aspects of learning ANNs, but due to their high dimensionality and non-linearity, their comprehensive analysis remains a challenge. In ANNs, knowledge is thought to reside in connection weights or in attractor basins, but these two paradigms are not linked explicitly. Here we comprehensively analyse mechanisms of memory formation in an 81-neuron Hopfield network undergoing Hebbian learning by revealing bifurcations leading to formation and destruction of attractors and their basin boundaries. We show that, by affecting evolution of connection weights, the applied stimuli induce a pitchfork and then a cascade of saddle-node bifurcations creating new attractors with their basins that can code true or spurious memories, and an abrupt disappearance of old memories (catastrophic forgetting). With successful learning, new categories are represented by the basins of newly born point attractors, and their boundaries by the stable manifolds of new saddles. With this, memorisation and forgetting represent two manifestations of the same mechanism. Our strategy to analyse high-dimensional learning ANNs is universal and applicable to recurrent ANNs of any form. The demonstrated mechanisms of memory formation and of catastrophic forgetting shed light on the operation of a wider class of recurrent ANNs and could aid the development of approaches to mitigate their flaws.

Comment: The paper provides a comprehensive analysis of memory formation in Hopfield networks, relevant to foundational research in neural network behavior.

Relevance: 9 Novelty: 8

ArXiv ID: 2508.10009

Authors: Hojun Jin, Eunsoo Hong, Ziwon Hyung, Sungjun Lim, Seungjin Lee, Keunseok Cho

Abstract: Hard-parameter sharing is a common strategy to train a single model jointly across diverse tasks. However, this often leads to task interference, impeding overall model performance. To address the issue, we propose a simple yet effective Supervised Mixture of Experts (S-MoE). Unlike traditional Mixture of Experts models, S-MoE eliminates the need for training gating functions by utilizing special guiding tokens to route each task to its designated expert. By assigning each task to a separate feedforward network, S-MoE overcomes the limitations of hard-parameter sharing. We further apply S-MoE to a speech-to-text model, enabling the model to process mixed-bandwidth input while jointly performing automatic speech recognition (ASR) and speech translation (ST). Experimental results demonstrate the effectiveness of the proposed S-MoE, achieving a 6.35% relative improvement in Word Error Rate (WER) when applied to both the encoder and decoder.

Comment: The paper proposes a Supervised Mixture of Experts (S-MoE) model, which aligns with the model architecture criterion by introducing a novel approach to MoE.

Relevance: 9 Novelty: 8

5. Decoupling Understanding from Reasoning via Problem Space Mapping for Small-scale Model Reasoning

ArXiv ID: 2508.10019

Authors: Li Wang, Changhao Zhang, Zengqi Xiu, Kai Lu, Xin Yu, Kui Zhang, Wenjun Wu

Abstract: Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., $\leq$ 1.5B) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs' performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.

Comment: The paper introduces a framework for improving reasoning in small language models by decoupling understanding from reasoning, which aligns with foundational research in representation learning.

Relevance: 9 Novelty: 8

6. Constrained Decoding of Diffusion LLMs with Context-Free Grammars

ArXiv ID: 2508.10111

Authors: Niels M\"undler, Jasper Dekoninck, Martin Vechev

Abstract: Large language models (LLMs) have shown promising performance across diverse domains. Many practical applications of LLMs, such as code completion and structured data extraction, require adherence to syntactic constraints specified by a formal language. Yet, due to their probabilistic nature, LLM output is not guaranteed to adhere to such formal languages. Prior work has proposed constrained decoding as a means to restrict LLM generation to particular formal languages. However, existing works are not applicable to the emerging paradigm of diffusion LLMs, when used in practical scenarios such as the generation of formally correct C++ or JSON output. In this paper we address this challenge and present the first constrained decoding method for diffusion models, one that can handle formal languages captured by context-free grammars. We begin by reducing constrained decoding to the more general additive infilling problem, which asks whether a partial output can be completed to a valid word in the target language. This problem also naturally subsumes the previously unaddressed multi-region infilling constrained decoding. We then reduce this problem to the task of deciding whether the intersection of the target language and a regular language is empty and present an efficient algorithm to solve it for context-free languages. Empirical results on various applications, such as C++ code infilling and structured data extraction in JSON, demonstrate that our method achieves near-perfect syntactic correctness while consistently preserving or improving functional correctness. Importantly, our efficiency optimizations ensure that the computational overhead remains practical.

Comment: The paper presents a constrained decoding method for diffusion LLMs using context-free grammars, which is relevant to foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 8

7. DINOv3

ArXiv ID: 2508.10104

Authors: Oriane Sim\'eoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha\"el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth\'ee Darcet, Th\'eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Herv\'e J\'egou, Patrick Labatut, Piotr Bojanowski

Abstract: Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

Comment: The paper introduces DINOv3, a self-supervised learning model that enhances visual representation learning, which aligns with the representation learning criterion.

Relevance: 9 Novelty: 8

8. Why Cannot Large Language Models Ever Make True Correct Reasoning?

ArXiv ID: 2508.10265

Authors: Jingde Cheng

Abstract: Recently, with the application progress of AIGC tools based on large language models (LLMs), led by ChatGPT, many AI experts and more non-professionals are trumpeting the "understanding ability" and "reasoning ability" of the LLMs. The present author considers that the so-called "understanding ability" and "reasoning ability" of LLMs are just illusions of those people who with vague concepts. In fact, the LLMs can never have the true understanding ability and true reasoning ability. This paper intents to explain that, because the essential limitations of their working principle, the LLMs can never have the ability of true correct reasoning.

Comment: The paper critiques the reasoning abilities of LLMs, providing theoretical insights into their limitations, aligning with the large language models criterion.

Relevance: 9 Novelty: 8

9. Natively Trainable Sparse Attention for Hierarchical Point Cloud Datasets

ArXiv ID: 2508.10758

Authors: Nicolas Lapautre, Maria Marchenko, Carlos Miguel Pati\~no, Xin Zhou

Abstract: Unlocking the potential of transformers on datasets of large physical systems depends on overcoming the quadratic scaling of the attention mechanism. This work explores combining the Erwin architecture with the Native Sparse Attention (NSA) mechanism to improve the efficiency and receptive field of transformer models for large-scale physical systems, addressing the challenge of quadratic attention complexity. We adapt the NSA mechanism for non-sequential data, implement the Erwin NSA model, and evaluate it on three datasets from the physical sciences -- cosmology simulations, molecular dynamics, and air pressure modeling -- achieving performance that matches or exceeds that of the original Erwin model. Additionally, we reproduce the experimental results from the Erwin paper to validate their implementation.

Comment: The paper explores sparse attention mechanisms in transformers, which is relevant to model architecture and efficiency improvements.

Relevance: 9 Novelty: 7

10. On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations

ArXiv ID: 2508.10490

Authors: Amir Mehrpanah, Matteo Gamba, Kevin Smith, Hossein Azizpour

Abstract: ReLU networks, while prevalent for visual data, have sharp transitions, sometimes relying on individual pixels for predictions, making vanilla gradient-based explanations noisy and difficult to interpret. Existing methods, such as GradCAM, smooth these explanations by producing surrogate models at the cost of faithfulness. We introduce a unifying spectral framework to systematically analyze and quantify smoothness, faithfulness, and their trade-off in explanations. Using this framework, we quantify and regularize the contribution of ReLU networks to high-frequency information, providing a principled approach to identifying this trade-off. Our analysis characterizes how surrogate-based smoothing distorts explanations, leading to an ``explanation gap'' that we formally define and measure for different post-hoc methods. Finally, we validate our theoretical findings across different design choices, datasets, and ablations.

Comment: The paper introduces a spectral framework to analyze gradient-based explanations in ReLU networks, which is relevant to foundational research in model interpretability.

Relevance: 8 Novelty: 8

11. IBEX: Information-Bottleneck-EXplored Coarse-to-Fine Molecular Generation under Limited Data

ArXiv ID: 2508.10775

Authors: Dong Xu, Zhangfan Yang, Jenna Xinyi Yao, Shuangbao Song, Zexuan Zhu, Junkai Ji

Abstract: Three-dimensional generative models increasingly drive structure-based drug discovery, yet it remains constrained by the scarce publicly available protein-ligand complexes. Under such data scarcity, almost all existing pipelines struggle to learn transferable geometric priors and consequently overfit to training-set biases. As such, we present IBEX, an Information-Bottleneck-EXplored coarse-to-fine pipeline to tackle the chronic shortage of protein-ligand complex data in structure-based drug design. Specifically, we use PAC-Bayesian information-bottleneck theory to quantify the information density of each sample. This analysis reveals how different masking strategies affect generalization and indicates that, compared with conventional de novo generation, the constrained Scaffold Hopping task endows the model with greater effective capacity and improved transfer performance. IBEX retains the original TargetDiff architecture and hyperparameters for training to generate molecules compatible with the binding pocket; it then applies an L-BFGS optimization step to finely refine each conformation by optimizing five physics-based terms and adjusting six translational and rotational degrees of freedom in under one second. With only these modifications, IBEX raises the zero-shot docking success rate on CBGBench CrossDocked2020-based from 53% to 64%, improves the mean Vina score from $-7.41 kcal mol^{-1}$ to $-8.07 kcal mol^{-1}$, and achieves the best median Vina energy in 57 of 100 pockets versus 3 for the original TargetDiff. IBEX also increases the QED by 25%, achieves state-of-the-art validity and diversity, and markedly reduces extrapolation error.

Comment: The paper presents IBEX, a method for molecular generation using information bottleneck theory, aligning with AI for Science and representation learning in molecular modeling.

Relevance: 8 Novelty: 8

12. PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks

ArXiv ID: 2508.10557

Authors: Xinhao Wang, Zhiwei Lin, Zhongyu Xia, Yongtao Wang

Abstract: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) represent two mainstream model quantization approaches. However, PTQ often leads to unacceptable performance degradation in quantized models, while QAT imposes substantial GPU memory requirements and extended training time due to weight fine-tuning.In this paper, we propose PTQAT, a novel general hybrid quantization algorithm for the efficient deployment of 3D perception networks. To address the speed accuracy trade-off between PTQ and QAT, our method selects critical layers for QAT fine-tuning and performs PTQ on the remaining layers. Contrary to intuition, fine-tuning the layers with smaller output discrepancies before and after quantization, rather than those with larger discrepancies, actually leads to greater improvements in the model's quantization accuracy. This means we better compensate for quantization errors during their propagation, rather than addressing them at the point where they occur. The proposed PTQAT achieves similar performance to QAT with more efficiency by freezing nearly 50% of quantifiable layers. Additionally, PTQAT is a universal quantization method that supports various quantization bit widths (4 bits) as well as different model architectures, including CNNs and Transformers. The experimental results on nuScenes across diverse 3D perception tasks, including object detection, semantic segmentation, and occupancy prediction, show that our method consistently outperforms QAT-only baselines. Notably, it achieves 0.2%-0.9% NDS and 0.3%-1.0% mAP gains in object detection, 0.3%-2.0% mIoU gains in semantic segmentation and occupancy prediction while fine-tuning fewer weights.

Comment: The paper presents a novel hybrid quantization algorithm, which is relevant to model compression through quantization.