Personalized Daily ArXiv Papers 2025-08-04

[gpt-4o]	Prompt	Completion	Total
Token	28725	3404	32129
Cost	$0.07	$0.03	$0.11

Total arXiv papers: 381

Total scanned papers: 237

Total relevant papers: 17

Table of contents with paper titles:

MMBERT: Scaled Mixture-of-Experts Multimodal BERT for Robust Chinese Hate Speech Detection under Cloaking Perturbations Authors: Qiyao Xue, Yuchen Dou, Ryan Shi, Xiang Lorraine Li, Wei Gao
Adacc: Adaptive Compression and Activation Checkpointing for LLM Memory Management Authors: Ping Chen, Zhuohong Deng, Ping Li, Shuibing He, Hongzi Zhu, Yi Zheng, Zhefeng Wang, Baoxing Huai, Minyi Guo
Learning to optimize with guarantees: a complete characterization of linearly convergent algorithms Authors: Andrea Martin, Ian R. Manchester, Luca Furieri
Embryology of a Language Model Authors: George Wang, Garrett Baker, Andrew Gordon, Daniel Murfet
Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs Authors: Ziqian Zhong, Aditi Raghunathan
Systematic Evaluation of Optimization Techniques for Long-Context Language Models Authors: Ammar Ahmed, Sheng Di, Franck Cappello, Zirui Liu, Jingoo Han, Ali Anwar
Thinking Machines: Mathematical Reasoning in the Age of LLMs Authors: Andrea Asperti, Alberto Naibo, Claudio Sacerdoti Coen
FMPlug: Plug-In Foundation Flow-Matching Priors for Inverse Problems Authors: Yuxiang Wan, Ryan Devera, Wenjie Zhang, Ju Sun
Separated-Variable Spectral Neural Networks: A Physics-Informed Learning Approach for High-Frequency PDEs Authors: Xiong Xiong, Zhuo Zhang, Rongchun Hu, Chen Gao, Zichen Deng
Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri--Rao Product Authors: Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Anton van den Hengel, Ehsan Abbasnejad
Graph Lineages and Skeletal Graph Products Authors: Eric Mjolsness, Cory B. Scott
Sheaf Graph Neural Networks via PAC-Bayes Spectral Optimization Authors: Yoonhyuk Choi, Jiho Choi, Chong-Kwon Kim
Reinitializing weights vs units for maintaining plasticity in neural networks Authors: J. Fernando Hernandez-Garcia, Shibhansh Dohare, Jun Luo, Rich S. Sutton
Invariant Graph Transformer for Out-of-Distribution Generalization Authors: Tianyin Liao, Ziwei Zhang, Yufei Sun, Chunyu Hu, Jianxin Li
Stress-Aware Resilient Neural Training Authors: Ashkan Shakarami, Yousef Yeganeh, Azade Farshad, Lorenzo Nicole, Stefano Ghidoni, Nassir Navab
Improved Robustness and Functional Localization in Topographic CNNs Through Weight Similarity Authors: Nhut Truong, Uri Hasson
Sinusoidal Approximation Theorem for Kolmogorov-Arnold Networks Authors: Sergei Gleyzer, Hanh Nguyen, Dinesh P. Ramakrishnan, Eric A. F. Reinhardt

1. MMBERT: Scaled Mixture-of-Experts Multimodal BERT for Robust Chinese Hate Speech Detection under Cloaking Perturbations

ArXiv ID: 2508.00760

Authors: Qiyao Xue, Yuchen Dou, Ryan Shi, Xiang Lorraine Li, Wei Gao

Abstract: Hate speech detection on Chinese social networks presents distinct challenges, particularly due to the widespread use of cloaking techniques designed to evade conventional text-based detection systems. Although large language models (LLMs) have recently improved hate speech detection capabilities, the majority of existing work has concentrated on English datasets, with limited attention given to multimodal strategies in the Chinese context. In this study, we propose MMBERT, a novel BERT-based multimodal framework that integrates textual, speech, and visual modalities through a Mixture-of-Experts (MoE) architecture. To address the instability associated with directly integrating MoE into BERT-based models, we develop a progressive three-stage training paradigm. MMBERT incorporates modality-specific experts, a shared self-attention mechanism, and a router-based expert allocation strategy to enhance robustness against adversarial perturbations. Empirical results in several Chinese hate speech datasets show that MMBERT significantly surpasses fine-tuned BERT-based encoder models, fine-tuned LLMs, and LLMs utilizing in-context learning approaches.

Comment: The paper introduces MMBERT, a novel multimodal framework using a Mixture-of-Experts architecture, which aligns with model architecture criteria.

Relevance: 9 Novelty: 8

2. Adacc: Adaptive Compression and Activation Checkpointing for LLM Memory Management

ArXiv ID: 2508.00806

Authors: Ping Chen, Zhuohong Deng, Ping Li, Shuibing He, Hongzi Zhu, Yi Zheng, Zhefeng Wang, Baoxing Huai, Minyi Guo

Abstract: Training large language models often employs recomputation to alleviate memory pressure, which can introduce up to 30% overhead in real-world scenarios. In this paper, we propose Adacc, a novel memory management framework that combines adaptive compression and activation checkpointing to reduce the GPU memory footprint. It comprises three modules: (1) We design layer-specific compression algorithms that account for outliers in LLM tensors, instead of directly quantizing floats from FP16 to INT4, to ensure model accuracy. (2) We propose an optimal scheduling policy that employs MILP to determine the best memory optimization for each tensor. (3) To accommodate changes in training tensors, we introduce an adaptive policy evolution mechanism that adjusts the policy during training to enhance throughput. Experimental results show that Adacc can accelerate the LLM training by 1.01x to 1.37x compared to state-of-the-art frameworks, while maintaining comparable model accuracy to the Baseline.

Comment: The paper introduces a novel memory management framework combining adaptive compression and activation checkpointing, relevant to model compression.

Relevance: 9 Novelty: 8

3. Learning to optimize with guarantees: a complete characterization of linearly convergent algorithms

ArXiv ID: 2508.00775

Authors: Andrea Martin, Ian R. Manchester, Luca Furieri

Abstract: In high-stakes engineering applications, optimization algorithms must come with provable worst-case guarantees over a mathematically defined class of problems. Designing for the worst case, however, inevitably sacrifices performance on the specific problem instances that often occur in practice. We address the problem of augmenting a given linearly convergent algorithm to improve its average-case performance on a restricted set of target problems - for example, tailoring an off-the-shelf solver for model predictive control (MPC) for an application to a specific dynamical system - while preserving its worst-case guarantees across the entire problem class. Toward this goal, we characterize the class of algorithms that achieve linear convergence for classes of nonsmooth composite optimization problems. In particular, starting from a baseline linearly convergent algorithm, we derive all - and only - the modifications to its update rule that maintain its convergence properties. Our results apply to augmenting legacy algorithms such as gradient descent for nonconvex, gradient-dominated functions; Nesterov's accelerated method for strongly convex functions; and projected methods for optimization over polyhedral feasibility sets. We showcase effectiveness of the approach on solving optimization problems with tight iteration budgets in application to ill-conditioned systems of linear equations and MPC for linear systems.

Comment: The paper provides a theoretical characterization of linearly convergent algorithms, which aligns with the emerging trends criterion.

Relevance: 9 Novelty: 8

4. Embryology of a Language Model

ArXiv ID: 2508.00331

Authors: George Wang, Garrett Baker, Andrew Gordon, Daniel Murfet

Abstract: Understanding how language models develop their internal computational structure is a central problem in the science of deep learning. While susceptibilities, drawn from statistical physics, offer a promising analytical tool, their full potential for visualizing network organization remains untapped. In this work, we introduce an embryological approach, applying UMAP to the susceptibility matrix to visualize the model's structural development over training. Our visualizations reveal the emergence of a clear body plan,'' charting the formation of known features like the induction circuit and discovering previously unknown structures, such as aspacing fin'' dedicated to counting space tokens. This work demonstrates that susceptibility analysis can move beyond validation to uncover novel mechanisms, providing a powerful, holistic lens for studying the developmental principles of complex neural networks.

Comment: The paper provides insights into the internal structure development of language models, which is relevant to representation learning and understanding LLM behavior.

Relevance: 9 Novelty: 8

5. Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

ArXiv ID: 2508.00161

Authors: Ziqian Zhong, Aditi Raghunathan

Abstract: The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation. Our implementation can be found at https://github.com/fjzzq2002/WeightWatch.

Comment: The paper presents a novel method for monitoring and controlling fine-tuned LLMs by interpreting weights, which is relevant to understanding LLM behavior and interpretability.

Relevance: 9 Novelty: 8

6. Systematic Evaluation of Optimization Techniques for Long-Context Language Models

ArXiv ID: 2508.00305

Authors: Ammar Ahmed, Sheng Di, Franck Cappello, Zirui Liu, Jingoo Han, Ali Anwar

Abstract: Large language models (LLMs) excel across diverse natural language processing tasks but face resource demands and limited context windows. Although techniques like pruning, quantization, and token dropping can mitigate these issues, their efficacy in long-context scenarios and system evaluation remains underexplored. This paper systematically benchmarks these optimizations, characterizing memory usage, latency, and throughput, and studies how these methods impact the quality of text generation. We first analyze individual optimization methods for two LLM architectures supporting long context and then systematically evaluate combinations of these techniques to assess how this deeper analysis impacts performance metrics. We subsequently study the scalability of individual optimization methods on a larger variant with 70 billion-parameter model. Our novel insights reveal that naive combination inference optimization algorithms can adversely affect larger models due to compounded approximation errors, as compared to their smaller counterparts. Experiments show that relying solely on F1 obscures these effects by hiding precision-recall trade-offs in question answering tasks. By integrating system-level profiling with task-specific insights, this study helps LLM practitioners and researchers explore and balance efficiency, accuracy, and scalability across tasks and hardware configurations.

Comment: The paper systematically evaluates optimization techniques like pruning and quantization for LLMs, aligning with the model compression criterion.

Relevance: 9 Novelty: 7

7. Thinking Machines: Mathematical Reasoning in the Age of LLMs

ArXiv ID: 2508.00459

Authors: Andrea Asperti, Alberto Naibo, Claudio Sacerdoti Coen

Abstract: Large Language Models (LLMs) have shown remarkable abilities in structured reasoning and symbolic tasks, with coding emerging as a particular area of strength. This success has sparked growing interest in applying LLMs to mathematics, both in informal problem-solving and formal theorem proving. However, progress in formal mathematics has proven to be significantly more difficult, despite surface-level similarities between programming and proof construction. This discrepancy raises important questions about how LLMs ``reason'', how they are supervised, and whether they internally track a notion of computational or deductive state. In this article, we address the state-of-the-art of the discipline, focusing on recent models and benchmarks, and explore three central issues at the intersection of machine learning and mathematical cognition: (i) the trade-offs between formal and informal mathematics as training domains; (ii) the deeper reasons why proof generation remains more brittle than code synthesis; (iii) and the question of whether LLMs represent, or merely mimic, a notion of evolving logical state. Our goal is not to draw hard boundaries, but to identify where the current limits lie, and how they might be extended.

Comment: The paper explores theoretical insights into LLM behavior, particularly in mathematical reasoning, aligning with foundational research in LLMs.

Relevance: 9 Novelty: 7

8. FMPlug: Plug-In Foundation Flow-Matching Priors for Inverse Problems

ArXiv ID: 2508.00721

Authors: Yuxiang Wan, Ryan Devera, Wenjie Zhang, Ju Sun

Abstract: We present FMPlug, a novel plug-in framework that enhances foundation flow-matching (FM) priors for solving ill-posed inverse problems. Unlike traditional approaches that rely on domain-specific or untrained priors, FMPlug smartly leverages two simple but powerful insights: the similarity between observed and desired objects and the Gaussianity of generative flows. By introducing a time-adaptive warm-up strategy and sharp Gaussianity regularization, FMPlug unlocks the true potential of domain-agnostic foundation models. Our method beats state-of-the-art methods that use foundation FM priors by significant margins, on image super-resolution and Gaussian deblurring.

Comment: The paper introduces FMPlug, a novel framework leveraging foundation flow-matching priors for inverse problems, which aligns with foundational research in AI for Science.

Relevance: 8 Novelty: 8

9. Separated-Variable Spectral Neural Networks: A Physics-Informed Learning Approach for High-Frequency PDEs

ArXiv ID: 2508.00628

Authors: Xiong Xiong, Zhuo Zhang, Rongchun Hu, Chen Gao, Zichen Deng

Abstract: Solving high-frequency oscillatory partial differential equations (PDEs) is a critical challenge in scientific computing, with applications in fluid mechanics, quantum mechanics, and electromagnetic wave propagation. Traditional physics-informed neural networks (PINNs) suffer from spectral bias, limiting their ability to capture high-frequency solution components. We introduce Separated-Variable Spectral Neural Networks (SV-SNN), a novel framework that addresses these limitations by integrating separation of variables with adaptive spectral methods. Our approach features three key innovations: (1) decomposition of multivariate functions into univariate function products, enabling independent spatial and temporal networks; (2) adaptive Fourier spectral features with learnable frequency parameters for high-frequency capture; and (3) theoretical framework based on singular value decomposition to quantify spectral bias. Comprehensive evaluation on benchmark problems including Heat equation, Helmholtz equation, Poisson equations and Navier-Stokes equations demonstrates that SV-SNN achieves 1-3 orders of magnitude improvement in accuracy while reducing parameter count by over 90\% and training time by 60\%. These results establish SV-SNN as an effective solution to the spectral bias problem in neural PDE solving. The implementation will be made publicly available upon acceptance at https://github.com/xgxgnpu/SV-SNN.

Comment: The paper introduces a novel framework for solving high-frequency PDEs using neural networks, addressing spectral bias with a theoretical framework based on singular value decomposition. This aligns with representation learning and model architecture innovations.

Relevance: 8 Novelty: 8

10. Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri--Rao Product

ArXiv ID: 2508.00230

Authors: Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Anton van den Hengel, Ehsan Abbasnejad

Abstract: Parameter-efficient fine-tuning (PEFT) has become a standard approach for adapting large pre-trained models. Amongst PEFT methods, low-rank adaptation (LoRA) has achieved notable success. However, recent studies have highlighted its limitations compared against full-rank alternatives, particularly when applied to multimodal and large language models. In this work, we present a quantitative comparison amongst full-rank and low-rank PEFT methods using a synthetic matrix approximation benchmark with controlled spectral properties. Our results confirm that LoRA struggles to approximate matrices with relatively flat spectrums or high frequency components -- signs of high effective ranks. To this end, we introduce KRAdapter, a novel PEFT algorithm that leverages the Khatri-Rao product to produce weight updates, which, by construction, tends to produce matrix product with a high effective rank. We demonstrate performance gains with KRAdapter on vision-language models up to 1B parameters and on large language models up to 8B parameters, particularly on unseen common-sense reasoning tasks. In addition, KRAdapter maintains the memory and compute efficiency of LoRA, making it a practical and robust alternative to fine-tune billion-scale parameter models.

Comment: The paper introduces KRAdapter, a novel PEFT algorithm using the Khatri-Rao product for higher effective rank, relevant to model compression and efficiency.

Relevance: 8 Novelty: 8

11. Graph Lineages and Skeletal Graph Products

ArXiv ID: 2508.00197

Authors: Eric Mjolsness, Cory B. Scott

Abstract: Graphs, and sequences of growing graphs, can be used to specify the architecture of mathematical models in many fields including machine learning and computational science. Here we define structured graph "lineages" (ordered by level number) that grow in a hierarchical fashion, so that: (1) the number of graph vertices and edges increases exponentially in level number; (2) bipartite graphs connect successive levels within a graph lineage and, as in multigrid methods, can constrain matrices relating successive levels; (3) using prolongation maps within a graph lineage, process-derived distance measures between graphs at successive levels can be defined; (4) a category of "graded graphs" can be defined, and using it low-cost "skeletal" variants of standard algebraic graph operations and type constructors (cross product, box product, disjoint sum, and function types) can be derived for graded graphs and hence hierarchical graph lineages; (5) these skeletal binary operators have similar but not identical algebraic and category-theoretic properties to their standard counterparts; (6) graph lineages and their skeletal product constructors can approach continuum limit objects. Additional space-efficient unary operators on graded graphs are also derived: thickening, which creates a graph lineage of multiscale graphs, and escalation to a graph lineage of search frontiers (useful as a generalization of adaptive grids and in defining "skeletal" functions). The result is an algebraic type theory for graded graphs and (hierarchical) graph lineages. The approach is expected to be well suited to defining hierarchical model architectures - "hierarchitectures" - and local sampling, search, or optimization algorithms on them. We demonstrate such application to deep neural networks (including visual and feature scale spaces) and to multigrid numerical methods.

Comment: The paper introduces a new algebraic type theory for graded graphs and hierarchical graph lineages, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 8

12. Sheaf Graph Neural Networks via PAC-Bayes Spectral Optimization

ArXiv ID: 2508.00357

Authors: Yoonhyuk Choi, Jiho Choi, Chong-Kwon Kim

Abstract: Over-smoothing in Graph Neural Networks (GNNs) causes collapse in distinct node features, particularly on heterophilic graphs where adjacent nodes often have dissimilar labels. Although sheaf neural networks partially mitigate this problem, they typically rely on static or heavily parameterized sheaf structures that hinder generalization and scalability. Existing sheaf-based models either predefine restriction maps or introduce excessive complexity, yet fail to provide rigorous stability guarantees. In this paper, we introduce a novel scheme called SGPC (Sheaf GNNs with PAC-Bayes Calibration), a unified architecture that combines cellular-sheaf message passing with several mechanisms, including optimal transport-based lifting, variance-reduced diffusion, and PAC-Bayes spectral regularization for robust semi-supervised node classification. We establish performance bounds theoretically and demonstrate that the resulting bound-aware objective can be achieved via end-to-end training in linear computational complexity. Experiments on nine homophilic and heterophilic benchmarks show that SGPC outperforms state-of-the-art spectral and sheaf-based GNNs while providing certified confidence intervals on unseen nodes.

Comment: The paper introduces a novel architecture for Graph Neural Networks using PAC-Bayes spectral optimization, which aligns with the model architecture criterion.