Personalized Daily ArXiv Papers 2025-08-25

[gpt-4o]	Prompt	Completion	Total
Token	31443	3336	34779
Cost	$0.08	$0.03	$0.11

Total arXiv papers: 433

Total scanned papers: 273

Total relevant papers: 20

Table of contents with paper titles:

From Confidence to Collapse in LLM Factual Robustness Authors: Alina Fastowski, Bardh Prenkaj, Gjergji Kasneci
TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill \& Decode Inference Authors: Xiaojuan Tang, Fanxu Meng, Pingzhi Tang, Yuxuan Wang, Di Yin, Xing Sun, Muhan Zhang
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing Authors: Yixuan Wang, Haoyu Qiao, Lujun Li, Qingfu Zhu, Wanxiang Che
Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search Authors: Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai
Z-Pruner: Post-Training Pruning of Large Language Models for Efficiency without Retraining Authors: Samiul Basir Bhuiyan, Md. Sazzad Hossain Adib, Mohammed Aman Bhuiyan, Muhammad Rafsan Kabir, Moshiur Farazi, Shafin Rahman, Nabeel Mohammed
On Task Vectors and Gradients Authors: Luca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Giuseppe Alessio D'Inverno, Fabrizio Silvestri, Emanuele Rodol`a
Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs Authors: Jiaqi Lin, Malyaban Bal, Abhronil Sengupta
Closer to Reality: Practical Semi-Supervised Federated Learning for Foundation Model Adaptation Authors: Guangyu Sun, Jingtao Li, Weiming Zhuang, Chen Chen, Chen Chen, Lingjuan Lyu
SCOPE: A Generative Approach for LLM Prompt Compression Authors: Tinghui Zhang, Yifan Wang, Daisy Zhe Wang
GEM: A Scale-Aware and Distribution-Sensitive Sparse Fine-Tuning Framework for Effective Downstream Adaptation Authors: Sungmin Kang, Jisoo Kim, Salman Avestimehr, Sunwoo Lee
Tessellation Groups, Harmonic Analysis on Non-compact Symmetric Spaces and the Heat Kernel in view of Cartan Convolutional Neural Networks Authors: Pietro Fr\'e, Federico Milanesio, Marcelo Oyarzo, Matteo Santoro, Mario Trigiante
Transforming Causality: Transformer-Based Temporal Causal Discovery with Prior Knowledge Integration Authors: Jihua Huang, Yi Yao, Ajay Divakaran
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning Authors: Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li
Low-dimensional embeddings of high-dimensional data Authors: Cyril de Bodt, Alex Diaz-Papkovich, Michael Bleher, Kerstin Bunte, Corinna Coupette, Sebastian Damrich, Enrique Fita Sanmartin, Fred A. Hamprecht, Em\H{o}ke-\'Agnes Horv\'at, Dhruv Kohli, Smita Krishnaswamy, John A. Lee, Boudewijn P. F. Lelieveldt, Leland McInnes, Ian T. Nabney, Maximilian Noichl, Pavlin G. Poli\v{c}ar, Bastian Rieck, Guy Wolf, Gal Mishne, Dmitry Kobak
Representation Learning with Adaptive Superpixel Coding Authors: Mahmoud Khalil, Ahmad Khalil, Alioune Ngom
Interpretable Kernels Authors: Patrick J. F. Groenen, Michael Greenacre
Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs Authors: Terry Jingchen Zhang, Wenyuan Jiang, Rongchuan Liu, Yisong Wang, Junran Yang, Ning Wang, Nicole Ni, Yinya Huang, Mrinmaya Sachan
Generative Foundation Model for Structured and Unstructured Electronic Health Records Authors: Sonish Sivarajkumar, Hang Zhang, Yuelyu Ji, Maneesh Bilalpur, Xizhi Wu, Chenyu Li, Min Gu Kwak, Shyam Visweswaran, Yanshan Wang
SDEC: Semantic Deep Embedded Clustering Authors: Mohammad Wali Ur Rahman, Ric Nevarez, Lamia Tasnim Mim, Salim Hariri
HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling Authors: Zahra Yousefijamarani, Xinglu Wang, Qian Wang, Morgan Lindsay Heisler, Taha Shabani, Niloofar Gholipour, Parham Yassini, Hong Chang, Kan Chen, Qiantao Zhang, Xiaolong Bai, Jiannan Wang, Ying Xiong, Yong Zhang, Zhenan Fan

1. From Confidence to Collapse in LLM Factual Robustness

ArXiv ID: 2508.16267

Authors: Alina Fastowski, Bardh Prenkaj, Gjergji Kasneci

Abstract: Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly -- smaller models report an FRS of $0.76$, larger ones $0.93$ -- with accuracy degrading by ~$60\%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.

Comment: The paper introduces a novel metric for evaluating factual robustness in LLMs, which aligns with theoretical insights into LLM behavior.

Relevance: 9 Novelty: 8

2. TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill \& Decode Inference

ArXiv ID: 2508.15881

Authors: Xiaojuan Tang, Fanxu Meng, Pingzhi Tang, Yuxuan Wang, Di Yin, Xing Sun, Muhan Zhang

Abstract: Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, compresses key-value states into a low-rank latent vector, caching only this vector to reduce memory. In tensor parallelism (TP), however, attention heads are computed across multiple devices, and each device must load the full cache, eroding the advantage of MLA over Grouped Query Attention (GQA). We propose Tensor-Parallel Latent Attention (TPLA): a scheme that partitions both the latent representation and each head's input dimension across devices, performs attention independently per shard, and then combines results with an all-reduce. TPLA preserves the benefits of a compressed KV cache while unlocking TP efficiency. Unlike Grouped Latent Attention (GLA), every head in TPLA still leverages the full latent representation, maintaining stronger representational capacity. TPLA is drop-in compatible with models pre-trained using MLA: it supports MLA-style prefilling and enables efficient tensor-parallel decoding without retraining. Applying simple orthogonal transforms -- e.g., the Hadamard transform or PCA -- before TP slicing further mitigates cross-shard interference, yielding minimal accuracy degradation. By reducing the per-device KV cache for DeepSeek-V3 and Kimi-K2, we achieve 1.79x and 1.93x speedups, respectively, at a 32K-token context length while maintaining performance on commonsense and LongBench benchmarks. TPLA can be implemented with FlashAttention-3, enabling practical end-to-end acceleration.

Comment: The paper introduces Tensor-Parallel Latent Attention, focusing on model compression and efficiency improvements, relevant to model compression.

Relevance: 9 Novelty: 8

ArXiv ID: 2508.16134

Authors: Yixuan Wang, Haoyu Qiao, Lujun Li, Qingfu Zhu, Wanxiang Che

Abstract: Large Language Models (LLMs) confront significant memory challenges due to the escalating KV cache with increasing sequence length. As a crucial technique, existing cross-layer KV cache sharing methods either necessitate modified model architectures with subsequent pre-training or incur significant performance degradation at high compression rates. To mitigate these challenges, we propose CommonKV, a training-free method for cross-layer KV cache compression through adjacent parameters sharing. Inspired by the high similarity observed in cross-layer hidden states, we utilize Singular Value Decomposition (SVD) to achieve weight sharing across adjacent parameters, resulting in a more easily mergeable latent KV cache. Furthermore, we also introduce an adaptive budget allocation strategy. It dynamically assigns compression budgets based on cosine similarity, ensuring that dissimilar caches are not over-compressed. Experiments across multiple backbone models and benchmarks including LongBench and Ruler demonstrate that the proposed method consistently outperforms existing low-rank and cross-layer approaches at various compression ratios. Moreover, we find that the benefits of CommonKV are orthogonal to other quantization and eviction methods. By integrating these approaches, we can ultimately achieve a 98\% compression ratio without significant performance loss.

Comment: The paper introduces CommonKV, a novel method for compressing KV cache in LLMs using cross-layer parameter sharing and SVD, which aligns with the model compression criterion.

Relevance: 9 Novelty: 8

4. Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

ArXiv ID: 2508.15884

Authors: Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai

Abstract: We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.

Comment: Jet-Nemotron presents a new hybrid-architecture language model developed using a novel neural architecture exploration pipeline, which aligns with the model architecture criterion.

Relevance: 9 Novelty: 8

5. Z-Pruner: Post-Training Pruning of Large Language Models for Efficiency without Retraining

ArXiv ID: 2508.15828

Authors: Samiul Basir Bhuiyan, Md. Sazzad Hossain Adib, Mohammed Aman Bhuiyan, Muhammad Rafsan Kabir, Moshiur Farazi, Shafin Rahman, Nabeel Mohammed

Abstract: Large language models (LLMs) have rapidly advanced in recent years, achieving remarkable performance across a wide range of natural language processing tasks. However, this progress has come at the cost of increasingly large model sizes, which pose significant challenges for deployment, scalability, and energy efficiency. To address these limitations, post-training pruning has emerged as a promising approach for reducing model size and inference latency without the need for retraining. Despite these advantages, many existing pruning methods result in substantial performance degradation or require computationally expensive fine-tuning. In this work, we introduce Z-Pruner, a novel post-training pruning method designed to induce sparsity in pretrained LLMs without any retraining. Unlike conventional approaches, Z-Pruner leverages both weight update magnitudes and activation patterns to identify and eliminate redundant parameters more effectively. Our method is model-agnostic, efficient, and easy to implement. We evaluate Z-Pruner using multiple widely-used LLM architectures, including LLaMA-2, LLaMA-3, and OPT, across a diverse set of standard language benchmarks. Experimental results demonstrate that Z-Pruner surpasses state-of-the-art pruning methods that require intensive weight updates. Specifically, Z-Pruner achieves the lowest perplexity scores and the highest overall average score for zero-shot accuracy. We have made the corresponding codes publicly available at https://github.com/sazzadadib/Z-Pruner.

Comment: Z-Pruner introduces a novel post-training pruning method for LLMs, focusing on inducing sparsity without retraining, which aligns with the model compression criterion.

Relevance: 9 Novelty: 8

6. On Task Vectors and Gradients

ArXiv ID: 2508.16082

Authors: Luca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Giuseppe Alessio D'Inverno, Fabrizio Silvestri, Emanuele Rodol`a

Abstract: Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.

Comment: The paper provides a theoretical foundation for task arithmetic, which is relevant to representation learning and offers substantial insights into training dynamics.

Relevance: 9 Novelty: 8

7. Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs

ArXiv ID: 2508.15989

Authors: Jiaqi Lin, Malyaban Bal, Abhronil Sengupta

Abstract: Equilibrium Propagation (EP) is a biologically inspired local learning rule first proposed for convergent recurrent neural networks (CRNNs), in which synaptic updates depend only on neuron states from two distinct phases. EP estimates gradients that closely align with those computed by Backpropagation Through Time (BPTT) while significantly reducing computational demands, positioning it as a potential candidate for on-chip training in neuromorphic architectures. However, prior studies on EP have been constrained to shallow architectures, as deeper networks suffer from the vanishing gradient problem, leading to convergence difficulties in both energy minimization and gradient computation. To address the vanishing gradient problem in deep EP networks, we propose a novel EP framework that incorporates intermediate error signals to enhance information flow and convergence of neuron dynamics. This is the first work to integrate knowledge distillation and local error signals into EP, enabling the training of significantly deeper architectures. Our proposed approach achieves state-of-the-art performance on the CIFAR-10 and CIFAR-100 datasets, showcasing its scalability on deep VGG architectures. These results represent a significant advancement in the scalability of EP, paving the way for its application in real-world systems.

Comment: The paper presents a novel framework for Equilibrium Propagation in deep networks, addressing the vanishing gradient problem and enhancing scalability, which aligns with representation learning and training dynamics in neural networks.

Relevance: 9 Novelty: 8

8. Closer to Reality: Practical Semi-Supervised Federated Learning for Foundation Model Adaptation

ArXiv ID: 2508.16568

Authors: Guangyu Sun, Jingtao Li, Weiming Zhuang, Chen Chen, Chen Chen, Lingjuan Lyu

Abstract: Foundation models (FMs) exhibit remarkable generalization but require adaptation to downstream tasks, particularly in privacy-sensitive applications. Due to data privacy regulations, cloud-based FMs cannot directly access private edge data, limiting their adaptation. Federated learning (FL) provides a privacy-aware alternative, but existing FL approaches overlook the constraints imposed by edge devices -- namely, limited computational resources and the scarcity of labeled data. To address these challenges, we introduce Practical Semi-Supervised Federated Learning (PSSFL), where edge devices hold only unlabeled, low-resolution data, while the server has limited labeled, high-resolution data. In this setting, we propose the Federated Mixture of Experts (FedMox), a novel framework that enhances FM adaptation in FL. FedMox tackles computational and resolution mismatch challenges via a sparse Mixture-of-Experts architecture, employing a spatial router to align features across resolutions and a Soft-Mixture strategy to stabilize semi-supervised learning. We take object detection as a case study, and experiments on real-world autonomous driving datasets demonstrate that FedMox effectively adapts FMs under PSSFL, significantly improving performance with constrained memory costs on edge devices. Our work paves the way for scalable and privacy-preserving FM adaptation in federated scenarios.

Comment: The paper introduces a novel framework, FedMox, which uses a sparse Mixture-of-Experts architecture for federated learning, aligning with the Model Architecture and Model Compression criteria.

Relevance: 9 Novelty: 8

9. SCOPE: A Generative Approach for LLM Prompt Compression

ArXiv ID: 2508.15813

Authors: Tinghui Zhang, Yifan Wang, Daisy Zhe Wang

Abstract: Prompt compression methods enhance the efficiency of Large Language Models (LLMs) and minimize the cost by reducing the length of input context. The goal of prompt compression is to shorten the LLM prompt while maintaining a high generation quality. However, existing solutions, mainly based on token removal, face challenges such as information loss and structural incoherence, like missing grammar elements in a sentence, or incomplete word phrases after token removal. Such challenges limit the final generation quality of LLM. To overcome these limitations, we present a novel generative prompt compression method. Unlike the existing token removal methods, our method centers at a chunking-and-summarization mechanism. Specifically, our method splits prompt into semantically coherent chunks and rewrites the chunks to be more concise. The chunks are reconstructed into meaningful prompt finally. We design several optimization techniques for the mechanism, including optimized semantic chunking, outlier chunk handling, dynamic compression ratio, compression prioritization, and keyword maintaining. These techniques effectively improve the identifying and preserving of critical information and coherence among texts, as well as providing finer grind control of the compression ratio. We conduct extensive evaluation on question-answering and summarization tasks, with datasets covering multiple different domain. The evaluation shows our method achieves a significantly better compression quality, and higher stability than the state-of-the-art methods, especially under high compression ratio, which proves the effectiveness and practicality of our method.

Comment: The paper introduces a generative approach for prompt compression in LLMs, focusing on efficiency and coherence, which aligns with model compression and LLM efficiency.

Relevance: 9 Novelty: 8

10. GEM: A Scale-Aware and Distribution-Sensitive Sparse Fine-Tuning Framework for Effective Downstream Adaptation

ArXiv ID: 2508.16191

Authors: Sungmin Kang, Jisoo Kim, Salman Avestimehr, Sunwoo Lee

Abstract: Parameter-efficient fine-tuning (PEFT) has become a popular way to adapt large pre-trained models to new tasks. Most PEFT methods update only a small subset of parameters while freezing the rest, avoiding redundant computation. As they maximize the absolute size of the updates without regard to the parameters' original scale, the resulting changes in model behavior can be minimal. In contrast, we maximize updates relative to each parameter's scale, yielding more meaningful downstream adaptation. We propose Gradient-to-Weight Ratio and Entropy-guided Masking (GEM), a parameter scale-aware, distribution-sensitive sparse fine-tuning framework. GEM prioritizes parameters whose updates are significant in proportion to their initial pre-trained values. It also adaptively determines how many parameters to tune at each layer based on the entropy of parameter values, thereby making the most effective use of the computational budget in PEFT. Our empirical study demonstrates the efficacy of GEM on both general-domain tasks (GLUE and SuperGLUE) and domain-specific tasks (GSM8k and MBPP), achieving up to a 1.6% improvement in fine-tuning accuracy over full fine-tuning while updating only 0.1% of model parameters.

Comment: The paper proposes a sparse fine-tuning framework that is scale-aware and distribution-sensitive, aligning with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 7

11. Tessellation Groups, Harmonic Analysis on Non-compact Symmetric Spaces and the Heat Kernel in view of Cartan Convolutional Neural Networks

ArXiv ID: 2508.16015

Authors: Pietro Fr\'e, Federico Milanesio, Marcelo Oyarzo, Matteo Santoro, Mario Trigiante

Abstract: In this paper, we continue the development of the Cartan neural networks programme, launched with three previous publications, by focusing on some mathematical foundational aspects that we deem necessary for our next steps forward. The mathematical and conceptual results are diverse and span various mathematical fields, but the inspiring motivation is unified. The aim is to introduce layers that are mathematically modeled as non-compact symmetric spaces, each mapped onto the next one by solvable group homomorphisms. In particular, in the spirit of Convolutional neural networks, we have introduced the notion of Tits Satake (TS) vector bundles where the TS submanifold is the base space. Within this framework, the tiling of the base manifold, the representation of bundle sections using harmonics, and the need for a general theory of separator walls motivated a series of mathematical investigations that produced both definite and partial results. Specifically, we present the group theoretical construction of the separators for all non-compact symmetric spaces $\mathrm{U/H}$, as well as of the $\Delta_{8,3,2}$ tiling group and its normal Fuchsian subgroups, respectively yielding the uniformization of the genus $g=3$ Fermat Quartic and of the genus $g=2$ Bolza surface. The quotient automorphic groups are studied. Furthermore, we found a new representation of the Laplacian Green function and the Heat Kernel on Hyperbolic Spaces $\mathbb{H}^{n}$, and a setup for the construction of the harmonic functions in terms of the spinor representation of pseudo-orthogonal groups. Finally, to obtain an explicit construction of the Laplacian eigenfunctions on the Bolza Riemann surface, we propose and conjecture a new strategy relying on the Abel-Jacobi map of the Riemann surface to its Jacobian variety and the Siegel Theta function.

Comment: The paper discusses mathematical foundations for Cartan Convolutional Neural Networks, which is a novel architectural concept, aligning with model architecture innovations.

Relevance: 8 Novelty: 8

12. Transforming Causality: Transformer-Based Temporal Causal Discovery with Prior Knowledge Integration

ArXiv ID: 2508.15928

Authors: Jihua Huang, Yi Yao, Ajay Divakaran

Abstract: We introduce a novel framework for temporal causal discovery and inference that addresses two key challenges: complex nonlinear dependencies and spurious correlations. Our approach employs a multi-layer Transformer-based time-series forecaster to capture long-range, nonlinear temporal relationships among variables. After training, we extract the underlying causal structure and associated time lags from the forecaster using gradient-based analysis, enabling the construction of a causal graph. To mitigate the impact of spurious causal relationships, we introduce a prior knowledge integration mechanism based on attention masking, which consistently enforces user-excluded causal links across multiple Transformer layers. Extensive experiments show that our method significantly outperforms other state-of-the-art approaches, achieving a 12.8% improvement in F1-score for causal discovery and 98.9% accuracy in estimating causal lags.

Comment: The paper presents a Transformer-based framework for temporal causal discovery, focusing on architecture-level innovations.