Personalized Daily ArXiv Papers 2025-10-06

[gpt-5]	Prompt	Completion	Total
Token	51773	49673	101446
Cost	$0.06	$0.5	$0.56

Total arXiv papers: 520

Total scanned papers: 315

Total relevant papers: 31

Table of contents with paper titles:

To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration Authors: Zeyu Yang, Tianyi Zhang, Jianwen Xie, Chuan Li, Zhaozhuo Xu, Anshumali Shrivastava
Pretraining with hierarchical memories: separating long-tail and common knowledge Authors: Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel
Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders Authors: Kriz Tahimic, Charibeth Cheng
Superposition disentanglement of neural representations reveals hidden alignment Authors: Andr\'e Longon, David Klindt, Meenakshi Khosla
Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression Authors: Peijun Zhu, Ning Yang, Jiayu Wei, Jinghang Wu, Haijun Zhang
The Curious Case of In-Training Compression of State Space Models Authors: Makram Chahine, Philipp Nazari, Daniela Rus, T. Konstantin Rusch
Unraveling Syntax: How Language Models Learn Context-Free Grammars Authors: Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio
In-memory Training on Analog Devices with Limited Conductance States via Multi-tile Residual Learning Authors: Jindan Li, Zhaoxian Wu, Gaowen Liu, Tayfun Gokmen, Tianyi Chen
SAGE: Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection Authors: Ashish Jha, Salman Ahmadi-Asl
DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding Authors: Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, Jun Wang
Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models Authors: Minsung Kim, Dong-Kyum Kim, Jea Kwon, Nakyeong Yang, Kyomin Jung, Meeyoung Cha
Topological Invariance and Breakdown in Learning Authors: Yongyi Yang, Tomaso Poggio, Isaac Chuang, Liu Ziyin
Cache-to-Cache: Direct Semantic Communication Between Large Language Models Authors: Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang
On The Expressive Power of GNN Derivatives Authors: Yam Eitan, Moshe Eliasof, Yoav Gelberg, Fabrizio Frasca, Guy Bar-Shalom, Haggai Maron
Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing Authors: Zhe Li, Wei Zhao, Yige Li, Jun Sun
CHORD: Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration Authors: Tianqi Liu, Kairui Fu, Shengyu Zhang, Wenyan Fan, Zhaocheng Du, Jieming Zhu, Fan Wu, Fei Wu
HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference Authors: Shubham Negi, Kaushik Roy
HyperAdaLoRA: Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance Authors: Hao Zhang, Zhenjia Li, Runfeng Bao, Yifan Gao, Xi Xiao, Bo Huang, Yuhang Wu, Tianyang Wang, Hao Xu
TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling Authors: Junyi Chen, Chuheng Du, Renyuan Liu, Shuochao Yao, Dingtian Yan, Jiang Liao, Shengzhong Liu, Fan Wu, Guihai Chen
Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning Authors: Wannan Yang, Xinchi Qiu, Lei Yu, Yuchen Zhang, Oliver Aobo Yang, Narine Kokhlikyan, Nicola Cancedda, Diego Garcia-Olano
Learning Multi-Index Models with Hyper-Kernel Ridge Regression Authors: Shuo Huang, Hippolyte Labarri`ere, Ernesto De Vito, Tomaso Poggio, Lorenzo Rosasco
Best-of-Majority: Minimax-Optimal Strategy for Pass@$k$ Inference Scaling Authors: Qiwei Di, Kaixuan Ji, Xuheng Li, Heyang Zhao, Quanquan Gu
Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification Authors: Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, Yiming Ying
Mitigating Modal Imbalance in Multimodal Reasoning Authors: Chen Henry Wu, Neil Kale, Aditi Raghunathan
Linear RNNs for autoregressive generation of long music samples Authors: Konrad Szewczyk, Daniel Gallo Fern\'andez, James Townsend
AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks Authors: Irene Tenison, Soumyajit Chatterjee, Fahim Kawsar, Mohammad Malekzadeh
On the Role of Temperature Sampling in Test-Time Scaling Authors: Yuheng Wu, Azalia Mirhoseini, Thierry Tambe
ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference Authors: Haojie Ouyang, Jianwei Lv, Lei Ren, Chen Wei, Xiaojie Wang, Fangxiang Feng
Hyperparameter Loss Surfaces Are Simple Near their Optima Authors: Nicholas Lourie, He He, Kyunghyun Cho
Why Do We Need Warm-up? A Theoretical Perspective Authors: Foivos Alimisis, Rustem Islamov, Aurelien Lucchi
Multimodal Function Vectors for Spatial Relations Authors: Shuhao Fu, Esther Goldberg, Ying Nian Wu, Hongjing Lu

1. To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

ArXiv ID: 2510.02676

Authors: Zeyu Yang, Tianyi Zhang, Jianwen Xie, Chuan Li, Zhaozhuo Xu, Anshumali Shrivastava

Abstract: The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision floating-point formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an exponent concentration phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from $\alpha$-stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era.

Comment: Model Compression and Efficiency + HPC: establishes exponent concentration with theoretical entropy bounds; proposes lossless ECF8 FP format with entropy-aware encoding and GPU-optimized decoding.

Relevance: 10 Novelty: 9

2. Pretraining with hierarchical memories: separating long-tail and common knowledge

ArXiv ID: 2510.02375

Authors: Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel

Abstract: The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. During pretraining and inference, we fetch a small, context-dependent memory block and add it to the model. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-parameters memory fetched from a 4.6B memory bank obtains comparable performance to a regular model with more than 2x the parameters. Through extensive experiments, we study the optimal type and size of parametric memories in transformers, scaling them to over 21B parameters. We find that our proposed hierarchical feed-forward memories work robustly across transformer architectures, whether added during pretraining or post-hoc.

Comment: Strongly matches Model Architecture and Efficiency: memory-augmented transformers with hierarchical parametric memory banks and context-dependent fetch, aligned with hardware for scalable pretraining/inference.

Relevance: 10 Novelty: 9

3. Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders

ArXiv ID: 2510.02917

Authors: Kriz Tahimic, Charibeth Cheng

Abstract: As Large Language Models become integral to software development, with substantial portions of AI-suggested code entering production, understanding their internal correctness mechanisms becomes critical for safe deployment. We apply sparse autoencoders to decompose LLM representations, identifying directions that correspond to code correctness. We select predictor directions using t-statistics and steering directions through separation scores from base model representations, then analyze their mechanistic properties through steering, attention analysis, and weight orthogonalization. We find that code correctness directions in LLMs reliably predict incorrect code, while correction capabilities, though statistically significant, involve tradeoffs between fixing errors and preserving correct code. Mechanistically, successful code generation depends on attending to test cases rather than problem descriptions. Moreover, directions identified in base models retain their effectiveness after instruction-tuning, suggesting code correctness mechanisms learned during pre-training are repurposed during fine-tuning. Our mechanistic insights suggest three practical applications: prompting strategies should prioritize test examples over elaborate problem descriptions, predictor directions can serve as error alarms for developer review, and these same predictors can guide selective steering, intervening only when errors are anticipated to prevent the code corruption from constant steering.

Comment: Representation Learning: uses sparse autoencoders to identify and steer code-correctness directions in LLM representations (mechanistic interpretability).

Relevance: 10 Novelty: 8

4. Superposition disentanglement of neural representations reveals hidden alignment

ArXiv ID: 2510.03186

Authors: Andr\'e Longon, David Klindt, Meenakshi Khosla

Abstract: The superposition hypothesis states that a single neuron within a population may participate in the representation of multiple features in order for the population to represent more features than the number of neurons. In neuroscience and AI, representational alignment metrics measure the extent to which different deep neural networks (DNNs) or brains represent similar information. In this work, we explore a critical question: \textit{does superposition interact with alignment metrics in any undesirable way?} We hypothesize that models which represent the same features in \textit{different superposition arrangements}, i.e., their neurons have different linear combinations of the features, will interfere with predictive mapping metrics (semi-matching, soft-matching, linear regression), producing lower alignment than expected. We first develop a theory for how the strict permutation metrics are dependent on superposition arrangements. This is tested by training sparse autoencoders (SAEs) to disentangle superposition in toy models, where alignment scores are shown to typically increase when a model's base neurons are replaced with its sparse overcomplete latent codes. We find similar increases for DNN(\rightarrow)DNN and DNN(\rightarrow)brain linear regression alignment in the visual domain. Our results suggest that superposition disentanglement is necessary for mapping metrics to uncover the true representational alignment between neural codes.

Comment: Representation Learning: examines superposition and alignment; uses sparse autoencoders to disentangle features and improve representational alignment metrics.

Relevance: 10 Novelty: 8

5. Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

ArXiv ID: 2510.02345

Authors: Peijun Zhu, Ning Yang, Jiayu Wei, Jinghang Wu, Haijun Zhang

Abstract: Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively. Our method employs an online clustering procedure that periodically regroups experts using a fused metric of parameter and activation similarity, which stabilizes expert utilization. To our knowledge, this is one of the first frameworks to leverage the semantic embedding capability of the router to dynamically reconfigure the model's architecture during training for substantial efficiency gains. Within each cluster, we decompose expert weights into a shared base matrix and extremely low-rank residual adapters, achieving up to fivefold parameter reduction per group while preserving specialization. This structure enables a two-stage hierarchical routing strategy: tokens are first assigned to a cluster, then to specific experts within it, drastically reducing the routing search space and the volume of all-to-all communication. Furthermore, a heterogeneous precision scheme, which stores shared bases in FP16 and residual factors in INT4, coupled with dynamic offloading of inactive clusters, reduces peak memory consumption to levels comparable to dense models. Evaluated on GLUE and WikiText-103, our framework matches the quality of standard MoE models while reducing total parameters by approximately 80%, improving throughput by 10% to 20%, and lowering expert load variance by a factor of over three. Our work demonstrates that structural reorganization is a principled path toward scalable, efficient, and memory-effective MoE LLMs.

Comment: Model Architecture: MoE with dynamic expert clustering and hierarchical routing; Compression/Efficiency: shared base + ultra low-rank residual adapters, mixed precision, reduced communication.

Relevance: 10 Novelty: 8

6. The Curious Case of In-Training Compression of State Space Models

ArXiv ID: 2510.02823

Authors: Makram Chahine, Philipp Nazari, Daniela Rus, T. Konstantin Rusch

Abstract: State Space Models (SSMs), developed to tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference. At their core are recurrent dynamical systems that maintain a hidden state, with update costs scaling with the state dimension. A key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden. Control theory, and more specifically Hankel singular value analysis, provides a potent framework for the measure of energy for each state, as well as the balanced truncation of the original system down to a smaller representation with performance guarantees. Leveraging the eigenvalue stability properties of Hankel matrices, we apply this lens to SSMs during training, where only dimensions of high influence are identified and preserved. Our approach applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models. Experiments show that in-training reduction significantly accelerates optimization while preserving expressivity, with compressed models retaining task-critical structure lost by models trained directly at smaller dimension. In other words, SSMs that begin large and shrink during training achieve computational efficiency while maintaining higher performance.

Comment: Compression/Efficiency — in-training balanced truncation of State Space Models via Hankel singular values to reduce state dimension while preserving expressivity.

Relevance: 10 Novelty: 8

7. Unraveling Syntax: How Language Models Learn Context-Free Grammars

ArXiv ID: 2510.02524

Authors: Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio

Abstract: We introduce a new framework for understanding how language models acquire syntax. While large models achieve impressive results, little is known about their learning dynamics. Our approach starts with the observation that most domains of interest, such as natural language syntax, coding languages, arithmetic problems, are captured by probabilistic context-free grammars (PCFGs). We study the learning dynamics of small models trained on synthetic languages generated from PCFGs, enabling precise control over grammar complexity, recursion depth, and subgrammar structure. We prove several general, recursive formulae for the training loss and Kullback-Leibler divergence over the subgrammar structure of a PCFG. Empirically, we find that unlike children, who first master simple substructures before progressing to more complex constructions, transformers reduce loss across all subgrammars in parallel. We further show that subgrammar pretraining can improve the final loss for smaller models, and that pretrained models develop internal representations more aligned with the grammar's substructure. Finally, we demonstrate that models struggle with deeper recursive structures (a limitation even of large language models), revealing fundamental challenges in how neural networks represent hierarchical syntax. Overall, our work initiates the study of the learning dynamics of transformers on PCFGs as a versatile testbed for probing learning in language models, opening a research direction with many open questions.

Comment: Representation Learning/Training Dynamics: theoretical and empirical study of how transformers learn PCFGs, with recursive loss/KL formulae and subgrammar pretraining effects.

Relevance: 9 Novelty: 8

8. In-memory Training on Analog Devices with Limited Conductance States via Multi-tile Residual Learning

ArXiv ID: 2510.02516

Authors: Jindan Li, Zhaoxian Wu, Gaowen Liu, Tayfun Gokmen, Tianyi Chen

Abstract: Analog in-memory computing (AIMC) accelerators enable efficient deep neural network computation directly within memory using resistive crossbar arrays, where model parameters are represented by the conductance states of memristive devices. However, effective in-memory training typically requires at least 8-bit conductance states to match digital baselines. Realizing such fine-grained states is costly and often requires complex noise mitigation techniques that increase circuit complexity and energy consumption. In practice, many promising memristive devices such as ReRAM offer only about 4-bit resolution due to fabrication constraints, and this limited update precision substantially degrades training accuracy. To enable on-chip training with these limited-state devices, this paper proposes a \emph{residual learning} framework that sequentially learns on multiple crossbar tiles to compensate the residual errors from low-precision weight updates. Our theoretical analysis shows that the optimality gap shrinks with the number of tiles and achieves a linear convergence rate. Experiments on standard image classification benchmarks demonstrate that our method consistently outperforms state-of-the-art in-memory analog training strategies under limited-state settings, while incurring only moderate hardware overhead as confirmed by our cost analysis.

Comment: Matches Model Compression and Efficiency/HPC: enables in-memory training on low-precision analog devices via multi-tile residual learning with convergence guarantees.

Relevance: 9 Novelty: 8

9. SAGE: Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection

ArXiv ID: 2510.02470

Authors: Ashish Jha, Salman Ahmadi-Asl

Abstract: Training modern neural networks on large datasets is computationally and energy intensive. We present SAGE, a streaming data-subset selection method that maintains a compact Frequent Directions (FD) sketch of gradient geometry in $O(\ell D)$ memory and prioritizes examples whose sketched gradients align with a consensus direction. The approach eliminates $N \times N$ pairwise similarities and explicit $N \times \ell$ gradient stores, yielding a simple two-pass, GPU-friendly pipeline. Leveraging FD's deterministic approximation guarantees, we analyze how agreement scoring preserves gradient energy within the principal sketched subspace. Across multiple benchmarks, SAGE trains with small kept-rate budgets while retaining competitive accuracy relative to full-data training and recent subset-selection baselines, and reduces end-to-end compute and peak memory. Overall, SAGE offers a practical, constant-memory alternative that complements pruning and model compression for efficient training.

Comment: Efficiency: streaming subset selection via Frequent Directions gradient sketches enabling constant-memory, GPU-friendly training.

Relevance: 9 Novelty: 8

10. DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding

ArXiv ID: 2510.02358

Authors: Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, Jun Wang

Abstract: As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter to propose multi-token drafts, which are then verified in parallel by the target model. However, many deployments still rely on AR drafters, where sequential passes limit wall-clock gains. We revisit the drafting stage and present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass, while remaining compatible with standard AR verifiers. Because DLM drafts are generated under bidirectional conditioning, parallel per-position candidates form a token lattice in which the locally highest-probability token at each position need not form a causal left-to-right path. Moreover, DLM drafting requires pre-specifying a draft length, inducing a speed-quality trade-off. To address these challenges, we introduce two practical components: (i) a causal-consistency path search (CPS) over this lattice that extracts a left-to-right path aligned with AR verification; and (ii) an adaptive draft-length (ADL) controller that adjusts next proposal size based on recent acceptance feedback and realized generated length. Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.

Comment: Efficiency/HPC: speculative decoding using a diffusion LM drafter with causal-consistency path search and adaptive draft length for speedups.

Relevance: 9 Novelty: 8

11. Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models

ArXiv ID: 2510.02370

Authors: Minsung Kim, Dong-Kyum Kim, Jea Kwon, Nakyeong Yang, Kyomin Jung, Meeyoung Cha

Abstract: Large language models often encounter conflicts between in-context knowledge retrieved at inference time and parametric knowledge acquired during pretraining. Models that accept external knowledge uncritically are vulnerable to misinformation, whereas models that adhere rigidly to parametric knowledge fail to benefit from retrieval. Despite the widespread adoption of retrieval-augmented generation, we still lack a systematic understanding of what shapes knowledge-arbitration strategies during training. This gap risks producing pretrained models with undesirable arbitration behaviors and, consequently, wasting substantial computational resources after the pretraining budget has already been spent. To address this problem, we present the first controlled study of how training conditions influence models' use of in-context and parametric knowledge, and how they arbitrate between them. We train transformer-based language models on a synthetic biographies corpus while systematically controlling various conditions. Our experiments reveal that intra-document repetition of facts fosters the development of both parametric and in-context capabilities. Moreover, training on a corpus that contains inconsistent information or distributional skew encourages models to develop robust strategies for leveraging parametric and in-context knowledge. Rather than viewing these non-ideal properties as artifacts to remove, our results indicate that they are important for learning robust arbitration. These insights offer concrete, empirical guidance for pretraining models that harmoniously integrate parametric and in-context knowledge.

Comment: Representation learning/training dynamics: controlled study of arbitration between parametric and in-context knowledge in Transformers.

Relevance: 9 Novelty: 8

12. Topological Invariance and Breakdown in Learning

ArXiv ID: 2510.02670

Authors: Yongyi Yang, Tomaso Poggio, Isaac Chuang, Liu Ziyin

Abstract: We prove that for a broad class of permutation-equivariant learning rules (including SGD, Adam, and others), the training process induces a bi-Lipschitz mapping between neurons and strongly constrains the topology of the neuron distribution during training. This result reveals a qualitative difference between small and large learning rates $\eta$. With a learning rate below a topological critical point $\eta^$, the training is constrained to preserve all topological structure of the neurons. In contrast, above $\eta^$, the learning process allows for topological simplification, making the neuron manifold progressively coarser and thereby reducing the model's expressivity. Viewed in combination with the recent discovery of the edge of stability phenomenon, the learning dynamics of neuron networks under gradient descent can be divided into two phases: first they undergo smooth optimization under topological constraints, and then enter a second phase where they learn through drastic topological simplifications. A key feature of our theory is that it is independent of specific architectures or loss functions, enabling the universal application of topological methods to the study of deep learning.

Comment: Representation Learning/Training Dynamics — architecture-agnostic theory showing topology-preserving vs. simplifying phases in learning governed by the learning rate.

Relevance: 9 Novelty: 8

13. Cache-to-Cache: Direct Semantic Communication Between Large Language Models

ArXiv ID: 2510.03215

Authors: Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang

Abstract: Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

Comment: Systems/Efficiency — KV-cache projection/fusion and gating enable direct inter-LLM communication, improving accuracy and reducing latency.

Relevance: 9 Novelty: 8

14. On The Expressive Power of GNN Derivatives

ArXiv ID: 2510.02565

Authors: Yam Eitan, Moshe Eliasof, Yoav Gelberg, Fabrizio Frasca, Guy Bar-Shalom, Haggai Maron

Abstract: Despite significant advances in Graph Neural Networks (GNNs), their limited expressivity remains a fundamental challenge. Research on GNN expressivity has produced many expressive architectures, leading to architecture hierarchies with models of increasing expressive power. Separately, derivatives of GNNs with respect to node features have been widely studied in the context of the oversquashing and over-smoothing phenomena, GNN explainability, and more. To date, these derivatives remain unexplored as a means to enhance GNN expressivity. In this paper, we show that these derivatives provide a natural way to enhance the expressivity of GNNs. We introduce High-Order Derivative GNN (HOD-GNN), a novel method that enhances the expressivity of Message Passing Neural Networks (MPNNs) by leveraging high-order node derivatives of the base model. These derivatives generate expressive structure-aware node embeddings processed by a second GNN in an end-to-end trainable architecture. Theoretically, we show that the resulting architecture family's expressive power aligns with the WL hierarchy. We also draw deep connections between HOD-GNN, Subgraph GNNs, and popular structural encoding schemes. For computational efficiency, we develop a message-passing algorithm for computing high-order derivatives of MPNNs that exploits graph sparsity and parallelism. Evaluations on popular graph learning benchmarks demonstrate HOD-GNN's strong performance on popular graph learning tasks.

Comment: Model Architecture — HOD-GNN augments MPNNs with high-order feature derivatives to boost expressivity up to WL hierarchy; efficient derivative message passing.

Relevance: 9 Novelty: 8

15. Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

ArXiv ID: 2510.02334

Authors: Zhe Li, Wei Zhao, Yige Li, Jun Sun

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their deployment is frequently undermined by undesirable behaviors such as generating harmful content, factual inaccuracies, and societal biases. Diagnosing the root causes of these failures poses a critical challenge for AI safety. Existing attribution methods, particularly those based on parameter gradients, often fall short due to prohibitive noisy signals and computational complexity. In this work, we introduce a novel and efficient framework that diagnoses a range of undesirable LLM behaviors by analyzing representation and its gradients, which operates directly in the model's activation space to provide a semantically meaningful signal linking outputs to their training data. We systematically evaluate our method for tasks that include tracking harmful content, detecting backdoor poisoning, and identifying knowledge contamination. The results demonstrate that our approach not only excels at sample-level attribution but also enables fine-grained token-level analysis, precisely identifying the specific samples and phrases that causally influence model behavior. This work provides a powerful diagnostic tool to understand, audit, and ultimately mitigate the risks associated with LLMs. The code is available at https://github.com/plumprc/RepT.

Comment: Representation learning: activation-space attribution with representation gradient tracing to link outputs to training data.

Relevance: 9 Novelty: 7

16. CHORD: Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration

ArXiv ID: 2510.03038

Authors: Tianqi Liu, Kairui Fu, Shengyu Zhang, Wenyan Fan, Zhaocheng Du, Jieming Zhu, Fan Wu, Fei Wu

Abstract: With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model compression. Recent quantization methods show promise for efficient deployment, yet they overlook device-specific user interests, resulting in compromised recommendation accuracy. While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for \underline{\textbf{C}}ustomizing \underline{\textbf{H}}ybrid-precision \underline{\textbf{O}}n-device model for sequential \underline{\textbf{R}}ecommendation with \underline{\textbf{D}}evice-cloud collaboration (\textbf{CHORD}), leveraging channel-wise mixed-precision quantization to simultaneously achieve personalization and resource-adaptive deployment. CHORD distributes randomly initialized models across heterogeneous devices and identifies user-specific critical parameters through auxiliary hypernetwork modules on the cloud. Our parameter sensitivity analysis operates across multiple granularities (layer, filter, and element levels), enabling precise mapping from user profiles to quantization strategy. Through on-device mixed-precision quantization, CHORD delivers dynamic model adaptation and accelerated inference without backpropagation, eliminating costly retraining cycles. We minimize communication overhead by encoding quantization strategies using only 2 bits per channel instead of 32-bit weights. Experiments on three real-world datasets with two popular backbones (SASRec and Caser) demonstrate the accuracy, efficiency, and adaptivity of CHORD.

Comment: Model Compression and Efficiency: channel-wise mixed-precision quantization personalized via a hypernetwork; 2-bit per-channel strategy encoding enables resource-adaptive deployment without backprop.

Relevance: 9 Novelty: 7

17. HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference

ArXiv ID: 2510.02675

Authors: Shubham Negi, Kaushik Roy

Abstract: The rapid adoption of Large Language Models (LLMs) has driven a growing demand for efficient inference, particularly in latency-sensitive applications such as chatbots and personalized assistants. Unlike traditional deep neural networks, LLM inference proceeds in two distinct phases: the prefill phase, which processes the full input sequence in parallel, and the decode phase, which generates tokens sequentially. These phases exhibit highly diverse compute and memory requirements, which makes accelerator design particularly challenging. Prior works have primarily been optimized for high-batch inference or evaluated only short input context lengths, leaving the low-batch and long context regime, which is critical for interactive applications, largely underexplored. We propose HALO, a heterogeneous memory centric accelerator designed for these unique challenges of prefill and decode phases in low-batch LLM inference. HALO integrates HBM based Compute-in-DRAM (CiD) with an on-chip analog Compute-in-Memory (CiM), co-packaged using 2.5D integration. To further improve the hardware utilization, we introduce a phase-aware mapping strategy that adapts to the distinct demands of the prefill and decode phases. Compute bound operations in the prefill phase are mapped to CiM to exploit its high throughput matrix multiplication capability, while memory-bound operations in the decode phase are executed on CiD to benefit from reduced data movement within DRAM. Additionally, we present an analysis of the performance tradeoffs of LLMs under two architectural extremes: a fully CiD and a fully on-chip analog CiM design to highlight the need for a heterogeneous design. We evaluate HALO on LLaMA-2 7B and Qwen3 8B models. Our experimental results show that LLMs mapped to HALO achieve up to 18x geometric mean speedup over AttAcc, an attention-optimized mapping and 2.5x over CENT, a fully CiD based mapping.

Comment: Strongly matches High-Performance Computing: heterogeneous CiD + on-chip analog CiM with phase-aware mapping and 2.5D integration targeted at low-batch, long-context LLM inference.

Relevance: 9 Novelty: 7

18. HyperAdaLoRA: Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance

ArXiv ID: 2510.02630

Authors: Hao Zhang, Zhenjia Li, Runfeng Bao, Yifan Gao, Xi Xiao, Bo Huang, Yuhang Wu, Tianyang Wang, Hao Xu

Abstract: Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), has emerged as a promising approach to fine-tuning large language models(LLMs) while reducing computational and memory overhead. However, LoRA assumes a uniform rank \textit{r} for each incremental matrix, not accounting for the varying significance of weight matrices across different modules and layers. AdaLoRA leverages Singular Value Decomposition (SVD) to parameterize updates and employs pruning of singular values to introduce dynamic rank allocation, thereby enhancing adaptability. However, during the training process, it often encounters issues of slow convergence speed and high computational overhead. To address this issue, we propose HyperAdaLoRA, a novel framework that accelerates the convergence of AdaLoRA by leveraging a hypernetwork. Instead of directly optimizing the components of Singular Value Decomposition $(P, \Lambda, Q)$, HyperAdaLoRA employs a hypernetwork based on attention mechanisms to dynamically generate these parameters. By pruning the outputs of the hypernetwork that generates the singular values, dynamic rank allocation is achieved. Comprehensive experiments on various datasets and models demonstrate that our method achieves faster convergence without sacrificing performance. Additionally, further extension experiments on other LoRA-based approaches validate the broad applicability of our method.

Comment: Compression/Efficiency: dynamic low-rank adaptation (LoRA) accelerated via hypernetwork-generated SVD parameters with rank pruning for efficient PEFT.

Relevance: 9 Novelty: 7

19. TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

ArXiv ID: 2510.02758

Authors: Junyi Chen, Chuheng Du, Renyuan Liu, Shuochao Yao, Dingtian Yan, Jiang Liao, Shengzhong Liu, Fan Wu, Guihai Chen

Abstract: Real-time LLM interactions demand streamed token generations, where text tokens are progressively generated and delivered to users while balancing two objectives: responsiveness (i.e., low time-to-first-token) and steady generation (i.e.,required time-between-tokens). Standard LLM serving systems suffer from the inflexibility caused by non-preemptive request scheduling and reactive memory management, leading to poor resource utilization and low request processing parallelism under request bursts. Therefore, we present TokenFlow, a novel LLM serving system with enhanced text streaming performance via preemptive request scheduling and proactive key-value (KV) cache management. TokenFlow dynamically prioritizes requests based on real-time token buffer occupancy and token consumption rate, while actively transferring KV cache between GPU and CPU memory in the background and overlapping I/O with computation to minimize request preemption overhead. Extensive experiments on Llama3-8B and Qwen2.5-32B across multiple GPUs (RTX 4090, A6000, H200) demonstrate that TokenFlow achieves up to 82.5% higher effective throughput (accounting for actual user consumption) while reducing P99 TTFT by up to 80.2%, without degrading overall token throughput.

Comment: Matches High Performance Computing: systems-level preemptive scheduling and proactive KV-cache memory management for LLM serving to improve responsiveness and throughput.

Relevance: 8 Novelty: 8

20. Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning

ArXiv ID: 2510.02324

Authors: Wannan Yang, Xinchi Qiu, Lei Yu, Yuchen Zhang, Oliver Aobo Yang, Narine Kokhlikyan, Nicola Cancedda, Diego Garcia-Olano

Abstract: Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model's weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL's light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by 30%-40% across multiple short-form QA benchmarks. CASAL is 30x more compute-efficient and 20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL's flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.

Comment: Matches Representation Learning/Model Architecture: amortized activation steering by training a minimal transformer submodule; effective for both dense and MoE models with strong compute efficiency.

Relevance: 8 Novelty: 8

21. Learning Multi-Index Models with Hyper-Kernel Ridge Regression

ArXiv ID: 2510.02532

Authors: Shuo Huang, Hippolyte Labarri`ere, Ernesto De Vito, Tomaso Poggio, Lorenzo Rosasco

Abstract: Deep neural networks excel in high-dimensional problems, outperforming models such as kernel methods, which suffer from the curse of dimensionality. However, the theoretical foundations of this success remain poorly understood. We follow the idea that the compositional structure of the learning task is the key factor determining when deep networks outperform other approaches. Taking a step towards formalizing this idea, we consider a simple compositional model, namely the multi-index model (MIM). In this context, we introduce and study hyper-kernel ridge regression (HKRR), an approach blending neural networks and kernel methods. Our main contribution is a sample complexity result demonstrating that HKRR can adaptively learn MIM, overcoming the curse of dimensionality. Further, we exploit the kernel nature of the estimator to develop ad hoc optimization approaches. Indeed, we contrast alternating minimization and alternating gradient methods both theoretically and numerically. These numerical results complement and reinforce our theoretical findings.

Comment: Representation Learning/Theory: HKRR provides sample complexity guarantees for compositional multi-index models, bridging kernels and neural approaches to overcome curse of dimensionality.

Relevance: 8 Novelty: 8

22. Best-of-Majority: Minimax-Optimal Strategy for Pass@$k$ Inference Scaling

ArXiv ID: 2510.03199

Authors: Qiwei Di, Kaixuan Ji, Xuheng Li, Heyang Zhao, Quanquan Gu

Abstract: LLM inference often generates a batch of candidates for a prompt and selects one via strategies like majority voting or Best-of- N (BoN). For difficult tasks, this single-shot selection often underperforms. Consequently, evaluations commonly report Pass@$k$: the agent may submit up to $k$ responses, and only the best of them is used when computing regret. Motivated by this, we study inference scaling in the more general Pass@$k$ inference setting, and prove that neither majority voting nor BoN exhibits the desirable scaling with $k$ and the sampling budget $N$. Combining the advantages of majority voting and BoN, we propose a new inference strategy called Best-of-Majority (BoM), with a pivotal step that restricts the candidates to the responses with high frequency in the $N$ samples before selecting the top-$k$ rewards. We prove that when the sampling budget is $N=\tilde\Omega(C^)$, the regret of BoM is $O(\epsilon_{\mathrm{opt}}+\sqrt{\epsilon_{\mathrm{RM}}^2C^/k})$, where $C^*$ is the coverage coefficient, $\epsilon_{\mathrm{RM}}$ is the estimation error of the reward model, and $\epsilon_{\mathrm{opt}}$ is the estimation error of reward at the optimal response. We further establish a matching lower bound, certifying that our algorithm is minimax optimal. Beyond optimality, BoM has a key advantage: unlike majority voting and BoN, its performance does not degrade when increasing $N$. Experimental results of inference on math problems show BoM outperforming both majority voting and BoN.

Comment: Matches Efficiency/Test-Time Scaling: introduces Best-of-Majority, a minimax-optimal Pass@k inference strategy with theoretical guarantees over majority voting/BoN.

Relevance: 8 Novelty: 8

23. Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification

ArXiv ID: 2510.02779

Authors: Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, Yiming Ying

Abstract: Recent advances have significantly improved our understanding of the generalization performance of gradient descent (GD) methods in deep neural networks. A natural and fundamental question is whether GD can achieve generalization rates comparable to the minimax optimal rates established in the kernel setting. Existing results either yield suboptimal rates of $O(1/\sqrt{n})$, or focus on networks with smooth activation functions, incurring exponential dependence on network depth $L$. In this work, we establish optimal generalization rates for GD with deep ReLU networks by carefully trading off optimization and generalization errors, achieving only polynomial dependence on depth. Specifically, under the assumption that the data are NTK separable from the margin $\gamma$, we prove an excess risk rate of $\widetilde{O}(L^4 (1 + \gamma L^2) / (n \gamma^2))$, which aligns with the optimal SVM-type rate $\widetilde{O}(1 / (n \gamma^2))$ up to depth-dependent factors. A key technical contribution is our novel control of activation patterns near a reference model, enabling a sharper Rademacher complexity bound for deep ReLU networks trained with gradient descent.

Comment: Training Dynamics — optimal generalization rates for GD on deep ReLU networks via control of activation patterns and sharper Rademacher bounds.

Relevance: 8 Novelty: 8

ArXiv ID: 2510.02608

Authors: Chen Henry Wu, Neil Kale, Aditi Raghunathan

Abstract: Foundation models (FMs) deployed in real-world tasks such as computer-use agents must integrate diverse modalities. How good are FMs at performing joint reasoning, simultaneously reasoning over multiple modalities, especially when the modalities interact and relate to each other to form cross-modal context? To better understand this problem, we study FMs on cross-modal conflicts: scenarios where conflicting evidence is presented across modalities. This allows us to examine whether FMs prioritize one modality over another or reason jointly to reconcile the conflict. Our experiments reveal that FMs can recognize conflicts in unimodal contexts, composed of a single modality, 90% of the time, but the ratio falls as low as 3% when evidence is split across modalities -- similar observations hold in cross-lingual contexts, composed of multiple languages. We trace this failure to cross-modal attention imbalance, showing that FMs exhibit extreme asymmetry in attention scores, disproportionately prioritizing certain modalities. We show that cross-modal attention imbalance does not go away by simply scaling up multimodal or multilingual datasets blindly, since they lack training examples that explicitly require cross-modal reasoning. We demonstrate that even a simple and scalable method of explicitly combining multiple modalities within each training instance significantly reduces attention imbalance. Reduced attention imbalance directly translates to improved downstream performance on several vision-language benchmarks. Our findings underscore the importance of systematically addressing cross-modal contexts to build reliable foundation models.

Comment: Representation Learning: analyzes and mitigates cross-modal attention imbalance with a training strategy that explicitly combines modalities to improve joint reasoning.