Personalized Daily ArXiv Papers 2025-12-05

[gpt-5]	Prompt	Completion	Total
Token	36714	33874	70588
Cost	$0.05	$0.34	$0.38

Total arXiv papers: 436

Total scanned papers: 256

Total relevant papers: 19

Table of contents with paper titles:

BEP: A Binary Error Propagation Algorithm for Binary Neural Networks Training Authors: Luca Colombo, Fabrizio Pittorino, Daniele Zambon, Carlo Baldassi, Manuel Roveri, Cesare Alippi
Network of Theseus (like the ship) Authors: Vighnesh Subramaniam, Colin Conwell, Boris Katz, Andrei Barbu, Brian Cheung
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs Authors: Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying
KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing Authors: Lishuo Deng, Shaojie Xu, Jinwu Chen, Changwei Yan, Jiajie Wang, Zhe Jiang, Weiwei Shan
Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems Authors: Zehao Fan, Zhenyu Liu, Yunzhen Liu, Yayue Hou, Hadjer Benmeziane, Kaoutar El Maghraoui, Liu Liu
A note on the impossibility of conditional PAC-efficient reasoning in large language models Authors: Hao Zeng
UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs Authors: Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
When do spectral gradient updates help in deep learning? Authors: Damek Davis, Dmitriy Drusvyatskiy
Arbitrage: Efficient Reasoning via Advantage-Aware Speculation Authors: Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu, Kerem Dilmen, Coleman Hooper, Haocheng Xi, Nicholas Lee, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation Authors: Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang
The Universal Weight Subspace Hypothesis Authors: Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya, Rama Chellappa, Alan Yuille
The Initialization Determines Whether In-Context Learning Is Gradient Descent Authors: Shifeng Xie, Rui Yuan, Simone Rossi, Thomas Hannagan
Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective Authors: Bowen Zheng, Ran Cheng
SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals Authors: Cassandra Goldberg, Chaehyeon Kim, Adam Stein, Eric Wong
GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers Authors: Malyaban Bal, Abhronil Sengupta
On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference Authors: Yue Yu, Qiwei Di, Quanquan Gu, Dongruo Zhou
TV2TV: A Unified Framework for Interleaved Language and Video Generation Authors: Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan
Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity Authors: Noa Rubin, Orit Davidovich, Zohar Ringel
Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models Authors: Xiwen Wei, Mustafa Munir, Radu Marculescu

1. BEP: A Binary Error Propagation Algorithm for Binary Neural Networks Training

ArXiv ID: 2512.04189

Authors: Luca Colombo, Fabrizio Pittorino, Daniele Zambon, Carlo Baldassi, Manuel Roveri, Cesare Alippi

Abstract: Binary Neural Networks (BNNs), which constrain both weights and activations to binary values, offer substantial reductions in computational complexity, memory footprint, and energy consumption. These advantages make them particularly well suited for deployment on resource-constrained devices. However, training BNNs via gradient-based optimization remains challenging due to the discrete nature of their variables. The dominant approach, quantization-aware training, circumvents this issue by employing surrogate gradients. Yet, this method requires maintaining latent full-precision parameters and performing the backward pass with floating-point arithmetic, thereby forfeiting the efficiency of binary operations during training. While alternative approaches based on local learning rules exist, they are unsuitable for global credit assignment and for back-propagating errors in multi-layer architectures. This paper introduces Binary Error Propagation (BEP), the first learning algorithm to establish a principled, discrete analog of the backpropagation chain rule. This mechanism enables error signals, represented as binary vectors, to be propagated backward through multiple layers of a neural network. BEP operates entirely on binary variables, with all forward and backward computations performed using only bitwise operations. Crucially, this makes BEP the first solution to enable end-to-end binary training for recurrent neural network architectures. We validate the effectiveness of BEP on both multi-layer perceptrons and recurrent neural networks, demonstrating gains of up to +6.89% and +10.57% in test accuracy, respectively. The proposed algorithm is released as an open-source repository.

Comment: Model Compression and Efficiency: introduces Binary Error Propagation, a discrete analog of backprop enabling fully binary forward and backward passes (including RNNs) with only bitwise ops.

Relevance: 10 Novelty: 9

2. Network of Theseus (like the ship)

ArXiv ID: 2512.04198

Authors: Vighnesh Subramaniam, Colin Conwell, Boris Katz, Andrei Barbu, Brian Cheung

Abstract: A standard assumption in deep learning is that the inductive bias introduced by a neural network architecture must persist from training through inference. The architecture you train with is the architecture you deploy. This assumption constrains the community from selecting architectures that may have desirable efficiency or design properties due to difficulties with optimization. We challenge this assumption with Network of Theseus (NoT), a method for progressively converting a trained, or even untrained, guide network architecture part-by-part into an entirely different target network architecture while preserving the performance of the guide network. At each stage, components in the guide network architecture are incrementally replaced with target architecture modules and aligned via representational similarity metrics. This procedure largely preserves the functionality of the guide network even under substantial architectural changes-for example, converting a convolutional network into a multilayer perceptron, or GPT-2 into a recurrent neural network. By decoupling optimization from deployment, NoT expands the space of viable inference-time architectures, opening opportunities for better accuracy-efficiency tradeoffs and enabling more directed exploration of the architectural design space.

Comment: Matches Model Architecture: progressive architecture conversion using representational similarity alignment to decouple optimization from deployment, enabling new accuracy–efficiency tradeoffs.

Relevance: 10 Novelty: 9

3. Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

ArXiv ID: 2512.03324

Authors: Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying

Abstract: Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.

Comment: Model Efficiency: memory-bounded KV cache via learned per-token retention gates (layer/head-specific) for eviction; aligns with pruning/selection for inference efficiency.