Personalized Daily ArXiv Papers 2025-09-18

[gpt-5]	Prompt	Completion	Total
Token	34772	38522	73294
Cost	$0.04	$0.39	$0.43

Total arXiv papers: 390

Total scanned papers: 237

Total relevant papers: 22

Table of contents with paper titles:

Circuit realization and hardware linearization of monotone operator equilibrium networks Authors: Thomas Chaffey
NIRVANA: Structured pruning reimagined for large language models compression Authors: Mengting Ai, Tianxin Wei, Sirui Chen, Jingrui He
Dense Video Understanding with Gated Residual Tokenization Authors: Haichao Zhang, Wenhao Chai, Shwai He, Ang Li, Yun Fu
A Compositional Kernel Model for Feature Learning Authors: Feng Ruan, Keli Liu, Michael Jordan
Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency Authors: Colin Hong, Xu Guo, Anand Chaanan Singh, Esha Choukse, Dmitrii Ustiugov
Asterisk Operator Authors: Zixi Li
Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs Authors: Zhuoxuan Zhang, Jinhao Duan, Edward Kim, Kaidi Xu
Deep Lookup Network Authors: Yulan Guo, Longguang Wang, Wendong Mao, Xiaoyu Dong, Yingqian Wang, Li Liu, Wei An
Language models' activations linearly encode training-order recency Authors: Dmitrii Krasheninnikov, Richard E. Turner, David Krueger
Beyond Correlation: Causal Multi-View Unsupervised Feature Selection Learning Authors: Zongxin Shen, Yanyong Huang, Bin Wang, Jinyuan Chang, Shiyu Liu, Tianrui Li
Curvature as a tool for evaluating dimensionality reduction and estimating intrinsic dimension Authors: Charlotte Beylier, Parvaneh Joharinad, J\"urgen Jost, Nahid Torbati
BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching Authors: Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, Weijia Jia
Semantic Fusion with Fuzzy-Membership Features for Controllable Language Modelling Authors: Yongchao Huang, Hassan Raza
State Space Models over Directed Graphs Authors: Junzhi She, Xunkai Li, Rong-Hua Li, Guoren Wang
Evaluation Awareness Scales Predictably in Open-Weights Large Language Models Authors: Maheep Chaudhary, Ian Su, Nikhil Hooda, Nishith Shankar, Julia Tan, Kevin Zhu, Ashwinee Panda, Ryan Lagasse, Vasu Sharma
MapAnything: Universal Feed-Forward Metric 3D Reconstruction Authors: Nikhil Keetha, Norman M\"uller, Johannes Sch\"onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, Peter Kontschieder
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision Authors: Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, Alan Schelten
Learning quantum many-body data locally: A provably scalable framework Authors: Koki Chinzei, Quoc Hoan Tran, Norifumi Matsumoto, Yasuhiro Endo, Hirotaka Oshima
Quantum Variational Activation Functions Empower Kolmogorov-Arnold Networks Authors: Jiun-Cheng Jiang, Morris Yu-Chao Huang, Tianlong Chen, Hsi-Sheng Goan
A Variational Framework for Residual-Based Adaptivity in Neural PDE Solvers and Operator Learning Authors: Juan Diego Toscano, Daniel T. Chen, Vivek Oommen, George Em Karniadakis
A reduced-order derivative-informed neural operator for subsurface fluid-flow Authors: Jeongjin (Jayjay), Park, Grant Bruer, Huseyin Tuna Erdinc, Abhinav Prakash Gahlot, Felix J. Herrmann
Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning Authors: Zhaoyang Chu, Yao Wan, Zhikun Zhang, Di Wang, Zhou Yang, Hongyu Zhang, Pan Zhou, Xuanhua Shi, Hai Jin, David Lo

1. Circuit realization and hardware linearization of monotone operator equilibrium networks

ArXiv ID: 2509.13793

Authors: Thomas Chaffey

Abstract: It is shown that the port behavior of a resistor-diode network corresponds to the solution of a ReLU monotone operator equilibrium network (a neural network in the limit of infinite depth), giving a parsimonious construction of a neural network in analog hardware. We furthermore show that the gradient of such a circuit can be computed directly in hardware, using a procedure we call hardware linearization. This allows the network to be trained in hardware, which we demonstrate with a device-level circuit simulation. We extend the results to cascades of resistor-diode networks, which can be used to implement feedforward and other asymmetric networks. We finally show that different nonlinear elements give rise to different activation functions, and introduce the novel diode ReLU which is induced by a non-ideal diode model.

Comment: Model Architecture and HPC: analog circuit realization of monotone operator equilibrium networks with in-hardware gradient computation (hardware linearization), enabling trainable analog implementations.

Relevance: 10 Novelty: 9

2. NIRVANA: Structured pruning reimagined for large language models compression

ArXiv ID: 2509.14230

Authors: Mengting Ai, Tianxin Wei, Sirui Chen, Jingrui He

Abstract: Structured pruning of large language models (LLMs) offers substantial efficiency improvements by removing entire hidden units, yet current approaches often suffer from significant performance degradation, particularly in zero-shot settings, and necessitate costly recovery techniques such as supervised fine-tuning (SFT) or adapter insertion. To address these critical shortcomings, we introduce NIRVANA, a novel pruning method explicitly designed to balance immediate zero-shot accuracy preservation with robust fine-tuning capability. Leveraging a first-order saliency criterion derived from the Neural Tangent Kernel under Adam optimization dynamics, NIRVANA provides a theoretically grounded pruning strategy that respects essential model training behaviors. To further address the unique challenges posed by structured pruning, NIRVANA incorporates an adaptive sparsity allocation mechanism across layers and modules (attention vs. MLP), which adjusts pruning intensity between modules in a globally balanced manner. Additionally, to mitigate the high sensitivity of pruning decisions to calibration data quality, we propose a simple yet effective KL divergence-based calibration data selection strategy, ensuring more reliable and task-agnostic pruning outcomes. Comprehensive experiments conducted on Llama3, Qwen, and T5 models demonstrate that NIRVANA outperforms existing structured pruning methods under equivalent sparsity constraints, providing a theoretically sound and practical approach to LLM compression. The code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/NIRVANA.

Comment: Matches Model Compression and Efficiency: structured pruning (sparsity) for LLMs with NTK-based saliency and adaptive layer/module sparsity allocation.

Relevance: 10 Novelty: 8

3. Dense Video Understanding with Gated Residual Tokenization

ArXiv ID: 2509.14199

Authors: Haichao Zhang, Wenhao Chai, Shwai He, Ang Li, Yun Fu

Abstract: High temporal resolution is essential for capturing fine-grained details in video understanding. However, current video large language models (VLLMs) and benchmarks mostly rely on low-frame-rate sampling, such as uniform sampling or keyframe selection, discarding dense temporal information. This compromise avoids the high cost of tokenizing every frame, which otherwise leads to redundant computation and linear token growth as video length increases. While this trade-off works for slowly changing content, it fails for tasks like lecture comprehension, where information appears in nearly every frame and requires precise temporal alignment. To address this gap, we introduce Dense Video Understanding (DVU), which enables high-FPS video comprehension by reducing both tokenization time and token overhead. Existing benchmarks are also limited, as their QA pairs focus on coarse content changes. We therefore propose DIVE (Dense Information Video Evaluation), the first benchmark designed for dense temporal reasoning. To make DVU practical, we present Gated Residual Tokenization (GRT), a two-stage framework: (1) Motion-Compensated Inter-Gated Tokenization uses pixel-level motion estimation to skip static regions during tokenization, achieving sub-linear growth in token count and compute. (2) Semantic-Scene Intra-Tokenization Merging fuses tokens across static regions within a scene, further reducing redundancy while preserving dynamic semantics. Experiments on DIVE show that GRT outperforms larger VLLM baselines and scales positively with FPS. These results highlight the importance of dense temporal information and demonstrate that GRT enables efficient, scalable high-FPS video understanding.

Comment: Directly targets compression/efficiency via conditional tokenization (motion-compensated gating and token merging) to achieve sub-linear token growth in VLLMs.

Relevance: 9 Novelty: 8

4. A Compositional Kernel Model for Feature Learning

ArXiv ID: 2509.14158

Authors: Feng Ruan, Keli Liu, Michael Jordan

Abstract: We study a compositional variant of kernel ridge regression in which the predictor is applied to a coordinate-wise reweighting of the inputs. Formulated as a variational problem, this model provides a simple testbed for feature learning in compositional architectures. From the perspective of variable selection, we show how relevant variables are recovered while noise variables are eliminated. We establish guarantees showing that both global minimizers and stationary points discard noise coordinates when the noise variables are Gaussian distributed. A central finding is that $\ell_1$-type kernels, such as the Laplace kernel, succeed in recovering features contributing to nonlinear effects at stationary points, whereas Gaussian kernels recover only linear ones.

Comment: Representation learning theory: compositional kernel model with guarantees on variable selection and recovery of nonlinear features.

Relevance: 9 Novelty: 8

5. Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency

ArXiv ID: 2509.13990

Authors: Colin Hong, Xu Guo, Anand Chaanan Singh, Esha Choukse, Dmitrii Ustiugov

Abstract: Recently, Test-Time Scaling (TTS) has gained increasing attention for improving LLM reasoning performance at test time without retraining the model. A notable TTS technique is Self-Consistency (SC), which generates multiple reasoning chains in parallel and selects the final answer via majority voting. While effective, the order-of-magnitude computational overhead limits its broad deployment. Prior attempts to accelerate SC mainly rely on model-based confidence scores or heuristics with limited empirical support. For the first time, we theoretically and empirically analyze the inefficiencies of SC and reveal actionable opportunities for improvement. Building on these insights, we propose Slim-SC, a step-wise pruning strategy that identifies and removes redundant chains using inter-chain similarity at the thought level. Experiments on three STEM reasoning datasets and two recent LLM architectures show that Slim-SC reduces inference latency and KVC usage by up to 45% and 26%, respectively, with R1-Distill, while maintaining or improving accuracy, thus offering a simple yet efficient TTS alternative for SC.

Comment: Model Compression and Efficiency: test-time scaling efficiency via step-wise pruning of self-consistency reasoning chains using inter-chain similarity, cutting KV cache and latency with theoretical backing.

Relevance: 9 Novelty: 8

6. Asterisk Operator

ArXiv ID: 2509.13364

Authors: Zixi Li

Abstract: We propose the \textbf{Asterisk Operator} ($\ast$-operator), a novel unified framework for abstract reasoning based on Adjacency-Structured Parallel Propagation (ASPP). The operator formalizes structured reasoning tasks as local, parallel state evolution processes guided by implicit relational graphs. We prove that the $\ast$-operator maintains local computational constraints while achieving global reasoning capabilities, providing an efficient and convergent computational paradigm for abstract reasoning problems. Through rigorous mathematical analysis and comprehensive experiments on ARC2 challenges and Conway's Game of Life, we demonstrate the operator's universality, convergence properties, and superior performance. Our innovative Embedding-Asterisk distillation method achieves 100\% accuracy on ARC2 validation with only 6M parameters, representing a significant breakthrough in neural-symbolic reasoning. \textbf{Keywords:} Abstract Reasoning, Adjacency Structure, Parallel Propagation, Asterisk Operator, Convergence, Universal Approximation

Comment: Model Architecture: introduces a new reasoning operator (Asterisk Operator) with analysis of convergence/universality and a compact distilled model.

Relevance: 9 Novelty: 8

7. Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs

ArXiv ID: 2509.13664

Authors: Zhuoxuan Zhang, Jinhao Duan, Edward Kim, Kaidi Xu

Abstract: Ambiguity is pervasive in real-world questions, yet large language models (LLMs) often respond with confident answers rather than seeking clarification. In this work, we show that question ambiguity is linearly encoded in the internal representations of LLMs and can be both detected and controlled at the neuron level. During the model's pre-filling stage, we identify that a small number of neurons, as few as one, encode question ambiguity information. Probes trained on these Ambiguity-Encoding Neurons (AENs) achieve strong performance on ambiguity detection and generalize across datasets, outperforming prompting-based and representation-based baselines. Layerwise analysis reveals that AENs emerge from shallow layers, suggesting early encoding of ambiguity signals in the model's processing pipeline. Finally, we show that through manipulating AENs, we can control LLM's behavior from direct answering to abstention. Our findings reveal that LLMs form compact internal representations of question ambiguity, enabling interpretable and controllable behavior.

Comment: Representation Learning: identifies sparse neurons encoding question ambiguity and demonstrates controllable behavior via neuron-level manipulation.

Relevance: 9 Novelty: 8

8. Deep Lookup Network

ArXiv ID: 2509.13662

Authors: Yulan Guo, Longguang Wang, Wendong Mao, Xiaoyu Dong, Yingqian Wang, Li Liu, Wei An

Abstract: Convolutional neural networks are constructed with massive operations with different types and are highly computationally intensive. Among these operations, multiplication operation is higher in computational complexity and usually requires {more} energy consumption with longer inference time than other operations, which hinders the deployment of convolutional neural networks on mobile devices. In many resource-limited edge devices, complicated operations can be calculated via lookup tables to reduce computational cost. Motivated by this, in this paper, we introduce a generic and efficient lookup operation which can be used as a basic operation for the construction of neural networks. Instead of calculating the multiplication of weights and activation values, simple yet efficient lookup operations are adopted to compute their responses. To enable end-to-end optimization of the lookup operation, we construct the lookup tables in a differentiable manner and propose several training strategies to promote their convergence. By replacing computationally expensive multiplication operations with our lookup operations, we develop lookup networks for the image classification, image super-resolution, and point cloud classification tasks. It is demonstrated that our lookup networks can benefit from the lookup operations to achieve higher efficiency in terms of energy consumption and inference speed while maintaining competitive performance to vanilla convolutional networks. Extensive experiments show that our lookup networks produce state-of-the-art performance on different tasks (both classification and regression tasks) and different data types (both images and point clouds).

Comment: Matches Compression/Efficiency: replaces multiplications with differentiable lookup operations and provides training strategies for LUT-based networks, yielding energy and speed gains.

Relevance: 9 Novelty: 7

9. Language models' activations linearly encode training-order recency

ArXiv ID: 2509.14223

Authors: Dmitrii Krasheninnikov, Richard E. Turner, David Krueger

Abstract: We show that language models' activations linearly encode when information was learned during training. Our setup involves creating a model with a known training order by sequentially fine-tuning Llama-3.2-1B on six disjoint but otherwise similar datasets about named entities. We find that the average activations of test samples for the six training datasets encode the training order: when projected into a 2D subspace, these centroids are arranged exactly in the order of training and lie on a straight line. Further, we show that linear probes can accurately (~90%) distinguish "early" vs. "late" entities, generalizing to entities unseen during the probes' own training. The model can also be fine-tuned to explicitly report an unseen entity's training stage (~80% accuracy). Interestingly, this temporal signal does not seem attributable to simple differences in activation magnitudes, losses, or model confidence. Our paper demonstrates that models are capable of differentiating information by its acquisition time, and carries significant implications for how they might manage conflicting data and respond to knowledge modifications.

Comment: Representation Learning: reveals that activations linearly encode training-order recency, with successful linear probes—an insight into internal representations and training dynamics.

Relevance: 9 Novelty: 7

10. Beyond Correlation: Causal Multi-View Unsupervised Feature Selection Learning

ArXiv ID: 2509.13763

Authors: Zongxin Shen, Yanyong Huang, Bin Wang, Jinyuan Chang, Shiyu Liu, Tianrui Li

Abstract: Multi-view unsupervised feature selection (MUFS) has recently received increasing attention for its promising ability in dimensionality reduction on multi-view unlabeled data. Existing MUFS methods typically select discriminative features by capturing correlations between features and clustering labels. However, an important yet underexplored question remains: \textit{Are such correlations sufficiently reliable to guide feature selection?} In this paper, we analyze MUFS from a causal perspective by introducing a novel structural causal model, which reveals that existing methods may select irrelevant features because they overlook spurious correlations caused by confounders. Building on this causal perspective, we propose a novel MUFS method called CAusal multi-view Unsupervised feature Selection leArning (CAUSA). Specifically, we first employ a generalized unsupervised spectral regression model that identifies informative features by capturing dependencies between features and consensus clustering labels. We then introduce a causal regularization module that can adaptively separate confounders from multi-view data and simultaneously learn view-shared sample weights to balance confounder distributions, thereby mitigating spurious correlations. Thereafter, integrating both into a unified learning framework enables CAUSA to select causally informative features. Comprehensive experiments demonstrate that CAUSA outperforms several state-of-the-art methods. To our knowledge, this is the first in-depth study of causal multi-view feature selection in the unsupervised setting.

Comment: Representation Learning: causal regularization for unsupervised multi-view feature selection to mitigate confounding and select informative features.

Relevance: 8 Novelty: 8

11. Curvature as a tool for evaluating dimensionality reduction and estimating intrinsic dimension

ArXiv ID: 2509.13385

Authors: Charlotte Beylier, Parvaneh Joharinad, J\"urgen Jost, Nahid Torbati

Abstract: Utilizing recently developed abstract notions of sectional curvature, we introduce a method for constructing a curvature-based geometric profile of discrete metric spaces. The curvature concept that we use here captures the metric relations between triples of points and other points. More significantly, based on this curvature profile, we introduce a quantitative measure to evaluate the effectiveness of data representations, such as those produced by dimensionality reduction techniques. Furthermore, Our experiments demonstrate that this curvature-based analysis can be employed to estimate the intrinsic dimensionality of datasets. We use this to explore the large-scale geometry of empirical networks and to evaluate the effectiveness of dimensionality reduction techniques.

Comment: Representation learning evaluation: curvature-based metric to assess dimensionality reduction quality and estimate intrinsic dimension.