Personalized Daily ArXiv Papers 2025-09-18
| [gpt-5] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 34772 | 38522 | 73294 |
| Cost | $0.04 | $0.39 | $0.43 |
Total arXiv papers: 390
Total scanned papers: 237
Total relevant papers: 22
Table of contents with paper titles:
-
Circuit realization and hardware linearization of monotone operator equilibrium networks Authors: Thomas Chaffey
-
NIRVANA: Structured pruning reimagined for large language models compression Authors: Mengting Ai, Tianxin Wei, Sirui Chen, Jingrui He
-
Dense Video Understanding with Gated Residual Tokenization Authors: Haichao Zhang, Wenhao Chai, Shwai He, Ang Li, Yun Fu
-
A Compositional Kernel Model for Feature Learning Authors: Feng Ruan, Keli Liu, Michael Jordan
-
Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency Authors: Colin Hong, Xu Guo, Anand Chaanan Singh, Esha Choukse, Dmitrii Ustiugov
-
Asterisk Operator Authors: Zixi Li
-
Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs Authors: Zhuoxuan Zhang, Jinhao Duan, Edward Kim, Kaidi Xu
-
Deep Lookup Network Authors: Yulan Guo, Longguang Wang, Wendong Mao, Xiaoyu Dong, Yingqian Wang, Li Liu, Wei An
-
Language models' activations linearly encode training-order recency Authors: Dmitrii Krasheninnikov, Richard E. Turner, David Krueger
-
Beyond Correlation: Causal Multi-View Unsupervised Feature Selection Learning Authors: Zongxin Shen, Yanyong Huang, Bin Wang, Jinyuan Chang, Shiyu Liu, Tianrui Li
-
Curvature as a tool for evaluating dimensionality reduction and estimating intrinsic dimension Authors: Charlotte Beylier, Parvaneh Joharinad, J\"urgen Jost, Nahid Torbati
-
BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching Authors: Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, Weijia Jia
-
Semantic Fusion with Fuzzy-Membership Features for Controllable Language Modelling Authors: Yongchao Huang, Hassan Raza
-
State Space Models over Directed Graphs Authors: Junzhi She, Xunkai Li, Rong-Hua Li, Guoren Wang
-
Evaluation Awareness Scales Predictably in Open-Weights Large Language Models Authors: Maheep Chaudhary, Ian Su, Nikhil Hooda, Nishith Shankar, Julia Tan, Kevin Zhu, Ashwinee Panda, Ryan Lagasse, Vasu Sharma
-
MapAnything: Universal Feed-Forward Metric 3D Reconstruction Authors: Nikhil Keetha, Norman M\"uller, Johannes Sch\"onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, Peter Kontschieder
-
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision Authors: Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, Alan Schelten
-
Learning quantum many-body data locally: A provably scalable framework Authors: Koki Chinzei, Quoc Hoan Tran, Norifumi Matsumoto, Yasuhiro Endo, Hirotaka Oshima
-
Quantum Variational Activation Functions Empower Kolmogorov-Arnold Networks Authors: Jiun-Cheng Jiang, Morris Yu-Chao Huang, Tianlong Chen, Hsi-Sheng Goan
-
A Variational Framework for Residual-Based Adaptivity in Neural PDE Solvers and Operator Learning Authors: Juan Diego Toscano, Daniel T. Chen, Vivek Oommen, George Em Karniadakis
-
A reduced-order derivative-informed neural operator for subsurface fluid-flow Authors: Jeongjin (Jayjay), Park, Grant Bruer, Huseyin Tuna Erdinc, Abhinav Prakash Gahlot, Felix J. Herrmann
-
Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning Authors: Zhaoyang Chu, Yao Wan, Zhikun Zhang, Di Wang, Zhou Yang, Hongyu Zhang, Pan Zhou, Xuanhua Shi, Hai Jin, David Lo
1. Circuit realization and hardware linearization of monotone operator equilibrium networks
ArXiv ID: 2509.13793
Authors: Thomas Chaffey
Abstract: It is shown that the port behavior of a resistor-diode network corresponds to the solution of a ReLU monotone operator equilibrium network (a neural network in the limit of infinite depth), giving a parsimonious construction of a neural network in analog hardware. We furthermore show that the gradient of such a circuit can be computed directly in hardware, using a procedure we call hardware linearization. This allows the network to be trained in hardware, which we demonstrate with a device-level circuit simulation. We extend the results to cascades of resistor-diode networks, which can be used to implement feedforward and other asymmetric networks. We finally show that different nonlinear elements give rise to different activation functions, and introduce the novel diode ReLU which is induced by a non-ideal diode model.
Comment: Model Architecture and HPC: analog circuit realization of monotone operator equilibrium networks with in-hardware gradient computation (hardware linearization), enabling trainable analog implementations.
Relevance: 10 Novelty: 9
2. NIRVANA: Structured pruning reimagined for large language models compression
ArXiv ID: 2509.14230
Authors: Mengting Ai, Tianxin Wei, Sirui Chen, Jingrui He
Abstract: Structured pruning of large language models (LLMs) offers substantial efficiency improvements by removing entire hidden units, yet current approaches often suffer from significant performance degradation, particularly in zero-shot settings, and necessitate costly recovery techniques such as supervised fine-tuning (SFT) or adapter insertion. To address these critical shortcomings, we introduce NIRVANA, a novel pruning method explicitly designed to balance immediate zero-shot accuracy preservation with robust fine-tuning capability. Leveraging a first-order saliency criterion derived from the Neural Tangent Kernel under Adam optimization dynamics, NIRVANA provides a theoretically grounded pruning strategy that respects essential model training behaviors. To further address the unique challenges posed by structured pruning, NIRVANA incorporates an adaptive sparsity allocation mechanism across layers and modules (attention vs. MLP), which adjusts pruning intensity between modules in a globally balanced manner. Additionally, to mitigate the high sensitivity of pruning decisions to calibration data quality, we propose a simple yet effective KL divergence-based calibration data selection strategy, ensuring more reliable and task-agnostic pruning outcomes. Comprehensive experiments conducted on Llama3, Qwen, and T5 models demonstrate that NIRVANA outperforms existing structured pruning methods under equivalent sparsity constraints, providing a theoretically sound and practical approach to LLM compression. The code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/NIRVANA.
Comment: Matches Model Compression and Efficiency: structured pruning (sparsity) for LLMs with NTK-based saliency and adaptive layer/module sparsity allocation.
Relevance: 10 Novelty: 8
3. Dense Video Understanding with Gated Residual Tokenization
ArXiv ID: 2509.14199
Authors: Haichao Zhang, Wenhao Chai, Shwai He, Ang Li, Yun Fu
Abstract: High temporal resolution is essential for capturing fine-grained details in video understanding. However, current video large language models (VLLMs) and benchmarks mostly rely on low-frame-rate sampling, such as uniform sampling or keyframe selection, discarding dense temporal information. This compromise avoids the high cost of tokenizing every frame, which otherwise leads to redundant computation and linear token growth as video length increases. While this trade-off works for slowly changing content, it fails for tasks like lecture comprehension, where information appears in nearly every frame and requires precise temporal alignment. To address this gap, we introduce Dense Video Understanding (DVU), which enables high-FPS video comprehension by reducing both tokenization time and token overhead. Existing benchmarks are also limited, as their QA pairs focus on coarse content changes. We therefore propose DIVE (Dense Information Video Evaluation), the first benchmark designed for dense temporal reasoning. To make DVU practical, we present Gated Residual Tokenization (GRT), a two-stage framework: (1) Motion-Compensated Inter-Gated Tokenization uses pixel-level motion estimation to skip static regions during tokenization, achieving sub-linear growth in token count and compute. (2) Semantic-Scene Intra-Tokenization Merging fuses tokens across static regions within a scene, further reducing redundancy while preserving dynamic semantics. Experiments on DIVE show that GRT outperforms larger VLLM baselines and scales positively with FPS. These results highlight the importance of dense temporal information and demonstrate that GRT enables efficient, scalable high-FPS video understanding.
Comment: Directly targets compression/efficiency via conditional tokenization (motion-compensated gating and token merging) to achieve sub-linear token growth in VLLMs.
Relevance: 9 Novelty: 8
4. A Compositional Kernel Model for Feature Learning
ArXiv ID: 2509.14158
Authors: Feng Ruan, Keli Liu, Michael Jordan
Abstract: We study a compositional variant of kernel ridge regression in which the predictor is applied to a coordinate-wise reweighting of the inputs. Formulated as a variational problem, this model provides a simple testbed for feature learning in compositional architectures. From the perspective of variable selection, we show how relevant variables are recovered while noise variables are eliminated. We establish guarantees showing that both global minimizers and stationary points discard noise coordinates when the noise variables are Gaussian distributed. A central finding is that $\ell_1$-type kernels, such as the Laplace kernel, succeed in recovering features contributing to nonlinear effects at stationary points, whereas Gaussian kernels recover only linear ones.
Comment: Representation learning theory: compositional kernel model with guarantees on variable selection and recovery of nonlinear features.
Relevance: 9 Novelty: 8
5. Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency
ArXiv ID: 2509.13990
Authors: Colin Hong, Xu Guo, Anand Chaanan Singh, Esha Choukse, Dmitrii Ustiugov
Abstract: Recently, Test-Time Scaling (TTS) has gained increasing attention for improving LLM reasoning performance at test time without retraining the model. A notable TTS technique is Self-Consistency (SC), which generates multiple reasoning chains in parallel and selects the final answer via majority voting. While effective, the order-of-magnitude computational overhead limits its broad deployment. Prior attempts to accelerate SC mainly rely on model-based confidence scores or heuristics with limited empirical support. For the first time, we theoretically and empirically analyze the inefficiencies of SC and reveal actionable opportunities for improvement. Building on these insights, we propose Slim-SC, a step-wise pruning strategy that identifies and removes redundant chains using inter-chain similarity at the thought level. Experiments on three STEM reasoning datasets and two recent LLM architectures show that Slim-SC reduces inference latency and KVC usage by up to 45% and 26%, respectively, with R1-Distill, while maintaining or improving accuracy, thus offering a simple yet efficient TTS alternative for SC.
Comment: Model Compression and Efficiency: test-time scaling efficiency via step-wise pruning of self-consistency reasoning chains using inter-chain similarity, cutting KV cache and latency with theoretical backing.
Relevance: 9 Novelty: 8
6. Asterisk Operator
ArXiv ID: 2509.13364
Authors: Zixi Li
Abstract: We propose the \textbf{Asterisk Operator} ($\ast$-operator), a novel unified framework for abstract reasoning based on Adjacency-Structured Parallel Propagation (ASPP). The operator formalizes structured reasoning tasks as local, parallel state evolution processes guided by implicit relational graphs. We prove that the $\ast$-operator maintains local computational constraints while achieving global reasoning capabilities, providing an efficient and convergent computational paradigm for abstract reasoning problems. Through rigorous mathematical analysis and comprehensive experiments on ARC2 challenges and Conway's Game of Life, we demonstrate the operator's universality, convergence properties, and superior performance. Our innovative Embedding-Asterisk distillation method achieves 100\% accuracy on ARC2 validation with only 6M parameters, representing a significant breakthrough in neural-symbolic reasoning. \textbf{Keywords:} Abstract Reasoning, Adjacency Structure, Parallel Propagation, Asterisk Operator, Convergence, Universal Approximation
Comment: Model Architecture: introduces a new reasoning operator (Asterisk Operator) with analysis of convergence/universality and a compact distilled model.
Relevance: 9 Novelty: 8
7. Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs
ArXiv ID: 2509.13664
Authors: Zhuoxuan Zhang, Jinhao Duan, Edward Kim, Kaidi Xu
Abstract: Ambiguity is pervasive in real-world questions, yet large language models (LLMs) often respond with confident answers rather than seeking clarification. In this work, we show that question ambiguity is linearly encoded in the internal representations of LLMs and can be both detected and controlled at the neuron level. During the model's pre-filling stage, we identify that a small number of neurons, as few as one, encode question ambiguity information. Probes trained on these Ambiguity-Encoding Neurons (AENs) achieve strong performance on ambiguity detection and generalize across datasets, outperforming prompting-based and representation-based baselines. Layerwise analysis reveals that AENs emerge from shallow layers, suggesting early encoding of ambiguity signals in the model's processing pipeline. Finally, we show that through manipulating AENs, we can control LLM's behavior from direct answering to abstention. Our findings reveal that LLMs form compact internal representations of question ambiguity, enabling interpretable and controllable behavior.
Comment: Representation Learning: identifies sparse neurons encoding question ambiguity and demonstrates controllable behavior via neuron-level manipulation.
Relevance: 9 Novelty: 8
8. Deep Lookup Network
ArXiv ID: 2509.13662
Authors: Yulan Guo, Longguang Wang, Wendong Mao, Xiaoyu Dong, Yingqian Wang, Li Liu, Wei An
Abstract: Convolutional neural networks are constructed with massive operations with different types and are highly computationally intensive. Among these operations, multiplication operation is higher in computational complexity and usually requires {more} energy consumption with longer inference time than other operations, which hinders the deployment of convolutional neural networks on mobile devices. In many resource-limited edge devices, complicated operations can be calculated via lookup tables to reduce computational cost. Motivated by this, in this paper, we introduce a generic and efficient lookup operation which can be used as a basic operation for the construction of neural networks. Instead of calculating the multiplication of weights and activation values, simple yet efficient lookup operations are adopted to compute their responses. To enable end-to-end optimization of the lookup operation, we construct the lookup tables in a differentiable manner and propose several training strategies to promote their convergence. By replacing computationally expensive multiplication operations with our lookup operations, we develop lookup networks for the image classification, image super-resolution, and point cloud classification tasks. It is demonstrated that our lookup networks can benefit from the lookup operations to achieve higher efficiency in terms of energy consumption and inference speed while maintaining competitive performance to vanilla convolutional networks. Extensive experiments show that our lookup networks produce state-of-the-art performance on different tasks (both classification and regression tasks) and different data types (both images and point clouds).
Comment: Matches Compression/Efficiency: replaces multiplications with differentiable lookup operations and provides training strategies for LUT-based networks, yielding energy and speed gains.
Relevance: 9 Novelty: 7
9. Language models' activations linearly encode training-order recency
ArXiv ID: 2509.14223
Authors: Dmitrii Krasheninnikov, Richard E. Turner, David Krueger
Abstract: We show that language models' activations linearly encode when information was learned during training. Our setup involves creating a model with a known training order by sequentially fine-tuning Llama-3.2-1B on six disjoint but otherwise similar datasets about named entities. We find that the average activations of test samples for the six training datasets encode the training order: when projected into a 2D subspace, these centroids are arranged exactly in the order of training and lie on a straight line. Further, we show that linear probes can accurately (~90%) distinguish "early" vs. "late" entities, generalizing to entities unseen during the probes' own training. The model can also be fine-tuned to explicitly report an unseen entity's training stage (~80% accuracy). Interestingly, this temporal signal does not seem attributable to simple differences in activation magnitudes, losses, or model confidence. Our paper demonstrates that models are capable of differentiating information by its acquisition time, and carries significant implications for how they might manage conflicting data and respond to knowledge modifications.
Comment: Representation Learning: reveals that activations linearly encode training-order recency, with successful linear probes—an insight into internal representations and training dynamics.
Relevance: 9 Novelty: 7
10. Beyond Correlation: Causal Multi-View Unsupervised Feature Selection Learning
ArXiv ID: 2509.13763
Authors: Zongxin Shen, Yanyong Huang, Bin Wang, Jinyuan Chang, Shiyu Liu, Tianrui Li
Abstract: Multi-view unsupervised feature selection (MUFS) has recently received increasing attention for its promising ability in dimensionality reduction on multi-view unlabeled data. Existing MUFS methods typically select discriminative features by capturing correlations between features and clustering labels. However, an important yet underexplored question remains: \textit{Are such correlations sufficiently reliable to guide feature selection?} In this paper, we analyze MUFS from a causal perspective by introducing a novel structural causal model, which reveals that existing methods may select irrelevant features because they overlook spurious correlations caused by confounders. Building on this causal perspective, we propose a novel MUFS method called CAusal multi-view Unsupervised feature Selection leArning (CAUSA). Specifically, we first employ a generalized unsupervised spectral regression model that identifies informative features by capturing dependencies between features and consensus clustering labels. We then introduce a causal regularization module that can adaptively separate confounders from multi-view data and simultaneously learn view-shared sample weights to balance confounder distributions, thereby mitigating spurious correlations. Thereafter, integrating both into a unified learning framework enables CAUSA to select causally informative features. Comprehensive experiments demonstrate that CAUSA outperforms several state-of-the-art methods. To our knowledge, this is the first in-depth study of causal multi-view feature selection in the unsupervised setting.
Comment: Representation Learning: causal regularization for unsupervised multi-view feature selection to mitigate confounding and select informative features.
Relevance: 8 Novelty: 8
11. Curvature as a tool for evaluating dimensionality reduction and estimating intrinsic dimension
ArXiv ID: 2509.13385
Authors: Charlotte Beylier, Parvaneh Joharinad, J\"urgen Jost, Nahid Torbati
Abstract: Utilizing recently developed abstract notions of sectional curvature, we introduce a method for constructing a curvature-based geometric profile of discrete metric spaces. The curvature concept that we use here captures the metric relations between triples of points and other points. More significantly, based on this curvature profile, we introduce a quantitative measure to evaluate the effectiveness of data representations, such as those produced by dimensionality reduction techniques. Furthermore, Our experiments demonstrate that this curvature-based analysis can be employed to estimate the intrinsic dimensionality of datasets. We use this to explore the large-scale geometry of empirical networks and to evaluate the effectiveness of dimensionality reduction techniques.
Comment: Representation learning evaluation: curvature-based metric to assess dimensionality reduction quality and estimate intrinsic dimension.
Relevance: 8 Novelty: 7
12. BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching
ArXiv ID: 2509.13789
Authors: Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, Weijia Jia
Abstract: Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$\times$ speedup with comparable visual quality.
Comment: Matches Model Compression and Efficiency: training-free block-wise feature caching in Diffusion Transformers to reduce inference compute.
Relevance: 8 Novelty: 7
13. Semantic Fusion with Fuzzy-Membership Features for Controllable Language Modelling
ArXiv ID: 2509.13357
Authors: Yongchao Huang, Hassan Raza
Abstract: We propose semantic fusion, a lightweight scheme that augments a Transformer language model (LM) with a parallel, fuzzy-membership feature channel that encodes token-level semantics. Each token is represented by a vector of interpretable features (e.g. part-of-speech cues, shallow roles, boundary flags, sentiment polarity and strength) whose values are graded degrees from differentiable membership functions (e.g. power kernels). These per-token vectors form a sentence-level semantic matrix fused via a gated adapter into the LM. Training uses standard next-token prediction, an auxiliary loss that reconstructs the semantic features from hidden states, and a lightweight uniformizer that regularizes adjective-class distributions. On a synthetic two-clause corpus with held-out adjectives for out-of-distribution (OOD) control, semantic fusion improves perplexity and enables precise, user-controllable generation of polarity and punctuation while maintaining model simplicity. This approach adds only small overhead, remains fully compatible with tied input-output embeddings, and provides an interpretable pathway for conditioned natural language generation.
Comment: Matches Model Architecture: augments a Transformer LM with a gated semantic feature channel and auxiliary semantic reconstruction for controllable generation with small overhead.
Relevance: 8 Novelty: 7
14. State Space Models over Directed Graphs
ArXiv ID: 2509.13735
Authors: Junzhi She, Xunkai Li, Rong-Hua Li, Guoren Wang
Abstract: Directed graphs are ubiquitous across numerous domains, where the directionality of edges encodes critical causal dependencies. However, existing GNNs and graph Transformers tailored for directed graphs face two major challenges: (1) effectively capturing long-range causal dependencies derived from directed edges; (2) balancing accuracy and training efficiency when processing large-scale graph datasets. In recent years, state space models (SSMs) have achieved substantial progress in causal sequence tasks, and their variants designed for graphs have demonstrated state-of-the-art accuracy while maintaining high efficiency across various graph learning benchmarks. However, existing graph state space models are exclusively designed for undirected graphs, which limits their performance in directed graph learning. To this end, we propose an innovative approach DirEgo2Token which sequentializes directed graphs via k-hop ego graphs. This marks the first systematic extension of state space models to the field of directed graph learning. Building upon this, we develop DirGraphSSM, a novel directed graph neural network architecture that implements state space models on directed graphs via the message-passing mechanism. Experimental results demonstrate that DirGraphSSM achieves state-of-the-art performance on three representative directed graph learning tasks while attaining competitive performance on two additional tasks with 1.5$\times $ to 2$\times $ training speed improvements compared to existing state-of-the-art models.
Comment: Matches Model Architecture: introduces a directed-graph SSM architecture (DirGraphSSM) with a novel k-hop ego sequentialization for causal dependencies in directed graphs.
Relevance: 8 Novelty: 7
15. Evaluation Awareness Scales Predictably in Open-Weights Large Language Models
ArXiv ID: 2509.13333
Authors: Maheep Chaudhary, Ian Su, Nikhil Hooda, Nishith Shankar, Julia Tan, Kevin Zhu, Ashwinee Panda, Ryan Lagasse, Vasu Sharma
Abstract: Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as \emph{evaluation awareness}. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single $70$B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation awareness across $15$ models scaling from $0.27$B to $70$B parameters from four families using linear probing on steering vector activations. Our results reveal a clear power-law scaling: evaluation awareness increases predictably with model size. This scaling law enables forecasting deceptive behavior in future larger models and guides the design of scale-aware evaluation strategies for AI safety. A link to the implementation of this paper can be found at https://anonymous.4open.science/r/evaluation-awareness-scaling-laws/README.md.
Comment: Matches Representation Learning: analyzes internal activations with linear probes to reveal a scaling law for evaluation awareness in LLMs, offering insight into encoded context.
Relevance: 8 Novelty: 7
16. MapAnything: Universal Feed-Forward Metric 3D Reconstruction
ArXiv ID: 2509.13414
Authors: Nikhil Keetha, Norman M\"uller, Johannes Sch\"onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, Peter Kontschieder
Abstract: We introduce MapAnything, a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. MapAnything leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a single feed-forward pass, including uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, depth completion, and more. We provide extensive experimental analyses and model ablations demonstrating that MapAnything outperforms or matches specialist feed-forward models while offering more efficient joint training behavior, thus paving the way toward a universal 3D reconstruction backbone.
Comment: Model Architecture: proposes a unified transformer-based feed-forward 3D reconstruction backbone with a factored multi-view scene representation.
Relevance: 8 Novelty: 7
17. Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
ArXiv ID: 2509.14234
Authors: Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, Alan Schelten
Abstract: Where do learning signals come from when there is no ground truth in post-training? We propose turning exploration into supervision through Compute as Teacher (CaT), which converts the model's own exploration at inference-time into reference-free supervision by synthesizing a single reference from a group of parallel rollouts and then optimizing toward it. Concretely, the current policy produces a group of rollouts; a frozen anchor (the initial policy) reconciles omissions and contradictions to estimate a reference, turning extra inference-time compute into a teacher signal. We turn this into rewards in two regimes: (i) verifiable tasks use programmatic equivalence on final answers; (ii) non-verifiable tasks use self-proposed rubrics-binary, auditable criteria scored by an independent LLM judge, with reward given by the fraction satisfied. Unlike selection methods (best-of-N, majority, perplexity, or judge scores), synthesis may disagree with the majority and be correct even when all rollouts are wrong; performance scales with the number of rollouts. As a test-time procedure, CaT improves Gemma 3 4B, Qwen 3 4B, and Llama 3.1 8B (up to +27% on MATH-500; +12% on HealthBench). With reinforcement learning (CaT-RL), we obtain further gains (up to +33% and +30%), with the trained policy surpassing the initial teacher signal.
Comment: Foundational training method: converts inference-time compute into reference-free supervision (CaT/CaT-RL), an algorithmic post-training innovation.
Relevance: 7 Novelty: 8
18. Learning quantum many-body data locally: A provably scalable framework
ArXiv ID: 2509.13705
Authors: Koki Chinzei, Quoc Hoan Tran, Norifumi Matsumoto, Yasuhiro Endo, Hirotaka Oshima
Abstract: Machine learning (ML) holds great promise for extracting insights from complex quantum many-body data obtained in quantum experiments. This approach can efficiently solve certain quantum problems that are classically intractable, suggesting potential advantages of harnessing quantum data. However, addressing large-scale problems still requires significant amounts of data beyond the limited computational resources of near-term quantum devices. We propose a scalable ML framework called Geometrically Local Quantum Kernel (GLQK), designed to efficiently learn quantum many-body experimental data by leveraging the exponential decay of correlations, a phenomenon prevalent in noncritical systems. In the task of learning an unknown polynomial of quantum expectation values, we rigorously prove that GLQK substantially improves polynomial sample complexity in the number of qubits $n$, compared to the existing shadow kernel, by constructing a feature space from local quantum information at the correlation length scale. This improvement is particularly notable when each term of the target polynomial involves few local subsystems. Remarkably, for translationally symmetric data, GLQK achieves constant sample complexity, independent of $n$. We numerically demonstrate its high scalability in two learning tasks on quantum many-body phenomena. These results establish new avenues for utilizing experimental data to advance the understanding of quantum many-body physics.
Comment: Algorithmic efficiency and representation: local quantum kernel exploiting decay of correlations to improve sample complexity.
Relevance: 7 Novelty: 7
19. Quantum Variational Activation Functions Empower Kolmogorov-Arnold Networks
ArXiv ID: 2509.14026
Authors: Jiun-Cheng Jiang, Morris Yu-Chao Huang, Tianlong Chen, Hsi-Sheng Goan
Abstract: Variational quantum circuits (VQCs) are central to quantum machine learning, while recent progress in Kolmogorov-Arnold networks (KANs) highlights the power of learnable activation functions. We unify these directions by introducing quantum variational activation functions (QVAFs), realized through single-qubit data re-uploading circuits called DatA Re-Uploading ActivatioNs (DARUANs). We show that DARUAN with trainable weights in data pre-processing possesses an exponentially growing frequency spectrum with data repetitions, enabling an exponential reduction in parameter size compared with Fourier-based activations without loss of expressivity. Embedding DARUAN into KANs yields quantum-inspired KANs (QKANs), which retain the interpretability of KANs while improving their parameter efficiency, expressivity, and generalization. We further introduce two novel techniques to enhance scalability, feasibility and computational efficiency, such as layer extension and hybrid QKANs (HQKANs) as drop-in replacements of multi-layer perceptrons (MLPs) for feed-forward networks in large-scale models. We provide theoretical analysis and extensive experiments on function regression, image classification, and autoregressive generative language modeling, demonstrating the efficiency and scalability of QKANs. DARUANs and QKANs offer a promising direction for advancing quantum machine learning on both noisy intermediate-scale quantum (NISQ) hardware and classical quantum simulators.
Comment: Matches Model Architecture: introduces variational activation functions (DARUAN) embedded into KANs/HQKANs as MLP replacements for parameter-efficient expressivity.
Relevance: 7 Novelty: 7
20. A Variational Framework for Residual-Based Adaptivity in Neural PDE Solvers and Operator Learning
ArXiv ID: 2509.14198
Authors: Juan Diego Toscano, Daniel T. Chen, Vivek Oommen, George Em Karniadakis
Abstract: Residual-based adaptive strategies are widely used in scientific machine learning but remain largely heuristic. We introduce a unifying variational framework that formalizes these methods by integrating convex transformations of the residual. Different transformations correspond to distinct objective functionals: exponential weights target the minimization of uniform error, while linear weights recover the minimization of quadratic error. Within this perspective, adaptive weighting is equivalent to selecting sampling distributions that optimize the primal objective, thereby linking discretization choices directly to error metrics. This principled approach yields three benefits: (1) it enables systematic design of adaptive schemes across norms, (2) reduces discretization error through variance reduction of the loss estimator, and (3) enhances learning dynamics by improving the gradient signal-to-noise ratio. Extending the framework to operator learning, we demonstrate substantial performance gains across optimizers and architectures. Our results provide a theoretical justification of residual-based adaptivity and establish a foundation for principled discretization and training strategies.
Comment: Matches Representation Learning: formalizes residual-based adaptivity via a variational objective, improving training dynamics (variance reduction, gradient SNR) and principled sampling/discretization.
Relevance: 7 Novelty: 7
21. A reduced-order derivative-informed neural operator for subsurface fluid-flow
ArXiv ID: 2509.13620
Authors: Jeongjin (Jayjay), Park, Grant Bruer, Huseyin Tuna Erdinc, Abhinav Prakash Gahlot, Felix J. Herrmann
Abstract: Neural operators have emerged as cost-effective surrogates for expensive fluid-flow simulators, particularly in computationally intensive tasks such as permeability inversion from time-lapse seismic data, and uncertainty quantification. In these applications, the fidelity of the surrogate's gradients with respect to system parameters is crucial, as the accuracy of downstream tasks, such as optimization and Bayesian inference, relies directly on the quality of the derivative information. Recent advances in physics-informed methods have leveraged derivative information to improve surrogate accuracy. However, incorporating explicit Jacobians can become computationally prohibitive, as the complexity typically scales quadratically with the number of input parameters. To address this limitation, we propose DeFINO (Derivative-based Fisher-score Informed Neural Operator), a reduced-order, derivative-informed training framework. DeFINO integrates Fourier neural operators (FNOs) with a novel derivative-based training strategy guided by the Fisher Information Matrix (FIM). By projecting Jacobians onto dominant eigen-directions identified by the FIM, DeFINO captures critical sensitivity information directly informed by observational data, significantly reducing computational expense. We validate DeFINO through synthetic experiments in the context of subsurface multi-phase fluid-flow, demonstrating improvements in gradient accuracy while maintaining robust forward predictions of underlying fluid dynamics. These results highlight DeFINO's potential to offer practical, scalable solutions for inversion problems in complex real-world scenarios, all at substantially reduced computational cost.
Comment: Matches Efficiency/Training for operators: derivative-informed training with FIM-projected Jacobians integrated into FNOs reduces derivative complexity while preserving gradient fidelity.
Relevance: 7 Novelty: 7
22. Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning
ArXiv ID: 2509.13755
Authors: Zhaoyang Chu, Yao Wan, Zhikun Zhang, Di Wang, Zhou Yang, Hongyu Zhang, Pan Zhou, Xuanhua Shi, Hai Jin, David Lo
Abstract: While Code Language Models (CLMs) have demonstrated superior performance in software engineering tasks such as code generation and summarization, recent empirical studies reveal a critical privacy vulnerability: these models exhibit unintended memorization of sensitive training data, enabling verbatim reproduction of confidential information when specifically prompted. To address this issue, several approaches, including training data de-duplication and differential privacy augmentation, have been proposed. However, these methods require full-model retraining for deployed CLMs, which incurs substantial computational costs. In this paper, we aim to answer the following research question: Can sensitive information memorized by CLMs be erased effectively and efficiently? We conduct a pioneering investigation into erasing sensitive memorization in CLMs through machine unlearning - a post-hoc modification method that removes specific information from trained models without requiring full retraining. Specifically, we first quantify the memorization risks of sensitive data within CLM training datasets and curate a high-risk dataset of 50,000 sensitive memorized samples as unlearning targets. We study two widely used gradient ascent-based unlearning approaches: the vanilla and constraint-based methods, and introduce CodeEraser, an advanced variant that selectively unlearns sensitive memorized segments in code while preserving the structural integrity and functional correctness of the surrounding code. Extensive experiments on three families of CLMs, i.e., CodeParrot, CodeGen-Mono, and Qwen2.5-Coder, validate the effectiveness and efficiency of CodeEraser in erasing targeted sensitive memorization while maintaining model utility.
Comment: Model Editing/Efficiency: machine unlearning (post-hoc) to erase memorized sensitive code segments without full retraining, improving safety while preserving utility.
Relevance: 7 Novelty: 7
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.