Personalized Daily ArXiv Papers 2025-12-10

[gpt-5]	Prompt	Completion	Total
Token	38012	34218	72230
Cost	$0.05	$0.34	$0.39

Total arXiv papers: 481

Total scanned papers: 283

Total relevant papers: 23

Table of contents with paper titles:

LUNA: Linear Universal Neural Attention with Generalization Guarantees Authors: Ashkan Shahbazi, Ping He, Ali Abbasi, Yikun Bai, Xinran Liu, Elaheh Akbari, Darian Salehi, Navid NaderiAlizadeh, Soheil Kolouri
HybridToken-VLM: Hybrid Token Compression for Vision-Language Models Authors: Jusheng Zhang, Xiaoyang Guo, Kaitong Cai, Qinhan Lv, Yijia Fan, Wenhao Chai, Jian Wang, Keze Wang
Generation is Required for Data-Efficient Perception Authors: Jack Brady, Bernhard Sch\"olkopf, Thomas Kipf, Simon Buchholz, Wieland Brendel
LAPA: Log-Domain Prediction-Driven Dynamic Sparsity Accelerator for Transformer Model Authors: Huizheng Wang, Hongbin Wang, Shaojun Wei, Yang Hu, Shouyi Yin
Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis Authors: Ferdinand Kapl, Emmanouil Angelis, Tobias H\"oppe, Kaitlin Maile, Johannes von Oswald, Nino Scherrer, Stefan Bauer
LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Average for Training Neural Networks Authors: Nanda K. Unnikrishnan, Keshab K. Parhi
SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models Authors: Jiayi Tian, Seyedarmin Azizi, Yequan Zhao, Erfan Baghaei Potraghloo, Sean McPherson, Sharath Nittur Sridhar, Zhengyang Wang, Zheng Zhang, Massoud Pedram, Souvik Kundu
GSPN-2: Efficient Parallel Sequence Modeling Authors: Hongjun Wang, Yitong Jiang, Collin McCarthy, David Wehr, Hanrong Ye, Xinhao Li, Ka Chun Cheung, Wonmin Byeon, Jinwei Gu, Ke Chen, Kai Han, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu
ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models Authors: Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models Authors: Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang
Mathematical Foundations of Neural Tangents and Infinite-Width Networks Authors: Rachana Mysore, Preksha Girish, Kavitha Jayaram, Shrey Kumar, Preksha Girish, Shravan Sanjeev Bagal, Kavitha Jayaram, Shreya Aravind Shastry
Softly Symbolifying Kolmogorov-Arnold Networks Authors: James Bagrow, Josh Bongard
Complexity of One-Dimensional ReLU DNNs Authors: Jonathan Kogan, Hayden Jananthan, Jeremy Kepner
Prismatic World Model: Learning Compositional Dynamics for Planning in Hybrid Systems Authors: Mingwei Li, Xiaoyuan Zhang, Chengwei Yang, Zilong Zheng, Yaodong Yang
Short-Context Dominance: How Much Local Context Natural Language Actually Needs? Authors: Vala Vakilian, Zimeng Wang, Ankit Singh Rawat, Christos Thrampoulidis
gHAWK: Local and Global Structure Encoding for Scalable Training of Graph Neural Networks on Knowledge Graphs Authors: Humera Sabir, Fatima Farooq, Ashraf Aboulnaga
PR-CapsNet: Pseudo-Riemannian Capsule Network with Adaptive Curvature Routing for Graph Learning Authors: Ye Qin, Jingchao Wang, Yang Shi, Haiying Huang, Junxu Li, Weijian Liu, Tinghui Chen, Jinghui Qin
MobileFineTuner: A Unified End-to-End Framework for Fine-Tuning LLMs on Mobile Phones Authors: Jiaxiang Geng, Lunyu Zhao, Yiyi Lu, Bing Luo
Semi-Supervised Contrastive Learning with Orthonormal Prototypes Authors: Huanran Li (Department of Electrical Engineering), Manh Nguyen (Statistics), Daniel Pimentel-Alarc\'on (Biostatistics, Wisconsin Institute of Discovery, University of Wisconsin-Madison)
Nonnegative Matrix Factorization through Cone Collapse Authors: Manh Nguyen (Department of Statistics), Daniel Pimentel-Alarc\'on (Department of Biostatistics and Medical Informatics, Wisconsin Institute of Discovery, University of Wisconsin-Madison)
MixLM: High-Throughput and Effective LLM Ranking via Text-Embedding Mix-Interaction Authors: Guoyao Li, Ran He, Shusen Jing, Kayhan Behdin, Yubo Wang, Sundara Raman Ramachandran, Chanh Nguyen, Jian Sheng, Xiaojing Ma, Chuanrui Zhu, Sriram Vasudevan, Muchen Wu, Sayan Ghosh, Lin Su, Qingquan Song, Xiaoqing Wang, Zhipeng Wang, Qing Lan, Yanning Chen, Jingwei Wu, Luke Simon, Wenjing Zhang, Qi Guo, Fedor Borisyuk
GeoDM: Geometry-aware Distribution Matching for Dataset Distillation Authors: Xuhui Li, Zhengquan Luo, Zihui Cui, Zhiqiang Xu
Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training Authors: Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, Jason Ramapuram

1. LUNA: Linear Universal Neural Attention with Generalization Guarantees

ArXiv ID: 2512.08061

Authors: Ashkan Shahbazi, Ping He, Ali Abbasi, Yikun Bai, Xinran Liu, Elaheh Akbari, Darian Salehi, Navid NaderiAlizadeh, Soheil Kolouri

Abstract: Scaling attention faces a critical bottleneck: the $\mathcal{O}(n^2)$ quadratic computational cost of softmax attention, which limits its application in long-sequence domains. While linear attention mechanisms reduce this cost to $\mathcal{O}(n)$, they typically rely on fixed random feature maps, such as random Fourier features or hand-crafted functions. This reliance on static, data-agnostic kernels creates a fundamental trade-off, forcing practitioners to sacrifice significant model accuracy for computational efficiency. We introduce \textsc{LUNA}, a kernelized linear attention mechanism that eliminates this trade-off, retaining linear cost while matching and surpassing the accuracy of quadratic attention. \textsc{LUNA} is built on the key insight that the kernel feature map itself should be learned rather than fixed a priori. By parameterizing the kernel, \textsc{LUNA} learns a feature basis tailored to the specific data and task, overcoming the expressive limitations of fixed-feature methods. \textsc{Luna} implements this with a learnable feature map that induces a positive-definite kernel and admits a streaming form, yielding linear time and memory scaling in the sequence length. Empirical evaluations validate our approach across diverse settings. On the Long Range Arena (LRA), \textsc{Luna} achieves state-of-the-art average accuracy among efficient Transformers under compute parity, using the same parameter count, training steps, and approximate FLOPs. \textsc{Luna} also excels at post-hoc conversion: replacing softmax in fine-tuned BERT and ViT-B/16 checkpoints and briefly fine-tuning recovers most of the original performance, substantially outperforming fixed linearizations.

Comment: Matches Model Architecture and Efficiency: linear attention with learned positive-definite kernel feature maps and streaming computation; retains linear time/memory.

Relevance: 10 Novelty: 9

2. HybridToken-VLM: Hybrid Token Compression for Vision-Language Models

ArXiv ID: 2512.08240

Authors: Jusheng Zhang, Xiaoyang Guo, Kaitong Cai, Qinhan Lv, Yijia Fan, Wenhao Chai, Jian Wang, Keze Wang

Abstract: Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.

Comment: Compression/Efficiency: hybrid discrete–continuous token compression for VLMs yielding 580-to-1 compression and a single fused token.

Relevance: 10 Novelty: 9

3. Generation is Required for Data-Efficient Perception

ArXiv ID: 2512.08854

Authors: Jack Brady, Bernhard Sch\"olkopf, Thomas Kipf, Simon Buchholz, Wieland Brendel

Abstract: It has been hypothesized that human-level visual perception requires a generative approach in which internal representations result from inverting a decoder. Yet today's most successful vision models are non-generative, relying on an encoder that maps images to representations without decoder inversion. This raises the question of whether generation is, in fact, necessary for machines to achieve human-level visual perception. To address this, we study whether generative and non-generative methods can achieve compositional generalization, a hallmark of human perception. Under a compositional data generating process, we formalize the inductive biases required to guarantee compositional generalization in decoder-based (generative) and encoder-based (non-generative) methods. We then show theoretically that enforcing these inductive biases on encoders is generally infeasible using regularization or architectural constraints. In contrast, for generative methods, the inductive biases can be enforced straightforwardly, thereby enabling compositional generalization by constraining a decoder and inverting it. We highlight how this inversion can be performed efficiently, either online through gradient-based search or offline through generative replay. We examine the empirical implications of our theory by training a range of generative and non-generative methods on photorealistic image datasets. We find that, without the necessary inductive biases, non-generative methods often fail to generalize compositionally and require large-scale pretraining or added supervision to improve generalization. By comparison, generative methods yield significant improvements in compositional generalization, without requiring additional data, by leveraging suitable inductive biases on a decoder along with search and replay.

Comment: Matches Representation Learning Theory: formalizes inductive biases for compositional generalization and shows why generative inversion enables data-efficient perception.

Relevance: 9 Novelty: 9

4. LAPA: Log-Domain Prediction-Driven Dynamic Sparsity Accelerator for Transformer Model

ArXiv ID: 2512.07855

Authors: Huizheng Wang, Hongbin Wang, Shaojun Wei, Yang Hu, Shouyi Yin

Abstract: Attention-based Transformers have revolutionized natural language processing (NLP) and shown strong performance in computer vision (CV) tasks. However, as the input sequence varies, the computational bottlenecks in Transformer models exhibit dynamic behavior across stages, which calls for a cross-stage sparse acceleration strategy. Unfortunately, most existing sparse Transformer approaches are single-stage based, and their sparsity prediction mechanisms lead to significant power overhead when applied across multiple stages. To this end, this paper proposes a log-domain attention prediction algorithm-architecture co-design, named LAPA. First, an asymmetric leading one computing (ALOC) scheme is designed to eliminate expensive multiplications. Next, a mixed-precision multi-round shifting accumulation (MRSA) mechanism is further proposed to mitigate the accumulation overhead. A data-feature dependent filter (DDF) strategy is designed to work in concert with the MRSA process. Finally, an elaborate accelerator is designed to translate the theoretical enhancement into practical hardware improvement. Experimental results show that LAPA achieves 3.52x, 3.24x and 2.79x higher energy efficiency than the state-of-the-art (SOTA) works Spatten, Sanger and FACT, respectively.

Comment: Matches Compression/Efficiency: dynamic sparsity prediction and log-domain computation with hardware co-design for Transformer acceleration.

Relevance: 9 Novelty: 8

5. Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

ArXiv ID: 2512.08819

Authors: Ferdinand Kapl, Emmanouil Angelis, Tobias H\"oppe, Kaitlin Maile, Johannes von Oswald, Nino Scherrer, Stefan Bauer

Abstract: Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csord\'as et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.

Comment: Training dynamics/Architecture growth: analyzes and improves gradual depth stacking to counter the Transformer curse of depth with mechanistic insights.

Relevance: 9 Novelty: 8

6. LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Average for Training Neural Networks

ArXiv ID: 2512.08160

Authors: Nanda K. Unnikrishnan, Keshab K. Parhi

Abstract: In our prior work, LayerPipe, we had introduced an approach to accelerate training of convolutional, fully connected, and spiking neural networks by overlapping forward and backward computation. However, despite empirical success, a principled understanding of how much gradient delay needs to be introduced at each layer to achieve desired level of pipelining was not addressed. This paper, LayerPipe2, fills that gap by formally deriving LayerPipe using variable delayed gradient adaptation and retiming. We identify where delays may be legally inserted and show that the required amount of delay follows directly from the network structure where inner layers require fewer delays and outer layers require longer delays. When pipelining is applied at every layer, the amount of delay depends only on the number of remaining downstream stages. When layers are pipelined in groups, all layers in the group share the same assignment of delays. These insights not only explain previously observed scheduling patterns but also expose an often overlooked challenge that pipelining implicitly requires storage of historical weights. We overcome this storage bottleneck by developing a pipeline--aware moving average that reconstructs the required past states rather than storing them explicitly. This reduces memory cost without sacrificing the accuracy guarantees that makes pipelined learning viable. The result is a principled framework that illustrates how to construct LayerPipe architectures, predicts their delay requirements, and mitigates their storage burden, thereby enabling scalable pipelined training with controlled communication computation tradeoffs.

Comment: High Performance Computing: principled multistage pipelining with variable gradient delays and pipeline-aware EMA to reconstruct past weights and cut memory.

Relevance: 9 Novelty: 8

7. SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

ArXiv ID: 2512.07993

Authors: Jiayi Tian, Seyedarmin Azizi, Yequan Zhao, Erfan Baghaei Potraghloo, Sean McPherson, Sharath Nittur Sridhar, Zhengyang Wang, Zheng Zhang, Massoud Pedram, Souvik Kundu

Abstract: Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method for selective \textit{eviction} and \textit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $\mathbf{26.7}\%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $\mathbf{1.6}\times$ fewer generation length while improving throughput up to $\mathbf{1.7}\times$.

Comment: Matches Model Compression and Efficiency: training-free KV-cache compression via selective sentence-level eviction and generation control for CoT.

Relevance: 9 Novelty: 8

8. GSPN-2: Efficient Parallel Sequence Modeling

ArXiv ID: 2512.07884

Authors: Hongjun Wang, Yitong Jiang, Collin McCarthy, David Wehr, Hanrong Ye, Xinhao Li, Ka Chun Cheung, Wonmin Byeon, Jinwei Gu, Ke Chen, Kai Han, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu

Abstract: Efficient vision transformer remains a bottleneck for high-resolution images and long-video related real-world applications. Generalized Spatial Propagation Network (GSPN) addresses this by replacing quadratic self-attention with a line-scan propagation scheme, bringing the cost close to linear in the number of rows or columns, while retaining accuracy. Despite this advancement, the existing GSPN implementation still suffers from (i) heavy overhead due to repeatedly launching GPU kernels, (ii) excessive data transfers from global GPU memory, and (iii) redundant computations caused by maintaining separate propagation weights for each channel. We introduce GSPN-2, a joint algorithm-system redesign. In particular, we eliminate thousands of micro-launches from the previous implementation into one single 2D kernel, explicitly pin one warp to each channel slice, and stage the previous column's activations in shared memory. On the model side, we introduce a compact channel propagation strategy that replaces per-channel matrices, trimming parameters, and align naturally with the affinity map used in transformer attention. Experiments demonstrate GSPN-2's effectiveness across image classification and text-to-image synthesis tasks, matching transformer-level accuracy with significantly lower computational cost. GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications through its unique combination of structured matrix transformations and GPU-optimized implementation. Project page: https://whj363636.github.io/GSPN2/

Comment: Matches HPC and Model Architecture: algorithm–system co-design for efficient global context modeling (GSPN-2) with fused kernels and compact channel propagation as an alternative to self-attention.

Relevance: 9 Novelty: 8

9. ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

ArXiv ID: 2512.07843

Authors: Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin

Abstract: Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver's performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.

Comment: Efficiency/HPC: adaptive parallel reasoning at inference with trie-based training–inference co-design avoiding KV cache/PE changes, plus RL for parallelization.

Relevance: 9 Novelty: 8

10. InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

ArXiv ID: 2512.08829

Authors: Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang

Abstract: Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.

Comment: Architecture + efficiency: combines sliding-window attention with linear attention (Gated DeltaNet) for linear complexity, constant KV cache, and faster inference.

Relevance: 9 Novelty: 8

11. Mathematical Foundations of Neural Tangents and Infinite-Width Networks

ArXiv ID: 2512.08264

Authors: Rachana Mysore, Preksha Girish, Kavitha Jayaram, Shrey Kumar, Preksha Girish, Shravan Sanjeev Bagal, Kavitha Jayaram, Shreya Aravind Shastry

Abstract: We investigate the mathematical foundations of neural networks in the infinite-width regime through the Neural Tangent Kernel (NTK). We propose the NTK-Eigenvalue-Controlled Residual Network (NTK-ECRN), an architecture integrating Fourier feature embeddings, residual connections with layerwise scaling, and stochastic depth to enable rigorous analysis of kernel evolution during training. Our theoretical contributions include deriving bounds on NTK dynamics, characterizing eigenvalue evolution, and linking spectral properties to generalization and optimization stability. Empirical results on synthetic and benchmark datasets validate the predicted kernel behavior and demonstrate improved training stability and generalization. This work provides a comprehensive framework bridging infinite-width theory and practical deep-learning architectures.

Comment: Matches Representation Learning and Model Architecture: NTK-ECRN enables rigorous analysis with bounds on NTK dynamics/eigenvalues linking to generalization/stability.

Relevance: 9 Novelty: 7

12. Softly Symbolifying Kolmogorov-Arnold Networks

ArXiv ID: 2512.07875

Authors: James Bagrow, Josh Bongard

Abstract: Kolmogorov-Arnold Networks (KANs) offer a promising path toward interpretable machine learning: their learnable activations can be studied individually, while collectively fitting complex data accurately. In practice, however, trained activations often lack symbolic fidelity, learning pathological decompositions with no meaningful correspondence to interpretable forms. We propose Softly Symbolified Kolmogorov-Arnold Networks (S2KAN), which integrate symbolic primitives directly into training. Each activation draws from a dictionary of symbolic and dense terms, with learnable gates that sparsify the representation. Crucially, this sparsification is differentiable, enabling end-to-end optimization, and is guided by a principled Minimum Description Length objective. When symbolic terms suffice, S2KAN discovers interpretable forms; when they do not, it gracefully degrades to dense splines. We demonstrate competitive or superior accuracy with substantially smaller models across symbolic benchmarks, dynamical systems forecasting, and real-world prediction tasks, and observe evidence of emergent self-sparsification even without regularization pressure.

Comment: Matches Model Architecture and Sparsity: integrates symbolic primitives with differentiable sparsifying gates (MDL-guided) in KANs for interpretable representations.

Relevance: 9 Novelty: 7

13. Complexity of One-Dimensional ReLU DNNs

ArXiv ID: 2512.08091

Authors: Jonathan Kogan, Hayden Jananthan, Jeremy Kepner

Abstract: We study the expressivity of one-dimensional (1D) ReLU deep neural networks through the lens of their linear regions. For randomly initialized, fully connected 1D ReLU networks (He scaling with nonzero bias) in the infinite-width limit, we prove that the expected number of linear regions grows as $\sum_{i = 1}^L n_i + \mathop{{o}}\left(\sum_{i = 1}^L{n_i}\right) + 1$, where $n_\ell$ denotes the number of neurons in the $\ell$-th hidden layer. We also propose a function-adaptive notion of sparsity that compares the expected regions used by the network to the minimal number needed to approximate a target within a fixed tolerance.

Comment: Matches Representation Learning Theory: expressivity of 1D ReLU DNNs via linear region counts; introduces function-adaptive sparsity notion.

Relevance: 9 Novelty: 7

14. Prismatic World Model: Learning Compositional Dynamics for Planning in Hybrid Systems

ArXiv ID: 2512.08411

Authors: Mingwei Li, Xiaoyuan Zhang, Chengwei Yang, Zilong Zheng, Yaodong Yang

Abstract: Model-based planning in robotic domains is fundamentally challenged by the hybrid nature of physical dynamics, where continuous motion is punctuated by discrete events such as contacts and impacts. Conventional latent world models typically employ monolithic neural networks that enforce global continuity, inevitably over-smoothing the distinct dynamic modes (e.g., sticking vs. sliding, flight vs. stance). For a planner, this smoothing results in catastrophic compounding errors during long-horizon lookaheads, rendering the search process unreliable at physical boundaries. To address this, we introduce the Prismatic World Model (PRISM-WM), a structured architecture designed to decompose complex hybrid dynamics into composable primitives. PRISM-WM leverages a context-aware Mixture-of-Experts (MoE) framework where a gating mechanism implicitly identifies the current physical mode, and specialized experts predict the associated transition dynamics. We further introduce a latent orthogonalization objective to ensure expert diversity, effectively preventing mode collapse. By accurately modeling the sharp mode transitions in system dynamics, PRISM-WM significantly reduces rollout drift. Extensive experiments on challenging continuous control benchmarks, including high-dimensional humanoids and diverse multi-task settings, demonstrate that PRISM-WM provides a superior high-fidelity substrate for trajectory optimization algorithms (e.g., TD-MPC), proving its potential as a powerful foundational model for next-generation model-based agents.

Comment: Model architecture (MoE): context-aware gating with specialized experts for hybrid dynamics; adds latent orthogonalization to enforce expert diversity.

Relevance: 9 Novelty: 7

15. Short-Context Dominance: How Much Local Context Natural Language Actually Needs?

ArXiv ID: 2512.08082

Authors: Vala Vakilian, Zimeng Wang, Ankit Singh Rawat, Christos Thrampoulidis

Abstract: We investigate the short-context dominance hypothesis: that for most sequences, a small local prefix suffices to predict their next tokens. Using large language models as statistical oracles, we measure the minimum context length (MCL) needed to reproduce accurate full-context predictions across datasets with sequences of varying lengths. For sequences with 1-7k tokens from long-context documents, we consistently find that 75-80% require only the last 96 tokens at most. Given the dominance of short-context tokens, we then ask whether it is possible to detect challenging long-context sequences for which a short local prefix does not suffice for prediction. We introduce a practical proxy to MCL, called Distributionally Aware MCL (DaMCL), that does not require knowledge of the actual next-token and is compatible with sampling strategies beyond greedy decoding. Our experiments validate that simple thresholding of the metric defining DaMCL achieves high performance in detecting long vs. short context sequences. Finally, to counter the bias that short-context dominance induces in LLM output distributions, we develop an intuitive decoding algorithm that leverages our detector to identify and boost tokens that are long-range-relevant. Across Q&A tasks and model architectures, we confirm that mitigating the bias improves performance.

Comment: Matches Representation Learning/Training Dynamics: quantifies minimum context length and proposes DaMCL-based decoding to address short-context dominance.