Personalized Daily ArXiv Papers 2025-11-12

[gpt-5]	Prompt	Completion	Total
Token	51588	48776	100364
Cost	$0.06	$0.49	$0.55

Total arXiv papers: 622

Total scanned papers: 385

Total relevant papers: 22

Table of contents with paper titles:

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics Authors: Randall Balestriero, Yann LeCun
Parallel Sampling via Autospeculation Authors: Nima Anari, Carlo Baronio, CJ Chen, Alireza Haqi, Frederic Koehler, Anqi Li, Thuy-Duong Vuong
Extreme Model Compression with Structured Sparsity at Low Precision Authors: Dan Liu, Nikita Dvornik, Xue Liu
A Circular Argument : Does RoPE need to be Equivariant for Vision? Authors: Chase van de Geijn, Timo L\"uddecke, Polina Turishcheva, Alexander S. Ecker
DynaKV: Enabling Accurate and Efficient Long-Sequence LLM Decoding on Smartphones Authors: Tuowei Wang, Minxing Huang, Fengzu Li, Ligeng Chen, Jinrui Zhang, Ju Ren
Streaming Tensor Program: A streaming abstraction for dynamic parallelism Authors: Gina Sohn, Genghan Zhang, Konstantin Hossfeld, Jungwoo Kim, Nathan Sobotka, Nathan Zhang, Olivia Hsu, Kunle Olukotun
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models Authors: Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang
NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization Authors: Xiyuan Wei, Chih-Jen Lin, Tianbao Yang
HipKittens: Fast and Furious AMD Kernels Authors: William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Y. Fu, Ryann Swann, Muhammad Osama, Christopher R\'e, Simran Arora
A Generalized Spectral Framework to Expain Neural Scaling and Compression Dynamics Authors: Yizhou Zhang
Alignment-Constrained Dynamic Pruning for LLMs: Identifying and Preserving Alignment-Critical Circuits Authors: Dev Patel, Gabrielle Gervacio, Diekola Raimi, Kevin Zhu, Ryan Lagasse, Gabriel Grand, Ashwinee Panda, Maheep Chaudhary
Alignment-Aware Quantization for LLM Safety Authors: Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak
Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning Authors: Jialong Qin, Xin Zou, Di Lu, Yibo Yan, Xuming Hu
A General Method for Proving Networks Universal Approximation Property Authors: Wei Wang
Motif 2 12.7B technical report Authors: Junghwan Lim, Sungmin Lee, Dongseok Kim, Taehyun Kim, Eunhwan Park, Jeesoo Lee, Jeongdoo Lee, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Minjae Kim, Taewhan Kim, Youngrok Kim, Hyukjin Kweon, Haesol Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Dongjoo Weon
SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs Authors: Sean P. Fillingham, Andrew Gordon, Peter Lai, Xavier Poncini, David Quarel, Stefan Heimersheim
A Unified Geometric Field Theory Framework for Transformers: From Manifold Embeddings to Kernel Modulation Authors: Xianshuai Shi, Jianfeng Zhu, Leibo Liu
ImagebindDC: Compressing Multi-modal Data with Imagebind-based Condensation Authors: Yue Min, Shaobo Wang, Jiaze Li, Tianle Niu, Junxin Fan, Yongliang Miao, Lijin Yang, Linfeng Zhang
Coherence Mechanisms for Provable Self-Improvement Authors: Mehryar Mohri, Jon Schneider, Yifan Wu
Training Language Models to Explain Their Own Computations Authors: Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas
Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-Tuning Authors: Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko
Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning Authors: Ziyu Ma, Chenhui Gou, Yiming Hu, Yong Wang, Xiangxiang Chu, Bohan Zhuang, Jianfei Cai

1. LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

ArXiv ID: 2511.08544

Authors: Randall Balestriero, Yann LeCun

Abstract: Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc R&D. We present a comprehensive theory of JEPAs and instantiate it in {\bf LeJEPA}, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs' embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective--{\bf Sketched Isotropic Gaussian Regularization} (SIGReg)--to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits: (i) single trade-off hyperparameter, (ii) linear time and memory complexity, (iii) stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, (iv) heuristics-free, e.g., no stop-gradient, no teacher-student, no hyper-parameter schedulers, and (v) distributed training-friendly implementation requiring only $\approx$50 lines of code. Our empirical validation covers 10+ datasets, 60+ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches 79\% with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (\href{git@github.com:rbalestr-lab/lejepa.git}{GitHub repo}).

Comment: Author match

2. Parallel Sampling via Autospeculation

ArXiv ID: 2511.07869

Authors: Nima Anari, Carlo Baronio, CJ Chen, Alireza Haqi, Frederic Koehler, Anqi Li, Thuy-Duong Vuong

Abstract: We present parallel algorithms to accelerate sampling via counting in two settings: any-order autoregressive models and denoising diffusion models. An any-order autoregressive model accesses a target distribution $\mu$ on $[q]^n$ through an oracle that provides conditional marginals, while a denoising diffusion model accesses a target distribution $\mu$ on $\mathbb{R}^n$ through an oracle that provides conditional means under Gaussian noise. Standard sequential sampling algorithms require $\widetilde{O}(n)$ time to produce a sample from $\mu$ in either setting. We show that, by issuing oracle calls in parallel, the expected sampling time can be reduced to $\widetilde{O}(n^{1/2})$. This improves the previous $\widetilde{O}(n^{2/3})$ bound for any-order autoregressive models and yields the first parallel speedup for diffusion models in the high-accuracy regime, under the relatively mild assumption that the support of $\mu$ is bounded. We introduce a novel technique to obtain our results: speculative rejection sampling. This technique leverages an auxiliary speculative'' distribution~$\nu$ that approximates~$\mu$ to accelerate sampling. Our technique is inspired by the well-studiedspeculative decoding'' techniques popular in large language models, but differs in key ways. Firstly, we use autospeculation,'' namely we build the speculation $\nu$ out of the same oracle that defines~$\mu$. In contrast, speculative decoding typically requires a separate, faster, but potentially less accuratedraft'' model $\nu$. Secondly, the key differentiating factor in our technique is that we make and accept speculations at a ``sequence'' level rather than at the level of single (or a few) steps. This last fact is key to unlocking our parallel runtime of $\widetilde{O}(n^{1/2})$.

Comment: High Performance Computing: speculative rejection sampling (autospeculation) achieving parallel sampling in O~(n^{1/2}) time for autoregressive and diffusion models.

Relevance: 10 Novelty: 9

3. Extreme Model Compression with Structured Sparsity at Low Precision

ArXiv ID: 2511.08360

Authors: Dan Liu, Nikita Dvornik, Xue Liu

Abstract: Deep neural networks (DNNs) are used in many applications, but their large size and high computational cost make them hard to run on devices with limited resources. Two widely used techniques to address this challenge are weight quantization, which lowers the precision of all weights, and structured sparsity, which removes unimportant weights while retaining the important ones at full precision. Although both are effective individually, they are typically studied in isolation due to their compounded negative impact on model accuracy when combined. In this work, we introduce SLOPE Structured Sparsity at Low Precision), a unified framework, to effectively combine structured sparsity and low-bit quantization in a principled way. We show that naively combining sparsity and quantization severely harms performance due to the compounded impact of both techniques. To address this, we propose a training-time regularization strategy that minimizes the discrepancy between full-precision weights and their sparse, quantized counterparts by promoting angular alignment rather than direct matching. On ResNet-18, SLOPE achieves $\sim20\times$ model size reduction while retaining $\sim$99% of the original accuracy. It consistently outperforms state-of-the-art quantization and structured sparsity methods across classification, detection, and segmentation tasks on models such as ResNet-18, ViT-Small, and Mask R-CNN.

Comment: Direct hit on 'Model Compression and Efficiency': combines structured sparsity with low-bit quantization using a training-time angular-alignment regularizer for extreme compression.

Relevance: 10 Novelty: 8

4. A Circular Argument : Does RoPE need to be Equivariant for Vision?

ArXiv ID: 2511.08368

Authors: Chase van de Geijn, Timo L\"uddecke, Polina Turishcheva, Alexander S. Ecker

Abstract: Rotary Positional Encodings (RoPE) have emerged as a highly effective technique for one-dimensional sequences in Natural Language Processing spurring recent progress towards generalizing RoPE to higher-dimensional data such as images and videos. The success of RoPE has been thought to be due to its positional equivariance, i.e. its status as a relative positional encoding. In this paper, we mathematically show RoPE to be one of the most general solutions for equivariant positional embedding in one-dimensional data. Moreover, we show Mixed RoPE to be the analogously general solution for M-dimensional data, if we require commutative generators -- a property necessary for RoPE's equivariance. However, we question whether strict equivariance plays a large role in RoPE's performance. We propose Spherical RoPE, a method analogous to Mixed RoPE, but assumes non-commutative generators. Empirically, we find Spherical RoPE to have the equivalent or better learning behavior compared to its equivariant analogues. This suggests that relative positional embeddings are not as important as is commonly believed, at least within computer vision. We expect this discovery to facilitate future work in positional encodings for vision that can be faster and generalize better by removing the preconception that they must be relative.

Comment: Model Architecture: formal analysis of RoPE equivariance and a new positional encoding (Spherical RoPE) for M-dimensional data, challenging necessity of strict equivariance.

Relevance: 10 Novelty: 8

5. DynaKV: Enabling Accurate and Efficient Long-Sequence LLM Decoding on Smartphones

ArXiv ID: 2511.07427

Authors: Tuowei Wang, Minxing Huang, Fengzu Li, Ligeng Chen, Jinrui Zhang, Ju Ren

Abstract: As the demand for human-like reasoning, multi-turn dialogues, and long-form responses grows, large language models (LLMs) are increasingly expected to support efficient and effective long-sequence decoding. However, due to limited DRAM capacity, long-seuqence LLM decoding on smartphones is constrained by the key-value cache (KVCache), whose memory footprint increases linearly with sequence length. Retrieval-based methods mitigate DRAM pressure by offloading KVCache to flash and retrieving query-relevant entries through cluster-based indexing. Unfortunately, as decoding progresses, KVCache distribution shifts render static or local cluster updates progressively misaligned, excluding essential entries or fetching redundant ones. These issues are further exacerbated by smartphone-specific limitations in bandwidth, IOPS, and memory capacity. We propose DynaKV, the first adaptive KVCache management approach that jointly addresses accuracy and efficiency for long-sequence decoding on smartphones. DynaKV integrates three key techniques: (1) Migration-Free Cluster Adaptation, which adaptively splits clusters during retrieval without incurring additional transfers; (2) Continuity-Centric Flash Management, which co-locates correlated entries and clusters and employs a dual-head layout for efficient updates; and (3) Memory-Efficient Cache Design, which virtualizes cache space across DRAM and flash and extends replacement policies to align with cluster-level access patterns. Evaluations demonstrate that DynaKV improves retrieval accuracy and reduces end-to-end latency compared to state-of-the-art solutions, achieving average gains of $1.38\times$ in accuracy and $1.47\times$ speedups. Furthermore, the insights of DynaKV naturally extend to other long-context workloads and multi-tier memory hierarchies, underscoring its broader applicability.

Comment: Matches HPC/Efficiency with KV-cache memory optimization for long-sequence LLM decoding on smartphones: adaptive clustering, flash-aware layout, and cache virtualization.

Relevance: 9 Novelty: 8

6. Streaming Tensor Program: A streaming abstraction for dynamic parallelism

ArXiv ID: 2511.07776

Authors: Gina Sohn, Genghan Zhang, Konstantin Hossfeld, Jungwoo Kim, Nathan Sobotka, Nathan Zhang, Olivia Hsu, Kunle Olukotun

Abstract: Dynamic behaviors are becoming prevalent in many tensor applications. In machine learning, for example, the input tensors are dynamically shaped or ragged, and data-dependent control flow is widely used in many models. However, the limited expressiveness of prior programming abstractions for spatial dataflow accelerators forces the dynamic behaviors to be implemented statically or lacks the visibility for performance-critical decisions. To address these challenges, we present the Streaming Tensor Program (STeP), a new streaming abstraction that enables dynamic tensor workloads to run efficiently on spatial dataflow accelerators. STeP introduces flexible routing operators, an explicit memory hierarchy, and symbolic shape semantics that expose dynamic data rates and tensor dimensions. These capabilities unlock new optimizations-dynamic tiling, dynamic parallelization, and configuration time-multiplexing-that adapt to dynamic behaviors while preserving dataflow efficiency. Using a cycle-approximate simulator on representative LLM layers with real-world traces, dynamic tiling reduces on-chip memory requirement by 2.18x, dynamic parallelization improves latency by 1.5x, and configuration time-multiplexing improves compute utilization by 2.57x over implementations available in prior abstractions.

Comment: High Performance Computing: new streaming abstraction (STeP) with dynamic tiling/parallelization and explicit memory hierarchy for dynamic tensor workloads on accelerators.

Relevance: 9 Novelty: 8

7. Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

ArXiv ID: 2511.08577

Authors: Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang

Abstract: Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.

Comment: Model Architecture / Conditional computation: dynamic latent iterations only on hard tokens with duo-causal attention across iteration depth and LoRA-based refinement for efficiency.

Relevance: 9 Novelty: 8

8. NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization

ArXiv ID: 2511.08417

Authors: Xiyuan Wei, Chih-Jen Lin, Tianbao Yang

Abstract: Accurately estimating the normalization term (also known as the partition function) in the contrastive loss is a central challenge for training Contrastive Language-Image Pre-training (CLIP) models. Conventional methods rely on large batches for approximation, demanding substantial computational resources. To mitigate this issue, prior works introduced per-sample normalizer estimators, which are updated at each epoch in a blockwise coordinate manner to keep track of updated encoders. However, this scheme incurs optimization error that scales with the ratio of dataset size to batch size, limiting effectiveness for large datasets or small batches. To overcome this limitation, we propose NeuCLIP, a novel and elegant optimization framework based on two key ideas: (i) $\textbf{reformulating}$ the contrastive loss for each sample $\textbf{via convex analysis}$ into a minimization problem with an auxiliary variable representing its log-normalizer; and (ii) $\textbf{transforming}$ the resulting minimization over $n$ auxiliary variables (where $n$ is the dataset size) via $\textbf{variational analysis}$ into the minimization over a compact neural network that predicts the log-normalizers. We design an alternating optimization algorithm that jointly trains the CLIP model and the auxiliary network. By employing a tailored architecture and acceleration techniques for the auxiliary network, NeuCLIP achieves more accurate normalizer estimation, leading to improved performance compared with previous methods. Extensive experiments on large-scale CLIP training, spanning datasets from millions to billions of samples, demonstrate that NeuCLIP outperforms previous methods.

Comment: HPC/Efficiency: reformulates CLIP contrastive normalizer via convex/variational analysis and learns a neural normalizer, enabling efficient large-scale training with small batches.

Relevance: 9 Novelty: 8

9. HipKittens: Fast and Furious AMD Kernels

ArXiv ID: 2511.08083

Authors: William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Y. Fu, Ryann Swann, Muhammad Osama, Christopher R\'e, Simran Arora

Abstract: AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives -- for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers -- are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD's hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by $1.2-2.4\times$ (e.g., $d=64$ attention, GQA backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors. HipKittens is released at: https://github.com/HazyResearch/HipKittens.

Comment: HPC: tile-based programming framework for AMD GPUs enabling high-performance AI kernels across vendors (systems-level innovation).

Relevance: 9 Novelty: 8

10. A Generalized Spectral Framework to Expain Neural Scaling and Compression Dynamics

ArXiv ID: 2511.07892

Authors: Yizhou Zhang

Abstract: Empirical scaling laws describe how test loss and other performance metrics depend on model size, dataset size, and compute. While such laws are consistent within specific regimes, apparently distinct scaling behaviors have been reported for related settings such as model compression. Motivated by recent progress in spectral analyses of neural representations, this paper develops a \emph{generalized spectral framework} that unifies learning dynamics and compression phenomena under a common functional ansatz. We generalize the spectral evolution function from the linear kernel form $g(\lambda t)=\lambda t$ to an asymptotically polynomial function $g(\lambda,t;\beta)$, characterized by an effective spectral--temporal elasticity $\rho(\beta)$. This framework recovers existing lazy and feature-learning theories as special cases and yields an invariant relation between learning and compression

Comment: Develops a unified spectral framework connecting learning dynamics and compression—fits 'Representation Learning' and 'Compression/Efficiency' theory.