Personalized Daily ArXiv Papers 2025-11-21

[gpt-5]	Prompt	Completion	Total
Token	36183	37347	73530
Cost	$0.05	$0.37	$0.42

Total arXiv papers: 439

Total scanned papers: 252

Total relevant papers: 19

Table of contents with paper titles:

Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit Authors: Bohan Zhang, Zihao Wang, Hengyu Fu, Jason D. Lee
Change-of-Basis Pruning via Rotational Invariance Authors: Alex Ning, Vainateya Rangaraju
Transformer Injectivity & Geometric Robustness - Analytic Margins and Bi-Lipschitz Uniformity of Sequence-Level Hidden States Authors: Mikael von Strauss
Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference Authors: Kexin Chu, Dawei Xiang, Zixu Shen, Yiwei Yang, Zecheng Liu, Wei Zhang
Sparse Autoencoders are Topic Models Authors: Leander Girrbach, Zeynep Akata
Evolution Strategies at the Hyperscale Authors: Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio Le\'on Villares, Anya Sims, Dylan Cope, Jarek Liesen, Lukas Seier, Theo Wolf, Uljad Berdica, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, Jakob Nicolaus Foerster
Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone Authors: Vaibhav Singh, Oleksiy Ostapenko, Pierre-Andr\'e No\"el, Torsten Scholak
PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants Authors: Mingkun Yu, Heming Zhong, Dan Huang, Yutong Lu, Jiazhi Jiang
Almost Sure Convergence Analysis of Differentially Private Stochastic Gradient Methods Authors: Amartya Mukherjee, Jun Liu
Walrus: A Cross-Domain Foundation Model for Continuum Dynamics Authors: Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Francois Rozet, Cristiana Diaconu, Lucas Meyer, Kaze W. K. Wong, Hadi Sotoudeh, Alberto Bietti, Irina Espejo, Rio Fear, Siavash Golkar, Tom Hehir, Keiya Hirashima, Geraud Krawezik, Francois Lanusse, Rudy Morel, Ruben Ohana, Liam Parker, Mariel Pettee, Jeff Shen, Kyunghyun Cho, Miles Cranmer, Shirley Ho
Descend or Rewind? Stochastic Gradient Descent Unlearning Authors: Siqiao Mu, Diego Klabjan
Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter Authors: Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han
ODE-ViT: Plug & Play Attention Layer from the Generalization of the ViT as an Ordinary Differential Equation Authors: Carlos Boned Riera, David Romero Sanchez, Oriol Ramos Terrades
Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story Authors: Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya
iLTM: Integrated Large Tabular Model Authors: David Bonet, Mar\c{c}al Comajoan Cara, Alvaro Calafell, Daniel Mas Montserrat, Alexander G. Ioannidis
GLOBE: Accurate and Generalizable PDE Surrogates using Domain-Inspired Architectures and Equivariances Authors: Peter Sharpe
Anatomy of an Idiom: Tracing Non-Compositionality in Language Models Authors: Andrew Gomes
Gauge-Equivariant Graph Networks via Self-Interference Cancellation Authors: Yoonhyuk Choi, Chong-Kwon Kim
Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation Authors: Nicholas Cooper, Lijun Chen, Sailesh Dwivedy, Danna Gurari

1. Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit

ArXiv ID: 2511.15120

Authors: Bohan Zhang, Zihao Wang, Hengyu Fu, Jason D. Lee

Abstract: In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model $f(\boldsymbol{x})=g(\boldsymbol{U}\boldsymbol{x})$ with hidden subspace $\boldsymbol{U}\in \mathbb{R}^{r\times d}$, which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error using $\widetilde{\mathcal{O}}(d)$ samples and $\widetilde{\mathcal{O}}(d^2)$ time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing that the inner weights can perform a power-iteration process. This process implicitly mimics a spectral start for the whole span of the hidden subspace and eventually eliminates finite-sample noise and recovers this span. It surprisingly indicates that optimal results can only be achieved if the first layer is trained for more than $\mathcal{O}(1)$ steps. This work demonstrates the ability of neural networks to effectively learn hierarchical functions with respect to both sample and time efficiency.

Comment: Representation Learning: provable sample/time-optimal learning of multi-index models by two-layer nets via gradient descent.

Relevance: 10 Novelty: 9

2. Change-of-Basis Pruning via Rotational Invariance

ArXiv ID: 2511.16061

Authors: Alex Ning, Vainateya Rangaraju

Abstract: Structured pruning removes entire neurons or channels, but its effectiveness depends on how importance is distributed across the representation space. Change-of-basis (CoB) pruning addresses this challenge by applying orthogonal linear transformations that concentrate importance within certain dimensions. However, many standard deep learning architectures are not inherently invariant to such transformations. To enable compatibility, we introduce two-subspace radial activations (TSRAs): an activation family that is invariant to orthogonal linear transformations applied independently within its two activation subspaces. This invariance allows CoB transformations to be merged into surrounding weights without incurring extra parameters. We position this work as a proof-of-concept that a rotationally invariant design may offer a principled approach towards change-of-basis pruning. We do not provide an analysis of multiple TSRA candidates nor do we explore weight initialization for any TSRAs. These limitations, combined with other necessary modifications we make to permit rotational invariance, result in a slight accuracy drop of $4.52\%$ compared to a ReLU-based control. However, using activation-magnitude importance, VGG-16 implementing our CoB+TSRA framework shows encouraging results on CIFAR-10. Under fixed-ratio structured pruning, CoB improves accuracy over a TSRA baseline at all pruning ratios and extends reliable pruning frontier from roughly $30\%$ to $70\%$ of parameters without post-prune fine tuning. Under threshold-based pruning strategies, CoB prunes $90-96\%$ of parameters while maintaining $1-6\%$ accuracy drop after fine-tuning. Together, these results indicate that rotationally invariant architectures may offer a promising path towards CoB pruning.

Comment: Strongly matches Model Compression/Efficiency criterion: change-of-basis structured pruning enabled by rotationally invariant activations (TSRAs) to concentrate importance and prune effectively.

Relevance: 10 Novelty: 8

3. Transformer Injectivity & Geometric Robustness - Analytic Margins and Bi-Lipschitz Uniformity of Sequence-Level Hidden States

ArXiv ID: 2511.14808

Authors: Mikael von Strauss

Abstract: Under real-analytic assumptions on decoder-only Transformers, recent work shows that the map from discrete prompts to last-token hidden states is generically injective on finite prompt sets. We refine this picture: for each layer $\ell$ we define a collision discriminant $\Delta^\ell \subset \Theta$ and injective stratum $U^\ell = \Theta \setminus \Delta^\ell$, and prove a dichotomy -- either the model is nowhere injective on the set, or $U^\ell$ is open and dense and every $F^\ell_\theta$ is injective. Under mild non-singularity assumptions on the optimizer and an absolutely continuous initialization, generic injectivity persists along smooth training trajectories over any fixed horizon. We also treat symmetry groups $G$, showing that discriminants and injective strata descend to the quotient $\Theta/G$, so injectivity is naturally a property of functional equivalence classes. We complement these results with an empirical study of layerwise geometric diagnostics. We define a separation margin and a co-Lipschitz (lower Lipschitz) constant between prompt space and last-token representation space, estimated via nearest-neighbor statistics on large prompt sets. Applying these diagnostics to pretrained LLaMA-3 and Qwen models, we study behavior across layers, sequence lengths, model scales, and 8- and 4-bit activation quantization. On our sampled prompts we see no collisions in full precision or at 8 bits, while 4-bit quantization induces a small number of collisions and markedly shrinks co-Lipschitz estimates. For a small GPT-2 trained from scratch, normalized metrics remain stable over training. Overall, the results suggest that Transformer representations are generically and persistently injective in the continuous-parameter idealization, while their practical invertibility can be probed using simple geometric diagnostics.

Comment: Model Architecture / Representation Learning: theoretical analysis of Transformer injectivity and bi-Lipschitz properties with layerwise geometric diagnostics (including quantization effects).

Relevance: 10 Novelty: 8

4. Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

ArXiv ID: 2511.15015

Authors: Kexin Chu, Dawei Xiang, Zixu Shen, Yiwei Yang, Zecheng Liu, Wei Zhang

Abstract: Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantization reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive compression. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that overlaps promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets. Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large LLMs on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware quantization is an effective strategy for memory-constrained MoE serving.

Comment: Model Compression and Efficiency: adaptive, runtime expert quantization for MoE inference under strict HBM budgets.

Relevance: 10 Novelty: 8

5. Sparse Autoencoders are Topic Models

ArXiv ID: 2511.16309

Authors: Leander Girrbach, Zeynep Akata

Abstract: Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.

Comment: Representation Learning/Sparsity: theoretical reinterpretation of sparse autoencoders as topic models and a new SAE-TM framework for thematic analysis.

Relevance: 9 Novelty: 8

6. Evolution Strategies at the Hyperscale

ArXiv ID: 2511.16652

Authors: Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio Le\'on Villares, Anya Sims, Dylan Cope, Jarek Liesen, Lukas Seier, Theo Wolf, Uljad Berdica, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, Jakob Nicolaus Foerster

Abstract: We introduce Evolution Guided General Optimization via Low-rank Learning (EGGROLL), an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes for modern large neural network architectures with billions of parameters. ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives with excellent scaling potential through parallelisation. Na{\"i}ve ES becomes prohibitively expensive at scale due to the computational and memory costs associated with generating matrix perturbations $E\in\mathbb{R}^{m\times n}$ and the batched matrix multiplications needed to compute per-member forward passes. EGGROLL overcomes these bottlenecks by generating random matrices $A\in \mathbb{R}^{m\times r},\ B\in \mathbb{R}^{n\times r}$ with $r\ll \min(m,n)$ to form a low-rank matrix perturbation $A B^\top$ that are used in place of the full-rank perturbation $E$. As the overall update is an average across a population of $N$ workers, this still results in a high-rank update but with significant memory and computation savings, reducing the auxiliary storage from $mn$ to $r(m+n)$ per layer and the cost of a forward pass from $\mathcal{O}(mn)$ to $\mathcal{O}(r(m+n))$ when compared to full-rank ES. A theoretical analysis reveals our low-rank update converges to the full-rank update at a fast $\mathcal{O}\left(\frac{1}{r}\right)$ rate. Our experiments show that (1) EGGROLL does not compromise the performance of ES in tabula-rasa RL settings, despite being faster, (2) it is competitive with GRPO as a technique for improving LLM reasoning, and (3) EGGROLL enables stable pre-training of nonlinear recurrent language models that operate purely in integer datatypes.

Comment: High Performance Computing + Efficiency: low-rank evolution strategies enabling scalable, backprop-free training for large networks.

Relevance: 9 Novelty: 8

7. Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone

ArXiv ID: 2511.15927

Authors: Vaibhav Singh, Oleksiy Ostapenko, Pierre-Andr\'e No\"el, Torsten Scholak

Abstract: Diffusion-based language models have recently emerged as a promising alternative to autoregressive generation, yet their reliance on Transformer backbones limits inference efficiency due to quadratic attention and KV-cache overhead. In this work, we introduce DiffuApriel, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling. DiffuApriel matches the performance of Transformer-based diffusion models while achieving up to 4.4x higher inference throughput for long sequences with a 1.3B model. We further propose DiffuApriel-H, a hybrid variant that interleaves attention and mamba layers, offering up to 2.6x throughput improvement with balanced global and local context modeling. Our results demonstrate that bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs, providing a practical and scalable foundation for faster, memory-efficient text generation.

Comment: Efficiency/Architecture: diffusion LM with bidirectional Mamba backbone for linear-time, high-throughput generation.

Relevance: 9 Novelty: 8

8. PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants

ArXiv ID: 2511.14852

Authors: Mingkun Yu, Heming Zhong, Dan Huang, Yutong Lu, Jiazhi Jiang

Abstract: Kolmogorov-Arnold Networks (KANs) promise higher expressive capability and stronger interpretability than Multi-Layer Perceptron, particularly in the domain of AI for Science. However, practical adoption has been hindered by low GPU utilization of existing parallel implementations. To address this challenge, we present a GPU-accelerated operator library, named PolyKAN which is the first general open-source implementation of KAN and its variants. PolyKAN fuses the forward and backward passes of polynomial KAN layers into a concise set of optimized CUDA kernels. Four orthogonal techniques underpin the design: (i) \emph{lookup-table} with linear interpolation that replaces runtime expensive math-library functions; (ii) \emph{2D tiling} to expose thread-level parallelism with preserving memory locality; (iii) a \emph{two-stage reduction} scheme converting scattered atomic updates into a single controllable merge step; and (iv) \emph{coefficient-layout reordering} yielding unit-stride reads under the tiled schedule. Using a KAN variant, Chebyshev KAN, as a case-study, PolyKAN delivers $1.2$--$10\times$ faster inference and $1.4$--$12\times$ faster training than a Triton + cuBLAS baseline, with identical accuracy on speech, audio-enhancement, and tabular-regression workloads on both highend GPU and consumer-grade GPU.

Comment: High Performance Computing: fused CUDA operators and kernel-level optimizations for KAN variants to boost GPU utilization.

Relevance: 9 Novelty: 7

9. Almost Sure Convergence Analysis of Differentially Private Stochastic Gradient Methods

ArXiv ID: 2511.16587

Authors: Amartya Mukherjee, Jun Liu

Abstract: Differentially private stochastic gradient descent (DP-SGD) has become the standard algorithm for training machine learning models with rigorous privacy guarantees. Despite its widespread use, the theoretical understanding of its long-run behavior remains limited: existing analyses typically establish convergence in expectation or with high probability, but do not address the almost sure convergence of single trajectories. In this work, we prove that DP-SGD converges almost surely under standard smoothness assumptions, both in nonconvex and strongly convex settings, provided the step sizes satisfy some standard decaying conditions. Our analysis extends to momentum variants such as the stochastic heavy ball (DP-SHB) and Nesterov's accelerated gradient (DP-NAG), where we show that careful energy constructions yield similar guarantees. These results provide stronger theoretical foundations for differentially private optimization and suggest that, despite privacy-induced distortions, the algorithm remains pathwise stable in both convex and nonconvex regimes.

Comment: Matches Training Dynamics/Optimization theory criterion: establishes almost sure convergence of DP-SGD and momentum variants under standard assumptions.

Relevance: 8 Novelty: 8

10. Walrus: A Cross-Domain Foundation Model for Continuum Dynamics

ArXiv ID: 2511.15684

Authors: Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Francois Rozet, Cristiana Diaconu, Lucas Meyer, Kaze W. K. Wong, Hadi Sotoudeh, Alberto Bietti, Irina Espejo, Rio Fear, Siavash Golkar, Tom Hehir, Keiya Hirashima, Geraud Krawezik, Francois Lanusse, Rudy Morel, Ruben Ohana, Liam Parker, Mariel Pettee, Jeff Shen, Kyunghyun Cho, Miles Cranmer, Shirley Ho

Abstract: Foundation models have transformed machine learning for language and vision, but achieving comparable impact in physical simulation remains a challenge. Data heterogeneity and unstable long-term dynamics inhibit learning from sufficiently diverse dynamics, while varying resolutions and dimensionalities challenge efficient training on modern hardware. Through empirical and theoretical analysis, we incorporate new approaches to mitigate these obstacles, including a harmonic-analysis-based stabilization method, load-balanced distributed 2D and 3D training strategies, and compute-adaptive tokenization. Using these tools, we develop Walrus, a transformer-based foundation model developed primarily for fluid-like continuum dynamics. Walrus is pretrained on nineteen diverse scenarios spanning astrophysics, geoscience, rheology, plasma physics, acoustics, and classical fluids. Experiments show that Walrus outperforms prior foundation models on both short and long term prediction horizons on downstream tasks and across the breadth of pretraining data, while ablation studies confirm the value of our contributions to forecast stability, training throughput, and transfer performance over conventional approaches. Code and weights are released for community use.

Comment: High Performance Computing + Model Architecture: transformer-based foundation model for continuum dynamics with harmonic-analysis stabilization, load-balanced distributed 2D/3D training, and compute-adaptive tokenization.

Relevance: 8 Novelty: 8

11. Descend or Rewind? Stochastic Gradient Descent Unlearning

ArXiv ID: 2511.15983

Authors: Siqiao Mu, Diego Klabjan

Abstract: Machine unlearning algorithms aim to remove the impact of selected training data from a model without the computational expenses of retraining from scratch. Two such algorithms are Descent-to-Delete" (D2D) andRewind-to-Delete" (R2D), full-batch gradient descent algorithms that are easy to implement and satisfy provable unlearning guarantees. In particular, the stochastic version of D2D is widely implemented as the ``finetuning" unlearning baseline, despite lacking theoretical backing on nonconvex functions. In this work, we prove $(\epsilon, \delta)$ certified unlearning guarantees for stochastic R2D and D2D for strongly convex, convex, and nonconvex loss functions, by analyzing unlearning through the lens of disturbed or biased gradient systems, which may be contracting, semi-contracting, or expansive respectively. Our argument relies on optimally coupling the random behavior of the unlearning and retraining trajectories, resulting in a probabilistic sensitivity bound that can be combined with a novel relaxed Gaussian mechanism to achieve $(\epsilon, \delta)$ unlearning. We determine that D2D can yield tighter guarantees for strongly convex functions compared to R2D by relying on contraction to a unique global minimum. However, unlike D2D, R2D can achieve unlearning in the convex and nonconvex setting because it draws the unlearned model closer to the retrained model by reversing the accumulated disturbances.

Comment: Training dynamics/optimization: certified SGD unlearning guarantees for stochastic D2D/R2D across convex and nonconvex losses.

Relevance: 8 Novelty: 8

12. Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

ArXiv ID: 2511.16665

Authors: Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han

Abstract: The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.

Comment: Matches High Performance Computing criterion: systems-level acceleration of RL training via adaptive speculative decoding and a memory-efficient CUDA Graphs rollout engine.