Personalized Daily ArXiv Papers 2026-02-06

[gpt-5]	Prompt	Completion	Total
Token	81235	65391	146626
Cost	$0.1	$0.65	$0.76

Total arXiv papers: 716

Total scanned papers: 421

Total relevant papers: 63

Table of contents with paper titles:

SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel Authors: Jose Miguel Luna, Taha Bouhsine, Krzysztof Choromanski
ZeroS: Zero-Sum Linear Attention for Efficient Transformers Authors: Jiecheng Lu, Xu Han, Yan Sun, Viresh Pati, Yubin Kim, Siddhartha Somani, Shihao Yang
LoRDO: Distributed Low-Rank Optimization with Infrequent Communication Authors: Andrej Jovanovi\'c, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane
CoSA: Compressed Sensing-Based Adaptation of Large Language Models Authors: Songtao Wei, Yi Li, Bohan Zhang, Zhichun Guo, Ying Huang, Yuede Ji, Miao Yin, Guanpeng Li, Bingzhe Li
Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs Authors: Wentao Ni, Kangqi Zhang, Zhongming Yu, Oren Nelson, Mingu Lee, Hong Cai, Fatih Porikli, Jongryool Kim, Zhijian Liu, Jishen Zhao
CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers Authors: Boxiang Zhang, Baijian Yang
Semantic Rate Distortion and Posterior Design: Compute Constraints, Multimodality, and Strategic Inference Authors: Emrah Akyol
SpecMD: A Comprehensive Study On Speculative Expert Prefetching Authors: Duc Hoang, Ajay Jaiswal, Mohammad Samragh, Minsik Cho
Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization Authors: Aleksandar Armacki, Dragana Bajovi\'c, Du\v{s}an Jakoveti\'c, Soummya Kar, Ali H. Sayed
Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models Authors: Yichen Xu, Yuyang Liang, Shan Dai, Tianyang Hu, Tsz Nam Chan, Chenhao Ma
From Dead Neurons to Deep Approximators: Deep Bernstein Networks as a Provable Alternative to Residual Layers Authors: Ibrahim Albool, Malak Gamal El-Din, Salma Elmalaki, Yasser Shoukry
On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature Authors: Yikuan Zhang, Ning Yang, Yuhai Tu
Does SGD Seek Flatness or Sharpness? An Exactly Solvable Model Authors: Yizhou Xu, Pierfrancesco Beneventano, Isaac Chuang, Liu Ziyin
GeoIB: Geometry-Aware Information Bottleneck via Statistical-Manifold Compression Authors: Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Shui Yu
Optimal scaling laws in learning hierarchical multi-index models Authors: Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, Antoine Maillard
CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs Authors: Haoran Li, Sucheng Ren, Alan Yuille, Feng Wang
CSRv2: Unlocking Ultra-Sparse Embeddings Authors: Lixuan Guo, Yifei Wang, Tiansheng Wen, Yifan Wang, Aosong Feng, Bo Chen, Stefanie Jegelka, Chenyu You
Fluid Representations in Reasoning Models Authors: Dmitrii Kharlapenko, Alessandro Stolfo, Arthur Conmy, Mrinmaya Sachan, Zhijing Jin
Path-Guided Flow Matching for Dataset Distillation Authors: Xuhui Li, Zhengquan Luo, Xiwei Liu, Yongqiang Yu, Zhiqiang Xu
Orthogonal Model Merging Authors: Sihan Yang, Kexuan Shi, Weiyang Liu
Pseudo-Invertible Neural Networks Authors: Yamit Ehrlich, Nimrod Berman, Assaf Shocher
Orthogonal Self-Attention Authors: Leo Zhang, James Martens
Inverse Depth Scaling From Most Layers Being Similar Authors: Yizhou Liu, Sara Kangaslahti, Ziming Liu, Jeff Gore
LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding Authors: Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, Min Zhang
Price of universality in vector quantization is at most 0.11 bit Authors: Alina Harbuzova, Or Ordentlich, Yury Polyanskiy
A logical re-conception of neural networks: Hamiltonian bitwise part-whole architecture Authors: E Bowen, R Granger, A Rodriguez
Learning Compact Boolean Networks Authors: Shengpu Wang, Yuhao Mao, Yani Zhang, Martin Vechev
Regularized Calibration with Successive Rounding for Post-Training Quantization Authors: Seohyeon Cha, Huancheng Chen, Dongjun Kim, Haoran Zhang, Kevin Chan, Gustavo de Veciana, Haris Vikalo
Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning Authors: Nicholas Barnfield, Subhabrata Sen, Pragya Sur
Transformers Are Born Biased: Structural Inductive Biases at Random Initialization and Their Practical Consequences Authors: Siquan Li, Yao Tong, Haonan Wang, Tianyang Hu
TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation Authors: Junhan Kim, Yeo Jeong Park, Seungwoo Son, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon
Topology-Aware Revival for Efficient Sparse Training Authors: Meiling Jin, Fei Wang, Xiaoyun Yuan, Chen Qian, Yuan Cheng
Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog Authors: Yiran Zhao, Shengyang Zhou, Zijian Wu, Tongyan Hu, Yuhui Xu, Rengan Dou, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Michael Qizhe Shieh
Mechanisms of AI Protein Folding in ESMFold Authors: Kevin Lu, Jannik Brinkmann, Stefan Huber, Aaron Mueller, Yonatan Belinkov, David Bau, Chris Wendler
Multi-Token Prediction via Self-Distillation Authors: John Kirchenbauer, Abhimanyu Hans, Brian Bartoldson, Micah Goldblum, Ashwinee Panda, Tom Goldstein
Depth-Wise Emergence of Prediction-Centric Geometry in Large Language Models Authors: Shahar Haim, Daniel C McNamee
RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models Authors: Jiacheng Liang, Yuhui Wang, Tanqiu Jiang, Ting Wang
Breaking Symmetry Bottlenecks in GNN Readouts Authors: Mouad Talhi, Arne Wolf, Anthea Monod
Logarithmic-time Schedules for Scaling Language Models with Momentum Authors: Damien Ferbach, Courtney Paquette, Gauthier Gidel, Katie Everett, Elliot Paquette
Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance Authors: Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou
DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders Authors: Xu Wang, Bingqing Jiang, Yu Wan, Baosong Yang, Lingpeng Kong, Difan Zou
Subliminal Effects in Your Data: A General Mechanism via Log-Linearity Authors: Ishaq Aden-Ali, Noah Golowich, Allen Liu, Abhishek Shetty, Ankur Moitra, Nika Haghtalab
Limitations of SGD for Multi-Index Models Beyond Statistical Queries Authors: Daniel Barzilai, Ohad Shamir
SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration Authors: Zekun Li, Ning Wang, Tongxin Bai, Changwang Mei, Peisong Wang, Shuang Qiu, Jian Cheng
Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps Authors: Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, Max Simchowitz
Improving Set Function Approximation with Quasi-Arithmetic Neural Networks Authors: Tomas Tokar, Scott Sanner
Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers Authors: Artem Riabinin, Andrey Veprikov, Arman Bolatov, Martin Tak\'a\v{c}, Aleksandr Beznosikov
When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging Authors: Yayuan Li, Ze Peng, Jian Zhang, Jintao Guo, Yue Duan, Yinghuan Shi
TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference Authors: Jiyoung Park, Hankyu Jang, Changseok Song, Wookeun Jung
Addressing Corpus Knowledge Poisoning Attacks on RAG Using Sparse Attention Authors: Sagie Dekel, Moshe Tennenholtz, Oren Kurland
Joint Embedding Variational Bayes Authors: Amin Oji, Paul Fieguth
Billion-Scale Graph Foundation Models Authors: Maya Bechler-Speicher, Yoel Gottlieb, Andrey Isakov, David Abensur, Ami Tavory, Daniel Haimovich, Ido Guy, Udi Weinsberg
Shared LoRA Subspaces for almost Strict Continual Learning Authors: Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, Rama Chellappa, Alan Yuille
Smoothness Errors in Dynamics Models and How to Avoid Them Authors: Edward Berman, Luisa Li, Jung Yeon Park, Robin Walters
Rational ANOVA Networks Authors: Jusheng Zhang, Ningyuan Liu, Qinhan Lyu, Jing Yang, Keze Wang
Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers Authors: Jingkai Huang, Will Ma, Zhengyuan Zhou
Phaedra: Learning High-Fidelity Discrete Tokenization for the Physical Science Authors: Levi Lingsch, Georgios Kissas, Johannes Jakubik, Siddhartha Mishra
Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better Authors: Ji Zhao, Yufei Gu, Shitong Shao, Xun Zhou, Liang Xiang, Zeke Xie
Structural Disentanglement in Bilinear MLPs via Architectural Inductive Bias Authors: Ojasva Nema, Kaustubh Sharma, Aditya Chauhan, Parikshit Pareek
Refine and Purify: Orthogonal Basis Optimization with Null-Space Denoising for Conditional Representation Learning Authors: Jiaquan Wang, Yan Lyu, Chen Li, Yuheng Jia
Disentangled Representation Learning via Flow Matching Authors: Jinjin Chi, Taoping Liu, Mengtao Yin, Ximing Li, Yongcheng Jing, Dacheng Tao
How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs Authors: Emily Dent, Jared Tanner
Disentangling Causal Importance from Emergent Structure in Multi-Expert Orchestration Authors: Sudipto Ghosh, Sujoy Nath, Sunny Manchanda, Tanmoy Chakraborty

1. SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel

ArXiv ID: 2602.04915

Authors: Jose Miguel Luna, Taha Bouhsine, Krzysztof Choromanski

Abstract: We propose a new class of linear-time attention mechanisms based on a relaxed and computationally efficient formulation of the recently introduced E-Product, often referred to as the Yat-kernel (Bouhsine, 2025). The resulting interactions are geometry-aware and inspired by inverse-square interactions in physics. Our method, Spherical Linearized Attention with Yat Kernels (SLAY), constrains queries and keys to the unit sphere so that attention depends only on angular alignment. Using Bernstein's theorem, we express the spherical Yat-kernel as a nonnegative mixture of polynomial-exponential product kernels and derive a strictly positive random-feature approximation enabling linear-time O(L) attention. We establish positive definiteness and boundedness on the sphere and show that the estimator yields well-defined, nonnegative attention scores. Empirically, SLAY achieves performance that is nearly indistinguishable from standard softmax attention while retaining linear time and memory scaling, and consistently outperforms prior linear-time attention mechanisms such as Performers and Cosformers. To the best of our knowledge, SLAY represents the closest linear-time approximation to softmax attention reported to date, enabling scalable Transformers without the typical performance trade-offs of attention linearization.

Comment: Model Architecture/Efficiency: linear-time spherical attention (SLAY) with positive random features approximating softmax closely.

Relevance: 10 Novelty: 9

2. ZeroS: Zero-Sum Linear Attention for Efficient Transformers

ArXiv ID: 2602.05230

Authors: Jiecheng Lu, Xu Han, Yan Sun, Viresh Pati, Yubin Kim, Siddhartha Somani, Shihao Yang

Abstract: Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O(N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.

Comment: Introduces Zero-Sum Linear Attention achieving O(N) complexity with contrastive capabilities via zero-sum residuals; core Transformer efficiency/attention innovation.

Relevance: 10 Novelty: 9

3. LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

ArXiv ID: 2602.04396

Authors: Andrej Jovanovi\'c, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane

Abstract: Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.

Comment: High-Performance/Distributed Training: unifies low-rank optimization with infrequent communication, restoring subspace exploration and cutting communication in DDP for foundation models.

Relevance: 10 Novelty: 8

4. CoSA: Compressed Sensing-Based Adaptation of Large Language Models

ArXiv ID: 2602.05148

Authors: Songtao Wei, Yi Li, Bohan Zhang, Zhichun Guo, Ying Huang, Yuede Ji, Miao Yin, Guanpeng Li, Bingzhe Li

Abstract: Parameter-Efficient Fine-Tuning (PEFT) has emerged as a practical paradigm for adapting large language models (LLMs) without updating all parameters. Most existing approaches, such as LoRA and PiSSA, rely on low-rank decompositions of weight updates. However, the low-rank assumption may restrict expressivity, particularly in task-specific adaptation scenarios where singular values are distributed relatively uniformly. To address this limitation, we propose CoSA (Compressed Sensing-Based Adaptation), a new PEFT method extended from compressed sensing theory. Instead of constraining weight updates to a low-rank subspace, CoSA expresses them through fixed random projection matrices and a compact learnable core. We provide a formal theoretical analysis of CoSA as a synthesis process, proving that weight updates can be compactly encoded into a low-dimensional space and mapped back through random projections. Extensive experimental results show that CoSA provides a principled perspective for efficient and expressive multi-scale model adaptation. Specifically, we evaluate CoSA on 10 diverse tasks, including natural language understanding and generation, employing 5 models of different scales from RoBERTa, Llama, and Qwen families. Across these settings, CoSA consistently matches or outperforms state-of-the-art PEFT methods.

Comment: PEFT/Compression: compressed sensing-based adaptation replaces low-rank updates with random projections plus compact core, improving expressivity under parameter efficiency constraints.

Relevance: 10 Novelty: 8

5. Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs

ArXiv ID: 2602.05191

Authors: Wentao Ni, Kangqi Zhang, Zhongming Yu, Oren Nelson, Mingu Lee, Hong Cai, Fatih Porikli, Jongryool Kim, Zhijian Liu, Jishen Zhao

Abstract: As long-context inference becomes central to large language models (LLMs), attention over growing key-value caches emerges as a dominant decoding bottleneck, motivating sparse attention for scalable inference. Fixed-budget top-k sparse attention cannot adapt to heterogeneous attention distributions across heads and layers, whereas top-p sparse attention directly preserves attention mass and provides stronger accuracy guarantees. Existing top-p methods, however, fail to jointly optimize top-p accuracy, selection overhead, and sparse attention cost, which limits their overall efficiency. We present Double-P, a hierarchical sparse attention framework that optimizes all three stages. Double-P first performs coarse-grained top-p estimation at the cluster level using size-weighted centroids, then adaptively refines computation through a second top-p stage that allocates token-level attention only when needed. Across long-context benchmarks, Double-P consistently achieves near-zero accuracy drop, reducing attention computation overhead by up to 1.8x and delivers up to 1.3x end-to-end decoding speedup over state-of-the-art fixed-budget sparse attention methods.

Comment: Compression/Efficiency: hierarchical top-p sparse attention optimizing selection cost and attention compute for long-context LLMs.

Relevance: 10 Novelty: 8

6. CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers

ArXiv ID: 2602.05243

Authors: Boxiang Zhang, Baijian Yang

Abstract: Vision Transformers achieve strong accuracy but incur high compute and memory cost. Structured pruning can reduce inference cost, but most methods rely on retraining or multi-stage optimization. These requirements limit post-training deployment. We propose \textbf{CORP}, a closed-form one-shot structured pruning framework for Vision Transformers. CORP removes entire MLP hidden dimensions and attention substructures without labels, gradients, or fine-tuning. It operates under strict post-training constraints using only a small unlabeled calibration set. CORP formulates structured pruning as a representation recovery problem. It models removed activations and attention logits as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights. This minimizes expected representation error under the calibration distribution. Experiments on ImageNet with DeiT models show strong redundancy in MLP and attention representations. Without compensation, one-shot structured pruning causes severe accuracy degradation. With CORP, models preserve accuracy under aggressive sparsity. On DeiT-Huge, CORP retains 82.8\% Top-1 accuracy after pruning 50\% of both MLP and attention structures. CORP completes pruning in under 20 minutes on a single GPU and delivers substantial real-world efficiency gains.

Comment: Closed-form one-shot structured pruning for ViTs with representation-preserving compensation using unlabeled calibration; directly targets structured pruning/efficiency.

Relevance: 10 Novelty: 8

7. Semantic Rate Distortion and Posterior Design: Compute Constraints, Multimodality, and Strategic Inference

ArXiv ID: 2602.03949

Authors: Emrah Akyol

Abstract: We study strategic Gaussian semantic compression under rate and compute constraints, where an encoder and decoder optimize distinct quadratic objectives. A latent Gaussian state generates a task dependent semantic variable, and the decoder best responds via MMSE estimation, reducing the encoder's problem to posterior covariance design under an information rate constraint. We characterize the strategic rate distortion function in direct, remote, and full information regimes, derive semantic waterfilling and rate constrained Gaussian persuasion solutions, and establish Gaussian optimality under misaligned objectives. We further show that architectural compute limits act as implicit rate constraints, yielding exponential improvements in semantic accuracy with model depth and inference time compute, while multimodal observation eliminates the geometric mean penalty inherent to remote encoding. These results provide information theoretic foundations for data and energy efficient AI and offer a principled interpretation of modern multimodal language models as posterior design mechanisms under resource constraints.

Comment: Compression/Efficiency and Representation Learning: semantic rate–compute tradeoffs with strategic posterior design, showing compute as implicit rate and multimodal benefits.

Relevance: 9 Novelty: 9

8. SpecMD: A Comprehensive Study On Speculative Expert Prefetching

ArXiv ID: 2602.03921

Authors: Duc Hoang, Ajay Jaiswal, Mohammad Samragh, Minsik Cho

Abstract: Mixture-of-Experts (MoE) models enable sparse expert activation, meaning that only a subset of the model's parameters is used during each inference. However, to translate this sparsity into practical performance, an expert caching mechanism is required. Previous works have proposed hardware-centric caching policies, but how these various caching policies interact with each other and different hardware specification remains poorly understood. To address this gap, we develop \textbf{SpecMD}, a standardized framework for benchmarking ad-hoc cache policies on various hardware configurations. Using SpecMD, we perform an exhaustive benchmarking of several MoE caching strategies, reproducing and extending prior approaches in controlled settings with realistic constraints. Our experiments reveal that MoE expert access is not consistent with temporal locality assumptions (e.g LRU, LFU). Motivated by this observation, we propose \textbf{Least-Stale}, a novel eviction policy that exploits MoE's predictable expert access patterns to reduce collision misses by up to $85\times$ over LRU. With such gains, we achieve over $88\%$ hit rates with up to $34.7\%$ Time-to-first-token (TTFT) reduction on OLMoE at only $5\%$ or $0.6GB$ of VRAM cache capacity.

Comment: Model Architecture and Efficiency (MoE): standardized benchmarking for MoE expert caching and a novel eviction policy tailored to expert access patterns, improving TTFT and hit rates.

Relevance: 10 Novelty: 7

9. Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

ArXiv ID: 2602.05657

Authors: Aleksandar Armacki, Dragana Bajovi\'c, Du\v{s}an Jakoveti\'c, Soummya Kar, Ali H. Sayed

Abstract: The study of tail behaviour of SGD-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of SGD-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) SGD, for non-convex costs and bounded noise, with long-term decay at rate $e^{-t/\log(t)}$. Next, we relax the noise assumption by considering clipped SGD (c-SGD) under heavy-tailed noise with bounded moment of order $p \in (1,2]$, showing an upper bound with long-term decay at rate $e^{-t^{\beta_p}/\log(t)}$, where $\beta_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$ and $e^{-t/\log^2(t)}$ for $p = 2$. Finally, we provide lower bounds on the tail decay, at rate $e^{-t}$, showing that our rates for both SGD and c-SGD are tight, up to poly-logarithmic factors. Notably, our results demonstrate an order of magnitude faster long-term tail decay compared to existing work based on finite-time bounds, which show rates $e^{-\sqrt{t}}$ and $e^{-t^{\beta_p/2}}$, $p \in (1,2]$, for SGD and c-SGD, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.

Comment: Optimization/Training Dynamics: tight long-term tail decay analysis for (clipped) SGD in non-convex settings via large deviations, offering stronger run-level guarantees.