Personalized Daily ArXiv Papers 2025-07-18

[gpt-4o]	Prompt	Completion	Total
Token	31853	3755	35608
Cost	$0.08	$0.04	$0.12

Total arXiv papers: 400

Total scanned papers: 239

Total relevant papers: 16

Table of contents with paper titles:

Compact Vision Transformer by Reduction of Kernel Complexity Authors: Yancheng Wang, Yingzhen Yang
Probabilistic Soundness Guarantees in LLM Reasoning Chains Authors: Weiqiu You, Anton Xue, Shreya Havaldar, Delip Rao, Helen Jin, Chris Callison-Burch, Eric Wong
Training Transformers with Enforced Lipschitz Constants Authors: Laker Newhouse, R. Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, Phillip Isola
DASViT: Differentiable Architecture Search for Vision Transformer Authors: Pengjin Wu, Ferrante Neri, Zhenhua Feng
Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability Authors: Kaiqi Jiang, Jeremy Cohen, Yuanzhi Li
Reasoning-Finetuning Repurposes Latent Representations in Base Models Authors: Jake Ward, Chuqiao Lin, Constantin Venhoff, Neel Nanda
Finite-Dimensional Gaussian Approximation for Deep Neural Networks: Universality in Random Weights Authors: Krishnakumar Balasubramanian, Nathan Ross
Merge Kernel for Bayesian Optimization on Permutation Space Authors: Zikai Xie, Linjiang Chen
Kodezi Chronos: A Debugging-First Language Model for Repository-Scale, Memory-Driven Code Understanding Authors: Ishraq Khan, Assad Chowdary, Sharoz Haseeb, Urvish Patel
Making Language Model a Hierarchical Classifier and Generator Authors: Yihong Wang, Zhonglin Jiang, Ningyuan Xi, Yue Zhao, Qingqing Gu, Xiyuan Chen, Hao Wu, Sheng Xu, Hange Zhou, Yong Chen, Luo Ji
FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming Authors: Gal Beniamini, Yuval Dor, Alon Vinnikov, Shir Granot Peled, Or Weinstein, Or Sharir, Noam Wies, Tomer Nussbaum, Ido Ben Shaul, Tomer Zekharya, Yoav Levine, Shai Shalev-Shwartz, Amnon Shashua
Insights into a radiology-specialised multimodal large language model with sparse autoencoders Authors: Kenza Bouzid, Shruthi Bannur, Daniel Coelho de Castro, Anton Schwaighofer, Javier Alvarez-Valle, Stephanie L. Hyland
DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization Authors: Dongyeun Lee, Jiwan Hur, Hyounguk Shon, Jae Young Lee, Junmo Kim
Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy Authors: Yiting Yang, Hao Luo, Yuan Sun, Qingsen Yan, Haokui Zhang, Wei Dong, Guoqing Wang, Peng Wang, Yang Yang, Hengtao Shen
Logit Arithmetic Elicits Long Reasoning Capabilities Without Training Authors: Yunxiang Zhang, Muhammad Khalifa, Lechen Zhang, Xin Liu, Ayoung Lee, Xinliang Frederick Zhang, Farima Fatahi Bayat, Lu Wang
Unsupervised Ground Metric Learning Authors: Janis Auffenberg, Jonas Bresch, Oleh Melnyk, Gabriele Steidl

1. Compact Vision Transformer by Reduction of Kernel Complexity

ArXiv ID: 2507.12780

Authors: Yancheng Wang, Yingzhen Yang

Abstract: Self-attention and transformer architectures have become foundational components in modern deep learning. Recent efforts have integrated transformer blocks into compact neural architectures for computer vision, giving rise to various efficient vision transformers. In this work, we introduce Transformer with Kernel Complexity Reduction, or KCR-Transformer, a compact transformer block equipped with differentiable channel selection, guided by a novel and sharp theoretical generalization bound. KCR-Transformer performs input/output channel selection in the MLP layers of transformer blocks to reduce the computational cost. Furthermore, we provide a rigorous theoretical analysis establishing a tight generalization bound for networks equipped with KCR-Transformer blocks. Leveraging such strong theoretical results, the channel pruning by KCR-Transformer is conducted in a generalization-aware manner, ensuring that the resulting network retains a provably small generalization error. Our KCR-Transformer is compatible with many popular and compact transformer networks, such as ViT and Swin, and it reduces the FLOPs of the vision transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in the vision transformers with KCR-Transformer blocks, leading to KCR-Transformer networks with different backbones. The resulting TCR-Transformers achieve superior performance on various computer vision tasks, achieving even better performance than the original models with even less FLOPs and parameters.

Comment: The paper introduces a compact vision transformer with kernel complexity reduction, which is relevant to model architecture and compression.

Relevance: 9 Novelty: 8

2. Probabilistic Soundness Guarantees in LLM Reasoning Chains

ArXiv ID: 2507.12948

Authors: Weiqiu You, Anton Xue, Shreya Havaldar, Delip Rao, Helen Jin, Chris Callison-Burch, Eric Wong

Abstract: In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because they do not properly account for how earlier errors might corrupt judgments of downstream reasoning. To better detect such propagated errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a novel probabilistic framework that prevents error propagation by judging each claim based only on previously-assessed sound premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).

Comment: The paper introduces a novel probabilistic framework for error detection in LLM reasoning chains, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 8

3. Training Transformers with Enforced Lipschitz Constants

ArXiv ID: 2507.13338

Authors: Laker Newhouse, R. Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, Phillip Isola

Abstract: Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and overfitting. To combat these problems, past research has looked at building neural networks entirely from Lipschitz components. However, these techniques have not matured to the point where researchers have trained a modern architecture such as a transformer with a Lipschitz certificate enforced beyond initialization. To explore this gap, we begin by developing and benchmarking novel, computationally-efficient tools for maintaining norm-constrained weight matrices. Applying these tools, we are able to train transformer models with Lipschitz bounds enforced throughout training. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard methods -- weight decay and spectral normalization -- allowing models to reach equal performance with a lower Lipschitz bound. Inspired by Muon's update having a fixed spectral norm, we co-design a weight constraint method that improves the Lipschitz vs. performance tradeoff on MLPs and 2M parameter transformers. Our 2-Lipschitz transformer on Shakespeare text reaches validation accuracy 60%. Scaling to 145M parameters, our 10-Lipschitz transformer reaches 21% accuracy on internet text. However, to match the NanoGPT baseline validation accuracy of 39.4%, our Lipschitz upper bound increases to 10^264. Nonetheless, our Lipschitz transformers train without stability measures such as layer norm, QK norm, and logit tanh softcapping.

Comment: The paper explores training Transformers with enforced Lipschitz constants, providing insights into model architecture and training dynamics.

Relevance: 9 Novelty: 8

4. DASViT: Differentiable Architecture Search for Vision Transformer

ArXiv ID: 2507.13079

Authors: Pengjin Wu, Ferrante Neri, Zhenhua Feng

Abstract: Designing effective neural networks is a cornerstone of deep learning, and Neural Architecture Search (NAS) has emerged as a powerful tool for automating this process. Among the existing NAS approaches, Differentiable Architecture Search (DARTS) has gained prominence for its efficiency and ease of use, inspiring numerous advancements. Since the rise of Vision Transformers (ViT), researchers have applied NAS to explore ViT architectures, often focusing on macro-level search spaces and relying on discrete methods like evolutionary algorithms. While these methods ensure reliability, they face challenges in discovering innovative architectural designs, demand extensive computational resources, and are time-intensive. To address these limitations, we introduce Differentiable Architecture Search for Vision Transformer (DASViT), which bridges the gap in differentiable search for ViTs and uncovers novel designs. Experiments show that DASViT delivers architectures that break traditional Transformer encoder designs, outperform ViT-B/16 on multiple datasets, and achieve superior efficiency with fewer parameters and FLOPs.

Comment: The paper introduces DASViT, a differentiable architecture search for Vision Transformers, which aligns with model architecture innovations by exploring novel designs for ViTs.

Relevance: 9 Novelty: 8

5. Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability

ArXiv ID: 2507.12837

Authors: Kaiqi Jiang, Jeremy Cohen, Yuanzhi Li

Abstract: The study of Neural Tangent Kernels (NTKs) in deep learning has drawn increasing attention in recent years. NTKs typically actively change during training and are related to feature learning. In parallel, recent work on Gradient Descent (GD) has found a phenomenon called Edge of Stability (EoS), in which the largest eigenvalue of the NTK oscillates around a value inversely proportional to the step size. However, although follow-up works have explored the underlying mechanism of such eigenvalue behavior in depth, the understanding of the behavior of the NTK eigenvectors during EoS is still missing. This paper examines the dynamics of NTK eigenvectors during EoS in detail. Across different architectures, we observe that larger learning rates cause the leading eigenvectors of the final NTK, as well as the full NTK matrix, to have greater alignment with the training target. We then study the underlying mechanism of this phenomenon and provide a theoretical analysis for a two-layer linear network. Our study enhances the understanding of GD training dynamics in deep learning.

Comment: The paper provides insights into the training dynamics of neural networks by examining the behavior of Neural Tangent Kernels (NTKs) during the Edge of Stability (EoS). This aligns with the representation learning criterion as it enhances understanding of how deep networks encode information.

Relevance: 9 Novelty: 8

6. Reasoning-Finetuning Repurposes Latent Representations in Base Models

ArXiv ID: 2507.12638

Authors: Jake Ward, Chuqiao Lin, Constantin Venhoff, Neel Nanda

Abstract: Backtracking, an emergent behavior elicited by reasoning fine-tuning, has been shown to be a key mechanism in reasoning models' enhanced capabilities. Prior work has succeeded in manipulating this behavior via steering vectors, but the underlying mechanism remains poorly understood. In this work, we show that the emergence of backtracking in DeepSeek-R1-Distill-Llama-8B is in part driven by a repurposed direction already present in base model activations. Specifically, we identify a direction in base Llama-3.1-8B's residual stream which systematically induces backtracking when used to steer the distilled reasoning model, and find that the effects of steering with this direction cannot be trivially explained by token-level attributes. We further find that this direction does not induce backtracking in the base model, suggesting that the reasoning finetuning process repurposes pre-existing representations to form new behavioral circuits. Additionally, we hypothesize that this direction is one of several which may work together to mediate backtracking. Our findings offer a compelling picture that reasoning-finetuned models repurpose pre-existing base model representations, rather than learn new capabilities from scratch.

Comment: The paper explores how reasoning-finetuning repurposes latent representations in base models, which aligns with representation learning by providing insights into how deep networks encode information.

Relevance: 9 Novelty: 7

7. Finite-Dimensional Gaussian Approximation for Deep Neural Networks: Universality in Random Weights

ArXiv ID: 2507.12686

Authors: Krishnakumar Balasubramanian, Nathan Ross

Abstract: We study the Finite-Dimensional Distributions (FDDs) of deep neural networks with randomly initialized weights that have finite-order moments. Specifically, we establish Gaussian approximation bounds in the Wasserstein-$1$ norm between the FDDs and their Gaussian limit assuming a Lipschitz activation function and allowing the layer widths to grow to infinity at arbitrary relative rates. In the special case where all widths are proportional to a common scale parameter $n$ and there are $L-1$ hidden layers, we obtain convergence rates of order $n^{-({1}/{6})^{L-1} + \epsilon}$, for any $\epsilon > 0$.

Comment: The paper studies Gaussian approximation for deep neural networks with random weights, contributing to the understanding of training dynamics and representation learning.

Relevance: 9 Novelty: 7

8. Merge Kernel for Bayesian Optimization on Permutation Space

ArXiv ID: 2507.13263

Authors: Zikai Xie, Linjiang Chen

Abstract: Bayesian Optimization (BO) algorithm is a standard tool for black-box optimization problems. The current state-of-the-art BO approach for permutation spaces relies on the Mallows kernel-an $\Omega(n^2)$ representation that explicitly enumerates every pairwise comparison. Inspired by the close relationship between the Mallows kernel and pairwise comparison, we propose a novel framework for generating kernel functions on permutation space based on sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from bubble sort. Further, we introduce the \textbf{Merge Kernel} constructed from merge sort, which replaces the quadratic complexity with $\Theta(n\log n)$ to achieve the lowest possible complexity. The resulting feature vector is significantly shorter, can be computed in linearithmic time, yet still efficiently captures meaningful permutation distances. To boost robustness and right-invariance without sacrificing compactness, we further incorporate three lightweight, task-agnostic descriptors: (1) a shift histogram, which aggregates absolute element displacements and supplies a global misplacement signal; (2) a split-pair line, which encodes selected long-range comparisons by aligning elements across the two halves of the whole permutation; and (3) sliding-window motifs, which summarize local order patterns that influence near-neighbor objectives. Our empirical evaluation demonstrates that the proposed kernel consistently outperforms the state-of-the-art Mallows kernel across various permutation optimization benchmarks. Results confirm that the Merge Kernel provides a more compact yet more effective solution for Bayesian optimization in permutation space.

Comment: The paper introduces a novel Merge Kernel for Bayesian Optimization, which is a foundational contribution to optimization methods.

Relevance: 8 Novelty: 8

9. Kodezi Chronos: A Debugging-First Language Model for Repository-Scale, Memory-Driven Code Understanding

ArXiv ID: 2507.12482

Authors: Ishraq Khan, Assad Chowdary, Sharoz Haseeb, Urvish Patel

Abstract: Large Language Models (LLMs) have advanced code generation and software automation, but are fundamentally constrained by limited inference-time context and lack of explicit code structure reasoning. We introduce Kodezi Chronos, a next-generation architecture for autonomous code understanding, debugging, and maintenance, designed to operate across ultra-long contexts comprising entire codebases, histories, and documentation, all without fixed window limits. Kodezi Chronos leverages a multi-level embedding memory engine, combining vector and graph-based indexing with continuous code-aware retrieval. This enables efficient and accurate reasoning over millions of lines of code, supporting repository-scale comprehension, multi-file refactoring, and real-time self-healing actions. Our evaluation introduces a novel Multi Random Retrieval benchmark, specifically tailored to the software engineering domain. Unlike classical retrieval benchmarks, this method requires the model to resolve arbitrarily distant and obfuscated associations across code artifacts, simulating realistic tasks such as variable tracing, dependency migration, and semantic bug localization. Chronos outperforms prior LLMs and code models, demonstrating a 23% improvement in real-world bug detection and reducing debugging cycles by up to 40% compared to traditional sequence-based approaches. By natively interfacing with IDEs and CI/CD workflows, Chronos enables seamless, autonomous software maintenance, elevating code reliability and productivity while reducing manual effort. These results mark a critical advance toward self-sustaining, continuously optimized software ecosystems.

Comment: The paper introduces a new architecture for code understanding and debugging, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 8

10. Making Language Model a Hierarchical Classifier and Generator

ArXiv ID: 2507.12930

Authors: Yihong Wang, Zhonglin Jiang, Ningyuan Xi, Yue Zhao, Qingqing Gu, Xiyuan Chen, Hao Wu, Sheng Xu, Hange Zhou, Yong Chen, Luo Ji

Abstract: Decoder-only language models, such as GPT and LLaMA, generally decode on the last layer. Motivated by human's hierarchical thinking capability, we propose that a hierarchical decoder architecture could be built with different layers decoding texts simultaneously. Due to limited time and computationally resources, we choose to adapt a pretrained language model into this form of hierarchical decoder. Language heads of the last layer are copied to different selected intermediate layers, and fine-tuned with different task inputs. By thorough experiments, we validate that these selective intermediate layers could be adapted to speak meaningful and reasonable contents, and this paradigm of hierarchical decoder can obtain state-of-the-art performances on multiple tasks such as hierarchical text classification, classification-guided generation, and hierarchical text generation. This study suggests the possibility of a generalized hierarchical reasoner, pretraining from scratch.

Comment: The paper proposes a hierarchical decoder architecture for language models, which aligns with the core topic of model architecture, specifically focusing on architectural innovations.