Personalized Daily ArXiv Papers 2025-06-05

[gpt-4o]	Prompt	Completion	Total
Token	39730	4673	44403
Cost	$0.1	$0.05	$0.15

Total arXiv papers: 638

Total scanned papers: 340

Total relevant papers: 27

Table of contents with paper titles:

Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study Authors: Yotam Alexander, Yonatan Slutzky, Yuval Ran-Milo, Nadav Cohen
Attention-Only Transformers via Unrolled Subspace Denoising Authors: Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu, Yi Ma
CARL: Causality-guided Architecture Representation Learning for an Interpretable Performance Predictor Authors: Han Ji, Yuqi Feng, Jiahao Fan, Yanan Sun
Models of Heavy-Tailed Mechanistic Universality Authors: Liam Hodgkinson, Zhichao Wang, Michael W. Mahoney
Efficient Knowledge Editing via Minimal Precomputation Authors: Akshat Gupta, Maochuan Lu, Thomas Hartvigsen, Gopala Anumanchipalli
Adaptive Task Vectors for Large Language Models Authors: Joonseong Kang, Soojeong Lee, Subeen Park, Sumin Park, Taero Kim, Jihee Kim, Ryunyi Lee, Kyungwoo Song
Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem Authors: Yubo Wang, Ping Nie, Kai Zou, Lijun Wu, Wenhu Chen
BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing Authors: Masaya Kawamura, Takuya Hasumi, Yuma Shirahata, Ryuichi Yamamoto
RhoDARTS: Differentiable Quantum Architecture Search with Density Matrix Simulations Authors: Swagat Kumar, Jan-Nico Zaech, Colin Michael Wilmott, Luc Van Gool
A Foundation Model for Spatial Proteomics Authors: Muhammad Shaban, Yuzhou Chang, Huaying Qiu, Yao Yu Yeo, Andrew H. Song, Guillaume Jaume, Yuchen Wang, Luca L. Weishaupt, Tong Ding, Anurag Vaidya, Abdallah Lamane, Daniel Shao, Mohammed Zidane, Yunhao Bai, Paige McCallum, Shuli Luo, Wenrui Wu, Yang Wang, Precious Cramer, Chi Ngai Chan, Pierre Stephan, Johanna Schaffenrath, Jia Le Lee, Hendrik A. Michel, Caiwei Tian, Cristina Almagro-Perez, Sophia J. Wagner, Sharifa Sahai, Ming Y. Lu, Richard J. Chen, Andrew Zhang, Mark Edward M. Gonzales, Ahmad Makky, Jia-Ying Joey Lee, Hao Cheng, Nourhan El Ahmar, Sayed Matar, Maximilian Haist, Darci Phillips, Yuqi Tan, Garry P. Nolan, W. Richard Burack, Jacob D. Estes, Jonathan T. C. Liu, Toni K Choueiri, Neeraj Agarwal, Marc Barry, Scott J. Rodig, Long Phi Le, Georg Gerber, Christian M. Sch\"urch, Fabian J. Theis, Youn H Kim, Joe Yeong, Sabina Signoretti, Brooke E. Howitt, Lit-Hsin Loo, Qin Ma, Sizun Jiang, Faisal Mahmood
Learning-at-Criticality in Large Language Models for Quantum Field Theory and Beyond Authors: Xiansheng Cai, Sihan Hu, Tao Wang, Yuan Huang, Pan Zhang, Youjin Deng, Kun Chen
Reason from Future: Reverse Thought Chain Enhances LLM Reasoning Authors: Yinlong Xu, Yanzhao Zheng, Shuoshuo Sun, Shuaihan Huang, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu, Hongxia Xu, Jian Wu
Temporal horizons in forecasting: a performance-learnability trade-off Authors: Pau Vilimelis Aceituno, Jack William Miller, Noah Marti, Youssef Farag, Victor Boussange
Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner Authors: Runa Eschenhagen, Aaron Defazio, Tsung-Hsien Lee, Richard E. Turner, Hao-Jun Michael Shi
ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices Authors: Hao Yu, Tangyu Jiang, Shuning Jia, Shannan Yan, Shunning Liu, Haolong Qian, Guanghao Li, Shuting Dong, Huaisong Zhang, Chun Yuan
Learning equivariant models by discovering symmetries with learnable augmentations Authors: Eduardo Santos Escriche, Stefanie Jegelka
Bridging Neural ODE and ResNet: A Formal Error Bound for Safety Verification Authors: Abdelrahman Sayed Sayed, Pierre-Jean Meyer, Mohamed Ghazel
RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing Authors: Ruihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu, Mingkuan Feng, Shuai Zhang, Jianhua Tao
Multi-Exit Kolmogorov-Arnold Networks: enhancing accuracy and parsimony Authors: James Bagrow, Josh Bongard
The Future of Continual Learning in the Era of Foundation Models: Three Key Directions Authors: Jack Bell, Luigi Quarantiello, Eric Nuertey Coleman, Lanpei Li, Malio Li, Mauro Madeddu, Elia Piccoli, Vincenzo Lomonaco
Out-of-Vocabulary Sampling Boosts Speculative Decoding Authors: Nadav Timor, Jonathan Mamou, Oren Pereg, Hongyang Zhang, David Harel
Out-of-Distribution Graph Models Merging Authors: Yidi Wang, Jiawei Gu, pei Xiaobing, Xubin Zheng, Xiao Luo, Pengyang Wang, Ziyue Qiao
EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding Authors: Mingxu Tao, Jie Hu, Mingchuan Yang, Yunhuai Liu, Dongyan Zhao, Yansong Feng
Revisiting Unbiased Implicit Variational Inference Authors: Tobias Pielok, Bernd Bischl, David R\"ugamer
Rethinking the Stability-Plasticity Trade-off in Continual Learning from an Architectural Perspective Authors: Aojun Lu, Hangjie Yuan, Tao Feng, Yanan Sun
Guided Speculative Inference for Efficient Test-Time Alignment of LLMs Authors: Jonathan Geuter, Youssef Mroueh, David Alvarez-Melis
Adapting Rule Representation With Four-Parameter Beta Distribution for Learning Classifier Systems Authors: Hiroki Shiraishi, Yohei Hayamizu, Tomonori Hashiyama, Keiki Takadama, Hisao Ishibuchi, Masaya Nakata

1. Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study

ArXiv ID: 2506.03931

Authors: Yotam Alexander, Yonatan Slutzky, Yuval Ran-Milo, Nadav Cohen

Abstract: Conventional wisdom attributes the mysterious generalization abilities of overparameterized neural networks to gradient descent (and its variants). The recent volume hypothesis challenges this view: it posits that these generalization abilities persist even when gradient descent is replaced by Guess & Check (G&C), i.e., by drawing weight settings until one that fits the training data is found. The validity of the volume hypothesis for wide and deep neural networks remains an open question. In this paper, we theoretically investigate this question for matrix factorization (with linear and non-linear activation)--a common testbed in neural network theory. We first prove that generalization under G&C deteriorates with increasing width, establishing what is, to our knowledge, the first case where G&C is provably inferior to gradient descent. Conversely, we prove that generalization under G&C improves with increasing depth, revealing a stark contrast between wide and deep networks, which we further validate empirically. These findings suggest that even in simple settings, there may not be a simple answer to the question of whether neural networks need gradient descent to generalize well.

Comment: The paper provides theoretical insights into the generalization abilities of neural networks, which is relevant to representation learning.

Relevance: 9 Novelty: 8

2. Attention-Only Transformers via Unrolled Subspace Denoising

ArXiv ID: 2506.03790

Authors: Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu, Yi Ma

Abstract: Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of \textit{only} self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations \textit{at a linear rate} with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE.

Comment: The paper proposes a fully interpretable transformer architecture using only self-attention operators, which is relevant to model architecture innovations.

Relevance: 9 Novelty: 8

3. CARL: Causality-guided Architecture Representation Learning for an Interpretable Performance Predictor

ArXiv ID: 2506.04001

Authors: Han Ji, Yuqi Feng, Jiahao Fan, Yanan Sun

Abstract: Performance predictors have emerged as a promising method to accelerate the evaluation stage of neural architecture search (NAS). These predictors estimate the performance of unseen architectures by learning from the correlation between a small set of trained architectures and their performance. However, most existing predictors ignore the inherent distribution shift between limited training samples and diverse test samples. Hence, they tend to learn spurious correlations as shortcuts to predictions, leading to poor generalization. To address this, we propose a Causality-guided Architecture Representation Learning (CARL) method aiming to separate critical (causal) and redundant (non-causal) features of architectures for generalizable architecture performance prediction. Specifically, we employ a substructure extractor to split the input architecture into critical and redundant substructures in the latent space. Then, we generate multiple interventional samples by pairing critical representations with diverse redundant representations to prioritize critical features. Extensive experiments on five NAS search spaces demonstrate the state-of-the-art accuracy and superior interpretability of CARL. For instance, CARL achieves 97.67% top-1 accuracy on CIFAR-10 using DARTS.

Comment: The paper proposes a causality-guided architecture representation learning method, which is relevant to representation learning and model architecture analysis.

Relevance: 9 Novelty: 8

4. Models of Heavy-Tailed Mechanistic Universality

ArXiv ID: 2506.03470

Authors: Liam Hodgkinson, Zhichao Wang, Michael W. Mahoney

Abstract: Recent theoretical and empirical successes in deep learning, including the celebrated neural scaling laws, are punctuated by the observation that many objects of interest tend to exhibit some form of heavy-tailed or power law behavior. In particular, the prevalence of heavy-tailed spectral densities in Jacobians, Hessians, and weight matrices has led to the introduction of the concept of heavy-tailed mechanistic universality (HT-MU). Multiple lines of empirical evidence suggest a robust correlation between heavy-tailed metrics and model performance, indicating that HT-MU may be a fundamental aspect of deep learning efficacy. Here, we propose a general family of random matrix models -- the high-temperature Marchenko-Pastur (HTMP) ensemble -- to explore attributes that give rise to heavy-tailed behavior in trained neural networks. Under this model, spectral densities with power laws on (upper and lower) tails arise through a combination of three independent factors (complex correlation structures in the data; reduced temperatures during training; and reduced eigenvector entropy), appearing as an implicit bias in the model structure, and they can be controlled with an "eigenvalue repulsion" parameter. Implications of our model on other appearances of heavy tails, including neural scaling laws, optimizer trajectories, and the five-plus-one phases of neural network training, are discussed.

Comment: The paper introduces a model to explore heavy-tailed behavior in neural networks, which aligns with representation learning by providing insights into training dynamics and model behavior.

Relevance: 9 Novelty: 8

5. Efficient Knowledge Editing via Minimal Precomputation

ArXiv ID: 2506.04226

Authors: Akshat Gupta, Maochuan Lu, Thomas Hartvigsen, Gopala Anumanchipalli

Abstract: Knowledge editing methods like MEMIT are able to make data and compute efficient updates of factual knowledge by using a single sentence to update facts and their consequences. However, what is often overlooked is a "precomputation step", which requires a one-time but significant computational cost. The authors of MEMIT originally precompute approximately 44 million hidden vectors per edited layer, which requires a forward pass over 44 million tokens. For GPT-J (6B), this precomputation step takes 36 hours on a single GPU, while it takes approximately 40 hours for Llama2-7B. Additionally, this precomputation time grows with model size. In this paper, we show that this excessive computational cost is unnecessary. Knowledge editing using MEMIT and related methods, such as ROME and EMMET, can be performed by pre-computing a very small portion of the 44 million hidden vectors. We first present the theoretical minimum number of hidden vector precomputation required for solutions of these editing methods to exist. We then empirically show that knowledge editing using these methods can be done by pre-computing significantly fewer hidden vectors. Specifically, we show that the precomputation step can be done with less than 0.3% of the originally stipulated number of hidden vectors. This saves a significant amount of precomputation time and allows users to begin editing new models within a few minutes.

Comment: The paper provides theoretical insights into reducing precomputation in knowledge editing for LLMs, aligning with the LLMs criterion.

Relevance: 9 Novelty: 8

6. Adaptive Task Vectors for Large Language Models

ArXiv ID: 2506.03426

Authors: Joonseong Kang, Soojeong Lee, Subeen Park, Sumin Park, Taero Kim, Jihee Kim, Ryunyi Lee, Kyungwoo Song

Abstract: In-Context Learning (ICL) enables Large Language Models (LLMs) to perform tasks without parameter updates by conditioning on a few demonstrations provided in the prompt. Despite its success, ICL suffers from several limitations, including sensitivity to demonstration order, context length constraints, and computational inefficiency. To address these challenges, task vector-based approaches compress task information into a single vector. However, these methods typically construct task vectors from fixed sets of demonstrations and reuse them across input queries, without conditioning on the specific input. This limitation can lead models to struggle with effective adaptation when the input query is not well aligned with the underlying demonstrations, consequently degrading their generalization performance on unseen tasks. To overcome this limitation, we propose Adaptive Task Vectors (ATV), a simple and effective framework that dynamically generates task vectors conditioned on each input query. ATV employs a small language model to generate task vectors, which are then transformed to match the target LLM's architecture and applied to guide its output generation. In contrast to ICL and previous vector-based approaches, which rely on fixed demonstration sets and their corresponding vectors, ATV dynamically generates task vectors tailored to each specific input query and task. Consequently, ATV demonstrates strong performance and generalization capabilities, even for unseen tasks. Furthermore, we provide a theoretical analysis indicating that ATV is expressively equivalent to LoRA under equal rank budgets and more expressive than Prefix-Tuning, thereby offering formal support for its representational advantage.

Comment: The paper proposes Adaptive Task Vectors for LLMs, which is relevant to the Large Language Models criterion by addressing theoretical insights into LLM behavior and improving generalization capabilities.

Relevance: 9 Novelty: 7

7. Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem

ArXiv ID: 2506.03295

Authors: Yubo Wang, Ping Nie, Kai Zou, Lijun Wu, Wenhu Chen

Abstract: We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models' reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.

Comment: The paper discusses critique fine-tuning to enhance reasoning in LLMs, which is relevant to understanding and improving LLM behavior, aligning with foundational research in LLMs.

Relevance: 9 Novelty: 7

8. BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing

ArXiv ID: 2506.03515

Authors: Masaya Kawamura, Takuya Hasumi, Yuma Shirahata, Ryuichi Yamamoto

Abstract: This paper proposes a highly compact, lightweight text-to-speech (TTS) model for on-device applications. To reduce the model size, the proposed model introduces two techniques. First, we introduce quantization-aware training (QAT), which quantizes model parameters during training to as low as 1.58-bit. In this case, most of 32-bit model parameters are quantized to ternary values {-1, 0, 1}. Second, we propose a method named weight indexing. In this method, we save a group of 1.58-bit weights as a single int8 index. This allows for efficient storage of model parameters, even on hardware that treats values in units of 8-bit. Experimental results demonstrate that the proposed method achieved 83 % reduction in model size, while outperforming the baseline of similar model size without quantization in synthesis quality.

Comment: The paper focuses on model compression through quantization and weight indexing, which aligns with the model compression criterion.

Relevance: 9 Novelty: 7

9. RhoDARTS: Differentiable Quantum Architecture Search with Density Matrix Simulations

ArXiv ID: 2506.03697

Authors: Swagat Kumar, Jan-Nico Zaech, Colin Michael Wilmott, Luc Van Gool

Abstract: Variational Quantum Algorithms (VQAs) are a promising approach for leveraging powerful Noisy Intermediate-Scale Quantum (NISQ) computers. When applied to machine learning tasks, VQAs give rise to NISQ-compatible Quantum Neural Networks (QNNs), which have been shown to outperform classical neural networks with a similar number of trainable parameters. While the quantum circuit structures of VQAs for physics simulations are determined by the physical properties of the systems, identifying effective QNN architectures for general machine learning tasks is a difficult challenge due to the lack of domain-specific priors. Indeed, existing Quantum Architecture Search (QAS) algorithms, adaptations of classical neural architecture search techniques, often overlook the inherent quantum nature of the circuits they produce. By approaching QAS from the ground-up and from a quantum perspective, we resolve this limitation by proposing $\rho$DARTS, a differentiable QAS algorithm that models the search process as the evolution of a quantum mixed state, emerging from the search space of quantum architectures. We validate our method by finding circuits for state initialization, Hamiltonian optimization, and image classification. Further, we demonstrate better convergence against existing QAS techniques and show improved robustness levels to noise.

Comment: The paper introduces a novel differentiable Quantum Architecture Search algorithm, which aligns with the Model Architecture criterion by proposing a new method for identifying effective quantum neural network architectures.

Relevance: 8 Novelty: 8

10. A Foundation Model for Spatial Proteomics

ArXiv ID: 2506.03373

Authors: Muhammad Shaban, Yuzhou Chang, Huaying Qiu, Yao Yu Yeo, Andrew H. Song, Guillaume Jaume, Yuchen Wang, Luca L. Weishaupt, Tong Ding, Anurag Vaidya, Abdallah Lamane, Daniel Shao, Mohammed Zidane, Yunhao Bai, Paige McCallum, Shuli Luo, Wenrui Wu, Yang Wang, Precious Cramer, Chi Ngai Chan, Pierre Stephan, Johanna Schaffenrath, Jia Le Lee, Hendrik A. Michel, Caiwei Tian, Cristina Almagro-Perez, Sophia J. Wagner, Sharifa Sahai, Ming Y. Lu, Richard J. Chen, Andrew Zhang, Mark Edward M. Gonzales, Ahmad Makky, Jia-Ying Joey Lee, Hao Cheng, Nourhan El Ahmar, Sayed Matar, Maximilian Haist, Darci Phillips, Yuqi Tan, Garry P. Nolan, W. Richard Burack, Jacob D. Estes, Jonathan T. C. Liu, Toni K Choueiri, Neeraj Agarwal, Marc Barry, Scott J. Rodig, Long Phi Le, Georg Gerber, Christian M. Sch\"urch, Fabian J. Theis, Youn H Kim, Joe Yeong, Sabina Signoretti, Brooke E. Howitt, Lit-Hsin Loo, Qin Ma, Sizun Jiang, Faisal Mahmood

Abstract: Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-supervised manner on over 47 million image patches covering 175 protein markers, 16 tissue types, and 8 fluorescence-based imaging platforms. We introduce key architectural adaptations to address the high-dimensional, multi-channel, and heterogeneous nature of multiplex imaging. We demonstrate that KRONOS learns biologically meaningful representations across multiple scales, ranging from cellular and microenvironment to tissue levels, enabling it to address diverse downstream tasks, including cell phenotyping, region classification, and patient stratification. Evaluated across 11 independent cohorts, KRONOS achieves state-of-the-art performance across cell phenotyping, treatment response prediction, and retrieval tasks, and is highly data-efficient. KRONOS also introduces the paradigm of segmentation-free patch-level processing for efficient and scalable spatial proteomics analysis, allowing cross-institutional comparisons, and as an image reverse search engine for spatial patterns. Together, these results position KRONOS as a flexible and scalable tool for spatial proteomics. The model is publicly accessible at https://github.com/mahmoodlab/KRONOS.

Comment: The paper presents KRONOS, a foundation model for spatial proteomics, which aligns with the AI for Science criterion by introducing a new generative paradigm for spatial proteomics analysis.

Relevance: 8 Novelty: 8

11. Learning-at-Criticality in Large Language Models for Quantum Field Theory and Beyond

ArXiv ID: 2506.03703

Authors: Xiansheng Cai, Sihan Hu, Tao Wang, Yuan Huang, Pan Zhang, Youjin Deng, Kun Chen

Abstract: Fundamental physics often confronts complex symbolic problems with few guiding exemplars or established principles. While artificial intelligence (AI) offers promise, its typical need for vast datasets to learn from hinders its use in these information-scarce frontiers. We introduce learning at criticality (LaC), a reinforcement learning (RL) scheme that tunes Large Language Models (LLMs) to a sharp learning transition, addressing this information scarcity. At this transition, LLMs achieve peak generalization from minimal data, exemplified by 7-digit base-7 addition -- a test of nontrivial arithmetic reasoning. To elucidate this peak, we analyze a minimal concept-network model (CoNet) designed to capture the essence of how LLMs might link tokens. Trained on a single exemplar, this model also undergoes a sharp learning transition. This transition exhibits hallmarks of a second-order phase transition, notably power-law distributed solution path lengths. At this critical point, the system maximizes a ``critical thinking pattern" crucial for generalization, enabled by the underlying scale-free exploration. This suggests LLMs reach peak performance by operating at criticality, where such explorative dynamics enable the extraction of underlying operational rules. We demonstrate LaC in quantum field theory: an 8B-parameter LLM, tuned to its critical point by LaC using a few exemplars of symbolic Matsubara sums, solves unseen, higher-order problems, significantly outperforming far larger models. LaC thus leverages critical phenomena, a physical principle, to empower AI for complex, data-sparse challenges in fundamental physics.

Comment: The paper introduces a novel reinforcement learning scheme for LLMs in quantum field theory, which is relevant to foundational research in AI for science.

Relevance: 8 Novelty: 8

12. Reason from Future: Reverse Thought Chain Enhances LLM Reasoning

ArXiv ID: 2506.03673

Authors: Yinlong Xu, Yanzhao Zheng, Shuoshuo Sun, Shuaihan Huang, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu, Hongxia Xu, Jian Wu

Abstract: It has been demonstrated that carefully designed reasoning paradigms, like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), can enhance the reasoning capabilities of small language models by detailed thinking and extensive thought searching, unbounded branching factors in the searching space create prohibitive reasoning consumption. However these methods fall into the trap of local optimum reasoning, which means the model lacks a global perspective while solving problems. We propose a novel reasoning paradigm called Reason from Future (RFF), which generates reasoning paths by bidirectional reasoning that combines top-down planning with bottom-up reasoning accumulation. The essence of RFF lies in its reverse reasoning mechanism, which prioritizes core logical relationships and imposes goal-oriented constraints on intermediate steps, thereby reducing the searching space and mitigating error accumulation inherent in sequential forward reasoning. Empirical evaluations across diverse experiments demonstrate that RFF outperforms conventional paradigms with higher accuracy and less searching space to solve complex tasks.

Comment: The paper proposes a novel reasoning paradigm called Reason from Future, which enhances LLM reasoning and aligns with foundational research in LLM behavior and interpretability.

Relevance: 8 Novelty: 8

13. Temporal horizons in forecasting: a performance-learnability trade-off

ArXiv ID: 2506.03889

Authors: Pau Vilimelis Aceituno, Jack William Miller, Noah Marti, Youssef Farag, Victor Boussange

Abstract: When training autoregressive models for dynamical systems, a critical question arises: how far into the future should the model be trained to predict? Too short a horizon may miss long-term trends, while too long a horizon can impede convergence due to accumulating prediction errors. In this work, we formalize this trade-off by analyzing how the geometry of the loss landscape depends on the training horizon. We prove that for chaotic systems, the loss landscape's roughness grows exponentially with the training horizon, while for limit cycles, it grows linearly, making long-horizon training inherently challenging. However, we also show that models trained on long horizons generalize well to short-term forecasts, whereas those trained on short horizons suffer exponentially (resp. linearly) worse long-term predictions in chaotic (resp. periodic) systems. We validate our theory through numerical experiments and discuss practical implications for selecting training horizons. Our results provide a principled foundation for hyperparameter optimization in autoregressive forecasting models.

Comment: The paper analyzes the trade-off in training horizons for autoregressive models, providing theoretical insights into model training dynamics, relevant to representation learning.

Relevance: 8 Novelty: 8

14. Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

ArXiv ID: 2506.03595

Authors: Runa Eschenhagen, Aaron Defazio, Tsung-Hsien Lee, Richard E. Turner, Hao-Jun Michael Shi

Abstract: The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's eigenvalues and how correcting the eigenvalues directly can eliminate the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.

Comment: The paper investigates heuristics in the Shampoo optimization algorithm, focusing on Kronecker-factorization-based training algorithms, which is relevant to model architecture and efficiency improvements.

Relevance: 8 Novelty: 8

15. ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices

ArXiv ID: 2506.03737

Authors: Hao Yu, Tangyu Jiang, Shuning Jia, Shannan Yan, Shunning Liu, Haolong Qian, Guanghao Li, Shuting Dong, Huaisong Zhang, Chun Yuan

Abstract: The Transformer architecture has revolutionized various regions since it was proposed, and its effectiveness largely depends on the ability to encode positional information. Traditional position encoding methods exhibit significant limitations due to lack of robustness and flexibility of position. Therefore, Rotary Positional Encoding (RoPE) was proposed to alleviate these issues, which integrates positional information by rotating the embeddings in the attention mechanism. However, RoPE requires manually defined rotation matrices with limited transformation space, constraining the model's capacity. In this work, we propose ComRoPE, which generalizes RoPE by defining it in terms of trainable commuting angle matrices. Specifically, we demonstrate that pairwise commutativity of these matrices is essential for RoPE to achieve scalability and positional robustness. We formally define the RoPE Equation, which is an essential condition that ensures consistent performance with position offsets. Based on the theoretical analysis, we present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation, which significantly improve performance, surpassing the current state-of-the-art method by 1.6% at training resolution and 2.9% at higher resolution on the ImageNet-1K dataset. Furthermore, our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research. To ensure reproducibility, the source code and instructions are available at https://github.com/Longin-Yu/ComRoPE

Comment: The paper proposes a new method for positional encoding in Transformers, which is relevant to model architecture innovations.