Personalized Daily ArXiv Papers 2025-10-20

[gpt-5]	Prompt	Completion	Total
Token	45503	40129	85632
Cost	$0.06	$0.4	$0.46

Total arXiv papers: 520

Total scanned papers: 302

Total relevant papers: 24

Table of contents with paper titles:

Language Models are Injective and Hence Invertible Authors: Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodola'
A simple mean field model of feature learning Authors: Niclas G\"oring, Chris Mingard, Yoonsoo Nam, Ard Louis
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning Authors: Zhiyuan Fan, Yifeng Liu, Qingyue Zhao, Angela Yuan, Quanquan Gu
ParaFormer: Shallow Parallel Transformers with Progressive Approximation Authors: Wei Wang, Xiao-Yong Wei, Qing Li
On the Neural Feature Ansatz for Deep Neural Networks Authors: Edward Tansley, Estelle Massart, Coralia Cartis
The Coverage Principle: How Pre-training Enables Post-Training Authors: Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, Dylan J. Foster
Continual Learning via Sparse Memory Finetuning Authors: Jessy Lin, Luke Zettlemoyer, Gargi Ghosh, Wen-Tau Yih, Aram Markosyan, Vincent-Pierre Berges, Barlas O\u{g}uz
On Universality of Deep Equivariant Networks Authors: Marco Pacini, Mircea Petrache, Bruno Lepri, Shubhendu Trivedi, Robin Walters
Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation Authors: Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu, Changxing Ding
Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential Authors: Xuansheng Wu, Xiaoman Pan, Wenlin Yao, Jianshu Chen
Sequence Modeling with Spectral Mean Flows Authors: Jinwoo Kim, Max Beier, Petar Bevanda, Nayun Kim, Seunghoon Hong
Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning Authors: Lina Berrayana, Ahmed Heakl, Muhammad Abdullah Sohail, Thomas Hofmann, Salman Khan, Wei Chen
Learning to Answer from Correct Demonstrations Authors: Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma, Nathan Srebro
WARP-LUTs - Walsh-Assisted Relaxation for Probabilistic Look Up Tables Authors: Lino Gerlach, Liv V{\aa}ge, Thore Gerlach, Elliott Kauffman
A Split-Client Approach to Second-Order Optimization Authors: El Mahdi Chayti, Martin Jaggi
SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients Authors: Dominik Kallusky, Vinay Rao, Vishal Nandavanam, Hao-Jun Michael Shi
GRATING: Low-Latency and Memory-Efficient Semantic Selection on Device Authors: Jiahao Zhou, Chengliang Lin, Dingji Li, Mingkai Dong, Haibo Chen
Revisiting Knowledge Distillation: The Hidden Role of Dataset Size Authors: Giulia Lanzillotta, Felix Sarnthein, Gil Kur, Thomas Hofmann, Bobby He
Particle Dynamics for Latent-Variable Energy-Based Models Authors: Shiqin Tang, Shuxin Zhuang, Rong Feng, Runsheng Yu, Hongzong Li, Youzhi Zhang
TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs Authors: Sibo Xiao, Jinyuan Fu, Zhongle Xie, Lidan Shou
From Universal Approximation Theorem to Tropical Geometry of Multi-Layer Perceptrons Authors: Yi-Shan Chu, Yueh-Cheng Kuo
AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport Authors: Lingkai Kong, Molei Tao, Yang Liu, Bryan Wang, Jinmiao Fu, Chien-Chih Wang, Huidong Liu
Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection Authors: Denis Janiak, Jakub Binkowski, Tomasz Kajdanowicz
Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity Authors: Naoki Yoshida, Satoshi Hayakawa, Yuhta Takida, Toshimitsu Uesaka, Hiromi Wakaki, Yuki Mitsufuji

1. Language Models are Injective and Hence Invertible

ArXiv ID: 2510.15511

Authors: Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodola'

Abstract: Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.

Comment: Representation Learning: proves injectivity/invertibility of transformer LMs and provides an exact input reconstruction algorithm.

Relevance: 10 Novelty: 9

2. A simple mean field model of feature learning

ArXiv ID: 2510.15174

Authors: Niclas G\"oring, Chris Mingard, Yoonsoo Nam, Ard Louis

Abstract: Feature learning (FL), where neural networks adapt their internal representations during training, remains poorly understood. Using methods from statistical physics, we derive a tractable, self-consistent mean-field (MF) theory for the Bayesian posterior of two-layer non-linear networks trained with stochastic gradient Langevin dynamics (SGLD). At infinite width, this theory reduces to kernel ridge regression, but at finite width it predicts a symmetry breaking phase transition where networks abruptly align with target functions. While the basic MF theory provides theoretical insight into the emergence of FL in the finite-width regime, semi-quantitatively predicting the onset of FL with noise or sample size, it substantially underestimates the improvements in generalisation after the transition. We trace this discrepancy to a key mechanism absent from the plain MF description: \textit{self-reinforcing input feature selection}. Incorporating this mechanism into the MF theory allows us to quantitatively match the learning curves of SGLD-trained networks and provides mechanistic insight into FL.

Comment: Representation Learning: mean-field theory of feature learning and phase transitions in finite-width networks.

Relevance: 9 Novelty: 9

3. Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

ArXiv ID: 2510.15262

Authors: Zhiyuan Fan, Yifeng Liu, Qingyue Zhao, Angela Yuan, Quanquan Gu

Abstract: Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization ($\mu$P) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady state where normalization layers create backward scale sensitivity and the effective learning rate becomes width dependent, degrading $\mu$P transfer. We address this by introducing a weight-decay scaling rule for AdamW that preserves sublayer gain across widths. Empirically, the singular-value spectrum of each matrix parameter scales in norm as $\sqrt{\eta/\lambda}$ with an approximately invariant shape; under width scaling $d$, we observe that the top singular value scales approximately as $\sqrt{\eta/\lambda}\cdot d^{0.75}$. Combining this observation with the $\mu$P learning-rate rule $\eta_2\propto d^{-1}$ for matrix-like parameters implies an empirical weight-decay scaling rule $\lambda_2\propto \sqrt{d}$ that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at $\eta_1=\Theta_d(1)$ and $\lambda_1=0$, this yields \emph{zero-shot} transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps. We validate the rule on LLaMA-style Transformers and in a minimal synthetic setting, and we provide a simple diagnostic, matching top singular values, to check sublayer-gain invariance. Our results extend $\mu$P beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, offering a practical recipe for width-robust hyperparameter transfer under AdamW.

Comment: Training dynamics/scaling laws: proposes a weight-decay scaling rule for AdamW extending μP beyond the near-init regime for width-robust hyperparameter transfer in Transformers.

Relevance: 9 Novelty: 8

4. ParaFormer: Shallow Parallel Transformers with Progressive Approximation

ArXiv ID: 2510.15425

Authors: Wei Wang, Xiao-Yong Wei, Qing Li

Abstract: The widespread 'deeper is better' philosophy has driven the creation of architectures like ResNet and Transformer, which achieve high performance by stacking numerous layers. However, increasing model depth comes with challenges such as longer training times, higher inference latency, and impracticality on resource-constrained devices. To address these issues, we propose ParaFormer, a shallow Transformer architecture designed for true parallelism in both structure and computation. By formulating standard Transformers as function approximators in closed-form, our theoretical analysis shows that their performance relies on inter-layer collaboration for progressive approximation, rather than depth itself. While deep Transformers enforce this collaboration through sequential designs, we demonstrate that such collaboration is not inherently tied to sequential structures. ParaFormer removes the sequential constraint by organizing layers into parallel branches, enforcing inter-layer collaboration algorithmically. Specifically, we implement progressive approximation, ensuring that each new branch further reduces the loss from preceding branches, enabling faster convergence. Extensive experiments validate ParaFormer's effectiveness, outperforming standard Transformers like ViT. Moreover, ParaFormer supports up to 15.07x model compression and facilitates model expansion for adaptive continuous learning. Experimental results on multi-GPU deployment demonstrate that ParaFormer is 3.30x faster than widely used parallelism solutions such as FairScale. These advancements stem from our closed-form formulation of Transformers based on the Universal Approximation Theorem, which not only explains the ``depth belief'' but also opens new avenues for designing efficient Transformer architectures. Source code: https://(open-upon-acceptance)

Comment: Strongly matches Model Architecture and Efficiency/HPC: shallow parallel Transformer with progressive approximation enabling compression and multi-GPU speedups.

Relevance: 9 Novelty: 8

5. On the Neural Feature Ansatz for Deep Neural Networks

ArXiv ID: 2510.15563

Authors: Edward Tansley, Estelle Massart, Coralia Cartis

Abstract: Understanding feature learning is an important open question in establishing a mathematical foundation for deep neural networks. The Neural Feature Ansatz (NFA) states that after training, the Gram matrix of the first-layer weights of a deep neural network is proportional to some power $\alpha>0$ of the average gradient outer product (AGOP) of this network with respect to its inputs. Assuming gradient flow dynamics with balanced weight initialization, the NFA was proven to hold throughout training for two-layer linear networks with exponent $\alpha = 1/2$ (Radhakrishnan et al., 2024). We extend this result to networks with $L \geq 2$ layers, showing that the NFA holds with exponent $\alpha = 1/L$, thus demonstrating a depth dependency of the NFA. Furthermore, we prove that for unbalanced initialization, the NFA holds asymptotically through training if weight decay is applied. We also provide counterexamples showing that the NFA does not hold for some network architectures with nonlinear activations, even when these networks fit arbitrarily well the training data. We thoroughly validate our theoretical results through numerical experiments across a variety of optimization algorithms, weight decay rates and initialization schemes.

Comment: Matches Representation Learning: theoretical analysis of Neural Feature Ansatz and training dynamics across depth.

Relevance: 9 Novelty: 8

6. The Coverage Principle: How Pre-training Enables Post-Training

ArXiv ID: 2510.15020

Authors: Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, Dylan J. Foster

Abstract: Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross entropy loss, cross-entropy can be a poor predictor of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of \emph{coverage}, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods such as Best-of-N to succeed. Our main results develop an understanding of \emph{the coverage principle}, a phenomenon whereby next-token prediction implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: \emph{coverage generalizes faster than cross entropy}, avoiding spurious dependence on problem-dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.

Comment: Matches Representation Learning/Training Dynamics: theory of coverage from next-token pretraining predicting downstream/post-training success with provable interventions.

Relevance: 9 Novelty: 8

7. Continual Learning via Sparse Memory Finetuning

ArXiv ID: 2510.15103

Authors: Jessy Lin, Luke Zettlemoyer, Gargi Ghosh, Wen-Tau Yih, Aram Markosyan, Vincent-Pierre Berges, Barlas O\u{g}uz

Abstract: Modern language models are powerful, but typically static after deployment. A major obstacle to building models that continually learn over time is catastrophic forgetting, where updating on new data erases previously acquired capabilities. Motivated by the intuition that mitigating forgetting is challenging because trainable parameters are shared across all tasks, we investigate whether sparse parameter updates can enable learning without catastrophic forgetting. We introduce sparse memory finetuning, leveraging memory layer models (Berges et al., 2024), which are sparsely updated by design. By updating only the memory slots that are highly activated by a new piece of knowledge relative to usage on pretraining data, we reduce interference between new knowledge and the model's existing capabilities. We evaluate learning and forgetting compared to full finetuning and parameter-efficient finetuning with LoRA on two question answering tasks. We find that sparse memory finetuning learns new knowledge while exhibiting substantially less forgetting: while NaturalQuestions F1 drops by 89% after full finetuning on new facts and 71% with LoRA, sparse memory finetuning yields only an 11% drop with the same level of new knowledge acquisition. Our results suggest sparsity in memory layers offers a promising path toward continual learning in large language models.

Comment: Matches Model Compression and Efficiency (Sparsity) for continual learning via sparsely updated memory layers to reduce interference/forgetting.

Relevance: 9 Novelty: 8

8. On Universality of Deep Equivariant Networks

ArXiv ID: 2510.15814

Authors: Marco Pacini, Mircea Petrache, Bruno Lepri, Shubhendu Trivedi, Robin Walters

Abstract: Universality results for equivariant neural networks remain rare. Those that do exist typically hold only in restrictive settings: either they rely on regular or higher-order tensor representations, leading to impractically high-dimensional hidden spaces, or they target specialized architectures, often confined to the invariant setting. This work develops a more general account. For invariant networks, we establish a universality theorem under separation constraints, showing that the addition of a fully connected readout layer secures approximation within the class of separation-constrained continuous functions. For equivariant networks, where results are even scarcer, we demonstrate that standard separability notions are inadequate and introduce the sharper criterion of $\textit{entry-wise separability}$. We show that with sufficient depth or with the addition of appropriate readout layers, equivariant networks attain universality within the entry-wise separable regime. Together with prior results showing the failure of universality for shallow models, our findings identify depth and readout layers as a decisive mechanism for universality, additionally offering a unified perspective that subsumes and extends earlier specialized results.

Comment: Model Architecture: universality results for invariant/equivariant networks highlighting depth/readout as mechanisms.

Relevance: 9 Novelty: 8

9. Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation

ArXiv ID: 2510.15304

Authors: Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu, Changxing Ding

Abstract: Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weight layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels across adjacent layers, enabling progressive model size reduction. Finally, we propose a hierarchical distillation protocol that leverages the correspondences between the original and pruned model layers established during pruning, thereby enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b's parameters, the pruned model retains 83% of its original average accuracy. Our code is available at https://github.com/MPI-Lab/CoMe.

Comment: Matches Model Compression and Efficiency via structured pruning with concatenation-based layer merging and hierarchical distillation to retain capacity.

Relevance: 9 Novelty: 7

10. Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential

ArXiv ID: 2510.15216

Authors: Xuansheng Wu, Xiaoman Pan, Wenlin Yao, Jianshu Chen

Abstract: Reinforcement learning with verifiable rewards (RLVR) can elicit strong reasoning in large language models (LLMs), while their performance after RLVR varies dramatically across different base models. This raises a fundamental question: what microscopic property of pre-trained models leads to this variation? To investigate, we formalize reasoning as chains of Horn clauses ("if-then" rules) built from features extracted from the LLM's latent space via cross-layer sparse autoencoders (SAEs). We estimate the transition probabilities between its features, and further categorize each rule by its semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key discovery is that high-potential models are inherently soundness-aware: their internal probability distributions systematically shift across rules' soundness levels, becoming highly distinct for "strict" versus "noisy" rules. In contrast, weaker models are soundness-agnostic, collapsing to one distribution regardless of soundness levels. To quantify this, we introduce the Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon Divergence to measure the separation between these distributions. We show that SAL's predictions of post-RLVR reasoning performance follow a precise empirical law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek) and scales (0.5B-14B). This reveals that a model's reasoning potential is tied to its intrinsic, pre-trained ability to distinguish sound knowledge from unsound ones. These findings underscore the critical role of model pre-training in shaping reasoning and offer a practical metric grounded in the model's internal mechanisms for selecting/designing stronger base models.

Comment: Representation learning/training dynamics: uses cross-layer sparse autoencoders to extract latent rules and introduces SAL to quantify soundness-aware internal distributions predicting reasoning potential.

Relevance: 8 Novelty: 8

11. Sequence Modeling with Spectral Mean Flows

ArXiv ID: 2510.15366

Authors: Jinwoo Kim, Max Beier, Petar Bevanda, Nayun Kim, Seunghoon Hong

Abstract: A key question in sequence modeling with neural networks is how to represent and learn highly nonlinear and probabilistic state dynamics. Operator theory views such dynamics as linear maps on Hilbert spaces containing mean embedding vectors of distributions, offering an appealing but currently overlooked perspective. We propose a new approach to sequence modeling based on an operator-theoretic view of a hidden Markov model (HMM). Instead of materializing stochastic recurrence, we embed the full sequence distribution as a tensor in the product Hilbert space. A generative process is then defined as maximum mean discrepancy (MMD) gradient flow in the space of sequences. To overcome challenges with large tensors and slow sampling convergence, we introduce spectral mean flows, a novel tractable algorithm integrating two core concepts. First, we propose a new neural architecture by leveraging spectral decomposition of linear operators to derive a scalable tensor network decomposition of sequence mean embeddings. Second, we extend MMD gradient flows to time-dependent Hilbert spaces and connect them to flow matching via the continuity equation, enabling simulation-free learning and faster sampling. We demonstrate competitive results on a range of time-series modeling datasets. Code is available at https://github.com/jw9730/spectral-mean-flow.

Comment: Matches Model Architecture and Representation Learning: operator-theoretic sequence model with spectral tensor-network decomposition and flow matching.

Relevance: 8 Novelty: 8

12. Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning

ArXiv ID: 2510.15244

Authors: Lina Berrayana, Ahmed Heakl, Muhammad Abdullah Sohail, Thomas Hofmann, Salman Khan, Wei Chen

Abstract: Current autoregressive language models (ARMs) achieve high accuracy but require long token sequences, making them costly. Discrete diffusion language models (DDLMs) enable parallel and flexible generation within a fixed number of steps and have recently emerged for their strong performance in complex reasoning and long-term planning tasks. We present a study exploring hybrid architectures that couple DDLMs with ARMs to assess whether their collaboration can yield complementary benefits. We first examine collaboration in text space, where one model plans the reasoning process and another executes the final answer based on that plan. We then extend this setup to latent-space communication, introducing a learned projector that maps DDLM latents into the ARM's embedding space, potentially bypassing some of the text-generation limitations of diffusion models. We find that shifting DDLM --> ARM communication from text space to latent space yields significant accuracy gains, for example increasing from 27.0% to 54.0% on DART-5 and from 0.0% to 14.0% on AIME24. We also find that combining a DDLM planner with an ARM executor can provide substantial computational savings with little to no impact on accuracy. For example, the latent-space pipeline, using 64 tokens for planning and roughly 5 for execution, surpasses Qwen3.1-7B on DART-5 and AIME, despite Qwen using 44 times more tokens. Overall, our study offers new insights into reasoning with DDLMs and highlights their potential in hybrid architectures.

Comment: Matches Model Architecture via a hybrid discrete diffusion planner with an autoregressive executor, including latent-space interfacing to reduce tokens and improve reasoning.

Relevance: 8 Novelty: 8

13. Learning to Answer from Correct Demonstrations

ArXiv ID: 2510.15464

Authors: Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma, Nathan Srebro

Abstract: We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each training question, as in Supervised Fine Tuning (SFT). We formalize the problem as offline imitation learning in contextual bandits, with demonstrations from some optimal policy, without explicitly observed rewards. Prior work assumes that the demonstrator belongs to a low-complexity policy class, which motivates maximum likelihood estimation (i.e., log-loss minimization). In contrast, we propose relying only on the reward model (specifying which answers are correct) being in a low-cardinality class, which we argue is a weaker assumption. We show that likelihood maximization methods can fail in this case, and instead devise an alternative novel approach that learns with sample complexity logarithmic in the cardinality of the reward class. Our work motivates looking beyond likelihood maximization when learning from correct demonstrations.

Comment: Matches foundational Training Objective design for learning from correct demonstrations beyond MLE, with sample complexity guarantees under a low-cardinality reward class.

Relevance: 8 Novelty: 8

14. WARP-LUTs - Walsh-Assisted Relaxation for Probabilistic Look Up Tables

ArXiv ID: 2510.15655

Authors: Lino Gerlach, Liv V{\aa}ge, Thore Gerlach, Elliott Kauffman

Abstract: Fast and efficient machine learning is of growing interest to the scientific community and has spurred significant research into novel model architectures and hardware-aware design. Recent hard? and software co-design approaches have demonstrated impressive results with entirely multiplication-free models. Differentiable Logic Gate Networks (DLGNs), for instance, provide a gradient-based framework for learning optimal combinations of low-level logic gates, setting state-of-the-art trade-offs between accuracy, resource usage, and latency. However, these models suffer from high computational cost during training and do not generalize well to logic blocks with more inputs. In this work, we introduce Walsh-Assisted Relaxation for Probabilistic Look-Up Tables (WARP-LUTs) - a novel gradient-based method that efficiently learns combinations of logic gates with substantially fewer trainable parameters. We demonstrate that WARP-LUTs achieve significantly faster convergence on CIFAR-10 compared to DLGNs, while maintaining comparable accuracy. Furthermore, our approach suggests potential for extension to higher-input logic blocks, motivating future research on extremely efficient deployment on modern FPGAs and its real-time science applications.

Comment: Model Architecture and Efficiency: multiplication-free probabilistic LUT networks with Walsh-assisted relaxation for fewer parameters and faster convergence.

Relevance: 8 Novelty: 8

15. A Split-Client Approach to Second-Order Optimization

ArXiv ID: 2510.15714

Authors: El Mahdi Chayti, Martin Jaggi

Abstract: Second-order methods promise faster convergence but are rarely used in practice because Hessian computations and decompositions are far more expensive than gradients. We propose a \emph{split-client} framework where gradients and curvature are computed asynchronously by separate clients. This abstraction captures realistic delays and inexact Hessian updates while avoiding the manual tuning required by Lazy Hessian methods. Focusing on cubic regularization, we show that our approach retains strong convergence guarantees and achieves a provable wall-clock speedup of order $\sqrt{\tau}$, where $\tau$ is the relative time needed to compute and decompose the Hessian compared to a gradient step. Since $\tau$ can be orders of magnitude larger than one in high-dimensional problems, this improvement is practically significant. Experiments on synthetic and real datasets confirm the theory: asynchronous curvature consistently outperforms vanilla and Lazy Hessian baselines, while maintaining second-order accuracy.

Comment: Matches High Performance Computing criterion: proposes an algorithmic/system-level asynchronous split-client scheme for second-order training with provable wall-clock speedups, enabling practical large-scale optimization.

Relevance: 8 Novelty: 8

16. SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients

ArXiv ID: 2510.15830

Authors: Dominik Kallusky, Vinay Rao, Vishal Nandavanam, Hao-Jun Michael Shi

Abstract: The rapid development of large language models (LLMs) has driven the demand for more efficient optimization techniques. Among these, the Lookahead family of optimizers employs a two-loop framework, maintaining fast and slow sets of model weights. Multiple inner optimizer steps on the fast weights produce a trajectory - the pseudo-gradient - that is used to update the slow weights. DiLoCo, a notable example originally designed for distributed training, applies Nesterov momentum to the averaged pseudo-gradient from multiple workers, claiming to even outperform AdamW in a non-distributed setup. In this paper, we empirically show that DiLoCo's surprising effectiveness stems primarily from applying Nesterov momentum to the pseudo-gradient, which improves training in a non-distributed setting. We call this Lookahead variant the Step-$K$ Nesterov Outer Optimizer (SNOO). We demonstrate that SNOO achieves compute factor gains of 1.5 - 2.5$\times$ in a non-distributed setting up to a scale of 1e23 training FLOPs, with improvements that increase with model size. Because of its minimal compute and memory overhead and compatibility with model sharding, SNOO is a practical enhancement for a variety of inner optimizers, including AdamW and Muon.

Comment: Optimization for large-scale training: a Lookahead variant applying Nesterov momentum to pseudo-gradients (SNOO) for compute-efficient training with minimal overhead.