Personalized Daily ArXiv Papers 2025-10-20
| [gpt-5] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 45503 | 40129 | 85632 |
| Cost | $0.06 | $0.4 | $0.46 |
Total arXiv papers: 520
Total scanned papers: 302
Total relevant papers: 24
Table of contents with paper titles:
-
Language Models are Injective and Hence Invertible Authors: Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodola'
-
A simple mean field model of feature learning Authors: Niclas G\"oring, Chris Mingard, Yoonsoo Nam, Ard Louis
-
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning Authors: Zhiyuan Fan, Yifeng Liu, Qingyue Zhao, Angela Yuan, Quanquan Gu
-
ParaFormer: Shallow Parallel Transformers with Progressive Approximation Authors: Wei Wang, Xiao-Yong Wei, Qing Li
-
On the Neural Feature Ansatz for Deep Neural Networks Authors: Edward Tansley, Estelle Massart, Coralia Cartis
-
The Coverage Principle: How Pre-training Enables Post-Training Authors: Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, Dylan J. Foster
-
Continual Learning via Sparse Memory Finetuning Authors: Jessy Lin, Luke Zettlemoyer, Gargi Ghosh, Wen-Tau Yih, Aram Markosyan, Vincent-Pierre Berges, Barlas O\u{g}uz
-
On Universality of Deep Equivariant Networks Authors: Marco Pacini, Mircea Petrache, Bruno Lepri, Shubhendu Trivedi, Robin Walters
-
Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation Authors: Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu, Changxing Ding
-
Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential Authors: Xuansheng Wu, Xiaoman Pan, Wenlin Yao, Jianshu Chen
-
Sequence Modeling with Spectral Mean Flows Authors: Jinwoo Kim, Max Beier, Petar Bevanda, Nayun Kim, Seunghoon Hong
-
Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning Authors: Lina Berrayana, Ahmed Heakl, Muhammad Abdullah Sohail, Thomas Hofmann, Salman Khan, Wei Chen
-
Learning to Answer from Correct Demonstrations Authors: Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma, Nathan Srebro
-
WARP-LUTs - Walsh-Assisted Relaxation for Probabilistic Look Up Tables Authors: Lino Gerlach, Liv V{\aa}ge, Thore Gerlach, Elliott Kauffman
-
A Split-Client Approach to Second-Order Optimization Authors: El Mahdi Chayti, Martin Jaggi
-
SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients Authors: Dominik Kallusky, Vinay Rao, Vishal Nandavanam, Hao-Jun Michael Shi
-
GRATING: Low-Latency and Memory-Efficient Semantic Selection on Device Authors: Jiahao Zhou, Chengliang Lin, Dingji Li, Mingkai Dong, Haibo Chen
-
Revisiting Knowledge Distillation: The Hidden Role of Dataset Size Authors: Giulia Lanzillotta, Felix Sarnthein, Gil Kur, Thomas Hofmann, Bobby He
-
Particle Dynamics for Latent-Variable Energy-Based Models Authors: Shiqin Tang, Shuxin Zhuang, Rong Feng, Runsheng Yu, Hongzong Li, Youzhi Zhang
-
TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs Authors: Sibo Xiao, Jinyuan Fu, Zhongle Xie, Lidan Shou
-
From Universal Approximation Theorem to Tropical Geometry of Multi-Layer Perceptrons Authors: Yi-Shan Chu, Yueh-Cheng Kuo
-
AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport Authors: Lingkai Kong, Molei Tao, Yang Liu, Bryan Wang, Jinmiao Fu, Chien-Chih Wang, Huidong Liu
-
Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection Authors: Denis Janiak, Jakub Binkowski, Tomasz Kajdanowicz
-
Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity Authors: Naoki Yoshida, Satoshi Hayakawa, Yuhta Takida, Toshimitsu Uesaka, Hiromi Wakaki, Yuki Mitsufuji
1. Language Models are Injective and Hence Invertible
ArXiv ID: 2510.15511
Authors: Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodola'
Abstract: Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.
Comment: Representation Learning: proves injectivity/invertibility of transformer LMs and provides an exact input reconstruction algorithm.
Relevance: 10 Novelty: 9
2. A simple mean field model of feature learning
ArXiv ID: 2510.15174
Authors: Niclas G\"oring, Chris Mingard, Yoonsoo Nam, Ard Louis
Abstract: Feature learning (FL), where neural networks adapt their internal representations during training, remains poorly understood. Using methods from statistical physics, we derive a tractable, self-consistent mean-field (MF) theory for the Bayesian posterior of two-layer non-linear networks trained with stochastic gradient Langevin dynamics (SGLD). At infinite width, this theory reduces to kernel ridge regression, but at finite width it predicts a symmetry breaking phase transition where networks abruptly align with target functions. While the basic MF theory provides theoretical insight into the emergence of FL in the finite-width regime, semi-quantitatively predicting the onset of FL with noise or sample size, it substantially underestimates the improvements in generalisation after the transition. We trace this discrepancy to a key mechanism absent from the plain MF description: \textit{self-reinforcing input feature selection}. Incorporating this mechanism into the MF theory allows us to quantitatively match the learning curves of SGLD-trained networks and provides mechanistic insight into FL.
Comment: Representation Learning: mean-field theory of feature learning and phase transitions in finite-width networks.
Relevance: 9 Novelty: 9
3. Robust Layerwise Scaling Rules by Proper Weight Decay Tuning
ArXiv ID: 2510.15262
Authors: Zhiyuan Fan, Yifeng Liu, Qingyue Zhao, Angela Yuan, Quanquan Gu
Abstract: Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization ($\mu$P) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady state where normalization layers create backward scale sensitivity and the effective learning rate becomes width dependent, degrading $\mu$P transfer. We address this by introducing a weight-decay scaling rule for AdamW that preserves sublayer gain across widths. Empirically, the singular-value spectrum of each matrix parameter scales in norm as $\sqrt{\eta/\lambda}$ with an approximately invariant shape; under width scaling $d$, we observe that the top singular value scales approximately as $\sqrt{\eta/\lambda}\cdot d^{0.75}$. Combining this observation with the $\mu$P learning-rate rule $\eta_2\propto d^{-1}$ for matrix-like parameters implies an empirical weight-decay scaling rule $\lambda_2\propto \sqrt{d}$ that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at $\eta_1=\Theta_d(1)$ and $\lambda_1=0$, this yields \emph{zero-shot} transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps. We validate the rule on LLaMA-style Transformers and in a minimal synthetic setting, and we provide a simple diagnostic, matching top singular values, to check sublayer-gain invariance. Our results extend $\mu$P beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, offering a practical recipe for width-robust hyperparameter transfer under AdamW.
Comment: Training dynamics/scaling laws: proposes a weight-decay scaling rule for AdamW extending μP beyond the near-init regime for width-robust hyperparameter transfer in Transformers.
Relevance: 9 Novelty: 8
4. ParaFormer: Shallow Parallel Transformers with Progressive Approximation
ArXiv ID: 2510.15425
Authors: Wei Wang, Xiao-Yong Wei, Qing Li
Abstract: The widespread 'deeper is better' philosophy has driven the creation of architectures like ResNet and Transformer, which achieve high performance by stacking numerous layers. However, increasing model depth comes with challenges such as longer training times, higher inference latency, and impracticality on resource-constrained devices. To address these issues, we propose ParaFormer, a shallow Transformer architecture designed for true parallelism in both structure and computation. By formulating standard Transformers as function approximators in closed-form, our theoretical analysis shows that their performance relies on inter-layer collaboration for progressive approximation, rather than depth itself. While deep Transformers enforce this collaboration through sequential designs, we demonstrate that such collaboration is not inherently tied to sequential structures. ParaFormer removes the sequential constraint by organizing layers into parallel branches, enforcing inter-layer collaboration algorithmically. Specifically, we implement progressive approximation, ensuring that each new branch further reduces the loss from preceding branches, enabling faster convergence. Extensive experiments validate ParaFormer's effectiveness, outperforming standard Transformers like ViT. Moreover, ParaFormer supports up to 15.07x model compression and facilitates model expansion for adaptive continuous learning. Experimental results on multi-GPU deployment demonstrate that ParaFormer is 3.30x faster than widely used parallelism solutions such as FairScale. These advancements stem from our closed-form formulation of Transformers based on the Universal Approximation Theorem, which not only explains the ``depth belief'' but also opens new avenues for designing efficient Transformer architectures. Source code: https://(open-upon-acceptance)
Comment: Strongly matches Model Architecture and Efficiency/HPC: shallow parallel Transformer with progressive approximation enabling compression and multi-GPU speedups.
Relevance: 9 Novelty: 8
5. On the Neural Feature Ansatz for Deep Neural Networks
ArXiv ID: 2510.15563
Authors: Edward Tansley, Estelle Massart, Coralia Cartis
Abstract: Understanding feature learning is an important open question in establishing a mathematical foundation for deep neural networks. The Neural Feature Ansatz (NFA) states that after training, the Gram matrix of the first-layer weights of a deep neural network is proportional to some power $\alpha>0$ of the average gradient outer product (AGOP) of this network with respect to its inputs. Assuming gradient flow dynamics with balanced weight initialization, the NFA was proven to hold throughout training for two-layer linear networks with exponent $\alpha = 1/2$ (Radhakrishnan et al., 2024). We extend this result to networks with $L \geq 2$ layers, showing that the NFA holds with exponent $\alpha = 1/L$, thus demonstrating a depth dependency of the NFA. Furthermore, we prove that for unbalanced initialization, the NFA holds asymptotically through training if weight decay is applied. We also provide counterexamples showing that the NFA does not hold for some network architectures with nonlinear activations, even when these networks fit arbitrarily well the training data. We thoroughly validate our theoretical results through numerical experiments across a variety of optimization algorithms, weight decay rates and initialization schemes.
Comment: Matches Representation Learning: theoretical analysis of Neural Feature Ansatz and training dynamics across depth.
Relevance: 9 Novelty: 8
6. The Coverage Principle: How Pre-training Enables Post-Training
ArXiv ID: 2510.15020
Authors: Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, Dylan J. Foster
Abstract: Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross entropy loss, cross-entropy can be a poor predictor of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of \emph{coverage}, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods such as Best-of-N to succeed. Our main results develop an understanding of \emph{the coverage principle}, a phenomenon whereby next-token prediction implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: \emph{coverage generalizes faster than cross entropy}, avoiding spurious dependence on problem-dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.
Comment: Matches Representation Learning/Training Dynamics: theory of coverage from next-token pretraining predicting downstream/post-training success with provable interventions.
Relevance: 9 Novelty: 8
7. Continual Learning via Sparse Memory Finetuning
ArXiv ID: 2510.15103
Authors: Jessy Lin, Luke Zettlemoyer, Gargi Ghosh, Wen-Tau Yih, Aram Markosyan, Vincent-Pierre Berges, Barlas O\u{g}uz
Abstract: Modern language models are powerful, but typically static after deployment. A major obstacle to building models that continually learn over time is catastrophic forgetting, where updating on new data erases previously acquired capabilities. Motivated by the intuition that mitigating forgetting is challenging because trainable parameters are shared across all tasks, we investigate whether sparse parameter updates can enable learning without catastrophic forgetting. We introduce sparse memory finetuning, leveraging memory layer models (Berges et al., 2024), which are sparsely updated by design. By updating only the memory slots that are highly activated by a new piece of knowledge relative to usage on pretraining data, we reduce interference between new knowledge and the model's existing capabilities. We evaluate learning and forgetting compared to full finetuning and parameter-efficient finetuning with LoRA on two question answering tasks. We find that sparse memory finetuning learns new knowledge while exhibiting substantially less forgetting: while NaturalQuestions F1 drops by 89% after full finetuning on new facts and 71% with LoRA, sparse memory finetuning yields only an 11% drop with the same level of new knowledge acquisition. Our results suggest sparsity in memory layers offers a promising path toward continual learning in large language models.
Comment: Matches Model Compression and Efficiency (Sparsity) for continual learning via sparsely updated memory layers to reduce interference/forgetting.
Relevance: 9 Novelty: 8
8. On Universality of Deep Equivariant Networks
ArXiv ID: 2510.15814
Authors: Marco Pacini, Mircea Petrache, Bruno Lepri, Shubhendu Trivedi, Robin Walters
Abstract: Universality results for equivariant neural networks remain rare. Those that do exist typically hold only in restrictive settings: either they rely on regular or higher-order tensor representations, leading to impractically high-dimensional hidden spaces, or they target specialized architectures, often confined to the invariant setting. This work develops a more general account. For invariant networks, we establish a universality theorem under separation constraints, showing that the addition of a fully connected readout layer secures approximation within the class of separation-constrained continuous functions. For equivariant networks, where results are even scarcer, we demonstrate that standard separability notions are inadequate and introduce the sharper criterion of $\textit{entry-wise separability}$. We show that with sufficient depth or with the addition of appropriate readout layers, equivariant networks attain universality within the entry-wise separable regime. Together with prior results showing the failure of universality for shallow models, our findings identify depth and readout layers as a decisive mechanism for universality, additionally offering a unified perspective that subsumes and extends earlier specialized results.
Comment: Model Architecture: universality results for invariant/equivariant networks highlighting depth/readout as mechanisms.
Relevance: 9 Novelty: 8
9. Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation
ArXiv ID: 2510.15304
Authors: Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu, Changxing Ding
Abstract: Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weight layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels across adjacent layers, enabling progressive model size reduction. Finally, we propose a hierarchical distillation protocol that leverages the correspondences between the original and pruned model layers established during pruning, thereby enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b's parameters, the pruned model retains 83% of its original average accuracy. Our code is available at https://github.com/MPI-Lab/CoMe.
Comment: Matches Model Compression and Efficiency via structured pruning with concatenation-based layer merging and hierarchical distillation to retain capacity.
Relevance: 9 Novelty: 7
10. Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential
ArXiv ID: 2510.15216
Authors: Xuansheng Wu, Xiaoman Pan, Wenlin Yao, Jianshu Chen
Abstract: Reinforcement learning with verifiable rewards (RLVR) can elicit strong reasoning in large language models (LLMs), while their performance after RLVR varies dramatically across different base models. This raises a fundamental question: what microscopic property of pre-trained models leads to this variation? To investigate, we formalize reasoning as chains of Horn clauses ("if-then" rules) built from features extracted from the LLM's latent space via cross-layer sparse autoencoders (SAEs). We estimate the transition probabilities between its features, and further categorize each rule by its semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key discovery is that high-potential models are inherently soundness-aware: their internal probability distributions systematically shift across rules' soundness levels, becoming highly distinct for "strict" versus "noisy" rules. In contrast, weaker models are soundness-agnostic, collapsing to one distribution regardless of soundness levels. To quantify this, we introduce the Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon Divergence to measure the separation between these distributions. We show that SAL's predictions of post-RLVR reasoning performance follow a precise empirical law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek) and scales (0.5B-14B). This reveals that a model's reasoning potential is tied to its intrinsic, pre-trained ability to distinguish sound knowledge from unsound ones. These findings underscore the critical role of model pre-training in shaping reasoning and offer a practical metric grounded in the model's internal mechanisms for selecting/designing stronger base models.
Comment: Representation learning/training dynamics: uses cross-layer sparse autoencoders to extract latent rules and introduces SAL to quantify soundness-aware internal distributions predicting reasoning potential.
Relevance: 8 Novelty: 8
11. Sequence Modeling with Spectral Mean Flows
ArXiv ID: 2510.15366
Authors: Jinwoo Kim, Max Beier, Petar Bevanda, Nayun Kim, Seunghoon Hong
Abstract: A key question in sequence modeling with neural networks is how to represent and learn highly nonlinear and probabilistic state dynamics. Operator theory views such dynamics as linear maps on Hilbert spaces containing mean embedding vectors of distributions, offering an appealing but currently overlooked perspective. We propose a new approach to sequence modeling based on an operator-theoretic view of a hidden Markov model (HMM). Instead of materializing stochastic recurrence, we embed the full sequence distribution as a tensor in the product Hilbert space. A generative process is then defined as maximum mean discrepancy (MMD) gradient flow in the space of sequences. To overcome challenges with large tensors and slow sampling convergence, we introduce spectral mean flows, a novel tractable algorithm integrating two core concepts. First, we propose a new neural architecture by leveraging spectral decomposition of linear operators to derive a scalable tensor network decomposition of sequence mean embeddings. Second, we extend MMD gradient flows to time-dependent Hilbert spaces and connect them to flow matching via the continuity equation, enabling simulation-free learning and faster sampling. We demonstrate competitive results on a range of time-series modeling datasets. Code is available at https://github.com/jw9730/spectral-mean-flow.
Comment: Matches Model Architecture and Representation Learning: operator-theoretic sequence model with spectral tensor-network decomposition and flow matching.
Relevance: 8 Novelty: 8
12. Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning
ArXiv ID: 2510.15244
Authors: Lina Berrayana, Ahmed Heakl, Muhammad Abdullah Sohail, Thomas Hofmann, Salman Khan, Wei Chen
Abstract: Current autoregressive language models (ARMs) achieve high accuracy but require long token sequences, making them costly. Discrete diffusion language models (DDLMs) enable parallel and flexible generation within a fixed number of steps and have recently emerged for their strong performance in complex reasoning and long-term planning tasks. We present a study exploring hybrid architectures that couple DDLMs with ARMs to assess whether their collaboration can yield complementary benefits. We first examine collaboration in text space, where one model plans the reasoning process and another executes the final answer based on that plan. We then extend this setup to latent-space communication, introducing a learned projector that maps DDLM latents into the ARM's embedding space, potentially bypassing some of the text-generation limitations of diffusion models. We find that shifting DDLM --> ARM communication from text space to latent space yields significant accuracy gains, for example increasing from 27.0% to 54.0% on DART-5 and from 0.0% to 14.0% on AIME24. We also find that combining a DDLM planner with an ARM executor can provide substantial computational savings with little to no impact on accuracy. For example, the latent-space pipeline, using 64 tokens for planning and roughly 5 for execution, surpasses Qwen3.1-7B on DART-5 and AIME, despite Qwen using 44 times more tokens. Overall, our study offers new insights into reasoning with DDLMs and highlights their potential in hybrid architectures.
Comment: Matches Model Architecture via a hybrid discrete diffusion planner with an autoregressive executor, including latent-space interfacing to reduce tokens and improve reasoning.
Relevance: 8 Novelty: 8
13. Learning to Answer from Correct Demonstrations
ArXiv ID: 2510.15464
Authors: Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma, Nathan Srebro
Abstract: We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each training question, as in Supervised Fine Tuning (SFT). We formalize the problem as offline imitation learning in contextual bandits, with demonstrations from some optimal policy, without explicitly observed rewards. Prior work assumes that the demonstrator belongs to a low-complexity policy class, which motivates maximum likelihood estimation (i.e., log-loss minimization). In contrast, we propose relying only on the reward model (specifying which answers are correct) being in a low-cardinality class, which we argue is a weaker assumption. We show that likelihood maximization methods can fail in this case, and instead devise an alternative novel approach that learns with sample complexity logarithmic in the cardinality of the reward class. Our work motivates looking beyond likelihood maximization when learning from correct demonstrations.
Comment: Matches foundational Training Objective design for learning from correct demonstrations beyond MLE, with sample complexity guarantees under a low-cardinality reward class.
Relevance: 8 Novelty: 8
14. WARP-LUTs - Walsh-Assisted Relaxation for Probabilistic Look Up Tables
ArXiv ID: 2510.15655
Authors: Lino Gerlach, Liv V{\aa}ge, Thore Gerlach, Elliott Kauffman
Abstract: Fast and efficient machine learning is of growing interest to the scientific community and has spurred significant research into novel model architectures and hardware-aware design. Recent hard? and software co-design approaches have demonstrated impressive results with entirely multiplication-free models. Differentiable Logic Gate Networks (DLGNs), for instance, provide a gradient-based framework for learning optimal combinations of low-level logic gates, setting state-of-the-art trade-offs between accuracy, resource usage, and latency. However, these models suffer from high computational cost during training and do not generalize well to logic blocks with more inputs. In this work, we introduce Walsh-Assisted Relaxation for Probabilistic Look-Up Tables (WARP-LUTs) - a novel gradient-based method that efficiently learns combinations of logic gates with substantially fewer trainable parameters. We demonstrate that WARP-LUTs achieve significantly faster convergence on CIFAR-10 compared to DLGNs, while maintaining comparable accuracy. Furthermore, our approach suggests potential for extension to higher-input logic blocks, motivating future research on extremely efficient deployment on modern FPGAs and its real-time science applications.
Comment: Model Architecture and Efficiency: multiplication-free probabilistic LUT networks with Walsh-assisted relaxation for fewer parameters and faster convergence.
Relevance: 8 Novelty: 8
15. A Split-Client Approach to Second-Order Optimization
ArXiv ID: 2510.15714
Authors: El Mahdi Chayti, Martin Jaggi
Abstract: Second-order methods promise faster convergence but are rarely used in practice because Hessian computations and decompositions are far more expensive than gradients. We propose a \emph{split-client} framework where gradients and curvature are computed asynchronously by separate clients. This abstraction captures realistic delays and inexact Hessian updates while avoiding the manual tuning required by Lazy Hessian methods. Focusing on cubic regularization, we show that our approach retains strong convergence guarantees and achieves a provable wall-clock speedup of order $\sqrt{\tau}$, where $\tau$ is the relative time needed to compute and decompose the Hessian compared to a gradient step. Since $\tau$ can be orders of magnitude larger than one in high-dimensional problems, this improvement is practically significant. Experiments on synthetic and real datasets confirm the theory: asynchronous curvature consistently outperforms vanilla and Lazy Hessian baselines, while maintaining second-order accuracy.
Comment: Matches High Performance Computing criterion: proposes an algorithmic/system-level asynchronous split-client scheme for second-order training with provable wall-clock speedups, enabling practical large-scale optimization.
Relevance: 8 Novelty: 8
16. SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients
ArXiv ID: 2510.15830
Authors: Dominik Kallusky, Vinay Rao, Vishal Nandavanam, Hao-Jun Michael Shi
Abstract: The rapid development of large language models (LLMs) has driven the demand for more efficient optimization techniques. Among these, the Lookahead family of optimizers employs a two-loop framework, maintaining fast and slow sets of model weights. Multiple inner optimizer steps on the fast weights produce a trajectory - the pseudo-gradient - that is used to update the slow weights. DiLoCo, a notable example originally designed for distributed training, applies Nesterov momentum to the averaged pseudo-gradient from multiple workers, claiming to even outperform AdamW in a non-distributed setup. In this paper, we empirically show that DiLoCo's surprising effectiveness stems primarily from applying Nesterov momentum to the pseudo-gradient, which improves training in a non-distributed setting. We call this Lookahead variant the Step-$K$ Nesterov Outer Optimizer (SNOO). We demonstrate that SNOO achieves compute factor gains of 1.5 - 2.5$\times$ in a non-distributed setting up to a scale of 1e23 training FLOPs, with improvements that increase with model size. Because of its minimal compute and memory overhead and compatibility with model sharding, SNOO is a practical enhancement for a variety of inner optimizers, including AdamW and Muon.
Comment: Optimization for large-scale training: a Lookahead variant applying Nesterov momentum to pseudo-gradients (SNOO) for compute-efficient training with minimal overhead.
Relevance: 8 Novelty: 7
17. GRATING: Low-Latency and Memory-Efficient Semantic Selection on Device
ArXiv ID: 2510.15620
Authors: Jiahao Zhou, Chengliang Lin, Dingji Li, Mingkai Dong, Haibo Chen
Abstract: Semantic top-K selection with cross-encoder rerankers underpins of on-device AI services, such as retrieval-augmented generation, agent memory, and personalized recommendation. However, its latency and memory demands dominate end-to-end budgets on edge hardware. Revisiting the objective of top-K selection, we reveal that only relative rankings matter, not exact per-candidate scores. We further observe sequence-level sparsity: relative rankings stabilize early in intermediate layers, allowing pruning opportunities prior to completing full inference. Building on this insight, we propose monolithic forwarding and develop a training-free inference system, GRATING. By maintaining a global view of all candidates, it reduces latency through progressive cluster pruning. It also bounds peak memory usage by strategically overlapping I/O with computation via dual-layer sliding window and chunked execution. We evaluate GRATING against state-of-the-art baselines on rerankers from 0.6B to 8B parameters across Apple M2 and RTX 5070. GRATING consistently reduces latency by up to 89.0% and peak memory by up to 94.9% in microbenchmarks, without any loss in precision. Across three real-world on-device AI applications, GRATING lowers latency by 11.6%-51.0% and peak memory by 18.6%-77.8%, demonstrating substantial improvements in efficiency and deployability.
Comment: Systems-level inference efficiency: training-free monolithic forwarding with sequence-level sparsity for top-K reranking, reducing latency and peak memory on-device.
Relevance: 8 Novelty: 7
18. Revisiting Knowledge Distillation: The Hidden Role of Dataset Size
ArXiv ID: 2510.15516
Authors: Giulia Lanzillotta, Felix Sarnthein, Gil Kur, Thomas Hofmann, Bobby He
Abstract: The concept of knowledge distillation (KD) describes the training of a student model from a teacher model and is a widely adopted technique in deep learning. However, it is still not clear how and why distillation works. Previous studies focus on two central aspects of distillation: model size, and generalisation. In this work we study distillation in a third dimension: dataset size. We present a suite of experiments across a wide range of datasets, tasks and neural architectures, demonstrating that the effect of distillation is not only preserved but amplified in low-data regimes. We call this newly discovered property the data efficiency of distillation. Equipped with this new perspective, we test the predictive power of existing theories of KD as we vary the dataset size. Our results disprove the hypothesis that distillation can be understood as label smoothing, and provide further evidence in support of the dark knowledge hypothesis. Finally, we analyse the impact of modelling factors such as the objective, scale and relative number of samples on the observed phenomenon. Ultimately, this work reveals that the dataset size may be a fundamental but overlooked variable in the mechanisms underpinning distillation.
Comment: Training dynamics/representation: identifies data-efficiency of knowledge distillation in low-data regimes and evaluates competing theories (label smoothing vs dark knowledge).
Relevance: 8 Novelty: 7
19. Particle Dynamics for Latent-Variable Energy-Based Models
ArXiv ID: 2510.15447
Authors: Shiqin Tang, Shuxin Zhuang, Rong Feng, Runsheng Yu, Hongzong Li, Youzhi Zhang
Abstract: Latent-variable energy-based models (LVEBMs) assign a single normalized energy to joint pairs of observed data and latent variables, offering expressive generative modeling while capturing hidden structure. We recast maximum-likelihood training as a saddle problem over distributions on the latent and joint manifolds and view the inner updates as coupled Wasserstein gradient flows. The resulting algorithm alternates overdamped Langevin updates for a joint negative pool and for conditional latent particles with stochastic parameter ascent, requiring no discriminator or auxiliary networks. We prove existence and convergence under standard smoothness and dissipativity assumptions, with decay rates in KL divergence and Wasserstein-2 distance. The saddle-point view further yields an ELBO strictly tighter than bounds obtained with restricted amortized posteriors. Our method is evaluated on numerical approximations of physical systems and performs competitively against comparable approaches.
Comment: Matches Representation Learning: latent-variable energy-based models with Wasserstein gradient flow training and convergence guarantees.
Relevance: 8 Novelty: 7
20. TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
ArXiv ID: 2510.15545
Authors: Sibo Xiao, Jinyuan Fu, Zhongle Xie, Lidan Shou
Abstract: Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.
Comment: Matches Efficiency/HPC: universal speculative decoding via DTW-based alignment enabling draft–target mismatch and faster inference.
Relevance: 8 Novelty: 7
21. From Universal Approximation Theorem to Tropical Geometry of Multi-Layer Perceptrons
ArXiv ID: 2510.15012
Authors: Yi-Shan Chu, Yueh-Cheng Kuo
Abstract: We revisit the Universal Approximation Theorem(UAT) through the lens of the tropical geometry of neural networks and introduce a constructive, geometry-aware initialization for sigmoidal multi-layer perceptrons (MLPs). Tropical geometry shows that Rectified Linear Unit (ReLU) networks admit decision functions with a combinatorial structure often described as a tropical rational, namely a difference of tropical polynomials. Focusing on planar binary classification, we design purely sigmoidal MLPs that adhere to the finite-sum format of UAT: a finite linear combination of shifted and scaled sigmoids of affine functions. The resulting models yield decision boundaries that already align with prescribed shapes at initialization and can be refined by standard training if desired. This provides a practical bridge between the tropical perspective and smooth MLPs, enabling interpretable, shape-driven initialization without resorting to ReLU architectures. We focus on the construction and empirical demonstrations in two dimensions; theoretical analysis and higher-dimensional extensions are left for future work.
Comment: Matches Representation Learning/Architecture: geometry-aware initialization for sigmoidal MLPs via tropical perspective.
Relevance: 8 Novelty: 7
22. AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport
ArXiv ID: 2510.15038
Authors: Lingkai Kong, Molei Tao, Yang Liu, Bryan Wang, Jinmiao Fu, Chien-Chih Wang, Huidong Liu
Abstract: Flow-based Generative Models (FGMs) effectively transform noise into complex data distributions. Incorporating Optimal Transport (OT) to couple noise and data during FGM training has been shown to improve the straightness of flow trajectories, enabling more effective inference. However, existing OT-based methods estimate the OT plan using (mini-)batches of sampled noise and data points, which limits their scalability to large and high-dimensional datasets in FGMs. This paper introduces AlignFlow, a novel approach that leverages Semi-Discrete Optimal Transport (SDOT) to enhance the training of FGMs by establishing an explicit, optimal alignment between noise distribution and data points with guaranteed convergence. SDOT computes a transport map by partitioning the noise space into Laguerre cells, each mapped to a corresponding data point. During FGM training, i.i.d. noise samples are paired with data points via the SDOT map. AlignFlow scales well to large datasets and model architectures with negligible computational overhead. Experimental results show that AlignFlow improves the performance of a wide range of state-of-the-art FGM algorithms and can be integrated as a plug-and-play component. Code is available at: https://github.com/konglk1203/AlignFlow.
Comment: Matches Model Architecture/Training with Semi-Discrete Optimal Transport to align noise and data in flow-based models, improving trajectory straightness and efficiency.
Relevance: 8 Novelty: 7
23. Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection
ArXiv ID: 2510.15202
Authors: Denis Janiak, Jakub Binkowski, Tomasz Kajdanowicz
Abstract: Out-of-distribution (OOD) detection is critical for the reliable deployment of deep learning models. hile Mahalanobis distance methods are widely used, the impact of representation geometry and normalization on their performance is not fully understood, which may limit their downstream application. To address this gap, we conducted a comprehensive empirical study across diverse image foundation models, datasets, and distance normalization schemes. First, our analysis shows that Mahalanobis-based methods aren't universally reliable. Second, we define the ideal geometry for data representations and demonstrate that spectral and intrinsic-dimensionality metrics can accurately predict a model's OOD performance. Finally, we analyze how normalization impacts OOD performance. Building upon these studies, we propose radially scaled $\ell_2$ normalization, a method that generalizes the standard $\ell_2$ normalization recently applied to Mahalanobis-based OOD detection. Our approach introduces a tunable parameter to directly control the radial geometry of the feature space, systematically contracting or expanding representations to significantly improve OOD detection performance. By bridging the gap between representation geometry, normalization, and OOD performance, our findings offer new insights into the design of more effective and reliable deep learning models.
Comment: Representation Learning: analyzes feature geometry/normalization for OOD and introduces radially scaled l2 normalization.
Relevance: 8 Novelty: 7
24. Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity
ArXiv ID: 2510.15508
Authors: Naoki Yoshida, Satoshi Hayakawa, Yuhta Takida, Toshimitsu Uesaka, Hiromi Wakaki, Yuki Mitsufuji
Abstract: In this study, we propose an enhancement to the similarity computation mechanism in multi-modal contrastive pretraining frameworks such as CLIP. Prior theoretical research has demonstrated that the optimal similarity metrics between paired modalities should correspond to the pointwise mutual information (PMI) between the two modalities. However, the current implementations of CLIP and its variants fail to fully utilize the underlying linear structure of PMI. We therefore propose KME-CLIP, which leverages this structure through the inner product in a reproducing kernel Hilbert space. We theoretically prove that our method can approximate PMI with arbitrary accuracy and empirically demonstrate that our approach overall outperforms the standard CLIP formulation across several retrieval and classification tasks.
Comment: Matches Representation Learning criterion: introduces a theoretically grounded similarity (PMI in RKHS) for contrastive multi-modal models like CLIP, analyzing and improving the underlying representation/metric.
Relevance: 8 Novelty: 7
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.