Personalized Daily ArXiv Papers 2025-10-13

[gpt-5]	Prompt	Completion	Total
Token	61176	61229	122405
Cost	$0.08	$0.61	$0.69

Total arXiv papers: 664

Total scanned papers: 395

Total relevant papers: 35

Table of contents with paper titles:

Transmuting prompts into weights Authors: Hanna Mazzawi, Benoit Dherin, Michael Munn, Michael Wunder, Javier Gonzalvo
LOTION: Smoothing the Optimization Landscape for Quantized Training Authors: Mujin Kwun, Depen Morwani, Chloe Huangyuan Su, Stephanie Gil, Nikhil Anand, Sham Kakade
SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions Authors: Ziyi Wang, Nan Jiang, Guang Lin, Qifan Song
Localist LLMs -- A Mathematical Framework for Dynamic Locality Control Authors: Joachim Diederich
FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference Authors: Yu-Chen Lu, Chong-Yan Chen, Chi-Chih Chang, Yu-Fang Hu, Kai-Chiang Wu
On Uniformly Scaling Flows: A Density-Aligned Approach to Deep One-Class Classification Authors: Faried Abu Zaid, Tim Katzke, Emmanuel M\"uller, Daniel Neider
Efficient Autoregressive Inference for Transformer Probabilistic Models Authors: Conor Hassan, Nasrulloh Loka, Cen-You Li, Daolang Huang, Paul E. Chang, Yang Yang, Francesco Silvestrin, Samuel Kaski, Luigi Acerbi
AdaPM: a Partial Momentum Algorithm for LLM Training Authors: Yimu Zhang, Yuanshi Liu, Cong Fang
Design Principles for Sequence Models via Coefficient Dynamics Authors: Jerome Sieber, Antonio Orvieto, Melanie N. Zeilinger, Carmen Amo Alonso
Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers Authors: Rui Bu, Haofeng Zhong, Wenzheng Chen, Yangyan Li
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton Authors: Natalie Abreu, Nikhil Vyas, Sham Kakade, Depen Morwani
Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry Authors: Thomas Fel, Binxu Wang, Michael A. Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S. Lubana, Talia Konkle, Demba Ba, Martin Wattenberg
On the Alignment Between Supervised and Self-Supervised Contrastive Learning Authors: Achleshwar Luthra, Priyadarsi Mishra, Tomer Galanti
Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs Authors: Yifan Zhao, Egan Johnson, Prasanth Chatarasi, Vikram Adve, Sasa Misailovic
Slicing Is All You Need: Towards A Universal One-Sided Algorithm for Distributed Matrix Multiplication Authors: Benjamin Brock, Renato Golin
dInfer: An Efficient Inference Framework for Diffusion Language Models Authors: Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, Da Zheng
StreamingVLM: Real-Time Understanding for Infinite Video Streams Authors: Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han
Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training Authors: T. Ed Li, Junyu Ren
MODE: Learning compositional representations of complex systems with Mixtures Of Dynamical Experts Authors: Nathan Quiblier, Roy Friedman, Matthew Ricci
Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation Authors: Devleena Das, Rajeev Patwari, Ashish Sirasao
Task-Level Insights from Eigenvalues across Sequence Models Authors: Rahel Rickenbach, Jelena Trisovic, Alexandre Didier, Jerome Sieber, Melanie N. Zeilinger
Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization Authors: Le-Trung Nguyen, Enzo Tartaglione, Van-Tam Nguyen
Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression Authors: Chengzhengxu Li, Xiaoming Liu, Zhaohan Zhang, Shaochu Zhang, Shengchao Liu, Guoxin Ma, Yu Lan, Chao Shen
Logits Replay + MoClip: Stabilized, Low-Cost Post-Training with Minimal Forgetting Authors: Suming Qiu, Jing Li, Zhicheng Zhou, Junjie Huang, Linyuan Qiu, Zhijie Sun
Auto-scaling Continuous Memory for GUI Agent Authors: Wenyi Wu, Kun Zhou, Ruoxin Yuan, Vivian Yu, Stephen Wang, Zhiting Hu, Biwei Huang
On the Implicit Adversariality of Catastrophic Forgetting in Deep Continual Learning Authors: Ze Peng, Jian Zhang, Jintao Guo, Lei Qi, Yang Gao, Yinghuan Shi
QuIRK: Quantum-Inspired Re-uploading KAN Authors: Vinayak Sharma, Ashish Padhy, Vijay Jagdish Karanjkar, Sourav Behera, Lord Sen, Shyamapada Mukherjee, Aviral Shrivastava
Geodesic Calculus on Latent Spaces Authors: Florine Hartwig, Josua Sassen, Juliane Braunsmann, Martin Rumpf, Benedikt Wirth
On the Representations of Entities in Auto-regressive Large Language Models Authors: Victor Morand, Josiane Mothe, Benjamin Piwowarski
PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning Authors: Daiki Yoshikawa, Takashi Matsubara
PAC Reasoning: Controlling the Performance Loss for Efficient Reasoning Authors: Hao Zeng, Jianguo Huang, Bingyi Jing, Hongxin Wei, Bo An
Inverse-Free Wilson Loops for Transformers: A Practical Diagnostic for Invariance and Order Sensitivity Authors: Edward Y. Chang, Ethan Y. Chang
Deep Multimodal Subspace Clustering Networks Authors: Mahdi Abavisani, Vishal M. Patel
Sparse components distinguish visual pathways & their alignment to neural networks Authors: Ammar I Marvi, Nancy G Kanwisher, Meenakshi Khosla
Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings Authors: Shikun Liu, Haoyu Wang, Mufei Li, Pan Li

1. Transmuting prompts into weights

ArXiv ID: 2510.08734

Authors: Hanna Mazzawi, Benoit Dherin, Michael Munn, Michael Wunder, Javier Gonzalvo

Abstract: A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt's influence can be mathematically mapped to implicit weight updates (Dherin et al., 2025), we generalize this theory to deep, multi-block transformers. We show how the information contained in any chunk of a user prompt is represented and composed internally through weight vectors and weight matrices. We then derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector- and matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.

Comment: Model Architecture and Representation: theoretical mapping from prompts to implicit weight updates in deep Transformers; introduces token-independent thought vectors/matrices enabling principled weight-level steering.

Relevance: 10 Novelty: 9

2. LOTION: Smoothing the Optimization Landscape for Quantized Training

ArXiv ID: 2510.08757

Authors: Mujin Kwun, Depen Morwani, Chloe Huangyuan Su, Stephanie Gil, Nikhil Anand, Sham Kakade

Abstract: Optimizing neural networks for quantized objectives is fundamentally challenging because the quantizer is piece-wise constant, yielding zero gradients everywhere except at quantization thresholds where the derivative is undefined. Most existing methods deal with this issue by relaxing gradient computations with techniques like Straight Through Estimators (STE) and do not provide any guarantees of convergence. In this work, taking inspiration from Nesterov smoothing, we approximate the quantized loss surface with a continuous loss surface. In particular, we introduce LOTION, \textbf{L}ow-precision \textbf{O}ptimization via s\textbf{T}ochastic-no\textbf{I}se sm\textbf{O}othi\textbf{N}g, a principled smoothing framework that replaces the raw quantized loss with its expectation under unbiased randomized-rounding noise. In this framework, standard optimizers are guaranteed to converge to a local minimum of the loss surface. Moreover, when using noise derived from stochastic rounding, we show that the global minima of the original quantized loss are preserved. We empirically demonstrate that this method outperforms standard QAT on synthetic testbeds and on 150M- and 300M- parameter language models.

Comment: Model Compression and Efficiency: principled optimization for quantized training via stochastic-noise smoothing with convergence guarantees; addresses core quantization challenges beyond STE.

Relevance: 10 Novelty: 8

3. SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions

ArXiv ID: 2510.08999

Authors: Ziyi Wang, Nan Jiang, Guang Lin, Qifan Song

Abstract: Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops. We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (SQS), which achieves higher compression rates than prior baselines while maintaining comparable performance. The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision. In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network. Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops.

Comment: Model Compression and Efficiency: Bayesian joint sparsity (spike-and-slab) and low-bit quantization via GMM within a unified variational framework with consistency results.

Relevance: 10 Novelty: 8

4. Localist LLMs -- A Mathematical Framework for Dynamic Locality Control

ArXiv ID: 2510.09338

Authors: Joachim Diederich

Abstract: We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovation is a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, and dynamic rule injection. We provide rigorous mathematical proofs establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks, with exponential bounds on attention entropy and pointer fidelity. Specifically, we prove that when group sparsity penalties exceed certain threshold values, the model's attention mechanisms concentrate on semantically relevant blocks, achieving low entropy and high fidelity with negligible error. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes, supporting applications in regulated domains requiring both transparency and capability.

Comment: Matches Model Architecture and Sparsity: introduces a tunable locality dial via group sparsity on attention with theoretical guarantees, enabling dynamic control between localist and distributed representations.

Relevance: 10 Novelty: 8

5. FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

ArXiv ID: 2510.09332

Authors: Yu-Chen Lu, Chong-Yan Chen, Chi-Chih Chang, Yu-Fang Hu, Kai-Chiang Wu

Abstract: Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.

Comment: Compression/Efficiency: fine-grained low-rank rank allocation per layer and progressive low-rank decoding for efficient LLM inference.

Relevance: 10 Novelty: 8

6. On Uniformly Scaling Flows: A Density-Aligned Approach to Deep One-Class Classification

ArXiv ID: 2510.09452

Authors: Faried Abu Zaid, Tim Katzke, Emmanuel M\"uller, Daniel Neider

Abstract: Unsupervised anomaly detection is often framed around two widely studied paradigms. Deep one-class classification, exemplified by Deep SVDD, learns compact latent representations of normality, while density estimators realized by normalizing flows directly model the likelihood of nominal data. In this work, we show that uniformly scaling flows (USFs), normalizing flows with a constant Jacobian determinant, precisely connect these approaches. Specifically, we prove how training a USF via maximum-likelihood reduces to a Deep SVDD objective with a unique regularization that inherently prevents representational collapse. This theoretical bridge implies that USFs inherit both the density faithfulness of flows and the distance-based reasoning of one-class methods. We further demonstrate that USFs induce a tighter alignment between negative log-likelihood and latent norm than either Deep SVDD or non-USFs, and how recent hybrid approaches combining one-class objectives with VAEs can be naturally extended to USFs. Consequently, we advocate using USFs as a drop-in replacement for non-USFs in modern anomaly detection architectures. Empirically, this substitution yields consistent performance gains and substantially improved training stability across multiple benchmarks and model backbones for both image-level and pixel-level detection. These results unify two major anomaly detection paradigms, advancing both theoretical understanding and practical performance.

Comment: Representation Learning criterion: introduces uniformly scaling flows linking Deep SVDD and normalizing flows, preventing collapse and tightening likelihood–latent norm alignment.

Relevance: 9 Novelty: 8

7. Efficient Autoregressive Inference for Transformer Probabilistic Models

ArXiv ID: 2510.09477

Authors: Conor Hassan, Nasrulloh Loka, Cen-You Li, Daolang Huang, Paul E. Chang, Yang Yang, Francesco Silvestrin, Samuel Kaski, Luigi Acerbi

Abstract: Transformer-based models for amortized probabilistic inference, such as neural processes, prior-fitted networks, and tabular foundation models, excel at single-pass marginal prediction. However, many real-world applications, from signal interpolation to multi-column tabular predictions, require coherent joint distributions that capture dependencies between predictions. While purely autoregressive architectures efficiently generate such distributions, they sacrifice the flexible set-conditioning that makes these models powerful for meta-learning. Conversely, the standard approach to obtain joint distributions from set-based models requires expensive re-encoding of the entire augmented conditioning set at each autoregressive step. We introduce a causal autoregressive buffer that preserves the advantages of both paradigms. Our approach decouples context encoding from updating the conditioning set. The model processes the context once and caches it. A dynamic buffer then captures target dependencies: as targets are incorporated, they enter the buffer and attend to both the cached context and previously buffered targets. This enables efficient batched autoregressive generation and one-pass joint log-likelihood evaluation. A unified training strategy allows seamless integration of set-based and autoregressive modes at minimal additional cost. Across synthetic functions, EEG signals, cognitive models, and tabular data, our method matches predictive accuracy of strong baselines while delivering up to 20 times faster joint sampling. Our approach combines the efficiency of autoregressive generative models with the representational power of set-based conditioning, making joint prediction practical for transformer-based probabilistic models.

Comment: Systems/architecture innovation: causal autoregressive buffer with cached context enables efficient joint sampling—cache/memory optimization for Transformer probabilistic models.

Relevance: 9 Novelty: 8

8. AdaPM: a Partial Momentum Algorithm for LLM Training

ArXiv ID: 2510.09103

Authors: Yimu Zhang, Yuanshi Liu, Cong Fang

Abstract: In the training of large language models, momentum is widely used and often demonstrated to achieve significant acceleration. However, storing momentum typically presents memory challenges. In this paper, we propose AdaPM, an adaptive training strategy that leverages partial momentum to implement a memory-efficient optimizer. To this end, AdaPM utilizes a non-uniform momentum design: for most blocks, full momentum is not necessary to preserve the performance of the optimization. In the momentum design of AdaPM, to mitigate the bias and performance loss caused by partial momentum, we enhance the partial momentum by a bias correction technique. Empirically, we verify that our approach reduces memory by over $90\%$ in momentum while maintaining both efficiency and performance for pretraining various language models ranging from 60M to 1.5B, as well as for supervised fine-tuning and RLHF. AdaPM can further reduce memory by up to $95\%$ in optimizer states by combining the memory-efficient technique on the second-order statistic, saving over $30\%$ GPU hours for pretraining GPT-2 1.5B.

Comment: HPC/Efficiency: memory-efficient optimizer for LLM training via partial momentum with bias correction, reducing momentum state memory by >90%.

Relevance: 9 Novelty: 8

9. Design Principles for Sequence Models via Coefficient Dynamics

ArXiv ID: 2510.09389

Authors: Jerome Sieber, Antonio Orvieto, Melanie N. Zeilinger, Carmen Amo Alonso

Abstract: Deep sequence models, ranging from Transformers and State Space Models (SSMs) to more recent approaches such as gated linear RNNs, fundamentally compute outputs as linear combinations of past value vectors. To draw insights and systematically compare such architectures, we develop a unified framework that makes this output operation explicit, by casting the linear combination coefficients as the outputs of autonomous linear dynamical systems driven by impulse inputs. This viewpoint, in spirit substantially different from approaches focusing on connecting linear RNNs with linear attention, reveals a common mathematical theme across diverse architectures and crucially captures softmax attention, on top of RNNs, SSMs, and related models. In contrast to new model proposals that are commonly evaluated on benchmarks, we derive design principles linking architectural choices to model properties. Thereby identifying tradeoffs between expressivity and efficient implementation, geometric constraints on input selectivity, and stability conditions for numerically stable training and information retention. By connecting several insights and observations from recent literature, the framework both explains empirical successes of recent designs and provides guiding principles for systematically designing new sequence model architectures.

Comment: Model Architecture: unifies Transformers/SSMs/RNNs via coefficient-dynamics and derives design principles linking expressivity, efficient implementation, and stability.

Relevance: 9 Novelty: 8

10. Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers

ArXiv ID: 2510.09017

Authors: Rui Bu, Haofeng Zhong, Wenzheng Chen, Yangyan Li

Abstract: Large models based on the Transformer architecture are susceptible to extreme-token phenomena, such as attention sinks and value-state drains. These issues, which degrade model performance, quantization fidelity, and interpretability, arise from a problematic mutual reinforcement mechanism where the model learns an inefficient 'no-op' behavior by focusing attention on tokens with near-zero value states. In this paper, we propose Value-State Gated Attention (VGA), a simple, dedicated, and stable architectural mechanism for performing 'no-op' attention efficiently by directly breaking this cycle. VGA introduces a learnable, data-dependent gate, computed directly from the value vectors (V), to modulate the output. Through a theoretical analysis of the underlying gradients, we show that gating the value-state with a function of itself is more effective at decoupling value and attention score updates than prior methods that gate on input embeddings. This creates a direct regulatory pathway that allows the model to suppress a token's contribution based on its emergent value representation. Our experiments demonstrate that VGA significantly mitigates the formation of attention sinks and stabilizes value-state norms, leading to improved performance, robust quantization fidelity, and enhanced model interpretability.

Comment: Model Architecture criterion: proposes Value-State Gated Attention to break attention/value reinforcement; also improves quantization fidelity (Compression/Efficiency criterion).

Relevance: 9 Novelty: 8

11. The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

ArXiv ID: 2510.09378

Authors: Natalie Abreu, Nikhil Vyas, Sham Kakade, Depen Morwani

Abstract: Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.

Comment: High Performance Computing/Efficiency criterion: evaluates full and layerwise Gauss-Newton preconditioning for transformer training, showing large iteration reductions and insights on Hessian structure.

Relevance: 9 Novelty: 8

12. Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

ArXiv ID: 2510.08638

Authors: Thomas Fel, Binxu Wang, Michael A. Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S. Lubana, Talia Konkle, Demba Ba, Martin Wattenberg

Abstract: DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits "Elsewhere" concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.

Comment: Representation Learning: dictionary/SAE-based interpretability of DINOv2 and a new Minkowski Representation Hypothesis about concept geometry in ViTs.

Relevance: 9 Novelty: 8

13. On the Alignment Between Supervised and Self-Supervised Contrastive Learning

ArXiv ID: 2510.08852

Authors: Achleshwar Luthra, Priyadarsi Mishra, Tomer Galanti

Abstract: Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL) loss, as the number of classes grows. Yet this loss-level similarity leaves an open question: {\em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives?} We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time. Finally, we validate these predictions empirically, showing that CL-NSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning. Our code and project page are available at [\href{https://github.com/DLFundamentals/understanding_ssl_v2}{code}, \href{https://dlfundamentals.github.io/cl-nscl-representation-alignment/}{project page}].

Comment: Strong match to representation learning: theoretical/empirical analysis of alignment between supervised and self-supervised contrastive learning at the representation level.

Relevance: 9 Novelty: 8

14. Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

ArXiv ID: 2510.08726

Authors: Yifan Zhao, Egan Johnson, Prasanth Chatarasi, Vikram Adve, Sasa Misailovic

Abstract: Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor compilers struggle to fuse complex reduction computations involving loop-carried dependencies, such as attention mechanisms. The paper introduces Neptune, a tensor compiler for advanced operator fusion for sequences of reduction operators. Neptune presents a new approach for advanced operator fusion, which intentionally breaks some existing dependencies and compensates by constructing algebraic correction expressions that allow the kernel to produce the correct result. On ten attention-based benchmarks, Neptune, starting from simple attention code and a high-level scheduling template, outperforms existing compilers like Triton, TVM, and FlexAttention, including Triton-based implementations of FlashAttention. Across four different GPU architectures from NVIDIA and AMD, Neptune-generated kernels have average speedup of $1.35\times$ over the next best alternative, demonstrating its effectiveness for deep learning workloads.

Comment: Strong match to high-performance computing: novel tensor-compiler operator fusion for reductions (e.g., attention) improving locality/parallelism with algebraic correction.

Relevance: 9 Novelty: 8

15. Slicing Is All You Need: Towards A Universal One-Sided Algorithm for Distributed Matrix Multiplication

ArXiv ID: 2510.08874

Authors: Benjamin Brock, Renato Golin

Abstract: Many important applications across science, data analytics, and AI workloads depend on distributed matrix multiplication. Prior work has developed a large array of algorithms suitable for different problem sizes and partitionings including 1D, 2D, 1.5D, and 2.5D algorithms. A limitation of current work is that existing algorithms are limited to a subset of partitionings. Multiple algorithm implementations are required to support the full space of possible partitionings. If no algorithm implementation is available for a particular set of partitionings, one or more operands must be redistributed, increasing communication costs. This paper presents a universal one-sided algorithm for distributed matrix multiplication that supports all combinations of partitionings and replication factors. Our algorithm uses slicing (index arithmetic) to compute the sets of overlapping tiles that must be multiplied together. This list of local matrix multiplies can then either be executed directly, or reordered and lowered to an optimized IR to maximize overlap. We implement our algorithm using a high-level C++-based PGAS programming framework that performs direct GPU-to-GPU communication using intra-node interconnects. We evaluate performance for a wide variety of partitionings and replication factors, finding that our work is competitive with PyTorch DTensor, a highly optimized distributed tensor library targeting AI models.

Comment: Strong match to HPC: a universal algorithm for distributed matrix multiplication across arbitrary partitionings/replication, improving systems support for large-scale training/inference.

Relevance: 9 Novelty: 8

16. dInfer: An Efficient Inference Framework for Diffusion Language Models

ArXiv ID: 2510.08666

Authors: Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, Da Zheng

Abstract: Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components-model, diffusion iteration manager, decoding strategy, and KV-cache manager-and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared with AR models (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with latest vLLM inference engine, dInfer still deliverers $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.

Comment: High Performance Computing and Model Efficiency: proposes an inference framework for diffusion LLMs with algorithmic and system-level optimizations (diffusion iteration manager, decoding, KV-cache manager) enabling large speedups.

Relevance: 9 Novelty: 7

17. StreamingVLM: Real-Time Understanding for Infinite Video Streams

ArXiv ID: 2510.09608

Authors: Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han

Abstract: Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

Comment: Compression/Efficiency criterion: streaming KV-cache management (attention sinks, short/long windows) with training–inference alignment for real-time long-context VLMs.

Relevance: 9 Novelty: 7

18. Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training

ArXiv ID: 2510.08855

Authors: T. Ed Li, Junyu Ren

Abstract: Understanding the internal representations of large language models is crucial for ensuring their reliability and safety, with sparse autoencoders (SAEs) emerging as a promising interpretability approach. However, current SAE training methods face feature absorption, where features (or neurons) are absorbed into each other to minimize $L_1$ penalty, making it difficult to consistently identify and analyze model behaviors. We introduce Adaptive Temporal Masking (ATM), a novel training approach that dynamically adjusts feature selection by tracking activation magnitudes, frequencies, and reconstruction contributions to compute importance scores that evolve over time. ATM applies a probabilistic masking mechanism based on statistical thresholding of these importance scores, creating a more natural feature selection process. Through extensive experiments on the Gemma-2-2b model, we demonstrate that ATM achieves substantially lower absorption scores compared to existing methods like TopK and JumpReLU SAEs, while maintaining excellent reconstruction quality. These results establish ATM as a principled solution for learning stable, interpretable features in neural networks, providing a foundation for more reliable model analysis.

Comment: Sparse Autoencoders training innovation (adaptive temporal masking) to reduce feature absorption; matches sparsity and representation-learning criteria.

Relevance: 9 Novelty: 7

19. MODE: Learning compositional representations of complex systems with Mixtures Of Dynamical Experts

ArXiv ID: 2510.09594

Authors: Nathan Quiblier, Roy Friedman, Matthew Ricci

Abstract: Dynamical systems in the life sciences are often composed of complex mixtures of overlapping behavioral regimes. Cellular subpopulations may shift from cycling to equilibrium dynamics or branch towards different developmental fates. The transitions between these regimes can appear noisy and irregular, posing a serious challenge to traditional, flow-based modeling techniques which assume locally smooth dynamics. To address this challenge, we propose MODE (Mixture Of Dynamical Experts), a graphical modeling framework whose neural gating mechanism decomposes complex dynamics into sparse, interpretable components, enabling both the unsupervised discovery of behavioral regimes and accurate long-term forecasting across regime transitions. Crucially, because agents in our framework can jump to different governing laws, MODE is especially tailored to the aforementioned noisy transitions. We evaluate our method on a battery of synthetic and real datasets from computational biology. First, we systematically benchmark MODE on an unsupervised classification task using synthetic dynamical snapshot data, including in noisy, few-sample settings. Next, we show how MODE succeeds on challenging forecasting tasks which simulate key cycling and branching processes in cell biology. Finally, we deploy our method on human, single-cell RNA sequencing data and show that it can not only distinguish proliferation from differentiation dynamics but also predict when cells will commit to their ultimate fate, a key outstanding challenge in computational biology.

Comment: Model Architecture: Mixture-Of-Experts with neural gating decomposes dynamics into sparse experts (conditional computation).

Relevance: 9 Novelty: 7

20. Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation

ArXiv ID: 2510.08600

Authors: Devleena Das, Rajeev Patwari, Ashish Sirasao

Abstract: Inference optimizations such as quantization, pruning, format and datatype conversion, model export, and serialization can lead to functional degradations in language model task performance. While most efforts on performance recovery for deployment focus on robust quantization techniques, we focus on recovering model accuracies from any sources that degrade model weights, such as improper model serialization. In this work, we propose Recover-LoRA, a lightweight and dataset agnostic method to recover accuracy in degraded models. Recover-LoRA uses synthetic data and logit distillation to learn LoRA adapters on selective layers that facilitate aligning the degraded model to its full precision model. We investigate the utility of Recover-LoRA across a diverse set of small language models (SLMs), including models with varying attention architectures, multi-head attention (MHA) and group-query attention (GQA), as well as several evaluation datasets. Our results show that Recover-LoRA recovers model accuracies by 5-17% on MHA and GQA SLMs.

Comment: Matches Model Compression and Efficiency: uses low-rank adaptation (LoRA) with synthetic data/logit distillation to recover accuracy after quantization/pruning/serialization-induced degradation.

Relevance: 9 Novelty: 7

21. Task-Level Insights from Eigenvalues across Sequence Models

ArXiv ID: 2510.09379

Authors: Rahel Rickenbach, Jelena Trisovic, Alexandre Didier, Jerome Sieber, Melanie N. Zeilinger

Abstract: Although softmax attention drives state-of-the-art performance for sequence models, its quadratic complexity limits scalability, motivating linear alternatives such as state space models (SSMs). While these alternatives improve efficiency, their fundamental differences in information processing remain poorly understood. In this work, we leverage the recently proposed dynamical systems framework to represent softmax, norm and linear attention as dynamical systems, enabling a structured comparison with SSMs by analyzing their respective eigenvalue spectra. Since eigenvalues capture essential aspects of dynamical system behavior, we conduct an extensive empirical analysis across diverse sequence models and benchmarks. We first show that eigenvalues influence essential aspects of memory and long-range dependency modeling, revealing spectral signatures that align with task requirements. Building on these insights, we then investigate how architectural modifications in sequence models impact both eigenvalue spectra and task performance. This correspondence further strengthens the position of eigenvalue analysis as a principled metric for interpreting, understanding, and ultimately improving the capabilities of sequence models.

Comment: Representation Learning: dynamical-systems eigenvalue analysis across attention variants and SSMs linking spectra to memory and long-range dependency modeling.

Relevance: 9 Novelty: 7

22. Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization

ArXiv ID: 2510.09160

Authors: Le-Trung Nguyen, Enzo Tartaglione, Van-Tam Nguyen

Abstract: As AI increasingly shapes daily life, energy consumption and data privacy have become pressing concerns. On-device learning trains models directly on edge devices, cutting energy consumption and safeguarding data privacy. However, the expanding scale of modern neural networks creates a major obstacle for on-device training. Although prior work has concentrated on compact convolutional architectures, we instead apply subspace-based training to transformer models. Motivated by the idea that a model's essential information lies in a fixed subspace, we introduce Weight-Activation Subspace Iteration (WASI), a method that mitigates the memory bottleneck of backpropagation and boosts inference efficiency in transformer models by restricting training to this subspace. Our results demonstrate that WASI maintains accuracy comparable to vanilla training while reducing memory usage by up to $62\times$ and computational cost (FLOPs) by up to $2\times$. On a Raspberry Pi 5, WASI achieves roughly $1.5\times$ faster training and inference than vanilla training.

Comment: Matches compression/efficiency: subspace-restricted training (weight–activation subspace iteration) for ViTs reducing memory/FLOPs and enabling on-device learning.

Relevance: 9 Novelty: 7

23. Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression

ArXiv ID: 2510.08647

Authors: Chengzhengxu Li, Xiaoming Liu, Zhaohan Zhang, Shaochu Zhang, Shengchao Liu, Guoxin Ma, Yu Lan, Chao Shen

Abstract: Recent developments have enabled advanced reasoning in Large Language Models (LLMs) via long Chain-of-Thought (CoT), while long CoT suffers from high computational costs and significant latency losses owing to the autoregressive nature of generative LLMs. CoT compression aims to improve efficiency in the reasoning process by reducing output length. Previous works trade reasoning efficiency by either laborious discrete prompt designing or the construction of external compressed CoT datasets that sacrifice key reasoning details. In this work, we propose Upfront CoT (UCoT): an efficient reasoning framework with upfront thought embedding to automate CoT compression. UCoT is a cooperative workflow involving a small model (compressor) and a large model (executor). The first stage of UCoT trains compressor to generate upfront thought embeddings rich in reasoning information for the executor, avoiding the drawbacks of manually designed prompts. The second stage optimizes executor to utilize upfront thought embeddings to derive the correct answer with short reasoning, using a reward mechanism. Extensive experiments show that UCoT maintains the powerful reasoning ability of executor while significantly reducing the length of CoT. It is worth mentioning that when applying UCoT to the Qwen2.5-7B-Instruct model, the usage of tokens on GSM8K dataset is reduced by 50\%, while the performance is 3.08\% higher than that of the state-of-the-art (SOTA) method. The code and dataset are in supplementary material.

Comment: Matches Model Compression and Efficiency: CoT compression via an upfront thought-embedding compressor–executor framework to reduce token usage/latency while maintaining reasoning quality.

Relevance: 9 Novelty: 7

24. Logits Replay + MoClip: Stabilized, Low-Cost Post-Training with Minimal Forgetting

ArXiv ID: 2510.09152

Authors: Suming Qiu, Jing Li, Zhicheng Zhou, Junjie Huang, Linyuan Qiu, Zhijie Sun

Abstract: Large language models (LLMs) often face a trade-off in post-training: improvements on specialized domains frequently come at the expense of general capabilities. Existing solutions attempt to mitigate this tension via regularization, selective parameter updates, or data-centric replay, but each imposes significant costs in computation, data access, or adaptability. Recent work has shown that training signals can be compressed to subsets of logits without severe accuracy loss, suggesting a path toward efficient adaptation. However, naive truncation destabilizes optimization and exacerbates forgetting. We introduce Logits Replay + MoClip, a two-stage framework that compresses supervision in the logit space and stabilizes optimization at the update level. In Stage 0, we record dynamic Top-K token subsets that cover a probability threshold, always including the gold label. In Stage 1, we replay these compact subsets to compute exact renormalized losses, avoiding full softmax computation and implicitly regularizing. To ensure stability, we design MoClip, an optimizer that caps gradient-momentum rotation and applies an arctan2-based rescaling of updates. Empirically, our method improves domain performance on Communication Technology (CT) and NL2SQL tasks while mitigating forgetting on general benchmarks (MMLU, BBH, GPQA, MATH), and reduces training cost by over 40%. Together, these contributions offer a scalable, architecture-agnostic path for domain adaptation of LLMs without sacrificing generalization.

Comment: Model Efficiency: compresses supervision via Top-K logits replay and introduces an update-stabilizing optimizer (MoClip), reducing softmax cost and mitigating forgetting in post-training.