Previous Day 2025-09-19
Monthly Overview 2025-09
Next Day 2025-09-23

Personalized Daily ArXiv Papers 2025-09-22

[gpt-5] Prompt Completion Total
Token 46024 49441 95465
Cost $0.06 $0.49 $0.55

Total arXiv papers: 520

Total scanned papers: 289

Total relevant papers: 25

Table of contents with paper titles:

  1. IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs Authors: Junchen Zhao, Ali Derakhshan, Dushyant Bharadwaj, Jayden Kana Hyman, Junhao Dong, Sangeetha Abdu Jyothi, Ian Harris

  2. Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems Authors: Saeed Amizadeh, Sara Abdali, Yinheng Li, Kazuhito Koishida

  3. MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Authors: Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen

  4. RMT-KD: Random Matrix Theoretic Causal Knowledge Distillation Authors: Davide Ettori, Nastaran Darabi, Sureshkumar Senthilkumar, Amit Ranjan Trivedi

  5. Localmax dynamics for attention in transformers and its asymptotic behavior Authors: Henri Cimeti`ere, Maria Teresa Chiri, Bahman Gharesifard

  6. Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification Authors: Zinan Lin, Enshu Liu, Xuefei Ning, Junyi Zhu, Wenyu Wang, Sergey Yekhanin

  7. Distribution-Aligned Decoding for Efficient LLM Task Adaptation Authors: Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Sam Tak Wu Kwong, Yuguang Fang

  8. Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception Authors: Yulin Wang, Yang Yue, Yang Yue, Huanqian Wang, Haojun Jiang, Yizeng Han, Zanlin Ni, Yifan Pu, Minglei Shi, Rui Lu, Qisen Yang, Andrew Zhao, Zhuofan Xia, Shiji Song, Gao Huang

  9. Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models Authors: Tomoya Yamashita, Akira Ito, Yuuki Yamanaka, Masanori Yamada, Takayuki Miura, Toshiki Shibahara

  10. Region-Aware Deformable Convolutions Authors: Abolfazl Saheban Maleki, Maryam Imani

  11. Nonconvex Decentralized Stochastic Bilevel Optimization under Heavy-Tailed Noises Authors: Xinwen Zhang, Yihan Zhang, Hongchang Gao

  12. Universal Learning of Stochastic Dynamics for Exact Belief Propagation using Bernstein Normalizing Flows Authors: Peter Amorese, Morteza Lahijanian

  13. Synthetic bootstrapped pretraining Authors: Zitong Yang, Aonan Zhang, Hong Liu, Tatsunori Hashimoto, Emmanuel Cand`es, Chong Wang, Ruoming Pang

  14. The Multi-Query Paradox in Zeroth-Order Optimization Authors: Wei Lin, Qingyu Song, Hong Xu

  15. On the Convergence of Muon and Beyond Authors: Da Chang, Yongxiang Liu, Ganzhao Yuan

  16. BEFT: Bias-Efficient Fine-Tuning of Language Models Authors: Baichuan Huang, Ananth Balashankar, Amir Aminifar

  17. Detail Across Scales: Multi-Scale Enhancement for Full Spectrum Neural Representations Authors: Yuan Ni, Zhantao Chen, Cheng Peng, Rajan Plumley, Chun Hong Yoon, Jana B. Thayer, Joshua J. Turner

  18. MTS-DMAE: Dual-Masked Autoencoder for Unsupervised Multivariate Time Series Representation Learning Authors: Yi Xu, Yitian Zhang, Yun Fu

  19. Stochastic Sample Approximations of (Local) Moduli of Continuity Authors: Rodion Nazarov, Allen Gehret, Robert Shorten, Jakub Marecek

  20. Computing Linear Regions in Neural Networks with Skip Connections Authors: Johnny Joyce, Jan Verschelde

  21. Attention Schema-based Attention Control (ASAC): A Cognitive-Inspired Approach for Attention Management in Transformers Authors: Krati Saxena, Federico Jurado Ruiz, Guido Manzi, Dianbo Liu, Alex Lamb

  22. Global Pre-fixing, Local Adjusting: A Simple yet Effective Contrastive Strategy for Continual Learning Authors: Jia Tang, Xinrui Wang, Songcan Chen

  23. SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection Authors: Maithili Joshi, Palash Nandi, Tanmoy Chakraborty

  24. Toward Efficient Influence Function: Dropout as a Compression Tool Authors: Yuchen Zhang, Mohammad Mohammadi Amiri

  25. On Optimal Steering to Achieve Exact Fairness Authors: Mohit Sharma, Amit Jayant Deshpande, Chiranjib Bhattacharyya, Rajiv Ratn Shah


1. IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs

ArXiv ID: 2509.15455

Authors: Junchen Zhao, Ali Derakhshan, Dushyant Bharadwaj, Jayden Kana Hyman, Junhao Dong, Sangeetha Abdu Jyothi, Ian Harris

Abstract: Large Language Models (LLMs) promise impressive capabilities, yet their multi-billion-parameter scale makes on-device or low-resource deployment prohibitive. Mixed-precision quantization offers a compelling solution, but existing methods struggle when the average precision drops below four bits, as they rely on isolated, layer-specific metrics that overlook critical inter-layer interactions affecting overall performance. In this paper, we propose two innovations to address these limitations. First, we frame the mixed-precision quantization problem as a cooperative game among layers and introduce Shapley-based Progressive Quantization Estimation (SPQE) to efficiently obtain accurate Shapley estimates of layer sensitivities and inter-layer interactions. Second, building upon SPQE, we propose Interaction-aware Mixed-Precision Quantization (IMPQ) which translates these Shapley estimates into a binary quadratic optimization formulation, assigning either 2 or 4-bit precision to layers under strict memory constraints. Comprehensive experiments conducted on Llama-3, Gemma-2, and Qwen-3 models across three independent PTQ backends (Quanto, HQQ, GPTQ) demonstrate IMPQ's scalability and consistently superior performance compared to methods relying solely on isolated metrics. Across average precisions spanning 4 bit down to 2 bit, IMPQ cuts Perplexity by 20 to 80 percent relative to the best baseline, with the margin growing as the bit-width tightens.

Comment: Model Compression and Efficiency: interaction-aware mixed-precision quantization using Shapley-based layer sensitivity/interactions and binary quadratic optimization for 2/4-bit LLMs.

Relevance: 10 Novelty: 8


2. Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

ArXiv ID: 2509.15448

Authors: Saeed Amizadeh, Sara Abdali, Yinheng Li, Kazuhito Koishida

Abstract: Transformers and their attention mechanism have been revolutionary in the field of Machine Learning. While originally proposed for the language data, they quickly found their way to the image, video, graph, etc. data modalities with various signal geometries. Despite this versatility, generalizing the attention mechanism to scenarios where data is presented at different scales from potentially different modalities is not straightforward. The attempts to incorporate hierarchy and multi-modality within transformers are largely based on ad hoc heuristics, which are not seamlessly generalizable to similar problems with potentially different structures. To address this problem, in this paper, we take a fundamentally different approach: we first propose a mathematical construct to represent multi-modal, multi-scale data. We then mathematically derive the neural attention mechanics for the proposed construct from the first principle of entropy minimization. We show that the derived formulation is optimal in the sense of being the closest to the standard Softmax attention while incorporating the inductive biases originating from the hierarchical/geometric information of the problem. We further propose an efficient algorithm based on dynamic programming to compute our derived attention mechanism. By incorporating it within transformers, we show that the proposed hierarchical attention mechanism not only can be employed to train transformer models in hierarchical/multi-modal settings from scratch, but it can also be used to inject hierarchical information into classical, pre-trained transformer models post training, resulting in more efficient models in zero-shot manner.

Comment: Model Architecture: derives a hierarchical self-attention mechanism from first principles with a dynamic-programming algorithm, enabling multi-scale transformers and post-hoc hierarchical injection.

Relevance: 10 Novelty: 8


3. MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

ArXiv ID: 2509.16197

Authors: Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen

Abstract: Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

Comment: Model Architecture: unified multimodal LLM with a hybrid vision tokenizer and dual adapters enabling joint image understanding and generation.

Relevance: 9 Novelty: 8


4. RMT-KD: Random Matrix Theoretic Causal Knowledge Distillation

ArXiv ID: 2509.15724

Authors: Davide Ettori, Nastaran Darabi, Sureshkumar Senthilkumar, Amit Ranjan Trivedi

Abstract: Large deep learning models such as BERT and ResNet achieve state-of-the-art performance but are costly to deploy at the edge due to their size and compute demands. We present RMT-KD, a compression method that leverages Random Matrix Theory (RMT) for knowledge distillation to iteratively reduce network size. Instead of pruning or heuristic rank selection, RMT-KD preserves only informative directions identified via the spectral properties of hidden representations. RMT-based causal reduction is applied layer by layer with self-distillation to maintain stability and accuracy. On GLUE, AG News, and CIFAR-10, RMT-KD achieves up to 80% parameter reduction with only 2% accuracy loss, delivering 2.8x faster inference and nearly halved power consumption. These results establish RMT-KD as a mathematically grounded approach to network distillation.

Comment: Compression/Efficiency: Random Matrix Theory–guided dimensionality reduction/knowledge distillation preserving informative directions without pruning/heuristic ranks.

Relevance: 9 Novelty: 8


5. Localmax dynamics for attention in transformers and its asymptotic behavior

ArXiv ID: 2509.15958

Authors: Henri Cimeti`ere, Maria Teresa Chiri, Bahman Gharesifard

Abstract: We introduce a new discrete-time attention model, termed the localmax dynamics, which interpolates between the classic softmax dynamics and the hardmax dynamics, where only the tokens that maximize the influence toward a given token have a positive weight. As in hardmax, uniform weights are determined by a parameter controlling neighbor influence, but the key extension lies in relaxing neighborhood interactions through an alignment-sensitivity parameter, which allows controlled deviations from pure hardmax behavior. As we prove, while the convex hull of the token states still converges to a convex polytope, its structure can no longer be fully described by a maximal alignment set, prompting the introduction of quiescent sets to capture the invariant behavior of tokens near vertices. We show that these sets play a key role in understanding the asymptotic behavior of the system, even under time-varying alignment sensitivity parameters. We further show that localmax dynamics does not exhibit finite-time convergence and provide results for vanishing, nonzero, time-varying alignment-sensitivity parameters, recovering the limiting behavior of hardmax as a by-product. Finally, we adapt Lyapunov-based methods from classical opinion dynamics, highlighting their limitations in the asymmetric setting of localmax interactions and outlining directions for future research.

Comment: Model Architecture / Representation Learning: theoretical analysis of transformer attention via localmax dynamics interpolating softmax and hardmax with asymptotic behavior results.

Relevance: 9 Novelty: 8


6. Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification

ArXiv ID: 2509.15591

Authors: Zinan Lin, Enshu Liu, Xuefei Ning, Junyi Zhu, Wenyu Wang, Sergey Yekhanin

Abstract: Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.

Comment: Model Architecture + Representation Learning: unified encoder/decoder mapping to disjoint zones in a shared Gaussian latent space enabling generation, representation learning, and classification.

Relevance: 9 Novelty: 8


7. Distribution-Aligned Decoding for Efficient LLM Task Adaptation

ArXiv ID: 2509.15888

Authors: Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Sam Tak Wu Kwong, Yuguang Fang

Abstract: Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVD), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model's output distribution towards the task distribution. We theoretically prove that SVD is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVD paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 points and open-ended truthfulness by 2 points, with similar gains (1-2 points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVD thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models.

Comment: Compression/Efficiency: decoding-time task adaptation via a KL-gradient-derived steering vector; PEFT-compatible with theoretical first-order equivalence to full fine-tuning.

Relevance: 9 Novelty: 8


8. Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception

ArXiv ID: 2509.15333

Authors: Yulin Wang, Yang Yue, Yang Yue, Huanqian Wang, Haojun Jiang, Yizeng Han, Zanlin Ni, Yifan Pu, Minglei Shi, Rui Lu, Qisen Yang, Andrew Zhao, Zhuofan Xia, Shiji Song, Gao Huang

Abstract: Human vision is highly adaptive, efficiently sampling intricate environments by sequentially fixating on task-relevant regions. In contrast, prevailing machine vision models passively process entire scenes at once, resulting in excessive resource demands scaling with spatial-temporal input resolution and model size, yielding critical limitations impeding both future advancements and real-world application. Here we introduce AdaptiveNN, a general framework aiming to drive a paradigm shift from 'passive' to 'active, adaptive' vision models. AdaptiveNN formulates visual perception as a coarse-to-fine sequential decision-making process, progressively identifying and attending to regions pertinent to the task, incrementally combining information across fixations, and actively concluding observation when sufficient. We establish a theory integrating representation learning with self-rewarding reinforcement learning, enabling end-to-end training of the non-differentiable AdaptiveNN without additional supervision on fixation locations. We assess AdaptiveNN on 17 benchmarks spanning 9 tasks, including large-scale visual recognition, fine-grained discrimination, visual search, processing images from real driving and medical scenarios, language-driven embodied AI, and side-by-side comparisons with humans. AdaptiveNN achieves up to 28x inference cost reduction without sacrificing accuracy, flexibly adapts to varying task demands and resource budgets without retraining, and provides enhanced interpretability via its fixation patterns, demonstrating a promising avenue toward efficient, flexible, and interpretable computer vision. Furthermore, AdaptiveNN exhibits closely human-like perceptual behaviors in many cases, revealing its potential as a valuable tool for investigating visual cognition. Code is available at https://github.com/LeapLabTHU/AdaptiveNN.

Comment: Model Architecture + Efficiency: conditional/dynamic vision framework (sequential fixations, early stopping) with theory enabling end-to-end training of non-differentiable policies.

Relevance: 9 Novelty: 8


9. Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models

ArXiv ID: 2509.15631

Authors: Tomoya Yamashita, Akira Ito, Yuuki Yamanaka, Masanori Yamada, Takayuki Miura, Toshiki Shibahara

Abstract: As large language models (LLMs) are increasingly deployed across various applications, privacy and copyright concerns have heightened the need for more effective LLM unlearning techniques. Many existing unlearning methods aim to suppress undesirable outputs through additional training (e.g., gradient ascent), which reduces the probability of generating such outputs. While such suppression-based approaches can control model outputs, they may not eliminate the underlying knowledge embedded in the model's internal activations; muting a response is not the same as forgetting it. Moreover, such suppression-based methods often suffer from model collapse. To address these issues, we propose a novel unlearning method that directly intervenes in the model's internal activations. In our formulation, forgetting is defined as a state in which the activation of a forgotten target is indistinguishable from that of unknown'' entities. Our method introduces an unlearning objective that modifies the activation of the target entity away from those of known entities and toward those of unknown entities in a sparse autoencoder latent space. By aligning the target's internal activation with those of unknown entities, we shift the model's recognition of the target entity fromknown'' to ``unknown'', achieving genuine forgetting while avoiding over-suppression and model collapse. Empirically, we show that our method effectively aligns the internal activations of the forgotten target, a result that the suppression-based approaches do not reliably achieve. Additionally, our method effectively reduces the model's recall of target knowledge in question-answering tasks without significant damage to the non-target knowledge.

Comment: Representation Learning: directly intervenes in LLM internal activations via a sparse autoencoder latent space to achieve genuine unlearning (aligning targets with “unknown” representations).

Relevance: 9 Novelty: 8


10. Region-Aware Deformable Convolutions

ArXiv ID: 2509.15436

Authors: Abolfazl Saheban Maleki, Maryam Imani

Abstract: We introduce Region-Aware Deformable Convolution (RAD-Conv), a new convolutional operator that enhances neural networks' ability to adapt to complex image structures. Unlike traditional deformable convolutions, which are limited to fixed quadrilateral sampling areas, RAD-Conv uses four boundary offsets per kernel element to create flexible, rectangular regions that dynamically adjust their size and shape to match image content. This approach allows precise control over the receptive field's width and height, enabling the capture of both local details and long-range dependencies, even with small 1x1 kernels. By decoupling the receptive field's shape from the kernel's structure, RAD-Conv combines the adaptability of attention mechanisms with the efficiency of standard convolutions. This innovative design offers a practical solution for building more expressive and efficient vision models, bridging the gap between rigid convolutional architectures and computationally costly attention-based methods.

Comment: Model Architecture: introduces Region-Aware Deformable Convolution with boundary-offset-defined receptive fields, combining attention-like adaptability with convolution efficiency.

Relevance: 9 Novelty: 7


11. Nonconvex Decentralized Stochastic Bilevel Optimization under Heavy-Tailed Noises

ArXiv ID: 2509.15543

Authors: Xinwen Zhang, Yihan Zhang, Hongchang Gao

Abstract: Existing decentralized stochastic optimization methods assume the lower-level loss function is strongly convex and the stochastic gradient noise has finite variance. These strong assumptions typically are not satisfied in real-world machine learning models. To address these limitations, we develop a novel decentralized stochastic bilevel optimization algorithm for the nonconvex bilevel optimization problem under heavy-tailed noises. Specifically, we develop a normalized stochastic variance-reduced bilevel gradient descent algorithm, which does not rely on any clipping operation. Moreover, we establish its convergence rate by innovatively bounding interdependent gradient sequences under heavy-tailed noises for nonconvex decentralized bilevel optimization problems. As far as we know, this is the first decentralized bilevel optimization algorithm with rigorous theoretical guarantees under heavy-tailed noises. The extensive experimental results confirm the effectiveness of our algorithm in handling heavy-tailed noises.

Comment: High Performance Computing: introduces a decentralized stochastic bilevel optimization algorithm with theoretical guarantees under heavy-tailed noise—an algorithmic contribution to distributed training.

Relevance: 8 Novelty: 8


12. Universal Learning of Stochastic Dynamics for Exact Belief Propagation using Bernstein Normalizing Flows

ArXiv ID: 2509.15533

Authors: Peter Amorese, Morteza Lahijanian

Abstract: Predicting the distribution of future states in a stochastic system, known as belief propagation, is fundamental to reasoning under uncertainty. However, nonlinear dynamics often make analytical belief propagation intractable, requiring approximate methods. When the system model is unknown and must be learned from data, a key question arises: can we learn a model that (i) universally approximates general nonlinear stochastic dynamics, and (ii) supports analytical belief propagation? This paper establishes the theoretical foundations for a class of models that satisfy both properties. The proposed approach combines the expressiveness of normalizing flows for density estimation with the analytical tractability of Bernstein polynomials. Empirical results show the efficacy of our learned model over state-of-the-art data-driven methods for belief propagation, especially for highly non-linear systems with non-additive, non-Gaussian noise.

Comment: Matches model-architecture innovation (normalizing-flow design with Bernstein polynomials) enabling analytical belief propagation; also advances representation learning of stochastic dynamics.

Relevance: 8 Novelty: 8


13. Synthetic bootstrapped pretraining

ArXiv ID: 2509.15248

Authors: Zitong Yang, Aonan Zhang, Hong Liu, Tatsunori Hashimoto, Emmanuel Cand`es, Chong Wang, Ruoming Pang

Abstract: We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.

Comment: Representation Learning/Foundation Models: proposes synthetic bootstrapped pretraining that models inter-document relations to improve LM pretraining.

Relevance: 8 Novelty: 8


14. The Multi-Query Paradox in Zeroth-Order Optimization

ArXiv ID: 2509.15552

Authors: Wei Lin, Qingyu Song, Hong Xu

Abstract: Zeroth-order (ZO) optimization provides a powerful framework for problems where explicit gradients are unavailable and have to be approximated using only queries to function value. The prevalent single-query approach is simple, but suffers from high estimation variance, motivating a multi-query paradigm to improves estimation accuracy. This, however, creates a critical trade-off: under a fixed budget of queries (i.e. cost), queries per iteration and the total number of optimization iterations are inversely proportional to one another. How to best allocate this budget is a fundamental, under-explored question. This work systematically resolves this query allocation problem. We analyze two aggregation methods: the de facto simple averaging (ZO-Avg), and a new Projection Alignment method (ZO-Align) we derive from local surrogate minimization. By deriving convergence rates for both methods that make the dependence on the number of queries explicit across strongly convex, convex, non-convex, and stochastic settings, we uncover a stark dichotomy: For ZO-Avg, we prove that using more than one query per iteration is always query-inefficient, rendering the single-query approach optimal. On the contrary, ZO-Align generally performs better with more queries per iteration, resulting in a full-subspace estimation as the optimal approach. Thus, our work clarifies that the multi-query problem boils down to a choice not about an intermediate query size, but between two classic algorithms, a choice dictated entirely by the aggregation method used. These theoretical findings are also consistently validated by extensive experiments.

Comment: Model Efficiency: resolves multi-query allocation in zeroth-order optimization with a new aggregation method (ZO-Align) and explicit convergence rates across settings.

Relevance: 8 Novelty: 8


15. On the Convergence of Muon and Beyond

ArXiv ID: 2509.15816

Authors: Da Chang, Yongxiang Liu, Ganzhao Yuan

Abstract: The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap persists between its practical performance and theoretical understanding. Existing analyses indicate that the standard Muon variant achieves only a suboptimal convergence rate of $\mathcal{O}(T^{-1/4})$ in stochastic non-convex settings, where $T$ denotes the number of iterations. To explore the theoretical limits of the Muon framework, we construct and analyze a variance-reduced variant, termed Muon-VR2. We provide the first rigorous proof that incorporating a variance-reduction mechanism enables Muon-VR2 to attain an optimal convergence rate of $\tilde{\mathcal{O}}(T^{-1/3})$, thereby matching the theoretical lower bound for this class of problems. Moreover, our analysis establishes convergence guarantees for Muon variants under the Polyak-{\L}ojasiewicz (P{\L}) condition. Extensive experiments on vision (CIFAR-10) and language (C4) benchmarks corroborate our theoretical findings on per-iteration convergence. Overall, this work provides the first proof of optimality for a Muon-style optimizer and clarifies the path toward developing more practically efficient, accelerated variants.

Comment: High-Performance Training/Efficiency: variance-reduced Muon optimizer with optimal T^-1/3 convergence and guarantees under PL condition.

Relevance: 8 Novelty: 8


16. BEFT: Bias-Efficient Fine-Tuning of Language Models

ArXiv ID: 2509.15974

Authors: Baichuan Huang, Ananth Balashankar, Amir Aminifar

Abstract: Fine-tuning all-bias-terms stands out among various parameter-efficient fine-tuning (PEFT) techniques, owing to its out-of-the-box usability and competitive performance, especially in low-data regimes. Bias-only fine-tuning has the potential for unprecedented parameter efficiency. However, the link between fine-tuning different bias terms (i.e., bias terms in the query, key, or value projections) and downstream performance remains unclear. The existing approaches, e.g., based on the magnitude of bias change or empirical Fisher information, provide limited guidance for selecting the particular bias term for effective fine-tuning. In this paper, we propose an approach for selecting the bias term to be fine-tuned, forming the foundation of our bias-efficient fine-tuning (BEFT). We extensively evaluate our bias-efficient approach against other bias-selection approaches, across a wide range of large language models (LLMs) spanning encoder-only and decoder-only architectures from 110M to 6.7B parameters. Our results demonstrate the effectiveness and superiority of our bias-efficient approach on diverse downstream tasks, including classification, multiple-choice, and generation tasks.

Comment: Compression/Efficiency: parameter-efficient fine-tuning via principled selection of which bias term (e.g., Q/K/V biases) to update.

Relevance: 8 Novelty: 7


17. Detail Across Scales: Multi-Scale Enhancement for Full Spectrum Neural Representations

ArXiv ID: 2509.15494

Authors: Yuan Ni, Zhantao Chen, Cheng Peng, Rajan Plumley, Chun Hong Yoon, Jana B. Thayer, Joshua J. Turner

Abstract: Implicit neural representations (INRs) have emerged as a compact and parametric alternative to discrete array-based data representations, encoding information directly in neural network weights to enable resolution-independent representation and memory efficiency. However, existing INR approaches, when constrained to compact network sizes, struggle to faithfully represent the multi-scale structures, high-frequency information, and fine textures that characterize the majority of scientific datasets. To address this limitation, we propose WIEN-INR, a wavelet-informed implicit neural representation that distributes modeling across different resolution scales and employs a specialized kernel network at the finest scale to recover subtle details. This multi-scale architecture allows for the use of smaller networks to retain the full spectrum of information while preserving the training efficiency and reducing storage cost. Through extensive experiments on diverse scientific datasets spanning different scales and structural complexities, WIEN-INR achieves superior reconstruction fidelity while maintaining a compact model size. These results demonstrate WIEN-INR as a practical neural representation framework for high-fidelity scientific data encoding, extending the applicability of INRs to domains where efficient preservation of fine detail is essential.

Comment: Model Architecture/Representation: wavelet-informed multi-scale implicit neural representation with a fine-scale kernel for high-frequency detail under compact models.

Relevance: 8 Novelty: 7


18. MTS-DMAE: Dual-Masked Autoencoder for Unsupervised Multivariate Time Series Representation Learning

ArXiv ID: 2509.16078

Authors: Yi Xu, Yitian Zhang, Yun Fu

Abstract: Unsupervised multivariate time series (MTS) representation learning aims to extract compact and informative representations from raw sequences without relying on labels, enabling efficient transfer to diverse downstream tasks. In this paper, we propose Dual-Masked Autoencoder (DMAE), a novel masked time-series modeling framework for unsupervised MTS representation learning. DMAE formulates two complementary pretext tasks: (1) reconstructing masked values based on visible attributes, and (2) estimating latent representations of masked features, guided by a teacher encoder. To further improve representation quality, we introduce a feature-level alignment constraint that encourages the predicted latent representations to align with the teacher's outputs. By jointly optimizing these objectives, DMAE learns temporally coherent and semantically rich representations. Comprehensive evaluations across classification, regression, and forecasting tasks demonstrate that our approach achieves consistent and superior performance over competitive baselines.

Comment: Model Architecture and Representation Learning: proposes a dual-masked autoencoder for multivariate time series with teacher-guided latent estimation and feature-level alignment.

Relevance: 8 Novelty: 7


19. Stochastic Sample Approximations of (Local) Moduli of Continuity

ArXiv ID: 2509.15368

Authors: Rodion Nazarov, Allen Gehret, Robert Shorten, Jakub Marecek

Abstract: Modulus of local continuity is used to evaluate the robustness of neural networks and fairness of their repeated uses in closed-loop models. Here, we revisit a connection between generalized derivatives and moduli of local continuity, and present a non-uniform stochastic sample approximation for moduli of local continuity. This is of importance in studying robustness of neural networks and fairness of their repeated uses.

Comment: Matches representation-learning/robustness theory: stochastic approximation of local moduli of continuity to assess neural network robustness in closed-loop use.

Relevance: 8 Novelty: 7


20. Computing Linear Regions in Neural Networks with Skip Connections

ArXiv ID: 2509.15441

Authors: Johnny Joyce, Jan Verschelde

Abstract: Neural networks are important tools in machine learning. Representing piecewise linear activation functions with tropical arithmetic enables the application of tropical geometry. Algorithms are presented to compute regions where the neural networks are linear maps. Through computational experiments, we provide insights on the difficulty to train neural networks, in particular on the problems of overfitting and on the benefits of skip connections.

Comment: Representation Learning: algorithms to compute linear regions in piecewise-linear networks (including skip connections) via tropical geometry, yielding insights into training dynamics/overfitting.

Relevance: 8 Novelty: 7


21. Attention Schema-based Attention Control (ASAC): A Cognitive-Inspired Approach for Attention Management in Transformers

ArXiv ID: 2509.16058

Authors: Krati Saxena, Federico Jurado Ruiz, Guido Manzi, Dianbo Liu, Alex Lamb

Abstract: Attention mechanisms have become integral in AI, significantly enhancing model performance and scalability by drawing inspiration from human cognition. Concurrently, the Attention Schema Theory (AST) in cognitive science posits that individuals manage their attention by creating a model of the attention itself, effectively allocating cognitive resources. Inspired by AST, we introduce ASAC (Attention Schema-based Attention Control), which integrates the attention schema concept into artificial neural networks. Our initial experiments focused on embedding the ASAC module within transformer architectures. This module employs a Vector-Quantized Variational AutoEncoder (VQVAE) as both an attention abstractor and controller, facilitating precise attention management. By explicitly modeling attention allocation, our approach aims to enhance system efficiency. We demonstrate ASAC's effectiveness in both the vision and NLP domains, highlighting its ability to improve classification accuracy and expedite the learning process. Our experiments with vision transformers across various datasets illustrate that the attention controller not only boosts classification accuracy but also accelerates learning. Furthermore, we have demonstrated the model's robustness and generalization capabilities across noisy and out-of-distribution datasets. In addition, we have showcased improved performance in multi-task settings. Quick experiments reveal that the attention schema-based module enhances resilience to adversarial attacks, optimizes attention to improve learning efficiency, and facilitates effective transfer learning and learning from fewer examples. These promising results establish a connection between cognitive science and machine learning, shedding light on the efficient utilization of attention mechanisms in AI systems.

Comment: Model Architecture: introduces a VQVAE-based attention-schema controller for transformers to dynamically manage attention allocation (conditional/dynamic network) with efficiency aims.

Relevance: 8 Novelty: 7


22. Global Pre-fixing, Local Adjusting: A Simple yet Effective Contrastive Strategy for Continual Learning

ArXiv ID: 2509.15347

Authors: Jia Tang, Xinrui Wang, Songcan Chen

Abstract: Continual learning (CL) involves acquiring and accumulating knowledge from evolving tasks while alleviating catastrophic forgetting. Recently, leveraging contrastive loss to construct more transferable and less forgetful representations has been a promising direction in CL. Despite advancements, their performance is still limited due to confusion arising from both inter-task and intra-task features. To address the problem, we propose a simple yet effective contrastive strategy named \textbf{G}lobal \textbf{P}re-fixing, \textbf{L}ocal \textbf{A}djusting for \textbf{S}upervised \textbf{C}ontrastive learning (GPLASC). Specifically, to avoid task-level confusion, we divide the entire unit hypersphere of representations into non-overlapping regions, with the centers of the regions forming an inter-task pre-fixed \textbf{E}quiangular \textbf{T}ight \textbf{F}rame (ETF). Meanwhile, for individual tasks, our method helps regulate the feature structure and form intra-task adjustable ETFs within their respective allocated regions. As a result, our method \textit{simultaneously} ensures discriminative feature structures both between tasks and within tasks and can be seamlessly integrated into any existing contrastive continual learning framework. Extensive experiments validate its effectiveness.

Comment: Representation Learning: supervised contrastive continual-learning strategy enforcing inter-/intra-task ETF structure to shape embeddings and reduce forgetting.

Relevance: 7 Novelty: 7


23. SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection

ArXiv ID: 2509.16060

Authors: Maithili Joshi, Palash Nandi, Tanmoy Chakraborty

Abstract: Large Language Models (LLMs) with safe-alignment training are powerful instruments with robust language comprehension capabilities. These models typically undergo meticulous alignment procedures involving human feedback to ensure the acceptance of safe inputs while rejecting harmful or unsafe ones. However, despite their massive scale and alignment efforts, LLMs remain vulnerable to jailbreak attacks, where malicious users manipulate the model to produce harmful outputs that it was explicitly trained to avoid. In this study, we find that the safety mechanisms in LLMs are predominantly embedded in the middle-to-late layers. Building on this insight, we introduce a novel white-box jailbreak method, SABER (Safety Alignment Bypass via Extra Residuals), which connects two intermediate layers $s$ and $e$ such that $s < e$, through a residual connection. Our approach achieves a 51% improvement over the best-performing baseline on the HarmBench test set. Furthermore, SABER induces only a marginal shift in perplexity when evaluated on the HarmBench validation set. The source code is publicly available at https://github.com/PalGitts/SABER.

Comment: Model Architecture / Representation Learning: introduces a cross-layer residual connection to bypass safety and analyzes localization of alignment signals in mid–late transformer layers.

Relevance: 7 Novelty: 7


24. Toward Efficient Influence Function: Dropout as a Compression Tool

ArXiv ID: 2509.15651

Authors: Yuchen Zhang, Mohammad Mohammadi Amiri

Abstract: Assessing the impact the training data on machine learning models is crucial for understanding the behavior of the model, enhancing the transparency, and selecting training data. Influence function provides a theoretical framework for quantifying the effect of training data points on model's performance given a specific test data. However, the computational and memory costs of influence function presents significant challenges, especially for large-scale models, even when using approximation methods, since the gradients involved in computation are as large as the model itself. In this work, we introduce a novel approach that leverages dropout as a gradient compression mechanism to compute the influence function more efficiently. Our method significantly reduces computational and memory overhead, not only during the influence function computation but also in gradient compression process. Through theoretical analysis and empirical validation, we demonstrate that our method could preserves critical components of the data influence and enables its application to modern large-scale models.

Comment: Model Compression and Efficiency: leverages dropout as a gradient compression mechanism to scale influence-function computation with reduced memory/compute.

Relevance: 7 Novelty: 7


25. On Optimal Steering to Achieve Exact Fairness

ArXiv ID: 2509.15759

Authors: Mohit Sharma, Amit Jayant Deshpande, Chiranjib Bhattacharyya, Rajiv Ratn Shah

Abstract: To fix the 'bias in, bias out' problem in fair machine learning, it is important to steer feature distributions of data or internal representations of Large Language Models (LLMs) to ideal ones that guarantee group-fair outcomes. Previous work on fair generative models and representation steering could greatly benefit from provable fairness guarantees on the model output. We define a distribution as ideal if the minimizer of any cost-sensitive risk on it is guaranteed to have exact group-fair outcomes (e.g., demographic parity, equal opportunity)-in other words, it has no fairness-utility trade-off. We formulate an optimization program for optimal steering by finding the nearest ideal distribution in KL-divergence, and provide efficient algorithms for it when the underlying distributions come from well-known parametric families (e.g., normal, log-normal). Empirically, our optimal steering techniques on both synthetic and real-world datasets improve fairness without diminishing utility (and sometimes even improve utility). We demonstrate affine steering of LLM representations to reduce bias in multi-class classification, e.g., occupation prediction from a short biography in Bios dataset (De-Arteaga et al.). Furthermore, we steer internal representations of LLMs towards desired outputs so that it works equally well across different groups.

Comment: Representation Learning: optimizes steering of features/LLM internal representations to ideal distributions guaranteeing group-fair outcomes with provable properties.

Relevance: 7 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  2. Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  3. High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

  4. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.