Personalized Daily ArXiv Papers 2025-09-12

[gpt-5]	Prompt	Completion	Total
Token	26364	29620	55984
Cost	$0.03	$0.3	$0.33

Total arXiv papers: 357

Total scanned papers: 196

Total relevant papers: 18

Table of contents with paper titles:

Fast attention mechanisms: a tale of parallelism Authors: Jingwen Liu, Hantao Yu, Clayton Sanford, Alexandr Andoni, Daniel Hsu
Steering MoE LLMs via Expert (De)Activation Authors: Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Sch\"utze, Nanyun Peng
ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms Authors: Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang
Representation-Aware Distributionally Robust Optimization: A Knowledge Transfer Framework Authors: Zitao Wang, Nian Si, Molei Liu
An entropy formula for the Deep Linear Network Authors: Govind Menon, Tianmin Yu
ENSI: Efficient Non-Interactive Secure Inference for Large Language Models Authors: Zhiyu He, Maojiang Wang, Xinwen Gao, Yuchuan Luo, Lin Liu, Shaojing Fu
SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models Authors: Hengyu Fang, Yijiang Liu, Yuan Du, Li Du, Huanrui Yang
Expressive Power of Deep Networks on Manifolds: Simultaneous Approximation Authors: Hanfei Zhou, Lei Shi
MoWE : A Mixture of Weather Experts Authors: Dibyajyoti Chakraborty, Romit Maulik, Peter Harrington, Dallas Foster, Mohammad Amin Nabian, Sanjay Choudhry
Sensitivity-LoRA: Low-Load Sensitivity-Based Fine-Tuning for Large Language Models Authors: Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, Hao Xu
MoSE: Unveiling Structural Patterns in Graphs via Mixture of Subgraph Experts Authors: Junda Ye, Zhongbao Zhang, Li Sun, Siqiang Luo
ReBaNO: Reduced Basis Neural Operator Mitigating Generalization Gaps and Achieving Discretization Invariance Authors: Haolan Zheng, Yanlai Chen, Jiequn Han, Yue Yu
Semantic Concentration for Self-Supervised Dense Representations Learning Authors: Peisong Wen, Qianqian Xu, Siran Dai, Runmin Cong, Qingming Huang
Balancing Utility and Privacy: Dynamically Private SGD with Random Projection Authors: Zhanhong Jiang, Md Zahid Hasan, Nastaran Saadati, Aditya Balu, Chao Liu, Soumik Sarkar
Group Distributionally Robust Machine Learning under Group Level Distributional Uncertainty Authors: Xenia Konti, Yi Shen, Zifan Wang, Karl Henrik Johansson, Michael J. Pencina, Nicoleta J. Economou-Zavlanos, Michael M. Zavlanos
Convexity of Optimization Curves: Local Sharp Thresholds, Robustness Impossibility, and New Counterexamples Authors: Le Duc Hieu
ForTIFAI: Fending Off Recursive Training Induced Failure for AI Models Authors: Soheil Zibakhsh Shabgahi, Pedram Aghazadeh, Azalia Mirhosseini, Farinaz Koushanfar
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs Authors: Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping

1. Fast attention mechanisms: a tale of parallelism

ArXiv ID: 2509.09001

Authors: Jingwen Liu, Hantao Yu, Clayton Sanford, Alexandr Andoni, Daniel Hsu

Abstract: Transformers have the representational capacity to simulate Massively Parallel Computation (MPC) algorithms, but they suffer from quadratic time complexity, which severely limits their scalability. We introduce an efficient attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with sub-quadratic time complexity. We prove that ANNA-transformers (1) retain the expressive power previously established for standard attention in terms of matching the capabilities of MPC algorithms, and (2) can solve key reasoning tasks such as Match2 and $k$-hop with near-optimal depth. Using the MPC framework, we further prove that constant-depth ANNA-transformers can simulate constant-depth low-rank transformers, thereby providing a unified way to reason about a broad class of efficient attention approximations.

Comment: Compression/Efficiency and Model Architecture — introduces sub-quadratic Approximate Nearest Neighbor Attention with theoretical guarantees (MPC-equivalence) and connections to low-rank transformers.

Relevance: 10 Novelty: 9

2. Steering MoE LLMs via Expert (De)Activation

ArXiv ID: 2509.09660

Authors: Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Sch\"utze, Nanyun Peng

Abstract: Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts. Our detection method identifies experts with distinct activation patterns across paired inputs exhibiting contrasting behaviors. By selectively (de)activating such experts during inference, we control behaviors like faithfulness and safety without retraining or modifying weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. In adversarial attack mode, it drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails and exposing a new dimension of alignment faking hidden within experts.

Comment: Model Architecture (MoE): identifies behavior-linked experts via activation patterns and steers behavior by selective expert (de)activation at inference without retraining.

Relevance: 10 Novelty: 8

3. ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

ArXiv ID: 2509.09679

Authors: Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang

Abstract: Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence $\mu = 1/\sqrt{n}$--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. We propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete ${+1, -1}$ entries that are non-differentiable and prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU--a negligible one-time cost. On LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 22.1 for QuaRot.

Comment: Model Compression and Efficiency: ultra-low-bit LLM quantization via learnable orthogonal butterfly transforms (O(n log n) complexity) enabling layer-adaptive rotations for outlier suppression.

Relevance: 10 Novelty: 8

4. Representation-Aware Distributionally Robust Optimization: A Knowledge Transfer Framework

ArXiv ID: 2509.09371

Authors: Zitao Wang, Nian Si, Molei Liu

Abstract: We propose REpresentation-Aware Distributionally Robust Estimation (READ), a novel framework for Wasserstein distributionally robust learning that accounts for predictive representations when guarding against distributional shifts. Unlike classical approaches that treat all feature perturbations equally, READ embeds a multidimensional alignment parameter into the transport cost, allowing the model to differentially discourage perturbations along directions associated with informative representations. This yields robustness to feature variation while preserving invariant structure. Our first contribution is a theoretical foundation: we show that seminorm regularizations for linear regression and binary classification arise as Wasserstein distributionally robust objectives, thereby providing tractable reformulations of READ and unifying a broad class of regularized estimators under the DRO lens. Second, we adopt a principled procedure for selecting the Wasserstein radius using the techniques of robust Wasserstein profile inference. This further enables the construction of valid, representation-aware confidence regions for model parameters with distinct geometric features. Finally, we analyze the geometry of READ estimators as the alignment parameters vary and propose an optimization algorithm to estimate the projection of the global optimum onto this solution surface. This procedure selects among equally robust estimators while optimally constructing a representation structure. We conclude by demonstrating the effectiveness of our framework through extensive simulations and a real-world study, providing a powerful robust estimation grounded in learning representation.

Comment: Representation Learning: introduces representation-aware Wasserstein DRO with theoretical reformulations (seminorm regularization equivalence) and an optimization method on the solution surface.

Relevance: 9 Novelty: 8

5. An entropy formula for the Deep Linear Network

ArXiv ID: 2509.09088

Authors: Govind Menon, Tianmin Yu

Abstract: We study the Riemannian geometry of the Deep Linear Network (DLN) as a foundation for a thermodynamic description of the learning process. The main tools are the use of group actions to analyze overparametrization and the use of Riemannian submersion from the space of parameters to the space of observables. The foliation of the balanced manifold in the parameter space by group orbits is used to define and compute a Boltzmann entropy. We also show that the Riemannian geometry on the space of observables defined in [2] is obtained by Riemannian submersion of the balanced manifold. The main technical step is an explicit construction of an orthonormal basis for the tangent space of the balanced manifold using the theory of Jacobi matrices.

Comment: Foundational analysis of deep linear networks via Riemannian geometry and an entropy formula—insights into representation/training dynamics.

Relevance: 9 Novelty: 8

6. ENSI: Efficient Non-Interactive Secure Inference for Large Language Models

ArXiv ID: 2509.09424

Authors: Zhiyu He, Maojiang Wang, Xinwen Gao, Yuchuan Luo, Lin Liu, Shaojing Fu

Abstract: Secure inference enables privacy-preserving machine learning by leveraging cryptographic protocols that support computations on sensitive user data without exposing it. However, integrating cryptographic protocols with large language models (LLMs) presents significant challenges, as the inherent complexity of these protocols, together with LLMs' massive parameter scale and sophisticated architectures, severely limits practical usability. In this work, we propose ENSI, a novel non-interactive secure inference framework for LLMs, based on the principle of co-designing the cryptographic protocols and LLM architecture. ENSI employs an optimized encoding strategy that seamlessly integrates CKKS scheme with a lightweight LLM variant, BitNet, significantly reducing the computational complexity of encrypted matrix multiplications. In response to the prohibitive computational demands of softmax under homomorphic encryption (HE), we pioneer the integration of the sigmoid attention mechanism with HE as a seamless, retraining-free alternative. Furthermore, by embedding the Bootstrapping operation within the RMSNorm process, we efficiently refresh ciphertexts while markedly decreasing the frequency of costly bootstrapping invocations. Experimental evaluations demonstrate that ENSI achieves approximately an 8x acceleration in matrix multiplications and a 2.6x speedup in softmax inference on CPU compared to state-of-the-art method, with the proportion of bootstrapping is reduced to just 1%.

Comment: HE–LLM co-design (BitNet integration, sigmoid attention under HE, bootstrapping fused with RMSNorm) for efficient secure inference—systems/efficiency innovation.

Relevance: 9 Novelty: 8

7. SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models

ArXiv ID: 2509.09090

Authors: Hengyu Fang, Yijiang Liu, Yuan Du, Li Du, Huanrui Yang

Abstract: Vision-Language-Action (VLA) models exhibit unprecedented capabilities for embodied intelligence. However, their extensive computational and memory costs hinder their practical deployment. Existing VLA compression and acceleration approaches conduct quantization or token pruning in an ad-hoc manner but fail to enable both for a holistic efficiency improvement due to an observed incompatibility. This work introduces SQAP-VLA, the first structured, training-free VLA inference acceleration framework that simultaneously enables state-of-the-art quantization and token pruning. We overcome the incompatibility by co-designing the quantization and token pruning pipeline, where we propose new quantization-aware token pruning criteria that work on an aggressively quantized model while improving the quantizer design to enhance pruning effectiveness. When applied to standard VLA models, SQAP-VLA yields significant gains in computational efficiency and inference speed while successfully preserving core model performance, achieving a $\times$1.93 speedup and up to a 4.5\% average success rate enhancement compared to the original model.

Comment: Model Compression and Efficiency: co-designed quantization-aware token pruning with improved quantizer; training-free, structured inference acceleration for VLA models.

Relevance: 9 Novelty: 8

8. Expressive Power of Deep Networks on Manifolds: Simultaneous Approximation

ArXiv ID: 2509.09362

Authors: Hanfei Zhou, Lei Shi

Abstract: A key challenge in scientific machine learning is solving partial differential equations (PDEs) on complex domains, where the curved geometry complicates the approximation of functions and their derivatives required by differential operators. This paper establishes the first simultaneous approximation theory for deep neural networks on manifolds. We prove that a constant-depth $\mathrm{ReLU}^{k-1}$ network with bounded weights--a property that plays a crucial role in controlling generalization error--can approximate any function in the Sobolev space $\mathcal{W}_p^{k}(\mathcal{M}^d)$ to an error of $\varepsilon$ in the $\mathcal{W}_p^{s}(\mathcal{M}^d)$ norm, for $k\geq 3$ and $s<k$, using $\mathcal{O}(\varepsilon^{-d/(k-s)})$ nonzero parameters, a rate that overcomes the curse of dimensionality by depending only on the intrinsic dimension $d$. These results readily extend to functions in H\"older-Zygmund spaces. We complement this result with a matching lower bound, proving our construction is nearly optimal by showing the required number of parameters matches up to a logarithmic factor. Our proof of the lower bound introduces novel estimates for the Vapnik-Chervonenkis dimension and pseudo-dimension of the network's high-order derivative classes. These complexity bounds provide a theoretical cornerstone for learning PDEs on manifolds involving derivatives. Our analysis reveals that the network architecture leverages a sparse structure to efficiently exploit the manifold's low-dimensional geometry.

Comment: Representation Learning/Theory: simultaneous Sobolev approximation on manifolds with bounded-weight ReLU^k networks and matching lower bounds, leveraging sparse architectural structure.

Relevance: 8 Novelty: 9

9. MoWE : A Mixture of Weather Experts

ArXiv ID: 2509.09052

Authors: Dibyajyoti Chakraborty, Romit Maulik, Peter Harrington, Dallas Foster, Mohammad Amin Nabian, Sanjay Choudhry

Abstract: Data-driven weather models have recently achieved state-of-the-art performance, yet progress has plateaued in recent years. This paper introduces a Mixture of Experts (MoWE) approach as a novel paradigm to overcome these limitations, not by creating a new forecaster, but by optimally combining the outputs of existing models. The MoWE model is trained with significantly lower computational resources than the individual experts. Our model employs a Vision Transformer-based gating network that dynamically learns to weight the contributions of multiple "expert" models at each grid point, conditioned on forecast lead time. This approach creates a synthesized deterministic forecast that is more accurate than any individual component in terms of Root Mean Squared Error (RMSE). Our results demonstrate the effectiveness of this method, achieving up to a 10% lower RMSE than the best-performing AI weather model on a 2-day forecast horizon, significantly outperforming individual experts as well as a simple average across experts. This work presents a computationally efficient and scalable strategy to push the state of the art in data-driven weather prediction by making the most out of leading high-quality forecast models.

Comment: Model Architecture (MoE): ViT-based gating for conditional per-grid, per-lead-time expert weighting to combine multiple forecasters; also emphasizes computational efficiency of training the gating network.

Relevance: 9 Novelty: 7

10. Sensitivity-LoRA: Low-Load Sensitivity-Based Fine-Tuning for Large Language Models

ArXiv ID: 2509.09119

Authors: Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, Hao Xu

Abstract: Large Language Models (LLMs) have transformed both everyday life and scientific research. However, adapting LLMs from general-purpose models to specialized tasks remains challenging, particularly in resource-constrained environments. Low-Rank Adaptation (LoRA), a prominent method within Parameter-Efficient Fine-Tuning (PEFT), has emerged as a promising approach to LLMs by approximating model weight updates using low-rank decomposition. However, LoRA is limited by its uniform rank ( r ) allocation to each incremental matrix, and existing rank allocation techniques aimed at addressing this issue remain computationally inefficient, complex, and unstable, hindering practical applications. To address these limitations, we propose Sensitivity-LoRA, an efficient fine-tuning method that dynamically allocates ranks to weight matrices based on both their global and local sensitivities. It leverages the second-order derivatives (Hessian Matrix) of the loss function to effectively capture weight sensitivity, enabling optimal rank allocation with minimal computational overhead. Our experimental results have demonstrated robust effectiveness, efficiency and stability of Sensitivity-LoRA across diverse tasks and benchmarks.

Comment: PEFT with low-rank adaptation (LoRA) using Hessian-based sensitivity for dynamic rank allocation—compression/efficiency criterion.

Relevance: 9 Novelty: 7

11. MoSE: Unveiling Structural Patterns in Graphs via Mixture of Subgraph Experts

ArXiv ID: 2509.09337

Authors: Junda Ye, Zhongbao Zhang, Li Sun, Siqiang Luo

Abstract: While graph neural networks (GNNs) have achieved great success in learning from graph-structured data, their reliance on local, pairwise message passing restricts their ability to capture complex, high-order subgraph patterns. leading to insufficient structural expressiveness. Recent efforts have attempted to enhance structural expressiveness by integrating random walk kernels into GNNs. However, these methods are inherently designed for graph-level tasks, which limits their applicability to other downstream tasks such as node classification. Moreover, their fixed kernel configurations hinder the model's flexibility in capturing diverse subgraph structures. To address these limitations, this paper proposes a novel Mixture of Subgraph Experts (MoSE) framework for flexible and expressive subgraph-based representation learning across diverse graph tasks. Specifically, MoSE extracts informative subgraphs via anonymous walks and dynamically routes them to specialized experts based on structural semantics, enabling the model to capture diverse subgraph patterns with improved flexibility and interpretability. We further provide a theoretical analysis of MoSE's expressivity within the Subgraph Weisfeiler-Lehman (SWL) Test, proving that it is more powerful than SWL. Extensive experiments, together with visualizations of learned subgraph experts, demonstrate that MoSE not only outperforms competitive baselines but also provides interpretable insights into structural patterns learned by the model.

Comment: Model Architecture — Mixture-of-Experts for graphs (Mixture of Subgraph Experts) with dynamic routing over subgraph semantics and formal expressivity beyond SWL.

Relevance: 8 Novelty: 8

12. ReBaNO: Reduced Basis Neural Operator Mitigating Generalization Gaps and Achieving Discretization Invariance

ArXiv ID: 2509.09611

Authors: Haolan Zheng, Yanlai Chen, Jiequn Han, Yue Yu

Abstract: We propose a novel data-lean operator learning algorithm, the Reduced Basis Neural Operator (ReBaNO), to solve a group of PDEs with multiple distinct inputs. Inspired by the Reduced Basis Method and the recently introduced Generative Pre-Trained Physics-Informed Neural Networks, ReBaNO relies on a mathematically rigorous greedy algorithm to build its network structure offline adaptively from the ground up. Knowledge distillation via task-specific activation function allows ReBaNO to have a compact architecture requiring minimal computational cost online while embedding physics. In comparison to state-of-the-art operator learning algorithms such as PCA-Net, DeepONet, FNO, and CNO, numerical results demonstrate that ReBaNO significantly outperforms them in terms of eliminating/shrinking the generalization gap for both in- and out-of-distribution tests and being the only operator learning algorithm achieving strict discretization invariance.

Comment: Strong match to Model Architecture and Representation Learning: reduced-basis–driven adaptive network construction yielding compact operator learners with discretization invariance.

Relevance: 8 Novelty: 8

13. Semantic Concentration for Self-Supervised Dense Representations Learning

ArXiv ID: 2509.09429

Authors: Peisong Wen, Qianqian Xu, Siran Dai, Runmin Cong, Qingming Huang

Abstract: Recent advances in image-level self-supervised learning (SSL) have made significant progress, yet learning dense representations for patches remains challenging. Mainstream methods encounter an over-dispersion phenomenon that patches from the same instance/category scatter, harming downstream performance on dense tasks. This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration. Specifically, the non-strict spatial alignment ensures intra-instance consistency, while shared patterns, i.e., similar parts of within-class instances in the input space, ensure inter-image consistency. Unfortunately, these approaches are infeasible for dense SSL due to their spatial sensitivity and complicated scene-centric data. These observations motivate us to explore explicit semantic concentration for dense SSL. First, to break the strict spatial alignment, we propose to distill the patch correspondences. Facing noisy and imbalanced pseudo labels, we propose a noise-tolerant ranking loss. The core idea is extending the Average Precision (AP) loss to continuous targets, such that its decision-agnostic and adaptive focusing properties prevent the student model from being misled. Second, to discriminate the shared patterns from complicated scenes, we propose the object-aware filter to map the output space to an object-based space. Specifically, patches are represented by learnable prototypes of objects via cross-attention. Last but not least, empirical studies across various tasks soundly support the effectiveness of our method. Code is available in https://github.com/KID-7391/CoTAP.

Comment: Representation Learning — proposes explicit semantic concentration for dense SSL via a noise-tolerant AP-based ranking loss and an object-aware prototype filtering mechanism for patch features.

Relevance: 8 Novelty: 7

14. Balancing Utility and Privacy: Dynamically Private SGD with Random Projection

ArXiv ID: 2509.09485

Authors: Zhanhong Jiang, Md Zahid Hasan, Nastaran Saadati, Aditya Balu, Chao Liu, Soumik Sarkar

Abstract: Stochastic optimization is a pivotal enabler in modern machine learning, producing effective models for various tasks. However, several existing works have shown that model parameters and gradient information are susceptible to privacy leakage. Although Differentially Private SGD (DPSGD) addresses privacy concerns, its static noise mechanism impacts the error bounds for model performance. Additionally, with the exponential increase in model parameters, efficient learning of these models using stochastic optimizers has become more challenging. To address these concerns, we introduce the Dynamically Differentially Private Projected SGD (D2P2-SGD) optimizer. In D2P2-SGD, we combine two important ideas: (i) dynamic differential privacy (DDP) with automatic gradient clipping and (ii) random projection with SGD, allowing dynamic adjustment of the tradeoff between utility and privacy of the model. It exhibits provably sub-linear convergence rates across different objective functions, matching the best available rate. The theoretical analysis further suggests that DDP leads to better utility at the cost of privacy, while random projection enables more efficient model learning. Extensive experiments across diverse datasets show that D2P2-SGD remarkably enhances accuracy while maintaining privacy. Our code is available here.

Comment: Matches Model Compression and Efficiency via random projection–accelerated SGD and dynamic DP (algorithmic optimizer improving training efficiency with provable rates).

Relevance: 8 Novelty: 7

15. Group Distributionally Robust Machine Learning under Group Level Distributional Uncertainty

ArXiv ID: 2509.08942

Authors: Xenia Konti, Yi Shen, Zifan Wang, Karl Henrik Johansson, Michael J. Pencina, Nicoleta J. Economou-Zavlanos, Michael M. Zavlanos

Abstract: The performance of machine learning (ML) models critically depends on the quality and representativeness of the training data. In applications with multiple heterogeneous data generating sources, standard ML methods often learn spurious correlations that perform well on average but degrade performance for atypical or underrepresented groups. Prior work addresses this issue by optimizing the worst-group performance. However, these approaches typically assume that the underlying data distributions for each group can be accurately estimated using the training data, a condition that is frequently violated in noisy, non-stationary, and evolving environments. In this work, we propose a novel framework that relies on Wasserstein-based distributionally robust optimization (DRO) to account for the distributional uncertainty within each group, while simultaneously preserving the objective of improving the worst-group performance. We develop a gradient descent-ascent algorithm to solve the proposed DRO problem and provide convergence results. Finally, we validate the effectiveness of our method on real-world data.

Comment: Representation Learning/Robust Training: group-level Wasserstein DRO to optimize worst-group performance under distributional uncertainty with a descent–ascent algorithm and convergence results.

Relevance: 8 Novelty: 7

16. Convexity of Optimization Curves: Local Sharp Thresholds, Robustness Impossibility, and New Counterexamples

ArXiv ID: 2509.08954

Authors: Le Duc Hieu

Abstract: We study when the \emph{optimization curve} of first-order methods -- the sequence \${f(x_n)}{n\ge0}\$ produced by constant-stepsize iterations -- is convex, equivalently when the forward differences \$f(x_n)-f(x)\$ are nonincreasing. For gradient descent (GD) on convex \$L\$-smooth functions, the curve is convex for all stepsizes \$\eta \le 1.75/L\$, and this threshold is tight. Moreover, gradient norms are nonincreasing for all \$\eta \le 2/L\$, and in continuous time (gradient flow) the curve is always convex. These results complement and refine the classical smooth convex optimization toolbox, connecting discrete and continuous dynamics as well as worst-case analyses.

Comment: Training dynamics/optimization theory — derives sharp step-size thresholds for convexity of GD optimization curves and links discrete GD with continuous-time gradient flow.

Relevance: 7 Novelty: 7

17. ForTIFAI: Fending Off Recursive Training Induced Failure for AI Models

ArXiv ID: 2509.08972

Authors: Soheil Zibakhsh Shabgahi, Pedram Aghazadeh, Azalia Mirhosseini, Farinaz Koushanfar

Abstract: The increasing reliance on generative AI models has accelerated the generation rate of synthetic data, with some projections suggesting that most available new data for training could be machine-generated by 2030. This shift to a mainly synthetic content presents a critical challenge: repeated training in synthetic data leads to a phenomenon known as model collapse, where model performance degrades over generations of training, eventually rendering the models ineffective. Although prior studies have explored the causes and detection of model collapse, existing mitigation strategies remain limited. In this paper, we identify model overconfidence in their self-generated data as a key driver of collapse. Building on this observation, we propose a confidence-aware loss function that downweights high-confidence predictions during training. We introduce a novel loss function we call Truncated Cross Entropy (TCE). We demonstrate that TCE significantly delays model collapse in recursive training. We provide a model-agnostic framework that links the loss function design to model collapse mitigation and validate our approach both theoretically and empirically, showing that it can extend the model's fidelity interval before collapse by more than 2.3x. Finally, we show that our method generalizes across modalities. These findings suggest that the design of loss functions provides a simple yet powerful tool for preserving the quality of generative models in the era of increasing synthetic data.

Comment: Training dynamics/loss design — introduces Truncated Cross Entropy to mitigate recursive training collapse by down-weighting overconfident predictions with theoretical and cross-modal support.

Relevance: 7 Novelty: 7

18. The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

ArXiv ID: 2509.09677

Authors: Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping

Abstract: Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100\% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations -- curiously, we observe a self-conditioning effect -- models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.

Comment: Representation Learning/Training Dynamics: analyzes long-horizon execution, identifies self-conditioning error effects, and proposes a measurement framework for execution capability.

Relevance: 7 Novelty: 7

Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

ARXIVID: should be the ArXiv ID.

COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.

RELEVANCE: should be a score from 1-10.

NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Relevance 9-10 (Completely Relevant)

Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.

Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

Relevance 7-8 (Relevant)

Focus: Retain a solid link to the main research area, though may touch on peripheral elements.

Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

Relevance 5-6 (Borderline)

Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.

Examples: Work referencing MoE centered on reinforcement learning.

Relevance 3-4 (Irrelevant)

Focus: Largely outside our interests with no association to our topics.

Examples: Application-focused papers like using MoE to solve a problem in the real world.

Relevance 1-2 (Ignore)

Focus: Purely unrelated to our topics. Completely a different domain.

Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

Novelty 9-10 (Breakthrough)

Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.

Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

Novelty 7-8 (Improvements)

Definition: Substantial insights/enhancements, though not a full paradigm shift.

Examples: Modifications on existing methods yielding significantly better results.

Novelty 5-6 (Borderline)

Definition: Incremental contributions with possible long-term benefits, not immediately transformative.

Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

Novelty 3-4 (Tangential)

Definition: Minor or domain-specific improvements with limited broader impact.

Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

Novelty 1-2 (Low)

Definition: Minimal originality, applying standard approaches without real innovation.

Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.

Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.

Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.