Personalized Daily ArXiv Papers 2025-12-30

[gpt-5]	Prompt	Completion	Total
Token	54533	46142	100675
Cost	$0.07	$0.46	$0.53

Total arXiv papers: 561

Total scanned papers: 359

Total relevant papers: 42

Table of contents with paper titles:

A Comedy of Estimators: On KL Regularization in RL Training of LLMs Authors: Vedant Shah, Johan Obando-Ceron, Vineet Jain, Brian Bartoldson, Bhavya Kailkhura, Sarthak Mittal, Glen Berseth, Pablo Samuel Castro, Yoshua Bengio, Nikolay Malkin, Moksh Jain, Siddarth Venkatraman, Aaron Courville
End-to-End Test-Time Training for Long Context Authors: Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel R{\o}d, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun
MoR: Mixture Of Representations For Mixed-Precision Training Authors: Bor-Yiing Su, Peter Dykas, Mike Chrzanowski, Jatin Chhugani
Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution Authors: Shuhuan Wang, Yuzhen Xie, Jiayi Li, Yinliang Diao
Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism Authors: Xinglin Pan, Shaohuai Shi, Wenxiang Lin, Yuxin Wang, Zhenheng Tang, Wei Wang, Xiaowen Chu
Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks Authors: Zubair Shah, Noaman Khan
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss Authors: Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao
Likelihood-Preserving Embeddings for Statistical Inference Authors: Deniz Akdemir
Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds Authors: Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra
The Bayesian Geometry of Transformer Attention Authors: Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra
AFA-LoRA: Enabling Non-Linear Adaptations in LoRA with Activation Function Annealing Authors: Jiacheng Li, Jianchao Tan, Zhidong Yang, Feiye Huo, Yerui Sun, Yuchen Xie, Xunliang Cai
Towards Unsupervised Causal Representation Learning via Latent Additive Noise Model Causal Autoencoders Authors: Hans Jarett J. Ong, Brian Godwin S. Lim, Dominic Dayta, Renzo Roel P. Tan, Kazushi Ikeda
On Harnessing Idle Compute at the Edge for Foundation Model Training Authors: Leyang Xue, Meghana Madhyastha, Myungjin Lee, Amos Storkey, Randal Burns, Mahesh K. Marina
Visual Language Hypothesis Authors: Xiu Li
Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs Authors: Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zihao Ye, Songting Wang, Wenqin Yang, Xupeng Miao, Tianqi Chen, Zhihao Jia
Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation Authors: Dianyun Wang, Qingsen Ma, Yuhu Shang, Zhifeng Lu, Lechen Ning, Zhenbo Xu, Huijia Wu, Zhaofeng He
The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models Authors: Dakuan Lu, Jiaqi Zhang, Cheng Yuan, Jiawei Shao, Chi Zhang, Xuelong Li
On the Existence and Behaviour of Secondary Attention Sinks Authors: Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, Yiren Zhao
FRoD: Full-Rank Efficient Fine-Tuning with Rotational Degrees for Fast Convergence Authors: Guoan Wan, Tianyu Chen, Fangzheng Feng, Haoyi Zhou, Runhua Xu
The Quest for Winning Tickets in Low-Rank Adapters Authors: Hamed Damirchi, Cristian Rodriguez-Opazo, Ehsan Abbasnejad, Zhen Zhang, Javen Shi
Communication Compression for Distributed Learning with Aggregate and Server-Guided Feedback Authors: Tomas Ortega, Chun-Yin Huang, Xiaoxiao Li, Hamid Jafarkhani
Merge before Forget: A Single LoRA Continual Learning via Continual Merging Authors: Fuli Qiao, Mehrdad Mahdavi
Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models Authors: Tingyang Sun, Ting He, Bo Ji, Parimal Parag
Transformer Reconstructed with Dynamic Value Attention Authors: Xiaowei Wang
Theoretical Foundations of Scaling Law in Familial Models Authors: Huan Song, Qingfei Zhao, Ting Long, Shuyu Tian, Hongjun An, Jiawei Shao, Chi Zhang, Xuelong Li
Random Controlled Differential Equations Authors: Francesco Piatti, Thomas Cass, William F. Turner
Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis Authors: Zijian Liu
Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model Authors: Nathan Kallus
Decomposing Task Vectors for Refined Model Editing Authors: Hamed Damirchi, Ehsan Abbasnejad, Zhen Zhang, Javen Shi
GLUE: Gradient-free Learning to Unify Experts Authors: Jong-Ik Park, Shreyas Chaudhari, Srinivasa Pranav, Carlee Joe-Wong, Jos\'e M. F. Moura
Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models Authors: Mingyuan Zhang, Yue Bai, Yifan Wang, Yiyang Huang, Yun Fu
Splitwise: Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL Authors: Abolfazl Younesi, Abbas Shabrang Maryan, Elyas Oustad, Zahra Najafabadi Samani, Mohsen Ansari, Thomas Fahringer
Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration Authors: Bruno Mlodozeniec, Pierre Ablin, Louis B\'ethune, Dan Busbridge, Michal Klein, Jason Ramapuram, Marco Cuturi
Beyond Centralization: Provable Communication Efficient Decentralized Multi-Task Learning Authors: Donghwa Kang, Shana Moothedath
Understanding the Mechanisms of Fast Hyperparameter Transfer Authors: Nikhil Ghosh, Denny Wu, Alberto Bietti
How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure Authors: Paul M. Thompson
The Affine Divergence: Aligning Activation Updates Beyond Normalisation Authors: George Bird
LLMBoost: Make Large Language Models Stronger with Boosting Authors: Zehao Chen, Tianxiang Ai, Yifei Li, Gongxun Li, Yuyang Wei, Wang Zhou, Guanghui Li, Bin Yu, Zhijun Chen, Hailong Sun, Fuzhen Zhuang, Jianxin Li, Deqing Wang, Yikun Ban
A general framework for deep learning Authors: William Kengne, Modou Wade
Frequency Regularization: Unveiling the Spectral Inductive Bias of Deep Neural Networks Authors: Jiahao Lu
Multiple Token Divergence: Measuring and Steering In-Context Computation Density Authors: Vincent Herrmann, Eric Alcaide, Michael Wand, J\"urgen Schmidhuber
Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought Authors: Yuyi Zhang, Boyu Tang, Tianjie Ju, Sufeng Duan, Gongshen Liu

1. A Comedy of Estimators: On KL Regularization in RL Training of LLMs

ArXiv ID: 2512.21852

Authors: Vedant Shah, Johan Obando-Ceron, Vineet Jain, Brian Bartoldson, Bhavya Kailkhura, Sarthak Mittal, Glen Berseth, Pablo Samuel Castro, Yoshua Bengio, Nikolay Malkin, Moksh Jain, Siddarth Venkatraman, Aaron Courville

Abstract: The reasoning performance of large language models (LLMs) can be substantially improved by training them with reinforcement learning (RL). The RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy. Since computing the KL divergence exactly is intractable, various estimators are used in practice to estimate it from on-policy samples. Despite its wide adoption, including in several open-source libraries, there is no systematic study analyzing the numerous ways of incorporating KL estimators in the objective and their effect on the downstream performance of RL-trained models. Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation. In this paper, we further analyze these practices and study the gradients of several estimators configurations, revealing how design choices shape gradient bias. We substantiate these findings with empirical observations by RL fine-tuning \texttt{Qwen2.5-7B}, \texttt{Llama-3.1-8B-Instruct} and \texttt{Qwen3-4B-Instruct-2507} with different configurations and evaluating their performance on both in- and out-of-distribution tasks. Through our analysis, we observe that, in on-policy settings: (1) estimator configurations with biased gradients can result in training instabilities; and (2) using estimator configurations resulting in unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks. We also investigate the performance resulting from different KL configurations in off-policy settings and observe that KL regularization can help stabilize off-policy RL training resulting from asynchronous setups.

Comment: Author match

2. End-to-End Test-Time Training for Long Context

ArXiv ID: 2512.23675

Authors: Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel R{\o}d, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun

Abstract: We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.

Comment: Treats long-context handling as test-time training with meta-learned initialization on a standard sliding-window Transformer; achieves constant-latency inference — strong efficiency/training-dynamics innovation.

Relevance: 10 Novelty: 9

3. MoR: Mixture Of Representations For Mixed-Precision Training

ArXiv ID: 2512.22804

Authors: Bor-Yiing Su, Peter Dykas, Mike Chrzanowski, Jatin Chhugani

Abstract: Mixed-precision training is a crucial technique for scaling deep learning models, but successful mixedprecision training requires identifying and applying the right combination of training methods. This paper presents our preliminary study on Mixture-of-Representations (MoR), a novel, per-tensor and sub-tensor level quantization framework that dynamically analyzes a tensor's numerical properties to select between a variety of different representations. Based on the framework, we have proposed and experimented concrete algorithms that choose dynamically between FP8 and BF16 representations for both per-tensor and sub-tensor level granularities. Our universal approach is designed to preserve model quality across various quantization partition strategies and datasets. Our initial findings show that this approach can achieve state-of-the-art results with 98.38% of tensors quantized to the FP8 format. This work highlights the potential of dynamic, property-aware quantization while preserving model quality. We believe this approach can generally improve the robustness of low precision training, as demonstrated by achieving FP8 accuracies that are on par with existing approaches without the need for fine-grain partitioning, or can be used in combination with other training methods to improve the leverage of even lower precision number formats such as NVFP4.

Comment: Matches Compression/Efficiency: dynamic per-tensor and sub-tensor mixed-precision selection (FP8 vs BF16) based on tensor properties to robustly scale low-precision training.

Relevance: 10 Novelty: 8

4. Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution

ArXiv ID: 2512.23068

Authors: Shuhuan Wang, Yuzhen Xie, Jiayi Li, Yinliang Diao

Abstract: Selective State Space Models (SSMs) achieve linear-time inference, yet their gradient-based sensitivity analysis remains bottlenecked by O(L) memory scaling during backpropagation. This memory constraint precludes genomic-scale modeling (L > 10^5) on consumer-grade hardware. We introduce Phase Gradient Flow (PGF), a framework that computes exact analytical derivatives by operating directly in the state-space manifold, bypassing the need to materialize the intermediate computational graph. By reframing SSM dynamics as Tiled Operator-Space Evolution (TOSE), our method delivers O(1) memory complexity relative to sequence length, yielding a 94% reduction in peak VRAM and a 23x increase in throughput compared to standard Autograd. Unlike parallel prefix scans that exhibit numerical divergence in stiff ODE regimes, PGF ensures stability through invariant error scaling, maintaining near-machine precision across extreme sequences. We demonstrate the utility of PGF on an impulse-response benchmark with 128,000-step sequences - a scale where conventional Autograd encounters prohibitive memory overhead, often leading to out-of-memory (OOM) failures in multi-layered models. Our work enables chromosome-scale sensitivity analysis on a single GPU, bridging the gap between theoretical infinite-context models and practical hardware limitations.

Comment: Matches High Performance Computing: O(1) memory exact differentiation for SSMs via tiled operator-space evolution (PGF), enabling long-sequence training/sensitivity analysis on limited hardware.

Relevance: 10 Novelty: 8

5. Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism

ArXiv ID: 2512.21487

Authors: Xinglin Pan, Shaohuai Shi, Wenxiang Lin, Yuxin Wang, Zhenheng Tang, Wei Wang, Xiaowen Chu

Abstract: The mixture-of-experts (MoE) architecture scales model size with sublinear computational increase but suffers from memory-intensive inference due to KV caches and sparse expert activation. Recent disaggregated expert parallelism (DEP) distributes attention and experts to dedicated GPU groups but lacks support for shared experts and efficient task scheduling, limiting performance. We propose FinDEP, a fine-grained task scheduling algorithm for DEP that maximizes task overlap to improve MoE inference throughput. FinDEP introduces three innovations: 1) partitioning computation/communication into smaller tasks for fine-grained pipelining, 2) formulating a scheduling optimization supporting variable granularity and ordering, and 3) developing an efficient solver for this large search space. Experiments on four GPU systems with DeepSeek-V2 and Qwen3-MoE show FinDEP improves throughput by up to 1.61x over prior methods, achieving up to 1.24x speedup on a 32-GPU system.

Comment: HPC + MoE: fine-grained scheduling for disaggregated expert parallelism to optimize MoE inference throughput with algorithmic solver.

Relevance: 10 Novelty: 8

6. Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

ArXiv ID: 2512.22106

Authors: Zubair Shah, Noaman Khan

Abstract: Neural network pruning is widely used to reduce model size and computational cost. Yet, most existing methods treat sparsity as an externally imposed constraint, enforced through heuristic importance scores or training-time regularization. In this work, we propose a fundamentally different perspective: pruning as an equilibrium outcome of strategic interaction among model components. We model parameter groups such as weights, neurons, or filters as players in a continuous non-cooperative game, where each player selects its level of participation in the network to balance contribution against redundancy and competition. Within this formulation, sparsity emerges naturally when continued participation becomes a dominated strategy at equilibrium. We analyze the resulting game and show that dominated players collapse to zero participation under mild conditions, providing a principled explanation for pruning behavior. Building on this insight, we derive a simple equilibrium-driven pruning algorithm that jointly updates network parameters and participation variables without relying on explicit importance scores. This work focuses on establishing a principled formulation and empirical validation of pruning as an equilibrium phenomenon, rather than exhaustive architectural or large-scale benchmarking. Experiments on standard benchmarks demonstrate that the proposed approach achieves competitive sparsity-accuracy trade-offs while offering an interpretable, theory-grounded alternative to existing pruning methods.

Comment: Core compression: game-theoretic formulation where sparsity emerges at equilibrium; yields an interpretable pruning algorithm.

Relevance: 10 Novelty: 8

7. Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

ArXiv ID: 2512.23447

Authors: Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao

Abstract: Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.

Comment: Direct MoE advancement: auxiliary expert–router coupling loss aligning router embeddings with expert capabilities; computationally efficient and scales with experts, matching the MoE architecture criterion.

Relevance: 10 Novelty: 8

8. Likelihood-Preserving Embeddings for Statistical Inference

ArXiv ID: 2512.22638

Authors: Deniz Akdemir

Abstract: Modern machine learning embeddings provide powerful compression of high-dimensional data, yet they typically destroy the geometric structure required for classical likelihood-based statistical inference. This paper develops a rigorous theory of likelihood-preserving embeddings: learned representations that can replace raw data in likelihood-based workflows -- hypothesis testing, confidence interval construction, model selection -- without altering inferential conclusions. We introduce the Likelihood-Ratio Distortion metric $\Delta_n$, which measures the maximum error in log-likelihood ratios induced by an embedding. Our main theoretical contribution is the Hinge Theorem, which establishes that controlling $\Delta_n$ is necessary and sufficient for preserving inference. Specifically, if the distortion satisfies $\Delta_n = o_p(1)$, then (i) all likelihood-ratio based tests and Bayes factors are asymptotically preserved, and (ii) surrogate maximum likelihood estimators are asymptotically equivalent to full-data MLEs. We prove an impossibility result showing that universal likelihood preservation requires essentially invertible embeddings, motivating the need for model-class-specific guarantees. We then provide a constructive framework using neural networks as approximate sufficient statistics, deriving explicit bounds connecting training loss to inferential guarantees. Experiments on Gaussian and Cauchy distributions validate the sharp phase transition predicted by exponential family theory, and applications to distributed clinical inference demonstrate practical utility.

Comment: Likelihood-preserving embeddings with explicit bounds (approximate sufficient statistics); strong theory for representation learning/compression for inference.

Relevance: 9 Novelty: 9

9. Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

ArXiv ID: 2512.22473

Authors: Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra

Abstract: Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emph{advantage-based routing law} for attention scores, [ \frac{\partial L}{\partial s_{ij}} = \alpha_{ij}\bigl(b_{ij}-\mathbb{E}{\alpha_i}[b]\bigr), \qquad b$ are attention weights. These equations induce a positive feedback loop in which routing and content specialize together: queries route more strongly to values that are above-average for their error signal, and those values are pulled toward the queries that use them. We show that this coupled specialization behaves like a two-timescale EM procedure: attention weights implement an E-step (soft responsibilities), while values implement an M-step (responsibility-weighted prototype updates), with queries and keys adjusting the hypothesis frame. Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference. This yields a unified picture in which optimization (gradient flow) gives rise to geometry (Bayesian manifolds), which in turn supports function (in-context probabilistic reasoning).} := u_i^\top v_j, ] coupled with a \emph{responsibility-weighted update} for values, [ \Delta v_j = -\eta\sum_i \alpha_{ij} u_i, ] where $u_i$ is the upstream gradient at position $i$ and $\alpha_{ij

Comment: Matches: Representation Learning (theoretical gradient dynamics of attention) explaining how cross-entropy sculpts internal geometry.

Relevance: 9 Novelty: 9

10. The Bayesian Geometry of Transformer Attention

ArXiv ID: 2512.22471

Authors: Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra

Abstract: Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emph{Bayesian wind tunnels} -- controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with $10^{-3}$-$10^{-4}$ bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation. Across two tasks -- bijection elimination and Hidden Markov Model (HMM) state tracking -- we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emph{frame-precision dissociation} predicted by recent gradient analyses. Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models.

Comment: Matches Representation Learning and Architecture Analysis: mechanistic evidence that transformer attention implements Bayesian inference via geometric roles of residuals/FFN/attention.

Relevance: 9 Novelty: 8

11. AFA-LoRA: Enabling Non-Linear Adaptations in LoRA with Activation Function Annealing

ArXiv ID: 2512.22455

Authors: Jiacheng Li, Jianchao Tan, Zhidong Yang, Feiye Huo, Yerui Sun, Yuchen Xie, Xunliang Cai

Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method. However, its linear adaptation process limits its expressive power. This means there is a gap between the expressive power of linear training and non-linear training. To bridge this gap, we propose AFA-LoRA, a novel training strategy that brings non-linear expressivity to LoRA while maintaining its seamless mergeability. Our key innovation is an annealed activation function that transitions from a non-linear to a linear transformation during training, allowing the adapter to initially adopt stronger representational capabilities before converging to a mergeable linear form. We implement our method on supervised fine-tuning, reinforcement learning, and speculative decoding. The results show that AFA-LoRA reduces the performance gap between LoRA and full-parameter training. This work enables a more powerful and practical paradigm of parameter-efficient adaptation.

Comment: Compression/Efficiency: PEFT (LoRA) with annealed activation to gain non-linear expressivity while remaining mergeable for efficient adaptation.

Relevance: 9 Novelty: 8

12. Towards Unsupervised Causal Representation Learning via Latent Additive Noise Model Causal Autoencoders

ArXiv ID: 2512.22150

Authors: Hans Jarett J. Ong, Brian Godwin S. Lim, Dominic Dayta, Renzo Roel P. Tan, Kazushi Ikeda

Abstract: Unsupervised representation learning seeks to recover latent generative factors, yet standard methods relying on statistical independence often fail to capture causal dependencies. A central challenge is identifiability: as established in disentangled representation learning and nonlinear ICA literature, disentangling causal variables from observational data is impossible without supervision, auxiliary signals, or strong inductive biases. In this work, we propose the Latent Additive Noise Model Causal Autoencoder (LANCA) to operationalize the Additive Noise Model (ANM) as a strong inductive bias for unsupervised discovery. Theoretically, we prove that while the ANM constraint does not guarantee unique identifiability in the general mixing case, it resolves component-wise indeterminacy by restricting the admissible transformations from arbitrary diffeomorphisms to the affine class. Methodologically, arguing that the stochastic encoding inherent to VAEs obscures the structural residuals required for latent causal discovery, LANCA employs a deterministic Wasserstein Auto-Encoder (WAE) coupled with a differentiable ANM Layer. This architecture transforms residual independence from a passive assumption into an explicit optimization objective. Empirically, LANCA outperforms state-of-the-art baselines on synthetic physics benchmarks (Pendulum, Flow), and on photorealistic environments (CANDLE), where it demonstrates superior robustness to spurious correlations arising from complex background scenes.

Comment: Representation Learning: causal autoencoder with ANM inductive bias and identifiability analysis; architectural innovation via differentiable ANM layer in WAE.

Relevance: 9 Novelty: 8

13. On Harnessing Idle Compute at the Edge for Foundation Model Training

ArXiv ID: 2512.22142

Authors: Leyang Xue, Meghana Madhyastha, Myungjin Lee, Amos Storkey, Randal Burns, Mahesh K. Marina

Abstract: The ecosystem behind foundation model development today is highly centralized and limited to large-scale cloud data center operators: training foundation models is costly, needing immense compute resources. Decentralized foundation model training across edge devices, leveraging their spare compute, promises a democratized alternative. However, existing edge-training approaches fall short: they struggle to match cloud-based training performance, exhibit limited scalability with model size, exceed device memory capacity, and have prohibitive communication overhead. They also fail to satisfactorily handle device heterogeneity and dynamism. We introduce a new paradigm, Cleave, which finely partitions training operations through a novel selective hybrid tensor parallelism method. Together with a parameter server centric training framework, Cleave copes with device memory limits and avoids communication bottlenecks, thereby enabling efficient training of large models on par with the cloud. Further, with a cost optimization model to guide device selection and training workload distribution, Cleave effectively accounts for device heterogeneity and churn. Our evaluations show that Cleave matches cloud-based GPU training by scaling efficiently to larger models and thousands of devices, supporting up to 8x more devices than baseline edge-training approaches. It outperforms state-of-the-art edge training methods by up to a factor of 10 in per-batch training time and efficiently handles device failures, achieving at least 100x faster recovery than prior methods.

Comment: Distributed training innovation for edge devices via selective hybrid tensor parallelism and PS-centric design (HPC/distributed training).

Relevance: 9 Novelty: 8

14. Visual Language Hypothesis

ArXiv ID: 2512.23335

Authors: Xiu Li

Abstract: We study visual representation learning from a structural and topological perspective. We begin from a single hypothesis: that visual understanding presupposes a semantic language for vision, in which many perceptual observations correspond to a small number of discrete semantic states. Together with widely assumed premises on transferability and abstraction in representation learning, this hypothesis implies that the visual observation space must be organized in a fiber bundle like structure, where nuisance variation populates fibers and semantics correspond to a quotient base space. From this structure we derive two theoretical consequences. First, the semantic quotient $X/G$ is not a submanifold of $X$ and cannot be obtained through smooth deformation alone, semantic invariance requires a non-homeomorphic, discriminative target, for example, supervision via labels, cross instance identification, or multimodal alignment that supplies explicit semantic equivalence. Second, we show that approximating the quotient also places structural demands on the model architecture. Semantic abstraction requires not only an external semantic target, but a representation mechanism capable of supporting topology change: an expand-and-snap process in which the manifold is first geometrically expanded to separate structure and then collapsed to form discrete semantic regions. We emphasize that these results are interpretive rather than prescriptive: the framework provides a topological lens that aligns with empirical regularities observed in large-scale discriminative and multimodal models, and with classical principles in statistical learning theory.

Comment: Matches: Representation Learning (topological theory of semantics) and Model Architecture (architectural requirements for topology change).

Relevance: 9 Novelty: 8

15. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

ArXiv ID: 2512.22219

Authors: Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zihao Ye, Songting Wang, Wenqin Yang, Xupeng Miao, Tianqi Chen, Zhihao Jia

Abstract: We introduce Mirage Persistent Kernel (MPK), the first compiler and runtime system that automatically transforms multi-GPU model inference into a single high-performance megakernel. MPK introduces an SM-level graph representation that captures data dependencies at the granularity of individual streaming multiprocessors (SMs), enabling cross-operator software pipelining, fine-grained kernel overlap, and other previously infeasible GPU optimizations. The MPK compiler lowers tensor programs into highly optimized SM-level task graphs and generates optimized CUDA implementations for all tasks, while the MPK in-kernel parallel runtime executes these tasks within a single mega-kernel using decentralized scheduling across SMs. Together, these components provide end-to-end kernel fusion with minimal developer effort, while preserving the flexibility of existing programming models. Our evaluation shows that MPK significantly outperforms existing kernel-per-operator LLM serving systems by reducing end-to-end inference latency by up to 1.7x, pushing LLM inference performance close to hardware limits. MPK is publicly available at https://github.com/mirage-project/mirage.

Comment: Matches: High Performance Computing (compiler/runtime mega-kernelization, SM-level task graphs for end-to-end fusion).

Relevance: 9 Novelty: 8

16. Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

ArXiv ID: 2512.23260

Authors: Dianyun Wang, Qingsen Ma, Yuhu Shang, Zhifeng Lu, Lechen Ning, Zhenbo Xu, Huijia Wu, Zhaofeng He

Abstract: Parameter-efficient fine-tuning has become the dominant paradigm for adapting large language models to downstream tasks. Low-rank adaptation methods such as LoRA operate under the assumption that task-relevant weight updates reside in a low-rank subspace, yet this subspace is learned implicitly from data in a black-box manner, offering no interpretability or direct control. We hypothesize that this difficulty stems from polysemanticity--individual dimensions encoding multiple entangled concepts. To address this, we leverage pre-trained Sparse Autoencoders (SAEs) to identify task-relevant features in a disentangled feature space, then construct an explicit, interpretable low-rank subspace to guide adapter initialization. We provide theoretical analysis proving that under monosemanticity assumptions, SAE-based subspace identification achieves arbitrarily small recovery error, while direct identification in polysemantic space suffers an irreducible error floor. On safety alignment, our method achieves up to 99.6% safety rate--exceeding full fine-tuning by 7.4 percentage points and approaching RLHF-based methods--while updating only 0.19-0.24% of parameters. Crucially, our method provides interpretable insights into the learned alignment subspace through the semantic grounding of SAE features. Our work demonstrates that incorporating mechanistic interpretability into the fine-tuning process can simultaneously improve both performance and transparency.

Comment: Matches: Model Compression and Efficiency (low-rank subspace adapters) and Representation Learning (SAE-based interpretable feature subspace).

Relevance: 9 Novelty: 8

17. The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models

ArXiv ID: 2512.23340

Authors: Dakuan Lu, Jiaqi Zhang, Cheng Yuan, Jiawei Shao, Chi Zhang, Xuelong Li

Abstract: Recent advances in large language models (LLMs) have been largely driven by scaling laws for individual models, which predict performance improvements as model parameters and data volume increase. However, the capabilities of any single LLM are inherently bounded. One solution originates from intricate interactions among multiple LLMs, rendering their collective performance surpasses that of any constituent model. Despite the rapid proliferation of multi-model integration techniques such as model routing and post-hoc ensembling, a unifying theoretical framework of performance scaling for multi-model collaboration remains absent. In this work, we propose the Law of Multi-model Collaboration, a scaling law that predicts the performance limits of LLM ensembles based on their aggregated parameter budget. To quantify the intrinsic upper bound of multi-model collaboration, we adopt a method-agnostic formulation and assume an idealized integration oracle where the total cross-entropy loss of each sample is determined by the minimum loss of any model in the model pool. Experimental results reveal that multi-model systems follow a power-law scaling with respect to the total parameter count, exhibiting a more significant improvement trend and a lower theoretical loss floor compared to single model scaling. Moreover, ensembles of heterogeneous model families achieve better performance scaling than those formed within a single model family, indicating that model diversity is a primary driver of collaboration gains. These findings suggest that model collaboration represents a critical axis for extending the intelligence frontier of LLMs.

Comment: Foundational scaling-law theory for ensembling LLMs (multi-model collaboration), directly addressing model architecture/ensembling limits and diversity effects.

Relevance: 9 Novelty: 8

18. On the Existence and Behaviour of Secondary Attention Sinks

ArXiv ID: 2512.22213

Authors: Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, Yiren Zhao

Abstract: Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance. In this work, we identify a class of attention sinks, which we term secondary sinks, that differ fundamentally from the sinks studied in prior works, which we term primary sinks. While prior works have identified that tokens other than BOS can sometimes become sinks, they were found to exhibit properties analogous to the BOS token. Specifically, they emerge at the same layer, persist throughout the network and draw a large amount of attention mass. Whereas, we find the existence of secondary sinks that arise primarily in middle layers and can persist for a variable number of layers, and draw a smaller, but still significant, amount of attention mass. Through extensive experiments across 11 model families, we analyze where these secondary sinks appear, their properties, how they are formed, and their impact on the attention mechanism. Specifically, we show that: (1) these sinks are formed by specific middle-layer MLP modules; these MLPs map token representations to vectors that align with the direction of the primary sink of that layer. (2) The $\ell_2$-norm of these vectors determines the sink score of the secondary sink, and also the number of layers it lasts for, thereby leading to different impacts on the attention mechanisms accordingly. (3) The primary sink weakens in middle layers, coinciding with the emergence of secondary sinks. We observe that in larger-scale models, the location and lifetime of the sinks, together referred to as sink levels, appear in a more deterministic and frequent manner. Specifically, we identify three sink levels in QwQ-32B and six levels in Qwen3-14B.

Comment: Mechanistic analysis of attention sinks (secondary sinks) in Transformers — deep representation/architecture insight into attention behavior and MLP roles.

Relevance: 9 Novelty: 8

19. FRoD: Full-Rank Efficient Fine-Tuning with Rotational Degrees for Fast Convergence

ArXiv ID: 2512.23485

Authors: Guoan Wan, Tianyu Chen, Fangzheng Feng, Haoyi Zhou, Runhua Xu

Abstract: Parameter-efficient fine-tuning (PEFT) methods have emerged as a practical solution for adapting large foundation models to downstream tasks, reducing computational and memory costs by updating only a small subset of parameters. Among them, approaches like LoRA aim to strike a balance between efficiency and expressiveness, but often suffer from slow convergence and limited adaptation capacity due to their inherent low-rank constraints. This trade-off hampers the ability of PEFT methods to capture complex patterns needed for diverse tasks. To address these challenges, we propose FRoD, a novel fine-tuning method that combines hierarchical joint decomposition with rotational degrees of freedom. By extracting a globally shared basis across layers and injecting sparse, learnable perturbations into scaling factors for flexible full-rank updates, FRoD enhances expressiveness and efficiency, leading to faster and more robust convergence. On 20 benchmarks spanning vision, reasoning, and language understanding, FRoD matches full model fine-tuning in accuracy, while using only 1.72% of trainable parameters under identical training budgets.

Comment: Matches Model Compression and Efficiency: parameter-efficient fine-tuning via hierarchical decomposition and sparse rotational full-rank updates for faster, robust convergence.

Relevance: 9 Novelty: 7

20. The Quest for Winning Tickets in Low-Rank Adapters

ArXiv ID: 2512.22495

Authors: Hamed Damirchi, Cristian Rodriguez-Opazo, Ehsan Abbasnejad, Zhen Zhang, Javen Shi

Abstract: The Lottery Ticket Hypothesis (LTH) suggests that over-parameterized neural networks contain sparse subnetworks ("winning tickets") capable of matching full model performance when trained from scratch. With the growing reliance on fine-tuning large pretrained models, we investigate whether LTH extends to parameter-efficient fine-tuning (PEFT), specifically focusing on Low-Rank Adaptation (LoRA) methods. Our key finding is that LTH holds within LoRAs, revealing sparse subnetworks that can match the performance of dense adapters. In particular, we find that the effectiveness of sparse subnetworks depends more on how much sparsity is applied in each layer than on the exact weights included in the subnetwork. Building on this insight, we propose Partial-LoRA, a method that systematically identifies said subnetworks and trains sparse low-rank adapters aligned with task-relevant subspaces of the pre-trained model. Experiments across 8 vision and 12 language tasks in both single-task and multi-task settings show that Partial-LoRA reduces the number of trainable parameters by up to 87\%, while maintaining or improving accuracy. Our results not only deepen our theoretical understanding of transfer learning and the interplay between pretraining and fine-tuning but also open new avenues for developing more efficient adaptation strategies.

Comment: Matches Compression/Efficiency and Sparsity: extends Lottery Ticket Hypothesis to LoRA and introduces Partial-LoRA for sparse low-rank adapters with large parameter savings.

Relevance: 9 Novelty: 7

21. Communication Compression for Distributed Learning with Aggregate and Server-Guided Feedback

ArXiv ID: 2512.22623

Authors: Tomas Ortega, Chun-Yin Huang, Xiaoxiao Li, Hamid Jafarkhani

Abstract: Distributed learning, particularly Federated Learning (FL), faces a significant bottleneck in the communication cost, particularly the uplink transmission of client-to-server updates, which is often constrained by asymmetric bandwidth limits at the edge. Biased compression techniques are effective in practice, but require error feedback mechanisms to provide theoretical guarantees and to ensure convergence when compression is aggressive. Standard error feedback, however, relies on client-specific control variates, which violates user privacy and is incompatible with stateless clients common in large-scale FL. This paper proposes two novel frameworks that enable biased compression without client-side state or control variates. The first, Compressed Aggregate Feedback (CAFe), uses the globally aggregated update from the previous round as a shared control variate for all clients. The second, Server-Guided Compressed Aggregate Feedback (CAFe-S), extends this idea to scenarios where the server possesses a small private dataset; it generates a server-guided candidate update to be used as a more accurate predictor. We consider Distributed Gradient Descent (DGD) as a representative algorithm and analytically prove CAFe's superiority to Distributed Compressed Gradient Descent (DCGD) with biased compression in the non-convex regime with bounded gradient dissimilarity. We further prove that CAFe-S converges to a stationary point, with a rate that improves as the server's data become more representative. Experimental results in FL scenarios validate the superiority of our approaches over existing compression schemes.

Comment: Matches Distributed Training/Communication Compression: CAFe and CAFe-S enable biased compression with aggregate/server-guided feedback and convergence guarantees without client-side state.

Relevance: 9 Novelty: 7

22. Merge before Forget: A Single LoRA Continual Learning via Continual Merging

ArXiv ID: 2512.23017

Authors: Fuli Qiao, Mehrdad Mahdavi

Abstract: Parameter-efficient continual learning has emerged as a promising approach for large language models (LLMs) to mitigate catastrophic forgetting while enabling adaptation to new tasks. Current Low-Rank Adaptation (LoRA) continual learning techniques often retain and freeze previously learned LoRAs or generate data representations to overcome forgetting, typically utilizing these to support new LoRAs learn new tasks. However, these methods not only ignore growing computational memory with tasks and limited storage space but also suffer from potential task interference due to the lack of effective LoRA merging mechanisms. In this paper, we propose a novel continual learning method that orthogonally initializes and sequentially merges LoRAs updates into a single unified LoRA. Our method leverages orthogonal basis extraction from previously learned LoRA to initialize the learning of new tasks, further exploits the intrinsic asymmetry property of LoRA components by using a time-aware scaling mechanism to balance new and old knowledge during continual merging. Our approach maintains constant memory complexity with respect to the number of tasks, minimizes interference between past and new tasks via orthogonal basis initialization, and improves performance over asymmetric LoRA merging via adaptive scaling. We provide theoretical analysis to justify our design and conduct extensive experiments across diverse continual learning benchmarks using various Llama models, demonstrating the effectiveness and efficiency of our method.

Comment: Compression/Efficiency: continual LoRA merging with orthogonal basis and time-aware scaling, maintaining constant memory and mitigating interference.

Relevance: 9 Novelty: 7

23. Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models

ArXiv ID: 2512.21884

Authors: Tingyang Sun, Ting He, Bo Ji, Parimal Parag

Abstract: Large language models have demonstrated extraordinary performance in many AI tasks but are expensive to use, even after training, due to their requirement of high-end GPUs. Recently, a distributed system called PETALS was developed to lower the barrier for deploying LLMs by splitting the model blocks across multiple servers with low-end GPUs distributed over the Internet, which was much faster than swapping the model parameters between the GPU memory and other cheaper but slower local storage media. However, the performance of such a distributed system critically depends on the resource allocation, and how to do so optimally remains unknown. In this work, we present the first systematic study of the resource allocation problem in distributed LLM inference, with focus on two important decisions: block placement and request routing. Our main results include: experimentally validated performance models that can predict the inference performance under given block placement and request routing decisions, a formulation of the offline optimization of block placement and request routing as a mixed integer linear programming problem together with the NP-hardness proof and a polynomial-complexity algorithm with guaranteed performance, and an adaptation of the offline algorithm for the online setting with the same performance guarantee under bounded load. Through both experiments and experimentally-validated simulations, we have verified that the proposed solution can substantially reduce the inference time compared to the state-of-the-art solution in diverse settings with geographically-distributed servers. As a byproduct, we have also developed a light-weighted CPU-only simulator capable of predicting the performance of distributed LLM inference on GPU servers, which can evaluate large deployments and facilitate future research for researchers with limited GPU access.

Comment: HPC/Systems: optimization of block placement and request routing for geographically distributed LLM inference with validated performance models and provable guarantees.

Relevance: 9 Novelty: 7

24. Transformer Reconstructed with Dynamic Value Attention

ArXiv ID: 2512.22212

Authors: Xiaowei Wang

Abstract: Since transformer was firstly published in 2017, several works have been proposed to optimize it. However, the major structure of transformer remains unchanged, ignoring one of its main intrinsic limitations, which is the same static value is used for every query in a head. Transformer itself tries to solve this problem by implementing multi-head attentions, yet the number of heads is limited by complexity. I propose a method to decide a value for each query dynamically, which could cut down all the redundant heads, keeping only one. Consequently, the following feed forward network could be cut down entirely, as each revised embedding has already fetched enough useful values far beyond the context. As a result, a single-head Dynamic Value Attention (DVA) is all you need in a transformer. According to the experiment, DVA may save 37.6% training time than the original transformer meanwhile increasing the learning capability.

Comment: Matches: Model Architecture (Dynamic Value Attention replacing multi-head+FFN) and Efficiency (reduced heads and training time).

Relevance: 9 Novelty: 7

25. Theoretical Foundations of Scaling Law in Familial Models

ArXiv ID: 2512.23407

Authors: Huan Song, Qingfei Zhao, Ting Long, Shuyu Tian, Hongjun An, Jiawei Shao, Chi Zhang, Xuelong Li

Abstract: Neural scaling laws have become foundational for optimizing large language model (LLM) training, yet they typically assume a single dense model output. This limitation effectively overlooks "Familial models, a transformative paradigm essential for realizing ubiquitous intelligence across heterogeneous device-edge-cloud hierarchies. Transcending static architectures, familial models integrate early exits with relay-style inference to spawn G deployable sub-models from a single shared backbone. In this work, we theoretically and empirically extend the scaling law to capture this "one-run, many-models" paradigm by introducing Granularity (G) as a fundamental scaling variable alongside model size (N) and training tokens (D). To rigorously quantify this relationship, we propose a unified functional form L(N, D, G) and parameterize it using large-scale empirical runs. Specifically, we employ a rigorous IsoFLOP experimental design to strictly isolate architectural impact from computational scale. Across fixed budgets, we systematically sweep model sizes (N) and granularities (G) while dynamically adjusting tokens (D). This approach effectively decouples the marginal cost of granularity from the benefits of scale, ensuring high-fidelity parameterization of our unified scaling law. Our results reveal that the granularity penalty follows a multiplicative power law with an extremely small exponent. Theoretically, this bridges fixed-compute training with dynamic architectures. Practically, it validates the "train once, deploy many" paradigm, demonstrating that deployment flexibility is achievable without compromising the compute-optimality of dense baselines.

Comment: Matches Scaling Laws/Architecture: unified scaling law L(N,D,G) for familial models with early exits/relay inference, quantifying granularity alongside size and data.

Relevance: 8 Novelty: 8

26. Random Controlled Differential Equations

ArXiv ID: 2512.23670

Authors: Francesco Piatti, Thomas Cass, William F. Turner

Abstract: We introduce a training-efficient framework for time-series learning that combines random features with controlled differential equations (CDEs). In this approach, large randomly parameterized CDEs act as continuous-time reservoirs, mapping input paths to rich representations. Only a linear readout layer is trained, resulting in fast, scalable models with strong inductive bias. Building on this foundation, we propose two variants: (i) Random Fourier CDEs (RF-CDEs): these lift the input signal using random Fourier features prior to the dynamics, providing a kernel-free approximation of RBF-enhanced sequence models; (ii) Random Rough DEs (R-RDEs): these operate directly on rough-path inputs via a log-ODE discretization, using log-signatures to capture higher-order temporal interactions while remaining stable and efficient. We prove that in the infinite-width limit, these model induces the RBF-lifted signature kernel and the rough signature kernel, respectively, offering a unified perspective on random-feature reservoirs, continuous-time deep architectures, and path-signature theory. We evaluate both models across a range of time-series benchmarks, demonstrating competitive or state-of-the-art performance. These methods provide a practical alternative to explicit signature computations, retaining their inductive bias while benefiting from the efficiency of random features.

Comment: Model Architecture and Efficiency: random-feature controlled differential equation reservoirs with only a linear readout; kernel-limit theory for efficient time-series representation learning.

Relevance: 8 Novelty: 8

27. Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

ArXiv ID: 2512.23178

Authors: Zijian Liu

Abstract: Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient noise, a bounded ${\frak p}$-th moment where ${\frak p}\in(1,2]$ has been recognized to be more realistic (say being upper bounded by $\sigma_{\frak l}^{\frak p}$ for some $\sigma_{\frak l}\ge0$). A simple yet effective operation, gradient clipping, is known to handle this new challenge successfully. Specifically, Clipped Stochastic Gradient Descent (Clipped SGD) guarantees a high-probability rate ${\cal O}(\sigma_{\frak l}\ln(1/\delta)T^{1/{\frak p}-1})$ (resp. ${\cal O}(\sigma_{\frak l}^2\ln^2(1/\delta)T^{2/{\frak p}-2})$) for nonsmooth convex (resp. strongly convex) problems, where $\delta\in(0,1]$ is the failure probability and $T\in\mathbb{N}$ is the time horizon. In this work, we provide a refined analysis for Clipped SGD and offer two faster rates, ${\cal O}(\sigma_{\frak l}d_{\rm eff}^{-1/2{\frak p}}\ln^{1-1/{\frak p}}(1/\delta)T^{1/{\frak p}-1})$ and ${\cal O}(\sigma_{\frak l}^2d_{\rm eff}^{-1/{\frak p}}\ln^{2-2/{\frak p}}(1/\delta)T^{2/{\frak p}-2})$, than the aforementioned best results, where $d_{\rm eff}\ge1$ is a quantity we call the $\textit{generalized effective dimension}$. Our analysis improves upon the existing approach on two sides: better utilization of Freedman's inequality and finer bounds for clipping error under heavy-tailed noise. In addition, we extend the refined analysis to convergence in expectation and obtain new rates that break the known lower bounds. Lastly, to complement the study, we establish new lower bounds for both high-probability and in-expectation convergence. Notably, the in-expectation lower bounds match our new upper bounds, indicating the optimality of our refined analysis for convergence in expectation.

Comment: Optimization theory for SGD with gradient clipping under heavy-tailed noise; refined rates and lower bounds (training dynamics/optimization).

Relevance: 8 Novelty: 8

28. Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

ArXiv ID: 2512.21917

Authors: Nathan Kallus

Abstract: Aligning large language models to preference data is commonly implemented by assuming a known link function between the distribution of observed preferences and the unobserved rewards (e.g., a logistic link as in Bradley-Terry). If the link is wrong, however, inferred rewards can be biased and policies be misaligned. We study policy alignment to preferences under an unknown and unrestricted link. We consider an $f$-divergence-constrained reward maximization problem and show that realizability of the solution in a policy class implies a semiparametric single-index binary choice model, where a scalar-valued index determined by a policy captures the dependence on demonstrations and the rest of the preference distribution is an unrestricted function thereof. Rather than focus on estimation of identifiable finite-dimensional structural parameters in the index as in econometrics, we focus on policy learning, focusing on error to the optimal policy and allowing unidentifiable and nonparametric indices. We develop a variety of policy learners based on profiling the link function, orthogonalizing the link function, and using link-agnostic bipartite ranking objectives. We analyze these and provide finite-sample policy error bounds that depend on generic functional complexity measures of the index class. We further consider practical implementations using first-order optimization suited to neural networks and batched data. The resulting methods are robust to unknown preference noise distribution and scale, while preserving the direct optimization of policies without explicitly fitting rewards.

Comment: Matches: Representation Learning (preference alignment via semiparametric single-index modeling) with theoretical policy error bounds.

Relevance: 8 Novelty: 8

29. Decomposing Task Vectors for Refined Model Editing

ArXiv ID: 2512.22511

Authors: Hamed Damirchi, Ehsan Abbasnejad, Zhen Zhang, Javen Shi

Abstract: Large pre-trained models have transformed machine learning, yet adapting these models effectively to exhibit precise, concept-specific behaviors remains a significant challenge. Task vectors, defined as the difference between fine-tuned and pre-trained model parameters, provide a mechanism for steering neural networks toward desired behaviors. This has given rise to large repositories dedicated to task vectors tailored for specific behaviors. The arithmetic operation of these task vectors allows for the seamless combination of desired behaviors without the need for large datasets. However, these vectors often contain overlapping concepts that can interfere with each other during arithmetic operations, leading to unpredictable outcomes. We propose a principled decomposition method that separates each task vector into two components: one capturing shared knowledge across multiple task vectors, and another isolating information unique to each specific task. By identifying invariant subspaces across projections, our approach enables more precise control over concept manipulation without unintended amplification or diminution of other behaviors. We demonstrate the effectiveness of our decomposition method across three domains: improving multi-task merging in image classification by 5% using shared components as additional task vectors, enabling clean style mixing in diffusion models without generation degradation by mixing only the unique components, and achieving 47% toxicity reduction in language models while preserving performance on general knowledge tasks by negating the toxic information isolated to the unique component. Our approach provides a new framework for understanding and controlling task vector arithmetic, addressing fundamental limitations in model editing operations.

Comment: Matches: Representation Learning (task vector decomposition into shared/unique subspaces) enabling controlled model editing.

Relevance: 8 Novelty: 8

30. GLUE: Gradient-free Learning to Unify Experts

ArXiv ID: 2512.22467

Authors: Jong-Ik Park, Shreyas Chaudhari, Srinivasa Pranav, Carlee Joe-Wong, Jos\'e M. F. Moura

Abstract: In many deployed systems (multilingual ASR, cross-hospital imaging, region-specific perception), multiple pretrained specialist models coexist. Yet, new target domains often require domain expansion: a generalized model that performs well beyond any single specialist's domain. Given such a new target domain, prior works seek a single strong initialization prior for the model parameters by first blending expert models to initialize a target model. However, heuristic blending -- using coefficients based on data size or proxy metrics -- often yields lower target-domain test accuracy, and learning the coefficients on the target loss typically requires computationally-expensive full backpropagation through the network. We propose GLUE, Gradient-free Learning To Unify Experts, which initializes the target model as a convex combination of fixed experts, learning the mixture coefficients of this combination via a gradient-free two-point (SPSA) update that requires only two forward passes per step. Across experiments on three datasets and three network architectures, GLUE produces a single prior that can be fine-tuned effectively to outperform baselines. GLUE improves test accuracy by up to 8.5% over data-size weighting and by up to 9.1% over proxy-metric selection. GLUE either outperforms backpropagation-based full-gradient mixing or matches its performance within 1.4%.

Comment: Matches Model Architecture: gradient-free convex weight-space mixing (SPSA-learned mixture coefficients) to unify multiple pretrained experts for initialization.