Previous Day 2025-12-29
Monthly Overview 2025-12
Next Day 2025-12-31

Personalized Daily ArXiv Papers 2025-12-30

[gpt-5] Prompt Completion Total
Token 54533 46142 100675
Cost $0.07 $0.46 $0.53

Total arXiv papers: 561

Total scanned papers: 359

Total relevant papers: 42

Table of contents with paper titles:

  1. A Comedy of Estimators: On KL Regularization in RL Training of LLMs Authors: Vedant Shah, Johan Obando-Ceron, Vineet Jain, Brian Bartoldson, Bhavya Kailkhura, Sarthak Mittal, Glen Berseth, Pablo Samuel Castro, Yoshua Bengio, Nikolay Malkin, Moksh Jain, Siddarth Venkatraman, Aaron Courville

  2. End-to-End Test-Time Training for Long Context Authors: Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel R{\o}d, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun

  3. MoR: Mixture Of Representations For Mixed-Precision Training Authors: Bor-Yiing Su, Peter Dykas, Mike Chrzanowski, Jatin Chhugani

  4. Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution Authors: Shuhuan Wang, Yuzhen Xie, Jiayi Li, Yinliang Diao

  5. Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism Authors: Xinglin Pan, Shaohuai Shi, Wenxiang Lin, Yuxin Wang, Zhenheng Tang, Wei Wang, Xiaowen Chu

  6. Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks Authors: Zubair Shah, Noaman Khan

  7. Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss Authors: Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao

  8. Likelihood-Preserving Embeddings for Statistical Inference Authors: Deniz Akdemir

  9. Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds Authors: Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra

  10. The Bayesian Geometry of Transformer Attention Authors: Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra

  11. AFA-LoRA: Enabling Non-Linear Adaptations in LoRA with Activation Function Annealing Authors: Jiacheng Li, Jianchao Tan, Zhidong Yang, Feiye Huo, Yerui Sun, Yuchen Xie, Xunliang Cai

  12. Towards Unsupervised Causal Representation Learning via Latent Additive Noise Model Causal Autoencoders Authors: Hans Jarett J. Ong, Brian Godwin S. Lim, Dominic Dayta, Renzo Roel P. Tan, Kazushi Ikeda

  13. On Harnessing Idle Compute at the Edge for Foundation Model Training Authors: Leyang Xue, Meghana Madhyastha, Myungjin Lee, Amos Storkey, Randal Burns, Mahesh K. Marina

  14. Visual Language Hypothesis Authors: Xiu Li

  15. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs Authors: Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zihao Ye, Songting Wang, Wenqin Yang, Xupeng Miao, Tianqi Chen, Zhihao Jia

  16. Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation Authors: Dianyun Wang, Qingsen Ma, Yuhu Shang, Zhifeng Lu, Lechen Ning, Zhenbo Xu, Huijia Wu, Zhaofeng He

  17. The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models Authors: Dakuan Lu, Jiaqi Zhang, Cheng Yuan, Jiawei Shao, Chi Zhang, Xuelong Li

  18. On the Existence and Behaviour of Secondary Attention Sinks Authors: Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, Yiren Zhao

  19. FRoD: Full-Rank Efficient Fine-Tuning with Rotational Degrees for Fast Convergence Authors: Guoan Wan, Tianyu Chen, Fangzheng Feng, Haoyi Zhou, Runhua Xu

  20. The Quest for Winning Tickets in Low-Rank Adapters Authors: Hamed Damirchi, Cristian Rodriguez-Opazo, Ehsan Abbasnejad, Zhen Zhang, Javen Shi

  21. Communication Compression for Distributed Learning with Aggregate and Server-Guided Feedback Authors: Tomas Ortega, Chun-Yin Huang, Xiaoxiao Li, Hamid Jafarkhani

  22. Merge before Forget: A Single LoRA Continual Learning via Continual Merging Authors: Fuli Qiao, Mehrdad Mahdavi

  23. Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models Authors: Tingyang Sun, Ting He, Bo Ji, Parimal Parag

  24. Transformer Reconstructed with Dynamic Value Attention Authors: Xiaowei Wang

  25. Theoretical Foundations of Scaling Law in Familial Models Authors: Huan Song, Qingfei Zhao, Ting Long, Shuyu Tian, Hongjun An, Jiawei Shao, Chi Zhang, Xuelong Li

  26. Random Controlled Differential Equations Authors: Francesco Piatti, Thomas Cass, William F. Turner

  27. Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis Authors: Zijian Liu

  28. Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model Authors: Nathan Kallus

  29. Decomposing Task Vectors for Refined Model Editing Authors: Hamed Damirchi, Ehsan Abbasnejad, Zhen Zhang, Javen Shi

  30. GLUE: Gradient-free Learning to Unify Experts Authors: Jong-Ik Park, Shreyas Chaudhari, Srinivasa Pranav, Carlee Joe-Wong, Jos\'e M. F. Moura

  31. Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models Authors: Mingyuan Zhang, Yue Bai, Yifan Wang, Yiyang Huang, Yun Fu

  32. Splitwise: Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL Authors: Abolfazl Younesi, Abbas Shabrang Maryan, Elyas Oustad, Zahra Najafabadi Samani, Mohsen Ansari, Thomas Fahringer

  33. Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration Authors: Bruno Mlodozeniec, Pierre Ablin, Louis B\'ethune, Dan Busbridge, Michal Klein, Jason Ramapuram, Marco Cuturi

  34. Beyond Centralization: Provable Communication Efficient Decentralized Multi-Task Learning Authors: Donghwa Kang, Shana Moothedath

  35. Understanding the Mechanisms of Fast Hyperparameter Transfer Authors: Nikhil Ghosh, Denny Wu, Alberto Bietti

  36. How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure Authors: Paul M. Thompson

  37. The Affine Divergence: Aligning Activation Updates Beyond Normalisation Authors: George Bird

  38. LLMBoost: Make Large Language Models Stronger with Boosting Authors: Zehao Chen, Tianxiang Ai, Yifei Li, Gongxun Li, Yuyang Wei, Wang Zhou, Guanghui Li, Bin Yu, Zhijun Chen, Hailong Sun, Fuzhen Zhuang, Jianxin Li, Deqing Wang, Yikun Ban

  39. A general framework for deep learning Authors: William Kengne, Modou Wade

  40. Frequency Regularization: Unveiling the Spectral Inductive Bias of Deep Neural Networks Authors: Jiahao Lu

  41. Multiple Token Divergence: Measuring and Steering In-Context Computation Density Authors: Vincent Herrmann, Eric Alcaide, Michael Wand, J\"urgen Schmidhuber

  42. Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought Authors: Yuyi Zhang, Boyu Tang, Tianjie Ju, Sufeng Duan, Gongshen Liu


1. A Comedy of Estimators: On KL Regularization in RL Training of LLMs

ArXiv ID: 2512.21852

Authors: Vedant Shah, Johan Obando-Ceron, Vineet Jain, Brian Bartoldson, Bhavya Kailkhura, Sarthak Mittal, Glen Berseth, Pablo Samuel Castro, Yoshua Bengio, Nikolay Malkin, Moksh Jain, Siddarth Venkatraman, Aaron Courville

Abstract: The reasoning performance of large language models (LLMs) can be substantially improved by training them with reinforcement learning (RL). The RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy. Since computing the KL divergence exactly is intractable, various estimators are used in practice to estimate it from on-policy samples. Despite its wide adoption, including in several open-source libraries, there is no systematic study analyzing the numerous ways of incorporating KL estimators in the objective and their effect on the downstream performance of RL-trained models. Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation. In this paper, we further analyze these practices and study the gradients of several estimators configurations, revealing how design choices shape gradient bias. We substantiate these findings with empirical observations by RL fine-tuning \texttt{Qwen2.5-7B}, \texttt{Llama-3.1-8B-Instruct} and \texttt{Qwen3-4B-Instruct-2507} with different configurations and evaluating their performance on both in- and out-of-distribution tasks. Through our analysis, we observe that, in on-policy settings: (1) estimator configurations with biased gradients can result in training instabilities; and (2) using estimator configurations resulting in unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks. We also investigate the performance resulting from different KL configurations in off-policy settings and observe that KL regularization can help stabilize off-policy RL training resulting from asynchronous setups.

Comment: Author match


2. End-to-End Test-Time Training for Long Context

ArXiv ID: 2512.23675

Authors: Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel R{\o}d, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun

Abstract: We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.

Comment: Treats long-context handling as test-time training with meta-learned initialization on a standard sliding-window Transformer; achieves constant-latency inference — strong efficiency/training-dynamics innovation.

Relevance: 10 Novelty: 9


3. MoR: Mixture Of Representations For Mixed-Precision Training

ArXiv ID: 2512.22804

Authors: Bor-Yiing Su, Peter Dykas, Mike Chrzanowski, Jatin Chhugani

Abstract: Mixed-precision training is a crucial technique for scaling deep learning models, but successful mixedprecision training requires identifying and applying the right combination of training methods. This paper presents our preliminary study on Mixture-of-Representations (MoR), a novel, per-tensor and sub-tensor level quantization framework that dynamically analyzes a tensor's numerical properties to select between a variety of different representations. Based on the framework, we have proposed and experimented concrete algorithms that choose dynamically between FP8 and BF16 representations for both per-tensor and sub-tensor level granularities. Our universal approach is designed to preserve model quality across various quantization partition strategies and datasets. Our initial findings show that this approach can achieve state-of-the-art results with 98.38% of tensors quantized to the FP8 format. This work highlights the potential of dynamic, property-aware quantization while preserving model quality. We believe this approach can generally improve the robustness of low precision training, as demonstrated by achieving FP8 accuracies that are on par with existing approaches without the need for fine-grain partitioning, or can be used in combination with other training methods to improve the leverage of even lower precision number formats such as NVFP4.

Comment: Matches Compression/Efficiency: dynamic per-tensor and sub-tensor mixed-precision selection (FP8 vs BF16) based on tensor properties to robustly scale low-precision training.

Relevance: 10 Novelty: 8


4. Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution

ArXiv ID: 2512.23068

Authors: Shuhuan Wang, Yuzhen Xie, Jiayi Li, Yinliang Diao

Abstract: Selective State Space Models (SSMs) achieve linear-time inference, yet their gradient-based sensitivity analysis remains bottlenecked by O(L) memory scaling during backpropagation. This memory constraint precludes genomic-scale modeling (L > 10^5) on consumer-grade hardware. We introduce Phase Gradient Flow (PGF), a framework that computes exact analytical derivatives by operating directly in the state-space manifold, bypassing the need to materialize the intermediate computational graph. By reframing SSM dynamics as Tiled Operator-Space Evolution (TOSE), our method delivers O(1) memory complexity relative to sequence length, yielding a 94% reduction in peak VRAM and a 23x increase in throughput compared to standard Autograd. Unlike parallel prefix scans that exhibit numerical divergence in stiff ODE regimes, PGF ensures stability through invariant error scaling, maintaining near-machine precision across extreme sequences. We demonstrate the utility of PGF on an impulse-response benchmark with 128,000-step sequences - a scale where conventional Autograd encounters prohibitive memory overhead, often leading to out-of-memory (OOM) failures in multi-layered models. Our work enables chromosome-scale sensitivity analysis on a single GPU, bridging the gap between theoretical infinite-context models and practical hardware limitations.

Comment: Matches High Performance Computing: O(1) memory exact differentiation for SSMs via tiled operator-space evolution (PGF), enabling long-sequence training/sensitivity analysis on limited hardware.

Relevance: 10 Novelty: 8


5. Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism

ArXiv ID: 2512.21487

Authors: Xinglin Pan, Shaohuai Shi, Wenxiang Lin, Yuxin Wang, Zhenheng Tang, Wei Wang, Xiaowen Chu

Abstract: The mixture-of-experts (MoE) architecture scales model size with sublinear computational increase but suffers from memory-intensive inference due to KV caches and sparse expert activation. Recent disaggregated expert parallelism (DEP) distributes attention and experts to dedicated GPU groups but lacks support for shared experts and efficient task scheduling, limiting performance. We propose FinDEP, a fine-grained task scheduling algorithm for DEP that maximizes task overlap to improve MoE inference throughput. FinDEP introduces three innovations: 1) partitioning computation/communication into smaller tasks for fine-grained pipelining, 2) formulating a scheduling optimization supporting variable granularity and ordering, and 3) developing an efficient solver for this large search space. Experiments on four GPU systems with DeepSeek-V2 and Qwen3-MoE show FinDEP improves throughput by up to 1.61x over prior methods, achieving up to 1.24x speedup on a 32-GPU system.

Comment: HPC + MoE: fine-grained scheduling for disaggregated expert parallelism to optimize MoE inference throughput with algorithmic solver.

Relevance: 10 Novelty: 8


6. Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

ArXiv ID: 2512.22106

Authors: Zubair Shah, Noaman Khan

Abstract: Neural network pruning is widely used to reduce model size and computational cost. Yet, most existing methods treat sparsity as an externally imposed constraint, enforced through heuristic importance scores or training-time regularization. In this work, we propose a fundamentally different perspective: pruning as an equilibrium outcome of strategic interaction among model components. We model parameter groups such as weights, neurons, or filters as players in a continuous non-cooperative game, where each player selects its level of participation in the network to balance contribution against redundancy and competition. Within this formulation, sparsity emerges naturally when continued participation becomes a dominated strategy at equilibrium. We analyze the resulting game and show that dominated players collapse to zero participation under mild conditions, providing a principled explanation for pruning behavior. Building on this insight, we derive a simple equilibrium-driven pruning algorithm that jointly updates network parameters and participation variables without relying on explicit importance scores. This work focuses on establishing a principled formulation and empirical validation of pruning as an equilibrium phenomenon, rather than exhaustive architectural or large-scale benchmarking. Experiments on standard benchmarks demonstrate that the proposed approach achieves competitive sparsity-accuracy trade-offs while offering an interpretable, theory-grounded alternative to existing pruning methods.

Comment: Core compression: game-theoretic formulation where sparsity emerges at equilibrium; yields an interpretable pruning algorithm.

Relevance: 10 Novelty: 8


7. Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

ArXiv ID: 2512.23447

Authors: Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao

Abstract: Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.

Comment: Direct MoE advancement: auxiliary expert–router coupling loss aligning router embeddings with expert capabilities; computationally efficient and scales with experts, matching the MoE architecture criterion.

Relevance: 10 Novelty: 8


8. Likelihood-Preserving Embeddings for Statistical Inference

ArXiv ID: 2512.22638

Authors: Deniz Akdemir

Abstract: Modern machine learning embeddings provide powerful compression of high-dimensional data, yet they typically destroy the geometric structure required for classical likelihood-based statistical inference. This paper develops a rigorous theory of likelihood-preserving embeddings: learned representations that can replace raw data in likelihood-based workflows -- hypothesis testing, confidence interval construction, model selection -- without altering inferential conclusions. We introduce the Likelihood-Ratio Distortion metric $\Delta_n$, which measures the maximum error in log-likelihood ratios induced by an embedding. Our main theoretical contribution is the Hinge Theorem, which establishes that controlling $\Delta_n$ is necessary and sufficient for preserving inference. Specifically, if the distortion satisfies $\Delta_n = o_p(1)$, then (i) all likelihood-ratio based tests and Bayes factors are asymptotically preserved, and (ii) surrogate maximum likelihood estimators are asymptotically equivalent to full-data MLEs. We prove an impossibility result showing that universal likelihood preservation requires essentially invertible embeddings, motivating the need for model-class-specific guarantees. We then provide a constructive framework using neural networks as approximate sufficient statistics, deriving explicit bounds connecting training loss to inferential guarantees. Experiments on Gaussian and Cauchy distributions validate the sharp phase transition predicted by exponential family theory, and applications to distributed clinical inference demonstrate practical utility.

Comment: Likelihood-preserving embeddings with explicit bounds (approximate sufficient statistics); strong theory for representation learning/compression for inference.

Relevance: 9 Novelty: 9


9. Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

ArXiv ID: 2512.22473

Authors: Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra

Abstract: Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emph{advantage-based routing law} for attention scores, [ \frac{\partial L}{\partial s_{ij}} = \alpha_{ij}\bigl(b_{ij}-\mathbb{E}{\alpha_i}[b]\bigr), \qquad b$ are attention weights. These equations induce a positive feedback loop in which routing and content specialize together: queries route more strongly to values that are above-average for their error signal, and those values are pulled toward the queries that use them. We show that this coupled specialization behaves like a two-timescale EM procedure: attention weights implement an E-step (soft responsibilities), while values implement an M-step (responsibility-weighted prototype updates), with queries and keys adjusting the hypothesis frame. Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference. This yields a unified picture in which optimization (gradient flow) gives rise to geometry (Bayesian manifolds), which in turn supports function (in-context probabilistic reasoning).} := u_i^\top v_j, ] coupled with a \emph{responsibility-weighted update} for values, [ \Delta v_j = -\eta\sum_i \alpha_{ij} u_i, ] where $u_i$ is the upstream gradient at position $i$ and $\alpha_{ij

Comment: Matches: Representation Learning (theoretical gradient dynamics of attention) explaining how cross-entropy sculpts internal geometry.

Relevance: 9 Novelty: 9


10. The Bayesian Geometry of Transformer Attention

ArXiv ID: 2512.22471

Authors: Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra

Abstract: Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emph{Bayesian wind tunnels} -- controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with $10^{-3}$-$10^{-4}$ bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation. Across two tasks -- bijection elimination and Hidden Markov Model (HMM) state tracking -- we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emph{frame-precision dissociation} predicted by recent gradient analyses. Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models.

Comment: Matches Representation Learning and Architecture Analysis: mechanistic evidence that transformer attention implements Bayesian inference via geometric roles of residuals/FFN/attention.

Relevance: 9 Novelty: 8


11. AFA-LoRA: Enabling Non-Linear Adaptations in LoRA with Activation Function Annealing

ArXiv ID: 2512.22455

Authors: Jiacheng Li, Jianchao Tan, Zhidong Yang, Feiye Huo, Yerui Sun, Yuchen Xie, Xunliang Cai

Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method. However, its linear adaptation process limits its expressive power. This means there is a gap between the expressive power of linear training and non-linear training. To bridge this gap, we propose AFA-LoRA, a novel training strategy that brings non-linear expressivity to LoRA while maintaining its seamless mergeability. Our key innovation is an annealed activation function that transitions from a non-linear to a linear transformation during training, allowing the adapter to initially adopt stronger representational capabilities before converging to a mergeable linear form. We implement our method on supervised fine-tuning, reinforcement learning, and speculative decoding. The results show that AFA-LoRA reduces the performance gap between LoRA and full-parameter training. This work enables a more powerful and practical paradigm of parameter-efficient adaptation.

Comment: Compression/Efficiency: PEFT (LoRA) with annealed activation to gain non-linear expressivity while remaining mergeable for efficient adaptation.

Relevance: 9 Novelty: 8


12. Towards Unsupervised Causal Representation Learning via Latent Additive Noise Model Causal Autoencoders

ArXiv ID: 2512.22150

Authors: Hans Jarett J. Ong, Brian Godwin S. Lim, Dominic Dayta, Renzo Roel P. Tan, Kazushi Ikeda

Abstract: Unsupervised representation learning seeks to recover latent generative factors, yet standard methods relying on statistical independence often fail to capture causal dependencies. A central challenge is identifiability: as established in disentangled representation learning and nonlinear ICA literature, disentangling causal variables from observational data is impossible without supervision, auxiliary signals, or strong inductive biases. In this work, we propose the Latent Additive Noise Model Causal Autoencoder (LANCA) to operationalize the Additive Noise Model (ANM) as a strong inductive bias for unsupervised discovery. Theoretically, we prove that while the ANM constraint does not guarantee unique identifiability in the general mixing case, it resolves component-wise indeterminacy by restricting the admissible transformations from arbitrary diffeomorphisms to the affine class. Methodologically, arguing that the stochastic encoding inherent to VAEs obscures the structural residuals required for latent causal discovery, LANCA employs a deterministic Wasserstein Auto-Encoder (WAE) coupled with a differentiable ANM Layer. This architecture transforms residual independence from a passive assumption into an explicit optimization objective. Empirically, LANCA outperforms state-of-the-art baselines on synthetic physics benchmarks (Pendulum, Flow), and on photorealistic environments (CANDLE), where it demonstrates superior robustness to spurious correlations arising from complex background scenes.

Comment: Representation Learning: causal autoencoder with ANM inductive bias and identifiability analysis; architectural innovation via differentiable ANM layer in WAE.

Relevance: 9 Novelty: 8


13. On Harnessing Idle Compute at the Edge for Foundation Model Training

ArXiv ID: 2512.22142

Authors: Leyang Xue, Meghana Madhyastha, Myungjin Lee, Amos Storkey, Randal Burns, Mahesh K. Marina

Abstract: The ecosystem behind foundation model development today is highly centralized and limited to large-scale cloud data center operators: training foundation models is costly, needing immense compute resources. Decentralized foundation model training across edge devices, leveraging their spare compute, promises a democratized alternative. However, existing edge-training approaches fall short: they struggle to match cloud-based training performance, exhibit limited scalability with model size, exceed device memory capacity, and have prohibitive communication overhead. They also fail to satisfactorily handle device heterogeneity and dynamism. We introduce a new paradigm, Cleave, which finely partitions training operations through a novel selective hybrid tensor parallelism method. Together with a parameter server centric training framework, Cleave copes with device memory limits and avoids communication bottlenecks, thereby enabling efficient training of large models on par with the cloud. Further, with a cost optimization model to guide device selection and training workload distribution, Cleave effectively accounts for device heterogeneity and churn. Our evaluations show that Cleave matches cloud-based GPU training by scaling efficiently to larger models and thousands of devices, supporting up to 8x more devices than baseline edge-training approaches. It outperforms state-of-the-art edge training methods by up to a factor of 10 in per-batch training time and efficiently handles device failures, achieving at least 100x faster recovery than prior methods.

Comment: Distributed training innovation for edge devices via selective hybrid tensor parallelism and PS-centric design (HPC/distributed training).

Relevance: 9 Novelty: 8


14. Visual Language Hypothesis

ArXiv ID: 2512.23335

Authors: Xiu Li

Abstract: We study visual representation learning from a structural and topological perspective. We begin from a single hypothesis: that visual understanding presupposes a semantic language for vision, in which many perceptual observations correspond to a small number of discrete semantic states. Together with widely assumed premises on transferability and abstraction in representation learning, this hypothesis implies that the visual observation space must be organized in a fiber bundle like structure, where nuisance variation populates fibers and semantics correspond to a quotient base space. From this structure we derive two theoretical consequences. First, the semantic quotient $X/G$ is not a submanifold of $X$ and cannot be obtained through smooth deformation alone, semantic invariance requires a non-homeomorphic, discriminative target, for example, supervision via labels, cross instance identification, or multimodal alignment that supplies explicit semantic equivalence. Second, we show that approximating the quotient also places structural demands on the model architecture. Semantic abstraction requires not only an external semantic target, but a representation mechanism capable of supporting topology change: an expand-and-snap process in which the manifold is first geometrically expanded to separate structure and then collapsed to form discrete semantic regions. We emphasize that these results are interpretive rather than prescriptive: the framework provides a topological lens that aligns with empirical regularities observed in large-scale discriminative and multimodal models, and with classical principles in statistical learning theory.

Comment: Matches: Representation Learning (topological theory of semantics) and Model Architecture (architectural requirements for topology change).

Relevance: 9 Novelty: 8


15. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

ArXiv ID: 2512.22219

Authors: Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zihao Ye, Songting Wang, Wenqin Yang, Xupeng Miao, Tianqi Chen, Zhihao Jia

Abstract: We introduce Mirage Persistent Kernel (MPK), the first compiler and runtime system that automatically transforms multi-GPU model inference into a single high-performance megakernel. MPK introduces an SM-level graph representation that captures data dependencies at the granularity of individual streaming multiprocessors (SMs), enabling cross-operator software pipelining, fine-grained kernel overlap, and other previously infeasible GPU optimizations. The MPK compiler lowers tensor programs into highly optimized SM-level task graphs and generates optimized CUDA implementations for all tasks, while the MPK in-kernel parallel runtime executes these tasks within a single mega-kernel using decentralized scheduling across SMs. Together, these components provide end-to-end kernel fusion with minimal developer effort, while preserving the flexibility of existing programming models. Our evaluation shows that MPK significantly outperforms existing kernel-per-operator LLM serving systems by reducing end-to-end inference latency by up to 1.7x, pushing LLM inference performance close to hardware limits. MPK is publicly available at https://github.com/mirage-project/mirage.

Comment: Matches: High Performance Computing (compiler/runtime mega-kernelization, SM-level task graphs for end-to-end fusion).

Relevance: 9 Novelty: 8


16. Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

ArXiv ID: 2512.23260

Authors: Dianyun Wang, Qingsen Ma, Yuhu Shang, Zhifeng Lu, Lechen Ning, Zhenbo Xu, Huijia Wu, Zhaofeng He

Abstract: Parameter-efficient fine-tuning has become the dominant paradigm for adapting large language models to downstream tasks. Low-rank adaptation methods such as LoRA operate under the assumption that task-relevant weight updates reside in a low-rank subspace, yet this subspace is learned implicitly from data in a black-box manner, offering no interpretability or direct control. We hypothesize that this difficulty stems from polysemanticity--individual dimensions encoding multiple entangled concepts. To address this, we leverage pre-trained Sparse Autoencoders (SAEs) to identify task-relevant features in a disentangled feature space, then construct an explicit, interpretable low-rank subspace to guide adapter initialization. We provide theoretical analysis proving that under monosemanticity assumptions, SAE-based subspace identification achieves arbitrarily small recovery error, while direct identification in polysemantic space suffers an irreducible error floor. On safety alignment, our method achieves up to 99.6% safety rate--exceeding full fine-tuning by 7.4 percentage points and approaching RLHF-based methods--while updating only 0.19-0.24% of parameters. Crucially, our method provides interpretable insights into the learned alignment subspace through the semantic grounding of SAE features. Our work demonstrates that incorporating mechanistic interpretability into the fine-tuning process can simultaneously improve both performance and transparency.

Comment: Matches: Model Compression and Efficiency (low-rank subspace adapters) and Representation Learning (SAE-based interpretable feature subspace).

Relevance: 9 Novelty: 8


17. The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models

ArXiv ID: 2512.23340

Authors: Dakuan Lu, Jiaqi Zhang, Cheng Yuan, Jiawei Shao, Chi Zhang, Xuelong Li

Abstract: Recent advances in large language models (LLMs) have been largely driven by scaling laws for individual models, which predict performance improvements as model parameters and data volume increase. However, the capabilities of any single LLM are inherently bounded. One solution originates from intricate interactions among multiple LLMs, rendering their collective performance surpasses that of any constituent model. Despite the rapid proliferation of multi-model integration techniques such as model routing and post-hoc ensembling, a unifying theoretical framework of performance scaling for multi-model collaboration remains absent. In this work, we propose the Law of Multi-model Collaboration, a scaling law that predicts the performance limits of LLM ensembles based on their aggregated parameter budget. To quantify the intrinsic upper bound of multi-model collaboration, we adopt a method-agnostic formulation and assume an idealized integration oracle where the total cross-entropy loss of each sample is determined by the minimum loss of any model in the model pool. Experimental results reveal that multi-model systems follow a power-law scaling with respect to the total parameter count, exhibiting a more significant improvement trend and a lower theoretical loss floor compared to single model scaling. Moreover, ensembles of heterogeneous model families achieve better performance scaling than those formed within a single model family, indicating that model diversity is a primary driver of collaboration gains. These findings suggest that model collaboration represents a critical axis for extending the intelligence frontier of LLMs.

Comment: Foundational scaling-law theory for ensembling LLMs (multi-model collaboration), directly addressing model architecture/ensembling limits and diversity effects.

Relevance: 9 Novelty: 8


18. On the Existence and Behaviour of Secondary Attention Sinks

ArXiv ID: 2512.22213

Authors: Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, Yiren Zhao

Abstract: Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance. In this work, we identify a class of attention sinks, which we term secondary sinks, that differ fundamentally from the sinks studied in prior works, which we term primary sinks. While prior works have identified that tokens other than BOS can sometimes become sinks, they were found to exhibit properties analogous to the BOS token. Specifically, they emerge at the same layer, persist throughout the network and draw a large amount of attention mass. Whereas, we find the existence of secondary sinks that arise primarily in middle layers and can persist for a variable number of layers, and draw a smaller, but still significant, amount of attention mass. Through extensive experiments across 11 model families, we analyze where these secondary sinks appear, their properties, how they are formed, and their impact on the attention mechanism. Specifically, we show that: (1) these sinks are formed by specific middle-layer MLP modules; these MLPs map token representations to vectors that align with the direction of the primary sink of that layer. (2) The $\ell_2$-norm of these vectors determines the sink score of the secondary sink, and also the number of layers it lasts for, thereby leading to different impacts on the attention mechanisms accordingly. (3) The primary sink weakens in middle layers, coinciding with the emergence of secondary sinks. We observe that in larger-scale models, the location and lifetime of the sinks, together referred to as sink levels, appear in a more deterministic and frequent manner. Specifically, we identify three sink levels in QwQ-32B and six levels in Qwen3-14B.

Comment: Mechanistic analysis of attention sinks (secondary sinks) in Transformers — deep representation/architecture insight into attention behavior and MLP roles.

Relevance: 9 Novelty: 8


19. FRoD: Full-Rank Efficient Fine-Tuning with Rotational Degrees for Fast Convergence

ArXiv ID: 2512.23485

Authors: Guoan Wan, Tianyu Chen, Fangzheng Feng, Haoyi Zhou, Runhua Xu

Abstract: Parameter-efficient fine-tuning (PEFT) methods have emerged as a practical solution for adapting large foundation models to downstream tasks, reducing computational and memory costs by updating only a small subset of parameters. Among them, approaches like LoRA aim to strike a balance between efficiency and expressiveness, but often suffer from slow convergence and limited adaptation capacity due to their inherent low-rank constraints. This trade-off hampers the ability of PEFT methods to capture complex patterns needed for diverse tasks. To address these challenges, we propose FRoD, a novel fine-tuning method that combines hierarchical joint decomposition with rotational degrees of freedom. By extracting a globally shared basis across layers and injecting sparse, learnable perturbations into scaling factors for flexible full-rank updates, FRoD enhances expressiveness and efficiency, leading to faster and more robust convergence. On 20 benchmarks spanning vision, reasoning, and language understanding, FRoD matches full model fine-tuning in accuracy, while using only 1.72% of trainable parameters under identical training budgets.

Comment: Matches Model Compression and Efficiency: parameter-efficient fine-tuning via hierarchical decomposition and sparse rotational full-rank updates for faster, robust convergence.

Relevance: 9 Novelty: 7


20. The Quest for Winning Tickets in Low-Rank Adapters

ArXiv ID: 2512.22495

Authors: Hamed Damirchi, Cristian Rodriguez-Opazo, Ehsan Abbasnejad, Zhen Zhang, Javen Shi

Abstract: The Lottery Ticket Hypothesis (LTH) suggests that over-parameterized neural networks contain sparse subnetworks ("winning tickets") capable of matching full model performance when trained from scratch. With the growing reliance on fine-tuning large pretrained models, we investigate whether LTH extends to parameter-efficient fine-tuning (PEFT), specifically focusing on Low-Rank Adaptation (LoRA) methods. Our key finding is that LTH holds within LoRAs, revealing sparse subnetworks that can match the performance of dense adapters. In particular, we find that the effectiveness of sparse subnetworks depends more on how much sparsity is applied in each layer than on the exact weights included in the subnetwork. Building on this insight, we propose Partial-LoRA, a method that systematically identifies said subnetworks and trains sparse low-rank adapters aligned with task-relevant subspaces of the pre-trained model. Experiments across 8 vision and 12 language tasks in both single-task and multi-task settings show that Partial-LoRA reduces the number of trainable parameters by up to 87\%, while maintaining or improving accuracy. Our results not only deepen our theoretical understanding of transfer learning and the interplay between pretraining and fine-tuning but also open new avenues for developing more efficient adaptation strategies.

Comment: Matches Compression/Efficiency and Sparsity: extends Lottery Ticket Hypothesis to LoRA and introduces Partial-LoRA for sparse low-rank adapters with large parameter savings.

Relevance: 9 Novelty: 7


21. Communication Compression for Distributed Learning with Aggregate and Server-Guided Feedback

ArXiv ID: 2512.22623

Authors: Tomas Ortega, Chun-Yin Huang, Xiaoxiao Li, Hamid Jafarkhani

Abstract: Distributed learning, particularly Federated Learning (FL), faces a significant bottleneck in the communication cost, particularly the uplink transmission of client-to-server updates, which is often constrained by asymmetric bandwidth limits at the edge. Biased compression techniques are effective in practice, but require error feedback mechanisms to provide theoretical guarantees and to ensure convergence when compression is aggressive. Standard error feedback, however, relies on client-specific control variates, which violates user privacy and is incompatible with stateless clients common in large-scale FL. This paper proposes two novel frameworks that enable biased compression without client-side state or control variates. The first, Compressed Aggregate Feedback (CAFe), uses the globally aggregated update from the previous round as a shared control variate for all clients. The second, Server-Guided Compressed Aggregate Feedback (CAFe-S), extends this idea to scenarios where the server possesses a small private dataset; it generates a server-guided candidate update to be used as a more accurate predictor. We consider Distributed Gradient Descent (DGD) as a representative algorithm and analytically prove CAFe's superiority to Distributed Compressed Gradient Descent (DCGD) with biased compression in the non-convex regime with bounded gradient dissimilarity. We further prove that CAFe-S converges to a stationary point, with a rate that improves as the server's data become more representative. Experimental results in FL scenarios validate the superiority of our approaches over existing compression schemes.

Comment: Matches Distributed Training/Communication Compression: CAFe and CAFe-S enable biased compression with aggregate/server-guided feedback and convergence guarantees without client-side state.

Relevance: 9 Novelty: 7


22. Merge before Forget: A Single LoRA Continual Learning via Continual Merging

ArXiv ID: 2512.23017

Authors: Fuli Qiao, Mehrdad Mahdavi

Abstract: Parameter-efficient continual learning has emerged as a promising approach for large language models (LLMs) to mitigate catastrophic forgetting while enabling adaptation to new tasks. Current Low-Rank Adaptation (LoRA) continual learning techniques often retain and freeze previously learned LoRAs or generate data representations to overcome forgetting, typically utilizing these to support new LoRAs learn new tasks. However, these methods not only ignore growing computational memory with tasks and limited storage space but also suffer from potential task interference due to the lack of effective LoRA merging mechanisms. In this paper, we propose a novel continual learning method that orthogonally initializes and sequentially merges LoRAs updates into a single unified LoRA. Our method leverages orthogonal basis extraction from previously learned LoRA to initialize the learning of new tasks, further exploits the intrinsic asymmetry property of LoRA components by using a time-aware scaling mechanism to balance new and old knowledge during continual merging. Our approach maintains constant memory complexity with respect to the number of tasks, minimizes interference between past and new tasks via orthogonal basis initialization, and improves performance over asymmetric LoRA merging via adaptive scaling. We provide theoretical analysis to justify our design and conduct extensive experiments across diverse continual learning benchmarks using various Llama models, demonstrating the effectiveness and efficiency of our method.

Comment: Compression/Efficiency: continual LoRA merging with orthogonal basis and time-aware scaling, maintaining constant memory and mitigating interference.

Relevance: 9 Novelty: 7


23. Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models

ArXiv ID: 2512.21884

Authors: Tingyang Sun, Ting He, Bo Ji, Parimal Parag

Abstract: Large language models have demonstrated extraordinary performance in many AI tasks but are expensive to use, even after training, due to their requirement of high-end GPUs. Recently, a distributed system called PETALS was developed to lower the barrier for deploying LLMs by splitting the model blocks across multiple servers with low-end GPUs distributed over the Internet, which was much faster than swapping the model parameters between the GPU memory and other cheaper but slower local storage media. However, the performance of such a distributed system critically depends on the resource allocation, and how to do so optimally remains unknown. In this work, we present the first systematic study of the resource allocation problem in distributed LLM inference, with focus on two important decisions: block placement and request routing. Our main results include: experimentally validated performance models that can predict the inference performance under given block placement and request routing decisions, a formulation of the offline optimization of block placement and request routing as a mixed integer linear programming problem together with the NP-hardness proof and a polynomial-complexity algorithm with guaranteed performance, and an adaptation of the offline algorithm for the online setting with the same performance guarantee under bounded load. Through both experiments and experimentally-validated simulations, we have verified that the proposed solution can substantially reduce the inference time compared to the state-of-the-art solution in diverse settings with geographically-distributed servers. As a byproduct, we have also developed a light-weighted CPU-only simulator capable of predicting the performance of distributed LLM inference on GPU servers, which can evaluate large deployments and facilitate future research for researchers with limited GPU access.

Comment: HPC/Systems: optimization of block placement and request routing for geographically distributed LLM inference with validated performance models and provable guarantees.

Relevance: 9 Novelty: 7


24. Transformer Reconstructed with Dynamic Value Attention

ArXiv ID: 2512.22212

Authors: Xiaowei Wang

Abstract: Since transformer was firstly published in 2017, several works have been proposed to optimize it. However, the major structure of transformer remains unchanged, ignoring one of its main intrinsic limitations, which is the same static value is used for every query in a head. Transformer itself tries to solve this problem by implementing multi-head attentions, yet the number of heads is limited by complexity. I propose a method to decide a value for each query dynamically, which could cut down all the redundant heads, keeping only one. Consequently, the following feed forward network could be cut down entirely, as each revised embedding has already fetched enough useful values far beyond the context. As a result, a single-head Dynamic Value Attention (DVA) is all you need in a transformer. According to the experiment, DVA may save 37.6% training time than the original transformer meanwhile increasing the learning capability.

Comment: Matches: Model Architecture (Dynamic Value Attention replacing multi-head+FFN) and Efficiency (reduced heads and training time).

Relevance: 9 Novelty: 7


25. Theoretical Foundations of Scaling Law in Familial Models

ArXiv ID: 2512.23407

Authors: Huan Song, Qingfei Zhao, Ting Long, Shuyu Tian, Hongjun An, Jiawei Shao, Chi Zhang, Xuelong Li

Abstract: Neural scaling laws have become foundational for optimizing large language model (LLM) training, yet they typically assume a single dense model output. This limitation effectively overlooks "Familial models, a transformative paradigm essential for realizing ubiquitous intelligence across heterogeneous device-edge-cloud hierarchies. Transcending static architectures, familial models integrate early exits with relay-style inference to spawn G deployable sub-models from a single shared backbone. In this work, we theoretically and empirically extend the scaling law to capture this "one-run, many-models" paradigm by introducing Granularity (G) as a fundamental scaling variable alongside model size (N) and training tokens (D). To rigorously quantify this relationship, we propose a unified functional form L(N, D, G) and parameterize it using large-scale empirical runs. Specifically, we employ a rigorous IsoFLOP experimental design to strictly isolate architectural impact from computational scale. Across fixed budgets, we systematically sweep model sizes (N) and granularities (G) while dynamically adjusting tokens (D). This approach effectively decouples the marginal cost of granularity from the benefits of scale, ensuring high-fidelity parameterization of our unified scaling law. Our results reveal that the granularity penalty follows a multiplicative power law with an extremely small exponent. Theoretically, this bridges fixed-compute training with dynamic architectures. Practically, it validates the "train once, deploy many" paradigm, demonstrating that deployment flexibility is achievable without compromising the compute-optimality of dense baselines.

Comment: Matches Scaling Laws/Architecture: unified scaling law L(N,D,G) for familial models with early exits/relay inference, quantifying granularity alongside size and data.

Relevance: 8 Novelty: 8


26. Random Controlled Differential Equations

ArXiv ID: 2512.23670

Authors: Francesco Piatti, Thomas Cass, William F. Turner

Abstract: We introduce a training-efficient framework for time-series learning that combines random features with controlled differential equations (CDEs). In this approach, large randomly parameterized CDEs act as continuous-time reservoirs, mapping input paths to rich representations. Only a linear readout layer is trained, resulting in fast, scalable models with strong inductive bias. Building on this foundation, we propose two variants: (i) Random Fourier CDEs (RF-CDEs): these lift the input signal using random Fourier features prior to the dynamics, providing a kernel-free approximation of RBF-enhanced sequence models; (ii) Random Rough DEs (R-RDEs): these operate directly on rough-path inputs via a log-ODE discretization, using log-signatures to capture higher-order temporal interactions while remaining stable and efficient. We prove that in the infinite-width limit, these model induces the RBF-lifted signature kernel and the rough signature kernel, respectively, offering a unified perspective on random-feature reservoirs, continuous-time deep architectures, and path-signature theory. We evaluate both models across a range of time-series benchmarks, demonstrating competitive or state-of-the-art performance. These methods provide a practical alternative to explicit signature computations, retaining their inductive bias while benefiting from the efficiency of random features.

Comment: Model Architecture and Efficiency: random-feature controlled differential equation reservoirs with only a linear readout; kernel-limit theory for efficient time-series representation learning.

Relevance: 8 Novelty: 8


27. Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

ArXiv ID: 2512.23178

Authors: Zijian Liu

Abstract: Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient noise, a bounded ${\frak p}$-th moment where ${\frak p}\in(1,2]$ has been recognized to be more realistic (say being upper bounded by $\sigma_{\frak l}^{\frak p}$ for some $\sigma_{\frak l}\ge0$). A simple yet effective operation, gradient clipping, is known to handle this new challenge successfully. Specifically, Clipped Stochastic Gradient Descent (Clipped SGD) guarantees a high-probability rate ${\cal O}(\sigma_{\frak l}\ln(1/\delta)T^{1/{\frak p}-1})$ (resp. ${\cal O}(\sigma_{\frak l}^2\ln^2(1/\delta)T^{2/{\frak p}-2})$) for nonsmooth convex (resp. strongly convex) problems, where $\delta\in(0,1]$ is the failure probability and $T\in\mathbb{N}$ is the time horizon. In this work, we provide a refined analysis for Clipped SGD and offer two faster rates, ${\cal O}(\sigma_{\frak l}d_{\rm eff}^{-1/2{\frak p}}\ln^{1-1/{\frak p}}(1/\delta)T^{1/{\frak p}-1})$ and ${\cal O}(\sigma_{\frak l}^2d_{\rm eff}^{-1/{\frak p}}\ln^{2-2/{\frak p}}(1/\delta)T^{2/{\frak p}-2})$, than the aforementioned best results, where $d_{\rm eff}\ge1$ is a quantity we call the $\textit{generalized effective dimension}$. Our analysis improves upon the existing approach on two sides: better utilization of Freedman's inequality and finer bounds for clipping error under heavy-tailed noise. In addition, we extend the refined analysis to convergence in expectation and obtain new rates that break the known lower bounds. Lastly, to complement the study, we establish new lower bounds for both high-probability and in-expectation convergence. Notably, the in-expectation lower bounds match our new upper bounds, indicating the optimality of our refined analysis for convergence in expectation.

Comment: Optimization theory for SGD with gradient clipping under heavy-tailed noise; refined rates and lower bounds (training dynamics/optimization).

Relevance: 8 Novelty: 8


28. Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

ArXiv ID: 2512.21917

Authors: Nathan Kallus

Abstract: Aligning large language models to preference data is commonly implemented by assuming a known link function between the distribution of observed preferences and the unobserved rewards (e.g., a logistic link as in Bradley-Terry). If the link is wrong, however, inferred rewards can be biased and policies be misaligned. We study policy alignment to preferences under an unknown and unrestricted link. We consider an $f$-divergence-constrained reward maximization problem and show that realizability of the solution in a policy class implies a semiparametric single-index binary choice model, where a scalar-valued index determined by a policy captures the dependence on demonstrations and the rest of the preference distribution is an unrestricted function thereof. Rather than focus on estimation of identifiable finite-dimensional structural parameters in the index as in econometrics, we focus on policy learning, focusing on error to the optimal policy and allowing unidentifiable and nonparametric indices. We develop a variety of policy learners based on profiling the link function, orthogonalizing the link function, and using link-agnostic bipartite ranking objectives. We analyze these and provide finite-sample policy error bounds that depend on generic functional complexity measures of the index class. We further consider practical implementations using first-order optimization suited to neural networks and batched data. The resulting methods are robust to unknown preference noise distribution and scale, while preserving the direct optimization of policies without explicitly fitting rewards.

Comment: Matches: Representation Learning (preference alignment via semiparametric single-index modeling) with theoretical policy error bounds.

Relevance: 8 Novelty: 8


29. Decomposing Task Vectors for Refined Model Editing

ArXiv ID: 2512.22511

Authors: Hamed Damirchi, Ehsan Abbasnejad, Zhen Zhang, Javen Shi

Abstract: Large pre-trained models have transformed machine learning, yet adapting these models effectively to exhibit precise, concept-specific behaviors remains a significant challenge. Task vectors, defined as the difference between fine-tuned and pre-trained model parameters, provide a mechanism for steering neural networks toward desired behaviors. This has given rise to large repositories dedicated to task vectors tailored for specific behaviors. The arithmetic operation of these task vectors allows for the seamless combination of desired behaviors without the need for large datasets. However, these vectors often contain overlapping concepts that can interfere with each other during arithmetic operations, leading to unpredictable outcomes. We propose a principled decomposition method that separates each task vector into two components: one capturing shared knowledge across multiple task vectors, and another isolating information unique to each specific task. By identifying invariant subspaces across projections, our approach enables more precise control over concept manipulation without unintended amplification or diminution of other behaviors. We demonstrate the effectiveness of our decomposition method across three domains: improving multi-task merging in image classification by 5% using shared components as additional task vectors, enabling clean style mixing in diffusion models without generation degradation by mixing only the unique components, and achieving 47% toxicity reduction in language models while preserving performance on general knowledge tasks by negating the toxic information isolated to the unique component. Our approach provides a new framework for understanding and controlling task vector arithmetic, addressing fundamental limitations in model editing operations.

Comment: Matches: Representation Learning (task vector decomposition into shared/unique subspaces) enabling controlled model editing.

Relevance: 8 Novelty: 8


30. GLUE: Gradient-free Learning to Unify Experts

ArXiv ID: 2512.22467

Authors: Jong-Ik Park, Shreyas Chaudhari, Srinivasa Pranav, Carlee Joe-Wong, Jos\'e M. F. Moura

Abstract: In many deployed systems (multilingual ASR, cross-hospital imaging, region-specific perception), multiple pretrained specialist models coexist. Yet, new target domains often require domain expansion: a generalized model that performs well beyond any single specialist's domain. Given such a new target domain, prior works seek a single strong initialization prior for the model parameters by first blending expert models to initialize a target model. However, heuristic blending -- using coefficients based on data size or proxy metrics -- often yields lower target-domain test accuracy, and learning the coefficients on the target loss typically requires computationally-expensive full backpropagation through the network. We propose GLUE, Gradient-free Learning To Unify Experts, which initializes the target model as a convex combination of fixed experts, learning the mixture coefficients of this combination via a gradient-free two-point (SPSA) update that requires only two forward passes per step. Across experiments on three datasets and three network architectures, GLUE produces a single prior that can be fine-tuned effectively to outperform baselines. GLUE improves test accuracy by up to 8.5% over data-size weighting and by up to 9.1% over proxy-metric selection. GLUE either outperforms backpropagation-based full-gradient mixing or matches its performance within 1.4%.

Comment: Matches Model Architecture: gradient-free convex weight-space mixing (SPSA-learned mixture coefficients) to unify multiple pretrained experts for initialization.

Relevance: 8 Novelty: 7


31. Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models

ArXiv ID: 2512.23073

Authors: Mingyuan Zhang, Yue Bai, Yifan Wang, Yiyang Huang, Yun Fu

Abstract: Explorations in fine-tuning Vision-Language Models (VLMs), such as Low-Rank Adaptation (LoRA) from Parameter Efficient Fine-Tuning (PEFT), have made impressive progress. However, most approaches rely on explicit weight updates, overlooking the extensive representational structures already encoded in pre-trained models that remain underutilized. Recent works have demonstrated that Mask Fine-Tuning (MFT) can be a powerful and efficient post-training paradigm for language models. Instead of updating weights, MFT assigns learnable gating scores to each weight, allowing the model to reorganize its internal subnetworks for downstream task adaptation. In this paper, we rethink fine-tuning for VLMs from a structural reparameterization perspective grounded in MFT. We apply MFT to the language and projector components of VLMs with different language backbones and compare against strong PEFT baselines. Experiments show that MFT consistently surpasses LoRA variants and even full fine-tuning, achieving high performance without altering the frozen backbone. Our findings reveal that effective adaptation can emerge not only from updating weights but also from reestablishing connections among the model's existing knowledge. Code available at: https://github.com/Ming-K9/MFT-VLM

Comment: Matches Model Architecture and Efficiency: mask fine-tuning with learnable gating scores to reconfigure internal subnetworks in VLMs without weight updates.

Relevance: 8 Novelty: 7


32. Splitwise: Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL

ArXiv ID: 2512.23310

Authors: Abolfazl Younesi, Abbas Shabrang Maryan, Elyas Oustad, Zahra Najafabadi Samani, Mohsen Ansari, Thomas Fahringer

Abstract: Deploying large language models (LLMs) on edge devices is challenging due to their limited memory and power resources. Cloud-only inference reduces device burden but introduces high latency and cost. Static edge-cloud partitions optimize a single metric and struggle when bandwidth fluctuates. We propose Splitwise, a novel Lyapunov-assisted deep reinforcement learning (DRL) framework for fine-grained, adaptive partitioning of LLMs across edge and cloud environments. Splitwise decomposes transformer layers into attention heads and feed-forward sub-blocks, exposing more partition choices than layer-wise schemes. A hierarchical DRL policy, guided by Lyapunov optimization, jointly minimizes latency, energy consumption, and accuracy degradation while guaranteeing queue stability under stochastic workloads and variable network bandwidth. Splitwise also guarantees robustness via partition checkpoints with exponential backoff recovery in case of communication failures. Experiments on Jetson Orin NX, Galaxy S23, and Raspberry Pi 5 with GPT-2 (1.5B), LLaMA-7B, and LLaMA-13B show that Splitwise reduces end-to-end latency by 1.4x-2.8x and cuts energy consumption by up to 41% compared with existing partitioners. It lowers the 95th-percentile latency by 53-61% relative to cloud-only execution, while maintaining accuracy and modest memory requirements.

Comment: Matches Systems/HPC for inference: fine-grained transformer sub-block partitioning across edge–cloud optimized via Lyapunov-assisted DRL to minimize latency/energy under bandwidth variability.

Relevance: 8 Novelty: 7


33. Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

ArXiv ID: 2512.22382

Authors: Bruno Mlodozeniec, Pierre Ablin, Louis B\'ethune, Dan Busbridge, Michal Klein, Jason Ramapuram, Marco Cuturi

Abstract: Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as $\mu$P, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete$^{(d)}$ Parameterisation that unifies scaling in width and depth -- using an adaptation of CompleteP -- as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.

Comment: Matches HPC/Training Optimization: Complete^(d) parameterization for transferring global and per-module hyperparameters across width, depth, batch size, and training duration.

Relevance: 8 Novelty: 7


34. Beyond Centralization: Provable Communication Efficient Decentralized Multi-Task Learning

ArXiv ID: 2512.22675

Authors: Donghwa Kang, Shana Moothedath

Abstract: Representation learning is a widely adopted framework for learning in data-scarce environments, aiming to extract common features from related tasks. While centralized approaches have been extensively studied, decentralized methods remain largely underexplored. We study decentralized multi-task representation learning in which the features share a low-rank structure. We consider multiple tasks, each with a finite number of data samples, where the observations follow a linear model with task-specific parameters. In the decentralized setting, task data are distributed across multiple nodes, and information exchange between nodes is constrained by a communication network. The goal is to recover the underlying feature matrix whose rank is much smaller than both the parameter dimension and the number of tasks. We propose a new alternating projected gradient and minimization algorithm with provable accuracy guarantees. We provide comprehensive characterizations of the time, communication, and sample complexities. Importantly, the communication complexity is independent of the target accuracy, which significantly reduces communication cost compared to prior methods. Numerical simulations validate the theoretical analysis across different dimensions and network topologies, and demonstrate regimes in which decentralized learning outperforms centralized federated approaches.

Comment: HPC/Distributed training and Representation Learning: communication-efficient decentralized multi-task learning with shared low-rank structure and provable time/communication/sample complexity.

Relevance: 8 Novelty: 7


35. Understanding the Mechanisms of Fast Hyperparameter Transfer

ArXiv ID: 2512.22768

Authors: Nikhil Ghosh, Denny Wu, Alberto Bietti

Abstract: The growing scale of deep learning models has rendered standard hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware hyperparameters, which can enable direct transfer of optimal HPs from small-scale grid searches to large models with minimal performance loss. To understand the principles governing such transfer strategy, we develop a general conceptual framework for reasoning about HP transfer across scale, characterizing transfer as fast when the suboptimality it induces vanishes asymptotically faster than the finite-scale performance gap. We show formally that fast transfer is equivalent to useful transfer for compute-optimal grid search, meaning that transfer is asymptotically more compute-efficient than direct tuning. While empirical work has found that the Maximal Update Parameterization ($\mu$P) exhibits fast transfer when scaling model width, the mechanisms remain poorly understood. We show that this property depends critically on problem structure by presenting synthetic settings where transfer either offers provable computational advantage or fails to outperform direct tuning even under $\mu$P. To explain the fast transfer observed in practice, we conjecture that decomposing the optimization trajectory reveals two contributions to loss reduction: (1) a width-stable component that determines the optimal HPs, and (2) a width-sensitive component that improves with width but weakly perturbs the HP optimum. We present empirical evidence for this hypothesis across various settings, including large language model pretraining.

Comment: Training dynamics/scale-aware tuning: theoretical framework for fast hyperparameter transfer (including μP) and compute-optimal grid search across model width.

Relevance: 8 Novelty: 7


36. How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure

ArXiv ID: 2512.23109

Authors: Paul M. Thompson

Abstract: Modern generative and vision-language models (VLMs) are increasingly used in scientific and medical decision support, where predicted probabilities must be both accurate and well calibrated. Despite strong empirical results with moderate data, it remains unclear when such predictions generalize uniformly across inputs, classes, or subpopulations, rather than only on average-a critical issue in biomedicine, where rare conditions and specific groups can exhibit large errors even when overall loss is low. We study this question from a finite-sample perspective and ask: under what structural assumptions can generative and VLM-based predictors achieve uniformly accurate and calibrated behavior with practical sample sizes? Rather than analyzing arbitrary parameterizations, we focus on induced families of classifiers obtained by varying prompts or semantic embeddings within a restricted representation space. When model outputs depend smoothly on a low-dimensional semantic representation-an assumption supported by spectral structure in text and joint image-text embeddings-classical uniform convergence tools yield meaningful non-asymptotic guarantees. Our main results give finite-sample uniform convergence bounds for accuracy and calibration functionals of VLM-induced classifiers under Lipschitz stability with respect to prompt embeddings. The implied sample complexity depends on intrinsic/effective dimension, not ambient embedding dimension, and we further derive spectrum-dependent bounds that make explicit how eigenvalue decay governs data requirements. We conclude with implications for data-limited biomedical modeling, including when current dataset sizes can support uniformly reliable predictions and why average calibration metrics may miss worst-case miscalibration.

Comment: Finite-sample uniform convergence bounds under low-dimensional prompt/embedding structure (representation learning theory).

Relevance: 8 Novelty: 7


37. The Affine Divergence: Aligning Activation Updates Beyond Normalisation

ArXiv ID: 2512.22247

Authors: George Bird

Abstract: A systematic mismatch exists between mathematically ideal and effective activation updates during gradient descent. As intended, parameters update in their direction of steepest descent. However, activations are argued to constitute a more directly impactful quantity to prioritise in optimisation, as they are closer to the loss in the computational graph and carry sample-dependent information through the network. Yet their propagated updates do not take the optimal steepest-descent step. These quantities exhibit non-ideal sample-wise scaling across affine, convolutional, and attention layers. Solutions to correct for this are trivial and, entirely incidentally, derive normalisation from first principles despite motivational independence. Consequently, such considerations offer a fresh and conceptual reframe of normalisation's action, with auxiliary experiments bolstering this mechanistically. Moreover, this analysis makes clear a second possibility: a solution that is functionally distinct from modern normalisations, without scale-invariance, yet remains empirically successful, outperforming conventional normalisers across several tests. This is presented as an alternative to the affine map. This generalises to convolution via a new functional form, "PatchNorm", a compositionally inseparable normaliser. Together, these provide an alternative mechanistic framework that adds to, and counters some of, the discussion of normalisation. Further, it is argued that normalisers are better decomposed into activation-function-like maps with parameterised scaling, thereby aiding the prioritisation of representations during optimisation. Overall, this constitutes a theoretical-principled approach that yields several new functions that are empirically validated and raises questions about the affine + nonlinear approach to model creation.

Comment: Matches: Model Architecture (new normalization mechanisms including PatchNorm) and Representation Learning (training dynamics aligning activation updates).

Relevance: 8 Novelty: 7


38. LLMBoost: Make Large Language Models Stronger with Boosting

ArXiv ID: 2512.22309

Authors: Zehao Chen, Tianxiang Ai, Yifei Li, Gongxun Li, Yuyang Wei, Wang Zhou, Guanghui Li, Bin Yu, Zhijun Chen, Hailong Sun, Fuzhen Zhuang, Jianxin Li, Deqing Wang, Yikun Ban

Abstract: Ensemble learning of LLMs has emerged as a promising alternative to enhance performance, but existing approaches typically treat models as black boxes, combining the inputs or final outputs while overlooking the rich internal representations and interactions across models.In this work, we introduce LLMBoost, a novel ensemble fine-tuning framework that breaks this barrier by explicitly leveraging intermediate states of LLMs. Inspired by the boosting paradigm, LLMBoost incorporates three key innovations. First, a cross-model attention mechanism enables successor models to access and fuse hidden states from predecessors, facilitating hierarchical error correction and knowledge transfer. Second, a chain training paradigm progressively fine-tunes connected models with an error-suppression objective, ensuring that each model rectifies the mispredictions of its predecessor with minimal additional computation. Third, a near-parallel inference paradigm design pipelines hidden states across models layer by layer, achieving inference efficiency approaching single-model decoding. We further establish the theoretical foundations of LLMBoost, proving that sequential integration guarantees monotonic improvements under bounded correction assumptions. Extensive experiments on commonsense reasoning and arithmetic reasoning tasks demonstrate that LLMBoost consistently boosts accuracy while reducing inference latency.

Comment: Matches: Model Architecture (boosting-style ensemble with cross-model attention) and Efficiency (near-parallel inference).

Relevance: 8 Novelty: 7


39. A general framework for deep learning

ArXiv ID: 2512.23425

Authors: William Kengne, Modou Wade

Abstract: This paper develops a general approach for deep learning for a setting that includes nonparametric regression and classification. We perform a framework from data that fulfills a generalized Bernstein-type inequality, including independent, $\phi$-mixing, strongly mixing and $\mathcal{C}$-mixing observations. Two estimators are proposed: a non-penalized deep neural network estimator (NPDNN) and a sparse-penalized deep neural network estimator (SPDNN). For each of these estimators, bounds of the expected excess risk on the class of H\"older smooth functions and composition H\"older functions are established. Applications to independent data, as well as to $\phi$-mixing, strongly mixing, $\mathcal{C}$-mixing processes are considered. For each of these examples, the upper bounds of the expected excess risk of the proposed NPDNN and SPDNN predictors are derived. It is shown that both the NPDNN and SPDNN estimators are minimax optimal (up to a logarithmic factor) in many classical settings.

Comment: Matches: Model Compression and Efficiency (sparse-penalized DNN) and Representation Learning theory (minimax-optimal excess risk under mixing).

Relevance: 8 Novelty: 7


40. Frequency Regularization: Unveiling the Spectral Inductive Bias of Deep Neural Networks

ArXiv ID: 2512.22192

Authors: Jiahao Lu

Abstract: Regularization techniques such as L2 regularization (Weight Decay) and Dropout are fundamental to training deep neural networks, yet their underlying physical mechanisms regarding feature frequency selection remain poorly understood. In this work, we investigate the Spectral Bias of modern Convolutional Neural Networks (CNNs). We introduce a Visual Diagnostic Framework to track the dynamic evolution of weight frequencies during training and propose a novel metric, the Spectral Suppression Ratio (SSR), to quantify the "low-pass filtering" intensity of different regularizers. By addressing the aliasing issue in small kernels (e.g., 3x3) through discrete radial profiling, our empirical results on ResNet-18 and CIFAR-10 demonstrate that L2 regularization suppresses high-frequency energy accumulation by over 3x compared to unregularized baselines. Furthermore, we reveal a critical Accuracy-Robustness Trade-off: while L2 models are sensitive to broadband Gaussian noise due to over-specialization in low frequencies, they exhibit superior robustness against high-frequency information loss (e.g., low resolution), outperforming baselines by >6% in blurred scenarios. This work provides a signal-processing perspective on generalization, confirming that regularization enforces a strong spectral inductive bias towards low-frequency structures.

Comment: Provides representation learning and training-dynamics insights by quantifying spectral inductive bias of regularizers (SSR metric) and analyzing frequency behavior in CNNs.

Relevance: 8 Novelty: 7


41. Multiple Token Divergence: Measuring and Steering In-Context Computation Density

ArXiv ID: 2512.22944

Authors: Vincent Herrmann, Eric Alcaide, Michael Wand, J\"urgen Schmidhuber

Abstract: Measuring the in-context computational effort of language models is a key challenge, as metrics like next-token loss fail to capture reasoning complexity. Prior methods based on latent state compressibility can be invasive and unstable. We propose Multiple Token Divergence (MTD), a simple measure of computational effort defined as the KL divergence between a model's full output distribution and that of a shallow, auxiliary prediction head. MTD can be computed directly from pre-trained models with multiple prediction heads, requiring no additional training. Building on this, we introduce Divergence Steering, a novel decoding method to control the computational character of generated text. We empirically show that MTD is more effective than prior methods at distinguishing complex tasks from simple ones. On mathematical reasoning benchmarks, MTD correlates positively with problem difficulty. Lower MTD is associated with more accurate reasoning. MTD provides a practical, lightweight tool for analyzing and steering the computational dynamics of language models.

Comment: Introduces a metric (Multiple Token Divergence) for in-context computation density and a decoding method (Divergence Steering), offering insights and control over computational dynamics in LMs — representation/efficiency criterion.

Relevance: 8 Novelty: 7


42. Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought

ArXiv ID: 2512.21711

Authors: Yuyi Zhang, Boyu Tang, Tianjie Ju, Sufeng Duan, Gongshen Liu

Abstract: Latent tokens are gaining attention for enhancing reasoning in large language models (LLMs), yet their internal mechanisms remain unclear. This paper examines the problem from a reliability perspective, uncovering fundamental weaknesses: latent tokens function as uninterpretable placeholders rather than encoding faithful reasoning. While resistant to perturbation, they promote shortcut usage over genuine reasoning. We focus on Chain-of-Continuous-Thought (COCONUT), which claims better efficiency and stability than explicit Chain-of-Thought (CoT) while maintaining performance. We investigate this through two complementary approaches. First, steering experiments perturb specific token subsets, namely COCONUT and explicit CoT. Unlike CoT tokens, COCONUT tokens show minimal sensitivity to steering and lack reasoning-critical information. Second, shortcut experiments evaluate models under biased and out-of-distribution settings. Results on MMLU and HotpotQA demonstrate that COCONUT consistently exploits dataset artifacts, inflating benchmark performance without true reasoning. These findings reposition COCONUT as a pseudo-reasoning mechanism: it generates plausible traces that conceal shortcut dependence rather than faithfully representing reasoning processes.

Comment: Causal/adversarial analysis of latent tokens in Chain-of-Continuous-Thought, revealing shortcut behavior — insightful representation/behavioral analysis of LM reasoning.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis/innovations on existing architectures. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  2. Model Compression and Efficiency - Relevant: Sparsity, pruning, quantization, low-rank approaches, cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  3. High Performance Computing - Relevant: Algorithmic or systems-level innovations enabling training of large-scale models, distributed training techniques, memory optimization. - Irrelevant: Incremental engineering improvements without novel algorithmic contributions.

  4. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.