Personalized Daily ArXiv Papers 2025-10-10

[gpt-5]	Prompt	Completion	Total
Token	66358	69233	135591
Cost	$0.08	$0.69	$0.78

Total arXiv papers: 741

Total scanned papers: 428

Total relevant papers: 35

Table of contents with paper titles:

FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts Authors: Heming Zou, Yunliang Zang, Wutong Xu, Yao Zhu, Xiangyang Ji
MeSH: Memory-as-State-Highways for Recursive Transformers Authors: Chengting Yu, Xiaobo Shu, Yadao Wang, Yizhen Zhang, Haoyi Wu, Jiaang Li, Rujiao Long, Ziheng Chen, Yuchi Xu, Wenbo Su, Bo Zheng
From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill Authors: Gunjun Lee, Jiwon Kim, Jaiyoung Park, Younjoo Lee, Jung Ho Ahn
Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training Authors: Ruizhe Wang, Yucheng Ding, Xiao Liu, Yaoxiang Wang, Peng Cheng, Baining Guo, Zhengjun Zha, Yeyun Gong
Who Said Neural Networks Aren't Linear? Authors: Nimrod Berman, Assaf Hallak, Assaf Shocher
Lossless Vocabulary Reduction for Auto-Regressive Language Models Authors: Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin'ya Yamaguchi, Tomoya Ohba, Tamao Sakao, Susumu Takeuchi
GCPO: When Contrast Fails, Go Gold Authors: Hao Wu, Wei Liu
Base Models Know How to Reason, Thinking Models Learn When Authors: Constantin Venhoff, Iv\'an Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda
Geodesics in the Deep Linear Network Authors: Alan Chen
Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization Authors: P\'al Zs\'amboki, Benjamin Levi, David Ansel Josef Smith, Mitansh Kagalwala, Arlington Kell, Samuel Liechty, Cong Wang
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency Authors: Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference Authors: Hengrui Zhang, Pratyush Patel, August Ning, David Wentzlaff
Integral Signatures of Activation Functions: A 9-Dimensional Taxonomy and Stability Theory for Deep Learning Authors: Ankur Mali, Lawrence Hall, Jake Williams, Gordon Richards
Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts Authors: Yeskendir Koishekenov, Aldo Lipani, Nicola Cancedda
MoGU: Mixture-of-Gaussians with Uncertainty-based Gating for Time Series Forecasting Authors: Yoli Shavit, Jacob Goldberger
Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization Authors: Jason Bohne, Pawel Polak, David Rosenberg, Brian Bloniarz, Gary Kazantsev
AILoRA: Function-Aware Asymmetric Initialization for Low-Rank Adaptation of Large Language Models Authors: Xiaoshuang Ji, Zhendong Zhao, Xiaoyan Gu, Xiaojun Chen, Xin Zhao, Zeyao Liu
Memory Retrieval and Consolidation in Large Language Models through Function Tokens Authors: Shaohua Zhang, Yuan Lin, Hang Li
R\'enyi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization Authors: Qiaozhe Zhang, Jun Sun, Ruijie Zhang, Yingzhuang Liu
Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models Authors: Sharut Gupta, Shobhita Sundaram, Chenyu Wang, Stefanie Jegelka, Phillip Isola
Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity Authors: Akira Ito, Masanori Yamada, Daiki Chijiwa, Atsutoshi Kumagai
Beyond independent component analysis: identifiability and algorithms Authors: Alvaro Ribot, Anna Seigal, Piotr Zwiernik
OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs Authors: Jaeseong Lee, seung-won hwang, Aurick Qiao, Gabriele Oliaro, Ye Wang, Samyam Rajbhandari
DeepPrune: Parallel Scaling without Inter-trace Redundancy Authors: Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li
On the Relationship Between the Choice of Representation and In-Context Learning Authors: Ioana Marinescu, Kyunghyun Cho, Eric Karl Oermann
gLSTM: Mitigating Over-Squashing by Increasing Storage Capacity Authors: Hugh Blayney, \'Alvaro Arroyo, Xiaowen Dong, Michael M. Bronstein
Vocabulary embeddings organize linguistic structure early in language model training Authors: Isabel Papadimitriou, Jacob Prince
HySim-LLM: Embedding-Weighted Fine-Tuning Bounds and Manifold Denoising for Domain-Adapted LLMs Authors: Majid Jaberi-Douraki, Hossein Sholehrasa, Xuan Xu, Remya Ampadi Ramachandran
Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT Authors: Noor Ul Zain, Mohsin Raza, Ahsan Adeel
To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models Authors: Jiayun Luo, Wan-Cyuan Fan, Lyuyang Wang, Xiangteng He, Tanzila Rahman, Purang Abolmaesumi, Leonid Sigal
First Try Matters: Revisiting the Role of Reflection in Reasoning Models Authors: Liwei Kang, Yue Deng, Yao Xiao, Zhanfeng Mo, Wee Sun Lee, Lidong Bing
Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning Authors: Aman Sharma, Paras Chopra
Robust and Efficient Collaborative Learning Authors: Abdellah El Mrini, Sadegh Farhadkhan, Rachid Guerraoui
Don't Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered Authors: Jason Jabbour, Dong-Ki Kim, Max Smith, Jay Patrikar, Radhika Ghosal, Youhui Wang, Ali Agha, Vijay Janapa Reddi, Shayegan Omidshafiei
Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training Authors: Qinglun Li, Yingqi Liu, Miao Zhang, Xiaochun Cao, Quanjun Yin, Li Shen

1. FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts

ArXiv ID: 2510.08396

Authors: Heming Zou, Yunliang Zang, Wutong Xu, Yao Zhu, Xiangyang Ji

Abstract: Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains -- general knowledge understanding, scientific question answering, mathematical reasoning, and code generation -- demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.

Comment: Model Architecture/Efficiency: implicit rank-wise MoE within LoRA using sparse random projection as router for parameter-efficient fine-tuning and task decoupling.

Relevance: 10 Novelty: 8

2. MeSH: Memory-as-State-Highways for Recursive Transformers

ArXiv ID: 2510.07739

Authors: Chengting Yu, Xiaobo Shu, Yadao Wang, Yizhen Zhang, Haoyi Wu, Jiaang Li, Rujiao Long, Ziheng Chen, Yuchi Xu, Wenbo Su, Bo Zheng

Abstract: Recursive transformers reuse parameters and iterate over hidden states multiple times, decoupling compute depth from parameter depth. However, under matched compute, recursive models with fewer parameters often lag behind non-recursive counterparts. By probing hidden states, we trace this performance gap to two primary bottlenecks: undifferentiated computation, where the core is forced to adopt a similar computational pattern at every iteration, and information overload, where long-lived and transient information must coexist in a single hidden state. To address the issues, we introduce a Memory-as-State-Highways (MeSH) scheme, which externalizes state management into an explicit memory buffer and employs lightweight routers to dynamically diversify computation across iterations. Probing visualizations confirm that MeSH successfully resolves the pathologies by inducing functional specialization across iterations. On the Pythia suite (160M-1.4B), MeSH-enhanced recursive transformers consistently improve over recursive baselines and outperforms its larger non-recursive counterpart at the 1.4B scale, improving average downstream accuracy by +1.06% with 33% fewer non-embedding parameters. Our analysis establishes MeSH as a scalable and principled architecture for building stronger recursive models.

Comment: Model Architecture: Memory-as-State-Highways adds explicit memory and lightweight routers to diversify computation across recursive iterations, strengthening recursive transformers.

Relevance: 10 Novelty: 8

3. From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill

ArXiv ID: 2510.08055

Authors: Gunjun Lee, Jiwon Kim, Jaiyoung Park, Younjoo Lee, Jung Ho Ahn

Abstract: Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect budgets. Modern serving systems adopt stall-free scheduling techniques such as chunked prefill, which splits long prompt processing along the token dimension and interleaves prefill with ongoing decode iterations. While effective at stabilizing TBT, chunked prefill incurs substantial overhead in Mixture-of-Experts (MoE) models: redundant expert weight loads increase memory traffic by up to 39% and inflate energy consumption. We propose layered prefill, a new scheduling paradigm that treats transformer layer groups as the primary scheduling unit. By vertically partitioning the model into contiguous layer groups and interleaving prefill and decode across the groups, layered prefill sustains stall-free decoding while eliminating chunk-induced MoE weight reloads. It reduces off-chip bandwidth demand, lowering TTFT by up to 70%, End-to-End latency by 41% and per-token energy by up to 22%. Evaluations show that layered prefill consistently improves the TTFT--TBT Pareto frontier over chunked prefill, reducing expert-load traffic and energy cost while maintaining stall-free decoding. Overall, shifting the scheduling axis from tokens to layers unlocks a new operating regime for high-efficiency, energy-aware LLM serving in co-located environments.

Comment: High Performance Computing and Efficiency: introduces layered prefill scheduling that reduces MoE expert weight reloads, lowering memory bandwidth and latency for stall-free serving.

Relevance: 10 Novelty: 8

4. Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training

ArXiv ID: 2510.08008

Authors: Ruizhe Wang, Yucheng Ding, Xiao Liu, Yaoxiang Wang, Peng Cheng, Baining Guo, Zhengjun Zha, Yeyun Gong

Abstract: The rapidly increasing computational cost of pretraining Large Language Models necessitates more efficient approaches. Numerous computational costs have been invested in existing well-trained checkpoints, but many of them remain underutilized due to engineering constraints or limited model capacity. To efficiently reuse this "sunk" cost, we propose to recycle pretrained checkpoints by expanding their parameter counts and continuing training. We propose orthogonal growth method well-suited for converged Mixture-of-Experts model: interpositional layer copying for depth growth and expert duplication with injected noise for width growth. To determine the optimal timing for such growth across checkpoints sequences, we perform comprehensive scaling experiments revealing that the final accuracy has a strong positive correlation with the amount of sunk cost, indicating that greater prior investment leads to better performance. We scale our approach to models with 70B parameters and over 1T training tokens, achieving 10.66% accuracy gain over training from scratch under the same additional compute budget. Our checkpoint recycling approach establishes a foundation for economically efficient large language model pretraining.

Comment: Matches Model Architecture criterion (Mixture-of-Experts): orthogonal growth (depth/width) and checkpoint recycling for efficient pretraining.

Relevance: 10 Novelty: 8

5. Who Said Neural Networks Aren't Linear?

ArXiv ID: 2510.08570

Authors: Nimrod Berman, Assaf Hallak, Assaf Shocher

Abstract: Neural networks are famously nonlinear. However, linearity is defined relative to a pair of vector spaces, $f$$:$$X$$\to$$Y$. Is it possible to identify a pair of non-standard vector spaces for which a conventionally nonlinear function is, in fact, linear? This paper introduces a method that makes such vector spaces explicit by construction. We find that if we sandwich a linear operator $A$ between two invertible neural networks, $f(x)=g_y^{-1}(A g_x(x))$, then the corresponding vector spaces $X$ and $Y$ are induced by newly defined addition and scaling actions derived from $g_x$ and $g_y$. We term this kind of architecture a Linearizer. This framework makes the entire arsenal of linear algebra, including SVD, pseudo-inverse, orthogonal projection and more, applicable to nonlinear mappings. Furthermore, we show that the composition of two Linearizers that share a neural network is also a Linearizer. We leverage this property and demonstrate that training diffusion models using our architecture makes the hundreds of sampling steps collapse into a single step. We further utilize our framework to enforce idempotency (i.e. $f(f(x))=f(x)$) on networks leading to a globally projective generative model and to demonstrate modular style transfer.

Comment: Matches Model Architecture: introduces a Linearizer architecture (invertible NNs around a linear map) enabling linear-algebraic analysis and composition properties for nonlinear networks.

Relevance: 9 Novelty: 9

6. Lossless Vocabulary Reduction for Auto-Regressive Language Models

ArXiv ID: 2510.08102

Authors: Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin'ya Yamaguchi, Tomoya Ohba, Tamao Sakao, Susumu Takeuchi

Abstract: Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.

Comment: Model efficiency/tokenization: lossless vocabulary reduction enabling smaller vocabularies and cross-tokenizer cooperation for AR LMs; strong alignment with efficiency and systems-level interoperability.

Relevance: 9 Novelty: 8

7. GCPO: When Contrast Fails, Go Gold

ArXiv ID: 2510.07790

Authors: Hao Wu, Wei Liu

Abstract: Reinforcement learning has been widely applied to enhance the reasoning capabilities of large language models. Extending the inference limits of smaller models has become a prominent research focus. However, algorithms such as Group Relative Policy Optimization (GRPO) suffer from a clear drawback: the upper bound of a model's rollout responses is entirely determined by the model itself, preventing the acquisition of knowledge from samples that are either all incorrect or all correct. In this paper, we introduce Group Contrastive Policy Optimization (GCPO), a method that incorporates external standard reference answers. When the model cannot solve a problem, the reference answer supplies the correct response, steering the model toward an unequivocally accurate update direction. This approach offers two main advantages: (1) it improves training efficiency by fully utilizing every sample; (2) it enables the model to emulate the problem solving strategy of the reference answer during training, thereby enhancing generalization in reasoning. GCPO achieves outstanding results across multiple benchmark datasets, yielding substantial improvements over the baseline model. Our code is available at: https://github.com/AchoWu/GCPO.

Comment: High-Performance/Distributed Training: stability-based generalization and excess error bounds for multi-gossip decentralized training; algorithmic insights into communication/training efficiency.

Relevance: 9 Novelty: 8

8. Base Models Know How to Reason, Thinking Models Learn When

ArXiv ID: 2510.07364

Authors: Constantin Venhoff, Iv\'an Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda

Abstract: Why do thinking language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, it remains unclear to what extent thinking models learn entirely new reasoning capabilities or repurpose pre-existing base model ones. In this work, we propose a hybrid model where we activate reasoning mechanisms in base models at the right time to elicit thinking-model-level reasoning chains, implying that thinking models exploit already existing capabilities. To ground our analysis, we introduce an unsupervised, bottom-up approach for uncovering human-interpretable reasoning behaviors in thinking models. This approach provides an unbiased method to discover reasoning behaviors without imposing manual or LLM-derived assumptions. Across three base and four thinking models, using GSM8K and MATH500, our hybrid model recovers up to 91% of the performance gap to thinking models without any weight updates while steering only 12% of tokens. Concretely, our empirical setup provides a simple, causal way to test the effectiveness of existing reasoning mechanisms in base models by invoking them directly and measuring the resulting task performance. More broadly, these results reframe our understanding of how thinking models are trained: pre-training is when models acquire most of their reasoning mechanisms, and post-training teaches efficient deployment of these mechanisms at the right time, enabling efficient use of their inference-time compute.

Comment: Representation Learning/Training Dynamics: causal elicitation of latent reasoning mechanisms in base models and analysis of when vs how reasoning is deployed; foundational interpretability insight.

Relevance: 9 Novelty: 8

9. Geodesics in the Deep Linear Network

ArXiv ID: 2510.07324

Authors: Alan Chen

Abstract: We derive a general system of ODEs and associated explicit solutions in a special case for geodesics between full rank matrices in the deep linear network geometry. In the process, we characterize all horizontal straight lines in the invariant balanced manifold that remain geodesics under Riemannian submersion.

Comment: Training dynamics/geometry: derives geodesics and ODEs in deep linear network geometry, offering theoretical insight into network optimization landscapes.

Relevance: 9 Novelty: 8

10. Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization

ArXiv ID: 2510.08341

Authors: P\'al Zs\'amboki, Benjamin Levi, David Ansel Josef Smith, Mitansh Kagalwala, Arlington Kell, Samuel Liechty, Cong Wang

Abstract: We study length generalization in transformers through the set complement task, where a model must predict a uniform distribution over tokens absent from an input sequence -- an ability central to board-game style reasoning. Our main theoretical result establishes two statements. First, we prove tight bounds on embedding and value dimensions for single-layer attention-only transformers. Second, we show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences, though with reduced precision. A mechanistic reading of the proof explains this limitation: as more tokens are attended to, softmax compresses logit displacements, eroding separation between valid and invalid outputs. Training dynamics also suggest a second obstacle: when many next tokens are possible, updates become noisy. We hypothesize that dropout can counteract the first effect and Exponential Moving Average (EMA) the second. We validate these hypotheses through random hyperparameter search on the set complement task, which confirms both mechanisms. We then test OthelloGPT, a GPT-1 style model trained on random Othello moves, and find that EMA again improves length generalization in this more complex setting.

Comment: Representation Learning/Training Dynamics: theoretical bounds for attention-only transformers and mechanisms (dropout, EMA) that improve length generalization.

Relevance: 9 Novelty: 8

11. Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

ArXiv ID: 2510.08431

Authors: Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang

Abstract: This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.

Comment: HPC + Efficiency: introduces a parallelism-compatible FlashAttention-2 JVP kernel enabling 10B+ model sCM training and proposes score-regularized continuous-time consistency distillation for few-step generation.

Relevance: 9 Novelty: 8

12. SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference

ArXiv ID: 2510.08544

Authors: Hengrui Zhang, Pratyush Patel, August Ning, David Wentzlaff

Abstract: Large Language Models (LLMs) have gained popularity in recent years, driving up the demand for inference. LLM inference is composed of two phases with distinct characteristics: a compute-bound prefill phase followed by a memory-bound decode phase. To efficiently serve LLMs, prior work proposes prefill-decode disaggregation to run each phase on separate hardware. However, existing hardware poorly matches the different requirements of each phase. Current datacenter GPUs and TPUs follow a more-is-better design philosophy that maximizes compute and memory resources, causing memory bandwidth underutilization in the prefill phase and compute underutilization in the decode phase. Such underutilization directly translates into increased serving costs. This paper proposes SPAD (Specialized Prefill and Decode hardware), adopting a less-is-more methodology to design specialized chips tailored to the distinct characteristics of prefill and decode phases. The proposed Prefill Chips have larger systolic arrays and use cost-effective GDDR memory, whereas the proposed Decode Chips retain high memory bandwidth but reduce compute capacity. Compared to modeled H100s, simulations show that the proposed Prefill Chips deliver 8% higher prefill performance on average at 52% lower hardware cost, while the proposed Decode Chips achieve 97% of the decode performance with 28% lower TDP. End-to-end simulations on production traces show that SPAD reduces hardware cost by 19%-41% and TDP by 2%-17% compared to modeled baseline clusters while offering the same performance. Even when models and workloads change, SPAD can reallocate either type of chip to run either phase and still achieve 11%-43% lower hardware costs, demonstrating the longevity of the SPAD design.

Comment: High Performance Computing: systems-level prefill/decode disaggregation with specialized hardware to optimize compute/memory utilization for LLM inference.

Relevance: 9 Novelty: 8

13. Integral Signatures of Activation Functions: A 9-Dimensional Taxonomy and Stability Theory for Deep Learning

ArXiv ID: 2510.08456

Authors: Ankur Mali, Lawrence Hall, Jake Williams, Gordon Richards

Abstract: Activation functions govern the expressivity and stability of neural networks, yet existing comparisons remain largely heuristic. We propose a rigorous framework for their classification via a nine-dimensional integral signature S_sigma(phi), combining Gaussian propagation statistics (m1, g1, g2, m2, eta), asymptotic slopes (alpha_plus, alpha_minus), and regularity measures (TV(phi'), C(phi)). This taxonomy establishes well-posedness, affine reparameterization laws with bias, and closure under bounded slope variation. Dynamical analysis yields Lyapunov theorems with explicit descent constants and identifies variance stability regions through (m2', g2). From a kernel perspective, we derive dimension-free Hessian bounds and connect smoothness to bounded variation of phi'. Applying the framework, we classify eight standard activations (ReLU, leaky-ReLU, tanh, sigmoid, Swish, GELU, Mish, TeLU), proving sharp distinctions between saturating, linear-growth, and smooth families. Numerical Gauss-Hermite and Monte Carlo validation confirms theoretical predictions. Our framework provides principled design guidance, moving activation choice from trial-and-error to provable stability and kernel conditioning.

Comment: Architecture/Training Dynamics: rigorous activation-function taxonomy with Lyapunov stability and kernel Hessian bounds guiding network design.

Relevance: 9 Novelty: 8

14. Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts

ArXiv ID: 2510.07358

Authors: Yeskendir Koishekenov, Aldo Lipani, Nicola Cancedda

Abstract: Most efforts to improve the reasoning capabilities of large language models (LLMs) involve either scaling the number of parameters and the size of training data, or scaling inference computation by letting models generate complex chains of thought. Motivated by interpretability studies showing that the crucial computation required for reasoning tasks is concentrated in a limited range of layers, we introduce Encode-Think-Decode (ETD), a method that enhances the reasoning capabilities of a base model by training it to iterate over a small subset of reasoning-relevant layers during the mid-training stage. ETD amplifies latent reasoning while preserving the original architecture, parameter count, hyperparameters, and training data composition. When iterating on the selected layers at inference time, ETD models yield substantial gains on 17 reasoning benchmarks, including +28.4% relative accuracy improvement on GSM8K and +36% on MATH with the OLMo-2 1B Base model. We also explore an adaptive depth strategy that adjusts the computation per input token. Our results show that recursive latent reasoning offers a simple and effective path to stronger LLM reasoning.

Comment: Model Architecture and Efficiency: introduces recursive iteration over selected reasoning-relevant layers and adaptive depth for test-time compute scaling without increasing parameters.

Relevance: 9 Novelty: 8

15. MoGU: Mixture-of-Gaussians with Uncertainty-based Gating for Time Series Forecasting

ArXiv ID: 2510.07459

Authors: Yoli Shavit, Jacob Goldberger

Abstract: We introduce Mixture-of-Gaussians with Uncertainty-based Gating (MoGU), a novel Mixture-of-Experts (MoE) framework designed for regression tasks and applied to time series forecasting. Unlike conventional MoEs that provide only point estimates, MoGU models each expert's output as a Gaussian distribution. This allows it to directly quantify both the forecast (the mean) and its inherent uncertainty (variance). MoGU's core innovation is its uncertainty-based gating mechanism, which replaces the traditional input-based gating network by using each expert's estimated variance to determine its contribution to the final prediction. Evaluated across diverse time series forecasting benchmarks, MoGU consistently outperforms single-expert models and traditional MoE setups. It also provides well-quantified, informative uncertainties that directly correlate with prediction errors, enhancing forecast reliability. Our code is available from: https://github.com/yolish/moe_unc_tsf

Comment: Model Architecture (MoE): probabilistic experts with uncertainty-based gating replacing input-based routers for regression/forecasting.

Relevance: 9 Novelty: 7

16. Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

ArXiv ID: 2510.08256

Authors: Jason Bohne, Pawel Polak, David Rosenberg, Brian Bloniarz, Gary Kazantsev

Abstract: Direct Preference Optimization (DPO) has recently emerged as a simple and effective alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with user preferences. However, existing DPO formulations rely on a single monolithic model, which limits their expressivity in multi-task settings and their adaptability to heterogeneous or diverse preference distributions. In this work, we propose Mix- and MoE-DPO, a framework that extends DPO with both soft mixture models and mixture-of-experts (MoE) architectures, using a stochastic variational inference approach. Our method introduces a latent-variable model over expert assignments and optimizes a variational evidence lower bound (ELBO), enabling stable and efficient learning of specialized expert policies from preference data. Mix- and MoE-DPO provides three key advantages over standard DPO: (i) generalization via universal function approximation through mixtures; (ii) reward and policy specialization through expert components tailored to distinct preference modes; and (iii) contextual alignment through input-dependent soft gating that enables user-specific mixture policies. Our framework supports both shared base architectures with expert-specific policy heads and fully independent expert models, allowing flexible trade-offs between parameter efficiency and specialization. We validate our approach on a variety of model sizes and multi-preference datasets, demonstrating that Mix- and MoE-DPO offers a powerful and scalable method for preference-based LLM alignment.

Comment: Strong match to Model Architecture: extends DPO with mixture models and MoE architectures using variational inference and ELBO optimization for expert specialization.

Relevance: 9 Novelty: 7

17. AILoRA: Function-Aware Asymmetric Initialization for Low-Rank Adaptation of Large Language Models

ArXiv ID: 2510.08034

Authors: Xiaoshuang Ji, Zhendong Zhao, Xiaoyan Gu, Xiaojun Chen, Xin Zhao, Zeyao Liu

Abstract: Parameter-efficient finetuning (PEFT) aims to mitigate the substantial computational and memory overhead involved in adapting large-scale pretrained models to diverse downstream tasks. Among numerous PEFT strategies, Low-Rank Adaptation (LoRA) has emerged as one of the most widely adopted approaches due to its robust empirical performance and low implementation complexity. In practical deployment, LoRA is typically applied to the $W^Q$ and $W^V$ projection matrices of self-attention modules, enabling an effective trade-off between model performance and parameter efficiency. While LoRA has achieved considerable empirical success, it still encounters challenges such as suboptimal performance and slow convergence. To address these limitations, we introduce \textbf{AILoRA}, a novel parameter-efficient method that incorporates function-aware asymmetric low-rank priors. Our empirical analysis reveals that the projection matrices $W^Q$ and $W^V$ in the self-attention mechanism exhibit distinct parameter characteristics, stemming from their functional differences. Specifically, $W^Q$ captures task-specific semantic space knowledge essential for attention distributions computation, making its parameters highly sensitive to downstream task variations. In contrast, $W^V$ encodes token-level feature representations that tend to remain stable across tasks and layers. Leveraging these insights, AILoRA performs a function-aware initialization by injecting the principal components of $W^Q$ to retain task-adaptive capacity, and the minor components of $W^V$ to preserve generalizable feature representations. This asymmetric initialization strategy enables LoRA modules to better capture the specialized roles of attention parameters, thereby enhancing both finetuning performance and convergence efficiency.

Comment: Strong match to Compression/Efficiency: enhances LoRA via function-aware asymmetric low-rank initialization with analysis of distinct W^Q and W^V roles in self-attention.

Relevance: 9 Novelty: 7

18. Memory Retrieval and Consolidation in Large Language Models through Function Tokens

ArXiv ID: 2510.08203

Authors: Shaohua Zhang, Yuan Lin, Hang Li

Abstract: The remarkable success of large language models (LLMs) stems from their ability to consolidate vast amounts of knowledge into the memory during pre-training and to retrieve it from the memory during inference, enabling advanced capabilities such as knowledge memorization, instruction-following and reasoning. However, the mechanisms of memory retrieval and consolidation in LLMs remain poorly understood. In this paper, we propose the function token hypothesis to explain the workings of LLMs: During inference, function tokens activate the most predictive features from context and govern next token prediction (memory retrieval). During pre-training, predicting the next tokens (usually content tokens) that follow function tokens increases the number of learned features of LLMs and updates the model parameters (memory consolidation). Function tokens here roughly correspond to function words in linguistics, including punctuation marks, articles, prepositions, and conjunctions, in contrast to content tokens. We provide extensive experimental evidence supporting this hypothesis. Using bipartite graph analysis, we show that a small number of function tokens activate the majority of features. Case studies further reveal how function tokens activate the most predictive features from context to direct next token prediction. We also find that during pre-training, the training loss is dominated by predicting the next content tokens following function tokens, which forces the function tokens to select the most predictive features from context.

Comment: Representation Learning: proposes the function token hypothesis with evidence on how function tokens retrieve features and drive memory consolidation in LLMs.

Relevance: 9 Novelty: 7

19. R\'enyi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization

ArXiv ID: 2510.07758

Authors: Qiaozhe Zhang, Jun Sun, Ruijie Zhang, Yingzhuang Liu

Abstract: Sharpness (of the loss minima) is a common measure to investigate the generalization of neural networks. Intuitively speaking, the flatter the landscape near the minima is, the better generalization might be. Unfortunately, the correlation between many existing sharpness measures and the generalization is usually not strong, sometimes even weak. To close the gap between the intuition and the reality, we propose a novel sharpness measure, i.e., \textit{R\'enyi sharpness}, which is defined as the negative R\'enyi entropy (a generalization of the classical Shannon entropy) of the loss Hessian. The main ideas are as follows: 1) we realize that \textit{uniform} (identical) eigenvalues of the loss Hessian is most desirable (while keeping the sum constant) to achieve good generalization; 2) we employ the \textit{R\'enyi entropy} to concisely characterize the extent of the spread of the eigenvalues of loss Hessian. Normally, the larger the spread, the smaller the (R\'enyi) entropy. To rigorously establish the relationship between generalization and (R\'enyi) sharpness, we provide several generalization bounds in terms of R\'enyi sharpness, by taking advantage of the reparametrization invariance property of R\'enyi sharpness, as well as the trick of translating the data discrepancy to the weight perturbation. Furthermore, extensive experiments are conducted to verify the strong correlation (in specific, Kendall rank correlation) between the R\'enyi sharpness and generalization. Moreover, we propose to use a variant of R\'enyi Sharpness as regularizer during training, i.e., R\'enyi Sharpness Aware Minimization (RSAM), which turns out to outperform all existing sharpness-aware minimization methods. It is worthy noting that the test accuracy gain of our proposed RSAM method could be as high as nearly 2.5\%, compared against the classical SAM method.

Comment: Representation Learning/Training Dynamics: introduces Rényi-sharpness tied to Hessian spectra with generalization bounds and a new SAM-style regularizer (RSAM).

Relevance: 8 Novelty: 8

20. Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

ArXiv ID: 2510.08492

Authors: Sharut Gupta, Shobhita Sundaram, Chenyu Wang, Stefanie Jegelka, Phillip Isola

Abstract: Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities -- such as text, audio, or images -- consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page: https://unpaired-multimodal.github.io/

Comment: Representation Learning: unpaired multimodal training with shared parameters; theory under linear assumptions showing unimodal gains from auxiliary modalities.

Relevance: 8 Novelty: 8

21. Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity

ArXiv ID: 2510.08023

Authors: Akira Ito, Masanori Yamada, Daiki Chijiwa, Atsutoshi Kumagai

Abstract: Recently, Ainsworth et al. empirically demonstrated that, given two independently trained models, applying a parameter permutation that preserves the input-output behavior allows the two models to be connected by a low-loss linear path. When such a path exists, the models are said to achieve linear mode connectivity (LMC). Prior studies, including Ainsworth et al., have reported that achieving LMC requires not only an appropriate permutation search but also sufficiently wide models (e.g., a 32 $\times$ width multiplier for ResNet-20). This is broadly believed to be because increasing the model width ensures a large enough space of candidate permutations, increasing the chance of finding one that yields LMC. In this work, we empirically demonstrate that, even without any permutations, simply widening the models is sufficient for achieving LMC when using a suitable softmax temperature calibration. We further explain why this phenomenon arises by analyzing intermediate layer outputs. Specifically, we introduce layerwise exponentially weighted connectivity (LEWC), which states that the output of each layer of the merged model can be represented as an exponentially weighted sum of the outputs of the corresponding layers of the original models. Consequently the merged model's output matches that of an ensemble of the original models, which facilitates LMC. To the best of our knowledge, this work is the first to show that widening the model not only facilitates nonlinear mode connectivity, as suggested in prior research, but also significantly increases the possibility of achieving linear mode connectivity.

Comment: Matches Representation Learning/training dynamics: shows width expansion enables linear mode connectivity without permutations; introduces LEWC explanation.

Relevance: 8 Novelty: 8

22. Beyond independent component analysis: identifiability and algorithms

ArXiv ID: 2510.07525

Authors: Alvaro Ribot, Anna Seigal, Piotr Zwiernik

Abstract: Independent Component Analysis (ICA) is a classical method for recovering latent variables with useful identifiability properties. For independent variables, cumulant tensors are diagonal; relaxing independence yields tensors whose zero structure generalizes diagonality. These models have been the subject of recent work in non-independent component analysis. We show that pairwise mean independence answers the question of how much one can relax independence: it is identifiable, any weaker notion is non-identifiable, and it contains the models previously studied as special cases. Our results apply to distributions with the required zero pattern at any cumulant tensor. We propose an algebraic recovery algorithm based on least-squares optimization over the orthogonal group. Simulations highlight robustness: enforcing full independence can harm estimation, while pairwise mean independence enables more stable recovery. These findings extend the classical ICA framework and provide a rigorous basis for blind source separation beyond independence.

Comment: Representation Learning: identifiability theory beyond ICA (pairwise mean independence) with an algebraic recovery algorithm.

Relevance: 8 Novelty: 8

23. OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs

ArXiv ID: 2510.07535

Authors: Jaeseong Lee, seung-won hwang, Aurick Qiao, Gabriele Oliaro, Ye Wang, Samyam Rajbhandari

Abstract: Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads involve long contexts. We find current approaches degrade severely with long contexts; for instance, EAGLE3 even slows down the generation speed by 0.81x. We address these limitations by releasing a new long-context benchmark (LongSpecBench) and introducing a novel model (OWL). OWL achieves about 5x higher acceptance length than EAGLE3 on long-context inputs through three innovations: (1) an LSTM-based drafter conditioned only on the last-token state, making it generalize to various lengths, (2) a special token [SPEC] in the verifier that produces richer representation for drafter, and (3) a hybrid algorithm combining both tree and non-tree decoding methods. We release all code and datasets to advance future research.

Comment: Model Compression and Efficiency: algorithmic speedups for long-context speculative decoding (LSTM drafter, [SPEC] verifier, hybrid tree/non-tree) to improve inference throughput.