Personalized Daily ArXiv Papers 2025-12-17

[gpt-5]	Prompt	Completion	Total
Token	35044	31055	66099
Cost	$0.04	$0.31	$0.35

Total arXiv papers: 459

Total scanned papers: 287

Total relevant papers: 25

Table of contents with paper titles:

Sliding Window Recurrences for Sequence Models Authors: Dragos Secrieru, Garyk Brixi, Yoshua Bengio, Taiji Suzuki, Michael Poli, Stefano Massaroli
SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations Authors: Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, Tri Dao
Low-Rank Compression of Language Models via Differentiable Rank Selection Authors: Sidhant Sundrani, Francesco Tudisco, Pasquale Minervini
OPTIMA: Optimal One-shot Pruning for LLMs via Quadratic Programming Reconstruction Authors: Mohammad Mozaffari, Samuel Kushnir, Maryam Mehri Dehnavi, Amir Yazdanbakhsh
Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training Authors: Can Jin, Hongwu Peng, Mingcan Xiang, Qixin Zhang, Xiangchi Yuan, Amit Hasan, Ohiremen Dibua, Yifan Gong, Yan Kang, Dimitris N. Metaxas
EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training Authors: Qingao Yi, Jiaang Duan, Hanwen Hu, Qin Hua, Haiyan Zhao, Shiyou Qian, Dingyu Yang, Jian Cao, Jinghua Tang, Yinghao Yu, Chenzhi Liao, Kangjin Wang, Liping Zhang
Spherical Leech Quantization for Visual Tokenization and Generation Authors: Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, Philipp Kr\"ahenb\"uhl
CurvaDion: Curvature-Adaptive Distributed Orthonormalization Authors: Bhavesh Kumar, Roger Jin, Jeffrey Quesnelle
Ladder Up, Memory Down: Low-Cost Fine-Tuning With Side Nets Authors: Estelle Zheng (LORIA, ALE), Nathan Cerisara (LORIA), S\'ebastien Warichet (ALE), Emmanuel Helbert (ALE), Christophe Cerisara (SYNALP, LORIA)
MIDUS: Memory-Infused Depth Up-Scaling Authors: Taero Kim, Hoyoon Byun, Youngjun Choi, Sungrae Park, Kyungwoo Song
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed Authors: Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov
SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models Authors: Shizhuo Mao, Song Chen, Yi Kang
A Single Architecture for Representing Invariance Under Any Space Group Authors: Cindy Y. Zhang, Elif Ertekin, Peter Orbanz, Ryan P. Adams
Modular connectivity in neural networks emerges from Poisson noise-motivated regularisation, and promotes robustness and compositional generalisation Authors: Daoyuan Qian, Qiyao Liang, Ila Fiete
Derivative-Informed Fourier Neural Operator: Universal Approximation and Applications to PDE-Constrained Optimization Authors: Boyuan Yao, Dingcheng Luo, Lianghao Cao, Nikola Kovachki, Thomas O'Leary-Roseberry, Omar Ghattas
Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs Authors: Rachit Bansal, Aston Zhang, Rishabh Tiwari, Lovish Madaan, Sai Surya Duvvuri, Devvrit Khatri, David Brandfonbrener, David Alvarez-Melis, Prajjwal Bhargava, Mihir Sanjay Kale, Samy Jelassi
Implicit Bias and Invariance: How Hopfield Networks Efficiently Learn Graph Orbits Authors: Michael Murray, Tenzin Chan, Kedar Karhadker, Christopher J. Hillar
RePo: Language Models with Context Re-Positioning Authors: Huayang Li, Tianyu Zhao, Richard Sproat
Understanding the Gain from Data Filtering in Multimodal Contrastive Learning Authors: Divyansh Pareek, Sewoong Oh, Simon S. Du
STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning Authors: Jie Qin, Jiancheng Huang, Limeng Qiao, Lin Ma
RADAR: Accelerating Large Language Model Inference With RL-Based Dynamic Draft Trees Authors: Junjie Ma, Jinlong Li
Massive Editing for Large Language Models Based on Dynamic Weight Generation Authors: Wentao Wan, Qiqing Lao, Zhiwei Xie, Hefeng Wu, Runnan Lin, Liang Lin, Keze Wang
Universal Reasoning Model Authors: Zitian Gao, Lynx Chen, Yihao Xiao, He Xing, Ran Tao, Haoming Luo, Joey Zhou, Bryan Dai
ReflCtrl: Controlling LLM Reflection via Representation Engineering Authors: Ge Yan (Lily), Chung-En Sun (Lily), Tsui-Wei (Lily), Weng
ParaFormer: A Generalized PageRank Graph Transformer for Graph Representation Learning Authors: Chaohao Yuan, Zhenjie Song, Ercan Engin Kuruoglu, Kangfei Zhao, Yang Liu, Deli Zhao, Hong Cheng, Yu Rong

1. Sliding Window Recurrences for Sequence Models

ArXiv ID: 2512.13921

Authors: Dragos Secrieru, Garyk Brixi, Yoshua Bengio, Taiji Suzuki, Michael Poli, Stefano Massaroli

Abstract: Multi-hybrid architectures are poised to take over language modeling due to better quality and performance. We introduce a hierarchical decomposition framework for linear recurrences that allows us to develop algorithms aligned with GPU memory hierarchies, yielding Sliding Window Recurrences. We focus specifically on truncating recurrences to hardware-aligned windows which are naturally jagged, limiting costly inter-warp communication. Using SWR, we develop Phalanx layers that serve as drop-in replacements for windowed attention or linear recurrences. In 1B parameter multi-hybrid models, Phalanx achieves over 10-40% speedup across 4K to 32K context length over optimized Transformers while matching perplexity.

Comment: Author match

2. SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

ArXiv ID: 2512.14080

Authors: Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, Tri Dao

Abstract: Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE's BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day comparable to ScatterMoE's 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top-$K$ routing while maintaining similar downstream performance. We open-source all our kernels to enable faster MoE model training.

Comment: MoE Efficiency/HPC—memory-efficient MoE backward/forward, IO-overlap GPU kernels, and tile-aware token rounding for reduced padding.

Relevance: 10 Novelty: 9

3. Low-Rank Compression of Language Models via Differentiable Rank Selection

ArXiv ID: 2512.13733

Authors: Sidhant Sundrani, Francesco Tudisco, Pasquale Minervini

Abstract: Approaches for compressing large-language models using low-rank decomposition have made strides, particularly with the introduction of activation and loss-aware SVD, which improves the trade-off between decomposition rank and downstream task performance. Despite these advancements, a persistent challenge remains--selecting the optimal ranks for each layer to jointly optimise compression rate and downstream task accuracy. Current methods either rely on heuristics that can yield sub-optimal results due to their limited discrete search space or are gradient-based but are not as performant as heuristic approaches without post-compression fine-tuning. To address these issues, we propose Learning to Low-Rank Compress (LLRC), a gradient-based approach which directly learns the weights of masks that select singular values in a fine-tuning-free setting. Using a calibration dataset, we train only the mask weights to select fewer and fewer singular values while minimising the divergence of intermediate activations from the original model. Our approach outperforms competing ranking selection methods that similarly require no post-compression fine-tuning across various compression rates on common-sense reasoning and open-domain question-answering tasks. For instance, with a compression rate of 20% on Llama-2-13B, LLRC outperforms the competitive Sensitivity-based Truncation Rank Searching (STRS) on MMLU, BoolQ, and OpenbookQA by 12%, 3.5%, and 4.4%, respectively. Compared to other compression techniques, our approach consistently outperforms fine-tuning-free variants of SVD-LLM and LLM-Pruner across datasets and compression rates. Our fine-tuning-free approach also performs competitively with the fine-tuning variant of LLM-Pruner.

Comment: Model Compression—low-rank decomposition with differentiable rank selection masks; fine-tuning-free rank optimization.

Relevance: 10 Novelty: 8

4. OPTIMA: Optimal One-shot Pruning for LLMs via Quadratic Programming Reconstruction

ArXiv ID: 2512.13886

Authors: Mohammad Mozaffari, Samuel Kushnir, Maryam Mehri Dehnavi, Amir Yazdanbakhsh

Abstract: Post-training model pruning is a promising solution, yet it faces a trade-off: simple heuristics that zero weights are fast but degrade accuracy, while principled joint optimization methods recover accuracy but are computationally infeasible at modern scale. One-shot methods such as SparseGPT offer a practical trade-off in optimality by applying efficient, approximate heuristic weight updates. To close this gap, we introduce OPTIMA, a practical one-shot post-training pruning method that balances accuracy and scalability. OPTIMA casts layer-wise weight reconstruction after mask selection as independent, row-wise Quadratic Programs (QPs) that share a common layer Hessian. Solving these QPs yields the per-row globally optimal update with respect to the reconstruction objective given the estimated Hessian. The shared-Hessian structure makes the problem highly amenable to batching on accelerators. We implement an accelerator-friendly QP solver that accumulates one Hessian per layer and solves many small QPs in parallel, enabling one-shot post-training pruning at scale on a single accelerator without fine-tuning. OPTIMA integrates with existing mask selectors and consistently improves zero-shot performance across multiple LLM families and sparsity regimes, yielding up to 3.97% absolute accuracy improvement. On an NVIDIA H100, OPTIMA prunes a 8B-parameter transformer end-to-end in 40 hours with 60GB peak memory. Together, these results set a new state-of-the-art accuracy-efficiency trade-offs for one-shot post-training pruning.

Comment: Model Compression and Efficiency: one-shot pruning with accelerator-friendly per-row QP reconstruction using shared layer Hessian (sparsity/pruning).

Relevance: 10 Novelty: 8

5. Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training

ArXiv ID: 2512.13996

Authors: Can Jin, Hongwu Peng, Mingcan Xiang, Qixin Zhang, Xiangchi Yuan, Amit Hasan, Ohiremen Dibua, Yifan Gong, Yan Kang, Dimitris N. Metaxas

Abstract: Sparse Mixture-of-Experts (MoE) architectures effectively scale model capacity by activating only a subset of experts for each input token. However, the standard Top-k routing strategy imposes a uniform sparsity pattern that ignores the varying difficulty of tokens. While Top-p routing offers a flexible alternative, existing implementations typically rely on a fixed global probability threshold, which results in uncontrolled computational costs and sensitivity to hyperparameter selection. In this paper, we propose DTop-p MoE, a sparsity-controllable dynamic Top-p routing mechanism. To resolve the challenge of optimizing a non-differentiable threshold, we utilize a Proportional-Integral (PI) Controller that dynamically adjusts the probability threshold to align the running activated-expert sparsity with a specified target. Furthermore, we introduce a dynamic routing normalization mechanism that adapts layer-wise routing logits, allowing different layers to learn distinct expert-selection patterns while utilizing a global probability threshold. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that DTop-p consistently outperforms both Top-k and fixed-threshold Top-p baselines. Our analysis confirms that DTop-p maintains precise control over the number of activated experts while adaptively allocating resources across different tokens and layers. Furthermore, DTop-p exhibits strong scaling properties with respect to expert granularity, expert capacity, model size, and dataset size, offering a robust framework for large-scale MoE pre-training.

Comment: Model Architecture (MoE) and Efficiency: sparsity-controllable dynamic Top-p routing with PI control and layer-wise routing normalization.

Relevance: 10 Novelty: 8

6. EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training

ArXiv ID: 2511.10333

Authors: Qingao Yi, Jiaang Duan, Hanwen Hu, Qin Hua, Haiyan Zhao, Shiyou Qian, Dingyu Yang, Jian Cao, Jinghua Tang, Yinghao Yu, Chenzhi Liao, Kangjin Wang, Liping Zhang

Abstract: Training large language models (LLMs) poses significant challenges regarding computational resources and memory capacity. Although distributed training techniques help mitigate these issues, they still suffer from considerable communication overhead. Existing approaches primarily rely on static gradient compression to enhance communication efficiency; however, these methods neglect the dynamic nature of evolving gradients during training, leading to performance degradation. Accelerating LLM training via compression without sacrificing performance remains a challenge. In this paper, we propose an entropy-driven dynamic gradient compression framework called EDGC. The core concept is to adjust the compression rate during LLM training based on the evolving trends of gradient entropy, taking into account both compression efficiency and error. EDGC consists of three key components.First, it employs a down-sampling method to efficiently estimate gradient entropy, reducing computation overhead. Second, it establishes a theoretical model linking compression rate with gradient entropy, enabling more informed compression decisions. Lastly, a window-based adjustment mechanism dynamically adapts the compression rate across pipeline stages, improving communication efficiency and maintaining model performance. We implemented EDGC on a 32-NVIDIA-V100 cluster and a 64-NVIDIA-H100 cluster to train GPT2-2.5B and GPT2-12.1B, respectively. The results show that EDGC significantly reduces communication latency and training time by up to 46.45% and 16.13% while preserving LLM accuracy.

Comment: HPC/Distributed Training—entropy-driven dynamic gradient compression adapting compression rates during LLM training.

Relevance: 9 Novelty: 8

7. Spherical Leech Quantization for Visual Tokenization and Generation

ArXiv ID: 2512.14697

Authors: Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, Philipp Kr\"ahenb\"uhl

Abstract: Non-parametric quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. The geometry of lattice codes explains the necessity of auxiliary loss terms when training auto-encoders with certain existing lookup-free quantization variants such as BSQ. As a step forward, we explore a few possible candidates, including random lattices, generalized Fibonacci lattices, and densest sphere packing lattices. Among all, we find the Leech lattice-based quantization method, which is dubbed as Spherical Leech Quantization ($\Lambda_{24}$-SQ), leads to both a simplified training recipe and an improved reconstruction-compression tradeoff thanks to its high symmetry and even distribution on the hypersphere. In image tokenization and compression tasks, this quantization approach achieves better reconstruction quality across all metrics than BSQ, the best prior art, while consuming slightly fewer bits. The improvement also extends to state-of-the-art auto-regressive image generation frameworks.

Comment: Compression/Quantization—non-parametric lattice-based codebook (Leech lattice) for autoencoder-style visual tokenization and compression.

Relevance: 9 Novelty: 8

8. CurvaDion: Curvature-Adaptive Distributed Orthonormalization

ArXiv ID: 2512.13728

Authors: Bhavesh Kumar, Roger Jin, Jeffrey Quesnelle

Abstract: As language models scale to trillions of parameters, distributed training across many GPUs becomes essential, yet gradient synchronization over high-bandwidth, low-latency networks remains a critical bottleneck. While recent methods like Dion reduce per-step communication through low-rank updates, they synchronize at every step regardless of the optimization landscape. We observe that synchronization requirements vary dramatically throughout training: workers naturally compute similar gradients in flat regions, making frequent synchronization redundant, while high-curvature regions require coordination to prevent divergence. We introduce CurvaDion, which uses Relative Maximum Momentum Change (RMMC) to detect high-curvature regions requiring synchronization. RMMC leverages momentum dynamics which are already computed during optimization as a computationally tractable proxy for directional curvature, adding only $\mathcal{O}(d)$ operations per layer. We establish theoretical connections between RMMC and loss curvature and demonstrate that CurvaDion achieves 99\% communication reduction while matching baseline convergence across models from 160M to 1.3B parameters.

Comment: High Performance Computing: communication-efficient distributed training via curvature-adaptive synchronization using momentum dynamics (RMMC).

Relevance: 9 Novelty: 8

9. Ladder Up, Memory Down: Low-Cost Fine-Tuning With Side Nets

ArXiv ID: 2512.14237

Authors: Estelle Zheng (LORIA, ALE), Nathan Cerisara (LORIA), S\'ebastien Warichet (ALE), Emmanuel Helbert (ALE), Christophe Cerisara (SYNALP, LORIA)

Abstract: Fine-tuning large language models (LLMs) is often limited by the memory available on commodity GPUs. Parameter-efficient fine-tuning (PEFT) methods such as QLoRA reduce the number of trainable parameters, yet still incur high memory usage induced by the backward pass in the full model. We revisit Ladder Side Tuning (LST), a rarely explored PEFT technique that adds a lightweight side network, and show that it matches QLoRA's compute scaling slope while cutting peak memory by 50\%. Across different downstream benchmarks spanning natural language understanding, mathematical and LLM-critic tasks, LST has competitive performance with QLoRA's accuracy on average while being much more memory-efficient. This efficiency enables fine-tuning of 7B-parameter models on a single 12 GB consumer GPU with 2k-token contexts, requiring no gradient checkpointing\textemdash conditions under which QLoRA exhausts memory. Beyond memory efficiency, we also establish scaling laws showing that LST scales similarly to QLoRA. We exploit Ladder's architectural flexibility by introducing xLadder, a depth-extended variant that increases effective depth via cross-connections and shortens chain-of-thought (CoT) at fixed parameter count. Ladder is strong when memory is the bottleneck; xLadder builds on this by enabling deeper reasoning without additional memory overhead.

Comment: Compression/Efficiency: PEFT via Ladder Side Tuning halves peak memory vs QLoRA; Model Architecture: xLadder increases effective depth via side-net cross-connections at fixed params

Relevance: 9 Novelty: 8

10. MIDUS: Memory-Infused Depth Up-Scaling

ArXiv ID: 2512.13751

Authors: Taero Kim, Hoyoon Byun, Youngjun Choi, Sungrae Park, Kyungwoo Song

Abstract: Scaling large language models (LLMs) demands approaches that increase capacity without incurring excessive parameter growth or inference cost. Depth Up-Scaling (DUS) has emerged as a promising strategy by duplicating layers and applying Continual Pre-training (CPT), but its reliance on feed-forward networks (FFNs) limits efficiency and attainable gains. We introduce Memory-Infused Depth Up-Scaling (MIDUS), which replaces FFNs in duplicated blocks with a head-wise memory (HML) layer. Motivated by observations that attention heads have distinct roles both across and within layers, MIDUS assigns an independent memory bank to each head, enabling head-wise retrieval and injecting information into subsequent layers while preserving head-wise functional structure. This design combines sparse memory access with head-wise representations and incorporates an efficient per-head value factorization module, thereby relaxing the usual efficiency-performance trade-off. Across our CPT experiments, MIDUS exhibits robust performance improvements over strong DUS baselines while maintaining a highly efficient parameter footprint. Our findings establish MIDUS as a compelling and resource-efficient alternative to conventional FFN replication for depth up-scaling by leveraging its head-wise memory design.

Comment: Model Architecture: head-wise memory layer replacing FFNs in duplicated blocks; Compression/Efficiency: sparse memory access and per-head value factorization for parameter-efficient depth up-scaling

Relevance: 9 Novelty: 8

11. Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

ArXiv ID: 2512.14067

Authors: Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov

Abstract: Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.

Comment: Model Architecture/Efficiency: AR-to-diffusion LM conversion with block-wise causal/bidirectional attention preserving KV cache and pretrained weights; Training Dynamics: position-dependent masking to close train–test gap

Relevance: 9 Novelty: 8

12. SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models

ArXiv ID: 2512.14481

Authors: Shizhuo Mao, Song Chen, Yi Kang

Abstract: Large language models (LLMs) excel at natural language tasks but face deployment challenges due to their growing size outpacing GPU memory advancements. Model quantization mitigates this issue by lowering weight and activation precision, but existing solutions face fundamental trade-offs: dynamic quantization incurs high computational overhead and poses deployment challenges on edge devices, while static quantization sacrifices accuracy. Existing approaches of quantization-aware training (QAT) further suffer from weight training costs. We propose SASQ: a lightweight QAT framework specifically tailored for activation quantization factors. SASQ exclusively optimizes only the quantization factors (without changing pre-trained weights), enabling static inference with high accuracy while maintaining deployment efficiency. SASQ adaptively truncates some outliers, thereby reducing the difficulty of quantization while preserving the distributional characteristics of the activations. SASQ not only surpasses existing SOTA quantization schemes but also outperforms the corresponding FP16 models. On LLaMA2-7B, it achieves 5.2% lower perplexity than QuaRot and 4.7% lower perplexity than the FP16 model on WikiText2.

Comment: Model Compression and Efficiency: quantization-aware training focused on activation scaling factors enabling accurate static inference (quantization).

Relevance: 9 Novelty: 7

13. A Single Architecture for Representing Invariance Under Any Space Group

ArXiv ID: 2512.13989

Authors: Cindy Y. Zhang, Elif Ertekin, Peter Orbanz, Ryan P. Adams

Abstract: Incorporating known symmetries in data into machine learning models has consistently improved predictive accuracy, robustness, and generalization. However, achieving exact invariance to specific symmetries typically requires designing bespoke architectures for each group of symmetries, limiting scalability and preventing knowledge transfer across related symmetries. In the case of the space groups, symmetries critical to modeling crystalline solids in materials science and condensed matter physics, this challenge is particularly salient as there are 230 such groups in three dimensions. In this work we present a new approach to such crystallographic symmetries by developing a single machine learning architecture that is capable of adapting its weights automatically to enforce invariance to any input space group. Our approach is based on constructing symmetry-adapted Fourier bases through an explicit characterization of constraints that group operations impose on Fourier coefficients. Encoding these constraints into a neural network layer enables weight sharing across different space groups, allowing the model to leverage structural similarities between groups and overcome data sparsity when limited measurements are available for specific groups. We demonstrate the effectiveness of this approach in achieving competitive performance on material property prediction tasks and performing zero-shot learning to generalize to unseen groups.

Comment: Model Architecture—single architecture/layer enforcing invariance under any space group via symmetry-adapted Fourier weight sharing.

Relevance: 8 Novelty: 8

14. Modular connectivity in neural networks emerges from Poisson noise-motivated regularisation, and promotes robustness and compositional generalisation

ArXiv ID: 2512.13707

Authors: Daoyuan Qian, Qiyao Liang, Ila Fiete

Abstract: Circuits in the brain commonly exhibit modular architectures that factorise complex tasks, resulting in the ability to compositionally generalise and reduce catastrophic forgetting. In contrast, artificial neural networks (ANNs) appear to mix all processing, because modular solutions are difficult to find as they are vanishing subspaces in the space of possible solutions. Here, we draw inspiration from fault-tolerant computation and the Poisson-like firing of real neurons to show that activity-dependent neural noise, combined with nonlinear neural responses, drives the emergence of solutions that reflect an accurate understanding of modular tasks, corresponding to acquisition of a correct world model. We find that noise-driven modularisation can be recapitulated by a deterministic regulariser that multiplicatively combines weights and activations, revealing rich phenomenology not captured in linear networks or by standard regularisation methods. Though the emergence of modular structure requires sufficiently many training samples (exponential in the number of modular task dimensions), we show that pre-modularised ANNs exhibit superior noise-robustness and the ability to generalise and extrapolate well beyond training data, compared to ANNs without such inductive biases. Together, our work demonstrates a regulariser and architectures that could encourage modularity emergence to yield functional benefits.

Comment: Representation Learning/Regularization—Poisson-noise-motivated regularizer induces modular connectivity and compositional generalization.

Relevance: 8 Novelty: 8

15. Derivative-Informed Fourier Neural Operator: Universal Approximation and Applications to PDE-Constrained Optimization

ArXiv ID: 2512.14086

Authors: Boyuan Yao, Dingcheng Luo, Lianghao Cao, Nikola Kovachki, Thomas O'Leary-Roseberry, Omar Ghattas

Abstract: We present approximation theories and efficient training methods for derivative-informed Fourier neural operators (DIFNOs) with applications to PDE-constrained optimization. A DIFNO is an FNO trained by minimizing its prediction error jointly on output and Fr\'echet derivative samples of a high-fidelity operator (e.g., a parametric PDE solution operator). As a result, a DIFNO can closely emulate not only the high-fidelity operator's response but also its sensitivities. To motivate the use of DIFNOs instead of conventional FNOs as surrogate models, we show that accurate surrogate-driven PDE-constrained optimization requires accurate surrogate Fr\'echet derivatives. Then, for continuously differentiable operators, we establish (i) simultaneous universal approximation of FNOs and their Fr\'echet derivatives on compact sets, and (ii) universal approximation of FNOs in weighted Sobolev spaces with input measures that have unbounded supports. Our theoretical results certify the capability of FNOs for accurate derivative-informed operator learning and accurate solution of PDE-constrained optimization. Furthermore, we develop efficient training schemes using dimension reduction and multi-resolution techniques that significantly reduce memory and computational costs for Fr\'echet derivative learning. Numerical examples on nonlinear diffusion--reaction, Helmholtz, and Navier--Stokes equations demonstrate that DIFNOs are superior in sample complexity for operator learning and solving infinite-dimensional PDE-constrained inverse problems, achieving high accuracy at low training sample sizes.

Comment: Model Architecture/Representation Learning: derivative-informed training of Fourier Neural Operators with universal approximation (incl. Fréchet derivatives) and efficient training schemes.

Relevance: 8 Novelty: 8

16. Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs

ArXiv ID: 2512.13898

Authors: Rachit Bansal, Aston Zhang, Rishabh Tiwari, Lovish Madaan, Sai Surya Duvvuri, Devvrit Khatri, David Brandfonbrener, David Alvarez-Melis, Prajjwal Bhargava, Mihir Sanjay Kale, Samy Jelassi

Abstract: Progress on training and architecture strategies has enabled LLMs with millions of tokens in context length. However, empirical evidence suggests that such long-context LLMs can consume far more text than they can reliably use. On the other hand, it has been shown that inference-time compute can be used to scale performance of LLMs, often by generating thinking tokens, on challenging tasks involving multi-step reasoning. Through controlled experiments on sandbox long-context tasks, we find that such inference-time strategies show rapidly diminishing returns and fail at long context. We attribute these failures to score dilution, a phenomenon inherent to static self-attention. Further, we show that current inference-time strategies cannot retrieve relevant long-context signals under certain conditions. We propose a simple method that, through targeted gradient updates on the given context, provably overcomes limitations of static self-attention. We find that this shift in how inference-time compute is spent leads to consistently large performance improvements across models and long-context benchmarks. Our method leads to large 12.6 and 14.1 percentage point improvements for Qwen3-4B on average across subsets of LongBench-v2 and ZeroScrolls benchmarks. The takeaway is practical: for long context, a small amount of context-specific training is a better use of inference compute than current inference-time scaling strategies like producing more thinking tokens.

Comment: Efficiency/Training dynamics: test-time training with targeted gradient updates to overcome static attention limitations in long-context LLMs.

Relevance: 8 Novelty: 8

17. Implicit Bias and Invariance: How Hopfield Networks Efficiently Learn Graph Orbits

ArXiv ID: 2512.14338

Authors: Michael Murray, Tenzin Chan, Kedar Karhadker, Christopher J. Hillar

Abstract: Many learning problems involve symmetries, and while invariance can be built into neural architectures, it can also emerge implicitly when training on group-structured data. We study this phenomenon in classical Hopfield networks and show they can infer the full isomorphism class of a graph from a small random sample. Our results reveal that: (i) graph isomorphism classes can be represented within a three-dimensional invariant subspace, (ii) using gradient descent to minimize energy flow (MEF) has an implicit bias toward norm-efficient solutions, which underpins a polynomial sample complexity bound for learning isomorphism classes, and (iii) across multiple learning rules, parameters converge toward the invariant subspace as sample sizes grow. Together, these findings highlight a unifying mechanism for generalization in Hopfield networks: a bias toward norm efficiency in learning drives the emergence of approximate invariance under group-structured data.

Comment: Representation Learning/Implicit Bias—emergent invariance in Hopfield networks and norm-efficient learning of graph orbits.