This is a remedial run for missed papers from 05/15/2025 to 05/15/2025.

Results generated on 05/26/2025.

Personalized Daily ArXiv Papers 2025-05-16

[gpt-4o]	Prompt	Completion	Total
Token	23827	2914	26741
Cost	$0.06	$0.03	$0.09

Total arXiv papers: 262

Total scanned papers: 262

Total relevant papers: 11

Table of contents with paper titles:

A probabilistic framework for dynamic quantization Authors: Gabriele Santini, Francesco Paissan, Elisabetta Farella
Parallel Scaling Law for Language Models Authors: Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, Zhongxin Liu
Neural Thermodynamic Laws for Large Language Model Training Authors: Ziming Liu, Yizhou Liu, Jeff Gore, Max Tegmark
Superposition Yields Robust Neural Scaling Authors: Yizhou Liu, Ziming Liu, Jeff Gore
Learning Repetition-Invariant Representations for Polymer Informatics Authors: Yihan Zhu, Gang Liu, Eric Inae, Tengfei Luo, Meng Jiang
FlowVAT: Normalizing Flow Variational Inference with Affine-Invariant Tempering Authors: Juehang Qin, Shixiao Liang, Christopher Tunnell
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates Authors: Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang
The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think Authors: Seongyun Lee, Seungone Kim, Minju Seo, Yongrae Jo, Dongyoung Go, Hyeonbin Hwang, Jinho Park, Xiang Yue, Sean Welleck, Graham Neubig, Moontae Lee, Minjoon Seo
ZEUS: Zero-shot Embeddings for Unsupervised Separation of Tabular Data Authors: Patryk Marszałek, Tomasz Kuśmierczyk, Witold Wydmański, Jacek Tabor, Marek Śmieja
SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity Authors: Shihao Zou, Qingfeng Li, Wei Ji, Jingjing Li, Yongkui Yang, Guoqi Li, Chao Dong
Emergence of Structure in Ensembles of Random Neural Networks Authors: Luca Muscarnera, Luigi Loreti, Giovanni Todeschini, Alessio Fumagalli, Francesco Regazzoni

1. A probabilistic framework for dynamic quantization

ArXiv ID: 2505.10689

Authors: Gabriele Santini, Francesco Paissan, Elisabetta Farella

Abstract: We propose a probabilistic framework for dynamic quantization of neural networks that allows for a computationally efficient input-adaptive rescaling of the quantization parameters. Our framework applies a probabilistic model to the network's pre-activations through a lightweight surrogate, enabling the adaptive adjustment of the quantization parameters on a per-input basis without significant memory overhead. We validate our approach on a set of popular computer vision tasks and models, observing only a negligible loss in performance. Our method strikes the best performance and computational overhead tradeoff compared to standard quantization strategies.

Comment: The paper proposes a probabilistic framework for dynamic quantization, which is relevant to model compression by introducing an input-adaptive rescaling method.

Relevance: 9 Novelty: 8

2. Parallel Scaling Law for Language Models

ArXiv ID: 2505.10475

Authors: Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, Zhongxin Liu

Abstract: It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply $P$ diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the $P$ outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with $P$ parallel streams is similar to scaling the parameters by $O(\log P)$ while showing superior inference efficiency. For example, ParScale can use up to 22$\times$ less memory increase and 6$\times$ less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.

Comment: The paper introduces a new scaling paradigm for language models, which aligns with the interest in model architecture and efficiency improvements.

Relevance: 9 Novelty: 8

3. Neural Thermodynamic Laws for Large Language Model Training

ArXiv ID: 2505.10559

Authors: Ziming Liu, Yizhou Liu, Jeff Gore, Max Tegmark

Abstract: Beyond neural scaling laws, little is known about the laws underlying large language models (LLMs). We introduce Neural Thermodynamic Laws (NTL) -- a new framework that offers fresh insights into LLM training dynamics. On the theoretical side, we demonstrate that key thermodynamic quantities (e.g., temperature, entropy, heat capacity, thermal conduction) and classical thermodynamic principles (e.g., the three laws of thermodynamics and the equipartition theorem) naturally emerge under river-valley loss landscape assumptions. On the practical side, this scientific perspective yields intuitive guidelines for designing learning rate schedules.

Comment: The paper introduces Neural Thermodynamic Laws, offering theoretical insights into LLM training dynamics, which aligns with the interest in foundational research on LLMs.

Relevance: 9 Novelty: 8

4. Superposition Yields Robust Neural Scaling

ArXiv ID: 2505.10465

Authors: Yizhou Liu, Ziming Liu, Jeff Gore

Abstract: The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law -- the finding that loss decreases as a power law with model size -- remains unclear. Starting from two empirical principles -- that LLMs represent more things than the model dimensions (widths) they have (i.e., representations are superposed), and that words or concepts in language occur with varying frequencies -- we constructed a toy model to study the loss scaling with model size. We found that when superposition is weak, meaning only the most frequent features are represented without interference, the scaling of loss with model size depends on the underlying feature frequency; if feature frequencies follow a power law, so does the loss. In contrast, under strong superposition, where all features are represented but overlap with each other, the loss becomes inversely proportional to the model dimension across a wide range of feature frequency distributions. This robust scaling behavior is explained geometrically: when many more vectors are packed into a lower dimensional space, the interference (squared overlaps) between vectors scales inversely with that dimension. We then analyzed four families of open-sourced LLMs and found that they exhibit strong superposition and quantitatively match the predictions of our toy model. The Chinchilla scaling law turned out to also agree with our results. We conclude that representation superposition is an important mechanism underlying the observed neural scaling laws. We anticipate that these insights will inspire new training strategies and model architectures to achieve better performance with less computation and fewer parameters.

Comment: The paper provides insights into neural scaling laws and representation superposition, which is relevant to representation learning and foundational research in LLMs.

Relevance: 9 Novelty: 8

5. Learning Repetition-Invariant Representations for Polymer Informatics

ArXiv ID: 2505.10726

Authors: Yihan Zhu, Gang Liu, Eric Inae, Tengfei Luo, Meng Jiang

Abstract: Polymers are large macromolecules composed of repeating structural units known as monomers and are widely applied in fields such as energy storage, construction, medicine, and aerospace. However, existing graph neural network methods, though effective for small molecules, only model the single unit of polymers and fail to produce consistent vector representations for the true polymer structure with varying numbers of units. To address this challenge, we introduce Graph Repetition Invariance (GRIN), a novel method to learn polymer representations that are invariant to the number of repeating units in their graph representations. GRIN integrates a graph-based maximum spanning tree alignment with repeat-unit augmentation to ensure structural consistency. We provide theoretical guarantees for repetition-invariance from both model and data perspectives, demonstrating that three repeating units are the minimal augmentation required for optimal invariant representation learning. GRIN outperforms state-of-the-art baselines on both homopolymer and copolymer benchmarks, learning stable, repetition-invariant representations that generalize effectively to polymer chains of unseen sizes.

Comment: The paper introduces a novel method for learning repetition-invariant representations in polymer informatics, which aligns with representation learning by providing insights into encoding information in deep networks.

Relevance: 9 Novelty: 8

6. FlowVAT: Normalizing Flow Variational Inference with Affine-Invariant Tempering

ArXiv ID: 2505.10466

Authors: Juehang Qin, Shixiao Liang, Christopher Tunnell

Abstract: Multi-modal and high-dimensional posteriors present significant challenges for variational inference, causing mode-seeking behavior and collapse despite the theoretical expressiveness of normalizing flows. Traditional annealing methods require temperature schedules and hyperparameter tuning, falling short of the goal of truly black-box variational inference. We introduce FlowVAT, a conditional tempering approach for normalizing flow variational inference that addresses these limitations. Our method tempers both the base and target distributions simultaneously, maintaining affine-invariance under tempering. By conditioning the normalizing flow on temperature, we leverage overparameterized neural networks' generalization capabilities to train a single flow representing the posterior across a range of temperatures. This preserves modes identified at higher temperatures when sampling from the variational posterior at $T = 1$, mitigating standard variational methods' mode-seeking behavior. In experiments with 2, 10, and 20 dimensional multi-modal distributions, FlowVAT outperforms traditional and adaptive annealing methods, finding more modes and achieving better ELBO values, particularly in higher dimensions where existing approaches fail. Our method requires minimal hyperparameter tuning and does not require an annealing schedule, advancing toward fully-automatic black-box variational inference for complicated posteriors.

Comment: The paper introduces FlowVAT, a novel method for variational inference using normalizing flows, which aligns with representation learning by addressing mode-seeking behavior and improving ELBO values.