Personalized Daily Arxiv Papers 3/29/2025

[gpt-4o]	Prompt	Completion	Total
Token	24427	2937	27364
Cost	$0.06	$0.03	$0.09

Total arXiv papers: 205

Total scanned papers: 117

Total relevant papers: 15

Table of contents with paper titles:

Squared families: Searching beyond regular probability models Authors: Russell Tsuchida, Jiawei Liu, Cheng Soon Ong, Dino Sejdinovic
HOT: Hadamard-based Optimized Training Authors: Seonggon Kim, Juncheol Shin, Seung-taek Woo, Eunhyeok Park
How do language models learn facts? Dynamics, curricula and hallucinations Authors: Nicolas Zucchet, J\"org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De
Consistent Multigroup Low-Rank Approximation Authors: Antonis Matakos, Martino Ciaperoni, Heikki Mannila
MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness Authors: Zihao Zheng (Eric), Xiuping Cui (Eric), Size Zheng (Eric), Maoliang Li (Eric), Jiayu Chen (Eric), Yun (Eric), Liang, Xiang Chen
Model Assembly Learning with Heterogeneous Layer Weight Merging Authors: Yi-Kai Zhang, Jin Wang, Xu-Xiang Zhong, De-Chuan Zhan, Han-Jia Ye
Shared Global and Local Geometry of Language Model Embeddings Authors: Andrew Lee, Melanie Weber, Fernanda Vi\'egas, Martin Wattenberg
F-INR: Functional Tensor Decomposition for Implicit Neural Representations Authors: Sai Karthikeya Vemuri, Tim B\"uchner, Joachim Denzler
Rethinking Graph Structure Learning in the Era of LLMs Authors: Zhihan Zhang, Xunkai Li, Guang Zeng, Hongchao Qin, Ronghua Li, Guoren Wang
Exploring the Energy Landscape of RBMs: Reciprocal Space Insights into Bosons, Hierarchical Learning and Symmetry Breaking Authors: J. Quetzalc\'oatl Toledo-Marin, Anindita Maiti, Geoffrey C. Fox, Roger G. Melko
Nonlinear Multiple Response Regression and Learning of Latent Spaces Authors: Ye Tian, Sanyou Wu, Long Feng
Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models Authors: Pin-Yu Chen, Han Shen, Payel Das, Tianyi Chen
Uncertainty propagation in feed-forward neural network models Authors: Jeremy Diamzon, Daniele Venturi
Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck Authors: Adrian Bulat, Yassine Ouali, Georgios Tzimiropoulos
Stochastic Engrams for Efficient Continual Learning with Binarized Neural Networks Authors: Isabelle Aguilar, Luis Fernando Herbozo Contreras, Omid Kavehei

1. Squared families: Searching beyond regular probability models

ArXiv ID: 2503.21128

Authors: Russell Tsuchida, Jiawei Liu, Cheng Soon Ong, Dino Sejdinovic

Abstract: We introduce squared families, which are families of probability densities obtained by squaring a linear transformation of a statistic. Squared families are singular, however their singularity can easily be handled so that they form regular models. After handling the singularity, squared families possess many convenient properties. Their Fisher information is a conformal transformation of the Hessian metric induced from a Bregman generator. The Bregman generator is the normalising constant, and yields a statistical divergence on the family. The normalising constant admits a helpful parameter-integral factorisation, meaning that only one parameter-independent integral needs to be computed for all normalising constants in the family, unlike in exponential families. Finally, the squared family kernel is the only integral that needs to be computed for the Fisher information, statistical divergence and normalising constant. We then describe how squared families are special in the broader class of $g$-families, which are obtained by applying a sufficiently regular function $g$ to a linear transformation of a statistic. After removing special singularities, positively homogeneous families and exponential families are the only $g$-families for which the Fisher information is a conformal transformation of the Hessian metric, where the generator depends on the parameter only through the normalising constant. Even-order monomial families also admit parameter-integral factorisations, unlike exponential families. We study parameter estimation and density estimation in squared families, in the well-specified and misspecified settings. We use a universal approximation property to show that squared families can learn sufficiently well-behaved target densities at a rate of $\mathcal{O}(N^{-1/2})+C n^{-1/4}$, where $N$ is the number of datapoints, $n$ is the number of parameters, and $C$ is some constant.

Comment: The paper introduces squared families, a novel statistical framework with foundational insights into probability models and their properties. It aligns with the 'Emerging Trends' criterion due to its theoretical contributions challenging established assumptions in statistical modeling.

Relevance: 9 Novelty: 9

2. HOT: Hadamard-based Optimized Training

ArXiv ID: 2503.21261

Authors: Seonggon Kim, Juncheol Shin, Seung-taek Woo, Eunhyeok Park

Abstract: It has become increasingly important to optimize backpropagation to reduce memory usage and computational overhead. Achieving this goal is highly challenging, as multiple objectives must be considered jointly while maintaining training quality. In this paper, we focus on matrix multiplication, which accounts for the largest portion of training costs, and analyze its backpropagation in detail to identify lightweight techniques that offer the best benefits. Based on this analysis, we introduce a novel method, Hadamard-based Optimized Training (HOT). In this approach, we apply Hadamard-based optimizations, such as Hadamard quantization and Hadamard low-rank approximation, selectively and with awareness of the suitability of each optimization for different backward paths. Additionally, we introduce two enhancements: activation buffer compression and layer-wise quantizer selection. Our extensive analysis shows that HOT achieves up to 75% memory savings and a 2.6 times acceleration on real GPUs, with negligible accuracy loss compared to FP32 precision.

Comment: The paper introduces Hadamard-based optimizations for backpropagation, which aligns with the 'Model Compression' criterion due to its focus on memory and computational efficiency.