Personalized Daily Arxiv Papers 02/14/2025

	Prompt	Completion	Total
Token	82335	6722	89057
Cost	$0.21	$0.07	$0.27

Total scanned papers: 337

Total relevant papers: 19

Table of contents with paper titles:

RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models Authors: Quan Wei (Katie), Chung-Yiu Yau (Katie), Hoi-To Wai (Katie), Yang (Katie), Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong
When do neural networks learn world models? Authors: Tianren Zhang, Guanyu Chen, Feng Chen
On the Importance of Embedding Norms in Self-Supervised Learning Authors: Andrew Draganov, Sharvaree Vadgama, Sebastian Damrich, Jan Niklas B\"ohm, Lucas Maes, Dmitry Kobak, Erik Bekkers
LoRA Training Provably Converges to a Low-Rank Global Minimum or It Fails Loudly (But it Probably Won't Fail) Authors: Junsu Kim, Jaeyeon Kim, Ernest K. Ryu
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation Authors: Hoigi Seo, Wongi Jeong, Jae-sun Seo, Se Young Chun
Spectral Journey: How Transformers Predict the Shortest Path Authors: Andrew Cohen, Andrey Gromov, Kaiyu Yang, Yuandong Tian
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU Authors: Heejun Lee, Geon Park, Jaduk Suh, Sung Ju Hwang
Scalable First-order Method for Certifying Optimal k-Sparse GLMs Authors: Jiachang Liu, Soroosh Shafiee, Andrea Lodi
On multi-token prediction for efficient LLM inference Authors: Somesh Mehra, Javier Alonso Garcia, Lukas Mauch
Improving Deep Regression with Tightness Authors: Shihao Zhang, Yuguang Yan, Angela Yao
Generalizability through Explainability: Countering Overfitting with Counterfactual Examples Authors: Flavio Giorgi, Fabiano Veglianti, Fabrizio Silvestri, Gabriele Tolomei
New Bounds for Sparse Variational Gaussian Processes Authors: Michalis K. Titsias
Cost-Saving LLM Cascades with Early Abstention Authors: Michael J. Zellinger, Rex Liu, Matt Thomson
Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models Authors: Xin Zhou, Yiwen Guo, Ruotian Ma, Tao Gui, Qi Zhang, Xuanjing Huang
Trust Me, I Know the Way: Predictive Uncertainty in the Presence of Shortcut Learning Authors: Lisa Wimmer, Bernd Bischl, Ludwig Bothmann
Biologically Plausible Brain Graph Transformer Authors: Ciyuan Peng, Yuelong Huang, Qichao Dong, Shuo Yu, Feng Xia, Chengqi Zhang, Yaochu Jin
Neural Force Field: Learning Generalized Physical Representation from a Few Examples Authors: Shiqian Li, Ruihong Shen, Chi Zhang, Yixin Zhu
Designing a Conditional Prior Distribution for Flow-Based Generative Models Authors: Noam Issachar, Mohammad Salama, Raanan Fattal, Sagie Benaim
CoT-Valve: Length-Compressible Chain-of-Thought Tuning Authors: Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, Xinchao Wang

1. RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models

ArXiv ID: 2502.09003

Authors: Quan Wei (Katie), Chung-Yiu Yau (Katie), Hoi-To Wai (Katie), Yang (Katie), Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong

Abstract: Supervised fine-tuning is a standard method for adapting pre-trained large language models (LLMs) to downstream tasks. Quantization has been recently studied as a post-training technique for efficient LLM deployment. To obtain quantized fine-tuned LLMs, conventional pipelines would first fine-tune the pre-trained models, followed by post-training quantization. This often yields suboptimal performance as it fails to leverage the synergy between fine-tuning and quantization. To effectively realize low-bit quantization of weights, activations, and KV caches in LLMs, we propose an algorithm named Rotated Straight-Through-Estimator (RoSTE), which combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that identifies an effective rotation configuration to reduce activation outliers. We provide theoretical insights on RoSTE by analyzing its prediction error when applied to an overparameterized least square quantized training problem. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration. Experiments on Pythia and Llama models of different sizes demonstrate the effectiveness of RoSTE. Compared to existing post-SFT quantization baselines, our method consistently achieves superior performances across various tasks and different LLM architectures.

Comment: The paper proposes a quantization-aware fine-tuning approach for LLMs, which is highly relevant to model compression and efficiency.

Relevance: 10 Novelty: 8

2. When do neural networks learn world models?

ArXiv ID: 2502.09297

Authors: Tianren Zhang, Guanyu Chen, Feng Chen

Abstract: Humans develop world models that capture the underlying generation process of data. Whether neural networks can learn similar world models remains an open problem. In this work, we provide the first theoretical results for this problem, showing that in a multi-task setting, models with a low-degree bias provably recover latent data-generating variables under mild assumptions -- even if proxy tasks involve complex, non-linear functions of the latents. However, such recovery is also sensitive to model architecture. Our analysis leverages Boolean models of task solutions via the Fourier-Walsh transform and introduces new techniques for analyzing invertible Boolean transforms, which may be of independent interest. We illustrate the algorithmic implications of our results and connect them to related research areas, including self-supervised learning, out-of-distribution generalization, and the linear representation hypothesis in large language models.

Comment: The paper provides theoretical insights into when neural networks learn world models, which aligns with representation learning and foundational research into training dynamics.

Relevance: 9 Novelty: 9

3. On the Importance of Embedding Norms in Self-Supervised Learning

ArXiv ID: 2502.09252

Authors: Andrew Draganov, Sharvaree Vadgama, Sebastian Damrich, Jan Niklas B\"ohm, Lucas Maes, Dmitry Kobak, Erik Bekkers

Abstract: Self-supervised learning (SSL) allows training data representations without a supervised signal and has become an important paradigm in machine learning. Most SSL methods employ the cosine similarity between embedding vectors and hence effectively embed data on a hypersphere. While this seemingly implies that embedding norms cannot play any role in SSL, a few recent works have suggested that embedding norms have properties related to network convergence and confidence. In this paper, we resolve this apparent contradiction and systematically establish the embedding norm's role in SSL training. Using theoretical analysis, simulations, and experiments, we show that embedding norms (i) govern SSL convergence rates and (ii) encode network confidence, with smaller norms corresponding to unexpected samples. Additionally, we show that manipulating embedding norms can have large effects on convergence speed. Our findings demonstrate that SSL embedding norms are integral to understanding and optimizing network behavior.

Comment: This paper provides theoretical insights into the role of embedding norms in self-supervised learning, which aligns with representation learning and training dynamics in neural networks.