Personalized Daily Arxiv Papers 01/22/2025

Total cost: $6.360799999999999

Total relevant papers: 70

Paper selection prompt and criteria at the bottom

Table of contents with paper titles:

RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? Authors: Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingying Zhang, Zhijiang Guo, Yaodong Yang, Muhan Zhang, Debing Zhang
Physics of Skill Learning Authors: Ziming Liu, Yizhou Liu, Eric J. Michaud, Jeff Gore, Max Tegmark
Can Bayesian Neural Networks Make Confident Predictions? Authors: Katharine Fisher, Youssef Marzouk
Universality of Benign Overfitting in Binary Linear Classification Authors: Ichiro Hashimoto, Stanislav Volgushev, Piotr Zwiernik
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models Authors: Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models Authors: Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin
Playing the Lottery With Concave Regularizers for Sparse Trainable Neural Networks Authors: Giulia Fracastoro, Sophie M. Fosson, Andrea Migliorati, Giuseppe C. Calafiore
Higher Order Approximation Rates for ReLU CNNs in Korobov Spaces Authors: Yuwen Li, Guozhi Zhang
DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference Authors: Yujie Zhang, Shivam Aggarwal, Tulika Mitra
Nested Annealed Training Scheme for Generative Adversarial Networks Authors: Chang Wan, Ming-Hsuan Yang, Minglu Li, Yunliang Jiang, Zhonglong Zheng
Generalizable Spectral Embedding with an Application to UMAP Authors: Nir Ben-Ari, Amitai Yacobi, Uri Shaham
Is logical analysis performed by transformers taking place in self-attention or in the fully connected part? Authors: Evgeniy Shin, Heinrich Matzinger
Meta-Sparsity: Learning Optimal Sparse Structures in Multi-task Networks through Meta-learning Authors: Richa Upadhyay, Ronald Phlypo, Rajkumar Saini, Marcus Liwicki
ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models Authors: Yassir Bendou, Amine Ouasfi, Vincent Gripon, Adnane Boukhayma
Issues with Neural Tangent Kernel Approach to Neural Networks Authors: Haoran Liu, Anthony Tai, David J. Crandall, Chunfeng Huang
The Geometry of Tokens in Internal Representations of Large Language Models Authors: Karthik Viswanathan, Yuri Gardinazzi, Giada Panerai, Alberto Cazzaniga, Matteo Biagetti
Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy Authors: Saeid Asgari Taghanaki, Joao Monteiro
Training-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing Authors: Chaoqing Tang, Huanze Zhuang, Guiyun Tian, Zhenli Zeng, Yi Ding, Wenzhong Liu, Xiang Bai
Glinthawk: A Two-Tiered Architecture for High-Throughput LLM Inference Authors: Pouya Hamadanian, Sadjad Fouladi
Unveiling the Mystery of Weight in Large Foundation Models: Gaussian Distribution Never Fades Authors: Chongjie Si, Jingjing Jiang, Wei Shen
Test-time regression: a unifying framework for designing sequence models with associative memory Authors: Ke Alexander Wang, Jiaxin Shi, Emily B. Fox
The "Law" of the Unconscious Contrastive Learner: Probabilistic Alignment of Unpaired Modalities Authors: Yongwei Che, Benjamin Eysenbach
LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator Authors: Guoyu Li (University of Chinese Academy of Sciences, Microsoft Research), Shengyu Ye (Microsoft Research), Chunyun Chen (NTU Singapore), Yang Wang (Microsoft Research), Fan Yang (Microsoft Research), Ting Cao (Microsoft Research), Cheng Liu (University of Chinese Academy of Sciences), Mohamed M. Sabry (NTU Singapore), Mao Yang (Microsoft Research)
HAC++: Towards 100X Compression of 3D Gaussian Splatting Authors: Yihang Chen, Qianyi Wu, Weiyao Lin, Mehrtash Harandi, Jianfei Cai
A Metric Topology of Deep Learning for Data Classification Authors: Jwo-Yuh Wu, Liang-Chi Huang, Wen-Hsuan Li, Chun-Hung Liu
Jailbreaking Large Language Models in Infinitely Many Ways Authors: Oliver Goldstein, Emanuele La Malfa, Felix Drinkall, Samuele Marro, Michael Wooldridge
FOCUS: First Order Concentrated Updating Scheme Authors: Yizhou Liu, Ziming Liu, Jeff Gore
Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities Authors: Qirun Dai, Dylan Zhang, Jiaqi W. Ma, Hao Peng
EDoRA: Efficient Weight-Decomposed Low-Rank Adaptation via Singular Value Decomposition Authors: Hamid Nasiri, Peter Garraghan
MirrorCBO: A consensus-based optimization method in the spirit of mirror descent Authors: Leon Bungert, Franca Hoffmann, Doh Yeon Kim, Tim Roith
SeRpEnt: Selective Resampling for Expressive State Space Models Authors: Stefano Rando, Luca Romani, Matteo Migliarini, Luca Franco, Denis Gudovskiy, Fabio Galasso
Fast sparse optimization via adaptive shrinkage Authors: Vito Cerone, Sophie M. Fosson, Diego Regruto
Simulation of Hypergraph Algorithms with Looped Transformers Authors: Xiaoyu Li, Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Zhen Zhuang
FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models Authors: Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui Wang, Bo Li, Xiaowen Chu
ARD-VAE: A Statistical Formulation to Find the Relevant Latent Dimensions of Variational Autoencoders Authors: Surojit Saha, Sarang Joshi, Ross Whitaker
SILO: Solving Inverse Problems with Latent Operators Authors: Ron Raphaeli, Sean Man, Michael Elad
MoGERNN: An Inductive Traffic Predictor for Unobserved Locations in Dynamic Sensing Networks Authors: Qishen Zhou, Yifan Zhang, Michail A. Makridis, Anastasios Kouvelas, Yibing Wang, Simon Hu
Toward Effective Digraph Representation Learning: A Magnetic Adaptive Propagation based Approach Authors: Xunkai Li, Daohan Su, Zhengyu Wu, Guang Zeng, Hongchao Qin, Rong-Hua Li, Guoren Wang
Control LLM: Controlled Evolution for Intelligence Retention in LLM Authors: Haichao Wei, Yunxiang Ren, Zhoutong Fu, Aman Lunia, Yi-Lin Chen, Alice Leung, Ya Xu
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space Authors: Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel
SMamba: Sparse Mamba for Event-based Object Detection Authors: Nan Yang, Yang Wang, Zhanwen Liu, Meng Li, Yisheng An, Xiangmo Zhao
Systematic Abductive Reasoning via Diverse Relation Representations in Vector-symbolic Architecture Authors: Zhong-Hua Sun, Ru-Yuan Zhang, Zonglei Zhen, Da-Hui Wang, Yong-Jie Li, Xiaohong Wan, Hongzhi You
Empirical Bayes Estimation for Lasso-Type Regularizers: Analysis of Automatic Relevance Determination Authors: Tsukasa Yoshida, Kazuho Watanabe
Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity Authors: David Williams-King, Linh Le, Adam Oberman, Yoshua Bengio
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback Authors: Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang
Ditto: Accelerating Diffusion Model via Temporal Value Similarity Authors: Sungbin Kim, Hyunwuk Lee, Wonho Cho, Mincheol Park, Won Woo Ro
Graph-defined Language Learning with LLMs Authors: Huachi Zhou, Jiahe Du, Chuang Zhou, Chang Yang, Yilin Xiao, Yuxuan Xie, Xiao Huang
A margin-based replacement for cross-entropy loss Authors: Michael W. Spratling, Heiko H. Sch\"utt
A Truly Sparse and General Implementation of Gradient-Based Synaptic Plasticity Authors: Jamie Lohoff, Anil Kaya, Florian Assmuth, Emre Neftci
Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search Authors: Daniel Severo, Giuseppe Ottaviano, Matthew Muckley, Karen Ullrich, Matthijs Douze
Exploring Transferable Homogeneous Groups for Compositional Zero-Shot Learning Authors: Zhijie Rao, Jingcai Guo, Miaoge Li, Yang Chen
Uncertainty Quantification With Noise Injection in Neural Networks: A Bayesian Perspective Authors: Xueqiong Yuan, Jipeng Li, Ercan Engin Kuruoglu
SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks Authors: Wentao Wan, Zhuojie Yang, Yongcan Chen, Chenglin Luo, Ruilin Wang, Kehao Cai, Nan Kang, Liang Lin, Keze Wang
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling Authors: Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, Yuxiao Dong
Quantitative Error Bounds for Scaling Limits of Stochastic Iterative Algorithms Authors: Xiaoyu Wang, Mikolaj J. Kasprzak, Jeffrey Negrea, Solesne Bourguin, Jonathan H. Huggins
Parallel Sequence Modeling via Generalized Spatial Propagation Network Authors: Hongjun Wang, Wonmin Byeon, Jiarui Xu, Jinwei Gu, Ka Chun Cheung, Xiaolong Wang, Kai Han, Jan Kautz, Sifei Liu
Score Combining for Contrastive OOD Detection Authors: Edward T. Reehorst, Philip Schniter
Block Flow: Learning Straight Flow on Data Blocks Authors: Zibin Wang, Zhiyuan Ouyang, Xiangyun Zhang
With Great Backbones Comes Great Adversarial Transferability Authors: Erik Arakelyan, Karen Hambardzumyan, Davit Papikyan, Pasquale Minervini, Albert Gordo, Isabelle Augenstein, Aram H. Markosyan
Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding Authors: Zhanpeng Chen, Mingxiao Li, Ziyang Chen, Nan Du, Xiaolong Li, Yuexian Zou
Dynamic Continual Learning: Harnessing Parameter Uncertainty for Improved Network Adaptation Authors: Christopher Angelini, Nidhal Bouaynaya
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments Authors: Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan \"O. Ar{\i}k
Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos Authors: Yanlai Yang, Mengye Ren
Unsupervised Learning in Echo State Networks for Input Reconstruction Authors: Taiki Yamada, Yuichi Katori, Kantaro Fujiwara
Optimizing Pretraining Data Mixtures with LLM-Estimated Utility Authors: William Held, Bhargavi Paranjape, Punit Singh Koura, Mike Lewis, Frank Zhang, Todor Mihaylov
Mutual Regression Distance Authors: Dong Qiao, Jicong Fan
A Hybrid Supervised and Self-Supervised Graph Neural Network for Edge-Centric Applications Authors: Eugenio Borzone, Leandro Di Persia, Matias Gerard
CDW-CoT: Clustered Distance-Weighted Chain-of-Thoughts Reasoning Authors: Yuanheng Fang, Guoqing Chao, Wenqiang Lei, Shaobo Li, Dianhui Chu
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model Authors: Xianwei Zhuang, Yuxin Xie, Yufan Deng, Liming Liang, Jinghan Ru, Yuguo Yin, Yuexian Zou
Disentangled Interpretable Representation for Efficient Long-term Time Series Forecasting Authors: Yuang Zhao, Tianyu Li, Jiadong Chen, Shenrong Ye, Fuxin Jiang, Tieying Zhang, Xiaofeng Gao

0. RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?

ArXiv ID: 2501.11284

Authors: Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingying Zhang, Zhijiang Guo, Yaodong Yang, Muhan Zhang, Debing Zhang

Abstract: Can scaling transform reasoning? In this work, we explore the untapped potential of scaling Long Chain-of-Thought (Long-CoT) data to 1000k samples, pioneering the development of a slow-thinking model, RedStar. Through extensive experiments with various LLMs and different sizes, we uncover the ingredients for specialization and scale for Long-CoT training. Surprisingly, even smaller models show significant performance gains with limited data, revealing the sample efficiency of Long-CoT and the critical role of sample difficulty in the learning process. Our findings demonstrate that Long-CoT reasoning can be effectively triggered with just a few thousand examples, while larger models achieve unparalleled improvements. We also introduce reinforcement learning (RL)-scale training as a promising direction for advancing slow-thinking systems. RedStar shines across domains: on the MATH-Hard benchmark, RedStar-code-math boosts performance from 66.2\% to 81.6\%, and on the USA Math Olympiad (AIME), it solves 46.7\% of problems using only 21k mixed-code-math datasets. In multimodal tasks like GeoQA and MathVista-GEO, RedStar-Geo achieves competitive results with minimal Long-CoT data, outperforming other slow-thinking systems like QvQ-Preview. Compared to QwQ, RedStar strikes the perfect balance between reasoning and generalizability. Our work highlights that, with careful tuning, scaling Long-CoT can unlock extraordinary reasoning capabilities-even with limited dataset and set a new standard for slow-thinking models across diverse challenges. Our data and models are released at https://huggingface.co/RedStar-Reasoning.

Comment: Explores scaling Long Chain-of-Thought reasoning in LLMs, demonstrating breakthroughs in 'slow-thinking' reasoning improvements through detailed experiments. High relevance for architecture insights in LLMs with well-established novelty.

Relevance: 10 Novelty: 9

1. Physics of Skill Learning

ArXiv ID: 2501.12391

Authors: Ziming Liu, Yizhou Liu, Eric J. Michaud, Jeff Gore, Max Tegmark

Abstract: We aim to understand physics of skill learning, i.e., how skills are learned in neural networks during training. We start by observing the Domino effect, i.e., skills are learned sequentially, and notably, some skills kick off learning right after others complete learning, similar to the sequential fall of domino cards. To understand the Domino effect and relevant behaviors of skill learning, we take physicists' approach of abstraction and simplification. We propose three models with varying complexities -- the Geometry model, the Resource model, and the Domino model, trading between reality and simplicity. The Domino effect can be reproduced in the Geometry model, whose resource interpretation inspires the Resource model, which can be further simplified to the Domino model. These models present different levels of abstraction and simplification; each is useful to study some aspects of skill learning. The Geometry model provides interesting insights into neural scaling laws and optimizers; the Resource model sheds light on the learning dynamics of compositional tasks; the Domino model reveals the benefits of modularity. These models are not only conceptually interesting -- e.g., we show how Chinchilla scaling laws can emerge from the Geometry model, but also are useful in practice by inspiring algorithmic development -- e.g., we show how simple algorithmic changes, motivated by these toy models, can speed up the training of deep learning models.

Comment: Provides theoretical insights into how neural networks learn and encode information through novel models, directly aligning with representation learning and theoretical work.

Relevance: 10 Novelty: 9

2. Can Bayesian Neural Networks Make Confident Predictions?

ArXiv ID: 2501.11773

Authors: Katharine Fisher, Youssef Marzouk

Abstract: Bayesian inference promises a framework for principled uncertainty quantification of neural network predictions. Barriers to adoption include the difficulty of fully characterizing posterior distributions on network parameters and the interpretability of posterior predictive distributions. We demonstrate that under a discretized prior for the inner layer weights, we can exactly characterize the posterior predictive distribution as a Gaussian mixture. This setting allows us to define equivalence classes of network parameter values which produce the same likelihood (training error) and to relate the elements of these classes to the network's scaling regime -- defined via ratios of the training sample size, the size of each layer, and the number of final layer parameters. Of particular interest are distinct parameter realizations that map to low training error and yet correspond to distinct modes in the posterior predictive distribution. We identify settings that exhibit such predictive multimodality, and thus provide insight into the accuracy of unimodal posterior approximations. We also characterize the capacity of a model to "learn from data" by evaluating contraction of the posterior predictive in different scaling regimes.

Comment: Introduces a Bayesian framework that precisely characterizes predictive distributions in neural networks, offering theoretical insights valuable for understanding representation learning in scaling regimes. Strong alignment with foundational research.

Relevance: 9 Novelty: 9

3. Universality of Benign Overfitting in Binary Linear Classification

ArXiv ID: 2501.10538

Authors: Ichiro Hashimoto, Stanislav Volgushev, Piotr Zwiernik

Abstract: The practical success of deep learning has led to the discovery of several surprising phenomena. One of these phenomena, that has spurred intense theoretical research, is ``benign overfitting'': deep neural networks seem to generalize well in the over-parametrized regime even though the networks show a perfect fit to noisy training data. It is now known that benign overfitting also occurs in various classical statistical models. For linear maximum margin classifiers, benign overfitting has been established theoretically in a class of mixture models with very strong assumptions on the covariate distribution. However, even in this simple setting, many questions remain open. For instance, most of the existing literature focuses on the noiseless case where all true class labels are observed without errors, whereas the more interesting noisy case remains poorly understood. We provide a comprehensive study of benign overfitting for linear maximum margin classifiers. We discover a phase transition in test error bounds for the noisy model which was previously unknown and provide some geometric intuition behind it. We further considerably relax the required covariate assumptions in both, the noisy and noiseless case. Our results demonstrate that benign overfitting of maximum margin classifiers holds in a much wider range of scenarios than was previously known and provide new insights into the underlying mechanisms.

Comment: Provides theoretical insights into benign overfitting in linear classification models, significantly relaxing covariate assumptions and discovering new phase transitions. This paper aligns well with theoretical advancements in representation learning.

Relevance: 9 Novelty: 9

4. Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

ArXiv ID: 2501.12370

Authors: Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak

Abstract: Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Expert models (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the ratio of non-active to total parameters, affects model performance in terms of both pretraining and downstream performance. We find that under different constraints (e.g. parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.

Comment: The paper investigates the interplay between sparsity in Mixture-of-Experts (MoE) models and scaling laws, which is highly relevant to model architecture and compression topics. The exploration of optimal sparsity levels provides theoretical insights into designing efficient MoE models.

Relevance: 10 Novelty: 8

5. Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

ArXiv ID: 2501.11873

Authors: Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin

Abstract: This paper revisits the implementation of $\textbf{L}$oad-$\textbf{b}$alancing $\textbf{L}$oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as $N_E \sum_{i=1}^{N_E} f_i p_i$, where $N_E$ is the total number of experts, $f_i$ represents the frequency of expert $i$ being selected, and $p_i$ denotes the average gating score of the expert $i$. Existing MoE training frameworks usually employ the parallel training strategy so that $f_i$ and the LBL are calculated within a $\textbf{micro-batch}$ and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence ($\textit{e.g.}$, code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a $\textbf{global-batch}$ to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize $f_i$ across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to $\textbf{42.8B}$ total parameters and $\textbf{400B}$ tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.

Comment: The paper improves load-balancing loss calculation for Mixture-of-Experts (MoE) models, directly addressing foundational architecture challenges. The focus on specialization and load balancing makes it highly relevant.

Relevance: 10 Novelty: 8

6. Playing the Lottery With Concave Regularizers for Sparse Trainable Neural Networks

ArXiv ID: 2501.11135

Authors: Giulia Fracastoro, Sophie M. Fosson, Andrea Migliorati, Giuseppe C. Calafiore

Abstract: The design of sparse neural networks, i.e., of networks with a reduced number of parameters, has been attracting increasing research attention in the last few years. The use of sparse models may significantly reduce the computational and storage footprint in the inference phase. In this context, the lottery ticket hypothesis (LTH) constitutes a breakthrough result, that addresses not only the performance of the inference phase, but also of the training phase. It states that it is possible to extract effective sparse subnetworks, called winning tickets, that can be trained in isolation. The development of effective methods to play the lottery, i.e., to find winning tickets, is still an open problem. In this article, we propose a novel class of methods to play the lottery. The key point is the use of concave regularization to promote the sparsity of a relaxed binary mask, which represents the network topology. We theoretically analyze the effectiveness of the proposed method in the convex framework. Then, we propose extended numerical tests on various datasets and architectures, that show that the proposed method can improve the performance of state-of-the-art algorithms.

Comment: Presents a novel method for discovering sparse trainable neural networks using concave regularizers, directly addressing sparsity and efficient training, which aligns closely with model compression and theoretical insights.

Relevance: 10 Novelty: 8

7. Higher Order Approximation Rates for ReLU CNNs in Korobov Spaces

ArXiv ID: 2501.11275

Authors: Yuwen Li, Guozhi Zhang

Abstract: This paper investigates the $L_p$ approximation error for higher order Korobov functions using deep convolutional neural networks (CNNs) with ReLU activation. For target functions having a mixed derivative of order m+1 in each direction, we improve classical approximation rate of second order to (m+1)-th order (modulo a logarithmic factor) in terms of the depth of CNNs. The key ingredient in our analysis is approximate representation of high-order sparse grid basis functions by CNNs. The results suggest that higher order expressivity of CNNs does not severely suffer from the curse of dimensionality.

Comment: This paper delivers theoretical insights into CNNs with ReLU activations achieving higher-order approximation rates in Korobov spaces, closely aligning with fundamental topics in model architecture and theoretical representation learning.

Relevance: 9 Novelty: 9

8. DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference

ArXiv ID: 2501.10375

Authors: Yujie Zhang, Shivam Aggarwal, Tulika Mitra

Abstract: Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks, face significant deployment challenges on memory-constrained devices. While GPUs offer fast inference, their limited memory compared to CPUs means not all experts can be stored on the GPU simultaneously, necessitating frequent, costly data transfers from CPU memory, often negating GPU speed advantages. To address this, we present DAOP, an on-device MoE inference engine to optimize parallel GPU-CPU execution. DAOP dynamically allocates experts between CPU and GPU based on per-sequence activation patterns, and selectively pre-calculates predicted experts on CPUs to minimize transfer latency. This approach enables efficient resource utilization across various expert cache ratios while maintaining model accuracy through a novel graceful degradation mechanism. Comprehensive evaluations across various datasets show that DAOP outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.

Comment: DAOP focuses on optimizing Mixture-of-Experts (MoE) inference on memory-constrained devices, introducing a novel mechanism for expert allocation and predictive pre-calculation. Its relevance to MoE and model efficiency makes it highly suitable.

Relevance: 9 Novelty: 8

9. Nested Annealed Training Scheme for Generative Adversarial Networks

ArXiv ID: 2501.11318

Authors: Chang Wan, Ming-Hsuan Yang, Minglu Li, Yunliang Jiang, Zhonglong Zheng

Abstract: Recently, researchers have proposed many deep generative models, including generative adversarial networks(GANs) and denoising diffusion models. Although significant breakthroughs have been made and empirical success has been achieved with the GAN, its mathematical underpinnings remain relatively unknown. This paper focuses on a rigorous mathematical theoretical framework: the composite-functional-gradient GAN (CFG)[1]. Specifically, we reveal the theoretical connection between the CFG model and score-based models. We find that the training objective of the CFG discriminator is equivalent to finding an optimal D(x). The optimal gradient of D(x) differentiates the integral of the differences between the score functions of real and synthesized samples. Conversely, training the CFG generator involves finding an optimal G(x) that minimizes this difference. In this paper, we aim to derive an annealed weight preceding the weight of the CFG discriminator. This new explicit theoretical explanation model is called the annealed CFG method. To overcome the limitation of the annealed CFG method, as the method is not readily applicable to the SOTA GAN model, we propose a nested annealed training scheme (NATS). This scheme keeps the annealed weight from the CFG method and can be seamlessly adapted to various GAN models, no matter their structural, loss, or regularization differences. We conduct thorough experimental evaluations on various benchmark datasets for image generation. The results show that our annealed CFG and NATS methods significantly improve the quality and diversity of the synthesized samples. This improvement is clear when comparing the CFG method and the SOTA GAN models.

Comment: This paper introduces a nested annealed training scheme for GANs and develops theoretical insights into GAN optimization. Its focus on foundational training paradigms for generative models aligns well with our interest in framework-level innovations.

Relevance: 9 Novelty: 8

10. Generalizable Spectral Embedding with an Application to UMAP

ArXiv ID: 2501.11305

Authors: Nir Ben-Ari, Amitai Yacobi, Uri Shaham

Abstract: Spectral Embedding (SE) is a popular method for dimensionality reduction, applicable across diverse domains. Nevertheless, its current implementations face three prominent drawbacks which curtail its broader applicability: generalizability (i.e., out-of-sample extension), scalability, and eigenvectors separation. In this paper, we introduce GrEASE: Generalizable and Efficient Approximate Spectral Embedding, a novel deep-learning approach designed to address these limitations. GrEASE incorporates an efficient post-processing step to achieve eigenvectors separation, while ensuring both generalizability and scalability, allowing for the computation of the Laplacian's eigenvectors on unseen data. This method expands the applicability of SE to a wider range of tasks and can enhance its performance in existing applications. We empirically demonstrate GrEASE's ability to consistently approximate and generalize SE, while ensuring scalability. Additionally, we show how GrEASE can be leveraged to enhance existing methods. Specifically, we focus on UMAP, a leading visualization technique, and introduce NUMAP, a generalizable version of UMAP powered by GrEASE. Our codes are publicly available.

Comment: GrEASE introduces a novel deep learning-based approach for spectral embedding, addressing scalability, generalizability, and eigenvector separation. It directly contributes to representation learning and introduces theoretical innovations in dimensionality reduction, particularly enhancing UMAP.

Relevance: 9 Novelty: 8

11. Is logical analysis performed by transformers taking place in self-attention or in the fully connected part?

ArXiv ID: 2501.11765

Authors: Evgeniy Shin, Heinrich Matzinger

Abstract: Transformers architecture apply self-attention to tokens represented as vectors, before a fully connected (neuronal network) layer. These two parts can be layered many times. Traditionally, self-attention is seen as a mechanism for aggregating information before logical operations are performed by the fully connected layer. In this paper, we show, that quite counter-intuitively, the logical analysis can also be performed within the self-attention. For this we implement a handcrafted single-level encoder layer which performs the logical analysis within self-attention. We then study the scenario in which a one-level transformer model undergoes self-learning using gradient descent. We investigate whether the model utilizes fully connected layers or self-attention mechanisms for logical analysis when it has the choice. Given that gradient descent can become stuck at undesired zeros, we explicitly calculate these unwanted zeros and find ways to avoid them. We do all this in the context of predicting grammatical category pairs of adjacent tokens in a text. We believe that our findings have broader implications for understanding the potential logical operations performed by self-attention.

Comment: The paper provides theoretical insights into how Transformers perform logical analysis, especially focusing on self-attention versus fully connected layers. This directly aligns with understanding foundational aspects of Transformer architecture and presents an innovative perspective on their behavior.

Relevance: 9 Novelty: 8

12. Meta-Sparsity: Learning Optimal Sparse Structures in Multi-task Networks through Meta-learning

ArXiv ID: 2501.12115

Authors: Richa Upadhyay, Ronald Phlypo, Rajkumar Saini, Marcus Liwicki

Abstract: This paper presents meta-sparsity, a framework for learning model sparsity, basically learning the parameter that controls the degree of sparsity, that allows deep neural networks (DNNs) to inherently generate optimal sparse shared structures in multi-task learning (MTL) setting. This proposed approach enables the dynamic learning of sparsity patterns across a variety of tasks, unlike traditional sparsity methods that rely heavily on manual hyperparameter tuning. Inspired by Model Agnostic Meta-Learning (MAML), the emphasis is on learning shared and optimally sparse parameters in multi-task scenarios by implementing a penalty-based, channel-wise structured sparsity during the meta-training phase. This method improves the model's efficacy by removing unnecessary parameters and enhances its ability to handle both seen and previously unseen tasks. The effectiveness of meta-sparsity is rigorously evaluated by extensive experiments on two datasets, NYU-v2 and CelebAMask-HQ, covering a broad spectrum of tasks ranging from pixel-level to image-level predictions. The results show that the proposed approach performs well across many tasks, indicating its potential as a versatile tool for creating efficient and adaptable sparse neural networks. This work, therefore, presents an approach towards learning sparsity, contributing to the efforts in the field of sparse neural networks and suggesting new directions for research towards parsimonious models.

Comment: Proposes 'meta-sparsity,' a framework leveraging meta-learning to dynamically learn optimal sparsity in multi-task networks. This aligns with the 'Model Compression' topic through sparse/dynamic network adaptation, offering theoretical and methodological advances.

Relevance: 9 Novelty: 8

13. ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models

ArXiv ID: 2501.11175

Authors: Yassir Bendou, Amine Ouasfi, Vincent Gripon, Adnane Boukhayma

Abstract: The growing popularity of Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks. To enhance CLIP's effectiveness and versatility, efficient few-shot adaptation techniques have been widely adopted. Among these approaches, training-free methods, particularly caching methods exemplified by Tip-Adapter, have gained attention for their lightweight adaptation without the need for additional fine-tuning. In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters and are connected to a well-established kernel literature. Drawing on this insight, we offer a theoretical understanding of how these methods operate and suggest multiple avenues for enhancing the Tip-Adapter baseline. Notably, our analysis shows the importance of incorporating global information in local adapters. Therefore, we subsequently propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space (RKHS) using CLIP as a base learner. Our method, which we call ProKeR (Proximal Kernel ridge Regression), has a closed form solution and achieves state-of-the-art performances across 11 datasets in the standard few-shot adaptation benchmark.

Comment: Proposes a theoretical advancement by reinterpreting caching methods like Tip-Adapter through a kernel perspective and introduces a proximal kernel regression method, which has notable implications for representation learning and efficiency.

Relevance: 9 Novelty: 8

14. Issues with Neural Tangent Kernel Approach to Neural Networks

ArXiv ID: 2501.10929

Authors: Haoran Liu, Anthony Tai, David J. Crandall, Chunfeng Huang

Abstract: Neural tangent kernels (NTKs) have been proposed to study the behavior of trained neural networks from the perspective of Gaussian processes. An important result in this body of work is the theorem of equivalence between a trained neural network and kernel regression with the corresponding NTK. This theorem allows for an interpretation of neural networks as special cases of kernel regression. However, does this theorem of equivalence hold in practice? In this paper, we revisit the derivation of the NTK rigorously and conduct numerical experiments to evaluate this equivalence theorem. We observe that adding a layer to a neural network and the corresponding updated NTK do not yield matching changes in the predictor error. Furthermore, we observe that kernel regression with a Gaussian process kernel in the literature that does not account for neural network training produces prediction errors very close to that of kernel regression with NTKs. These observations suggest the equivalence theorem does not hold well in practice and puts into question whether neural tangent kernels adequately address the training process of neural networks.

Comment: This paper critiques the Neural Tangent Kernel (NTK) framework and questions its practical equivalence theorem, providing theoretical insights into neural network training behavior.

Relevance: 9 Novelty: 8

15. The Geometry of Tokens in Internal Representations of Large Language Models

ArXiv ID: 2501.10573

Authors: Karthik Viswanathan, Yuri Gardinazzi, Giada Panerai, Alberto Cazzaniga, Matteo Biagetti

Abstract: We investigate the relationship between the geometry of token embeddings and their role in the next token prediction within transformer models. An important aspect of this connection uses the notion of empirical measure, which encodes the distribution of token point clouds across transformer layers and drives the evolution of token representations in the mean-field interacting picture. We use metrics such as intrinsic dimension, neighborhood overlap, and cosine similarity to observationally probe these empirical measures across layers. To validate our approach, we compare these metrics to a dataset where the tokens are shuffled, which disrupts the syntactic and semantic structure. Our findings reveal a correlation between the geometric properties of token embeddings and the cross-entropy loss of next token predictions, implying that prompts with higher loss values have tokens represented in higher-dimensional spaces.

Comment: Analyzes the geometry of token embeddings in large language models to explore their relationship with next token prediction. This provides theoretical insights into LLM behavior, aligning with foundational advancements in representation learning and interpretability.

Relevance: 9 Novelty: 8

16. Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy

ArXiv ID: 2501.11721

Authors: Saeid Asgari Taghanaki, Joao Monteiro

Abstract: Large language models (LLMs) have demonstrated remarkable proficiency in generating detailed and coherent explanations of complex concepts. However, the extent to which these models truly comprehend the concepts they articulate remains unclear. To assess the level of comprehension of a model relative to the content it generates, we implemented a self-evaluation pipeline where models: (i) given a topic generate an excerpt with information about the topic, (ii) given an excerpt generate question-answer pairs, and finally (iii) given a question generate an answer. We refer to this self-evaluation approach as Explain-Query-Test (EQT). Interestingly, the accuracy on generated questions resulting from running the EQT pipeline correlates strongly with the model performance as verified by typical benchmarks such as MMLU-Pro. In other words, EQT's performance is predictive of MMLU-Pro's, and EQT can be used to rank models without the need for any external source of evaluation data other than lists of topics of interest. Moreover, our results reveal a disparity between the models' ability to produce detailed explanations and their performance on questions related to those explanations. This gap highlights fundamental limitations in the internal knowledge representation and reasoning abilities of current LLMs. We release the code at https://github.com/asgsaeid/EQT.

Comment: This work investigates LLM self-comprehension via a novel Explain-Query-Test pipeline and highlights gaps in LLM internal knowledge representation. The focus on theoretical understanding and evaluation mechanics is relevant for foundational LLM insights.

Relevance: 9 Novelty: 8

17. Training-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing

ArXiv ID: 2501.11592

Authors: Chaoqing Tang, Huanze Zhuang, Guiyun Tian, Zhenli Zeng, Yi Ding, Wenzhong Liu, Xiang Bai

Abstract: Pre-trained large models attract widespread attention in recent years, but they face challenges in applications that require high interpretability or have limited resources, such as physical sensing, medical imaging, and bioinformatics. Compressed Sensing (CS) is a well-proved theory that drives many recent breakthroughs in these applications. However, as a typical under-determined linear system, CS suffers from excessively long sparse reconstruction times when using traditional iterative methods, particularly with large-scale data. Current AI methods like deep unfolding fail to substitute them because pre-trained models exhibit poor generality beyond their training conditions and dataset distributions, or lack interpretability. Instead of following the big model fervor, this paper proposes ultra-small artificial neural models called coefficients learning (CL), enabling training-free and rapid sparse reconstruction while perfectly inheriting the generality and interpretability of traditional iterative methods, bringing new feature of incorporating prior knowledges. In CL, a signal of length $n$ only needs a minimal of $n$ trainable parameters. A case study model called CLOMP is implemented for evaluation. Experiments are conducted on both synthetic and real one-dimensional and two-dimensional signals, demonstrating significant improvements in efficiency and accuracy. Compared to representative iterative methods, CLOMP improves efficiency by 100 to 1000 folds for large-scale data. Test results on eight diverse image datasets indicate that CLOMP improves structural similarity index by 292%, 98%, 45% for sampling rates of 0.1, 0.3, 0.5, respectively. We believe this method can truly usher CS reconstruction into the AI era, benefiting countless under-determined linear systems that rely on sparse solution.

Comment: The paper introduces a novel ultra-small model for rapid sparse reconstruction in compressed sensing, addressing efficiency and interpretability. The focus on sparsity and low computational cost aligns well with the model compression and representation learning criteria.

Relevance: 9 Novelty: 8

18. Glinthawk: A Two-Tiered Architecture for High-Throughput LLM Inference

ArXiv ID: 2501.11779

Authors: Pouya Hamadanian, Sadjad Fouladi

Abstract: Large Language Models (LLM) have revolutionized natural language processing, but their inference demands substantial resources, while under-utilizing high-end accelerators like GPUs. A major bottleneck arises from the attention mechanism, which requires storing large key-value caches, limiting the maximum achievable throughput way below the available computing resources. Current approaches attempt to mitigate this issue through memory-efficient attention and paging mechanisms, but remained constrained by the assumption that all operations must be performed on high-end accelerators. In this work, we propose Glinthawk, a two-tiered architecture that decouples the attention mechanism from the rest of the Transformer model. This approach allows the memory requirements for attention to scale independently, enabling larger batch sizes and more efficient use of the high-end accelerators. We prototype Glinthawk with NVIDIA T4 GPUs as one tier and standard CPU VMs as the other. Compared to a traditional single-tier setup, it improves throughput by $5.9\times$ and reduces cost of generation by $2.8\times$. For longer sequence lengths, it achieves $16.3\times$ throughput improvement at $2.4\times$ less cost. Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation, making it highly effective for latency-tolerant, throughput-oriented applications such as batch processing. We shared our prototype publicly at \url{https://github.com/microsoft/glinthawk}.

Comment: Proposes a novel two-tiered architecture decoupling the attention mechanism for LLM inference, improving throughput and cost efficiency. Strong match with model architecture (Transformer-related innovations) and resource efficiency.

Relevance: 9 Novelty: 8

19. Unveiling the Mystery of Weight in Large Foundation Models: Gaussian Distribution Never Fades

ArXiv ID: 2501.10661

Authors: Chongjie Si, Jingjing Jiang, Wei Shen

Abstract: This paper presents a pioneering exploration of the mechanisms underlying large foundation models' (LFMs) weights, aiming to simplify AI research. Through extensive observation and analysis on prevailing LFMs, we find that regardless of initialization strategies, their weights predominantly follow a Gaussian distribution, with occasional sharp, inverted T-shaped, or linear patterns. We further discover that the weights share the i.i.d. properties of Gaussian noise, and explore their direct relationship. We find that transformation weights can be derived from Gaussian noise, and they primarily serve to increase the standard deviation of pre-trained weights, with their standard deviation growing with layer depth. In other words, transformation weights broaden the acceptable deviation from the optimal weights, facilitating adaptation to downstream tasks. Building upon the above conclusions, we thoroughly discussed the nature of optimal weights, ultimately concluding that they should exhibit zero-mean, symmetry, and sparsity, with the sparse values being a truncated Gaussian distribution and a few outliers. Our experiments in LFM adaptation and editing demonstrate the effectiveness of these insights. We hope these findings can provide a foundational understanding to pave the way for future advancements in the LFM community.

Comment: Explores the Gaussian distribution of weights in large foundation models and derives foundational insights into their nature and optimization. This is highly relevant to foundational understanding of large-scale models.

Relevance: 8 Novelty: 9

20. Test-time regression: a unifying framework for designing sequence models with associative memory

ArXiv ID: 2501.12352

Authors: Ke Alexander Wang, Jiaxin Shi, Emily B. Fox

Abstract: Sequences provide a remarkably general way to represent and process information. This powerful abstraction has placed sequence modeling at the center of modern deep learning applications, inspiring numerous architectures from transformers to recurrent networks. While this fragmented development has yielded powerful models, it has left us without a unified framework to understand their fundamental similarities and explain their effectiveness. We present a unifying framework motivated by an empirical observation: effective sequence models must be able to perform associative recall. Our key insight is that memorizing input tokens through an associative memory is equivalent to performing regression at test-time. This regression-memory correspondence provides a framework for deriving sequence models that can perform associative recall, offering a systematic lens to understand seemingly ad-hoc architectural choices. We show numerous recent architectures -- including linear attention models, their gated variants, state-space models, online learners, and softmax attention -- emerge naturally as specific approaches to test-time regression. Each architecture corresponds to three design choices: the relative importance of each association, the regressor function class, and the optimization algorithm. This connection leads to new understanding: we provide theoretical justification for QKNorm in softmax attention, and we motivate higher-order generalizations of softmax attention. Beyond unification, our work unlocks decades of rich statistical tools that can guide future development of more powerful yet principled sequence models.

Comment: Introduces a unifying framework for sequence models through test-time regression, providing a systematic lens for architectural choices and theoretical justifications (e.g., higher-order generalizations of softmax attention). Solidly relevant to model architecture through theoretical advancements in Transformers and related sequence models.

Relevance: 9 Novelty: 8

21. The "Law" of the Unconscious Contrastive Learner: Probabilistic Alignment of Unpaired Modalities

ArXiv ID: 2501.11326

Authors: Yongwei Che, Benjamin Eysenbach

Abstract: While internet-scale data often comes in pairs (e.g., audio/image, image/text), we often want to perform inferences over modalities unseen together in the training data (e.g., audio/text). Empirically, this can often be addressed by learning multiple contrastive embedding spaces between existing modality pairs, implicitly hoping that unseen modality pairs will end up being aligned. This theoretical paper proves that this hope is well founded, under certain assumptions. Starting with the proper Bayesian approach of integrating out intermediate modalities, we show that directly comparing the representations of data from unpaired modalities can recover the same likelihood ratio. Our analysis builds on prior work on the geometry and probabilistic interpretation of contrastive representations, showing how these representations can answer many of the same inferences as probabilistic graphical models. Our analysis suggests two new ways of using contrastive representations: in settings with pre-trained contrastive models, and for handling language ambiguity in reinforcement learning. Our numerical experiments study the importance of our assumptions and demonstrate these new applications.

Comment: The paper provides a theoretical framework for understanding probabilistic alignment in contrastive learning for unpaired modalities, addressing foundational aspects of representation learning and theoretical insights.

Relevance: 9 Novelty: 8

22. LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator

ArXiv ID: 2501.10658

Authors: Guoyu Li (University of Chinese Academy of Sciences, Microsoft Research), Shengyu Ye (Microsoft Research), Chunyun Chen (NTU Singapore), Yang Wang (Microsoft Research), Fan Yang (Microsoft Research), Ting Cao (Microsoft Research), Cheng Liu (University of Chinese Academy of Sciences), Mohamed M. Sabry (NTU Singapore), Mao Yang (Microsoft Research)

Abstract: The emergence of neural network capabilities invariably leads to a significant surge in computational demands due to expanding model sizes and increased computational complexity. To reduce model size and lower inference costs, recent research has focused on simplifying models and designing hardware accelerators using low-bit quantization. However, due to numerical representation limits, scalar quantization cannot reduce bit width lower than 1-bit, diminishing its benefits. To break through these limitations, we introduce LUT-DLA, a Look-Up Table (LUT) Deep Learning Accelerator Framework that utilizes vector quantization to convert neural network models into LUTs, achieving extreme low-bit quantization. The LUT-DLA framework facilitates efficient and cost-effective hardware accelerator designs and supports the LUTBoost algorithm, which helps to transform various DNN models into LUT-based models via multistage training, drastically cutting both computational and hardware overhead. Additionally, through co-design space exploration, LUT-DLA assesses the impact of various model and hardware parameters to fine-tune hardware configurations for different application scenarios, optimizing performance and efficiency. Our comprehensive experiments show that LUT-DLA achieves improvements in power efficiency and area efficiency with gains of $1.4$~$7.0\times$ and $1.5$~$146.1\times$, respectively, while maintaining only a modest accuracy drop. For CNNs, accuracy decreases by $0.1\%$~$3.1\%$ using the $L_2$ distance similarity, $0.1\%$~$3.4\%$ with the $L_1$ distance similarity, and $0.1\%$~$3.8\%$ when employing the Chebyshev distance similarity. For transformer-based models, the accuracy drop ranges from $1.4\%$ to $3.0\%$.

Comment: Proposes LUT-DLA for efficient hardware acceleration using extreme low-bit quantization, related to model compression and efficiency.

Relevance: 9 Novelty: 8

23. HAC++: Towards 100X Compression of 3D Gaussian Splatting

ArXiv ID: 2501.12255

Authors: Yihang Chen, Qianyi Wu, Weiyao Lin, Mehrtash Harandi, Jianfei Cai

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. To achieve a compact size, we propose HAC++, which leverages the relationships between unorganized anchors and a structured hash grid, utilizing their mutual information for context modeling. Additionally, HAC++ captures intra-anchor contextual relationships to further enhance compression performance. To facilitate entropy coding, we utilize Gaussian distributions to precisely estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Moreover, we incorporate an adaptive masking strategy to eliminate invalid Gaussians and anchors. Overall, HAC++ achieves a remarkable size reduction of over 100X compared to vanilla 3DGS when averaged on all datasets, while simultaneously improving fidelity. It also delivers more than 20X size reduction compared to Scaffold-GS. Our code is available at https://github.com/YihangChen-ee/HAC-plus.

Comment: Proposes a method for compressing 3D Gaussian Splatting with over 100x compression, aligning with model compression. Some novel ideas such as structured hash grids and adaptive quantization add impact.

Relevance: 9 Novelty: 8

24. A Metric Topology of Deep Learning for Data Classification

ArXiv ID: 2501.11265

Authors: Jwo-Yuh Wu, Liang-Chi Huang, Wen-Hsuan Li, Chun-Hung Liu

Abstract: Empirically, Deep Learning (DL) has demonstrated unprecedented success in practical applications. However, DL remains by and large a mysterious "black-box", spurring recent theoretical research to build its mathematical foundations. In this paper, we investigate DL for data classification through the prism of metric topology. Considering that conventional Euclidean metric over the network parameter space typically fails to discriminate DL networks according to their classification outcomes, we propose from a probabilistic point of view a meaningful distance measure, whereby DL networks yielding similar classification performances are close. The proposed distance measure defines such an equivalent relation among network parameter vectors that networks performing equally well belong to the same equivalent class. Interestingly, our proposed distance measure can provably serve as a metric on the quotient set modulo the equivalent relation. Then, under quite mild conditions it is shown that, apart from a vanishingly small subset of networks likely to predict non-unique labels, our proposed metric space is compact, and coincides with the well-known quotient topological space. Our study contributes to fundamental understanding of DL, and opens up new ways of studying DL using fruitful metric space theory.

Comment: This paper contributes theoretical insights into deep learning by exploring metric topology for data classification, which aligns with representation learning and foundational AI concepts.

Relevance: 9 Novelty: 8

25. Jailbreaking Large Language Models in Infinitely Many Ways

ArXiv ID: 2501.10800

Authors: Oliver Goldstein, Emanuele La Malfa, Felix Drinkall, Samuele Marro, Michael Wooldridge

Abstract: We discuss the "Infinitely Many Meanings" attacks (IMM), a category of jailbreaks that leverages the increasing capabilities of a model to handle paraphrases and encoded communications to bypass their defensive mechanisms. IMMs' viability pairs and grows with a model's capabilities to handle and bind the semantics of simple mappings between tokens and work extremely well in practice, posing a concrete threat to the users of the most powerful LLMs in commerce. We show how one can bypass the safeguards of the most powerful open- and closed-source LLMs and generate content that explicitly violates their safety policies. One can protect against IMMs by improving the guardrails and making them scale with the LLMs' capabilities. For two categories of attacks that are straightforward to implement, i.e., bijection and encoding, we discuss two defensive strategies, one in token and the other in embedding space. We conclude with some research questions we believe should be prioritised to enhance the defensive mechanisms of LLMs and our understanding of their safety.

Comment: The paper discusses a novel jailbreak method (IMM) on LLMs, providing theoretical insights into their vulnerabilities and mechanisms, which aligns with the foundational topic of LLM behavior analysis. The proposed attacks and defenses introduce innovative perspectives.

Relevance: 9 Novelty: 8

26. FOCUS: First Order Concentrated Updating Scheme

ArXiv ID: 2501.12243

Authors: Yizhou Liu, Ziming Liu, Jeff Gore

Abstract: Large language models (LLMs) demonstrate remarkable performance, and improving their pre-training process appears to be key to enhancing their capabilities further. Based on the documented success of Adam, learning rate decay, and weight decay, we hypothesize that the pre-training loss landscape features a narrowing valley structure. Through experiments with synthetic loss functions, we discover that when gradient query noise is high relative to the valley's sharpness, Adam's performance falls behind that of Signum because Adam reduces the effective step size too drastically. This observation led us to develop FOCUS, an optimizer that enhances Signum by incorporating attraction toward moving averaged parameters, allowing it to handle noise better while maintaining larger step sizes. In training GPT-2, FOCUS proves to be more stable than Signum and faster than Adam. These results suggest that gradient noise may be an underappreciated limiting factor in LLM training, and FOCUS offers promising solutions.

Comment: The proposal of FOCUS as a training optimizer for large language models aligns with emerging trends and foundational insights into LLM training. Its focus on stability and noise handling during optimization could lead to advancements in pretraining methodologies, making it highly relevant and impactful.

Relevance: 9 Novelty: 8

27. Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities

ArXiv ID: 2501.12147

Authors: Qirun Dai, Dylan Zhang, Jiaqi W. Ma, Hao Peng

Abstract: Selecting appropriate training data is crucial for effective instruction fine-tuning of large language models (LLMs), which aims to (1) elicit strong capabilities, and (2) achieve balanced performance across a diverse range of tasks. Influence-based methods show promise in achieving (1) by estimating the contribution of each training example to the model's predictions, but often struggle with (2). Our systematic investigation reveals that this underperformance can be attributed to an inherent bias where certain tasks intrinsically have greater influence than others. As a result, data selection is often biased towards these tasks, not only hurting the model's performance on others but also, counterintuitively, harms performance on these high-influence tasks themselves. As a remedy, we propose BIDS, a Balanced and Influential Data Selection algorithm. BIDS first normalizes influence scores of the training data, and then iteratively balances data selection by choosing the training example with the highest influence on the most underrepresented task. Experiments with both Llama-3 and Mistral-v0.3 on seven benchmarks spanning five diverse capabilities show that BIDS consistently outperforms both state-of-the-art influence-based algorithms and other non-influence-based selection frameworks. Surprisingly, training on a 15% subset selected by BIDS can even outperform full-dataset training with a much more balanced performance. Our analysis further highlights the importance of both instance-level normalization and iterative optimization of selected data for balanced learning of diverse capabilities.

Comment: The proposed method, BIDS, innovatively balances data selection for instruction tuning of LLMs, contributing to training insights for large language models.

Relevance: 9 Novelty: 8

28. EDoRA: Efficient Weight-Decomposed Low-Rank Adaptation via Singular Value Decomposition

ArXiv ID: 2501.12067

Authors: Hamid Nasiri, Peter Garraghan

Abstract: Parameter-efficient fine-tuning methods, such as LoRA, reduces the number of trainable parameters. However, they often suffer from scalability issues and differences between their learning pattern and full fine-tuning. To overcome these limitations, we propose Efficient Weight-Decomposed Low-Rank Adaptation (EDoRA): a novel PEFT method that decomposes pre-trained weights into magnitude and directional components. By freezing low-rank matrices, initializing them by singular value decomposition, and introducing a small trainable matrix between them, EDoRA achieves substantial reduction in trainable parameters while maintaining learning capacity. Experimental results on the GLUE benchmark demonstrate that EDoRA achieves competitive or superior performance compared to state-of-the-art methods, such as LoRA and DoRA, with up to 30x fewer trainable parameters. This makes EDoRA a highly efficient solution for adapting LLMs to diverse tasks under memory-constrained settings. Code is available at https://github.com/Hamid-Nasiri/EDoRA .

Comment: EDoRA proposes a novel parameter-efficient adaptation technique based on low-rank decomposition, directly contributing to model compression and low-rank techniques. This aligns well with foundational interests in compression methods.

Relevance: 9 Novelty: 8

29. MirrorCBO: A consensus-based optimization method in the spirit of mirror descent

ArXiv ID: 2501.12189

Authors: Leon Bungert, Franca Hoffmann, Doh Yeon Kim, Tim Roith

Abstract: In this work we propose MirrorCBO, a consensus-based optimization (CBO) method which generalizes standard CBO in the same way that mirror descent generalizes gradient descent. For this we apply the CBO methodology to a swarm of dual particles and retain the primal particle positions by applying the inverse of the mirror map, which we parametrize as the subdifferential of a strongly convex function $\phi$. In this way, we combine the advantages of a derivative-free non-convex optimization algorithm with those of mirror descent. As a special case, the method extends CBO to optimization problems with convex constraints. Assuming bounds on the Bregman distance associated to $\phi$, we provide asymptotic convergence results for MirrorCBO with explicit exponential rate. Another key contribution is an exploratory numerical study of this new algorithm across different application settings, focusing on (i) sparsity-inducing optimization, and (ii) constrained optimization, demonstrating the competitive performance of MirrorCBO. We observe empirically that the method can also be used for optimization on (non-convex) submanifolds of Euclidean space, can be adapted to mirrored versions of other recent CBO variants, and that it inherits from mirror descent the capability to select desirable minimizers, like sparse ones. We also include an overview of recent CBO approaches for constrained optimization and compare their performance to MirrorCBO.

Comment: MirrorCBO proposes a novel optimization approach combining consensus-based optimization with mirror descent. This introduces theoretical contributions and sparsity-inducing optimization, making it highly relevant to foundational model compression topics.

Relevance: 9 Novelty: 8

30. SeRpEnt: Selective Resampling for Expressive State Space Models

ArXiv ID: 2501.11729

Authors: Stefano Rando, Luca Romani, Matteo Migliarini, Luca Franco, Denis Gudovskiy, Fabio Galasso

Abstract: State Space Models (SSMs) have recently enjoyed a rise to prominence in the field of deep learning for sequence modeling, especially as an alternative to Transformers. Their success stems from avoiding two well-known drawbacks of attention-based models: quadratic complexity with respect to the sequence length and inability to model long-range dependencies. The SSM variant Mamba has demonstrated performance comparable to Transformers without any form of attention, thanks to the use of a selective mechanism for the state parameters. Selectivity, however, is only evaluated empirically and the reasons of its effectiveness remain unclear. In this work, we show how selectivity is related to the sequence processing. Our analysis shows that selective time intervals in Mamba act as linear approximators of information. Then, we propose our SeRpEnt architecture, a SSM that further exploits selectivity to compress sequences in an information-aware fashion. It employs a resampling mechanism that aggregates elements based on their information content. Our empirical results in the Long Range Arena benchmark and other language modeling tasks show benefits of the SeRpEnt's resampling mechanism.

Comment: This paper introduces SeRpEnt, a selective resampling mechanism for State Space Models, positioning it as an alternative to Transformers. It aligns with emerging trends in architecture research and provides theoretical insights into sequence modeling.

Relevance: 8 Novelty: 8

31. Fast sparse optimization via adaptive shrinkage

ArXiv ID: 2501.12236

Authors: Vito Cerone, Sophie M. Fosson, Diego Regruto

Abstract: The need for fast sparse optimization is emerging, e.g., to deal with large-dimensional data-driven problems and to track time-varying systems. In the framework of linear sparse optimization, the iterative shrinkage-thresholding algorithm is a valuable method to solve Lasso, which is particularly appreciated for its ease of implementation. Nevertheless, it converges slowly. In this paper, we develop a proximal method, based on logarithmic regularization, which turns out to be an iterative shrinkage-thresholding algorithm with adaptive shrinkage hyperparameter. This adaptivity substantially enhances the trajectory of the algorithm, in a way that yields faster convergence, while keeping the simplicity of the original method. Our contribution is twofold: on the one hand, we derive and analyze the proposed algorithm; on the other hand, we validate its fast convergence via numerical experiments and we discuss the performance with respect to state-of-the-art algorithms.

Comment: The paper focuses on sparse optimization with an adaptive shrinkage method, aligning well with the 'Representation Learning' criterion, particularly sparse learning. It provides a methodological innovation for faster convergence.

Relevance: 9 Novelty: 7

32. Simulation of Hypergraph Algorithms with Looped Transformers

ArXiv ID: 2501.10688

Authors: Xiaoyu Li, Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Zhen Zhuang

Abstract: Looped Transformers have shown exceptional capability in simulating traditional graph algorithms, but their application to more complex structures like hypergraphs remains underexplored. Hypergraphs generalize graphs by modeling higher-order relationships among multiple entities, enabling richer representations but introducing significant computational challenges. In this work, we extend the Loop Transformer architecture to simulate hypergraph algorithms efficiently, addressing the gap between neural networks and combinatorial optimization over hypergraphs. In this paper, we extend the Loop Transformer architecture to simulate hypergraph algorithms efficiently, addressing the gap between neural networks and combinatorial optimization over hypergraphs. Specifically, we propose a novel degradation mechanism for reducing hypergraphs to graph representations, enabling the simulation of graph-based algorithms, such as Dijkstra's shortest path. Furthermore, we introduce a hyperedge-aware encoding scheme to simulate hypergraph-specific algorithms, exemplified by Helly's algorithm. The paper establishes theoretical guarantees for these simulations, demonstrating the feasibility of processing high-dimensional and combinatorial data using Loop Transformers. This work highlights the potential of Transformers as general-purpose algorithmic solvers for structured data.

Comment: Extends Loop Transformers to simulate hypergraph algorithms, introducing novel encoding schemes for hypergraph-specific tasks. This aligns with 'Model Architecture,' particularly for foundational work leveraging Transformer advancements.

Relevance: 8 Novelty: 8

33. FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

ArXiv ID: 2501.10714

Authors: Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui Wang, Bo Li, Xiaowen Chu

Abstract: Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable versatile usage of MoE models, we introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques: 1) Unified abstraction and online profiling of MoE modules for task scheduling across various MoE implementations. 2) Co-scheduling intra-node and inter-node communications with computations to minimize communication overheads. 3) To support near-optimal task scheduling, we design an adaptive gradient partitioning method for gradient aggregation and a schedule to adaptively pipeline communications and computations. We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters. Experimental results show that 1) our FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations (with up to a 1.42$\times$ speedup), and 2) FSMoE outperforms the state-of-the-art MoE training systems (DeepSpeed-MoE and Tutel) by 1.18$\times$-1.22$\times$ on 1458 MoE layers and 1.19$\times$-3.01$\times$ on real-world MoE models based on GPT-2 and Mixtral using a popular routing function.

Comment: The paper introduces FSMoE, a training system for sparse MoE models. Its focus on optimizing task scheduling and efficiency for MoE aligns well with the model architecture topic. The improvements in training speed also contribute to model compression.

Relevance: 9 Novelty: 7

34. ARD-VAE: A Statistical Formulation to Find the Relevant Latent Dimensions of Variational Autoencoders

ArXiv ID: 2501.10901

Authors: Surojit Saha, Sarang Joshi, Ross Whitaker

Abstract: The variational autoencoder (VAE) is a popular, deep, latent-variable model (DLVM) due to its simple yet effective formulation for modeling the data distribution. Moreover, optimizing the VAE objective function is more manageable than other DLVMs. The bottleneck dimension of the VAE is a crucial design choice, and it has strong ramifications for the model's performance, such as finding the hidden explanatory factors of a dataset using the representations learned by the VAE. However, the size of the latent dimension of the VAE is often treated as a hyperparameter estimated empirically through trial and error. To this end, we propose a statistical formulation to discover the relevant latent factors required for modeling a dataset. In this work, we use a hierarchical prior in the latent space that estimates the variance of the latent axes using the encoded data, which identifies the relevant latent dimensions. For this, we replace the fixed prior in the VAE objective function with a hierarchical prior, keeping the remainder of the formulation unchanged. We call the proposed method the automatic relevancy detection in the variational autoencoder (ARD-VAE). We demonstrate the efficacy of the ARD-VAE on multiple benchmark datasets in finding the relevant latent dimensions and their effect on different evaluation metrics, such as FID score and disentanglement analysis.

Comment: Proposes ARD-VAE for automatically detecting relevant latent dimensions in Variational Autoencoders, contributing to foundational insights in representation learning and latent space modeling.

Relevance: 9 Novelty: 7

35. SILO: Solving Inverse Problems with Latent Operators

ArXiv ID: 2501.11746

Authors: Ron Raphaeli, Sean Man, Michael Elad

Abstract: Consistent improvement of image priors over the years has led to the development of better inverse problem solvers. Diffusion models are the newcomers to this arena, posing the strongest known prior to date. Recently, such models operating in a latent space have become increasingly predominant due to their efficiency. In recent works, these models have been applied to solve inverse problems. Working in the latent space typically requires multiple applications of an Autoencoder during the restoration process, which leads to both computational and restoration quality challenges. In this work, we propose a new approach for handling inverse problems with latent diffusion models, where a learned degradation function operates within the latent space, emulating a known image space degradation. Usage of the learned operator reduces the dependency on the Autoencoder to only the initial and final steps of the restoration process, facilitating faster sampling and superior restoration quality. We demonstrate the effectiveness of our method on a variety of image restoration tasks and datasets, achieving significant improvements over prior art.

Comment: This work introduces a novel framework for solving inverse problems using latent diffusion models with a new learned degradation function, making it relevant to representation learning and autoencoder-based methods.

Relevance: 9 Novelty: 7

36. MoGERNN: An Inductive Traffic Predictor for Unobserved Locations in Dynamic Sensing Networks

ArXiv ID: 2501.12281

Authors: Qishen Zhou, Yifan Zhang, Michail A. Makridis, Anastasios Kouvelas, Yibing Wang, Simon Hu

Abstract: Given a partially observed road network, how can we predict the traffic state of unobserved locations? While deep learning approaches show exceptional performance in traffic prediction, most assume sensors at all locations of interest, which is impractical due to financial constraints. Furthermore, these methods typically require costly retraining when sensor configurations change. We propose MoGERNN, an inductive spatio-temporal graph representation model, to address these challenges. Inspired by the Mixture of Experts approach in Large Language Models, we introduce a Mixture of Graph Expert (MoGE) block to model complex spatial dependencies through multiple graph message aggregators and a sparse gating network. This block estimates initial states for unobserved locations, which are then processed by a GRU-based Encoder-Decoder that integrates a graph message aggregator to capture spatio-temporal dependencies and predict future states. Experiments on two real-world datasets show MoGERNN consistently outperforms baseline methods for both observed and unobserved locations. MoGERNN can accurately predict congestion evolution even in areas without sensors, offering valuable information for traffic management. Moreover, MoGERNN is adaptable to dynamic sensing networks, maintaining competitive performance even compared to its retrained counterpart. Tests with different numbers of available sensors confirm its consistent superiority, and ablation studies validate the effectiveness of its key modules.

Comment: Incorporates elements of Mixture of Experts (MoE) for spatio-temporal graph modeling, which aligns with representation learning and model architecture innovations. The introduction of the Mixture of Graph Experts block is relevant for foundational architectural improvements.

Relevance: 9 Novelty: 7

37. Toward Effective Digraph Representation Learning: A Magnetic Adaptive Propagation based Approach

ArXiv ID: 2501.11817

Authors: Xunkai Li, Daohan Su, Zhengyu Wu, Guang Zeng, Hongchao Qin, Rong-Hua Li, Guoren Wang

Abstract: The $q$-parameterized magnetic Laplacian serves as the foundation of directed graph (digraph) convolution, enabling this kind of digraph neural network (MagDG) to encode node features and structural insights by complex-domain message passing. As a generalization of undirected methods, MagDG shows superior capability in modeling intricate web-scale topology. Despite the great success achieved by existing MagDGs, limitations still exist: (1) Hand-crafted $q$: The performance of MagDGs depends on selecting an appropriate $q$-parameter to construct suitable graph propagation equations in the complex domain. This parameter tuning, driven by downstream tasks, limits model flexibility and significantly increases manual effort. (2) Coarse Message Passing: Most approaches treat all nodes with the same complex-domain propagation and aggregation rules, neglecting their unique digraph contexts. This oversight results in sub-optimal performance. To address the above issues, we propose two key techniques: (1) MAP is crafted to be a plug-and-play complex-domain propagation optimization strategy in the context of digraph learning, enabling seamless integration into any MagDG to improve predictions while enjoying high running efficiency. (2) MAP++ is a new digraph learning framework, further incorporating a learnable mechanism to achieve adaptively edge-wise propagation and node-wise aggregation in the complex domain for better performance. Extensive experiments on 12 datasets demonstrate that MAP enjoys flexibility for it can be incorporated with any MagDG, and scalability as it can deal with web-scale digraphs. MAP++ achieves SOTA predictive performance on 4 different downstream tasks.

Comment: The paper introduces MAP++ for digraph neural networks, focusing on representation learning with advancements in adaptive propagation.

Relevance: 9 Novelty: 7

38. Control LLM: Controlled Evolution for Intelligence Retention in LLM

ArXiv ID: 2501.10979

Authors: Haichao Wei, Yunxiang Ren, Zhoutong Fu, Aman Lunia, Yi-Lin Chen, Alice Leung, Ya Xu

Abstract: Large Language Models (LLMs) demand significant computational resources, making it essential to enhance their capabilities without retraining from scratch. A key challenge in this domain is \textit{catastrophic forgetting} (CF), which hampers performance during Continuous Pre-training (CPT) and Continuous Supervised Fine-Tuning (CSFT). We propose \textbf{Control LLM}, a novel approach that leverages parallel pre-trained and expanded transformer blocks, aligning their hidden-states through interpolation strategies This method effectively preserves performance on existing tasks while seamlessly integrating new knowledge. Extensive experiments demonstrate the effectiveness of Control LLM in both CPT and CSFT. On Llama3.1-8B-Instruct, it achieves significant improvements in mathematical reasoning ($+14.4\%$ on Math-Hard) and coding performance ($+10\%$ on MBPP-PLUS). On Llama3.1-8B, it enhances multilingual capabilities ($+10.6\%$ on C-Eval, $+6.8\%$ on CMMLU, and $+30.2\%$ on CMMLU-0shot-CoT). It surpasses existing methods and achieves SOTA among open-source models tuned from the same base model, using substantially less data and compute. Crucially, these gains are realized while preserving strong original capabilities, with minimal degradation ($35\%$ in open-source Math and Coding models. This approach has been successfully deployed in LinkedIn's GenAI-powered job seeker and Ads unit products. To support further research, we release the training and evaluation code (\url{https://github.com/linkedin/ControlLLM}) along with models trained on public datasets (\url{ https://huggingface.co/ControlLLM}) to the community.

Comment: Introduces Control LLM, enhancing LLM capabilities by aligning transformer blocks to combat catastrophic forgetting, relevant for LLM advancement.

Relevance: 8 Novelty: 8

39. TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

ArXiv ID: 2501.12224

Authors: Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel

Abstract: We present TokenVerse -- a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project's webpage in https://token-verse.github.io/

Comment: Proposes an innovative method for multi-concept personalization in diffusion-based models, introducing novel techniques in token modulation space, which is relevant to foundational representation and model advances.

Relevance: 8 Novelty: 8

40. SMamba: Sparse Mamba for Event-based Object Detection

ArXiv ID: 2501.11971

Authors: Nan Yang, Yang Wang, Zhanwen Liu, Meng Li, Yisheng An, Xiangmo Zhao

Abstract: Transformer-based methods have achieved remarkable performance in event-based object detection, owing to the global modeling ability. However, they neglect the influence of non-event and noisy regions and process them uniformly, leading to high computational overhead. To mitigate computation cost, some researchers propose window attention based sparsification strategies to discard unimportant regions, which sacrifices the global modeling ability and results in suboptimal performance. To achieve better trade-off between accuracy and efficiency, we propose Sparse Mamba (SMamba), which performs adaptive sparsification to reduce computational effort while maintaining global modeling capability. Specifically, a Spatio-Temporal Continuity Assessment module is proposed to measure the information content of tokens and discard uninformative ones by leveraging the spatiotemporal distribution differences between activity and noise events. Based on the assessment results, an Information-Prioritized Local Scan strategy is designed to shorten the scan distance between high-information tokens, facilitating interactions among them in the spatial dimension. Furthermore, to extend the global interaction from 2D space to 3D representations, a Global Channel Interaction module is proposed to aggregate channel information from a global spatial perspective. Results on three datasets (Gen1, 1Mpx, and eTram) demonstrate that our model outperforms other methods in both performance and efficiency.

Comment: Introducing a sparse token prioritization mechanism, this paper explores sparsification strategies within Transformer architectures, aligning with sparsity and efficiency-focused innovations in model compression and representation learning.

Relevance: 8 Novelty: 8

41. Systematic Abductive Reasoning via Diverse Relation Representations in Vector-symbolic Architecture

ArXiv ID: 2501.11896

Authors: Zhong-Hua Sun, Ru-Yuan Zhang, Zonglei Zhen, Da-Hui Wang, Yong-Jie Li, Xiaohong Wan, Hongzhi You

Abstract: In abstract visual reasoning, monolithic deep learning models suffer from limited interpretability and generalization, while existing neuro-symbolic approaches fall short in capturing the diversity and systematicity of attributes and relation representations. To address these challenges, we propose a Systematic Abductive Reasoning model with diverse relation representations (Rel-SAR) in Vector-symbolic Architecture (VSA) to solve Raven's Progressive Matrices (RPM). To derive attribute representations with symbolic reasoning potential, we introduce not only various types of atomic vectors that represent numeric, periodic and logical semantics, but also the structured high-dimentional representation (SHDR) for the overall Grid component. For systematic reasoning, we propose novel numerical and logical relation functions and perform rule abduction and execution in a unified framework that integrates these relation representations. Experimental results demonstrate that Rel-SAR achieves significant improvement on RPM tasks and exhibits robust out-of-distribution generalization. Rel-SAR leverages the synergy between HD attribute representations and symbolic reasoning to achieve systematic abductive reasoning with both interpretable and computable semantics.

Comment: The paper introduces a novel abductive reasoning model with structured high-dimensional representations, which aligns with representation learning and shows theoretical depth.

Relevance: 8 Novelty: 8

42. Empirical Bayes Estimation for Lasso-Type Regularizers: Analysis of Automatic Relevance Determination

ArXiv ID: 2501.11280

Authors: Tsukasa Yoshida, Kazuho Watanabe

Abstract: This paper focuses on linear regression models with non-conjugate sparsity-inducing regularizers such as lasso and group lasso. Although empirical Bayes approach enables us to estimate the regularization parameter, little is known on the properties of the estimators. In particular, there are many unexplained aspects regarding the specific conditions under which the mechanism of automatic relevance determination (ARD) occurs. In this paper, we derive the empirical Bayes estimators for the group lasso regularized linear regression models with a limited number of parameters. It is shown that the estimators diverge under a certain condition, giving rise to the ARD mechanism. We also prove that empirical Bayes methods can produce ARD mechanism in general regularized linear regression models and clarify the conditions under which models such as ridge, lasso, and group lasso can produce ARD mechanism.

Comment: This paper provides a theoretical analysis of sparsity-inducing regularizers like lasso and group lasso, directly aligning with model compression and sparsity topics.

Relevance: 9 Novelty: 7

43. Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity

ArXiv ID: 2501.11183

Authors: David Williams-King, Linh Le, Adam Oberman, Yoshua Bengio

Abstract: As LLMs develop increasingly advanced capabilities, there is an increased need to minimize the harm that could be caused to society by certain model outputs; hence, most LLMs have safety guardrails added, for example via fine-tuning. In this paper, we argue the position that current safety fine-tuning is very similar to a traditional cat-and-mouse game (or arms race) between attackers and defenders in cybersecurity. Model jailbreaks and attacks are patched with bandaids to target the specific attack mechanism, but many similar attack vectors might remain. When defenders are not proactively coming up with principled mechanisms, it becomes very easy for attackers to sidestep any new defenses. We show how current defenses are insufficient to prevent new adversarial jailbreak attacks, reward hacking, and loss of control problems. In order to learn from past mistakes in cybersecurity, we draw analogies with historical examples and develop lessons learned that can be applied to LLM safety. These arguments support the need for new and more principled approaches to designing safe models, which are architected for security from the beginning. We describe several such approaches from the AI literature.

Comment: The paper critiques current safety fine-tuning of LLMs and suggests principled design inspired by cybersecurity. It aligns with foundational work if viewed as a methodological shift in safety for LLMs.

Relevance: 8 Novelty: 8

44. Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

ArXiv ID: 2501.10799

Authors: Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang

Abstract: Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.

Comment: Presents a novel framework for improving mathematical reasoning in LLMs via process-level and outcome-level binary feedback. While relevant for insights into LLM training, it slightly deviates towards application-focused improvements rather than foundational changes.

Relevance: 8 Novelty: 7

45. Ditto: Accelerating Diffusion Model via Temporal Value Similarity

ArXiv ID: 2501.11211

Authors: Sungbin Kim, Hyunwuk Lee, Wonho Cho, Mincheol Park, Won Woo Ro

Abstract: Diffusion models achieve superior performance in image generation tasks. However, it incurs significant computation overheads due to its iterative structure. To address these overheads, we analyze this iterative structure and observe that adjacent time steps in diffusion models exhibit high value similarity, leading to narrower differences between consecutive time steps. We adapt these characteristics to a quantized diffusion model and reveal that the majority of these differences can be represented with reduced bit-width, and even zero. Based on our observations, we propose the Ditto algorithm, a difference processing algorithm that leverages temporal similarity with quantization to enhance the efficiency of diffusion models. By exploiting the narrower differences and the distributive property of layer operations, it performs full bit-width operations for the initial time step and processes subsequent steps with temporal differences. In addition, Ditto execution flow optimization is designed to mitigate the memory overhead of temporal difference processing, further boosting the efficiency of the Ditto algorithm. We also design the Ditto hardware, a specialized hardware accelerator, fully exploiting the dynamic characteristics of the proposed algorithm. As a result, the Ditto hardware achieves up to 1.5x speedup and 17.74% energy saving compared to other accelerators.

Comment: Proposes a novel method for improving the efficiency of diffusion models using quantization and temporal value similarity, which falls under the topic of model compression due to its focus on efficiency. It also provides algorithmic innovations specific to diffusion models.

Relevance: 8 Novelty: 7

46. Graph-defined Language Learning with LLMs

ArXiv ID: 2501.11478

Authors: Huachi Zhou, Jiahe Du, Chuang Zhou, Chang Yang, Yilin Xiao, Yuxuan Xie, Xiao Huang

Abstract: Recent efforts leverage Large Language Models (LLMs) for modeling text-attributed graph structures in node classification tasks. These approaches describe graph structures for LLMs to understand or aggregate LLM-generated textual attribute embeddings through graph structure. However, these approaches face two main limitations in modeling graph structures with LLMs. (i) Graph descriptions become verbose in describing high-order graph structure. (ii) Textual attributes alone do not contain adequate graph structure information. It is challenging to model graph structure concisely and adequately with LLMs. LLMs lack built-in mechanisms to model graph structures directly. They also struggle with complex long-range dependencies between high-order nodes and target nodes. Inspired by the observation that LLMs pre-trained on one language can achieve exceptional performance on another with minimal additional training, we propose \textbf{G}raph-\textbf{D}efined \textbf{L}anguage for \textbf{L}arge \textbf{L}anguage \textbf{M}odel (GDL4LLM). This novel framework enables LLMs to transfer their powerful language understanding capabilities to graph-structured data. GDL4LLM translates graphs into a graph language corpus instead of graph descriptions and pre-trains LLMs on this corpus to adequately understand graph structures. During fine-tuning, this corpus describes the structural information of target nodes concisely with only a few tokens. By treating graphs as a new language, GDL4LLM enables LLMs to model graph structures adequately and concisely for node classification tasks. Extensive experiments on three real-world datasets demonstrate that GDL4LLM outperforms description-based and textual attribute embeddings-based baselines by efficiently modeling different orders of graph structure with LLMs.

Comment: Introduces a novel framework for enabling LLMs to work directly with graph-structured data and proposes translating graphs into a new 'language', which is a potentially significant step in representation learning and LLM integration.

Relevance: 8 Novelty: 7

47. A margin-based replacement for cross-entropy loss

ArXiv ID: 2501.12191

Authors: Michael W. Spratling, Heiko H. Sch\"utt

Abstract: Cross-entropy (CE) loss is the de-facto standard for training deep neural networks to perform classification. However, CE-trained deep neural networks struggle with robustness and generalisation issues. To alleviate these issues, we propose high error margin (HEM) loss, a variant of multi-class margin loss that overcomes the training issues of other margin-based losses. We evaluate HEM extensively on a range of architectures and datasets. We find that HEM loss is more effective than cross-entropy loss across a wide range of tasks: unknown class rejection, adversarial robustness, learning with imbalanced data, continual learning, and semantic segmentation (a pixel-level classification task). Despite all training hyper-parameters being chosen for CE loss, HEM is inferior to CE only in terms of clean accuracy and this difference is insignificant. We also compare HEM to specialised losses that have previously been proposed to improve performance on specific tasks. LogitNorm, a loss achieving state-of-the-art performance on unknown class rejection, produces similar performance to HEM for this task, but is much poorer for continual learning and semantic segmentation. Logit-adjusted loss, designed for imbalanced data, has superior results to HEM for that task, but performs more poorly on unknown class rejection and semantic segmentation. DICE, a popular loss for semantic segmentation, is inferior to HEM loss on all tasks, including semantic segmentation. Thus, HEM often out-performs specialised losses, and in contrast to them, is a general-purpose replacement for CE loss.

Comment: Proposes a margin-based loss function (HEM) as a replacement for cross-entropy loss, which ties to foundational innovation in representation learning. The focus on robustness and generalization challenges is moderately novel.

Relevance: 8 Novelty: 7

48. A Truly Sparse and General Implementation of Gradient-Based Synaptic Plasticity

ArXiv ID: 2501.11407

Authors: Jamie Lohoff, Anil Kaya, Florian Assmuth, Emre Neftci

Abstract: Online synaptic plasticity rules derived from gradient descent achieve high accuracy on a wide range of practical tasks. However, their software implementation often requires tediously hand-derived gradients or using gradient backpropagation which sacrifices the online capability of the rules. In this work, we present a custom automatic differentiation (AD) pipeline for sparse and online implementation of gradient-based synaptic plasticity rules that generalizes to arbitrary neuron models. Our work combines the programming ease of backpropagation-type methods for forward AD while being memory-efficient. To achieve this, we exploit the advantageous compute and memory scaling of online synaptic plasticity by providing an inherently sparse implementation of AD where expensive tensor contractions are replaced with simple element-wise multiplications if the tensors are diagonal. Gradient-based synaptic plasticity rules such as eligibility propagation (e-prop) have exactly this property and thus profit immensely from this feature. We demonstrate the alignment of our gradients with respect to gradient backpropagation on an synthetic task where e-prop gradients are exact, as well as audio speech classification benchmarks. We demonstrate how memory utilization scales with network size without dependence on the sequence length, as expected from forward AD methods.

Comment: This work introduces a sparse, online implementation pipeline for gradient-based synaptic plasticity. It aligns with representation learning due to its sparse and memory-efficient approach, and provides methodological improvements for network scalability, making it relevant.

Relevance: 8 Novelty: 7

49. Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search

ArXiv ID: 2501.10479

Authors: Daniel Severo, Giuseppe Ottaviano, Matthew Muckley, Karen Ullrich, Matthijs Douze

Abstract: Approximate nearest neighbor search for vectors relies on indexes that are most often accessed from RAM. Therefore, storage is the factor limiting the size of the database that can be served from a machine. Lossy vector compression, i.e., embedding quantization, has been applied extensively to reduce the size of indexes. However, for inverted file and graph-based indices, auxiliary data such as vector ids and links (edges) can represent most of the storage cost. We introduce and evaluate lossless compression schemes for these cases. These approaches are based on asymmetric numeral systems or wavelet trees that exploit the fact that the ordering of ids is irrelevant within the data structures. In some settings, we are able to compress the vector ids by a factor 7, with no impact on accuracy or search runtime. On billion-scale datasets, this results in a reduction of 30% of the index size. Furthermore, we show that for some datasets, these methods can also compress the quantized vector codes losslessly, by exploiting sub-optimalities in the original quantization algorithm. The source code for our approach available at https://github.com/facebookresearch/vector_db_id_compression.

Comment: This paper introduces compression schemes for vector IDs in approximate nearest neighbor search. It aligns well with the model compression criterion, focusing on efficiency through innovations in lossless compression. The proposed method demonstrates theoretical depth and practical impact.

Relevance: 8 Novelty: 7

50. Exploring Transferable Homogeneous Groups for Compositional Zero-Shot Learning

ArXiv ID: 2501.10695

Authors: Zhijie Rao, Jingcai Guo, Miaoge Li, Yang Chen

Abstract: Conditional dependency present one of the trickiest problems in Compositional Zero-Shot Learning, leading to significant property variations of the same state (object) across different objects (states). To address this problem, existing approaches often adopt either all-to-one or one-to-one representation paradigms. However, these extremes create an imbalance in the seesaw between transferability and discriminability, favoring one at the expense of the other. Comparatively, humans are adept at analogizing and reasoning in a hierarchical clustering manner, intuitively grouping categories with similar properties to form cohesive concepts. Motivated by this, we propose Homogeneous Group Representation Learning (HGRL), a new perspective formulates state (object) representation learning as multiple homogeneous sub-group representation learning. HGRL seeks to achieve a balance between semantic transferability and discriminability by adaptively discovering and aggregating categories with shared properties, learning distributed group centers that retain group-specific discriminative features. Our method integrates three core components designed to simultaneously enhance both the visual and prompt representation capabilities of the model. Extensive experiments on three benchmark datasets validate the effectiveness of our method.

Comment: The paper proposes Homogeneous Group Representation Learning (HGRL) for balancing transferability and discriminability in Compositional Zero-Shot Learning. This is relevant to representation learning with potential novel contributions to the field.

Relevance: 8 Novelty: 7

51. Uncertainty Quantification With Noise Injection in Neural Networks: A Bayesian Perspective

ArXiv ID: 2501.12314

Authors: Xueqiong Yuan, Jipeng Li, Ercan Engin Kuruoglu

Abstract: Model uncertainty quantification involves measuring and evaluating the uncertainty linked to a model's predictions, helping assess their reliability and confidence. Noise injection is a technique used to enhance the robustness of neural networks by introducing randomness. In this paper, we establish a connection between noise injection and uncertainty quantification from a Bayesian standpoint. We theoretically demonstrate that injecting noise into the weights of a neural network is equivalent to Bayesian inference on a deep Gaussian process. Consequently, we introduce a Monte Carlo Noise Injection (MCNI) method, which involves injecting noise into the parameters during training and performing multiple forward propagations during inference to estimate the uncertainty of the prediction. Through simulation and experiments on regression and classification tasks, our method demonstrates superior performance compared to the baseline model.

Comment: The paper examines the connection between noise injection and Bayesian uncertainty quantification, presenting a theoretical perspective and a new method (MCNI). This involves insights into neural networks and could align with theoretical foundations of representation learning.

Relevance: 7 Novelty: 8

52. SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks

ArXiv ID: 2501.11599

Authors: Wentao Wan, Zhuojie Yang, Yongcan Chen, Chenglin Luo, Ruilin Wang, Kehao Cai, Nan Kang, Liang Lin, Keze Wang

Abstract: Deductive reasoning is a crucial logical capability that assists us in solving complex problems based on existing knowledge. Although augmented by Chain-of-Thought prompts, Large Language Models (LLMs) might not follow the correct reasoning paths. Enhancing the deductive reasoning abilities of LLMs, and leveraging their extensive built-in knowledge for various reasoning tasks, remains an open question. Attempting to mimic the human deductive reasoning paradigm, we propose a multi-stage Syllogistic-Reasoning Framework of Thought (SR-FoT) that enables LLMs to perform syllogistic deductive reasoning to handle complex knowledge-based reasoning tasks. Our SR-FoT begins by interpreting the question and then uses the interpretation and the original question to propose a suitable major premise. It proceeds by generating and answering minor premise questions in two stages to match the minor premises. Finally, it guides LLMs to use the previously generated major and minor premises to perform syllogistic deductive reasoning to derive the answer to the original question. Extensive and thorough experiments on knowledge-based reasoning tasks have demonstrated the effectiveness and advantages of our SR-FoT.

Comment: Proposes a novel framework (SR-FoT) for improving deductive reasoning in LLMs, aligning closely with theoretical insights into LLM behavior and reasoning improvements.

Relevance: 8 Novelty: 7

53. Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

ArXiv ID: 2501.11651

Authors: Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, Yuxiao Dong

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration and learning from feedback, recent attempts yield only modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through oversampling. We further employ an entropy bonus as an auxiliary loss, alongside a dynamic anchor for regularization to facilitate reward optimization. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. For example, T1 with Qwen2.5-32B as the base model outperforms the recent Qwen QwQ-32B-Preview model on MATH500, AIME2024, and Omni-math-500. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1's better performance without any additional verification. We will open-source the T1 models and the data used to train them at \url{https://github.com/THUDM/T1}.

Comment: This paper introduces an RL-based approach to improve reasoning in LLMs and examines inference scaling behavior. It aligns with the LLM behavior and scaling criteria, showing methodological innovation worth considering.

Relevance: 8 Novelty: 7

54. Quantitative Error Bounds for Scaling Limits of Stochastic Iterative Algorithms

ArXiv ID: 2501.12212

Authors: Xiaoyu Wang, Mikolaj J. Kasprzak, Jeffrey Negrea, Solesne Bourguin, Jonathan H. Huggins

Abstract: Stochastic iterative algorithms, including stochastic gradient descent (SGD) and stochastic gradient Langevin dynamics (SGLD), are widely utilized for optimization and sampling in large-scale and high-dimensional problems in machine learning, statistics, and engineering. Numerous works have bounded the parameter error in, and characterized the uncertainty of, these approximations. One common approach has been to use scaling limit analyses to relate the distribution of algorithm sample paths to a continuous-time stochastic process approximation, particularly in asymptotic setups. Focusing on the univariate setting, in this paper, we build on previous work to derive non-asymptotic functional approximation error bounds between the algorithm sample paths and the Ornstein-Uhlenbeck approximation using an infinite-dimensional version of Stein's method of exchangeable pairs. We show that this bound implies weak convergence under modest additional assumptions and leads to a bound on the error of the variance of the iterate averages of the algorithm. Furthermore, we use our main result to construct error bounds in terms of two common metrics: the L\'{e}vy-Prokhorov and bounded Wasserstein distances. Our results provide a foundation for developing similar error bounds for the multivariate setting and for more sophisticated stochastic approximation algorithms.

Comment: The paper develops non-asymptotic error bounds for stochastic iterative algorithms like SGD using a novel application of Stein's method. It contributes theoretical insights relevant to optimization methods in machine learning, though not directly to representation learning or LLM advancements.

Relevance: 7 Novelty: 8

55. Parallel Sequence Modeling via Generalized Spatial Propagation Network

ArXiv ID: 2501.12381

Authors: Hongjun Wang, Wonmin Byeon, Jiarui Xu, Jinwei Gu, Ka Chun Cheung, Xiaolong Wang, Kai Han, Jan Kautz, Sifei Liu

Abstract: We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multi-dimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, context-aware propagation across 2D sequences and reduces the effective sequence length to $\sqrt{N}$ for a square map with N elements, significantly enhancing computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over $84\times$ when generating 16K images.

Comment: Proposes a new attention mechanism, Generalized Spatial Propagation Network (GSPN), optimized for vision tasks with significant computational efficiency. Relevant to architectural advances in attention models.

Relevance: 8 Novelty: 7

56. Score Combining for Contrastive OOD Detection

ArXiv ID: 2501.12204

Authors: Edward T. Reehorst, Philip Schniter

Abstract: In out-of-distribution (OOD) detection, one is asked to classify whether a test sample comes from a known inlier distribution or not. We focus on the case where the inlier distribution is defined by a training dataset and there exists no additional knowledge about the novelties that one is likely to encounter. This problem is also referred to as novelty detection, one-class classification, and unsupervised anomaly detection. The current literature suggests that contrastive learning techniques are state-of-the-art for OOD detection. We aim to improve on those techniques by combining/ensembling their scores using the framework of null hypothesis testing and, in particular, a novel generalized likelihood ratio test (GLRT). We demonstrate that our proposed GLRT-based technique outperforms the state-of-the-art CSI and SupCSI techniques from Tack et al. 2020 in dataset-vs-dataset experiments with CIFAR-10, SVHN, LSUN, ImageNet, and CIFAR-100, as well as leave-one-class-out experiments with CIFAR-10. We also demonstrate that our GLRT outperforms the score-combining methods of Fisher, Bonferroni, Simes, Benjamini-Hochwald, and Stouffer in our application.

Comment: Focuses on contrastive learning for OOD detection and proposes improvements via a new GLRT method. Aligned with representation learning but lacks groundbreaking theoretical advancements.

Relevance: 8 Novelty: 7

57. Block Flow: Learning Straight Flow on Data Blocks

ArXiv ID: 2501.11361

Authors: Zibin Wang, Zhiyuan Ouyang, Xiangyun Zhang

Abstract: Flow-matching models provide a powerful framework for various applications, offering efficient sampling and flexible probability path modeling. These models are characterized by flows with low curvature in learned generative trajectories, which results in reduced truncation error at each sampling step. To further reduce curvature, we propose block matching. This novel approach leverages label information to partition the data distribution into blocks and match them with a prior distribution parameterized using the same label information, thereby learning straighter flows. We demonstrate that the variance of the prior distribution can control the curvature upper bound of forward trajectories in flow-matching models. By designing flexible regularization strategies to adjust this variance, we achieve optimal generation performance, effectively balancing the trade-off between maintaining diversity in generated samples and minimizing numerical solver errors. Our results demonstrate competitive performance with models of the same parameter scale.Code is available at \url{https://github.com/wpp13749/block_flow}.

Comment: The paper introduces the concept of 'block matching' to improve flow-matching models, aligning partly with representation learning topics by proposing an innovative regularization strategy for generative trajectory flows.

Relevance: 7 Novelty: 8

58. With Great Backbones Comes Great Adversarial Transferability

ArXiv ID: 2501.12275

Authors: Erik Arakelyan, Karen Hambardzumyan, Davit Papikyan, Pasquale Minervini, Albert Gordo, Isabelle Augenstein, Aram H. Markosyan

Abstract: Advances in self-supervised learning (SSL) for machine vision have improved representation robustness and model performance, giving rise to pre-trained backbones like \emph{ResNet} and \emph{ViT} models tuned with SSL methods such as \emph{SimCLR}. Due to the computational and data demands of pre-training, the utilization of such backbones becomes a strenuous necessity. However, employing these backbones may inherit vulnerabilities to adversarial attacks. While adversarial robustness has been studied under \emph{white-box} and \emph{black-box} settings, the robustness of models tuned on pre-trained backbones remains largely unexplored. Additionally, the role of tuning meta-information in mitigating exploitation risks is unclear. This work systematically evaluates the adversarial robustness of such models across $20,000$ combinations of tuning meta-information, including fine-tuning techniques, backbone families, datasets, and attack types. We propose using proxy models to transfer attacks, simulating varying levels of target knowledge by fine-tuning these proxies with diverse configurations. Our findings reveal that proxy-based attacks approach the effectiveness of \emph{white-box} methods, even with minimal tuning knowledge. We also introduce a naive "backbone attack," leveraging only the backbone to generate adversarial samples, which outperforms \emph{black-box} attacks and rivals \emph{white-box} methods, highlighting critical risks in model-sharing practices. Finally, our ablations reveal how increasing tuning meta-information impacts attack transferability, measuring each meta-information combination.

Comment: Investigates the adversarial robustness of SSL-tuned models, particularly in representation learning backbones like ResNet and ViT, touching on robustness in foundational architectures.

Relevance: 8 Novelty: 7

59. Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding

ArXiv ID: 2501.10967

Authors: Zhanpeng Chen, Mingxiao Li, Ziyang Chen, Nan Du, Xiaolong Li, Yuexian Zou

Abstract: Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models' comprehensive perception performance across different levels of granularity. In this work, we propose Pyramid-descent Visual Position Encoding (PyPE), a novel approach designed to enhance the perception of visual tokens within VLMs. By assigning visual position indexes from the periphery to the center and expanding the central receptive field incrementally, PyPE addresses the limitations of traditional raster-scan methods and mitigates the long-term decay effects induced by Rotary Position Embedding (RoPE). Our method reduces the relative distance between interrelated visual elements and instruction tokens, promoting a more rational allocation of attention weights and allowing for a multi-granularity perception of visual elements and countering the over-reliance on anchor tokens. Extensive experimental evaluations demonstrate that PyPE consistently improves the general capabilities of VLMs across various sizes. Code is available at https://github.com/SakuraTroyChen/PyPE.

Comment: Proposes Pyramid-descent Visual Position Encoding (PyPE) to enhance visual token perception in vision-language models. Relevant for its architectural improvements in foundational vision-language methods.

Relevance: 8 Novelty: 7

60. Dynamic Continual Learning: Harnessing Parameter Uncertainty for Improved Network Adaptation

ArXiv ID: 2501.10861

Authors: Christopher Angelini, Nidhal Bouaynaya

Abstract: When fine-tuning Deep Neural Networks (DNNs) to new data, DNNs are prone to overwriting network parameters required for task-specific functionality on previously learned tasks, resulting in a loss of performance on those tasks. We propose using parameter-based uncertainty to determine which parameters are relevant to a network's learned function and regularize training to prevent change in these important parameters. We approach this regularization in two ways: (1), we constrain critical parameters from significant changes by associating more critical parameters with lower learning rates, thereby limiting alterations in those parameters; (2), important parameters are restricted from change by imposing a higher regularization weighting, causing parameters to revert to their states prior to the learning of subsequent tasks. We leverage a Bayesian Moment Propagation framework which learns network parameters concurrently with their associated uncertainties while allowing each parameter to contribute uncertainty to the network's predictive distribution, avoiding the pitfalls of existing sampling-based methods. The proposed approach is evaluated for common sequential benchmark datasets and compared to existing published approaches from the Continual Learning community. Ultimately, we show improved Continual Learning performance for Average Test Accuracy and Backward Transfer metrics compared to sampling-based methods and other non-uncertainty-based approaches.

Comment: The paper proposes a novel approach to dynamic continual learning by leveraging Bayesian uncertainty, aligning with conditional/dynamic networks. It offers theoretical insights into adaptability but focuses on improving continual learning tasks rather than foundational network innovations.

Relevance: 7 Novelty: 7

61. Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

ArXiv ID: 2501.10893

Authors: Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan \"O. Ar{\i}k

Abstract: Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks are often hindered by the lack of high-quality agent data from the corresponding environments they interact with. We propose Learn-by-interact, a data-centric framework to adapt LLM agents to any given environments without human annotations. Learn-by-interact synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting the interaction histories, a process called backward construction. We assess the quality of our synthetic data by using them in both training-based scenarios and training-free in-context learning (ICL), where we craft innovative retrieval approaches optimized for agents. Extensive experiments on SWE-bench, WebArena, OSWorld and Spider2-V spanning across realistic coding, web, and desktop environments show the effectiveness of Learn-by-interact in various downstream agentic tasks -- baseline results are improved by up to 12.2\% for ICL with Claude-3.5 and 19.5\% for training with Codestral-22B. We further demonstrate the critical role of backward construction, which provides up to 14.0\% improvement for training. Our ablation studies demonstrate the efficiency provided by our synthesized data in ICL and the superiority of our retrieval pipeline over alternative approaches like conventional retrieval-augmented generation (RAG). We expect that Learn-by-interact will serve as a foundation for agent data synthesis as LLMs are increasingly deployed at real-world environments.

Comment: Learn-by-interact proposes an innovative framework for adapting LLM agents using synthesized interaction data, potentially relevant to improving foundational aspects of LLM behavior. However, the focus on instruction synthesis and data pipelines doesn’t strongly match architecture or representation breakthroughs.

Relevance: 7 Novelty: 7

62. Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos

ArXiv ID: 2501.12254

Authors: Yanlai Yang, Mengye Ren

Abstract: Self-supervised learning holds the promise to learn good representations from real-world continuous uncurated data streams. However, most existing works in visual self-supervised learning focus on static images or artificial data streams. Towards exploring a more realistic learning substrate, we investigate streaming self-supervised learning from long-form real-world egocentric video streams. Inspired by the event segmentation mechanism in human perception and memory, we propose "Memory Storyboard" that groups recent past frames into temporal segments for more effective summarization of the past visual streams for memory replay. To accommodate efficient temporal segmentation, we propose a two-tier memory hierarchy: the recent past is stored in a short-term memory, and the storyboard temporal segments are then transferred to a long-term memory. Experiments on real-world egocentric video datasets including SAYCam and KrishnaCam show that contrastive learning objectives on top of storyboard frames result in semantically meaningful representations which outperform those produced by state-of-the-art unsupervised continual learning methods.

Comment: Focuses on self-supervised representation learning with innovative temporal segmentation and memory mechanisms, aligning partially with representation learning criteria. However, it emphasizes continuous video streams and application aspects, which reduce relevance.

Relevance: 7 Novelty: 7

63. Unsupervised Learning in Echo State Networks for Input Reconstruction

ArXiv ID: 2501.11409

Authors: Taiki Yamada, Yuichi Katori, Kantaro Fujiwara

Abstract: Conventional echo state networks (ESNs) require supervised learning to train the readout layer, using the desired outputs as training data. In this study, we focus on input reconstruction (IR), which refers to training the readout layer to reproduce the input time series in its output. We reformulate the learning algorithm of the ESN readout layer to perform IR using unsupervised learning (UL). By conducting theoretical analysis and numerical experiments, we demonstrate that IR in ESNs can be effectively implemented under realistic conditions without explicitly using the desired outputs as training data; in this way, UL is enabled. Furthermore, we demonstrate that applications relying on IR, such as dynamical system replication and noise filtering, can be reformulated within the UL framework. Our findings establish a theoretically sound and universally applicable IR formulation, along with its related tasks in ESNs. This work paves the way for novel predictions and highlights unresolved theoretical challenges in ESNs, particularly in the context of time-series processing methods and computational models of the brain.

Comment: The paper explores unsupervised learning in Echo State Networks with a focus on input reconstruction, which introduces a unique reformulation of the readout training process for time series. This is interesting for representation learning but largely tied to a specific model type (ESNs), limiting broader impact.

Relevance: 7 Novelty: 7

64. Optimizing Pretraining Data Mixtures with LLM-Estimated Utility

ArXiv ID: 2501.11747

Authors: William Held, Bhargavi Paranjape, Punit Singh Koura, Mike Lewis, Frank Zhang, Todor Mihaylov

Abstract: Large Language Models improve with increasing amounts of high-quality training data. However, leveraging larger datasets requires balancing quality, quantity, and diversity across sources. After evaluating nine baseline methods under both compute- and data-constrained scenarios, we find token-count heuristics outperform manual and learned mixes, indicating that simple approaches accounting for dataset size and diversity are surprisingly effective. Building on this insight, we propose two complementary approaches: UtiliMax, which extends token-based heuristics by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $\sim$200x. Together, these approaches establish a new framework for automated, compute-efficient data mixing that is robust across training regimes.

Comment: Presents a framework for compute-efficient data mixing in LLM training, which ties to representation learning and foundational model advancements by addressing data utility estimation. However, it primarily focuses on practical methods rather than deep theoretical insights.

Relevance: 8 Novelty: 6

65. Mutual Regression Distance

ArXiv ID: 2501.10617

Authors: Dong Qiao, Jicong Fan

Abstract: The maximum mean discrepancy and Wasserstein distance are popular distance measures between distributions and play important roles in many machine learning problems such as metric learning, generative modeling, domain adaption, and clustering. However, since they are functions of pair-wise distances between data points in two distributions, they do not exploit the potential manifold properties of data such as smoothness and hence are not effective in measuring the dissimilarity between the two distributions in the form of manifolds. In this paper, different from existing measures, we propose a novel distance called Mutual Regression Distance (MRD) induced by a constrained mutual regression problem, which can exploit the manifold property of data. We prove that MRD is a pseudometric that satisfies almost all the axioms of a metric. Since the optimization of the original MRD is costly, we provide a tight MRD and a simplified MRD, based on which a heuristic algorithm is established. We also provide kernel variants of MRDs that are more effective in handling nonlinear data. Our MRDs especially the simplified MRDs have much lower computational complexity than the Wasserstein distance. We provide theoretical guarantees, such as robustness, for MRDs. Finally, we apply MRDs to distribution clustering, generative models, and domain adaptation. The numerical results demonstrate the effectiveness and superiority of MRDs compared to the baselines.

Comment: Proposes Mutual Regression Distance (MRD), a novel pseudometric for distributions, with theoretical guarantees and applicability to generative models and domain adaptation. Relevant to representation learning but does not focus directly on foundational representation paradigms.

Relevance: 7 Novelty: 7

66. A Hybrid Supervised and Self-Supervised Graph Neural Network for Edge-Centric Applications

ArXiv ID: 2501.12309

Authors: Eugenio Borzone, Leandro Di Persia, Matias Gerard

Abstract: This paper presents a novel graph-based deep learning model for tasks involving relations between two nodes (edge-centric tasks), where the focus lies on predicting relationships and interactions between pairs of nodes rather than node properties themselves. This model combines supervised and self-supervised learning, taking into account for the loss function the embeddings learned and patterns with and without ground truth. Additionally it incorporates an attention mechanism that leverages both node and edge features. The architecture, trained end-to-end, comprises two primary components: embedding generation and prediction. First, a graph neural network (GNN) transform raw node features into dense, low-dimensional embeddings, incorporating edge attributes. Then, a feedforward neural model processes the node embeddings to produce the final output. Experiments demonstrate that our model matches or exceeds existing methods for protein-protein interactions prediction and Gene Ontology (GO) terms prediction. The model also performs effectively with one-hot encoding for node features, providing a solution for the previously unsolved problem of predicting similarity between compounds with unknown structures.

Comment: Presents a hybrid supervised and self-supervised GNN model with innovation in learning embeddings and incorporating attention mechanisms; overlaps with representation learning.

Relevance: 7 Novelty: 7

67. CDW-CoT: Clustered Distance-Weighted Chain-of-Thoughts Reasoning

ArXiv ID: 2501.12226

Authors: Yuanheng Fang, Guoqing Chao, Wenqiang Lei, Shaobo Li, Dianhui Chu

Abstract: Large Language Models (LLMs) have recently achieved impressive results in complex reasoning tasks through Chain of Thought (CoT) prompting. However, most existing CoT methods rely on using the same prompts, whether manually designed or automatically generated, to handle the entire dataset. This one-size-fits-all approach may fail to meet the specific needs arising from the diversities within a single dataset. To solve this problem, we propose the Clustered Distance-Weighted Chain of Thought (CDW-CoT) method, which dynamically constructs prompts tailored to the characteristics of each data instance by integrating clustering and prompt optimization techniques. Our method employs clustering algorithms to categorize the dataset into distinct groups, from which a candidate pool of prompts is selected to reflect the inherent diversity within the dataset. For each cluster, CDW-CoT trains the optimal prompt probability distribution tailored to their specific characteristics. Finally, it dynamically constructs a unique prompt probability distribution for each test instance, based on its proximity to cluster centers, from which prompts are selected for reasoning. CDW-CoT consistently outperforms traditional CoT methods across six datasets, including commonsense, symbolic, and mathematical reasoning tasks. Specifically, when compared to manual CoT, CDW-CoT achieves an average accuracy improvement of 25.34% on LLaMA2 (13B) and 15.72% on LLaMA3 (8B).

Comment: Proposes a novel approach called CDW-CoT for improving Chain of Thought reasoning in LLMs through clustering and prompt optimization. This is relevant to representation learning and theoretical insights into LLM behavior but lacks groundbreaking theoretical contributions.

Relevance: 7 Novelty: 7

68. VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model

ArXiv ID: 2501.12327

Authors: Xianwei Zhuang, Yuxin Xie, Yufan Deng, Liming Liang, Jinghan Ru, Yuguo Yin, Yuexian Zou

Abstract: We present VARGPT, a novel multimodal large language model (MLLM) that unifies visual understanding and generation within a single autoregressive framework. VARGPT employs a next-token prediction paradigm for visual understanding and a next-scale prediction paradigm for visual autoregressive generation. VARGPT innovatively extends the LLaVA architecture, achieving efficient scale-wise autoregressive visual generation within MLLMs while seamlessly accommodating mixed-modal input and output within a single model framework. Our VARGPT undergoes a three-stage unified training process on specially curated datasets, comprising a pre-training phase and two mixed visual instruction-tuning phases. The unified training strategy are designed to achieve alignment between visual and textual features, enhance instruction following for both understanding and generation, and improve visual generation quality, respectively. Despite its LLAVA-based architecture for multimodel understanding, VARGPT significantly outperforms LLaVA-1.5 across various vision-centric benchmarks, such as visual question-answering and reasoning tasks. Notably, VARGPT naturally supports capabilities in autoregressive visual generation and instruction-to-image synthesis, showcasing its versatility in both visual understanding and generation tasks. Project page is at: \url{https://vargpt-1.github.io/}

Comment: VARGPT introduces a multimodal large language model, extending the LLaVA framework for unified visual understanding and generation. While impressive on tasks, it does not introduce fundamental theoretical advancements beyond incremental multimodal integration.

Relevance: 7 Novelty: 6

69. Disentangled Interpretable Representation for Efficient Long-term Time Series Forecasting

ArXiv ID: 2411.17257

Authors: Yuang Zhao, Tianyu Li, Jiadong Chen, Shenrong Ye, Fuxin Jiang, Tieying Zhang, Xiaofeng Gao

Abstract: Industry 5.0 introduces new challenges for Long-term Time Series Forecasting (LTSF), characterized by high-dimensional, high-resolution data and high-stakes application scenarios. Against this backdrop, developing efficient and interpretable models for LTSF becomes a key challenge. Existing deep learning and linear models often suffer from excessive parameter complexity and lack intuitive interpretability. To address these issues, we propose DiPE-Linear, a Disentangled interpretable Parameter-Efficient Linear network. DiPE-Linear incorporates three temporal components: Static Frequential Attention (SFA), Static Temporal Attention (STA), and Independent Frequential Mapping (IFM). These components alternate between learning in the frequency and time domains to achieve disentangled interpretability. The decomposed model structure reduces parameter complexity from quadratic in fully connected networks (FCs) to linear and computational complexity from quadratic to log-linear. Additionally, a Low-Rank Weight Sharing policy enhances the model's ability to handle multivariate series. Despite operating within a subspace of FCs with limited expressive capacity, DiPE-Linear demonstrates comparable or superior performance to both FCs and nonlinear models across multiple open-source and real-world LTSF datasets, validating the effectiveness of its sophisticatedly designed structure. The combination of efficiency, accuracy, and interpretability makes DiPE-Linear a strong candidate for advancing LTSF in both research and real-world applications. The source code is available at https://github.com/wintertee/DiPE-Linear.

Comment: This paper introduces a disentangled interpretable parameter-efficient model for long-term time series forecasting. The use of Low-Rank Weight Sharing and a novel combination of static attention mechanisms shows clear ties to representation learning and model compression principles, though it is domain-specific. This makes it moderately relevant to foundational representation learning topics.

Relevance: 7 Novelty: 6

Paper selection prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Relevant Topics

Representation Learning - Relevant: Feature learning, sparse/contrastive learning, dictionary learning, or theoretical insights into how deep networks encode information. - Irrelevant: Application-only work using standard representation learning without innovative insights.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, and other foundational structures. - Irrelevant: Simply applying existing architectures to new tasks without structural/theoretical innovation.
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank, KV cache, or theoretical/algorithmic innovations for efficiency, etc. - Irrelevant: Simply applying existing compression to new tasks.
Large Language Models (LLMs) - Relevant: Strong theoretical insights on LLM behavior, architecture/training breakthroughs (e.g., MoE). - Irrelevant: Domain-specific usage or small tweaks (e.g., instruction tuning), lack of theoretical advancement (e.g., benchmarks/datasets, inference tricks like RAG).
AI for Science - Relevant: Foundational research in molecule/protein modeling (e.g., new training paradigms, advanced generative methods, or theoretical perspectives), or major architecture-level innovation. - Irrelevant: Conventional, domain-limited applications lacking insights on the foundational side.
Emerging Trends - Relevant: Cutting-edge theoretical work challenging assumptions, or broad new paradigms/concepts in AI research. - Irrelevant: Trend-following or incremental extensions on existing methods.

Note: Foundation vs. Application - Foundational/theoretical papers (new theorems, architectures, or strong methodological insights) are of high relevance. - Subdomain papers and application-focused papers (e.g., "methods for xxx") are lower in relevance.

Hints on Irrelevant Domains: Federated Learning, Online Learning, Transfer Learning, Reinforcement Learning, etc.

Hints on Application Tasks: Image Segmentation, Medical Imaging, Speech Recognition, Video Understanding, Recommendation Systems, 3D Vision, Machine Translation, Information Retrieval, etc.

Papers

[PAPER LIST HERE]

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other. E.g., a paper with high relevance can be of low novelty, or vice versa.

Relevance Scoring

Relevance 9-10 (Completely Relevant)
Focus: Fully aligned with core topics, score the highest if also contains keywords in it.
Keywords: “Mixture of Experts (MoE),” “Representation Learning,” “Compression,” “Sparse/Sparsity,” “Pruning,” “Quantization,” “Low-rank,” “Scaling,” “Foundation Models,” etc.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
Focus: Clearly tied to our main topics, may not fully hit the interest in foundational methods.
Examples: Pure research on representation/architecture on MoE with no other domain focus.
Relevance 5-6 (Optional)
Focus: Link to our topics—covers relevant ideas but also includes another area of interest.
Examples: Work referencing MoE centered on another domain.
Relevance 3-4 (Irrelevant)
Focus: Largely outside our interests, with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in real world.
Relevance 1-2 (Ignore)
Focus: Purely unrelated to our topics. Completely a different domain.
Exception: If you think it is an emerging trend (that may lead to a thorough breakthrough in the future), you can give a score of 9-10. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

Novelty 9-10 (Breakthrough)
Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Moderate)
Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
Definition: Minimal originality, applying standard approaches without real innovation.
Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

ARXIVID: should be the ArXiv ID.
COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
RELEVANCE: should be a score from 1-10.
NOVELTY: should be a score from 1-10.