Previous Day 2025-06-02
Monthly Overview 2025-06
Next Day 2025-06-04

Personalized Daily ArXiv Papers 2025-06-03

[gpt-4o] Prompt Completion Total
Token 87195 11992 99187
Cost $0.22 $0.12 $0.34

Total arXiv papers: 1424

Total scanned papers: 816

Total relevant papers: 56

Table of contents with paper titles:

  1. FORT: Forward-Only Regression Training of Normalizing Flows Authors: Danyal Rehman, Oscar Davis, Jiarui Lu, Jian Tang, Michael Bronstein, Yoshua Bengio, Alexander Tong, Avishek Joey Bose

  2. Esoteric Language Models Authors: Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, Arash Vahdat

  3. MLorc: Momentum Low-rank Compression for Large Language Model Adaptation Authors: Wei Shen, Yaxiang Zhang, Minhui Huang, Mengfan Xu, Jiawei Zhang, Cong Shen

  4. SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model Authors: Zhao Yang, Jiwei Zhu, Bing Su

  5. Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures Authors: Mark Muchane, Sean Richardson, Kiho Park, Victor Veitch

  6. Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism Authors: Sameera Ramasinghe, Thalaiyasingam Ajanthan, Gil Avraham, Yan Zuo, Alexander Long

  7. FLoE: Fisher-Based Layer Selection for Efficient Sparse Adaptation of Low-Rank Experts Authors: Xinyi Wang, Lirong Gao, Haobo Wang, Yiming Zhang, Junbo Zhao

  8. Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer Authors: Yihe Dong, Lorenzo Noci, Mikhail Khodak, Mufan Li

  9. Model Reprogramming Demystified: A Neural Tangent Kernel Perspective Authors: Ming-Yu Chung, Jiashuo Fan, Hancheng Ye, Qinsi Wang, Wei-Chen Shen, Chia-Mu Yu, Pin-Yu Chen, Sy-Yen Kuo

  10. Blending Complementary Memory Systems in Hybrid Quadratic-Linear Transformers Authors: Kazuki Irie, Morris Yau, Samuel J. Gershman

  11. It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs Authors: Jun Wu, Yirong Xiong, Jiangtao Wen, Yuxing Han

  12. LLM Cannot Discover Causality, and Should Be Restricted to Non-Decisional Support in Causal Discovery Authors: Xingyu Wu, Kui Yu, Jibin Wu, Kay Chen Tan

  13. Slow Feature Analysis as Variational Inference Objective Authors: Merlin Sch\"uler, Laurenz Wiskott

  14. Unified Scaling Laws for Compressed Representations Authors: Andrei Panferov, Alexandra Volkova, Ionut-Vlad Modoranu, Vage Egiazarian, Mher Safaryan, Dan Alistarh

  15. Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization Authors: Wojciech Masarczyk, Mateusz Ostaszewski, Tin Sum Cheng, Tomasz Trzci\'nski, Aurelien Lucchi, Razvan Pascanu

  16. Uni-LoRA: One Vector is All You Need Authors: Kaiyang Li, Shaobo Han, Qing Su, Wei Li, Zhipeng Cai, Shihao Ji

  17. Ultra-Quantisation: Efficient Embedding Search via 1.58-bit Encodings Authors: Richard Connor, Alan Dearle, Ben Claydon

  18. Latent Structured Hopfield Network for Semantic Association and Retrieval Authors: Chong Li, Xiangyang Xue, Jianfeng Feng, Taiping Zeng

  19. Probing Neural Topology of Large Language Models Authors: Yu Zheng, Yuan Yuan, Yong Li, Paolo Santi

  20. Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning Authors: Ryotaro Kawata, Kohsei Matsutani, Yuri Kinoshita, Naoki Nishikawa, Taiji Suzuki

  21. On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning Authors: Magdalena Proszewska, Nikolay Malkin, N. Siddharth

  22. Tug-of-war between idiom's figurative and literal meanings in LLMs Authors: Soyoung Oh, Xinting Huang, Mathis Pink, Michael Hahn, Vera Demberg

  23. Earley-Driven Dynamic Pruning for Efficient Structured Decoding Authors: Xintong Sun, Chi Wei, Minghao Tian, Shiwen Ni

  24. Overfitting has a limitation: a model-independent generalization error bound based on R\'enyi entropy Authors: Atsushi Suzuki

  25. Unlocking the Power of Rehearsal in Continual Learning: A Theoretical Perspective Authors: Junze Deng, Qinhang Wu, Peizhong Ju, Sen Lin, Yingbin Liang, Ness Shroff

  26. zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression Authors: Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West

  27. TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network Authors: Guangxin He, Yuan Cao, Yutong He, Tianyi Bai, Kun Yuan, Binhang Yuan

  28. Learning DNF through Generalized Fourier Representations Authors: Mohsen Heidari, Roni Khardon

  29. PMNO: A novel physics guided multi-step neural operator predictor for partial differential equations Authors: Jin Song, Kenji Kawaguchi, Zhenya Yan

  30. Generalization in VAE and Diffusion Models: A Unified Information-Theoretic Analysis Authors: Qi Chen, Jierui Zhu, Florian Shkurti

  31. Manipulating 3D Molecules in a Fixed-Dimensional SE(3)-Equivariant Latent Space Authors: Zitao Chen, Yinjun Jia, Zitong Tian, Wei-Ying Ma, Yanyan Lan

  32. Existing Large Language Model Unlearning Evaluations Are Inconclusive Authors: Zhili Feng, Yixuan Even Xu, Alexander Robey, Robert Kirk, Xander Davies, Yarin Gal, Avi Schwarzschild, J. Zico Kolter

  33. Quantitative Error Feedback for Quantization Noise Reduction of Filtering over Graphs Authors: Xue Xian Zheng, Weihang Liu, Xin Lou, Stefan Vlaski, Tareq Al-Naffouri

  34. Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness Authors: Thomas Pethick, Wanyun Xie, Mete Erdogan, Kimon Antonakopoulos, Tony Silveti-Falls, Volkan Cevher

  35. Understanding Model Reprogramming for CLIP via Decoupling Visual Prompts Authors: Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu

  36. Self-supervised Latent Space Optimization with Nebula Variational Coding Authors: Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari

  37. Quotient Network - A Network Similar to ResNet but Learning Quotients Authors: Peng Hui, Jiamuyang Zhao, Changxin Li, Qingzhen Zhu

  38. Connecting Neural Models Latent Geometries with Relative Geodesic Representations Authors: Hanlin Yu, Berfin Inal, Georgios Arvanitidis, Soren Hauberg, Francesco Locatello, Marco Fumero

  39. LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning-Focused Supervised Fine-Tuning Authors: Zihang Liu, Tianyu Pang, Oleg Balabanov, Chaoqun Yang, Tianjin Huang, Lu Yin, Yaoqing Yang, Shiwei Liu

  40. Trade-offs in Data Memorization via Strong Data Processing Inequalities Authors: Vitaly Feldman, Guy Kornowski, Xin Lyu

  41. Flexible Mixed Precision Quantization for Learned Image Compression Authors: Md Adnan Faisal Hossain, Zhihao Duan, Fengqing Zhu

  42. MOFGPT: Generative Design of Metal-Organic Frameworks using Language Models Authors: Srivathsan Badrinarayanan, Rishikesh Magar, Akshay Antony, Radheesh Sharma Meda, Amir Barati Farimani

  43. Rapid yet accurate Tile-circuit and device modeling for Analog In-Memory Computing Authors: J. Luquin, C. Mackin, S. Ambrogio, A. Chen, F. Baldi, G. Miralles, M. J. Rasch, J. B\"uchel, M. Lalwani, W. Ponghiran, P. Solomon, H. Tsai, G. W. Burr, P. Narayanan

  44. Understanding Overadaptation in Supervised Fine-Tuning: The Role of Ensemble Methods Authors: Yifan Hao, Xingyuan Pan, Hanning Zhang, Chenlu Ye, Rui Pan, Tong Zhang

  45. Unlocking Personalized Knowledge in Federated Large Language Model: The Power of Mixture of Experts Authors: Fan Liu, Bikang Pan, Zhongyi Wang, Xi Yao, Xiaoying Tang, Jingya Wang, Ye Shi

  46. Mamba Drafters for Speculative Decoding Authors: Daewon Choi, Seunghyuk Oh, Saket Dingliwal, Jihoon Tack, Kyuyoung Kim, Woomin Song, Seojin Kim, Insu Han, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati

  47. VUSA: Virtually Upscaled Systolic Array Architecture to Exploit Unstructured Sparsity in AI Acceleration Authors: Shereef Helal, Alberto Garcia-Ortiz, Lennart Bamberg

  48. Visual Sparse Steering: Improving Zero-shot Image Classification with Sparsity Guided Steering Vectors Authors: Gerasimos Chatzoudis, Zhuowei Li, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas

  49. Control-R: Towards controllable test-time scaling Authors: Di Zhang, Weida Wang, Junxian Li, Xunzhi Wang, Jiatong Li, Jianbo Wu, Jingdi Lei, Haonan He, Peng Ye, Shufei Zhang, Wanli Ouyang, Yuqiang Li, Dongzhan Zhou

  50. Data Pruning by Information Maximization Authors: Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi

  51. Boosting Bot Detection via Heterophily-Aware Representation Learning and Prototype-Guided Cluster Discovery Authors: Buyun He, Xiaorui Jiang, Qi Wu, Hao Liu, Yingguang Yang, Yong Liao

  52. From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models Authors: As{\i}m Ersoy, Basel Mousi, Shammur Chowdhury, Firoj Alam, Fahim Dalvi, Nadir Durrani

  53. Decoding Dense Embeddings: Sparse Autoencoders for Interpreting and Discretizing Dense Retrieval Authors: Seongwan Park, Taeklim Kim, Youngjoong Ko

  54. Concept-Centric Token Interpretation for Vector-Quantized Generative Models Authors: Tianze Yang, Yucheng Shi, Mengnan Du, Xuansheng Wu, Qiaoyu Tan, Jin Sun, Ninghao Liu

  55. Weight-Space Linear Recurrent Neural Networks Authors: Roussel Desmond Nzoyem, Nawid Keshtmand, Idriss Tsayem, David A. W. Barton, Tom Deakin

  56. Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers Authors: Woomin Song, Sai Muralidhar Jayanthi, Srikanth Ronanki, Kanthashree Mysore Sathyendra, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati


1. FORT: Forward-Only Regression Training of Normalizing Flows

ArXiv ID: 2506.01158

Authors: Danyal Rehman, Oscar Davis, Jiarui Lu, Jian Tang, Michael Bronstein, Yoshua Bengio, Alexander Tong, Avishek Joey Bose

Abstract: Simulation-free training frameworks have been at the forefront of the generative modelling revolution in continuous spaces, leading to neural dynamical systems that encompass modern large-scale diffusion and flow matching models. Despite the scalability of training, the generation of high-quality samples and their corresponding likelihood under the model requires expensive numerical simulation -- inhibiting adoption in numerous scientific applications such as equilibrium sampling of molecular systems. In this paper, we revisit classical normalizing flows as one-step generative models with exact likelihoods and propose a novel, scalable training objective that does not require computing the expensive change of variable formula used in conventional maximum likelihood training. We propose Forward-Only Regression Training (FORT), a simple $\ell_2$-regression objective that maps prior samples under our flow to specifically chosen targets. We demonstrate that FORT supports a wide class of targets, such as optimal transport targets and targets from pre-trained continuous-time normalizing flows (CNF). We further demonstrate that by using CNF targets, our one-step flows allow for larger-scale training that exceeds the performance and stability of maximum likelihood training, while unlocking a broader class of architectures that were previously challenging to train. Empirically, we elucidate that our trained flows can perform equilibrium conformation sampling in Cartesian coordinates of alanine dipeptide, alanine tripeptide, and alanine tetrapeptide.

Comment: Author match


2. Esoteric Language Models

ArXiv ID: 2506.01928

Authors: Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, Arash Vahdat

Abstract: Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features--most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the first to introduce KV caching for MDMs while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to 65x faster inference than standard MDMs and 4x faster inference than prior semi-autoregressive approaches. We provide the code and model checkpoints on the project page: http://s-sahoo.github.io/Eso-LMs

Comment: The paper introduces a new family of models, Eso-LMs, which combines AR and MDM paradigms and introduces KV caching for MDMs, aligning with foundational research in LLM architecture.

Relevance: 9 Novelty: 8


3. MLorc: Momentum Low-rank Compression for Large Language Model Adaptation

ArXiv ID: 2506.01897

Authors: Wei Shen, Yaxiang Zhang, Minhui Huang, Mengfan Xu, Jiawei Zhang, Cong Shen

Abstract: With increasing size of large language models (LLMs), full-parameter fine-tuning imposes substantial memory demands. To alleviate this, we propose a novel memory-efficient training paradigm called Momentum Low-rank compression (MLorc). By directly compressing and reconstructing momentum rather than gradients, MLorc avoids imposing a fixed-rank constraint on weight update matrices and better preserves the training dynamics of full-parameter fine-tuning, in contrast to existing low-rank approaches such as LoRA and GaLore. Empirically, MLorc consistently outperforms other memory-efficient training methods, matches or even exceeds the performance of full fine-tuning with a small rank (e.g., $r=4$), and generalizes well across different optimizers -- all while not compromising time or memory efficiency. Furthermore, we provide a theoretical guarantee for its convergence under reasonable assumptions.

Comment: The paper proposes a novel memory-efficient training paradigm for LLMs using momentum low-rank compression, aligning with foundational research in model compression.

Relevance: 9 Novelty: 8


4. SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model

ArXiv ID: 2506.01833

Authors: Zhao Yang, Jiwei Zhu, Bing Su

Abstract: Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training. Furthermore, considering the multi-species and multi-profile nature of genomic profile prediction, we introduce our $\textbf{S}$pecies-$\textbf{P}$rofile $\textbf{A}$daptive $\textbf{C}$ollaborative $\textbf{E}$xperts (SPACE) that leverages Mixture of Experts (MoE) to better capture the relationships between DNA sequences across different species and genomic profiles, thereby learning more effective DNA representations. Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners. The code is available at https://github.com/ZhuJiwei111/SPACE.

Comment: The paper introduces a Mixture of Experts (MoE) model for DNA representation learning, which is relevant to both representation learning and model architecture.

Relevance: 9 Novelty: 8


5. Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures

ArXiv ID: 2506.01197

Authors: Mark Muchane, Sean Richardson, Kiho Park, Victor Veitch

Abstract: Sparse dictionary learning (and, in particular, sparse autoencoders) attempts to learn a set of human-understandable concepts that can explain variation on an abstract space. A basic limitation of this approach is that it neither exploits nor represents the semantic relationships between the learned concepts. In this paper, we introduce a modified SAE architecture that explicitly models a semantic hierarchy of concepts. Application of this architecture to the internal representations of large language models shows both that semantic hierarchy can be learned, and that doing so improves both reconstruction and interpretability. Additionally, the architecture leads to significant improvements in computational efficiency.

Comment: The paper introduces a modified sparse autoencoder architecture incorporating hierarchical semantics, relevant to representation learning and model architecture.

Relevance: 9 Novelty: 8


6. Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism

ArXiv ID: 2506.01260

Authors: Sameera Ramasinghe, Thalaiyasingam Ajanthan, Gil Avraham, Yan Zuo, Alexander Long

Abstract: Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging due to communication bottlenecks. While existing compression techniques are effective in data-parallel, they do not extend to model parallelism. Unlike data-parallel training, where weight gradients are exchanged, model-parallel requires compressing activations and activation gradients as they propagate through layers, accumulating compression errors. We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation with negligible memory/compute overhead. By leveraging a recursive structure in transformer networks, we predefine a low-dimensional subspace to confine the activations and gradients, allowing full reconstruction in subsequent layers. Our method achieves up to 100x improvement in communication efficiency and enables training billion-parameter-scale models over low-end GPUs connected via consumer-grade internet speeds as low as 80Mbps, matching the convergence of centralized datacenter systems with 100Gbps connections with model parallel.

Comment: The paper proposes a novel compression algorithm for decentralized training, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8


7. FLoE: Fisher-Based Layer Selection for Efficient Sparse Adaptation of Low-Rank Experts

ArXiv ID: 2506.00495

Authors: Xinyi Wang, Lirong Gao, Haobo Wang, Yiming Zhang, Junbo Zhao

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a widely adopted strategy for adapting pre-trained Large Language Models (LLMs) to downstream tasks, significantly reducing memory and computational costs. However, most existing PEFT techniques uniformly deploy LoRA adapters across all layers, disregarding the intrinsic heterogeneity of layer contributions and task-specific rank requirements. This uniform paradigm leads to redundant parameter allocation and suboptimal adaptation efficiency. To address these limitations, we propose FLoE, a novel PEFT framework that introduces two key innovations: (i) a Fisher information-guided importance scoring mechanism to dynamically identify task-critical transformer layers for MoE-based low-rank adaptation, enabling sparse adapter deployment; and (ii) a Bayesian optimization-driven rank allocator that automatically determines optimal LoRA ranks on specific datasets without exhaustive grid search. Extensive experiments across diverse LLMs and benchmarks reveal that FLoE achieves impressive efficiency-accuracy trade-offs, making FLoE particularly advantageous in resource-constrained environments that necessitate rapid adaptation.

Comment: The paper proposes a novel PEFT framework for LLMs using MoE-based low-rank adaptation, relevant to model architecture and compression.

Relevance: 9 Novelty: 8


8. Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer

ArXiv ID: 2506.01115

Authors: Yihe Dong, Lorenzo Noci, Mikhail Khodak, Mufan Li

Abstract: The Transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of algorithmic tasks -- including mathematical reasoning, memorization, and retrieval -- using only gradient-based training on next-token prediction. While the core component of a Transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it. To this end, we compare standard Transformers to variants in which either the multi-layer perceptron (MLP) layers or the attention projectors (queries and keys) are frozen at initialization. To further isolate the contribution of attention, we introduce MixiT -- the Mixing Transformer -- a simplified, principled model in which the attention coefficients are entirely random and fixed at initialization, eliminating any input-dependent computation or learning in attention. Surprisingly, we find that MixiT matches the performance of fully trained Transformers on various algorithmic tasks, especially those involving basic arithmetic or focusing heavily on memorization. For retrieval-based tasks, we observe that having input-dependent attention coefficients is consistently beneficial, while MixiT underperforms. We attribute this failure to its inability to form specialized circuits such as induction heads -- a specific circuit known to be crucial for learning and exploiting repeating patterns in input sequences. Even more interestingly, we find that attention with frozen key and query projectors is not only able to form induction heads, but can also perform competitively on language modeling. Our results underscore the importance of architectural heterogeneity, where distinct components contribute complementary inductive biases crucial for solving different classes of tasks.

Comment: The paper analyzes the roles of attention and MLP in Transformers, providing insights into model architecture.

Relevance: 9 Novelty: 8


9. Model Reprogramming Demystified: A Neural Tangent Kernel Perspective

ArXiv ID: 2506.00620

Authors: Ming-Yu Chung, Jiashuo Fan, Hancheng Ye, Qinsi Wang, Wei-Chen Shen, Chia-Mu Yu, Pin-Yu Chen, Sy-Yen Kuo

Abstract: Model Reprogramming (MR) is a resource-efficient framework that adapts large pre-trained models to new tasks with minimal additional parameters and data, offering a promising solution to the challenges of training large models for diverse tasks. Despite its empirical success across various domains such as computer vision and time-series forecasting, the theoretical foundations of MR remain underexplored. In this paper, we present a comprehensive theoretical analysis of MR through the lens of the Neural Tangent Kernel (NTK) framework. We demonstrate that the success of MR is governed by the eigenvalue spectrum of the NTK matrix on the target dataset and establish the critical role of the source model's effectiveness in determining reprogramming outcomes. Our contributions include a novel theoretical framework for MR, insights into the relationship between source and target models, and extensive experiments validating our findings.

Comment: The paper provides a theoretical analysis of Model Reprogramming using the Neural Tangent Kernel framework, which aligns with representation learning insights.

Relevance: 9 Novelty: 8


10. Blending Complementary Memory Systems in Hybrid Quadratic-Linear Transformers

ArXiv ID: 2506.00744

Authors: Kazuki Irie, Morris Yau, Samuel J. Gershman

Abstract: We develop hybrid memory architectures for general-purpose sequence processing neural networks, that combine key-value memory using softmax attention (KV-memory) with dynamic synaptic memory through fast-weight programming (FW-memory) -- the core principles of quadratic and linear transformers, respectively. These two memory systems have complementary but individually limited properties: KV-memory offers precise retrieval but is constrained by quadratic complexity in sequence length, while FW-memory supports arbitrarily long sequences and enables more expressive computation but sacrifices precise recall. We propose and compare three methods to blend these two systems into a single memory system to leverage the strengths of both. We conduct experiments on general language modeling and retrieval tasks by training 340M- and 1.3B-parameter models from scratch, as well as on synthetic algorithmic tasks designed to precisely illustrate the benefits of certain hybrid methods over others. We also evaluate our hybrid memory systems on reinforcement learning in partially observable environments. Overall, we demonstrate how a well-designed hybrid can overcome the limitations of its individual components, offering new insights into the design principle of neural memory systems.

Comment: The paper discusses hybrid memory architectures in transformers, which is relevant to model architecture innovations.

Relevance: 9 Novelty: 8


11. It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs

ArXiv ID: 2506.00486

Authors: Jun Wu, Yirong Xiong, Jiangtao Wen, Yuxing Han

Abstract: Despite rapid advancements in the research and deployment of large language models (LLMs), the statistical distribution of model parameters, as well as their influence on initialization, training dynamics, and downstream efficiency, has received surprisingly little attention. A recent work introduced BackSlash, a training-time compression algorithm. It first demonstrated that pre-trained LLM parameters follow generalized Gaussian distributions (GGDs) better. By optimizing GG priors during training, BackSlash can reduce parameters by up to 90\% with minimal performance loss. Building on this foundational insight, we propose a unified, end-to-end framework for LLM optimization based on the GG model. Our contributions are threefold: (1) GG-based initialization scheme that aligns with the statistical structure of trained models, resulting in faster convergence and improved accuracy; (2) DeepShape, a post-training regularization method that reshapes weight distributions to match a GG profile, improving compressibility with minimized degradation in performance; and (3) RF8, a compact and hardware-efficient 8-bit floating-point format designed for GG-distributed-initialized BackSlash training, enabling low-cost inference without compromising accuracy. Experiments across diverse model architectures show that our framework consistently yields smaller and faster models that match or outperform standard training baselines. By grounding LLM development in principled statistical modeling, this work forges a new path toward efficient, scalable, and hardware-aware AI systems. The code is available on our project page: https://huggingface.co/spaces/shifeng3711/gg_prior.

Comment: The paper proposes a framework for LLM optimization using generalized Gaussian priors, which aligns with model compression and efficiency.

Relevance: 9 Novelty: 8


12. LLM Cannot Discover Causality, and Should Be Restricted to Non-Decisional Support in Causal Discovery

ArXiv ID: 2506.00844

Authors: Xingyu Wu, Kui Yu, Jibin Wu, Kay Chen Tan

Abstract: This paper critically re-evaluates LLMs' role in causal discovery and argues against their direct involvement in determining causal relationships. We demonstrate that LLMs' autoregressive, correlation-driven modeling inherently lacks the theoretical grounding for causal reasoning and introduces unreliability when used as priors in causal discovery algorithms. Through empirical studies, we expose the limitations of existing LLM-based methods and reveal that deliberate prompt engineering (e.g., injecting ground-truth knowledge) could overstate their performance, helping to explain the consistently favorable results reported in much of the current literature. Based on these findings, we strictly confined LLMs' role to a non-decisional auxiliary capacity: LLMs should not participate in determining the existence or directionality of causal relationships, but can assist the search process for causal graphs (e.g., LLM-based heuristic search). Experiments across various settings confirm that, by strictly isolating LLMs from causal decision-making, LLM-guided heuristic search can accelerate the convergence and outperform both traditional and LLM-based methods in causal structure learning. We conclude with a call for the community to shift focus from naively applying LLMs to developing specialized models and training method that respect the core principles of causal discovery.

Comment: The paper provides theoretical insights into the limitations of LLMs in causal discovery, aligning with the criteria for foundational research in LLM behavior.

Relevance: 9 Novelty: 8


13. Slow Feature Analysis as Variational Inference Objective

ArXiv ID: 2506.00580

Authors: Merlin Sch\"uler, Laurenz Wiskott

Abstract: This work presents a novel probabilistic interpretation of Slow Feature Analysis (SFA) through the lens of variational inference. Unlike prior formulations that recover linear SFA from Gaussian state-space models with linear emissions, this approach relaxes the key constraint of linearity. While it does not lead to full equivalence to non-linear SFA, it recasts the classical slowness objective in a variational framework. Specifically, it allows the slowness objective to be interpreted as a regularizer to a reconstruction loss. Furthermore, we provide arguments, why -- from the perspective of slowness optimization -- the reconstruction loss takes on the role of the constraints that ensure informativeness in SFA. We conclude with a discussion of potential new research directions.

Comment: The paper provides a novel probabilistic interpretation of Slow Feature Analysis through variational inference, which aligns with representation learning insights.

Relevance: 9 Novelty: 8


14. Unified Scaling Laws for Compressed Representations

ArXiv ID: 2506.01863

Authors: Andrei Panferov, Alexandra Volkova, Ionut-Vlad Modoranu, Vage Egiazarian, Mher Safaryan, Dan Alistarh

Abstract: Scaling laws have shaped recent advances in machine learning by enabling predictable scaling of model performance based on model size, computation, and data volume. Concurrently, the rise in computational cost for AI has motivated model compression techniques, notably quantization and sparsification, which have emerged to mitigate the steep computational demands associated with large-scale training and inference. This paper investigates the interplay between scaling laws and compression formats, exploring whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations, such as sparse, scalar-quantized, sparse-quantized or even vector-quantized formats. Our key contributions include validating a general scaling law formulation and showing that it is applicable both individually but also composably across compression types. Based on this, our main finding is demonstrating both theoretically and empirically that there exists a simple "capacity" metric -- based on the representation's ability to fit random Gaussian data -- which can robustly predict parameter efficiency across multiple compressed representations. On the practical side, we extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.

Comment: The paper explores the interplay between scaling laws and compression formats, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8


15. Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization

ArXiv ID: 2506.01562

Authors: Wojciech Masarczyk, Mateusz Ostaszewski, Tin Sum Cheng, Tomasz Trzci\'nski, Aurelien Lucchi, Razvan Pascanu

Abstract: The softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the softmax function in shaping the model's representation. We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes. This bias depends on the softmax function's logits norm, which is implicitly influenced by hyperparameters or directly modified by softmax temperature. Furthermore, we demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of softmax, enabling better control over representation learning in deep neural networks.

Comment: The paper provides insights into the role of the softmax function in representation learning, which is relevant to understanding training dynamics in neural networks.

Relevance: 9 Novelty: 8


16. Uni-LoRA: One Vector is All You Need

ArXiv ID: 2506.00799

Authors: Kaiyang Li, Shaobo Han, Qing Su, Wei Li, Zhipeng Cai, Shihao Ji

Abstract: Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient fine-tuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VB-LoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strategies employed by these LoRA variants can be formulated within a unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a high-dimensional vector space $R^D$, can be reconstructed through a projection from a subspace R^d, with $d \ll D$. We demonstrate that the fundamental difference among various LoRA methods lies in the choice of the projection matrix, $P \in R^{D \times d}$.Most existing LoRA variants rely on layer-wise or structure-specific projections that limit cross-layer parameter sharing, thereby compromising parameter efficiency. In light of this, we introduce an efficient and theoretically grounded projection matrix that is isometric, enabling global parameter sharing and reducing computation overhead. Furthermore, under the unified view of Uni-LoRA, this design requires only a single trainable vector to reconstruct LoRA parameters for the entire LLM - making Uni-LoRA both a unified framework and a "one-vector-only" solution. Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter efficiency while outperforming or matching prior approaches in predictive performance.

Comment: The paper presents Uni-LoRA, a framework for parameter-efficient fine-tuning, relevant to model compression and efficiency.

Relevance: 9 Novelty: 8


17. Ultra-Quantisation: Efficient Embedding Search via 1.58-bit Encodings

ArXiv ID: 2506.00528

Authors: Richard Connor, Alan Dearle, Ben Claydon

Abstract: Many modern search domains comprise high-dimensional vectors of floating point numbers derived from neural networks, in the form of embeddings. Typical embeddings range in size from hundreds to thousands of dimensions, making the size of the embeddings, and the speed of comparison, a significant issue. Quantisation is a class of mechanism which replaces the floating point values with a smaller representation, for example a short integer. This gives an approximation of the embedding space in return for a smaller data representation and a faster comparison function. Here we take this idea almost to its extreme: we show how vectors of arbitrary-precision floating point values can be replaced by vectors whose elements are drawn from the set {-1,0,1}. This yields very significant savings in space and metric evaluation cost, while maintaining a strong correlation for similarity measurements. This is achieved by way of a class of convex polytopes which exist in the high-dimensional space. In this article we give an outline description of these objects, and show how they can be used for the basis of such radical quantisation while maintaining a surprising degree of accuracy.

Comment: The paper introduces a novel quantization method for efficient embedding search, which aligns with model compression and efficiency.

Relevance: 9 Novelty: 8


18. Latent Structured Hopfield Network for Semantic Association and Retrieval

ArXiv ID: 2506.01303

Authors: Chong Li, Xiangyang Xue, Jianfeng Feng, Taiping Zeng

Abstract: Episodic memory enables humans to recall past experiences by associating semantic elements such as objects, locations, and time into coherent event representations. While large pretrained models have shown remarkable progress in modeling semantic memory, the mechanisms for forming associative structures that support episodic memory remain underexplored. Inspired by hippocampal CA3 dynamics and its role in associative memory, we propose the Latent Structured Hopfield Network (LSHN), a biologically inspired framework that integrates continuous Hopfield attractor dynamics into an autoencoder architecture. LSHN mimics the cortical-hippocampal pathway: a semantic encoder extracts compact latent representations, a latent Hopfield network performs associative refinement through attractor convergence, and a decoder reconstructs perceptual input. Unlike traditional Hopfield networks, our model is trained end-to-end with gradient descent, achieving scalable and robust memory retrieval. Experiments on MNIST, CIFAR-10, and a simulated episodic memory task demonstrate superior performance in recalling corrupted inputs under occlusion and noise, outperforming existing associative memory models. Our work provides a computational perspective on how semantic elements can be dynamically bound into episodic memory traces through biologically grounded attractor mechanisms.

Comment: The paper introduces a biologically inspired framework integrating Hopfield networks into an autoencoder architecture, which aligns with representation learning and model architecture.

Relevance: 9 Novelty: 8


19. Probing Neural Topology of Large Language Models

ArXiv ID: 2506.01042

Authors: Yu Zheng, Yuan Yuan, Yong Li, Paolo Santi

Abstract: Probing large language models (LLMs) has yielded valuable insights into their internal mechanisms by linking neural representations to interpretable semantics. However, how neurons functionally co-activate with each other to give rise to emergent capabilities remains largely unknown, hindering a deeper understanding and safer development of LLMs. In this work, we introduce graph probing, a method for uncovering the functional connectivity topology of LLM neurons and relating it to language generation performance. By analyzing internal neural graphs across diverse LLM families and scales, we discover a universal predictability of next-token prediction performance using only neural topology. This predictability is robust even when retaining just 1% of neuron connections or probing models after only 8 pretraining steps, highlighting the sparsity and early emergence of topological patterns. Further graph matching analysis suggests that, despite significant distinctions in architectures, parameters, and training data, different LLMs develop intricate and consistent neural topological structures that may form the foundation for their language generation abilities. Codes and data for the graph probing toolbox are released at https://github.com/DavyMorgan/llm-graph-probing.

Comment: The paper introduces graph probing to uncover functional connectivity in LLMs, providing insights into their internal mechanisms, which aligns with foundational research in LLMs.

Relevance: 9 Novelty: 8


20. Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning

ArXiv ID: 2506.01656

Authors: Ryotaro Kawata, Kohsei Matsutani, Yuri Kinoshita, Naoki Nishikawa, Taiji Suzuki

Abstract: Mixture of Experts (MoE), an ensemble of specialized models equipped with a router that dynamically distributes each input to appropriate experts, has achieved successful results in the field of machine learning. However, theoretical understanding of this architecture is falling behind due to its inherent complexity. In this paper, we theoretically study the sample and runtime complexity of MoE following the stochastic gradient descent (SGD) when learning a regression task with an underlying cluster structure of single index models. On the one hand, we prove that a vanilla neural network fails in detecting such a latent organization as it can only process the problem as a whole. This is intrinsically related to the concept of information exponent which is low for each cluster, but increases when we consider the entire task. On the other hand, we show that a MoE succeeds in dividing this problem into easier subproblems by leveraging the ability of each expert to weakly recover the simpler function corresponding to an individual cluster. To the best of our knowledge, this work is among the first to explore the benefits of the MoE framework by examining its SGD dynamics in the context of nonlinear regression.

Comment: The paper provides a theoretical study on the sample and runtime complexity of MoE, aligning with foundational research in model architecture.

Relevance: 9 Novelty: 7


21. On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning

ArXiv ID: 2506.00136

Authors: Magdalena Proszewska, Nikolay Malkin, N. Siddharth

Abstract: Diffusion autoencoders (DAs) are variants of diffusion generative models that use an input-dependent latent variable to capture representations alongside the diffusion process. These representations, to varying extents, can be used for tasks such as downstream classification, controllable generation, and interpolation. However, the generative performance of DAs relies heavily on how well the latent variables can be modelled and subsequently sampled from. Better generative modelling is also the primary goal of another class of diffusion models -- those that learn their forward (noising) process. While effective at adjusting the noise process in an input-dependent manner, they must satisfy additional constraints derived from the terminal conditions of the diffusion process. Here, we draw a connection between these two classes of models and show that certain design decisions (latent variable choice, conditioning method, etc.) in the DA framework -- leading to a model we term DMZ -- allow us to obtain the best of both worlds: effective representations as evaluated on downstream tasks, including domain transfer, as well as more efficient modelling and generation with fewer denoising steps compared to standard DMs.

Comment: The paper discusses diffusion autoencoders for efficient generation and representation learning, which aligns with foundational research in representation learning.

Relevance: 9 Novelty: 7


22. Tug-of-war between idiom's figurative and literal meanings in LLMs

ArXiv ID: 2506.01723

Authors: Soyoung Oh, Xinting Huang, Mathis Pink, Michael Hahn, Vera Demberg

Abstract: Idioms present a unique challenge for language models due to their non-compositional figurative meanings, which often strongly diverge from the idiom's literal interpretation. This duality requires a model to learn representing and deciding between the two meanings to interpret an idiom in a figurative sense, or literally. In this paper, we employ tools from mechanistic interpretability to trace how a large pretrained causal transformer (LLama3.2-1B-base) deals with this ambiguity. We localize three steps of idiom processing: First, the idiom's figurative meaning is retrieved in early attention and MLP sublayers. We identify specific attention heads which boost the figurative meaning of the idiom while suppressing the idiom's literal interpretation. The model subsequently represents the figurative representation through an intermediate path. Meanwhile, a parallel bypass route forwards literal interpretation, ensuring that a both reading remain available. Overall, our findings provide a mechanistic evidence for idiom comprehension in an autoregressive transformer.

Comment: The paper provides mechanistic insights into how LLMs process idioms, which aligns with the interest in theoretical insights into LLM behavior.

Relevance: 9 Novelty: 7


23. Earley-Driven Dynamic Pruning for Efficient Structured Decoding

ArXiv ID: 2506.01151

Authors: Xintong Sun, Chi Wei, Minghao Tian, Shiwen Ni

Abstract: Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose $\textbf{ZapFormat}$, a novel $\textbf{dynamic pruning}$ strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only $\textbf{consistently maintains}$ high-precision compliant outputs but also achieves $\textbf{significant improvements}$ in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source at https://github.com/Dan-wanna-M/formatron.

Comment: The paper proposes a dynamic pruning strategy for efficient structured decoding, relevant to model compression.

Relevance: 9 Novelty: 7


24. Overfitting has a limitation: a model-independent generalization error bound based on R\'enyi entropy

ArXiv ID: 2506.00182

Authors: Atsushi Suzuki

Abstract: Will further scaling up of machine learning models continue to bring success? A significant challenge in answering this question lies in understanding generalization error, which is the impact of overfitting. Understanding generalization error behavior of increasingly large-scale machine learning models remains a significant area of investigation, as conventional analyses often link error bounds to model complexity, failing to fully explain the success of extremely large architectures. This research introduces a novel perspective by establishing a model-independent upper bound for generalization error applicable to algorithms whose outputs are determined solely by the data's histogram, such as empirical risk minimization or gradient-based methods. Crucially, this bound is shown to depend only on the R\'enyi entropy of the data-generating distribution, suggesting that a small generalization error can be maintained even with arbitrarily large models, provided the data quantity is sufficient relative to this entropy. This framework offers a direct explanation for the phenomenon where generalization performance degrades significantly upon injecting random noise into data, where the performance degrade is attributed to the consequent increase in the data distribution's R\'enyi entropy. Furthermore, we adapt the no-free-lunch theorem to be data-distribution-dependent, demonstrating that an amount of data corresponding to the R\'enyi entropy is indeed essential for successful learning, thereby highlighting the tightness of our proposed generalization bound.

Comment: The paper introduces a model-independent generalization error bound based on Rényi entropy, which is relevant to emerging trends in theoretical work.

Relevance: 8 Novelty: 8


25. Unlocking the Power of Rehearsal in Continual Learning: A Theoretical Perspective

ArXiv ID: 2506.00205

Authors: Junze Deng, Qinhang Wu, Peizhong Ju, Sen Lin, Yingbin Liang, Ness Shroff

Abstract: Rehearsal-based methods have shown superior performance in addressing catastrophic forgetting in continual learning (CL) by storing and training on a subset of past data alongside new data in current task. While such a concurrent rehearsal strategy is widely used, it remains unclear if this approach is always optimal. Inspired by human learning, where sequentially revisiting tasks helps mitigate forgetting, we explore whether sequential rehearsal can offer greater benefits for CL compared to standard concurrent rehearsal. To address this question, we conduct a theoretical analysis of rehearsal-based CL in overparameterized linear models, comparing two strategies: 1) Concurrent Rehearsal, where past and new data are trained together, and 2) Sequential Rehearsal, where new data is trained first, followed by revisiting past data sequentially. By explicitly characterizing forgetting and generalization error, we show that sequential rehearsal performs better when tasks are less similar. These insights further motivate a novel Hybrid Rehearsal method, which trains similar tasks concurrently and revisits dissimilar tasks sequentially. We characterize its forgetting and generalization performance, and our experiments with deep neural networks further confirm that the hybrid approach outperforms standard concurrent rehearsal. This work provides the first comprehensive theoretical analysis of rehearsal-based CL.

Comment: The paper provides a theoretical analysis of rehearsal-based continual learning, which is relevant to representation learning.

Relevance: 8 Novelty: 8


26. zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

ArXiv ID: 2506.01084

Authors: Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West

Abstract: Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable "hypertokens" on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60\%, with significant improvements in inference latency.

Comment: The paper introduces a framework for adaptive vocabularies in language models via token compression, which is relevant to model compression.

Relevance: 8 Novelty: 8


27. TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

ArXiv ID: 2506.01352

Authors: Guangxin He, Yuan Cao, Yutong He, Tianyi Bai, Kun Yuan, Binhang Yuan

Abstract: Decentralized training of large language models offers the opportunity to pool computational resources across geographically distributed participants but faces significant network communication bottlenecks, particularly in pipeline-parallel settings. While pipeline parallelism partitions model layers across devices to handle large-scale models, it necessitates frequent communication of intermediate activations, creating challenges when network bandwidth is limited. Existing activation compression methods, such as AQ-SGD, mitigate quantization-induced errors through error compensation but impose prohibitive memory overhead by requiring storage of previous activations. To address these issues, we introduce TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework designed specifically for pipeline parallelism. Our approach integrates fine-grained tile-wise quantization for precise control, entropy-guided token-level adaptive bit allocation for optimal bit usage, and a Hadamard-based transform with pivot element swapping to effectively suppress quantization outliers. We further provide a theoretical analysis, proving that pipeline parallel training equipped with TAH-Quant maintains a convergence rate of $\mathcal{O}(1/\sqrt{T})$, matching that of vanilla stochastic gradient descent. Extensive experiments on diverse LLM tasks demonstrate that TAH-Quant achieves aggressive activation quantization (3-4 bits) ratio, which provides up to 4.3$\times$ end-to-end speedup without compromising training convergence, matches state-of-the-art methods, incurs no extra memory overhead, and generalizes well across different training scenarios.

Comment: The paper introduces TAH-Quant, a novel activation quantization framework, relevant to model compression.

Relevance: 8 Novelty: 8


28. Learning DNF through Generalized Fourier Representations

ArXiv ID: 2506.01075

Authors: Mohsen Heidari, Roni Khardon

Abstract: The Fourier representation for the uniform distribution over the Boolean cube has found numerous applications in algorithms and complexity analysis. Notably, in learning theory, learnability of Disjunctive Normal Form (DNF) under uniform as well as product distributions has been established through such representations. This paper makes five main contributions. First, it introduces a generalized Fourier expansion that can be used with any distribution $D$ through the representation of the distribution as a Bayesian network (BN). Second, it shows that the main algorithmic tools for learning with the Fourier representation, that use membership queries to approximate functions by recovering their heavy Fourier coefficients, can be used with slight modifications with the generalized expansion. These results hold for any distribution. Third, it analyzes the $L_1$ spectral norm of conjunctions under the new expansion, showing that it is bounded for a class of distributions which can be represented by difference bounded tree BN, where a parent node in the BN representation can change the conditional expectation of a child node by at most $\alpha<0.5$. Lower bounds are presented to show that such constraints are necessary. The fourth contribution uses these results to show the learnability of DNF with membership queries under difference bounded tree BN. The final contribution is to develop an algorithm for learning difference-bounded tree BN distributions, thus extending the DNF learnability result to cases where the distribution is not known in advance.

Comment: The paper introduces a generalized Fourier representation for learning DNF, which aligns with foundational research in representation learning.

Relevance: 8 Novelty: 8


29. PMNO: A novel physics guided multi-step neural operator predictor for partial differential equations

ArXiv ID: 2506.01598

Authors: Jin Song, Kenji Kawaguchi, Zhenya Yan

Abstract: Neural operators, which aim to approximate mappings between infinite-dimensional function spaces, have been widely applied in the simulation and prediction of physical systems. However, the limited representational capacity of network architectures, combined with their heavy reliance on large-scale data, often hinder effective training and result in poor extrapolation performance. In this paper, inspired by traditional numerical methods, we propose a novel physics guided multi-step neural operator (PMNO) architecture to address these challenges in long-horizon prediction of complex physical systems. Distinct from general operator learning methods, the PMNO framework replaces the single-step input with multi-step historical data in the forward pass and introduces an implicit time-stepping scheme based on the Backward Differentiation Formula (BDF) during backpropagation. This design not only strengthens the model's extrapolation capacity but also facilitates more efficient and stable training with fewer data samples, especially for long-term predictions. Meanwhile, a causal training strategy is employed to circumvent the need for multi-stage training and to ensure efficient end-to-end optimization. The neural operator architecture possesses resolution-invariant properties, enabling the trained model to perform fast extrapolation on arbitrary spatial resolutions. We demonstrate the superior predictive performance of PMNO predictor across a diverse range of physical systems, including 2D linear system, modeling over irregular domain, complex-valued wave dynamics, and reaction-diffusion processes. Depending on the specific problem setting, various neural operator architectures, including FNO, DeepONet, and their variants, can be seamlessly integrated into the PMNO framework.

Comment: The paper presents a novel neural operator architecture for PDEs, which is relevant to foundational research in AI for science.

Relevance: 8 Novelty: 8


30. Generalization in VAE and Diffusion Models: A Unified Information-Theoretic Analysis

ArXiv ID: 2506.00849

Authors: Qi Chen, Jierui Zhu, Florian Shkurti

Abstract: Despite the empirical success of Diffusion Models (DMs) and Variational Autoencoders (VAEs), their generalization performance remains theoretically underexplored, especially lacking a full consideration of the shared encoder-generator structure. Leveraging recent information-theoretic tools, we propose a unified theoretical framework that provides guarantees for the generalization of both the encoder and generator by treating them as randomized mappings. This framework further enables (1) a refined analysis for VAEs, accounting for the generator's generalization, which was previously overlooked; (2) illustrating an explicit trade-off in generalization terms for DMs that depends on the diffusion time $T$; and (3) providing computable bounds for DMs based solely on the training data, allowing the selection of the optimal $T$ and the integration of such bounds into the optimization process to improve model performance. Empirical results on both synthetic and real datasets illustrate the validity of the proposed theory.

Comment: The paper provides a unified theoretical framework for analyzing the generalization of VAEs and DMs, which aligns with foundational research in representation learning.

Relevance: 8 Novelty: 7


31. Manipulating 3D Molecules in a Fixed-Dimensional SE(3)-Equivariant Latent Space

ArXiv ID: 2506.00771

Authors: Zitao Chen, Yinjun Jia, Zitong Tian, Wei-Ying Ma, Yanyan Lan

Abstract: Medicinal chemists often optimize drugs considering their 3D structures and designing structurally distinct molecules that retain key features, such as shapes, pharmacophores, or chemical properties. Previous deep learning approaches address this through supervised tasks like molecule inpainting or property-guided optimization. In this work, we propose a flexible zero-shot molecule manipulation method by navigating in a shared latent space of 3D molecules. We introduce a Variational AutoEncoder (VAE) for 3D molecules, named MolFLAE, which learns a fixed-dimensional, SE(3)-equivariant latent space independent of atom counts. MolFLAE encodes 3D molecules using an SE(3)-equivariant neural network into fixed number of latent nodes, distinguished by learned embeddings. The latent space is regularized, and molecular structures are reconstructed via a Bayesian Flow Network (BFN) conditioned on the encoder's latent output. MolFLAE achieves competitive performance on standard unconditional 3D molecule generation benchmarks. Moreover, the latent space of MolFLAE enables zero-shot molecule manipulation, including atom number editing, structure reconstruction, and coordinated latent interpolation for both structure and properties. We further demonstrate our approach on a drug optimization task for the human glucocorticoid receptor, generating molecules with improved hydrophilicity while preserving key interactions, under computational evaluations. These results highlight the flexibility, robustness, and real-world utility of our method, opening new avenues for molecule editing and optimization.

Comment: The paper introduces a VAE for 3D molecules with a SE(3)-equivariant latent space, which is relevant to representation learning and model architecture.

Relevance: 8 Novelty: 7


32. Existing Large Language Model Unlearning Evaluations Are Inconclusive

ArXiv ID: 2506.00688

Authors: Zhili Feng, Yixuan Even Xu, Alexander Robey, Robert Kirk, Xander Davies, Yarin Gal, Avi Schwarzschild, J. Zico Kolter

Abstract: Machine unlearning aims to remove sensitive or undesired data from large language models. However, recent studies suggest that unlearning is often shallow, claiming that removed knowledge can easily be recovered. In this work, we critically examine standard unlearning evaluation practices and uncover key limitations that shake our trust in those findings. First, we show that some evaluations introduce substantial new information into the model, potentially masking true unlearning performance by re-teaching the model during testing. Second, we demonstrate that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation routines. Finally, we find that many evaluations rely on spurious correlations, making their results difficult to trust and interpret. Taken together, these issues suggest that current evaluation protocols may both overstate and understate unlearning success. To address this, we propose two principles for future unlearning evaluations: minimal information injection and downstream task awareness. We validate these principles through a series of targeted experiments, showing how violations of each can lead to misleading conclusions.

Comment: The paper critiques existing LLM unlearning evaluations, which is relevant to large language models.

Relevance: 8 Novelty: 7


33. Quantitative Error Feedback for Quantization Noise Reduction of Filtering over Graphs

ArXiv ID: 2506.01404

Authors: Xue Xian Zheng, Weihang Liu, Xin Lou, Stefan Vlaski, Tareq Al-Naffouri

Abstract: This paper introduces an innovative error feedback framework designed to mitigate quantization noise in distributed graph filtering, where communications are constrained to quantized messages. It comes from error spectrum shaping techniques from state-space digital filters, and therefore establishes connections between quantized filtering processes over different domains. In contrast to existing error compensation methods, our framework quantitatively feeds back the quantization noise for exact compensation. We examine the framework under three key scenarios: (i) deterministic graph filtering, (ii) graph filtering over random graphs, and (iii) graph filtering with random node-asynchronous updates. Rigorous theoretical analysis demonstrates that the proposed framework significantly reduces the effect of quantization noise, and we provide closed-form solutions for the optimal error feedback coefficients. Moreover, this quantitative error feedback mechanism can be seamlessly integrated into communication-efficient decentralized optimization frameworks, enabling lower error floors. Numerical experiments validate the theoretical results, consistently showing that our method outperforms conventional quantization strategies in terms of both accuracy and robustness.

Comment: The paper introduces an error feedback framework for quantization noise reduction, relevant to model compression.

Relevance: 8 Novelty: 7


34. Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness

ArXiv ID: 2506.01913

Authors: Thomas Pethick, Wanyun Xie, Mete Erdogan, Kimon Antonakopoulos, Tony Silveti-Falls, Volkan Cevher

Abstract: This work introduces a hybrid non-Euclidean optimization method which generalizes gradient norm clipping by combining steepest descent and conditional gradient approaches. The method achieves the best of both worlds by establishing a descent property under a generalized notion of ($L_0$,$L_1$)-smoothness. Weight decay is incorporated in a principled manner by identifying a connection to the Frank-Wolfe short step. In the stochastic case, we show an order optimal $O(n^{-1/4})$ convergence rate by leveraging a momentum based gradient estimator. We discuss how to instantiate the algorithms for deep learning and demonstrate their properties on image classification and language modeling.

Comment: The paper introduces a hybrid non-Euclidean optimization method with insights into training dynamics, which is relevant to representation learning.

Relevance: 8 Novelty: 7


35. Understanding Model Reprogramming for CLIP via Decoupling Visual Prompts

ArXiv ID: 2506.01000

Authors: Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu

Abstract: Model reprogramming adapts pretrained models to downstream tasks by modifying only the input and output spaces. Visual reprogramming (VR) is one instance for vision tasks that adds a trainable noise pattern (i.e., a visual prompt) to input images to facilitate downstream classification. The existing VR approaches for CLIP train a single visual prompt using all descriptions of different downstream classes. However, the limited learning capacity may result in (1) a failure to capture diverse aspects of the descriptions (e.g., shape, color, and texture), and (2) a possible bias toward less informative attributes that do not help distinguish between classes. In this paper, we introduce a decoupling-and-reweighting framework. Our decoupled visual prompts (DVP) are optimized using descriptions grouped by explicit causes (DVP-cse) or unsupervised clusters (DVP-cls). Then, we integrate the outputs of these visual prompts with a probabilistic reweighting matrix (PRM) that measures their contributions to each downstream class. Theoretically, DVP lowers the empirical risk bound. Experimentally, DVP outperforms baselines on average across 11 downstream datasets. Notably, the DVP-PRM integration enables insights into how individual visual prompts influence classification decisions, providing a probabilistic framework for understanding reprogramming. Our code is available at https://github.com/tmlr-group/DecoupledVP.

Comment: The paper introduces a decoupling-and-reweighting framework for model reprogramming, which provides insights into representation learning.

Relevance: 8 Novelty: 7


36. Self-supervised Latent Space Optimization with Nebula Variational Coding

ArXiv ID: 2506.01414

Authors: Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari

Abstract: Deep learning approaches process data in a layer-by-layer way with intermediate (or latent) features. We aim at designing a general solution to optimize the latent manifolds to improve the performance on classification, segmentation, completion and/or reconstruction through probabilistic models. This paper proposes a variational inference model which leads to a clustered embedding. We introduce additional variables in the latent space, called \textbf{nebula anchors}, that guide the latent variables to form clusters during training. To prevent the anchors from clustering among themselves, we employ the variational constraint that enforces the latent features within an anchor to form a Gaussian distribution, resulting in a generative model we refer as Nebula Variational Coding (NVC). Since each latent feature can be labeled with the closest anchor, we also propose to apply metric learning in a self-supervised way to make the separation between clusters more explicit. As a consequence, the latent variables of our variational coder form clusters which adapt to the generated semantic of the training data, \textit{e.g.} the categorical labels of each sample. We demonstrate experimentally that it can be used within different architectures designed to solve different problems including text sequence, images, 3D point clouds and volumetric data, validating the advantage of our proposed method.

Comment: The paper proposes a variational inference model for optimizing latent manifolds, which is relevant to representation learning.

Relevance: 8 Novelty: 7


37. Quotient Network - A Network Similar to ResNet but Learning Quotients

ArXiv ID: 2506.00992

Authors: Peng Hui, Jiamuyang Zhao, Changxin Li, Qingzhen Zhu

Abstract: The emergence of ResNet provides a powerful tool for training extremely deep networks. The core idea behind it is to change the learning goals of the network. It no longer learns new features from scratch but learns the difference between the target and existing features. However, the difference between the two kinds of features does not have an independent and clear meaning, and the amount of learning is based on the absolute rather than the relative difference, which is sensitive to the size of existing features. We propose a new network that perfectly solves these two problems while still having the advantages of ResNet. Specifically, it chooses to learn the quotient of the target features with the existing features, so we call it the quotient network. In order to enable this network to learn successfully and achieve higher performance, we propose some design rules for this network so that it can be trained efficiently and achieve better performance than ResNet. Experiments on the CIFAR10, CIFAR100, and SVHN datasets prove that this network can stably achieve considerable improvements over ResNet by simply making tiny corresponding changes to the original ResNet network without adding new parameters.

Comment: The paper proposes a new network similar to ResNet but learning quotients, which is relevant to model architecture.

Relevance: 8 Novelty: 7


38. Connecting Neural Models Latent Geometries with Relative Geodesic Representations

ArXiv ID: 2506.01599

Authors: Hanlin Yu, Berfin Inal, Georgios Arvanitidis, Soren Hauberg, Francesco Locatello, Marco Fumero

Abstract: Neural models learn representations of high-dimensional data on low-dimensional manifolds. Multiple factors, including stochasticities in the training process, model architectures, and additional inductive biases, may induce different representations, even when learning the same task on the same data. However, it has recently been shown that when a latent structure is shared between distinct latent spaces, relative distances between representations can be preserved, up to distortions. Building on this idea, we demonstrate that exploiting the differential-geometric structure of latent spaces of neural models, it is possible to capture precisely the transformations between representational spaces trained on similar data distributions. Specifically, we assume that distinct neural models parametrize approximately the same underlying manifold, and introduce a representation based on the pullback metric that captures the intrinsic structure of the latent space, while scaling efficiently to large models. We validate experimentally our method on model stitching and retrieval tasks, covering autoencoders and vision foundation discriminative models, across diverse architectures, datasets, and pretraining schemes.

Comment: The paper explores the latent geometries of neural models, which is relevant to representation learning.

Relevance: 8 Novelty: 7


39. LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning-Focused Supervised Fine-Tuning

ArXiv ID: 2506.00772

Authors: Zihang Liu, Tianyu Pang, Oleg Balabanov, Chaoqun Yang, Tianjin Huang, Lu Yin, Yaoqing Yang, Shiwei Liu

Abstract: Recent studies have shown that supervised fine-tuning of LLMs on a small number of high-quality datasets can yield strong reasoning capabilities. However, full fine-tuning (Full FT), while powerful, is computationally expensive and susceptible to overfitting and catastrophic forgetting, particularly when data is limited. Sparse fine-tuning, which previously achieved notable success by updating only a small subset of model parameters, offers a promising trade-off between efficiency and effectiveness. Yet, it has lagged behind in the LLM era due to the difficulty of identifying parameters truly critical for reasoning. In this work, we state that weights with the largest magnitude after low-rank approximation are critical weights for fine-tuning, which we call Principal Weights. Surprisingly, while magnitude-based sparse fine-tuning performs poorly as a baseline on LLM fine-tuning, it becomes highly effective after rank reduction. These insights motivate our method: Low-rank Informed Sparse Fine-Tuning (LIFT). LIFT only updates the top 5% Principal Weights throughout training and consistently achieves better performance on reasoning tasks than Full FT, while maintaining memory efficiency on par with popular parameter-efficient fine-tuning methods. In addition to strong performance on target domains such as arithmetic reasoning, LIFT also retains up to 20% more source-domain knowledge, compared to Full FT and LoRA. Our code is available at: https://github.com/zihanghliu/LIFT.

Comment: The paper introduces LIFT, a sparse fine-tuning method for LLMs, relevant to model compression and efficiency.

Relevance: 8 Novelty: 7


40. Trade-offs in Data Memorization via Strong Data Processing Inequalities

ArXiv ID: 2506.01855

Authors: Vitaly Feldman, Guy Kornowski, Xin Lyu

Abstract: Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization's role in learning. In this work, we develop a general approach for proving lower bounds on excess data memorization, that relies on a new connection between strong data processing inequalities and data memorization. We then demonstrate that several simple and natural binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm, and the amount of information about the training data that a learning algorithm needs to memorize to be accurate. In particular, $\Omega(d)$ bits of information about the training data need to be memorized when $O(1)$ $d$-dimensional examples are available, which then decays as the number of examples grows at a problem-specific rate. Further, our lower bounds are generally matched (up to logarithmic factors) by simple learning algorithms. We also extend our lower bounds to more general mixture-of-clusters models. Our definitions and results build on the work of Brown et al. (2021) and address several limitations of the lower bounds in their work.

Comment: The paper discusses data memorization in LLMs, providing theoretical insights into memorization and learning.

Relevance: 8 Novelty: 7


41. Flexible Mixed Precision Quantization for Learned Image Compression

ArXiv ID: 2506.01221

Authors: Md Adnan Faisal Hossain, Zhihao Duan, Fengqing Zhu

Abstract: Despite its improvements in coding performance compared to traditional codecs, Learned Image Compression (LIC) suffers from large computational costs for storage and deployment. Model quantization offers an effective solution to reduce the computational complexity of LIC models. However, most existing works perform fixed-precision quantization which suffers from sub-optimal utilization of resources due to the varying sensitivity to quantization of different layers of a neural network. In this paper, we propose a Flexible Mixed Precision Quantization (FMPQ) method that assigns different bit-widths to different layers of the quantized network using the fractional change in rate-distortion loss as the bit-assignment criterion. We also introduce an adaptive search algorithm which reduces the time-complexity of searching for the desired distribution of quantization bit-widths given a fixed model size. Evaluation of our method shows improved BD-Rate performance under similar model size constraints compared to other works on quantization of LIC models. We have made the source code available at gitlab.com/viper-purdue/fmpq.

Comment: The paper proposes a flexible mixed precision quantization method, which is relevant to model compression through quantization.

Relevance: 8 Novelty: 7


42. MOFGPT: Generative Design of Metal-Organic Frameworks using Language Models

ArXiv ID: 2506.00198

Authors: Srivathsan Badrinarayanan, Rishikesh Magar, Akshay Antony, Radheesh Sharma Meda, Amir Barati Farimani

Abstract: The discovery of Metal-Organic Frameworks (MOFs) with application-specific properties remains a central challenge in materials chemistry, owing to the immense size and complexity of their structural design space. Conventional computational screening techniques such as molecular simulations and density functional theory (DFT), while accurate, are computationally prohibitive at scale. Machine learning offers an exciting alternative by leveraging data-driven approaches to accelerate materials discovery. The complexity of MOFs, with their extended periodic structures and diverse topologies, creates both opportunities and challenges for generative modeling approaches. To address these challenges, we present a reinforcement learning-enhanced, transformer-based framework for the de novo design of MOFs. Central to our approach is MOFid, a chemically-informed string representation encoding both connectivity and topology, enabling scalable generative modeling. Our pipeline comprises three components: (1) a generative GPT model trained on MOFid sequences, (2) MOFormer, a transformer-based property predictor, and (3) a reinforcement learning (RL) module that optimizes generated candidates via property-guided reward functions. By integrating property feedback into sequence generation, our method drives the model toward synthesizable, topologically valid MOFs with desired functional attributes. This work demonstrates the potential of large language models, when coupled with reinforcement learning, to accelerate inverse design in reticular chemistry and unlock new frontiers in computational MOF discovery.

Comment: The paper presents a generative design framework for MOFs using LLMs, which is relevant to AI for Science with a focus on foundational research in molecular modeling.

Relevance: 8 Novelty: 7


43. Rapid yet accurate Tile-circuit and device modeling for Analog In-Memory Computing

ArXiv ID: 2506.00004

Authors: J. Luquin, C. Mackin, S. Ambrogio, A. Chen, F. Baldi, G. Miralles, M. J. Rasch, J. B\"uchel, M. Lalwani, W. Ponghiran, P. Solomon, H. Tsai, G. W. Burr, P. Narayanan

Abstract: Analog In-Memory Compute (AIMC) can improve the energy efficiency of Deep Learning by orders of magnitude. Yet analog-domain device and circuit non-idealities -- within the analog ``Tiles'' performing Matrix-Vector Multiply (MVM) operations -- can degrade neural-network task accuracy. We quantify the impact of low-level distortions and noise, and develop a mathematical model for Multiply-ACcumulate (MAC) operations mapped to analog tiles. Instantaneous-current IR-drop (the most significant circuit non-ideality), and ADC quantization effects are fully captured by this model, which can predict MVM tile-outputs both rapidly and accurately, as compared to much slower rigorous circuit simulations. A statistical model of PCM read noise at nanosecond timescales is derived from -- and matched against -- experimental measurements. We integrate these (statistical) device and (deterministic) circuit effects into a PyTorch-based framework to assess the accuracy impact on the BERT and ALBERT Transformer networks. We show that hardware-aware fine-tuning using simple Gaussian noise provides resilience against ADC quantization and PCM read noise effects, but is less effective against IR-drop. This is because IR-drop -- although deterministic -- is non-linear, is changing significantly during the time-integration window, and is ultimately dependent on all the excitations being introduced in parallel into the analog tile. The apparent inability of simple Gaussian noise applied during training to properly prepare a DNN network for IR-drop during inference implies that more complex training approaches -- incorporating advances such as the Tile-circuit model introduced here -- will be critical for resilient deployment of large neural networks onto AIMC hardware.

Comment: The paper models analog in-memory computing for neural networks, which is relevant to model compression and efficiency.

Relevance: 8 Novelty: 7


44. Understanding Overadaptation in Supervised Fine-Tuning: The Role of Ensemble Methods

ArXiv ID: 2506.01901

Authors: Yifan Hao, Xingyuan Pan, Hanning Zhang, Chenlu Ye, Rui Pan, Tong Zhang

Abstract: Supervised fine-tuning (SFT) on domain-specific data is the dominant approach for adapting foundation models to specialized tasks. However, it has been observed that SFT models tend to forget knowledge acquired during pretraining. In vision models, ensembling a pretrained model with its fine-tuned counterpart has been shown to mitigate this issue. In this work, we demonstrate that the same holds for language models, and, more strikingly, we observe an overadaptation phenomenon: the ensemble model not only retains general knowledge from the foundation model but also outperforms the fine-tuned model even on the fine-tuning domain itself. Despite the empirical success of ensembling, a theoretical understanding of its benefits remains underexplored. We develop a formal theoretical analysis of the overadaptation phenomenon. Ensembling mitigates this by balancing two primary sources of error: bias, caused by insufficient fine-tuning, and variance, introduced by overfitting to fine-tuning data. While regularization techniques aim to address this trade-off, we show that ensembling provides a more effective solution. We analyze this phenomenon in over-parameterized linear settings and demonstrate that interpolating between pretrained and fine-tuned weights significantly improves performance. These findings offer theoretical justification for the observed advantages of model ensembling, supported by empirical experiments consistent with our analysis.

Comment: The paper provides a theoretical analysis of overadaptation in supervised fine-tuning, which aligns with foundational research in representation learning.

Relevance: 8 Novelty: 7


45. Unlocking Personalized Knowledge in Federated Large Language Model: The Power of Mixture of Experts

ArXiv ID: 2506.00965

Authors: Fan Liu, Bikang Pan, Zhongyi Wang, Xi Yao, Xiaoying Tang, Jingya Wang, Ye Shi

Abstract: The Mixture of Experts (MoE) architecture has emerged as a prominent strategy for scaling large language models (LLMs), effectively leveraging sparse activation and facilitating task-specific personalization. However, current federated learning (FL) approaches are primarily designed for dense models, making them unable to directly exploit the sparsity inherent in MoE architectures. Treating MoE models as dense networks in federated scenarios results in excessive communication overhead and computational costs, undermining the potential for personalized knowledge sharing. To address these challenges, we propose FLEx (Federated LLMs with Personalized Experts), a novel federated learning framework explicitly tailored for MoE-based LLMs. FLEx efficiently personalizes by pruning the global MoE model to keep only one expert per client, and employs an adaptive gating mechanism to reintegrate these personalized experts into the pre-trained MoE layers, ensuring the original backbone architecture remains unchanged. These personalized experts are trained with local data and stored locally on each client, while the shared modules are aggregated globally. Extensive evaluations on diverse instruction-based datasets under non-IID conditions consistently demonstrate that FLEx outperforms existing federated baselines. Our code is available at https://anonymous.4open.science/r/FLEx-8F12.

Comment: The paper discusses the use of Mixture of Experts in federated learning, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 7


46. Mamba Drafters for Speculative Decoding

ArXiv ID: 2506.01206

Authors: Daewon Choi, Seunghyuk Oh, Saket Dingliwal, Jihoon Tack, Kyuyoung Kim, Woomin Song, Seojin Kim, Insu Han, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati

Abstract: Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability.

Comment: The paper introduces a novel drafter for speculative decoding in LLMs, focusing on efficiency improvements, aligning with model compression and efficiency breakthroughs.

Relevance: 8 Novelty: 7


47. VUSA: Virtually Upscaled Systolic Array Architecture to Exploit Unstructured Sparsity in AI Acceleration

ArXiv ID: 2506.01166

Authors: Shereef Helal, Alberto Garcia-Ortiz, Lennart Bamberg

Abstract: Leveraging high degrees of unstructured sparsity is a promising approach to enhance the efficiency of deep neural network DNN accelerators - particularly important for emerging Edge-AI applications. We introduce VUSA, a systolic-array architecture that virtually grows based on the present sparsity to perform larger matrix multiplications with the same number of physical multiply-accumulate MAC units. The proposed architecture achieves saving by 37% and 68% in area and power efficiency, respectively, at the same peak-performance, compared to a baseline systolic array architecture in a commercial 16-nm technology. Still, the proposed architecture supports acceleration for any DNN with any sparsity - even no sparsity at all. Thus, the proposed architecture is application-independent, making it viable for general-purpose AI acceleration.

Comment: The paper introduces a new systolic-array architecture to exploit unstructured sparsity, aligning with model compression and efficiency breakthroughs.

Relevance: 8 Novelty: 7


48. Visual Sparse Steering: Improving Zero-shot Image Classification with Sparsity Guided Steering Vectors

ArXiv ID: 2506.01247

Authors: Gerasimos Chatzoudis, Zhuowei Li, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas

Abstract: Steering vision foundation models at inference time without retraining or access to large labeled datasets is a desirable yet challenging objective, particularly in dynamic or resource-constrained settings. In this paper, we introduce Visual Sparse Steering (VS2), a lightweight, test-time method that guides vision models using steering vectors derived from sparse features learned by top-$k$ Sparse Autoencoders without requiring contrastive data. Specifically, VS2 surpasses zero-shot CLIP by 4.12% on CIFAR-100, 1.08% on CUB-200, and 1.84% on Tiny-ImageNet. We further propose VS2++, a retrieval-augmented variant that selectively amplifies relevant sparse features using pseudo-labeled neighbors at inference time. With oracle positive/negative sets, VS2++ achieves absolute top-1 gains over CLIP zero-shot of up to 21.44% on CIFAR-100, 7.08% on CUB-200, and 20.47% on Tiny-ImageNet. Interestingly, VS2 and VS2++ raise per-class accuracy by up to 25% and 38%, respectively, showing that sparse steering benefits specific classes by disambiguating visually or taxonomically proximate categories rather than providing a uniform boost. Finally, to better align the sparse features learned through the SAE reconstruction task with those relevant for downstream performance, we propose Prototype-Aligned Sparse Steering (PASS). By incorporating a prototype-alignment loss during SAE training, using labels only during training while remaining fully test-time unsupervised, PASS consistently, though modestly, outperforms VS2, achieving a 6.12% gain over VS2 only on CIFAR-100 with ViT-B/32.

Comment: The paper introduces a method for zero-shot image classification using sparse features, aligning with representation learning through sparse methods.

Relevance: 8 Novelty: 7


49. Control-R: Towards controllable test-time scaling

ArXiv ID: 2506.00189

Authors: Di Zhang, Weida Wang, Junxian Li, Xunzhi Wang, Jiatong Li, Jianbo Wu, Jingdi Lei, Haonan He, Peng Ye, Shufei Zhang, Wanli Ouyang, Yuqiang Li, Dongzhan Zhou

Abstract: This paper target in addressing the challenges of underthinking and overthinking in long chain-of-thought (CoT) reasoning for Large Reasoning Models (LRMs) by introducing Reasoning Control Fields (RCF)--a novel test-time approach that injects structured control signals to guide reasoning from a tree search perspective. RCF enables models to adjust reasoning effort according to given control conditions when solving complex tasks. Additionally, we present the Control-R-4K dataset, which consists of challenging problems annotated with detailed reasoning processes and corresponding control fields. To further enhance reasoning control, we propose a Conditional Distillation Finetuning (CDF) method, which trains model--particularly Control-R-32B--to effectively adjust reasoning effort during test time. Experimental results on benchmarks such as AIME2024 and MATH500 demonstrate that our approach achieves state-of-the-art performance at the 32B scale while enabling a controllable Long CoT reasoning process (L-CoT). Overall, this work introduces an effective paradigm for controllable test-time scaling reasoning.

Comment: The paper introduces a novel test-time approach for reasoning in LRMs, which aligns with model architecture innovations through conditional networks.

Relevance: 8 Novelty: 7


50. Data Pruning by Information Maximization

ArXiv ID: 2506.01701

Authors: Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi

Abstract: In this paper, we present InfoMax, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, InfoMax enhances the overall informativeness of the coreset. The information of individual samples is measured by importance scores, which capture their influence or difficulty in model learning. To quantify redundancy, we use pairwise sample similarities, based on the premise that similar samples contribute similarly to the learning process. We formalize the coreset selection problem as a discrete quadratic programming (DQP) task, with the objective of maximizing the total information content, represented as the sum of individual sample contributions minus the redundancies introduced by similar samples within the coreset. To ensure practical scalability, we introduce an efficient gradient-based solver, complemented by sparsification techniques applied to the similarity matrix and dataset partitioning strategies. This enables InfoMax to seamlessly scale to datasets with millions of samples. Extensive experiments demonstrate the superior performance of InfoMax in various data pruning tasks, including image classification, vision-language pre-training, and instruction tuning for large language models.

Comment: The paper introduces a novel data pruning method, InfoMax, which is relevant to model compression through sparsification techniques.

Relevance: 8 Novelty: 7


51. Boosting Bot Detection via Heterophily-Aware Representation Learning and Prototype-Guided Cluster Discovery

ArXiv ID: 2506.00989

Authors: Buyun He, Xiaorui Jiang, Qi Wu, Hao Liu, Yingguang Yang, Yong Liao

Abstract: Detecting social media bots is essential for maintaining the security and trustworthiness of social networks. While contemporary graph-based detection methods demonstrate promising results, their practical application is limited by label reliance and poor generalization capability across diverse communities. Generative Graph Self-Supervised Learning (GSL) presents a promising paradigm to overcome these limitations, yet existing approaches predominantly follow the homophily assumption and fail to capture the global patterns in the graph, which potentially diminishes their effectiveness when facing the challenges of interaction camouflage and distributed deployment in bot detection scenarios. To this end, we propose BotHP, a generative GSL framework tailored to boost graph-based bot detectors through heterophily-aware representation learning and prototype-guided cluster discovery. Specifically, BotHP leverages a dual-encoder architecture, consisting of a graph-aware encoder to capture node commonality and a graph-agnostic encoder to preserve node uniqueness. This enables the simultaneous modeling of both homophily and heterophily, effectively countering the interaction camouflage issue. Additionally, BotHP incorporates a prototype-guided cluster discovery pretext task to model the latent global consistency of bot clusters and identify spatially dispersed yet semantically aligned bot collectives. Extensive experiments on two real-world bot detection benchmarks demonstrate that BotHP consistently boosts graph-based bot detectors, improving detection performance, alleviating label reliance, and enhancing generalization capability.

Comment: The paper introduces a heterophily-aware representation learning framework for bot detection, which is relevant to representation learning.

Relevance: 8 Novelty: 7


52. From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models

ArXiv ID: 2506.01133

Authors: As{\i}m Ersoy, Basel Mousi, Shammur Chowdhury, Firoj Alam, Fahim Dalvi, Nadir Durrani

Abstract: The emergence of large language models (LLMs) has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts--showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. For reproducibility we made scripts and other resources available to the community.

Comment: The paper analyzes concept formation in speech and text-based foundation models, which is relevant to foundational research in LLMs.

Relevance: 8 Novelty: 7


53. Decoding Dense Embeddings: Sparse Autoencoders for Interpreting and Discretizing Dense Retrieval

ArXiv ID: 2506.00041

Authors: Seongwan Park, Taeklim Kim, Youngjoong Ko

Abstract: Despite their strong performance, Dense Passage Retrieval (DPR) models suffer from a lack of interpretability. In this work, we propose a novel interpretability framework that leverages Sparse Autoencoders (SAEs) to decompose previously uninterpretable dense embeddings from DPR models into distinct, interpretable latent concepts. We generate natural language descriptions for each latent concept, enabling human interpretations of both the dense embeddings and the query-document similarity scores of DPR models. We further introduce Concept-Level Sparse Retrieval (CL-SR), a retrieval framework that directly utilizes the extracted latent concepts as indexing units. CL-SR effectively combines the semantic expressiveness of dense embeddings with the transparency and efficiency of sparse representations. We show that CL-SR achieves high index-space and computational efficiency while maintaining robust performance across vocabulary and semantic mismatches.

Comment: The paper proposes a novel interpretability framework using Sparse Autoencoders, which aligns with representation learning and model architecture analysis.

Relevance: 8 Novelty: 7


54. Concept-Centric Token Interpretation for Vector-Quantized Generative Models

ArXiv ID: 2506.00698

Authors: Tianze Yang, Yucheng Shi, Mengnan Du, Xuansheng Wu, Qiaoyu Tan, Jin Sun, Ninghao Liu

Abstract: Vector-Quantized Generative Models (VQGMs) have emerged as powerful tools for image generation. However, the key component of VQGMs -- the codebook of discrete tokens -- is still not well understood, e.g., which tokens are critical to generate an image of a certain concept? This paper introduces Concept-Oriented Token Explanation (CORTEX), a novel approach for interpreting VQGMs by identifying concept-specific token combinations. Our framework employs two methods: (1) a sample-level explanation method that analyzes token importance scores in individual images, and (2) a codebook-level explanation method that explores the entire codebook to find globally relevant tokens. Experimental results demonstrate CORTEX's efficacy in providing clear explanations of token usage in the generative process, outperforming baselines across multiple pretrained VQGMs. Besides enhancing VQGMs transparency, CORTEX is useful in applications such as targeted image editing and shortcut feature detection. Our code is available at https://github.com/YangTianze009/CORTEX.

Comment: The paper introduces a novel approach for interpreting VQGMs, which aligns with representation learning and model architecture analysis.

Relevance: 8 Novelty: 7


55. Weight-Space Linear Recurrent Neural Networks

ArXiv ID: 2506.01153

Authors: Roussel Desmond Nzoyem, Nawid Keshtmand, Idriss Tsayem, David A. W. Barton, Tom Deakin

Abstract: We introduce WARP (Weight-space Adaptive Recurrent Prediction), a simple yet powerful framework that unifies weight-space learning with linear recurrence to redefine sequence modeling. Unlike conventional recurrent neural networks (RNNs) which collapse temporal dynamics into fixed-dimensional hidden states, WARP explicitly parametrizes the hidden state as the weights of a distinct root neural network. This formulation promotes higher-resolution memory, gradient-free adaptation at test-time, and seamless integration of domain-specific physical priors. Empirical validation shows that WARP matches or surpasses state-of-the-art baselines on diverse classification tasks, spanning synthetic benchmarks to real-world datasets. Furthermore, extensive experiments across sequential image completion, dynamical system reconstruction, and multivariate time series forecasting demonstrate its expressiveness and generalization capabilities. Critically, WARP's weight trajectories offer valuable insights into the model's inner workings. Ablation studies confirm the architectural necessity of key components, solidifying weight-space linear RNNs as a transformative paradigm for adaptive machine intelligence.

Comment: The paper introduces a new framework for sequence modeling with weight-space learning, which is relevant to representation learning.

Relevance: 8 Novelty: 7


56. Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers

ArXiv ID: 2506.01215

Authors: Woomin Song, Sai Muralidhar Jayanthi, Srikanth Ronanki, Kanthashree Mysore Sathyendra, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati

Abstract: As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model's pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches require substantial memory resources. We introduce REFORM, a novel inference framework that efficiently handles long contexts through a two-phase approach. First, it incrementally processes input chunks while maintaining a compressed KV cache, constructs cross-layer context embeddings, and utilizes early exit strategy for improved efficiency. Second, it identifies and gathers essential tokens via similarity matching and selectively recomputes the KV cache. Compared to baselines, REFORM achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length. It also outperforms baselines on Infinite-Bench and MM-NIAH, demonstrating flexibility across diverse tasks and domains. Additionally, REFORM reduces inference time by 30% and peak memory usage by 5%, achieving both efficiency and superior performance.

Comment: The paper presents REFORM, a novel framework for efficient long-context processing in transformers, which relates to model compression and efficiency.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.