Personalized Daily ArXiv Papers 2025-04-09
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 34160 | 4448 | 38608 |
| Cost | $0.09 | $0.04 | $0.13 |
Total arXiv papers: 475
Total scanned papers: 261
Total relevant papers: 24
Table of contents with paper titles:
-
Quantum Mechanics and Neural Networks Authors: Christian Ferko, James Halverson
-
Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression Authors: Ivan Ilin, Peter Richtarik
-
Architecture independent generalization bounds for overparametrized deep ReLU networks Authors: Thomas Chen, Chun-Kai Kevin Chien, Patricia Mu\~noz Ewald, Andrew G. Moore
-
Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations Authors: Ajay Jaiswal, Jianyu Wang, Yixiao Li, Pingzhi Li, Tianlong Chen, Zhangyang Wang, Chong Wang, Ruoming Pang, Xianzhi Du
-
Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification Authors: Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, He He
-
Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization Authors: Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, Yatao Bian
-
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models Authors: Chejian Xu, Wei Ping, Peng Xu, Zihan Liu, Boxin Wang, Mohammad Shoeybi, Bo Li, Bryan Catanzaro
-
Lattice: Learning to Efficiently Compress the Memory Authors: Mahdi Karami, Vahab Mirrokni
-
MASS: MoErging through Adaptive Subspace Selection Authors: Donato Crisostomi, Alessandro Zirilli, Antonio Andrea Gargiulo, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, Iacopo Masi, Emanuele Rodol`a
-
Find A Winning Sign: Sign Is All We Need to Win the Lottery Authors: Junghun Oh, Sungyong Baik, Kyoung Mu Lee
-
Achieving binary weight and activation for LLMs using Post-Training Quantization Authors: Siqing Song, Chuang Wang, Ruiqi Wang, Yi Yang, Xuyao Zhang
-
The Work Capacity of Channels with Memory: Maximum Extractable Work in Percept-Action Loops Authors: Lukas J. Fiderer, Paul C. Barth, Isaac D. Smith, Hans J. Briegel
-
TAGC: Optimizing Gradient Communication in Distributed Transformer Training Authors: Igor Polyakov, Alexey Dukhanov, Egor Spirin
-
Fractal and Regular Geometry of Deep Neural Networks Authors: Simmaco Di Lillo, Domenico Marinucci, Michele Salvi, Stefano Vigogna
-
GOLLuM: Gaussian Process Optimized LLMs -- Reframing LLM Finetuning through Bayesian Optimization Authors: Bojana Rankovi\'c, Philippe Schwaller
-
Meta-Continual Learning of Neural Fields Authors: Seungyoon Woo, Junhyeog Yun, Gunhee Kim
-
Leanabell-Prover: Posttraining Scaling in Formal Reasoning Authors: Jingyuan Zhang, Qi Wang, Xingguang Ji, Yahui Liu, Yang Yue, Fuzheng Zhang, Di Zhang, Guorui Zhou, Kun Gai
-
DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding Authors: Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram
-
Curved representational Bregman divergences and their applications Authors: Frank Nielsen
-
DDT: Decoupled Diffusion Transformer Authors: Shuai Wang, Zhi Tian, Weilin Huang, Limin Wang
-
Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation Authors: Biao Zhang, Fedor Moiseev, Joshua Ainslie, Paul Suganthan, Min Ma, Surya Bhupatiraju, Fede Lebron, Orhan Firat, Armand Joulin, Zhe Dong
-
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling Authors: Benjamin Lipkin, Benjamin LeBrun, Jacob Hoover Vigly, Jo\~ao Loula, David R. MacIver, Li Du, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Timothy J. O'Donnell, Alexander K. Lew, Tim Vieira
-
Intermediate Layer Classifiers for OOD generalization Authors: Arnas Uselis, Seong Joon Oh
-
Measuring D\'ej`a vu Memorization Efficiently Authors: Narine Kokhlikyan, Bargav Jayaraman, Florian Bordes, Chuan Guo, Kamalika Chaudhuri
1. Quantum Mechanics and Neural Networks
ArXiv ID: 2504.05462
Authors: Christian Ferko, James Halverson
Abstract: We demonstrate that any Euclidean-time quantum mechanical theory may be represented as a neural network, ensured by the Kosambi-Karhunen-Lo`eve theorem, mean-square path continuity, and finite two-point functions. The additional constraint of reflection positivity, which is related to unitarity, may be achieved by a number of mechanisms, such as imposing neural network parameter space splitting or the Markov property. Non-differentiability of the networks is related to the appearance of non-trivial commutators. Neural networks acting on Markov processes are no longer Markov, but still reflection positive, which facilitates the definition of deep neural network quantum systems. We illustrate these principles in several examples using numerical implementations, recovering classic quantum mechanical results such as Heisenberg uncertainty, non-trivial commutators, and the spectrum.
Comment: The paper explores the representation of quantum mechanical theories as neural networks, introducing theoretical insights into the intersection of quantum mechanics and neural networks. This aligns with emerging trends and foundational research.
Relevance: 9 Novelty: 9
2. Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression
ArXiv ID: 2504.05346
Authors: Ivan Ilin, Peter Richtarik
Abstract: This paper presents Thanos, a novel weight-pruning algorithm designed to reduce the memory footprint and enhance the computational efficiency of large language models (LLMs) by removing redundant weights while maintaining accuracy. Thanos introduces a block-wise pruning strategy with adaptive masks that dynamically adjust to weight importance, enabling flexible sparsity patterns and structured formats, such as $n:m$ sparsity, optimized for hardware acceleration. Experimental evaluations demonstrate that Thanos achieves state-of-the-art performance in structured pruning and outperforms existing methods in unstructured pruning. By providing an efficient and adaptable approach to model compression, Thanos offers a practical solution for deploying large models in resource-constrained environments.
Comment: The paper introduces a novel block-wise pruning algorithm, Thanos, for efficient LLM compression, which aligns with the 'Model Compression' criterion, particularly in sparsity and pruning methods.
Relevance: 9 Novelty: 8
3. Architecture independent generalization bounds for overparametrized deep ReLU networks
ArXiv ID: 2504.05695
Authors: Thomas Chen, Chun-Kai Kevin Chien, Patricia Mu\~noz Ewald, Andrew G. Moore
Abstract: We prove that overparametrized neural networks are able to generalize with a test error that is independent of the level of overparametrization, and independent of the Vapnik-Chervonenkis (VC) dimension. We prove explicit bounds that only depend on the metric geometry of the test and training sets, on the regularity properties of the activation function, and on the operator norms of the weights and norms of biases. For overparametrized deep ReLU networks with a training sample size bounded by the input space dimension, we explicitly construct zero loss minimizers without use of gradient descent, and prove that the generalization error is independent of the network architecture.
Comment: This paper provides theoretical generalization bounds for overparametrized deep ReLU networks, which aligns with 'Representation Learning' by offering insights into training dynamics and generalization.
Relevance: 9 Novelty: 8
4. Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations
ArXiv ID: 2504.05586
Authors: Ajay Jaiswal, Jianyu Wang, Yixiao Li, Pingzhi Li, Tianlong Chen, Zhangyang Wang, Chong Wang, Ruoming Pang, Xianzhi Du
Abstract: Sparsely activated Mixture-of-Experts (SMoE) has shown promise in scaling up the learning capacity of neural networks. However, vanilla SMoEs have issues such as expert redundancy and heavy memory requirements, making them inefficient and non-scalable, especially for resource-constrained scenarios. Expert-level sparsification of SMoEs involves pruning the least important experts to address these limitations. In this work, we aim to address three questions: (1) What is the best recipe to identify the least knowledgeable subset of experts that can be dropped with minimal impact on performance? (2) How should we perform expert dropping (one-shot or iterative), and what correction measures can we undertake to minimize its drastic impact on SMoE subnetwork capabilities? (3) What capabilities of full-SMoEs are severely impacted by the removal of the least dominant experts, and how can we recover them? Firstly, we propose MoE Experts Compression Suite (MC-Suite), which is a collection of some previously explored and multiple novel recipes to provide a comprehensive benchmark for estimating expert importance from diverse perspectives, as well as unveil numerous valuable insights for SMoE experts. Secondly, unlike prior works with a one-shot expert pruning approach, we explore the benefits of iterative pruning with the re-estimation of the MC-Suite criterion. Moreover, we introduce the benefits of task-agnostic fine-tuning as a correction mechanism during iterative expert dropping, which we term MoE Lottery Subnetworks. Lastly, we present an experimentally validated conjecture that, during expert dropping, SMoEs' instruction-following capabilities are predominantly hurt, which can be restored to a robust level subject to external augmentation of instruction-following capabilities using k-shot examples and supervised fine-tuning.
Comment: The paper addresses sparsity and pruning in Mixture-of-Experts (MoE) models, aligning closely with the model compression and sparsity criteria. It introduces iterative pruning and correction mechanisms, which are novel contributions.
Relevance: 9 Novelty: 8
5. Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification
ArXiv ID: 2504.05419
Authors: Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, He He
Abstract: Reasoning models have achieved remarkable performance on tasks like math and logical reasoning thanks to their ability to search during reasoning. However, they still suffer from overthinking, often performing unnecessary reasoning steps even after reaching the correct answer. This raises the question: can models evaluate the correctness of their intermediate answers during reasoning? In this work, we study whether reasoning models encode information about answer correctness through probing the model's hidden states. The resulting probe can verify intermediate answers with high accuracy and produces highly calibrated scores. Additionally, we find models' hidden states encode correctness of future answers, enabling early prediction of the correctness before the intermediate answer is fully formulated. We then use the probe as a verifier to decide whether to exit reasoning at intermediate answers during inference, reducing the number of inference tokens by 24\% without compromising performance. These findings confirm that reasoning models do encode a notion of correctness yet fail to exploit it, revealing substantial untapped potential to enhance their efficiency.
Comment: The paper probes hidden states of reasoning models to verify correctness, which provides insights into representation learning and training dynamics in neural networks.
Relevance: 9 Novelty: 8
6. Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization
ArXiv ID: 2504.05812
Authors: Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, Yatao Bian
Abstract: While large language models (LLMs) have demonstrated exceptional capabilities in challenging tasks such as mathematical reasoning, existing methods to enhance reasoning ability predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data after pre-training. However, these approaches critically depend on external supervisions--such as human labelled reasoning traces, verified golden answers, or pre-trained reward models--which limits scalability and practical applicability. In this work, we propose Entropy Minimized Policy Optimization (EMPO), which makes an early attempt at fully unsupervised LLM reasoning incentivization. EMPO does not require any supervised information for incentivizing reasoning capabilities (i.e., neither verifiable reasoning traces, problems with golden answers, nor additional pre-trained reward models). By continuously minimizing the predictive entropy of LLMs on unlabeled user queries in a latent semantic space, EMPO enables purely self-supervised evolution of reasoning capabilities with strong flexibility and practicality. Our experiments demonstrate competitive performance of EMPO on both mathematical reasoning and free-form commonsense reasoning tasks. Specifically, without any supervised signals, EMPO boosts the accuracy of Qwen2.5-Math-7B Base from 30.7\% to 48.1\% on mathematical benchmarks and improves truthfulness accuracy of Qwen2.5-7B Instruct from 87.16\% to 97.25\% on TruthfulQA.
Comment: The paper introduces an unsupervised incentivization method for reasoning in LLMs, which aligns with foundational research in large language models and their training dynamics.
Relevance: 9 Novelty: 8
7. From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models
ArXiv ID: 2504.06214
Authors: Chejian Xu, Wei Ping, Peng Xu, Zihan Liu, Boxin Wang, Mohammad Shoeybi, Bo Li, Bryan Catanzaro
Abstract: Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data. In this work, we introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach leverages efficient continued pretraining strategies to extend the context window and employs effective instruction tuning to maintain the instruction-following and reasoning abilities. Our UltraLong-8B, built on Llama3.1-Instruct with our recipe, achieves state-of-the-art performance across a diverse set of long-context benchmarks. Importantly, models trained with our approach maintain competitive performance on standard benchmarks, demonstrating balanced improvements for both long and short context tasks. We further provide an in-depth analysis of key design choices, highlighting the impacts of scaling strategies and data composition. Our findings establish a robust framework for efficiently scaling context lengths while preserving general model capabilities. We release all model weights at: https://ultralong.github.io/.
Comment: The paper presents efficient training strategies for ultra-long context LLMs, which aligns with foundational research in large language models and efficiency improvements.
Relevance: 9 Novelty: 8
8. Lattice: Learning to Efficiently Compress the Memory
ArXiv ID: 2504.05646
Authors: Mahdi Karami, Vahab Mirrokni
Abstract: Attention mechanisms have revolutionized sequence learning but suffer from quadratic computational complexity. This paper introduces Lattice, a novel recurrent neural network (RNN) mechanism that leverages the inherent low-rank structure of K-V matrices to efficiently compress the cache into a fixed number of memory slots, achieving sub-quadratic complexity. We formulate this compression as an online optimization problem and derive a dynamic memory update rule based on a single gradient descent step. The resulting recurrence features a state- and input-dependent gating mechanism, offering an interpretable memory update process. The core innovation is the orthogonal update: each memory slot is updated exclusively with information orthogonal to its current state hence incorporation of only novel, non-redundant data, which minimizes the interference with previously stored information. The experimental results show that Lattice achieves the best perplexity compared to all baselines across diverse context lengths, with performance improvement becoming more pronounced as the context length increases.
Comment: The paper introduces a novel RNN mechanism leveraging low-rank compression for memory efficiency, which is highly relevant to model compression and efficiency breakthroughs.
Relevance: 9 Novelty: 8
9. MASS: MoErging through Adaptive Subspace Selection
ArXiv ID: 2504.05342
Authors: Donato Crisostomi, Alessandro Zirilli, Antonio Andrea Gargiulo, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, Iacopo Masi, Emanuele Rodol`a
Abstract: Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input's intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2 storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.
Comment: The paper introduces MASS, a novel approach leveraging low-rank decomposition and adaptive subspace selection for model merging. This aligns with model compression and efficiency criteria, particularly through its innovative use of low-rank methods.
Relevance: 9 Novelty: 8
10. Find A Winning Sign: Sign Is All We Need to Win the Lottery
ArXiv ID: 2504.05357
Authors: Junghun Oh, Sungyong Baik, Kyoung Mu Lee
Abstract: The Lottery Ticket Hypothesis (LTH) posits the existence of a sparse subnetwork (a.k.a. winning ticket) that can generalize comparably to its over-parameterized counterpart when trained from scratch. The common approach to finding a winning ticket is to preserve the original strong generalization through Iterative Pruning (IP) and transfer information useful for achieving the learned generalization by applying the resulting sparse mask to an untrained network. However, existing IP methods still struggle to generalize their observations beyond ad-hoc initialization and small-scale architectures or datasets, or they bypass these challenges by applying their mask to trained weights instead of initialized ones. In this paper, we demonstrate that the parameter sign configuration plays a crucial role in conveying useful information for generalization to any randomly initialized network. Through linear mode connectivity analysis, we observe that a sparse network trained by an existing IP method can retain its basin of attraction if its parameter signs and normalization layer parameters are preserved. To take a step closer to finding a winning ticket, we alleviate the reliance on normalization layer parameters by preventing high error barriers along the linear path between the sparse network trained by our method and its counterpart with initialized normalization layer parameters. Interestingly, across various architectures and datasets, we observe that any randomly initialized network can be optimized to exhibit low error barriers along the linear path to the sparse network trained by our method by inheriting its sparsity and parameter sign information, potentially achieving performance comparable to the original. The code is available at https://github.com/JungHunOh/AWS_ICLR2025.git
Comment: The paper explores the role of parameter sign configuration in sparse networks, contributing to the Lottery Ticket Hypothesis and sparsity-related research. This aligns well with model compression and sparsity criteria.
Relevance: 9 Novelty: 8
11. Achieving binary weight and activation for LLMs using Post-Training Quantization
ArXiv ID: 2504.05352
Authors: Siqing Song, Chuang Wang, Ruiqi Wang, Yi Yang, Xuyao Zhang
Abstract: Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework with W(1+1)A(1*4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 * INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully binarized models.
Comment: The paper proposes a novel post-training quantization framework for LLMs, which directly aligns with the model compression criterion. The approach to achieve binary weight and activation is innovative.
Relevance: 9 Novelty: 8
12. The Work Capacity of Channels with Memory: Maximum Extractable Work in Percept-Action Loops
ArXiv ID: 2504.06209
Authors: Lukas J. Fiderer, Paul C. Barth, Isaac D. Smith, Hans J. Briegel
Abstract: Predicting future observations plays a central role in machine learning, biology, economics, and many other fields. It lies at the heart of organizational principles such as the variational free energy principle and has even been shown -- based on the second law of thermodynamics -- to be necessary for reaching the fundamental energetic limits of sequential information processing. While the usefulness of the predictive paradigm is undisputed, complex adaptive systems that interact with their environment are more than just predictive machines: they have the power to act upon their environment and cause change. In this work, we develop a framework to analyze the thermodynamics of information processing in percept-action loops -- a model of agent-environment interaction -- allowing us to investigate the thermodynamic implications of actions and percepts on equal footing. To this end, we introduce the concept of work capacity -- the maximum rate at which an agent can expect to extract work from its environment. Our results reveal that neither of two previously established design principles for work-efficient agents -- maximizing predictive power and forgetting past actions -- remains optimal in environments where actions have observable consequences. Instead, a trade-off emerges: work-efficient agents must balance prediction and forgetting, as remembering past actions can reduce the available free energy. This highlights a fundamental departure from the thermodynamics of passive observation, suggesting that prediction and energy efficiency may be at odds in active learning systems.
Comment: The paper develops a thermodynamic framework for percept-action loops, which introduces a novel perspective on active learning systems and their energy efficiency, aligning with emerging trends in foundational research.
Relevance: 8 Novelty: 9
13. TAGC: Optimizing Gradient Communication in Distributed Transformer Training
ArXiv ID: 2504.05638
Authors: Igor Polyakov, Alexey Dukhanov, Egor Spirin
Abstract: The increasing complexity of large language models (LLMs) necessitates efficient training strategies to mitigate the high computational costs associated with distributed training. A significant bottleneck in this process is gradient synchronization across multiple GPUs, particularly in the zero-redundancy parallelism mode. In this paper, we introduce Transformer-Aware Gradient Compression (TAGC), an optimized gradient compression algorithm designed specifically for transformer-based models. TAGC extends the lossless homomorphic compression method by adapting it for sharded models and incorporating transformer-specific optimizations, such as layer-selective compression and dynamic sparsification. Our experimental results demonstrate that TAGC accelerates training by up to 15% compared to the standard Fully Sharded Data Parallel (FSDP) approach, with minimal impact on model quality. We integrate TAGC into the PyTorch FSDP framework, the implementation is publicly available at https://github.com/ipolyakov/TAGC.
Comment: The paper proposes a gradient compression method tailored for transformer-based models, which is relevant to model compression and efficiency improvements.
Relevance: 9 Novelty: 7
14. Fractal and Regular Geometry of Deep Neural Networks
ArXiv ID: 2504.06250
Authors: Simmaco Di Lillo, Domenico Marinucci, Michele Salvi, Stefano Vigogna
Abstract: We study the geometric properties of random neural networks by investigating the boundary volumes of their excursion sets for different activation functions, as the depth increases. More specifically, we show that, for activations which are not very regular (e.g., the Heaviside step function), the boundary volumes exhibit fractal behavior, with their Hausdorff dimension monotonically increasing with the depth. On the other hand, for activations which are more regular (e.g., ReLU, logistic and $\tanh$), as the depth increases, the expected boundary volumes can either converge to zero, remain constant or diverge exponentially, depending on a single spectral parameter which can be easily computed. Our theoretical results are confirmed in some numerical experiments based on Monte Carlo simulations.
Comment: The paper investigates the fractal and regular geometry of neural networks, which aligns with 'Representation Learning' by exploring the geometric properties of network activations and their implications.
Relevance: 8 Novelty: 8
15. GOLLuM: Gaussian Process Optimized LLMs -- Reframing LLM Finetuning through Bayesian Optimization
ArXiv ID: 2504.06265
Authors: Bojana Rankovi\'c, Philippe Schwaller
Abstract: Large Language Models (LLMs) can encode complex relationships in their latent spaces, yet harnessing them for optimization under uncertainty remains challenging. We address this gap with a novel architecture that reframes LLM finetuning as Gaussian process (GP) marginal likelihood optimization via deep kernel methods. We introduce LLM-based deep kernels, jointly optimized with GPs to preserve the benefits of both - LLMs to provide a rich and flexible input space for Bayesian optimization and - GPs to model this space with predictive uncertainty for more efficient sampling. Applied to Buchwald-Hartwig reaction optimization, our method nearly doubles the discovery rate of high-performing reactions compared to static LLM embeddings (from 24% to 43% coverage of the top 5% reactions in just 50 optimization iterations). We also observe a 14% improvement over domain-specific representations without requiring specialized features. Extensive empirical evaluation across 19 benchmarks - ranging from general chemistry to reaction and molecular property optimization - demonstrates our method's robustness, generality, and consistent improvements across: (1) tasks, (2) LLM architectures (encoder, decoder, encoder-decoder), (3) pretraining domains (chemistry-related or general-purpose) and (4) hyperparameter settings (tuned once on a single dataset). Finally, we explain these improvements: joint LLM-GP optimization through marginal likelihood implicitly performs contrastive learning, aligning representations to produce (1) better-structured embedding spaces, (2) improved uncertainty calibration, and (3) more efficient sampling - without requiring any external loss. This work provides both practical advances in sample-efficient optimization and insights into what makes effective Bayesian optimization.
Comment: The paper reframes LLM finetuning through Bayesian optimization, introducing a novel integration of Gaussian processes and LLMs, which aligns with foundational research in representation learning and optimization.
Relevance: 8 Novelty: 8
16. Meta-Continual Learning of Neural Fields
ArXiv ID: 2504.05806
Authors: Seungyoon Woo, Junhyeog Yun, Gunhee Kim
Abstract: Neural Fields (NF) have gained prominence as a versatile framework for complex data representation. This work unveils a new problem setting termed \emph{Meta-Continual Learning of Neural Fields} (MCL-NF) and introduces a novel strategy that employs a modular architecture combined with optimization-based meta-learning. Focused on overcoming the limitations of existing methods for continual learning of neural fields, such as catastrophic forgetting and slow convergence, our strategy achieves high-quality reconstruction with significantly improved learning speed. We further introduce Fisher Information Maximization loss for neural radiance fields (FIM-NeRF), which maximizes information gains at the sample level to enhance learning generalization, with proved convergence guarantee and generalization bound. We perform extensive evaluations across image, audio, video reconstruction, and view synthesis tasks on six diverse datasets, demonstrating our method's superiority in reconstruction quality and speed over existing MCL and CL-NF approaches. Notably, our approach attains rapid adaptation of neural fields for city-scale NeRF rendering with reduced parameter requirement.
Comment: The paper introduces a new problem setting and modular architecture for neural fields, which aligns with emerging trends and architectural innovations.
Relevance: 8 Novelty: 8
17. Leanabell-Prover: Posttraining Scaling in Formal Reasoning
ArXiv ID: 2504.06122
Authors: Jingyuan Zhang, Qi Wang, Xingguang Ji, Yahui Liu, Yang Yue, Fuzheng Zhang, Di Zhang, Guorui Zhou, Kun Gai
Abstract: Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttraining scaling as demonstrated by Open AI O1/O3 and Deepseek R1. In this work, we investigate the entire posttraining of ATP, aiming to align it with breakthroughs in reasoning models in natural languages.To begin, we continual train current ATP models with a hybrid dataset, which consists of numerous statement-proof pairs, and additional data aimed at incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. Next, we explore reinforcement learning with the use of outcome reward returned by Lean 4 compiler. Through our designed continual training and reinforcement learning processes, we have successfully improved existing formal provers, including both DeepSeek-Prover-v1.5 and Goedel-Prover, achieving state-of-the-art performance in the field of whole-proof generation. For example, we achieve a 59.8% pass rate (pass@32) on MiniF2F. This is an on-going project and we will progressively update our findings, release our data and training details.
Comment: The paper focuses on posttraining scaling for automated theorem proving, which aligns with foundational research in LLMs and explores reinforcement learning for reasoning tasks. The approach to improve formal provers is novel.
Relevance: 8 Novelty: 8
18. DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
ArXiv ID: 2504.05598
Authors: Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram
Abstract: Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed by parallel verification using the target LLM. This approach leads to faster inference compared to auto-regressive decoding. While there are multiple approaches to create a draft model, one promising approach is to use early-exit methods. These methods draft candidate tokens by using a subset of layers of the primary model and applying the remaining layers for verification, allowing a single model to handle both drafting and verification. While this technique reduces memory usage and computational cost, its performance relies on the choice of the exit layer for drafting and the number of tokens drafted (speculation length) in each SD round. Prior works use hyperparameter exploration to statically select these values. However, our evaluations show that these hyperparameter values are task-specific, and even within a task they are dependent on the current sequence context. We introduce DEL, a plug-and-play method that adaptively selects the exit layer and speculation length during inference. DEL dynamically tracks the token acceptance rate if the tokens are drafted at each layer of an LLM and uses that knowledge to heuristically select the optimal exit layer and speculation length. Our experiments across a broad range of models and downstream tasks show that DEL achieves overall speedups of $2.16\times$$\sim$$2.50\times$ over vanilla auto-regressive decoding and improves upon the state-of-the-art SD methods by up to $0.27\times$.
Comment: The paper introduces DEL, a dynamic exit layer method for efficient speculative decoding in LLMs, which aligns with 'Model Compression' and 'Large Language Models' by addressing efficiency in decoding.
Relevance: 8 Novelty: 7
19. Curved representational Bregman divergences and their applications
ArXiv ID: 2504.05654
Authors: Frank Nielsen
Abstract: By analogy to curved exponential families, we define curved Bregman divergences as restrictions of Bregman divergences to sub-dimensional parameter subspaces, and prove that the barycenter of a finite weighted parameter set with respect to a curved Bregman divergence amounts to the Bregman projection onto the subspace induced by the constraint of the barycenter with respect to the unconstrained full Bregman divergence. We demonstrate the significance of curved Bregman divergences with two examples: (1) symmetrized Bregman divergences and (2) the Kullback-Leibler divergence between circular complex normal distributions. We then consider monotonic embeddings to define representational curved Bregman divergences and show that the $\alpha$-divergences are representational curved Bregman divergences with respect to $\alpha$-embeddings of the probability simplex into the positive measure cone. As an application, we report an efficient method to calculate the intersection of a finite set of $\alpha$-divergence spheres.
Comment: The paper introduces curved Bregman divergences and explores their theoretical properties, which align with foundational research in representation learning, particularly in understanding divergence measures and embeddings.
Relevance: 8 Novelty: 7
20. DDT: Decoupled Diffusion Transformer
ArXiv ID: 2504.05741
Authors: Shuai Wang, Zhi Tian, Weilin Huang, Limin Wang
Abstract: Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \textbf{\color{ddt}D}ecoupled \textbf{\color{ddt}D}iffusion \textbf{\color{ddt}T}ransformer~(\textbf{\color{ddt}DDT}), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet $256\times256$, Our DDT-XL/2 achieves a new state-of-the-art performance of {1.31 FID}~(nearly $4\times$ faster training convergence compared to previous diffusion transformers). For ImageNet $512\times512$, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.
Comment: The paper proposes a decoupled diffusion transformer architecture, which is relevant to model architecture innovations, particularly in transformer-based generative models.
Relevance: 8 Novelty: 7
21. Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation
ArXiv ID: 2504.06225
Authors: Biao Zhang, Fedor Moiseev, Joshua Ainslie, Paul Suganthan, Min Ma, Surya Bhupatiraju, Fede Lebron, Orhan Firat, Armand Joulin, Zhe Dong
Abstract: While decoder-only large language models (LLMs) have shown impressive results, encoder-decoder models are still widely adopted in real-world applications for their inference efficiency and richer encoder representation. In this paper, we study a novel problem: adapting pretrained decoder-only LLMs to encoder-decoder, with the goal of leveraging the strengths of both approaches to achieve a more favorable quality-efficiency trade-off. We argue that adaptation not only enables inheriting the capability of decoder-only LLMs but also reduces the demand for computation compared to pretraining from scratch. We rigorously explore different pretraining objectives and parameter initialization/optimization techniques. Through extensive experiments based on Gemma 2 (2B and 9B) and a suite of newly pretrained mT5-sized models (up to 1.6B), we demonstrate the effectiveness of adaptation and the advantage of encoder-decoder LLMs. Under similar inference budget, encoder-decoder LLMs achieve comparable (often better) pretraining performance but substantially better finetuning performance than their decoder-only counterpart. For example, Gemma 2B-2B outperforms Gemma 2B by $\sim$7\% after instruction tuning. Encoder-decoder adaptation also allows for flexible combination of different-sized models, where Gemma 9B-2B significantly surpasses Gemma 2B-2B by $>$3\%. The adapted encoder representation also yields better results on SuperGLUE. We will release our checkpoints to facilitate future research.
Comment: The paper explores adapting decoder-only LLMs to encoder-decoder models, which provides insights into model architecture and efficiency trade-offs, aligning with foundational research in model architecture.
Relevance: 8 Novelty: 7
22. Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling
ArXiv ID: 2504.05410
Authors: Benjamin Lipkin, Benjamin LeBrun, Jacob Hoover Vigly, Jo\~ao Loula, David R. MacIver, Li Du, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Timothy J. O'Donnell, Alexander K. Lew, Tim Vieira
Abstract: The dominant approach to generating from language models subject to some constraint is locally constrained decoding (LCD), incrementally sampling tokens at each time step such that the constraint is never violated. Typically, this is achieved through token masking: looping over the vocabulary and excluding non-conforming tokens. There are two important problems with this approach. (i) Evaluating the constraint on every token can be prohibitively expensive -- LM vocabularies often exceed $100,000$ tokens. (ii) LCD can distort the global distribution over strings, sampling tokens based only on local information, even if they lead down dead-end paths. This work introduces a new algorithm that addresses both these problems. First, to avoid evaluating a constraint on the full vocabulary at each step of generation, we propose an adaptive rejection sampling algorithm that typically requires orders of magnitude fewer constraint evaluations. Second, we show how this algorithm can be extended to produce low-variance, unbiased estimates of importance weights at a very small additional cost -- estimates that can be soundly used within previously proposed sequential Monte Carlo algorithms to correct for the myopic behavior of local constraint enforcement. Through extensive empirical evaluation in text-to-SQL, molecular synthesis, goal inference, pattern matching, and JSON domains, we show that our approach is superior to state-of-the-art baselines, supporting a broader class of constraints and improving both runtime and performance. Additional theoretical and empirical analyses show that our method's runtime efficiency is driven by its dynamic use of computation, scaling with the divergence between the unconstrained and constrained LM, and as a consequence, runtime improvements are greater for better models.
Comment: The paper introduces a novel algorithm for constrained generation in LMs, which aligns with foundational research in efficiency and algorithmic improvements for LLMs.
Relevance: 8 Novelty: 7
23. Intermediate Layer Classifiers for OOD generalization
ArXiv ID: 2504.05461
Authors: Arnas Uselis, Seong Joon Oh
Abstract: Deep classifiers are known to be sensitive to data distribution shifts, primarily due to their reliance on spurious correlations in training data. It has been suggested that these classifiers can still find useful features in the network's last layer that hold up under such shifts. In this work, we question the use of last-layer representations for out-of-distribution (OOD) generalisation and explore the utility of intermediate layers. To this end, we introduce \textit{Intermediate Layer Classifiers} (ILCs). We discover that intermediate layer representations frequently offer substantially better generalisation than those from the penultimate layer. In many cases, zero-shot OOD generalisation using earlier-layer representations approaches the few-shot performance of retraining on penultimate layer representations. This is confirmed across multiple datasets, architectures, and types of distribution shifts. Our analysis suggests that intermediate layers are less sensitive to distribution shifts compared to the penultimate layer. These findings highlight the importance of understanding how information is distributed across network layers and its role in OOD generalisation, while also pointing to the limits of penultimate layer representation utility. Code is available at https://github.com/oshapio/intermediate-layer-generalization
Comment: The paper introduces Intermediate Layer Classifiers (ILCs) and explores the utility of intermediate layers for OOD generalization. This aligns with representation learning, particularly in understanding how information is distributed across network layers.
Relevance: 8 Novelty: 7
24. Measuring D\'ej`a vu Memorization Efficiently
ArXiv ID: 2504.05651
Authors: Narine Kokhlikyan, Bargav Jayaraman, Florian Bordes, Chuan Guo, Kamalika Chaudhuri
Abstract: Recent research has shown that representation learning models may accidentally memorize their training data. For example, the d\'ej`a vu method shows that for certain representation learning models and training images, it is sometimes possible to correctly predict the foreground label given only the representation of the background - better than through dataset-level correlations. However, their measurement method requires training two models - one to estimate dataset-level correlations and the other to estimate memorization. This multiple model setup becomes infeasible for large open-source models. In this work, we propose alternative simple methods to estimate dataset-level correlations, and show that these can be used to approximate an off-the-shelf model's memorization ability without any retraining. This enables, for the first time, the measurement of memorization in pre-trained open-source image representation and vision-language representation models. Our results show that different ways of measuring memorization yield very similar aggregate results. We also find that open-source models typically have lower aggregate memorization than similar models trained on a subset of the data. The code is available both for vision and vision language models.
Comment: The paper explores memorization in representation learning models, which aligns with the topic of representation learning and provides insights into training dynamics. The method to measure memorization in pre-trained models is novel.
Relevance: 8 Novelty: 7
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.