Personalized Daily Arxiv Papers 3/31/2025
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 25192 | 3444 | 28636 |
| Cost | $0.06 | $0.03 | $0.1 |
Total arXiv papers: 353
Total scanned papers: 223
Total relevant papers: 15
Table of contents with paper titles:
-
Meta-Representational Predictive Coding: Biomimetic Self-Supervised Learning Authors: Alexander Ororbia, Karl Friston, Rajesh P. N. Rao
-
STADE: Standard Deviation as a Pruning Metric Authors: Diego Coello de Portugal Mecke, Haya Alyoussef, Ilia Koloiarov, Maximilian Stubbemann, Lars Schmidt-Thieme
-
Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities Authors: Raman Dutt, Harleen Hanspal, Guoxuan Xia, Petru-Daniel Tudosiu, Alexander Black, Yongxin Yang, Steven McDonagh, Sarah Parisot
-
An Efficient Training Algorithm for Models with Block-wise Sparsity Authors: Ding Zhu, Zhiqun Zuo, Mohammad Mahdi Khalili
-
Bridging the Dimensional Chasm: Uncover Layer-wise Dimensional Reduction in Transformers through Token Correlation Authors: Zhuo-Yang Song, Zeyu Li, Qing-Hong Cao, Ming-xing Luo, Hua Xing Zhu
-
Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models Authors: Tom Kempton, Stuart Burrell
-
Concise One-Layer Transformers Can Do Function Evaluation (Sometimes) Authors: Lena Strobl, Dana Angluin, Robert Frank
-
AdaRank: Adaptive Rank Pruning for Enhanced Model Merging Authors: Chanhyuk Lee, Jiho Choi, Chanryeol Lee, Donggyun Kim, Seunghoon Hong
-
Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment Authors: Audrey Huang, Adam Block, Qinghua Liu, Nan Jiang, Dylan J. Foster, Akshay Krishnamurthy
-
Arch-LLM: Taming LLMs for Neural Architecture Generation via Unsupervised Discrete Representation Learning Authors: Deshani Geethika Poddenige, Sachith Seneviratne, Damith Senanayake, Mahesan Niranjan, PN Suganthan, Saman Halgamuge
-
MixFunn: A Neural Network for Differential Equations with Improved Generalization and Interpretability Authors: Tiago de Souza Farias, Gubio Gomes de Lima, Jonas Maziero, Celso Jorge Villas-Boas
-
ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models Authors: Chung-En Sun, Ge Yan, Tsui-Wei Weng
-
MSPLoRA: A Multi-Scale Pyramid Low-Rank Adaptation for Efficient Model Fine-Tuning Authors: Jiancheng Zhao, Xingda Yu, Zhen Yang
-
A Proposal for Networks Capable of Continual Learning Authors: Zeki Doruk Erden, Boi Faltings
-
Efficient Joint Prediction of Multiple Future Tokens Authors: Kwangjun Ahn, Alex Lamb, John Langford
1. Meta-Representational Predictive Coding: Biomimetic Self-Supervised Learning
ArXiv ID: 2503.21796
Authors: Alexander Ororbia, Karl Friston, Rajesh P. N. Rao
Abstract: Self-supervised learning has become an increasingly important paradigm in the domain of machine intelligence. Furthermore, evidence for self-supervised adaptation, such as contrastive formulations, has emerged in recent computational neuroscience and brain-inspired research. Nevertheless, current work on self-supervised learning relies on biologically implausible credit assignment -- in the form of backpropagation of errors -- and feedforward inference, typically a forward-locked pass. Predictive coding, in its mechanistic form, offers a biologically plausible means to sidestep these backprop-specific limitations. However, unsupervised predictive coding rests on learning a generative model of raw pixel input (akin to ``generative AI'' approaches), which entails predicting a potentially high dimensional input; on the other hand, supervised predictive coding, which learns a mapping between inputs to target labels, requires human annotation, and thus incurs the drawbacks of supervised learning. In this work, we present a scheme for self-supervised learning within a neurobiologically plausible framework that appeals to the free energy principle, constructing a new form of predictive coding that we call meta-representational predictive coding (MPC). MPC sidesteps the need for learning a generative model of sensory input (e.g., pixel-level features) by learning to predict representations of sensory input across parallel streams, resulting in an encoder-only learning and inference scheme. This formulation rests on active inference (in the form of sensory glimpsing) to drive the learning of representations, i.e., the representational dynamics are driven by sequences of decisions made by the model to sample informative portions of its sensorium.
Comment: The paper introduces a novel self-supervised learning framework, Meta-Representational Predictive Coding (MPC), which aligns with representation learning by focusing on biologically plausible mechanisms and encoder-only learning. It provides theoretical insights into predictive coding and active inference.
Relevance: 9 Novelty: 8
2. STADE: Standard Deviation as a Pruning Metric
ArXiv ID: 2503.22451
Authors: Diego Coello de Portugal Mecke, Haya Alyoussef, Ilia Koloiarov, Maximilian Stubbemann, Lars Schmidt-Thieme
Abstract: Recently, Large Language Models (LLMs) have become very widespread and are used to solve a wide variety of tasks. To successfully handle these tasks, LLMs require longer training times and larger model sizes. This makes LLMs ideal candidates for pruning methods that reduce computational demands while maintaining performance. Previous methods require a retraining phase after pruning to maintain the original model's performance. However, state-of-the-art pruning methods, such as Wanda, prune the model without retraining, making the pruning process faster and more efficient. Building upon Wanda's work, this study provides a theoretical explanation of why the method is effective and leverages these insights to enhance the pruning process. Specifically, a theoretical analysis of the pruning problem reveals a common scenario in Machine Learning where Wanda is the optimal pruning method. Furthermore, this analysis is extended to cases where Wanda is no longer optimal, leading to the development of a new method, STADE, based on the standard deviation of the input. From a theoretical standpoint, STADE demonstrates better generality across different scenarios. Finally, extensive experiments on Llama and Open Pre-trained Transformers (OPT) models validate these theoretical findings, showing that depending on the training conditions, Wanda's optimal performance varies as predicted by the theoretical framework. These insights contribute to a more robust understanding of pruning strategies and their practical implications. Code is available at: https://github.com/Coello-dev/STADE/
Comment: The paper proposes STADE, a pruning method for LLMs, and provides theoretical insights into pruning strategies, which aligns with the model compression criterion. It extends the understanding of pruning beyond existing methods like Wanda.
Relevance: 9 Novelty: 8
3. Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities
ArXiv ID: 2503.22517
Authors: Raman Dutt, Harleen Hanspal, Guoxuan Xia, Petru-Daniel Tudosiu, Alexander Black, Yongxin Yang, Steven McDonagh, Sarah Parisot
Abstract: In this work, we undertake the challenge of augmenting the existing generative capabilities of pre-trained text-only large language models (LLMs) with multi-modal generation capability while satisfying two core constraints: C1 preserving the preservation of original language generative capabilities with negligible performance degradation, and C2 adhering to a small parameter budget to learn the new modality, ensuring scalability and efficiency. In contrast to current approaches that add dedicated modules, thereby significantly increasing the parameter count, we propose a method that leverages the underutilized capacity inherent in deep models. Specifically, we exploit the parameter redundancy within Mixture-of-Experts (MoEs) as a source of additional capacity for learning a new modality, enabling better parameter efficiency (C1). Moreover, we preserve the original language generation capabilities by applying low-rank adaptation exclusively to the tokens of the new modality (C2). Furthermore, we introduce a novel parameter initialization scheme based on the Gromov-Wasserstein distance to improve convergence and training stability. Through an extensive analysis of the routing mechanism, we uncover the emergence of modality-specific pathways and decreased redundancy within the experts that can efficiently unlock multi-modal generative capabilities. Overall, our method can be seamlessly applied to a wide range of contemporary LLMs, providing a new pathway for transitioning from uni-modal to multi-modal architectures.
Comment: The paper focuses on leveraging Mixture-of-Experts (MoE) redundancy for multi-modal generative capabilities, which aligns with the 'Model Architecture' and 'Representation Learning' criteria. The use of low-rank adaptation and insights into modality-specific pathways adds theoretical depth.
Relevance: 9 Novelty: 8
4. An Efficient Training Algorithm for Models with Block-wise Sparsity
ArXiv ID: 2503.21928
Authors: Ding Zhu, Zhiqun Zuo, Mohammad Mahdi Khalili
Abstract: Large-scale machine learning (ML) models are increasingly being used in critical domains like education, lending, recruitment, healthcare, criminal justice, etc. However, the training, deployment, and utilization of these models demand substantial computational resources. To decrease computation and memory costs, machine learning models with sparse weight matrices are widely used in the literature. Among sparse models, those with special sparse structures (e.g., models with block-wise sparse weight matrices) fit better with the hardware accelerators and can decrease the memory and computation costs during the inference. Unfortunately, while there are several efficient training methods, none of them are designed to train a block-wise sparse model efficiently. As a result, the current methods for training block-wise sparse models start with full and dense models leading to inefficient training. In this work, we focus on training models with \textit{block-wise sparse matrices} and propose an efficient training algorithm to decrease both computation and memory costs during training and inference. In addition, we will show that our proposed method enables us to efficiently find the right block size for the sparsity pattern during the training process. Our extensive empirical and theoretical analyses show that our algorithms can decrease the computation and memory costs significantly without a performance drop compared to baselines.
Comment: The paper introduces an efficient training algorithm for block-wise sparse models, which aligns with the 'Model Compression' criterion. The focus on block-wise sparsity and efficient training adds theoretical and practical value.
Relevance: 9 Novelty: 8
5. Bridging the Dimensional Chasm: Uncover Layer-wise Dimensional Reduction in Transformers through Token Correlation
ArXiv ID: 2503.22547
Authors: Zhuo-Yang Song, Zeyu Li, Qing-Hong Cao, Ming-xing Luo, Hua Xing Zhu
Abstract: The geometric evolution of token representations in large language models (LLMs) presents a fundamental paradox: while human language inherently organizes semantic information in low-dimensional spaces ($\sim 10^1$ dimensions), modern LLMs employ high-dimensional embeddings ($\sim 10^3$ dimensions) processed through Transformer architectures. To resolve this paradox, this work bridges this conceptual gap by developing a geometric framework that tracks token dynamics across Transformers layers. Through layer-wise analysis of intrinsic dimensions across multiple architectures, we reveal an expansion-contraction pattern where tokens diffuse to a "working space" and then progressively project onto lower-dimensional submanifolds. Our finding implies a negative correlation between the working space dimension and parameter-sensitive performance of the LLMs, and indicates that effective models tend to compress tokens into approximately 10-dimensional submanifolds, closely resembling human semantic spaces. This work not only advances LLM interpretability by reframing Transformers layers as projectors that mediate between high-dimensional computation and low-dimensional semantics, but also provides practical tools for model diagnostics that do not rely on task-specific evaluations.
Comment: This paper provides a geometric framework for understanding token dynamics in Transformers, aligning with foundational research in representation learning and model architecture. The insights into dimensional reduction and token behavior are highly relevant.
Relevance: 9 Novelty: 8
6. Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models
ArXiv ID: 2503.21929
Authors: Tom Kempton, Stuart Burrell
Abstract: Advances in hardware and language model architecture have spurred a revolution in natural language generation. However, autoregressive models compute probability distributions over next-token choices, and sampling from these distributions, known as decoding, has received significantly less attention than other design choices. Existing decoding strategies are largely based on heuristics, resulting in methods that are hard to apply or improve in a principled manner. We develop the theory of decoding strategies for language models by expressing popular decoding algorithms as equilibrium states in the language of ergodic theory and stating the functions they optimize. Using this, we analyze the effect of the local normalization step of top-k, nucleus, and temperature sampling, used to make probabilities sum to one. We argue that local normalization distortion is a fundamental defect of decoding strategies and quantify the size of this distortion and its effect on mathematical proxies for the quality and diversity of generated text. Contrary to the prevailing explanation, we argue that the major cause of the under-performance of top-k sampling relative to nucleus sampling is local normalization distortion. This yields conclusions for the future design of decoding algorithms and the detection of machine-generated text.
Comment: This paper develops a theoretical framework for decoding strategies in LLMs, analyzing local normalization distortion and its effects. It provides foundational insights into LLM behavior and decoding, aligning well with the criteria for LLM theoretical research.
Relevance: 9 Novelty: 8
7. Concise One-Layer Transformers Can Do Function Evaluation (Sometimes)
ArXiv ID: 2503.22076
Authors: Lena Strobl, Dana Angluin, Robert Frank
Abstract: While transformers have proven enormously successful in a range of tasks, their fundamental properties as models of computation are not well understood. This paper contributes to the study of the expressive capacity of transformers, focusing on their ability to perform the fundamental computational task of evaluating an arbitrary function from $[n]$ to $[n]$ at a given argument. We prove that concise 1-layer transformers (i.e., with a polylog bound on the product of the number of heads, the embedding dimension, and precision) are capable of doing this task under some representations of the input, but not when the function's inputs and values are only encoded in different input positions. Concise 2-layer transformers can perform the task even with the more difficult input representation. Experimentally, we find a rough alignment between what we have proven can be computed by concise transformers and what can be practically learned.
Comment: This paper provides theoretical insights into the computational capabilities of concise one-layer transformers, directly contributing to understanding transformer architecture. It aligns well with the 'Model Architecture' criterion.
Relevance: 9 Novelty: 8
8. AdaRank: Adaptive Rank Pruning for Enhanced Model Merging
ArXiv ID: 2503.22178
Authors: Chanhyuk Lee, Jiho Choi, Chanryeol Lee, Donggyun Kim, Seunghoon Hong
Abstract: Model merging has emerged as a promising approach for unifying independently fine-tuned models into an integrated framework, significantly enhancing computational efficiency in multi-task learning. Recently, several SVD-based techniques have been introduced to exploit low-rank structures for enhanced merging, but their reliance on such manually designed rank selection often leads to cross-task interference and suboptimal performance. In this paper, we propose AdaRank, a novel model merging framework that adaptively selects the most beneficial singular directions of task vectors to merge multiple models. We empirically show that the dominant singular components of task vectors can cause critical interference with other tasks, and that naive truncation across tasks and layers degrades performance. In contrast, AdaRank dynamically prunes the singular components that cause interference and offers an optimal amount of information to each task vector by learning to prune ranks during test-time via entropy minimization. Our analysis demonstrates that such method mitigates detrimental overlaps among tasks, while empirical results show that AdaRank consistently achieves state-of-the-art performance with various backbones and number of tasks, reducing the performance gap between fine-tuned models to nearly 1%.
Comment: The paper proposes AdaRank, a framework for adaptive rank pruning in model merging, which aligns with the model compression criterion by addressing low-rank approaches and pruning strategies. It provides insights into mitigating task interference during model merging.
Relevance: 8 Novelty: 8
9. Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment
ArXiv ID: 2503.21878
Authors: Audrey Huang, Adam Block, Qinghua Liu, Nan Jiang, Dylan J. Foster, Akshay Krishnamurthy
Abstract: Inference-time computation provides an important axis for scaling language model performance, but naively scaling compute through techniques like Best-of-$N$ sampling can cause performance to degrade due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we focus on inference-time alignment which we formalize as the problem of improving a pre-trained policy's responses for a prompt of interest, given access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policy's coverage over high-quality responses for performance and compute scaling: 1. We show that Best-of-$N$ alignment with an ideal choice for $N$ can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when $N$ is large, and fails to achieve tight guarantees under more realistic coverage conditions. 2. We introduce $\texttt{InferenceTimePessimism}$, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute, implementing the principle of pessimism in the face of uncertainty via rejection sampling; we prove that its performance is optimal and does not degrade with $N$, meaning it is scaling-monotonic. We complement our theoretical results with an experimental evaluation that demonstrate the benefits of $\texttt{InferenceTimePessimism}$ across a variety of tasks and models.
Comment: The paper analyzes inference-time alignment and introduces a new algorithm, InferenceTimePessimism, which aligns with 'Large Language Models' and 'Emerging Trends'. The theoretical insights into reward hacking and scaling are significant.
Relevance: 8 Novelty: 8
10. Arch-LLM: Taming LLMs for Neural Architecture Generation via Unsupervised Discrete Representation Learning
ArXiv ID: 2503.22063
Authors: Deshani Geethika Poddenige, Sachith Seneviratne, Damith Senanayake, Mahesan Niranjan, PN Suganthan, Saman Halgamuge
Abstract: Unsupervised representation learning has been widely explored across various modalities, including neural architectures, where it plays a key role in downstream applications like Neural Architecture Search (NAS). These methods typically learn an unsupervised representation space before generating/ sampling architectures for the downstream search. A common approach involves the use of Variational Autoencoders (VAEs) to map discrete architectures onto a continuous representation space, however, sampling from these spaces often leads to a high percentage of invalid or duplicate neural architectures. This could be due to the unnatural mapping of inherently discrete architectural space onto a continuous space, which emphasizes the need for a robust discrete representation of these architectures. To address this, we introduce a Vector Quantized Variational Autoencoder (VQ-VAE) to learn a discrete latent space more naturally aligned with the discrete neural architectures. In contrast to VAEs, VQ-VAEs (i) map each architecture into a discrete code sequence and (ii) allow the prior to be learned by any generative model rather than assuming a normal distribution. We then represent these architecture latent codes as numerical sequences and train a text-to-text model leveraging a Large Language Model to learn and generate sequences representing architectures. We experiment our method with Inception/ ResNet-like cell-based search spaces, namely NAS-Bench-101 and NAS-Bench-201. Compared to VAE-based methods, our approach improves the generation of valid and unique architectures by over 80% on NASBench-101 and over 8% on NASBench-201. Finally, we demonstrate the applicability of our method in NAS employing a sequence-modeling-based NAS algorithm.
Comment: The paper introduces a discrete representation learning approach for neural architecture generation, which aligns with 'Model Architecture' and 'Representation Learning'. The use of VQ-VAE and LLMs for architecture generation is novel.
Relevance: 8 Novelty: 8
11. MixFunn: A Neural Network for Differential Equations with Improved Generalization and Interpretability
ArXiv ID: 2503.22528
Authors: Tiago de Souza Farias, Gubio Gomes de Lima, Jonas Maziero, Celso Jorge Villas-Boas
Abstract: We introduce MixFunn, a novel neural network architecture designed to solve differential equations with enhanced precision, interpretability, and generalization capability. The architecture comprises two key components: the mixed-function neuron, which integrates multiple parameterized nonlinear functions to improve representational flexibility, and the second-order neuron, which combines a linear transformation of its inputs with a quadratic term to capture cross-combinations of input variables. These features significantly enhance the expressive power of the network, enabling it to achieve comparable or superior results with drastically fewer parameters and a reduction of up to four orders of magnitude compared to conventional approaches. We applied MixFunn in a physics-informed setting to solve differential equations in classical mechanics, quantum mechanics, and fluid dynamics, demonstrating its effectiveness in achieving higher accuracy and improved generalization to regions outside the training domain relative to standard machine learning models. Furthermore, the architecture facilitates the extraction of interpretable analytical expressions, offering valuable insights into the underlying solutions.
Comment: The paper introduces MixFunn, a novel neural network architecture for solving differential equations with enhanced generalization and interpretability. It provides architectural innovations and insights into representation learning, making it relevant to foundational research.
Relevance: 8 Novelty: 8
12. ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
ArXiv ID: 2503.22048
Authors: Chung-En Sun, Ge Yan, Tsui-Wei Weng
Abstract: Recent studies have shown that Large Language Models (LLMs) augmented with chain-of-thought (CoT) reasoning demonstrate impressive problem-solving abilities. However, in this work, we identify a recurring issue where these models occasionally generate overly short reasoning, leading to degraded performance on even simple mathematical problems. Specifically, we investigate how reasoning length is embedded in the hidden representations of reasoning models and its impact on accuracy. Our analysis reveals that reasoning length is governed by a linear direction in the representation space, allowing us to induce overly short reasoning by steering the model along this direction. Building on this insight, we introduce ThinkEdit, a simple yet effective weight-editing approach to mitigate the issue of overly short reasoning. We first identify a small subset of attention heads (approximately 2%) that predominantly drive short reasoning behavior. We then edit the output projection weights of these heads to suppress the short reasoning direction. With changes to only 0.1% of the model's parameters, ThinkEdit effectively reduces overly short reasoning and yields notable accuracy gains for short reasoning outputs (+5.44%), along with an overall improvement across multiple math benchmarks (+2.43%). Our findings provide new mechanistic insights into how reasoning length is controlled within LLMs and highlight the potential of fine-grained model interventions to improve reasoning quality. Our code is available at https://github.com/Trustworthy-ML-Lab/ThinkEdit
Comment: The paper provides insights into reasoning length control in LLMs and introduces a weight-editing approach, which aligns with 'Representation Learning' and 'Large Language Models' criteria. The mechanistic understanding of reasoning length is a novel contribution.
Relevance: 8 Novelty: 7
13. MSPLoRA: A Multi-Scale Pyramid Low-Rank Adaptation for Efficient Model Fine-Tuning
ArXiv ID: 2503.21838
Authors: Jiancheng Zhao, Xingda Yu, Zhen Yang
Abstract: Parameter-Efficient Fine-Tuning (PEFT) has become an essential approach for adapting large-scale pre-trained models while reducing computational costs. Among PEFT methods, LoRA significantly reduces trainable parameters by decomposing weight updates into low-rank matrices. However, traditional LoRA applies a fixed rank across all layers, failing to account for the varying complexity of hierarchical information, which leads to inefficient adaptation and redundancy. To address this, we propose MSPLoRA (Multi-Scale Pyramid LoRA), which introduces Global Shared LoRA, Mid-Level Shared LoRA, and Layer-Specific LoRA to capture global patterns, mid-level features, and fine-grained information, respectively. This hierarchical structure reduces inter-layer redundancy while maintaining strong adaptation capability. Experiments on various NLP tasks demonstrate that MSPLoRA achieves more efficient adaptation and better performance while significantly reducing the number of trainable parameters. Furthermore, additional analyses based on Singular Value Decomposition validate its information decoupling ability, highlighting MSPLoRA as a scalable and effective optimization strategy for parameter-efficient fine-tuning in large language models. Our code is available at https://github.com/Oblivioniss/MSPLoRA.
Comment: The paper proposes a hierarchical low-rank adaptation method (MSPLoRA) for efficient fine-tuning, which aligns with the 'Model Compression' criterion. The multi-scale approach and validation through SVD analysis add novelty.
Relevance: 8 Novelty: 7
14. A Proposal for Networks Capable of Continual Learning
ArXiv ID: 2503.22068
Authors: Zeki Doruk Erden, Boi Faltings
Abstract: We analyze the ability of computational units to retain past responses after parameter updates, a key property for system-wide continual learning. Neural networks trained with gradient descent lack this capability, prompting us to propose Modelleyen, an alternative approach with inherent response preservation. We demonstrate through experiments on modeling the dynamics of a simple environment and on MNIST that, despite increased computational complexity and some representational limitations at its current stage, Modelleyen achieves continual learning without relying on sample replay or predefined task boundaries.
Comment: The paper proposes a novel architecture for continual learning, which aligns with the model architecture criterion. The approach introduces a new paradigm for response preservation, making it relevant and moderately novel.
Relevance: 8 Novelty: 7
15. Efficient Joint Prediction of Multiple Future Tokens
ArXiv ID: 2503.21801
Authors: Kwangjun Ahn, Alex Lamb, John Langford
Abstract: In this short report, we introduce joint multi-token prediction (JTP), a lightweight modification of standard next-token prediction designed to enrich hidden state representations by jointly predicting multiple future tokens. Unlike previous multi-token prediction approaches, JTP strategically employs teacher forcing of future-tokens through a carefully designed representation bottleneck, allowing the model to encode rich predictive information with minimal computational overhead during training. We show that the JTP approach achieves a short-horizon belief state representation, while popular alternatives for multi-token prediction fail to do so. We demonstrate the effectiveness of our method on the synthetic star graph navigation task from from Bachmann and Nagarajan [2024], highlighting a significant performance improvement over existing methods. This manuscript presents promising preliminary results intended to stimulate further research.
Comment: The paper proposes joint multi-token prediction (JTP) to enrich hidden state representations in language models. It aligns with representation learning and introduces a lightweight architectural modification, making it relevant to foundational research.
Relevance: 8 Novelty: 7
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.