Personalized Daily ArXiv Papers 2025-06-13

[gpt-4o]	Prompt	Completion	Total
Token	39617	5178	44795
Cost	$0.1	$0.05	$0.15

Total arXiv papers: 596

Total scanned papers: 345

Total relevant papers: 32

Table of contents with paper titles:

A Conjecture on a Fundamental Trade-Off between Certainty and Scope in Symbolic and Generative AI Authors: Luciano Floridi
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers Authors: Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei
Farseer: A Refined Scaling Law in Large Language Models Authors: Houyi Li, Wenzhen Zheng, Qiufeng Wang, Zhenyu Ding, Haoying Wang, Zili Wang, Shijie Xuyang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang
AWP: Activation-Aware Weight Pruning and Quantization with Projected Gradient Descent Authors: Jing Liu, Toshiaki Koike-Akino, Ye Wang, Hassan Mansour, Matthew Brand
Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning Authors: Jikai Jin, Vasilis Syrgkanis, Sham Kakade, Hanlin Zhang
Sequential-Parallel Duality in Prefix Scannable Models Authors: Morris Yau, Sharut Gupta, Valerie Engelmayer, Kazuki Irie, Stefanie Jegelka, Jacob Andreas
Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods Authors: Zhaiming Shen, Alexander Hsu, Rongjie Lai, Wenjing Liao
Unsupervised Elicitation of Language Models Authors: Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Feng, Ethan Perez, Jan Leike
Probabilistic Variational Contrastive Learning Authors: Minoh Jeong, Seonho Kim, Alfred Hero
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization Authors: Or Shafran, Atticus Geiger, Mor Geva
NoLoCo: No-all-reduce Low Communication Training Method for Large Models Authors: Jari Kolehmainen, Nikolay Blagoev, John Donaghy, O\u{g}uzhan Ersoy, Christopher Nies
Resa: Transparent Reasoning Models via SAEs Authors: Shangshang Wang, Julian Asilis, \"Omer Faruk Akg\"ul, Enes Burak Bilgin, Ollie Liu, Deqing Fu, Willie Neiswanger
Dense Associative Memory with Epanechnikov Energy Authors: Benjamin Hoover, Zhaoyang Shi, Krishnakumar Balasubramanian, Dmitry Krotov, Parikshit Ram
Tina: Tiny Reasoning Models via LoRA Authors: Shangshang Wang, Julian Asilis, \"Omer Faruk Akg\"ul, Enes Burak Bilgin, Ollie Liu, Willie Neiswanger
Provably Learning from Language Feedback Authors: Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, Ching-An Cheng
Foundation Models for Causal Inference via Prior-Data Fitted Networks Authors: Yuchen Ma, Dennis Frauen, Emil Javurek, Stefan Feuerriegel
Geometric Regularity in Deterministic Sampling of Diffusion-based Generative Models Authors: Defang Chen, Zhenyu Zhou, Can Wang, Siwei Lyu
VQC-MLPNet: An Unconventional Hybrid Quantum-Classical Architecture for Scalable and Robust Quantum Machine Learning Authors: Jun Qi, Chao-Han Yang, Pin-Yu Chen, Min-Hsiu Hsieh
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs Authors: Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang
Interior-Point Vanishing Problem in Semidefinite Relaxations for Neural Network Verification Authors: Ryota Ueda, Takami Sato, Ken Kobayashi, Kazuhide Nakata
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation Authors: Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang, Tianyi Zhou, Madian Khabsa, Qifan Wang, Di Jin, Michihiro Yasunaga, Lili Yu, Xi Victoria Lin, Shaoliang Nie
GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models Authors: Evelyn Ma, Duo Zhou, Peizhi Niu, Huiting Zhou, Huan Zhang, Olgica Milenkovic, S. Rasoul Etesami
Box-Constrained Softmax Function and Its Application for Post-Hoc Calibration Authors: Kyohei Atarashi, Satoshi Oyama, Hiromi Arai, Hisashi Kashima
Principled Approaches for Extending Neural Architectures to Function Spaces for Operator Learning Authors: Julius Berner, Miguel Liu-Schiaffini, Jean Kossaifi, Valentin Duruisseaux, Boris Bonev, Kamyar Azizzadenesheli, Anima Anandkumar
Textual Bayes: Quantifying Uncertainty in LLM-Based Systems Authors: Brendan Leigh Ross, No\"el Vouitsis, Atiyeh Ashari Ghomi, Rasa Hosseinzadeh, Ji Xin, Zhaoyan Liu, Yi Sui, Shiyi Hou, Kin Kwan Leung, Gabriel Loaiza-Ganem, Jesse C. Cresswell
TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree Authors: Yu-Yang Qian, Yuan-Ze Xu, Zhen-Yu Zhang, Peng Zhao, Zhi-Hua Zhou
Optimizing Latent Dimension Allocation in Hierarchical VAEs: Balancing Attenuation and Information Retention for OOD Detection Authors: Dane Williamson, Yangfeng Ji, Matthew Dwyer
Self-Adapting Language Models Authors: Adam Zweiger, Jyothish Pari, Han Guo, Ekin Aky\"urek, Yoon Kim, Pulkit Agrawal
On the role of non-linear latent features in bipartite generative neural networks Authors: Tony Bonnaire, Giovanni Catania, Aur\'elien Decelle, Beatriz Seoane
Lattice Climber Attack: Adversarial attacks for randomized mixtures of classifiers Authors: Lucas Gnecco-Heredia, Benjamin Negrevergne, Yann Chevaleyre
Slimming Down LLMs Without Losing Their Minds Authors: Qingda (Michael), Mai
DynaSubVAE: Adaptive Subgrouping for Scalable and Robust OOD Detection Authors: Tina Behrouzi, Sana Tonekaboni, Rahul G. Krishnan, Anna Goldenberg

1. A Conjecture on a Fundamental Trade-Off between Certainty and Scope in Symbolic and Generative AI

ArXiv ID: 2506.10130

Authors: Luciano Floridi

Abstract: This article introduces a conjecture that formalises a fundamental trade-off between provable correctness and broad data-mapping capacity in Artificial Intelligence (AI) systems. When an AI system is engineered for deductively watertight guarantees (demonstrable certainty about the error-free nature of its outputs) -- as in classical symbolic AI -- its operational domain must be narrowly circumscribed and pre-structured. Conversely, a system that can input high-dimensional data to produce rich information outputs -- as in contemporary generative models -- necessarily relinquishes the possibility of zero-error performance, incurring an irreducible risk of errors or misclassification. By making this previously implicit trade-off explicit and open to rigorous verification, the conjecture significantly reframes both engineering ambitions and philosophical expectations for AI. After reviewing the historical motivations for this tension, the article states the conjecture in information-theoretic form and contextualises it within broader debates in epistemology, formal verification, and the philosophy of technology. It then offers an analysis of its implications and consequences, drawing on notions of underdetermination, prudent epistemic risk, and moral responsibility. The discussion clarifies how, if correct, the conjecture would help reshape evaluation standards, governance frameworks, and hybrid system design. The conclusion underscores the importance of eventually proving or refuting the inequality for the future of trustworthy AI.

Comment: The paper introduces a conjecture on a fundamental trade-off in AI systems, which aligns with emerging trends and theoretical insights.

Relevance: 9 Novelty: 9

2. Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

ArXiv ID: 2506.10887

Authors: Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei

Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.

Comment: The paper provides a theoretical foundation for understanding out-of-context reasoning in transformers, which aligns with theoretical insights into LLM behavior.

Relevance: 9 Novelty: 9

3. Farseer: A Refined Scaling Law in Large Language Models

ArXiv ID: 2506.10972

Authors: Houyi Li, Wenzhen Zheng, Qiufeng Wang, Zhenyu Ding, Haoying Wang, Zili Wang, Shijie Xuyang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang

Abstract: Training Large Language Models (LLMs) is prohibitively expensive, creating a critical scaling gap where insights from small-scale experiments often fail to transfer to resource-intensive production systems, thereby hindering efficient innovation. To bridge this, we introduce Farseer, a novel and refined scaling law offering enhanced predictive accuracy across scales. By systematically constructing a model loss surface $L(N,D)$, Farseer achieves a significantly better fit to empirical data than prior laws (e.g., Chinchilla's law). Our methodology yields accurate, robust, and highly generalizable predictions, demonstrating excellent extrapolation capabilities, improving upon Chinchilla's law by reducing extrapolation error by 433\%. This allows for the reliable evaluation of competing training strategies across all $(N,D)$ settings, enabling conclusions from small-scale ablation studies to be confidently extrapolated to predict large-scale performance. Furthermore, Farseer provides new insights into optimal compute allocation, better reflecting the nuanced demands of modern LLM training. To validate our approach, we trained an extensive suite of approximately 1,000 LLMs across diverse scales and configurations, consuming roughly 3 million NVIDIA H100 GPU hours. We are comprehensively open-sourcing all models, data, results, and logs at https://github.com/Farseer-Scaling-Law/Farseer to foster further research.

Comment: The paper introduces a refined scaling law for LLMs, which is relevant to large language models and offers theoretical insights into their behavior.

Relevance: 9 Novelty: 9

4. AWP: Activation-Aware Weight Pruning and Quantization with Projected Gradient Descent

ArXiv ID: 2506.10205

Authors: Jing Liu, Toshiaki Koike-Akino, Ye Wang, Hassan Mansour, Matthew Brand

Abstract: To address the enormous size of Large Language Models (LLMs), model compression methods, such as quantization and pruning, are often deployed, especially on edge devices. In this work, we focus on layer-wise post-training quantization and pruning. Drawing connections between activation-aware weight pruning and sparse approximation problems, and motivated by the success of Iterative Hard Thresholding (IHT), we propose a unified method for Activation-aware Weight pruning and quantization via Projected gradient descent (AWP). Our experiments demonstrate that AWP outperforms state-of-the-art LLM pruning and quantization methods. Theoretical convergence guarantees of the proposed method for pruning are also provided.

Comment: The paper focuses on model compression through activation-aware weight pruning and quantization, which aligns with the model compression criterion.

Relevance: 9 Novelty: 8

5. Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning

ArXiv ID: 2506.10378

Authors: Jikai Jin, Vasilis Syrgkanis, Sham Kakade, Hanlin Zhang

Abstract: Faithful evaluation of language model capabilities is crucial for deriving actionable insights that can inform model development. However, rigorous causal evaluations in this domain face significant methodological challenges, including complex confounding effects and prohibitive computational costs associated with extensive retraining. To tackle these challenges, we propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors. Crucially, these latent factors are identified as causally interrelated after appropriately controlling for the base model as a common confounder. Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks from the Open LLM Leaderboard, we identify a concise three-node linear causal structure that reliably explains the observed performance variations. Further interpretation of this causal structure provides substantial scientific insights beyond simple numerical rankings: specifically, we reveal a clear causal direction starting from general problem-solving capabilities, advancing through instruction-following proficiency, and culminating in mathematical reasoning ability. Our results underscore the essential role of carefully controlling base model variations during evaluation, a step critical to accurately uncovering the underlying causal relationships among latent model capabilities.

Comment: The paper uses causal representation learning to uncover latent capabilities of language models, aligning with representation learning and providing theoretical insights into LLM behavior.

Relevance: 9 Novelty: 8

6. Sequential-Parallel Duality in Prefix Scannable Models

ArXiv ID: 2506.10918

Authors: Morris Yau, Sharut Gupta, Valerie Engelmayer, Kazuki Irie, Stefanie Jegelka, Jacob Andreas

Abstract: Modern neural sequence models are designed to meet the dual mandate of parallelizable training and fast sequential inference. Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba, that achieve such ``sequential-parallel duality.'' This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference? We begin by describing a broad class of such models -- state space models -- as those whose state updates can be computed using the classic parallel prefix scan algorithm with a custom associative aggregation operator. We then define a more general class, Prefix-Scannable Models (PSMs), by relaxing the state aggregation operator to allow arbitrary (potentially non-associative) functions such as softmax attention. This generalization unifies many existing architectures, including element-wise RNNs (e.g., Mamba) and linear transformers (e.g., GLA, Mamba2, mLSTM), while also introducing new models with softmax-like operators that achieve O(1) amortized compute per token and log(N) memory for sequence length N. We empirically evaluate such models on illustrative small-scale language modeling and canonical synthetic tasks, including state tracking and associative recall. Empirically, we find that PSMs retain the expressivity of transformer-based architectures while matching the inference efficiency of state space models -- in some cases exhibiting better length generalization than either.

Comment: The paper discusses a broad class of neural sequence models and introduces Prefix-Scannable Models, which is relevant to model architecture innovations.

Relevance: 9 Novelty: 8

7. Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

ArXiv ID: 2506.10959

Authors: Zhaiming Shen, Alexander Hsu, Rongjie Lai, Wenjing Liao

Abstract: While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding--particularly in the context of structured geometric data--remains unexplored. In this work, we initiate a theoretical study of ICL for regression of H\"older functions on manifolds. By establishing a novel connection between the attention mechanism and classical kernel methods, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of H\"older functions on manifolds, which scales exponentially with the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.

Comment: The paper provides a theoretical study of in-context learning on structured manifolds, which is relevant to representation learning and provides foundational insights.

Relevance: 9 Novelty: 8

8. Unsupervised Elicitation of Language Models

ArXiv ID: 2506.10139

Authors: Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Feng, Ethan Perez, Jan Leike

Abstract: To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.

Comment: The paper presents an unsupervised algorithm for fine-tuning language models, which is relevant to large language models and representation learning.

Relevance: 9 Novelty: 8

9. Probabilistic Variational Contrastive Learning

ArXiv ID: 2506.10159

Authors: Minoh Jeong, Seonho Kim, Alfred Hero

Abstract: Deterministic embeddings learned by contrastive learning (CL) methods such as SimCLR and SupCon achieve state-of-the-art performance but lack a principled mechanism for uncertainty quantification. We propose Variational Contrastive Learning (VCL), a decoder-free framework that maximizes the evidence lower bound (ELBO) by interpreting the InfoNCE loss as a surrogate reconstruction term and adding a KL divergence regularizer to a uniform prior on the unit hypersphere. We model the approximate posterior $q_\theta(z|x)$ as a projected normal distribution, enabling the sampling of probabilistic embeddings. Our two instantiations--VSimCLR and VSupCon--replace deterministic embeddings with samples from $q_\theta(z|x)$ and incorporate a normalized KL term into the loss. Experiments on multiple benchmarks demonstrate that VCL mitigates dimensional collapse, enhances mutual information with class labels, and matches or outperforms deterministic baselines in classification accuracy, all the while providing meaningful uncertainty estimates through the posterior model. VCL thus equips contrastive learning with a probabilistic foundation, serving as a new basis for contrastive approaches.

Comment: The paper introduces a probabilistic approach to contrastive learning, which is relevant to representation learning and offers a new theoretical perspective.

Relevance: 9 Novelty: 8

10. Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

ArXiv ID: 2506.10920

Authors: Or Shafran, Atticus Geiger, Mor Geva

Abstract: A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.

Comment: The paper focuses on representation learning by decomposing MLP activations into interpretable features using semi-nonnegative matrix factorization, which aligns with insights into how deep networks encode information.

Relevance: 9 Novelty: 8

11. NoLoCo: No-all-reduce Low Communication Training Method for Large Models

ArXiv ID: 2506.10911

Authors: Jari Kolehmainen, Nikolay Blagoev, John Donaghy, O\u{g}uzhan Ersoy, Christopher Nies

Abstract: Training large language models is generally done via optimization methods on clusters containing tens of thousands of accelerators, communicating over a high-bandwidth interconnect. Scaling up these clusters is expensive and can become impractical, imposing limits on the size of models that can be trained. Several recent studies have proposed training methods that are less communication intensive, avoiding the need for a highly connected compute cluster. These state-of-the-art low communication training methods still employ a synchronization step for model parameters, which, when performed over all model replicas, can become costly on a low-bandwidth network. In this work, we propose a novel optimization method, NoLoCo, that does not explicitly synchronize all model parameters during training and, as a result, does not require any collective communication. NoLoCo implicitly synchronizes model weights via a novel variant of the Nesterov momentum optimizer by partially averaging model weights with a randomly selected other one. We provide both a theoretical convergence analysis for our proposed optimizer as well as empirical results from language model training. We benchmark NoLoCo on a wide range of accelerator counts and model sizes, between 125M to 6.8B parameters. Our method requires significantly less communication overhead than fully sharded data parallel training or even widely used low communication training method, DiLoCo. The synchronization step itself is estimated to be one magnitude faster than the all-reduce used in DiLoCo for few hundred accelerators training over the internet. We also do not have any global blocking communication that reduces accelerator idling time. Compared to DiLoCo, we also observe up to $4\%$ faster convergence rate with wide range of model sizes and accelerator counts.

Comment: The paper proposes a novel optimization method, NoLoCo, for low communication training of large models, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

12. Resa: Transparent Reasoning Models via SAEs

ArXiv ID: 2506.09967

Authors: Shangshang Wang, Julian Asilis, \"Omer Faruk Akg\"ul, Enes Burak Bilgin, Ollie Liu, Deqing Fu, Willie Neiswanger

Abstract: How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains >97% of its RL-trained counterpart's reasoning performance while reducing training costs by >2000x to roughly \$1 and training time by >450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around \$1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.

Comment: The paper introduces Resa, a reasoning model using sparse autoencoder tuning, which aligns with representation learning and insights into how deep networks encode information.

Relevance: 9 Novelty: 8

13. Dense Associative Memory with Epanechnikov Energy

ArXiv ID: 2506.10801

Authors: Benjamin Hoover, Zhaoyang Shi, Krishnakumar Balasubramanian, Dmitry Krotov, Parikshit Ram

Abstract: We propose a novel energy function for Dense Associative Memory (DenseAM) networks, the log-sum-ReLU (LSR), inspired by optimal kernel density estimation. Unlike the common log-sum-exponential (LSE) function, LSR is based on the Epanechnikov kernel and enables exact memory retrieval with exponential capacity without requiring exponential separation functions. Moreover, it introduces abundant additional \emph{emergent} local minima while preserving perfect pattern recovery -- a characteristic previously unseen in DenseAM literature. Empirical results show that LSR energy has significantly more local minima (memories) that have comparable log-likelihood to LSE-based models. Analysis of LSR's emergent memories on image datasets reveals a degree of creativity and novelty, hinting at this method's potential for both large-scale memory storage and generative tasks.

Comment: The paper proposes a novel energy function for Dense Associative Memory networks, which aligns with emerging trends in foundational research by introducing a new paradigm for memory storage.

Relevance: 8 Novelty: 9

14. Tina: Tiny Reasoning Models via LoRA

ArXiv ID: 2504.15777

Authors: Shangshang Wang, Julian Asilis, \"Omer Faruk Akg\"ul, Enes Burak Bilgin, Ollie Liu, Willie Neiswanger

Abstract: How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this fundamental question, we present Tina, a family of tiny reasoning models achieved with high cost-efficiency. Notably, Tina demonstrates that substantial reasoning performance can be developed using only minimal resources, by applying parameter-efficient updates during reinforcement learning (RL), using low-rank adaptation (LoRA), to an already tiny 1.5B parameter base model. This minimalist approach produces models that achieve reasoning performance which is competitive with, and sometimes surpasses, SOTA RL reasoning models built upon the same base model. Crucially, this is achieved at a tiny fraction of the computational post-training cost employed by existing SOTA models. In fact, the best Tina model achieves a >20\% reasoning performance increase and 43.33\% Pass@1 accuracy on AIME24, at only \$9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA. We validate this across multiple open-source reasoning datasets and various ablation settings starting with a single, fixed set of hyperparameters. Furthermore, we hypothesize that this effectiveness and efficiency stem from LoRA rapidly adapting the model to the structural format of reasoning rewarded by RL, while largely preserving the base model's underlying knowledge. In service of accessibility and open research, we fully open-source all code, training logs, and model weights \& checkpoints.

Comment: The paper discusses low-rank adaptation (LoRA) for efficient reasoning in language models, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 7

15. Provably Learning from Language Feedback

ArXiv ID: 2506.10341

Authors: Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, Ching-An Cheng

Abstract: Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce $\textit{transfer eluder dimension}$ as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.

Comment: The paper formalizes the Learning from Language Feedback problem and introduces a new complexity measure, which aligns with emerging trends in theoretical insights.

Relevance: 8 Novelty: 8

16. Foundation Models for Causal Inference via Prior-Data Fitted Networks

ArXiv ID: 2506.10914

Authors: Yuchen Ma, Dennis Frauen, Emil Javurek, Stefan Feuerriegel

Abstract: Prior-data fitted networks (PFNs) have recently been proposed as a promising way to train tabular foundation models. PFNs are transformers that are pre-trained on synthetic data generated from a prespecified prior distribution and that enable Bayesian inference through in-context learning. In this paper, we introduce CausalFM, a comprehensive framework for training PFN-based foundation models in various causal inference settings. First, we formalize the construction of Bayesian priors for causal inference based on structural causal models (SCMs) in a principled way and derive necessary criteria for the validity of such priors. Building on this, we propose a novel family of prior distributions using causality-inspired Bayesian neural networks that enable CausalFM to perform Bayesian causal inference in various settings, including back-door, front-door, and instrumental variable adjustment. Finally, we instantiate CausalFM and explicitly train a foundation model for estimating conditional average treatment effects (CATEs) using back-door adjustment. We show that CausalFM performs competitively for CATE estimation using various synthetic and semi-synthetic benchmarks. In sum, our framework can be used as a general recipe to train foundation models for various causal inference settings. In contrast to the current state-of-the-art in causal inference, CausalFM offers a novel paradigm with the potential to fundamentally change how practitioners perform causal inference in medicine, economics, and other disciplines.

Comment: The paper introduces a framework for training foundation models for causal inference, which aligns with foundational research in AI for science.

Relevance: 8 Novelty: 8

17. Geometric Regularity in Deterministic Sampling of Diffusion-based Generative Models

ArXiv ID: 2506.10177

Authors: Defang Chen, Zhenyu Zhou, Can Wang, Siwei Lyu

Abstract: Diffusion-based generative models employ stochastic differential equations (SDEs) and their equivalent probability flow ordinary differential equations (ODEs) to establish a smooth transformation between complex high-dimensional data distributions and tractable prior distributions. In this paper, we reveal a striking geometric regularity in the deterministic sampling dynamics: each simulated sampling trajectory lies within an extremely low-dimensional subspace, and all trajectories exhibit an almost identical ''boomerang'' shape, regardless of the model architecture, applied conditions, or generated content. We characterize several intriguing properties of these trajectories, particularly under closed-form solutions based on kernel-estimated data modeling. We also demonstrate a practical application of the discovered trajectory regularity by proposing a dynamic programming-based scheme to better align the sampling time schedule with the underlying trajectory structure. This simple strategy requires minimal modification to existing ODE-based numerical solvers, incurs negligible computational overhead, and achieves superior image generation performance, especially in regions with only $5 \sim 10$ function evaluations.

Comment: The paper reveals geometric regularity in diffusion-based generative models, which aligns with emerging trends in theoretical insights.

Relevance: 8 Novelty: 8

18. VQC-MLPNet: An Unconventional Hybrid Quantum-Classical Architecture for Scalable and Robust Quantum Machine Learning

ArXiv ID: 2506.10275

Authors: Jun Qi, Chao-Han Yang, Pin-Yu Chen, Min-Hsiu Hsieh

Abstract: Variational Quantum Circuits (VQCs) offer a novel pathway for quantum machine learning, yet their practical application is hindered by inherent limitations such as constrained linear expressivity, optimization challenges, and acute sensitivity to quantum hardware noise. This work introduces VQC-MLPNet, a scalable and robust hybrid quantum-classical architecture designed to overcome these obstacles. By innovatively employing quantum circuits to dynamically generate parameters for classical Multi-Layer Perceptrons (MLPs) via amplitude encoding and parameterized quantum operations, VQC-MLPNet substantially expands representation capabilities and augments training stability. We provide rigorous theoretical guarantees via statistical learning techniques and Neural Tangent Kernel analysis, explicitly deriving upper bounds on approximation, uniform deviation, and optimization errors. These theoretical insights demonstrate exponential improvements in representation capacity relative to quantum circuit depth and the number of qubits, providing clear computational advantages over standalone quantum circuits and existing hybrid quantum architectures. Our theoretical claims are empirically corroborated through extensive experiments, including classifying semiconductor quantum-dot charge states and predicting genomic transcription factor binding sites, demonstrating resilient performance even under realistic IBM quantum noise simulations. This research establishes a theoretically sound and practically robust framework, advancing the frontiers of quantum-enhanced learning for unconventional computing paradigms in the Noisy Intermediate-Scale Quantum era and beyond.

Comment: The paper introduces a novel hybrid quantum-classical architecture, VQC-MLPNet, which enhances representation capabilities and training stability, aligning with representation learning and model architecture criteria.

Relevance: 8 Novelty: 8

19. Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

ArXiv ID: 2506.10967

Authors: Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang

Abstract: In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance. In this paper, we go beyond attention or similarity by proposing a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens. We first define the conditional similarity between visual tokens conditioned on the instruction, and then reformulate the token pruning problem with determinantal point process (DPP) to maximize the conditional diversity of the selected subset. The proposed CDPruner is training-free and model-agnostic, allowing easy application to various MLLMs. Extensive experiments across diverse MLLMs show that CDPruner establishes new state-of-the-art on various vision-language benchmarks. By maximizing conditional diversity through DPP, the selected subset better represents the input images while closely adhering to user instructions, thereby preserving strong performance even with high reduction ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95\% and CUDA latency by 78\%, while maintaining 94\% of the original accuracy. Our code is available at https://github.com/Theia-4869/CDPruner.

Comment: The paper proposes a novel token pruning method for MLLMs, aligning with model compression and efficiency breakthroughs.

Relevance: 8 Novelty: 8

20. Interior-Point Vanishing Problem in Semidefinite Relaxations for Neural Network Verification

ArXiv ID: 2506.10269

Authors: Ryota Ueda, Takami Sato, Ken Kobayashi, Kazuhide Nakata

Abstract: Semidefinite programming (SDP) relaxation has emerged as a promising approach for neural network verification, offering tighter bounds than other convex relaxation methods for deep neural networks (DNNs) with ReLU activations. However, we identify a critical limitation in the SDP relaxation when applied to deep networks: interior-point vanishing, which leads to the loss of strict feasibility -- a crucial condition for the numerical stability and optimality of SDP. Through rigorous theoretical and empirical analysis, we demonstrate that as the depth of DNNs increases, the strict feasibility is likely to be lost, creating a fundamental barrier to scaling SDP-based verification. To address the interior-point vanishing, we design and investigate five solutions to enhance the feasibility conditions of the verification problem. Our methods can successfully solve 88% of the problems that could not be solved by existing methods, accounting for 41% of the total. Our analysis also reveals that the valid constraints for the lower and upper bounds for each ReLU unit are traditionally inherited from prior work without solid reasons, but are actually not only unbeneficial but also even harmful to the problem's feasibility. This work provides valuable insights into the fundamental challenges of SDP-based DNN verification and offers practical solutions to improve its applicability to deeper neural networks, contributing to the development of more reliable and secure systems with DNNs.

Comment: The paper addresses a fundamental challenge in neural network verification using semidefinite relaxations, which is relevant to understanding model behavior and efficiency.

Relevance: 8 Novelty: 8

21. Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

ArXiv ID: 2506.10395

Authors: Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang, Tianyi Zhou, Madian Khabsa, Qifan Wang, Di Jin, Michihiro Yasunaga, Lili Yu, Xi Victoria Lin, Shaoliang Nie

Abstract: Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.

Comment: The paper introduces a novel decoupled visual encoding architecture for a multimodal foundation model, which relates to model architecture innovations.