Personalized Daily ArXiv Papers 2025-06-11

[gpt-4o]	Prompt	Completion	Total
Token	43637	5405	49042
Cost	$0.11	$0.05	$0.16

Total arXiv papers: 668

Total scanned papers: 385

Total relevant papers: 34

Table of contents with paper titles:

Diffuse and Disperse: Image Generation with Representation Regularization Authors: Runqian Wang, Kaiming He
InfoDPCCA: Information-Theoretic Dynamic Probabilistic Canonical Correlation Analysis Authors: Shiqin Tang, Shujian Yu
Optimization over Sparse Support-Preserving Sets: Two-Step Projection with Global Optimality Guarantees Authors: William de Vazelhes, Xiao-Tong Yuan, Bin Gu
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning Authors: Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang
e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs Authors: Amrith Setlur, Matthew Y. R. Yang, Charlie Snell, Jeremy Greer, Ian Wu, Virginia Smith, Max Simchowitz, Aviral Kumar
Diversity-Guided MLP Reduction for Efficient Large Vision Transformers Authors: Chengchao Shen, Hourun Zhu, Gongfan Fang, Jianxin Wang, Xinchao Wang
Highly Compressed Tokenizer Can Generate Without Training Authors: L. Lao Beyer, T. Li, X. Chen, S. Karaman, K. He
Draft-based Approximate Inference for LLMs Authors: Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee
PropMEND: Hypernetworks for Knowledge Propagation in LLMs Authors: Zeyu Leo Liu, Greg Durrett, Eunsol Choi
Understanding Task Vectors in In-Context Learning: Emergence, Functionality, and Limitations Authors: Yuxin Dong, Jiachen Jiang, Zhihui Zhu, Xia Ning
SWAT-NN: Simultaneous Weights and Architecture Training for Neural Networks in a Latent Space Authors: Zitong Huang, Mansooreh Montazerin, Ajitesh Srivastava
Leveraging chaos in the training of artificial neural networks Authors: Pedro Jim\'enez-Gonz\'alez, Miguel C. Soriano, Lucas Lacasa
sparseGeoHOPCA: A Geometric Solution to Sparse Higher-Order PCA Without Covariance Estimation Authors: Renjie Xu, Chong Wu, Maolin Che, Zhuoheng Ran, Yimin Wei, Hong Yan
The Geometries of Truth Are Orthogonal Across Tasks Authors: Waiss Azizian, Michael Kirchhof, Eugene Ndiaye, Louis Bethune, Michal Klein, Pierre Ablin, Marco Cuturi
Edit Flows: Flow Matching with Edit Operations Authors: Marton Havasi, Brian Karrer, Itai Gat, Ricky T. Q. Chen
Why Masking Diffusion Works: Condition on the Jump Schedule for Improved Discrete Diffusion Authors: Alan N. Amin, Nate Gruver, Andrew Gordon Wilson
Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction Authors: Ruben Weitzman, Peter M{\o}rch Groth, Lood Van Niekerk, Aoi Otani, Yarin Gal, Debora Marks, Pascal Notin
Improved Scaling Laws in Linear Regression via Data Reuse Authors: Licong Lin, Jingfeng Wu, Peter L. Bartlett
Sharper Convergence Rates for Nonconvex Optimisation via Reduction Mappings Authors: Evan Markou, Thalaiyasingam Ajanthan, Stephen Gould
AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI) Authors: Danush Khanna, Krishna Kumar, Basab Ghosh, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das
Explaining, Fast and Slow: Abstraction and Refinement of Provable Explanations Authors: Shahaf Bassan, Yizhak Yisrael Elboher, Tobias Ladner, Matthias Althoff, Guy Katz
BLUR: A Bi-Level Optimization Approach for LLM Unlearning Authors: Hadi Reisizadeh, Jinghan Jia, Zhiqi Bu, Bhanukiran Vinzamuri, Anil Ramakrishna, Kai-Wei Chang, Volkan Cevher, Sijia Liu, Mingyi Hong
On Reasoning Strength Planning in Large Reasoning Models Authors: Leheng Sheng, An Zhang, Zijian Wu, Weixiang Zhao, Changshuo Shen, Yi Zhang, Xiang Wang, Tat-Seng Chua
Thermodynamically Consistent Latent Dynamics Identification for Parametric Systems Authors: Xiaolong He, Yeonjong Shin, Anthony Gruber, Sohyeon Jung, Kookjin Lee, Youngsoo Choi
Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning Authors: Yiqun Sun, Qiang Huang, Anthony K. H. Tung, Jun Yu
What makes an Ensemble (Un) Interpretable? Authors: Shahaf Bassan, Guy Amir, Meirav Zehavi, Guy Katz
Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings Authors: Liyan Xu, Zhenlin Su, Mo Yu, Jiangnan Li, Fandong Meng, Jie Zhou
BioLangFusion: Multimodal Fusion of DNA, mRNA, and Protein Language Models Authors: Amina Mollaysa, Artem Moskale, Pushpak Pati, Tommaso Mansi, Mangal Prakash, Rui Liao
MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature Authors: Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko
Aligning Proteins and Language: A Foundation Model for Protein Retrieval Authors: Qifeng Wu, Zhengzhe Liu, Han Zhu, Yizhou Zhao, Daisuke Kihara, Min Xu
SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging Authors: Nhat Thanh Tran, Fanghui Xue, Shuai Zhang, Jiancheng Lyu, Yunling Zheng, Yingyong Qi, Jack Xin
AlphaFold Database Debiasing for Robust Inverse Folding Authors: Cheng Tan, Zhenxiao Cao, Zhangyang Gao, Siyuan Li, Yufei Huang, Stan Z. Li
An Adaptive Method Stabilizing Activations for Enhanced Generalization Authors: Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko
Propositional Logic for Probing Generalization in Neural Networks Authors: Anna Langedijk, Jaap Jumelet, Willem Zuidema

1. Diffuse and Disperse: Image Generation with Representation Regularization

ArXiv ID: 2506.09027

Authors: Runqian Wang, Kaiming He

Abstract: The development of diffusion-based generative models over the past decade has largely proceeded independently of progress in representation learning. These diffusion models typically rely on regression-based objectives and generally lack explicit regularization. In this work, we propose \textit{Dispersive Loss}, a simple plug-and-play regularizer that effectively improves diffusion-based generative models. Our loss function encourages internal representations to disperse in the hidden space, analogous to contrastive self-supervised learning, with the key distinction that it requires no positive sample pairs and therefore does not interfere with the sampling process used for regression. Compared to the recent method of representation alignment (REPA), our approach is self-contained and minimalist, requiring no pre-training, no additional parameters, and no external data. We evaluate Dispersive Loss on the ImageNet dataset across a range of models and report consistent improvements over widely used and strong baselines. We hope our work will help bridge the gap between generative modeling and representation learning.

Comment: Author match

2. InfoDPCCA: Information-Theoretic Dynamic Probabilistic Canonical Correlation Analysis

ArXiv ID: 2506.08884

Authors: Shiqin Tang, Shujian Yu

Abstract: Extracting meaningful latent representations from high-dimensional sequential data is a crucial challenge in machine learning, with applications spanning natural science and engineering. We introduce InfoDPCCA, a dynamic probabilistic Canonical Correlation Analysis (CCA) framework designed to model two interdependent sequences of observations. InfoDPCCA leverages a novel information-theoretic objective to extract a shared latent representation that captures the mutual structure between the data streams and balances representation compression and predictive sufficiency while also learning separate latent components that encode information specific to each sequence. Unlike prior dynamic CCA models, such as DPCCA, our approach explicitly enforces the shared latent space to encode only the mutual information between the sequences, improving interpretability and robustness. We further introduce a two-step training scheme to bridge the gap between information-theoretic representation learning and generative modeling, along with a residual connection mechanism to enhance training stability. Through experiments on synthetic and medical fMRI data, we demonstrate that InfoDPCCA excels as a tool for representation learning. Code of InfoDPCCA is available at https://github.com/marcusstang/InfoDPCCA.

Comment: The paper introduces InfoDPCCA, a novel framework for representation learning using an information-theoretic approach, which aligns with the representation learning criterion.

Relevance: 9 Novelty: 8

3. Optimization over Sparse Support-Preserving Sets: Two-Step Projection with Global Optimality Guarantees

ArXiv ID: 2506.08558

Authors: William de Vazelhes, Xiao-Tong Yuan, Bin Gu

Abstract: In sparse optimization, enforcing hard constraints using the $\ell_0$ pseudo-norm offers advantages like controlled sparsity compared to convex relaxations. However, many real-world applications demand not only sparsity constraints but also some extra constraints. While prior algorithms have been developed to address this complex scenario with mixed combinatorial and convex constraints, they typically require the closed form projection onto the mixed constraints which might not exist, and/or only provide local guarantees of convergence which is different from the global guarantees commonly sought in sparse optimization. To fill this gap, in this paper, we study the problem of sparse optimization with extra \qw{\textit{support-preserving}} constraints commonly encountered in the literature. We present a new variant of iterative hard-thresholding algorithm equipped with a two-step consecutive projection operator customized for these mixed constraints, serving as a simple alternative to the Euclidean projection onto the mixed constraint. By introducing a novel trade-off between sparsity relaxation and sub-optimality, we provide global guarantees in objective value for the output of our algorithm, in the deterministic, stochastic, and zeroth-order settings, under the conventional restricted strong-convexity/smoothness assumptions. As a fundamental contribution in proof techniques, we develop a novel extension of the classic three-point lemma to the considered two-step non-convex projection operator, which allows us to analyze the convergence in objective value in an elegant way that has not been possible with existing techniques. In the zeroth-order case, such technique also improves upon the state-of-the-art result from de Vazelhes et. al. (2022), even in the case without additional constraints, by allowing us to remove a non-vanishing system error present in their work.

Comment: The paper presents a new sparse optimization algorithm with global optimality guarantees, relevant to model compression through sparsity.

Relevance: 9 Novelty: 8

4. SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

ArXiv ID: 2506.08889

Authors: Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang

Abstract: We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.

Comment: The paper introduces SeerAttention-R, a sparse attention framework, which is relevant to model architecture and efficiency through sparse methods.

Relevance: 9 Novelty: 8

5. e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

ArXiv ID: 2506.09026

Authors: Amrith Setlur, Matthew Y. R. Yang, Charlie Snell, Jeremy Greer, Ian Wu, Virginia Smith, Max Simchowitz, Aviral Kumar

Abstract: Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep "thinking" for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging "negative" gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.

Comment: The paper discusses test-time scaling and in-context exploration for LLMs, which relates to foundational research in LLM architecture and theoretical insights.

Relevance: 9 Novelty: 8

6. Diversity-Guided MLP Reduction for Efficient Large Vision Transformers

ArXiv ID: 2506.08591

Authors: Chengchao Shen, Hourun Zhu, Gongfan Fang, Jianxin Wang, Xinchao Wang

Abstract: Transformer models achieve excellent scaling property, where the performance is improved with the increment of model capacity. However, large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability of the compressed models and propose a Diversity-Guided MLP Reduction (DGMR) method to significantly reduce the parameters of large vision transformers with only negligible performance degradation. Specifically, we conduct a Gram-Schmidt weight pruning strategy to eliminate redundant neurons of MLP hidden layer, while preserving weight diversity for better performance recover during distillation. Compared to the model trained from scratch, our pruned model only requires 0.06\% data of LAION-2B (for the training of large vision transformers) without labels (ImageNet-1K) to recover the original performance. Experimental results on several state-of-the-art large vision transformers demonstrate that our method achieves a more than 57.0\% parameter and FLOPs reduction in a near lossless manner. Notably, for EVA-CLIP-E (4.4B), our method accomplishes a 71.5\% parameter and FLOPs reduction without performance degradation. The source code and trained weights are available at https://github.com/visresearch/DGMR.

Comment: The paper proposes a method for MLP reduction in vision transformers, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

7. Highly Compressed Tokenizer Can Generate Without Training

ArXiv ID: 2506.08257

Authors: L. Lao Beyer, T. Li, X. Chen, S. Karaman, K. He

Abstract: Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called 1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations -- such as copying and replacing tokens between latent representations of images -- enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer's latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.

Comment: The paper discusses a highly compressed tokenizer with vector quantization, which relates to model compression and efficiency. It introduces a novel approach to image generation without training a generative model, aligning with foundational research in representation learning.

Relevance: 9 Novelty: 8

8. Draft-based Approximate Inference for LLMs

ArXiv ID: 2506.08373

Authors: Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee

Abstract: Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a novel framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Specifically, we introduce two instantiations of our proposed framework: (i) SpecKV, which leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model's attention activations to identify and discard unimportant prompt tokens. To the best of our knowledge, this is the first work to use draft models for approximate LLM inference acceleration, extending their utility beyond traditional lossless speculative decoding. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput. Our code is available at https://github.com/furiosa-ai/draft-based-approx-llm.

Comment: The paper proposes a novel framework for approximate LLM inference using draft models, which relates to model efficiency and compression. It introduces new methods for inference acceleration, aligning with foundational research in LLMs.

Relevance: 9 Novelty: 8

9. PropMEND: Hypernetworks for Knowledge Propagation in LLMs

ArXiv ID: 2506.08920

Authors: Zeyu Leo Liu, Greg Durrett, Eunsol Choi

Abstract: Knowledge editing techniques for large language models (LLMs) can inject knowledge that is later reproducible verbatim, but they fall short on propagating that knowledge: models cannot answer questions that require reasoning with the injected knowledge. We present a hypernetwork-based approach for knowledge propagation, named PropMEND, where we meta-learn how to modify gradients of a language modeling loss to encourage injected information to propagate. Our approach extends the meta-objective of MEND [29] so that gradient updates on knowledge are transformed to enable answering multi-hop questions involving that knowledge. We show improved performance on the RippleEdit dataset, showing almost 2x accuracy on challenging multi-hop questions whose answers are not explicitly stated in the injected fact. We further introduce a new dataset, Controlled RippleEdit, to evaluate the generalization of our hypernetwork, testing knowledge propagation along relations and entities unseen during hypernetwork training. PropMEND still outperforms existing approaches in unseen entity-relation pairs, yet the performance gap decreases substantially, suggesting future work in propagating knowledge to a wide range of relations.

Comment: The paper introduces a hypernetwork-based approach for knowledge propagation in LLMs, which is relevant to foundational research in LLM behavior and architecture. It presents a novel method for improving knowledge propagation.

Relevance: 9 Novelty: 8

10. Understanding Task Vectors in In-Context Learning: Emergence, Functionality, and Limitations

ArXiv ID: 2506.09048

Authors: Yuxin Dong, Jiachen Jiang, Zhihui Zhu, Xia Ning

Abstract: Task vectors offer a compelling mechanism for accelerating inference in in-context learning (ICL) by distilling task-specific information into a single, reusable representation. Despite their empirical success, the underlying principles governing their emergence and functionality remain unclear. This work proposes the Linear Combination Conjecture, positing that task vectors act as single in-context demonstrations formed through linear combinations of the original ones. We provide both theoretical and empirical support for this conjecture. First, we show that task vectors naturally emerge in linear transformers trained on triplet-formatted prompts through loss landscape analysis. Next, we predict the failure of task vectors on representing high-rank mappings and confirm this on practical LLMs. Our findings are further validated through saliency analyses and parameter visualization, suggesting an enhancement of task vectors by injecting multiple ones into few-shot prompts. Together, our results advance the understanding of task vectors and shed light on the mechanisms underlying ICL in transformer-based models.

Comment: The paper provides theoretical insights into task vectors in in-context learning, which relates to representation learning and understanding how deep networks encode information.

Relevance: 9 Novelty: 8

11. SWAT-NN: Simultaneous Weights and Architecture Training for Neural Networks in a Latent Space

ArXiv ID: 2506.08270

Authors: Zitong Huang, Mansooreh Montazerin, Ajitesh Srivastava

Abstract: Designing neural networks typically relies on manual trial and error or a neural architecture search (NAS) followed by weight training. The former is time-consuming and labor-intensive, while the latter often discretizes architecture search and weight optimization. In this paper, we propose a fundamentally different approach that simultaneously optimizes both the architecture and the weights of a neural network. Our framework first trains a universal multi-scale autoencoder that embeds both architectural and parametric information into a continuous latent space, where functionally similar neural networks are mapped closer together. Given a dataset, we then randomly initialize a point in the embedding space and update it via gradient descent to obtain the optimal neural network, jointly optimizing its structure and weights. The optimization process incorporates sparsity and compactness penalties to promote efficient models. Experiments on synthetic regression tasks demonstrate that our method effectively discovers sparse and compact neural networks with strong performance.

Comment: The paper proposes a novel approach to simultaneously optimize neural network architecture and weights, relevant to model architecture innovations.

Relevance: 9 Novelty: 8

12. Leveraging chaos in the training of artificial neural networks

ArXiv ID: 2506.08523

Authors: Pedro Jim\'enez-Gonz\'alez, Miguel C. Soriano, Lucas Lacasa

Abstract: Traditional algorithms to optimize artificial neural networks when confronted with a supervised learning task are usually exploitation-type relaxational dynamics such as gradient descent (GD). Here, we explore the dynamics of the neural network trajectory along training for unconventionally large learning rates. We show that for a region of values of the learning rate, the GD optimization shifts away from purely exploitation-like algorithm into a regime of exploration-exploitation balance, as the neural network is still capable of learning but the trajectory shows sensitive dependence on initial conditions -- as characterized by positive network maximum Lyapunov exponent --. Interestingly, the characteristic training time required to reach an acceptable accuracy in the test set reaches a minimum precisely in such learning rate region, further suggesting that one can accelerate the training of artificial neural networks by locating at the onset of chaos. Our results -- initially illustrated for the MNIST classification task -- qualitatively hold for a range of supervised learning tasks, learning architectures and other hyperparameters, and showcase the emergent, constructive role of transient chaotic dynamics in the training of artificial neural networks.

Comment: The paper explores the dynamics of neural network training with large learning rates, providing insights into training dynamics and the role of chaos, which aligns with representation learning.

Relevance: 9 Novelty: 8

13. sparseGeoHOPCA: A Geometric Solution to Sparse Higher-Order PCA Without Covariance Estimation

ArXiv ID: 2506.08670

Authors: Renjie Xu, Chong Wu, Maolin Che, Zhuoheng Ran, Yimin Wei, Hong Yan

Abstract: We propose sparseGeoHOPCA, a novel framework for sparse higher-order principal component analysis (SHOPCA) that introduces a geometric perspective to high-dimensional tensor decomposition. By unfolding the input tensor along each mode and reformulating the resulting subproblems as structured binary linear optimization problems, our method transforms the original nonconvex sparse objective into a tractable geometric form. This eliminates the need for explicit covariance estimation and iterative deflation, enabling significant gains in both computational efficiency and interpretability, particularly in high-dimensional and unbalanced data scenarios. We theoretically establish the equivalence between the geometric subproblems and the original SHOPCA formulation, and derive worst-case approximation error bounds based on classical PCA residuals, providing data-dependent performance guarantees. The proposed algorithm achieves a total computational complexity of $O\left(\sum_{n=1}^{N} (k_n^3 + J_n k_n^2)\right)$, which scales linearly with tensor size. Extensive experiments demonstrate that sparseGeoHOPCA accurately recovers sparse supports in synthetic settings, preserves classification performance under 10$\times$ compression, and achieves high-quality image reconstruction on ImageNet, highlighting its robustness and versatility.

Comment: The paper presents sparseGeoHOPCA, a novel framework for sparse higher-order PCA, which is relevant to model compression and representation learning.

Relevance: 9 Novelty: 8

14. The Geometries of Truth Are Orthogonal Across Tasks

ArXiv ID: 2506.08572

Authors: Waiss Azizian, Michael Kirchhof, Eugene Ndiaye, Louis Bethune, Michal Klein, Pierre Ablin, Marco Cuturi

Abstract: Large Language Models (LLMs) have demonstrated impressive generalization capabilities across various tasks, but their claim to practical relevance is still mired by concerns on their reliability. Recent works have proposed examining the activations produced by an LLM at inference time to assess whether its answer to a question is correct. Some works claim that a "geometry of truth" can be learned from examples, in the sense that the activations that generate correct answers can be distinguished from those leading to mistakes with a linear classifier. In this work, we underline a limitation of these approaches: we observe that these "geometries of truth" are intrinsically task-dependent and fail to transfer across tasks. More precisely, we show that linear classifiers trained across distinct tasks share little similarity and, when trained with sparsity-enforcing regularizers, have almost disjoint supports. We show that more sophisticated approaches (e.g., using mixtures of probes and tasks) fail to overcome this limitation, likely because activation vectors commonly used to classify answers form clearly separated clusters when examined across tasks.

Comment: The paper discusses the task-dependent nature of 'geometries of truth' in LLMs, providing theoretical insights into LLM behavior.

Relevance: 9 Novelty: 7

15. Edit Flows: Flow Matching with Edit Operations

ArXiv ID: 2506.09018

Authors: Marton Havasi, Brian Karrer, Itai Gat, Ricky T. Q. Chen

Abstract: Autoregressive generative models naturally generate variable-length sequences, while non-autoregressive models struggle, often imposing rigid, token-wise structures. We propose Edit Flows, a non-autoregressive model that overcomes these limitations by defining a discrete flow over sequences through edit operations-insertions, deletions, and substitutions. By modeling these operations within a Continuous-time Markov Chain over the sequence space, Edit Flows enable flexible, position-relative generation that aligns more closely with the structure of sequence data. Our training method leverages an expanded state space with auxiliary variables, making the learning process efficient and tractable. Empirical results show that Edit Flows outperforms both autoregressive and mask models on image captioning and significantly outperforms the mask construction in text and code generation.

Comment: The paper proposes Edit Flows, a non-autoregressive model using edit operations, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 8

16. Why Masking Diffusion Works: Condition on the Jump Schedule for Improved Discrete Diffusion

ArXiv ID: 2506.08316

Authors: Alan N. Amin, Nate Gruver, Andrew Gordon Wilson

Abstract: Discrete diffusion models, like continuous diffusion models, generate high-quality samples by gradually undoing noise applied to datapoints with a Markov process. Gradual generation in theory comes with many conceptual benefits; for example, inductive biases can be incorporated into the noising Markov process, and access to improved sampling algorithms. In practice, however, the consistently best performing discrete diffusion model is, surprisingly, masking diffusion, which does not denoise gradually. Here we explain the superior performance of masking diffusion by noting that it makes use of a fundamental difference between continuous and discrete Markov processes: discrete Markov processes evolve by discontinuous jumps at a fixed rate and, unlike other discrete diffusion models, masking diffusion builds in the known distribution of jump times and only learns where to jump to. We show that we can similarly bake in the known distribution of jump times into any discrete diffusion model. The resulting models - schedule-conditioned discrete diffusion (SCUD) - generalize classical discrete diffusion and masking diffusion. By applying SCUD to models with noising processes that incorporate inductive biases on images, text, and protein data, we build models that outperform masking.

Comment: The paper explains the superior performance of masking diffusion in discrete diffusion models, which is relevant to emerging trends in model architecture.

Relevance: 8 Novelty: 8

17. Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction

ArXiv ID: 2506.08954

Authors: Ruben Weitzman, Peter M{\o}rch Groth, Lood Van Niekerk, Aoi Otani, Yarin Gal, Debora Marks, Pascal Notin

Abstract: Retrieving homologous protein sequences is essential for a broad range of protein modeling tasks such as fitness prediction, protein design, structure modeling, and protein-protein interactions. Traditional workflows have relied on a two-step process: first retrieving homologs via Multiple Sequence Alignments (MSA), then training models on one or more of these alignments. However, MSA-based retrieval is computationally expensive, struggles with highly divergent sequences or complex insertions & deletions patterns, and operates independently of the downstream modeling objective. We introduce Protriever, an end-to-end differentiable framework that learns to retrieve relevant homologs while simultaneously training for the target task. When applied to protein fitness prediction, Protriever achieves state-of-the-art performance compared to sequence-based models that rely on MSA-based homolog retrieval, while being two orders of magnitude faster through efficient vector search. Protriever is both architecture- and task-agnostic, and can flexibly adapt to different retrieval strategies and protein databases at inference time -- offering a scalable alternative to alignment-centric approaches.

Comment: The paper introduces a new framework for protein homology search, which is foundational research in molecular modeling.

Relevance: 8 Novelty: 8

18. Improved Scaling Laws in Linear Regression via Data Reuse

ArXiv ID: 2506.08415

Authors: Licong Lin, Jingfeng Wu, Peter L. Bartlett

Abstract: Neural scaling laws suggest that the test error of large language models trained online decreases polynomially as the model size and data size increase. However, such scaling can be unsustainable when running out of new data. In this work, we show that data reuse can improve existing scaling laws in linear regression. Specifically, we derive sharp test error bounds on $M$-dimensional linear models trained by multi-pass stochastic gradient descent (multi-pass SGD) on $N$ data with sketched features. Assuming that the data covariance has a power-law spectrum of degree $a$, and that the true parameter follows a prior with an aligned power-law spectrum of degree $b-a$ (with $a > b > 1$), we show that multi-pass SGD achieves a test error of $\Theta(M^{1-b} + L^{(1-b)/a})$, where $L \lesssim N^{a/b}$ is the number of iterations. In the same setting, one-pass SGD only attains a test error of $\Theta(M^{1-b} + N^{(1-b)/a})$ (see e.g., Lin et al., 2024). This suggests an improved scaling law via data reuse (i.e., choosing $L>N$) in data-constrained regimes. Numerical simulations are also provided to verify our theoretical findings.

Comment: The paper provides theoretical insights into improving scaling laws in linear regression, which is relevant to representation learning and model efficiency.

Relevance: 8 Novelty: 8

19. Sharper Convergence Rates for Nonconvex Optimisation via Reduction Mappings

ArXiv ID: 2506.08428

Authors: Evan Markou, Thalaiyasingam Ajanthan, Stephen Gould

Abstract: Many high-dimensional optimisation problems exhibit rich geometric structures in their set of minimisers, often forming smooth manifolds due to over-parametrisation or symmetries. When this structure is known, at least locally, it can be exploited through reduction mappings that reparametrise part of the parameter space to lie on the solution manifold. These reductions naturally arise from inner optimisation problems and effectively remove redundant directions, yielding a lower-dimensional objective. In this work, we introduce a general framework to understand how such reductions influence the optimisation landscape. We show that well-designed reduction mappings improve curvature properties of the objective, leading to better-conditioned problems and theoretically faster convergence for gradient-based methods. Our analysis unifies a range of scenarios where structural information at optimality is leveraged to accelerate convergence, offering a principled explanation for the empirical gains observed in such optimisation algorithms.

Comment: The paper discusses a framework for understanding optimization landscapes, which is relevant to emerging trends in theoretical work.

Relevance: 8 Novelty: 8

20. AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

ArXiv ID: 2506.08885

Authors: Danush Khanna, Krishna Kumar, Basab Ghosh, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das

Abstract: Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent geometry. We introduce ALKALI, the first rigorously curated adversarial benchmark and the most comprehensive to date spanning 9,000 prompts across three macro categories, six subtypes, and fifteen attack families. Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates (ASRs) across both open and closed source models, exposing an underlying vulnerability we term latent camouflage, a structural blind spot where adversarial completions mimic the latent geometry of safe ones. To mitigate this vulnerability, we introduce GRACE - Geometric Representation Aware Contrastive Enhancement, an alignment framework coupling preference learning with latent space regularization. GRACE enforces two constraints: latent separation between safe and adversarial completions, and adversarial cohesion among unsafe and jailbreak behaviors. These operate over layerwise pooled embeddings guided by a learned attention profile, reshaping internal geometry without modifying the base model, and achieve up to 39% ASR reduction. Moreover, we introduce AVQI, a geometry aware metric that quantifies latent alignment failure via cluster separation and compactness. AVQI reveals when unsafe completions mimic the geometry of safe ones, offering a principled lens into how models internally encode safety. We make the code publicly available at https://anonymous.4open.science/r/alkali-B416/README.md.

Comment: The paper introduces a new framework for safeguarding LLMs against adversarial attacks, providing insights into LLM behavior and interpretability.

Relevance: 8 Novelty: 8

21. Explaining, Fast and Slow: Abstraction and Refinement of Provable Explanations

ArXiv ID: 2506.08505

Authors: Shahaf Bassan, Yizhak Yisrael Elboher, Tobias Ladner, Matthias Althoff, Guy Katz

Abstract: Despite significant advancements in post-hoc explainability techniques for neural networks, many current methods rely on heuristics and do not provide formally provable guarantees over the explanations provided. Recent work has shown that it is possible to obtain explanations with formal guarantees by identifying subsets of input features that are sufficient to determine that predictions remain unchanged using neural network verification techniques. Despite the appeal of these explanations, their computation faces significant scalability challenges. In this work, we address this gap by proposing a novel abstraction-refinement technique for efficiently computing provably sufficient explanations of neural network predictions. Our method abstracts the original large neural network by constructing a substantially reduced network, where a sufficient explanation of the reduced network is also provably sufficient for the original network, hence significantly speeding up the verification process. If the explanation is in sufficient on the reduced network, we iteratively refine the network size by gradually increasing it until convergence. Our experiments demonstrate that our approach enhances the efficiency of obtaining provably sufficient explanations for neural network predictions while additionally providing a fine-grained interpretation of the network's predictions across different abstraction levels.

Comment: The paper proposes an abstraction-refinement technique for computing provable explanations, which is relevant to representation learning and model interpretability.

Relevance: 8 Novelty: 8

22. BLUR: A Bi-Level Optimization Approach for LLM Unlearning

ArXiv ID: 2506.08164

Authors: Hadi Reisizadeh, Jinghan Jia, Zhiqi Bu, Bhanukiran Vinzamuri, Anil Ramakrishna, Kai-Wei Chang, Volkan Cevher, Sijia Liu, Mingyi Hong

Abstract: Enabling large language models (LLMs) to unlearn knowledge and capabilities acquired during training has proven vital for ensuring compliance with data regulations and promoting ethical practices in generative AI. Although there are growing interests in developing various unlearning algorithms, it remains unclear how to best formulate the unlearning problem. The most popular formulation uses a weighted sum of forget and retain loss, but it often leads to performance degradation due to the inherent trade-off between forget and retain losses. In this work, we argue that it is important to model the hierarchical structure of the unlearning problem, where the forget problem (which \textit{unlearns} certain knowledge and/or capabilities) takes priority over the retain problem (which preserves model utility). This hierarchical structure naturally leads to a bi-level optimization formulation where the lower-level objective focuses on minimizing the forget loss, while the upper-level objective aims to maintain the model's utility. Based on this new formulation, we propose a novel algorithm, termed Bi-Level UnleaRning (\texttt{BLUR}), which not only possesses strong theoretical guarantees but more importantly, delivers superior performance. In particular, our extensive experiments demonstrate that \texttt{BLUR} consistently outperforms all the state-of-the-art algorithms across various unlearning tasks, models, and metrics. Codes are available at https://github.com/OptimAI-Lab/BLURLLMUnlearning.

Comment: The paper proposes a novel bi-level optimization approach for LLM unlearning, which is relevant to foundational research in LLMs.