Personalized Daily ArXiv Papers 2025-07-15

[gpt-4o]	Prompt	Completion	Total
Token	54472	6802	61274
Cost	$0.14	$0.07	$0.2

Total arXiv papers: 866

Total scanned papers: 532

Total relevant papers: 36

Table of contents with paper titles:

Lizard: An Efficient Linearization Framework for Large Language Models Authors: Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen
Continuous-Time Signal Decomposition: An Implicit Neural Generalization of PCA and ICA Authors: Shayan K. Azmoodeh, Krishna Subramani, Paris Smaragdis
Quantize-then-Rectify: Efficient VQ-VAE Training Authors: Borui Zhang, Qihang Rao, Wenzhao Zheng, Jie Zhou, Jiwen Lu
Compression Method for Deep Diagonal State Space Model Based on $H^2$ Optimal Reduction Authors: Hiroki Sakamoto, Kazuhiro Sato
From Sequence to Structure: Uncovering Substructure Reasoning in Transformers Authors: Xinnan Dai, Kai Yang, Jay Revolinsky, Kai Guo, Aoran Wang, Bohang Zhang, Jiliang Tang
BitParticle: Partializing Sparse Dual-Factors to Build Quasi-Synchronizing MAC Arrays for Energy-efficient DNNs Authors: Feilong Qiaoyuan, Jihe Wang, Zhiyu Sun, Linying Wu, Yuanhua Xiao, Danghui Wang
Algorithm Development in Neural Networks: Insights from the Streaming Parity Task Authors: Loek van Rossem, Andrew M. Saxe
Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models Authors: Ameen Ali, Shahar Katz, Lior Wolf, Ivan Titov
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation Authors: Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun
Memorization Sinks: Isolating Memorization during LLM Training Authors: Gaurav R. Ghosal, Pratyush Maini, Aditi Raghunathan
On Information Geometry and Iterative Optimization in Model Compression: Operator Factorization Authors: Zakhar Shumaylov, Vasileios Tsiaras, Yannis Stylianou
A Random Matrix Theory Perspective on the Learning Dynamics of Multi-head Latent Attention Authors: Nandan Kumar Jha, Brandon Reagen
A Generalization Theory for Zero-Shot Prediction Authors: Ronak Mehta, Zaid Harchaoui
Transformers Don't In-Context Learn Least Squares Regression Authors: Joshua Hill, Benjamin Eyre, Elliot Creager
DeepSeek: Paradigm Shifts and Technical Evolution in Large AI Models Authors: Luolin Xiong, Haofen Wang, Xi Chen, Lu Sheng, Yun Xiong, Jingping Liu, Yanghua Xiao, Huajun Chen, Qing-Long Han, Yang Tang
QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models Authors: Tien-Yu Chi, Hung-Yueh Chiang, Diana Marculescu, Kai-Chiang Wu
Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces Authors: Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, Amin Karbasi
(Almost) Free Modality Stitching of Foundation Models Authors: Jaisidh Singh, Diganta Misra, Boris Knyazev, Antonio Orvieto
Meta-autoencoders: An approach to discovery and representation of relationships between dynamically evolving classes Authors: Assaf Marron, Smadar Szekely, Irun Cohen, David Harel
Explainable AI in Genomics: Transcription Factor Binding Site Prediction with Mixture of Experts Authors: Aakash Tripathi, Ian E. Nielsen, Muhammad Umer, Ravi P. Ramachandran, Ghulam Rasool
Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving Authors: Wonung Kim, Yubin Lee, Yoonsung Kim, Jinwoo Hwang, Seongryong Oh, Jiyong Jung, Aziz Huseynov, Woong Gyu Park, Chang Hyun Park, Divya Mahajan, Jongse Park
Graph World Model Authors: Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You
A Mixture of Linear Corrections Generates Secure Code Authors: Weichen Yu, Ravi Mangal, Terry Zhuo, Matt Fredrikson, Corina S. Pasareanu
Disentangling Neural Disjunctive Normal Form Models Authors: Kexin Gu Baugh, Vincent Perreault, Matthew Baugh, Luke Dickens, Katsumi Inoue, Alessandra Russo
BioAnalyst: A Foundation Model for Biodiversity Authors: Athanasios Trantas, Martino Mensio, Stylianos Stasinos, Sebastian Gribincea, Taimur Khan, Damian Podareanu, Aliene van der Veen
Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning Authors: Chenxi Huang, Shaotian Yan, Liang Xie, Binbin Lin, Sinan Fan, Yue Xin, Deng Cai, Chen Shen, Jieping Ye
Some Super-approximation Rates of ReLU Neural Networks for Korobov Functions Authors: Yuwen Li, Guozhi Zhang
Multiple Choice Learning of Low Rank Adapters for Language Modeling Authors: Victor Letzelter, Hugo Malard, Mathieu Fontaine, Ga\"el Richard, Slim Essid, Andrei Bursuc, Patrick P\'erez
Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition Authors: Qinyuan Ye, Robin Jia, Xiang Ren
Text-Driven Causal Representation Learning for Source-Free Domain Generalization Authors: Lihua Zhou, Mao Ye, Nianxin Li, Shuaifeng Li, Jinlin Wu, Xiatian Zhu, Lei Deng, Hongbin Liu, Jiebo Luo, Zhen Lei
Zero-Shot Neural Architecture Search with Weighted Response Correlation Authors: Kun Jing, Luoyu Chen, Jungang Xu, Jianwei Tai, Yiyu Wang, Shuaimin Li
Scaling Laws for Optimal Data Mixtures Authors: Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, Pierre Ablin
Think Clearly: Improving Reasoning via Redundant Token Pruning Authors: Daewon Choi, Jimin Lee, Jihoon Tack, Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati
MixLoRA-DSI: Dynamically Expandable Mixture-of-LoRA Experts for Rehearsal-Free Generative Retrieval over Dynamic Corpora Authors: Tuan-Luc Huynh, Thuy-Trang Vu, Weiqing Wang, Trung Le, Dragan Ga\v{s}evi\'c, Yuan-Fang Li, Thanh-Toan Do
A Pre-training Framework for Relational Data with Information-theoretic Principles Authors: Quang Truong, Zhikai Chen, Mingxuan Ju, Tong Zhao, Neil Shah, Jiliang Tang
Cameras as Relative Positional Encoding Authors: Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, Angjoo Kanazawa

1. Lizard: An Efficient Linearization Framework for Large Language Models

ArXiv ID: 2507.09025

Authors: Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen

Abstract: We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation. Transformer-based LLMs face significant memory and computational bottlenecks as context lengths increase, due to the quadratic complexity of softmax attention and the growing key-value (KV) cache. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality. Unlike previous linearization methods, which are often limited by fixed model structures and therefore exclude gating mechanisms, Lizard incorporates a gating module inspired by recent state-of-the-art linear models. This enables adaptive memory control, supports constant-memory inference, offers strong length generalization, and allows more flexible model design. Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory, forming a hybrid mechanism that captures both long-range dependencies and fine-grained local interactions. Moreover, we introduce a hardware-aware algorithm that accelerates the training speed of our models. Extensive experiments show that Lizard achieves near-lossless recovery of the teacher model's performance across standard language modeling tasks, while significantly outperforming previous linearization methods. On the 5-shot MMLU benchmark, Lizard improves over prior models by 18 points and shows significant improvements on associative recall tasks.

Comment: The paper proposes a linearization framework for LLMs, addressing memory and computational bottlenecks, which is relevant to model compression and architecture.

Relevance: 9 Novelty: 8

2. Continuous-Time Signal Decomposition: An Implicit Neural Generalization of PCA and ICA

ArXiv ID: 2507.09091

Authors: Shayan K. Azmoodeh, Krishna Subramani, Paris Smaragdis

Abstract: We generalize the low-rank decomposition problem, such as principal and independent component analysis (PCA, ICA) for continuous-time vector-valued signals and provide a model-agnostic implicit neural signal representation framework to learn numerical approximations to solve the problem. Modeling signals as continuous-time stochastic processes, we unify the approaches to both the PCA and ICA problems in the continuous setting through a contrast function term in the network loss, enforcing the desired statistical properties of the source signals (decorrelation, independence) learned in the decomposition. This extension to a continuous domain allows the application of such decompositions to point clouds and irregularly sampled signals where standard techniques are not applicable.

Comment: The paper generalizes PCA and ICA for continuous-time signals, which is relevant to representation learning and introduces a novel theoretical framework.

Relevance: 9 Novelty: 8

3. Quantize-then-Rectify: Efficient VQ-VAE Training

ArXiv ID: 2507.10547

Authors: Borui Zhang, Qihang Rao, Wenzhao Zheng, Jie Zhou, Jiwen Lu

Abstract: Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete tokens. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE's tolerance threshold. We present \textbf{Quantize-then-Rectify (ReVQ)}, a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating \textbf{channel multi-group quantization} to enlarge codebook capacity and a \textbf{post rectifier} to mitigate quantization errors, ReVQ compresses ImageNet images into at most 512 tokens while sustaining competitive reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training costs by over two orders of magnitude relative to state-of-the-art approaches: ReVQ finishes full training on a single NVIDIA 4090 in approximately 22 hours, whereas comparable methods require 4.5 days on 32 A100 GPUs. Experimental results show that ReVQ achieves superior efficiency-reconstruction trade-offs.

Comment: The paper presents a novel framework for efficient VQ-VAE training, which is relevant to model compression through quantization and efficiency improvements.

Relevance: 9 Novelty: 8

4. Compression Method for Deep Diagonal State Space Model Based on $H^2$ Optimal Reduction

ArXiv ID: 2507.10078

Authors: Hiroki Sakamoto, Kazuhiro Sato

Abstract: Deep learning models incorporating linear SSMs have gained attention for capturing long-range dependencies in sequential data. However, their large parameter sizes pose challenges for deployment on resource-constrained devices. In this study, we propose an efficient parameter reduction method for these models by applying $H^{2}$ model order reduction techniques from control theory to their linear SSM components. In experiments, the LRA benchmark results show that the model compression based on our proposed method outperforms an existing method using the Balanced Truncation, while successfully reducing the number of parameters in the SSMs to $1/32$ without sacrificing the performance of the original models.

Comment: The paper proposes a novel compression method for deep diagonal state space models using $H^2$ optimal reduction, which aligns with the model compression criterion.

Relevance: 9 Novelty: 8

5. From Sequence to Structure: Uncovering Substructure Reasoning in Transformers

ArXiv ID: 2507.10435

Authors: Xinnan Dai, Kai Yang, Jay Revolinsky, Kai Guo, Aoran Wang, Bohang Zhang, Jiliang Tang

Abstract: Recent studies suggest that large language models (LLMs) possess the capability to solve graph reasoning tasks. Notably, even when graph structures are embedded within textual descriptions, LLMs can still effectively answer related questions. This raises a fundamental question: How can a decoder-only Transformer architecture understand underlying graph structures? To address this, we start with the substructure extraction task, interpreting the inner mechanisms inside the transformers and analyzing the impact of the input queries. Specifically, through both empirical results and theoretical analysis, we present Induced Substructure Filtration (ISF), a perspective that captures the substructure identification in the multi-layer transformers. We further validate the ISF process in LLMs, revealing consistent internal dynamics across layers. Building on these insights, we explore the broader capabilities of Transformers in handling diverse graph types. Specifically, we introduce the concept of thinking in substructures to efficiently extract complex composite patterns, and demonstrate that decoder-only Transformers can successfully extract substructures from attributed graphs, such as molecular graphs. Together, our findings offer a new insight on how sequence-based Transformers perform the substructure extraction task over graph data.

Comment: The paper explores how transformers can perform substructure extraction from graph data, which aligns with the model architecture criterion.

Relevance: 9 Novelty: 8

6. BitParticle: Partializing Sparse Dual-Factors to Build Quasi-Synchronizing MAC Arrays for Energy-efficient DNNs

ArXiv ID: 2507.09780

Authors: Feilong Qiaoyuan, Jihe Wang, Zhiyu Sun, Linying Wu, Yuanhua Xiao, Danghui Wang

Abstract: Bit-level sparsity in quantized deep neural networks (DNNs) offers significant potential for optimizing Multiply-Accumulate (MAC) operations. However, two key challenges still limit its practical exploitation. First, conventional bit-serial approaches cannot simultaneously leverage the sparsity of both factors, leading to a complete waste of one factor' s sparsity. Methods designed to exploit dual-factor sparsity are still in the early stages of exploration, facing the challenge of partial product explosion. Second, the fluctuation of bit-level sparsity leads to variable cycle counts for MAC operations. Existing synchronous scheduling schemes that are suitable for dual-factor sparsity exhibit poor flexibility and still result in significant underutilization of MAC units. To address the first challenge, this study proposes a MAC unit that leverages dual-factor sparsity through the emerging particlization-based approach. The proposed design addresses the issue of partial product explosion through simple control logic, resulting in a more area- and energy-efficient MAC unit. In addition, by discarding less significant intermediate results, the design allows for further hardware simplification at the cost of minor accuracy loss. To address the second challenge, a quasi-synchronous scheme is introduced that adds cycle-level elasticity to the MAC array, reducing pipeline stalls and thereby improving MAC unit utilization. Evaluation results show that the exact version of the proposed MAC array architecture achieves a 29.2% improvement in area efficiency compared to the state-of-the-art bit-sparsity-driven architecture, while maintaining comparable energy efficiency. The approximate variant further improves energy efficiency by 7.5%, compared to the exact version. Index-Terms: DNN acceleration, Bit-level sparsity, MAC unit

Comment: The paper proposes a novel MAC unit design leveraging bit-level sparsity, which aligns with the model compression criterion.

Relevance: 9 Novelty: 8

7. Algorithm Development in Neural Networks: Insights from the Streaming Parity Task

ArXiv ID: 2507.09897

Authors: Loek van Rossem, Andrew M. Saxe

Abstract: Even when massively overparameterized, deep neural networks show a remarkable ability to generalize. Research on this phenomenon has focused on generalization within distribution, via smooth interpolation. Yet in some settings neural networks also learn to extrapolate to data far beyond the bounds of the original training set, sometimes even allowing for infinite generalization, implying that an algorithm capable of solving the task has been learned. Here we undertake a case study of the learning dynamics of recurrent neural networks (RNNs) trained on the streaming parity task in order to develop an effective theory of algorithm development. The streaming parity task is a simple but nonlinear task defined on sequences up to arbitrary length. We show that, with sufficient finite training experience, RNNs exhibit a phase transition to perfect infinite generalization. Using an effective theory for the representational dynamics, we find an implicit representational merger effect which can be interpreted as the construction of a finite automaton that reproduces the task. Overall, our results disclose one mechanism by which neural networks can generalize infinitely from finite training experience.

Comment: The paper provides insights into the learning dynamics of RNNs, which is relevant to representation learning and understanding neural network behavior.

Relevance: 9 Novelty: 8

8. Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models

ArXiv ID: 2507.09185

Authors: Ameen Ali, Shahar Katz, Lior Wolf, Ivan Titov

Abstract: Large language models (LLMs) often develop learned mechanisms specialized to specific datasets, such as reliance on domain-specific correlations, which yield high-confidence predictions without generalizable reasoning. While beneficial in one setting, these dataset-specific mechanisms typically degrade performance when models encounter novel tasks or distributions. In this work, we introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms in transformer-based LLMs. Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance without supporting robust, transferable reasoning. Selectively pruning these neurons compels the model to depend on generalizable representations. Evaluated across multiple-choice benchmarks, our pruning-based fine-tuning significantly enhances performance, surpassing prior (non-pruning) adaptation methods.

Comment: The paper introduces a method for pruning neurons in LLMs to enhance generalization, which is relevant to model compression and LLM behavior.

Relevance: 9 Novelty: 8

9. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

ArXiv ID: 2507.10524

Authors: Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun

Abstract: Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

Comment: The paper introduces Mixture-of-Recursions (MoR), a novel framework combining parameter sharing and adaptive computation, which is highly relevant to model architecture innovations.

Relevance: 9 Novelty: 8

10. Memorization Sinks: Isolating Memorization during LLM Training

ArXiv ID: 2507.09937

Authors: Gaurav R. Ghosal, Pratyush Maini, Aditi Raghunathan

Abstract: Large language models are susceptible to memorizing repeated sequences, posing privacy and copyright concerns. A popular mitigation strategy is to remove memorized information from specific neurons post-hoc. However, such approaches have shown limited success so far. In a controlled setting, we show that the memorization of natural sequences (those that resemble linguistically plausible text) become mechanistically entangled with general language abilities, thereby becoming challenging to remove post-hoc. In this work, we put forward a new paradigm of MemSinks that promotes isolation of memorization by design. We leverage a sequence identifier that activates a unique set of memorization neurons for each sequence across repetitions. By analyzing the dynamics of learning and forgetting, we argue that MemSinks facilitates isolation of memorized content, making it easier to remove without compromising general language capabilities. We implement MemSinks at the billion-parameter and billion-token scale, and observe both effective isolation and strong generalization. To our knowledge, this is the first proof-of-concept on real data demonstrating that simultaneous generalization and isolation is achievable. We open-source our code at http://github.com/grghosal/MemSinks.

Comment: The paper introduces MemSinks, a new paradigm for isolating memorization in LLMs, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 8

11. On Information Geometry and Iterative Optimization in Model Compression: Operator Factorization

ArXiv ID: 2507.09428

Authors: Zakhar Shumaylov, Vasileios Tsiaras, Yannis Stylianou

Abstract: The ever-increasing parameter counts of deep learning models necessitate effective compression techniques for deployment on resource-constrained devices. This paper explores the application of information geometry, the study of density-induced metrics on parameter spaces, to analyze existing methods within the space of model compression, primarily focusing on operator factorization. Adopting this perspective highlights the core challenge: defining an optimal low-compute submanifold (or subset) and projecting onto it. We argue that many successful model compression approaches can be understood as implicitly approximating information divergences for this projection. We highlight that when compressing a pre-trained model, using information divergences is paramount for achieving improved zero-shot accuracy, yet this may no longer be the case when the model is fine-tuned. In such scenarios, trainability of bottlenecked models turns out to be far more important for achieving high compression ratios with minimal performance degradation, necessitating adoption of iterative methods. In this context, we prove convergence of iterative singular value thresholding for training neural networks subject to a soft rank constraint. To further illustrate the utility of this perspective, we showcase how simple modifications to existing methods through softer rank reduction result in improved performance under fixed compression rates.

Comment: The paper explores information geometry for model compression, focusing on operator factorization and iterative optimization, which is relevant to model compression techniques.

Relevance: 9 Novelty: 8

12. A Random Matrix Theory Perspective on the Learning Dynamics of Multi-head Latent Attention

ArXiv ID: 2507.09394

Authors: Nandan Kumar Jha, Brandon Reagen

Abstract: In this work, we study how multi-head latent attention (MLA), a popular strategy for compressing key/value memory, affects a transformer's internal capacity during pretraining. Using a lightweight suite of Marchenko-Pastur (MP) diagnostics, we analyze the spectrum of the $W_{Q}W_{K}^\top$ gram matrix throughout training, comparing three variants: the standard multi-head attention (MHA) baseline, MLA-PreRoPE with rotary applied before compression, and MLA-Decoupled, which shares a single rotary sub-vector across all heads. Our random matrix analysis reveals \textbf{three key findings:} \textbf{ i)} capacity bottlenecks emerge locally: both MHA and MLA-PreRoPE exhibit sharp, early spikes in specific layers that persist and propagate, disrupting the balance between bulk and outlier directions; \textbf{ ii)} these spikes coincide with rank collapse, concentrating the model's expressivity into narrow subspaces; \textbf{ iii)} only the decoupled variant prevents this cascade, maintaining broad spectral support and suppressing outlier formation across layers. These results underscore that \emph{how} rotary embeddings are applied is just as critical as \emph{where} compression occurs. Sharing rotary components across heads mitigates spectral fragmentation and preserves representational capacity.

Comment: The paper provides a theoretical analysis of multi-head latent attention in transformers using random matrix theory, which is relevant to model architecture and representation learning.

Relevance: 9 Novelty: 8

13. A Generalization Theory for Zero-Shot Prediction

ArXiv ID: 2507.09128

Authors: Ronak Mehta, Zaid Harchaoui

Abstract: A modern paradigm for generalization in machine learning and AI consists of pre-training a task-agnostic foundation model, generally obtained using self-supervised and multimodal contrastive learning. The resulting representations can be used for prediction on a downstream task for which no labeled data is available. We present a theoretical framework to better understand this approach, called zero-shot prediction. We identify the target quantities that zero-shot prediction aims to learn, or learns in passing, and the key conditional independence relationships that enable its generalization ability.

Comment: The paper presents a theoretical framework for zero-shot prediction, contributing to foundational research in representation learning.

Relevance: 9 Novelty: 8

14. Transformers Don't In-Context Learn Least Squares Regression

ArXiv ID: 2507.09440

Authors: Joshua Hill, Benjamin Eyre, Elliot Creager

Abstract: In-context learning (ICL) has emerged as a powerful capability of large pretrained transformers, enabling them to solve new tasks implicit in example input-output pairs without any gradient updates. Despite its practical success, the mechanisms underlying ICL remain largely mysterious. In this work we study synthetic linear regression to probe how transformers implement learning at inference time. Previous works have demonstrated that transformers match the performance of learning rules such as Ordinary Least Squares (OLS) regression or gradient descent and have suggested ICL is facilitated in transformers through the learned implementation of one of these techniques. In this work, we demonstrate through a suite of out-of-distribution generalization experiments that transformers trained for ICL fail to generalize after shifts in the prompt distribution, a behaviour that is inconsistent with the notion of transformers implementing algorithms such as OLS. Finally, we highlight the role of the pretraining corpus in shaping ICL behaviour through a spectral analysis of the learned representations in the residual stream. Inputs from the same distribution as the training data produce representations with a unique spectral signature: inputs from this distribution tend to have the same top two singular vectors. This spectral signature is not shared by out-of-distribution inputs, and a metric characterizing the presence of this signature is highly correlated with low loss.

Comment: The paper provides theoretical insights into the behavior of transformers in in-context learning, which aligns with the interest in understanding LLM behavior and interpretability.

Relevance: 9 Novelty: 7

15. DeepSeek: Paradigm Shifts and Technical Evolution in Large AI Models

ArXiv ID: 2507.09955

Authors: Luolin Xiong, Haofen Wang, Xi Chen, Lu Sheng, Yun Xiong, Jingping Liu, Yanghua Xiao, Huajun Chen, Qing-Long Han, Yang Tang

Abstract: DeepSeek, a Chinese Artificial Intelligence (AI) startup, has released their V3 and R1 series models, which attracted global attention due to their low cost, high performance, and open-source advantages. This paper begins by reviewing the evolution of large AI models focusing on paradigm shifts, the mainstream Large Language Model (LLM) paradigm, and the DeepSeek paradigm. Subsequently, the paper highlights novel algorithms introduced by DeepSeek, including Multi-head Latent Attention (MLA), Mixture-of-Experts (MoE), Multi-Token Prediction (MTP), and Group Relative Policy Optimization (GRPO). The paper then explores DeepSeek engineering breakthroughs in LLM scaling, training, inference, and system-level optimization architecture. Moreover, the impact of DeepSeek models on the competitive AI landscape is analyzed, comparing them to mainstream LLMs across various fields. Finally, the paper reflects on the insights gained from DeepSeek innovations and discusses future trends in the technical and engineering development of large AI models, particularly in data, training, and reasoning.

Comment: The paper reviews the evolution of large AI models and introduces novel algorithms like Mixture-of-Experts, which aligns with the model architecture criterion.

Relevance: 9 Novelty: 7

16. QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models

ArXiv ID: 2507.09514

Authors: Tien-Yu Chi, Hung-Yueh Chiang, Diana Marculescu, Kai-Chiang Wu

Abstract: State space models (SSMs) reduce the quadratic complexity of transformers by leveraging linear recurrence. Recently, VMamba has emerged as a strong SSM-based vision backbone, yet remains bottlenecked by spatial redundancy in its four-directional scan. We propose QuarterMap, a post-training activation pruning method that removes redundant spatial activations before scanning and restores dimensions via nearest-neighbor upsampling. Our method improves throughput without retraining. On ImageNet-1K, QuarterMap achieves up to 11% speedup on VMamba with less than 0.9% accuracy drop, and yields similar gains on ADE20K segmentation. Beyond VMamba, we validate QuarterMap on MedMamba, a domain-specific model that shares the same four-directional scanning structure, where it consistently improves throughput while preserving accuracy across multiple medical imaging tasks. Compared to token merging methods like ToMe, QuarterMap is tailored for SSMs and avoids costly merge-unmerge operations. Our method offers a plug-and-play tool for deployment-time efficiency without compromising transferability.

Comment: The paper focuses on model compression through post-training token pruning, which aligns with the model compression criteria.

Relevance: 9 Novelty: 7

17. Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces

ArXiv ID: 2507.09709

Authors: Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, Amin Karbasi

Abstract: Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. \baturay{However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To investigate this, we conduct a large-scale empirical study of hidden states in transformer-based LLMs, analyzing 11 decoder-only models across 6 scientific topics and 12 layers each. We find that high-level semantic information consistently lies in low-dimensional subspaces that form linearly separable representations across distinct domains. This separability becomes more pronounced in deeper layers and under prompts that trigger structured reasoning or alignment behaviors$\unicode{x2013}$even when surface content is unchanged. This geometry enables simple yet effective causal interventions in hidden space; for example, reasoning patterns like chain-of-thought can be captured by a single vector direction. Together, these findings support the development of geometry-aware tools that operate directly on latent representations to detect and mitigate harmful or adversarial content, using methods such as transport-based defenses that leverage this separability. As a proof of concept, we demonstrate this potential by training a simple MLP classifier as a lightweight latent-space guardrail, which detects adversarial and malicious prompts with high precision.

Comment: The paper investigates the latent space geometry of LLMs, which is relevant to understanding LLM behavior and representation learning.

Relevance: 9 Novelty: 7

18. (Almost) Free Modality Stitching of Foundation Models

ArXiv ID: 2507.10015

Authors: Jaisidh Singh, Diganta Misra, Boris Knyazev, Antonio Orvieto

Abstract: Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an autoregressive text model. This stitching process is performed by training a connector module that aims to align the representation-representation or representation-input spaces of these uni-modal models. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for $N \times M$ combinations of uni-modal models. In our experiments, Hyma reduces the optimal uni-modal model pair search cost by $10\times$ (averaged across all experiments), while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.

Comment: The paper proposes a novel method for aligning multi-modal models using hypernetworks, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 8

19. Meta-autoencoders: An approach to discovery and representation of relationships between dynamically evolving classes

ArXiv ID: 2507.09362

Authors: Assaf Marron, Smadar Szekely, Irun Cohen, David Harel

Abstract: An autoencoder (AE) is a neural network that, using self-supervised training, learns a succinct parameterized representation, and a corresponding encoding and decoding process, for all instances in a given class. Here, we introduce the concept of a meta-autoencoder (MAE): an AE for a collection of autoencoders. Given a family of classes that differ from each other by the values of some parameters, and a trained AE for each class, an MAE for the family is a neural net that has learned a compact representation and associated encoder and decoder for the class-specific AEs. One application of this general concept is in research and modeling of natural evolution -- capturing the defining and the distinguishing properties across multiple species that are dynamically evolving from each other and from common ancestors. In this interim report we provide a constructive definition of MAEs, initial examples, and the motivating research directions in machine learning and biology.

Comment: The paper introduces the concept of meta-autoencoders, which is relevant to model architecture and representation learning.

Relevance: 8 Novelty: 8

20. Explainable AI in Genomics: Transcription Factor Binding Site Prediction with Mixture of Experts

ArXiv ID: 2507.09754

Authors: Aakash Tripathi, Ian E. Nielsen, Muhammad Umer, Ravi P. Ramachandran, Ghulam Rasool

Abstract: Transcription Factor Binding Site (TFBS) prediction is crucial for understanding gene regulation and various biological processes. This study introduces a novel Mixture of Experts (MoE) approach for TFBS prediction, integrating multiple pre-trained Convolutional Neural Network (CNN) models, each specializing in different TFBS patterns. We evaluate the performance of our MoE model against individual expert models on both in-distribution and out-of-distribution (OOD) datasets, using six randomly selected transcription factors (TFs) for OOD testing. Our results demonstrate that the MoE model achieves competitive or superior performance across diverse TF binding sites, particularly excelling in OOD scenarios. The Analysis of Variance (ANOVA) statistical test confirms the significance of these performance differences. Additionally, we introduce ShiftSmooth, a novel attribution mapping technique that provides more robust model interpretability by considering small shifts in input sequences. Through comprehensive explainability analysis, we show that ShiftSmooth offers superior attribution for motif discovery and localization compared to traditional Vanilla Gradient methods. Our work presents an efficient, generalizable, and interpretable solution for TFBS prediction, potentially enabling new discoveries in genome biology and advancing our understanding of transcriptional regulation.

Comment: The paper introduces a novel Mixture of Experts approach for TFBS prediction, which is relevant to model architecture innovations.