Personalized Daily ArXiv Papers 2025-07-15
| [gpt-4o] | Prompt | Completion | Total |
|---|---|---|---|
| Token | 54472 | 6802 | 61274 |
| Cost | $0.14 | $0.07 | $0.2 |
Total arXiv papers: 866
Total scanned papers: 532
Total relevant papers: 36
Table of contents with paper titles:
-
Lizard: An Efficient Linearization Framework for Large Language Models Authors: Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen
-
Continuous-Time Signal Decomposition: An Implicit Neural Generalization of PCA and ICA Authors: Shayan K. Azmoodeh, Krishna Subramani, Paris Smaragdis
-
Quantize-then-Rectify: Efficient VQ-VAE Training Authors: Borui Zhang, Qihang Rao, Wenzhao Zheng, Jie Zhou, Jiwen Lu
-
Compression Method for Deep Diagonal State Space Model Based on $H^2$ Optimal Reduction Authors: Hiroki Sakamoto, Kazuhiro Sato
-
From Sequence to Structure: Uncovering Substructure Reasoning in Transformers Authors: Xinnan Dai, Kai Yang, Jay Revolinsky, Kai Guo, Aoran Wang, Bohang Zhang, Jiliang Tang
-
BitParticle: Partializing Sparse Dual-Factors to Build Quasi-Synchronizing MAC Arrays for Energy-efficient DNNs Authors: Feilong Qiaoyuan, Jihe Wang, Zhiyu Sun, Linying Wu, Yuanhua Xiao, Danghui Wang
-
Algorithm Development in Neural Networks: Insights from the Streaming Parity Task Authors: Loek van Rossem, Andrew M. Saxe
-
Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models Authors: Ameen Ali, Shahar Katz, Lior Wolf, Ivan Titov
-
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation Authors: Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun
-
Memorization Sinks: Isolating Memorization during LLM Training Authors: Gaurav R. Ghosal, Pratyush Maini, Aditi Raghunathan
-
On Information Geometry and Iterative Optimization in Model Compression: Operator Factorization Authors: Zakhar Shumaylov, Vasileios Tsiaras, Yannis Stylianou
-
A Random Matrix Theory Perspective on the Learning Dynamics of Multi-head Latent Attention Authors: Nandan Kumar Jha, Brandon Reagen
-
A Generalization Theory for Zero-Shot Prediction Authors: Ronak Mehta, Zaid Harchaoui
-
Transformers Don't In-Context Learn Least Squares Regression Authors: Joshua Hill, Benjamin Eyre, Elliot Creager
-
DeepSeek: Paradigm Shifts and Technical Evolution in Large AI Models Authors: Luolin Xiong, Haofen Wang, Xi Chen, Lu Sheng, Yun Xiong, Jingping Liu, Yanghua Xiao, Huajun Chen, Qing-Long Han, Yang Tang
-
QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models Authors: Tien-Yu Chi, Hung-Yueh Chiang, Diana Marculescu, Kai-Chiang Wu
-
Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces Authors: Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, Amin Karbasi
-
(Almost) Free Modality Stitching of Foundation Models Authors: Jaisidh Singh, Diganta Misra, Boris Knyazev, Antonio Orvieto
-
Meta-autoencoders: An approach to discovery and representation of relationships between dynamically evolving classes Authors: Assaf Marron, Smadar Szekely, Irun Cohen, David Harel
-
Explainable AI in Genomics: Transcription Factor Binding Site Prediction with Mixture of Experts Authors: Aakash Tripathi, Ian E. Nielsen, Muhammad Umer, Ravi P. Ramachandran, Ghulam Rasool
-
Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving Authors: Wonung Kim, Yubin Lee, Yoonsung Kim, Jinwoo Hwang, Seongryong Oh, Jiyong Jung, Aziz Huseynov, Woong Gyu Park, Chang Hyun Park, Divya Mahajan, Jongse Park
-
Graph World Model Authors: Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You
-
A Mixture of Linear Corrections Generates Secure Code Authors: Weichen Yu, Ravi Mangal, Terry Zhuo, Matt Fredrikson, Corina S. Pasareanu
-
Disentangling Neural Disjunctive Normal Form Models Authors: Kexin Gu Baugh, Vincent Perreault, Matthew Baugh, Luke Dickens, Katsumi Inoue, Alessandra Russo
-
BioAnalyst: A Foundation Model for Biodiversity Authors: Athanasios Trantas, Martino Mensio, Stylianos Stasinos, Sebastian Gribincea, Taimur Khan, Damian Podareanu, Aliene van der Veen
-
Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning Authors: Chenxi Huang, Shaotian Yan, Liang Xie, Binbin Lin, Sinan Fan, Yue Xin, Deng Cai, Chen Shen, Jieping Ye
-
Some Super-approximation Rates of ReLU Neural Networks for Korobov Functions Authors: Yuwen Li, Guozhi Zhang
-
Multiple Choice Learning of Low Rank Adapters for Language Modeling Authors: Victor Letzelter, Hugo Malard, Mathieu Fontaine, Ga\"el Richard, Slim Essid, Andrei Bursuc, Patrick P\'erez
-
Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition Authors: Qinyuan Ye, Robin Jia, Xiang Ren
-
Text-Driven Causal Representation Learning for Source-Free Domain Generalization Authors: Lihua Zhou, Mao Ye, Nianxin Li, Shuaifeng Li, Jinlin Wu, Xiatian Zhu, Lei Deng, Hongbin Liu, Jiebo Luo, Zhen Lei
-
Zero-Shot Neural Architecture Search with Weighted Response Correlation Authors: Kun Jing, Luoyu Chen, Jungang Xu, Jianwei Tai, Yiyu Wang, Shuaimin Li
-
Scaling Laws for Optimal Data Mixtures Authors: Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, Pierre Ablin
-
Think Clearly: Improving Reasoning via Redundant Token Pruning Authors: Daewon Choi, Jimin Lee, Jihoon Tack, Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati
-
MixLoRA-DSI: Dynamically Expandable Mixture-of-LoRA Experts for Rehearsal-Free Generative Retrieval over Dynamic Corpora Authors: Tuan-Luc Huynh, Thuy-Trang Vu, Weiqing Wang, Trung Le, Dragan Ga\v{s}evi\'c, Yuan-Fang Li, Thanh-Toan Do
-
A Pre-training Framework for Relational Data with Information-theoretic Principles Authors: Quang Truong, Zhikai Chen, Mingxuan Ju, Tong Zhao, Neil Shah, Jiliang Tang
-
Cameras as Relative Positional Encoding Authors: Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, Angjoo Kanazawa
1. Lizard: An Efficient Linearization Framework for Large Language Models
ArXiv ID: 2507.09025
Authors: Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen
Abstract: We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation. Transformer-based LLMs face significant memory and computational bottlenecks as context lengths increase, due to the quadratic complexity of softmax attention and the growing key-value (KV) cache. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality. Unlike previous linearization methods, which are often limited by fixed model structures and therefore exclude gating mechanisms, Lizard incorporates a gating module inspired by recent state-of-the-art linear models. This enables adaptive memory control, supports constant-memory inference, offers strong length generalization, and allows more flexible model design. Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory, forming a hybrid mechanism that captures both long-range dependencies and fine-grained local interactions. Moreover, we introduce a hardware-aware algorithm that accelerates the training speed of our models. Extensive experiments show that Lizard achieves near-lossless recovery of the teacher model's performance across standard language modeling tasks, while significantly outperforming previous linearization methods. On the 5-shot MMLU benchmark, Lizard improves over prior models by 18 points and shows significant improvements on associative recall tasks.
Comment: The paper proposes a linearization framework for LLMs, addressing memory and computational bottlenecks, which is relevant to model compression and architecture.
Relevance: 9 Novelty: 8
2. Continuous-Time Signal Decomposition: An Implicit Neural Generalization of PCA and ICA
ArXiv ID: 2507.09091
Authors: Shayan K. Azmoodeh, Krishna Subramani, Paris Smaragdis
Abstract: We generalize the low-rank decomposition problem, such as principal and independent component analysis (PCA, ICA) for continuous-time vector-valued signals and provide a model-agnostic implicit neural signal representation framework to learn numerical approximations to solve the problem. Modeling signals as continuous-time stochastic processes, we unify the approaches to both the PCA and ICA problems in the continuous setting through a contrast function term in the network loss, enforcing the desired statistical properties of the source signals (decorrelation, independence) learned in the decomposition. This extension to a continuous domain allows the application of such decompositions to point clouds and irregularly sampled signals where standard techniques are not applicable.
Comment: The paper generalizes PCA and ICA for continuous-time signals, which is relevant to representation learning and introduces a novel theoretical framework.
Relevance: 9 Novelty: 8
3. Quantize-then-Rectify: Efficient VQ-VAE Training
ArXiv ID: 2507.10547
Authors: Borui Zhang, Qihang Rao, Wenzhao Zheng, Jie Zhou, Jiwen Lu
Abstract: Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete tokens. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE's tolerance threshold. We present \textbf{Quantize-then-Rectify (ReVQ)}, a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating \textbf{channel multi-group quantization} to enlarge codebook capacity and a \textbf{post rectifier} to mitigate quantization errors, ReVQ compresses ImageNet images into at most 512 tokens while sustaining competitive reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training costs by over two orders of magnitude relative to state-of-the-art approaches: ReVQ finishes full training on a single NVIDIA 4090 in approximately 22 hours, whereas comparable methods require 4.5 days on 32 A100 GPUs. Experimental results show that ReVQ achieves superior efficiency-reconstruction trade-offs.
Comment: The paper presents a novel framework for efficient VQ-VAE training, which is relevant to model compression through quantization and efficiency improvements.
Relevance: 9 Novelty: 8
4. Compression Method for Deep Diagonal State Space Model Based on $H^2$ Optimal Reduction
ArXiv ID: 2507.10078
Authors: Hiroki Sakamoto, Kazuhiro Sato
Abstract: Deep learning models incorporating linear SSMs have gained attention for capturing long-range dependencies in sequential data. However, their large parameter sizes pose challenges for deployment on resource-constrained devices. In this study, we propose an efficient parameter reduction method for these models by applying $H^{2}$ model order reduction techniques from control theory to their linear SSM components. In experiments, the LRA benchmark results show that the model compression based on our proposed method outperforms an existing method using the Balanced Truncation, while successfully reducing the number of parameters in the SSMs to $1/32$ without sacrificing the performance of the original models.
Comment: The paper proposes a novel compression method for deep diagonal state space models using $H^2$ optimal reduction, which aligns with the model compression criterion.
Relevance: 9 Novelty: 8
5. From Sequence to Structure: Uncovering Substructure Reasoning in Transformers
ArXiv ID: 2507.10435
Authors: Xinnan Dai, Kai Yang, Jay Revolinsky, Kai Guo, Aoran Wang, Bohang Zhang, Jiliang Tang
Abstract: Recent studies suggest that large language models (LLMs) possess the capability to solve graph reasoning tasks. Notably, even when graph structures are embedded within textual descriptions, LLMs can still effectively answer related questions. This raises a fundamental question: How can a decoder-only Transformer architecture understand underlying graph structures? To address this, we start with the substructure extraction task, interpreting the inner mechanisms inside the transformers and analyzing the impact of the input queries. Specifically, through both empirical results and theoretical analysis, we present Induced Substructure Filtration (ISF), a perspective that captures the substructure identification in the multi-layer transformers. We further validate the ISF process in LLMs, revealing consistent internal dynamics across layers. Building on these insights, we explore the broader capabilities of Transformers in handling diverse graph types. Specifically, we introduce the concept of thinking in substructures to efficiently extract complex composite patterns, and demonstrate that decoder-only Transformers can successfully extract substructures from attributed graphs, such as molecular graphs. Together, our findings offer a new insight on how sequence-based Transformers perform the substructure extraction task over graph data.
Comment: The paper explores how transformers can perform substructure extraction from graph data, which aligns with the model architecture criterion.
Relevance: 9 Novelty: 8
6. BitParticle: Partializing Sparse Dual-Factors to Build Quasi-Synchronizing MAC Arrays for Energy-efficient DNNs
ArXiv ID: 2507.09780
Authors: Feilong Qiaoyuan, Jihe Wang, Zhiyu Sun, Linying Wu, Yuanhua Xiao, Danghui Wang
Abstract: Bit-level sparsity in quantized deep neural networks (DNNs) offers significant potential for optimizing Multiply-Accumulate (MAC) operations. However, two key challenges still limit its practical exploitation. First, conventional bit-serial approaches cannot simultaneously leverage the sparsity of both factors, leading to a complete waste of one factor' s sparsity. Methods designed to exploit dual-factor sparsity are still in the early stages of exploration, facing the challenge of partial product explosion. Second, the fluctuation of bit-level sparsity leads to variable cycle counts for MAC operations. Existing synchronous scheduling schemes that are suitable for dual-factor sparsity exhibit poor flexibility and still result in significant underutilization of MAC units. To address the first challenge, this study proposes a MAC unit that leverages dual-factor sparsity through the emerging particlization-based approach. The proposed design addresses the issue of partial product explosion through simple control logic, resulting in a more area- and energy-efficient MAC unit. In addition, by discarding less significant intermediate results, the design allows for further hardware simplification at the cost of minor accuracy loss. To address the second challenge, a quasi-synchronous scheme is introduced that adds cycle-level elasticity to the MAC array, reducing pipeline stalls and thereby improving MAC unit utilization. Evaluation results show that the exact version of the proposed MAC array architecture achieves a 29.2% improvement in area efficiency compared to the state-of-the-art bit-sparsity-driven architecture, while maintaining comparable energy efficiency. The approximate variant further improves energy efficiency by 7.5%, compared to the exact version. Index-Terms: DNN acceleration, Bit-level sparsity, MAC unit
Comment: The paper proposes a novel MAC unit design leveraging bit-level sparsity, which aligns with the model compression criterion.
Relevance: 9 Novelty: 8
7. Algorithm Development in Neural Networks: Insights from the Streaming Parity Task
ArXiv ID: 2507.09897
Authors: Loek van Rossem, Andrew M. Saxe
Abstract: Even when massively overparameterized, deep neural networks show a remarkable ability to generalize. Research on this phenomenon has focused on generalization within distribution, via smooth interpolation. Yet in some settings neural networks also learn to extrapolate to data far beyond the bounds of the original training set, sometimes even allowing for infinite generalization, implying that an algorithm capable of solving the task has been learned. Here we undertake a case study of the learning dynamics of recurrent neural networks (RNNs) trained on the streaming parity task in order to develop an effective theory of algorithm development. The streaming parity task is a simple but nonlinear task defined on sequences up to arbitrary length. We show that, with sufficient finite training experience, RNNs exhibit a phase transition to perfect infinite generalization. Using an effective theory for the representational dynamics, we find an implicit representational merger effect which can be interpreted as the construction of a finite automaton that reproduces the task. Overall, our results disclose one mechanism by which neural networks can generalize infinitely from finite training experience.
Comment: The paper provides insights into the learning dynamics of RNNs, which is relevant to representation learning and understanding neural network behavior.
Relevance: 9 Novelty: 8
8. Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models
ArXiv ID: 2507.09185
Authors: Ameen Ali, Shahar Katz, Lior Wolf, Ivan Titov
Abstract: Large language models (LLMs) often develop learned mechanisms specialized to specific datasets, such as reliance on domain-specific correlations, which yield high-confidence predictions without generalizable reasoning. While beneficial in one setting, these dataset-specific mechanisms typically degrade performance when models encounter novel tasks or distributions. In this work, we introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms in transformer-based LLMs. Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance without supporting robust, transferable reasoning. Selectively pruning these neurons compels the model to depend on generalizable representations. Evaluated across multiple-choice benchmarks, our pruning-based fine-tuning significantly enhances performance, surpassing prior (non-pruning) adaptation methods.
Comment: The paper introduces a method for pruning neurons in LLMs to enhance generalization, which is relevant to model compression and LLM behavior.
Relevance: 9 Novelty: 8
9. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
ArXiv ID: 2507.10524
Authors: Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun
Abstract: Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.
Comment: The paper introduces Mixture-of-Recursions (MoR), a novel framework combining parameter sharing and adaptive computation, which is highly relevant to model architecture innovations.
Relevance: 9 Novelty: 8
10. Memorization Sinks: Isolating Memorization during LLM Training
ArXiv ID: 2507.09937
Authors: Gaurav R. Ghosal, Pratyush Maini, Aditi Raghunathan
Abstract: Large language models are susceptible to memorizing repeated sequences, posing privacy and copyright concerns. A popular mitigation strategy is to remove memorized information from specific neurons post-hoc. However, such approaches have shown limited success so far. In a controlled setting, we show that the memorization of natural sequences (those that resemble linguistically plausible text) become mechanistically entangled with general language abilities, thereby becoming challenging to remove post-hoc. In this work, we put forward a new paradigm of MemSinks that promotes isolation of memorization by design. We leverage a sequence identifier that activates a unique set of memorization neurons for each sequence across repetitions. By analyzing the dynamics of learning and forgetting, we argue that MemSinks facilitates isolation of memorized content, making it easier to remove without compromising general language capabilities. We implement MemSinks at the billion-parameter and billion-token scale, and observe both effective isolation and strong generalization. To our knowledge, this is the first proof-of-concept on real data demonstrating that simultaneous generalization and isolation is achievable. We open-source our code at http://github.com/grghosal/MemSinks.
Comment: The paper introduces MemSinks, a new paradigm for isolating memorization in LLMs, which aligns with foundational research in LLM behavior and interpretability.
Relevance: 9 Novelty: 8
11. On Information Geometry and Iterative Optimization in Model Compression: Operator Factorization
ArXiv ID: 2507.09428
Authors: Zakhar Shumaylov, Vasileios Tsiaras, Yannis Stylianou
Abstract: The ever-increasing parameter counts of deep learning models necessitate effective compression techniques for deployment on resource-constrained devices. This paper explores the application of information geometry, the study of density-induced metrics on parameter spaces, to analyze existing methods within the space of model compression, primarily focusing on operator factorization. Adopting this perspective highlights the core challenge: defining an optimal low-compute submanifold (or subset) and projecting onto it. We argue that many successful model compression approaches can be understood as implicitly approximating information divergences for this projection. We highlight that when compressing a pre-trained model, using information divergences is paramount for achieving improved zero-shot accuracy, yet this may no longer be the case when the model is fine-tuned. In such scenarios, trainability of bottlenecked models turns out to be far more important for achieving high compression ratios with minimal performance degradation, necessitating adoption of iterative methods. In this context, we prove convergence of iterative singular value thresholding for training neural networks subject to a soft rank constraint. To further illustrate the utility of this perspective, we showcase how simple modifications to existing methods through softer rank reduction result in improved performance under fixed compression rates.
Comment: The paper explores information geometry for model compression, focusing on operator factorization and iterative optimization, which is relevant to model compression techniques.
Relevance: 9 Novelty: 8
12. A Random Matrix Theory Perspective on the Learning Dynamics of Multi-head Latent Attention
ArXiv ID: 2507.09394
Authors: Nandan Kumar Jha, Brandon Reagen
Abstract: In this work, we study how multi-head latent attention (MLA), a popular strategy for compressing key/value memory, affects a transformer's internal capacity during pretraining. Using a lightweight suite of Marchenko-Pastur (MP) diagnostics, we analyze the spectrum of the $W_{Q}W_{K}^\top$ gram matrix throughout training, comparing three variants: the standard multi-head attention (MHA) baseline, MLA-PreRoPE with rotary applied before compression, and MLA-Decoupled, which shares a single rotary sub-vector across all heads. Our random matrix analysis reveals \textbf{three key findings:} \textbf{ i)} capacity bottlenecks emerge locally: both MHA and MLA-PreRoPE exhibit sharp, early spikes in specific layers that persist and propagate, disrupting the balance between bulk and outlier directions; \textbf{ ii)} these spikes coincide with rank collapse, concentrating the model's expressivity into narrow subspaces; \textbf{ iii)} only the decoupled variant prevents this cascade, maintaining broad spectral support and suppressing outlier formation across layers. These results underscore that \emph{how} rotary embeddings are applied is just as critical as \emph{where} compression occurs. Sharing rotary components across heads mitigates spectral fragmentation and preserves representational capacity.
Comment: The paper provides a theoretical analysis of multi-head latent attention in transformers using random matrix theory, which is relevant to model architecture and representation learning.
Relevance: 9 Novelty: 8
13. A Generalization Theory for Zero-Shot Prediction
ArXiv ID: 2507.09128
Authors: Ronak Mehta, Zaid Harchaoui
Abstract: A modern paradigm for generalization in machine learning and AI consists of pre-training a task-agnostic foundation model, generally obtained using self-supervised and multimodal contrastive learning. The resulting representations can be used for prediction on a downstream task for which no labeled data is available. We present a theoretical framework to better understand this approach, called zero-shot prediction. We identify the target quantities that zero-shot prediction aims to learn, or learns in passing, and the key conditional independence relationships that enable its generalization ability.
Comment: The paper presents a theoretical framework for zero-shot prediction, contributing to foundational research in representation learning.
Relevance: 9 Novelty: 8
14. Transformers Don't In-Context Learn Least Squares Regression
ArXiv ID: 2507.09440
Authors: Joshua Hill, Benjamin Eyre, Elliot Creager
Abstract: In-context learning (ICL) has emerged as a powerful capability of large pretrained transformers, enabling them to solve new tasks implicit in example input-output pairs without any gradient updates. Despite its practical success, the mechanisms underlying ICL remain largely mysterious. In this work we study synthetic linear regression to probe how transformers implement learning at inference time. Previous works have demonstrated that transformers match the performance of learning rules such as Ordinary Least Squares (OLS) regression or gradient descent and have suggested ICL is facilitated in transformers through the learned implementation of one of these techniques. In this work, we demonstrate through a suite of out-of-distribution generalization experiments that transformers trained for ICL fail to generalize after shifts in the prompt distribution, a behaviour that is inconsistent with the notion of transformers implementing algorithms such as OLS. Finally, we highlight the role of the pretraining corpus in shaping ICL behaviour through a spectral analysis of the learned representations in the residual stream. Inputs from the same distribution as the training data produce representations with a unique spectral signature: inputs from this distribution tend to have the same top two singular vectors. This spectral signature is not shared by out-of-distribution inputs, and a metric characterizing the presence of this signature is highly correlated with low loss.
Comment: The paper provides theoretical insights into the behavior of transformers in in-context learning, which aligns with the interest in understanding LLM behavior and interpretability.
Relevance: 9 Novelty: 7
15. DeepSeek: Paradigm Shifts and Technical Evolution in Large AI Models
ArXiv ID: 2507.09955
Authors: Luolin Xiong, Haofen Wang, Xi Chen, Lu Sheng, Yun Xiong, Jingping Liu, Yanghua Xiao, Huajun Chen, Qing-Long Han, Yang Tang
Abstract: DeepSeek, a Chinese Artificial Intelligence (AI) startup, has released their V3 and R1 series models, which attracted global attention due to their low cost, high performance, and open-source advantages. This paper begins by reviewing the evolution of large AI models focusing on paradigm shifts, the mainstream Large Language Model (LLM) paradigm, and the DeepSeek paradigm. Subsequently, the paper highlights novel algorithms introduced by DeepSeek, including Multi-head Latent Attention (MLA), Mixture-of-Experts (MoE), Multi-Token Prediction (MTP), and Group Relative Policy Optimization (GRPO). The paper then explores DeepSeek engineering breakthroughs in LLM scaling, training, inference, and system-level optimization architecture. Moreover, the impact of DeepSeek models on the competitive AI landscape is analyzed, comparing them to mainstream LLMs across various fields. Finally, the paper reflects on the insights gained from DeepSeek innovations and discusses future trends in the technical and engineering development of large AI models, particularly in data, training, and reasoning.
Comment: The paper reviews the evolution of large AI models and introduces novel algorithms like Mixture-of-Experts, which aligns with the model architecture criterion.
Relevance: 9 Novelty: 7
16. QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models
ArXiv ID: 2507.09514
Authors: Tien-Yu Chi, Hung-Yueh Chiang, Diana Marculescu, Kai-Chiang Wu
Abstract: State space models (SSMs) reduce the quadratic complexity of transformers by leveraging linear recurrence. Recently, VMamba has emerged as a strong SSM-based vision backbone, yet remains bottlenecked by spatial redundancy in its four-directional scan. We propose QuarterMap, a post-training activation pruning method that removes redundant spatial activations before scanning and restores dimensions via nearest-neighbor upsampling. Our method improves throughput without retraining. On ImageNet-1K, QuarterMap achieves up to 11% speedup on VMamba with less than 0.9% accuracy drop, and yields similar gains on ADE20K segmentation. Beyond VMamba, we validate QuarterMap on MedMamba, a domain-specific model that shares the same four-directional scanning structure, where it consistently improves throughput while preserving accuracy across multiple medical imaging tasks. Compared to token merging methods like ToMe, QuarterMap is tailored for SSMs and avoids costly merge-unmerge operations. Our method offers a plug-and-play tool for deployment-time efficiency without compromising transferability.
Comment: The paper focuses on model compression through post-training token pruning, which aligns with the model compression criteria.
Relevance: 9 Novelty: 7
17. Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces
ArXiv ID: 2507.09709
Authors: Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, Amin Karbasi
Abstract: Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. \baturay{However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To investigate this, we conduct a large-scale empirical study of hidden states in transformer-based LLMs, analyzing 11 decoder-only models across 6 scientific topics and 12 layers each. We find that high-level semantic information consistently lies in low-dimensional subspaces that form linearly separable representations across distinct domains. This separability becomes more pronounced in deeper layers and under prompts that trigger structured reasoning or alignment behaviors$\unicode{x2013}$even when surface content is unchanged. This geometry enables simple yet effective causal interventions in hidden space; for example, reasoning patterns like chain-of-thought can be captured by a single vector direction. Together, these findings support the development of geometry-aware tools that operate directly on latent representations to detect and mitigate harmful or adversarial content, using methods such as transport-based defenses that leverage this separability. As a proof of concept, we demonstrate this potential by training a simple MLP classifier as a lightweight latent-space guardrail, which detects adversarial and malicious prompts with high precision.
Comment: The paper investigates the latent space geometry of LLMs, which is relevant to understanding LLM behavior and representation learning.
Relevance: 9 Novelty: 7
18. (Almost) Free Modality Stitching of Foundation Models
ArXiv ID: 2507.10015
Authors: Jaisidh Singh, Diganta Misra, Boris Knyazev, Antonio Orvieto
Abstract: Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an autoregressive text model. This stitching process is performed by training a connector module that aims to align the representation-representation or representation-input spaces of these uni-modal models. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for $N \times M$ combinations of uni-modal models. In our experiments, Hyma reduces the optimal uni-modal model pair search cost by $10\times$ (averaged across all experiments), while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.
Comment: The paper proposes a novel method for aligning multi-modal models using hypernetworks, which is relevant to model architecture innovations.
Relevance: 8 Novelty: 8
19. Meta-autoencoders: An approach to discovery and representation of relationships between dynamically evolving classes
ArXiv ID: 2507.09362
Authors: Assaf Marron, Smadar Szekely, Irun Cohen, David Harel
Abstract: An autoencoder (AE) is a neural network that, using self-supervised training, learns a succinct parameterized representation, and a corresponding encoding and decoding process, for all instances in a given class. Here, we introduce the concept of a meta-autoencoder (MAE): an AE for a collection of autoencoders. Given a family of classes that differ from each other by the values of some parameters, and a trained AE for each class, an MAE for the family is a neural net that has learned a compact representation and associated encoder and decoder for the class-specific AEs. One application of this general concept is in research and modeling of natural evolution -- capturing the defining and the distinguishing properties across multiple species that are dynamically evolving from each other and from common ancestors. In this interim report we provide a constructive definition of MAEs, initial examples, and the motivating research directions in machine learning and biology.
Comment: The paper introduces the concept of meta-autoencoders, which is relevant to model architecture and representation learning.
Relevance: 8 Novelty: 8
20. Explainable AI in Genomics: Transcription Factor Binding Site Prediction with Mixture of Experts
ArXiv ID: 2507.09754
Authors: Aakash Tripathi, Ian E. Nielsen, Muhammad Umer, Ravi P. Ramachandran, Ghulam Rasool
Abstract: Transcription Factor Binding Site (TFBS) prediction is crucial for understanding gene regulation and various biological processes. This study introduces a novel Mixture of Experts (MoE) approach for TFBS prediction, integrating multiple pre-trained Convolutional Neural Network (CNN) models, each specializing in different TFBS patterns. We evaluate the performance of our MoE model against individual expert models on both in-distribution and out-of-distribution (OOD) datasets, using six randomly selected transcription factors (TFs) for OOD testing. Our results demonstrate that the MoE model achieves competitive or superior performance across diverse TF binding sites, particularly excelling in OOD scenarios. The Analysis of Variance (ANOVA) statistical test confirms the significance of these performance differences. Additionally, we introduce ShiftSmooth, a novel attribution mapping technique that provides more robust model interpretability by considering small shifts in input sequences. Through comprehensive explainability analysis, we show that ShiftSmooth offers superior attribution for motif discovery and localization compared to traditional Vanilla Gradient methods. Our work presents an efficient, generalizable, and interpretable solution for TFBS prediction, potentially enabling new discoveries in genome biology and advancing our understanding of transcriptional regulation.
Comment: The paper introduces a novel Mixture of Experts approach for TFBS prediction, which is relevant to model architecture innovations.
Relevance: 8 Novelty: 7
21. Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving
ArXiv ID: 2507.10178
Authors: Wonung Kim, Yubin Lee, Yoonsung Kim, Jinwoo Hwang, Seongryong Oh, Jiyong Jung, Aziz Huseynov, Woong Gyu Park, Chang Hyun Park, Divya Mahajan, Jongse Park
Abstract: Transformers are the driving force behind today's Large Language Models (LLMs), serving as the foundation for their performance and versatility. Yet, their compute and memory costs grow with sequence length, posing scalability challenges for long-context inferencing. In response, the algorithm community is exploring alternative architectures, such as state space models (SSMs), linear attention, and recurrent neural networks (RNNs), which we refer to as post-transformers. This shift presents a key challenge: building a serving system that efficiently supports both transformer and post-transformer LLMs within a unified framework. To address this challenge, we analyze the performance characteristics of transformer and post-transformer LLMs. Despite their algorithmic differences, both are fundamentally limited by memory bandwidth under batched inference due to attention in transformers and state updates in post-transformers. Further analyses suggest two additional insights: (1) state update operations, unlike attention, incur high hardware cost, making per-bank PIM acceleration inefficient, and (2) different low-precision arithmetic methods offer varying accuracy-area tradeoffs, while we identify Microsoft's MX as the Pareto-optimal choice. Building on these insights, we design Pimba as an array of State-update Processing Units (SPUs), each shared between two banks to enable interleaved access to PIM. Each SPU includes a State-update Processing Engine (SPE) that comprises element-wise multipliers and adders using MX-based quantized arithmetic, enabling efficient execution of state update and attention operations. Our evaluation shows that, compared to LLM-optimized GPU and GPU+PIM systems, Pimba achieves up to 3.2x and 2.1x higher token generation throughput, respectively.
Comment: The paper discusses a novel architecture for LLM serving, focusing on post-transformer models and memory efficiency, which aligns with model architecture and compression topics.
Relevance: 8 Novelty: 7
22. Graph World Model
ArXiv ID: 2507.10539
Authors: Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You
Abstract: World models (WMs) demonstrate strong capabilities in prediction, generation, and planning tasks. Existing WMs primarily focus on unstructured data and cannot leverage the ubiquitous structured data, often represented as graphs, in the digital world. While multiple graph foundation models have been proposed, they focus on graph learning tasks and cannot extend to diverse multi-modal data and interdisciplinary tasks. To address these challenges, we propose the Graph World Model (GWM), a world model that supports both unstructured and graph-structured states with multi-modal information and represents diverse tasks as actions. The core of a GWM is a generic message-passing algorithm to aggregate structured information, either over a unified multi-modal token space by converting multi-modal data into text (GWM-T) or a unified multi-modal embedding space by modality-specific encoders (GWM-E). Notably, GWM introduces action nodes to support diverse tasks, where action nodes are linked to other nodes via direct reference or similarity computation. Extensive experiments on six tasks from diverse domains, including multi-modal generation and matching, recommendation, graph prediction, multi-agent, retrieval-augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain-specific baselines' performance, benefits from multi-hop structures, and demonstrates strong zero-shot/few-shot capabilities on unseen new tasks. Our code for GWM is released at https://github.com/ulab-uiuc/GWM.
Comment: The paper introduces a graph world model that supports multi-modal information, which aligns with model architecture innovations.
Relevance: 8 Novelty: 7
23. A Mixture of Linear Corrections Generates Secure Code
ArXiv ID: 2507.09508
Authors: Weichen Yu, Ravi Mangal, Terry Zhuo, Matt Fredrikson, Corina S. Pasareanu
Abstract: Large language models (LLMs) have become proficient at sophisticated code-generation tasks, yet remain ineffective at reliably detecting or avoiding code vulnerabilities. Does this deficiency stem from insufficient learning about code vulnerabilities, or is it merely a result of ineffective prompting? Using representation engineering techniques, we investigate whether LLMs internally encode the concepts necessary to identify code vulnerabilities. We find that current LLMs encode precise internal representations that distinguish vulnerable from secure code--achieving greater accuracy than standard prompting approaches. Leveraging these vulnerability-sensitive representations, we develop an inference-time steering technique that subtly modulates the model's token-generation probabilities through a mixture of corrections (MoC). Our method effectively guides LLMs to produce less vulnerable code without compromising functionality, demonstrating a practical approach to controlled vulnerability management in generated code. Notably, MoC enhances the security ratio of Qwen2.5-Coder-7B by 8.9\%, while simultaneously improving functionality on HumanEval pass@1 by 2.1\%.
Comment: The paper discusses a mixture of linear corrections for secure code generation, which involves representation learning and aligns with foundational research.
Relevance: 8 Novelty: 7
24. Disentangling Neural Disjunctive Normal Form Models
ArXiv ID: 2507.10546
Authors: Kexin Gu Baugh, Vincent Perreault, Matthew Baugh, Luke Dickens, Katsumi Inoue, Alessandra Russo
Abstract: Neural Disjunctive Normal Form (DNF) based models are powerful and interpretable approaches to neuro-symbolic learning and have shown promising results in classification and reinforcement learning settings without prior knowledge of the tasks. However, their performance is degraded by the thresholding of the post-training symbolic translation process. We show here that part of the performance degradation during translation is due to its failure to disentangle the learned knowledge represented in the form of the networks' weights. We address this issue by proposing a new disentanglement method; by splitting nodes that encode nested rules into smaller independent nodes, we are able to better preserve the models' performance. Through experiments on binary, multiclass, and multilabel classification tasks (including those requiring predicate invention), we demonstrate that our disentanglement method provides compact and interpretable logical representations for the neural DNF-based models, with performance closer to that of their pre-translation counterparts. Our code is available at https://github.com/kittykg/disentangling-ndnf-classification.
Comment: The paper focuses on a new disentanglement method for neural DNF models, which aligns with representation learning by addressing how networks encode information.
Relevance: 8 Novelty: 7
25. BioAnalyst: A Foundation Model for Biodiversity
ArXiv ID: 2507.09080
Authors: Athanasios Trantas, Martino Mensio, Stylianos Stasinos, Sebastian Gribincea, Taimur Khan, Damian Podareanu, Aliene van der Veen
Abstract: The accelerating loss of biodiversity presents critical challenges for ecological research and conservation strategies. The preservation of biodiversity is paramount for maintaining ecological balance and ensuring the sustainability of ecosystems. However, biodiversity faces numerous threats, including habitat loss, climate change, and the proliferation of invasive species. Addressing these and other ecology-related challenges, both at local and global scales, requires comprehensive monitoring, predictive and conservation planning capabilities. Artificial Intelligence (AI) Foundation Models (FMs) have gained significant momentum in numerous scientific domains by leveraging vast datasets to learn general-purpose representations adaptable to various downstream tasks. This paradigm holds immense promise for biodiversity conservation. In response, we introduce BioAnalyst, the first Foundation Model tailored for biodiversity analysis and conservation planning. BioAnalyst employs a transformer-based architecture, pre-trained on extensive multi-modal datasets encompassing species occurrence records, remote sensing indicators, climate and environmental variables. BioAnalyst is designed for adaptability, allowing for fine-tuning of a range of downstream tasks, such as species distribution modelling, habitat suitability assessments, invasive species detection, and population trend forecasting. We evaluate the model's performance on two downstream use cases, demonstrating its generalisability compared to existing methods, particularly in data-scarce scenarios for two distinct use-cases, establishing a new accuracy baseline for ecological forecasting. By openly releasing BioAnalyst and its fine-tuning workflows to the scientific community, we aim to foster collaborative efforts in biodiversity modelling and advance AI-driven solutions to pressing ecological challenges.
Comment: The paper introduces BioAnalyst, a foundation model for biodiversity, which is relevant to AI for Science through foundational research in ecological modeling.
Relevance: 8 Novelty: 7
26. Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning
ArXiv ID: 2507.10085
Authors: Chenxi Huang, Shaotian Yan, Liang Xie, Binbin Lin, Sinan Fan, Yue Xin, Deng Cai, Chen Shen, Jieping Ye
Abstract: Representation Fine-tuning (ReFT), a recently proposed Parameter-Efficient Fine-Tuning (PEFT) method, has attracted widespread attention for significantly improving parameter efficiency by editing representation space alone. In this work, we investigate applying ReFT to complex reasoning tasks. However, directly using the native ReFT method, which modifies fixed representations at the beginning and end of each layer, yields suboptimal performance, as these fixed-position representations have uncertain impact on the outputs. We observe that, in complex reasoning tasks, there often exist certain critical representations. These representations either integrate significant information from preceding layers or regulate subsequent layer representations. Through layer-by-layer propagation, they exert a substantial influence on the final output. Naturally, fine-tuning these critical representations has the potential to greatly enhance reasoning performance. Building upon these insights, we propose Critical Representation Fine-Tuning (CRFT), a novel method that identifies and optimizes these critical representations through information flow analysis. CRFT operates within a supervised learning framework, dynamically optimizing critical representations in a low-rank linear subspace while freezing the base model. The effectiveness and efficiency of our method are validated across eight benchmarks for arithmetic and commonsense reasoning, using LLaMA and Mistral model families. Furthermore, our method also adapts effectively to few-shot settings, boosting one-shot accuracy by 16.4%. Our work highlights the untapped potential of representation-level optimization for CoT reasoning, offering a lightweight yet powerful alternative to traditional PEFT methods.
Comment: The paper introduces Critical Representation Fine-Tuning (CRFT) for enhancing reasoning tasks, which is relevant to representation learning through representation-level optimization.
Relevance: 8 Novelty: 7
27. Some Super-approximation Rates of ReLU Neural Networks for Korobov Functions
ArXiv ID: 2507.10345
Authors: Yuwen Li, Guozhi Zhang
Abstract: This paper examines the $L_p$ and $W^1_p$ norm approximation errors of ReLU neural networks for Korobov functions. In terms of network width and depth, we derive nearly optimal super-approximation error bounds of order $2m$ in the $L_p$ norm and order $2m-2$ in the $W^1_p$ norm, for target functions with $L_p$ mixed derivative of order $m$ in each direction. The analysis leverages sparse grid finite elements and the bit extraction technique. Our results improve upon classical lowest order $L_\infty$ and $H^1$ norm error bounds and demonstrate that the expressivity of neural networks is largely unaffected by the curse of dimensionality.
Comment: The paper examines approximation rates of ReLU neural networks, which is relevant to model architecture analysis through theoretical insights into neural network expressivity.
Relevance: 8 Novelty: 7
28. Multiple Choice Learning of Low Rank Adapters for Language Modeling
ArXiv ID: 2507.10419
Authors: Victor Letzelter, Hugo Malard, Mathieu Fontaine, Ga\"el Richard, Slim Essid, Andrei Bursuc, Patrick P\'erez
Abstract: We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple futures may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the Winner-Takes-All (WTA) loss to efficiently handle ambiguity through Low-Rank Adaptation (LoRA). We provide a theoretical interpretation of applying Multiple Choice Learning to Language Modeling, assuming the data is generated from a mixture of distributions. To illustrate the proposed approach, we use data sampled from mixtures of Markov chains. We then demonstrate with extensive experiments on real-world visual and audio captioning tasks that our method achieves high diversity and relevance in generated outputs.
Comment: The paper proposes a method for language modeling using low-rank adapters, which aligns with the model compression criterion.
Relevance: 8 Novelty: 7
29. Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition
ArXiv ID: 2507.09875
Authors: Qinyuan Ye, Robin Jia, Xiang Ren
Abstract: Large language models demonstrate the intriguing ability to perform unseen tasks via in-context learning. However, it remains unclear what mechanisms inside the model drive such task-level generalization. In this work, we approach this question through the lens of off-by-one addition (i.e., 1+1=3, 2+2=5, 3+3=?), a two-step, counterfactual task with an unexpected +1 function as a second step. Leveraging circuit-style interpretability techniques such as path patching, we analyze the models' internal computations behind their notable performance and present three key findings. First, we uncover a function induction mechanism that explains the model's generalization from standard addition to off-by-one addition. This mechanism resembles the structure of the induction head mechanism found in prior work and elevates it to a higher level of abstraction. Second, we show that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function. Finally, we find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition. Overall, our findings offer deeper insights into how reusable and composable structures within language models enable task-level generalization.
Comment: The paper studies task generalization in LLMs through interpretability techniques, which aligns with the large language models criterion.
Relevance: 8 Novelty: 7
30. Text-Driven Causal Representation Learning for Source-Free Domain Generalization
ArXiv ID: 2507.09961
Authors: Lihua Zhou, Mao Ye, Nianxin Li, Shuaifeng Li, Jinlin Wu, Xiatian Zhu, Lei Deng, Hongbin Liu, Jiebo Luo, Zhen Lei
Abstract: Deep learning often struggles when training and test data distributions differ. Traditional domain generalization (DG) tackles this by including data from multiple source domains, which is impractical due to expensive data collection and annotation. Recent vision-language models like CLIP enable source-free domain generalization (SFDG) by using text prompts to simulate visual representations, reducing data demands. However, existing SFDG methods struggle with domain-specific confounders, limiting their generalization capabilities. To address this issue, we propose TDCRL (\textbf{T}ext-\textbf{D}riven \textbf{C}ausal \textbf{R}epresentation \textbf{L}earning), the first method to integrate causal inference into the SFDG setting. TDCRL operates in two steps: first, it employs data augmentation to generate style word vectors, combining them with class information to generate text embeddings to simulate visual representations; second, it trains a causal intervention network with a confounder dictionary to extract domain-invariant features. Grounded in causal learning, our approach offers a clear and effective mechanism to achieve robust, domain-invariant features, ensuring robust generalization. Extensive experiments on PACS, VLCS, OfficeHome, and DomainNet show state-of-the-art performance, proving TDCRL effectiveness in SFDG.
Comment: The paper introduces a novel method for causal representation learning, which is relevant to representation learning.
Relevance: 8 Novelty: 7
31. Zero-Shot Neural Architecture Search with Weighted Response Correlation
ArXiv ID: 2507.08841
Authors: Kun Jing, Luoyu Chen, Jungang Xu, Jianwei Tai, Yiyu Wang, Shuaimin Li
Abstract: Neural architecture search (NAS) is a promising approach for automatically designing neural network architectures. However, the architecture estimation of NAS is computationally expensive and time-consuming because of training multiple architectures from scratch. Although existing zero-shot NAS methods use training-free proxies to accelerate the architecture estimation, their effectiveness, stability, and generality are still lacking. We present a novel training-free estimation proxy called weighted response correlation (WRCor). WRCor utilizes correlation coefficient matrices of responses across different input samples to calculate the proxy scores of estimated architectures, which can measure their expressivity and generalizability. Experimental results on proxy evaluation demonstrate that WRCor and its voting proxies are more efficient estimation strategies than existing proxies. We also apply them with different search strategies in architecture search. Experimental results on architecture search show that our zero-shot NAS algorithm outperforms most existing NAS algorithms in different search spaces. Our NAS algorithm can discover an architecture with a 22.1% test error on the ImageNet-1k dataset within 4 GPU hours. All codes are publicly available at https://github.com/kunjing96/ZSNAS-WRCor.git.
Comment: The paper introduces a novel zero-shot NAS method with a new proxy for architecture estimation, which is relevant to model architecture and efficiency.
Relevance: 8 Novelty: 7
32. Scaling Laws for Optimal Data Mixtures
ArXiv ID: 2507.09404
Authors: Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, Pierre Ablin
Abstract: Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size $N$ trained with $D$ tokens and a specific domain weight vector $h$. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget ($N$,$D$), providing a principled alternative to costly trial-and-error methods.
Comment: The paper proposes a systematic method for determining optimal data mixtures using scaling laws, which is relevant to foundational research in large language models.
Relevance: 8 Novelty: 7
33. Think Clearly: Improving Reasoning via Redundant Token Pruning
ArXiv ID: 2507.08806
Authors: Daewon Choi, Jimin Lee, Jihoon Tack, Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati
Abstract: Recent large language models have shown promising capabilities in long-form reasoning, following structured chains of thought before arriving at a final answer. However, we observe that these reasoning paths tend to include substantial redundancy; analyzing attention patterns reveals that attention scores are widely scattered, particularly incorrect answers exhibit greater attention sparsity. In this paper, we demonstrate that deliberately removing this redundancy in the reasoning process significantly improves performance through clear thinking, i.e., removing distraction. Specifically, we systematically identify reasoning redundancy by measuring token-level attention scores to a special end-of-thinking token, which is appended to an explicit instruction inserted to conclude each intermediate reasoning step. Furthermore, we propose structure-aware pruning that prioritizes removing tokens in low-contributing reasoning chunks over individual tokens. After evicting redundant tokens, we remove the injected end-of-thinking instruction, then resume the reasoning generation. We demonstrate that our method significantly improves overall accuracy across reasoning-intensive benchmarks without any training involved. In particular, our method shows strong performance on challenging mathematical competition benchmarks such as AIME and AMC, where reasoning redundancy is more prevalent.
Comment: The paper introduces a method for improving reasoning in LLMs via token pruning, which is relevant to model compression and efficiency.
Relevance: 8 Novelty: 7
34. MixLoRA-DSI: Dynamically Expandable Mixture-of-LoRA Experts for Rehearsal-Free Generative Retrieval over Dynamic Corpora
ArXiv ID: 2507.09924
Authors: Tuan-Luc Huynh, Thuy-Trang Vu, Weiqing Wang, Trung Le, Dragan Ga\v{s}evi\'c, Yuan-Fang Li, Thanh-Toan Do
Abstract: Continually updating model-based indexes in generative retrieval with new documents remains challenging, as full retraining is computationally expensive and impractical under resource constraints. We propose MixLoRA-DSI, a novel framework that combines an expandable mixture of Low-Rank Adaptation experts with a layer-wise out-of-distribution (OOD)-driven expansion strategy. Instead of allocating new experts for each new corpus, our proposed expansion strategy enables sublinear parameter growth by selectively introducing new experts only when significant number of OOD documents are detected. Experiments on NQ320k and MS MARCO Passage demonstrate that MixLoRA-DSI outperforms full-model update baselines, with minimal parameter overhead and substantially lower training costs.
Comment: The paper presents MixLoRA-DSI, a framework for generative retrieval using a mixture of Low-Rank Adaptation experts, which is relevant to model architecture and efficiency.
Relevance: 8 Novelty: 7
35. A Pre-training Framework for Relational Data with Information-theoretic Principles
ArXiv ID: 2507.09837
Authors: Quang Truong, Zhikai Chen, Mingxuan Ju, Tong Zhao, Neil Shah, Jiliang Tang
Abstract: Relational databases underpin critical infrastructure across a wide range of domains, yet the design of generalizable pre-training strategies for learning from relational databases remains an open challenge due to task heterogeneity. Specifically, there exist infinitely many possible downstream tasks, as tasks are defined based on relational schema graphs, temporal dependencies, and SQL-defined label logics. An effective pre-training framework is desired to take these factors into account in order to obtain task-aware representations. By incorporating knowledge of the underlying distribution that drives label generation, downstream tasks can benefit from relevant side-channel information. To bridge this gap, we introduce Task Vector Estimation (TVE), a novel pre-training framework that constructs predictive supervisory signals via set-based aggregation over schema traversal graphs, explicitly modeling next-window relational dynamics. We formalize our approach through an information-theoretic lens, demonstrating that task-informed representations retain more relevant signals than those obtained without task priors. Extensive experiments on the RelBench benchmark show that TVE consistently outperforms traditional pre-training baselines. Our findings advocate for pre-training objectives that encode task heterogeneity and temporal structure as design principles for predictive modeling on relational databases.
Comment: The paper introduces a pre-training framework for relational data using information-theoretic principles, which is relevant to representation learning.
Relevance: 8 Novelty: 7
36. Cameras as Relative Positional Encoding
ArXiv ID: 2507.10496
Authors: Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, Angjoo Kanazawa
Abstract: Transformers are increasingly prevalent for multi-view computer vision tasks, where geometric relationships between viewpoints are critical for 3D perception. To leverage these relationships, multi-view transformers must use camera geometry to ground visual tokens in 3D space. In this work, we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we propose -- Projective Positional Encoding (PRoPE) -- that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding. Our experiments begin by showing how relative camera conditioning improves performance in feedforward novel view synthesis, with further gains from PRoPE. This holds across settings: scenes with both shared and varying intrinsics, when combining token- and attention-level conditioning, and for generalization to inputs with out-of-distribution sequence lengths and camera intrinsics. We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative spatial cognition, as well as larger model sizes.
Comment: The paper discusses techniques for conditioning transformers on camera geometry, which is relevant to model architecture innovations.
Relevance: 8 Novelty: 7
Paper Selection Prompt
System Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
User Prompt
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
Examples: Work referencing MoE centered on reinforcement learning.
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
Examples: Application-focused papers like using MoE to solve a problem in the real world.
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
Examples: Modifications on existing methods yielding significantly better results.
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Relevant Topics
Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords:
- Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
- Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
- Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.