Previous Day 2025-07-24
Monthly Overview 2025-07
Next Day 2025-07-28

Personalized Daily ArXiv Papers 2025-07-25

[gpt-4o] Prompt Completion Total
Token 28345 3259 31604
Cost $0.07 $0.03 $0.1

Total arXiv papers: 430

Total scanned papers: 250

Total relevant papers: 17

Table of contents with paper titles:

  1. AlphaGo Moment for Model Architecture Discovery Authors: Yixiu Liu, Yang Nan, Weixian Xu, Xiangkun Hu, Lyumanshan Ye, Zhen Qin, Pengfei Liu

  2. The Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection Authors: Steven A. Frank

  3. Revisiting LLM Reasoning via Information Bottleneck Authors: Shiye Lei, Zhihao Cheng, Kai Jia, Dacheng Tao

  4. The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm Authors: Jiale Chen, Torsten Hoefler, Dan Alistarh

  5. Squeeze10-LLM: Squeezing LLMs' Weights by 10 Times via a Staged Mixed-Precision Quantization Method Authors: Qingcheng Zhu, Yangyang Ren, Linlin Yang, Mingbao Lin, Yanjing Li, Sheng Xu, Zichao Feng, Haodong Zhu, Yuguang Yang, Juan Zhang, Runqi Wang, Baochang Zhang

  6. Neural Tangent Kernels and Fisher Information Matrices for Simple ReLU Networks with Random Hidden Weights Authors: Jun'ichi Takeuchia, Yoshinari Takeishia, Noboru Muratab, Kazushi Mimurac, Ka Long Keith Hod, Hiroshi Nagaoka

  7. Self-similarity Analysis in Deep Neural Networks Authors: Jingyi Ding, Chengwen Qi, Hongfei Wang, Jianshe Wu, Licheng Jiao, Yuwei Guo, Jian Gao

  8. Explainable Mapper: Charting LLM Embedding Spaces Using Perturbation-Based Explanation and Verification Agents Authors: Xinyuan Yan, Rita Sevastjanova, Sinie van der Ben, Mennatallah El-Assady, Bei Wang

  9. Are LLM Belief Updates Consistent with Bayes' Theorem? Authors: Sohaib Imran, Ihor Kendiukhov, Matthew Broerman, Aditya Thomas, Riccardo Campanella, Rob Lamb, Peter M. Atkinson

  10. Euclidean Distance Deflation Under High-Dimensional Heteroskedastic Noise Authors: Keyi Li, Yuval Kluger, Boris Landa

  11. ChronoSelect: Robust Learning with Noisy Labels via Dynamics Temporal Memory Authors: Jianchao Wang, Qingfeng Li, Pengcheng Zheng, Xiaorong Pu, Yazhou Ren

  12. C2G-KD: PCA-Constrained Generator for Data-Free Knowledge Distillation Authors: Magnus Bengtsson, Kenneth \"Ostberg

  13. Logical Characterizations of GNNs with Mean Aggregation Authors: Moritz Sch\"onherr, Carsten Lutz

  14. Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models Authors: Kexin Chen, Dongxia Wang, Yi Liu, Haonan Zhang, Wenhai Wang

  15. Enhancing Quantization-Aware Training on Edge Devices via Relative Entropy Coreset Selection and Cascaded Layer Correction Authors: Yujia Tong, Jingling Yuan, Chuang Hu

  16. Incentivised Orchestrated Training Architecture (IOTA): A Technical Primer for Release Authors: Felix Quinque, Alan Aboudib, Szymon Fonau, Rodrigo Lopez Portillo Alcocer, Brian McCrindle, Steffen Cruz

  17. GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs Authors: Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal


1. AlphaGo Moment for Model Architecture Discovery

ArXiv ID: 2507.18074

Authors: Yixiu Liu, Yang Nan, Weixian Xu, Xiangkun Hu, Lyumanshan Ye, Zhen Qin, Pengfei Liu

Abstract: While AI systems demonstrate exponentially improving capabilities, the pace of AI research itself remains linearly bounded by human cognitive capacity, creating an increasingly severe development bottleneck. We present ASI-Arch, the first demonstration of Artificial Superintelligence for AI research (ASI4AI) in the critical domain of neural architecture discovery--a fully autonomous system that shatters this fundamental constraint by enabling AI to conduct its own architectural innovation. Moving beyond traditional Neural Architecture Search (NAS), which is fundamentally limited to exploring human-defined spaces, we introduce a paradigm shift from automated optimization to automated innovation. ASI-Arch can conduct end-to-end scientific research in the domain of architecture discovery, autonomously hypothesizing novel architectural concepts, implementing them as executable code, training and empirically validating their performance through rigorous experimentation and past experience. ASI-Arch conducted 1,773 autonomous experiments over 20,000 GPU hours, culminating in the discovery of 106 innovative, state-of-the-art (SOTA) linear attention architectures. Like AlphaGo's Move 37 that revealed unexpected strategic insights invisible to human players, our AI-discovered architectures demonstrate emergent design principles that systematically surpass human-designed baselines and illuminate previously unknown pathways for architectural innovation. Crucially, we establish the first empirical scaling law for scientific discovery itself--demonstrating that architectural breakthroughs can be scaled computationally, transforming research progress from a human-limited to a computation-scalable process. We provide comprehensive analysis of the emergent design patterns and autonomous research capabilities that enabled these breakthroughs, establishing a blueprint for self-accelerating AI systems.

Comment: The paper introduces ASI-Arch, an autonomous system for neural architecture discovery, which is a significant innovation in model architecture.

Relevance: 10 Novelty: 9


2. The Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection

ArXiv ID: 2507.18549

Authors: Steven A. Frank

Abstract: Diverse learning algorithms, optimization methods, and natural selection share a common mathematical structure, despite their apparent differences. Here I show that a simple notational partitioning of change by the Price equation reveals a universal force-metric-bias (FMB) law: $\Delta\mathbf{\theta} = \mathbf{M}\,\mathbf{f} + \mathbf{b} + \mathbf{\xi}$. The force $\mathbf{f}$ drives improvement in parameters, $\Delta\mathbf{\theta}$, through the covariance between the parameters and performance. The metric $\mathbf{M}$ rescales movement by inverse curvature. The bias $\mathbf{b}$ adds momentum or changes in the frame of reference. The noise $\mathbf{\xi}$ enables exploration. This framework unifies natural selection, Bayesian updating, Newton's method, stochastic gradient descent, stochastic Langevin dynamics, Adam optimization, and most other algorithms as special cases of the same underlying process. The Price equation also reveals why Fisher information, Kullback-Leibler divergence, and d'Alembert's principle arise naturally in learning dynamics. By exposing this common structure, the FMB law provides a principled foundation for understanding, comparing, and designing learning algorithms across disciplines.

Comment: The paper presents a universal force-metric-bias law that unifies various learning algorithms, offering a theoretical insight into learning dynamics, which is relevant to representation learning.

Relevance: 9 Novelty: 9


3. Revisiting LLM Reasoning via Information Bottleneck

ArXiv ID: 2507.18391

Authors: Shiye Lei, Zhihao Cheng, Kai Jia, Dacheng Tao

Abstract: Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR). By leveraging simple rule-based rewards, RL effectively incentivizes LLMs to produce extended chain-of-thought (CoT) reasoning trajectories, progressively guiding them toward correct answers. However, existing approaches remain largely heuristic and intuition-driven, limiting the development of principled methodologies. In this paper, we present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle, introducing IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable across diverse prompts. We derive a practical token-level surrogate objective and propose an efficient approximation, resulting in the lightweight IB regularization method. This technique integrates seamlessly into existing RL-based post-training frameworks without additional computational overhead, requiring only a one-line code modification. Empirically, we validate IB regularization across multiple mathematical reasoning benchmarks and RL algorithms, demonstrating consistent improvements in LLM reasoning performance.

Comment: The paper presents a theoretical characterization of LLM reasoning using the information bottleneck principle, offering insights into LLM behavior.

Relevance: 9 Novelty: 8


4. The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

ArXiv ID: 2507.18553

Authors: Jiale Chen, Torsten Hoefler, Dan Alistarh

Abstract: Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale. Yet, its inner workings are described as a sequence of ad-hoc algebraic updates that obscure any geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: (i) the GPTQ error propagation step gains an intuitive geometric interpretation; (ii) GPTQ inherits the error upper bound of Babai's algorithm under the no-clipping condition. Taken together, these results place GPTQ on firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models.

Comment: The paper provides a theoretical foundation for GPTQ quantization by relating it to Babai's nearest plane algorithm, relevant to model compression.

Relevance: 9 Novelty: 8


5. Squeeze10-LLM: Squeezing LLMs' Weights by 10 Times via a Staged Mixed-Precision Quantization Method

ArXiv ID: 2507.18073

Authors: Qingcheng Zhu, Yangyang Ren, Linlin Yang, Mingbao Lin, Yanjing Li, Sheng Xu, Zichao Feng, Haodong Zhu, Yuguang Yang, Juan Zhang, Runqi Wang, Baochang Zhang

Abstract: Deploying large language models (LLMs) is challenging due to their massive parameters and high computational costs. Ultra low-bit quantization can significantly reduce storage and accelerate inference, but extreme compression (i.e., mean bit-width <= 2) often leads to severe performance degradation. To address this, we propose Squeeze10-LLM, effectively "squeezing" 16-bit LLMs' weights by 10 times. Specifically, Squeeze10-LLM is a staged mixed-precision post-training quantization (PTQ) framework and achieves an average of 1.6 bits per weight by quantizing 80% of the weights to 1 bit and 20% to 4 bits. We introduce Squeeze10LLM with two key innovations: Post-Binarization Activation Robustness (PBAR) and Full Information Activation Supervision (FIAS). PBAR is a refined weight significance metric that accounts for the impact of quantization on activations, improving accuracy in low-bit settings. FIAS is a strategy that preserves full activation information during quantization to mitigate cumulative error propagation across layers. Experiments on LLaMA and LLaMA2 show that Squeeze10-LLM achieves state-of-the-art performance for sub-2bit weight-only quantization, improving average accuracy from 43% to 56% on six zero-shot classification tasks--a significant boost over existing PTQ methods. Our code will be released upon publication.

Comment: The paper presents a novel quantization method for LLMs, focusing on model compression through a staged mixed-precision approach, which aligns with the model compression criteria.

Relevance: 9 Novelty: 8


6. Neural Tangent Kernels and Fisher Information Matrices for Simple ReLU Networks with Random Hidden Weights

ArXiv ID: 2507.18555

Authors: Jun'ichi Takeuchia, Yoshinari Takeishia, Noboru Muratab, Kazushi Mimurac, Ka Long Keith Hod, Hiroshi Nagaoka

Abstract: Fisher information matrices and neural tangent kernels (NTK) for 2-layer ReLU networks with random hidden weight are argued. We discuss the relation between both notions as a linear transformation and show that spectral decomposition of NTK with concrete forms of eigenfunctions with major eigenvalues. We also obtain an approximation formula of the functions presented by the 2-layer neural networks.

Comment: The paper discusses neural tangent kernels and Fisher information matrices for ReLU networks, which aligns with representation learning by providing insights into how networks encode information.

Relevance: 9 Novelty: 7


7. Self-similarity Analysis in Deep Neural Networks

ArXiv ID: 2507.17785

Authors: Jingyi Ding, Chengwen Qi, Hongfei Wang, Jianshe Wu, Licheng Jiao, Yuwei Guo, Jian Gao

Abstract: Current research has found that some deep neural networks exhibit strong hierarchical self-similarity in feature representation or parameter distribution. However, aside from preliminary studies on how the power-law distribution of weights across different training stages affects model performance,there has been no quantitative analysis on how the self-similarity of hidden space geometry influences model weight optimization, nor is there a clear understanding of the dynamic behavior of internal neurons. Therefore, this paper proposes a complex network modeling method based on the output features of hidden-layer neurons to investigate the self-similarity of feature networks constructed at different hidden layers, and analyzes how adjusting the degree of self-similarity in feature networks can enhance the classification performance of deep neural networks. Validated on three types of networks MLP architectures, convolutional networks, and attention architectures this study reveals that the degree of self-similarity exhibited by feature networks varies across different model architectures. Furthermore, embedding constraints on the self-similarity of feature networks during the training process can improve the performance of self-similar deep neural networks (MLP architectures and attention architectures) by up to 6 percentage points.

Comment: The paper investigates self-similarity in deep neural networks, providing insights into feature representation and training dynamics.

Relevance: 9 Novelty: 7


8. Explainable Mapper: Charting LLM Embedding Spaces Using Perturbation-Based Explanation and Verification Agents

ArXiv ID: 2507.18607

Authors: Xinyuan Yan, Rita Sevastjanova, Sinie van der Ben, Mennatallah El-Assady, Bei Wang

Abstract: Large language models (LLMs) produce high-dimensional embeddings that capture rich semantic and syntactic relationships between words, sentences, and concepts. Investigating the topological structures of LLM embedding spaces via mapper graphs enables us to understand their underlying structures. Specifically, a mapper graph summarizes the topological structure of the embedding space, where each node represents a topological neighborhood (containing a cluster of embeddings), and an edge connects two nodes if their corresponding neighborhoods overlap. However, manually exploring these embedding spaces to uncover encoded linguistic properties requires considerable human effort. To address this challenge, we introduce a framework for semi-automatic annotation of these embedding properties. To organize the exploration process, we first define a taxonomy of explorable elements within a mapper graph such as nodes, edges, paths, components, and trajectories. The annotation of these elements is executed through two types of customizable LLM-based agents that employ perturbation techniques for scalable and automated analysis. These agents help to explore and explain the characteristics of mapper elements and verify the robustness of the generated explanations. We instantiate the framework within a visual analytics workspace and demonstrate its effectiveness through case studies. In particular, we replicate findings from prior research on BERT's embedding properties across various layers of its architecture and provide further observations into the linguistic properties of topological neighborhoods.

Comment: The paper explores the topological structures of LLM embedding spaces, which aligns with representation learning and provides insights into LLM behavior.

Relevance: 9 Novelty: 7


9. Are LLM Belief Updates Consistent with Bayes' Theorem?

ArXiv ID: 2507.17951

Authors: Sohaib Imran, Ihor Kendiukhov, Matthew Broerman, Aditya Thomas, Riccardo Campanella, Rob Lamb, Peter M. Atkinson

Abstract: Do larger and more capable language models learn to update their "beliefs" about propositions more consistently with Bayes' theorem when presented with evidence in-context? To test this, we formulate a Bayesian Coherence Coefficient (BCC) metric and generate a dataset with which to measure the BCC. We measure BCC for multiple pre-trained-only language models across five model families, comparing against the number of model parameters, the amount of training data, and model scores on common benchmarks. Our results provide evidence for our hypothesis that larger and more capable pre-trained language models assign credences that are more coherent with Bayes' theorem. These results have important implications for our understanding and governance of LLMs.

Comment: The paper investigates LLM belief updates in relation to Bayes' theorem, providing theoretical insights into LLM behavior.

Relevance: 9 Novelty: 7


10. Euclidean Distance Deflation Under High-Dimensional Heteroskedastic Noise

ArXiv ID: 2507.18520

Authors: Keyi Li, Yuval Kluger, Boris Landa

Abstract: Pairwise Euclidean distance calculation is a fundamental step in many machine learning and data analysis algorithms. In real-world applications, however, these distances are frequently distorted by heteroskedastic noise$\unicode{x2014}$a prevalent form of inhomogeneous corruption characterized by variable noise magnitudes across data observations. Such noise inflates the computed distances in a nontrivial way, leading to misrepresentations of the underlying data geometry. In this work, we address the tasks of estimating the noise magnitudes per observation and correcting the pairwise Euclidean distances under heteroskedastic noise. Perhaps surprisingly, we show that in general high-dimensional settings and without assuming prior knowledge on the clean data structure or noise distribution, both tasks can be performed reliably, even when the noise levels vary considerably. Specifically, we develop a principled, hyperparameter-free approach that jointly estimates the noise magnitudes and corrects the distances. We provide theoretical guarantees for our approach, establishing probabilistic bounds on the estimation errors of both noise magnitudes and distances. These bounds, measured in the normalized $\ell_1$ norm, converge to zero at polynomial rates as both feature dimension and dataset size increase. Experiments on synthetic datasets demonstrate that our method accurately estimates distances in challenging regimes, significantly improving the robustness of subsequent distance-based computations. Notably, when applied to single-cell RNA sequencing data, our method yields noise magnitude estimates consistent with an established prototypical model, enabling accurate nearest neighbor identification that is fundamental to many downstream analyses.

Comment: The paper addresses the correction of Euclidean distances under heteroskedastic noise, providing theoretical insights into noise estimation and correction.

Relevance: 8 Novelty: 8


11. ChronoSelect: Robust Learning with Noisy Labels via Dynamics Temporal Memory

ArXiv ID: 2507.18183

Authors: Jianchao Wang, Qingfeng Li, Pengcheng Zheng, Xiaorong Pu, Yazhou Ren

Abstract: Training deep neural networks on real-world datasets is often hampered by the presence of noisy labels, which can be memorized by over-parameterized models, leading to significant degradation in generalization performance. While existing methods for learning with noisy labels (LNL) have made considerable progress, they fundamentally suffer from static snapshot evaluations and fail to leverage the rich temporal dynamics of learning evolution. In this paper, we propose ChronoSelect (chrono denoting its temporal nature), a novel framework featuring an innovative four-stage memory architecture that compresses prediction history into compact temporal distributions. Our unique sliding update mechanism with controlled decay maintains only four dynamic memory units per sample, progressively emphasizing recent patterns while retaining essential historical knowledge. This enables precise three-way sample partitioning into clean, boundary, and noisy subsets through temporal trajectory analysis and dual-branch consistency. Theoretical guarantees prove the mechanism's convergence and stability under noisy conditions. Extensive experiments demonstrate ChronoSelect's state-of-the-art performance across synthetic and real-world benchmarks.

Comment: The paper introduces a novel framework for learning with noisy labels, focusing on training dynamics and memory architecture, which is relevant to representation learning.

Relevance: 8 Novelty: 8


12. C2G-KD: PCA-Constrained Generator for Data-Free Knowledge Distillation

ArXiv ID: 2507.18533

Authors: Magnus Bengtsson, Kenneth \"Ostberg

Abstract: We introduce C2G-KD, a data-free knowledge distillation framework where a class-conditional generator is trained to produce synthetic samples guided by a frozen teacher model and geometric constraints derived from PCA. The generator never observes real training data but instead learns to activate the teacher's output through a combination of semantic and structural losses. By constraining generated samples to lie within class-specific PCA subspaces estimated from as few as two real examples per class, we preserve topological consistency and diversity. Experiments on MNIST show that even minimal class structure is sufficient to bootstrap useful synthetic training pipelines.

Comment: The paper introduces a data-free knowledge distillation framework using PCA-constrained generators, which aligns with model compression and efficiency.

Relevance: 8 Novelty: 7


13. Logical Characterizations of GNNs with Mean Aggregation

ArXiv ID: 2507.18145

Authors: Moritz Sch\"onherr, Carsten Lutz

Abstract: We study the expressive power of graph neural networks (GNNs) with mean as the aggregation function. In the non-uniform setting, we show that such GNNs have exactly the same expressive power as ratio modal logic, which has modal operators expressing that at least a certain ratio of the successors of a vertex satisfies a specified property. The non-uniform expressive power of mean GNNs is thus higher than that of GNNs with max aggregation, but lower than for sum aggregation--the latter are characterized by modal logic and graded modal logic, respectively. In the uniform setting, we show that the expressive power relative to MSO is exactly that of alternation-free modal logic, under the natural assumptions that combination functions are continuous and classification functions are thresholds. This implies that, relative to MSO and in the uniform setting, mean GNNs are strictly less expressive than sum GNNs and max GNNs. When any of the assumptions is dropped, the expressive power increases.

Comment: The paper provides insights into the expressive power of GNNs with mean aggregation, which aligns with representation learning by analyzing how these networks encode information.

Relevance: 8 Novelty: 7


14. Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models

ArXiv ID: 2507.18171

Authors: Kexin Chen, Dongxia Wang, Yi Liu, Haonan Zhang, Wenhai Wang

Abstract: Despite the widespread use of Transformer-based text embedding models in NLP tasks, surprising 'sticky tokens' can undermine the reliability of embeddings. These tokens, when repeatedly inserted into sentences, pull sentence similarity toward a certain value, disrupting the normal distribution of embedding distances and degrading downstream performance. In this paper, we systematically investigate such anomalous tokens, formally defining them and introducing an efficient detection method, Sticky Token Detector (STD), based on sentence and token filtering. Applying STD to 40 checkpoints across 14 model families, we discover a total of 868 sticky tokens. Our analysis reveals that these tokens often originate from special or unused entries in the vocabulary, as well as fragmented subwords from multilingual corpora. Notably, their presence does not strictly correlate with model size or vocabulary size. We further evaluate how sticky tokens affect downstream tasks like clustering and retrieval, observing significant performance drops of up to 50%. Through attention-layer analysis, we show that sticky tokens disproportionately dominate the model's internal representations, raising concerns about tokenization robustness. Our findings show the need for better tokenization strategies and model design to mitigate the impact of sticky tokens in future text embedding applications.

Comment: The paper investigates 'sticky tokens' in text embedding models, which relates to representation learning by analyzing how these models encode information.

Relevance: 8 Novelty: 7


15. Enhancing Quantization-Aware Training on Edge Devices via Relative Entropy Coreset Selection and Cascaded Layer Correction

ArXiv ID: 2507.17768

Authors: Yujia Tong, Jingling Yuan, Chuang Hu

Abstract: With the development of mobile and edge computing, the demand for low-bit quantized models on edge devices is increasing to achieve efficient deployment. To enhance the performance, it is often necessary to retrain the quantized models using edge data. However, due to privacy concerns, certain sensitive data can only be processed on edge devices. Therefore, employing Quantization-Aware Training (QAT) on edge devices has become an effective solution. Nevertheless, traditional QAT relies on the complete dataset for training, which incurs a huge computational cost. Coreset selection techniques can mitigate this issue by training on the most representative subsets. However, existing methods struggle to eliminate quantization errors in the model when using small-scale datasets (e.g., only 10% of the data), leading to significant performance degradation. To address these issues, we propose QuaRC, a QAT framework with coresets on edge devices, which consists of two main phases: In the coreset selection phase, QuaRC introduces the ``Relative Entropy Score" to identify the subsets that most effectively capture the model's quantization errors. During the training phase, QuaRC employs the Cascaded Layer Correction strategy to align the intermediate layer outputs of the quantized model with those of the full-precision model, thereby effectively reducing the quantization errors in the intermediate layers. Experimental results demonstrate the effectiveness of our approach. For instance, when quantizing ResNet-18 to 2-bit using a 1% data subset, QuaRC achieves a 5.72% improvement in Top-1 accuracy on the ImageNet-1K dataset compared to state-of-the-art techniques.

Comment: The paper enhances quantization-aware training on edge devices, focusing on model compression through coreset selection and layer correction.

Relevance: 8 Novelty: 7


16. Incentivised Orchestrated Training Architecture (IOTA): A Technical Primer for Release

ArXiv ID: 2507.17766

Authors: Felix Quinque, Alan Aboudib, Szymon Fonau, Rodrigo Lopez Portillo Alcocer, Brian McCrindle, Steffen Cruz

Abstract: In August 2024, Bittensor's Subnet 9 (SN9) demonstrated that a distributed network of incentivized, permissionless actors could each pretrain large language models (LLMs) ranging from 700 million to 14 billion parameters, while surpassing established baselines. While that work validated blockchain-based decentralized pretraining as viable, it contained core issues: (i) every miner had to fit an entire model locally, and (ii) "winner-takes-all" rewards encouraged model hoarding. Here we introduce IOTA (Incentivized Orchestrated Training Architecture), an architecture that addresses these limitations by transforming SN9's previously isolated competitors into a single cooperating unit that can scale arbitrarily while still rewarding each contributor fairly. Key preliminary results: (1) Data- and Pipeline-parallel SWARM architecture - An orchestrator distributes model layers across heterogeneous miners and streams activations between them, enabling model sizes to scale with the number of participants rather than being constrained by the VRAM of a single machine; (2) Granular, continuous incentives - Validators measure each miner's contribution and allocate token emissions proportionally; (3) Activation compression - We used model-bottlenecks to cut communication bandwidths of activations by up to 128x, vastly improving training speed; (4) Butterfly All-Reduce - Miners average disjoint parameter slices in O(1) bandwidth, offering linear scalability, redundancy and built-in collusion detection; (5) CLASP (Contribution Loss Assessment via Sampling of Pathways) - A fair attribution scheme assigns credit to miners proportional to their marginal utility and detects exploits, even when contributions are interdependent across the pipeline.

Comment: The paper introduces a new architecture for distributed training of LLMs, focusing on scalability and efficiency, which aligns with model architecture and compression topics.

Relevance: 8 Novelty: 7


17. GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs

ArXiv ID: 2507.18043

Authors: Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

Abstract: Inference-time steering methods offer a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying internal activations at test time without updating model weights. However, most existing approaches rely on fixed, global intervention vectors, overlook the causal influence of individual input tokens, and fail to leverage informative gradients from the model's logits, particularly in multimodal settings where visual and textual inputs contribute unevenly. To address these limitations, we introduce GrAInS, an inference-time steering approach that operates across both language-only and vision-language models and tasks. GrAInS uses contrastive, gradient-based attribution via Integrated Gradients to identify the top-k most influential tokens, both positively and negatively attributed based on their contribution to preferred versus dispreferred outputs. These tokens are then used to construct directional steering vectors that capture semantic shifts from undesirable to desirable behavior. During inference, GrAInS adjusts hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale. This enables fine-grained, interpretable, and modular control over model behavior, without retraining or auxiliary supervision. Empirically, GrAInS consistently outperforms both fine-tuning and existing steering baselines: it achieves a 13.22% accuracy gain on TruthfulQA using Llama-3.1-8B, reduces hallucination rates on MMHal-Bench from 0.624 to 0.514 with LLaVA-1.6-7B, and improves alignment win rates on SPA-VL by 8.11%, all while preserving the model's fluency and general capabilities.

Comment: The paper introduces a novel inference-time steering method for LLMs and VLMs, which involves modifying internal activations and aligns with representation learning and model architecture criteria.

Relevance: 8 Novelty: 7


Paper Selection Prompt

System Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

User Prompt

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

  • ARXIVID: should be the ArXiv ID.
  • COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
  • RELEVANCE: should be a score from 1-10.
  • NOVELTY: should be a score from 1-10.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

  • Relevance 9-10 (Completely Relevant)
  • Focus: Fully aligned with core topics with no deviation, score the highest if contains relevant keywords in it.
  • Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".

  • Relevance 7-8 (Relevant)

  • Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
  • Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.

  • Relevance 5-6 (Borderline)

  • Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
  • Examples: Work referencing MoE centered on reinforcement learning.

  • Relevance 3-4 (Irrelevant)

  • Focus: Largely outside our interests with no association to our topics.
  • Examples: Application-focused papers like using MoE to solve a problem in the real world.

  • Relevance 1-2 (Ignore)

  • Focus: Purely unrelated to our topics. Completely a different domain.
  • Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)

Novelty Scoring

  • Novelty 9-10 (Breakthrough)
  • Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
  • Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.

  • Novelty 7-8 (Improvements)

  • Definition: Substantial insights/enhancements, though not a full paradigm shift.
  • Examples: Modifications on existing methods yielding significantly better results.

  • Novelty 5-6 (Borderline)

  • Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
  • Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.

  • Novelty 3-4 (Tangential)

  • Definition: Minor or domain-specific improvements with limited broader impact.
  • Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.

  • Novelty 1-2 (Low)

  • Definition: Minimal originality, applying standard approaches without real innovation.
  • Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords:

  • Relevant: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
  • Irrelevant: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
  • Application: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, Knowledge Graph, etc.