Personalized Daily ArXiv Papers 2025-07-03

[gpt-4o]	Prompt	Completion	Total
Token	28429	3474	31903
Cost	$0.07	$0.03	$0.11

Total arXiv papers: 441

Total scanned papers: 272

Total relevant papers: 19

Table of contents with paper titles:

LEDOM: An Open and Fundamental Reverse Language Model Authors: Xunjian Yin, Sitao Cheng, Yuxi Xie, Xinyu Hu, Li Lin, Xinyi Wang, Liangming Pan, William Yang Wang, Xiaojun Wan
Proof of a perfect platonic representation hypothesis Authors: Liu Ziyin, Isaac Chuang
DBellQuant: Breaking the Bell with Double-Bell Transformation for LLMs Post Training Binarization Authors: Zijian Ye, Wei Huang, Yifei Yu, Tianhe Ren, Zhongrui Wang, Xiaojuan Qi
Tuning without Peeking: Provable Privacy and Generalization Bounds for LLM Post-Training Authors: Ismail Labiad, Mathurin Videau, Matthieu Kowalski, Marc Schoenauer, Alessandro Leite, Julia Kempe, Olivier Teytaud
Automatic Rank Determination for Low-Rank Adaptation via Submodular Function Maximization Authors: Yihang Gao, Vincent Y. F. Tan
GradMetaNet: An Equivariant Architecture for Learning on Gradients Authors: Yoav Gelberg (Moe), Yam Eitan (Moe), Aviv Navon (Moe), Aviv Shamsian (Moe), Theo (Moe), Putterman, Michael Bronstein, Haggai Maron
Parsimonious Gaussian mixture models with piecewise-constant eigenvalue profiles Authors: Tom Szwagier, Pierre-Alexandre Mattei, Charles Bouveyron, Xavier Pennec
GPT, But Backwards: Exactly Inverting Language Model Outputs Authors: Adrians Skapars, Edoardo Manino, Youcheng Sun, Lucas C. Cordeiro
Tensor Decomposition Networks for Fast Machine Learning Interatomic Potential Computations Authors: Yuchao Lin, Cong Fu, Zachary Krueger, Haiyang Yu, Maho Nakata, Jianwen Xie, Emine Kucukbenli, Xiaofeng Qian, Shuiwang Ji
Test-Time Scaling with Reflective Generative Model Authors: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie
Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective Authors: Yuxin Mao, Zhen Qin, Jinxing Zhou, Hui Deng, Xuyang Shen, Bin Fan, Jing Zhang, Yiran Zhong, Yuchao Dai
Token Communication in the Era of Large Models: An Information Bottleneck-Based Approach Authors: Hao Wei, Wanli Ni, Wen Wang, Wenjun Xu, Dusit Niyato, Ping Zhang
Towards Foundation Auto-Encoders for Time-Series Anomaly Detection Authors: Gast\'on Garc\'ia Gonz\'alez, Pedro Casas, Emilio Mart\'inez, Alicia Fern\'andez
Long-Sequence Memory with Temporal Kernels and Dense Hopfield Functionals Authors: Ahmed Farooq
Decomposing Prediction Mechanisms for In-Context Recall Authors: Sultan Daniels, Dylan Davis, Dhruv Gautam, Wentinn Liao, Gireeja Ranade, Anant Sahai
How Do Vision-Language Models Process Conflicting Information Across Modalities? Authors: Tianze Hua, Tian Yun, Ellie Pavlick
Analysis of Muon's Convergence and Critical Batch Size Authors: Naoki Sato, Hiroki Naganuma, Hideaki Iiduka
On Design Principles for Private Adaptive Optimizers Authors: Arun Ganesh, Brendan McMahan, Abhradeep Thakurta
ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks Authors: Zhiyao Ren, Siyuan Liang, Aishan Liu, Dacheng Tao

1. LEDOM: An Open and Fundamental Reverse Language Model

ArXiv ID: 2507.01335

Authors: Xunjian Yin, Sitao Cheng, Yuxi Xie, Xinyu Hu, Li Lin, Xinyi Wang, Liangming Pan, William Yang Wang, Xiaojun Wan

Abstract: We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based on LEDOM, we further introduce a novel application: Reverse Reward, where LEDOM-guided reranking of forward language model outputs leads to substantial performance improvements on mathematical reasoning tasks. This approach leverages LEDOM's unique backward reasoning capability to refine generation quality through posterior evaluation. Our findings suggest that LEDOM exhibits unique characteristics with broad application potential. We will release all models, training code, and pre-training data to facilitate future research.

Comment: The paper introduces LEDOM, a reverse language model, which is a novel approach in the context of foundational models and LLMs.

Relevance: 9 Novelty: 9

2. Proof of a perfect platonic representation hypothesis

ArXiv ID: 2507.01098

Authors: Liu Ziyin, Isaac Chuang

Abstract: In this note, we elaborate on and explain in detail the proof given by Ziyin et al. (2025) of the "perfect" Platonic Representation Hypothesis (PRH) for the embedded deep linear network model (EDLN). We show that if trained with SGD, two EDLNs with different widths and depths and trained on different data will become Perfectly Platonic, meaning that every possible pair of layers will learn the same representation up to a rotation. Because most of the global minima of the loss function are not Platonic, that SGD only finds the perfectly Platonic solution is rather extraordinary. The proof also suggests at least six ways the PRH can be broken. We also show that in the EDLN model, the emergence of the Platonic representations is due to the same reason as the emergence of progressive sharpening. This implies that these two seemingly unrelated phenomena in deep learning can, surprisingly, have a common cause. Overall, the theory and proof highlight the importance of understanding emergent "entropic forces" due to the irreversibility of SGD training and their role in representation learning. The goal of this note is to be instructive and avoid lengthy technical details.

Comment: The paper provides a theoretical insight into representation learning by proving the Platonic Representation Hypothesis for deep linear networks.

Relevance: 9 Novelty: 9

3. DBellQuant: Breaking the Bell with Double-Bell Transformation for LLMs Post Training Binarization

ArXiv ID: 2507.01027

Authors: Zijian Ye, Wei Huang, Yifei Yu, Tianhe Ren, Zhongrui Wang, Xiaojuan Qi

Abstract: Large language models (LLMs) demonstrate remarkable performance but face substantial computational and memory challenges that limit their practical deployment. Quantization has emerged as a promising solution; however, its effectiveness is often limited by quantization errors arising from weight distributions that are not quantization-friendly and the presence of activation outliers. To address these challenges, we introduce DBellQuant, an innovative post-training quantization (PTQ) framework that achieves nearly 1-bit weight compression and 6-bit activation quantization with minimal performance degradation. DBellQuant uses Learnable Transformation for Dual-Bell (LTDB) algorithm, which transforms single-bell weight distributions into dual-bell forms to reduce binarization errors and applies inverse transformations to smooth activations. DBellQuant sets a new state-of-the-art by preserving superior model performance under aggressive weight and activation quantization. For example, on the Wikitext2 dataset, DBellQuant achieves a perplexity of 14.39 on LLaMA2-13B with 6-bit activation quantization, significantly outperforming BiLLM's 21.35 without activation quantization, underscoring its potential in compressing LLMs for real-world applications.

Comment: The paper presents a novel quantization framework for LLMs, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

4. Tuning without Peeking: Provable Privacy and Generalization Bounds for LLM Post-Training

ArXiv ID: 2507.01752

Authors: Ismail Labiad, Mathurin Videau, Matthieu Kowalski, Marc Schoenauer, Alessandro Leite, Julia Kempe, Olivier Teytaud

Abstract: Gradient-based optimization is the workhorse of deep learning, offering efficient and scalable training via backpropagation. However, its reliance on large volumes of labeled data raises privacy and security concerns such as susceptibility to data poisoning attacks and the risk of overfitting. In contrast, black box optimization methods, which treat the model as an opaque function, relying solely on function evaluations to guide optimization, offer a promising alternative in scenarios where data access is restricted, adversarial risks are high, or overfitting is a concern. However, black box methods also pose significant challenges, including poor scalability to high-dimensional parameter spaces, as prevalent in large language models (LLMs), and high computational costs due to reliance on numerous model evaluations. This paper introduces BBoxER, an evolutionary black-box method for LLM post-training that induces an information bottleneck via implicit compression of the training data. Leveraging the tractability of information flow, we provide strong theoretical bounds on generalization, differential privacy, susceptibility to data poisoning attacks, and robustness to extraction attacks. BBoxER operates on top of pre-trained LLMs, offering a lightweight and modular enhancement suitable for deployment in restricted or privacy-sensitive environments, in addition to non-vacuous generalization guarantees. In experiments with LLMs, we demonstrate empirically that Retrofitting methods are able to learn, showing how a few iterations of BBoxER improve performance and generalize well on a benchmark of reasoning datasets. This positions BBoxER as an attractive add-on on top of gradient-based optimization.

Comment: The paper presents BBoxER, a black-box optimization method for LLM post-training, focusing on privacy and generalization, which aligns with foundational research in LLMs.

Relevance: 9 Novelty: 8

5. Automatic Rank Determination for Low-Rank Adaptation via Submodular Function Maximization

ArXiv ID: 2507.01841

Authors: Yihang Gao, Vincent Y. F. Tan

Abstract: In this paper, we propose SubLoRA, a rank determination method for Low-Rank Adaptation (LoRA) based on submodular function maximization. In contrast to prior approaches, such as AdaLoRA, that rely on first-order (linearized) approximations of the loss function, SubLoRA utilizes second-order information to capture the potentially complex loss landscape by incorporating the Hessian matrix. We show that the linearization becomes inaccurate and ill-conditioned when the LoRA parameters have been well optimized, motivating the need for a more reliable and nuanced second-order formulation. To this end, we reformulate the rank determination problem as a combinatorial optimization problem with a quadratic objective. However, solving this problem exactly is NP-hard in general. To overcome the computational challenge, we introduce a submodular function maximization framework and devise a greedy algorithm with approximation guarantees. We derive a sufficient and necessary condition under which the rank-determination objective becomes submodular, and construct a closed-form projection of the Hessian matrix that satisfies this condition while maintaining computational efficiency. Our method combines solid theoretical foundations, second-order accuracy, and practical computational efficiency. We further extend SubLoRA to a joint optimization setting, alternating between LoRA parameter updates and rank determination under a rank budget constraint. Extensive experiments on fine-tuning physics-informed neural networks (PINNs) for solving partial differential equations (PDEs) demonstrate the effectiveness of our approach. Results show that SubLoRA outperforms existing methods in both rank determination and joint training performance.

Comment: The paper proposes a new method for rank determination in low-rank adaptation, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

6. GradMetaNet: An Equivariant Architecture for Learning on Gradients

ArXiv ID: 2507.01649

Authors: Yoav Gelberg (Moe), Yam Eitan (Moe), Aviv Navon (Moe), Aviv Shamsian (Moe), Theo (Moe), Putterman, Michael Bronstein, Haggai Maron

Abstract: Gradients of neural networks encode valuable information for optimization, editing, and analysis of models. Therefore, practitioners often treat gradients as inputs to task-specific algorithms, e.g. for pruning or optimization. Recent works explore learning algorithms that operate directly on gradients but use architectures that are not specifically designed for gradient processing, limiting their applicability. In this paper, we present a principled approach for designing architectures that process gradients. Our approach is guided by three principles: (1) equivariant design that preserves neuron permutation symmetries, (2) processing sets of gradients across multiple data points to capture curvature information, and (3) efficient gradient representation through rank-1 decomposition. Based on these principles, we introduce GradMetaNet, a novel architecture for learning on gradients, constructed from simple equivariant blocks. We prove universality results for GradMetaNet, and show that previous approaches cannot approximate natural gradient-based functions that GradMetaNet can. We then demonstrate GradMetaNet's effectiveness on a diverse set of gradient-based tasks on MLPs and transformers, such as learned optimization, INR editing, and estimating loss landscape curvature.

Comment: The paper introduces GradMetaNet, a novel architecture for learning on gradients, focusing on equivariant design and efficient gradient representation, which aligns with representation learning and model architecture criteria.

Relevance: 9 Novelty: 8

7. Parsimonious Gaussian mixture models with piecewise-constant eigenvalue profiles

ArXiv ID: 2507.01542

Authors: Tom Szwagier, Pierre-Alexandre Mattei, Charles Bouveyron, Xavier Pennec

Abstract: Gaussian mixture models (GMMs) are ubiquitous in statistical learning, particularly for unsupervised problems. While full GMMs suffer from the overparameterization of their covariance matrices in high-dimensional spaces, spherical GMMs (with isotropic covariance matrices) certainly lack flexibility to fit certain anisotropic distributions. Connecting these two extremes, we introduce a new family of parsimonious GMMs with piecewise-constant covariance eigenvalue profiles. These extend several low-rank models like the celebrated mixtures of probabilistic principal component analyzers (MPPCA), by enabling any possible sequence of eigenvalue multiplicities. If the latter are prespecified, then we can naturally derive an expectation-maximization (EM) algorithm to learn the mixture parameters. Otherwise, to address the notoriously-challenging issue of jointly learning the mixture parameters and hyperparameters, we propose a componentwise penalized EM algorithm, whose monotonicity is proven. We show the superior likelihood-parsimony tradeoffs achieved by our models on a variety of unsupervised experiments: density fitting, clustering and single-image denoising.

Comment: The paper introduces a new family of parsimonious Gaussian mixture models, which aligns with representation learning through its focus on model efficiency and low-rank approaches.

Relevance: 9 Novelty: 8

8. GPT, But Backwards: Exactly Inverting Language Model Outputs

ArXiv ID: 2507.01693

Authors: Adrians Skapars, Edoardo Manino, Youcheng Sun, Lucas C. Cordeiro

Abstract: While existing auditing techniques attempt to identify potential unwanted behaviours in large language models (LLMs), we address the complementary forensic problem of reconstructing the exact input that led to an existing LLM output - enabling post-incident analysis and potentially the detection of fake output reports. We formalize exact input reconstruction as a discrete optimisation problem with a unique global minimum and introduce SODA, an efficient gradient-based algorithm that operates on a continuous relaxation of the input search space with periodic restarts and parameter decay. Through comprehensive experiments on LLMs ranging in size from 33M to 3B parameters, we demonstrate that SODA significantly outperforms existing approaches. We succeed in fully recovering 79.5% of shorter out-of-distribution inputs from next-token logits, without a single false positive, but struggle to extract private information from the outputs of longer (15+ token) input sequences. This suggests that standard deployment practices may currently provide adequate protection against malicious use of our method. Our code is available at https://doi.org/10.5281/zenodo.15539879.

Comment: The paper introduces a novel algorithm for reconstructing inputs from LLM outputs, which could provide insights into LLM behavior and interpretability.

Relevance: 9 Novelty: 8

9. Tensor Decomposition Networks for Fast Machine Learning Interatomic Potential Computations

ArXiv ID: 2507.01131

Authors: Yuchao Lin, Cong Fu, Zachary Krueger, Haiyang Yu, Maho Nakata, Jianwen Xie, Emine Kucukbenli, Xiaofeng Qian, Shuiwang Ji

Abstract: $\rm{SO}(3)$-equivariant networks are the dominant models for machine learning interatomic potentials (MLIPs). The key operation of such networks is the Clebsch-Gordan (CG) tensor product, which is computationally expensive. To accelerate the computation, we develop tensor decomposition networks (TDNs) as a class of approximately equivariant networks whose CG tensor products are replaced by low-rank tensor decompositions, such as the CANDECOMP/PARAFAC (CP) decomposition. With the CP decomposition, we prove (i) a uniform bound on the induced error of $\rm{SO}(3)$-equivariance, and (ii) the universality of approximating any equivariant bilinear map. To further reduce the number of parameters, we propose path-weight sharing that ties all multiplicity-space weights across the $O(L^3)$ CG paths into a single path without compromising equivariance, where $L$ is the maximum angular degree. The resulting layer acts as a plug-and-play replacement for tensor products in existing networks, and the computational complexity of tensor products is reduced from $O(L^6)$ to $O(L^4)$. We evaluate TDNs on PubChemQCR, a newly curated molecular relaxation dataset containing 105 million DFT-calculated snapshots. We also use existing datasets, including OC20, and OC22. Results show that TDNs achieve competitive performance with dramatic speedup in computations.

Comment: The paper introduces tensor decomposition networks for efficient computation in machine learning interatomic potentials, aligning with model compression and efficiency.

Relevance: 8 Novelty: 8

10. Test-Time Scaling with Reflective Generative Model

ArXiv ID: 2507.01951

Authors: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie

Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra process annotation, reducing over 99% PRM parameters for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable for test time scaling (TTS), and we provide three reasoning effort modes (low, medium, and high), based on the controllable thinking length. Moreover, we empirically establish a scaling law that reveals the relationship between total thinking computation and TTS performance. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI-o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.

Comment: The paper introduces a reflective generative model with a self-supervised process reward model, which is relevant to model architecture and efficiency improvements.

Relevance: 8 Novelty: 8

11. Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective

ArXiv ID: 2507.01652

Authors: Yuxin Mao, Zhen Qin, Jinxing Zhou, Hui Deng, Xuyang Shen, Bin Fan, Jing Zhang, Yiran Zhong, Yuchao Dai

Abstract: Autoregressive (AR) models have garnered significant attention in image generation for their ability to effectively capture both local and global structures within visual data. However, prevalent AR models predominantly rely on the transformer architectures, which are beset by quadratic computational complexity concerning input sequence length and substantial memory overhead due to the necessity of maintaining key-value caches. Although linear attention mechanisms have successfully reduced this burden in language models, our initial experiments reveal that they significantly degrade image generation quality because of their inability to capture critical long-range dependencies in visual data. We propose Linear Attention with Spatial-Aware Decay (LASAD), a novel attention mechanism that explicitly preserves genuine 2D spatial relationships within the flattened image sequences by computing position-dependent decay factors based on true 2D spatial location rather than 1D sequence positions. Based on this mechanism, we present LASADGen, an autoregressive image generator that enables selective attention to relevant spatial contexts with linear complexity. Experiments on ImageNet show LASADGen achieves state-of-the-art image generation performance and computational efficiency, bridging the gap between linear attention's efficiency and spatial understanding needed for high-quality generation.

Comment: The paper introduces LASAD, a novel attention mechanism for autoregressive image generation, focusing on spatial-aware decay, which is relevant to model architecture and efficiency.

Relevance: 8 Novelty: 8

12. Token Communication in the Era of Large Models: An Information Bottleneck-Based Approach

ArXiv ID: 2507.01728

Authors: Hao Wei, Wanli Ni, Wen Wang, Wenjun Xu, Dusit Niyato, Ping Zhang

Abstract: This letter proposes UniToCom, a unified token communication paradigm that treats tokens as the fundamental units for both processing and wireless transmission. Specifically, to enable efficient token representations, we propose a generative information bottleneck (GenIB) principle, which facilitates the learning of tokens that preserve essential information while supporting reliable generation across multiple modalities. By doing this, GenIB-based tokenization is conducive to improving the communication efficiency and reducing computational complexity. Additionally, we develop $\sigma$-GenIB to address the challenges of variance collapse in autoregressive modeling, maintaining representational diversity and stability. Moreover, we employ a causal Transformer-based multimodal large language model (MLLM) at the receiver to unify the processing of both discrete and continuous tokens under the next-token prediction paradigm. Simulation results validate the effectiveness and superiority of the proposed UniToCom compared to baselines under dynamic channel conditions. By integrating token processing with MLLMs, UniToCom enables scalable and generalizable communication in favor of multimodal understanding and generation, providing a potential solution for next-generation intelligent communications.

Comment: The paper introduces a generative information bottleneck principle for token communication, which aligns with representation learning and efficiency improvements.