Personalized Daily ArXiv Papers 2025-07-22

[gpt-4o]	Prompt	Completion	Total
Token	42397	5082	47479
Cost	$0.11	$0.05	$0.16

Total arXiv papers: 843

Total scanned papers: 542

Total relevant papers: 25

Table of contents with paper titles:

IPPRO: Importance-based Pruning with PRojective Offset for Magnitude-indifferent Structural Pruning Authors: Jaeheun Jung, Jaehyuk Lee, Yeajin Lee, Donghun Lee
The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts Authors: Sungmin Yun, Seonyong Park, Hwayong Nam, Younjoo Lee, Gunjun Lee, Kwanhee Kyung, Sangpyo Kim, Nam Sung Kim, Jongmin Kim, Hyungyo Kim, Juhwan Cho, Seungmin Baek, Jung Ho Ahn
LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression Authors: Wenjie Huang, Qi Yang, Shuting Xia, He Huang, Zhu Li, Yiling Xu
Better Models and Algorithms for Learning Ising Models from Dynamics Authors: Jason Gaitonde, Ankur Moitra, Elchanan Mossel
The Other Mind: How Language Models Exhibit Human Temporal Cognition Authors: Lingyu Li, Yang Yao, Yixu Wang, Chubo Li, Yan Teng, Yingchun Wang
Rethinking Memorization Measures and their Implications in Large Language Models Authors: Bishwamittra Ghosh, Soumi Das, Qinyuan Wu, Mohammad Aflah Khan, Krishna P. Gummadi, Evimaria Terzi, Deepak Garg
LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models Authors: Dachuan Shi (Celine), Yonggan Fu (Celine), Xiangchi Yuan (Celine), Zhongzhi Yu (Celine), Haoran You (Celine), Sixu Li (Celine), Xin Dong (Celine), Jan Kautz (Celine), Pavlo Molchanov (Celine), Yingyan (Celine), Lin
Language Generation in the Limit: Noise, Loss, and Feedback Authors: Yannan Bai, Debmalya Panigrahi, Ian Zhang
SDSC:A Structure-Aware Metric for Semantic Signal Representation Learning Authors: Jeyoung Lee, Hochul Kang
DFQ-ViT: Data-Free Quantization for Vision Transformers without Fine-tuning Authors: Yujia Tong, Jingling Yuan, Tian Zhang, Jianquan Liu, Chuang Hu
A Sparsity Predicting Approach for Large Language Models via Activation Pattern Clustering Authors: Nobel Dhar, Bobin Deng, Md Romyull Islam, Xinyue Zhang, Kazi Fahim Ahmad Nasif, Kun Suo
Flow Equivariant Recurrent Neural Networks Authors: T. Anderson Keller
Generative Distribution Distillation Authors: Jiequan Cui, Beier Zhu, Qingshan Xu, Xiaogang Xu, Pengguang Chen, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong
MAP Estimation with Denoisers: Convergence Rates and Guarantees Authors: Scott Pesme, Giacomo Meanti, Michael Arbel, Julien Mairal
Probing Information Distribution in Transformer Architectures through Entropy Analysis Authors: Amedeo Buonanno, Alessandro Rivetti, Francesco A. N. Palmieri, Giovanni Di Gennaro, Gianmarco Romano
All-atom inverse protein folding through discrete flow matching Authors: Kai Yi, Kiarash Jamali, Sjors H. W. Scheres
The Tsetlin Machine Goes Deep: Logical Learning and Reasoning With Graphs Authors: Ole-Christoffer Granmo, Youmna Abdelwahab, Per-Arne Andersen, Paul F. A. Clarke, Kunal Dumbre, Ylva Gr{\o}nnins{\ae}ter, Vojtech Halenka, Runar Helin, Lei Jiao, Ahmed Khalid, Rebekka Omslandseter, Rupsa Saha, Mayur Shende, Xuan Zhang
Universal crystal material property prediction via multi-view geometric fusion in graph transformers Authors: Liang Zhang, Kong Chen, Yuen Wu
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization Authors: Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang
A universal augmentation framework for long-range electrostatics in machine learning interatomic potentials Authors: Dongjin Kim, Xiaoyu Wang, Peichen Zhong, Daniel S. King, Theo Jaffrelot Inizan, Bingqing Cheng
Better Training Data Attribution via Better Inverse Hessian-Vector Products Authors: Andrew Wang, Elisa Nguyen, Runshi Yang, Juhan Bae, Sheila A. McIlraith, Roger Grosse
Beyond Model Base Selection: Weaving Knowledge to Master Fine-grained Neural Network Design Authors: Jialiang Wang, Hanmo Liu, Shimin Di, Zhili Wang, Jiachuan Wang, Lei Chen, Xiaofang Zhou
Inverse Scaling in Test-Time Compute Authors: Aryo Pradipta Gema, Alexander H\"agele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, Ethan Perez
Learning Stochastic Hamiltonian Systems via Stochastic Generating Function Neural Network Authors: Chen Chen, Lijin Wang, Yanzhao Cao, Xupeng Cheng
DIVER-0 : A Fully Channel Equivariant EEG Foundation Model Authors: Danny Dongyeop Han, Ahhyun Lucy Lee, Taeyang Lee, Yonghyeon Gwon, Sebin Lee, Seongjin Lee, David Keetae Park, Shinjae Yoo, Jiook Cha, Chun Kee Chung

1. IPPRO: Importance-based Pruning with PRojective Offset for Magnitude-indifferent Structural Pruning

ArXiv ID: 2507.14171

Authors: Jaeheun Jung, Jaehyuk Lee, Yeajin Lee, Donghun Lee

Abstract: With the growth of demand on neural network compression methods, the structured pruning methods including importance-based approach are actively studied. The magnitude importance and many correlated modern importance criteria often limit the capacity of pruning decision, since the filters with larger magnitudes are not likely to be pruned if the smaller one didn't, even if it is redundant. In this paper, we propose a novel pruning strategy to challenge this dominating effect of magnitude and provide fair chance to each filter to be pruned, by placing it on projective space. After that, we observe the gradient descent movement whether the filters move toward the origin or not, to measure how the filter is likely to be pruned. This measurement is used to construct PROscore, a novel importance score for IPPRO, a novel importance-based structured pruning with magnitude-indifference. Our evaluation results shows that the proposed importance criteria using the projective space achieves near-lossless pruning by reducing the performance drop in pruning, with promising performance after the finetuning. Our work debunks the ``size-matters'' myth in pruning and expands the frontier of importance-based pruning both theoretically and empirically.

Comment: The paper introduces a novel pruning strategy that challenges the traditional magnitude-based pruning, which aligns with the model compression criterion.

Relevance: 9 Novelty: 8

2. The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts

ArXiv ID: 2507.15465

Authors: Sungmin Yun, Seonyong Park, Hwayong Nam, Younjoo Lee, Gunjun Lee, Kwanhee Kyung, Sangpyo Kim, Nam Sung Kim, Jongmin Kim, Hyungyo Kim, Juhwan Cho, Seungmin Baek, Jung Ho Ahn

Abstract: Computational workloads composing traditional Transformer models are starkly bifurcated. Multi-Head Attention (MHA) is memory-bound, with low arithmetic intensity, while feedforward layers are compute-bound. This dichotomy has long motivated research into specialized hardware to mitigate the MHA bottleneck. This paper argues that recent architectural shifts, namely Multi-head Latent Attention (MLA) and Mixture-of-Experts (MoE), challenge the premise of specialized attention hardware. We make two key observations. First, the arithmetic intensity of MLA is over two orders of magnitude greater than that of MHA, shifting it close to a compute-bound regime well-suited for modern accelerators like GPUs. Second, by distributing MoE experts across a pool of accelerators, their arithmetic intensity can be tuned through batching to match that of the dense layers, creating a more balanced computational profile. These findings reveal a diminishing need for specialized attention hardware. The central challenge for next-generation Transformers is no longer accelerating a single memory-bound layer. Instead, the focus must shift to designing balanced systems with sufficient compute, memory capacity, memory bandwidth, and high-bandwidth interconnects to manage the diverse demands of large-scale models.

Comment: The paper discusses the implications of Mixture-of-Experts and Latent Attention on system design for LLMs, aligning with the model architecture and LLM criteria.

Relevance: 9 Novelty: 8

3. LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression

ArXiv ID: 2507.15686

Authors: Wenjie Huang, Qi Yang, Shuting Xia, He Huang, Zhu Li, Yiling Xu

Abstract: Existing AI-based point cloud compression methods struggle with dependence on specific training data distributions, which limits their real-world deployment. Implicit Neural Representation (INR) methods solve the above problem by encoding overfitted network parameters to the bitstream, resulting in more distribution-agnostic results. However, due to the limitation of encoding time and decoder size, current INR based methods only consider lossy geometry compression. In this paper, we propose the first INR based lossless point cloud geometry compression method called Lossless Implicit Neural Representations for Point Cloud Geometry Compression (LINR-PCGC). To accelerate encoding speed, we design a group of point clouds level coding framework with an effective network initialization strategy, which can reduce around 60% encoding time. A lightweight coding network based on multiscale SparseConv, consisting of scale context extraction, child node prediction, and model compression modules, is proposed to realize fast inference and compact decoder size. Experimental results show that our method consistently outperforms traditional and AI-based methods: for example, with the convergence time in the MVUB dataset, our method reduces the bitstream by approximately 21.21% compared to G-PCC TMC13v23 and 21.95% compared to SparsePCGC. Our project can be seen on https://huangwenjie2023.github.io/LINR-PCGC/.

Comment: The paper focuses on model compression through a novel lossless point cloud geometry compression method using implicit neural representations, which aligns with the model compression criterion.

Relevance: 9 Novelty: 8

4. Better Models and Algorithms for Learning Ising Models from Dynamics

ArXiv ID: 2507.15173

Authors: Jason Gaitonde, Ankur Moitra, Elchanan Mossel

Abstract: We study the problem of learning the structure and parameters of the Ising model, a fundamental model of high-dimensional data, when observing the evolution of an associated Markov chain. A recent line of work has studied the natural problem of learning when observing an evolution of the well-known Glauber dynamics [Bresler, Gamarnik, Shah, IEEE Trans. Inf. Theory 2018, Gaitonde, Mossel STOC 2024], which provides an arguably more realistic generative model than the classical i.i.d. setting. However, this prior work crucially assumes that all site update attempts are observed, \emph{even when this attempt does not change the configuration}: this strong observation model is seemingly essential for these approaches. While perhaps possible in restrictive contexts, this precludes applicability to most realistic settings where we can observe \emph{only} the stochastic evolution itself, a minimal and natural assumption for any process we might hope to learn from. However, designing algorithms that succeed in this more realistic setting has remained an open problem [Bresler, Gamarnik, Shah, IEEE Trans. Inf. Theory 2018, Gaitonde, Moitra, Mossel, STOC 2025]. In this work, we give the first algorithms that efficiently learn the Ising model in this much more natural observation model that only observes when the configuration changes. For Ising models with maximum degree $d$, our algorithm recovers the underlying dependency graph in time $\mathsf{poly}(d)\cdot n^2\log n$ and then the actual parameters in additional $\widetilde{O}(2^d n)$ time, which qualitatively matches the state-of-the-art even in the i.i.d. setting in a much weaker observation model. Our analysis holds more generally for a broader class of reversible, single-site Markov chains that also includes the popular Metropolis chain by leveraging more robust properties of reversible Markov chains.

Comment: The paper presents a new algorithm for learning Ising models from dynamics, which is a fundamental research area in representation learning.

Relevance: 9 Novelty: 8

5. The Other Mind: How Language Models Exhibit Human Temporal Cognition

ArXiv ID: 2507.15851

Authors: Lingyu Li, Yang Yao, Yixu Wang, Chubo Li, Yan Teng, Yingchun Wang

Abstract: As Large Language Models (LLMs) continue to advance, they exhibit certain cognitive patterns similar to those of humans that are not directly specified in training data. This study investigates this phenomenon by focusing on temporal cognition in LLMs. Leveraging the similarity judgment task, we find that larger models spontaneously establish a subjective temporal reference point and adhere to the Weber-Fechner law, whereby the perceived distance logarithmically compresses as years recede from this reference point. To uncover the mechanisms behind this behavior, we conducted multiple analyses across neuronal, representational, and informational levels. We first identify a set of temporal-preferential neurons and find that this group exhibits minimal activation at the subjective reference point and implements a logarithmic coding scheme convergently found in biological systems. Probing representations of years reveals a hierarchical construction process, where years evolve from basic numerical values in shallow layers to abstract temporal orientation in deep layers. Finally, using pre-trained embedding models, we found that the training corpus itself possesses an inherent, non-linear temporal structure, which provides the raw material for the model's internal construction. In discussion, we propose an experientialist perspective for understanding these findings, where the LLMs' cognition is viewed as a subjective construction of the external world by its internal representational system. This nuanced perspective implies the potential emergence of alien cognitive frameworks that humans cannot intuitively predict, pointing toward a direction for AI alignment that focuses on guiding internal constructions. Our code is available at https://TheOtherMind.github.io.

Comment: The paper explores temporal cognition in large language models, providing theoretical insights into LLM behavior, which aligns with the large language models criterion.

Relevance: 9 Novelty: 8

6. Rethinking Memorization Measures and their Implications in Large Language Models

ArXiv ID: 2507.14777

Authors: Bishwamittra Ghosh, Soumi Das, Qinyuan Wu, Mohammad Aflah Khan, Krishna P. Gummadi, Evimaria Terzi, Deepak Garg

Abstract: Concerned with privacy threats, memorization in LLMs is often seen as undesirable, specifically for learning. In this paper, we study whether memorization can be avoided when optimally learning a language, and whether the privacy threat posed by memorization is exaggerated or not. To this end, we re-examine existing privacy-focused measures of memorization, namely recollection-based and counterfactual memorization, along with a newly proposed contextual memorization. Relating memorization to local over-fitting during learning, contextual memorization aims to disentangle memorization from the contextual learning ability of LLMs. Informally, a string is contextually memorized if its recollection due to training exceeds the optimal contextual recollection, a learned threshold denoting the best contextual learning without training. Conceptually, contextual recollection avoids the fallacy of recollection-based memorization, where any form of high recollection is a sign of memorization. Theoretically, contextual memorization relates to counterfactual memorization, but imposes stronger conditions. Memorization measures differ in outcomes and information requirements. Experimenting on 18 LLMs from 6 families and multiple formal languages of different entropy, we show that (a) memorization measures disagree on memorization order of varying frequent strings, (b) optimal learning of a language cannot avoid partial memorization of training strings, and (c) improved learning decreases contextual and counterfactual memorization but increases recollection-based memorization. Finally, (d) we revisit existing reports of memorized strings by recollection that neither pose a privacy threat nor are contextually or counterfactually memorized.

Comment: The paper provides theoretical insights into memorization in LLMs, which aligns with the interest in understanding LLM behavior and interpretability.

Relevance: 9 Novelty: 8

7. LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

ArXiv ID: 2507.14204

Authors: Dachuan Shi (Celine), Yonggan Fu (Celine), Xiangchi Yuan (Celine), Zhongzhi Yu (Celine), Haoran You (Celine), Sixu Li (Celine), Xin Dong (Celine), Jan Kautz (Celine), Pavlo Molchanov (Celine), Yingyan (Celine), Lin

Abstract: Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck. In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets. Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache's effectiveness in enhancing LLMs' long-range capabilities. Our code is available at https://github.com/GATECH-EIC/LaCache.

Comment: The paper proposes a new KV cache optimization paradigm for LLMs, which is relevant to model compression and efficiency improvements.

Relevance: 9 Novelty: 8

8. Language Generation in the Limit: Noise, Loss, and Feedback

ArXiv ID: 2507.15319

Authors: Yannan Bai, Debmalya Panigrahi, Ian Zhang

Abstract: Kleinberg and Mullainathan (2024) recently proposed a formal framework called language generation in the limit and showed that given a sequence of example strings from an unknown target language drawn from any countable collection, an algorithm can correctly generate unseen strings from the target language within finite time. This notion was further refined by Li, Raman, and Tewari (2024), who defined stricter categories of non-uniform and uniform generation. They showed that a finite union of uniformly generatable collections is generatable in the limit, and asked if the same is true for non-uniform generation. We begin by resolving the question in the negative: we give a uniformly generatable collection and a non-uniformly generatable collection whose union is not generatable in the limit. We then use facets of this construction to further our understanding of several variants of language generation. The first two, generation with noise and without samples, were introduced by Raman and Raman (2025) and Li, Raman, and Tewari (2024) respectively. We show the equivalence of these models for uniform and non-uniform generation, and provide a characterization of non-uniform noisy generation. The former paper asked if there is any separation between noisy and non-noisy generation in the limit -- we show that such a separation exists even with a single noisy string. Finally, we study the framework of generation with feedback, introduced by Charikar and Pabbaraju (2025), where the algorithm is strengthened by allowing it to ask membership queries. We show finite queries add no power, but infinite queries yield a strictly more powerful model. In summary, the results in this paper resolve the union-closedness of language generation in the limit, and leverage those techniques (and others) to give precise characterizations for natural variants that incorporate noise, loss, and feedback.

Comment: The paper explores theoretical aspects of language generation in the limit, which aligns with emerging trends in foundational research.

Relevance: 9 Novelty: 8

9. SDSC:A Structure-Aware Metric for Semantic Signal Representation Learning

ArXiv ID: 2507.14516

Authors: Jeyoung Lee, Hochul Kang

Abstract: We propose the Signal Dice Similarity Coefficient (SDSC), a structure-aware metric function for time series self-supervised representation learning. Most Self-Supervised Learning (SSL) methods for signals commonly adopt distance-based objectives such as mean squared error (MSE), which are sensitive to amplitude, invariant to waveform polarity, and unbounded in scale. These properties hinder semantic alignment and reduce interpretability. SDSC addresses this by quantifying structural agreement between temporal signals based on the intersection of signed amplitudes, derived from the Dice Similarity Coefficient (DSC).Although SDSC is defined as a structure-aware metric, it can be used as a loss by subtracting from 1 and applying a differentiable approximation of the Heaviside function for gradient-based optimization. A hybrid loss formulation is also proposed to combine SDSC with MSE, improving stability and preserving amplitude where necessary. Experiments on forecasting and classification benchmarks demonstrate that SDSC-based pre-training achieves comparable or improved performance over MSE, particularly in in-domain and low-resource scenarios. The results suggest that structural fidelity in signal representations enhances the semantic representation quality, supporting the consideration of structure-aware metrics as viable alternatives to conventional distance-based methods.

Comment: The paper proposes a new structure-aware metric for self-supervised representation learning in time series, aligning with the representation learning criterion.

Relevance: 9 Novelty: 7

10. DFQ-ViT: Data-Free Quantization for Vision Transformers without Fine-tuning

ArXiv ID: 2507.14481

Authors: Yujia Tong, Jingling Yuan, Tian Zhang, Jianquan Liu, Chuang Hu

Abstract: Data-Free Quantization (DFQ) enables the quantization of Vision Transformers (ViTs) without requiring access to data, allowing for the deployment of ViTs on devices with limited resources. In DFQ, the quantization model must be calibrated using synthetic samples, making the quality of these synthetic samples crucial. Existing methods fail to fully capture and balance the global and local features within the samples, resulting in limited synthetic data quality. Moreover, we have found that during inference, there is a significant difference in the distributions of intermediate layer activations between the quantized and full-precision models. These issues lead to a severe performance degradation of the quantized model. To address these problems, we propose a pipeline for Data-Free Quantization for Vision Transformers (DFQ-ViT). Specifically, we synthesize samples in order of increasing difficulty, effectively enhancing the quality of synthetic data. During the calibration and inference stage, we introduce the activation correction matrix for the quantized model to align the intermediate layer activations with those of the full-precision model. Extensive experiments demonstrate that DFQ-ViT achieves remarkable superiority over existing DFQ methods and its performance is on par with models quantized through real data. For example, the performance of DeiT-T with 3-bit weights quantization is 4.29% higher than the state-of-the-art. Our method eliminates the need for fine-tuning, which not only reduces computational overhead but also lowers the deployment barriers for edge devices. This characteristic aligns with the principles of Green Learning by improving energy efficiency and facilitating real-world applications in resource-constrained environments.

Comment: The paper focuses on data-free quantization for Vision Transformers, which aligns with model compression through quantization, a relevant topic.

Relevance: 9 Novelty: 7

11. A Sparsity Predicting Approach for Large Language Models via Activation Pattern Clustering

ArXiv ID: 2507.14179

Authors: Nobel Dhar, Bobin Deng, Md Romyull Islam, Xinyue Zhang, Kazi Fahim Ahmad Nasif, Kun Suo

Abstract: Large Language Models (LLMs) exhibit significant activation sparsity, where only a subset of neurons are active for a given input. Although this sparsity presents opportunities to reduce computational cost, efficiently utilizing it requires predicting activation patterns in a scalable manner. However, direct prediction at the neuron level is computationally expensive due to the vast number of neurons in modern LLMs. To enable efficient prediction and utilization of activation sparsity, we propose a clustering-based activation pattern compression framework. Instead of treating each neuron independently, we group similar activation patterns into a small set of representative clusters. Our method achieves up to 79.34% clustering precision, outperforming standard binary clustering approaches while maintaining minimal degradation in perplexity (PPL) scores. With a sufficiently large number of clusters, our approach attains a PPL score as low as 12.49, demonstrating its effectiveness in preserving model quality while reducing computational overhead. By predicting cluster assignments rather than individual neuron states, future models can efficiently infer activation patterns from pre-computed centroids. We detail the clustering algorithm, analyze its effectiveness in capturing meaningful activation structures, and demonstrate its potential to improve sparse computation efficiency. This clustering-based formulation serves as a foundation for future work on activation pattern prediction, paving the way for efficient inference in large-scale language models.

Comment: The paper presents a clustering-based approach to predict activation sparsity in LLMs, which aligns with model compression and efficiency improvements.

Relevance: 9 Novelty: 7

12. Flow Equivariant Recurrent Neural Networks

ArXiv ID: 2507.14793

Authors: T. Anderson Keller

Abstract: Data arrives at our senses as a continuous stream, smoothly transforming from one instant to the next. These smooth transformations can be viewed as continuous symmetries of the environment that we inhabit, defining equivalence relations between stimuli over time. In machine learning, neural network architectures that respect symmetries of their data are called equivariant and have provable benefits in terms of generalization ability and sample efficiency. To date, however, equivariance has been considered only for static transformations and feed-forward networks, limiting its applicability to sequence models, such as recurrent neural networks (RNNs), and corresponding time-parameterized sequence transformations. In this work, we extend equivariant network theory to this regime of `flows' -- one-parameter Lie subgroups capturing natural transformations over time, such as visual motion. We begin by showing that standard RNNs are generally not flow equivariant: their hidden states fail to transform in a geometrically structured manner for moving stimuli. We then show how flow equivariance can be introduced, and demonstrate that these models significantly outperform their non-equivariant counterparts in terms of training speed, length generalization, and velocity generalization, on both next step prediction and sequence classification. We present this work as a first step towards building sequence models that respect the time-parameterized symmetries which govern the world around us.

Comment: The paper extends equivariant network theory to recurrent neural networks, which is a novel approach in model architecture.

Relevance: 8 Novelty: 8

13. Generative Distribution Distillation

ArXiv ID: 2507.14503

Authors: Jiequan Cui, Beier Zhu, Qingshan Xu, Xiaogang Xu, Pengguang Chen, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong

Abstract: In this paper, we formulate the knowledge distillation (KD) as a conditional generative problem and propose the \textit{Generative Distribution Distillation (GenDD)} framework. A naive \textit{GenDD} baseline encounters two major challenges: the curse of high-dimensional optimization and the lack of semantic supervision from labels. To address these issues, we introduce a \textit{Split Tokenization} strategy, achieving stable and effective unsupervised KD. Additionally, we develop the \textit{Distribution Contraction} technique to integrate label supervision into the reconstruction objective. Our theoretical proof demonstrates that \textit{GenDD} with \textit{Distribution Contraction} serves as a gradient-level surrogate for multi-task learning, realizing efficient supervised training without explicit classification loss on multi-step sampling image representations. To evaluate the effectiveness of our method, we conduct experiments on balanced, imbalanced, and unlabeled data. Experimental results show that \textit{GenDD} performs competitively in the unsupervised setting, significantly surpassing KL baseline by \textbf{16.29\%} on ImageNet validation set. With label supervision, our ResNet-50 achieves \textbf{82.28\%} top-1 accuracy on ImageNet in 600 epochs training, establishing a new state-of-the-art.

Comment: The paper proposes a generative approach to knowledge distillation, which is a novel method in representation learning.

Relevance: 8 Novelty: 8

14. MAP Estimation with Denoisers: Convergence Rates and Guarantees

ArXiv ID: 2507.15397

Authors: Scott Pesme, Giacomo Meanti, Michael Arbel, Julien Mairal

Abstract: Denoiser models have become powerful tools for inverse problems, enabling the use of pretrained networks to approximate the score of a smoothed prior distribution. These models are often used in heuristic iterative schemes aimed at solving Maximum a Posteriori (MAP) optimisation problems, where the proximal operator of the negative log-prior plays a central role. In practice, this operator is intractable, and practitioners plug in a pretrained denoiser as a surrogate-despite the lack of general theoretical justification for this substitution. In this work, we show that a simple algorithm, closely related to several used in practice, provably converges to the proximal operator under a log-concavity assumption on the prior $p$. We show that this algorithm can be interpreted as a gradient descent on smoothed proximal objectives. Our analysis thus provides a theoretical foundation for a class of empirically successful but previously heuristic methods.

Comment: The paper provides theoretical justification for using denoisers in MAP estimation, which is relevant to foundational research in representation learning.

Relevance: 8 Novelty: 8

15. Probing Information Distribution in Transformer Architectures through Entropy Analysis

ArXiv ID: 2507.15347

Authors: Amedeo Buonanno, Alessandro Rivetti, Francesco A. N. Palmieri, Giovanni Di Gennaro, Gianmarco Romano

Abstract: This work explores entropy analysis as a tool for probing information distribution within Transformer-based architectures. By quantifying token-level uncertainty and examining entropy patterns across different stages of processing, we aim to investigate how information is managed and transformed within these models. As a case study, we apply the methodology to a GPT-based large language model, illustrating its potential to reveal insights into model behavior and internal representations. This approach may offer insights into model behavior and contribute to the development of interpretability and evaluation frameworks for transformer-based models

Comment: The paper uses entropy analysis to probe information distribution in Transformer architectures, aligning with the model architecture criterion.