Previous Day 2025-04-07
Monthly Overview 2025-04
Next Day 2025-04-09

Personalized Daily ArXiv Papers 2025-04-08

[gpt-4o] Prompt Completion Total
Token 53257 7591 60848
Cost $0.13 $0.08 $0.21

Total arXiv papers: 856

Total scanned papers: 572

Total relevant papers: 38

Table of contents with paper titles:

  1. Language Models Are Implicitly Continuous Authors: Samuele Marro, Davide Evangelista, X. Angelo Huang, Emanuele La Malfa, Michele Lombardi, Michael Wooldridge

  2. On the Spatial Structure of Mixture-of-Experts in Transformers Authors: Daniel Bershatsky, Ivan Oseledets

  3. Generalising from Self-Produced Data: Model Training Beyond Human Constraints Authors: Alfath Daryl Alhajir, Jennifer Dodgson, Joseph Lim, Truong Ma Phi, Julian Peh, Akira Rafhael Janson Pattirane, Lokesh Poovaragan

  4. Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition Authors: Rishi Hazra, Gabriele Venturato, Pedro Zuidberg Dos Martires, Luc De Raedt

  5. Exact Unlearning of Finetuning Data via Model Merging at Scale Authors: Kevin Kuo, Amrith Setlur, Kartik Srinivas, Aditi Raghunathan, Virginia Smith

  6. Variational Self-Supervised Learning Authors: Mehmet Can Yavuz, Berrin Yanikoglu

  7. Scalable Robust Bayesian Co-Clustering with Compositional ELBOs Authors: Ashwin Vinod, Chandrajit Bajaj

  8. Saliency-driven Dynamic Token Pruning for Large Language Models Authors: Yao Tao, Yehui Tang, Yun Wang, Mingjian Zhu, Hailin Hu, Yunhe Wang

  9. HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs Authors: Yongji Wu, Xueshen Liu, Shuowei Jin, Ceyu Xu, Feng Qian, Z. Morley Mao, Matthew Lentz, Danyang Zhuo, Ion Stoica

  10. Towards Symmetric Low-Rank Adapters Authors: Tales Panoutsos, Rodrygo L. T. Santos, Flavio Figueiredo

  11. Capturing AI's Attention: Physics of Repetition, Hallucination, Bias and Beyond Authors: Frank Yingjie Huo, Neil F. Johnson

  12. RaanA: A Fast, Flexible, and Data-Efficient Post-Training Quantization Algorithm Authors: Yongyi Yang, Jianyang Gao, Wei Hu

  13. Using Attention Sinks to Identify and Evaluate Dormant Heads in Pretrained LLMs Authors: Pedro Sandoval-Segura, Xijun Wang, Ashwinee Panda, Micah Goldblum, Ronen Basri, Tom Goldstein, David Jacobs

  14. Entropy-Based Block Pruning for Efficient Large Language Models Authors: Liangwei Yang, Yuhui Xu, Juntao Tan, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Huan Wang, Shelby Heinecke

  15. AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability Authors: Fernando Rosas, Alexander Boyd, Manuel Baltieri

  16. Gating is Weighting: Understanding Gated Linear Attention through In-context Learning Authors: Yingcong Li, Davoud Ataee Tarzanagh, Ankit Singh Rawat, Maryam Fazel, Samet Oymak

  17. Directional Sign Loss: A Topology-Preserving Loss Function that Approximates the Sign of Finite Differences Authors: Harvey Dam, Tripti Agarwal, Ganesh Gopalakrishnan

  18. Nonlocal techniques for the analysis of deep ReLU neural network approximations Authors: Cornelia Schneider, Mario Ullrich, Jan Vybiral

  19. Retro-Search: Exploring Untaken Paths for Deeper and Efficient Reasoning Authors: Ximing Lu, Seungju Han, David Acuna, Hyunwoo Kim, Jaehun Jung, Shrimai Prabhumoye, Niklas Muennighoff, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

  20. Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models Authors: Yuheng Wu, Wentao Guo, Zirui Liu, Heng Ji, Zhaozhuo Xu, Denghui Zhang

  21. Rethinking Reflection in Pre-Training Authors: Essential AI, :, Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Anthony Polloreno, Ashish Tanwer, Burhan Drak Sibai, Divya S Mansingka, Divya Shivaprasad, Ishaan Shah, Karl Stratos, Khoi Nguyen, Michael Callahan, Michael Pust, Mrinal Iyer, Philip Monk, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Tim Romanski

  22. Contrastive and Variational Approaches in Self-Supervised Learning for Complex Data Mining Authors: Yingbin Liang, Lu Dai, Shuo Shi, Minghao Dai, Junliang Du, Haige Wang

  23. Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes Authors: Ruiqi Zhang, Jingfeng Wu, Licong Lin, Peter L. Bartlett

  24. Adversarial KA Authors: Sviatoslav Dzhenzher, Michael H. Freedman

  25. Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models Authors: Adri\'an Bazaga, Rexhina Blloshmi, Bill Byrne, Adri`a de Gispert

  26. Adaptive Elicitation of Latent Information Using Natural Language Authors: Jimmy Wang, Thomas Zollo, Richard Zemel, Hongseok Namkoong

  27. Better Rates for Random Task Orderings in Continual Linear Models Authors: Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, Nathan Srebro

  28. asKAN: Active Subspace embedded Kolmogorov-Arnold Network Authors: Zhiteng Zhou, Zhaoyue Xu, Yi Liu, Shizhao Wang

  29. Learning symmetries in datasets Authors: Veronica Sanz

  30. Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models Authors: Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, Lu Hou

  31. Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning Authors: Xuerui Su, Shufang Xie, Guoqing Liu, Yingce Xia, Renqian Luo, Peiran Jin, Zhiming Ma, Yue Wang, Zun Wang, Yuting Liu

  32. Memory and Bandwidth are All You Need for Fully Sharded Data Parallel Authors: Jiangtao Wang, Jan Ebert, Oleg Filatov, Stefan Kesselheim

  33. LOGLO-FNO: Efficient Learning of Local and Global Features in Fourier Neural Operators Authors: Marimuthu Kalimuthu, David Holzm\"uller, Mathias Niepert

  34. The Effects of Grouped Structural Global Pruning of Vision Transformers on Domain Generalisation Authors: Hamza Riaz, Alan F. Smeaton

  35. PINNverse: Accurate parameter estimation in differential equations from noisy data with constrained physics-informed neural networks Authors: Marius Almanst\"otter, Roman Vetter, Dagmar Iber

  36. Cramer-Rao Bounds for Laplacian Matrix Estimation Authors: Morad Halihal, Tirza Routtenberg, H. Vincent Poor

  37. A Simultaneous Approach for Training Neural Differential-Algebraic Systems of Equations Authors: Laurens R. Lueg, Victor Alves, Daniel Schicksnus, John R. Kitchin, Carl D. Laird, Lorenz T. Biegler

  38. SnapPix: Efficient-Coding--Inspired In-Sensor Compression for Edge Vision Authors: Weikai Lin, Tianrui Ma, Adith Boloor, Yu Feng, Ruofan Xing, Xuan Zhang, Yuhao Zhu


1. Language Models Are Implicitly Continuous

ArXiv ID: 2504.03933

Authors: Samuele Marro, Davide Evangelista, X. Angelo Huang, Emanuele La Malfa, Michele Lombardi, Michael Wooldridge

Abstract: Language is typically modelled with discrete sequences. However, the most successful approaches to language modelling, namely neural networks, are continuous and smooth function approximators. In this work, we show that Transformer-based language models implicitly learn to represent sentences as continuous-time functions defined over a continuous input space. This phenomenon occurs in most state-of-the-art Large Language Models (LLMs), including Llama2, Llama3, Phi3, Gemma, Gemma2, and Mistral, and suggests that LLMs reason about language in ways that fundamentally differ from humans. Our work formally extends Transformers to capture the nuances of time and space continuity in both input and output space. Our results challenge the traditional interpretation of how LLMs understand language, with several linguistic and engineering implications.

Comment: Explores the implicit continuous nature of LLMs, providing theoretical insights into their behavior, which is highly relevant to foundational research on LLMs.

Relevance: 10 Novelty: 9


2. On the Spatial Structure of Mixture-of-Experts in Transformers

ArXiv ID: 2504.04444

Authors: Daniel Bershatsky, Ivan Oseledets

Abstract: A common assumption is that MoE routers primarily leverage semantic features for expert selection. However, our study challenges this notion by demonstrating that positional token information also plays a crucial role in routing decisions. Through extensive empirical analysis, we provide evidence supporting this hypothesis, develop a phenomenological explanation of the observed behavior, and discuss practical implications for MoE-based architectures.

Comment: The paper analyzes the spatial structure of Mixture-of-Experts (MoE) in Transformers, which directly aligns with the model architecture criterion, particularly MoE behavior.

Relevance: 10 Novelty: 8


3. Generalising from Self-Produced Data: Model Training Beyond Human Constraints

ArXiv ID: 2504.04711

Authors: Alfath Daryl Alhajir, Jennifer Dodgson, Joseph Lim, Truong Ma Phi, Julian Peh, Akira Rafhael Janson Pattirane, Lokesh Poovaragan

Abstract: Current large language models (LLMs) are constrained by human-derived training data and limited by a single level of abstraction that impedes definitive truth judgments. This paper introduces a novel framework in which AI models autonomously generate and validate new knowledge through direct interaction with their environment. Central to this approach is an unbounded, ungamable numeric reward - such as annexed disk space or follower count - that guides learning without requiring human benchmarks. AI agents iteratively generate strategies and executable code to maximize this metric, with successful outcomes forming the basis for self-retraining and incremental generalisation. To mitigate model collapse and the warm start problem, the framework emphasizes empirical validation over textual similarity and supports fine-tuning via GRPO. The system architecture employs modular agents for environment analysis, strategy generation, and code synthesis, enabling scalable experimentation. This work outlines a pathway toward self-improving AI systems capable of advancing beyond human-imposed constraints toward autonomous general intelligence.

Comment: The paper introduces a novel framework for AI models to autonomously generate and validate knowledge, which aligns with foundational research in LLMs and emerging trends. The focus on self-improving AI systems is highly relevant.

Relevance: 9 Novelty: 9


4. Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition

ArXiv ID: 2504.03930

Authors: Rishi Hazra, Gabriele Venturato, Pedro Zuidberg Dos Martires, Luc De Raedt

Abstract: Large Language Models (LLMs) have been touted as AI models possessing advanced reasoning abilities. In theory, autoregressive LLMs with Chain-of-Thought (CoT) can perform more serial computations to solve complex reasoning tasks. However, recent studies suggest that, despite this capacity, LLMs do not truly learn to reason but instead fit on statistical features. To study the reasoning capabilities in a principled fashion, we adopt a computational theory perspective and propose an experimental protocol centered on 3-SAT -- the prototypical NP-complete problem lying at the core of logical reasoning and constraint satisfaction tasks. Specifically, we examine the phase transitions in random 3-SAT and characterize the reasoning abilities of state-of-the-art LLMs by varying the inherent hardness of the problem instances. By comparing DeepSeek R1 with other LLMs, our findings reveal two key insights (1) LLM accuracy drops significantly on harder instances, suggesting all current models struggle when statistical shortcuts are unavailable (2) Unlike other LLMs, R1 shows signs of having learned the underlying reasoning. Following a principled experimental protocol, our study moves beyond the benchmark-driven evidence often found in LLM reasoning research. Our findings highlight important gaps and suggest clear directions for future research.

Comment: The paper explores reasoning capabilities of LLMs using a principled experimental protocol, which aligns with foundational research on LLM behavior and interpretability.

Relevance: 9 Novelty: 8


5. Exact Unlearning of Finetuning Data via Model Merging at Scale

ArXiv ID: 2504.04626

Authors: Kevin Kuo, Amrith Setlur, Kartik Srinivas, Aditi Raghunathan, Virginia Smith

Abstract: Approximate unlearning has gained popularity as an approach to efficiently update an LLM so that it behaves (roughly) as if it was not trained on a subset of data to begin with. However, existing methods are brittle in practice and can easily be attacked to reveal supposedly unlearned information. To alleviate issues with approximate unlearning, we instead propose SIFT-Masks (SIgn-Fixed Tuning-Masks), an exact unlearning method based on model merging. SIFT-Masks addresses two key limitations of standard model merging: (1) merging a large number of tasks can severely harm utility; and (2) methods that boost utility by sharing extra information across tasks make exact unlearning prohibitively expensive. SIFT-Masks solves these issues by (1) applying local masks to recover task-specific performance; and (2) constraining finetuning to align with a global sign vector as a lightweight approach to determine masks independently before merging. Across four settings where we merge up to 500 models, SIFT-Masks improves accuracy by 5-80% over naive merging and uses up to 250x less compute for exact unlearning compared to other merging baselines.

Comment: The paper introduces SIFT-Masks for exact unlearning in LLMs, which aligns with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8


6. Variational Self-Supervised Learning

ArXiv ID: 2504.04318

Authors: Mehmet Can Yavuz, Berrin Yanikoglu

Abstract: We present Variational Self-Supervised Learning (VSSL), a novel framework that combines variational inference with self-supervised learning to enable efficient, decoder-free representation learning. Unlike traditional VAEs that rely on input reconstruction via a decoder, VSSL symmetrically couples two encoders with Gaussian outputs. A momentum-updated teacher network defines a dynamic, data-dependent prior, while the student encoder produces an approximate posterior from augmented views. The reconstruction term in the ELBO is replaced with a cross-view denoising objective, preserving the analytical tractability of Gaussian KL divergence. We further introduce cosine-based formulations of KL and log-likelihood terms to enhance semantic alignment in high-dimensional latent spaces. Experiments on CIFAR-10, CIFAR-100, and ImageNet-100 show that VSSL achieves competitive or superior performance to leading self-supervised methods, including BYOL and MoCo V3. VSSL offers a scalable, probabilistically grounded approach to learning transferable representations without generative reconstruction, bridging the gap between variational modeling and modern self-supervised techniques.

Comment: The paper introduces Variational Self-Supervised Learning (VSSL), which aligns with foundational research in representation learning and probabilistic modeling.

Relevance: 9 Novelty: 8


7. Scalable Robust Bayesian Co-Clustering with Compositional ELBOs

ArXiv ID: 2504.04079

Authors: Ashwin Vinod, Chandrajit Bajaj

Abstract: Co-clustering exploits the duality of instances and features to simultaneously uncover meaningful groups in both dimensions, often outperforming traditional clustering in high-dimensional or sparse data settings. Although recent deep learning approaches successfully integrate feature learning and cluster assignment, they remain susceptible to noise and can suffer from posterior collapse within standard autoencoders. In this paper, we present the first fully variational Co-clustering framework that directly learns row and column clusters in the latent space, leveraging a doubly reparameterized ELBO to improve gradient signal-to-noise separation. Our unsupervised model integrates a Variational Deep Embedding with a Gaussian Mixture Model (GMM) prior for both instances and features, providing a built-in clustering mechanism that naturally aligns latent modes with row and column clusters. Furthermore, our regularized end-to-end noise learning Compositional ELBO architecture jointly reconstructs the data while regularizing against noise through the KL divergence, thus gracefully handling corrupted or missing inputs in a single training pipeline. To counteract posterior collapse, we introduce a scale modification that increases the encoder's latent means only in the reconstruction pathway, preserving richer latent representations without inflating the KL term. Finally, a mutual information-based cross-loss ensures coherent co-clustering of rows and columns. Empirical results on diverse real-world datasets from multiple modalities, numerical, textual, and image-based, demonstrate that our method not only preserves the advantages of prior Co-clustering approaches but also exceeds them in accuracy and robustness, particularly in high-dimensional or noisy settings.

Comment: The paper introduces a novel variational co-clustering framework with a compositional ELBO, which aligns with representation learning through its focus on latent space clustering and training dynamics.

Relevance: 9 Novelty: 8


8. Saliency-driven Dynamic Token Pruning for Large Language Models

ArXiv ID: 2504.04514

Authors: Yao Tao, Yehui Tang, Yun Wang, Mingjian Zhu, Hailin Hu, Yunhe Wang

Abstract: Despite the recent success of large language models (LLMs), LLMs are particularly challenging in long-sequence inference scenarios due to the quadratic computational complexity of the attention mechanism. Inspired by the interpretability theory of feature attribution in neural network models, we observe that not all tokens have the same contribution. Based on this observation, we propose a novel token pruning framework, namely Saliency-driven Dynamic Token Pruning (SDTP), to gradually and dynamically prune redundant tokens based on the input context. Specifically, a lightweight saliency-driven prediction module is designed to estimate the importance score of each token with its hidden state, which is added to different layers of the LLM to hierarchically prune redundant tokens. Furthermore, a ranking-based optimization strategy is proposed to minimize the ranking divergence of the saliency score and the predicted importance score. Extensive experiments have shown that our framework is generalizable to various models and datasets. By hierarchically pruning 65\% of the input tokens, our method greatly reduces 33\% $\sim$ 47\% FLOPs and achieves speedup up to 1.75$\times$ during inference, while maintaining comparable performance. We further demonstrate that SDTP can be combined with KV cache compression method for further compression.

Comment: The paper proposes a token pruning framework for LLMs, which aligns with model compression and efficiency breakthroughs, particularly through saliency-driven dynamic pruning.

Relevance: 9 Novelty: 8


9. HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs

ArXiv ID: 2504.03871

Authors: Yongji Wu, Xueshen Liu, Shuowei Jin, Ceyu Xu, Feng Qian, Z. Morley Mao, Matthew Lentz, Danyang Zhuo, Ion Stoica

Abstract: The Mixture-of-Experts (MoE) architecture has become increasingly popular as a method to scale up large language models (LLMs). To save costs, heterogeneity-aware training solutions have been proposed to utilize GPU clusters made up of both newer and older-generation GPUs. However, existing solutions are agnostic to the performance characteristics of different MoE model components (i.e., attention and expert) and do not fully utilize each GPU's compute capability. In this paper, we introduce HeterMoE, a system to efficiently train MoE models on heterogeneous GPUs. Our key insight is that newer GPUs significantly outperform older generations on attention due to architectural advancements, while older GPUs are still relatively efficient for experts. HeterMoE disaggregates attention and expert computation, where older GPUs are only assigned with expert modules. Through the proposed zebra parallelism, HeterMoE overlaps the computation on different GPUs, in addition to employing an asymmetric expert assignment strategy for fine-grained load balancing to minimize GPU idle time. Our evaluation shows that HeterMoE achieves up to 2.3x speed-up compared to existing MoE training systems, and 1.4x compared to an optimally balanced heterogeneity-aware solution. HeterMoE efficiently utilizes older GPUs by maintaining 95% training throughput on average, even with half of the GPUs in a homogeneous A40 cluster replaced with V100.

Comment: The paper introduces HeterMoE, a system for efficient training of MoE models on heterogeneous GPUs, which aligns with model architecture innovations and efficiency improvements.

Relevance: 9 Novelty: 8


10. Towards Symmetric Low-Rank Adapters

ArXiv ID: 2504.03719

Authors: Tales Panoutsos, Rodrygo L. T. Santos, Flavio Figueiredo

Abstract: \newcommand{\mathds}[1]{\text{\usefont{U}{dsrom}{m}{n}#1}} In this paper, we introduce Symmetric Low-Rank Adapters, an optimized variant of LoRA with even fewer weights. This method utilizes Low-Rank Symmetric Weight Matrices to learn downstream tasks more efficiently. Traditional LoRA accumulates fine-tuning weights with the original pre-trained weights via a Singular Value Decomposition (SVD) like approach, i.e., model weights are fine-tuned via updates of the form $BA$ (where $B \in \mathbb{R}^{n\times r}$, $A \in \mathbb{R}^{r\times n}$, and $r$ is the rank of the merged weight matrix). In contrast, our approach, named SymLoRA, represents fine-tuning weights as a Spectral Decomposition, i.e., $Q \, diag(\Lambda)\, Q^T$, where $Q \in \mathbb{R}^{n\times r}$ and $\Lambda \in \mathbb{R}^r$. SymLoRA requires approximately half of the finetuning weights. Here, we show that this approach has negligible losses in downstream efficacy.

Comment: The paper introduces Symmetric Low-Rank Adapters (SymLoRA), which is a novel approach to low-rank adaptation with fewer weights. This aligns with the model compression criterion, specifically low-rank approaches.

Relevance: 9 Novelty: 8


11. Capturing AI's Attention: Physics of Repetition, Hallucination, Bias and Beyond

ArXiv ID: 2504.04600

Authors: Frank Yingjie Huo, Neil F. Johnson

Abstract: We derive a first-principles physics theory of the AI engine at the heart of LLMs' 'magic' (e.g. ChatGPT, Claude): the basic Attention head. The theory allows a quantitative analysis of outstanding AI challenges such as output repetition, hallucination and harmful content, and bias (e.g. from training and fine-tuning). Its predictions are consistent with large-scale LLM outputs. Its 2-body form suggests why LLMs work so well, but hints that a generalized 3-body Attention would make such AI work even better. Its similarity to a spin-bath means that existing Physics expertise could immediately be harnessed to help Society ensure AI is trustworthy and resilient to manipulation.

Comment: The paper provides a physics-based theoretical analysis of attention mechanisms in LLMs, which aligns with the criterion of theoretical insights into LLM behavior.

Relevance: 9 Novelty: 8


12. RaanA: A Fast, Flexible, and Data-Efficient Post-Training Quantization Algorithm

ArXiv ID: 2504.03717

Authors: Yongyi Yang, Jianyang Gao, Wei Hu

Abstract: Post-training Quantization (PTQ) has become a widely used technique for improving inference efficiency of large language models (LLMs). However, existing PTQ methods generally suffer from crucial limitations such as heavy calibration data requirements and inflexible choice of target number of bits. In this paper, we propose RaanA, a unified PTQ framework that overcomes these challenges by introducing two novel components: 1) RaBitQ-H, a variant of a randomized vector quantization method RaBitQ, designed for fast, accurate, and highly efficient quantization; and 2) AllocateBits, an algorithm that optimally allocates bit-widths across layers based on their quantization sensitivity. RaanA achieves competitive performance with state-of-the-art quantization methods while being extremely fast, requiring minimal calibration data, and enabling flexible bit allocation. Extensive experiments demonstrate RaanA's efficacy in balancing efficiency and accuracy. The code is publicly available at https://github.com/FFTYYY/RaanA .

Comment: The paper introduces a novel post-training quantization framework, which aligns with the model compression criterion, particularly quantization methods.

Relevance: 9 Novelty: 8


13. Using Attention Sinks to Identify and Evaluate Dormant Heads in Pretrained LLMs

ArXiv ID: 2504.03889

Authors: Pedro Sandoval-Segura, Xijun Wang, Ashwinee Panda, Micah Goldblum, Ronen Basri, Tom Goldstein, David Jacobs

Abstract: Multi-head attention is foundational to large language models (LLMs), enabling different heads to have diverse focus on relevant input tokens. However, learned behaviors like attention sinks, where the first token receives most attention despite limited semantic importance, challenge our understanding of multi-head attention. To analyze this phenomenon, we propose a new definition for attention heads dominated by attention sinks, known as dormant attention heads. We compare our definition to prior work in a model intervention study where we test whether dormant heads matter for inference by zeroing out the output of dormant attention heads. Using six pretrained models and five benchmark datasets, we find our definition to be more model and dataset-agnostic. Using our definition on most models, more than 4% of a model's attention heads can be zeroed while maintaining average accuracy, and zeroing more than 14% of a model's attention heads can keep accuracy to within 1% of the pretrained model's average accuracy. Further analysis reveals that dormant heads emerge early in pretraining and can transition between dormant and active states during pretraining. Additionally, we provide evidence that they depend on characteristics of the input text.

Comment: The paper investigates dormant attention heads in LLMs, which aligns with foundational research on understanding LLM behavior and interpretability.

Relevance: 9 Novelty: 8


14. Entropy-Based Block Pruning for Efficient Large Language Models

ArXiv ID: 2504.03794

Authors: Liangwei Yang, Yuhui Xu, Juntao Tan, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Huan Wang, Shelby Heinecke

Abstract: As large language models continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an entropy-based pruning strategy to enhance efficiency while maintaining performance. Empirical analysis reveals that the entropy of hidden representations decreases in the early blocks but progressively increases across most subsequent blocks. This trend suggests that entropy serves as a more effective measure of information richness within computation blocks. Unlike cosine similarity, which primarily captures geometric relationships, entropy directly quantifies uncertainty and information content, making it a more reliable criterion for pruning. Extensive experiments demonstrate that our entropy-based pruning approach surpasses cosine similarity-based methods in reducing model size while preserving accuracy, offering a promising direction for efficient model deployment.

Comment: Proposes an entropy-based pruning strategy for LLMs, which aligns with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8


15. AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability

ArXiv ID: 2504.04608

Authors: Fernando Rosas, Alexander Boyd, Manuel Baltieri

Abstract: Recent work proposes using world models to generate controlled virtual environments in which AI agents can be tested before deployment to ensure their reliability and safety. However, accurate world models often have high computational demands that can severely restrict the scope and depth of such assessments. Inspired by the classic `brain in a vat' thought experiment, here we investigate ways of simplifying world models that remain agnostic to the AI agent under evaluation. By following principles from computational mechanics, our approach reveals a fundamental trade-off in world model construction between efficiency and interpretability, demonstrating that no single world model can optimise all desirable characteristics. Building on this trade-off, we identify procedures to build world models that either minimise memory requirements, delineate the boundaries of what is learnable, or allow tracking causes of undesirable outcomes. In doing so, this work establishes fundamental limits in world modelling, leading to actionable guidelines that inform core design choices related to effective agent evaluation.

Comment: The paper investigates fundamental trade-offs in world modeling for agent sandboxing, which aligns with emerging trends and foundational research in AI interpretability.

Relevance: 9 Novelty: 8


16. Gating is Weighting: Understanding Gated Linear Attention through In-context Learning

ArXiv ID: 2504.04308

Authors: Yingcong Li, Davoud Ataee Tarzanagh, Ankit Singh Rawat, Maryam Fazel, Samet Oymak

Abstract: Linear attention methods offer a compelling alternative to softmax attention due to their efficiency in recurrent decoding. Recent research has focused on enhancing standard linear attention by incorporating gating while retaining its computational benefits. Such Gated Linear Attention (GLA) architectures include competitive models such as Mamba and RWKV. In this work, we investigate the in-context learning capabilities of the GLA model and make the following contributions. We show that a multilayer GLA can implement a general class of Weighted Preconditioned Gradient Descent (WPGD) algorithms with data-dependent weights. These weights are induced by the gating mechanism and the input, enabling the model to control the contribution of individual tokens to prediction. To further understand the mechanics of this weighting, we introduce a novel data model with multitask prompts and characterize the optimization landscape of learning a WPGD algorithm. Under mild conditions, we establish the existence and uniqueness (up to scaling) of a global minimum, corresponding to a unique WPGD solution. Finally, we translate these findings to explore the optimization landscape of GLA and shed light on how gating facilitates context-aware learning and when it is provably better than vanilla linear attention.

Comment: The paper provides theoretical insights into Gated Linear Attention (GLA) and its in-context learning capabilities, aligning with foundational research in model architecture and training dynamics.

Relevance: 9 Novelty: 8


17. Directional Sign Loss: A Topology-Preserving Loss Function that Approximates the Sign of Finite Differences

ArXiv ID: 2504.04202

Authors: Harvey Dam, Tripti Agarwal, Ganesh Gopalakrishnan

Abstract: Preserving critical topological features in learned latent spaces is a fundamental challenge in representation learning, particularly for topology-sensitive data. This paper introduces directional sign loss (DSL), a novel loss function that approximates the number of mismatches in the signs of finite differences between corresponding elements of two arrays. By penalizing discrepancies in critical points between input and reconstructed data, DSL encourages autoencoders and other learnable compressors to retain the topological features of the original data. We present the mathematical formulation, complexity analysis, and practical implementation of DSL, comparing its behavior to its non-differentiable counterpart and to other topological measures. Experiments on one-, two-, and three-dimensional data show that combining DSL with traditional loss functions preserves topological features more effectively than traditional losses alone. Moreover, DSL serves as a differentiable, efficient proxy for common topology-based metrics, enabling its use in gradient-based optimization frameworks.

Comment: The introduction of a topology-preserving loss function (DSL) for representation learning is highly relevant. It provides a novel approach to preserving topological features in latent spaces, aligning with representation learning.

Relevance: 9 Novelty: 8


18. Nonlocal techniques for the analysis of deep ReLU neural network approximations

ArXiv ID: 2504.04847

Authors: Cornelia Schneider, Mario Ullrich, Jan Vybiral

Abstract: Recently, Daubechies, DeVore, Foucart, Hanin, and Petrova introduced a system of piece-wise linear functions, which can be easily reproduced by artificial neural networks with the ReLU activation function and which form a Riesz basis of $L_2([0,1])$. This work was generalized by two of the authors to the multivariate setting. We show that this system serves as a Riesz basis also for Sobolev spaces $W^s([0,1]^d)$ and Barron classes ${\mathbb B}^s([0,1]^d)$ with smoothness $0<1$. We apply this fact to re-prove some recent results on the approximation of functions from these classes by deep neural networks. Our proof method avoids using local approximations and allows us to track also the implicit constants as well as to show that we can avoid the curse of dimension. Moreover, we also study how well one can approximate Sobolev and Barron functions by ANNs if only function values are known.

Comment: The paper provides theoretical insights into ReLU neural network approximations, focusing on Sobolev and Barron function spaces. This aligns with representation learning and foundational theoretical work.

Relevance: 9 Novelty: 8


19. Retro-Search: Exploring Untaken Paths for Deeper and Efficient Reasoning

ArXiv ID: 2504.04383

Authors: Ximing Lu, Seungju Han, David Acuna, Hyunwoo Kim, Jaehun Jung, Shrimai Prabhumoye, Niklas Muennighoff, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

Abstract: Large reasoning models exhibit remarkable reasoning capabilities via long, elaborate reasoning trajectories. Supervised fine-tuning on such reasoning traces, also known as distillation, can be a cost-effective way to boost reasoning capabilities of student models. However, empirical observations reveal that these reasoning trajectories are often suboptimal, switching excessively between different lines of thought, resulting in under-thinking, over-thinking, and even degenerate responses. We introduce Retro-Search, an MCTS-inspired search algorithm, for distilling higher quality reasoning paths from large reasoning models. Retro-Search retrospectively revises reasoning paths to discover better, yet shorter traces, which can then lead to student models with enhanced reasoning capabilities with shorter, thus faster inference. Our approach can enable two use cases: self-improvement, where models are fine-tuned on their own Retro-Search-ed thought traces, and weak-to-strong improvement, where a weaker model revises stronger model's thought traces via Retro-Search. For self-improving, R1-distill-7B, fine-tuned on its own Retro-Search-ed traces, reduces the average reasoning length by 31.2% while improving performance by 7.7% across seven math benchmarks. For weak-to-strong improvement, we retrospectively revise R1-671B's traces from the OpenThoughts dataset using R1-distill-32B as the Retro-Search-er, a model 20x smaller. Qwen2.5-32B, fine-tuned on this refined data, achieves performance comparable to R1-distill-32B, yielding an 11.3% reduction in reasoning length and a 2.4% performance improvement compared to fine-tuning on the original OpenThoughts data. Our work counters recently emergent viewpoints that question the relevance of search algorithms in the era of large reasoning models, by demonstrating that there are still opportunities for algorithmic advancements, even for frontier models.

Comment: The Retro-Search algorithm for improving reasoning paths in LLMs introduces a novel search-based approach for distillation, which aligns with foundational research on LLM efficiency and reasoning capabilities.

Relevance: 9 Novelty: 8


20. Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models

ArXiv ID: 2504.04238

Authors: Yuheng Wu, Wentao Guo, Zirui Liu, Heng Ji, Zhaozhuo Xu, Denghui Zhang

Abstract: This paper investigates the emergence of Theory-of-Mind (ToM) capabilities in large language models (LLMs) from a mechanistic perspective, focusing on the role of extremely sparse parameter patterns. We introduce a novel method to identify ToM-sensitive parameters and reveal that perturbing as little as 0.001% of these parameters significantly degrades ToM performance while also impairing contextual localization and language understanding. To understand this effect, we analyze their interaction with core architectural components of LLMs. Our findings demonstrate that these sensitive parameters are closely linked to the positional encoding module, particularly in models using Rotary Position Embedding (RoPE), where perturbations disrupt dominant-frequency activations critical for contextual processing. Furthermore, we show that perturbing ToM-sensitive parameters affects LLM's attention mechanism by modulating the angle between queries and keys under positional encoding. These insights provide a deeper understanding of how LLMs acquire social reasoning abilities, bridging AI interpretability with cognitive science. Our results have implications for enhancing model alignment, mitigating biases, and improving AI systems designed for human interaction.

Comment: The paper investigates sparse parameter patterns in LLMs and their role in Theory-of-Mind capabilities, providing insights into sparsity and interpretability in neural networks.

Relevance: 9 Novelty: 8


21. Rethinking Reflection in Pre-Training

ArXiv ID: 2504.04022

Authors: Essential AI, :, Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Anthony Polloreno, Ashish Tanwer, Burhan Drak Sibai, Divya S Mansingka, Divya Shivaprasad, Ishaan Shah, Karl Stratos, Khoi Nguyen, Michael Callahan, Michael Pust, Mrinal Iyer, Philip Monk, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Tim Romanski

Abstract: A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.

Comment: The paper investigates the emergence of self-correction during pretraining in LLMs, which aligns with the criterion of theoretical insights into LLM behavior.

Relevance: 9 Novelty: 7


22. Contrastive and Variational Approaches in Self-Supervised Learning for Complex Data Mining

ArXiv ID: 2504.04032

Authors: Yingbin Liang, Lu Dai, Shuo Shi, Minghao Dai, Junliang Du, Haige Wang

Abstract: Complex data mining has wide application value in many fields, especially in the feature extraction and classification tasks of unlabeled data. This paper proposes an algorithm based on self-supervised learning and verifies its effectiveness through experiments. The study found that in terms of the selection of optimizer and learning rate, the combination of AdamW optimizer and 0.002 learning rate performed best in all evaluation indicators, indicating that the adaptive optimization method can improve the performance of the model in complex data mining tasks. In addition, the ablation experiment further analyzed the contribution of each module. The results show that contrastive learning, variational modules, and data augmentation strategies play a key role in the generalization ability and robustness of the model. Through the convergence curve analysis of the loss function, the experiment verifies that the method can converge stably during the training process and effectively avoid serious overfitting. Further experimental results show that the model has strong adaptability on different data sets, can effectively extract high-quality features from unlabeled data, and improves classification accuracy. At the same time, under different data distribution conditions, the method can still maintain high detection accuracy, proving its applicability in complex data environments. This study analyzed the role of self-supervised learning methods in complex data mining through systematic experiments and verified its advantages in improving feature extraction quality, optimizing classification performance, and enhancing model stability

Comment: The paper explores contrastive and variational approaches in self-supervised learning, which aligns with foundational research in representation learning and training dynamics.

Relevance: 9 Novelty: 7


23. Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes

ArXiv ID: 2504.04105

Authors: Ruiqi Zhang, Jingfeng Wu, Licong Lin, Peter L. Bartlett

Abstract: We study $\textit{gradient descent}$ (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk, scaled by a constant hyperparameter $\eta$. We show that after at most $1/\gamma^2$ burn-in steps, GD achieves a risk upper bounded by $\exp(-\Theta(\eta))$, where $\gamma$ is the margin of the dataset. As $\eta$ can be arbitrarily large, GD attains an arbitrarily small risk $\textit{immediately after the burn-in steps}$, though the risk evolution may be $\textit{non-monotonic}$. We further construct hard datasets with margin $\gamma$, where any batch or online first-order method requires $\Omega(1/\gamma^2)$ steps to find a linear separator. Thus, GD with large, adaptive stepsizes is $\textit{minimax optimal}$ among first-order batch methods. Notably, the classical $\textit{Perceptron}$ (Novikoff, 1962), a first-order online method, also achieves a step complexity of $1/\gamma^2$, matching GD even in constants. Finally, our GD analysis extends to a broad class of loss functions and certain two-layer networks.

Comment: The paper provides theoretical insights into gradient descent with adaptive stepsizes, which aligns with training dynamics and foundational optimization methods.

Relevance: 8 Novelty: 8


24. Adversarial KA

ArXiv ID: 2504.05255

Authors: Sviatoslav Dzhenzher, Michael H. Freedman

Abstract: Regarding the representation theorem of Kolmogorov and Arnold (KA) as an algorithm for representing or {\guillemotleft}expressing{\guillemotright} functions, we test its robustness by analyzing its ability to withstand adversarial attacks. We find KA to be robust to countable collections of continuous adversaries, but unearth a question about the equi-continuity of the outer functions that, so far, obstructs taking limits and defeating continuous groups of adversaries. This question on the regularity of the outer functions is relevant to the debate over the applicability of KA to the general theory of NNs.

Comment: The paper investigates the robustness of the Kolmogorov-Arnold representation theorem under adversarial attacks, which aligns with foundational research in representation learning and theoretical insights.

Relevance: 8 Novelty: 8


25. Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models

ArXiv ID: 2504.05258

Authors: Adri\'an Bazaga, Rexhina Blloshmi, Bill Byrne, Adri`a de Gispert

Abstract: Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks. However, they struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships. These capabilities are critical for applications including question answering, scheduling, and historical analysis. In this paper, we introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection. Our approach leverages test-time scaling to extend the length of reasoning traces, enabling models to capture complex temporal dependencies more effectively. This strategy not only boosts reasoning accuracy but also improves the traceability of the inference process. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, including out-of-distribution test sets, and reveal that TISER enables smaller open-source models to surpass larger closed-weight models on challenging temporal reasoning tasks.

Comment: The paper introduces TISER, a framework for improving temporal reasoning in LLMs, which is relevant to foundational research in LLM behavior and interpretability. The focus on timeline construction and iterative self-reflection is novel.

Relevance: 8 Novelty: 8


26. Adaptive Elicitation of Latent Information Using Natural Language

ArXiv ID: 2504.04204

Authors: Jimmy Wang, Thomas Zollo, Richard Zemel, Hongseok Namkoong

Abstract: Eliciting information to reduce uncertainty about a latent entity is a critical task in many application domains, e.g., assessing individual student learning outcomes, diagnosing underlying diseases, or learning user preferences. Though natural language is a powerful medium for this purpose, large language models (LLMs) and existing fine-tuning algorithms lack mechanisms for strategically gathering information to refine their own understanding of the latent entity. To harness the generalization power and world knowledge of LLMs in developing effective information-gathering strategies, we propose an adaptive elicitation framework that actively reduces uncertainty on the latent entity. Since probabilistic modeling of an abstract latent entity is difficult, our framework adopts a predictive view of uncertainty, using a meta-learned language model to simulate future observations and enable scalable uncertainty quantification over complex natural language. Through autoregressive forward simulation, our model quantifies how new questions reduce epistemic uncertainty, enabling the development of sophisticated information-gathering strategies to choose the most informative next queries. In experiments on the 20 questions game, dynamic opinion polling, and adaptive student assessment, our method consistently outperforms baselines in identifying critical unknowns and improving downstream predictions, illustrating the promise of strategic information gathering in natural language settings.

Comment: The paper proposes an adaptive elicitation framework for reducing uncertainty in latent entities using natural language, which aligns with foundational research in representation learning and emerging trends.

Relevance: 8 Novelty: 8


27. Better Rates for Random Task Orderings in Continual Linear Models

ArXiv ID: 2504.04579

Authors: Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, Nathan Srebro

Abstract: We study the common continual learning setup where an overparameterized model is sequentially fitted to a set of jointly realizable tasks. We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations. For linear models, we prove that fitting a task is equivalent to a single stochastic gradient descent (SGD) step on a modified objective. We develop novel last-iterate SGD upper bounds in the realizable least squares setup, and apply them to derive new results for continual learning. Focusing on random orderings over $T$ tasks, we establish universal forgetting rates, whereas existing rates depend on the problem dimensionality or complexity. Specifically, in continual regression with replacement, we improve the best existing rate from $O((d-r)/k)$ to $O(\min(k^{-1/4}, \sqrt{d-r}/k, \sqrt{Tr}/k))$, where $d$ is the dimensionality and $r$ the average task rank. Furthermore, we establish the first rates for random task orderings without replacement. The obtained rate of $O(\min(T^{-1/4}, (d-r)/T))$ proves for the first time that randomization alone, with no task repetition, can prevent catastrophic forgetting in sufficiently long task sequences. Finally, we prove a similar $O(k^{-1/4})$ universal rate for the forgetting in continual linear classification on separable data. Our universal rates apply for broader projection methods, such as block Kaczmarz and POCS, illuminating their loss convergence under i.i.d and one-pass orderings.

Comment: The paper provides theoretical insights into training dynamics and forgetting in continual learning for linear models, which aligns with representation learning and foundational research.

Relevance: 8 Novelty: 7


28. asKAN: Active Subspace embedded Kolmogorov-Arnold Network

ArXiv ID: 2504.04669

Authors: Zhiteng Zhou, Zhaoyue Xu, Yi Liu, Shizhao Wang

Abstract: The Kolmogorov-Arnold Network (KAN) has emerged as a promising neural network architecture for small-scale AI+Science applications. However, it suffers from inflexibility in modeling ridge functions, which is widely used in representing the relationships in physical systems. This study investigates this inflexibility through the lens of the Kolmogorov-Arnold theorem, which starts the representation of multivariate functions from constructing the univariate components rather than combining the independent variables. Our analysis reveals that incorporating linear combinations of independent variables can substantially simplify the network architecture in representing the ridge functions. Inspired by this finding, we propose active subspace embedded KAN (asKAN), a hierarchical framework that synergizes KAN's function representation with active subspace methodology. The architecture strategically embeds active subspace detection between KANs, where the active subspace method is used to identify the primary ridge directions and the independent variables are adaptively projected onto these critical dimensions. The proposed asKAN is implemented in an iterative way without increasing the number of neurons in the original KAN. The proposed method is validated through function fitting, solving the Poisson equation, and reconstructing sound field. Compared with KAN, asKAN significantly reduces the error using the same network architecture. The results suggest that asKAN enhances the capability of KAN in fitting and solving equations with in the form of ridge functions.

Comment: The paper introduces a novel architecture (asKAN) for small-scale AI+Science applications, which aligns with foundational research in model architecture innovations.

Relevance: 8 Novelty: 7


29. Learning symmetries in datasets

ArXiv ID: 2504.05174

Authors: Veronica Sanz

Abstract: We investigate how symmetries present in datasets affect the structure of the latent space learned by Variational Autoencoders (VAEs). By training VAEs on data originating from simple mechanical systems and particle collisions, we analyze the organization of the latent space through a relevance measure that identifies the most meaningful latent directions. We show that when symmetries or approximate symmetries are present, the VAE self-organizes its latent space, effectively compressing the data along a reduced number of latent variables. This behavior captures the intrinsic dimensionality determined by the symmetry constraints and reveals hidden relations among the features. Furthermore, we provide a theoretical analysis of a simple toy model, demonstrating how, under idealized conditions, the latent space aligns with the symmetry directions of the data manifold. We illustrate these findings with examples ranging from two-dimensional datasets with $O(2)$ symmetry to realistic datasets from electron-positron and proton-proton collisions. Our results highlight the potential of unsupervised generative models to expose underlying structures in data and offer a novel approach to symmetry discovery without explicit supervision.

Comment: The paper investigates how symmetries in datasets affect the latent space of VAEs, which aligns with the representation learning criterion, particularly in understanding how models encode information.

Relevance: 8 Novelty: 7


30. Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models

ArXiv ID: 2504.04823

Authors: Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, Lu Hou

Abstract: Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this study, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, and QwQ-32B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes will be open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.

Comment: Conducts an empirical study on quantized reasoning models, which aligns with foundational research in model compression and efficiency.

Relevance: 8 Novelty: 7


31. Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning

ArXiv ID: 2504.04524

Authors: Xuerui Su, Shufang Xie, Guoqing Liu, Yingce Xia, Renqian Luo, Peiran Jin, Zhiming Ma, Yue Wang, Zun Wang, Yuting Liu

Abstract: Recently, Large Language Models (LLMs) have rapidly evolved, approaching Artificial General Intelligence (AGI) while benefiting from large-scale reinforcement learning to enhance Human Alignment (HA) and Reasoning. Recent reward-based optimization algorithms, such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) have achieved significant performance on reasoning tasks, whereas preference-based optimization algorithms such as Direct Preference Optimization (DPO) significantly improve the performance of LLMs on human alignment. However, despite the strong performance of reward-based optimization methods in alignment tasks , they remain vulnerable to reward hacking. Furthermore, preference-based algorithms (such as Online DPO) haven't yet matched the performance of reward-based optimization algorithms (like PPO) on reasoning tasks, making their exploration in this specific area still a worthwhile pursuit. Motivated by these challenges, we propose the Trust Region Preference Approximation (TRPA) algorithm, which integrates rule-based optimization with preference-based optimization for reasoning tasks. As a preference-based algorithm, TRPA naturally eliminates the reward hacking issue. TRPA constructs preference levels using predefined rules, forms corresponding preference pairs, and leverages a novel optimization algorithm for RL training with a theoretical monotonic improvement guarantee. Experimental results demonstrate that TRPA not only achieves competitive performance on reasoning tasks but also exhibits robust stability. The code of this paper are released and updating on https://github.com/XueruiSu/Trust-Region-Preference-Approximation.git.

Comment: The paper proposes TRPA, a preference-based optimization algorithm for reasoning tasks in LLMs, which aligns with foundational research in LLM reasoning and optimization methods.

Relevance: 8 Novelty: 7


32. Memory and Bandwidth are All You Need for Fully Sharded Data Parallel

ArXiv ID: 2504.03655

Authors: Jiangtao Wang, Jan Ebert, Oleg Filatov, Stefan Kesselheim

Abstract: Transformer models have revolutionized a wide spectrum of disciplines, especially in language processing. The recent success has proven that model size scalability is crucial for achieving superior performance metrics. However, training large transformer models is challenging even on modern hardware with powerful GPUs and high-speed interconnects. Existing studies primarily focus on optimizing model training distribution strategies to minimize memory footprint and enhance training speed, often overlooking the scalability challenges related to model size and hardware constraints. To address this oversight, we thoroughly investigate computational, memory, and network demands of training large transformers using the Fully Sharded Data Parallel (FSDP) distributed strategy across different hardware clusters. We explore the intricate relationships between model size and hardware setups to identify configurations that ensure maximum model and hardware efficiency, effective sequence length management, and optimal training throughput. A significant finding of our study is the critical interplay of the cluster's connection bandwidth and GPU memory size compared to the computational performance of GPUs. This interplay limits training efficiency, underscoring the role of both hardware characteristics as a possible bottleneck. By integrating theoretical analysis with simulations and empirical tests, we demonstrate how hardware limitations affect training efficacy, identifying key hardware thresholds and the impact of network connectivity. Our findings prompt a reassessment of training strategies guiding users on the way to finding hardware-optimal FSDP configurations, enhancing training efficiency for large-scale transformer models.

Comment: The paper investigates hardware constraints and optimization strategies for training large transformer models, which aligns with foundational research in model architecture and efficiency.

Relevance: 8 Novelty: 7


33. LOGLO-FNO: Efficient Learning of Local and Global Features in Fourier Neural Operators

ArXiv ID: 2504.04260

Authors: Marimuthu Kalimuthu, David Holzm\"uller, Mathias Niepert

Abstract: Modeling high-frequency information is a critical challenge in scientific machine learning. For instance, fully turbulent flow simulations of Navier-Stokes equations at Reynolds numbers 3500 and above can generate high-frequency signals due to swirling fluid motions caused by eddies and vortices. Faithfully modeling such signals using neural networks depends on accurately reconstructing moderate to high frequencies. However, it has been well known that deep neural nets exhibit the so-called spectral bias toward learning low-frequency components. Meanwhile, Fourier Neural Operators (FNOs) have emerged as a popular class of data-driven models in recent years for solving Partial Differential Equations (PDEs) and for surrogate modeling in general. Although impressive results have been achieved on several PDE benchmark problems, FNOs often perform poorly in learning non-dominant frequencies characterized by local features. This limitation stems from the spectral bias inherent in neural networks and the explicit exclusion of high-frequency modes in FNOs and their variants. Therefore, to mitigate these issues and improve FNO's spectral learning capabilities to represent a broad range of frequency components, we propose two key architectural enhancements: (i) a parallel branch performing local spectral convolutions (ii) a high-frequency propagation module. Moreover, we propose a novel frequency-sensitive loss term based on radially binned spectral errors. This introduction of a parallel branch for local convolutions reduces number of trainable parameters by up to 50% while achieving the accuracy of baseline FNO that relies solely on global convolutions. Experiments on three challenging PDE problems in fluid mechanics and biological pattern formation, and the qualitative and spectral analysis of predictions show the effectiveness of our method over the state-of-the-art neural operator baselines.

Comment: The paper proposes architectural enhancements to Fourier Neural Operators (FNOs) to address spectral bias, which aligns with architectural innovations and efficiency improvements.

Relevance: 8 Novelty: 7


34. The Effects of Grouped Structural Global Pruning of Vision Transformers on Domain Generalisation

ArXiv ID: 2504.04196

Authors: Hamza Riaz, Alan F. Smeaton

Abstract: With the growing sizes of AI models like large language models (LLMs) and vision transformers, deploying them on devices with limited computational resources is a significant challenge particularly when addressing domain generalisation (DG) tasks. This paper introduces a novel grouped structural pruning method for pre-trained vision transformers (ViT, BeiT, and DeiT), evaluated on the PACS and Office-Home DG benchmarks. Our method uses dependency graph analysis to identify and remove redundant groups of neurons, weights, filters, or attention heads within transformers, using a range of selection metrics. Grouped structural pruning is applied at pruning ratios of 50\%, 75\% and 95\% and the models are then fine-tuned on selected distributions from DG benchmarks to evaluate their overall performance in DG tasks. Results show significant improvements in inference speed and fine-tuning time with minimal trade-offs in accuracy and DG task performance. For instance, on the PACS benchmark, pruning ViT, BeiT, and DeiT models by 50\% using the Hessian metric resulted in accuracy drops of only -2.94\%, -1.42\%, and -1.72\%, respectively, while achieving speed boosts of 2.5x, 1.81x, and 2.15x. These findings demonstrate the effectiveness of our approach in balancing model efficiency with domain generalisation performance.

Comment: The paper introduces a novel grouped structural pruning method for vision transformers, which aligns with the model compression criterion. The focus on dependency graph analysis and pruning metrics adds methodological insights.

Relevance: 8 Novelty: 7


35. PINNverse: Accurate parameter estimation in differential equations from noisy data with constrained physics-informed neural networks

ArXiv ID: 2504.05248

Authors: Marius Almanst\"otter, Roman Vetter, Dagmar Iber

Abstract: Parameter estimation for differential equations from measured data is an inverse problem prevalent across quantitative sciences. Physics-Informed Neural Networks (PINNs) have emerged as effective tools for solving such problems, especially with sparse measurements and incomplete system information. However, PINNs face convergence issues, stability problems, overfitting, and complex loss function design. Here we introduce PINNverse, a training paradigm that addresses these limitations by reformulating the learning process as a constrained differential optimization problem. This approach achieves a dynamic balance between data loss and differential equation residual loss during training while preventing overfitting. PINNverse combines the advantages of PINNs with the Modified Differential Method of Multipliers to enable convergence on any point on the Pareto front. We demonstrate robust and accurate parameter estimation from noisy data in four classical ODE and PDE models from physics and biology. Our method enables accurate parameter inference also when the forward problem is expensive to solve.

Comment: The paper introduces a constrained optimization framework for physics-informed neural networks (PINNs), which is relevant to foundational AI for Science. The approach addresses key limitations in PINNs, offering methodological advancements.

Relevance: 8 Novelty: 7


36. Cramer-Rao Bounds for Laplacian Matrix Estimation

ArXiv ID: 2504.04576

Authors: Morad Halihal, Tirza Routtenberg, H. Vincent Poor

Abstract: In this paper, we analyze the performance of the estimation of Laplacian matrices under general observation models. Laplacian matrix estimation involves structural constraints, including symmetry and null-space properties, along with matrix sparsity. By exploiting a linear reparametrization that enforces the structural constraints, we derive closed-form matrix expressions for the Cramer-Rao Bound (CRB) specifically tailored to Laplacian matrix estimation. We further extend the derivation to the sparsity-constrained case, introducing two oracle CRBs that incorporate prior information of the support set, i.e. the locations of the nonzero entries in the Laplacian matrix. We examine the properties and order relations between the bounds, and provide the associated Slepian-Bangs formula for the Gaussian case. We demonstrate the use of the new CRBs in three representative applications: (i) topology identification in power systems, (ii) graph filter identification in diffused models, and (iii) precision matrix estimation in Gaussian Markov random fields under Laplacian constraints. The CRBs are evaluated and compared with the mean-squared-errors (MSEs) of the constrained maximum likelihood estimator (CMLE), which integrates both equality and inequality constraints along with sparsity constraints, and of the oracle CMLE, which knows the locations of the nonzero entries of the Laplacian matrix. We perform this analysis for the applications of power system topology identification and graphical LASSO, and demonstrate that the MSEs of the estimators converge to the CRB and oracle CRB, given a sufficient number of measurements.

Comment: The paper provides theoretical analysis on Laplacian matrix estimation, introducing new bounds and their applications. This aligns with foundational research in representation learning and sparsity.

Relevance: 8 Novelty: 7


37. A Simultaneous Approach for Training Neural Differential-Algebraic Systems of Equations

ArXiv ID: 2504.04665

Authors: Laurens R. Lueg, Victor Alves, Daniel Schicksnus, John R. Kitchin, Carl D. Laird, Lorenz T. Biegler

Abstract: Scientific machine learning is an emerging field that broadly describes the combination of scientific computing and machine learning to address challenges in science and engineering. Within the context of differential equations, this has produced highly influential methods, such as neural ordinary differential equations (NODEs). Recent works extend this line of research to consider neural differential-algebraic systems of equations (DAEs), where some unknown relationships within the DAE are learned from data. Training neural DAEs, similarly to neural ODEs, is computationally expensive, as it requires the solution of a DAE for every parameter update. Further, the rigorous consideration of algebraic constraints is difficult within common deep learning training algorithms such as stochastic gradient descent. In this work, we apply the simultaneous approach to neural DAE problems, resulting in a fully discretized nonlinear optimization problem, which is solved to local optimality and simultaneously obtains the neural network parameters and the solution to the corresponding DAE. We extend recent work demonstrating the simultaneous approach for neural ODEs, by presenting a general framework to solve neural DAEs, with explicit consideration of hybrid models, where some components of the DAE are known, e.g. physics-informed constraints. Furthermore, we present a general strategy for improving the performance and convergence of the nonlinear programming solver, based on solving an auxiliary problem for initialization and approximating Hessian terms. We achieve promising results in terms of accuracy, model generalizability and computational cost, across different problem settings such as sparse data, unobserved states and multiple trajectories. Lastly, we provide several promising future directions to improve the scalability and robustness of our approach.

Comment: The paper extends neural ODEs to neural DAEs with a simultaneous training approach, contributing to foundational research in scientific machine learning and hybrid modeling.

Relevance: 8 Novelty: 7


38. SnapPix: Efficient-Coding--Inspired In-Sensor Compression for Edge Vision

ArXiv ID: 2504.04535

Authors: Weikai Lin, Tianrui Ma, Adith Boloor, Yu Feng, Ruofan Xing, Xuan Zhang, Yuhao Zhu

Abstract: Energy-efficient image acquisition on the edge is crucial for enabling remote sensing applications where the sensor node has weak compute capabilities and must transmit data to a remote server/cloud for processing. To reduce the edge energy consumption, this paper proposes a sensor-algorithm co-designed system called SnapPix, which compresses raw pixels in the analog domain inside the sensor. We use coded exposure (CE) as the in-sensor compression strategy as it offers the flexibility to sample, i.e., selectively expose pixels, both spatially and temporally. SNAPPIX has three contributions. First, we propose a task-agnostic strategy to learn the sampling/exposure pattern based on the classic theory of efficient coding. Second, we co-design the downstream vision model with the exposure pattern to address the pixel-level non-uniformity unique to CE-compressed images. Finally, we propose lightweight augmentations to the image sensor hardware to support our in-sensor CE compression. Evaluating on action recognition and video reconstruction, SnapPix outperforms state-of-the-art video-based methods at the same speed while reducing the energy by up to 15.4x. We have open-sourced the code at: https://github.com/horizon-research/SnapPix.

Comment: The SnapPix system introduces an efficient in-sensor compression method inspired by efficient coding, which aligns with foundational research in model compression and efficiency.

Relevance: 8 Novelty: 7


Paper Selection Prompt

You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.

Instructions

Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.

Scoring Criteria

The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.

Relevance Scoring

Novelty Scoring

Papers

[PAPER LIST HERE]

Relevant Topics

Use the following relevance criteria to focus on foundational research. Keep relevant papers and filter out irrelevant ones. Avoid purely application-driven work.

  1. Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.

  2. Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.

  3. Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.

  4. Large Language Models (LLMs) - Relevant: Major breakthroughs in pretraining or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), finetuning or inference tricks (e.g., instruction tuning, chain-of-thoughts, data mixing), or empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).

  5. AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.

  6. Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.

Keywords: