Personalized Daily Arxiv Papers 3/25/2025

[gpt-4o]	Prompt	Completion	Total
Token	60688	8612	69300
Cost	$0.15	$0.09	$0.24

Total arXiv papers: 759

Total scanned papers: 449

Total relevant papers: 45

Table of contents with paper titles:

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training Authors: Brian R. Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura
Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry Authors: Chi-Ning Chou, Hang Le, Yichen Wang, SueYeon Chung
Intelligence Sequencing and the Path-Dependence of Intelligence Evolution: AGI-First vs. DCI-First as Irreversible Attractors Authors: Andy E. Williams
Self-Organizing Graph Reasoning Evolves into a Critical State for Continuous Discovery Through Structural-Semantic Dynamics Authors: Markus J. Buehler
Learning Multi-Level Features with Matryoshka Sparse Autoencoders Authors: Bart Bussmann, Noa Nabeshima, Adam Karvonen, Neel Nanda
Reasoning to Learn from Latent Thoughts Authors: Yangjun Ruan, Neil Band, Chris J. Maddison, Tatsunori Hashimoto
OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad Authors: Luyao Tang, Yuxuan Yuan, Chaoqi Chen, Zeyu Zhang, Yue Huang, Kun Zhang
Decoupling Angles and Strength in Low-rank Adaptation Authors: Massimo Bini, Leander Girrbach, Zeynep Akata
Optimal Neural Compressors for the Rate-Distortion-Perception Tradeoff Authors: Eric Lei, Hamed Hassani, Shirin Saeedi Bidokhti
Improving Quantization with Post-Training Model Expansion Authors: Giuseppe Franco, Pablo Monteagudo-Lago, Ian Colbert, Nicholas Fraser, Michaela Blott
xKV: Cross-Layer SVD for KV-Cache Compression Authors: Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. Abdelfattah
Theory-to-Practice Gap for Neural Networks and Neural Operators Authors: Philipp Grohs, Samuel Lanthaler, Margaret Trautner
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache Authors: Dayou Du, Shijie Cao, Jianyi Cheng, Ting Cao, Mao Yang
Variance Control via Weight Rescaling in LLM Pre-training Authors: Louis Owen, Abhay Kumar, Nilabhra Roy Chowdhury, Fabian G\"ura
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model Authors: Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, Bo Yuan
Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization Authors: Minsu Kim, Seongmin Hong, RyeoWook Ko, Soongyu Choi, Hunjong Lee, Junsoo Kim, Joo-Young Kim, Jongse Park
Feature Qualification by Deep Nets: A Constructive Approach Authors: Feilong Cao, Shao-Bo Lin
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models Authors: Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, Sjoerd van Steenkiste
Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLMs Authors: Chang Gao, Kang Zhao, Jianfei Chen, Liping Jing
FFN Fusion: Rethinking Sequential Computation in Large Language Models Authors: Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv
Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters Authors: Roberto Garcia, Jerry Liu, Daniel Sorvisto, Sabri Eyuboglu
Efficient Knowledge Distillation via Curriculum Extraction Authors: Shivam Gupta, Sushrut Karmalkar
CODA: Repurposing Continuous VAEs for Discrete Tokenization Authors: Zeyu Liu, Zanlin Ni, Yeguo Hua, Xin Deng, Xiao Ma, Cheng Zhong, Gao Huang
Generative AI for Validating Physics Laws Authors: Maria Nareklishvili, Nicholas Polson, Vadim Sokolov
What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models Authors: Keyon Vafa, Sarah Bentley, Jon Kleinberg, Sendhil Mullainathan
AutoBayes: A Compositional Framework for Generalized Variational Inference Authors: Toby St Clere Smithe, Marco Perin
Does GCL Need a Large Number of Negative Samples? Enhancing Graph Contrastive Learning with Effective and Efficient Negative Sampling Authors: Yongqi Huang, Jitao Zhao, Dongxiao He, Di Jin, Yuxiao Huang, Zhen Wang
On the Minimax Regret of Sequential Probability Assignment via Square-Root Entropy Authors: Zeyu Jia, Yury Polyanskiy, Alexander Rakhlin
Dynamic Gradient Sparse Update for Edge Training Authors: I-Hsuan Li, Tian-Sheuan Chang
Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2 Authors: Steven Abreu, Sumit Bam Shrestha, Rui-Jie Zhu, Jason Eshraghian
On the Optimality of Single-label and Multi-label Neural Network Decoders Authors: Yunus Can G\"ultekin, P\'eter Scheepers, Yuncheng Yuan, Federico Corradi, Alex Alvarado
Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes Authors: Sharan Maiya, Yinhong Liu, Ramit Debnath, Anna Korhonen
Generative Modeling of Class Probability for Multi-Modal Representation Learning Authors: Jungkyoo Shin, Bumsoo Kim, Eunwoo Kim
Language Models May Verbatim Complete TextThey Were Not Explicitly Trained On Authors: Ken Ziyu Liu, Christopher A. Choquette-Choo, Matthew Jagielski, Peter Kairouz, Sanmi Koyejo, Percy Liang, Nicolas Papernot
Towards Human-Understandable Multi-Dimensional Concept Discovery Authors: Arne Grobr\"ugge, Niklas K\"uhl, Gerhard Satzger, Philipp Spitzer
Exploring Energy Landscapes for Minimal Counterfactual Explanations: Applications in Cybersecurity and Beyond Authors: Spyridon Evangelatos, Eleni Veroni, Vasilis Efthymiou, Christos Nikolopoulos, Georgios Th. Papadopoulos, Panagiotis Sarigiannidis
TARDIS: Mitigate Temporal Misalignment via Representation Steering Authors: Changho Shin, Xinya Yan, Suenggwan Jo, Sungjun Cho, Shourjo Aditya Chaudhuri, Frederic Sala
Interpretable Feature Interaction via Statistical Self-supervised Learning on Tabular Data Authors: Xiaochen Zhang, Haoyi Xiong
Bayesian generative models can flag performance loss, bias, and out-of-distribution image content Authors: Miguel L\'opez-P\'erez, Marco Miani, Valery Naranjo, S{\o}ren Hauberg, Aasa Feragen
Distil-xLSTM: Learning Attention Mechanisms through Recurrent Structures Authors: Abdoul Majid O. Thiombiano, Brahim Hnich, Ali Ben Mrad, Mohamed Wiem Mkaouer
ConSol: Sequential Probability Ratio Testing to Find Consistent LLM Reasoning Paths Efficiently Authors: Jaeyeon Lee, Guantong Qi, Matthew Brady Neeley, Zhandong Liu, Hyun-Hwan Jeong
Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM Authors: Codefuse, Ling Team, :, Wenting Cai, Yuchen Cao, Chaoyu Chen, Chen Chen, Siba Chen, Qing Cui, Peng Di, Junpeng Fang, Zi Gong, Ting Guo, Zhengyu He, Yang Huang, Cong Li, Jianguo Li, Zheng Li, Shijie Lian, BingChang Liu, Songshan Luo, Shuo Mao, Min Shen, Jian Wu, Jiaolong Yang, Wenjie Yang, Tong Ye, Hang Yu, Wei Zhang, Zhenduo Zhang, Hailin Zhao, Xunjin Zheng, Jun Zhou
Do Your Best and Get Enough Rest for Continual Learning Authors: Hankyul Kang, Gregor Seifer, Donghyun Lee, Jongbin Ryu
MoST: Efficient Monarch Sparse Tuning for 3D Representation Learning Authors: Xu Han, Yuan Tang, Jinfeng Xu, Xianzhi Li
Neural Network Approach to Stochastic Dynamics for Smooth Multimodal Density Estimation Authors: Z. Zarezadeh, N. Zarezadeh

1. Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

ArXiv ID: 2503.18929

Authors: Brian R. Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura

Abstract: Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.

Comment: Author match

2. Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry

ArXiv ID: 2503.18114

Authors: Chi-Ning Chou, Hang Le, Yichen Wang, SueYeon Chung

Abstract: The ability to integrate task-relevant information into neural representations is a fundamental aspect of both biological and artificial intelligence. To enable theoretical analysis, recent work has examined whether a network learns task-relevant features (rich learning) or resembles a random feature model (or a kernel machine, i.e., lazy learning). However, this simple lazy-versus-rich dichotomy overlooks the possibility of various subtypes of feature learning that emerge from different architectures, learning rules, and data properties. Furthermore, most existing approaches emphasize weight matrices or neural tangent kernels, limiting their applicability to neuroscience because they do not explicitly characterize representations. In this work, we introduce an analysis framework based on representational geometry to study feature learning. Instead of analyzing what are the learned features, we focus on characterizing how task-relevant representational manifolds evolve during the learning process. In both theory and experiment, we find that when a network learns features useful for solving a task, the task-relevant manifolds become increasingly untangled. Moreover, by tracking changes in the underlying manifold geometry, we uncover distinct learning stages throughout training, as well as different learning strategies associated with training hyperparameters, uncovering subtypes of feature learning beyond the lazy-versus-rich dichotomy. Applying our method to neuroscience and machine learning, we gain geometric insights into the structural inductive biases of neural circuits solving cognitive tasks and the mechanisms underlying out-of-distribution generalization in image classification. Our framework provides a novel geometric perspective for understanding and quantifying feature learning in both artificial and biological neural networks.

Comment: The paper introduces a geometric framework for analyzing feature learning, providing novel insights into representational geometry and task-relevant manifold evolution, which is highly relevant to representation learning.

Relevance: 10 Novelty: 9

3. Intelligence Sequencing and the Path-Dependence of Intelligence Evolution: AGI-First vs. DCI-First as Irreversible Attractors

ArXiv ID: 2503.17688

Authors: Andy E. Williams

Abstract: The trajectory of intelligence evolution is often framed around the emergence of artificial general intelligence (AGI) and its alignment with human values. This paper challenges that framing by introducing the concept of intelligence sequencing: the idea that the order in which AGI and decentralized collective intelligence (DCI) emerge determines the long-term attractor basin of intelligence. Using insights from dynamical systems, evolutionary game theory, and network models, it argues that intelligence follows a path-dependent, irreversible trajectory. Once development enters a centralized (AGI-first) or decentralized (DCI-first) regime, transitions become structurally infeasible due to feedback loops and resource lock-in. Intelligence attractors are modeled in functional state space as the co-navigation of conceptual and adaptive fitness spaces. Early-phase structuring constrains later dynamics, much like renormalization in physics. This has major implications for AI safety: traditional alignment assumes AGI will emerge and must be controlled after the fact, but this paper argues that intelligence sequencing is more foundational. If AGI-first architectures dominate before DCI reaches critical mass, hierarchical monopolization and existential risk become locked in. If DCI-first emerges, intelligence stabilizes around decentralized cooperative equilibrium. The paper further explores whether intelligence structurally biases itself toward an attractor based on its self-modeling method -- externally imposed axioms (favoring AGI) vs. recursive internal visualization (favoring DCI). Finally, it proposes methods to test this theory via simulations, historical lock-in case studies, and intelligence network analysis. The findings suggest that intelligence sequencing is a civilizational tipping point: determining whether the future is shaped by unbounded competition or unbounded cooperation.

Comment: The paper explores intelligence sequencing and path-dependence in intelligence evolution, introducing a novel theoretical framework. It aligns with emerging trends and challenges established assumptions.

Relevance: 9 Novelty: 9

4. Self-Organizing Graph Reasoning Evolves into a Critical State for Continuous Discovery Through Structural-Semantic Dynamics

ArXiv ID: 2503.18852

Authors: Markus J. Buehler

Abstract: We report fundamental insights into how agentic graph reasoning systems spontaneously evolve toward a critical state that sustains continuous semantic discovery. By rigorously analyzing structural (Von Neumann graph entropy) and semantic (embedding) entropy, we identify a subtle yet robust regime in which semantic entropy persistently dominates over structural entropy. This interplay is quantified by a dimensionless Critical Discovery Parameter that stabilizes at a small negative value, indicating a consistent excess of semantic entropy. Empirically, we observe a stable fraction (12%) of "surprising" edges, links between semantically distant concepts, providing evidence of long-range or cross-domain connections that drive continuous innovation. Concomitantly, the system exhibits scale-free and small-world topological features, alongside a negative cross-correlation between structural and semantic measures, reinforcing the analogy to self-organized criticality. These results establish clear parallels with critical phenomena in physical, biological, and cognitive complex systems, revealing an entropy-based principle governing adaptability and continuous innovation. Crucially, semantic richness emerges as the underlying driver of sustained exploration, despite not being explicitly used by the reasoning process. Our findings provide interdisciplinary insights and practical strategies for engineering intelligent systems with intrinsic capacities for long-term discovery and adaptation, and offer insights into how model training strategies can be developed that reinforce critical discovery.

Comment: This paper provides theoretical insights into self-organizing graph reasoning systems and their evolution into a critical state, which aligns with emerging trends and foundational research. The entropy-based principle governing adaptability and innovation is novel and interdisciplinary.

Relevance: 9 Novelty: 9

5. Learning Multi-Level Features with Matryoshka Sparse Autoencoders

ArXiv ID: 2503.17547

Authors: Bart Bussmann, Noa Nabeshima, Adam Karvonen, Neel Nanda

Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting neural networks by extracting the concepts represented in their activations. However, choosing the size of the SAE dictionary (i.e. number of learned concepts) creates a tension: as dictionary size increases to capture more relevant concepts, sparsity incentivizes features to be split or absorbed into more specific features, leaving high-level features missing or warped. We introduce Matryoshka SAEs, a novel variant that addresses these issues by simultaneously training multiple nested dictionaries of increasing size, forcing the smaller dictionaries to independently reconstruct the inputs without using the larger dictionaries. This organizes features hierarchically - the smaller dictionaries learn general concepts, while the larger dictionaries learn more specific concepts, without incentive to absorb the high-level features. We train Matryoshka SAEs on Gemma-2-2B and TinyStories and find superior performance on sparse probing and targeted concept erasure tasks, more disentangled concept representations, and reduced feature absorption. While there is a minor tradeoff with reconstruction performance, we believe Matryoshka SAEs are a superior alternative for practical tasks, as they enable training arbitrarily large SAEs while retaining interpretable features at different levels of abstraction.

Comment: The introduction of Matryoshka Sparse Autoencoders directly contributes to representation learning by addressing hierarchical feature learning and disentanglement, which is highly relevant.