Personalized Daily Arxiv Papers 02/13/2025
| Prompt | Completion | Total | |
|---|---|---|---|
| Token | 119253 | 9699 | 128952 |
| Cost | $0.3 | $0.1 | $0.4 |
Total scanned papers: 376
Total relevant papers: 39
Table of contents with paper titles:
-
Monte Carlo Tree Diffusion for System 2 Planning Authors: Jaesik Yoon, Hyeonseo Cho, Doojin Baek, Yoshua Bengio, Sungjin Ahn
-
Mixture of Decoupled Message Passing Experts with Entropy Constraint for General Node Classification Authors: Xuanze Chen, Jiajun Zhou, Jinsong Chen, Shanqing Yu, Qi Xuan
-
Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach Authors: Xu Zhang, Kaidi Xu, Ziqing Hu, Ren Wang
-
Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline Authors: Zhiyuan Fang, Yuegui Huang, Zicong Hong, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng
-
Online Scheduling for LLM Inference with KV Cache Constraints Authors: Patrick Jaillet, Jiashuo Jiang, Chara Podimata, Zijie Zhou
-
Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving Authors: Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, Chi Jin
-
Unsupervised categorization of similarity measures Authors: Yoshiyuki Ohmura, Wataru Shimaya, Yasuo Kuniyoshi
-
LUNAR: LLM Unlearning via Neural Activation Redirection Authors: William F. Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, Nicholas D. Lane
-
LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits Authors: Zikai Zhou, Qizheng Zhang, Hermann Kumbong, Kunle Olukotun
-
LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid Authors: Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng
-
RomanLens: Latent Romanization and its role in Multilinguality in LLMs Authors: Alan Saji (Nilekani Centre at AI4Bharat), Jaavid Aktar Husain (Singapore University of Technology and Design), Thanmay Jayakumar (Nilekani Centre at AI4Bharat, Indian Institute of Technology Madras, India), Raj Dabre (Nilekani Centre at AI4Bharat, Indian Institute of Technology Bombay, India), Anoop Kunchukuttan (Nilekani Centre at AI4Bharat, Microsoft, India), Mitesh M. Khapra (Nilekani Centre at AI4Bharat, Indian Institute of Technology Madras, India), Ratish Puduppully (IT University of Copenhagen)
-
Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning Authors: Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, Shaul Druckmann
-
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! Authors: Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
-
Training-Free Restoration of Pruned Neural Networks Authors: Keonho Lee, Minsoo Kim, Dong-Wan Choi
-
Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning Authors: Qifan Yu, Zhenyu He, Sijie Li, Xun Zhou, Jun Zhang, Jingjing Xu, Di He
-
Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty Authors: Yeseul Cho, Baekrok Shin, Changmin Kang, Chulhee Yun
-
Scalable Thermodynamic Second-order Optimization Authors: Kaelan Donatella, Samuel Duffield, Denis Melanson, Maxwell Aifer, Phoebe Klett, Rajath Salegame, Zach Belateche, Gavin Crooks, Antonio J. Martinez, Patrick J. Coles
-
Exploring Model Invariance with Discrete Search for Ultra-Low-Bit Quantization Authors: Yuqiao Wen, Yanshuai Cao, Lili Mou
-
LoCA: Location-Aware Cosine Adaptation for Parameter-Efficient Fine-Tuning Authors: Zhekai Du, Yinjie Min, Jingjing Li, Ke Lu, Changliang Zou, Liuhua Peng, Tingjin Chu, Mingming Gong
-
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs Authors: Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks
-
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution Authors: Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, Peizhong Ju, A. B. Siddique
-
Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension Authors: Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds
-
On Mechanistic Circuits for Extractive Question-Answering Authors: Samyadeep Basu, Vlad Morariu, Zichao Wang, Ryan Rossi, Cherry Zhao, Soheil Feizi, Varun Manjunatha
-
Can Large Language Models Understand Intermediate Representations? Authors: Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan
-
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon Authors: Nurit Cohen-Inger, Yehonatan Elisha, Bracha Shapira, Lior Rokach, Seffi Cohen
-
Numerical Schemes for Signature Kernels Authors: Thomas Cass, Francesco Piatti, Jeffrey Pei
-
Enabling Autoregressive Models to Fill In Masked Tokens Authors: Daniel Israel, Aditya Grover, Guy Van den Broeck
-
What is a Sketch-and-Precondition Derivation for Low-Rank Approximation? Inverse Power Error or Inverse Power Estimation? Authors: Ruihan Xu, Yiping Lu
-
Harnessing Language's Fractal Geometry with Recursive Inference Scaling Authors: Ibrahim Alabdulmohsin, Xiaohua Zhai
-
The Observational Partial Order of Causal Structures with Latent Variables Authors: Marina Maciel Ansanelli, Elie Wolfe, Robert W. Spekkens
-
ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval Authors: Shubham Gupta, Zichao Li, Tianyi Chen, Cem Subakan, Siva Reddy, Perouz Taslakian, Valentina Zantedeschi
-
Loss Landscape Analysis for Reliable Quantized ML Models for Scientific Sensing Authors: Tommaso Baldi, Javier Campos, Olivia Weng, Caleb Geniesse, Nhan Tran, Ryan Kastner, Alessandro Biondi
-
Gradient Based Method for the Fusion of Lattice Quantizers Authors: Liyuan Zhang, Hanzhong Cao, Jiaheng Li, Minyang Yu
-
Column-wise Quantization of Weights and Partial Sums for Accurate and Efficient Compute-In-Memory Accelerators Authors: Jiyoon Kim, Kang Eun Jeon, Yulhwa Kim, Jong Hwan Ko
-
No Data, No Optimization: A Lightweight Method To Disrupt Neural Networks With Sign-Flips Authors: Ido Galil, Moshe Kimhi, Ran El-Yaniv
-
Automated Consistency Analysis of LLMs Authors: Aditya Patwardhan, Vivek Vaidya, Ashish Kundu
-
When More is Less: Understanding Chain-of-Thought Length in LLMs Authors: Yuyang Wu, Yifei Wang, Tianqi Du, Stefanie Jegelka, Yisen Wang
-
EdgeEar: Efficient and Accurate Ear Recognition for Edge Devices Authors: Camile Lendering, Bernardo Perrone Ribeiro, \v{Z}iga Emer\v{s}i\v{c}, Peter Peer
-
Understanding LLMs' Fluid Intelligence Deficiency: An Analysis of the ARC Task Authors: Junjie Wu, Mo Yu, Lemao Liu, Dit-Yan Yeung, Jie Zhou
1. Monte Carlo Tree Diffusion for System 2 Planning
ArXiv ID: 2502.07202
Authors: Jaesik Yoon, Hyeonseo Cho, Doojin Baek, Yoshua Bengio, Sungjin Ahn
Abstract: Diffusion models have recently emerged as a powerful tool for planning. However, unlike Monte Carlo Tree Search (MCTS)-whose performance naturally improves with additional test-time computation (TTC), standard diffusion-based planners offer only limited avenues for TTC scalability. In this paper, we introduce Monte Carlo Tree Diffusion (MCTD), a novel framework that integrates the generative strength of diffusion models with the adaptive search capabilities of MCTS. Our method reconceptualizes denoising as a tree-structured process, allowing partially denoised plans to be iteratively evaluated, pruned, and refined. By selectively expanding promising trajectories while retaining the flexibility to revisit and improve suboptimal branches, MCTD achieves the benefits of MCTS such as controlling exploration-exploitation trade-offs within the diffusion framework. Empirical results on challenging long-horizon tasks show that MCTD outperforms diffusion baselines, yielding higher-quality solutions as TTC increases.
Comment: Author match
2. Mixture of Decoupled Message Passing Experts with Entropy Constraint for General Node Classification
ArXiv ID: 2502.08083
Authors: Xuanze Chen, Jiajun Zhou, Jinsong Chen, Shanqing Yu, Qi Xuan
Abstract: The varying degrees of homophily and heterophily in real-world graphs persistently constrain the universality of graph neural networks (GNNs) for node classification. Adopting a data-centric perspective, this work reveals an inherent preference of different graphs towards distinct message encoding schemes: homophilous graphs favor local propagation, while heterophilous graphs exhibit preference for flexible combinations of propagation and transformation. To address this, we propose GNNMoE, a universal node classification framework based on the Mixture-of-Experts (MoE) mechanism. The framework first constructs diverse message-passing experts through recombination of fine-grained encoding operators, then designs soft and hard gating layers to allocate the most suitable expert networks for each node's representation learning, thereby enhancing both model expressiveness and adaptability to diverse graphs. Furthermore, considering that soft gating might introduce encoding noise in homophilous scenarios, we introduce an entropy constraint to guide sharpening of soft gates, achieving organic integration of weighted combination and Top-K selection. Extensive experiments demonstrate that GNNMoE significantly outperforms mainstream GNNs, heterophilous GNNs, and graph transformers in both node classification performance and universality across diverse graph datasets.
Comment: The paper proposes a Mixture-of-Experts (MoE) framework for node classification, which is highly relevant to model architecture and MoE research. The entropy constraint adds a novel perspective to MoE design.
Relevance: 10 Novelty: 8
3. Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach
ArXiv ID: 2502.06832
Authors: Xu Zhang, Kaidi Xu, Ziqing Hu, Ren Wang
Abstract: Mixture of Experts (MoE) have shown remarkable success in leveraging specialized expert networks for complex machine learning tasks. However, their susceptibility to adversarial attacks presents a critical challenge for deployment in robust applications. This paper addresses the critical question of how to incorporate robustness into MoEs while maintaining high natural accuracy. We begin by analyzing the vulnerability of MoE components, finding that expert networks are notably more susceptible to adversarial attacks than the router. Based on this insight, we propose a targeted robust training technique that integrates a novel loss function to enhance the adversarial robustness of MoE, requiring only the robustification of one additional expert without compromising training or inference efficiency. Building on this, we introduce a dual-model strategy that linearly combines a standard MoE model with our robustified MoE model using a smoothing parameter. This approach allows for flexible control over the robustness-accuracy trade-off. We further provide theoretical foundations by deriving certified robustness bounds for both the single MoE and the dual-model. To push the boundaries of robustness and accuracy, we propose a novel joint training strategy JTDMoE for the dual-model. This joint training enhances both robustness and accuracy beyond what is achievable with separate models. Experimental results on CIFAR-10 and TinyImageNet datasets using ResNet18 and Vision Transformer (ViT) architectures demonstrate the effectiveness of our proposed methods.
Comment: The paper addresses robustness in Mixture of Experts (MoE) models, which directly aligns with the model architecture criterion. The dual-model approach and robustness bounds are novel contributions.
Relevance: 10 Novelty: 8
4. Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline
ArXiv ID: 2502.06888
Authors: Zhiyuan Fang, Yuegui Huang, Zicong Hong, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng
Abstract: Mixture of Experts (MoE), with its distinctive sparse structure, enables the scaling of language models up to trillions of parameters without significantly increasing computational costs. However, the substantial parameter size presents a challenge for inference, as the expansion in GPU memory cannot keep pace with the growth in parameters. Although offloading techniques utilise memory from the CPU and disk and parallelise the I/O and computation for efficiency, the computation for each expert in MoE models is often less than the I/O, resulting in numerous bubbles in the pipeline. Therefore, we propose Klotski, an efficient MoE inference engine that significantly reduces pipeline bubbles through a novel expert-aware multi-batch pipeline paradigm. The proposed paradigm uses batch processing to extend the computation time of the current layer to overlap with the loading time of the next layer. Although this idea has been effectively applied to dense models, more batches may activate more experts in the MoE, leading to longer loading times and more bubbles. Thus, unlike traditional approaches, we balance computation and I/O time and minimise bubbles by orchestrating their inference orders based on their heterogeneous computation and I/O requirements and activation patterns under different batch numbers. Moreover, to adapt to different hardware environments and models, we design a constraint-sensitive I/O-compute planner and a correlation-aware expert prefetcher for a schedule that minimises pipeline bubbles. Experimental results demonstrate that Klotski achieves a superior throughput-latency trade-off compared to state-of-the-art techniques, with throughput improvements of up to 85.12x.
Comment: The paper proposes Klotski, an efficient MoE inference engine, which directly aligns with the core topic of Mixture-of-Experts and efficiency improvements.
Relevance: 10 Novelty: 8
5. Online Scheduling for LLM Inference with KV Cache Constraints
ArXiv ID: 2502.07115
Authors: Patrick Jaillet, Jiashuo Jiang, Chara Podimata, Zijie Zhou
Abstract: Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A key challenge in LLM inference is the management of the Key-Value (KV) cache, which reduces redundant computations but introduces memory constraints. In this work, we model LLM inference with KV cache constraints theoretically and propose novel batching and scheduling algorithms that minimize inference latency while effectively managing the KV cache's memory. We analyze both semi-online and fully online scheduling models, and our results are threefold. First, we provide a polynomial-time algorithm that achieves exact optimality in terms of average latency in the semi-online prompt arrival model. Second, in the fully online case with a stochastic prompt arrival, we introduce an efficient online scheduling algorithm with constant regret. Third, we prove that no algorithm (deterministic or randomized) can achieve a constant competitive ratio in fully online adversarial settings. Our empirical evaluations on a public LLM inference dataset, using the Llama-70B model on A100 GPUs, show that our approach significantly outperforms benchmark algorithms used currently in practice, achieving lower latency while reducing energy consumption. Overall, our results offer a path toward more sustainable and cost-effective LLM deployment.
Comment: The paper addresses KV cache constraints in LLM inference, which is directly relevant to model compression and efficiency. The theoretical scheduling algorithms and empirical results add novelty.
Relevance: 9 Novelty: 8
6. Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving
ArXiv ID: 2502.07640
Authors: Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, Chi Jin
Abstract: We introduce Goedel-Prover, an open-source large language model (LLM) that achieves the state-of-the-art (SOTA) performance in automated formal proof generation for mathematical problems. The key challenge in this field is the scarcity of formalized math statements and proofs, which we tackle in the following ways. We train statement formalizers to translate the natural language math problems from Numina into formal language (Lean 4), creating a dataset of 1.64 million formal statements. LLMs are used to check that the formal statements accurately preserve the content of the original natural language problems. We then iteratively build a large dataset of formal proofs by training a series of provers. Each prover succeeds in proving many statements that the previous ones could not, and these new proofs are added to the training set for the next prover. The final prover outperforms all existing open-source models in whole-proof generation. On the miniF2F benchmark, it achieves a 57.6% success rate (Pass@32), exceeding the previous best open-source model by 7.6%. On PutnamBench, Goedel-Prover successfully solves 7 problems (Pass@512), ranking first on the leaderboard. Furthermore, it generates 29.7K formal proofs for Lean Workbook problems, nearly doubling the 15.7K produced by earlier works.
Comment: The paper introduces Goedel-Prover, a large language model for automated theorem proving. It contributes to foundational research in LLMs by addressing dataset creation and iterative training for formal proof generation.
Relevance: 9 Novelty: 8
7. Unsupervised categorization of similarity measures
ArXiv ID: 2502.08098
Authors: Yoshiyuki Ohmura, Wataru Shimaya, Yasuo Kuniyoshi
Abstract: In general, objects can be distinguished on the basis of their features, such as color or shape. In particular, it is assumed that similarity judgments about such features can be processed independently in different metric spaces. However, the unsupervised categorization mechanism of metric spaces corresponding to object features remains unknown. Here, we show that the artificial neural network system can autonomously categorize metric spaces through representation learning to satisfy the algebraic independence between neural networks, and project sensory information onto multiple high-dimensional metric spaces to independently evaluate the differences and similarities between features. Conventional methods often constrain the axes of the latent space to be mutually independent or orthogonal. However, the independent axes are not suitable for categorizing metric spaces. High-dimensional metric spaces that are independent of each other are not uniquely determined by the mutually independent axes, because any combination of independent axes can form mutually independent spaces. In other words, the mutually independent axes cannot be used to naturally categorize different feature spaces, such as color space and shape space. Therefore, constraining the axes to be mutually independent makes it difficult to categorize high-dimensional metric spaces. To overcome this problem, we developed a method to constrain only the spaces to be mutually independent and not the composed axes to be independent. Our theory provides general conditions for the unsupervised categorization of independent metric spaces, thus advancing the mathematical theory of functional differentiation of neural networks.
Comment: The paper explores unsupervised categorization of similarity measures through representation learning, which aligns with foundational research in representation learning. The focus on independent metric spaces is novel.
Relevance: 9 Novelty: 8
8. LUNAR: LLM Unlearning via Neural Activation Redirection
ArXiv ID: 2502.07218
Authors: William F. Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, Nicholas D. Lane
Abstract: Large Language Models (LLMs) benefit from training on ever larger amounts of textual data, but as a result, they increasingly incur the risk of leaking private information. The ability to selectively remove knowledge from LLMs is, therefore, a highly desirable capability. In this paper, we propose LUNAR, a novel unlearning methodology grounded in the Linear Representation Hypothesis. LUNAR operates by redirecting the representations of unlearned data to regions that trigger the model's inherent ability to express its inability to answer. LUNAR achieves state-of-the-art unlearning performance while significantly enhancing the controllability of the unlearned model during inference. Specifically, LUNAR achieves between 2.9x to 11.7x improvements on combined "unlearning efficacy" and "model utility" score ("Deviation Score") on the PISTOL dataset across various base models. We also demonstrate, through quantitative analysis and qualitative examples, LUNAR's superior controllability in generating coherent and contextually aware responses, mitigating undesired side effects of existing methods. Moreover, we demonstrate that LUNAR is robust against white-box adversarial attacks and versatile in handling real-world scenarios, such as processing sequential unlearning requests.
Comment: The paper introduces a novel unlearning methodology for LLMs, which aligns with the criterion of theoretical insights into LLM behavior. The use of neural activation redirection is innovative.
Relevance: 9 Novelty: 8
9. LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits
ArXiv ID: 2502.08141
Authors: Zikai Zhou, Qizheng Zhang, Hermann Kumbong, Kunle Olukotun
Abstract: Fine-tuning large language models (LLMs) is increasingly costly as models scale to hundreds of billions of parameters, and even parameter-efficient fine-tuning (PEFT) methods like LoRA remain resource-intensive. We introduce LowRA, the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. LowRA optimizes fine-grained quantization - mapping, threshold selection, and precision assignment - while leveraging efficient CUDA kernels for scalable deployment. Extensive evaluations across 4 LLMs and 4 datasets show that LowRA achieves a superior performance-precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50%. Our results highlight the potential of ultra-low-bit LoRA fine-tuning for resource-constrained environments.
Comment: The paper introduces LowRA, a framework for ultra-low-bit LoRA fine-tuning of LLMs, which aligns with the model compression criterion, specifically quantization and efficiency breakthroughs.
Relevance: 9 Novelty: 8
10. LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid
ArXiv ID: 2502.07563
Authors: Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng
Abstract: Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.
Comment: LASP-2 proposes a new sequence parallelism method for linear attention, which aligns with architectural innovations and efficiency improvements in transformer models.
Relevance: 9 Novelty: 8
11. RomanLens: Latent Romanization and its role in Multilinguality in LLMs
ArXiv ID: 2502.07424
Authors: Alan Saji (Nilekani Centre at AI4Bharat), Jaavid Aktar Husain (Singapore University of Technology and Design), Thanmay Jayakumar (Nilekani Centre at AI4Bharat, Indian Institute of Technology Madras, India), Raj Dabre (Nilekani Centre at AI4Bharat, Indian Institute of Technology Bombay, India), Anoop Kunchukuttan (Nilekani Centre at AI4Bharat, Microsoft, India), Mitesh M. Khapra (Nilekani Centre at AI4Bharat, Indian Institute of Technology Madras, India), Ratish Puduppully (IT University of Copenhagen)
Abstract: Large Language Models (LLMs) exhibit remarkable multilingual generalization despite being predominantly trained on English-centric corpora. A fundamental question arises: how do LLMs achieve such robust multilingual capabilities? For non-Latin script languages, we investigate the role of romanization - the representation of non-Latin scripts using Latin characters - as a bridge in multilingual processing. Using mechanistic interpretability techniques, we analyze next-token generation and find that intermediate layers frequently represent target words in romanized form before transitioning to native script, a phenomenon we term Latent Romanization. Further, through activation patching experiments, we demonstrate that LLMs encode semantic concepts similarly across native and romanized scripts, suggesting a shared underlying representation. Additionally in translation towards non Latin languages, our findings reveal that when the target language is in romanized form, its representations emerge earlier in the model's layers compared to native script. These insights contribute to a deeper understanding of multilingual representation in LLMs and highlight the implicit role of romanization in facilitating language transfer. Our work provides new directions for potentially improving multilingual language modeling and interpretability.
Comment: The paper provides theoretical insights into multilingual representation in LLMs, specifically the role of latent romanization, which aligns with the criterion of understanding LLM behavior and interpretability.
Relevance: 9 Novelty: 8
12. Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning
ArXiv ID: 2502.07154
Authors: Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, Shaul Druckmann
Abstract: Recent progress in large language models (LLMs) highlights the power of scaling test-time compute to achieve strong performance on complex tasks, such as mathematical reasoning and code generation. This raises a critical question: how should model training be modified to optimize performance under a subsequent test-time compute strategy and budget? To explore this, we focus on pass@N, a simple test-time strategy that searches for a correct answer in $N$ independent samples. We show, surprisingly, that training with cross-entropy (CE) loss can be ${\it misaligned}$ with pass@N in that pass@N accuracy ${\it decreases}$ with longer training. We explain the origins of this misalignment in terms of model overconfidence induced by CE, and experimentally verify our prediction of overconfidence as an impediment to scaling test-time compute via pass@N. Furthermore we suggest a principled, modified training loss that is better aligned to pass@N by limiting model confidence and rescuing pass@N test performance. Our algorithm demonstrates improved mathematical reasoning on MATH and MiniF2F benchmarks under several scenarios: (1) providing answers to math questions; and (2) proving theorems by searching over proof trees of varying shapes. Overall our work underscores the importance of co-designing two traditionally separate phases of LLM development: training-time protocols and test-time search and reasoning strategies.
Comment: The paper addresses training misalignment in LLMs for mathematical reasoning, proposing a novel loss function to improve test-time performance. This aligns with the criterion of theoretical insights into LLM behavior.
Relevance: 9 Novelty: 8
13. LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
ArXiv ID: 2502.07374
Authors: Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
Abstract: Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA). With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0% (+8.1%) on LiveCodeBench, competitive to the proprietary o1-preview model's score of 44.6% and 59.1%. More importantly, we find that the structure of Long CoT is critical to the learning process, whereas the content of individual reasoning steps has minimal impact. Perturbations affecting content, such as training on incorrect samples or removing reasoning keywords, have little impact on performance. In contrast, structural modifications that disrupt logical consistency in the Long CoT, such as shuffling or deleting reasoning steps, significantly degrade accuracy. For example, a model trained on Long CoT samples with incorrect answers still achieves only 3.2% lower accuracy compared to training with fully correct samples. These insights deepen our understanding of how to elicit reasoning capabilities in LLMs and highlight key considerations for efficiently training the next generation of reasoning models. This is the academic paper of our previous released Sky-T1-32B-Preview model. Codes are available at https://github.com/NovaSky-AI/SkyThought.
Comment: The paper provides insights into how LLMs learn reasoning structures, emphasizing the importance of structure over content in Chain-of-Thought reasoning. This aligns with foundational research on LLM behavior and training dynamics.
Relevance: 9 Novelty: 8
14. Training-Free Restoration of Pruned Neural Networks
ArXiv ID: 2502.08474
Authors: Keonho Lee, Minsoo Kim, Dong-Wan Choi
Abstract: Although network pruning has been highly popularized to compress deep neural networks, its resulting accuracy heavily depends on a fine-tuning process that is often computationally expensive and requires the original data. However, this may not be the case in real-world scenarios, and hence a few recent works attempt to restore pruned networks without any expensive retraining process. Their strong assumption is that every neuron being pruned can be replaced with another one quite similar to it, but unfortunately this does not hold in many neural networks, where the similarity between neurons is extremely low in some layers. In this article, we propose a more rigorous and robust method of restoring pruned networks in a fine-tuning free and data-free manner, called LBYL (Leave Before You Leave). LBYL significantly relaxes the aforementioned assumption in a way that each pruned neuron leaves its pieces of information to as many preserved neurons as possible and thereby multiple neurons together obtain a more robust approximation to the original output of the neuron who just left. Our method is based on a theoretical analysis on how to formulate the reconstruction error between the original network and its approximation, which nicely leads to a closed form solution for our derived loss function. Through the extensive experiments, LBYL is confirmed to be indeed more effective to approximate the original network and consequently able to achieve higher accuracy for restored networks, compared to the recent approaches exploiting the similarity between two neurons. The very first version of this work, which contains major technical and theoretical components, was submitted to NeurIPS 2021 and ICML 2022.
Comment: The paper proposes a training-free method for restoring pruned neural networks, which aligns with foundational research on model compression and sparsity.
Relevance: 9 Novelty: 8
15. Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning
ArXiv ID: 2502.08482
Authors: Qifan Yu, Zhenyu He, Sijie Li, Xun Zhou, Jun Zhang, Jingjing Xu, Di He
Abstract: Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model's reasoning capabilities. However, generating long and correct CoT trajectories is challenging. Recent studies have demonstrated that Looped Transformers possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions. To better leverage the strengths of Looped Transformers, we propose RELAY (REasoning through Loop Alignment iterativelY). Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers. This additional iteration-wise supervision not only preserves the Looped Transformer's ability for length generalization but also enables it to predict CoT reasoning steps for unseen data. Therefore, we leverage this Looped Transformer to generate accurate reasoning chains for complex problems that exceed the training length, which will then be used to fine-tune an auto-regressive model. We conduct extensive experiments, and the results demonstrate the effectiveness of our approach, with significant improvements in the performance of the auto-regressive model. Code will be released at https://github.com/qifanyu/RELAY.
Comment: The paper introduces a novel approach to enhance Chain-of-Thought reasoning in LLMs using loop-aligned reasoning, contributing to foundational research on reasoning dynamics in LLMs.
Relevance: 9 Novelty: 8
16. Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty
ArXiv ID: 2502.06905
Authors: Yeseul Cho, Baekrok Shin, Changmin Kang, Chulhee Yun
Abstract: Recent advances in deep learning rely heavily on massive datasets, leading to substantial storage and training costs. Dataset pruning aims to alleviate this demand by discarding redundant examples. However, many existing methods require training a model with a full dataset over a large number of epochs before being able to prune the dataset, which ironically makes the pruning process more expensive than just training the model on the entire dataset. To overcome this limitation, we introduce a Difficulty and Uncertainty-Aware Lightweight (DUAL) score, which aims to identify important samples from the early training stage by considering both example difficulty and prediction uncertainty. To address a catastrophic accuracy drop at an extreme pruning, we further propose a ratio-adaptive sampling using Beta distribution. Experiments on various datasets and learning scenarios such as image classification with label noise and image corruption, and model architecture generalization demonstrate the superiority of our method over previous state-of-the-art (SOTA) approaches. Specifically, on ImageNet-1k, our method reduces the time cost for pruning to 66% compared to previous methods while achieving a SOTA, specifically 60% test accuracy at a 90% pruning ratio. On CIFAR datasets, the time cost is reduced to just 15% while maintaining SOTA performance.
Comment: The paper introduces a novel dataset pruning method based on difficulty and uncertainty, aligning with the model compression criterion.
Relevance: 9 Novelty: 8
17. Scalable Thermodynamic Second-order Optimization
ArXiv ID: 2502.08603
Authors: Kaelan Donatella, Samuel Duffield, Denis Melanson, Maxwell Aifer, Phoebe Klett, Rajath Salegame, Zach Belateche, Gavin Crooks, Antonio J. Martinez, Patrick J. Coles
Abstract: Many hardware proposals have aimed to accelerate inference in AI workloads. Less attention has been paid to hardware acceleration of training, despite the enormous societal impact of rapid training of AI models. Physics-based computers, such as thermodynamic computers, offer an efficient means to solve key primitives in AI training algorithms. Optimizers that normally would be computationally out-of-reach (e.g., due to expensive matrix inversions) on digital hardware could be unlocked with physics-based hardware. In this work, we propose a scalable algorithm for employing thermodynamic computers to accelerate a popular second-order optimizer called Kronecker-factored approximate curvature (K-FAC). Our asymptotic complexity analysis predicts increasing advantage with our algorithm as $n$, the number of neurons per layer, increases. Numerical experiments show that even under significant quantization noise, the benefits of second-order optimization can be preserved. Finally, we predict substantial speedups for large-scale vision and graph problems based on realistic hardware characteristics.
Comment: The paper proposes a scalable second-order optimization method leveraging thermodynamic computers, which aligns with model efficiency and optimization breakthroughs.
Relevance: 9 Novelty: 8
18. Exploring Model Invariance with Discrete Search for Ultra-Low-Bit Quantization
ArXiv ID: 2502.06844
Authors: Yuqiao Wen, Yanshuai Cao, Lili Mou
Abstract: Large language models have been increasing in size due to their success in a wide range of applications. This calls for a pressing need to reduce memory usage to make them more accessible. Post-training quantization is a popular technique which uses fewer bits (e.g., 4--8 bits) to represent the model without retraining it. However, it remains a challenging task to perform quantization in an ultra-low-bit setup (e.g., 2 bits). In this paper, we propose InvarExplore, a unified framework that systematically explores different model invariance at the same time, allowing us to take advantage of the synergy between each type of invariance. Importantly, InvarExplore features a discrete search algorithm that enables us to explore permutation invariance, which is under-studied as it cannot be optimized with gradient-based methods. Results show that InvarExplore is compatible with existing state-of-the-art methods, achieving an add-on performance improvement over strong competing methods.
Comment: The paper proposes a framework for ultra-low-bit quantization, which aligns with model compression and efficiency improvements. The discrete search algorithm for permutation invariance is a novel contribution.
Relevance: 9 Novelty: 8
19. LoCA: Location-Aware Cosine Adaptation for Parameter-Efficient Fine-Tuning
ArXiv ID: 2502.06820
Authors: Zhekai Du, Yinjie Min, Jingjing Li, Ke Lu, Changliang Zou, Liuhua Peng, Tingjin Chu, Mingming Gong
Abstract: Low-rank adaptation (LoRA) has become a prevalent method for adapting pre-trained large language models to downstream tasks. However, the simple low-rank decomposition form may constrain the hypothesis space. To address this limitation, we introduce Location-aware Cosine Adaptation (LoCA), a novel frequency-domain parameter-efficient fine-tuning method based on inverse Discrete Cosine Transform (iDCT) with selective locations of learnable components. We begin with a comprehensive theoretical comparison between frequency-domain and low-rank decompositions for fine-tuning pre-trained large models. Our analysis reveals that frequency-domain approximation with carefully selected frequency components can surpass the expressivity of traditional low-rank-based methods. Furthermore, we demonstrate that iDCT offers a more efficient implementation compared to inverse Discrete Fourier Transform (iDFT), allowing for better selection and tuning of frequency components while maintaining equivalent expressivity to the optimal iDFT-based adaptation. By employing finite-difference approximation to estimate gradients for discrete locations of learnable coefficients on the DCT spectrum, LoCA dynamically selects the most informative frequency components during training. Experiments on diverse language and vision fine-tuning tasks demonstrate that LoCA offers enhanced parameter efficiency while maintains computational feasibility comparable to low-rank-based methods.
Comment: The paper introduces a novel frequency-domain parameter-efficient fine-tuning method (LoCA) that builds on low-rank adaptation (LoRA). This aligns with the 'Model Compression' criterion, particularly in low-rank approaches and efficiency breakthroughs.
Relevance: 9 Novelty: 8
20. Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
ArXiv ID: 2502.08640
Authors: Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks
Abstract: As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.
Comment: The paper explores emergent value systems in LLMs and proposes a new research agenda called utility engineering. This aligns with the 'Large Language Models (LLMs)' criterion, focusing on theoretical insights into LLM behavior and interpretability.
Relevance: 9 Novelty: 8
21. Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
ArXiv ID: 2502.06809
Authors: Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, Peizhong Ju, A. B. Siddique
Abstract: Interpreting and controlling the internal mechanisms of large language models (LLMs) is crucial for improving their trustworthiness and utility. Recent efforts have primarily focused on identifying and manipulating neurons by establishing discrete mappings between neurons and semantic concepts. However, such mappings struggle to handle the inherent polysemanticity in LLMs, where individual neurons encode multiple, distinct concepts. This makes precise control challenging and complicates downstream interventions. Through an in-depth analysis of both encoder and decoder-based LLMs across multiple text classification datasets, we uncover that while individual neurons encode multiple concepts, their activation magnitudes vary across concepts in distinct, Gaussian-like patterns. Building on this insight, we introduce NeuronLens, a novel range-based interpretation and manipulation framework that provides a finer view of neuron activation distributions to localize concept attribution within a neuron. Extensive empirical evaluations demonstrate that NeuronLens significantly reduces unintended interference, while maintaining precise control for manipulation of targeted concepts, outperforming existing methods.
Comment: The paper introduces a novel framework (NeuronLens) for interpreting and manipulating neuron activations in LLMs, addressing polysemanticity. This aligns with the 'Large Language Models (LLMs)' criterion, focusing on interpretability and internal mechanisms.
Relevance: 9 Novelty: 8
22. Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension
ArXiv ID: 2502.07752
Authors: Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds
Abstract: Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the lens of structured Fisher information matrix (FIM) approximation. We show that many state-of-the-art efficient optimizers can be viewed as solutions to FIM approximation (under the Frobenius norm) with specific structural assumptions. Building on these insights, we propose two design recommendations of practical efficient optimizers for LLMs, involving the careful selection of structural assumptions to balance generality and efficiency, and enhancing memory efficiency of optimizers with general structures through a novel low-rank extension framework. We demonstrate how to use each design approach by deriving new memory-efficient optimizers: Row and Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation (Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the effectiveness, showing faster and better convergence than existing memory-efficient baselines and Adam with little memory overhead. Notably, Alice achieves better than 2x faster convergence over Adam, while RACS delivers strong performance on the 1B model with SGD-like memory.
Comment: The paper introduces efficient optimizers for LLMs using structured Fisher approximation with a low-rank extension. This aligns with foundational research in model efficiency and optimization, making it highly relevant.
Relevance: 9 Novelty: 8
23. On Mechanistic Circuits for Extractive Question-Answering
ArXiv ID: 2502.08059
Authors: Samyadeep Basu, Vlad Morariu, Zichao Wang, Ryan Rossi, Cherry Zhao, Soheil Feizi, Varun Manjunatha
Abstract: Large language models are increasingly used to process documents and facilitate question-answering on them. In our paper, we extract mechanistic circuits for this real-world language modeling task: context-augmented language modeling for extractive question-answering (QA) tasks and understand the potential benefits of circuits towards downstream applications such as data attribution to context information. We extract circuits as a function of internal model components (e.g., attention heads, MLPs) using causal mediation analysis techniques. Leveraging the extracted circuits, we first understand the interplay between the model's usage of parametric memory and retrieved context towards a better mechanistic understanding of context-augmented language models. We then identify a small set of attention heads in our circuit which performs reliable data attribution by default, thereby obtaining attribution for free in just the model's forward pass. Using this insight, we then introduce ATTNATTRIB, a fast data attribution algorithm which obtains state-of-the-art attribution results across various extractive QA benchmarks. Finally, we show the possibility to steer the language model towards answering from the context, instead of the parametric memory by using the attribution from ATTNATTRIB as an additional signal during the forward pass. Beyond mechanistic understanding, our paper provides tangible applications of circuits in the form of reliable data attribution and model steering.
Comment: The paper explores mechanistic circuits in extractive QA tasks, providing insights into the interplay between parametric memory and retrieved context. It aligns with foundational research in understanding LLM behavior and interpretability.
Relevance: 9 Novelty: 8
24. Can Large Language Models Understand Intermediate Representations?
ArXiv ID: 2502.06854
Authors: Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan
Abstract: Intermediate Representations (IRs) are essential in compiler design and program analysis, yet their comprehension by Large Language Models (LLMs) remains underexplored. This paper presents a pioneering empirical study to investigate the capabilities of LLMs, including GPT-4, GPT-3, Gemma 2, LLaMA 3.1, and Code Llama, in understanding IRs. We analyze their performance across four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code summarization, and execution reasoning. Our results indicate that while LLMs demonstrate competence in parsing IR syntax and recognizing high-level structures, they struggle with control flow reasoning, execution semantics, and loop handling. Specifically, they often misinterpret branching instructions, omit critical IR operations, and rely on heuristic-based reasoning, leading to errors in CFG reconstruction, IR decompilation, and execution reasoning. The study underscores the necessity for IR-specific enhancements in LLMs, recommending fine-tuning on structured IR datasets and integration of explicit control flow models to augment their comprehension and handling of IR-related tasks.
Comment: The paper investigates LLMs' understanding of intermediate representations, which is relevant to foundational research in LLM behavior and interpretability. The focus on control flow and execution reasoning adds depth.
Relevance: 9 Novelty: 7
25. Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
ArXiv ID: 2502.07445
Authors: Nurit Cohen-Inger, Yehonatan Elisha, Bracha Shapira, Lior Rokach, Seffi Cohen
Abstract: Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By rephrasing inputs while preserving their semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our method reveals an average performance degradation of 2.15% under modest perturbations, with 20 out of 26 models exhibiting statistically significant differences. Notably, models with higher baseline accuracy exhibit larger performance differences under perturbation, and larger LLMs tend to be more sensitive to rephrasings indicating that both cases may overrely on fixed prompt patterns. In contrast, the Llama family and models with lower baseline accuracy show insignificant degradation, suggesting reduced dependency on superficial cues. Moreover, C-BOD's dataset- and model-agnostic design allows easy integration into training pipelines to promote more robust language understanding. Our findings challenge the community to look beyond leaderboard scores and prioritize resilience and generalization in LLM evaluation.
Comment: The paper critiques LLM evaluation methods and introduces a meta-evaluation framework to detect overfitting. This aligns with the criterion of theoretical insights into LLM behavior.
Relevance: 9 Novelty: 7
26. Numerical Schemes for Signature Kernels
ArXiv ID: 2502.08470
Authors: Thomas Cass, Francesco Piatti, Jeffrey Pei
Abstract: Signature kernels have emerged as a powerful tool within kernel methods for sequential data. In the paper "The Signature Kernel is the solution of a Goursat PDE", the authors identify a kernel trick that demonstrates that, for continuously differentiable paths, the signature kernel satisfies a Goursat problem for a hyperbolic partial differential equation (PDE) in two independent time variables. While finite difference methods have been explored for this PDE, they face limitations in accuracy and stability when handling highly oscillatory inputs. In this work, we introduce two advanced numerical schemes that leverage polynomial representations of boundary conditions through either approximation or interpolation techniques, and rigorously establish the theoretical convergence of the polynomial approximation scheme. Experimental evaluations reveal that our approaches yield improvements of several orders of magnitude in mean absolute percentage error (MAPE) compared to traditional finite difference schemes, without increasing computational complexity. Furthermore, like finite difference methods, our algorithms can be GPU-parallelized to reduce computational complexity from quadratic to linear in the length of the input sequences, thereby improving scalability for high-frequency data. We have implemented these algorithms in a dedicated Python library, which is publicly available at: https://github.com/FrancescoPiatti/polysigkernel.
Comment: The paper introduces advanced numerical schemes for signature kernels, which are relevant to representation learning and efficiency. The theoretical convergence and GPU-parallelization aspects enhance its foundational contribution.
Relevance: 8 Novelty: 8
27. Enabling Autoregressive Models to Fill In Masked Tokens
ArXiv ID: 2502.06901
Authors: Daniel Israel, Aditya Grover, Guy Van den Broeck
Abstract: Historically, LLMs have been trained using either autoregressive (AR) or masked language modeling (MLM) objectives, with AR models gaining dominance in recent years. However, AR models are inherently incapable of masked infilling, which is the ability to predict masked tokens between past and future context. In contrast, MLM models suffer from intrinsic computational inefficiencies during both training and inference that hinder their scalability. This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that leverages the strengths of both paradigms to achieve state-of-the-art masked infilling performance. MARIA combines a pre-trained MLM and AR model by training a linear decoder that takes their concatenated hidden states as input. This minimal modification enables the AR model to perform infilling while retaining its inherent advantages in terms of faster inference with KV caching. Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.
Comment: MARIA introduces a novel approach combining AR and MLM objectives for masked infilling, which aligns with foundational research in LLM behavior and architecture innovations.
Relevance: 8 Novelty: 8
28. What is a Sketch-and-Precondition Derivation for Low-Rank Approximation? Inverse Power Error or Inverse Power Estimation?
ArXiv ID: 2502.07993
Authors: Ruihan Xu, Yiping Lu
Abstract: Randomized sketching accelerates large-scale numerical linear algebra by reducing computa- tional complexity. While the traditional sketch-and-solve approach reduces the problem size di- rectly through sketching, the sketch-and-precondition method leverages sketching to construct a computational friendly preconditioner. This preconditioner improves the convergence speed of iterative solvers applied to the original problem, maintaining accuracy in the full space. Further- more, the convergence rate of the solver improves at least linearly with the sketch size. Despite its potential, developing a sketch-and-precondition framework for randomized algorithms in low- rank matrix approximation remains an open challenge. We introduce the Error-Powered Sketched Inverse Iteration (EPSI) Method via run sketched Newton iteration for the Lagrange form as a sketch-and-precondition variant for randomized low-rank approximation. Our method achieves theoretical guarantees, including a convergence rate that improves at least linearly with the sketch size.
Comment: The paper introduces a novel sketch-and-precondition framework for low-rank approximation, which aligns with the criterion of foundational research in model compression and efficiency.
Relevance: 8 Novelty: 8
29. Harnessing Language's Fractal Geometry with Recursive Inference Scaling
ArXiv ID: 2502.07503
Authors: Ibrahim Alabdulmohsin, Xiaohua Zhai
Abstract: Recent research in language modeling reveals two scaling effects: the well-known improvement from increased training compute, and a lesser-known boost from applying more sophisticated or computationally intensive inference methods. Inspired by recent findings on the fractal geometry of language, we introduce Recursive INference Scaling (RINS) as a complementary, plug-in recipe for scaling inference time. For a given fixed model architecture and training compute budget, RINS substantially improves language modeling performance. It also generalizes beyond pure language tasks, delivering gains in multimodal systems, including a +2% improvement in 0-shot ImageNet accuracy for SigLIP-B/16. Additionally, by deriving data scaling laws, we show that RINS improves both the asymptotic performance limits and the scaling exponents. These advantages are maintained even when compared to state-of-the-art recursive techniques like the "repeat-all-over" (RAO) strategy in Mobile LLM. Finally, stochastic RINS not only can enhance performance further but also provides the flexibility to optionally forgo increased inference computation at test time with minimal performance degradation.
Comment: The paper introduces a novel inference scaling method inspired by fractal geometry, which aligns with the criterion of emerging trends in foundational research for LLMs.
Relevance: 8 Novelty: 8
30. The Observational Partial Order of Causal Structures with Latent Variables
ArXiv ID: 2502.07891
Authors: Marina Maciel Ansanelli, Elie Wolfe, Robert W. Spekkens
Abstract: For two causal structures with the same set of visible variables, one is said to observationally dominate the other if the set of distributions over the visible variables realizable by the first contains the set of distributions over the visible variables realizable by the second. Knowing such dominance relations is useful for adjudicating between these structures given observational data. We here consider the problem of determining the partial order of equivalence classes of causal structures with latent variables relative to observational dominance. We provide a complete characterization of the dominance order in the case of three visible variables, and a partial characterization in the case of four visible variables. Our techniques also help to identify which observational equivalence classes have a set of realizable distributions that is characterized by nontrivial inequality constraints, analogous to Bell inequalities and instrumental inequalities. We find evidence that as one increases the number of visible variables, the equivalence classes satisfying nontrivial inequality constraints become ubiquitous. (Because such classes are the ones for which there can be a difference in the distributions that are quantumly and classically realizable, this implies that the potential for quantum-classical gaps is also ubiquitous.) Furthermore, we find evidence that constraint-based causal discovery algorithms that rely solely on conditional independence constraints have a significantly weaker distinguishing power among observational equivalence classes than algorithms that go beyond these (i.e., algorithms that also leverage nested Markov constraints and inequality constraints).
Comment: The paper provides a theoretical analysis of causal structures with latent variables, which aligns with emerging trends in foundational research.
Relevance: 8 Novelty: 8
31. ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval
ArXiv ID: 2502.07971
Authors: Shubham Gupta, Zichao Li, Tianyi Chen, Cem Subakan, Siva Reddy, Perouz Taslakian, Valentina Zantedeschi
Abstract: Document retrieval is a core component of question-answering systems, as it enables conditioning answer generation on new and large-scale corpora. While effective, the standard practice of encoding documents into high-dimensional embeddings for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. In this paper, we propose a tree-based method for organizing and representing reference documents at various granular levels, which offers the flexibility to balance cost and utility, and eases the inspection of the corpus content and retrieval operations. Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches, hence directly optimizing for retrieval performance. Our evaluations show that ReTreever generally preserves full representation accuracy. Its hierarchical structure further provides strong coarse representations and enhances transparency by indirectly learning meaningful semantic groupings. Among hierarchical retrieval methods, ReTreever achieves the best retrieval accuracy at the lowest latency, proving that this family of techniques can be viable in practical applications.
Comment: The paper introduces a tree-based hierarchical representation for document retrieval, which aligns with representation learning and efficiency improvements. The hierarchical structure and optimization for retrieval performance are novel.
Relevance: 8 Novelty: 8
32. Loss Landscape Analysis for Reliable Quantized ML Models for Scientific Sensing
ArXiv ID: 2502.08355
Authors: Tommaso Baldi, Javier Campos, Olivia Weng, Caleb Geniesse, Nhan Tran, Ryan Kastner, Alessandro Biondi
Abstract: In this paper, we propose a method to perform empirical analysis of the loss landscape of machine learning (ML) models. The method is applied to two ML models for scientific sensing, which necessitates quantization to be deployed and are subject to noise and perturbations due to experimental conditions. Our method allows assessing the robustness of ML models to such effects as a function of quantization precision and under different regularization techniques -- two crucial concerns that remained underexplored so far. By investigating the interplay between performance, efficiency, and robustness by means of loss landscape analysis, we both established a strong correlation between gently-shaped landscapes and robustness to input and weight perturbations and observed other intriguing and non-obvious phenomena. Our method allows a systematic exploration of such trade-offs a priori, i.e., without training and testing multiple models, leading to more efficient development workflows. This work also highlights the importance of incorporating robustness into the Pareto optimization of ML models, enabling more reliable and adaptive scientific sensing systems.
Comment: The paper explores loss landscape analysis in the context of quantized ML models, which aligns with model compression and robustness. The focus on quantization and robustness trade-offs is relevant to foundational research in efficiency.
Relevance: 8 Novelty: 7
33. Gradient Based Method for the Fusion of Lattice Quantizers
ArXiv ID: 2502.06887
Authors: Liyuan Zhang, Hanzhong Cao, Jiaheng Li, Minyang Yu
Abstract: In practical applications, lattice quantizers leverage discrete lattice points to approximate arbitrary points in the lattice. An effective lattice quantizer significantly enhances both the accuracy and efficiency of these approximations. In the context of high-dimensional lattice quantization, previous work proposed utilizing low-dimensional optimal lattice quantizers and addressed the challenge of determining the optimal length ratio in orthogonal splicing. Notably, it was demonstrated that fixed length ratios and orthogonality yield suboptimal results when combining low-dimensional lattices. Building on this foundation, another approach employed gradient descent to identify optimal lattices, which inspired us to explore the use of neural networks to discover matrices that outperform those obtained from orthogonal splicing methods. We propose two novel approaches to tackle this problem: the Household Algorithm and the Matrix Exp Algorithm. Our results indicate that both the Household Algorithm and the Matrix Exp Algorithm achieve improvements in lattice quantizers across dimensions 13, 15, 17 to 19, 21, and 22. Moreover, the Matrix Exp Algorithm demonstrates superior efficacy in high-dimensional settings.
Comment: The paper proposes novel algorithms for lattice quantization, which aligns with model compression and efficiency topics. The focus on gradient-based methods and high-dimensional settings adds theoretical depth.
Relevance: 8 Novelty: 7
34. Column-wise Quantization of Weights and Partial Sums for Accurate and Efficient Compute-In-Memory Accelerators
ArXiv ID: 2502.07842
Authors: Jiyoon Kim, Kang Eun Jeon, Yulhwa Kim, Jong Hwan Ko
Abstract: Compute-in-memory (CIM) is an efficient method for implementing deep neural networks (DNNs) but suffers from substantial overhead from analog-to-digital converters (ADCs), especially as ADC precision increases. Low-precision ADCs can re- duce this overhead but introduce partial-sum quantization errors degrading accuracy. Additionally, low-bit weight constraints, im- posed by cell limitations and the need for multiple cells for higher- bit weights, present further challenges. While fine-grained partial- sum quantization has been studied to lower ADC resolution effectively, weight granularity, which limits overall partial-sum quantized accuracy, remains underexplored. This work addresses these challenges by aligning weight and partial-sum quantization granularities at the column-wise level. Our method improves accuracy while maintaining dequantization overhead, simplifies training by removing two-stage processes, and ensures robustness to memory cell variations via independent column-wise scale factors. We also propose an open-source CIM-oriented convolution framework to handle fine-grained weights and partial-sums effi- ciently, incorporating a novel tiling method and group convolution. Experimental results on ResNet-20 (CIFAR-10, CIFAR-100) and ResNet-18 (ImageNet) show accuracy improvements of 0.99%, 2.69%, and 1.01%, respectively, compared to the best-performing related works. Additionally, variation analysis reveals the robust- ness of our method against memory cell variations. These findings highlight the effectiveness of our quantization scheme in enhancing accuracy and robustness while maintaining hardware efficiency in CIM-based DNN implementations. Our code is available at https://github.com/jiyoonkm/ColumnQuant.
Comment: The paper introduces a column-wise quantization method for compute-in-memory accelerators, which aligns with the model compression criterion, particularly in quantization and efficiency improvements.
Relevance: 8 Novelty: 7
35. No Data, No Optimization: A Lightweight Method To Disrupt Neural Networks With Sign-Flips
ArXiv ID: 2502.07408
Authors: Ido Galil, Moshe Kimhi, Ran El-Yaniv
Abstract: Deep Neural Networks (DNNs) can be catastrophically disrupted by flipping only a handful of sign bits in their parameters. We introduce Deep Neural Lesion (DNL), a data-free, lightweight method that locates these critical parameters and triggers massive accuracy drops. We validate its efficacy on a wide variety of computer vision models and datasets. The method requires no training data or optimization and can be carried out via common exploits software, firmware or hardware based attack vectors. An enhanced variant that uses a single forward and backward pass further amplifies the damage beyond DNL's zero-pass approach. Flipping just two sign bits in ResNet50 on ImageNet reduces accuracy by 99.8\%. We also show that selectively protecting a small fraction of vulnerable sign bits provides a practical defense against such attacks.
Comment: The paper introduces a lightweight method to disrupt neural networks via sign-flips, which aligns with the model compression criterion, particularly in sparsity and robustness.
Relevance: 8 Novelty: 7
36. Automated Consistency Analysis of LLMs
ArXiv ID: 2502.07036
Authors: Aditya Patwardhan, Vivek Vaidya, Ashish Kundu
Abstract: Generative AI (Gen AI) with large language models (LLMs) are being widely adopted across the industry, academia and government. Cybersecurity is one of the key sectors where LLMs can be and/or are already being used. There are a number of problems that inhibit the adoption of trustworthy Gen AI and LLMs in cybersecurity and such other critical areas. One of the key challenge to the trustworthiness and reliability of LLMs is: how consistent an LLM is in its responses? In this paper, we have analyzed and developed a formal definition of consistency of responses of LLMs. We have formally defined what is consistency of responses and then develop a framework for consistency evaluation. The paper proposes two approaches to validate consistency: self-validation, and validation across multiple LLMs. We have carried out extensive experiments for several LLMs such as GPT4oMini, GPT3.5, Gemini, Cohere, and Llama3, on a security benchmark consisting of several cybersecurity questions: informational and situational. Our experiments corroborate the fact that even though these LLMs are being considered and/or already being used for several cybersecurity tasks today, they are often inconsistent in their responses, and thus are untrustworthy and unreliable for cybersecurity.
Comment: The paper introduces a framework for consistency analysis of LLMs, which aligns with theoretical insights into LLM behavior and interpretability.
Relevance: 8 Novelty: 7
37. When More is Less: Understanding Chain-of-Thought Length in LLMs
ArXiv ID: 2502.07266
Authors: Yuyang Wu, Yifei Wang, Tianqi Du, Stefanie Jegelka, Yisen Wang
Abstract: Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs) by breaking complex tasks into smaller, manageable sub-tasks. Researchers have been exploring ways to guide models to generate more complex CoT processes to improve the reasoning ability of LLMs, such as long CoT and the test-time scaling law. However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy? In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases. To understand this phenomenon, we provide a piece of evidence that longer reasoning processes are increasingly susceptible to noise. We theoretically prove the existence of an optimal CoT length and derive a scaling law for this optimal length based on model capability and task difficulty. Inspired by our theory, we conduct experiments on both synthetic and real world datasets and propose Length-filtered Vote to alleviate the effects of excessively long or short CoTs. Our findings highlight the critical need to calibrate CoT length to align with model capabilities and task demands, offering a principled framework for optimizing multi-step reasoning in LLMs.
Comment: The paper explores the optimal length of Chain-of-Thought reasoning in LLMs, providing theoretical and empirical insights into reasoning dynamics, which is relevant to foundational research on LLM behavior.
Relevance: 8 Novelty: 7
38. EdgeEar: Efficient and Accurate Ear Recognition for Edge Devices
ArXiv ID: 2502.07734
Authors: Camile Lendering, Bernardo Perrone Ribeiro, \v{Z}iga Emer\v{s}i\v{c}, Peter Peer
Abstract: Ear recognition is a contactless and unobtrusive biometric technique with applications across various domains. However, deploying high-performing ear recognition models on resource-constrained devices is challenging, limiting their applicability and widespread adoption. This paper introduces EdgeEar, a lightweight model based on a proposed hybrid CNN-transformer architecture to solve this problem. By incorporating low-rank approximations into specific linear layers, EdgeEar reduces its parameter count by a factor of 50 compared to the current state-of-the-art, bringing it below two million while maintaining competitive accuracy. Evaluation on the Unconstrained Ear Recognition Challenge (UERC2023) benchmark shows that EdgeEar achieves the lowest EER while significantly reducing computational costs. These findings demonstrate the feasibility of efficient and accurate ear recognition, which we believe will contribute to the wider adoption of ear biometrics.
Comment: The paper introduces a lightweight model for ear recognition using low-rank approximations, which aligns with the criterion of model compression and efficiency.
Relevance: 7 Novelty: 7
39. Understanding LLMs' Fluid Intelligence Deficiency: An Analysis of the ARC Task
ArXiv ID: 2502.07190
Authors: Junjie Wu, Mo Yu, Lemao Liu, Dit-Yan Yeung, Jie Zhou
Abstract: While LLMs have exhibited strong performance on various NLP tasks, it is noteworthy that most of these tasks rely on utilizing the vast amount of knowledge encoded in LLMs' parameters, rather than solving new problems without prior knowledge. In cognitive research, the latter ability is referred to as fluid intelligence, which is considered to be critical for assessing human intelligence. Recent research on fluid intelligence assessments has highlighted significant deficiencies in LLMs' abilities. In this paper, we analyze the challenges LLMs face in demonstrating fluid intelligence through controlled experiments, using the most representative ARC task as an example. Our study revealed three major limitations in existing LLMs: limited ability for skill composition, unfamiliarity with abstract input formats, and the intrinsic deficiency of left-to-right decoding. Our data and code can be found in https://wujunjie1998.github.io/araoc-benchmark.github.io/.
Comment: The paper analyzes LLMs' deficiencies in fluid intelligence, which provides theoretical insights into LLM behavior. However, it focuses on a specific task (ARC) and does not propose architectural or training innovations.
Relevance: 7 Novelty: 6
Paper Selection Prompt
You are a helpful paper reading assistant whose job is to read daily posts from ArXiv and identify a few papers that your friend will enjoy reading. Your job is to carefully read the paper titles and abstracts below and find the ones that match the criteria below.
Relevant Topics
Use the following relevance criteria to focus on foundational research, avoiding purely application-driven work:
-
Representation Learning - Relevant: Insights into how deep networks encode information, feature/dictionary learning, sparse/contrastive methods, training dynamics in neural networks. - Irrelevant: Standard applications of known techniques lacking new theoretical or methodological contributions.
-
Model Architecture - Relevant: Mixture-of-Experts (MoE), Transformers, Conditional/Dynamic Networks, Autoencoders, analysis on existing architectures (like encoder-decoder), or other architectural innovations. - Irrelevant: Merely using existing architectures for a certain task without insights into the structure themselves.
-
Model Compression - Relevant: Sparsity, pruning, quantization, low-rank approaches, KV cache, or other algorithmic/theoretical efficiency breakthroughs. - Irrelevant: Straightforward applications of existing compression methods to new tasks.
-
Large Language Models (LLMs) - Relevant: Major breakthroughs in training or architecture, theoretical insights into LLM behavior/interpretability. - Irrelevant: Domain-specific usage (e.g., translation, jail-breaking), minor tweaks (e.g., instruction tuning, CoT, data mixing), or purely empirical dataset/benchmark studies and text-level analysis (e.g. hallucination, reasoning, safety).
-
AI for Science - Relevant: Foundational research in molecular/protein modeling, new generative paradigms, or significant architecture-level innovations. - Irrelevant: Conventional, domain-specific applications without new theoretical perspectives.
-
Emerging Trends - Relevant: Cutting-edge theoretical work challenging established assumptions or introducing broad new paradigms. - Irrelevant: Incremental improvements or trend-following without novel insights.
Keywords for Relevant Domains: Mixture of Experts (MoE), Representation Learning, Compression/Efficiency, Sparse/Sparsity, Pruning, Quantization, Low-rank, Foundation Model, etc.
Hints on Irrelevant Domains: Reinforcement Learning, Transfer Learning, Federated Learning, Online Learning, Diffusion Models, etc.
Hints on Application Tasks: Image Segmentation, Medical Imaging, 3D Vision, Video Understanding, Information Retrieval, Summarization, Recommendation Systems, Machine Translation, Speech Recognition, Signal Processing, Spatial/Temporal Modeling, Time Series, etc.
Scoring Criteria
The "Relevance" score measures how closely the paper aligns with the core topics of the prompt. The "Novelty" score assesses the originality and impact of the paper. They are two ORTHONORMAL axes and SHOULD NOT be confused with each other.
Relevance Scoring
- Relevance 9-10 (Completely Relevant)
- Focus: Fully aligned with core topics with no deviation, score the highest if contains keywords in it.
-
Examples: Papers focused on foundational methods or theoretical research, whose titles contain topic keywords like "MoE".
-
Relevance 7-8 (Relevant)
- Focus: Retain a solid link to the main research area, though may touch on peripheral elements.
-
Examples: Papers research on the fundamental part of MoE through a less critical aspect like its behavior in GNN.
-
Relevance 5-6 (Borderline)
- Focus: Maintains a link to the core topic but also extends into at least one other domain/area beyond the primary focus.
-
Examples: Work referencing MoE centered on reinforcement learning.
-
Relevance 3-4 (Irrelevant)
- Focus: Largely outside our interests with no association to our topics.
-
Examples: Application-focused papers like using MoE to solve a problem in the real world.
-
Relevance 1-2 (Ignore)
- Focus: Purely unrelated to our topics. Completely a different domain.
- Exception: If the paper hints at a cutting-edge, radically new direction that could eventually transform the primary domain, consider a score of 9–10 despite initial appearances. (Usually a very rare concept that belongs to the fundamental research)
Novelty Scoring
- Novelty 9-10 (Breakthrough)
- Definition: Groundbreaking methods/theory introducing new directions or solving major challenges.
-
Examples: Entirely new paradigm for foundational models; a novel theory transforming representation learning.
-
Novelty 7-8 (Improvements)
- Definition: Substantial insights/enhancements, though not a full paradigm shift.
-
Examples: Modifications on existing methods yielding significantly better results.
-
Novelty 5-6 (Borderline)
- Definition: Incremental contributions with possible long-term benefits, not immediately transformative.
-
Examples: Moderately novel extension to an existing architecture; refining current methods without fundamentally altering them.
-
Novelty 3-4 (Tangential)
- Definition: Minor or domain-specific improvements with limited broader impact.
-
Examples: Slight modifications to known methods with strange motivation; purely engineering jobs like a new benchmark/dataset.
-
Novelty 1-2 (Low)
- Definition: Minimal originality, applying standard approaches without real innovation.
- Examples: Using an off-the-shelf model without adding new insights; purely application-driven studies like finetuning a pretrained model using existing methods.
Papers
[PAPER LIST HERE]
Instructions
Write the response in JSONL format with {ARXIVID, COMMENT, RELEVANCE, NOVELTY} on each line, one for each paper.
- ARXIVID: should be the ArXiv ID.
- COMMENT: should identify whether there is a criteria that match the paper very closely. These matches should not be based on general terms like "language modeling" or "advancements" and should specifically refer to a criterion. No need to mention the non-matching criteria.
- RELEVANCE: should be a score from 1-10.
- NOVELTY: should be a score from 1-10.