This is a remedial run for missed papers from 05/21/2025 to 05/21/2025.

Results generated on 05/26/2025.

Personalized Daily ArXiv Papers 2025-05-22

[gpt-4o]	Prompt	Completion	Total
Token	50079	6651	56730
Cost	$0.13	$0.07	$0.19

Total arXiv papers: 398

Total scanned papers: 398

Total relevant papers: 36

Table of contents with paper titles:

From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning Authors: Chen Shani, Dan Jurafsky, Yann LeCun, Ravid Shwartz-Ziv
MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding Authors: Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun
Generalization Through Growth: Hidden Dynamics Controls Depth Dependence Authors: Sho Sonoda, Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda
Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention Authors: Huanxuan Liao, Wen Hu, Yao Xu, Shizhu He, Jun Zhao, Kang Liu
Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers Authors: Peter Súkeník, Christoph H. Lampert, Marco Mondelli
Pre-training Large Memory Language Models with Internal and External Knowledge Authors: Linxi Zhao, Sofian Zalouk, Christian K. Belardi, Justin Lovelace, Jin Peng Zhou, Kilian Q. Weinberger, Yoav Artzi, Jennifer J. Sun
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space Authors: Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, Xin Eric Wang
Mechanistic Insights into Grokking from the Embedding Layer Authors: H. V. AlquBoj, Hilal AlQuabeh, Velibor Bojkovic, Munachiso Nwadike, Kentaro Inui
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning Authors: Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, Hao Peng
Decouple and Orthogonalize: A Data-Free Framework for LoRA Merging Authors: Shenghe Zheng, Hongzhi Wang, Chenyu Huang, Xiaohui Wang, Tao Chen, Jiayuan Fan, Shuyue Hu, Peng Ye
Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models Authors: Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, Zhongyu Wei
Efficient Differentiable Approximation of Generalized Low-rank Regularization Authors: Naiqi Li, Yuqiu Xie, Peiyuan Liu, Tao Dai, Yong Jiang, Shu-Tao Xia
Is (Selective) Round-To-Nearest Quantization All You Need? Authors: Alex Kogan
Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering Authors: Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du
Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs Authors: Federico Ranaldi, Andrea Zugarini, Leonardo Ranaldi, Fabio Massimo Zanzotto
The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation Authors: Patrick Kahardipraja, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Model Merging is Secretly Certifiable: Non-Vacuous Generalisation Bounds for Low-Shot Learning Authors: Taehoon Kim, Henry Gouk, Minyoung Kim, Timothy Hospedales
HOPSE: Scalable Higher-Order Positional and Structural Encoder for Combinatorial Representations Authors: Martin Carrasco, Guillermo Bernardez, Marco Montagna, Nina Miolane, Lev Telyatnikov
Time Tracker: Mixture-of-Experts-Enhanced Foundation Time Series Forecasting Model with Decoupled Training Pipelines Authors: Xiaohou Shi, Ke Li, Aobo Liang, Yan Sun
An Efficient Private GPT Never Autoregressively Decodes Authors: Zhengyi Li, Yue Guan, Kang Yang, Yu Feng, Ning Liu, Yu Yu, Jingwen Leng, Minyi Guo
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing Authors: Adel Javanmard, Rudrajit Das, Alessandro Epasto, Vahab Mirrokni
GradPCA: Leveraging NTK Alignment for Reliable Out-of-Distribution Detection Authors: Mariia Seleznova, Hung-Hsu Chou, Claudio Mayrink Verdun, Gitta Kutyniok
Scaling Diffusion Transformers Efficiently via $μ$P Authors: Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li
HybridProver: Augmenting Theorem Proving with LLM-Driven Proof Synthesis and Refinement Authors: Jilin Hu, Jianyu Zhang, Yongwang Zhao, Talia Ringer
Mechanistic evaluation of Transformers and state space models Authors: Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Róbert Csórdas, Dan Jurafsky, Christopher Potts
Fourier-Invertible Neural Encoder (FINE) for Homogeneous Flows Authors: Anqiao Ouyang, Hongyi Ke, Qi Wang
Bidirectional Variational Autoencoders Authors: Bart Kosko, Olaoluwa Adigun
Degree-Optimized Cumulative Polynomial Kolmogorov-Arnold Networks Authors: Mathew Vanherreweghe, Lirandë Pira, Patrick Rebentrost
Conformal Language Model Reasoning with Coherent Factuality Authors: Maxon Rubin-Toles, Maya Gambhir, Keshav Ramji, Aaron Roth, Surbhi Goel
Stronger ViTs With Octic Equivariance Authors: David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman
Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex Authors: Muquan Yu, Mu Nan, Hossein Adeli, Jacob S. Prince, John A. Pyles, Leila Wehbe, Margaret M. Henderson, Michael J. Tarr, Andrew F. Luo
An approach to identify the most semantically informative deep representations of text and images Authors: Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Matéo Mahaut, Marco Baroni, Alessandro Laio
SSR: Speculative Parallel Scaling Reasoning in Test-time Authors: Yuanlin Chu, Bo Wang, Xiang Liu, Hong Chen, Aiwei Liu, Xuming Hu
On the creation of narrow AI: hierarchy and nonlocality of neural network skills Authors: Eric J. Michaud, Asher Parker-Sartori, Max Tegmark
Last Layer Empirical Bayes Authors: Valentin Villecroze, Yixin Wang, Gabriel Loaiza-Ganem
Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision Authors: Eric Hanchen Jiang, Haozheng Luo, Shengyuan Pang, Xiaomin Li, Zhenting Qi, Hengli Li, Cheng-Fu Yang, Zongyu Lin, Xinfeng Li, Hao Xu, Kai-Wei Chang, Ying Nian Wu

1. From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

ArXiv ID: 2505.17117

Authors: Chen Shani, Dan Jurafsky, Yann LeCun, Ravid Shwartz-Ziv

Abstract: Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.

Comment: Author match

2. MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding

ArXiv ID: 2505.15946

Authors: Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun

Abstract: Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain's high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: https://github.com/yuxiangwei0808/MoRE-Brain.

Comment: The paper introduces MoRE-Brain, a routed mixture of experts architecture for fMRI visual decoding, which is relevant to model architecture as it employs a hierarchical MoE framework for interpretable and generalizable decoding.

Relevance: 9 Novelty: 8

3. Generalization Through Growth: Hidden Dynamics Controls Depth Dependence

ArXiv ID: 2505.15064

Authors: Sho Sonoda, Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda

Abstract: Recent theory has reduced the depth dependence of generalization bounds from exponential to polynomial and even depth-independent rates, yet these results remain tied to specific architectures and Euclidean inputs. We present a unified framework for arbitrary \blue{pseudo-metric} spaces in which a depth-(k) network is the composition of continuous hidden maps (f:\mathcal{X}\to \mathcal{X}) and an output map (h:\mathcal{X}\to \mathbb{R}). The resulting bound $O(\sqrt{(\alpha + \log \beta(k))/n})$ isolates the sole depth contribution in (\beta(k)), the word-ball growth of the semigroup generated by the hidden layers. By Gromov's theorem polynomial (resp. exponential) growth corresponds to virtually nilpotent (resp. expanding) dynamics, revealing a geometric dichotomy behind existing $O(\sqrt{k})$ (sublinear depth) and $\tilde{O}(1)$ (depth-independent) rates. We further provide covering-number estimates showing that expanding dynamics yield an exponential parameter saving via compositional expressivity. Our results decouple specification from implementation, offering architecture-agnostic and dynamical-systems-aware guarantees applicable to modern deep-learning paradigms such as test-time inference and diffusion models.

Comment: The paper presents a unified framework for understanding depth dependence in neural networks, which aligns with representation learning and model architecture analysis.

Relevance: 9 Novelty: 8

4. Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention

ArXiv ID: 2505.15774

Authors: Huanxuan Liao, Wen Hu, Yao Xu, Shizhu He, Jun Zhao, Kang Liu

Abstract: Large Language Models (LLMs) encounter significant challenges in long-sequence inference due to computational inefficiency and redundant processing, driving interest in context compression techniques. Existing methods often rely on token importance to perform hard local compression or encode context into latent representations for soft global compression. However, the uneven distribution of textual content relevance and the diversity of demands for user instructions mean these approaches frequently lead to the loss of potentially valuable information. To address this, we propose $\textbf{Hy}$brid $\textbf{Co}$ntext $\textbf{Co}$mpression (HyCo$_2$) for LLMs, which integrates both global and local perspectives to guide context compression while retaining both the essential semantics and critical details for task completion. Specifically, we employ a hybrid adapter to refine global semantics with the global view, based on the observation that different adapters excel at different tasks. Then we incorporate a classification layer that assigns a retention probability to each context token based on the local view, determining whether it should be retained or discarded. To foster a balanced integration of global and local compression, we introduce auxiliary paraphrasing and completion pretraining before instruction tuning. This promotes a synergistic integration that emphasizes instruction-relevant information while preserving essential local details, ultimately balancing local and global information retention in context compression. Experiments show that our HyCo$_2$ method significantly enhances long-text reasoning while reducing token usage. It improves the performance of various LLM series by an average of 13.1\% across seven knowledge-intensive QA benchmarks. Moreover, HyCo$_2$ matches the performance of uncompressed methods while reducing token consumption by 88.8\%.

Comment: The paper proposes a hybrid context compression method for LLMs, which is relevant to model compression and efficiency in LLMs.

Relevance: 9 Novelty: 8

5. Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers

ArXiv ID: 2505.15239

Authors: Peter Súkeník, Christoph H. Lampert, Marco Mondelli

Abstract: The empirical emergence of neural collapse -- a surprising symmetry in the feature representations of the training data in the penultimate layer of deep neural networks -- has spurred a line of theoretical research aimed at its understanding. However, existing work focuses on data-agnostic models or, when data structure is taken into account, it remains limited to multi-layer perceptrons. Our paper fills both these gaps by analyzing modern architectures in a data-aware regime: we prove that global optima of deep regularized transformers and residual networks (ResNets) with LayerNorm trained with cross entropy or mean squared error loss are approximately collapsed, and the approximation gets tighter as the depth grows. More generally, we formally reduce any end-to-end large-depth ResNet or transformer training into an equivalent unconstrained features model, thus justifying its wide use in the literature even beyond data-agnostic settings. Our theoretical results are supported by experiments on computer vision and language datasets showing that, as the depth grows, neural collapse indeed becomes more prominent.

Comment: The paper provides theoretical insights into neural collapse in deep networks, specifically in ResNets and Transformers, which aligns with representation learning and model architecture analysis.

Relevance: 9 Novelty: 8

6. Pre-training Large Memory Language Models with Internal and External Knowledge

ArXiv ID: 2505.15962

Authors: Linxi Zhao, Sofian Zalouk, Christian K. Belardi, Justin Lovelace, Jin Peng Zhou, Kilian Q. Weinberger, Yoav Artzi, Jennifer J. Sun

Abstract: Neural language models are black-boxes -- both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We propose a new class of language models, Large Memory Language Models (LMLM) with a pre-training recipe that stores factual knowledge in both internal weights and an external database. Our approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger, knowledge-dense LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases. This work represents a fundamental shift in how language models interact with and manage factual knowledge.

Comment: The paper introduces Large Memory Language Models with a novel pre-training approach, which is relevant to foundational research in LLMs.

Relevance: 9 Novelty: 8

7. Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

ArXiv ID: 2505.15778

Authors: Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, Xin Eric Wang

Abstract: Human cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current reasoning models, however, are constrained to reasoning within the boundaries of human language, processing discrete token embeddings that represent fixed points in the semantic space. This discrete constraint restricts the expressive power and upper potential of such reasoning models, often causing incomplete exploration of reasoning paths, as standard Chain-of-Thought (CoT) methods rely on sampling one token per step. In this work, we introduce Soft Thinking, a training-free method that emulates human-like "soft" reasoning by generating soft, abstract concept tokens in a continuous concept space. These concept tokens are created by the probability-weighted mixture of token embeddings, which form the continuous concept space, enabling smooth transitions and richer representations that transcend traditional discrete boundaries. In essence, each generated concept token encapsulates multiple meanings from related discrete tokens, implicitly exploring various reasoning paths to converge effectively toward the correct answer. Empirical evaluations on diverse mathematical and coding benchmarks consistently demonstrate the effectiveness and efficiency of Soft Thinking, improving pass@1 accuracy by up to 2.48 points while simultaneously reducing token usage by up to 22.4% compared to standard CoT. Qualitative analysis further reveals that Soft Thinking outputs remain highly interpretable and readable, highlighting the potential of Soft Thinking to break the inherent bottleneck of discrete language-based reasoning. Code is available at https://github.com/eric-ai-lab/Soft-Thinking.

Comment: The paper introduces a novel method for reasoning in LLMs using continuous concept space, which is relevant to theoretical insights into LLM behavior.

Relevance: 9 Novelty: 8

8. Mechanistic Insights into Grokking from the Embedding Layer

ArXiv ID: 2505.15624

Authors: H. V. AlquBoj, Hilal AlQuabeh, Velibor Bojkovic, Munachiso Nwadike, Kentaro Inui

Abstract: Grokking, a delayed generalization in neural networks after perfect training performance, has been observed in Transformers and MLPs, but the components driving it remain underexplored. We show that embeddings are central to grokking: introducing them into MLPs induces delayed generalization in modular arithmetic tasks, whereas MLPs without embeddings can generalize immediately. Our analysis identifies two key mechanisms: (1) Embedding update dynamics, where rare tokens stagnate due to sparse gradient updates and weight decay, and (2) Bilinear coupling, where the interaction between embeddings and downstream weights introduces saddle points and increases sensitivity to initialization. To confirm these mechanisms, we investigate frequency-aware sampling, which balances token updates by minimizing gradient variance, and embedding-specific learning rates, derived from the asymmetric curvature of the bilinear loss landscape. We prove that an adaptive learning rate ratio, (\frac{\eta_E}{\eta_W} \propto \frac{\sigma_{\max}(E)}{\sigma_{\max}(W)} \cdot \frac{f_W}{f_E}), mitigates bilinear coupling effects, accelerating convergence. Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.

Comment: The paper provides mechanistic insights into grokking, focusing on embedding layers, which is relevant to representation learning and training dynamics.

Relevance: 9 Novelty: 8

9. The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

ArXiv ID: 2505.15134

Authors: Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, Hao Peng

Abstract: Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetuning, but on unlabeled outputs drawn from the model; (2) EM-RL: reinforcement learning with negative entropy as the only reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce entropy without any training data or parameter updates. On Qwen-7B, EM-RL, without any labeled data, achieves comparable or better performance than strong RL baselines such as GRPO and RLOO that are trained on 60K labeled examples. Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the challenging SciCode benchmark, while being 3x more efficient than self-consistency and sequential refinement. Our findings reveal that many pretrained LLMs possess previously underappreciated reasoning capabilities that can be effectively elicited through entropy minimization alone, without any labeled data or even any parameter updates.

Comment: The paper explores entropy minimization in LLM reasoning, which is relevant to theoretical insights into LLM behavior.

Relevance: 9 Novelty: 8

10. Decouple and Orthogonalize: A Data-Free Framework for LoRA Merging

ArXiv ID: 2505.15875

Authors: Shenghe Zheng, Hongzhi Wang, Chenyu Huang, Xiaohui Wang, Tao Chen, Jiayuan Fan, Shuyue Hu, Peng Ye

Abstract: With more open-source models available for diverse tasks, model merging has gained attention by combining models into one, reducing training, storage, and inference costs. Current research mainly focuses on model merging for full fine-tuning, overlooking the popular LoRA. However, our empirical analysis reveals that: a) existing merging methods designed for full fine-tuning perform poorly on LoRA; b) LoRA modules show much larger parameter magnitude variance than full fine-tuned weights; c) greater parameter magnitude variance correlates with worse merging performance. Considering that large magnitude variances cause deviations in the distribution of the merged parameters, resulting in information loss and performance degradation, we propose a Decoupled and Orthogonal merging approach(DO-Merging). By separating parameters into magnitude and direction components and merging them independently, we reduce the impact of magnitude differences on the directional alignment of the merged models, thereby preserving task information. Furthermore, we introduce a data-free, layer-wise gradient descent method with orthogonal constraints to mitigate interference during the merging of direction components. We provide theoretical guarantees for both the decoupling and orthogonal components. And we validate through extensive experiments across vision, language, and multi-modal domains that our proposed DO-Merging can achieve significantly higher performance than existing merging methods at a minimal cost. Notably, each component can be flexibly integrated with existing methods, offering near free-lunch improvements across tasks.

Comment: The paper introduces a data-free framework for LoRA merging, which is relevant to model compression and efficiency.

Relevance: 9 Novelty: 8

11. Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

ArXiv ID: 2505.16056

Authors: Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, Zhongyu Wei

Abstract: Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce expert offloading that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this local routing consistency varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) Segment Routing Best Performance (SRP), which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) Segment Cache Best Hit Rate (SCH), which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at https://github.com/ljcleo/moe-lrc .

Comment: The paper provides insights into the local routing consistency of Mixture-of-Experts models, which is relevant to model architecture and efficiency.

Relevance: 9 Novelty: 8

12. Efficient Differentiable Approximation of Generalized Low-rank Regularization

ArXiv ID: 2505.15407

Authors: Naiqi Li, Yuqiu Xie, Peiyuan Liu, Tao Dai, Yong Jiang, Shu-Tao Xia

Abstract: Low-rank regularization (LRR) has been widely applied in various machine learning tasks, but the associated optimization is challenging. Directly optimizing the rank function under constraints is NP-hard in general. To overcome this difficulty, various relaxations of the rank function were studied. However, optimization of these relaxed LRRs typically depends on singular value decomposition, which is a time-consuming and nondifferentiable operator that cannot be optimized with gradient-based techniques. To address these challenges, in this paper we propose an efficient differentiable approximation of the generalized LRR. The considered LRR form subsumes many popular choices like the nuclear norm, the Schatten-$p$ norm, and various nonconvex relaxations. Our method enables LRR terms to be appended to loss functions in a plug-and-play fashion, and the GPU-friendly operations enable efficient and convenient implementation. Furthermore, convergence analysis is presented, which rigorously shows that both the bias and the variance of our rank estimator rapidly reduce with increased sample size and iteration steps. In the experimental study, the proposed method is applied to various tasks, which demonstrates its versatility and efficiency. Code is available at https://github.com/naiqili/EDLRR.

Comment: The paper proposes an efficient differentiable approximation for low-rank regularization, aligning with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8

13. Is (Selective) Round-To-Nearest Quantization All You Need?

ArXiv ID: 2505.15909

Authors: Alex Kogan

Abstract: Quantization became a necessary tool for serving ever-increasing Large Language Models (LLMs). RTN (Round-to-Nearest) is perhaps the simplest quantization technique that has been around well before LLMs surged to the forefront of machine learning (ML) research. Yet, it has been largely dismissed by recent and more advanced quantization methods that claim superiority over RTN in nearly every aspect of performance. This work aims to dispel this established point of view, showing that RTN is not only much cheaper to apply, but also its token generation throughput can be better than and accuracy can be similar to more advanced alternatives. In particular, we discuss our implementation of RTN based on the recent Marlin kernels and demonstrate how the accuracy of RTN can be gradually improved by selectively increasing the data precision format of certain model layers and modules. Based on our results, we argue that RTN presents a viable and practical choice for quantizing LLMs.

Comment: The paper revisits RTN quantization for LLMs, providing insights into model compression techniques, which is relevant to model compression.

Relevance: 9 Novelty: 7

14. Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

ArXiv ID: 2505.15038

Authors: Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du

Abstract: Linear Concept Vectors have proven effective for steering large language models (LLMs). While existing approaches like linear probing and difference-in-means derive these vectors from LLM hidden representations, diverse data introduces noises (i.e., irrelevant features) that challenge steering robustness. To address this, we propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which uses Sparse Autoencoders to filter out noisy features from hidden representations. When applied to linear probing and difference-in-means, our method improves their steering success rates. We validate our noise hypothesis through counterfactual experiments and feature visualizations.

Comment: The paper introduces a method for denoising concept vectors using sparse autoencoders, which is relevant to representation learning and model compression.

Relevance: 9 Novelty: 7

15. Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs

ArXiv ID: 2505.15501

Authors: Federico Ranaldi, Andrea Zugarini, Leonardo Ranaldi, Fabio Massimo Zanzotto

Abstract: We introduce the concept of protoknowledge to formalize and measure how sequences of tokens encoding Knowledge Graphs are internalized during pretraining and utilized at inference time by Large Language Models (LLMs). Indeed, LLMs have demonstrated the ability to memorize vast amounts of token sequences during pretraining, and a central open question is how they leverage this memorization as reusable knowledge through generalization. We then categorize protoknowledge into lexical, hierarchical, and topological forms, varying on the type of knowledge that needs to be activated. We measure protoknowledge through Knowledge Activation Tasks (KATs), analyzing its general properties such as semantic bias. We then investigate the impact of protoknowledge on Text-to-SPARQL performance by varying prompting strategies depending on input conditions. To this end, we adopt a novel analysis framework that assesses whether model predictions align with the successful activation of the relevant protoknowledge for each query. This methodology provides a practical tool to explore Semantic-Level Data Contamination and serves as an effective strategy for Closed-Pretraining models.

Comment: The paper explores the concept of protoknowledge in LLMs, focusing on how knowledge graphs are internalized and utilized, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 7

16. The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation

ArXiv ID: 2505.15807

Authors: Patrick Kahardipraja, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin

Abstract: Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities' relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.

Comment: The paper explores in-context learning in LLMs, focusing on attention heads and retrieval augmentation, which aligns with foundational research in LLM behavior and interpretability.

Relevance: 9 Novelty: 7

17. Model Merging is Secretly Certifiable: Non-Vacuous Generalisation Bounds for Low-Shot Learning

ArXiv ID: 2505.15798

Authors: Taehoon Kim, Henry Gouk, Minyoung Kim, Timothy Hospedales

Abstract: Certifying the IID generalisation ability of deep networks is the first of many requirements for trusting AI in high-stakes applications from medicine to security. However, when instantiating generalisation bounds for deep networks it remains challenging to obtain non-vacuous guarantees, especially when applying contemporary large models on the small scale data prevalent in such high-stakes fields. In this paper, we draw a novel connection between a family of learning methods based on model fusion and generalisation certificates, and surprisingly show that with minor adjustment several existing learning strategies already provide non-trivial generalisation guarantees. Essentially, by focusing on data-driven learning of downstream tasks by fusion rather than fine-tuning, the certified generalisation gap becomes tiny and independent of the base network size, facilitating its certification. Our results show for the first time non-trivial generalisation guarantees for learning with as low as 100 examples, while using vision models such as VIT-B and language models such as mistral-7B. This observation is significant as it has immediate implications for facilitating the certification of existing systems as trustworthy, and opens up new directions for research at the intersection of practice and theory.

Comment: The paper discusses model merging and provides non-vacuous generalization bounds for low-shot learning, which is relevant to representation learning and model architecture as it connects model fusion with generalization certificates.

Relevance: 8 Novelty: 8

18. HOPSE: Scalable Higher-Order Positional and Structural Encoder for Combinatorial Representations

ArXiv ID: 2505.15405

Authors: Martin Carrasco, Guillermo Bernardez, Marco Montagna, Nina Miolane, Lev Telyatnikov

Abstract: While Graph Neural Networks (GNNs) have proven highly effective at modeling relational data, pairwise connections cannot fully capture multi-way relationships naturally present in complex real-world systems. In response to this, Topological Deep Learning (TDL) leverages more general combinatorial representations -- such as simplicial or cellular complexes -- to accommodate higher-order interactions. Existing TDL methods often extend GNNs through Higher-Order Message Passing (HOMP), but face critical \emph{scalability challenges} due to \textit{(i)} a combinatorial explosion of message-passing routes, and \textit{(ii)} significant complexity overhead from the propagation mechanism. To overcome these limitations, we propose HOPSE (Higher-Order Positional and Structural Encoder) -- a \emph{message passing-free} framework that uses Hasse graph decompositions to derive efficient and expressive encodings over \emph{arbitrary higher-order domains}. Notably, HOPSE scales linearly with dataset size while preserving expressive power and permutation equivariance. Experiments on molecular, expressivity and topological benchmarks show that HOPSE matches or surpasses state-of-the-art performance while achieving up to 7 $times$ speedups over HOMP-based models, opening a new path for scalable TDL.

Comment: The paper introduces a scalable higher-order encoder for combinatorial representations, which is relevant to model architecture innovations.

Relevance: 8 Novelty: 8

19. Time Tracker: Mixture-of-Experts-Enhanced Foundation Time Series Forecasting Model with Decoupled Training Pipelines

ArXiv ID: 2505.15151

Authors: Xiaohou Shi, Ke Li, Aobo Liang, Yan Sun

Abstract: In the past few years, time series foundation models have achieved superior predicting accuracy. However, real-world time series often exhibit significant diversity in their temporal patterns across different time spans and domains, making it challenging for a single model architecture to fit all complex scenarios. In addition, time series data may have multiple variables exhibiting complex correlations between each other. Recent mainstream works have focused on modeling times series in a channel-independent manner in both pretraining and finetuning stages, overlooking the valuable inter-series dependencies. To this end, we propose \textbf{Time Tracker} for better predictions on multivariate time series data. Firstly, we leverage sparse mixture of experts (MoE) within Transformers to handle the modeling of diverse time series patterns, thereby alleviating the learning difficulties of a single model while improving its generalization. Besides, we propose Any-variate Attention, enabling a unified model structure to seamlessly handle both univariate and multivariate time series, thereby supporting channel-independent modeling during pretraining and channel-mixed modeling for finetuning. Furthermore, we design a graph learning module that constructs relations among sequences from frequency-domain features, providing more precise guidance to capture inter-series dependencies in channel-mixed modeling. Based on these advancements, Time Tracker achieves state-of-the-art performance in predicting accuracy, model generalization and adaptability.

Comment: The paper proposes a mixture-of-experts-enhanced model for time series forecasting, which is relevant to model architecture innovations.