Personalized Daily ArXiv Papers 2025-04-15

[gpt-4o]	Prompt	Completion	Total
Token	57598	8329	65927
Cost	$0.14	$0.08	$0.23

Total arXiv papers: 843

Total scanned papers: 519

Total relevant papers: 33

Table of contents with paper titles:

Mixture of Group Experts for Learning Invariant Representations Authors: Lei Kang, Jia Li, Mi Tian, Hua Huang
Towards Combinatorial Interpretability of Neural Computation Authors: Micah Adler, Dan Alistarh, Nir Shavit
Towards Weaker Variance Assumptions for Stochastic Optimization Authors: Ahmet Alacaoglu, Yura Malitsky, Stephen J. Wright
PQS (Prune, Quantize, and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations Authors: Vikas Natesh, H. T. Kung
Continuum-Interaction-Driven Intelligence: Human-Aligned Neural Architecture via Crystallized Reasoning and Fluid Generation Authors: Pengcheng Zhou, Zhiqiang Nie, Haochen Li
Expressivity of Quadratic Neural ODEs Authors: Joshua Hanson, Maxim Raginsky
DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models Authors: Wenjin Ke, Zhe Li, Dong Li, Lu Tian, Emad Barsoum
In almost all shallow analytic neural network optimization landscapes, efficient minimizers have strongly convex neighborhoods Authors: Felix Benning, Steffen Dereich
From Tokens to Lattices: Emergent Lattice Structures in Language Models Authors: Bo Xiong, Steffen Staab
Position: Beyond Euclidean -- Foundation Models Should Embrace Non-Euclidean Geometries Authors: Neil He, Jiahong Liu, Buze Zhang, Ngoc Bui, Ali Maatouk, Menglin Yang, Irwin King, Melanie Weber, Rex Ying
KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference Authors: Yuxuan Tian, Zihan Wang, Yebo Peng, Aomufei Yuan, Zhiming Wang, Bairen Yi, Xin Liu, Yong Cui, Tong Yang
Weight Ensembling Improves Reasoning in Language Models Authors: Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, Aditi Raghunathan
MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation Authors: Vikas Natesh, H. T. Kung, David Kong
RouterKT: Mixture-of-Experts for Knowledge Tracing Authors: Han Liao, Shuaishuai Zu
Sparse Hybrid Linear-Morphological Networks Authors: Konstantinos Fotopoulos, Christos Garoufis, Petros Maragos
Long Context In-Context Compression by Getting to the Gist of Gisting Authors: Aleksandar Petrov, Mark Sandler, Andrey Zhmoginov, Nolan Miller, Max Vladymyrov
Training Small Reasoning LLMs with Cognitive Preference Alignment Authors: Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang
Constants of motion network revisited Authors: Wenqi Fang, Chao Chen, Yongkui Yang, Zheng Wang
High-order expansion of Neural Ordinary Differential Equations flows Authors: Dario Izzo, Sebastien Origer, Giacomo Acciarini, Francesco Biscani
NetTAG: A Multimodal RTL-and-Layout-Aligned Netlist Foundation Model via Text-Attributed Graph Authors: Wenji Fang, Wenkai Li, Shang Liu, Yao Lu, Hongce Zhang, Zhiyao Xie
Preconditioned Gradient Descent for Over-Parameterized Nonconvex Matrix Factorization Authors: Gavin Zhang, Salar Fattahi, Richard Y. Zhang
IsoSEL: Isometric Structural Entropy Learning for Deep Graph Clustering in Hyperbolic Space Authors: Li Sun, Zhenhao Huang, Yujie Wang, Hongbo Lv, Chunyang Liu, Hao Peng, Philip S. Yu
Towards Scalable Bayesian Optimization via Gradient-Informed Bayesian Neural Networks Authors: Georgios Makrygiorgos, Joshua Hang Sai Ip, Ali Mesbah
HyperCore: The Core Framework for Building Hyperbolic Foundation Models with Comprehensive Modules Authors: Neil He, Menglin Yang, Rex Ying
LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks Authors: Soumyadeep Pal, Changsheng Wang, James Diffenderfer, Bhavya Kailkhura, Sijia Liu
How new data permeates LLM knowledge and how to dilute it Authors: Chen Sun, Renat Aksitov, Andrey Zhmoginov, Nolan Andrew Miller, Max Vladymyrov, Ulrich Rueckert, Been Kim, Mark Sandler
Negate or Embrace: On How Misalignment Shapes Multimodal Representation Learning Authors: Yichao Cai, Yuhang Liu, Erdun Gao, Tianjiao Jiang, Zhen Zhang, Anton van den Hengel, Javen Qinfeng Shi
SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting Authors: Jiaming Xu, Jiayi Pan, Yongkang Zhou, Siming Chen, Jinhao Li, Yaoxiu Lian, Junyi Wu, Guohao Dai
The Impact of Model Zoo Size and Composition on Weight Space Learning Authors: Damian Falk, Konstantin Sch\"urholt, Damian Borth
Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data Authors: Shuai Zhao, Linchao Zhu, Yi Yang
Towards Quantifying Commonsense Reasoning with Mechanistic Insights Authors: Abhinav Joshi, Areeb Ahmad, Divyaksh Shukla, Ashutosh Modi
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training Authors: Mingyu Liang, Hiwot Tadese Kassa, Wenyin Fu, Brian Coutinho, Louis Feng, Christina Delimitrou
Measuring Leakage in Concept-Based Methods: An Information Theoretic Approach Authors: Mikael Makonnen, Moritz Vandenhirtz, Sonia Laguna, Julia E Vogt

1. Mixture of Group Experts for Learning Invariant Representations

ArXiv ID: 2504.09265

Authors: Lei Kang, Jia Li, Mi Tian, Hua Huang

Abstract: Sparsely activated Mixture-of-Experts (MoE) models effectively increase the number of parameters while maintaining consistent computational costs per token. However, vanilla MoE models often suffer from limited diversity and specialization among experts, constraining their performance and scalability, especially as the number of experts increases. In this paper, we present a novel perspective on vanilla MoE with top-$k$ routing inspired by sparse representation. This allows us to bridge established theoretical insights from sparse representation into MoE models. Building on this foundation, we propose a group sparse regularization approach for the input of top-$k$ routing, termed Mixture of Group Experts (MoGE). MoGE indirectly regularizes experts by imposing structural constraints on the routing inputs, while preserving the original MoE architecture. Furthermore, we organize the routing input into a 2D topographic map, spatially grouping neighboring elements. This structure enables MoGE to capture representations invariant to minor transformations, thereby significantly enhancing expert diversity and specialization. Comprehensive evaluations across various Transformer models for image classification and language modeling tasks demonstrate that MoGE substantially outperforms its MoE counterpart, with minimal additional memory and computation overhead. Our approach provides a simple yet effective solution to scale the number of experts and reduce redundancy among them. The source code is included in the supplementary material and will be publicly released.

Comment: The paper proposes a novel group sparse regularization approach for Mixture-of-Experts (MoE) models, directly addressing architectural innovations and representation learning.

Relevance: 10 Novelty: 8

2. Towards Combinatorial Interpretability of Neural Computation

ArXiv ID: 2504.08842

Authors: Micah Adler, Dan Alistarh, Nir Shavit

Abstract: We introduce combinatorial interpretability, a methodology for understanding neural computation by analyzing the combinatorial structures in the sign-based categorization of a network's weights and biases. We demonstrate its power through feature channel coding, a theory that explains how neural networks compute Boolean expressions and potentially underlies other categories of neural network computation. According to this theory, features are computed via feature channels: unique cross-neuron encodings shared among the inputs the feature operates on. Because different feature channels share neurons, the neurons are polysemantic and the channels interfere with one another, making the computation appear inscrutable. We show how to decipher these computations by analyzing a network's feature channel coding, offering complete mechanistic interpretations of several small neural networks that were trained with gradient descent. Crucially, this is achieved via static combinatorial analysis of the weight matrices, without examining activations or training new autoencoding networks. Feature channel coding reframes the superposition hypothesis, shifting the focus from neuron activation directionality in high-dimensional space to the combinatorial structure of codes. It also allows us for the first time to exactly quantify and explain the relationship between a network's parameter size and its computational capacity (i.e. the set of features it can compute with low error), a relationship that is implicitly at the core of many modern scaling laws. Though our initial studies of feature channel coding are restricted to Boolean functions, we believe they provide a rich, controlled, and informative research space, and that the path we propose for combinatorial interpretation of neural computation can provide a basis for understanding both artificial and biological neural circuits.

Comment: The paper introduces a novel combinatorial interpretability framework for understanding neural computation, which aligns with foundational research in representation learning and training dynamics.

Relevance: 9 Novelty: 9

3. Towards Weaker Variance Assumptions for Stochastic Optimization

ArXiv ID: 2504.09951

Authors: Ahmet Alacaoglu, Yura Malitsky, Stephen J. Wright

Abstract: We revisit a classical assumption for analyzing stochastic gradient algorithms where the squared norm of the stochastic subgradient (or the variance for smooth problems) is allowed to grow as fast as the squared norm of the optimization variable. We contextualize this assumption in view of its inception in the 1960s, its seemingly independent appearance in the recent literature, its relationship to weakest-known variance assumptions for analyzing stochastic gradient algorithms, and its relevance in deterministic problems for non-Lipschitz nonsmooth convex optimization. We build on and extend a connection recently made between this assumption and the Halpern iteration. For convex nonsmooth, and potentially stochastic, optimization, we analyze horizon-free, anytime algorithms with last-iterate rates. For problems beyond simple constrained optimization, such as convex problems with functional constraints or regularized convex-concave min-max problems, we obtain rates for optimality measures that do not require boundedness of the feasible set.

Comment: The paper revisits variance assumptions in stochastic optimization, which aligns with the 'Emerging Trends' criterion by challenging established assumptions and providing theoretical insights.

Relevance: 9 Novelty: 8

4. PQS (Prune, Quantize, and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations

ArXiv ID: 2504.09064

Authors: Vikas Natesh, H. T. Kung

Abstract: We present PQS, which uses three techniques together - Prune, Quantize, and Sort - to achieve low-bitwidth accumulation of dot products in neural network computations. In conventional quantized (e.g., 8-bit) dot products, partial results are accumulated into wide (e.g., 32-bit) accumulators to avoid overflows when accumulating intermediate partial sums. However, such wide accumulators increase memory bandwidth usage and reduce energy efficiency. We show that iterative N:M pruning in floating point followed by quantization to 8 (or fewer) bits, and accumulation of partial products in a sorted order ("small to large") allows for accurate, compressed models with short dot product lengths that do not require wide accumulators. We design, analyze, and implement the PQS algorithm to eliminate accumulation overflows at inference time for several neural networks. Our method offers a 2.5x reduction in accumulator bitwidth while achieving model accuracy on par with floating-point baselines for multiple image classification tasks.

Comment: The PQS method combines pruning, quantization, and sorting for low-bitwidth accumulation, directly addressing model compression and efficiency.

Relevance: 9 Novelty: 8

5. Continuum-Interaction-Driven Intelligence: Human-Aligned Neural Architecture via Crystallized Reasoning and Fluid Generation

ArXiv ID: 2504.09301

Authors: Pengcheng Zhou, Zhiqiang Nie, Haochen Li

Abstract: Current AI systems based on probabilistic neural networks, such as large language models (LLMs), have demonstrated remarkable generative capabilities yet face critical challenges including hallucination, unpredictability, and misalignment with human decision-making. These issues fundamentally stem from the over-reliance on randomized (probabilistic) neural networks-oversimplified models of biological neural networks-while neglecting the role of procedural reasoning (chain-of-thought) in trustworthy decision-making. Inspired by the human cognitive duality of fluid intelligence (flexible generation) and crystallized intelligence (structured knowledge), this study proposes a dual-channel intelligent architecture that integrates probabilistic generation (LLMs) with white-box procedural reasoning (chain-of-thought) to construct interpretable, continuously learnable, and human-aligned AI systems. Concretely, this work: (1) redefines chain-of-thought as a programmable crystallized intelligence carrier, enabling dynamic knowledge evolution and decision verification through multi-turn interaction frameworks; (2) introduces a task-driven modular network design that explicitly demarcates the functional boundaries between randomized generation and procedural control to address trustworthiness in vertical-domain applications; (3) demonstrates that multi-turn interaction is a necessary condition for intelligence emergence, with dialogue depth positively correlating with the system's human-alignment degree. This research not only establishes a new paradigm for trustworthy AI deployment but also provides theoretical foundations for next-generation human-AI collaborative systems.

Comment: Proposes a dual-channel architecture inspired by human cognitive duality, integrating probabilistic generation with procedural reasoning. This aligns with the 'Model Architecture' criterion, offering insights into architectural innovations for trustworthy AI.

Relevance: 9 Novelty: 8

6. Expressivity of Quadratic Neural ODEs

ArXiv ID: 2504.09385

Authors: Joshua Hanson, Maxim Raginsky

Abstract: This work focuses on deriving quantitative approximation error bounds for neural ordinary differential equations having at most quadratic nonlinearities in the dynamics. The simple dynamics of this model form demonstrates how expressivity can be derived primarily from iteratively composing many basic elementary operations, versus from the complexity of those elementary operations themselves. Like the analog differential analyzer and universal polynomial DAEs, the expressivity is derived instead primarily from the "depth" of the model. These results contribute to our understanding of what depth specifically imparts to the capabilities of deep learning architectures.

Comment: The paper provides theoretical bounds on the expressivity of quadratic neural ODEs, focusing on the role of depth in model capabilities. This aligns with foundational research in representation learning and model architecture.

Relevance: 9 Novelty: 8

7. DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

ArXiv ID: 2504.09223

Authors: Wenjin Ke, Zhe Li, Dong Li, Lu Tian, Emad Barsoum

Abstract: Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream tasks. Quantization-aware Training (QAT) can alleviate this problem, but it requires significantly more computational resources. To tackle this, we introduced Weight-Decomposed Low-Rank Quantization-Aware Training (DL-QAT), which merges the advantages of QAT while training only less than 1% of the total parameters. Specifically, we introduce a group-specific quantization magnitude to adjust the overall scale of each quantization group. Within each quantization group, we use LoRA matrices to update the weight size and direction in the quantization space. We validated the effectiveness of our method on the LLaMA and LLaMA2 model families. The results show significant improvements over our baseline method across different quantization granularities. For instance, for LLaMA-7B, our approach outperforms the previous state-of-the-art method by 4.2% in MMLU on 3-bit LLaMA-7B model. Additionally, our quantization results on pre-trained models also surpass previous QAT methods, demonstrating the superior performance and efficiency of our approach.

Comment: The paper introduces DL-QAT, a low-rank quantization-aware training method for LLMs, aligning with the 'Model Compression' criterion by addressing efficiency in LLMs with novel quantization techniques.

Relevance: 9 Novelty: 8

8. In almost all shallow analytic neural network optimization landscapes, efficient minimizers have strongly convex neighborhoods

ArXiv ID: 2504.08867

Authors: Felix Benning, Steffen Dereich

Abstract: Whether or not a local minimum of a cost function has a strongly convex neighborhood greatly influences the asymptotic convergence rate of optimizers. In this article, we rigorously analyze the prevalence of this property for the mean squared error induced by shallow, 1-hidden layer neural networks with analytic activation functions when applied to regression problems. The parameter space is divided into two domains: the 'efficient domain' (all parameters for which the respective realization function cannot be generated by a network having a smaller number of neurons) and the 'redundant domain' (the remaining parameters). In almost all regression problems on the efficient domain the optimization landscape only features local minima that are strongly convex. Formally, we will show that for certain randomly picked regression problems the optimization landscape is almost surely a Morse function on the efficient domain. The redundant domain has significantly smaller dimension than the efficient domain and on this domain, potential local minima are never isolated.

Comment: The paper provides theoretical insights into the optimization landscape of shallow neural networks, which aligns with foundational research in representation learning and training dynamics.

Relevance: 9 Novelty: 8

9. From Tokens to Lattices: Emergent Lattice Structures in Language Models

ArXiv ID: 2504.08778

Authors: Bo Xiong, Steffen Staab

Abstract: Pretrained masked language models (MLMs) have demonstrated an impressive capability to comprehend and encode conceptual knowledge, revealing a lattice structure among concepts. This raises a critical question: how does this conceptualization emerge from MLM pretraining? In this paper, we explore this problem from the perspective of Formal Concept Analysis (FCA), a mathematical framework that derives concept lattices from the observations of object-attribute relationships. We show that the MLM's objective implicitly learns a \emph{formal context} that describes objects, attributes, and their dependencies, which enables the reconstruction of a concept lattice through FCA. We propose a novel framework for concept lattice construction from pretrained MLMs and investigate the origin of the inductive biases of MLMs in lattice structure learning. Our framework differs from previous work because it does not rely on human-defined concepts and allows for discovering "latent" concepts that extend beyond human definitions. We create three datasets for evaluation, and the empirical results verify our hypothesis.

Comment: The paper investigates how conceptual knowledge emerges in pretrained masked language models using Formal Concept Analysis, which provides theoretical insights into representation learning.

Relevance: 9 Novelty: 8

10. Position: Beyond Euclidean -- Foundation Models Should Embrace Non-Euclidean Geometries

ArXiv ID: 2504.08896

Authors: Neil He, Jiahong Liu, Buze Zhang, Ngoc Bui, Ali Maatouk, Menglin Yang, Irwin King, Melanie Weber, Rex Ying

Abstract: In the era of foundation models and Large Language Models (LLMs), Euclidean space has been the de facto geometric setting for machine learning architectures. However, recent literature has demonstrated that this choice comes with fundamental limitations. At a large scale, real-world data often exhibit inherently non-Euclidean structures, such as multi-way relationships, hierarchies, symmetries, and non-isotropic scaling, in a variety of domains, such as languages, vision, and the natural sciences. It is challenging to effectively capture these structures within the constraints of Euclidean spaces. This position paper argues that moving beyond Euclidean geometry is not merely an optional enhancement but a necessity to maintain the scaling law for the next-generation of foundation models. By adopting these geometries, foundation models could more efficiently leverage the aforementioned structures. Task-aware adaptability that dynamically reconfigures embeddings to match the geometry of downstream applications could further enhance efficiency and expressivity. Our position is supported by a series of theoretical and empirical investigations of prevalent foundation models.Finally, we outline a roadmap for integrating non-Euclidean geometries into foundation models, including strategies for building geometric foundation models via fine-tuning, training from scratch, and hybrid approaches.

Comment: This position paper argues for the adoption of non-Euclidean geometries in foundation models, which aligns with emerging trends and architectural innovations. It provides a theoretical perspective on improving model scalability and efficiency.

Relevance: 9 Novelty: 8

11. KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference

ArXiv ID: 2504.09936

Authors: Yuxuan Tian, Zihan Wang, Yebo Peng, Aomufei Yuan, Zhiming Wang, Bairen Yi, Xin Liu, Yong Cui, Tong Yang

Abstract: Efficient inference of large language models (LLMs) is hindered by an ever-growing key-value (KV) cache, making KV cache compression a critical research direction. Traditional methods selectively evict less important KV cache entries based on attention scores or position heuristics, which leads to information loss and hallucinations. Recently, merging-based strategies have been explored to retain more information by merging KV pairs that would be discarded; however, these existing approaches inevitably introduce inconsistencies in attention distributions before and after merging, causing output perturbation and degraded generation quality. To overcome this challenge, we propose KeepKV, a novel adaptive KV cache merging method designed to eliminate output perturbation while preserving performance under strict memory constraints. KeepKV introduces the Electoral Votes mechanism that records merging history and adaptively adjusts attention scores. Moreover, it further leverages a novel Zero Inference-Perturbation Merging methods, keeping attention consistency and compensating for attention loss resulting from cache merging. KeepKV successfully retains essential context information within a significantly compressed cache. Extensive experiments on various benchmarks and LLM architectures demonstrate that KeepKV substantially reduces memory usage, enhances inference throughput by more than 2x and keeps superior generation quality even with 10% KV cache budgets.

Comment: The paper introduces KeepKV, a novel KV cache compression method for LLM inference, which aligns with foundational research in model compression and efficiency.

Relevance: 9 Novelty: 8

12. Weight Ensembling Improves Reasoning in Language Models

ArXiv ID: 2504.10478

Authors: Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, Aditi Raghunathan

Abstract: We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the Pass@1 rate reliably improves during supervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, a simple intervention of interpolating the weights of the latest SFT checkpoint with an early checkpoint, otherwise known as WiSE-FT, almost completely recovers Pass@k while also improving Pass@1. The WiSE-FT variant achieves better test-time scaling (Best@k, majority vote) and achieves superior results with less data when tuned further by reinforcement learning. Finally, we find that WiSE-FT provides complementary performance gains that cannot be achieved only through diversity-inducing decoding strategies, like temperature scaling. We formalize a bias-variance tradeoff of Pass@k with respect to the expectation and variance of Pass@1 over the test distribution. We find that WiSE-FT can reduce bias and variance simultaneously, while temperature scaling inherently trades-off between bias and variance.

Comment: The paper proposes WiSE-FT, a weight ensembling method for improving reasoning in LLMs, contributing foundational insights into training dynamics and test-time scaling.

Relevance: 9 Novelty: 8

13. MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation

ArXiv ID: 2504.09072

Authors: Vikas Natesh, H. T. Kung, David Kong

Abstract: We offer a novel approach, MGS (Markov Greedy Sums), to improve the accuracy of low-bitwidth floating-point dot products in neural network computations. In conventional 32-bit floating-point summation, adding values with different exponents may lead to loss of precision in the mantissa of the smaller term, which is right-shifted to align with the larger term's exponent. Such shifting (a.k.a. 'swamping') is a significant source of numerical errors in accumulation when implementing low-bitwidth dot products (e.g., 8-bit floating point) as the mantissa has a small number of bits. We avoid most swamping errors by arranging the terms in dot product summation based on their exponents and summing the mantissas without overflowing the low-bitwidth accumulator. We design, analyze, and implement the algorithm to minimize 8-bit floating point error at inference time for several neural networks. In contrast to traditional sequential summation, our method has significantly lowered numerical errors, achieving classification accuracy on par with high-precision floating-point baselines for multiple image classification tasks. Our dMAC hardware units can reduce power consumption by up to 34.1\% relative to conventional MAC units.

Comment: The paper proposes a novel method for low-bitwidth floating-point accumulation, which aligns with model compression and efficiency breakthroughs.

Relevance: 9 Novelty: 8

14. RouterKT: Mixture-of-Experts for Knowledge Tracing

ArXiv ID: 2504.08989

Authors: Han Liao, Shuaishuai Zu

Abstract: Knowledge Tracing (KT) is a fundamental task in Intelligent Tutoring Systems (ITS), which aims to model the dynamic knowledge states of students based on their interaction histories. However, existing KT models often rely on a global forgetting decay mechanism for capturing learning patterns, assuming that students' performance is predominantly influenced by their most recent interactions. Such approaches fail to account for the diverse and complex learning patterns arising from individual differences and varying learning stages. To address this limitation, we propose RouterKT, a novel Mixture-of-Experts (MoE) architecture designed to capture heterogeneous learning patterns by enabling experts to specialize in different patterns without any handcrafted learning pattern bias such as forgetting decay. Specifically, RouterKT introduces a \textbf{person-wise routing mechanism} to effectively model individual-specific learning behaviors and employs \textbf{multi-heads as experts} to enhance the modeling of complex and diverse patterns. Comprehensive experiments on ten benchmark datasets demonstrate that RouterKT exhibits significant flexibility and improves the performance of various KT backbone models, with a maximum average AUC improvement of 3.29\% across different backbones and datasets, outperforming other state-of-the-art models. Moreover, RouterKT demonstrates consistently superior inference efficiency compared to existing approaches based on handcrafted learning pattern bias, highlighting its usability for real-world educational applications. The source code is available at https://github.com/derek-liao/RouterKT.git.

Comment: The paper introduces a Mixture-of-Experts (MoE) architecture for knowledge tracing, which aligns with the interest in MoE and architectural innovations.

Relevance: 9 Novelty: 7

15. Sparse Hybrid Linear-Morphological Networks

ArXiv ID: 2504.09289

Authors: Konstantinos Fotopoulos, Christos Garoufis, Petros Maragos

Abstract: We investigate hybrid linear-morphological networks. Recent studies highlight the inherent affinity of morphological layers to pruning, but also their difficulty in training. We propose a hybrid network structure, wherein morphological layers are inserted between the linear layers of the network, in place of activation functions. We experiment with the following morphological layers: 1) maxout pooling layers (as a special case of a morphological layer), 2) fully connected dense morphological layers, and 3) a novel, sparsely initialized variant of (2). We conduct experiments on the Magna-Tag-A-Tune (music auto-tagging) and CIFAR-10 (image classification) datasets, replacing the linear classification heads of state-of-the-art convolutional network architectures with our proposed network structure for the various morphological layers. We demonstrate that these networks induce sparsity to their linear layers, making them more prunable under L1 unstructured pruning. We also show that on MTAT our proposed sparsely initialized layer achieves slightly better performance than ReLU, maxout, and densely initialized max-plus layers, and exhibits faster initial convergence.

Comment: The paper introduces a hybrid linear-morphological network with sparsity and pruning insights, aligning with the model compression and sparsity criteria.

Relevance: 9 Novelty: 7

16. Long Context In-Context Compression by Getting to the Gist of Gisting

ArXiv ID: 2504.08934

Authors: Aleksandar Petrov, Mark Sandler, Andrey Zhmoginov, Nolan Miller, Max Vladymyrov

Abstract: Long context processing is critical for the adoption of LLMs, but existing methods often introduce architectural complexity that hinders their practical adoption. Gisting, an in-context compression method with no architectural modification to the decoder transformer, is a promising approach due to its simplicity and compatibility with existing frameworks. While effective for short instructions, we demonstrate that gisting struggles with longer contexts, with significant performance drops even at minimal compression rates. Surprisingly, a simple average pooling baseline consistently outperforms gisting. We analyze the limitations of gisting, including information flow interruptions, capacity limitations and the inability to restrict its attention to subsets of the context. Motivated by theoretical insights into the performance gap between gisting and average pooling, and supported by extensive experimentation, we propose GistPool, a new in-context compression method. GistPool preserves the simplicity of gisting, while significantly boosting its performance on long context compression tasks.

Comment: The paper proposes GistPool, a method for in-context compression in LLMs, which aligns with model compression and efficiency breakthroughs.

Relevance: 8 Novelty: 8

17. Training Small Reasoning LLMs with Cognitive Preference Alignment

ArXiv ID: 2504.09802

Authors: Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang

Abstract: The reasoning capabilities of large language models (LLMs), such as OpenAI's o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need to explore strategies to train effective reasoning LLMs with far fewer parameters. A critical challenge is that smaller models have different capacities and cognitive trajectories than their larger counterparts. Hence, direct distillation of chain-of-thought (CoT) results from large LLMs to smaller ones can be sometimes ineffective and requires a huge amount of annotated data. In this paper, we introduce a novel framework called Critique-Rethink-Verify (CRV), designed for training smaller yet powerful reasoning LLMs. Our CRV framework consists of multiple LLM agents, each specializing in unique abilities: (i) critiquing the CoTs according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. We further propose the cognitive preference optimization (CogPO) algorithm to enhance the reasoning abilities of smaller models by aligning thoughts of these models with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of CRV and CogPO, which outperforms other training methods by a large margin.

Comment: The CRV framework for training smaller reasoning LLMs introduces cognitive preference alignment, which is relevant to foundational research in LLM training dynamics.

Relevance: 8 Novelty: 8

18. Constants of motion network revisited

ArXiv ID: 2504.09434

Authors: Wenqi Fang, Chao Chen, Yongkui Yang, Zheng Wang

Abstract: Discovering constants of motion is meaningful in helping understand the dynamical systems, but inevitably needs proficient mathematical skills and keen analytical capabilities. With the prevalence of deep learning, methods employing neural networks, such as Constant Of Motion nETwork (COMET), are promising in handling this scientific problem. Although the COMET method can produce better predictions on dynamics by exploiting the discovered constants of motion, there is still plenty of room to sharpen it. In this paper, we propose a novel neural network architecture, built using the singular-value-decomposition (SVD) technique, and a two-phase training algorithm to improve the performance of COMET. Extensive experiments show that our approach not only retains the advantages of COMET, such as applying to non-Hamiltonian systems and indicating the number of constants of motion, but also can be more lightweight and noise-robust than COMET.

Comment: The paper proposes a novel neural network architecture using SVD and a two-phase training algorithm to improve the discovery of constants of motion. This aligns with foundational research in representation learning and architectural innovations.

Relevance: 8 Novelty: 8

19. High-order expansion of Neural Ordinary Differential Equations flows

ArXiv ID: 2504.08769

Authors: Dario Izzo, Sebastien Origer, Giacomo Acciarini, Francesco Biscani

Abstract: Artificial neural networks, widely recognised for their role in machine learning, are now transforming the study of ordinary differential equations (ODEs), bridging data-driven modelling with classical dynamical systems and enabling the development of infinitely deep neural models. However, the practical applicability of these models remains constrained by the opacity of their learned dynamics, which operate as black-box systems with limited explainability, thereby hindering trust in their deployment. Existing approaches for the analysis of these dynamical systems are predominantly restricted to first-order gradient information due to computational constraints, thereby limiting the depth of achievable insight. Here, we introduce Event Transition Tensors, a framework based on high-order differentials that provides a rigorous mathematical description of neural ODE dynamics on event manifolds. We demonstrate its versatility across diverse applications: characterising uncertainties in a data-driven prey-predator control model, analysing neural optimal feedback dynamics, and mapping landing trajectories in a three-body neural Hamiltonian system. In all cases, our method enhances the interpretability and rigour of neural ODEs by expressing their behaviour through explicit mathematical structures. Our findings contribute to a deeper theoretical foundation for event-triggered neural differential equations and provide a mathematical construct for explaining complex system dynamics.

Comment: The paper introduces a high-order expansion framework for neural ODEs, contributing to the theoretical understanding of neural dynamics. This aligns with foundational research in representation learning and interpretability.

Relevance: 8 Novelty: 8

20. NetTAG: A Multimodal RTL-and-Layout-Aligned Netlist Foundation Model via Text-Attributed Graph

ArXiv ID: 2504.09260

Authors: Wenji Fang, Wenkai Li, Shang Liu, Yao Lu, Hongce Zhang, Zhiyao Xie

Abstract: Circuit representation learning has shown promise in advancing Electronic Design Automation (EDA) by capturing structural and functional circuit properties for various tasks. Existing pre-trained solutions rely on graph learning with complex functional supervision, such as truth table simulation. However, they only handle simple and-inverter graphs (AIGs), struggling to fully encode other complex gate functionalities. While large language models (LLMs) excel at functional understanding, they lack the structural awareness for flattened netlists. To advance netlist representation learning, we present NetTAG, a netlist foundation model that fuses gate semantics with graph structure, handling diverse gate types and supporting a variety of functional and physical tasks. Moving beyond existing graph-only methods, NetTAG formulates netlists as text-attributed graphs, with gates annotated by symbolic logic expressions and physical characteristics as text attributes. Its multimodal architecture combines an LLM-based text encoder for gate semantics and a graph transformer for global structure. Pre-trained with gate and graph self-supervised objectives and aligned with RTL and layout stages, NetTAG captures comprehensive circuit intrinsics. Experimental results show that NetTAG consistently outperforms each task-specific method on four largely different functional and physical tasks and surpasses state-of-the-art AIG encoders, demonstrating its versatility.

Comment: The paper introduces a multimodal netlist foundation model combining graph and text attributes, which aligns with representation learning and architectural innovations.

Relevance: 8 Novelty: 8

21. Preconditioned Gradient Descent for Over-Parameterized Nonconvex Matrix Factorization

ArXiv ID: 2504.09708

Authors: Gavin Zhang, Salar Fattahi, Richard Y. Zhang

Abstract: In practical instances of nonconvex matrix factorization, the rank of the true solution $r^{\star}$ is often unknown, so the rank $r$ of the model can be overspecified as $r>r^{\star}$. This over-parameterized regime of matrix factorization significantly slows down the convergence of local search algorithms, from a linear rate with $r=r^{\star}$ to a sublinear rate when $r>r^{\star}$. We propose an inexpensive preconditioner for the matrix sensing variant of nonconvex matrix factorization that restores the convergence rate of gradient descent back to linear, even in the over-parameterized case, while also making it agnostic to possible ill-conditioning in the ground truth. Classical gradient descent in a neighborhood of the solution slows down due to the need for the model matrix factor to become singular. Our key result is that this singularity can be corrected by $\ell_{2}$ regularization with a specific range of values for the damping parameter. In fact, a good damping parameter can be inexpensively estimated from the current iterate. The resulting algorithm, which we call preconditioned gradient descent or PrecGD, is stable under noise, and converges linearly to an information theoretically optimal error bound. Our numerical experiments find that PrecGD works equally well in restoring the linear convergence of other variants of nonconvex matrix factorization in the over-parameterized regime.

Comment: The paper proposes PrecGD, a preconditioned gradient descent method for nonconvex matrix factorization, contributing foundational insights into optimization and efficiency in over-parameterized regimes.

Relevance: 8 Novelty: 8

22. IsoSEL: Isometric Structural Entropy Learning for Deep Graph Clustering in Hyperbolic Space

ArXiv ID: 2504.09970

Authors: Li Sun, Zhenhao Huang, Yujie Wang, Hongbo Lv, Chunyang Liu, Hao Peng, Philip S. Yu

Abstract: Graph clustering is a longstanding topic in machine learning. In recent years, deep learning methods have achieved encouraging results, but they still require predefined cluster numbers K, and typically struggle with imbalanced graphs, especially in identifying minority clusters. The limitations motivate us to study a challenging yet practical problem: deep graph clustering without K considering the imbalance in reality. We approach this problem from a fresh perspective of information theory (i.e., structural information). In the literature, structural information has rarely been touched in deep clustering, and the classic definition falls short in its discrete formulation, neglecting node attributes and exhibiting prohibitive complexity. In this paper, we first establish a new Differentiable Structural Information, generalizing the discrete formalism to continuous realm, so that the optimal partitioning tree, revealing the cluster structure, can be created by the gradient backpropagation. Theoretically, we demonstrate its capability in clustering without requiring K and identifying the minority clusters in imbalanced graphs, while reducing the time complexity to O(N) w.r.t. the number of nodes. Subsequently, we present a novel IsoSEL framework for deep graph clustering, where we design a hyperbolic neural network to learn the partitioning tree in the Lorentz model of hyperbolic space, and further conduct Lorentz Tree Contrastive Learning with isometric augmentation. As a result, the partitioning tree incorporates node attributes via mutual information maximization, while the cluster assignment is refined by the proposed tree contrastive learning. Extensive experiments on five benchmark datasets show the IsoSEL outperforms 14 recent baselines by an average of +1.3% in NMI.

Comment: The paper introduces IsoSEL, a novel framework for deep graph clustering in hyperbolic space, contributing foundational insights into representation learning and clustering methods.

Relevance: 8 Novelty: 8

23. Towards Scalable Bayesian Optimization via Gradient-Informed Bayesian Neural Networks

ArXiv ID: 2504.10076

Authors: Georgios Makrygiorgos, Joshua Hang Sai Ip, Ali Mesbah

Abstract: Bayesian optimization (BO) is a widely used method for data-driven optimization that generally relies on zeroth-order data of objective function to construct probabilistic surrogate models. These surrogates guide the exploration-exploitation process toward finding global optimum. While Gaussian processes (GPs) are commonly employed as surrogates of the unknown objective function, recent studies have highlighted the potential of Bayesian neural networks (BNNs) as scalable and flexible alternatives. Moreover, incorporating gradient observations into GPs, when available, has been shown to improve BO performance. However, the use of gradients within BNN surrogates remains unexplored. By leveraging automatic differentiation, gradient information can be seamlessly integrated into BNN training, resulting in more informative surrogates for BO. We propose a gradient-informed loss function for BNN training, effectively augmenting function observations with local gradient information. The effectiveness of this approach is demonstrated on well-known benchmarks in terms of improved BNN predictions and faster BO convergence as the number of decision variables increases.

Comment: The paper proposes gradient-informed Bayesian neural networks for Bayesian optimization, which aligns with the 'Representation Learning' criterion by enhancing surrogate models with gradient information.