Personalized Daily ArXiv Papers 2025-11-10

[gpt-5]	Prompt	Completion	Total
Token	29545	27496	57041
Cost	$0.04	$0.27	$0.31

Total arXiv papers: 384

Total scanned papers: 215

Total relevant papers: 18

Table of contents with paper titles:

Simplex-FEM Networks (SiFEN): Learning A Triangulated Function Approximator Authors: Chaymae Yahyati, Ismail Lamaakal, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine Maleh
Autoencoding Dynamics: Topological Limitations and Capabilities Authors: Matthew D. Kvalheim, Eduardo D. Sontag
PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference Authors: Yushu Zhao, Zheng Wang, Minjia Zhang
Attention and Compression is all you need for Controllably Efficient Language Models Authors: Jatin Prakash, Aahlad Puli, Rajesh Ranganath
Deep Progressive Training: scaling up depth capacity of zero/one-layer models Authors: Zhiqi Bu
Linear Gradient Prediction with Control Variates Authors: Kamil Ciosek, Nicol`o Felicioni, Juan Elenter Litwin
FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow Authors: Rubens Lacouture, Nathan Zhang, Ritvik Sharma, Marco Siracusa, Fredrik Kjolstad, Kunle Olukotun, Olivia Hsu
Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs Authors: Preetum Nakkiran, Arwen Bradley, Adam Goli\'nski, Eugene Ndiaye, Michael Kirchhof, Sinead Williamson
How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need? Authors: Tuan Anh Tran, Duy M. H. Nguyen, Hoai-Chau Tran, Michael Barz, Khoa D. Doan, Roger Wattenhofer, Ngo Anh Vien, Mathias Niepert, Daniel Sonntag, Paul Swoboda
SiamMM: A Mixture Model Perspective on Deep Unsupervised Learning Authors: Xiaodong Wang, Jing Huang, Kevin J Liang
APP: Accelerated Path Patching with Task-Specific Pruning Authors: Frauke Andersen, William Rudman, Ruochen Zhang, Carsten Eickhoff
Less Is More: Generating Time Series with LLaMA-Style Autoregression in Simple Factorized Latent Spaces Authors: Siyuan Li, Yifan Sun, Lei Cheng, Lewen Wang, Yang Liu, Weiqing Liu, Jianlong Li, Jiang Bian, Shikai Fang
When Data Falls Short: Grokking Below the Critical Threshold Authors: Vaibhav Singh, Eugene Belilovsky, Rahaf Aljundi
Sharp Minima Can Generalize: A Loss Landscape Perspective On Data Authors: Raymond Fan, Bryce Sandlund, Lin Myat Ko
DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing Authors: Lei Gao, Chaoyi Jiang, Hossein Entezari Zarch, Daniel Wong, Murali Annavaram
Another BRIXEL in the Wall: Towards Cheaper Dense Features Authors: Alexander Lappe, Martin A. Giese
MDM: Manhattan Distance Mapping of DNN Weights for Parasitic-Resistance-Resilient Memristive Crossbars Authors: Matheus Farias, Wanghley Martins, H. T. Kung
First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation Authors: Dmytro Vitel, Anshuman Chhabra

1. Simplex-FEM Networks (SiFEN): Learning A Triangulated Function Approximator

ArXiv ID: 2511.04804

Authors: Chaymae Yahyati, Ismail Lamaakal, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine Maleh

Abstract: We introduce Simplex-FEM Networks (SiFEN), a learned piecewise-polynomial predictor that represents f: R^d -> R^k as a globally C^r finite-element field on a learned simplicial mesh in an optionally warped input space. Each query activates exactly one simplex and at most d+1 basis functions via barycentric coordinates, yielding explicit locality, controllable smoothness, and cache-friendly sparsity. SiFEN pairs degree-m Bernstein-Bezier polynomials with a light invertible warp and trains end-to-end with shape regularization, semi-discrete OT coverage, and differentiable edge flips. Under standard shape-regularity and bi-Lipschitz warp assumptions, SiFEN achieves the classic FEM approximation rate M^(-m/d) with M mesh vertices. Empirically, on synthetic approximation tasks, tabular regression/classification, and as a drop-in head on compact CNNs, SiFEN matches or surpasses MLPs and KANs at matched parameter budgets, improves calibration (lower ECE/Brier), and reduces inference latency due to geometric locality. These properties make SiFEN a compact, interpretable, and theoretically grounded alternative to dense MLPs and edge-spline networks.

Comment: Model Architecture (FEM-based piecewise-polynomial network on learned simplicial mesh) with explicit sparsity/locality for efficiency.

Relevance: 10 Novelty: 9

2. Autoencoding Dynamics: Topological Limitations and Capabilities

ArXiv ID: 2511.04807

Authors: Matthew D. Kvalheim, Eduardo D. Sontag

Abstract: Given a "data manifold" $M\subset \mathbb{R}^n$ and "latent space" $\mathbb{R}^\ell$, an autoencoder is a pair of continuous maps consisting of an "encoder" $E\colon \mathbb{R}^n\to \mathbb{R}^\ell$ and "decoder" $D\colon \mathbb{R}^\ell\to \mathbb{R}^n$ such that the "round trip" map $D\circ E$ is as close as possible to the identity map $\mbox{id}_M$ on $M$. We present various topological limitations and capabilites inherent to the search for an autoencoder, and describe capabilities for autoencoding dynamical systems having $M$ as an invariant manifold.

Comment: Representation Learning (theoretical/topological limits and capabilities of autoencoders, including dynamics on manifolds).

Relevance: 10 Novelty: 8

3. PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference

ArXiv ID: 2511.04805

Authors: Yushu Zhao, Zheng Wang, Minjia Zhang

Abstract: Mixture-of-Experts (MoE) models have shown strong potential in scaling language models efficiently by activating only a small subset of experts per input. However, their widespread deployment remains limited due to the high memory overhead associated with storing all expert parameters, particularly as the number of experts increases. To address this challenge, prior works have explored expert dropping and merging strategies, yet they often suffer from performance drop at high compression ratios. In this paper, we introduce PuzzleMoE, a training-free MoE compression method that achieves both high accuracy and efficient inference through two key innovations: First, PuzzleMoE performs sparse expert merging by identifying element-wise weight redundancy and specialization. It uses a dual-mask to capture both shared and expert-specific parameters. Second, to avoid the overhead of storing binary masks and signs, PuzzleMoE introduces a bit-packed encoding scheme that reuses underutilized exponent bits, enabling efficient MoE inference on GPUs. Extensive experiments demonstrate that PuzzleMoE can compress MoE models by up to 50% while maintaining accuracy across various tasks. Specifically, it outperforms prior MoE compression methods by up to 16.7% on MMLU at 50% compression ratio, and achieves up to 1.28\times inference speedup.

Comment: Model Compression and Efficiency: MoE compression via sparse expert merging with dual masks and bit-packed inference for memory-efficient deployment.

Relevance: 10 Novelty: 8

4. Attention and Compression is all you need for Controllably Efficient Language Models

ArXiv ID: 2511.05313

Authors: Jatin Prakash, Aahlad Puli, Rajesh Ranganath

Abstract: The quadratic cost of attention in transformers motivated the development of efficient approaches: namely sparse and sliding window attention, convolutions and linear attention. Although these approaches result in impressive reductions in compute and memory, they often trade-off with quality, specifically in-context recall performance. Moreover, apriori fixing this quality-compute tradeoff means being suboptimal from the get-go: some downstream applications require more memory for in-context recall, while others require lower latency and memory. Further, these approaches rely on heuristic choices that artificially restrict attention, or require handcrafted and complex recurrent state update rules, or they must be carefully composed with attention at specific layers to form a hybrid architecture that complicates the design process, especially at scale. To address above issues, we propose Compress & Attend Transformer (CAT), a conceptually simple architecture employing two simple ingredients only: dense attention and compression. CAT decodes chunks of tokens by attending to compressed chunks of the sequence so far. Compression results in decoding from a reduced sequence length that yields compute and memory savings, while choosing a particular chunk size trades-off quality for efficiency. Moreover, CAT can be trained with multiple chunk sizes at once, unlocking control of quality-compute trade-offs directly at test-time without any retraining, all in a single adaptive architecture. In exhaustive evaluations on common language modeling tasks, in-context recall, and long-context understanding, a single adaptive CAT model outperforms existing efficient baselines, including hybrid architectures, across different compute-memory budgets. Further, a single CAT matches dense transformer in language modeling across model scales while being 1.4-3x faster and requiring 2-9x lower total memory usage.

Comment: Matches Model Architecture and Efficiency: Compress & Attend Transformer uses dense attention over compressed context for controllable compute-memory tradeoffs and test-time adaptivity via multi-chunk training.

Relevance: 10 Novelty: 8

5. Deep Progressive Training: scaling up depth capacity of zero/one-layer models

ArXiv ID: 2511.04981

Authors: Zhiqi Bu

Abstract: Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, an effective strategy is the progressive training, which scales up model capacity during training, hence significantly reducing computation with little to none performance degradation. In this work, we study the depth expansion of large models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training for the optimal tradeoff between computation and loss. For example, zero/one-layer progressive training on GPT2 can save $\approx 80\%$ compute, or equivalently accelerate $\approx 5\times$ while achieving almost the same loss, compared to to a fully trained 60-layer model with 7B parameters.

Comment: High Performance Computing/Efficiency: progressive depth expansion (zero/one-layer) with theoretical guidance for compute-efficient training of deep models.

Relevance: 9 Novelty: 8

6. Linear Gradient Prediction with Control Variates

ArXiv ID: 2511.05187

Authors: Kamil Ciosek, Nicol`o Felicioni, Juan Elenter Litwin

Abstract: We propose a new way of training neural networks, with the goal of reducing training cost. Our method uses approximate predicted gradients instead of the full gradients that require an expensive backward pass. We derive a control-variate-based technique that ensures our updates are unbiased estimates of the true gradient. Moreover, we propose a novel way to derive a predictor for the gradient inspired by the theory of the Neural Tangent Kernel. We empirically show the efficacy of the technique on a vision transformer classification task.

Comment: Training efficiency: control-variate-based linear gradient prediction (NTK-inspired) enabling unbiased updates without full backpropagation.

Relevance: 9 Novelty: 8

7. FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow

ArXiv ID: 2511.04768

Authors: Rubens Lacouture, Nathan Zhang, Ritvik Sharma, Marco Siracusa, Fredrik Kjolstad, Kunle Olukotun, Olivia Hsu

Abstract: As deep learning models scale, sparse computation and specialized dataflow hardware have emerged as powerful solutions to address efficiency. We propose FuseFlow, a compiler that converts sparse machine learning models written in PyTorch to fused sparse dataflow graphs for reconfigurable dataflow architectures (RDAs). FuseFlow is the first compiler to support general cross-expression fusion of sparse operations. In addition to fusion across kernels (expressions), FuseFlow also supports optimizations like parallelization, dataflow ordering, and sparsity blocking. It targets a cycle-accurate dataflow simulator for microarchitectural analysis of fusion strategies. We use FuseFlow for design-space exploration across four real-world machine learning applications with sparsity, showing that full fusion (entire cross-expression fusion across all computation in an end-to-end model) is not always optimal for sparse models-fusion granularity depends on the model itself. FuseFlow also provides a heuristic to identify and prune suboptimal configurations. Using Fuseflow, we achieve performance improvements, including a ~2.7x speedup over an unfused baseline for GPT-3 with BigBird block-sparse attention.

Comment: HPC/Systems for sparse DL: a compiler enabling cross-expression fusion, dataflow ordering, and sparsity blocking on reconfigurable dataflow architectures.

Relevance: 9 Novelty: 8

8. Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

ArXiv ID: 2511.04869

Authors: Preetum Nakkiran, Arwen Bradley, Adam Goli\'nski, Eugene Ndiaye, Michael Kirchhof, Sinead Williamson

Abstract: Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of "B-calibration," which is a notion of calibration parameterized by a choice of equivalence classes (semantic or otherwise). This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) RL instruction-tuning systematically breaks this calibration, and (3) chain-of-thought reasoning breaks calibration. To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.

Comment: Matches Representation Learning/Training Dynamics: theoretical mechanism and empirical analysis of semantic calibration emerging from next-token training; explains when calibration holds and when it breaks.

Relevance: 9 Novelty: 8

9. How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?

ArXiv ID: 2511.05449

Authors: Tuan Anh Tran, Duy M. H. Nguyen, Hoai-Chau Tran, Michael Barz, Khoa D. Doan, Roger Wattenhofer, Ngo Anh Vien, Mathias Niepert, Daniel Sonntag, Paul Swoboda

Abstract: Recent advances in 3D point cloud transformers have led to state-of-the-art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present the finding that tokens are remarkably redundant, leading to substantial inefficiency. We introduce gitmerge3D, a globally informed graph token merging method that can reduce the token count by up to 90-95% while maintaining competitive performance. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over-tokenized and under-optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures. Our code and checkpoints are publicly available at https://gitmerge3d.github.io

Comment: Compression/Efficiency (aggressive token merging for 3D Transformers reducing tokens by 90–95%) and architectural efficiency insights.

Relevance: 9 Novelty: 7

10. SiamMM: A Mixture Model Perspective on Deep Unsupervised Learning

ArXiv ID: 2511.05462

Authors: Xiaodong Wang, Jing Huang, Kevin J Liang

Abstract: Recent studies have demonstrated the effectiveness of clustering-based approaches for self-supervised and unsupervised learning. However, the application of clustering is often heuristic, and the optimal methodology remains unclear. In this work, we establish connections between these unsupervised clustering methods and classical mixture models from statistics. Through this framework, we demonstrate significant enhancements to these clustering methods, leading to the development of a novel model named SiamMM. Our method attains state-of-the-art performance across various self-supervised learning benchmarks. Inspection of the learned clusters reveals a strong resemblance to unseen ground truth labels, uncovering potential instances of mislabeling.

Comment: Representation learning: connects clustering-based self-supervised learning to classical mixture models and proposes the SiamMM architecture.

Relevance: 9 Novelty: 7

11. APP: Accelerated Path Patching with Task-Specific Pruning

ArXiv ID: 2511.05442

Authors: Frauke Andersen, William Rudman, Ruochen Zhang, Carsten Eickhoff

Abstract: Circuit discovery is a key step in many mechanistic interpretability pipelines. Current methods, such as Path Patching, are computationally expensive and have limited in-depth circuit analysis for smaller models. In this study, we propose Accelerated Path Patching (APP), a hybrid approach leveraging our novel contrastive attention head pruning method to drastically reduce the search space of circuit discovery methods. Our Contrastive-FLAP pruning algorithm uses techniques from causal mediation analysis to assign higher pruning scores to task-specific attention heads, leading to higher performing sparse models compared to traditional pruning techniques. Although Contrastive-FLAP is successful at preserving task-specific heads that existing pruning algorithms remove at low sparsity ratios, the circuits found by Contrastive-FLAP alone are too large to satisfy the minimality constraint required in circuit analysis. APP first applies Contrastive-FLAP to reduce the search space on required for circuit discovery algorithms by, on average, 56\%. Next, APP, applies traditional Path Patching on the remaining attention heads, leading to a speed up of 59.63\%-93.27\% compared to Path Patching applied to the dense model. Despite the substantial computational saving that APP provides, circuits obtained from APP exhibit substantial overlap and similar performance to previously established Path Patching circuits

Comment: Matches Model Compression/Efficiency: contrastive attention-head pruning (sparsity/pruning) to reduce search space and compute for circuit discovery; architecture-level head selection informed by causal mediation.

Relevance: 9 Novelty: 7

12. Less Is More: Generating Time Series with LLaMA-Style Autoregression in Simple Factorized Latent Spaces

ArXiv ID: 2511.04973

Authors: Siyuan Li, Yifan Sun, Lei Cheng, Lewen Wang, Yang Liu, Weiqing Liu, Jianlong Li, Jiang Bian, Shikai Fang

Abstract: Generative models for multivariate time series are essential for data augmentation, simulation, and privacy preservation, yet current state-of-the-art diffusion-based approaches are slow and limited to fixed-length windows. We propose FAR-TS, a simple yet effective framework that combines disentangled factorization with an autoregressive Transformer over a discrete, quantized latent space to generate time series. Each time series is decomposed into a data-adaptive basis that captures static cross-channel correlations and temporal coefficients that are vector-quantized into discrete tokens. A LLaMA-style autoregressive Transformer then models these token sequences, enabling fast and controllable generation of sequences with arbitrary length. Owing to its streamlined design, FAR-TS achieves orders-of-magnitude faster generation than Diffusion-TS while preserving cross-channel correlations and an interpretable latent space, enabling high-quality and flexible time series synthesis.

Comment: Model Architecture (disentangled/quantized latent space with AR Transformer) and Compression/Efficiency (discrete tokens for fast, arbitrary-length generation).