Personalized Daily ArXiv Papers 2026-01-16

[gpt-5]	Prompt	Completion	Total
Token	38366	34286	72652
Cost	$0.05	$0.34	$0.39

Total arXiv papers: 500

Total scanned papers: 285

Total relevant papers: 19

Table of contents with paper titles:

Discrete Feynman-Kac Correctors Authors: Mohsin Hasan, Viktor Ohanesian, Artem Gazizov, Yoshua Bengio, Al\'an Aspuru-Guzik, Roberto Bondesan, Marta Skreta, Kirill Neklyudov
STEM: Scaling Transformers with Embedding Modules Authors: Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, Beidi Chen
Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models Authors: Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, Kyungwoo Song
Global Context Compression with Interleaved Vision-Text Transformation Authors: Dian Jiao, Jiaxin Duan, Shuai Zhao, Jiabing Leng, Yiran Zhang, Feng Huang
The Geometry of Thought: Disclosing the Transformer as a Tropical Polynomial Circuit Authors: Faruk Alpay, Bilge Senturk
MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts Authors: Yuxuan Lou, Kai Yang, Yang You
An analytic theory of convolutional neural network inverse problems solvers Authors: Minh Hai Nguyen, Quoc Bao Do, Edouard Pauwels, Pierre Weiss
TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks Authors: Vansh Kapoor, Aman Gupta, Hao Chen, Anurag Beniwal, Jing Huang, Aviral Kumar
In-Context Operator Learning on the Space of Probability Measures Authors: Frank Cole, Dixi Wang, Yineng Chen, Yulong Lu, Rongjie Lai
Unlabeled Data Can Provably Enhance In-Context Learning of Transformers Authors: Renpu Liu, Jing Yang
Enhancing LUT-based Deep Neural Networks Inference through Architecture and Connectivity Optimization Authors: Binglei Lou, Ruilin Wu, Philip Leong
Single-Stage Huffman Encoder for ML Compression Authors: Aditya Agrawal, Albert Magyar, Hiteshwar Eswaraiah, Patrick Sheridan, Pradeep Janedula, Ravi Krishnan Venkatesan, Krishna Nair, Ravi Iyer
On the origin of neural scaling laws: from random graphs to natural language Authors: Maissam Barkeshli, Alberto Alfarano, Andrey Gromov
Training-Trajectory-Aware Token Selection Authors: Zhanming Shen, Jiaqi Hu, Zeyu Qin, Hao Chen, Wentao Ye, Zenan Huang, Yihong Zhuang, Guoshan Lu, Junlin Zhou, Junbo Zhao
A New Convergence Analysis of Plug-and-Play Proximal Gradient Descent Under Prior Mismatch Authors: Guixian Xu, Jinglai Li, Junqi Tang
Understanding and Preserving Safety in Fine-Tuned LLMs Authors: Jiawen Zhang, Yangfan Hu, Kejia Chen, Lipeng He, Jiachen Ma, Jian Lou, Dan Li, Jian Liu, Xiaohu Yang, Ruoxi Jia
LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning Authors: Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung
Thinking Long, but Short: Stable Sequential Test-Time Scaling for Large Reasoning Models Authors: Michael R. Metel, Yufei Cui, Boxing Chen, Prasanna Parthasarathi
Distributed Perceptron under Bounded Staleness, Partial Participation, and Noisy Communication Authors: Keval Jain, Anant Raj, Saurav Prakash, Girish Varma

1. Discrete Feynman-Kac Correctors

ArXiv ID: 2601.10403

Authors: Mohsin Hasan, Viktor Ohanesian, Artem Gazizov, Yoshua Bengio, Al\'an Aspuru-Guzik, Roberto Bondesan, Marta Skreta, Kirill Neklyudov

Abstract: Discrete diffusion models have recently emerged as a promising alternative to the autoregressive approach for generating discrete sequences. Sample generation via gradual denoising or demasking processes allows them to capture hierarchical non-sequential interdependencies in the data. These custom processes, however, do not assume a flexible control over the distribution of generated samples. We propose Discrete Feynman-Kac Correctors, a framework that allows for controlling the generated distribution of discrete masked diffusion models at inference time. We derive Sequential Monte Carlo (SMC) algorithms that, given a trained discrete diffusion model, control the temperature of the sampled distribution (i.e. perform annealing), sample from the product of marginals of several diffusion processes (e.g. differently conditioned processes), and sample from the product of the marginal with an external reward function, producing likely samples from the target distribution that also have high reward. Notably, our framework does not require any training of additional models or fine-tuning of the original model. We illustrate the utility of our framework in several applications including: efficient sampling from the annealed Boltzmann distribution of the Ising model, improving the performance of language models for code generation and amortized learning, as well as reward-tilted protein sequence generation.

Comment: Author match

2. STEM: Scaling Transformers with Embedding Modules

ArXiv ID: 2601.10639

Authors: Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, Beidi Chen

Abstract: Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce STEM (Scaling Transformers with Embedding Modules), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances its knowledge storage capacity. More interestingly, this enhanced knowledge capacity comes with better interpretability. The token-indexed nature of STEM embeddings allows simple ways to perform knowledge editing and knowledge injection in an interpretable manner without any intervention in the input text or additional computation. In addition, STEM strengthens long-context performance: as sequence length grows, more distinct parameters are activated, yielding practical test-time capacity scaling. Across 350M and 1B model scales, STEM delivers up to ~3--4% accuracy improvements overall, with notable gains on knowledge and reasoning-heavy benchmarks (ARC-Challenge, OpenBookQA, GSM8K, MMLU). Overall, STEM is an effective way of scaling parametric memory while providing better interpretability, better training stability and improved efficiency.

Comment: Model architecture and efficiency: static token-indexed sparsity replacing FFN up-projection; decouples capacity from per-token compute and enables CPU offload.

Relevance: 10 Novelty: 9

3. Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

ArXiv ID: 2601.09719

Authors: Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, Kyungwoo Song

Abstract: Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: https://anonymous.4open.science/r/BHyT

Comment: Model Architecture/Efficiency: Bounded Hyperbolic Tanh as a normalization-free alternative to Pre-LN with theoretical stability and faster training/inference for LLMs.

Relevance: 10 Novelty: 8

4. Global Context Compression with Interleaved Vision-Text Transformation

ArXiv ID: 2601.10378

Authors: Dian Jiao, Jiaxin Duan, Shuai Zhao, Jiabing Leng, Yiran Zhang, Feng Huang

Abstract: Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.

Comment: Compression/Efficiency and Model Architecture: global context compression in Transformers via interleaved vision–text tokens, reducing memory/FLOPs and token count.

Relevance: 10 Novelty: 8

5. The Geometry of Thought: Disclosing the Transformer as a Tropical Polynomial Circuit

ArXiv ID: 2601.09775

Authors: Faruk Alpay, Bilge Senturk

Abstract: We prove that the Transformer self-attention mechanism in the high-confidence regime ($\beta \to \infty$, where $\beta$ is an inverse temperature) operates in the tropical semiring (max-plus algebra). In particular, we show that taking the tropical limit of the softmax attention converts it into a tropical matrix product. This reveals that the Transformer's forward pass is effectively executing a dynamic programming recurrence (specifically, a Bellman-Ford path-finding update) on a latent graph defined by token similarities. Our theoretical result provides a new geometric perspective for chain-of-thought reasoning: it emerges from an inherent shortest-path (or longest-path) algorithm being carried out within the network's computation.

Comment: Architecture theory: shows self-attention’s tropical (max-plus) limit, linking transformers to dynamic programming/shortest-path.

Relevance: 9 Novelty: 9

6. MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

ArXiv ID: 2601.10272

Authors: Yuxuan Lou, Kai Yang, Yang You

Abstract: We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type. MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post-training on ASR and TTS datasets, followed by fine-tuning with a carefully curated speech-text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open-source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality-specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open-source speech-text LLM built on a Mixture of Experts architecture. \footnote{We release MoST model, training code, inference code, and training data at https://github.com/NUS-HPC-AI-Lab/MoST

Comment: Model architecture: Modality-Aware Mixture-of-Experts with modality-specific routing and shared experts (MoE).