Personalized Daily ArXiv Papers 2025-09-08

[gpt-5]	Prompt	Completion	Total
Token	39153	40799	79952
Cost	$0.05	$0.41	$0.46

Total arXiv papers: 381

Total scanned papers: 243

Total relevant papers: 20

Table of contents with paper titles:

SpikingBrain Technical Report: Spiking Brain-inspired Large Models Authors: Yuqi Pan, Yupeng Feng, Jinghao Zhuang, Siyu Ding, Zehao Liu, Bohan Sun, Yuhong Chou, Han Xu, Xuerui Qiu, Anlin Deng, Anjie Hu, Peng Zhou, Man Yao, Jibin Wu, Jian Yang, Guoliang Sun, Bo Xu, Guoqi Li
KVCompose: Efficient Structured KV Cache Compression with Composite Tokens Authors: Dmitry Akulov, Mohamed Sana, Antonio De Domenico, Tareq Si Salem, Nicola Piovesan, Fadhel Ayed
HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models Authors: Chang Dai, Hongyu Shan, Mingyang Song, Di Liang
Interpreting Transformer Architectures as Implicit Multinomial Regression Authors: Jonas A. Actor, Anthony Gruber, Eric C. Cyr
Just-in-time and distributed task representations in language models Authors: Yuxuan Li, Declan Campbell, Stephanie C. Y. Chan, Andrew Kyle Lampinen
Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference Authors: Hao Zhang, Mengsi Lyu, Yulong Ao, Yonghua Lin
Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining Authors: Deniz Bayazit, Aaron Mueller, Antoine Bosselut
Dynamical Learning in Deep Asymmetric Recurrent Neural Networks Authors: Davide Badalotti, Carlo Baldassi, Marc M\'ezard, Mattia Scardecchia, Riccardo Zecchina
Sample-efficient Integration of New Modalities into Large Language Models Authors: Osman Batur .Ince, Andr\'e F. T. Martins, Oisin Mac Aodha, Edoardo M. Ponti
Beyond I-Con: Exploring New Dimension of Distance Measures in Representation Learning Authors: Jasmine Shone, Shaden Alshammari, Mark Hamilton, Zhening Li, William Freeman
Probabilistic operator learning: generative modeling and uncertainty quantification for foundation models of differential equations Authors: Benjamin J. Zhang, Siting Liu, Stanley J. Osher, Markos A. Katsoulakis
Manipulating Transformer-Based Models: Controllability, Steerability, and Robust Interventions Authors: Faruk Alpay, Taylan Alpay
VCMamba: Bridging Convolutions with Multi-Directional Mamba for Efficient Visual Representation Authors: Mustafa Munir, Alex Zhang, Radu Marculescu
Natural Spectral Fusion: p-Exponent Cyclic Scheduling and Early Decision-Boundary Alignment in First-Order Optimization Authors: Gongyue Zhang, Honghai Liu
HyPINO: Multi-Physics Neural Operators via HyperPINNs and the Method of Manufactured Solutions Authors: Rafael Bischof, Michal Piovar\v{c}i, Michael A. Kraus, Siddhartha Mishra, Bernd Bickel
Any-Step Density Ratio Estimation via Interval-Annealed Secant Alignment Authors: Wei Chen, Shigui Li, Jiacheng Li, Jian Xu, Zhiqi Lin, Junmei Yang, Delu Zeng, John Paisley, Qibin Zhao
ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute Authors: Hao Wen, Yifan Su, Feifei Zhang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, Yuanchun Li
Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving Authors: Fangzhou Wu, Sandeep Silwal
Neuro-Spectral Architectures for Causal Physics-Informed Networks Authors: Arthur Bizzi, Leonardo M. Moreira, M\'arcio Marques, Leonardo Mendon\c{c}a, Christian J\'unior de Oliveira, Vitor Balestro, Lucas dos Santos Fernandez, Daniel Yukimura, Pavel Petrov, Jo\~ao M. Pereira, Tiago Novello, Lucas Nissenbaum
Adapt in the Wild: Test-Time Entropy Minimization with Sharpness and Feature Regularization Authors: Shuaicheng Niu, Guohao Chen, Deyu Chen, Yifan Zhang, Jiaxiang Wu, Zhiquan Wen, Yaofo Chen, Peilin Zhao, Chunyan Miao, Mingkui Tan

1. SpikingBrain Technical Report: Spiking Brain-inspired Large Models

ArXiv ID: 2509.05276

Authors: Yuqi Pan, Yupeng Feng, Jinghao Zhuang, Siyu Ding, Zehao Liu, Bohan Sun, Yuhong Chou, Han Xu, Xuerui Qiu, Anlin Deng, Anjie Hu, Peng Zhou, Man Yao, Jibin Wu, Jian Yang, Guoliang Sun, Bo Xu, Guoqi Li

Abstract: Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models significantly improve long-sequence training efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Training remains stable for weeks on hundreds of MetaX C550 GPUs, with the 7B model reaching a Model FLOPs Utilization of 23.4 percent. The proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.

Comment: Model Architecture, Compression/Efficiency, and HPC: spiking LLMs with linear/hybrid-linear attention and MoE, sparse/event-driven inference with near-constant memory, and custom distributed training on non-NVIDIA hardware.

Relevance: 10 Novelty: 9

2. KVCompose: Efficient Structured KV Cache Compression with Composite Tokens

ArXiv ID: 2509.05165

Authors: Dmitry Akulov, Mohamed Sana, Antonio De Domenico, Tareq Si Salem, Nicola Piovesan, Fadhel Ayed

Abstract: Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding; however, cache size grows linearly with context length and model depth, becoming a major bottleneck in long-context inference. Prior KV cache compression methods either enforce rigid heuristics, disrupt tensor layouts with per-attention-head variability, or require specialized compute kernels. We propose a simple, yet effective, KV cache compression framework based on attention-guided, layer-adaptive composite tokens. Our method aggregates attention scores to estimate token importance, selects head-specific tokens independently, and aligns them into composite tokens that respect the uniform cache structure required by existing inference engines. A global allocation mechanism further adapts retention budgets across layers, assigning more capacity to layers with informative tokens. This approach achieves significant memory reduction while preserving accuracy, consistently outperforming prior structured and semi-structured methods. Crucially, our approach remains fully compatible with standard inference pipelines, offering a practical and scalable solution for efficient long-context LLM deployment.

Comment: Model Compression and Efficiency: structured KV cache compression via attention-guided, layer-adaptive composite tokens compatible with standard inference engines.

Relevance: 10 Novelty: 8

3. HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models

ArXiv ID: 2509.05218

Authors: Chang Dai, Hongyu Shan, Mingyang Song, Di Liang

Abstract: Positional encoding mechanisms enable Transformers to model sequential structure and long-range dependencies in text. While absolute positional encodings struggle with extrapolation to longer sequences due to fixed positional representations, and relative approaches like Alibi exhibit performance degradation on extremely long contexts, the widely-used Rotary Positional Encoding (RoPE) introduces oscillatory attention patterns that hinder stable long-distance dependency modelling. We address these limitations through a geometric reformulation of positional encoding. Drawing inspiration from Lorentz transformations in hyperbolic geometry, we propose Hyperbolic Rotary Positional Encoding (HoPE), which leverages hyperbolic functions to implement Lorentz rotations on token representations. Theoretical analysis demonstrates that RoPE is a special case of our generalized formulation. HoPE fundamentally resolves RoPE's slation issues by enforcing monotonic decay of attention weights with increasing token distances. Extensive experimental results, including perplexity evaluations under several extended sequence benchmarks, show that HoPE consistently exceeds existing positional encoding methods. These findings underscore HoPE's enhanced capacity for representing and generalizing long-range dependencies. Data and code will be available.

Comment: Model Architecture: introduces hyperbolic rotary positional encoding (HoPE), a geometric generalization of RoPE for stable long-range dependencies.

Relevance: 10 Novelty: 8

4. Interpreting Transformer Architectures as Implicit Multinomial Regression

ArXiv ID: 2509.04653

Authors: Jonas A. Actor, Anthony Gruber, Eric C. Cyr

Abstract: Mechanistic interpretability aims to understand how internal components of modern machine learning models, such as weights, activations, and layers, give rise to the model's overall behavior. One particularly opaque mechanism is attention: despite its central role in transformer models, its mathematical underpinnings and relationship to concepts like feature polysemanticity, superposition, and model performance remain poorly understood. This paper establishes a novel connection between attention mechanisms and multinomial regression. Specifically, we show that in a fixed multinomial regression setting, optimizing over latent features yields optimal solutions that align with the dynamics induced by attention blocks. In other words, the evolution of representations through a transformer can be interpreted as a trajectory that recovers the optimal features for classification.

Comment: Model Architecture/Mechanistic Interpretability: establishes a theoretical link between attention dynamics in transformers and optimal feature recovery in multinomial regression.

Relevance: 10 Novelty: 8

5. Just-in-time and distributed task representations in language models

ArXiv ID: 2509.04466

Authors: Yuxuan Li, Declan Campbell, Stephanie C. Y. Chan, Andrew Kyle Lampinen

Abstract: Many of language models' impressive capabilities originate from their in-context learning: based on instructions or examples, they can infer and perform new tasks without weight updates. In this work, we investigate \emph{when} representations for new tasks are formed in language models, and \emph{how} these representations change over the course of context. We focus on ''transferrable'' task representations -- vector representations that can restore task context in another instance of the model, even without the full prompt. We show that these representations evolve in non-monotonic and sporadic ways, and are distinct from a more inert representation of high-level task categories that persists throughout the context. Specifically, models often condense multiple evidence into these transferrable task representations, which align well with the performance improvement based on more examples in the context. However, this accrual process exhibits strong locality along the sequence dimension, coming online only at certain tokens -- despite task identity being reliably decodable throughout the context. Moreover, these local but transferrable task representations tend to capture minimal ''task scopes'', such as a semantically-independent subtask, and models rely on more temporally-distributed representations to support longer and composite tasks. This two-fold locality (temporal and semantic) underscores a kind of just-in-time computational process underlying language models' ability to adapt to new evidence and learn new tasks on the fly.

Comment: Matches Representation Learning criterion—empirical analysis of when/where transferable task representations form and evolve during in-context learning in LMs.