Personalized Daily ArXiv Papers 2025-11-24

[gpt-5]	Prompt	Completion	Total
Token	28863	28013	56876
Cost	$0.04	$0.28	$0.32

Total arXiv papers: 355

Total scanned papers: 197

Total relevant papers: 18

Table of contents with paper titles:

Selective Rotary Position Embedding Authors: Sajad Movahedi, Timur Carstensen, Arshia Afzal, Frank Hutter, Antonio Orvieto, Volkan Cevher
A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias Authors: Wei-Kai Chang, Rajiv Khanna
Fermions and Supersymmetry in Neural Network Field Theories Authors: Samuel Frank, James Halverson, Anindita Maiti, Fabian Ruehle
MuM: Multi-View Masked Image Modeling for 3D Vision Authors: David Nordstr\"om, Johan Edstedt, Fredrik Kahl, Georg B\"okman
Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach Authors: Yaoxin Yang, Peng Ye, Xudong Tan, Chongjun Tu, Maosen Zhao, Jia Hao, Tao Chen
Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design Authors: Quentin Anthony, Yury Tokpanov, Skyler Szot, Srivatsan Rajagopal, Praneeth Medepalli, Rishi Iyer, Vasu Shyam, Anna Golubeva, Ansh Chaurasia, Xiao Yang, Tomas Figliolia, Robert Washbourne, Drew Thorstensen, Amartey Pearson, Zack Grossbart, Jason van Patten, Emad Barsoum, Zhenyu Gu, Yao Fu, Beren Millidge
Gradient flow for deep equilibrium single-index models Authors: Sanjit Dandapanthula, Aaditya Ramdas
Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required? Authors: Sukwon Yun, Heming Yao, Burkhard Hoeckendorf, David Richmond, Aviv Regev, Russell Littman
Efficient Penalty-Based Bilevel Methods: Improved Analysis, Novel Updates, and Flatness Condition Authors: Liuyuan Jiang, Quan Xiao, Lisha Chen, Tianyi Chen
Deep Improvement Supervision Authors: Arip Asadulaev, Rayan Banerjee, Fakhri Karray, Martin Takac
Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation Authors: Yeqin Zhang, Yizheng Zhao, Chen Hu, Binxing Jiao, Daxin Jiang, Ruihang Miao, Cam-Tu Nguyen
InTAct: Interval-based Task Activation Consolidation for Continual Learning Authors: Patryk Krukowski, Jan Miksa, Piotr Helm, Jacek Tabor, Pawe{\l} Wawrzy\'nski, Przemys{\l}aw Spurek
Topologic Attention Networks: Attending to Direct and Indirect Neighbors through Gaussian Belief Propagation Authors: Marshall Rosenhoover, Huaming Zhang
Spanning Tree Autoregressive Visual Generation Authors: Sangkyu Lee, Changho Lee, Janghoon Han, Hosung Song, Tackgeun You, Hwasup Lim, Stanley Jungkyu Choi, Honglak Lee, Youngjae Yu
Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation Authors: Aniketh Iyengar, Jiaqi Han, Boris Ruf, Vincent Grari, Marcin Detyniecki, Stefano Ermon
DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format Authors: Shady Agwa, Yikang Shen, Shiwei Wang, Themis Prodromakis
Self-Supervised Learning by Curvature Alignment Authors: Benyamin Ghojogh, M. Hadi Sepanj, Paul Fieguth
ManifoldFormer: Geometric Deep Learning for Neural Dynamics on Riemannian Manifolds Authors: Yihang Fu, Lifang He, Qingyu Chen

1. Selective Rotary Position Embedding

ArXiv ID: 2511.17388

Authors: Sajad Movahedi, Timur Carstensen, Arshia Afzal, Frank Hutter, Antonio Orvieto, Volkan Cevher

Abstract: Position information is essential for language modeling. In softmax transformers, Rotary Position Embeddings (\textit{RoPE}) encode positions through \textit{fixed-angle} rotations, while in linear transformers, order is handled via input-dependent (selective) gating that decays past key-value associations. Selectivity has generally been shown to improve language-related tasks. Inspired by this, we introduce \textit{Selective RoPE}, an \textit{input-dependent} rotary embedding mechanism, that generalizes \textit{RoPE}, and enables rotation in \textit{arbitrary angles} for both linear and softmax transformers. We show that softmax attention already performs a hidden form of these rotations on query-key pairs, uncovering an implicit positional structure. We further show that in state-space models and gated linear transformers, the real part manages forgetting while the imaginary part encodes positions through rotations. We validate our method by equipping gated transformers with \textit{Selective RoPE}, demonstrating that its input-dependent rotations improve performance in language modeling and on difficult sequence tasks like copying, state tracking, and retrieval.

Comment: Model Architecture: Selective (input-dependent) Rotary Position Embeddings generalizing RoPE across softmax/linear transformers and SSMs with analysis of implicit rotations/forgetting.

Relevance: 10 Novelty: 8

2. A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias

ArXiv ID: 2511.17378

Authors: Wei-Kai Chang, Rajiv Khanna

Abstract: Understanding the dynamics of optimization in deep learning is increasingly important as models scale. While stochastic gradient descent (SGD) and its variants reliably find solutions that generalize well, the mechanisms driving this generalization remain unclear. Notably, these algorithms often prefer flatter or simpler minima, particularly in overparameterized settings. Prior work has linked flatness to generalization, and methods like Sharpness-Aware Minimization (SAM) explicitly encourage flatness, but a unified theory connecting data structure, optimization dynamics, and the nature of learned solutions is still lacking. In this work, we develop a linear stability framework that analyzes the behavior of SGD, random perturbations, and SAM, particularly in two layer ReLU networks. Central to our analysis is a coherence measure that quantifies how gradient curvature aligns across data points, revealing why certain minima are stable and favored during training.

Comment: Training Dynamics Theory: unified stability analysis of SGD vs SAM using a data-coherence curvature measure, explaining flatness preference and simplicity bias.

Relevance: 9 Novelty: 8

3. Fermions and Supersymmetry in Neural Network Field Theories

ArXiv ID: 2511.16741

Authors: Samuel Frank, James Halverson, Anindita Maiti, Fabian Ruehle

Abstract: We introduce fermionic neural network field theories via Grassmann-valued neural networks. Free theories are obtained by a generalization of the Central Limit Theorem to Grassmann variables. This enables the realization of the free Dirac spinor at infinite width and a four fermion interaction at finite width. Yukawa couplings are introduced by breaking the statistical independence of the output weights for the fermionic and bosonic fields. A large class of interacting supersymmetric quantum mechanics and field theory models are introduced by super-affine transformations on the input that realize a superspace formalism.

Comment: Foundational Architecture Theory: Grassmann-valued neural networks realizing fermionic field theories, infinite-width limits, and supersymmetry via super-affine transformations.

Relevance: 8 Novelty: 9

4. MuM: Multi-View Masked Image Modeling for 3D Vision

ArXiv ID: 2511.17309

Authors: David Nordstr\"om, Johan Edstedt, Fredrik Kahl, Georg B\"okman

Abstract: Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation, finding that it outperforms the state-of-the-art visual encoders DINOv3 and CroCo v2.

Comment: Matches Representation Learning: multi-view masked autoencoding architecture with inter-frame attention tailored for 3D geometric features.

Relevance: 9 Novelty: 7

5. Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach

ArXiv ID: 2511.16786

Authors: Yaoxin Yang, Peng Ye, Xudong Tan, Chongjun Tu, Maosen Zhao, Jia Hao, Tao Chen

Abstract: Multimodal large language models suffer from substantial inference overhead since multimodal KV Cache grows proportionally with the visual input length. Existing multimodal KV Cache compression methods mostly rely on attention score to reduce cache size, which makes them are incompatible with established efficient attention kernels (e.g., FlashAttention) and ignores the contribution of value vectors to the attention output. In this work, we revisit multimodal KV Cache compression from the perspective of the KV matrices' distribution. First, we observe that frequency-domain energy of multimodal KV matrices is predominantly concentrated in low-frequency and extract this principal energy via a low-pass filter. Further, we find that removing KV pairs that deviate substantially from this principal energy leads to a pronounced performance drop, which we define as Outlier KVs. Considering Outlier KVs are more likely to encode features critical for inference, we propose FlashCache, a frequency-domain-guided, Outlier-KV-aware KV Cache compression framework. First, we introduce an Outlier KV Recognition Module that models the principal component of multimodal KV matrices in the frequency domain and preferentially retains KV pairs that significantly deviate from it. Furthermore, Dynamic Budget Allocation Module is designed to adaptively determine the per-layer KV Cache size to retain more Outlier KVs. Experiments on multiple MLLMs and benchmarks demonstrate that FlashCache outperforms state-of-the-art multimoal KV compression methods, achieving up to 1.69 times faster decoding with 80% lower KV memory usage while maintaining task performance.

Comment: Model Compression and Efficiency: proposes frequency-domain, outlier-KV-aware KV cache compression for multimodal LLMs with dynamic per-layer budget; compatible with FlashAttention kernels.

Relevance: 9 Novelty: 7

6. Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

ArXiv ID: 2511.17127

Authors: Quentin Anthony, Yury Tokpanov, Skyler Szot, Srivatsan Rajagopal, Praneeth Medepalli, Rishi Iyer, Vasu Shyam, Anna Golubeva, Ansh Chaurasia, Xiao Yang, Tomas Figliolia, Robert Washbourne, Drew Thorstensen, Amartey Pearson, Zack Grossbart, Jason van Patten, Emad Barsoum, Zhenyu Gu, Yao Fu, Beren Millidge

Abstract: We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

Comment: High Performance Computing and Model Architecture: large-scale MoE pretraining on AMD Pollara with detailed systems/networking microbenchmarks and MI300X-aware transformer/MoE sizing rules for throughput/latency.

Relevance: 9 Novelty: 7

7. Gradient flow for deep equilibrium single-index models

ArXiv ID: 2511.16976

Authors: Sanjit Dandapanthula, Aaditya Ramdas

Abstract: Deep equilibrium models (DEQs) have recently emerged as a powerful paradigm for training infinitely deep weight-tied neural networks that achieve state of the art performance across many modern machine learning tasks. Despite their practical success, theoretically understanding the gradient descent dynamics for training DEQs remains an area of active research. In this work, we rigorously study the gradient descent dynamics for DEQs in the simple setting of linear models and single-index models, filling several gaps in the literature. We prove a conservation law for linear DEQs which implies that the parameters remain trapped on spheres during training and use this property to show that gradient flow remains well-conditioned for all time. We then prove linear convergence of gradient descent to a global minimizer for linear DEQs and deep equilibrium single-index models under appropriate initialization and with a sufficiently small step size. Finally, we validate our theoretical findings through experiments.

Comment: Training Dynamics/Model Architecture: theoretical analysis of gradient flow and convergence for deep equilibrium (DEQ) and single-index models, including conservation law and linear convergence.

Relevance: 9 Novelty: 7

8. Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required?

ArXiv ID: 2511.17400

Authors: Sukwon Yun, Heming Yao, Burkhard Hoeckendorf, David Richmond, Aviv Regev, Russell Littman

Abstract: Vision Transformers ($\text{ViTs}$) have become the backbone of vision foundation models, yet their optimization for multi-channel domains - such as cell painting or satellite imagery - remains underexplored. A key challenge in these domains is capturing interactions between channels, as each channel carries different information. While existing works have shown efficacy by treating each channel independently during tokenization, this approach naturally introduces a major computational bottleneck in the attention block - channel-wise comparisons leads to a quadratic growth in attention, resulting in excessive $\text{FLOPs}$ and high training cost. In this work, we shift focus from efficacy to the overlooked efficiency challenge in cross-channel attention and ask: "Is it necessary to model all channel interactions?". Inspired by the philosophy of Sparse Mixture-of-Experts ($\text{MoE}$), we propose MoE-ViT, a Mixture-of-Experts architecture for multi-channel images in $\text{ViTs}$, which treats each channel as an expert and employs a lightweight router to select only the most relevant experts per patch for attention. Proof-of-concept experiments on real-world datasets - JUMP-CP and So2Sat - demonstrate that $\text{MoE-ViT}$ achieves substantial efficiency gains without sacrificing, and in some cases enhancing, performance, making it a practical and attractive backbone for multi-channel imaging.

Comment: Model Architecture and Efficiency: sparse Mixture-of-Experts treating channels as experts to reduce cross-channel attention cost in multi-channel ViTs.

Relevance: 9 Novelty: 7

9. Efficient Penalty-Based Bilevel Methods: Improved Analysis, Novel Updates, and Flatness Condition

ArXiv ID: 2511.16796

Authors: Liuyuan Jiang, Quan Xiao, Lisha Chen, Tianyi Chen

Abstract: Penalty-based methods have become popular for solving bilevel optimization (BLO) problems, thanks to their effective first-order nature. However, they often require inner-loop iterations to solve the lower-level (LL) problem and small outer-loop step sizes to handle the increased smoothness induced by large penalty terms, leading to suboptimal complexity. This work considers the general BLO problems with coupled constraints (CCs) and leverages a novel penalty reformulation that decouples the upper- and lower-level variables. This yields an improved analysis of the smoothness constant, enabling larger step sizes and reduced iteration complexity for Penalty-Based Gradient Descent algorithms in ALTernating fashion (ALT-PBGD). Building on the insight of reduced smoothness, we propose PBGD-Free, a novel fully single-loop algorithm that avoids inner loops for the uncoupled constraint BLO. For BLO with CCs, PBGD-Free employs an efficient inner-loop with substantially reduced iteration complexity. Furthermore, we propose a novel curvature condition describing the "flatness" of the upper-level objective with respect to the LL variable. This condition relaxes the traditional upper-level Lipschitz requirement, enables smaller penalty constant choices, and results in a negligible penalty gradient term during upper-level variable updates. We provide rigorous convergence analysis and validate the method's efficacy through hyperparameter optimization for support vector machines and fine-tuning of large language models.

Comment: Matches Optimization/Efficiency for training: improved penalty-based bilevel methods with larger steps, single-loop updates, and a flatness condition.

Relevance: 8 Novelty: 8

10. Deep Improvement Supervision

ArXiv ID: 2511.16886

Authors: Arip Asadulaev, Rayan Banerjee, Fakhri Karray, Martin Takac

Abstract: Recently, it was shown that small, looped architectures, such as Tiny Recursive Models (TRMs), can outperform Large Language Models (LLMs) on complex reasoning tasks, including the Abstraction and Reasoning Corpus (ARC). In this work, we investigate a core question: how can we further improve the efficiency of these methods with minimal changes? To address this, we frame the latent reasoning of TRMs as a form of classifier-free guidance and implicit policy improvement algorithm. Building on these insights, we propose a novel training scheme that provides a target for each loop during training. We demonstrate that our approach significantly enhances training efficiency. Our method reduces the total number of forward passes by 18x and eliminates halting mechanisms, while maintaining quality comparable to standard TRMs. Notably, we achieve 24% accuracy on ARC-1 with only 0.8M parameters, outperforming most LLMs.

Comment: Model Architecture/Training Efficiency: proposes a new supervision scheme for tiny recursive models that cuts forward passes 18x; insights into latent reasoning akin to classifier-free guidance.

Relevance: 8 Novelty: 8

11. Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation

ArXiv ID: 2511.17129

Authors: Yeqin Zhang, Yizheng Zhao, Chen Hu, Binxing Jiao, Daxin Jiang, Ruihang Miao, Cam-Tu Nguyen

Abstract: Text representation plays a critical role in tasks like clustering, retrieval, and other downstream applications. With the emergence of large language models (LLMs), there is increasing interest in harnessing their capabilities for this purpose. However, most of the LLMs are inherently causal and optimized for next-token prediction, making them suboptimal for producing holistic representations. To address this, recent studies introduced pretext tasks to adapt LLMs for text representation. Most of these tasks, however, rely on token-level prediction objectives, such as the masked next-token prediction (MNTP) used in LLM2Vec. In this work, we explore the untapped potential of context compression as a pretext task for unsupervised adaptation of LLMs. During compression pre-training, the model learns to generate compact memory tokens, which substitute the whole context for downstream sequence prediction. Experiments demonstrate that a well-designed compression objective can significantly enhance LLM-based text representations, outperforming models trained with token-level pretext tasks. Further improvements through contrastive learning produce a strong representation model (LLM2Comp) that outperforms contemporary LLM-based text encoders on a wide range of tasks while being more sample-efficient, requiring significantly less training data.

Comment: Representation Learning: introduces a context-compression pretext objective that trains LLMs to produce compact memory tokens for holistic embeddings, further improved with contrastive learning.