Introduction
Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Experiments across vision (ViT on CIFAR/ImageNet), autoregressive generation (MNIST/CIFAR-10), and large language models (Llama/Qwen) demonstrate consistent gains with substantially reduced computation, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.
Empirically, Krause Attention delivers consistent and substantial gains across vision, generation, and language modeling tasks. For image classification, Krause Vision Transformers (ViTs) consistently outperform standard ViTs (Dosovitskiy et al., 2021) on CIFAR-10/100 and ImageNet-1K, achieving an average accuracy improvement of \( \mathbf{+3.7\%} \) while reducing FLOPs by approximately 30% across model scales. In autoregressive image generation (Parmar et al., 2018), Krause-based models achieve lower negative log-likelihood than standard Transformers while enabling more than 2× faster inference. For LLMs (Yang et al., 2024a; Grattafiori et al., 2024), integrating Krause Attention as an auxiliary pathway consistently improves zero-shot evaluation performance over LoRA-finetuned baselines (Hu et al., 2022) on a broad suite of challenging language reasoning benchmarks, indicating improved robustness to attention concentration effects. Together, these results demonstrate that bounded-confidence dynamics provide a scalable, computationally efficient, and practically effective inductive bias for self-attention mechanisms across diverse modalities and model regimes.
Krause Attention
Visual Illustration of our Krause Attention.
Alleviating Attention Sinks in Krause-LLMs
Layer dynamics of first-token attentions on Llama3-8B.
Attention Heatmaps in Vision Transformers
Krause Attention yields more diverse attention heads.
Evolution of attention scores across layers in KViTs/ViTs. Krause Attention (left) achieves stable multi-cluster formation, while standard attention (right) progressively converges to a single global consensus.
Krause Autoregressive Transformers for Image Generation
Samples completed by Krause Autoregressive Models (KARMs) on MNIST (left) and CIFAR-10 (right).
BibTeX
@article{liukrause2026,
title={Krause Synchronization Transformers},
author={Jingkun Liu and Yisong Yue and Max Welling and Yue Song},
journal={ArXiv},
year={2026},
url={https://arxiv.org/abs/2602.11534}
}