Krause Attention, grounded in bounded-confidence interactions, promotes localized multi-cluster synchronization (top). In contrast, standard self-attention tends to induce globally coupled dynamics that concentrate attention onto a dominant mode, often manifesting as attention sinks (Xiao et al., 2024) (bottom).
Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length.
Empirically, Krause Attention delivers consistent and substantial gains across vision, generation, and language modeling tasks. For image classification, Krause Vision Transformers (ViTs) consistently outperform standard ViTs (Dosovitskiy et al., 2021) on CIFAR-10/100 and ImageNet-1K, achieving an average accuracy improvement of \( \mathbf{+3.0\%} \) while reducing FLOPs by approximately 30% across model scales. In autoregressive image generation (Parmar et al., 2018), Krause-based models achieve lower negative log-likelihood than standard Transformers while enabling more than 2× faster inference. For large language models (Yang et al., 2024a; Grattafiori et al., 2024) and models trained from scratch, incorporating Krause Attention as an auxiliary pathway consistently improves zero-shot performance over both LoRA-finetuned (Hu et al., 2022) and from-scratch baselines across a broad suite of language understanding benchmarks.
Visual Illustration of our Krause Attention dynamics.
Krause Attention computes RBF affinity scores, restricts updates to spatial local neighborhoods, and applies top-k selective interactions in the representation space.
By encoding locality and selective interactions into the design, Krause Attention turns clustering from a fragile, emergent phenomenon into a more stable architectural inductive bias. This helps preserve token diversity and improve robustness against representation collapse.
Krause Attention yields more diverse attention heads.
Evolution of attention scores across layers in KViTs/ViTs. Krause Attention (left) achieves stable multi-cluster formation, while standard attention (right) progressively converges to a single global consensus.
Samples completed by Krause Autoregressive Models (KARMs) on MNIST (left) and CIFAR-10 (right).
Layer dynamics of first-token attentions on Llama3-8B. Bounded-confidence interactions naturally moderate attention concentration.
LLMs often suffer from the attention sink effect (Xiao et al., 2024), where the softmax normalization allocates disproportionately high attention scores on early tokens, regardless of their semantic relevance. This behavior introduces positional bias, reduces model expressivity, and weakens representation robustness.
Krause Attention provides a complementary, bounded-confidence mechanism for mitigating this issue. By restricting attention to the local neighborhood, distant tokens can no longer allocate weight to the initial positions once they fall outside the receptive field. As shown in above figure, the base Llama model exhibits large oscillations and persistent peaks across layers, whereas Krause-LLMs produce remarkably more stable attention curves. This stabilization indicates that Krause Attention reduces reliance on fixed positional anchors and supports more robust representation learning.
If you find our work useful, please consider citing our paper:
@article{liukrause2026,
title={Krause Synchronization Transformers},
author={Jingkun Liu and Yisong Yue and Max Welling and Yue Song},
journal={ICML},
year={2026},
url={https://arxiv.org/abs/2602.11534}
}