Krause Synchronization Transformers

Jingkun Liu1,2,3, Yisong Yue4, Max Welling5,6, Yue Song1,2

1Shanghai Qi Zhi Institute     2Tsinghua University     3Shanghai Jiao Tong University
4California Institute of Technology     5University of Amsterdam     6Cusp AI

ICML 2026

Krause Attention, grounded in bounded-confidence interactions, promotes localized multi-cluster synchronization (top). In contrast, standard self-attention tends to induce globally coupled dynamics that concentrate attention onto a dominant mode, often manifesting as attention sinks (Xiao et al., 2024) (bottom).

Introduction

Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length.

Empirically, Krause Attention delivers consistent and substantial gains across vision, generation, and language modeling tasks. For image classification, Krause Vision Transformers (ViTs) consistently outperform standard ViTs (Dosovitskiy et al., 2021) on CIFAR-10/100 and ImageNet-1K, achieving an average accuracy improvement of \( \mathbf{+3.0\%} \) while reducing FLOPs by approximately 30% across model scales. In autoregressive image generation (Parmar et al., 2018), Krause-based models achieve lower negative log-likelihood than standard Transformers while enabling more than 2× faster inference. For large language models (Yang et al., 2024a; Grattafiori et al., 2024) and models trained from scratch, incorporating Krause Attention as an auxiliary pathway consistently improves zero-shot performance over both LoRA-finetuned (Hu et al., 2022) and from-scratch baselines across a broad suite of language understanding benchmarks.

Krause Attention Mechanism

Krause Attention Animation

Visual Illustration of our Krause Attention dynamics.

Krause Attention computes RBF affinity scores, restricts updates to spatial local neighborhoods, and applies top-k selective interactions in the representation space.

By encoding locality and selective interactions into the design, Krause Attention turns clustering from a fragile, emergent phenomenon into a more stable architectural inductive bias. This helps preserve token diversity and improve robustness against representation collapse.

Vision Experimental Results

Image classification results on CIFAR-10.

ModelsAcc.ParametersFLOPs
KViT-T (Ours)93.815,362,7740.25G
ViT-T90.755,362,7620.37G
KViT-S (Ours)95.2021,342,3580.97G
ViT-S93.3321,342,3461.43G
KViT-B (Ours)95.3585,152,0223.77G
ViT-B92.4585,152,0105.61G
KViT-B (RoPE) (Ours)95.6885,152,0223.77G
ViT-B (RoPE)94.1085,152,0105.61G

Attention Heatmaps in ViTs


Attention Heatmaps

Krause Attention yields more diverse attention heads.

Attention Evolution

Evolution of attention scores across layers in KViTs/ViTs. Krause Attention (left) achieves stable multi-cluster formation, while standard attention (right) progressively converges to a single global consensus.

Autoregressive Image Generation



Samples completed by Krause Autoregressive Models (KARMs) on MNIST (left) and CIFAR-10 (right).

Language Experimental Results

Language understanding and reasoning results of Krause-Llama3-8B. Results are reported in Acc. (%) and Acc. / Macro-F1.

ModelsBoolQCBPIQAMNLIANLI-R1ANLI-R2ANLI-R3MMLU-ProIFEval
Krause-Llama3-8B (Ours)80.5964.29/48.0477.7763.27/53.7240.30/33.0140.50/34.2745.67/39.8441.6734.01
Llama3-8B (finetuned w/ LoRA)80.4160.71/47.8175.1659.53/55.2938.70/30.6239.90/33.3744.92/39.5741.6732.72
Llama3-8B76.1341.07/19.4151.5235.45/18.1133.40/16.6933.40/16.6933.50/17.0437.5022.18

Alleviating Attention Sinks


Attention Sinks

Layer dynamics of first-token attentions on Llama3-8B. Bounded-confidence interactions naturally moderate attention concentration.

LLMs often suffer from the attention sink effect (Xiao et al., 2024), where the softmax normalization allocates disproportionately high attention scores on early tokens, regardless of their semantic relevance. This behavior introduces positional bias, reduces model expressivity, and weakens representation robustness.

Krause Attention provides a complementary, bounded-confidence mechanism for mitigating this issue. By restricting attention to the local neighborhood, distant tokens can no longer allocate weight to the initial positions once they fall outside the receptive field. As shown in above figure, the base Llama model exhibits large oscillations and persistent peaks across layers, whereas Krause-LLMs produce remarkably more stable attention curves. This stabilization indicates that Krause Attention reduces reliance on fixed positional anchors and supports more robust representation learning.

BibTeX

If you find our work useful, please consider citing our paper:

@article{liukrause2026,
  title={Krause Synchronization Transformers},
  author={Jingkun Liu and Yisong Yue and Max Welling and Yue Song},
  journal={ICML},
  year={2026},
  url={https://arxiv.org/abs/2602.11534}
}