[2602.17526] The Anxiety of Influence: Bloom Filters in Transformer Attention Heads
Summary
This article explores how certain transformer attention heads act as membership testers, identifying token repetition across various language models, and analyzing their performance as Bloom filters.
Why It Matters
Understanding the functionality of transformer attention heads as membership testers can enhance the design of language models, improve efficiency in token processing, and contribute to advancements in natural language processing and AI systems.
Key Takeaways
- Certain transformer attention heads function as high-precision membership filters.
- The study identifies a spectrum of membership-testing strategies across language models.
- Membership testing contributes to both repeated and novel token processing.
- Some heads generalize responses to any repeated token type, enhancing model versatility.
- Reclassification of certain heads strengthens the analysis and findings.
Computer Science > Machine Learning arXiv:2602.17526 (cs) [Submitted on 19 Feb 2026] Title:The Anxiety of Influence: Bloom Filters in Transformer Attention Heads Authors:Peter Balogh View a PDF of the paper titled The Anxiety of Influence: Bloom Filters in Transformer Attention Heads, by Peter Balogh View PDF HTML (experimental) Abstract:Some transformer attention heads appear to function as membership testers, dedicating themselves to answering the question "has this token appeared before in the context?" We identify these heads across four language models (GPT-2 small, medium, and large; Pythia-160M) and show that they form a spectrum of membership-testing strategies. Two heads (L0H1 and L0H5 in GPT-2 small) function as high-precision membership filters with false positive rates of 0-4\% even at 180 unique context tokens -- well above the $d_\text{head} = 64$ bit capacity of a classical Bloom filter. A third head (L1H11) shows the classic Bloom filter capacity curve: its false positive rate follows the theoretical formula $p \approx (1 - e^{-kn/m})^k$ with $R^2 = 1.0$ and fitted capacity $m \approx 5$ bits, saturating by $n \approx 20$ unique tokens. A fourth head initially identified as a Bloom filter (L3H0) was reclassified as a general prefix-attention head after confound controls revealed its apparent capacity curve was a sequence-length artifact. Together, the three genuine membership-testing heads form a multi-resolution system concentrated in early layers (0-1), t...