Transformers
Attention
See how transformer attention links tokens through query-key-value structure.
Input Sentence
Angry lions hunt zebras, while cats sleep peacefully
Word Embeddings Visualization
Words are plotted in 2D space based on their embedding vectors.
Transformation to Query, Key, Value Vectors
The embedding vector is multiplied by three learned projection matrices.
| Word | Embedding | Q | K | V |
|---|---|---|---|---|
| angry | [0.95, -0.90] | [0.37, -1.25] | [0.04, -1.31] | [-0.30, -1.27] |
| lions | [-0.95, 0.80] | [-0.42, 1.17] | [-0.11, 1.24] | [0.22, 1.22] |
| hunt | [-0.85, -0.75] | [-1.11, -0.22] | [-1.13, 0.07] | [-1.07, 0.36] |
| zebras | [-0.85, 0.95] | [-0.26, 1.25] | [0.07, 1.27] | [0.40, 1.21] |
| while | [-0.20, 0.00] | [-0.17, 0.10] | [-0.14, 0.14] | [-0.10, 0.17] |
| cats | [-0.80, 0.90] | [-0.24, 1.18] | [0.07, 1.20] | [0.38, 1.14] |
| sleep | [-0.75, -0.85] | [-1.07, -0.36] | [-1.13, -0.07] | [-1.11, 0.22] |
| peacefully | [0.85, -0.75] | [0.36, -1.07] | [0.07, -1.13] | [-0.22, -1.11] |
Attention Matrix
Rows are queries; columns are attended tokens.
Query Detail
Inspect the token that is currently asking.
angry
34%
lions
3%
hunt
7%
zebras
4%
while
9%
cats
4%
sleep
8%
peacefully
30%
embedding = [0.95, -0.90]q = [0.37, -1.25]k = [0.04, -1.31]v = [-0.30, -1.27]output = [-0.319, -0.580]
Technical Notes
Notes carried over from the original visual, tuned for the new site.
Learning to Reposition Word Vectors
Through learning specific linear transformations for the matrices,, and, the model adjusts word vectors (Q, K, V) in space. This process helps bring closer the vectors of words that are contextually related or important to attend to, enhancing the attention mechanism.
Using Numerically Stable Softmax
To ensure numerical stability, we implement softmax in a way that prevents overflow by subtracting the maximum value from each element before applying the exponential function.
- Softmax, expressed as, inherently provides stable values between 0 and 1, as all terms are positive and the denominator is never smaller than the numerator.
- However, if the values inare very large, the exponential function can overflow. To prevent this, we use the formula, where.
Simulated Weight Transformations
The weight transformations,,used in this visualization are approximations. In actual transformer models, these weights are learned through backpropagation as part of the neural network training process within transformer blocks.
Implementation note
Attention is structured context selection: each token asks which other tokens should matter for this computation.