Gloqo AI
Visual Labs
Transformers
Attention

See how transformer attention links tokens through query-key-value structure.

Input Sentence
Angry lions hunt zebras, while cats sleep peacefully
Word Embeddings Visualization
Words are plotted in 2D space based on their embedding vectors.
angrylionshuntzebraswhilecatssleeppeacefullyemotionanimalactivitycommon
Transformation to Query, Key, Value Vectors
The embedding vector is multiplied by three learned projection matrices.
Q=Embedding·WQK=Embedding·WKV=Embedding·WV
WordEmbeddingQKV
angry[0.95, -0.90][0.37, -1.25][0.04, -1.31][-0.30, -1.27]
lions[-0.95, 0.80][-0.42, 1.17][-0.11, 1.24][0.22, 1.22]
hunt[-0.85, -0.75][-1.11, -0.22][-1.13, 0.07][-1.07, 0.36]
zebras[-0.85, 0.95][-0.26, 1.25][0.07, 1.27][0.40, 1.21]
while[-0.20, 0.00][-0.17, 0.10][-0.14, 0.14][-0.10, 0.17]
cats[-0.80, 0.90][-0.24, 1.18][0.07, 1.20][0.38, 1.14]
sleep[-0.75, -0.85][-1.07, -0.36][-1.13, -0.07][-1.11, 0.22]
peacefully[0.85, -0.75][0.36, -1.07][0.07, -1.13][-0.22, -1.11]
Attention(Q,K,V)=softmax(QKTdk)V
Attention Matrix
Rows are queries; columns are attended tokens.
angrylionshuntzebraswhilecatssleeppeacefullyangry0.340.030.070.040.090.040.080.30lions0.030.220.110.220.090.200.100.03hunt0.110.080.220.070.100.070.230.11zebras0.020.230.100.230.090.220.090.03while0.110.130.140.130.120.130.140.11cats0.030.220.100.220.090.210.090.03sleep0.130.070.220.060.100.070.230.12peacefully0.310.040.080.040.100.050.090.28
Query Detail
Inspect the token that is currently asking.
angry
34%
lions
3%
hunt
7%
zebras
4%
while
9%
cats
4%
sleep
8%
peacefully
30%
embedding = [0.95, -0.90]q = [0.37, -1.25]k = [0.04, -1.31]v = [-0.30, -1.27]output = [-0.319, -0.580]
Technical Notes
Notes carried over from the original visual, tuned for the new site.

Learning to Reposition Word Vectors

Through learning specific linear transformations for the matricesWQ,WK, andWV, the model adjusts word vectors (Q, K, V) in space. This process helps bring closer the vectors of words that are contextually related or important to attend to, enhancing the attention mechanism.

Using Numerically Stable Softmax

To ensure numerical stability, we implement softmax in a way that prevents overflow by subtracting the maximum value from each element before applying the exponential function.

  • Softmax, expressed assoftmax(x)=exp(x)exp(x), inherently provides stable values between 0 and 1, as all terms are positive and the denominator is never smaller than the numerator.
  • However, if the values inxare very large, the exponential function can overflow. To prevent this, we use the formulasoftmax(x)=softmax(x+c), wherec=-max(x).

Simulated Weight Transformations

The weight transformationsWQ,WK,WVused in this visualization are approximations. In actual transformer models, these weights are learned through backpropagation as part of the neural network training process within transformer blocks.