From Word Embeddings to Attention

Explore how attention mechanisms process and relate words in a sentence

Input Sentence

Angry lions hunt zebras, while cats sleep peacefully

Word Embeddings Visualization

Note: Words are plotted in 2D space based on their embedding vectors. Colors indicate word categories.

Transformation to Query, Key, Value Vectors

\[ \begin{aligned} Q &= \text{Embedding} \cdot W_Q \\ K &= \text{Embedding} \cdot W_K \\ V &= \text{Embedding} \cdot W_V \end{aligned} \]

Attention Score Matrix

\[ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V \]

Detailed Calculations

Technical Notes

Learning to Reposition Word Vectors

Through learning specific linear transformations for the matrices \(W_Q\), \(W_K\), and \(W_V\), the model adjusts word vectors (Q, K, V) in space. This process helps bring closer the vectors of words that are contextually related or important to attend to, enhancing the attention mechanism.

Using Numerically Stable Softmax

To ensure numerical stability, we implement softmax in a way that prevents overflow by subtracting the maximum value from each element before applying the exponential function.

  • Softmax, expressed as \( \text{softmax}(x) = \frac{\exp(x)}{\sum \exp(x)} \), inherently provides stable values between 0 and 1, as all terms are positive and the denominator is never smaller than the numerator.
  • However, if the values in \(x\) are very large, the exponential function can overflow. To prevent this, we use the formula \( \text{softmax}(x) = \text{softmax}(x + c) \), where \(c\) is chosen as \(-\max(x)\).

Simulated Weight Transformations

The weight transformations \(W_Q, W_K, W_V\) used in this visualization are approximations. In actual transformer models, these weights are learned through backpropagation as part of the neural network training process within transformer blocks.