Inspect how text is broken into tokens and embedded for models.
Each example demonstrates different semantic clusters: emotions, animals, weather, activities, colors, objects, and common words.
Hover a point to isolate words from the same category.
Simplified Embeddings
These embeddings are a simplified example. Real-world embeddings capture richer semantic nuances and complex relationships between words, existing in hundreds or even thousands of dimensions rather than the 2D representation shown here.
Advanced Tokenization Methods
The tokenization shown here is a basic version. Production systems employ more sophisticated methods like BPE (Byte-Pair Encoding) or WordPiece, which better handle words and subwords, especially for rare words and morphological variations.
Simulated Word Relationships
The positions and distances between words in this visualization are artificially created for demonstration purposes. In real language models, these relationships would be learned from vast amounts of text data, capturing contextual nuances and multiple word meanings that this simplified visualization cannot represent.