From "what to look at" to the engine behind GPT and BERT. Scaled dot-product attention replaces recurrence with direct token-to-token relationships.
Sign in for the concept check
The optional multiple-choice concept check tracks your understanding. Browse the coding problems below, then sign in when you're ready to solve them.
Softmax Attention Weights
~12 min· Medium
Scaled Dot-Product Attention
~25 min· Hard
Sinusoidal Positional Encoding
Causal Attention Mask
~6 min· Medium
Multi-Head Attention
~18 min· Hard
Query-Key Attention Scores
~12 min· Easy
Attention Context Vector
~10 min· Easy