Hard

Multi-Head Attention

Hard

~18 min

code completion

Multi-Head Attention (Core Mechanic)

Single-head attention computes:

Attn (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V

Multi-head attention splits the feature dimension into multiple heads, applies attention per head, then concatenates:

1. Split columns of $Q, K, V$ into num_heads chunks.

2. For each head $h$ :

S_{h} = \frac{Q _{h} K _{h}^{⊤}}{d _{k}}, W_{h} = softmax (S_{h}) (row-wise), O_{h} = W_{h} V_{h}

3. Concatenate $O_{h}$ along feature axis.

Where:

Q, K, V \in R^{L \times d_{m o d e l}}

d_{k} = d_{m o d e l} / num_heads

Output shape is

L \times d_{m o d e l}

Your task: Implement multi_head_attention(Q, K, V, num_heads).

Implementation requirements:

1. Use stable softmax (subtract row max before exp).

2. Split by contiguous column slices.

3. Concatenate outputs along axis=1.

Example Tests

Identity Q=K=V, one head

Input: {"K":[[1,0],[0,1]],"Q":[[1,0],[0,1]],"V":[[1,0],[0,1]],"num_heads":1}

Expected: [[0.66976,0.33024],[0.33024,0.66976]]

Shape check: 2x4 inputs with 2 heads

Input: {"K":[[1,0,0,0],[0,1,0,0]],"Q":[[1,0,0,0],[0,1,0,0]],"V":[[1,0,0,0],[0,1,0,0]],"num_heads":2}

Expected: [2,4]

Shape check: 3x4 inputs with 1 head

Input: {"K":[[1,0,0,0],[0,1,0,0],[0,0,1,0]],"Q":[[1,0,0,0],[0,1,0,0],[0,0,1,0]],"V":[[1,0,0,0],[0,1,0,0],[0,0,1,0]],"num_heads":1}

Expected: [3,4]

You can read the full problem statement above. Create a free account to run code in the browser, submit solutions, and track your progress.