Multi-Head Attention
Multi-Head Attention (Core Mechanic)
Single-head attention computes:
Multi-head attention splits the feature dimension into multiple heads, applies attention per head, then concatenates:
1. Split columns of into num_heads chunks.
2. For each head :
3. Concatenate along feature axis.
Where:
Your task: Implement multi_head_attention(Q, K, V, num_heads).
Implementation requirements:
1. Use stable softmax (subtract row max before exp).
2. Split by contiguous column slices.
3. Concatenate outputs along axis=1.
Example Tests
Identity Q=K=V, one head
Input: {"K":[[1,0],[0,1]],"Q":[[1,0],[0,1]],"V":[[1,0],[0,1]],"num_heads":1}
Expected: [[0.66976,0.33024],[0.33024,0.66976]]
Shape check: 2x4 inputs with 2 heads
Input: {"K":[[1,0,0,0],[0,1,0,0]],"Q":[[1,0,0,0],[0,1,0,0]],"V":[[1,0,0,0],[0,1,0,0]],"num_heads":2}
Expected: [2,4]
Shape check: 3x4 inputs with 1 head
Input: {"K":[[1,0,0,0],[0,1,0,0],[0,0,1,0]],"Q":[[1,0,0,0],[0,1,0,0],[0,0,1,0]],"V":[[1,0,0,0],[0,1,0,0],[0,0,1,0]],"num_heads":1}
Expected: [3,4]