Multi-Head Attention

Hard
~18 min
code completion

Multi-Head Attention (Core Mechanic)

Single-head attention computes:

Multi-head attention splits the feature dimension into multiple heads, applies attention per head, then concatenates:

1. Split columns of into num_heads chunks.

2. For each head :

3. Concatenate along feature axis.

Where:

  • Output shape is
  • Your task: Implement multi_head_attention(Q, K, V, num_heads).

    Implementation requirements:

    1. Use stable softmax (subtract row max before exp).

    2. Split by contiguous column slices.

    3. Concatenate outputs along axis=1.

    Example Tests

    Identity Q=K=V, one head

    Input: {"K":[[1,0],[0,1]],"Q":[[1,0],[0,1]],"V":[[1,0],[0,1]],"num_heads":1}

    Expected: [[0.66976,0.33024],[0.33024,0.66976]]

    Shape check: 2x4 inputs with 2 heads

    Input: {"K":[[1,0,0,0],[0,1,0,0]],"Q":[[1,0,0,0],[0,1,0,0]],"V":[[1,0,0,0],[0,1,0,0]],"num_heads":2}

    Expected: [2,4]

    Shape check: 3x4 inputs with 1 head

    Input: {"K":[[1,0,0,0],[0,1,0,0],[0,0,1,0]],"Q":[[1,0,0,0],[0,1,0,0],[0,0,1,0]],"V":[[1,0,0,0],[0,1,0,0],[0,0,1,0]],"num_heads":1}

    Expected: [3,4]

    Sign in to solve this problem

    You can read the full problem statement above. Create a free account to run code in the browser, submit solutions, and track your progress.