Medium

Causal Attention Mask

Medium

~6 min

code completion

In autoregressive decoding, token $i$ is only allowed to attend to tokens $j \leq i$ .

If future tokens are visible during training, the model can "cheat".

A causal mask $M \in R^{L \times L}$ is defined as:

M [i, j] = {10 if j \leq i if j > i

This is a lower-triangular matrix. In full attention implementations, entries with 0 are usually turned into $- \infty$ before softmax.

Your task: Implement causal_mask(seq_len) to return a float NumPy array of shape (seq_len, seq_len).

Implementation requirements:

1. Include the diagonal (self-attention is allowed).

2. Return numeric values 1.0 and 0.0.

3. Use a vectorized NumPy approach.

Example Tests

Single token mask

Input: {"seq_len":1}

Expected: [[1]]

Lower-triangular mask for seq_len=3

Input: {"seq_len":3}

Expected: [[1,0,0],[1,1,0],[1,1,1]]

Shape check: seq_len=6

Input: {"seq_len":6}

Expected: [6,6]

You can read the full problem statement above. Create a free account to run code in the browser, submit solutions, and track your progress.