Causal Attention Mask
Causal Attention Mask
In autoregressive decoding, token is only allowed to attend to tokens .
If future tokens are visible during training, the model can "cheat".
A causal mask is defined as:
This is a lower-triangular matrix. In full attention implementations, entries with 0 are usually turned into before softmax.
Your task: Implement causal_mask(seq_len) to return a float NumPy array of shape (seq_len, seq_len).
Implementation requirements:
1. Include the diagonal (self-attention is allowed).
2. Return numeric values 1.0 and 0.0.
3. Use a vectorized NumPy approach.
Example Tests
Single token mask
Input: {"seq_len":1}
Expected: [[1]]
Lower-triangular mask for seq_len=3
Input: {"seq_len":3}
Expected: [[1,0,0],[1,1,0],[1,1,1]]
Shape check: seq_len=6
Input: {"seq_len":6}
Expected: [6,6]