Epsilon-Greedy Action Selection
Hard
~15 min
code completion
Epsilon-Greedy Policy
The exploration vs. exploitation tradeoff is central to RL. An epsilon-greedy policy balances both:
if random() < epsilon:
action = random_choice(n_actions)
else:
action = argmax(q_values)As training progresses, is typically decayed so the agent exploits more and more.
Your task:
Implement epsilon_greedy(q_values, epsilon, seed). Use np.random.default_rng(seed) for reproducibility. Return the chosen action index.
Example Tests
epsilon=0: always greedy (argmax)
Input: {"seed":0,"epsilon":0,"q_values":[1,5,2]}
Expected: 1
epsilon=1: always random (check within range)
Input: {"seed":42,"epsilon":1,"q_values":[0,0,0,0]}
Expected: 2
epsilon=0: greedy regardless of seed
Input: {"seed":99,"epsilon":0,"q_values":[3,1,2]}
Expected: 0