Hard

Epsilon-Greedy Action Selection

Hard

~15 min

code completion

Epsilon-Greedy Policy

The exploration vs. exploitation tradeoff is central to RL. An epsilon-greedy policy balances both:

With probability

ϵ

: explore — pick a random action

With probability

1 - ϵ

: exploit — pick the action with the highest Q-value

if random() < epsilon:
    action = random_choice(n_actions)
else:
    action = argmax(q_values)

As training progresses, $ϵ$ is typically decayed so the agent exploits more and more.

Your task:

Implement epsilon_greedy(q_values, epsilon, seed). Use np.random.default_rng(seed) for reproducibility. Return the chosen action index.

Example Tests

epsilon=0: always greedy (argmax)

Input: {"seed":0,"epsilon":0,"q_values":[1,5,2]}

Expected: 1

epsilon=1: always random (check within range)

Input: {"seed":42,"epsilon":1,"q_values":[0,0,0,0]}

Expected: 2

epsilon=0: greedy regardless of seed

Input: {"seed":99,"epsilon":0,"q_values":[3,1,2]}

Expected: 0

You can read the full problem statement above. Create a free account to run code in the browser, submit solutions, and track your progress.