Epsilon-Greedy Action Selection

Hard
~15 min
code completion

Epsilon-Greedy Policy

The exploration vs. exploitation tradeoff is central to RL. An epsilon-greedy policy balances both:

  • With probability : explore — pick a random action
  • With probability : exploit — pick the action with the highest Q-value
  • if random() < epsilon:
        action = random_choice(n_actions)
    else:
        action = argmax(q_values)

    As training progresses, is typically decayed so the agent exploits more and more.

    Your task:

    Implement epsilon_greedy(q_values, epsilon, seed). Use np.random.default_rng(seed) for reproducibility. Return the chosen action index.

    Example Tests

    epsilon=0: always greedy (argmax)

    Input: {"seed":0,"epsilon":0,"q_values":[1,5,2]}

    Expected: 1

    epsilon=1: always random (check within range)

    Input: {"seed":42,"epsilon":1,"q_values":[0,0,0,0]}

    Expected: 2

    epsilon=0: greedy regardless of seed

    Input: {"seed":99,"epsilon":0,"q_values":[3,1,2]}

    Expected: 0

    Sign in to solve this problem

    You can read the full problem statement above. Create a free account to run code in the browser, submit solutions, and track your progress.