7

I'm trying to follow a tutorial for Q-Table learning from this source, and am having difficulty understanding a small piece of the code. Here's the entire block:

import gym
import numpy as np


env = gym.make('FrozenLake-v0')

#Initialize table with all zeros
Q = np.zeros([env.observation_space.n,env.action_space.n])
# Set learning parameters
lr = .8
y = .95
num_episodes = 2000
#create lists to contain total rewards and steps per episode
#jList = []
rList = []
for i in range(num_episodes):
    #Reset environment and get first new observation
    s = env.reset()
    rAll = 0
    d = False
    j = 0
    #The Q-Table learning algorithm
    while j < 99:
        j+=1
        #Choose an action by greedily (with noise) picking from Q table
        a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1)))
        #Get new state and reward from environment
        s1,r,d,_ = env.step(a)
        #Update Q-Table with new knowledge
        Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a])
        rAll += r
        s = s1
        if d == True:
            break
    #jList.append(j)
    rList.append(rAll)


print "Score over time: " +  str(sum(rList)/num_episodes)

print "Final Q-Table Values"
print Q

The code runs well and I'm able to print my results, but here is where I'm having difficulties:

a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1)))

My question is, why are we multiplying by 1/(i+1)? Is this supposed to be an implementation of epsilon annealing? Any help is appreciated.

aalberti333
  • 195
  • 1
  • 5

1 Answers1

7

My question is, why are we multiplying by 1/(i+1)? Is this supposed to be an implementation of epsilon annealing?

The code looks like a relatively ad-hoc* adjustment to ensure early exploration, and an alternative to $\epsilon$-greedy action choice. The 1/(i+1) factor is similar to decaying $\epsilon$, but not identical.

$\epsilon$-greedy with the same decay factor might look like this:

a = np.argmax(Q[s,:])
if epsilon/(1+math.sqrt(i)) > random.random():
    a = random.randrange(0, env.action_space.n)

The math.sqrt(i) is just a suggestion, but I feel that epsilon/(1+i) is probably too aggressive and would cut off exploration too quickly.

It is not something I have seen before when studying Q-Learning (e.g. in David Silver's lectures or Sutton & Barto's book). However, Q-Learning is not predicated on using any specific action choice, it just needs enough exploration in the behaviour policy. For the given problem adding some noise to a greedy selection obviously works well enough. Technically for guaranteed convergence tabular Q-Learning needs infinite exploration over infinite time steps. The code as supplied does indeed do that because the noise is unbound from the Normal distribution. So there is always some small finite chance of selecting an action with a relatively low action-value estimate and refining that estimate later.

However, the fast decay (1/episode number) and initial scaling factor for the noise are both hyperparameters that need tuning to the problem. You might prefer something more standard from the literature such as $\epsilon$-greedy, Gibbs sampling or upper-confidence-bound action selection (the example is quite similar to UCB, in that it adds to the Q-values before taking the max).


* Perhaps the approach used in the example has a name (some variation of "Noisy Action Selection") but I don't know it, and could not find it on a quick search.

Neil Slater
  • 29,388
  • 5
  • 82
  • 101