RL's policy gradient (REINFORCE) pipeline clarification

Question

I try to build a policy gradient RL machine, and let's look at the REINFORCE's equation for updating the model parameters by taking a gradient to make the ascent (I apologize if notation is slightly non-conventional):

$$\omega = \omega + \alpha \cdot \nabla_\omega log\ \pi(A_t|S_t) * V_t$$

My questions I am unsure about are the following:

Do I calculate the gradient values at each time step $t$ (like in SGD fashion) or averaging gradient over all timesteps of the episode is a better option?
Do I care about getting the gradient values of the selected action probability output only (ignoring the outputs for other actions, in a discrete case)? In other words, do I consider the $V_t$ term for non-selected actions to be 0, which make the gradient values equal 0 as well?
In a discrete case the cross-entropy (the loss) is defined as: $$H(p,q) = -\sum_x P(x) * log Q(x)$$

(source: wikipedia)

Does that mean that if I substitute the labels (denoted as $P(x)$) with the $V_t$ terms (non-zero for selected action only) in my neural network training, I will be getting the correct gradient values of the log-loss which fully satisfy the REINFORCE definition?

score 3 · Accepted Answer · answered Sep 21 '18 at 04:11

For notation and visualizations please take a look at this excellent tutorial Policy Gradients.

For your questions:

The second is correct. In PGs we try to maximize the expected reward. In order to do this we approximate the expectation with the mean reward over samples of trajectories under a parametrized policy. In other words sample actions and get their respective rewards over a period of time within an episode. Compute the discounted return from last step back to the first step (this is your $V_t$ in your notation which usually you will find it as $R_t$ and is the discounted return). Multiply returns with logits and sum. Take a look at slides 8 and 9 of the (1) to see how the REINFORCE is being implemented along with this code examples (lines 59-75).
As you may have already realized, no. It's a return over an episode (multiple timesteps) and is calculated as a discounted sum of all the rewards that you collected. Even if you get 1 at the end and 0 everywhere else the reward is propagated back to the first step (doing it by hand helps a lot!) so all rewards at every timestep are converted into returns from that state and timestep.
Look at slide 13 of (1). Your intuition is correct. The maximum likelihood equation for a classification problem (cross-entropy) multiplied by the return equates the policy gradient loss. If you use a simple Neural Network with REINFORCE with two actions to perform a task you will notice that the gradients propagated back are the same as in a classification task but here are multiplied with their respective return (line 69 2) instead of the class label (0/1).

A bit of intuition for the 3: This is with accordance with PG methods considered as Model-free methods and map states to actions. If you use a clustering technique to cluster the hidden representations of the last layer of your network (after dimensionality reduction) and color the data points according to which action your network chose, you will find out that the representations are naturally clustered into separate groups depending the action. At a high level, you can state that the REINFORCE performs a type of classification having as label the reward signal.

Hope it helps!

RL's policy gradient (REINFORCE) pipeline clarification

1 Answers1