I am trying to understand reinforcement learning and markov decision processes (MDP) in the case where a neural net is being used as the function approximator.
I'm having difficulty with the relationship between the MDP where the environment is explored in a probabilistic manner, how this maps back to learning parameters and how the final solution/policies are found.
Am I correct to assume that in the case of Q-learning, the neural-network essentially acts as a function approximator for q-value itself so many steps in the future? How does this map to updating parameters via backpropagation or other methods?
Also, once the network has learned how to predict the future reward, how does this fit in with the system in terms of actually making decisions? I am assuming that the final system would not probabilistically make state transitions.
Thanks