I'm having some trouble describing in one line what I want, which is probably why I haven't had much luck with Google.
Say I have a game like 2048 where the possible actions each step are fixed (and more than two). I want to train a neural network that chooses a move, so I have 4 neurons in output layer and I make the move with the highest prediction. The output vector is normalized (softmax layer).
However, the training data I have is just the state, the move that was made, and whether that had good or bad results. If the chosen move is bad, I don't know which of the other ones was better (if any).
How should I train this? My current thought is like this:
- Good move? -> chosen action gets positive error (so prediction goes up)
- Bad move? -> chosen step gets negative error (so prediction goes down)
But I haven't found literature supporting this guess. There are alternatives:
- Maybe I should also update the options that weren't chosen (in the opposite direction)?
- Is it a good idea to set error directly instead of using goal predictions?
- The error for correct and incorrect could be different, maybe to preserve normalization?
- ...
(I'm doing 2048 and using neural networks, but I think this is not limited to this game or this method.)