Train a classifier for a game with feedback on chosen move instead of true labels

Question

I'm having some trouble describing in one line what I want, which is probably why I haven't had much luck with Google.

Say I have a game like 2048 where the possible actions each step are fixed (and more than two). I want to train a neural network that chooses a move, so I have 4 neurons in output layer and I make the move with the highest prediction. The output vector is normalized (softmax layer).

However, the training data I have is just the state, the move that was made, and whether that had good or bad results. If the chosen move is bad, I don't know which of the other ones was better (if any).

How should I train this? My current thought is like this:

Good move? -> chosen action gets positive error (so prediction goes up)
Bad move? -> chosen step gets negative error (so prediction goes down)

But I haven't found literature supporting this guess. There are alternatives:

Maybe I should also update the options that weren't chosen (in the opposite direction)?
Is it a good idea to set error directly instead of using goal predictions?
The error for correct and incorrect could be different, maybe to preserve normalization?
...

(I'm doing 2048 and using neural networks, but I think this is not limited to this game or this method.)

score 0 · Answer 1 · answered Jan 25 '22 at 15:17

One way to frame your problem is with reinforcement learning (RL). Reinforcement learning (RL) train an agent to accomplish a goal in environment. In your case, the environment is 2048, the goal is to solve the game, and the agent is the model you are training.

If the chosen move is bad, I don't know which of the other ones was better.

That trade-off is frequently called explore-exploit. Does an agent do the move it predicts is the best possible (exploit)? Or does the agent look for better possible moves (explore)?

Train a classifier for a game with feedback on chosen move instead of true labels

1 Answers1