implementing temporal difference in chess

Question

I have been developing a chess program which makes use of alpha-beta pruning algorithm and an evaluation function that evaluates positions using the following features namely material, kingsafety, mobility, pawn-structure and trapped pieces etc..... My evaluation function is derived from the

$$f(p) = w_1 \cdot \text{material} + w_2 \cdot \text{kingsafety} + w_3 \cdot \text{mobility} + w_4 \cdot \text{pawn-structure} + w_5 \cdot \text{trapped pieces}$$

where $w$ is the weight assigned to each feature. At this point i want to tune the weights of my evaluation function using temporal difference, where the agent plays against itself and in the process gather training data from its environment (which is a form of reinforcement learning). I have read some books and articles in order to have an insight on how to implement this in Java but they seem to be theoretical rather than practical. I need a detailed explanation and pseudo codes on how to automatically tune the weights of my evaluation function based on previous games.

SmallChess · Answer 1 · 2016-03-29T04:52:04.283

I recommend anybody interested in the topic take a look at the paper which combines TDL and deep-learning.

Roughly, you'll need to make the engine play games against each other. Record the mini-max evaluation for each position. At the end of the game, you'll get a reward, which is {0,1,-1} for chess. Then you'll need to adjust your parameters with:

This equation tells us we should adjust the weights by temporal differences, weighted by how far you should do it. If you have a perfect evaluation, your temporal difference would always be zero, thus you wouldn't need to make any adjustment.

Next, you'll need use the new parameters to play a new game. Repeats until as many games as you can afford or when you think it's converged.

Few remarks:

The paper I quote applies a discount factor. It's done for the backproportion algorithm for neural network. You don't need it.
You'll need to experiment with the optimal learning rate (alpha in the equation). Too large will make your learning unstable, too little will take longer to converge. I've seen people using 0.70. The paper I quote used 1.0.

score 2 · Answer 2 · answered Aug 23 '14 at 15:25

A first remark, you should watch 'Wargames' to know what you're getting yourself into.

What you want is f(p) such that f(p) is as close as possible to strength of position.

A very simple solution using genetic algo would be to setup 10000 players with different weights and see which wins. Then keep the top 1000 winners' weight, copy them 10 times, alter them slightly to explore weight space, and run the simulation again. That's standard GA, given a functional form, what are the best coefficients for it.

Another solution is to extract the positions, so you have a table '(material, kingsafety, mobility, pawn-structure, trappedpieces) -> goodness of position' where goodness of position is some objective factor (outcome win/lose computed using simulations above or known matches, depth of available tree, number of moves under the tree where one of the 5 factors gets better. You can then try different functional forms for your f(p), regression, svm.

implementing temporal difference in chess

2 Answers2