The cross entropy (CE) method is a technique to solve optimization problems such as the travelling salesman problem. The paper “A Tutorial on the Cross-Entropy Method” by Boer, Kroese, Mannor, and Rubinstein gives very detailed information about the CE method and also shows how to apply it problems like travelling salesman, max-cut, and maze traversal. Unfortunately the paper is rather dense. Fortunately, you can just read the rest of this post to learn more about cross entropy the easy way.

The CE method can be applied to any problem where you are trying to find the best values for a set of on/off switches. Since we can express the travelling salesman problem as a group of binary choices (take this road or don’t take this road) CE can be used to solve it. I have also used it to tune numeric parameters by turning a numeric range into a set of discrete switches. For instance, if I am programming something with a random event in it and I want to find the optimal value for that event in the range from 0 to 20 I make make several switches with different numeric values that sum to a maximum of 20 if they are all on. I would divide the range 0 to 20 into switches for 1, 2, 4, 8, and 5.

The way that the cross entropy method find the optimal values for the switches you give it is fairly simple. You start off with a group of binary switches and a function that uses those switches. Every time the function runs it reports a score that indicates the performance with the given values for the switches. The CE algorithm will then run a series of trials, use the performance of those trials to improve its choices for the binary switches, and then repeats the trials. It continues on this way until it either reaches your desired level of performance or until it can no longer improve performance.

Of course if there are a large number of binary switches (let’s say 100) then running through all possible combinations of those 100 switches would takes 2^100^ trials. Instead of running that many trials the CE algorithm runs a small number of trials with the switches set randomly based upon a vector of initial probabilities. It then looks at the values of the switches that gave the best performance in that set of trials to determine the probabilility that a switch should be on or off. For instance, let’s say the CE algorithm was looking at the top ten runs from a trial and sees that switch 1 is on in all of them. This implies that switch 1 is important to getting a good score so during the next round of trials switch 1 will always be turned on. Let’s say switch 2 was on for 8 of the 10 best trials – switch 2 will be turned on with probability .8 during the next round of trails. For the very first trial you need to provide the CE algorithm an initial set of probabilities for each switch, but for later rounds it deduces the probablities from the previous round.

To explain how the cross entropy method works I’ll pick a concrete example and walk through it. Imagine that you owned a carriage and 20 horses. Since you’re a busy person you would like to make the carriage go as fast as possible but it’s not clear which or how many horses should be hitched to the carriage to get the fastest possible result. There are 2 states for each horse (hitched or unhitched) and there are 20 horses to there are 2^{20}-1 possible ways to hitch your carriage with at least one horse.

Faced with this daunting task you will probably just try out a few combinations, pick the fastest one, and stop there. But wait – there is a way to use the information you get from each trial to improve your results in later trials. First you decide upon a random probability of hitching each horse – let’s say it’s one half to start out with. Now you randomly pick your horses with this probability and ride your carriage a few times, recording the time each trip. Let’s say you take 10 trips like this and end up with this table:

Trial | 1 | 2 | 3 | … | 9 | 10 |

Horse | ||||||

1 | P | P | A | … | A | P |

2 | P | A | A | … | A | P |

3 | A | P | P | … | A | A |

… | ||||||

10 | A | A | A | … | P | P |

Time | 30m | 25m | 24m | … | 28m | 32m |

The P’s and A’s denote that the horse was present or absent for that trial. For trial 1, horses 1 and 2 were present and horses 3 and 10 were absent.

Now, using the cross entropy method we choose an upper group of the trials (for instance the top 3 fastest or the top 5 fastest). In this case let’s choose the top two fastest, trials 2 and 3. Now we look at the probability that of each of the horses was in these trials and use these to set the probability of using that horse in the next round. Horse 1 was in one trial but not the other so it’s selection probability will remain at one half. Horse two was absent from both of the best trials so the probability of selecting it for the next round will be zero. Horse three was present in both of the fastest trials so it will always be selected in the next round.

Basically, the cross entropy method is to find the correlation between each item and getting a good result. If getting a good result is highly correlated with a certain horse then after several rounds the probability of choosing that horse will go to 1. Similarly, if a horse is negatively correlated with getting a good result then its probability will quickly go towards 0. Because many random combinations are attempted this approach can even discover specific combinations of horses that go faster together^{1}.

Now, let’s show a quick example with some of the code I’ve written. I have a program that simplifies using the CE method here. The cross entropy function looks like this:

struct CEIteration {

int round;

float min_quantile_performance;

float avg_quantile_performance;

float avg_performance;

vector<float> probability_vector;

};

/**

* Run the cross entropy method with N simulations per round,

* using the upper (1-quantile)*N results of each round to

* form the probability vector for the next round, and using

* initial_p_vector as the probility vector for round 0.

* The simulation tries to reach a score of goal.

* When successive iterations fail to improve by more than min_improve

* the algorithm will return.

* The function scoring_sim accepts a decision vector that

* specifies the behavior for the monte carlo simulation. The

* score of the simulation with that given decision vector is

* returned.

*/

vector<CEIteration> runCE(std::size_t N, float quantile, float goal, float min_improve,

vector<float> initial_p_vector, function<float (vector<bool>)> scoring_sim);

The result that runCE returns will indicate how many rounds were run and what the performance of that round was. I’ll leave you with a code example for one of the examples in the cross entropy paper. In this example we have a black box with 10 buttons. After some buttons are pressed the black box tells you how many buttons were pressed correctly but does not tell you which buttons were right. This is a toy example, but it should show you how to use the code I’ve given you.

//Returns the number of buttons in the correct state.

float blackBoxDecoder(std::vector<bool> input) {

if (input.size() != 10) {

return 0;

}

vector<bool> ground_truths{false, false, false, false, false,
true, true, true, true, true};

float score = 0.0;

for (size_t I = 0; I < 10; ++I) {

if (input[I] == ground_truths[I]) {

score += 1.0;

}

}

return score;

}

int main(int argcount, char** argvector) {

//Set the initial probabilities for each button to one half

std::vector<float> initial_p(10, 0.5);

//Get the results – fifty rounds per trial, use the results of the top ten percent

//for the next trial, aim for a score of 10 and stop if the improvement is less than .25.

std::vector<CEIteration> iterations = runCE(50, 0.1, 10.0, 0.25, initial_p, blackBoxDecoder);

for (auto I = iterations.begin(); I != iterations.end(); ++I) {

cout<<I->round<<‘\t’<<I->min_quantile_performance<<‘\t’<<
I->avg_quantile_performance;

for_each(I->probability_vector.begin(), I->probability_vector.end(),

[](float arg) { cout<<‘\t’<<arg;});

cout<<‘\n’;

}

return 0;

}

Keep in mind that you must run enough trials every iteration for the CE algorithm to see all of the switches in different states. This means that you should run more trials per iteration if you have more binary switches. For instance, with 10 switches and 10 trials per iteration the probability that none of the switches will either always on or always off (1 – 2*(0.5)^{10})^{10} = .98. However, if you had 100 switches this probability falls to 0.82 and with 1000 switches it falls to 0.14. If your switches are always in one state or the other then the CE algorithm will not be able to determine their correlation with the results so just make sure that the number of rounds you run per iteration scales with the number of variables you need to optimize.

^{1} The paper contains several mathematical proofs regarding to cross entropy’s ability to move past local minima and converge upon the optimimum results, so consult the paper for more details.