The OpenAI lunar lander environment is an exciting way to test reinforcement learning algorithms. My solution uses deep Q learning with experience replay to solve the discrete lunar lander task after 1500 episodes of training.
The deep Q learning algorithm was initially made popular by DeepMind after their demonstration of it learning to succeed in multiple different Atari games. Deep Q learning is a variation of traditional Q learning which entails using deep neural networks to estimate the Q value for large state spaces which would otherwise be unfeasible using traditional Q tables. The policy can then be inferred by choosing actions which maximize its output.
My model consists of a number of fully connected layers with linear rectified activations (relu). The output is a single value which is the current estimate of the Q function given the provided inputs. With a sufficiently small discrete action space, a reasonable approach would be to output the value of each action as a vector. The action could then be chosen based on the maximum index, which would only necessitate a single prediction to find argmax - and therefore reduce the cost of training. I have however chosen to estimate the Q function for each action separately in this project.
A Q function takes a state S and action A and attempts to estimate the expected future reward Q(S,A) for this combination. The reward is trained using temporal difference learning. Since the episode reward is not known before the episode ends, the expected reward is estimated as:
Q(S,A) = R + gamma * argmax_Ab Q(Sb, Ab)
Where Sb and Ab is the state and policy(Sb) for step t+1 respectively, and gamma is a falloff rate between 0 and 1. The actor has proved adequate learning abilities.
Each episode of training may last for only a few hundred timesteps or even less. This means that the actor could need a very high number of episodes to gather enough data to converge on an acceptable policy. In order to increase the sample efficiency, I have employed experience replay.
Each timestep the experienced transition is saved in a replay buffer of fixed size. Then, a random sample from this experience - the batch for the respective timestep - is chosen to train with. This allows the actor to retain previous experiences better, and therefore converge faster.
The environment was confirmed solved after 1590 episodes of training with an average score of 207 after 100 consecutive trials (200 is required to be considered solved). After additional training (4950 episodes total) the actor performed very well, achieving an average score of 266.
This project was originally published on the 4th of January 2019.
The code is open source, and available on Github here.