July 11, 2016 - 16 mins
This project was done as the Capstone for the Udacity Machine Learning Nanodegree
OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Go. In this project we will by training Reinforcement Learning (RL) agents to solve environments based on the Doom video game. Specifically, the DoomCorridor-v0 and DoomHealthGathering-v0 environments. Either environment is considered solved when the agent averages an episode reward >= 1000 over 100 consecutive episodes. An episode being a sequence (s0, a1, s1, a2, s2, …., st), s0 is the initial state, st is the terminal state and (action, state) pairs in between. A reward is given at the end of every action, the episode reward is the summation of these individual rewards.
The objective of the agent in DoomCorridor-v0 is to reach the vest at the end of the corridor as fast as possible without dying. There are 6 enemies (3 groups of 2) that can possibly kill the agent. The input data (observations) we use for training the agent are the frames (pixels) and action space is discrete, consisting of 6 actions.
The objective of the agent in DoomHealthGathering-v0 is to stay alive for as long as possible by collecting health packs. The ground is poison so simply standing on it results in a loss of health. The input data (observations) we use for training the agent are the frames (pixels) and action space is discrete, consisting of 3 actions.
So how do we train the agent?
The initial input is provided by the
reset method on the environment.
initial_state = env.reset()
The following inputs will be provided by calling the
next_state, reward, done, info = env.step(action)
reward are self explanatory,
done entails whether the agent has reached a terminal state and info is miscellaneous information that will be ignored for the purposes of this project.
The following details one timestep:
stepas described above. The input is an image representing the current state.
The above will be repeated until a terminal state or timestep limit is reached. During this process we accumulate the states, rewards, actions, etc as they will factor into the later training procedure. This sequence of state, action pairings is known as a trajectory.
Once we’ve collected a number of trajectories (dependent on the batch size) we’ll calculate our loss and run our SGD variant training the neural network. This will count as a single iteration.
For completeness, here’s the full algorithm in pseudocode. is the number of iteration, is the batch size.
for i in 1, …, N trajectories =  for j in 1, …, M Sample trajectory t_j and add t_j to trajectories Compute loss given trajectories Optimize neural network parameters w.r.t loss
An environment is considered solved if the agent averages greater or equal to an episodic reward of 1000 over 100 consecutive episodes. Because of this, the reward will be the main metric used.
As mentioned previously, we use the observation space as input data for the RL agent. The observation comes in the form of 3D image data (width, height, channels). The Doom environment allows us to choose from a range of screen resolutions (width, height) options. We choose the smallest available resolution (160, 120) for performance reasons. Thus, the observation space is a 3D Tensor of shape (160, 120, 3).
This video shows the DoomHealthGathering environment.
The algorithms used in this project will be based on deep RL, the intersection of deep and reinforcement learning. Reinforcement learning is concerned with an agent and an environment. The goal of the agent is to act optimally in the environment according to an objective. This is done by learning an expectation of future rewards from a state.
Where is the discount value , that determines how much long-term rewards are valued. We pick the action that maximizes value.
There are two prevailing approaches in representing or .
There are pros and cons to both approaches but function approximation was chosen and it comes down to two factors.
We settle on Policy Gradient methods and neural networks for our approach. Policy gradients are used because they directly optimize for the cumulative reward (unlike Q-Learning and can be applied straightforwardly with nonlinear approximators such as neural networks. Policy gradient methods have been known for instability due to high variance gradient estimates and therefore impractical. However, with Trust Region Policy Optimization (TRPO) being introduced, policy gradient methods have since shown success in learning difficult problems.
So why does TRPO help?
A neural network with parameters represents a function space or manifold, the function output being dependent on the parameters. The idea of the Natural Gradient is to traverse this manifold to the find the optimal function for the task. This approach is appealing because the parameterization of the network does not matter. For example, even though the gradients of
relu activations are different, since they’re part of the same function space the activation choice shouldn’t affect the optimization procedure. Given this, it’s evident how using the natural gradient over vanilla backpropagation would be desirable. TRPO builds on the natural gradient by providing additional constraints for the gradient step size and direction updates.
During the training process we actually train two separate neural networks simultaneously, called the policy and the baseline networks respectively. Both consist the same architecture until the final layer, the difference being the policy calculates action probabilities and thus the final layer is a softmax, while the baseline outputs a scalar, the final layer being linear with a single output. The scalar produced by the baseline is an estimate of the future reward given a state. Given the baseline value and actual rewards from a trajectory, we can calculate what’s known as the advantage.
The advantage is calculated by subtracting the baseline value from the sum of rewards (described above). The intuition being if the advantage is > 0, the sampled trajectory is a profitable one so the agent should be encouraged to follow similar ones. Conversely, if the advantage < 0, the sampled trajectory is non-profitable and similar trajectories should be discouraged. The advantage helps reduce the high variance in gradient estimates.
For our benchmark we will use the condition for solving the environment, that is, averaging an episode reward >= 1000 over 100 consecutive episodes.
We preprocesses both the observation space and the action space. The observation space is initially of shape (160, 120, 3). We transform it to (15, 20, 1) by a grayscale and resize operation, note this is now read as (height, width, channels). We do this transformation for performance reasons, for example instead of 160 * 120 * 3 = 57600 features per observation we have 20 * 15 = 300 features. That’s ~0.05% of the original size!
An action is represented as an array of 43 elements, each element representing an type of action. Simultaneous actions are supported but for simplicity we pick a single action at each timestep. For the DoomCorridor-v0 environment we only have 6 possible actions.
array index -> name
The output of our neural network will be an integer 0-5, a softmax over 6 actions. The integer is mapped to one of the above action indexes and the element value is set to 1. We then pass the array in the environment as our action. The mapping for DoomHealthGathering-v0 is similar.
We build upon Modular RL, an implementation of TRPO using Keras and Theano.
agentzoo.py- add support for CNN TRPO agents
filtered_env.py- add skiprate, action filters, support for Doom envs
run_cnn.py- do preprocessing with filters, implement loading snapshots
Skiprate & Preprocessing/Filters
During training a skiprate is when we predict an action to use that action for the the next k timesteps, the benefit being we explore states k times faster, thereby encouraging exploration. This also makes sense intuitively, think about talking a walk in a park. Chances are you don’t rethink where you should be going every step.
def _step(self, ac): nac = self.act_filter(ac) if self.act_filter else ac if self.skiprate: total_nrew = 0.0 total_rew = 0.0 num_steps = np.random.randint(self.skiprate, self.skiprate) nob = None done = False for _ in range(num_steps): ob, rew, done, info = self.env.step(nac) nob = self.ob_filter(ob) if self.ob_filter else ob nrew = self.rew_filter(rew) if self.rew_filter else rew total_nrew += nrew total_rew += rew if done: info["reward_raw"] = total_rew return (nob, total_nrew, done, info) info["reward_raw"] = total_rew return (nob, total_nrew, done, info) else: ob, rew, done, info = self.env.step(nac) nob = self.ob_filter(ob) if self.ob_filter else ob nrew = self.rew_filter(rew) if self.rew_filter else rew info["reward_raw"] = rew return (nob, nrew, done, info)
Above is the implementation of the skiprate along with the use of action and observation filters (we don’t use a reward filter).
The original action is processed such that we can call
nac = self.act_filter(ac) if self.act_filter else ac
Here we also see the state being processed from (160, 120, 3) to (15, 20, 1).
ob, rew, done, info = self.env.step(nac) nob = self.ob_filter(ob) if self.ob_filter else ob
If we have a skiprate, every step we pick a number uniformly in our range.
num_steps = np.random.randint(self.skiprate, self.skiprate)
For each of these steps we repeat the same action and accumulate the reward, if during a step we reach a terminal state then the function will return early.
for _ in range(num_steps): ob, rew, done, info = self.env.step(nac) nob = self.ob_filter(ob) if self.ob_filter else ob nrew = self.rew_filter(rew) if self.rew_filter else rew total_nrew += nrew total_rew += rew if done: info["reward_raw"] = total_rew return (nob, total_nrew, done, info) info["reward_raw"] = total_rew return (nob, total_nrew, done, info)
Models & Agents
class TrpoAgent(AgentWithPolicy): options = MLP_OPTIONS + PG_OPTIONS + TrpoUpdater.options + FILTER_OPTIONS def __init__(self, ob_space, ac_space, usercfg): cfg = update_default_config(self.options, usercfg) ** policy, self.baseline = make_mlps(ob_space, ac_space, cfg) ** obfilter, rewfilter = make_filters(cfg, ob_space) self.updater = TrpoUpdater(policy, cfg) AgentWithPolicy.__init__(self, policy, obfilter, rewfilter)
class TrpoAgentCNN(AgentWithPolicy): options = MLP_OPTIONS + PG_OPTIONS + TrpoUpdater.options + FILTER_OPTIONS def __init__(self, ob_space, ac_space, usercfg): cfg = update_default_config(self.options, usercfg) ** policy, self.baseline = make_cnn(ob_space, ac_space, cfg) ** obfilter, rewfilter = make_filters(cfg, ob_space) self.updater = TrpoUpdater(policy, cfg) AgentWithPolicy.__init__(self, policy, obfilter, rewfilter)
The code for the agents is the same except for the construction of the neural network. This obviously differs due to creating a feedforward or convolutional network.
The final piece to the puzzle is
def run_policy_gradient_algorithm(env, agent, usercfg=None, callback=None): cfg = update_default_config(PG_OPTIONS, usercfg) cfg.update(usercfg) print "policy gradient config", cfg if cfg["parallel"]: raise NotImplementedError tstart = time.time() seed_iter = itertools.count() for _ in xrange(cfg["n_iter"]): # Rollouts ======== paths = get_paths(env, agent, cfg, seed_iter) compute_advantage(agent.baseline, paths, gamma=cfg["gamma"], lam=cfg["lam"]) # VF Update ======== vf_stats = agent.baseline.fit(paths) # Pol Update ======== pol_stats = agent.updater(paths) # Stats ======== stats = OrderedDict() add_episode_stats(stats, paths) add_prefixed_stats(stats, "vf", vf_stats) add_prefixed_stats(stats, "pol", pol_stats) stats["TimeElapsed"] = time.time() - tstart if callback: callback(stats)
The most important snippet being
for _ in xrange(cfg["n_iter"]): # Rollouts ======== paths = get_paths(env, agent, cfg, seed_iter) compute_advantage(agent.baseline, paths, gamma=cfg["gamma"], lam=cfg["lam"]) # VF Update ======== vf_stats = agent.baseline.fit(paths) # Pol Update ======== pol_stats = agent.updater(paths)
get_paths returns a list of episodes, each episode contains information about the episode such as the reward at each timestep. Once we have the episodes we calculate the advantage and update the baseline and policy parameters.
Initially we had no skiprate and used the default resolution of 640x480. This proved to be very costly in time so the skiprate and 160x120 resolution were introduced. I also experimented with another policy gradient method called A3C, an async Actor-Critic method. I don’t talk much about A3C because I couldn’t get consistent results with it. This might be a fault of my implementation but even in the A3C paper the authors note they ran each experiment 50 times and picked the 5 best results. So it may just be A3C is unstable.
gamma= 0.995, Discount factor
lam= 0.97, Lambda parameter from Generalized Advantage Estimation
max_kl= 0.01, Add multiple of the identity to the Fisher Matrix during optimization.
cg_dampling= 0.1, KL-divergence between old and new policy (averaged over state-space).
activation= tanh for feedforward, relu for CNN
n_iter= 250 or manual stop
Both architectures are followed by a softmax layer for action probabilities.
The baselines follow the same structure as the policy networks except the softmax output layer is switched with a linear layer with 1 output value.
The model converges to a solution that averages ~2270 reward/episode over the last ~5k episodes. This is far beyond the benchmark of averaging 1000 reward/episode over consecutive 100 episodes.
Similar to the feedforward agent except convergence is quicker. This could be due to the convolutional network providing a better representation of the data.
The agent takes longer to converge than the DoomCorridor-v0 task, this isn’t surprising since the task is more difficult. After almost 10k episodes the environment is solved.
The agent never converges to a solution and appears to get stuck in a local minima.
The HealthGathering environment is interesting because the TRPO implementation used does not follow an exploration scheme such as linear decay (epsilon slowly decreasing over timesteps), so exploration isn’t explicitly encouraged. This might not matter if the agent was left to explore indefinately or over an extended period of time, but, since the agent will die shortly if it does not receive a health pack, the agent might never properly learn the reward associated with receiving a health pack. Due to this, it’s a possibility the CNN performs poorly simply due to the random placement of health packs in the environment.
The videos below are played in reverse chronological order, the first being the final recorded episode and the last being the first recorded episode.
In both the above videos we can see the agent figures out the best strategy is to run directly to the vest and not concern itself with the enemies.
The agent learns it needs to pick up health packs in order to survive. We can also see the agent becoming less timid in its decision making as the episodes progress. Another interesting note is the agent doesn’t appear to be completely convinced where the health pack is until it’s very close to it. We can see this in the video but also by running
play.py for several episodes with the saved snapshot. This shows high variance in the episode reward depending where the agent starts relative to the health packs.
The agent never converges to a solution. The agent was ran with multiple seeds to make sure a bad initialization wasn’t at fault.
In this project we trained reinforcement learning agents on the DoomCorridor-v0 and DoomHealthGathering-v0 environments provided as part of the OpenAI Gym toolkit. We modeled the agent as a neural network, implementing both feedforward and convolutional network architectures. The agent was trained using TRPO, an algorithm from a family of methods called Policy Gradients, which optimize to directly maximize the cumulative episodic reward. We successfully solve both environments as the agent receives an episodic reward of 1000 or greater over 100 consecutive episodes.
It’s Interesting how well the feedforward model performs. When it comes to Machine Learning tasks involving images convolutional networks have dramatically outperformed feedforward networks. So why does the feedforward model hold its own here? I hypothesize it’s due to the complexity of the input space.
In more traditional image processing tasks the images come from the real world. We can see images from the real world carry much more complexity than the images provided by the Doom environments. In fact, most images from our environment look very similar, there’s not much variance. Because of this the feedforward network can learn a representation of the input space where it otherwise couldn’t.
Due to TRPO being empirically and theoretically sound going into the project I thought it would perform well, especially since I found good results running the algorithm on a variety of other environments part of the Gym toolkit. It’s astonishing how much more robust TRPO is over REINFORCE even though they are quite similar, the main difference being the use of the natural gradient.
There are several directions we can follow which may improve our solution.
For the Corridor environment it seems the the optimal solution was found. However, the HealthGathering environment can most certainly be improved, as it’s far more open ended. Perhaps an entirely new benchmark for that environment can proposed, such as consistently staying alive for timesteps.
Written by Dominique Luna with the help of ☕.