ConvNetJS Deep Q Learning Demo

Description

This demo follows the description of the Deep Q Learning algorithm described in Playing Atari with Deep Reinforcement Learning, a paper from NIPS 2013 Deep Learning Workshop from DeepMind. The paper is a nice demo of a fairly standard (model-free) Reinforcement Learning algorithm (Q Learning) learning to play Atari games.

In this demo, instead of Atari games, we'll start out with something more simple: a 2D agent that has 9 eyes pointing in different angles ahead and every eye senses 3 values along its direction (up to a certain maximum visibility distance): distance to a wall, distance to a green thing, or distance to a red thing. The agent navigates by using one of 5 actions that turn it different angles. The red things are apples and the agent gets reward for eating them. The green things are poison and the agent gets negative reward for eating them. The training takes a few tens of minutes with current parameter settings.

Over time, the agent learns to avoid states that lead to states with low rewards, and picks actions that lead to better states instead.

Q-Learner full specification and options

The textfield below gets eval()'d to produce the Q-learner for this demo. This allows you to fiddle with various parameters and settings and also shows how you can use the API for your own purposes. All of these settings are optional but are listed to give an idea of possibilities. Feel free to change things around and hit reload! Documentation for all options is the paper linked to above, and there are also comments for every option in the source code javascript file.

var num_inputs = 27; // 9 eyes, each sees 3 numbers (wall, green, red thing proximity)
var num_actions = 5; // 5 possible angles agent can turn
var temporal_window = 1; // amount of temporal memory. 0 = agent lives in-the-moment :)
var network_size = num_inputs*temporal_window + num_actions*temporal_window + num_inputs;

// the value function network computes a value of taking any of the possible actions
// given an input state. Here we specify one explicitly the hard way
// but user could also equivalently instead use opt.hidden_layer_sizes = [20,20]
// to just insert simple relu hidden layers.
var layer_defs = [];
layer_defs.push({type:'input', out_sx:1, out_sy:1, out_depth:network_size});
layer_defs.push({type:'fc', num_neurons: 50, activation:'relu'});
layer_defs.push({type:'fc', num_neurons: 50, activation:'relu'});
layer_defs.push({type:'regression', num_neurons:num_actions});

// options for the Temporal Difference learner that trains the above net
// by backpropping the temporal difference learning rule.
var tdtrainer_options = {learning_rate:0.001, momentum:0.0, batch_size:64, l2_decay:0.01};

var opt = {};
opt.temporal_window = temporal_window;
opt.experience_size = 30000;
opt.start_learn_threshold = 1000;
opt.gamma = 0.7;
opt.learning_steps_total = 200000;
opt.learning_steps_burnin = 3000;
opt.epsilon_min = 0.05;
opt.epsilon_test_time = 0.05;
opt.layer_defs = layer_defs;
opt.tdtrainer_options = tdtrainer_options;

var brain = new deepqlearn.Brain(num_inputs, num_actions, opt); // woohoo

Q-Learner API

It's very simple to use deeqlearn.Brain: Initialize your network:

   var brain = new deepqlearn.Brain(num_inputs, num_actions);

And to train it proceed in loops as follows:

   var action = brain.forward(array_with_num_inputs_numbers);
   // action is a number in [0, num_actions) telling index of the action the agent chooses
   // here, apply the action on environment and observe some reward. Finally, communicate it:
   brain.backward(reward); // <-- learning magic happens here

That's it! Let the agent learn over time (it will take opt.learning_steps_total), and it will only get better and better at accumulating reward as it learns. Note that the agent will still take random actions with probability opt.epsilon_min even once it's fully trained. To completely disable this randomness, or change it, you can disable the learning and set epsilon_test_time to 0:

   brain.epsilon_test_time = 0.0; // don't make any random choices, ever
   brain.learning = false;
   var action = brain.forward(array_with_num_inputs_numbers); // get optimal action from learned policy

State Visualizations

Left: Current input state (quite a useless thing to look at). Right: Average reward over time (this should go up as agent becomes better on average at collecting rewards)

(Takes ~10 minutes to train with current settings. If you're impatient, scroll down and load an example pre-trained network from pre-filled JSON)

Controls

I/O

You can save and load a network from JSON here. Note that the textfield is prefilled with a pretrained network that works reasonable well, if you're impatient to let yours train enough. Just hit the load button!