Introduction to RL

Reinforcement Learning

In reinforcement learning (RL) an agent learns to operate in an environment to obtain rewards. The goal of the agent is to improve its performance over time and obtain higher rewards.

In the examples we will be working with this week we will be using the gymnasium python package to supply the environments our agents will work in. The gymnasium package allows us to set up simulation environments in which agents will participate in a series of rounds. At each round the environment will give the agent state information and then ask the agent to choose an action for the current round. Once the agent has chosen an action to perform the environment will respond by computing a new state for the agent and possibly giving the agent a reward for their most recent action.

The main challenge that agents face in RL scenarios is that rewards may be far removed in time from any individual action choice. For example, if we are training an agent to play chess, the only reward available may be a reward of 1 point when the agent wins the game. In these lectures we will work with two examples in which the reward is less far removed from agent actions. In the first example below the agent will get a single point for each round that they manage to stay alive, while in the second example we will train an agent to play the Atari Breakout game: in that game a player has to move a paddle to hit a ball into a wall to knock out bricks. The agent will get a point for each brick they knock out.

The code for both of the examples below is adapted from the book Deep Reinforcement Learning Hands-On, 2nd Edition by Maxim Lapin.

The Cartpole example

In the Cartpole game we train an agent to keep a pole upright on a cart that can move right and left.

The pole is attached to the cart via a hinge at the base of the pole. At each round of the game the agent can choose to push the cart to the right or the left. To help the agent make decisions about what to do, the environment will provide state information in the form of information on the velocity of the cart, the angle that the pole makes from the vertical, and the angular velocity for the pole. The agent's goal is to keep the pole upright for as many rounds as possible.

A random strategy

For our first solution we are going to use a purely random strategy: at each round of the game the program will randomly push the cart right or left. This strategy is not very effective, and usually only manages to keep the pole from falling over for about 10 time steps.

import gymnasium as gym

if __name__ == "__main__":
    env = gym.make("CartPole-v1")

    total_reward = 0.0
    total_steps = 0
    obs = env.reset()

    while True:
        action = env.action_space.sample()
        observation, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        total_steps += 1
        if terminated or truncated:
            break

    print("Episode done in %d steps, total reward %.2f" % (
        total_steps, total_reward))

The code starts by setting up an environment to play the game. To start the game we call the environment's reset() method, which returns state information about the state of the game at start. Each environment contains an action_space member variable which contains information about the actions that are available to the player. The action space object offers a sample() method we can call to select an action at random.

To take a single step in the environment we call the environment's step() method, passing it the action we want to take. The environment will respond with a tuple that contains the new state information, the reward for this step, and other information. The cartpole environment will give us a reward of 1 for each step that the pole remains upright.

Training a network to play the game

In the second version of this problem we will instead use a neural network to decide what action to take at each step.

Specifically,

The network will take states as its inputs and output a probability distribution over all possible actions.
On each round we will feed the current state to the network and have it output a probability distribution.
We then sample from that distribution to get our next action

We note several key points here:

At start the network will have random weights and will be spitting out probability distributions at random. This causes the system to use more or less random actions at start.
Next, we will be watching the results produced by the network. Whenever we see the network produce a sequence of moves that results in the pole staying upright for a longer than average span of moves, we make a record of all of the moves in that sequence and train the network on that record.
Over time the network will learn to produce outputs that result in better and better runs.
Even when the network gets trained we will still be introducing a small amount of randomness by sampling from the output distribution. This will allow our system to continue exploring the state space somewhat even after the network converges to a strategy.

Here now is the code to implement this more advanced strategy:

import gymnasium as gym
from collections import namedtuple
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim


HIDDEN_SIZE = 128
BATCH_SIZE = 16
PERCENTILE = 70


class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)


Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])


def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()[0]
    sm = nn.Softmax(dim=0)
    while True:
        obs_v = torch.FloatTensor(obs)
        act_probs_v = sm(net(obs_v))
        act_probs = act_probs_v.data.numpy()
        action = np.random.choice(len(act_probs), p=act_probs)
        next_obs, reward, terminated, truncated, info = env.step(action)
        episode_reward += reward
        step = EpisodeStep(observation=obs, action=action)
        episode_steps.append(step)
        if terminated or truncated:
            e = Episode(reward=episode_reward, steps=episode_steps)
            batch.append(e)
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()[0]
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs


def filter_batch(batch, percentile):
    rewards = list(map(lambda s: s.reward, batch))
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards))

    train_obs = []
    train_act = []
    for reward, steps in batch:
        if reward < reward_bound:
            continue
        train_obs.extend(map(lambda step: step.observation, steps))
        train_act.extend(map(lambda step: step.action, steps))

    train_obs_v = torch.FloatTensor(train_obs)
    train_act_v = torch.LongTensor(train_act)
    return train_obs_v, train_act_v, reward_bound, reward_mean


if __name__ == "__main__":
    env = gym.make("CartPole-v1")
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n
    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    objective = nn.CrossEntropyLoss()
    optimizer = optim.Adam(params=net.parameters(), lr=0.01)
    
    for iter_no, batch in enumerate(iterate_batches(
            env, net, BATCH_SIZE)):
        obs_v, acts_v, reward_b, reward_m = \
            filter_batch(batch, PERCENTILE)
        optimizer.zero_grad()
        action_scores_v = net(obs_v)
        loss_v = objective(action_scores_v, acts_v)
        loss_v.backward()
        optimizer.step()
        print("%d: loss=%.3f, reward_mean=%.1f, rw_bound=%.1f" % (
            iter_no, loss_v.item(), reward_m, reward_b))
        if reward_m > 199:
            print("Solved!")
            break

Here are some observations about this code:

The iterate_batches() function runs game episodes until it has collected enough data to form a batch of game episodes.
iterate_batches() is an example of a Python generator function. In place of return it uses yield. When you hit a yield statement you will return a result. The next time the generator function gets called it will pick back up where the yield statement was.
The batches spit out by iterate_batches() contain both successful runs and unsuccessful runs. The filter_batch() function retains only the data from longer runs and returns a tuple containing all of the steps it collected from those successful runs.

Programming assignment

The code above uses a PyTorch model for the network. Remove the PyTorch model and replace it with a Keras model that does the same thing.