The Basics of Reinforcement Learning

This is a guest contribution by Herman Schaaf, a final-year Electronic Engineering student at Stellenbosch University. Herman has just begun his final-year project, in which he is developing an artificial intelligence that should learn how to play StarCraft by using the principles of Reinforcement Learning. Because gaming technology and machine intelligence is very Medialabbish, we requested his permission to duplicate a recent blog post of his on the basics of this technique. — Ed.

When we write an AI for a strategy game, we are faced with a couple of early design choices. There are several different possible approaches to the problem, many of which can be applied with relative success. For instance, one could follow the rule-based approach, which basically involves telling the bot what to do in every specific type of situation. Most early game AIs used this method, but it can be extremely time-consuming to code if the game is very complex (such as in the case of Starcraft), and also requires in-depth understanding of the game itself to be able to specify actions in any general situation. Most of the time, this does not work extremely well, and it is often easy for human or other computer players to find and exploit holes in the approach.

The approach we will be taking is known as Reinforcement Learning. This is the name of the broad umbrella that actually covers several different AI techniques, such as Dynamic Programming, Monte Carlo methods and Temporal-Difference Learning, and many combinations of these. A task where reinforcement learning is appropriate is defined as one where there are a lot (even infinite) amount of possible states, but evaluating the relative success of any such state is very easy. Also, we have to be able to test the success of different approaches on-line, that is, our program must be able to make decisions and receive feedback based on these decisions. Unfortunately, there is not a lot of literature available on this topic, and the only real book on this topic is Reinforcement Learning: An Introduction (Sutton and Barto). Fortunately however, this is a gem of the book, and the apparent lack of literature might well be because this book just about covers so many topics in this area in sufficient detail and clarity that it’s hard to improve. I would definitely recommend reading this if you are interested in the Reinforcement Learning approach to AI.

In any case, I will now very briefly introduce some of the key elements of reinforcement learning that we will use later in our journey to build a full-blown AI. For now, to explain the concepts, we will be looking at one specific example (which is also the same example as in Sutton and Barto’s book, but with my own code, graphs and explanations).

For this example, we have a grid and our agent starts at position S and has the goal of ending at position G. Agent is the term assigned to the part of our program, or bot, that experiences the environment and is able to change it by means of the decisions he makes. Now we say that there is a wind present in certain columns of the grid, and the strength of the wind is given by the number below the column. Our agent is able to walk one block at a time, to any of the four adjacent blocks to his block (North, South, East, West). When he is in a column with wind however, in addition to his own movement he will also move upwards by the amount of blocks specified in the strength of the wind in that column. See the figure for a clearer explanation.

The problem our AI has to solve is to find the optimal strategy of walking so that goal G is reached in the minimum number of steps. In reinforcement learning, when we want our AI to achieve certain goals, we convey this information to it by means of ‘rewards’. Any reinforcement learning AI’s goal is to maximize the final amount of reward that it receives. In our example, we will make the reward at every time step -1 if the goal is not yet reached, and 100 if the agent finds the goal. In this way, the agent will be encouraged to not only find the goal, but also find it as quickly as possible.

We want our agent to learn the optimal strategy through experience, so we will allow it to experience this scenario over and over again, in the hopes of it discovering some general pattern that can lead to the optimal strategy. Obviously, once the agent reaches the goal, one episode is complete, and the next episode will start with the agent again finding himself at position S with 0 total reward. We will allow the agent to experience many such episodes, each of varying length. The length of any episode will depend on the decisions the agent makes within that episodes. Now the most pressing question is: How do we have our agent make decisions?

To answer this, we first have to define the terms state space and action space. A state is the current situation that the ‘world’ is in at any specific time, with the world being the world that the agent is able to explore. Such a state is summarized by all the changing elements within the scenario, that the AI has to take into account in order to make informed decisions. To think of it practically, you can imagine being the agent. You are not told anything about the environment such as the wind speed and the position of the goal, but you need some information in order to make sure you can learn from your previous experiences so as to remember the path to the goal once you have found it. This information is given to you via a state space representation. Obviously, we as programmers choose the state space representation in such a way that the AI can distinguish between different situations, and very often the appropriate choice of state space representation becomes the most critical problem to solve in any reinforcement learning environment (for reasons that will become clear later). The state space is simply all the possible states that any specific world can be in.

The action space is all the possible choices that the agent can make at any given time. The combination of the state and the action space is called the state-action space, and this is what our reinforcement learning AI will use to learn a good strategy for solving the problem. We will use the Sarsa algorithm, an algorithm similar to Q-learning, which states the following:

Initialize Q(s,a) arbitrarily
Repeat for each episode:
    Initialize s
    Choose a from s using policy derived from Q
    Repeat for each step of each episode:
        Take action a, observe r, s'
        Choose a' from s' using policy derived from Q
        Q(s,a) = Q(s,a) + alpha * [r + gamma * Q(s',a') - Q(s,a)]
        s = s'
        a = a'
    until s is the terminal state

You might be a bit confused right now, but that’s ok, I’ll explain the algorithm now – it’s not that difficult. Q is simply a hash-table of values identified by a state-action pair. For every episode, we initialize the state s to the starting state. In our case, this is with our agent standing at position (0,3).  Now we evaluate all our possible actions based on the current state. For our example initial state, this would be walking in one of 3 directions (the agent cannot move west, because he is already at x=0). Now, for every state until we find the terminal state, we allow our agent to take action a (which can be chosen by the ε-greedy method, more about this shortly). Now, we update the Q table with a new value. This value represents the total reward we can expect the next time we find ourselves in the same state s and take the same action a. The value r is the reward that we gave the agent for the action that it took at the previous time step. The values alpha and gamma are simply constants that indicate how much this experience we are having now should change the knowledge that we have build up previously. A value of 0 means our agent will not change it’s strategy at all, and a value of 1 means that each new experience weighs just as heavily as all our previous experiences combined. Usually we choose a value somewhere in between 1 and 0, but closer to 0. One approach that works well is to make these values decay over time, so that after many episodes new experiences have less of an effect.

The ε-greedy method is a clever way to choose what action to take next when we are given a certain state, our possible actions,  and a table of Q values with each cell having a state-action pair as key. We choose a value, ε, which is the probability of making a random choice. Say we make ε = 0.15. Then, at every opportunity for making a choice, there is a 15% chance that we will choose our action randomly, and a 85% chance that we will choose the action with the highest corresponding value in the Q table (representing the action with the highest expected award). Again here a good approach is sometimes to let ε strive to zero as the number of total time steps strive to infinity.

And that’s it! Over time, the values in the Q table will start to reflect more and more the expected reward from taking certain actions, and therefore our agent will take actions that bring in the highest total reward. And since we reward the agent for finding the goal and not taking unnecessary time steps, this should lead to an optimal solution to the wind-grid problem. When running the code, the following was found for different values of alpha (gamma set to 1):

The increase in slope indicates that time steps between episodes became progressively less and less, and at some stage the algorithm found the optimal strategy and kept to it (with some slight exploration as a result of the ε-greedy algorithm). The blue line on the right is with alpha set to 0.1, which clearly did not converge as quickly as other values of alpha that placed higher value on new experience than old. The next image shows a bit more clearly that the time steps in every episode became less and less as the algorithm gained experience, but also continued to variate:

That’s it for the introduction to reinforcement learning. For a more thorough introduction, be sure to read Reinforcement Learning: An Introduction (Sutton and Barto). Next time we can start talking about how to use this reinforcement learning technique in the StarCraft AI environment.



  1. 1
    Andrei on Thursday 29 July, 12:12 PM #

    Thanks for the awesome article! I wish that I learnt about all this @ varsity :)

  2. 2
    Michael Young on Monday 16 August, 16:18 PM #

    cool stuff. I have just recently started a masters, also using reinforcement learning, busy working my way through Sutton and Barto. I was going to apply it to robotics, but maybe I should switch to Starcraft 2 :P

  3. 3
    Herman on Monday 16 August, 22:43 PM #

    @ Michael: Haha, cool – Sutton and Barto’s book is awesome. But if I might offer some advice, rather stick with robotics ;) I’m finding (the hard way) that StarCraft might be a bit too much to bite off for reinforcement learning!

Leave a comment

Leave a Reply