RLAI Reinforcement Learning and Artificial Intelligence (RLAI)
RL interface documentation (development version)

The ambition of this web page is to fully describe how to use the Python module defining a standard reinforcement learning interface. We describe 1) how to construct an interface object for a given agent and environment, 2) the inputs and outputs of the interface object, and 3) the inputs and outputs of the functions (procedures) defining the agent and environment. Not convered is the internal workings of the interface object or any particular agent and environment.

The RLI (Reinforcement Learning Interface) module provides a standard interface for computational experiments with reinforcement-learning agents and environments. The interface is designed to facilitate comparison of different agent designs and their application to different problems (environments). This documentation presents the general ideas of the interface and a few examples of its use. After that is the source code for the RLinterface class and its three methods (episode, steps, and episodes) to answer any remaining questions.

An RLinterface is a Python object, created by calling RLinterface(agentStartFunction, agentStepFunction, environmentStartFunction, environmentStepFunction). The agentFunction and environmentFunction define the agent and environment that will participate in the interface. There will be libraries of standard agentFunction's and environmentFunction's, and of course you can write your own. An environmentFunction normally takes an action from the agentFunction and produces a sensation and reward, while the agentFunction does the reverse:

environmentStartFunction() ==> sensation

agentStartFunction(sensation) ==> action

environmentStepFunction(action) ==> sensation, reward


agentStepFunction(sensation, reward) ==> action

(An action is defined as anything accepted by environmentStepFunction and a sensation is defined as anything produced by environmentStepFunction; rewards must be numbers.) Together, the agent and environment functions  can be used to generate episodes -- sequences of sensations s, actions a, and rewards r:

import RLinterface
rli = RLinterface(myAgentStart, myAgentStep, myEnvStart, myEnvStep)

rli.episode(maxSteps) ==> s0, a0, r1, s1, a1, r2, s2, a2, ..., rT, 'terminal'

where 'terminal' is a special sensation recognized by RLinterface and agentStepFunction. (In a continuing problem there would be just one never-terminating episode.)

To produce the initial s0, and a0, the agentStartFunction and environmentStartFunction  are used:

environmentStartFunction() ==> sensation

agentStartFunction(sensation) ==> action

When the environmentStartFunction is called it should start a new episode -- reset the environment to a characteristic initial state (or distribution of states) and produce just a sensation without a reward. When the agentStartFunction is called it should also initialize itself for the beginning of an episode. 

Episodes can be generated by calling rli.episode(maxNumSteps) as above or, alternatively (and necessarily for continuing problems), segments of an episode can be generated by calling rli.steps(numSteps), which returns the sequence of experience on the next numSteps steps. For example, suppose rli is a freshly made RLinterface and we run it for a single step, then for one more step, and then for two steps after that:

rli.steps(1) ==> s0, a0

rli.steps(1) ==> r1, s1, a1

rli.steps(2) ==> r2, s2, a2, r3, s3, a3

Each call to rli.steps continues the current episode. To start a new episode, call rli.episode(1), which returns the same result as the first line above. Note that if rli.steps(numSteps) is called on an episodic problem it will run for numsteps even if episodes terminate and start along the way. Thus, for example,

rli.episode(1) ==> s0, a0

rli.steps(4) ==> r1, s1, a1, r2, 'terminal', s0, a0, r1, s1, a1

The method rli.episodes(numEpisodes, maxStepsPerEpisode, maxStepsTotal) is also provided for efficiently running multiple episodes.

Examples (these need to be reworked)

Here we do Q-learning with a random policy, presuming an MDP with N states and M actions.


Q = NxM array of zeros
alpha = 0.1
gamma =0.9

def agentStart(s):
... initialize

def agentStep(s, r):
Q(s,a) = Q(s,a) + alpha * (r + max_a(gamma*Q(sp,ap))
return random(M) # better to do epsilon greedy

state = 0

def environmentStart():
return 0 #s0

def environmentStep(a):
...
return s, 0

rli = RLinterface(agentStart, agentStep, environmentStart, environmentStep)
rli.steps(1000)
If additional arguments are needed for the routines, use lambda expressions (or make the required functions methods inside the agent and environment):

def agentStart(agent, s):
...
return a

def agentStep(agent, s, r):
...
return a

def environmentStart(environment):
return s0

def environmentStep(environment, a):
return s, r

env = makeEnvironment ...
agt = makeAgent ...
rli = RLinterface(lambda s: agentStart(agt, s), \
lambda s, r: agentStep(agt, s, r), \
lambda: environmentStart(env), \
lambda a: environmentStep(env, a) )
rli.episodes(10,100000)

Calling Sequences for the RLinterface methods

Here are the details for calling the RLinterface methods introduced above:

RLinterface(agentStartFunction, agentStepFunction, environmentStartFunction, environmentStepFunction)

This function sets up an interface object, which can then be used to run simulated episodes and steps. The four arguments are all functions, and are described below.

agentStartFunction(s)         This function returns the initial action for an episode.
def agentStart(s, r):
return a0 # return initial action

agentStepFunction(s, r)
This function does the learning and chooses the actions for the agent. It will be called with sensation s and  reward r.  The agent function should always return an action, unless the sensation is the terminal state.

def agentStep(s, r==None):
# learn from previous action
learn with s and r (and previously saved info)
if s != 'terminal':
a = choose next action
return a # return next action

environmentStartFunction()
This function returns the initial sensation (for a new episode).
def environmentStart():
return s0 # return initial sensation
environmentStepFunction(a)
This function does the environment task, such as determining the next state, or sensation after a move. It is called with an action a. It should return a new sensation and a reward.

def environmentStep(a):
do action a, calculating next state s and reward r
return s, r # return next sensation and reward

The object created by RLsimulation has the following methods:.

step()
Runs the simulation for exactly one step. Returns the list of sensations, actions and rewards from that step.

steps(numSteps)
stepsQ(numSteps)
Runs the simulation for numSteps steps, regardless of episode endings (if any). If steps is used, it will return a list of the sensations, actions and rewards in the simulation. If this is not wanted, use stepsQ instead (the quicker and quieter version).

episode([maxSteps])
episodeQ([maxSteps])
Runs a single episode (until state 'terminal' is reached). If episode is used, it will return a list of the sensations, actions and rewards in the episode. If this is not wanted, use episodeQ instead (the quicker and quieter version). If maxSteps is specified, the simulation will stop after that many steps even if the end of the episode hasn't been reached.

episodes(numEpisodes [, maxSteps, maxStepsTotal])
episodesQ(numEpisodes [, maxSteps, maxStepsTotal])
Runs numEpisodes episodes. If episodes is used, it will return a list of the sensations, actions and rewards in the episodes. If this is not wanted, use episodesQ instead (the quicker and quieter version). If maxSteps is specified, it indicates the maximum number of steps allowed for each episode. If maxStepsTotal is specified, it limits the number of steps for all of the episodes together (regardless of whether an episode has finished, or the specified number of episodes have run).

 

Source Code for RLinterface Module

You can get source code for the RLinterface module by downloading the RLtoolkit
Extend this Page   Edit   Styleguide   Subscribe   Notify   Subscribers   Suggestions   Help   This open web page hosted by the RLAI group at the University of Alberta.   Terms of use