![]() |
Reinforcement Learning and
Artificial
Intelligence (RLAI) |
RL
interface documentation (development version) |
The RLI (Reinforcement Learning Interface) module provides a
standard interface for computational experiments with
reinforcement-learning agents and environments. The interface is
designed to facilitate comparison of different agent designs and their
application to different problems (environments). This documentation
presents the general ideas of the interface and a few examples of its
use. After that is the source code for the RLinterface
class and its three methods (episode
, steps
,
and episodes
) to answer any remaining questions.
An RLinterface
is a Python object, created by calling RLinterface(agentStartFunction,
agentStepFunction,
environmentStartFunction, environmentStepFunction)
. The agentFunction
and environmentFunction
define the agent and environment that will participate in the
interface. There will be libraries of standard agentFunction
's
and environmentFunction
's, and of course you can write
your own. An environmentFunction
normally takes an action
from the agentFunction
and produces a sensation and
reward, while the agentFunction
does the reverse:
environmentStartFunction() ==> sensation
agentStartFunction(sensation) ==> action
environmentStepFunction(action) ==> sensation, reward
agentStepFunction(sensation, reward) ==> action
(An action
is defined as anything accepted by environmentStepFunction
and a sensation
is defined as anything produced by environmentStepFunction
;
reward
s must be numbers.) Together, the agent and
environment functions can be used to generate episodes
-- sequences of sensations s, actions a, and rewards r:
import RLinterface
rli = RLinterface(myAgentStart, myAgentStep, myEnvStart, myEnvStep)
rli.episode(maxSteps) ==>
s0, a0, r1, s1, a1, r2, s2, a2, ..., rT,
'terminal'
where 'terminal'
is a special sensation recognized by RLinterface
and agentStepFunction.
(In a continuing problem there
would
be just one never-terminating episode.)
To produce the initial s0, and a0, the agentStartFunction
and environmentStartFunction
are used:
environmentStartFunction() ==> sensation
agentStartFunction(sensation) ==> action
When the environmentStartFunction
is called it should
start a new episode -- reset the
environment to a characteristic initial state (or distribution of
states) and produce just a sensation without a reward. When the agentStartFunction
is called it should also initialize itself for the
beginning of an episode.
Episodes can be generated by calling rli.episode(maxNumSteps)
as above or, alternatively (and necessarily for continuing problems),
segments of an episode can be generated by calling rli.steps(numSteps)
,
which returns the sequence of experience on the next numSteps
steps. For example, suppose rli
is a freshly made
RLinterface and we run it for a single step, then for one more step,
and then for two steps after that:
rli.steps(1) ==>
s0, a0
rli.steps(1) ==>
r1, s1, a1
rli.steps(2) ==>
r2, s2, a2, r3, s3, a3
Each call to rli.steps
continues the current episode.
To start a new episode, call rli.episode(1)
, which
returns the same result as the first line above. Note that if rli.steps(numSteps)
is called on an episodic problem it will run for numsteps
even if episodes terminate and start along the way. Thus, for example,
rli.episode(1) ==>
s0, a0
rli.steps(4) ==>
r1, s1, a1, r2,'terminal'
, s0, a0, r1, s1, a1
The method rli.episodes(numEpisodes,
maxStepsPerEpisode, maxStepsTotal)
is also provided for
efficiently running multiple episodes.
Here we do Q-learning with a random policy, presuming an MDP with N states and M actions.
def agentStart(agent, s):
...
return a
Here are the details for calling the RLinterface methods introduced above:
RLinterface
(agentStartFunction,
agentStepFunction,
environmentStartFunction, environmentStepFunction)This function sets up an interface object, which can then be used to run simulated episodes and steps. The four arguments are all functions, and are described below.
agentStartFunction
(s)
This function returns the
initial action for an episode.
def agentStart(s, r):
return a0 # return initial action
agentStepFunction
(s, r)
def agentStep(s, r==None):
# learn from previous action
learn with s and r (and previously saved info)
if s != 'terminal':
a = choose next action
return a # return next action
environmentStartFunction
()def environmentStart():
return s0 # return initial sensation
environmentStepFunction
(a)
def environmentStep(a):
do action a, calculating next state s and reward r
return s, r # return next sensation and reward
The object created by RLsimulation
has the
following
methods:.
step
()steps
(numSteps)stepsQ
(numSteps)steps
is used, it will
return a list of the sensations, actions and rewards in the simulation.
If this is not wanted, use stepsQ
instead (the quicker
and quieter version).
episode
([maxSteps])episodeQ
([maxSteps])episode
is used, it will return a list of the sensations, actions and rewards
in the episode. If this is not wanted, use episodeQ
instead (the quicker and quieter version). If maxSteps is
specified, the simulation will stop after that many steps even if the
end of the episode hasn't been reached.
episodes
(numEpisodes [, maxSteps,
maxStepsTotal])episodesQ
(numEpisodes [, maxSteps,
maxStepsTotal])episodes
is
used, it will return a list of the sensations, actions and rewards in
the episodes. If this is not wanted, use episodesQ
instead (the quicker and quieter version). If maxSteps is
specified, it indicates the maximum number of steps allowed for each
episode. If maxStepsTotal is specified, it limits the
number of steps for all of the episodes together (regardless of whether
an episode has finished, or the specified number of episodes have run).