API¶
Solvers¶

class
lrl.solvers.
PolicyIteration
(env, value_function_initial_value=0.0, max_policy_eval_iters_per_improvement=10, policy_evaluation_type='onpolicyiterative', **kwargs)¶ Bases:
lrl.solvers.base_solver.BaseSolver
Solver for policy iteration
Implemented as per Sutton and Barto’s Reinforcement Learning (http://www.incompleteideas.net/book/RLbook2018.pdf, page 80).
Notes
See also BaseSolver for additional attributes, members, and arguments (missing here due to Sphinx bug with inheritance in docs)
Examples
See examples directory
 Parameters
value_function_initial_value (float) – Value to initialize all elements of the value function to
max_policy_eval_iters_per_improvement –
policy_evaluation_type (str) – Type of solution method for calculating policy (see policy_evaluation() for more details. Typical usage should not need to change this as it will make calculations slower and more memory intensive)
BaseSolver class for additional (See) –
 Returns
None

value
= None¶ Spaceefficient dictlike storage of the current and all former value functions
 Type

iterate
()¶ Perform a single iteration of policy iteration, updating self.value and storing metadata about the iteration.
Side Effects:
self.value: Updated to the newest estimate of the value function
self.policy: Updated to the greedy policy according to the value function estimate
self.iteration: Increment iteration counter by 1
self.iteration_data: Add new record to iteration data store
 Returns
None

converged
()¶ Returns True if solver is converged.
Judge convergence by checking whether the most recent policy iteration resulted in any changes in policy
 Returns
Convergence status (True=converged)
 Return type
bool

_policy_evaluation
(max_iters=None)¶ Compute an estimate of the value function for the current policy to within self.tolerance
 Side Effects:
self.value: Updated to the newest estimate of the value function
 Returns
None

_policy_improvement
(return_differences=True)¶ Update the policy to be greedy relative to the most recent value function
 Side Effects:
self.policy: Updated to be greedy relative to self.value
 Parameters
return_differences – If True, return number of differences between old and new policies
 Returns
(if return_differences==True) Number of differences between the old and new policies
 Return type
int

init_policy
(init_type=None)¶ Initialize self.policy, which is a dictionarylike DictWithHistory object for storing current and past policies
 Parameters
init_type (None, str) –
Method used for initializing policy. Can be any of:
None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
 random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)
 Side Effects:
If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)
 Returns
None

iterate_to_convergence
(raise_if_not_converged=None, score_while_training=None)¶ Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically
 Side Effects:
Many, but depends on the subclass of the solver’s .iterate()
 Parameters
raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs
 Returns
None

run_policy
(max_steps=None, initial_state=None)¶ Perform a walk (episode) through the environment using the current policy
 Side Effects:
self.env will be reset and optionally then forced into initial_state
 Parameters
max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)
 Returns
tuple containing:
states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally
 Return type
(tuple)

score_policy
(iters=500, max_steps=None, initial_state=None)¶ Score the current policy by performing iters greedy episodes in the environment and returning statistics
 Side Effects:
self.env will be reset more side effects
more side effects
 Parameters
iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)
 Returns
Object containing statistics about the episodes (rewards, number of steps, etc.)
 Return type

class
lrl.solvers.
ValueIteration
(env, value_function_initial_value=0.0, **kwargs)¶ Bases:
lrl.solvers.base_solver.BaseSolver
Solver for value iteration
Implemented as per Sutton and Barto’s Reinforcement Learning (http://www.incompleteideas.net/book/RLbook2018.pdf, page 82).
Notes
See also BaseSolver for additional attributes, members, and arguments (missing here due to Sphinx bug with inheritance in docs)
Examples
See examples directory
 Parameters
value_function_initial_value (float) – Value to initialize all elements of the value function to
BaseSolver class for additional (See) –
 Returns
None

value
= None¶ Spaceefficient dictlike storage of the current and all former value functions
 Type

iterate
()¶ Perform a single iteration of value iteration, updating self.value and storing metadata about the iteration.
Side Effects:
self.value: Updated to the newest estimate of the value function
self.policy: Updated to the greedy policy according to the value function estimate
self.iteration: Increment iteration counter by 1
self.iteration_data: Add new record to iteration data store
 Returns
None

converged
()¶ Returns True if solver is converged.
Test convergence by comparing the latest value function delta_max to the convergence tolerance
 Returns
Convergence status (True=converged)
 Return type
bool

init_policy
(init_type=None)¶ Initialize self.policy, which is a dictionarylike DictWithHistory object for storing current and past policies
 Parameters
init_type (None, str) –
Method used for initializing policy. Can be any of:
None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
 random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)
 Side Effects:
If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)
 Returns
None

iterate_to_convergence
(raise_if_not_converged=None, score_while_training=None)¶ Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically
 Side Effects:
Many, but depends on the subclass of the solver’s .iterate()
 Parameters
raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs
 Returns
None

run_policy
(max_steps=None, initial_state=None)¶ Perform a walk (episode) through the environment using the current policy
 Side Effects:
self.env will be reset and optionally then forced into initial_state
 Parameters
max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)
 Returns
tuple containing:
states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally
 Return type
(tuple)

score_policy
(iters=500, max_steps=None, initial_state=None)¶ Score the current policy by performing iters greedy episodes in the environment and returning statistics
 Side Effects:
self.env will be reset more side effects
more side effects
 Parameters
iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)
 Returns
Object containing statistics about the episodes (rewards, number of steps, etc.)
 Return type

class
lrl.solvers.
QLearning
(env, value_function_tolerance=0.1, alpha=None, epsilon=None, max_iters=2000, min_iters=250, num_episodes_for_convergence=20, **kwargs)¶ Bases:
lrl.solvers.base_solver.BaseSolver
Solver class for QLearning
Notes
See also BaseSolver for additional attributes, members, and arguments (missing due here to Sphinx bug with inheritance in docs)
Examples
See examples directory
 Parameters
alpha (float, dict) –
(OPTIONAL)
If None, default linear decay schedule applied, decaying from 0.1 at iter 0 to 0.025 at max iter
If float, interpreted as a constant alpha value
If dict, interpreted as specifications to a decay function as defined in decay_functions()
epsilon (float, dict) –
(OPTIONAL)
If None, default linear decay schedule applied, decaying from 0.25 at iter 0 to 0.05 at max iter
If float, interpreted as a constant epsilon value
If dict, interpreted as specifications to a decay function as defined in decay_functions()
num_episodes_for_convergence (int) – Number of consecutive episodes with delta_Q < tolerance to say a solution is converged
**kwargs – Other arguments passed to BaseSolver
 Returns
None

transitions
= None¶ Counter for number of transitions experienced during all learning
 Type
int

q
= None¶ Spaceefficient dictlike storage of the current and all former q functions
 Type

iteration_data
= None¶ Data store for iteration data
Overloads BaseSolver’s iteration_data attribute with one that includes more fields
 Type

episode_statistics
= None¶ Data store for statistics from training episodes
 Type

num_episodes_for_convergence
= None¶ Number of consecutive episodes with delta_Q < tolerance to say a solution is converged
 Type
int

_policy_improvement
(states=None)¶ Update the policy to be greedy relative to the most recent q function
 Side Effects:
self.policy: Updated to be greedy relative to self.q
 Parameters
states – List of states to update. If None, all states will be updated
 Returns
None

step
(count_transition=True)¶ Take and learn from a single step in the environment.
Applies the typical QLearning approach to learn from the experienced transition
 Parameters
count_transition (bool) – If True, increment transitions counter self.transitions. Else, do not.
 Returns
tuple containing:
transition (tuple): Tuple of (state, reward, next_state, is_terminal)
delta_q (float): The (absolute) change in q caused by this step
 Return type
(tuple)

iterate
()¶ Perform and learn from a single episode in the environment (one walk from start to finish)
Side Effects:
self.value: Updated to the newest estimate of the value function
self.policy: Updated to the greedy policy according to the value function estimate
self.iteration: Increment iteration counter by 1
self.iteration_data: Add new record to iteration data store
self.env: Reset and then walked through
 Returns
None

choose_epsilon_greedy_action
(state, epsilon=None)¶ Return an action chosen by epsilongreedy scheme based on the current estimate of Q
 Parameters
state (int, tuple) – Descriptor of current state in environment
epsilon – Optional. If None, self.epsilon is used
 Returns
action chosen
 Return type
int or tuple

converged
()¶ Returns True if solver is converged.
 Returns
Convergence status (True=converged)
 Return type
bool

get_q_at_state
(state)¶ Returns a numpy array of q values at the current state in the same order as the standard action indexing :param state: Descriptor of current state in environment :type state: int, tuple
 Returns
Numpy array of q for all actions
 Return type
np.array

init_policy
(init_type=None)¶ Initialize self.policy, which is a dictionarylike DictWithHistory object for storing current and past policies
 Parameters
init_type (None, str) –
Method used for initializing policy. Can be any of:
None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
 random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)
 Side Effects:
If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)
 Returns
None

init_q
(init_val=0.0)¶ Initialize self.q, a dictlike DictWithHistory object for storing the stateaction value function q
 Parameters
init_val (float) – Value to give all states in the initialized q
 Returns
None

iterate_to_convergence
(raise_if_not_converged=None, score_while_training=None)¶ Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically
 Side Effects:
Many, but depends on the subclass of the solver’s .iterate()
 Parameters
raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs
 Returns
None

run_policy
(max_steps=None, initial_state=None)¶ Perform a walk (episode) through the environment using the current policy
 Side Effects:
self.env will be reset and optionally then forced into initial_state
 Parameters
max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)
 Returns
tuple containing:
states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally
 Return type
(tuple)

score_policy
(iters=500, max_steps=None, initial_state=None)¶ Score the current policy by performing iters greedy episodes in the environment and returning statistics
 Side Effects:
self.env will be reset more side effects
more side effects
 Parameters
iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)
 Returns
Object containing statistics about the episodes (rewards, number of steps, etc.)
 Return type

property
alpha
¶ Returns value of alpha at current iteration

property
epsilon
¶ Returns value of epsilon at current iteration

class
lrl.solvers.
BaseSolver
(env, gamma=0.9, value_function_tolerance=0.001, policy_init_mode='zeros', max_iters=500, min_iters=2, max_steps_per_episode=100, score_while_training=False, raise_if_not_converged=False)¶ Bases:
object
Base class for solvers
Examples
See examples directory
 Parameters
env – Environment instance, such as from RaceTrack() or RewardingFrozenLake()
gamma (float) – Discount factor
value_function_tolerance (float) – Tolerance for convergence of value function during solving (also used for Q (stateaction) value function tolerance
policy_init_mode (str) – Initialization mode for policy. See init_policy() for more detail
max_iters (int) – Maximum number of iterations to solve environment
min_iters (int) – Minimum number of iterations before checking for solver convergence
raise_if_not_converged (bool) – If True, will raise exception when environment hits max_iters without convergence. If False, a warning will be logged.
max_steps_per_episode (int) – Maximum number of steps allowed per episode (helps when evaluating policies that can lead to infinite walks)
score_while_training (dict, bool) –
Dict specifying whether the policy should be scored during training (eg: test how well a policy is doing every N iterations).
If dict, must be of format:
n_trains_per_eval (int): Number of training iters between evaluations
n_evals (int): Number of episodes for a given policy evaluation
If True, score with default settings of:
n_trains_per_eval: 500
n_evals: 500
If False, do not score during training.
 Returns
None

policy
= None¶ Spaceefficient dictlike storage of the current and all former policies.
 Type

iteration_data
= None¶ Data describing iteration results during solving of the environment.
Fields include:
time: time for this iteration
delta_max: maximum change in value function for this iteration
policy_changes: number of policy changes this iteration
converged: boolean denoting if solution is converged after this iteration
 Type

scoring_summary
= None¶ Summary data from scoring runs computed during training if score_while_training == True
Fields include:
reward_mean: mean reward obtained during a given scoring run
 Type

scoring_episode_statistics
= None¶ Detailed scoring data from scoring runs held as a dict of EpisodeStatistics objects.
Data is indexed by iteration number (from scoring_summary)
 Type
dict, EpisodeStatistics

init_policy
(init_type=None)¶ Initialize self.policy, which is a dictionarylike DictWithHistory object for storing current and past policies
 Parameters
init_type (None, str) –
Method used for initializing policy. Can be any of:
None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
 random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)
 Side Effects:
If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)
 Returns
None

iterate
()¶ Perform the a single iteration of the solver.
This may be an iteration through all states in the environment (like in policy iteration) or obtaining and learning from a single experience (like in QLearning)
This method should update self.value and may update self.policy, and also commit iteration statistics to self.iteration_data. Unless the subclass implements a custom self.converged, self.iteration_data should include a boolean entry for “converged”, which is used by the default converged() function.
 Returns
None

iterate_to_convergence
(raise_if_not_converged=None, score_while_training=None)¶ Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically
 Side Effects:
Many, but depends on the subclass of the solver’s .iterate()
 Parameters
raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs
 Returns
None

converged
()¶ Returns True if solver is converged.
This may be custom for each solver, but as a default it checks whether the most recent iteration_data entry has converged==True
 Returns
Convergence status (True=converged)
 Return type
bool

run_policy
(max_steps=None, initial_state=None)¶ Perform a walk (episode) through the environment using the current policy
 Side Effects:
self.env will be reset and optionally then forced into initial_state
 Parameters
max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)
 Returns
tuple containing:
states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally
 Return type
(tuple)

score_policy
(iters=500, max_steps=None, initial_state=None)¶ Score the current policy by performing iters greedy episodes in the environment and returning statistics
 Side Effects:
self.env will be reset more side effects
more side effects
 Parameters
iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)
 Returns
Object containing statistics about the episodes (rewards, number of steps, etc.)
 Return type

class
lrl.solvers.
BaseSolver
(env, gamma=0.9, value_function_tolerance=0.001, policy_init_mode='zeros', max_iters=500, min_iters=2, max_steps_per_episode=100, score_while_training=False, raise_if_not_converged=False) Bases:
object
Base class for solvers
Examples
See examples directory
 Parameters
env – Environment instance, such as from RaceTrack() or RewardingFrozenLake()
gamma (float) – Discount factor
value_function_tolerance (float) – Tolerance for convergence of value function during solving (also used for Q (stateaction) value function tolerance
policy_init_mode (str) – Initialization mode for policy. See init_policy() for more detail
max_iters (int) – Maximum number of iterations to solve environment
min_iters (int) – Minimum number of iterations before checking for solver convergence
raise_if_not_converged (bool) – If True, will raise exception when environment hits max_iters without convergence. If False, a warning will be logged.
max_steps_per_episode (int) – Maximum number of steps allowed per episode (helps when evaluating policies that can lead to infinite walks)
score_while_training (dict, bool) –
Dict specifying whether the policy should be scored during training (eg: test how well a policy is doing every N iterations).
If dict, must be of format:
n_trains_per_eval (int): Number of training iters between evaluations
n_evals (int): Number of episodes for a given policy evaluation
If True, score with default settings of:
n_trains_per_eval: 500
n_evals: 500
If False, do not score during training.
 Returns
None

env
= None Environment being solved
 Type
Racetrack, RewardingFrozenLakeEnv

policy
= None Spaceefficient dictlike storage of the current and all former policies.
 Type

iteration_data
= None Data describing iteration results during solving of the environment.
Fields include:
time: time for this iteration
delta_max: maximum change in value function for this iteration
policy_changes: number of policy changes this iteration
converged: boolean denoting if solution is converged after this iteration
 Type

scoring_summary
= None Summary data from scoring runs computed during training if score_while_training == True
Fields include:
reward_mean: mean reward obtained during a given scoring run
 Type

scoring_episode_statistics
= None Detailed scoring data from scoring runs held as a dict of EpisodeStatistics objects.
Data is indexed by iteration number (from scoring_summary)
 Type
dict, EpisodeStatistics

init_policy
(init_type=None) Initialize self.policy, which is a dictionarylike DictWithHistory object for storing current and past policies
 Parameters
init_type (None, str) –
Method used for initializing policy. Can be any of:
None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
 random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)
 Side Effects:
If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)
 Returns
None

iterate
() Perform the a single iteration of the solver.
This may be an iteration through all states in the environment (like in policy iteration) or obtaining and learning from a single experience (like in QLearning)
This method should update self.value and may update self.policy, and also commit iteration statistics to self.iteration_data. Unless the subclass implements a custom self.converged, self.iteration_data should include a boolean entry for “converged”, which is used by the default converged() function.
 Returns
None

iterate_to_convergence
(raise_if_not_converged=None, score_while_training=None) Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically
 Side Effects:
Many, but depends on the subclass of the solver’s .iterate()
 Parameters
raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs
 Returns
None

converged
() Returns True if solver is converged.
This may be custom for each solver, but as a default it checks whether the most recent iteration_data entry has converged==True
 Returns
Convergence status (True=converged)
 Return type
bool

run_policy
(max_steps=None, initial_state=None) Perform a walk (episode) through the environment using the current policy
 Side Effects:
self.env will be reset and optionally then forced into initial_state
 Parameters
max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)
 Returns
tuple containing:
states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally
 Return type
(tuple)

score_policy
(iters=500, max_steps=None, initial_state=None) Score the current policy by performing iters greedy episodes in the environment and returning statistics
 Side Effects:
self.env will be reset more side effects
more side effects
 Parameters
iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)
 Returns
Object containing statistics about the episodes (rewards, number of steps, etc.)
 Return type
Environments¶

class
lrl.environments.
Racetrack
(track=None, x_vel_limits=None, y_vel_limits=None, x_accel_limits=None, y_accel_limits=None, max_total_accel=2)¶ Bases:
gym.envs.toy_text.discrete.DiscreteEnv
A carracelike environment that uses location and velocity for state and acceleration for actions, in 2D
Loosely inspired by the Racetrack example of Sutton and Barto’s Reinforcement Learning (Exercise 5.8, http://www.incompleteideas.net/book/thebook.html)
The objective of this environment is to traverse a racetrack from a start location to any goal location. Reaching a goal location returns a large reward and terminates the episode, whereas landing on a grass location returns a large negative reward and terminates the episode. All nonterminal transitions return a small negative reward. Oily road surfaces are nonterminal but also react to an agent’s action stochastically, sometimes causing an Agent to “slip” whereby their requested action is ignored (interpreted as if a=(0,0)).
The tiles in the environment are:
(blank): Clean open (deterministic) road
O: Oily (stochastic) road
G: (terminal) grass
S: Starting location (agent starts at a random starting location). After starting, S tiles behave like open road
F: Finish location(s) (agent must reach any of these tiles to receive positive reward
The state space of the environment is described by xy location and xy velocity (with maximum velocity being a userspecified parameter). For example, s=(3, 5, 1, 1) means the Agent is currently in the x=3, y=5 location with Vx=1, Vy=1.
The action space of the environment is xy acceleration (with maximum acceleration being a userspecified parameter). For example, a=(2, 1) means ax=2, ay=1. Transitions are determined by the current velocity as well as the requested acceleration (with a cap set by Vmax of the environment), for example:
s=(3, 5, 1, 1), a=(3, 1) –> s_prime=(1, 5, 2, 0)
But if vx_max == +1 then:
s=(3, 5, 1, 1), a=(3, 1) –> s_prime=(2, 5, 1, 0)
Note that sign conventions for location are:
x: 0 at leftmost column, positive to the right
y: 0 at bottommost row, positive up
 Parameters
track (list) – List of strings describing the track (see racetrack_tracks.py for examples)
x_vel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in x. Default is (2, 2).
y_vel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in y. Default is (2, 2).
x_accel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in x. Default is (2, 2).
y_accel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in y. Default is (2, 2).
max_total_accel (int) – (OPTIONAL) Integer maximum total acceleration in one action. Total acceleration is computed by abs(x_a)+abs(y_a), representing the sum of change in acceleration in both directions Default is infinite (eg: any accel described by x and y limits)
Notes
See also discrete.DiscreteEnv for additional attributes, members, and arguments (missing due here to Sphinx bug with inheritance in docs)
DOCTODO: Add examples

track
= None¶ List of strings describing track or the string name of a default track
 Type
list, str

desc
= None¶ Numpy character array of the track (better for printing on screen/accessing track at xy locations)
 Type
np.array

color_map
= None¶ Map from grid tile type to display color
 Type
dict

index_to_state
= None¶ Attribute to map from state index to full tuple describing state
Ex: index_to_state[state_index] > state_tuple
 Type
list

state_to_index
= None¶ Attribute to map from state tuple to state index
Ex: state_to_index[state_tuple] > state_index
 Type
dict

is_location_terminal
= None¶ no rewards/transitions leading out of state).
Keyed by state tuple
 Type
dict
 Type
Attribute to map whether a state is terminal (eg

s
= None¶ Current state (inherited from parent)
 Type
int, tuple

reset
()¶ Reset the environment to a random starting location
 Returns
None

render
(mode='human', current_location='*')¶ Render the environment.
Warning
This method does not follow the prototype of it’s parent. It is presently a very simple version for printing the environment’s current state to the screen
 Parameters
mode – (NOT USED)
current_location – Character to denote the current location
 Returns
None

step
(a)¶ Take a step in the environment.
This wraps the parent object’s step(), interpreting integer actions as mapped to humanreadable actions
 Parameters
a (tuple, int) – Action to take, either as an integer (0..nA1) or true action (tuple of (x_accel,y_accel))
 Returns
Next state, either as a tuple or int depending on type of state used

close
()¶ Override _close in your subclass to perform any necessary cleanup.
Environments will automatically close() themselves when garbage collected or when the program exits.

seed
(seed=None)¶ Sets the seed for this env’s random number generator(s).
Note
Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
 Returns
 Returns the list of seeds used in this env’s random
number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
 Return type
list<bigint>

property
unwrapped
¶ Completely unwrap this env.
 Returns
The base nonwrapped gym.Env instance
 Return type
gym.Env
Experiment Runners¶

lrl.utils.experiment_runners.
run_experiment
(env, params, output_path)¶ Run a single experiment (env/solver combination), outputing results to a given location
 FUTURE: Improve easy reproducibility by outputting a settings file or similar? Could use ginconfig or just output
params. Outputting params doesn’t cover env though…
 Parameters
env – An instanced environment object (eg: Racetrack( or RewardingFrozenLake())
params – A dictionary of solver parameters for this run
output_path (str) – Path to output data (plots and csvs)
Output to output_path:
iteration_data.csv: Data about each solver iteration (shows how long each iteration took, how quickly the solver converged, etc.)
solver_results*.png: Images of policy (and value for planners). If environment state is defined by xy alone, a single image is returned. Else, an image for each additional state is returned (eg: for state = (x, y, vx, vy), plots of solver_results_vx_vy.png are returned for each (vx, vy))
scored_episodes.csv and scored_episodes.png: Detailed data for each episode taken during the final scoring, and a composite image of those episodes in the environment
intermediate_scoring_results.csv: Summary data from each evaluation during training (shows history of how the solver improved over time)
intermediate_scoring_results_*.png: Composite images of the intermediate scoring results taken during training, indexed by the iteration at which they were produced
training_episodes.csv and training_episodes.png: Detailed data for each episode taken during training, and an composite image of those episodes exploring the environment (only available for an explorational learner like QLearning)
 Returns
dict containing:
solver (BaseSolver, ValueIteration, PolicyIteration, QLearner): Fully populated solver object (after solving env)
scored_results (EpisodeStatistics): EpisodeStatistics object of results from scoring the final policy
solve_time (float): Time in seconds used to solve the env (eg: run solver.iterate_to_convergence())
 Return type
(dict)

lrl.utils.experiment_runners.
run_experiments
(environments, solver_param_grid, output_path='./output/')¶ Runs a set of experiments defined by param_grid, writing results to output_path
 Parameters
environments (list) – List of instanced environments
solver_param_grid (dict) – Solver parameters in suitable form for sklearn.model_selection.ParameterGrid
output_path (str) – Relative path to which results will be output
Output to output_path:
For each environment:
env_name/grid_search_summary.csv: highlevel summary of results for this env
env_name/case_name: Directory with detailed results for each env/case combination See run_experiment for details on casewise output)
 Returns
None
Plotting¶

lrl.utils.plotting.
plot_solver_convergence
(solver, **kwargs)¶ Convenience binding to plot convergence statistics for a solver object.
Also useful as a recipe for custom plotting.
 Parameters
solver (BaseSolver (or child)) – Solver object to be plotted
args (Other) – See plot_solver_convergence_from_df()
 Returns
Matplotlib axes object
 Return type
Axes

lrl.utils.plotting.
plot_solver_convergence_from_df
(df, y='delta_max', y_label=None, x='iteration', x_label='Iteration', label=None, ax=None, savefig=None, **kwargs)¶ Convenience binding to plot convergence statistics for a set of solver objects.
Also useful as a recipe for custom plotting.
 Parameters
df (pandas.DataFrame) – DataFrame with solver convergence data
y (str) – Convergence statistic to be plotted (eg: delta_max, delta_mean, time, or policy_changes)
y_label (str) – Optional label for y_axis (if omitted, will use y as default name unless axis is already labeled)
x (str) – X axis data (typically ‘iteration’, but could be any convergence data)
x_label (str) – Optional label for x_axis (if omitted, will use ‘Iteration’)
label (str) – Optional label for the data set (shows up in axes legend)
ax (Axes) – Optional Matplotlib Axes object to add this line to
savefig (str) – Optional filename to save the figure to
kwargs – Additional args passed to matplotlib’s plot
 Returns
Matplotlib axes object
 Return type
Axes

lrl.utils.plotting.
plot_env
(env, ax=None, edgecolor='k', resize_figure=True, savefig=None)¶ Plot the map of an environment
 Parameters
env – Environment to plot
ax (axes) – (Optional) Axes object to plot on
edgecolor (str) – Color of the edge of each grid square (matplotlib format)
resize_figure (bool) –
If true, resize the figure to:
width = 0.5 * n_cols inches
height = 0.5 * n_rows inches
savefig (str) – If not None, save the figure to this filename
 Returns
Matplotlib axes object
 Return type
Axes

lrl.utils.plotting.
plot_solver_results
(env, solver=None, policy=None, value=None, savefig=None, **kwargs)¶ Convenience function to plot results from a solver over the environment map
Input can be using a BaseSolver or child object, or by specifying policy and/or value directly via dict or DictWithHistory.
See plot_solver_result() for more info on generation of individual plots and additional arguments for color/precision.
 Parameters
env – Augmented OpenAI Gymlike environment object
solver (BaseSolver) – Solver object used to solve the environment
policy (dict, DictWithHistory) – Policy for the environment, keyed by integer stateindex or tuples of state
value (dict, DictWithHistory) – Value function for the environment, keyed by integer stateindex or tuples of state
savefig (str) – If not None, save figures to this name. For cases with multiple policies per grid square, this will be the suffix on the name (eg: for policy at Vx=1, Vy=2, we get name of savefig_1_2.png)
**kwargs (dict) – Other arguments passed to plot_solver_result
 Returns
list of Matplotlib Axes for the plots
 Return type
list

lrl.utils.plotting.
plot_policy
(env, policy, **kwargs)¶ Convenience binding for plot_policy_or_value(). See plot_policy_or_value for more detail

lrl.utils.plotting.
plot_value
(env, value, **kwargs)¶ Convenience binding for plot_policy_or_value(). See plot_policy_or_value for more detail

lrl.utils.plotting.
plot_solver_result
(env, policy=None, value=None, ax=None, add_env_to_plot=True, hide_terminal_locations=True, color='k', title=None, savefig=None, size_policy='auto', size_value='auto', value_precision=2)¶ Plot result for a single xy map using a numpy array of shaped policy and/or value
 Parameters
env (Racetrack, FrozenLake, other environment) – Instantiated environment object
policy (np.array) – Policy for each grid square in the environment, in the same shape as env.desc For plotting environments where we have multiple states for a given grid square (eg for Racetrack), will call plotting for each given additional state (eg: for v=(0, 0), v=(1, 0), ..)
value – (np.array): Value for each grid square in the environment, in the same shape as env.desc For plotting environments where we have multiple states for a given grid square (eg for Racetrack), will call plotting for each given additional state (eg: for v=(0, 0), v=(1, 0), ..)
ax (Axes) – (OPTIONAL) Matplotlib axes object to plot to
add_env_to_plot (bool) – If True, add the environment map to the axes before plotting policy using plot_env()
hide_terminal_locations (bool) – If True, all known terminal locations will have no text printed (as policy here doesn’t matter)
color (str) – Matplotlib color string denoting color of the text for policy/value
title (str) – (Optional) title added to the axes object
savefig (str) – (Optional) string filename to output the figure to
size_policy (str, numeric) –
(Optional) Specification of text font size for policy printing. One of:
’auto’: Will automatically choose a font size based on the number of characters to be printed
str or numeric: Interpreted as a Matplotlib style font size designation
size_value (str, numeric) – (Optional) Specification of text font size for value printing. Same interface as size_policy
value_precision (int) – Precision of value function to be included on figures
 Returns
Matplotlib Axes object

lrl.utils.plotting.
plot_episodes
(episodes, env=None, add_env_to_plot=True, max_episodes=100, alpha=None, color='k', title=None, ax=None, savefig=None)¶ Plot a list of episodes through an environment over a drawing of the environment
 Parameters
episodes (list, EpisodeStatistics) – Series of episodes to be plotted. If EpisodeStatistics instance, .episodes will be extracted
env – Environment traversed
add_env_to_plot (bool) – If True, use plot_env to plot the environment to the image
alpha (float) – (Optional) alpha (transparency) used for plotting the episode. If left as None, a value will be chosen based on the number of episodes to be plotted
color (str) – Matplotlibstyle color designation
title (str) – (Optional) Title to be added to the axes
ax (axes) – (Optional) Matplotlib axes object to write the plot to
savefig (str) – (Optional) string filename to output the figure to
max_episodes (int) – Maximum number of episodes to add to the plot. If len(episodes) exceeds this value, randomly chosen episodes will be used
 Returns
Matplotlib Axes object with episodes plotted to it

lrl.utils.plotting.
plot_episode
(episode, env=None, add_env_to_plot=True, alpha=None, color='k', title=None, ax=None, savefig=None)¶ Plot a single episode (walk) through the environment
 Parameters
episode (list) – List of states encountered in the episode
env – Environment traversed
add_env_to_plot (bool) – If True, use plot_env to plot the environment to the image
alpha (float) – (Optional) alpha (transparency) used for plotting the episode.
color (str) – Matplotlibstyle color designation
title (str) – (Optional) Title to be added to the axes
ax (axes) – (Optional) Matplotlib axes object to write the plot to
savefig (str) – (Optional) string filename to output the figure to
 Returns
Matplotlib Axes object with a single episode plotted to it

lrl.utils.plotting.
choose_text_size
(n_chars, boxsize=1.0)¶ Helper to choose an appropriate text size when plotting policies. Size is chosen based on length of text
Return is calibrated to something that typically looked nice in testing
 Parameters
n_chars – Text caption to be added to plot
boxsize (float) – Size of box inside which text should print nicely. Used as a scaling factor. Default is 1 inch
 Returns
Matplotlibstyle text size argument

lrl.utils.plotting.
policy_dict_to_array
(env, policy_dict)¶ Convert a policy stored as a dictionary into a dictionary of one or more policy numpy arrays shaped like env.desc
Can also be used for a value_dict.
policy_dict is a dictionary relating state to policy at that state in one of several forms. The dictionary can be keyed by stateindex or a tuple of state (eg: (x, y, [other_state]), with x=0 in left column, y=0 in bottom row). If using tuples of state, state may be more than just x,y location as shown above, eg: (x, y, v_x, v_y). If len(state_tuple) > 2, we must plot each additional state separately.
Translate policy_dict into a policy_list_of_tuples of:
[(other_state_0, array_of_policy_at_other_state_0), (other_state_1, array_of_policy_at_other_state_1), ... ]
where the array_of_policy_at_other_state_* is in the same shape as env.desc (eg: cell [3, 2] of the array is the policy for the env.desc[3, 2] location in the env).
Examples
If state is described by tuples of (x, y) (where there is a single unique state for each grid location), eg:
policy_dict = { (0, 0): policy_0_0, (0, 1): policy_0_1, (0, 2): policy_0_2, ... (1, 0): policy_2_1, (1, 1): policy_2_1, ... (xmax, ymax): policy_xmax_ymax, }
then a singleelement list is returned of the form:
returned = [ (None, np_array_of_policy), ]
where np_array_of_policy is of the same shape as env.desc (eg: the map), with each element corresponding to the policy at that grid location (for example, cell [3, 2] of the array is the policy for the env.desc[3, 2] location in the env).
If state is described by tuples of (x, y, something_else, [more_something_else…]), for example if state = (x, y, Vx, Vy) like below:
policy_dict = { (0, 0, 0, 0): policy_0_0_0_0, (0, 0, 1, 0): policy_0_0_1_0, (0, 0, 0, 1): policy_0_0_0_1, ... (1, 0, 0, 0): policy_1_0_0_0, (1, 0, 0, 1): policy_1_0_0_1, ... (xmax, ymax, Vxmax, Vymax): policy_xmax_ymax_Vxmax_Vymax, }
then a list is returned of the form:
returned = [ # (other_state, np_array_of_policies_for_this_other_state) ((0, 0), np_array_of_policies_with_Vx0_Vy0), ((1, 0), np_array_of_policies_with_Vx0_Vy0), ((0, 1), np_array_of_policies_with_Vx0_Vy0), ... ((Vxmax, Vymax), np_array_of_policies_with_Vxmax_Vymax), ]
where each element corresponds to a different combination of all the nonlocation state. This means that each element of the list is:
(Identification_of_this_case, shaped_xygrid_of_policies_for_this_case)
and can be easily plotted over the environment’s map.
If policy_dict is keyed by stateindex rather than state directly, the same logic as above still applies.
Notes
If using an environment (with policy keyed by either index or state) that has more than one unique state per grid location (eg: state has more than (x, y)), then environment must also have an index_to_state attribute to identify overlapping states. This constraint exists both for policies keyed by index or state, but the code could be refactored to avoid this limitation for statekeyed policies if required.
 Parameters
env – Augmented OpenAI Gymlike environment object
policy_dict (dict) – Dictionary of policy for the environment, keyed by integer stateindex or tuples of state
 Returns
list of (description, shaped_policy) elements as described above

lrl.utils.plotting.
get_ax
(ax)¶ Returns figure and axes objects associated with an axes, instantiating if input is None
Data Stores¶

class
lrl.data_stores.
GeneralIterationData
(columns=None)¶ Bases:
object
Class to store data about solver iterations
Data is stored as a list of dictionaries. This is a placeholder for more advanced storage. Class gives a minimal set of extra bindings for convenience.
The present object has no checks to ensure consistency between added records (all have same fields, etc.). If any columns are missing from an added record, outputting to a dataframe will result in Pandas treating these as missing columns from a record.
 Parameters
columns (list) – An optional list of column names for the data (if specified, this sets the order of the columns in any output Pandas DataFrame or csv)
DOCTODO: Add example of usage

columns
= None¶ Column names used for data output.
If specified, this sets the order of any columns being output to Pandas DataFrame or csv
 Type
list

data
= None¶ List of dictionaries representing records.
Intended to be internal in future, but public at present to give easy access to records for slicing
 Type
list

add
(d)¶ Add a dictionary record to the data structure.
 Parameters
d (dict) – Dictionary of data to be stored
 Returns
None

get
(i=1)¶ Return the ith entry in the data store (index of storage is in order in which data is committed to this object)
 Parameters
i (int) – Index of data to return (can be any valid list index, including 1 and slices)
 Returns
ith entry in the data store
 Return type
dict

to_dataframe
()¶ Returns the data structure as a Pandas DataFrame
 Returns
Pandas DataFrame of the data
 Return type
dataframe

to_csv
(filename, **kwargs)¶ Write data structure to a csv via the Pandas DataFrame
 Parameters
filename (str) – Filename or full path to output data to
kwargs (dict) – Optional arguments to be passed to DataFrame.to_csv()
 Returns
None

class
lrl.data_stores.
DictWithHistory
(timepoint_mode='explicit', tolerance=1e07)¶ Bases:
collections.abc.MutableMapping
Dictionarylike object that maintains a history of all changes, either incrementally or at set timepoints
This object has access like a dictionary, but stores data internally such that the user can later recreate the state of the data from a past timepoint.
The intended use of this object is to store large objects which are iterated on (such as value or policy functions) in a way that a history of changes can be reproduced without having to store a new copy of the object every time. For example, when doing 10000 episodes of QLearning in a grid world with 2500 states, we can retain the full policy history during convergence (eg: answer “what was my policy after episode 527”) without keeping 10000 copies of a nearlyidentical 2500 element numpy array or dict. The cost for this is some computation, although this generally has not been seen to be too significant (~10’s of seconds for a large QLearning problem in testing)
 Parameters
timepoint_mode (str) – One of:
explicit (*) – Timepoint incrementing is handled explicitly by the user (the timepoint only changes if the user invokes .update_timepoint()
implicit (*) – Timepoint incrementing is automatic and occurs on every setting action, including redundant sets (setting a key to a value it already holds). This is useful for a timehistory of all sets to the object
tolerance (float) – Absolute tolerance to test for when replacing values. If a value to be set is less than tolerance different from the current value, the current value is not changed.
Warning
Deletion of keys is not specifically supported. Deletion likely works for the most recent timepoint, but the history does not handle deleted keys properly
Numeric data may work best due to how new values are compared to existing data, although tuples have also been tested. See __setitem__ for more detail
DOCTODO: Add example

timepoint_mode
= None¶ See Parameters for definition
 Type
str

current_timepoint
= None¶ Timepoint that will be written to next
 Type
int

__getitem__
(key)¶ Return the most recent value for key
 Returns
Whatever is contained in ._data[key][1][1] (return only the value from the most recent timepoint, not the timepoint associated with it)

__setitem__
(key, value)¶ Set the value at a key if it is different from the current data stored at key
Data stored here is stored under the self.current_timepoint.
Difference between new and current values is assessed by testing:
new_value == old_value
np.isclose(new_value, old_value)
where if neither returns True, the new value is taken to be different from the current value
 Side Effects:
If timepoint_mode == ‘implicit’, self.current_timepoint will be incremented after setting data
 Parameters
key (immutable) – Key under which data is stored
value – Value to store at key
 Returns
None

update
(d)¶ Update this instance with a dictionary of data, d (similar to dict.update())
Keys in d that are present in this object overwrite the previous value. Keys in d that are missing in this object are added.
All data written from d is given the same timepoint (even if timepoint_mode=implicit)  the addition is treated as a single update to the object rather than a series of updates.
 Parameters
d (dict) – Dictionary of data to be added here
 Returns
None

get_value_history
(key)¶ Returns a list of tuples of the value at a given key over the entire history of that key
 Parameters
key (immutable) – Any valid dictionary key
 Returns
list containing tuples of:
timepoint (int): Integer timepoint for this value
value (float): The value of key at the corresponding timepoint
 Return type
(list)

get_value_at_timepoint
(key, timepoint=1)¶ Returns the value corresponding to a key at the timepoint that is closest to but not greater than timepoint
Raises a KeyError if key did not exist at timepoint. Raises an IndexError if no timepoint exists that applies
 Parameters
key (immutable) – Any valid dictionary key
timepoint (int) – Integer timepoint to return value for. If negative, it is interpreted like typical python indexing (1 means most recent, 2 means second most recent, …)
 Returns
Value corresponding to key at the timepoint closest to but not over timepoint
 Return type
numeric

to_dict
(timepoint=1)¶ Return the state of the data at a given timepoint as a dict
 Parameters
timepoint (int) – Integer timepoint to return data as of. If negative, it is interpreted like typical python indexing (1 means most recent, 2 means second most recent, …)
 Returns
Data at timepoint
 Return type
dict

clear
() → None. Remove all items from D.¶

get
(k[, d]) → D[k] if k in D, else d. d defaults to None.¶

increment_timepoint
()¶ Increments the timepoint at which the object is currently writing
 Returns
None

items
() → a setlike object providing a view on D's items¶

keys
() → a setlike object providing a view on D's keys¶

pop
(k[, d]) → v, remove specified key and return the corresponding value.¶ If key is not found, d is returned if given, otherwise KeyError is raised.

popitem
() → (k, v), remove and return some (key, value) pair¶ as a 2tuple; but raise KeyError if D is empty.

setdefault
(k[, d]) → D.get(k,d), also set D[k]=d if k not in D¶

values
() → an object providing a view on D's values¶

class
lrl.data_stores.
EpisodeStatistics
¶ Bases:
object
Container for statistics about a set of independent episodes through an environment, typically following one policy
Statistics are lazily computed and memorized
DOCTODO: Add example usage. show plot_episodes

rewards
= None¶ List of the total reward for each episode (raw data)
 Type
list

episodes
= None¶ List of all episodes passed to the data object (raw data)
 Type
list

steps
= None¶ List of the total steps taken for each episode (raw data)
 Type
list

terminals
= None¶ List of whether each input episode was terminal (raw data)
 Type
list

add
(reward, episode, terminal)¶ Add an episode to the data store
 Parameters
reward (float) – Total reward from the episode
episode (list) – List of states encoutered in the episode, including the starting and final state
terminal (bool) – Boolean indicating if episode was terminal (did environment say episode has ended)
 Returns
None

get_statistic
(statistic='reward_mean', index=1)¶ Return a lazily computed and memorized statistic about the rewards from episodes 0 to index
If the statistic has not been previous computed, it will be computed and returned. See .get_statistics() for list of statistics available
 Side Effects:
self.statistics[index] will be computed using self.compute() if it has not been already
 Parameters
statistic (str) – See .compute() for available statistics
index (int) – Episode index for requested statistic
Notes
Statistics are computed for all episodes up to and including the requested statistic. For example if episodes have rewards of [1, 3, 5, 10], get_statistic(‘reward_mean’, index=2) returns 3 (mean of [1, 3, 5]).
DOCTODO: Example usage (show getting some statistics)
 Returns
Value of the statistic requested
 Return type
int, float

get_statistics
(index=1)¶ Return a lazily computed and memorized dictionary of all statistics about episodes 0 to index
If the statistic has not been previous computed, it will be computed here.
 Side Effects:
self.statistics[index] will be computed using self.compute() if it has not been already
 Parameters
index (int) – Episode index for requested statistic
 Returns
Details and statistics about this iteration, with keys:
Details about this iteration:
episode_index (int): Index of episode
terminal (bool): Boolean of whether this episode was terminal
reward (float): This episode’s reward (included to give easy access to periteration data)
steps (int): This episode’s steps (included to give easy access to periteration data)
Statistics computed for all episodes up to and including this episode:
reward_mean (float):
reward_median (float):
reward_std (float):
reward_max (float):
reward_min (float):
steps_mean (float):
steps_median (float):
steps_std (float):
steps_max (float):
steps_min (float):
terminal_fraction (float):
 Return type
dict

compute
(index=1, force=False)¶ Compute and store statistics about rewards and steps for episodes up to and including the indexth episode
 Side Effects:
self.statistics[index] will be updated
 Parameters
index (int or 'all') – If integer, the index of the episode for which statistics are computed. Eg: If index==3, compute the statistics (see get_statistics() for list) for the series of episodes from 0 up to and not including 3 (typical python indexing rules) If ‘all’, compute statistics for all indices, skipping any that have been previously computed unless force == True
force (bool) –
If True, always recompute statistics even if they already exist.
If False, only compute if no previous statistics exist.
 Returns
None

to_dataframe
(include_episodes=False)¶ Return a Pandas DataFrame of the episode statistics
See .get_statistics() for a definition of each column. Order of columns is set through self.statistics_columns
 Parameters
include_episodes (bool) – If True, add column including the entire episode for each iteration
 Returns
Pandas DataFrame

to_csv
(filename, **kwargs)¶ Write statistics to csv via the Pandas DataFrame
See .get_statistics() for a definition of each column. Order of columns is set through self.statistics_columns
 Parameters
filename (str) – Filename or full path to output data to
kwargs (dict) – Optional arguments to be passed to DataFrame.to_csv()
 Returns
None

Miscellaneous Utilities¶

class
lrl.utils.misc.
Timer
¶ Bases:
object
A Simple Timer class for timing code

start
= None¶ timeit.default_timer object initialized at instantiation

elapsed
()¶ Return the time elapsed since this object was instantiated, in seconds
 Returns
Time elapsed in seconds
 Return type
float


lrl.utils.misc.
print_dict_by_row
(d, fmt='{key:20s}: {val:d}')¶ Print a dictionary with a little extra structure, printing a different key/value to each line.
 Parameters
d (dict) – Dictionary to be printed
fmt (str) – Format string to be used for printing. Must contain key and val formatting references
 Returns
None

lrl.utils.misc.
count_dict_differences
(d1, d2, keys=None, raise_on_missing_key=True, print_differences=False)¶ Return the number of differences between two dictionaries. Useful to compare two policies stored as dictionaries.
Does not properly handle floats that are approximately equal. Mainly use for int and objects with __eq__
Optionally raise an error on missing keys (otherwise missing keys are counted as differences)
 Parameters
d1 (dict) – Dictionary to compare
d2 (dict) – Dictionary to compare
keys (list) – Optional list of keys to consider for differences. If None, all keys will be considered
raise_on_missing_key (bool) – If true, raise KeyError on any keys not shared by both dictionaries
print_differences (bool) – If true, print all differences to screen
 Returns
Number of differences between the two dictionaries
 Return type
int

lrl.utils.misc.
dict_differences
(d1, d2)¶ Return the maximum and mean of the absolute difference between all elements of two dictionaries of numbers
 Parameters
d1 (dict) – Dictionary to compare
d2 (dict) – Dictionary to compare
 Returns
tuple containing:
float: Maximum elementwise difference
float: Sum of elementwise differences
 Return type
tuple

lrl.utils.misc.
rc_to_xy
(row, col, rows)¶ Convert from (row, col) coordinates (eg: numpy array) to (x, y) coordinates (bottom left = 0,0)
(x, y) convention
(0,0) in bottom left
x +ve to the right
y +ve up
(row,col) convention:
(0,0) in top left
row +ve down
col +ve to the right
 Parameters
row (int) – row coordinate to be converted
col (int) – col coordinate to be converted
rows (int) – Total number of rows
 Returns
(x, y)
 Return type
tuple

lrl.utils.misc.
params_to_name
(params, n_chars=4, sep='_', first_fields=None, key_remap=None)¶ Convert a mappable of parameters into a string for easy test naming
Warning
Currently includes hardcoded formatting that interprets keys named ‘alpha’ or ‘epsilon’
 Parameters
params (dict) – Dictionary to convert to a string
n_chars (int) – Number of characters per key to add to string. Eg: if key=’abcdefg’ and n_chars=4, output will be ‘abcd’
sep (str) – Separator character between fields (uses one of these between key and value, and two between different keyvalue pairs
first_fields (list) – Optional list of keys to write ahead of other keys (otherwise, output order it sorted)
key_remap (list) – List of dictionaries of {key_name: new_key_name} for rewriting keys into more readable strings
 Returns
 Return type
str