API¶

Solvers¶

class lrl.solvers.PolicyIteration(env, value_function_initial_value=0.0, max_policy_eval_iters_per_improvement=10, policy_evaluation_type='on-policy-iterative', **kwargs)¶

Bases: lrl.solvers.base_solver.BaseSolver

Solver for policy iteration

Implemented as per Sutton and Barto’s Reinforcement Learning (http://www.incompleteideas.net/book/RLbook2018.pdf, page 80).

Notes

See also BaseSolver for additional attributes, members, and arguments (missing here due to Sphinx bug with inheritance in docs)

Examples

See examples directory

Parameters

value_function_initial_value (float) – Value to initialize all elements of the value function to
max_policy_eval_iters_per_improvement –
policy_evaluation_type (str) – Type of solution method for calculating policy (see policy_evaluation() for more details. Typical usage should not need to change this as it will make calculations slower and more memory intensive)
BaseSolver class for additional (See) –

Returns

None

value = None¶

Space-efficient dict-like storage of the current and all former value functions

Type: DictWithHistory

iterate()¶

Perform a single iteration of policy iteration, updating self.value and storing metadata about the iteration.

Side Effects:

self.value: Updated to the newest estimate of the value function
self.policy: Updated to the greedy policy according to the value function estimate
self.iteration: Increment iteration counter by 1
self.iteration_data: Add new record to iteration data store

Returns: None

converged()¶

Returns True if solver is converged.

Judge convergence by checking whether the most recent policy iteration resulted in any changes in policy

Returns: Convergence status (True=converged)
Return type: bool

_policy_evaluation(max_iters=None)¶

Compute an estimate of the value function for the current policy to within self.tolerance

Side Effects:: self.value: Updated to the newest estimate of the value function

Returns: None

_policy_improvement(return_differences=True)¶

Update the policy to be greedy relative to the most recent value function

Side Effects:: self.policy: Updated to be greedy relative to self.value

Parameters: return_differences – If True, return number of differences between old and new policies
Returns: (if return_differences==True) Number of differences between the old and new policies
Return type: int

init_policy(init_type=None)¶

Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies

Parameters

init_type (None, str) –

Method used for initializing policy. Can be any of:

None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)

Side Effects:: If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)

Returns: None

iterate_to_convergence(raise_if_not_converged=None, score_while_training=None)¶

Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically

Side Effects:: Many, but depends on the subclass of the solver’s .iterate()

Parameters

raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs

Returns

None

run_policy(max_steps=None, initial_state=None)¶

Perform a walk (episode) through the environment using the current policy

Side Effects:

self.env will be reset and optionally then forced into initial_state

Parameters

max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

tuple containing:

states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally

Return type

(tuple)

score_policy(iters=500, max_steps=None, initial_state=None)¶

Score the current policy by performing iters greedy episodes in the environment and returning statistics

Side Effects:

self.env will be reset more side effects

more side effects

Parameters

iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

Object containing statistics about the episodes (rewards, number of steps, etc.)

Return type

EpisodeStatistics

class lrl.solvers.ValueIteration(env, value_function_initial_value=0.0, **kwargs)¶

Bases: lrl.solvers.base_solver.BaseSolver

Solver for value iteration

Implemented as per Sutton and Barto’s Reinforcement Learning (http://www.incompleteideas.net/book/RLbook2018.pdf, page 82).

Notes

See also BaseSolver for additional attributes, members, and arguments (missing here due to Sphinx bug with inheritance in docs)

Examples

See examples directory

Parameters

value_function_initial_value (float) – Value to initialize all elements of the value function to
BaseSolver class for additional (See) –

Returns

None

value = None¶

Space-efficient dict-like storage of the current and all former value functions

Type: DictWithHistory

iterate()¶

Perform a single iteration of value iteration, updating self.value and storing metadata about the iteration.

Side Effects:

self.value: Updated to the newest estimate of the value function
self.policy: Updated to the greedy policy according to the value function estimate
self.iteration: Increment iteration counter by 1
self.iteration_data: Add new record to iteration data store

Returns: None

converged()¶

Returns True if solver is converged.

Test convergence by comparing the latest value function delta_max to the convergence tolerance

Returns: Convergence status (True=converged)
Return type: bool

init_policy(init_type=None)¶

Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies

Parameters

init_type (None, str) –

Method used for initializing policy. Can be any of:

None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)

Side Effects:: If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)

Returns: None

iterate_to_convergence(raise_if_not_converged=None, score_while_training=None)¶

Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically

Side Effects:: Many, but depends on the subclass of the solver’s .iterate()

Parameters

raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs

Returns

None

run_policy(max_steps=None, initial_state=None)¶

Perform a walk (episode) through the environment using the current policy

Side Effects:

self.env will be reset and optionally then forced into initial_state

Parameters

max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

tuple containing:

states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally

Return type

(tuple)

score_policy(iters=500, max_steps=None, initial_state=None)¶

Score the current policy by performing iters greedy episodes in the environment and returning statistics

Side Effects:

self.env will be reset more side effects

more side effects

Parameters

iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

Object containing statistics about the episodes (rewards, number of steps, etc.)

Return type

EpisodeStatistics

class lrl.solvers.QLearning(env, value_function_tolerance=0.1, alpha=None, epsilon=None, max_iters=2000, min_iters=250, num_episodes_for_convergence=20, **kwargs)¶

Bases: lrl.solvers.base_solver.BaseSolver

Solver class for Q-Learning

Notes

See also BaseSolver for additional attributes, members, and arguments (missing due here to Sphinx bug with inheritance in docs)

Examples

See examples directory

Parameters

alpha (float, dict) –
(OPTIONAL)
- If None, default linear decay schedule applied, decaying from 0.1 at iter 0 to 0.025 at max iter
- If float, interpreted as a constant alpha value
- If dict, interpreted as specifications to a decay function as defined in decay_functions()
epsilon (float, dict) –
(OPTIONAL)
- If None, default linear decay schedule applied, decaying from 0.25 at iter 0 to 0.05 at max iter
- If float, interpreted as a constant epsilon value
- If dict, interpreted as specifications to a decay function as defined in decay_functions()
num_episodes_for_convergence (int) – Number of consecutive episodes with delta_Q < tolerance to say a solution is converged
**kwargs – Other arguments passed to BaseSolver

Returns

None

transitions = None¶

Counter for number of transitions experienced during all learning

Type: int

q = None¶

Space-efficient dict-like storage of the current and all former q functions

Type: DictWithHistory

iteration_data = None¶

Data store for iteration data

Overloads BaseSolver’s iteration_data attribute with one that includes more fields

Type: GeneralIterationData

episode_statistics = None¶

Data store for statistics from training episodes

Type: EpisodeStatistics

num_episodes_for_convergence = None¶

Number of consecutive episodes with delta_Q < tolerance to say a solution is converged

Type: int

_policy_improvement(states=None)¶

Update the policy to be greedy relative to the most recent q function

Side Effects:: self.policy: Updated to be greedy relative to self.q

Parameters: states – List of states to update. If None, all states will be updated
Returns: None

step(count_transition=True)¶

Take and learn from a single step in the environment.

Applies the typical Q-Learning approach to learn from the experienced transition

Parameters

count_transition (bool) – If True, increment transitions counter self.transitions. Else, do not.

Returns

tuple containing:

transition (tuple): Tuple of (state, reward, next_state, is_terminal)
delta_q (float): The (absolute) change in q caused by this step

Return type

(tuple)

iterate()¶

Perform and learn from a single episode in the environment (one walk from start to finish)

Side Effects:

self.value: Updated to the newest estimate of the value function
self.policy: Updated to the greedy policy according to the value function estimate
self.iteration: Increment iteration counter by 1
self.iteration_data: Add new record to iteration data store
self.env: Reset and then walked through

Returns: None

choose_epsilon_greedy_action(state, epsilon=None)¶

Return an action chosen by epsilon-greedy scheme based on the current estimate of Q

Parameters

state (int, tuple) – Descriptor of current state in environment
epsilon – Optional. If None, self.epsilon is used

Returns

action chosen

Return type

int or tuple

converged()¶

Returns True if solver is converged.

Returns: Convergence status (True=converged)
Return type: bool

get_q_at_state(state)¶

Returns a numpy array of q values at the current state in the same order as the standard action indexing :param state: Descriptor of current state in environment :type state: int, tuple

Returns: Numpy array of q for all actions
Return type: np.array

init_policy(init_type=None)¶

Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies

Parameters

init_type (None, str) –

Method used for initializing policy. Can be any of:

None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)

Side Effects:: If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)

Returns: None

init_q(init_val=0.0)¶

Initialize self.q, a dict-like DictWithHistory object for storing the state-action value function q

Parameters: init_val (float) – Value to give all states in the initialized q
Returns: None

iterate_to_convergence(raise_if_not_converged=None, score_while_training=None)¶

Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically

Side Effects:: Many, but depends on the subclass of the solver’s .iterate()

Parameters

raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs

Returns

None

run_policy(max_steps=None, initial_state=None)¶

Perform a walk (episode) through the environment using the current policy

Side Effects:

self.env will be reset and optionally then forced into initial_state

Parameters

max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

tuple containing:

states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally

Return type

(tuple)

score_policy(iters=500, max_steps=None, initial_state=None)¶

Score the current policy by performing iters greedy episodes in the environment and returning statistics

Side Effects:

self.env will be reset more side effects

more side effects

Parameters

iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

Object containing statistics about the episodes (rewards, number of steps, etc.)

Return type

EpisodeStatistics

property alpha¶: Returns value of alpha at current iteration

property epsilon¶: Returns value of epsilon at current iteration

class lrl.solvers.BaseSolver(env, gamma=0.9, value_function_tolerance=0.001, policy_init_mode='zeros', max_iters=500, min_iters=2, max_steps_per_episode=100, score_while_training=False, raise_if_not_converged=False)¶

Bases: object

Base class for solvers

Examples

See examples directory

Parameters

env – Environment instance, such as from RaceTrack() or RewardingFrozenLake()
gamma (float) – Discount factor
value_function_tolerance (float) – Tolerance for convergence of value function during solving (also used for Q (state-action) value function tolerance
policy_init_mode (str) – Initialization mode for policy. See init_policy() for more detail
max_iters (int) – Maximum number of iterations to solve environment
min_iters (int) – Minimum number of iterations before checking for solver convergence
raise_if_not_converged (bool) – If True, will raise exception when environment hits max_iters without convergence. If False, a warning will be logged.
max_steps_per_episode (int) – Maximum number of steps allowed per episode (helps when evaluating policies that can lead to infinite walks)
score_while_training (dict, bool) –
Dict specifying whether the policy should be scored during training (eg: test how well a policy is doing every N iterations).

If dict, must be of format:
- n_trains_per_eval (int): Number of training iters between evaluations
- n_evals (int): Number of episodes for a given policy evaluation
If True, score with default settings of:
- n_trains_per_eval: 500
- n_evals: 500
If False, do not score during training.

Returns

None

env = None¶

Environment being solved

Type: Racetrack, RewardingFrozenLakeEnv

policy = None¶

Space-efficient dict-like storage of the current and all former policies.

Type: DictWithHistory

iteration_data = None¶

Data describing iteration results during solving of the environment.

Fields include:

time: time for this iteration
delta_max: maximum change in value function for this iteration
policy_changes: number of policy changes this iteration
converged: boolean denoting if solution is converged after this iteration

Type: GeneralIterationData

scoring_summary = None¶

Summary data from scoring runs computed during training if score_while_training == True

Fields include:

reward_mean: mean reward obtained during a given scoring run

Type: GeneralIterationData

scoring_episode_statistics = None¶

Detailed scoring data from scoring runs held as a dict of EpisodeStatistics objects.

Data is indexed by iteration number (from scoring_summary)

Type: dict, EpisodeStatistics

init_policy(init_type=None)¶

Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies

Parameters

init_type (None, str) –

Method used for initializing policy. Can be any of:

None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)

Side Effects:: If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)

Returns: None

iterate()¶

Perform the a single iteration of the solver.

This may be an iteration through all states in the environment (like in policy iteration) or obtaining and learning from a single experience (like in Q-Learning)

This method should update self.value and may update self.policy, and also commit iteration statistics to self.iteration_data. Unless the subclass implements a custom self.converged, self.iteration_data should include a boolean entry for “converged”, which is used by the default converged() function.

Returns: None

iterate_to_convergence(raise_if_not_converged=None, score_while_training=None)¶

Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically

Side Effects:: Many, but depends on the subclass of the solver’s .iterate()

Parameters

raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs

Returns

None

converged()¶

Returns True if solver is converged.

This may be custom for each solver, but as a default it checks whether the most recent iteration_data entry has converged==True

Returns: Convergence status (True=converged)
Return type: bool

run_policy(max_steps=None, initial_state=None)¶

Perform a walk (episode) through the environment using the current policy

Side Effects:

self.env will be reset and optionally then forced into initial_state

Parameters

max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

tuple containing:

states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally

Return type

(tuple)

score_policy(iters=500, max_steps=None, initial_state=None)¶

Score the current policy by performing iters greedy episodes in the environment and returning statistics

Side Effects:

self.env will be reset more side effects

more side effects

Parameters

iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

Object containing statistics about the episodes (rewards, number of steps, etc.)

Return type

EpisodeStatistics

class lrl.solvers.BaseSolver(env, gamma=0.9, value_function_tolerance=0.001, policy_init_mode='zeros', max_iters=500, min_iters=2, max_steps_per_episode=100, score_while_training=False, raise_if_not_converged=False)

Bases: object

Base class for solvers

Examples

See examples directory

Parameters

env – Environment instance, such as from RaceTrack() or RewardingFrozenLake()
gamma (float) – Discount factor
value_function_tolerance (float) – Tolerance for convergence of value function during solving (also used for Q (state-action) value function tolerance
policy_init_mode (str) – Initialization mode for policy. See init_policy() for more detail
max_iters (int) – Maximum number of iterations to solve environment
min_iters (int) – Minimum number of iterations before checking for solver convergence
raise_if_not_converged (bool) – If True, will raise exception when environment hits max_iters without convergence. If False, a warning will be logged.
max_steps_per_episode (int) – Maximum number of steps allowed per episode (helps when evaluating policies that can lead to infinite walks)
score_while_training (dict, bool) –
Dict specifying whether the policy should be scored during training (eg: test how well a policy is doing every N iterations).

If dict, must be of format:
- n_trains_per_eval (int): Number of training iters between evaluations
- n_evals (int): Number of episodes for a given policy evaluation
If True, score with default settings of:
- n_trains_per_eval: 500
- n_evals: 500
If False, do not score during training.

Returns

None

env = None

Environment being solved

Type: Racetrack, RewardingFrozenLakeEnv

policy = None

Space-efficient dict-like storage of the current and all former policies.

Type: DictWithHistory

iteration_data = None

Data describing iteration results during solving of the environment.

Fields include:

time: time for this iteration
delta_max: maximum change in value function for this iteration
policy_changes: number of policy changes this iteration
converged: boolean denoting if solution is converged after this iteration

Type: GeneralIterationData

scoring_summary = None

Summary data from scoring runs computed during training if score_while_training == True

Fields include:

reward_mean: mean reward obtained during a given scoring run

Type: GeneralIterationData

scoring_episode_statistics = None

Detailed scoring data from scoring runs held as a dict of EpisodeStatistics objects.

Data is indexed by iteration number (from scoring_summary)

Type: dict, EpisodeStatistics

init_policy(init_type=None)

Initialize self.policy, which is a dictionary-like DictWithHistory object for storing current and past policies

Parameters

init_type (None, str) –

Method used for initializing policy. Can be any of:

None: Uses value in self.policy_init_type
zeros: Initialize policy to all 0’s (first action)
random: Initialize policy to a random action (action indices are random integer from
[0, len(self.env.P[this_state])], where P is the transition matrix and P[state] is a list of all actions available in the state)

Side Effects:: If init_type is specified as argument, it is also stored to self.policy_init_type (overwriting previous value)

Returns: None

iterate()

Perform the a single iteration of the solver.

This may be an iteration through all states in the environment (like in policy iteration) or obtaining and learning from a single experience (like in Q-Learning)

This method should update self.value and may update self.policy, and also commit iteration statistics to self.iteration_data. Unless the subclass implements a custom self.converged, self.iteration_data should include a boolean entry for “converged”, which is used by the default converged() function.

Returns: None

iterate_to_convergence(raise_if_not_converged=None, score_while_training=None)

Perform self.iterate repeatedly until convergence, optionally scoring the current policy periodically

Side Effects:: Many, but depends on the subclass of the solver’s .iterate()

Parameters

raise_if_not_converged (bool) – If true, will raise an exception if convergence is not reached before hitting maximum number of iterations. If None, uses self.raise_if_not_converged
score_while_training (bool, dict, None) – If None, use self.score_while_training. Else, accepts inputs of same format as accepted for score_while_training solver inputs

Returns

None

converged()

Returns True if solver is converged.

This may be custom for each solver, but as a default it checks whether the most recent iteration_data entry has converged==True

Returns: Convergence status (True=converged)
Return type: bool

run_policy(max_steps=None, initial_state=None)

Perform a walk (episode) through the environment using the current policy

Side Effects:

self.env will be reset and optionally then forced into initial_state

Parameters

max_steps – Maximum number of steps to be taken in the walk (step 0 is taken to be entering initial state) If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the walk (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

tuple containing:

states (list): boolean indicating if the episode was terminal according to the environment
rewards (list): list of rewards obtained during the episode (rewards[0] == 0 as step 0 is simply starting the game)
is_terminal (bool): Boolean denoting whether the environment returned that the episode terminated naturally

Return type

(tuple)

score_policy(iters=500, max_steps=None, initial_state=None)

Score the current policy by performing iters greedy episodes in the environment and returning statistics

Side Effects:

self.env will be reset more side effects

more side effects

Parameters

iters – Number of episodes in the environment
max_steps – Maximum number of steps allowed per episode. If None, defaults to self.max_steps_per_episode
initial_state – State for the environment to be placed in to start the episode (used to force a deterministic start from anywhere in the environment rather than the typical start position)

Returns

Object containing statistics about the episodes (rewards, number of steps, etc.)

Return type

EpisodeStatistics

Environments¶

class lrl.environments.Racetrack(track=None, x_vel_limits=None, y_vel_limits=None, x_accel_limits=None, y_accel_limits=None, max_total_accel=2)¶

Bases: gym.envs.toy_text.discrete.DiscreteEnv

A car-race-like environment that uses location and velocity for state and acceleration for actions, in 2D

Loosely inspired by the Racetrack example of Sutton and Barto’s Reinforcement Learning (Exercise 5.8, http://www.incompleteideas.net/book/the-book.html)

The objective of this environment is to traverse a racetrack from a start location to any goal location. Reaching a goal location returns a large reward and terminates the episode, whereas landing on a grass location returns a large negative reward and terminates the episode. All non-terminal transitions return a small negative reward. Oily road surfaces are non-terminal but also react to an agent’s action stochastically, sometimes causing an Agent to “slip” whereby their requested action is ignored (interpreted as if a=(0,0)).

The tiles in the environment are:

(blank): Clean open (deterministic) road
O: Oily (stochastic) road
G: (terminal) grass
S: Starting location (agent starts at a random starting location). After starting, S tiles behave like open road
F: Finish location(s) (agent must reach any of these tiles to receive positive reward

The state space of the environment is described by xy location and xy velocity (with maximum velocity being a user-specified parameter). For example, s=(3, 5, 1, -1) means the Agent is currently in the x=3, y=5 location with Vx=1, Vy=-1.

The action space of the environment is xy acceleration (with maximum acceleration being a user-specified parameter). For example, a=(-2, 1) means ax=-2, ay=-1. Transitions are determined by the current velocity as well as the requested acceleration (with a cap set by Vmax of the environment), for example:

s=(3, 5, 1, -1), a=(-3, 1) –> s_prime=(1, 5, -2, 0)

But if vx_max == +-1 then:

s=(3, 5, 1, -1), a=(-3, 1) –> s_prime=(2, 5, -1, 0)

Note that sign conventions for location are:

x: 0 at leftmost column, positive to the right
y: 0 at bottommost row, positive up

Parameters

track (list) – List of strings describing the track (see racetrack_tracks.py for examples)
x_vel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in x. Default is (-2, 2).
y_vel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in y. Default is (-2, 2).
x_accel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in x. Default is (-2, 2).
y_accel_limits (tuple) – (OPTIONAL) Tuple of (min, max) valid acceleration in y. Default is (-2, 2).
max_total_accel (int) – (OPTIONAL) Integer maximum total acceleration in one action. Total acceleration is computed by abs(x_a)+abs(y_a), representing the sum of change in acceleration in both directions Default is infinite (eg: any accel described by x and y limits)

Notes

See also discrete.DiscreteEnv for additional attributes, members, and arguments (missing due here to Sphinx bug with inheritance in docs)

DOCTODO: Add examples

track = None¶

List of strings describing track or the string name of a default track

Type: list, str

desc = None¶

Numpy character array of the track (better for printing on screen/accessing track at xy locations)

Type: np.array

color_map = None¶

Map from grid tile type to display color

Type: dict

index_to_state = None¶

Attribute to map from state index to full tuple describing state

Ex: index_to_state[state_index] -> state_tuple

Type: list

state_to_index = None¶

Attribute to map from state tuple to state index

Ex: state_to_index[state_tuple] -> state_index

Type: dict

is_location_terminal = None¶

no rewards/transitions leading out of state).

Keyed by state tuple

Type: dict
Type: Attribute to map whether a state is terminal (eg

s = None¶

Current state (inherited from parent)

Type: int, tuple

reset()¶

Reset the environment to a random starting location

Returns: None

render(mode='human', current_location='*')¶

Render the environment.

Warning

This method does not follow the prototype of it’s parent. It is presently a very simple version for printing the environment’s current state to the screen

Parameters

mode – (NOT USED)
current_location – Character to denote the current location

Returns

None

step(a)¶

Take a step in the environment.

This wraps the parent object’s step(), interpreting integer actions as mapped to human-readable actions

Parameters: a (tuple, int) – Action to take, either as an integer (0..nA-1) or true action (tuple of (x_accel,y_accel))
Returns: Next state, either as a tuple or int depending on type of state used

close()¶

Override _close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

seed(seed=None)¶

Sets the seed for this env’s random number generator(s).

Note

Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.

Returns

Returns the list of seeds used in this env’s random: number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.

Return type

list<bigint>

property unwrapped¶

Completely unwrap this env.

Returns: The base non-wrapped gym.Env instance
Return type: gym.Env

Experiment Runners¶

lrl.utils.experiment_runners.run_experiment(env, params, output_path)¶

Run a single experiment (env/solver combination), outputing results to a given location

FUTURE: Improve easy reproducibility by outputting a settings file or similar? Could use gin-config or just output: params. Outputting params doesn’t cover env though…

Parameters

env – An instanced environment object (eg: Racetrack( or RewardingFrozenLake())
params – A dictionary of solver parameters for this run
output_path (str) – Path to output data (plots and csvs)

Output to output_path:

iteration_data.csv: Data about each solver iteration (shows how long each iteration took, how quickly the solver converged, etc.)

solver_results*.png: Images of policy (and value for planners). If environment state is defined by xy alone, a single image is returned. Else, an image for each additional state is returned (eg: for state = (x, y, vx, vy), plots of solver_results_vx_vy.png are returned for each (vx, vy))

scored_episodes.csv and scored_episodes.png: Detailed data for each episode taken during the final scoring, and a composite image of those episodes in the environment

intermediate_scoring_results.csv: Summary data from each evaluation during training (shows history of how the solver improved over time)

intermediate_scoring_results_*.png: Composite images of the intermediate scoring results taken during training, indexed by the iteration at which they were produced

training_episodes.csv and training_episodes.png: Detailed data for each episode taken during training, and an composite image of those episodes exploring the environment (only available for an explorational learner like Q-Learning)

Returns

dict containing:

solver (BaseSolver, ValueIteration, PolicyIteration, QLearner): Fully populated solver object (after solving env)
scored_results (EpisodeStatistics): EpisodeStatistics object of results from scoring the final policy
solve_time (float): Time in seconds used to solve the env (eg: run solver.iterate_to_convergence())

Return type

(dict)

lrl.utils.experiment_runners.run_experiments(environments, solver_param_grid, output_path='./output/')¶

Runs a set of experiments defined by param_grid, writing results to output_path

Parameters

environments (list) – List of instanced environments
solver_param_grid (dict) – Solver parameters in suitable form for sklearn.model_selection.ParameterGrid
output_path (str) – Relative path to which results will be output

Output to output_path:

For each environment:
- env_name/grid_search_summary.csv: high-level summary of results for this env
- env_name/case_name: Directory with detailed results for each env/case combination See run_experiment for details on casewise output)

Returns: None

Plotting¶

lrl.utils.plotting.plot_solver_convergence(solver, **kwargs)¶

Convenience binding to plot convergence statistics for a solver object.

Also useful as a recipe for custom plotting.

Parameters

solver (BaseSolver (or child)) – Solver object to be plotted
args (Other) – See plot_solver_convergence_from_df()

Returns

Matplotlib axes object

Return type

Axes

lrl.utils.plotting.plot_solver_convergence_from_df(df, y='delta_max', y_label=None, x='iteration', x_label='Iteration', label=None, ax=None, savefig=None, **kwargs)¶

Convenience binding to plot convergence statistics for a set of solver objects.

Also useful as a recipe for custom plotting.

Parameters

df (pandas.DataFrame) – DataFrame with solver convergence data
y (str) – Convergence statistic to be plotted (eg: delta_max, delta_mean, time, or policy_changes)
y_label (str) – Optional label for y_axis (if omitted, will use y as default name unless axis is already labeled)
x (str) – X axis data (typically ‘iteration’, but could be any convergence data)
x_label (str) – Optional label for x_axis (if omitted, will use ‘Iteration’)
label (str) – Optional label for the data set (shows up in axes legend)
ax (Axes) – Optional Matplotlib Axes object to add this line to
savefig (str) – Optional filename to save the figure to
kwargs – Additional args passed to matplotlib’s plot

Returns

Matplotlib axes object

Return type

Axes

lrl.utils.plotting.plot_env(env, ax=None, edgecolor='k', resize_figure=True, savefig=None)¶

Plot the map of an environment

Parameters

env – Environment to plot
ax (axes) – (Optional) Axes object to plot on
edgecolor (str) – Color of the edge of each grid square (matplotlib format)
resize_figure (bool) –
If true, resize the figure to:
- width = 0.5 * n_cols inches
- height = 0.5 * n_rows inches
savefig (str) – If not None, save the figure to this filename

Returns

Matplotlib axes object

Return type

Axes

lrl.utils.plotting.plot_solver_results(env, solver=None, policy=None, value=None, savefig=None, **kwargs)¶

Convenience function to plot results from a solver over the environment map

Input can be using a BaseSolver or child object, or by specifying policy and/or value directly via dict or DictWithHistory.

See plot_solver_result() for more info on generation of individual plots and additional arguments for color/precision.

Parameters

env – Augmented OpenAI Gym-like environment object
solver (BaseSolver) – Solver object used to solve the environment
policy (dict, DictWithHistory) – Policy for the environment, keyed by integer state-index or tuples of state
value (dict, DictWithHistory) – Value function for the environment, keyed by integer state-index or tuples of state
savefig (str) – If not None, save figures to this name. For cases with multiple policies per grid square, this will be the suffix on the name (eg: for policy at Vx=1, Vy=2, we get name of savefig_1_2.png)
**kwargs (dict) – Other arguments passed to plot_solver_result

Returns

list of Matplotlib Axes for the plots

Return type

list

lrl.utils.plotting.plot_policy(env, policy, **kwargs)¶: Convenience binding for plot_policy_or_value(). See plot_policy_or_value for more detail

lrl.utils.plotting.plot_value(env, value, **kwargs)¶: Convenience binding for plot_policy_or_value(). See plot_policy_or_value for more detail

lrl.utils.plotting.plot_solver_result(env, policy=None, value=None, ax=None, add_env_to_plot=True, hide_terminal_locations=True, color='k', title=None, savefig=None, size_policy='auto', size_value='auto', value_precision=2)¶

Plot result for a single xy map using a numpy array of shaped policy and/or value

Parameters

env (Racetrack, FrozenLake, other environment) – Instantiated environment object
policy (np.array) – Policy for each grid square in the environment, in the same shape as env.desc For plotting environments where we have multiple states for a given grid square (eg for Racetrack), will call plotting for each given additional state (eg: for v=(0, 0), v=(1, 0), ..)
value – (np.array): Value for each grid square in the environment, in the same shape as env.desc For plotting environments where we have multiple states for a given grid square (eg for Racetrack), will call plotting for each given additional state (eg: for v=(0, 0), v=(1, 0), ..)
ax (Axes) – (OPTIONAL) Matplotlib axes object to plot to
add_env_to_plot (bool) – If True, add the environment map to the axes before plotting policy using plot_env()
hide_terminal_locations (bool) – If True, all known terminal locations will have no text printed (as policy here doesn’t matter)
color (str) – Matplotlib color string denoting color of the text for policy/value
title (str) – (Optional) title added to the axes object
savefig (str) – (Optional) string filename to output the figure to
size_policy (str, numeric) –
(Optional) Specification of text font size for policy printing. One of:
- ’auto’: Will automatically choose a font size based on the number of characters to be printed
- str or numeric: Interpreted as a Matplotlib style font size designation
size_value (str, numeric) – (Optional) Specification of text font size for value printing. Same interface as size_policy
value_precision (int) – Precision of value function to be included on figures

Returns

Matplotlib Axes object

lrl.utils.plotting.plot_episodes(episodes, env=None, add_env_to_plot=True, max_episodes=100, alpha=None, color='k', title=None, ax=None, savefig=None)¶

Plot a list of episodes through an environment over a drawing of the environment

Parameters

episodes (list, EpisodeStatistics) – Series of episodes to be plotted. If EpisodeStatistics instance, .episodes will be extracted
env – Environment traversed
add_env_to_plot (bool) – If True, use plot_env to plot the environment to the image
alpha (float) – (Optional) alpha (transparency) used for plotting the episode. If left as None, a value will be chosen based on the number of episodes to be plotted
color (str) – Matplotlib-style color designation
title (str) – (Optional) Title to be added to the axes
ax (axes) – (Optional) Matplotlib axes object to write the plot to
savefig (str) – (Optional) string filename to output the figure to
max_episodes (int) – Maximum number of episodes to add to the plot. If len(episodes) exceeds this value, randomly chosen episodes will be used

Returns

Matplotlib Axes object with episodes plotted to it

lrl.utils.plotting.plot_episode(episode, env=None, add_env_to_plot=True, alpha=None, color='k', title=None, ax=None, savefig=None)¶

Plot a single episode (walk) through the environment

Parameters

episode (list) – List of states encountered in the episode
env – Environment traversed
add_env_to_plot (bool) – If True, use plot_env to plot the environment to the image
alpha (float) – (Optional) alpha (transparency) used for plotting the episode.
color (str) – Matplotlib-style color designation
title (str) – (Optional) Title to be added to the axes
ax (axes) – (Optional) Matplotlib axes object to write the plot to
savefig (str) – (Optional) string filename to output the figure to

Returns

Matplotlib Axes object with a single episode plotted to it

lrl.utils.plotting.choose_text_size(n_chars, boxsize=1.0)¶

Helper to choose an appropriate text size when plotting policies. Size is chosen based on length of text

Return is calibrated to something that typically looked nice in testing

Parameters

n_chars – Text caption to be added to plot
boxsize (float) – Size of box inside which text should print nicely. Used as a scaling factor. Default is 1 inch

Returns

Matplotlib-style text size argument

lrl.utils.plotting.policy_dict_to_array(env, policy_dict)¶

Convert a policy stored as a dictionary into a dictionary of one or more policy numpy arrays shaped like env.desc

Can also be used for a value_dict.

policy_dict is a dictionary relating state to policy at that state in one of several forms. The dictionary can be keyed by state-index or a tuple of state (eg: (x, y, [other_state]), with x=0 in left column, y=0 in bottom row). If using tuples of state, state may be more than just x,y location as shown above, eg: (x, y, v_x, v_y). If len(state_tuple) > 2, we must plot each additional state separately.

Translate policy_dict into a policy_list_of_tuples of:

[(other_state_0, array_of_policy_at_other_state_0),
 (other_state_1, array_of_policy_at_other_state_1),
  ... ]

where the array_of_policy_at_other_state_* is in the same shape as env.desc (eg: cell [3, 2] of the array is the policy for the env.desc[3, 2] location in the env).

Examples

If state is described by tuples of (x, y) (where there is a single unique state for each grid location), eg:

policy_dict = {
    (0, 0): policy_0_0,
    (0, 1): policy_0_1,
    (0, 2): policy_0_2,
    ...
    (1, 0): policy_2_1,
    (1, 1): policy_2_1,
    ...
    (xmax, ymax): policy_xmax_ymax,
    }

then a single-element list is returned of the form:

returned = [
  (None, np_array_of_policy),
]

where np_array_of_policy is of the same shape as env.desc (eg: the map), with each element corresponding to the policy at that grid location (for example, cell [3, 2] of the array is the policy for the env.desc[3, 2] location in the env).

If state is described by tuples of (x, y, something_else, [more_something_else…]), for example if state = (x, y, Vx, Vy) like below:

policy_dict = {
    (0, 0, 0, 0): policy_0_0_0_0,
    (0, 0, 1, 0): policy_0_0_1_0,
    (0, 0, 0, 1): policy_0_0_0_1,
    ...
    (1, 0, 0, 0): policy_1_0_0_0,
    (1, 0, 0, 1): policy_1_0_0_1,
    ...
    (xmax, ymax, Vxmax, Vymax): policy_xmax_ymax_Vxmax_Vymax,
    }

then a list is returned of the form:

returned = [
#   (other_state, np_array_of_policies_for_this_other_state)
    ((0, 0), np_array_of_policies_with_Vx-0_Vy-0),
    ((1, 0), np_array_of_policies_with_Vx-0_Vy-0),
    ((0, 1), np_array_of_policies_with_Vx-0_Vy-0),
    ...
    ((Vxmax, Vymax), np_array_of_policies_with_Vxmax_Vymax),
]

where each element corresponds to a different combination of all the non-location state. This means that each element of the list is:

(Identification_of_this_case, shaped_xy-grid_of_policies_for_this_case)

and can be easily plotted over the environment’s map.

If policy_dict is keyed by state-index rather than state directly, the same logic as above still applies.

Notes

If using an environment (with policy keyed by either index or state) that has more than one unique state per grid location (eg: state has more than (x, y)), then environment must also have an index_to_state attribute to identify overlapping states. This constraint exists both for policies keyed by index or state, but the code could be refactored to avoid this limitation for state-keyed policies if required.

Parameters

env – Augmented OpenAI Gym-like environment object
policy_dict (dict) – Dictionary of policy for the environment, keyed by integer state-index or tuples of state

Returns

list of (description, shaped_policy) elements as described above

lrl.utils.plotting.get_ax(ax)¶: Returns figure and axes objects associated with an axes, instantiating if input is None

Data Stores¶

class lrl.data_stores.GeneralIterationData(columns=None)¶

Bases: object

Class to store data about solver iterations

Data is stored as a list of dictionaries. This is a placeholder for more advanced storage. Class gives a minimal set of extra bindings for convenience.

The present object has no checks to ensure consistency between added records (all have same fields, etc.). If any columns are missing from an added record, outputting to a dataframe will result in Pandas treating these as missing columns from a record.

Parameters: columns (list) – An optional list of column names for the data (if specified, this sets the order of the columns in any output Pandas DataFrame or csv)

DOCTODO: Add example of usage

columns = None¶

Column names used for data output.

If specified, this sets the order of any columns being output to Pandas DataFrame or csv

Type: list

data = None¶

List of dictionaries representing records.

Intended to be internal in future, but public at present to give easy access to records for slicing

Type: list

add(d)¶

Add a dictionary record to the data structure.

Parameters: d (dict) – Dictionary of data to be stored
Returns: None

get(i=-1)¶

Return the ith entry in the data store (index of storage is in order in which data is committed to this object)

Parameters: i (int) – Index of data to return (can be any valid list index, including -1 and slices)
Returns: ith entry in the data store
Return type: dict

to_dataframe()¶

Returns the data structure as a Pandas DataFrame

Returns: Pandas DataFrame of the data
Return type: dataframe

to_csv(filename, **kwargs)¶

Write data structure to a csv via the Pandas DataFrame

Parameters

filename (str) – Filename or full path to output data to
kwargs (dict) – Optional arguments to be passed to DataFrame.to_csv()

Returns

None

class lrl.data_stores.DictWithHistory(timepoint_mode='explicit', tolerance=1e-07)¶

Bases: collections.abc.MutableMapping

Dictionary-like object that maintains a history of all changes, either incrementally or at set timepoints

This object has access like a dictionary, but stores data internally such that the user can later recreate the state of the data from a past timepoint.

The intended use of this object is to store large objects which are iterated on (such as value or policy functions) in a way that a history of changes can be reproduced without having to store a new copy of the object every time. For example, when doing 10000 episodes of Q-Learning in a grid world with 2500 states, we can retain the full policy history during convergence (eg: answer “what was my policy after episode 527”) without keeping 10000 copies of a nearly-identical 2500 element numpy array or dict. The cost for this is some computation, although this generally has not been seen to be too significant (~10’s of seconds for a large Q-Learning problem in testing)

Parameters

timepoint_mode (str) – One of:
explicit (*) – Timepoint incrementing is handled explicitly by the user (the timepoint only changes if the user invokes .update_timepoint()
implicit (*) – Timepoint incrementing is automatic and occurs on every setting action, including redundant sets (setting a key to a value it already holds). This is useful for a timehistory of all sets to the object
tolerance (float) – Absolute tolerance to test for when replacing values. If a value to be set is less than tolerance different from the current value, the current value is not changed.

Warning

Deletion of keys is not specifically supported. Deletion likely works for the most recent timepoint, but the history does not handle deleted keys properly
Numeric data may work best due to how new values are compared to existing data, although tuples have also been tested. See __setitem__ for more detail

DOCTODO: Add example

timepoint_mode = None¶

See Parameters for definition

Type: str

current_timepoint = None¶

Timepoint that will be written to next

Type: int

__getitem__(key)¶

Return the most recent value for key

Returns: Whatever is contained in ._data[key][-1][-1] (return only the value from the most recent timepoint, not the timepoint associated with it)

__setitem__(key, value)¶

Set the value at a key if it is different from the current data stored at key

Data stored here is stored under the self.current_timepoint.

Difference between new and current values is assessed by testing:

new_value == old_value
np.isclose(new_value, old_value)

where if neither returns True, the new value is taken to be different from the current value

Side Effects:: If timepoint_mode == ‘implicit’, self.current_timepoint will be incremented after setting data

Parameters

key (immutable) – Key under which data is stored
value – Value to store at key

Returns

None

update(d)¶

Update this instance with a dictionary of data, d (similar to dict.update())

Keys in d that are present in this object overwrite the previous value. Keys in d that are missing in this object are added.

All data written from d is given the same timepoint (even if timepoint_mode=implicit) - the addition is treated as a single update to the object rather than a series of updates.

Parameters: d (dict) – Dictionary of data to be added here
Returns: None

get_value_history(key)¶

Returns a list of tuples of the value at a given key over the entire history of that key

Parameters

key (immutable) – Any valid dictionary key

Returns

list containing tuples of:

timepoint (int): Integer timepoint for this value
value (float): The value of key at the corresponding timepoint

Return type

(list)

get_value_at_timepoint(key, timepoint=-1)¶

Returns the value corresponding to a key at the timepoint that is closest to but not greater than timepoint

Raises a KeyError if key did not exist at timepoint. Raises an IndexError if no timepoint exists that applies

Parameters

key (immutable) – Any valid dictionary key
timepoint (int) – Integer timepoint to return value for. If negative, it is interpreted like typical python indexing (-1 means most recent, -2 means second most recent, …)

Returns

Value corresponding to key at the timepoint closest to but not over timepoint

Return type

numeric

to_dict(timepoint=-1)¶

Return the state of the data at a given timepoint as a dict

Parameters: timepoint (int) – Integer timepoint to return data as of. If negative, it is interpreted like typical python indexing (-1 means most recent, -2 means second most recent, …)
Returns: Data at timepoint
Return type: dict

clear() → None. Remove all items from D.¶

get(k[, d]) → D[k] if k in D, else d. d defaults to None.¶

increment_timepoint()¶

Increments the timepoint at which the object is currently writing

Returns: None

items() → a set-like object providing a view on D's items¶

keys() → a set-like object providing a view on D's keys¶

pop(k[, d]) → v, remove specified key and return the corresponding value.¶: If key is not found, d is returned if given, otherwise KeyError is raised.

popitem() → (k, v), remove and return some (key, value) pair¶: as a 2-tuple; but raise KeyError if D is empty.

setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D¶

values() → an object providing a view on D's values¶

class lrl.data_stores.EpisodeStatistics¶

Bases: object

Container for statistics about a set of independent episodes through an environment, typically following one policy

Statistics are lazily computed and memorized

DOCTODO: Add example usage. show plot_episodes

rewards = None¶

List of the total reward for each episode (raw data)

Type: list

episodes = None¶

List of all episodes passed to the data object (raw data)

Type: list

steps = None¶

List of the total steps taken for each episode (raw data)

Type: list

terminals = None¶

List of whether each input episode was terminal (raw data)

Type: list

add(reward, episode, terminal)¶

Add an episode to the data store

Parameters

reward (float) – Total reward from the episode
episode (list) – List of states encoutered in the episode, including the starting and final state
terminal (bool) – Boolean indicating if episode was terminal (did environment say episode has ended)

Returns

None

get_statistic(statistic='reward_mean', index=-1)¶

Return a lazily computed and memorized statistic about the rewards from episodes 0 to index

If the statistic has not been previous computed, it will be computed and returned. See .get_statistics() for list of statistics available

Side Effects:: self.statistics[index] will be computed using self.compute() if it has not been already

Parameters

statistic (str) – See .compute() for available statistics
index (int) – Episode index for requested statistic

Notes

Statistics are computed for all episodes up to and including the requested statistic. For example if episodes have rewards of [1, 3, 5, 10], get_statistic(‘reward_mean’, index=2) returns 3 (mean of [1, 3, 5]).

DOCTODO: Example usage (show getting some statistics)

Returns: Value of the statistic requested
Return type: int, float

get_statistics(index=-1)¶

Return a lazily computed and memorized dictionary of all statistics about episodes 0 to index

If the statistic has not been previous computed, it will be computed here.

Side Effects:: self.statistics[index] will be computed using self.compute() if it has not been already

Parameters

index (int) – Episode index for requested statistic

Returns

Details and statistics about this iteration, with keys:

Details about this iteration:

episode_index (int): Index of episode
terminal (bool): Boolean of whether this episode was terminal
reward (float): This episode’s reward (included to give easy access to per-iteration data)
steps (int): This episode’s steps (included to give easy access to per-iteration data)

Statistics computed for all episodes up to and including this episode:

reward_mean (float):
reward_median (float):
reward_std (float):
reward_max (float):
reward_min (float):
steps_mean (float):
steps_median (float):
steps_std (float):
steps_max (float):
steps_min (float):
terminal_fraction (float):

Return type

dict

compute(index=-1, force=False)¶

Compute and store statistics about rewards and steps for episodes up to and including the indexth episode

Side Effects:: self.statistics[index] will be updated

Parameters

index (int or 'all') – If integer, the index of the episode for which statistics are computed. Eg: If index==3, compute the statistics (see get_statistics() for list) for the series of episodes from 0 up to and not including 3 (typical python indexing rules) If ‘all’, compute statistics for all indices, skipping any that have been previously computed unless force == True
force (bool) –
If True, always recompute statistics even if they already exist.

If False, only compute if no previous statistics exist.

Returns

None

to_dataframe(include_episodes=False)¶

Return a Pandas DataFrame of the episode statistics

See .get_statistics() for a definition of each column. Order of columns is set through self.statistics_columns

Parameters: include_episodes (bool) – If True, add column including the entire episode for each iteration
Returns: Pandas DataFrame

to_csv(filename, **kwargs)¶

Write statistics to csv via the Pandas DataFrame

See .get_statistics() for a definition of each column. Order of columns is set through self.statistics_columns

Parameters

filename (str) – Filename or full path to output data to
kwargs (dict) – Optional arguments to be passed to DataFrame.to_csv()

Returns

None

Miscellaneous Utilities¶

class lrl.utils.misc.Timer¶

Bases: object

A Simple Timer class for timing code

start = None¶: timeit.default_timer object initialized at instantiation

elapsed()¶

Return the time elapsed since this object was instantiated, in seconds

Returns: Time elapsed in seconds
Return type: float

lrl.utils.misc.print_dict_by_row(d, fmt='{key:20s}: {val:d}')¶

Print a dictionary with a little extra structure, printing a different key/value to each line.

Parameters

d (dict) – Dictionary to be printed
fmt (str) – Format string to be used for printing. Must contain key and val formatting references

Returns

None

lrl.utils.misc.count_dict_differences(d1, d2, keys=None, raise_on_missing_key=True, print_differences=False)¶

Return the number of differences between two dictionaries. Useful to compare two policies stored as dictionaries.

Does not properly handle floats that are approximately equal. Mainly use for int and objects with __eq__

Optionally raise an error on missing keys (otherwise missing keys are counted as differences)

Parameters

d1 (dict) – Dictionary to compare
d2 (dict) – Dictionary to compare
keys (list) – Optional list of keys to consider for differences. If None, all keys will be considered
raise_on_missing_key (bool) – If true, raise KeyError on any keys not shared by both dictionaries
print_differences (bool) – If true, print all differences to screen

Returns

Number of differences between the two dictionaries

Return type

int

lrl.utils.misc.dict_differences(d1, d2)¶

Return the maximum and mean of the absolute difference between all elements of two dictionaries of numbers

Parameters

d1 (dict) – Dictionary to compare
d2 (dict) – Dictionary to compare

Returns

tuple containing:

float: Maximum elementwise difference
float: Sum of elementwise differences

Return type

tuple

lrl.utils.misc.rc_to_xy(row, col, rows)¶

Convert from (row, col) coordinates (eg: numpy array) to (x, y) coordinates (bottom left = 0,0)

(x, y) convention

(0,0) in bottom left
x +ve to the right
y +ve up

(row,col) convention:

(0,0) in top left
row +ve down
col +ve to the right

Parameters

row (int) – row coordinate to be converted
col (int) – col coordinate to be converted
rows (int) – Total number of rows

Returns

(x, y)

Return type

tuple

lrl.utils.misc.params_to_name(params, n_chars=4, sep='_', first_fields=None, key_remap=None)¶

Convert a mappable of parameters into a string for easy test naming

Warning

Currently includes hard-coded formatting that interprets keys named ‘alpha’ or ‘epsilon’

Parameters

params (dict) – Dictionary to convert to a string
n_chars (int) – Number of characters per key to add to string. Eg: if key=’abcdefg’ and n_chars=4, output will be ‘abcd’
sep (str) – Separator character between fields (uses one of these between key and value, and two between different key-value pairs
first_fields (list) – Optional list of keys to write ahead of other keys (otherwise, output order it sorted)
key_remap (list) – List of dictionaries of {key_name: new_key_name} for rewriting keys into more readable strings

Returns

Return type

str